Cephalometric Landmark Detection by Attentive Feature Pyramid Fusion and Regression-Voting

Chen, Runnan; Ma, Yuexin; Chen, Nenglun; Lee, Daniel; Wang, Wenping

doi:10.1007/978-3-030-32248-9_97

Runnan Chen¹⁶,
Yuexin Ma¹⁶,
Nenglun Chen¹⁶,
Daniel Lee¹⁷ &
…
Wenping Wang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11766))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

11k Accesses
65 Citations

Abstract

Marking anatomical landmarks in cephalometric radiography is a critical operation in cephalometric analysis. Automatically and accurately locating these landmarks is a challenging issue because different landmarks require different levels of resolutions and semantics. Based on this observation, we propose a novel attentive feature pyramid fusion module (AFPF) to explicitly shape high-resolution and semantically enhanced fusion features to achieve significantly higher accuracy than existing deep learning-based methods. We also combine heat maps and offset maps to perform pixel-wise regression-voting to improve detection accuracy. By incorporating the AFPF and regression-voting, we develop an end-to-end deep learning framework that improves detection accuracy by 7%–11% for all the evaluation metrics over the state-of-the-art method. We present ablation studies to give more insights into different components of our method and demonstrate its generalization capability and stability for unseen data from diverse devices.

You have full access to this open access chapter, Download conference paper PDF

Cephalometric landmark detection without X-rays combining coordinate regression and heatmap regression

Article Open access 16 November 2023

Enhancing Cephalometric Landmark Detection with a Two-Stage Cascaded CNN on Multi-resolution Multi-modal Data

A Cephalometric Landmark Regression Method Based on Dual-Encoder for High-Resolution X-Ray Image

Keywords

1 Introduction

Cephalometric analysis is widely used in evaluation and treatment planning for orthodontic, orthognathic and maxillofacial surgeries. It provides the clinician with crucial information on the patient’s dental, skeletal and facial relationship. The key operation during the analysis is marking craniofacial landmarks [1] to assess and quantify the degree of the anatomical abnormalities. In practice, landmarks are located manually, which is tedious, time-consuming, and unreliable in achieving reproducible results. Hence, fully automatic and accurate landmark localization has been a long-standing area with a great deal of need.

Current solutions can be classified into five categories: knowledge-based, pattern matching-based, statistical learning-based, hybrid-based and deep learning-based methods. The first category [2] is to simulate the manual landmark detection process with human knowledge of the landmark structures. However, rules become too complex to be formulated with the increase of the image complexity. Then, some researchers employed search methods using pattern matching [3, 21]. However, they are quite sensitive to individual variations. Considering that both the global spatial constraints and local appearance of landmark locations are important, some statistical learning-based approaches have been proposed, like Active Shape Model [4] and Active Appearance Model [5]. Two frameworks [6, 7] combining the random forests regression-voting and the statistical shape analysis techniques perform well in the IEEE ISBI 2014 and 2015 Challenges [8, 9]. Since then, almost all the methods are benchmarked against the Grand Challenges dataset [10, 11, 22, 23]. There are also some hybrid-based methods [12] integrating different techniques mentioned above. The deep learning technique [13] that emerged in recent years has achieved great success in many fields and has been widely used in medical image analysis [14]. It learns features with multi-level semantics automatically, which has the potential to overcome the limitations of previous methods in feature definition and extraction. Some deep learning-based methods have been proposed [11, 22] on this issue, but they are comparable with previous state-of-the-art methods without prominent improvement.

In this paper, we propose an end-to-end deep learning framework that can accurately and efficiently detect landmarks automatically. Our network architecture contains three sequential modules: a feature extraction module, an attentive feature pyramid fusion (AFPF) module, and a prediction module. In the feature extraction module, we use VGG-19 [15] as a backbone network. For the critical module AFPF, we design it from two observations, while existing methods lack such considerations. One is that features extracted by different layers of neural network have various resolutions and semantics, usually higher semantics along with lower resolution. Identifying the landmarks on the boundary requires high-resolution and detailed structural information, while identifying the landmarks in the center of the region requires deep semantic information. To meet the requirements of identifying all the landmarks, we fuse different levels of features to get a high-resolution and semantically enhanced fusion feature. The other is that individual landmark has its specific attention to the same feature. We utilize the self-attention mechanism to learn corresponding weights of the fusion feature for different landmarks. Results show that the novel AFPF module plays an important role in improving the accuracy. It is also very flexible and can be inserted into other networks to improve the semantic representation. In the prediction module, we get inspiration from the traditional method which takes cropped patches to predict the offset of the ground truth landmarks. We adopt the combination of heat maps and offset maps to do pixel-wise regression-voting, which performs more effectively.

We evaluate the performance on the public available dataset from the ISBI Grand Challenges 2015. Our landmark detection accuracy on the validation dataset (Test Dataset 1) is 86.67% within the clinically accepted precision range of 2.0 mm with the average error of 1.17 mm. As for the testing dataset (Test Dataset 2), we get the 75.05% accuracy rate in the range of 2.0 mm with the average error of 1.48 mm. Our method outperforms the state-of-the-art by 7%–11% for all the measurements. The contributions of this paper are as follows.

Propose a new deep learning-based framework for cephalometric landmark detection.
Present a new and flexible module (AFPF) to get high-resolution and semantically enhanced fusion features with self-attention mechanism.
Our method outperforms the state-of-the-art by 7%–11% for all the measurements on a public dataset.
Our method has strong self-adaptive capability and performs well on unseen data source, which is very practical in clinical application.

2 Methods

Given a cephalometric radiography I, the goal is to detect anatomical landmark positions $P=(p_1,p_2,...,p_n)$ automatically, where p denotes the 2D position for a landmark, and n is the number of cephalometic landmarks. Our proposed framework is illustrated in Fig. 1. In the first module, we use pre-trained VGG-19 [15] as the backbone network. There are other networks for feature extraction like ResNet [16] and Inception [17]. We will show their respective performance in Sect. 3. In the following, we give details of AFPF module and the prediction module.

2.1 Attentive Feature Pyramid Fusion

The Attentive Feature Pyramid Fusion (AFPF) module takes different level’s features in the feature extraction module as input and produces a tensor T with the size (3n, h, w), where (h, w) denotes the spatial size of the input image and 3n stand for n heat maps and 2n offset maps. Heat maps H are used to indicate the rough area of the landmark while offset maps O are taken as regressors to locate the precise position [24]. As Fig. 1 shows, for different levels’ features in the first module, we apply $1\times 1$ lateral connections and upsampling on each of them to generate feature maps with the same resolution and number of channels. Then, these feature maps are concatenated together and passed through a dilated convolutional [18] block to form the feature pyramid F with the size of $(c,h_F,w_F)$. The dilated convolution enlarges the receptive field and aggregate multi-scale context so that more local information can be used to improve the estimation accuracy. Based on the observation that different landmarks have different attention to these feature maps, we use self-attention mechanism [19, 26] to learn attention weights for each landmark. The attention weight $a_k$ for the k-th landmark is computed as follows:

$$\begin{aligned} a_k = \text {softmax}({W_{k1}\tanh {(W_{k2}\tilde{F})}}), \end{aligned}$$

(1)

where $a_k$ is an attention matrix composed of three attention vectors $(a_k^1,a_k^2,a_k^3)$, one for heat map and two for offset maps. The length of each attention vector is c which is equal to the channel number of F. $\tilde{F}$ is obtained by operations of average pooling and reshaping that transfers F from the size of $(c,h_F,w_F)$ to the size $(c,h_F\times w_F/ 64)$. $W_{k1}$ and $W_{k2}$ are trainable matrices presented by fully connected layers without bias. For each landmark, we apply the attention weight $a_k$ on the feature pyramid F with channel-wise multiplication to get the weighted feature pyramids $F_{k}$:

$$\begin{aligned} F_{k} = c(a_k \otimes F), \end{aligned}$$

(2)

where $F_{k}$ contains three weighted feature pyramids $(F_k^1,F_k^2,F_k^3)$, and $F_k^j ( j=1,2,3)$ has the same size as F. $\otimes $ is the channel-wise multiplication and c is the channel number which serves as the scale factor. By applying $1\times 1$ convolution on $F_{k}$, we get the output with three channels for the k-th landmark, containing one heat map $H_k'$ and two offset maps $O'_k$. Then, $H_k'$ and $O'_k$ are upsampled to match the size of the input image and are named $H_k$ and $O_k$. The AFPF module can also be used in other networks to improve the semantic representation. We show its flexibility in Sect. 3.

2.2 Landmark Prediction with Regression Voting

In the prediction module, we combine the output of AFPF module - heat maps and offset maps to predict landmark positions.

In the training stage, for each pixel location $x_i$ and the k-th landmark $l_{k}$, we constrain the produced probability in the heat map $H_k(x_i)$ to be 1 if $\left\| x_i-l_k\right\| _2 \le R$ and 0 otherwise. Here R is the radius of a circular domain. Note that the heat maps and offset maps are generated by the fused feature maps of different resolution. So R is set to 40 to ensure that there is a minimal corresponding activation area on the smallest feature map (stride of 32 pixels to the input size). The loss function $L_{h}$ is defined to be mean logistic losses between the predicted heat maps and the ground truth. The offset maps are used to predict the 2D (x, y direction respectively) offset vector $O_k(x_{i})=(l_{k}-x_{i})/R$ from the pixel $x_{i}$ to the corresponding landmark $l_{k}$. The loss function $L_{o}$ is defined to be the L1 loss between the predicted offsets and the target. We only calculate the loss for positions $x_{i}$ within R instead of all pixels in training offset maps. The final loss function is defined as follows:

$$\begin{aligned} L(\theta ) = \alpha L_{h}(\theta )+(1-\alpha )L_{o}(\theta ) \end{aligned}$$

(3)

where $\alpha $ is a factor to balance the loss function terms. We set $\alpha = 2/3$ empirically.

In the testing stage, we aggregate the heat map and the offset maps for each landmark to construct an activation map $M_k$ via pixel-wise regression-voting as follows:

$$\begin{aligned} M_k(x_{i})=\sum _{x_{j}\in A_k}\mathbbm {1}\{\Vert x_{j}+\lfloor O_k(x_{j})\times R\rfloor -x_{i} \Vert =0\} \end{aligned}$$

(4)

where $A_k$ is the set of pixels with the $\pi R^2$ largest values in heat map $H_k$ for the k-th landmark, and $\mathbbm {1}\{\cdot \}$ is the indicator function. Finally, the pixel $x_{i}$ with the highest activation value $M_k(x_{i})$ is regarded as the most likely landmark position.

3 Evaluations and Discussions

In this section, we first evaluate our method on the public dataset from the IEEE ISBI 2015 Challenge by comparing with state-of-the-art methods. To highlight the contribution of different parts of the framework, we also show the performance of different configurations in the ablation study. Especially, our experiments illustrate the flexibility of the AFPF module. Furthermore, we test our framework on other large-scale datasets from various devices in the extended experiments. The feature extraction module is pre-trained on ImageNet dataset [25]. We resize the input image to $800\times 640$. The entire framework is built on PyTorch and optimized by the Adadelta optimizer with default configuration. The batch size is 1. The training time is approximately 7 h for 350 epochs on a GTX 1080 TI GPU.

3.1 Cephalometric Landmark Dataset

The IEEE ISBI 2015 Challenge [9] provides a public dataset for cephalometric landmark detections, which is the only related public dataset. The dataset consists of 400 cephalometric radiographs with 19 manually labeled landmarks by two doctors in each image, and the ground truth is the average of annotations of the two doctors. The image resolution is $1935\times 2400$ pixels in the TIFF format, and the pixel spacing is 0.1 mm. The pathology types for eight standard measurement methods can be calculated based on landmarks positions. We use 150 images for training, 150 images for validating and 100 images for testing, and adopt the evaluation metrics according to the IEEE ISBI 2015 Challenge standards [9], which include the mean radial error (MRE), the successful detection rate (SDR) in four target radius (2 mm, 2.5 mm, 3 mm, 4 mm), and the accuracy rate for pathology classification (APC).

Table 1. Comparison with three state-of-the-art methods and ablation study of our own method based on the dataset of IEEE ISBI 2015 Challenge. The model index from 1 to 6 stands for Ibragimov et al. [20], Lindner et al. [7], Arik et al. [11], our method without AFPF module, our method without self-attention mechanism, and our complete method respectively.

Full size table

Table 2. Performance of the AFPF module on other feature-extraction networks. Results by removing AFPF are shown in each second line.

Full size table

3.2 Baselines

We compare our approach with the top two methods [7, 20] in IEEE ISBI 2015 Challenge and two new approaches proposed by Arik et al. [11] and Payer et al. [22]. We also remove the AFPF module and the attention mechanism respectively to do the ablation study.

3.3 Analysis

The comparison with state-of-the-art methods is shown in Table 1. Our method (results are marked by black body) has much better performance under all the measurements, achieving the MRE of 1.17 mm and 1.48 mm with the standard deviation of 1.19 mm and 0.77 mm for the test dataset 1 and test dataset 2 respectively. In terms of the successful detection rate (SDR) for target radius evaluated in the Challenge, our method is higher than the top method by 7%–11%. As for the accuracy rate for pathology classification (APC), we achieved 79.05% and 81.95% respectively on two test datasets, with an average classification accuracy of 84.7% for all classified subjects. Payer et al. [22] show results under the other format. They count accuracy rates in four target radius by combining two test datasets and get 73.33%, 78.76%, 83.24%, and 89.75% respective while we get much higher accuracy of 82.03%, 88.74%, 92.74%, and 97.14%. The ablation study in Table 1 shows that the AFPF module really helps our model improve the detection accuracy and the attention mechanism plays an important role in the AFPF module. Note that even without the AFPF module, our network, adopting pixel-level regression-voting technique based on heat maps and offset maps, has already outperformed others.

To further identify the flexibility of the AFPF module, we replace the VGG-19 in the first module of our framework by other feature-extraction networks, including ResNet50 and Inception. The results in Table 2 illustrate that our AFPF module can be used on many networks to improve the performance. In addition, our method is efficient, processing one image within 70 ms on a GTX 1080 TI GPU or 7.8 s on the i7-6700K CPU.

3.4 Extended Experiments

In the real world, images captured by different devices differs significantly. A good algorithm should have the self-adaptive ability even for the data from a new device. To test generalization capability and stability of our approach, we use five datasets (Fig. 2) collected by four different devices. We name them as Data-A, Data-B, Data-C, Data-D, and Data-E, where Data-D and Data-E are from the same device. All of them are manually relabeled 19 landmarks by a dentist to avoid inter-observer errors. We use the Data-A, Data-C, and Data-D for training while Data-B and Data-E for testing. For the testing Data-E, the MSE is 1.03 mm with 94.2%, 96.86%, 98.16% and 99.31% SDRs in the measurement of 2 mm, 2.5 mm, 3 mm, 4 mm respectively. For Data-B, which comes from a new device and never occurs in the training dataset, the MSE is 0.88 mm with 94.73%, 97.56%, 98.80% and 99.66% SDRs. It shows that our method is well performed for seen or unseen data sources and very practical for the clinical application.

4 Conclusion

We propose an end-to-end deep learning framework to automatically detect cephalometric landmarks with high accuracy. In our framework, the AFPF module can get high-resolution and semantically enhanced fusion feature with attention to improve the prediction accuracy. The pixel-wise regression-voting technique based on heat maps and offset maps also benefits the performance. Our framework achieves state-of-the-art performance for all the evaluation metrics. Especially, our method performs well even for unseen data sources, which is significant for the actual application. Our framework is able to take in raw images and output landmarks directly in real time without any human intervention, so it is useful for fully automated cephalometric analysis. In the future, we will extend our method to more general landmark-detection tasks.

References

Ricketts, R.M., Roth, R.H., Chaconasand, S.J., Schulhof, R.J., Engel, G.A.: Orthodontic diagnosis and planning, vol. 1, p. 267. RMDS, Denver (1982)
Google Scholar
Levy-Mandel, A.D., Venetsanopoulos, A.N., Tsotsos, J.K.: Knowledge-based landmarking of cephalograms. CBR 19(3), 282–309 (1986)
Google Scholar
El-Feghi, I., Sid-Ahmed, M.A., Ahmadi, M.: Automatic localization of craniofacial landmarks for assisted cephalometry. Pattern Recogn. 37(3), 609–621 (2004)
Article Google Scholar
Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their training and application. CVIU 61(1), 38–59 (1995)
Google Scholar
Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. TPAMI 6, 681–685 (2001)
Article Google Scholar
Ibragimov, B., Likar, B., Pernus, F., Vrtovec, T.: Automatic cephalometric x-ray landmark detection by applying game theory and random forests. In: Proceedings of ISBI International Symposium on Biomedical Imaging (2014)
Google Scholar
Lindner, C., Cootes, T.F.: Fully automatic cephalometric evaluation using random forest regression-voting. In: ISBI. Citeseer (2015)
Google Scholar
Wang, C., Huang, C., Hsieh, M., Li, C., et al.: Evaluation and comparison of anatomical landmark detection methods for cephalometric x-ray images: a grand challenge. TMI 34(9), 1890–1900 (2015)
Google Scholar
Wang, C., Huang, C., Lee, J., Li, C., Chang, S., et al.: A benchmark for comparison of dental radiography analysis algorithms. MIA 31, 63–76 (2016)
Google Scholar
Lindner, C., Wang, C., Huang, C., Li, C., et al.: Fully automatic system for accurate localisation and analysis of cephalometric landmarks in lateral cephalograms. Sci. Rep. 6, 33581 (2016)
Article Google Scholar
Arik, S., Ibragimov, B., Xing, L.: Fully automated quantitative cephalometry using convolutional neural networks. J. Med. Imaging 4(1), 014501 (2017)
Article Google Scholar
Yue, W., Yin, D., Li, C., Wang, G., Xu, T.: Automated 2-D cephalometric analysis on x-ray images by a model-based approach. TBE 53(8), 1615–1623 (2006)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Litjens, G., Kooi, T., Bejnordi, B.E., et al.: A survey on deep learning in medical image analysis. MIA 42, 60–88 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Szegedy, C., et. al: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)
Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122 (2015)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Google Scholar
Ibragimov, B., Likar, B., Pernus, F., Vrtovec, T.: Computerized cephalometry by game theory with shape-and appearance-based landmark refinement. In: ISBI (2015)
Google Scholar
Cardillo, J., Sid-Ahmed, M.A.: An image processing system for locating craniofacial landmarks. TMI 13(2), 275–289 (1994)
Google Scholar
Christian, P., Darko, Š, Horst, B., Martin, U.: Integrating spatial configuration into heatmap regression based CNNs for landmark localization. MIA (2019)
Google Scholar
Martin, U., Thomas, E., Darko, Š.: Integrating geometric configuration and appearance information into a unified framework for anatomical landmark localization. MIA 43, 23–36 (2018)
Google Scholar
Papandreou, G., Zhu, T., Kanazawa, N., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR, vol. 3, p. 6 (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Google Scholar
Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., Manocha, D.: TrafficPredict: trajectory prediction for heterogeneous traffic-agents. arXiv:1811.02146 (2018)

Download references

Acknowledgment

This work is supported partially by Innovative Technology Fund (ITS/411/17FX) and General Research Fund (17210419), Hong Kong SAR.

Author information

Authors and Affiliations

Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
Runnan Chen, Yuexin Ma, Nenglun Chen & Wenping Wang
Modontics (Hong Kong) Limited, Kowloon, Hong Kong
Daniel Lee

Authors

Runnan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuexin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Nenglun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lee
View author publications
You can also search for this author in PubMed Google Scholar
Wenping Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenping Wang .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, R., Ma, Y., Chen, N., Lee, D., Wang, W. (2019). Cephalometric Landmark Detection by Attentive Feature Pyramid Fusion and Regression-Voting. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11766. Springer, Cham. https://doi.org/10.1007/978-3-030-32248-9_97

Download citation

DOI: https://doi.org/10.1007/978-3-030-32248-9_97
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32247-2
Online ISBN: 978-3-030-32248-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)