Abstract
Marking anatomical landmarks in cephalometric radiography is a critical operation in cephalometric analysis. Automatically and accurately locating these landmarks is a challenging issue because different landmarks require different levels of resolutions and semantics. Based on this observation, we propose a novel attentive feature pyramid fusion module (AFPF) to explicitly shape high-resolution and semantically enhanced fusion features to achieve significantly higher accuracy than existing deep learning-based methods. We also combine heat maps and offset maps to perform pixel-wise regression-voting to improve detection accuracy. By incorporating the AFPF and regression-voting, we develop an end-to-end deep learning framework that improves detection accuracy by 7%–11% for all the evaluation metrics over the state-of-the-art method. We present ablation studies to give more insights into different components of our method and demonstrate its generalization capability and stability for unseen data from diverse devices.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Cephalometric analysis is widely used in evaluation and treatment planning for orthodontic, orthognathic and maxillofacial surgeries. It provides the clinician with crucial information on the patient’s dental, skeletal and facial relationship. The key operation during the analysis is marking craniofacial landmarks [1] to assess and quantify the degree of the anatomical abnormalities. In practice, landmarks are located manually, which is tedious, time-consuming, and unreliable in achieving reproducible results. Hence, fully automatic and accurate landmark localization has been a long-standing area with a great deal of need.
Current solutions can be classified into five categories: knowledge-based, pattern matching-based, statistical learning-based, hybrid-based and deep learning-based methods. The first category [2] is to simulate the manual landmark detection process with human knowledge of the landmark structures. However, rules become too complex to be formulated with the increase of the image complexity. Then, some researchers employed search methods using pattern matching [3, 21]. However, they are quite sensitive to individual variations. Considering that both the global spatial constraints and local appearance of landmark locations are important, some statistical learning-based approaches have been proposed, like Active Shape Model [4] and Active Appearance Model [5]. Two frameworks [6, 7] combining the random forests regression-voting and the statistical shape analysis techniques perform well in the IEEE ISBI 2014 and 2015 Challenges [8, 9]. Since then, almost all the methods are benchmarked against the Grand Challenges dataset [10, 11, 22, 23]. There are also some hybrid-based methods [12] integrating different techniques mentioned above. The deep learning technique [13] that emerged in recent years has achieved great success in many fields and has been widely used in medical image analysis [14]. It learns features with multi-level semantics automatically, which has the potential to overcome the limitations of previous methods in feature definition and extraction. Some deep learning-based methods have been proposed [11, 22] on this issue, but they are comparable with previous state-of-the-art methods without prominent improvement.
In this paper, we propose an end-to-end deep learning framework that can accurately and efficiently detect landmarks automatically. Our network architecture contains three sequential modules: a feature extraction module, an attentive feature pyramid fusion (AFPF) module, and a prediction module. In the feature extraction module, we use VGG-19 [15] as a backbone network. For the critical module AFPF, we design it from two observations, while existing methods lack such considerations. One is that features extracted by different layers of neural network have various resolutions and semantics, usually higher semantics along with lower resolution. Identifying the landmarks on the boundary requires high-resolution and detailed structural information, while identifying the landmarks in the center of the region requires deep semantic information. To meet the requirements of identifying all the landmarks, we fuse different levels of features to get a high-resolution and semantically enhanced fusion feature. The other is that individual landmark has its specific attention to the same feature. We utilize the self-attention mechanism to learn corresponding weights of the fusion feature for different landmarks. Results show that the novel AFPF module plays an important role in improving the accuracy. It is also very flexible and can be inserted into other networks to improve the semantic representation. In the prediction module, we get inspiration from the traditional method which takes cropped patches to predict the offset of the ground truth landmarks. We adopt the combination of heat maps and offset maps to do pixel-wise regression-voting, which performs more effectively.
We evaluate the performance on the public available dataset from the ISBI Grand Challenges 2015. Our landmark detection accuracy on the validation dataset (Test Dataset 1) is 86.67% within the clinically accepted precision range of 2.0 mm with the average error of 1.17 mm. As for the testing dataset (Test Dataset 2), we get the 75.05% accuracy rate in the range of 2.0 mm with the average error of 1.48 mm. Our method outperforms the state-of-the-art by 7%–11% for all the measurements. The contributions of this paper are as follows.
-
Propose a new deep learning-based framework for cephalometric landmark detection.
-
Present a new and flexible module (AFPF) to get high-resolution and semantically enhanced fusion features with self-attention mechanism.
-
Our method outperforms the state-of-the-art by 7%–11% for all the measurements on a public dataset.
-
Our method has strong self-adaptive capability and performs well on unseen data source, which is very practical in clinical application.
Overview of our framework. Three consecutive modules, the feature extraction module, the AFPF module, and the prediction module are in blue, yellow and purple areas respectively. Feature maps are reshaped in the green box of AFPF. Attention vectors of heat maps and offset maps for landmarks are in the orange box of AFPF. (Color figure online)
2 Methods
Given a cephalometric radiography I, the goal is to detect anatomical landmark positions \(P=(p_1,p_2,...,p_n)\) automatically, where p denotes the 2D position for a landmark, and n is the number of cephalometic landmarks. Our proposed framework is illustrated in Fig. 1. In the first module, we use pre-trained VGG-19 [15] as the backbone network. There are other networks for feature extraction like ResNet [16] and Inception [17]. We will show their respective performance in Sect. 3. In the following, we give details of AFPF module and the prediction module.
2.1 Attentive Feature Pyramid Fusion
The Attentive Feature Pyramid Fusion (AFPF) module takes different level’s features in the feature extraction module as input and produces a tensor T with the size (3n, h, w), where (h, w) denotes the spatial size of the input image and 3n stand for n heat maps and 2n offset maps. Heat maps H are used to indicate the rough area of the landmark while offset maps O are taken as regressors to locate the precise position [24]. As Fig. 1 shows, for different levels’ features in the first module, we apply \(1\times 1\) lateral connections and upsampling on each of them to generate feature maps with the same resolution and number of channels. Then, these feature maps are concatenated together and passed through a dilated convolutional [18] block to form the feature pyramid F with the size of \((c,h_F,w_F)\). The dilated convolution enlarges the receptive field and aggregate multi-scale context so that more local information can be used to improve the estimation accuracy. Based on the observation that different landmarks have different attention to these feature maps, we use self-attention mechanism [19, 26] to learn attention weights for each landmark. The attention weight \(a_k\) for the k-th landmark is computed as follows:
where \(a_k\) is an attention matrix composed of three attention vectors \((a_k^1,a_k^2,a_k^3)\), one for heat map and two for offset maps. The length of each attention vector is c which is equal to the channel number of F. \(\tilde{F}\) is obtained by operations of average pooling and reshaping that transfers F from the size of \((c,h_F,w_F)\) to the size \((c,h_F\times w_F/ 64)\). \(W_{k1}\) and \(W_{k2}\) are trainable matrices presented by fully connected layers without bias. For each landmark, we apply the attention weight \(a_k\) on the feature pyramid F with channel-wise multiplication to get the weighted feature pyramids \(F_{k}\):
where \(F_{k}\) contains three weighted feature pyramids \((F_k^1,F_k^2,F_k^3)\), and \(F_k^j ( j=1,2,3)\) has the same size as F. \(\otimes \) is the channel-wise multiplication and c is the channel number which serves as the scale factor. By applying \(1\times 1\) convolution on \(F_{k}\), we get the output with three channels for the k-th landmark, containing one heat map \(H_k'\) and two offset maps \(O'_k\). Then, \(H_k'\) and \(O'_k\) are upsampled to match the size of the input image and are named \(H_k\) and \(O_k\). The AFPF module can also be used in other networks to improve the semantic representation. We show its flexibility in Sect. 3.
2.2 Landmark Prediction with Regression Voting
In the prediction module, we combine the output of AFPF module - heat maps and offset maps to predict landmark positions.
In the training stage, for each pixel location \(x_i\) and the k-th landmark \(l_{k}\), we constrain the produced probability in the heat map \(H_k(x_i)\) to be 1 if \(\left\| x_i-l_k\right\| _2 \le R\) and 0 otherwise. Here R is the radius of a circular domain. Note that the heat maps and offset maps are generated by the fused feature maps of different resolution. So R is set to 40 to ensure that there is a minimal corresponding activation area on the smallest feature map (stride of 32 pixels to the input size). The loss function \(L_{h}\) is defined to be mean logistic losses between the predicted heat maps and the ground truth. The offset maps are used to predict the 2D (x, y direction respectively) offset vector \(O_k(x_{i})=(l_{k}-x_{i})/R\) from the pixel \(x_{i}\) to the corresponding landmark \(l_{k}\). The loss function \(L_{o}\) is defined to be the L1 loss between the predicted offsets and the target. We only calculate the loss for positions \(x_{i}\) within R instead of all pixels in training offset maps. The final loss function is defined as follows:
where \(\alpha \) is a factor to balance the loss function terms. We set \(\alpha = 2/3\) empirically.
In the testing stage, we aggregate the heat map and the offset maps for each landmark to construct an activation map \(M_k\) via pixel-wise regression-voting as follows:
where \(A_k\) is the set of pixels with the \(\pi R^2\) largest values in heat map \(H_k\) for the k-th landmark, and \(\mathbbm {1}\{\cdot \}\) is the indicator function. Finally, the pixel \(x_{i}\) with the highest activation value \(M_k(x_{i})\) is regarded as the most likely landmark position.
3 Evaluations and Discussions
In this section, we first evaluate our method on the public dataset from the IEEE ISBI 2015 Challenge by comparing with state-of-the-art methods. To highlight the contribution of different parts of the framework, we also show the performance of different configurations in the ablation study. Especially, our experiments illustrate the flexibility of the AFPF module. Furthermore, we test our framework on other large-scale datasets from various devices in the extended experiments. The feature extraction module is pre-trained on ImageNet dataset [25]. We resize the input image to \(800\times 640\). The entire framework is built on PyTorch and optimized by the Adadelta optimizer with default configuration. The batch size is 1. The training time is approximately 7 h for 350 epochs on a GTX 1080 TI GPU.
3.1 Cephalometric Landmark Dataset
The IEEE ISBI 2015 Challenge [9] provides a public dataset for cephalometric landmark detections, which is the only related public dataset. The dataset consists of 400 cephalometric radiographs with 19 manually labeled landmarks by two doctors in each image, and the ground truth is the average of annotations of the two doctors. The image resolution is \(1935\times 2400\) pixels in the TIFF format, and the pixel spacing is 0.1 mm. The pathology types for eight standard measurement methods can be calculated based on landmarks positions. We use 150 images for training, 150 images for validating and 100 images for testing, and adopt the evaluation metrics according to the IEEE ISBI 2015 Challenge standards [9], which include the mean radial error (MRE), the successful detection rate (SDR) in four target radius (2 mm, 2.5 mm, 3 mm, 4 mm), and the accuracy rate for pathology classification (APC).
3.2 Baselines
We compare our approach with the top two methods [7, 20] in IEEE ISBI 2015 Challenge and two new approaches proposed by Arik et al. [11] and Payer et al. [22]. We also remove the AFPF module and the attention mechanism respectively to do the ablation study.
3.3 Analysis
The comparison with state-of-the-art methods is shown in Table 1. Our method (results are marked by black body) has much better performance under all the measurements, achieving the MRE of 1.17 mm and 1.48 mm with the standard deviation of 1.19 mm and 0.77 mm for the test dataset 1 and test dataset 2 respectively. In terms of the successful detection rate (SDR) for target radius evaluated in the Challenge, our method is higher than the top method by 7%–11%. As for the accuracy rate for pathology classification (APC), we achieved 79.05% and 81.95% respectively on two test datasets, with an average classification accuracy of 84.7% for all classified subjects. Payer et al. [22] show results under the other format. They count accuracy rates in four target radius by combining two test datasets and get 73.33%, 78.76%, 83.24%, and 89.75% respective while we get much higher accuracy of 82.03%, 88.74%, 92.74%, and 97.14%. The ablation study in Table 1 shows that the AFPF module really helps our model improve the detection accuracy and the attention mechanism plays an important role in the AFPF module. Note that even without the AFPF module, our network, adopting pixel-level regression-voting technique based on heat maps and offset maps, has already outperformed others.
To further identify the flexibility of the AFPF module, we replace the VGG-19 in the first module of our framework by other feature-extraction networks, including ResNet50 and Inception. The results in Table 2 illustrate that our AFPF module can be used on many networks to improve the performance. In addition, our method is efficient, processing one image within 70 ms on a GTX 1080 TI GPU or 7.8 s on the i7-6700K CPU.
3.4 Extended Experiments
In the real world, images captured by different devices differs significantly. A good algorithm should have the self-adaptive ability even for the data from a new device. To test generalization capability and stability of our approach, we use five datasets (Fig. 2) collected by four different devices. We name them as Data-A, Data-B, Data-C, Data-D, and Data-E, where Data-D and Data-E are from the same device. All of them are manually relabeled 19 landmarks by a dentist to avoid inter-observer errors. We use the Data-A, Data-C, and Data-D for training while Data-B and Data-E for testing. For the testing Data-E, the MSE is 1.03 mm with 94.2%, 96.86%, 98.16% and 99.31% SDRs in the measurement of 2 mm, 2.5 mm, 3 mm, 4 mm respectively. For Data-B, which comes from a new device and never occurs in the training dataset, the MSE is 0.88 mm with 94.73%, 97.56%, 98.80% and 99.66% SDRs. It shows that our method is well performed for seen or unseen data sources and very practical for the clinical application.
4 Conclusion
We propose an end-to-end deep learning framework to automatically detect cephalometric landmarks with high accuracy. In our framework, the AFPF module can get high-resolution and semantically enhanced fusion feature with attention to improve the prediction accuracy. The pixel-wise regression-voting technique based on heat maps and offset maps also benefits the performance. Our framework achieves state-of-the-art performance for all the evaluation metrics. Especially, our method performs well even for unseen data sources, which is significant for the actual application. Our framework is able to take in raw images and output landmarks directly in real time without any human intervention, so it is useful for fully automated cephalometric analysis. In the future, we will extend our method to more general landmark-detection tasks.
References
Ricketts, R.M., Roth, R.H., Chaconasand, S.J., Schulhof, R.J., Engel, G.A.: Orthodontic diagnosis and planning, vol. 1, p. 267. RMDS, Denver (1982)
Levy-Mandel, A.D., Venetsanopoulos, A.N., Tsotsos, J.K.: Knowledge-based landmarking of cephalograms. CBR 19(3), 282–309 (1986)
El-Feghi, I., Sid-Ahmed, M.A., Ahmadi, M.: Automatic localization of craniofacial landmarks for assisted cephalometry. Pattern Recogn. 37(3), 609–621 (2004)
Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their training and application. CVIU 61(1), 38–59 (1995)
Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. TPAMI 6, 681–685 (2001)
Ibragimov, B., Likar, B., Pernus, F., Vrtovec, T.: Automatic cephalometric x-ray landmark detection by applying game theory and random forests. In: Proceedings of ISBI International Symposium on Biomedical Imaging (2014)
Lindner, C., Cootes, T.F.: Fully automatic cephalometric evaluation using random forest regression-voting. In: ISBI. Citeseer (2015)
Wang, C., Huang, C., Hsieh, M., Li, C., et al.: Evaluation and comparison of anatomical landmark detection methods for cephalometric x-ray images: a grand challenge. TMI 34(9), 1890–1900 (2015)
Wang, C., Huang, C., Lee, J., Li, C., Chang, S., et al.: A benchmark for comparison of dental radiography analysis algorithms. MIA 31, 63–76 (2016)
Lindner, C., Wang, C., Huang, C., Li, C., et al.: Fully automatic system for accurate localisation and analysis of cephalometric landmarks in lateral cephalograms. Sci. Rep. 6, 33581 (2016)
Arik, S., Ibragimov, B., Xing, L.: Fully automated quantitative cephalometry using convolutional neural networks. J. Med. Imaging 4(1), 014501 (2017)
Yue, W., Yin, D., Li, C., Wang, G., Xu, T.: Automated 2-D cephalometric analysis on x-ray images by a model-based approach. TBE 53(8), 1615–1623 (2006)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Litjens, G., Kooi, T., Bejnordi, B.E., et al.: A survey on deep learning in medical image analysis. MIA 42, 60–88 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Szegedy, C., et. al: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122 (2015)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Ibragimov, B., Likar, B., Pernus, F., Vrtovec, T.: Computerized cephalometry by game theory with shape-and appearance-based landmark refinement. In: ISBI (2015)
Cardillo, J., Sid-Ahmed, M.A.: An image processing system for locating craniofacial landmarks. TMI 13(2), 275–289 (1994)
Christian, P., Darko, Š, Horst, B., Martin, U.: Integrating spatial configuration into heatmap regression based CNNs for landmark localization. MIA (2019)
Martin, U., Thomas, E., Darko, Š.: Integrating geometric configuration and appearance information into a unified framework for anatomical landmark localization. MIA 43, 23–36 (2018)
Papandreou, G., Zhu, T., Kanazawa, N., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR, vol. 3, p. 6 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., Manocha, D.: TrafficPredict: trajectory prediction for heterogeneous traffic-agents. arXiv:1811.02146 (2018)
Acknowledgment
This work is supported partially by Innovative Technology Fund (ITS/411/17FX) and General Research Fund (17210419), Hong Kong SAR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, R., Ma, Y., Chen, N., Lee, D., Wang, W. (2019). Cephalometric Landmark Detection by Attentive Feature Pyramid Fusion and Regression-Voting. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11766. Springer, Cham. https://doi.org/10.1007/978-3-030-32248-9_97
Download citation
DOI: https://doi.org/10.1007/978-3-030-32248-9_97
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32247-2
Online ISBN: 978-3-030-32248-9
eBook Packages: Computer ScienceComputer Science (R0)