Indoor scene understanding via RGB-D image segmentation employing depth-based CNN and CRFs

Li, Wei; Gu, Junhua; Dong, Yongfeng; Dong, Yao; Han, Jungong

doi:10.1007/s11042-019-07882-w

Indoor scene understanding via RGB-D image segmentation employing depth-based CNN and CRFs

Published: 05 July 2019

Volume 79, pages 35475–35489, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Wei Li ORCID: orcid.org/0000-0002-3947-4495^1,2,3,
Junhua Gu^4,5,
Yongfeng Dong^4,5,
Yao Dong^4,5 &
…
Jungong Han⁶

604 Accesses
Explore all metrics

Abstract

With the availability of low-cost depth-visual sensing devices, such as Microsoft Kinect, we are experiencing a growing interest in indoor environment understanding, at the core of which is semantic segmentation in RGB-D image. The latest research shows that the convolutional neural network (CNN) still dominates the image semantic segmentation field. However, down-sampling operated during the training process of CNNs leads to unclear segmentation boundaries and poor classification accuracy. To address this problem, in this paper, we propose a novel end-to-end deep architecture, termed FuseCRFNet, which seamlessly incorporates a fully-connected Conditional Random Fields (CRFs) model into a depth-based CNN framework. The proposed segmentation method uses the properties of pixel-to-pixel relationships to increase the accuracy of image semantic segmentation. More importantly, we formulate the CRF as one of the layers in FuseCRFNet to refine the coarse segmentation in the forward propagation, in meanwhile, it passes back the errors to facilitate the training. The performance of our FuseCRFNet is evaluated by experimenting with SUN RGB-D dataset, and the results show that the proposed algorithm is superior to existing semantic segmentation algorithms with an improvement in accuracy of at least 2%, further verifying the effectiveness of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic Segmentation of Indoor-Scene RGB-D Images Based on Iterative Contraction and Merging

A survey on indoor RGB-D semantic segmentation: from hand-crafted features to deep convolutional neural networks

Article 21 May 2019

RGB-D joint modelling with scene geometric information for indoor semantic segmentation

Article 21 May 2018

References

Alam FI, Zhou J, Liew WC et al (2017) Conditional random field and deep feature learning for hyperspectral image segmentation[J]. IEEE Trans Geosci Remote Sens PP:99
Google Scholar
Badrinarayanan V, Handa A, Cipolla R (2015) Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293
Chen S, de Bruijne M (2018) An End-to-end Approach to Semantic Segmentation with 3D CNN and Posterior-CRF in Medical Images. arXiv preprint arXiv:1811.03549
Chen LC, Papandreou G, Kokkinos I et al (2016) DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs[J]. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article Google Scholar
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking Atrous convolution for semantic image segmentationar. Xiv preprint arXiv: 1706.05587
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) r-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp 801–818
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2014) Semantic image segmentation with deep convolutional nets and fully connected CRFs. Computer Science (4):357–361
Chunfang ZH (2012) Image semantic segmentation based on conditional random field. Computer CD Software and Applications (9):21–23
Couprie C, Farabet C, Najman L, LeCun Y (2013) Indoor semantic segmentation using depth information. arXiv preprint arXiv: 1301.3572
Ding G, Guo Y, Chen K et al (2019) DECODE: deep confidence network for robust image classification[J]. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2019.2902115
Han J, Pauwels EJ, Zeeuw PMD et al (2012) Employing a RGB-D sensor for real-time tracking of humans across multiple re-entries in a smart environment[J]. IEEE Trans Consum Electron 58(2):255–263
Article Google Scholar
Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with microsoft kinect sensor: A review. IEEE Trans Cybern 43(5):1318–1334
Hazirbas C, Ma L, Domokos C et al (2016) FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture[C]. In: Asian conference on computer vision. Springer, Cham
Janoch A, Karayev S, Jia Y, Barron JT, Fritz M, Saenko K, Darrell T (2013) A category-level 3d object dataset: Putting the kinect to work. In: Consumer depth cameras for computer vision. Springer, London, pp 141–165
Jiang J, Zhang Z, Huang Y, Zheng L (2017) Incorporating depth into both cnn and crf for indoor semantic segmentation. In 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), IEEE, pp 525–530
Kendall A, Badrinarayanan V, Cipolla R (2015) Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv: 1511.02680
Krähenbühl, Philipp, Koltun V (2012) Efficient inference in fully connected CRFs with Gaussian edge potentials[J]. In Advances in neural information processing systems, pp 109–117
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks[C]. In: NIPS. Curran Associates Inc. In Advances in neural information processing systems, pp 1097–1105
Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data[J]
Li X, Belaroussi R (2016) Semi-dense 3D semantic mapping from monocular SLAM[J]. arXiv preprint arXiv:1611.04144
Li Z, Gan Y, Liang X et al (2016) LSTM-CF: unifying context modeling and fusion with LSTMs for RGB-D scene labeling[J]. In European conference on computer vision Springer, Cham, pp 541–557
Lin G, Shen C, Anton VDH et al (2017) Exploring context with deep structured models for semantic segmentation[J]. IEEE Trans Pattern Anal Mach Intell 40(6):1352–1366
Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation[J]. IEEE Trans Pattern Anal Mach Intell 39(4):640–651
Google Scholar
Luan S, Chen C, Zhang B et al (2018) Gabor convolutional networks[J]. IEEE Trans Image Process 27(9):4357–4366
Article MathSciNet Google Scholar
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation[J]. In Proceedings of the IEEE international conference on computer vision, pp 1520–1528
Pang Y, Cao J, Li X (2015) Cascade learning by optimally partitioning[J]. IEEE transactions on cybernetics 47(12):4148–4161
Pang Y, Xie J, Nie F et al (2018) Spectral clustering by joint spectral embedding and spectral rotation[J]. IEEE Transactions on Cybernetics, pp 1–12
Pang Y, Zhou B, Nie F (2017) Simultaneously learning Neighborship and projection matrix for supervised dimensionality reduction[J]. IEEE Transactions on Neural Networks and Learning Systems
Paszke A, Chaurasia A, Kim S et al (2016) ENet: a deep neural network architecture for real-time semantic segmentation[J]. arXiv preprint arXiv:1606.02147.
Paszke A, Gross S, Chintala S et al (2017) Automatic differentiation in pytorch [J]
Ren X, Bo L, Fox D (2012) RGB-(D) scene labeling: features and algorithms[C]. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on. IEEE
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation[J]. In International Conference on Medical image computing and computer-assisted intervention Springer, Cham, pp 234–241
Rumelhart DE (1986) Learning representations by back-propagating errors[J]. Nature 323:533–536
Article Google Scholar
Russakovsky O, Deng J, Su H et al (2014) ImageNet large scale visual recognition challenge[J]. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Sakkos D, Liu H, Han J et al (2018) End-to-end video background subtraction with 3d convolutional neural networks [J]. Multimedia Tools and Applications 77(17):23023–23041
Silberman N, Fergus R (2011) Indoor scene segmentation using a structured light sensor[C]. In: 2011 IEEE international conference on computer vision workshops (ICCV workshops). IEEE Computer Society, pp 601–608
Silberman N, Hoiem D, Kohli P et al (2012) Indoor segmentation and support inference from RGBD images[J]. In European Conference on Computer Vision. Springer, Berlin, Heidelberg, pp 746–760
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition[J]. Computer Science arXiv preprint arXiv:1409.1556
Song S, Lichtenberg SP, Xiao JSUN (2015) RGB-D: a RGB-D scene understanding benchmark suite[C]. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE
Sun H, Pang Y (2018) GlanceNets — efficient convolutional neural networks with adaptive hard example mining[J]. SCIENCE CHINA Inf Sci 61(10):109101
Article Google Scholar
Teichmann MTT, Cipolla R (2018) Convolutional CRFs for semantic segmentation [J]. arXiv preprint arXiv:1805.04777
Teichmann M, Weber M, Zoellner M et al (2016) MultiNet: real-time joint semantic reasoning for autonomous driving[J]. In 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, pp 1013–1020
Wang CY, Chen JZ, Li W (2014) Review on superpixel segmentation algorithms. Application research of Computers 31(1):6–12
Wu G, Han J, Lin Z et al (2018) Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning[J]. IEEE Transactions on Industrial Electronics
Wu G, Han J, Guo Y et al (2019) Unsupervised deep video hashing via balanced code for large-scale video retrieval[J]. IEEE Trans Image Process 28(4):1993–2007
Article MathSciNet Google Scholar
Xiao J, Owens A, Torralba A (2013) SUN3D: a database of big spaces reconstructed using SfM and object labels[C]. In: 2013 IEEE international conference on computer vision (ICCV). IEEE Computer Society
Yan C, Xie H, Yang D et al (2017) Supervised hash coding with deep neural network for environment perception of intelligent vehicles[J]. IEEE Trans Intell Transp Syst 19(1):284–295
Article Google Scholar
Yan C, Xie H, Chen J et al (2018) A fast Uyghur text detector for complex background images[J]. IEEE Transactions on Multimedia 20(12):3389–3398
Article Google Scholar
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions[J]. arXiv preprint arXiv:1511.07122
Zhao H, Shi J, Qi X et al (2016) Pyramid scene parsing network[J]. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Zhao B, Feng J, Wu X et al (2017) A survey on deep learning-based fine-grained object classification and semantic segmentation[J]. International Journal of Automation and Computing 14(2):119–135
Zheng S, Jayasumana S, Romera-Paredes B et al (2015) Conditional random fields as recurrent neural networks[J]

Download references

Acknowledgments

This work is supported in part by Science and Technology Program of Tianjin, China(14ZCDGSF00124), Basic Research Program of Tianjin, China (17JCTPJC55400) and in part by NSF of Hebei Province through the Key Program under Grant F2016202144.

I would like to acknowledge Professor Junhua GU for his support as scientific advisor and co-authoring the paper.

Author information

Authors and Affiliations

School of Electrical Engineering, Hebei University of Technology, Tianjin, 300401, China
Wei Li
State Key Laboratory of Reliability and Intelligence of Electrical Equipment, Hebei University of Technology, Tianjin, China
Wei Li
Key Laboratory of Electromagnetic Field and Electrical Apparatus Reliability of Hebei Province, Hebei University of Technology, Tianjin, China
Wei Li
School of Artificial Intelligence, Hebei University of Technology, Tianjin, 300401, China
Junhua Gu, Yongfeng Dong & Yao Dong
Key Laboratory of Big Data Computing, Hebei, Tianjin, China
Junhua Gu, Yongfeng Dong & Yao Dong
School of Computing and Communications, Lancaster University, Lancaster, UK
Jungong Han

Authors

Wei Li
View author publications
You can also search for this author inPubMed Google Scholar
Junhua Gu
View author publications
You can also search for this author inPubMed Google Scholar
Yongfeng Dong
View author publications
You can also search for this author inPubMed Google Scholar
Yao Dong
View author publications
You can also search for this author inPubMed Google Scholar
Jungong Han
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Wei Li.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, W., Gu, J., Dong, Y. et al. Indoor scene understanding via RGB-D image segmentation employing depth-based CNN and CRFs. Multimed Tools Appl 79, 35475–35489 (2020). https://doi.org/10.1007/s11042-019-07882-w

Download citation

Received: 01 February 2019
Revised: 05 April 2019
Accepted: 10 June 2019
Published: 05 July 2019
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11042-019-07882-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Indoor scene understanding via RGB-D image segmentation employing depth-based CNN and CRFs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic Segmentation of Indoor-Scene RGB-D Images Based on Iterative Contraction and Merging

A survey on indoor RGB-D semantic segmentation: from hand-crafted features to deep convolutional neural networks

RGB-D joint modelling with scene geometric information for indoor semantic segmentation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now