Abstract
Text in videos can be categorized into three types: overlaid text, layered text, and scene text. Existing detection methods focus on a specific type of text and cannot obtain a good performance when working on other text types. To our knowledge, few works explore to build a system to simultaneously detect all types of text. In this paper, we present a unified video text detector, which can simultaneously localize all types of text in videos accurately. Our system consists of a spatial text detector and a temporal fusion filter. First, we explore to use three different strategies to learn the spatial text detector based on deep convolutional neural networks, so that it can simultaneously detect various texts without knowing the types of text. Then, a new area-first non-maximum suppression computation combined with multiple constraints is proposed to remove the redundant bounding boxes. Finally, the temporal fusion filter exploits the features of spatial locations and text components to integrate the detection results of consecutive frames to further remove false positives. To validate the proposed approach, comprehensive experiments are carried out on three publicly available datasets, consisting of overlaid text, layered text, and scene text. The experimental results demonstrate that our method consistently achieves the best performance compared with state-of-the-art methods.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The dataset will be available soon for public.
References
Bertini M, Del Bimbo A, Nunziati W (2006) Automatic detection of player’s identity in soccer videos using faces and text cues. In: The ACM MM. ACM, pp 663–666
Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: The IEEE CVPR. IEEE, pp 2963–2970
Fang S, Xie H, Chen Z, Zhu S, Gu X, Gao X (2017) Detecting uyghur text in complex background images with convolutional neural network. Multimed Tools Appl 76(13):15083–15103
Goto H, Tanaka M (2009) Text-tracking wearable camera system for the blind. In: The ICDAR. IEEE, pp 141–145
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The CVPR, pp 770–778
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) ICDAR 2015 competition on robust reading. In: The ICDAR. IEEE, pp 1156–1160
Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, de las Heras LP (2013) ICDAR 2013 robust reading competition. In: The ICDAR. IEEE, pp 1484–1493
Khare V, Shivakumara P, Raveendran P, Blumenstein M (2016) A blind deconvolution model for scene text detection and recognition in video. Pattern Recognit 54:128–148
Liao M, Shi B, Bai X (2018) Textboxes++: A single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690
Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: The AAAI, pp 4161–4167
Liao M, Zhu Z, Shi B, Xia Gs, Bai X (2018) Rotation-sensitive regression for oriented scene text detection. In: The CVPR
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: The ECCV. Springer, pp 21–37
Liu X, Wang W (2010) Extracting captions from videos using temporal feature. In: The ACM MM. ACM, pp 843–846
Liu X, Wang W (2012) Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis. IEEE Trans Multimed 14(2):482–489
Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. In: The CVPR
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The CVPR, pp 3431–3440
Lucas SM (2005) ICDAR 2005 text locating competition results. In: The ICDAR. IEEE, pp 80–84
Lyu P, Yao C, Wu W, Yan S, Bai X (2018) Multi-oriented scene text detection via corner localization and region segmentation. In: The CVPR
Ma J, Wang W, Lu K, Zhou J (2017) Scene text detection based on pruning strategy of mser-trees and linkage-trees. In: The ICME. IEEE, pp 367–372
Minetto R, Thome N, Cord M, Leite NJ, Stolfi J (2011) Snoopertrack: text detection and tracking for outdoor videos. In: The ICIP. IEEE, pp 505–508
Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: The CVPR. IEEE, pp 3538–3545
Nguyen PX, Wang K, Belongie S (2014) Video text detection and recognition: dataset and benchmark. In: The WACV. IEEE, pp 776–783
Ozay N, Sankur B (2009) Automatic TV logo detection and classification in broadcast videos. In: The 17th European signal processing conference. IEEE, pp 839–843
Sato T, Kanade T, Hughes EK, Smith MA (1998) Video OCR for digital news archive. In: The IEEE international workshop on content-based access of image and video database. IEEE, pp 52–60
Shahab A, Shafait F, Dengel A (2011) ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: The ICDAR. IEEE, pp 1491–1496
Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: The CVPR. IEEE
Shivakumara P, Phan TQ, Tan CL (2011) A Laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33(2):412–419
Shivakumara P, Sreedhar RP, Phan TQ, Lu S, Tan CL (2012) Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Trans Circ Syst Video Technol 22(8):1227–1235
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Song Y, Chen J, Xie H, Chen Z, Gao X, Chen X (2017) Robust and parallel uyghur text localization in complex background images. Mach Vis Appl 28(7):755–769
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: The CVPR. IEEE, pp 1–9
Tanaka M, Goto H (2008) Text-tracking wearable camera system for visually-impaired people. In: The ICPR. IEEE, pp 1–4
Tian S, Yin XC, Su Y, Hao HW (2017) A unified framework for tracking based text detection and recognition from web videos. IEEE Trans Pattern Anal Mach Intell 40:542–554
Uchida S (2014) Text localization and recognition in images and video. In: Handbook of document image processing and recognition. Springer, pp 843–883
Wang J, Duan L, Li Z, Liu J, Lu H, Jin JS (2006) A robust method for tv logo tracking in video streams. In: The ICME. IEEE, pp 1041–1044
Wu L, Shivakumara P, Lu T, Tan CL (2015) A new technique for multi-oriented scene text line detection and tracking in video. IEEE Trans Multimed 17(8):1137–1152
Wu W, Chen X, Yang J (2004) Incremental detection of text on road signs from video with application to a driving assistant system. In: The ACM MM. ACM, pp 852–859
Wu W, Chen X, Yang J (2005) Detection of text on road signs from video. IEEE Trans Intell Transp Syst 6(4):378–390
Yang C, Yin XC, Pei WY, Tian S, Zuo ZY, Zhu C, Yan J (2017) Tracking based multi-orientation scene text detection: a unified framework with dynamic programming. IEEE Trans Image Process 26(7):3235–3248
Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: The CVPR. IEEE, pp 1083–1090
Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500
Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20(9):2594–2605
Yin XC, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36(5):970–983
Yin XC, Zuo ZY, Tian S, Liu CL (2016) Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans Image Process 25(6):2752–2773
Zayene O, Hennebert J, Touj SM, Ingold R, Amara NEB (2015) A dataset for Arabic text detection, tracking and recognition in news videos-AcTiV. In: The ICDAR. IEEE, pp 996–1000
Zayene O, Seuret M, Touj SM, Hennebert J, Ingold R, Amara NEB (2016) Text detection in Arabic news video based on SWT operator and convolutional auto-encoders. In: The IAPR DAS. IEEE, pp 13–18
Zhang H, Liu G, Chow TWS, Liu W (2011) Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw 22(10):1532–1546
Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-oriented text detection with fully convolutional networks. In: The CVPR, pp 4159–4167
Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: The CVPR
Zuo ZY, Tian S, Pei Wy, Yin XC (2015) Multi-strategy tracking based text detection in scene videos. In: The ICDAR. IEEE, pp 66–70
Funding
This work is supported by National Key R&D Program of China under contract No. 2017YFB1002203, NSFC projects under Grant 61772495, NSFC Key Projects of International (Regional) Cooperation and Exchanges under Grant 61860206004, and Ningbo 2025 Key Project of Science and Technology Innovation with No. 2018B10071.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Human and animal rights
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cai, Y., Wang, W. Robustly detect different types of text in videos. Neural Comput & Applic 32, 12827–12840 (2020). https://doi.org/10.1007/s00521-020-04729-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-04729-6