Abstract
We present a novel form of interactive object segmentation called Click Carving which enables accurate segmentation of objects in images and videos with only a few point clicks. Whereas conventional interactive pipelines take the user’s initialization as a starting point, we show the value in the system taking lead even in initialization. In particular, for a given image or a video frame, the system precomputes a ranked list of thousands of possible segmentation hypotheses (also referred to as object region proposals) using appearance and motion cues. Then, the user looks at the top ranked proposals, and clicks on the object boundary to carve away erroneous ones. This process iterates (typically 2–3 times), and each time the system revises the top ranked proposal set, until the user is satisfied with a resulting segmentation mask. In the case of images, this mask is considered as the final object segmentation. However in the case of videos, the object region proposals rely on motion as well, and the resulting segmentation mask in the first frame is further propagated across the video to obtain a complete spatio-temporal object tube. On six challenging image and video datasets, we provide extensive comparisons with both existing work and simpler alternative methods. In all, the proposed Click Carving approach strikes an excellent of accuracy and human effort. It outperforms all similarly fast methods, and is competitive or better than those requiring 2–12 times the effort.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
More details and videos can be found at: http://vision.cs.utexas.edu/projects/clickcarving/.
Code available at: http://vision.cs.utexas.edu/projects/clickcarving/.
The unsupervised NLC method (Faktor and Irani 2014) reports excellent results on a subset of the Segtrack-v2 dataset; the method achieves state of the art results for that subset. We were unable to reproduce the results using the publicly available NLC code, potentially because of an OS incompatibility.
IVID (Shankar Nagaraja et al. 2015) does not report annotation times for Segtrack-v2. Also, VSB100 dataset wasn’t used in their experiments.
More details and videos can be found at: http://vision.cs.utexas.edu/projects/clickcarving/.
References
Acuna, D., Ling, H., Kar, A., & Fidler, S. (2018). Efficient interactive annotation of segmentation datasets with polygon-rnn++.
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR.
Badrinarayanan, V., Galasso, F., & Cipolla, R. (2010). Label propagation in video sequences. In CVPR.
Bai, X., & Sapiro, G. (2007). Distancecut: Interactive segmentation and matting of images and videos. In 2007 IEEE international conference on image processing.
Bai, X., Wang, J., Simons, D., & Sapiro, G. (2009) Video snapcut: Robust video object cutout using localized classifiers. In SIGGRAPH.
Batra, D., Kowdle, A., Parikh, D., Luo, J., & Chen, T. (2010). iCoseg: Interactive co-segmentation with intelligent scribble guidance. In CVPR.
Bearman, A., Russakovsky, O., Ferrari, V., & Fei-Fei, L. (2015). What’s the point: Semantic segmentation with point supervision. ArXiv e-prints.
Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2015). Material recognition in the wild with the materials in context database. In Computer Vision and Pattern Recognition (CVPR).
Boykov, Y., & Jolly, M. (2001). Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In CVPR.
Carreira, J., & Sminchisescu, C. (2012). CPMC: Automatic object segmentation using constrained parametric min-cuts. PAMI, 34(7), 1312–1328.
Castrejón, L., Kundu, K., Urtasun, R., & Fidler, S. (2017). Annotating object instances with a polygon-rnn. In CVPR.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR.
Cheng, M.-M., Zhang, G.-X., Mitra, N. J., Huang, X., & Hu, S.-M. (2011). Global contrast based salient region detection. In CVPR (pp. 409–416).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Faktor, A., & Irani, M. (2014). Video segmentation by non-local consensus voting. In Proceedings of the British machine vision conference. BMVA Press.
Fathi, A., Balcan, M., Ren, X., & Rehg, J. (2011). Combining self training and active learning for video segmentation. In BMVC.
Fragkiadaki, K., Arbelaez, P., Felsen, P., & Malik, J. (2015). Learning to segment moving objects in videos. In CVPR.
Galasso, F., Nagaraja, N. S., Cardenas, T. J., Brox, T., & Schiele, B. (2013). A unified video segmentation benchmark: Annotation, metrics and analysis. In ICCV.
Godec, M., Roth, P. M., & Bischof, H. (2011). Hough-based tracking of non-rigid objects. In ICCV.
Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph based video segmentation. In CVPR.
Gulshan, V., Rother, C., Criminisi, A., Blake, A., & Zisserman, A. (2010). Geodesic star convexity for interactive image segmentation. In CVPR.
Jain, S., & Grauman, K. (2013). Predicting sufficient annotation strength for interactive foreground segmentation. In ICCV.
Jain, S. D., & Grauman, K. (2014). Supervoxel-consistent foreground propagation in video. In ECCV 2014. Lecture notes in computer science (pp. 656–671). Springer.
Jain, S. D., & Grauman, K. (2016). Click carving: Segmenting objects in video with point clicks. In AAAI conference on human computation and crowdsourcing (HCOMP).
Jiang, B., Zhang, L., Lu, H., Yang, C., & Yang, M.-H. (2013). Saliency detection via absorbing markov chain. In ICCV.
Karasev, V., Ravichandran, A., & Soatto, S. (2014). Active frame, location, and detector selection for automated and manual video annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Kass, M., Witkin, A., & Terzopoulos, D. (1988). Snakes: Active contour models. In IJCV (pp. 321–331).
Kohli, P., Nickisch, H., Rother, C., & Rhemann, C. (2012). User-centric learning and evaluation of interactive segmentation systems. IJCV, 100(3), 261–274.
Krähenbühl, P., & Koltun, V. (2014). In Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part V, chapter geodesic object proposals (pp. 725–739). Cham: Springer.
Krause, A., & Guestrin, C. (2007). Near-optimal observation selection using submodular functions. In National conference on artificial intelligence (AAAI), nectar track.
Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.
Lempitsky, V. S., Kohli, P., Rother, C., & Sharp, T. (2009). Image segmentation with a bounding box prior. In ICCV
Levinkov, E., Tompkin, J., Bonneel, N., Kirchhoff, S., Andres, B., & Pfister, H. (2016). Interactive multicut video segmentation. In Proceedings of the 24th Pacific conference on computer graphics and applications: Short papers (pp. 33–38).
Li, F., Kim, T., Humayun, A., Tsai, D., & Rehg, J. M. (2013). Video segmentation by tracking many figure-ground segments. In ICCV.
Li, X., Zhao, L., Wei, L., Yang, M.-H., Fei, W., Zhuang, Y., et al. (2016). DeepSaliency: Multi-task deep neural network model for salient object detection. IEEE TIP, 25(8), 3919–3930.
Li, Y., Hou, X., Koch, C., Rehg, J. M., & Yuille, A. L. (2014). The secrets of salient object segmentation. In CVPR.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV.
Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. PAMI, 33(2), 353–367.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Ma, T., & Latecki, L. (2012). Maximum weight cliques with mutex constraints for video object segmentation. In CVPR.
Malisiewicz, T., & Efros, A. A. (2007). Spatial support for objects via multiple segmentations. In BMVC.
Malmberg, F., Strand, R., & Nyström, I. (2011). Generalized hard constraints for graph segmentation. In SCIA.
McGuinness, K., & O’Connor, N. E. (2010). A comparative evaluation of interactive segmentation algorithms. Pattern Recognition, 43(2), 434–444. Interactive Imaging and Vision.
Mortensen, E., & Barrett, W. (1995). Intelligent scissors for image composition. In SIGGRAPH.
Nickisch, H., Rother, C., Kohli, P., & Rhemann, C. (2010). Learning an interactive segmentation system. In Proceedings of the seventh Indian conference on computer vision, graphics and image processing, ICVGIP ’10 (pp. 274–281). New York, NY: ACM.
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In 2015 IEEE international conference on computer vision (ICCV).
Oneata, D., Revaud, J., Verbeek, J., & Schmid, C. (2014). Spatio-temporal object detection proposals. In ECCV.
Papadopoulos, D., Uijlings, J., Keller, F., & Ferrari, V. (2017). Training object class detectors with click supervision. In CVPR.
Papazoglou, A., & Ferrari, V. (2013). Fast object segmentation in unconstrained video. In ICCV.
Perazzi, F., Krähenbühl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In CVPR (pp. 733–740).
Pinheiro, P. O., Collobert, R., & Dollár, P. (2015). Learning to segment object candidates. In NIPS
Pont-Tuset, J., Farré, M. A., & Smolic, A. (2015). Semi-automatic video object segmentation by advanced manipulation of segmentation hierarchies. In International workshop on content-based multimedia indexing (CBMI).
Ren, X., & Malik, J. (2007). Tracking as repeated figure/ground segmentation. In CVPR.
Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut-interactive foreground extraction using iterated graph cuts. In SIGGRAPH.
Russakovsky, O., Li, L.-J., & Fei-Fei, L. (2015). Best of both worlds: Human–machine collaboration for object annotation. In CVPR.
Shankar Nagaraja, N., Schmidt, F. R., & Brox, T. (2015). Video segmentation with just a few strokes. In ICCV.
Sundberg, P., Brox, T., Maire, M., Arbelaez, P., & Malik, J. (2011). Occlusion boundary detection and figure/ground assignment from optical flow. In CVPR, Washington, DC, USA.
Tsai, D., Flagg, M., & Rehg, J. (2010). Motion coherent tracking with multi-label mrf optimization. In BMVC.
The OpenCV reference manual, 2.4.9.0 edition, April 2014.
Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.
Vijayanarasimhan, S., & Grauman, K. (2012). Active frame selection for label propagation in videos. In ECCV.
Vondrick, C., & Ramanan, D. (2011). Video annotation and tracking with active learning. In NIPS.
Wang, J., Bhat, P., Colburn, A., Agrawala, M., & Cohen, M. F. (2005). Interactive video cutout. ACM Transactions on Graphics, 24(3), 585–594.
Wang, T., Han, B., & Collomosse, J. (2014). Touchcut: Fast image and video segmentation using single-touch interaction. Computer Vision and Image Understanding, 120, 14–30.
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2015). Learning to detect motion boundaries. In CVPR 2015, Boston, United States.
Wen, L., Du, D., Lei, Z., Li, S. Z., & Yang, M.-H. (2015). Jots: Joint online tracking and segmentation. In CVPR.
Wu, Z., Li, F., Sukthankar, R., & Rehg, J. M. (2015). Robust video segment proposals with painless occlusion handling. In CVPR.
Xu, N., Price, B. L., Cohen, S., Yang, J., & Huang, T. S. (2016). Deep interactive object selection. CVPR (pp. 373–381).
Yu, G., & Yuan, J. (2015). Fast action proposals for human action detection and search. In CVPR.
Zhang, D., Javed, O., & Shah, M. (2013). Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In CVPR.
Zhao, R., Ouyang, W., Li, H., & Wang, X. (2015). Saliency detection by multi-context learning. In CVPR.
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., et al. (2015). Conditional random fields as recurrent neural networks.
Acknowledgements
This research is supported in part by ONR PECASE N00014-15-1-2291, NSF IIS-1514118, a gift from Qualcomm and a gift from AWS Machine Learning. We would like to thank Shankar Nagaraja for providing the iVideoseg dataset timing data. We also thank all the participants in our user studies.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Jakob Verbeek.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jain, S.D., Grauman, K. Click Carving: Interactive Object Segmentation in Images and Videos with Point Clicks. Int J Comput Vis 127, 1321–1344 (2019). https://doi.org/10.1007/s11263-019-01184-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01184-2