ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling

Özsoy, Ege; Pellegrini, Chantal; Keicher, Matthias; Navab, Nassir

doi:10.1007/978-3-031-72089-5_43

Ege Özsoy^14,15,
Chantal Pellegrini^14,15,
Matthias Keicher¹⁴ &
…
Nassir Navab¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15006))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

1659 Accesses
1 Citations

Abstract

Every day, countless surgeries are performed worldwide, each within the distinct settings of operating rooms (ORs) that vary not only in their setups but also in the personnel, tools, and equipment used. This inherent diversity poses a substantial challenge for achieving a holistic understanding of the OR, as it requires models to generalize beyond their initial training datasets. To reduce this gap, we introduce ORacle, an advanced vision-language model designed for holistic OR domain modeling, which incorporates multi-view and temporal capabilities and can leverage external knowledge during inference, enabling it to adapt to previously unseen surgical scenarios. This capability is further enhanced by our novel data augmentation framework, which significantly diversifies the training dataset, ensuring ORacle’s proficiency in applying the provided knowledge effectively. In rigorous testing, in scene graph generation, and downstream tasks on the 4D-OR dataset, ORacle not only demonstrates state-of-the-art performance but does so requiring less data than existing models. Furthermore, its adaptability is displayed through its ability to interpret unseen views, actions, and appearances of tools and equipment. This demonstrates ORacle’s potential to significantly enhance the scalability and affordability of OR domain modeling and opens a pathway for future advancements in surgical data science. Our code, pretrained models and data is publicly available at https://github.com/egeozsoy/Oracle.

E. Özsoy and C. Pellegrini—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Holistic OR domain modeling: a semantic scene graph approach

Article Open access 12 October 2023

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

EndoViT: pretraining vision transformers on a large collection of endoscopic images

Article Open access 03 April 2024

References

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
Google Scholar
Bai, L., Islam, M., Ren, H.: Revisiting distillation for continual learning on visual question localized-answering in robotic surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 68–78. Springer (2023)
Google Scholar
Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March 2023)
Google Scholar
Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: Medical Image Computing and Computer Assisted Intervention - MICCAI 2020 - 23nd International Conference, Shenzhen, China, October 4-8, 2020, Proceedings, Part III. Lecture Notes in Computer Science, vol. 12263, pp. 343–352. Springer (2020)
Google Scholar
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)
Google Scholar
Das, A., Khan, D.Z., Williams, S.C., Hanrahan, J.G., Borg, A., Dorward, N.L., Bano, S., Marcus, H.J., Stoyanov, D.: A multi-task network for anatomy identification in endoscopic pituitary surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 472–482. Springer (2023)
Google Scholar
Hao, L., Hu, Y., Lin, W., Wang, Q., Li, H., Fu, H., Duan, J., Liu, J.: Act-net: Anchor-context action detection in surgery videos. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 196–206. Springer (2023)
Google Scholar
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jha, D., Ali, S., Emanuelsen, K., Hicks, S.A., Thambawita, V., Garcia-Ceja, E., Riegler, M.A., de Lange, T., Schmidt, P.T., Johansen, H.D., et al.: Kvasir-instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopy. In: MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II 27. pp. 218–229. Springer (2021)
Google Scholar
Lalys, F., Jannin, P.: Surgical process modelling: a review. International Journal of Computer Assisted Radiology and Surgery, Springer Verlag 9, 495–511 (2014)
Article Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
...Maier-Hein, L., Vedula, S.S., Speidel, S., Navab, N., Kikinis, R., Park, A., Eisenmann, M., Feussner, H., Forestier, G., Giannarou, S., Hashizume, M., Katic, D., Kenngott, H., Kranzfelder, M., Malpani, A., März, K., Neumuth, T., Padoy, N., Pugh, C., Schoch, N., Stoyanov, D., Taylor, R., Wagner, M., Hager, G.D., Jannin, P.: Surgical data science for next-generation interventions. Nature Biomedical Engineering 1(9), 691-696 (Sep 2017)
Google Scholar
Murali, A., Alapatt, D., Mascagni, P., Vardazaryan, A., Garcia, A., Okamoto, N., Mutter, D., Padoy, N.: Encoding surgical videos as latent spatiotemporal graphs for object and anatomy-driven reasoning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 647–657. Springer (2023)
Google Scholar
Nwoye, C.I., Yu, T., Gonzalez, C., Seeliger, B., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N.: Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis 78 (2022)
Google Scholar
Özsoy, E., Czempiel, T., Holm, F., Pellegrini, C., Navab, N.: Labrad-or: Lightweight memory scene graphs for accurate bimodal reasoning in dynamic operating rooms. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2023)
Google Scholar
Özsoy, E., Czempiel, T., Örnek, E.P., Eck, U., Tombari, F., Navab, N.: Holistic or domain modeling: a semantic scene graph approach. International Journal of Computer Assisted Radiology and Surgery (2023). https://doi.org/10.1007/s11548-023-03022-w, https://doi.org/10.1007/s11548-023-03022-w
Özsoy, E., Örnek, E.P., Eck, U., Czempiel, T., Tombari, F., Navab, N.: 4d-or: Semantic scene graphs for or domain modeling. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII. Springer (2022)
Google Scholar
Paranjape, J.N., Sikder, S., Patel, V.M., Vedula, S.S.: Cross-dataset adaptation for instrument classification in cataract surgery videos. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 739–748. Springer (2023)
Google Scholar
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
Qin, X., Dai, H., Hu, X., Fan, D.P., Shao, L., Gool, L.V.: Highly accurate dichotomous image segmentation. In: ECCV (2022)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
Google Scholar
Seenivasan, L., Islam, M., Kannan, G., Ren, H.: Surgicalgpt: End-to-end language-vision gpt for visual question answering in surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 281–290. Springer (2023)
Google Scholar
Sharma, S., Nwoye, C.I., Mutter, D., Padoy, N.: Surgical action triplet detection by mixed supervised learning of instrument-tissue interactions. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 505–514. Springer (2023)
Google Scholar
Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36(1), 86–97 (2016)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Google Scholar
Wang, R., Ktistakis, S., Zhang, S., Meboldt, M., Lohmeyer, Q.: Pov-surgery: A dataset for egocentric hand and tool pose estimation during surgical activities. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 440–450. Springer (2023)
Google Scholar

Download references

Acknowledgements

This work has been supported by Software Campus & Bundesministerium für Bildung und Forschung (BMBF) with grant [ZN 01IS17049]. The authors have been also partially supported by Stryker, the EVUK programme (“Next-generation Al for Integrated Diagnostics”) of the Free State of Bavaria.

Author information

Authors and Affiliations

Computer Aided Medical Procedures, Technische Universität München, Munich, Germany
Ege Özsoy, Chantal Pellegrini, Matthias Keicher & Nassir Navab
Munich Center for Machine Learning (MCML), Garching, Germany
Ege Özsoy & Chantal Pellegrini

Authors

Ege Özsoy
View author publications
You can also search for this author in PubMed Google Scholar
Chantal Pellegrini
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Keicher
View author publications
You can also search for this author in PubMed Google Scholar
Nassir Navab
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ege Özsoy .

Editor information

Editors and Affiliations

Children’s National Hospital/George Washington University, Washington, DC, USA
Marius George Linguraru
The Chinese University of Hong Kong, Hong Kong, China
Qi Dou
Technical University of Denmark, Kgs Lyngby, Denmark
Aasa Feragen
Imperial College London, London, UK
Stamatia Giannarou
Imperial College London, London, UK
Ben Glocker
Universitat de Barcelona, Barcelona, Spain
Karim Lekadir
Helmholtz Munich, Technical University of Munich and King’s College London, Munich, Germany
Julia A. Schnabel

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2094 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Özsoy, E., Pellegrini, C., Keicher, M., Navab, N. (2024). ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling. In: Linguraru, M.G., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, vol 15006. Springer, Cham. https://doi.org/10.1007/978-3-031-72089-5_43

Download citation

DOI: https://doi.org/10.1007/978-3-031-72089-5_43
Published: 03 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72088-8
Online ISBN: 978-3-031-72089-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling