Title: ๐Ÿš€ ICRA 2026: ๐‘บ๐’–๐’“๐’ˆ๐‘ฝ๐’Š๐’…๐‘ณ๐‘ด: ๐‘ป๐’๐’˜๐’‚๐’“๐’…๐’” ๐‘ด๐’–๐’๐’•๐’Š-๐’ˆ๐’“๐’‚๐’Š๐’๐’†๐’… ๐‘ฝ๐’Š๐’…๐’†๐’

๐”๐ง๐๐ž๐ซ๐ฌ๐ญ๐š๐ง๐๐ข๐ง๐  ๐ฐ๐ข๐ญ๐ก ๐‹๐š๐ซ๐ ๐ž ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐Œ๐จ๐๐ž๐ฅ ๐ข๐ง ๐‘๐จ๐›๐จ๐ญ-๐š๐ฌ๐ฌ๐ข๐ฌ๐ญ๐ž๐ ๐’๐ฎ๐ซ๐ ๐ž๐ซ๐ฒ!

Thrilled to share our latest work, ๐’๐ฎ๐ซ๐ ๐•๐ข๐๐‹๐Œ, the first video-language model specifically designed to address both full and fine-grained surgical video comprehension.

Surgical scene understanding is critical for training and robotic decision-making. While current Multimodal Large Language Models (MLLMs) excel at image analysis, they often overlook the fine-grained temporal reasoning required to capture detailed task execution and specific procedural processes within a surgery. This motivated us to bridge the gap between global video understanding and micro-action analysis.

๐Ÿง โœจ What we developed:

A novel framework and resource for surgical video reasoning that includes:

๐Ÿ”น ๐“๐ฐ๐จ-๐ฌ๐ญ๐š๐ ๐ž ๐’๐ญ๐š๐ ๐ž๐…๐จ๐œ๐ฎ๐ฌ ๐ฆ๐ž๐œ๐ก๐š๐ง๐ข๐ฌ๐ฆ: The first stage extracts global procedural context, while the second stage performs high-frequency local analysis for fine-grained task execution.

๐Ÿ”น ๐Œ๐ฎ๐ฅ๐ญ๐ข-๐Ÿ๐ซ๐ž๐ช๐ฎ๐ž๐ง๐œ๐ฒ ๐…๐ฎ๐ฌ๐ข๐จ๐ง ๐€๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง (๐Œ๐…๐€): Effectively integrates low-frequency global features with high-frequency local details to ensure comprehensive scene perception.

๐Ÿ”น ๐’๐•๐”-๐Ÿ‘๐Ÿ๐Š ๐ƒ๐š๐ญ๐š๐ฌ๐ž๐ญ: We constructed a large-scale dataset with over 31,000 video-instruction pairs, featuring hierarchical knowledge representation for enhanced visual reasoning.

๐ŸŽฏ Key Results:

โœ… SurgVidLM significantly outperforms existing models (like Qwen2-VL) in multi-grained surgical video understanding tasks.

โœ… Capable of inferring anatomical landmarks (e.g., Denonvilliers’ fascia) and providing clinical motivation, moving beyond simple visual description.

โœ… Demonstrated strong performance on unseen surgical tasks, proving the robustness of our hierarchical training approach.

๐Ÿ’ก Why it matters:

This work shows that by combining global context with localized high-frequency focus, we can significantly reduce “hallucinations” in surgical AI. It provides a pathway toward more intelligent, context-aware surgical assistants that can understand not just what is happening, but how and why specific steps are performed.

๐ŸŒฑ Whatโ€™s next?

We are exploring how to extend this multi-grained understanding to real-time intraoperative guidance and integrating it with physical robotic control for autonomous sub-tasks.

Bookmark the permalink.

Comments are closed.