๐๐ง๐๐๐ซ๐ฌ๐ญ๐๐ง๐๐ข๐ง๐ ๐ฐ๐ข๐ญ๐ก ๐๐๐ซ๐ ๐ ๐๐๐ง๐ ๐ฎ๐๐ ๐ ๐๐จ๐๐๐ฅ ๐ข๐ง ๐๐จ๐๐จ๐ญ-๐๐ฌ๐ฌ๐ข๐ฌ๐ญ๐๐ ๐๐ฎ๐ซ๐ ๐๐ซ๐ฒ!
Thrilled to share our latest work, ๐๐ฎ๐ซ๐ ๐๐ข๐๐๐, the first video-language model specifically designed to address both full and fine-grained surgical video comprehension.
Surgical scene understanding is critical for training and robotic decision-making. While current Multimodal Large Language Models (MLLMs) excel at image analysis, they often overlook the fine-grained temporal reasoning required to capture detailed task execution and specific procedural processes within a surgery. This motivated us to bridge the gap between global video understanding and micro-action analysis.
๐ง โจ What we developed:
A novel framework and resource for surgical video reasoning that includes:
๐น ๐๐ฐ๐จ-๐ฌ๐ญ๐๐ ๐ ๐๐ญ๐๐ ๐๐ ๐จ๐๐ฎ๐ฌ ๐ฆ๐๐๐ก๐๐ง๐ข๐ฌ๐ฆ: The first stage extracts global procedural context, while the second stage performs high-frequency local analysis for fine-grained task execution.
๐น ๐๐ฎ๐ฅ๐ญ๐ข-๐๐ซ๐๐ช๐ฎ๐๐ง๐๐ฒ ๐ ๐ฎ๐ฌ๐ข๐จ๐ง ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง (๐๐ ๐): Effectively integrates low-frequency global features with high-frequency local details to ensure comprehensive scene perception.
๐น ๐๐๐-๐๐๐ ๐๐๐ญ๐๐ฌ๐๐ญ: We constructed a large-scale dataset with over 31,000 video-instruction pairs, featuring hierarchical knowledge representation for enhanced visual reasoning.
๐ฏ Key Results:
โ SurgVidLM significantly outperforms existing models (like Qwen2-VL) in multi-grained surgical video understanding tasks.
โ Capable of inferring anatomical landmarks (e.g., Denonvilliers’ fascia) and providing clinical motivation, moving beyond simple visual description.
โ Demonstrated strong performance on unseen surgical tasks, proving the robustness of our hierarchical training approach.
๐ก Why it matters:
This work shows that by combining global context with localized high-frequency focus, we can significantly reduce “hallucinations” in surgical AI. It provides a pathway toward more intelligent, context-aware surgical assistants that can understand not just what is happening, but how and why specific steps are performed.
๐ฑ Whatโs next?
We are exploring how to extend this multi-grained understanding to real-time intraoperative guidance and integrating it with physical robotic control for autonomous sub-tasks.

