{"id":3438,"date":"2026-02-12T14:11:20","date_gmt":"2026-02-12T14:11:20","guid":{"rendered":"http:\/\/www.labren.org\/mm\/?p=3438"},"modified":"2026-02-12T14:16:28","modified_gmt":"2026-02-12T14:16:28","slug":"title-%f0%9f%9a%80-icra-2026-%f0%9d%91%ba%f0%9d%92%96%f0%9d%92%93%f0%9d%92%88%f0%9d%91%bd%f0%9d%92%8a%f0%9d%92%85%f0%9d%91%b3%f0%9d%91%b4-%f0%9d%91%bb%f0%9d%92%90%f0%9d%92%98%f0%9d%92%82","status":"publish","type":"post","link":"http:\/\/www.labren.org\/mm\/news\/title-%f0%9f%9a%80-icra-2026-%f0%9d%91%ba%f0%9d%92%96%f0%9d%92%93%f0%9d%92%88%f0%9d%91%bd%f0%9d%92%8a%f0%9d%92%85%f0%9d%91%b3%f0%9d%91%b4-%f0%9d%91%bb%f0%9d%92%90%f0%9d%92%98%f0%9d%92%82\/","title":{"rendered":"Title: \ud83d\ude80 ICRA 2026: \ud835\udc7a\ud835\udc96\ud835\udc93\ud835\udc88\ud835\udc7d\ud835\udc8a\ud835\udc85\ud835\udc73\ud835\udc74: \ud835\udc7b\ud835\udc90\ud835\udc98\ud835\udc82\ud835\udc93\ud835\udc85\ud835\udc94 \ud835\udc74\ud835\udc96\ud835\udc8d\ud835\udc95\ud835\udc8a-\ud835\udc88\ud835\udc93\ud835\udc82\ud835\udc8a\ud835\udc8f\ud835\udc86\ud835\udc85 \ud835\udc7d\ud835\udc8a\ud835\udc85\ud835\udc86\ud835\udc90"},"content":{"rendered":"\n<p>\ud835\udc14\ud835\udc27\ud835\udc1d\ud835\udc1e\ud835\udc2b\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc27\ud835\udc1d\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc30\ud835\udc22\ud835\udc2d\ud835\udc21 \ud835\udc0b\ud835\udc1a\ud835\udc2b\ud835\udc20\ud835\udc1e \ud835\udc0b\ud835\udc1a\ud835\udc27\ud835\udc20\ud835\udc2e\ud835\udc1a\ud835\udc20\ud835\udc1e \ud835\udc0c\ud835\udc28\ud835\udc1d\ud835\udc1e\ud835\udc25 \ud835\udc22\ud835\udc27 \ud835\udc11\ud835\udc28\ud835\udc1b\ud835\udc28\ud835\udc2d-\ud835\udc1a\ud835\udc2c\ud835\udc2c\ud835\udc22\ud835\udc2c\ud835\udc2d\ud835\udc1e\ud835\udc1d \ud835\udc12\ud835\udc2e\ud835\udc2b\ud835\udc20\ud835\udc1e\ud835\udc2b\ud835\udc32!<\/p>\n\n\n\n<p>Thrilled to share our latest work, \ud835\udc12\ud835\udc2e\ud835\udc2b\ud835\udc20\ud835\udc15\ud835\udc22\ud835\udc1d\ud835\udc0b\ud835\udc0c, the first video-language model specifically designed to address both full and fine-grained surgical video comprehension.<\/p>\n\n\n\n<p>Surgical scene understanding is critical for training and robotic decision-making. While current Multimodal Large Language Models (MLLMs) excel at image analysis, they often overlook the fine-grained temporal reasoning required to capture detailed task execution and specific procedural processes within a surgery. This motivated us to bridge the gap between global video understanding and micro-action analysis.<\/p>\n\n\n\n<p>\ud83e\udde0\u2728 What we developed:<\/p>\n\n\n\n<p>A novel framework and resource for surgical video reasoning that includes:<\/p>\n\n\n\n<p>\ud83d\udd39 \ud835\udc13\ud835\udc30\ud835\udc28-\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc20\ud835\udc1e \ud835\udc12\ud835\udc2d\ud835\udc1a\ud835\udc20\ud835\udc1e\ud835\udc05\ud835\udc28\ud835\udc1c\ud835\udc2e\ud835\udc2c \ud835\udc26\ud835\udc1e\ud835\udc1c\ud835\udc21\ud835\udc1a\ud835\udc27\ud835\udc22\ud835\udc2c\ud835\udc26: The first stage extracts global procedural context, while the second stage performs high-frequency local analysis for fine-grained task execution.<\/p>\n\n\n\n<p>\ud83d\udd39 \ud835\udc0c\ud835\udc2e\ud835\udc25\ud835\udc2d\ud835\udc22-\ud835\udc1f\ud835\udc2b\ud835\udc1e\ud835\udc2a\ud835\udc2e\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc32 \ud835\udc05\ud835\udc2e\ud835\udc2c\ud835\udc22\ud835\udc28\ud835\udc27 \ud835\udc00\ud835\udc2d\ud835\udc2d\ud835\udc1e\ud835\udc27\ud835\udc2d\ud835\udc22\ud835\udc28\ud835\udc27 (\ud835\udc0c\ud835\udc05\ud835\udc00): Effectively integrates low-frequency global features with high-frequency local details to ensure comprehensive scene perception.<\/p>\n\n\n\n<p>\ud83d\udd39 \ud835\udc12\ud835\udc15\ud835\udc14-\ud835\udfd1\ud835\udfcf\ud835\udc0a \ud835\udc03\ud835\udc1a\ud835\udc2d\ud835\udc1a\ud835\udc2c\ud835\udc1e\ud835\udc2d: We constructed a large-scale dataset with over 31,000 video-instruction pairs, featuring hierarchical knowledge representation for enhanced visual reasoning.<\/p>\n\n\n\n<p>\ud83c\udfaf Key Results:<\/p>\n\n\n\n<p>\u2705 SurgVidLM significantly outperforms existing models (like Qwen2-VL) in multi-grained surgical video understanding tasks.<\/p>\n\n\n\n<p>\u2705 Capable of inferring anatomical landmarks (e.g., Denonvilliers&#8217; fascia) and providing clinical motivation, moving beyond simple visual description.<\/p>\n\n\n\n<p>\u2705 Demonstrated strong performance on unseen surgical tasks, proving the robustness of our hierarchical training approach.<\/p>\n\n\n\n<p>\ud83d\udca1 Why it matters:<\/p>\n\n\n\n<p>This work shows that by combining global context with localized high-frequency focus, we can significantly reduce &#8220;hallucinations&#8221; in surgical AI. It provides a pathway toward more intelligent, context-aware surgical assistants that can understand not just what is happening, but how and why specific steps are performed.<\/p>\n\n\n\n<p>\ud83c\udf31 What\u2019s next?<\/p>\n\n\n\n<p>We are exploring how to extend this multi-grained understanding to real-time intraoperative guidance and integrating it with physical robotic control for autonomous sub-tasks.<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"602\" data-id=\"3441\" src=\"http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34-1024x602.png\" alt=\"\" class=\"wp-image-3441\" srcset=\"http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34-1024x602.png 1024w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34-300x176.png 300w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34-768x452.png 768w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34-1536x903.png 1536w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34-595x350.png 595w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34-150x88.png 150w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34-473x278.png 473w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-34.png 1757w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"739\" data-id=\"3440\" src=\"http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50-1024x739.png\" alt=\"\" class=\"wp-image-3440\" srcset=\"http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50-1024x739.png 1024w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50-300x217.png 300w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50-768x554.png 768w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50-485x350.png 485w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50-150x108.png 150w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50-385x278.png 385w, http:\/\/www.labren.org\/mm\/wp-content\/uploads\/2026\/02\/Snipaste_2026-02-12_22-13-50.png 1471w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n<\/figure>\n","protected":false},"excerpt":{"rendered":"<p>\ud835\udc14\ud835\udc27\ud835\udc1d\ud835\udc1e\ud835\udc2b\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc27\ud835\udc1d\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc30\ud835\udc22\ud835\udc2d\ud835\udc21 \ud835\udc0b\ud835\udc1a\ud835\udc2b\ud835\udc20\ud835\udc1e \ud835\udc0b\ud835\udc1a\ud835\udc27\ud835\udc20\ud835\udc2e\ud835\udc1a\ud835\udc20\ud835\udc1e \ud835\udc0c\ud835\udc28\ud835\udc1d\ud835\udc1e\ud835\udc25 \ud835\udc22\ud835\udc27 \ud835\udc11\ud835\udc28\ud835\udc1b\ud835\udc28\ud835\udc2d-\ud835\udc1a\ud835\udc2c\ud835\udc2c\ud835\udc22\ud835\udc2c\ud835\udc2d\ud835\udc1e\ud835\udc1d \ud835\udc12\ud835\udc2e\ud835\udc2b\ud835\udc20\ud835\udc1e\ud835\udc2b\ud835\udc32! Thrilled to share our latest work, \ud835\udc12\ud835\udc2e\ud835\udc2b\ud835\udc20\ud835\udc15\ud835\udc22\ud835\udc1d\ud835\udc0b\ud835\udc0c, the first video-language model specifically designed to address both full and fine-grained surgical video comprehension. Surgical scene understanding is critical for training and robotic decision-making. While current Multimodal Large Language Models (MLLMs) excel at image\u2026 <a class=\"continue-reading-link\" href=\"http:\/\/www.labren.org\/mm\/news\/title-%f0%9f%9a%80-icra-2026-%f0%9d%91%ba%f0%9d%92%96%f0%9d%92%93%f0%9d%92%88%f0%9d%91%bd%f0%9d%92%8a%f0%9d%92%85%f0%9d%91%b3%f0%9d%91%b4-%f0%9d%91%bb%f0%9d%92%90%f0%9d%92%98%f0%9d%92%82\/\">Continue reading<\/a><\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[4],"tags":[],"class_list":["post-3438","post","type-post","status-publish","format-standard","hentry","category-news"],"_links":{"self":[{"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/posts\/3438","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/comments?post=3438"}],"version-history":[{"count":2,"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/posts\/3438\/revisions"}],"predecessor-version":[{"id":3442,"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/posts\/3438\/revisions\/3442"}],"wp:attachment":[{"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/media?parent=3438"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/categories?post=3438"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.labren.org\/mm\/wp-json\/wp\/v2\/tags?post=3438"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}