In this project, we tackled the unique challenges of robotic endoscopy by integrating vision, language grounding, and motion planning into one end-to-end framework. EndoVLA enables:
– Precise polyp tracking through surgeon-issued prompts
– Delineation and following of abnormal mucosal regions
– Adherence to circumferential cutting markers during resections
We introduced a dual-phase training strategy:
1. ๐๐ฎ๐ฉ๐๐ซ๐ฏ๐ข๐ฌ๐๐ ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ on our new ๐๐ง๐๐จ๐๐๐-๐๐จ๐ญ๐ข๐จ๐ง dataset
2. ๐๐๐ข๐ง๐๐จ๐ซ๐๐๐ฆ๐๐ง๐ญ ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ with task-aware rewards
This approach impressively boosts tracking accuracy and achieves zero-shot generalization across diverse GI scenes.
The paper is available at: https://lnkd.in/g35DF7Fq