We present 𝐍𝐞𝐮𝐫𝐨-𝐕𝐋𝐀, an scenario-aware model designed for the motion control of a parallel continuum neurosurgical robot.
Robotic surgery systems have garnered significant attention for their precision and efficiency, yet achieving autonomous tasks in complex neurosurgical environments remains challenging. Although Vision-Language-Action (VLA) models hold great potential, their development is constrained by the scarcity of data from surgical environments and robotic kinematics. To address this issue, this paper proposes NeuroVLA: a VLA model specifically designed for neurosurgical robotic tumor debulking tasks. Through phantom experiments conducted on a flexible parallel continuum robot, we constructed a dataset and decomposed the debulking task into four skill-based instructions. NeuroVLA utilizes a Vision-Language Model (VLM) as its backbone for scene reasoning, enabling the robot to comprehend the surgical scene and its own state. Experimental results demonstrate that after training on 90 debulking segments, NeuroVLA can infer actions based on images, language instructions, and the robot’s state. It achieved average pixel distance errors of 29.10 pixels and 21.55 pixels for the “alignment” and “transfer” skills, respectively, and success rates of 88.89% and 100% for the “grasping” and “release” skills.
🧠 Technical Framework:
● End-to-End scenario-aware VLA model
● Skill-based scenario infer mechanism
● Debulking task dataset in neurosurgery
🎯 Experimental Results:
● NeuroVLA demonstrates significantly lower pixel distance (PD) errors in the “alignment” and “transfer” skills (29.10 px / 21.55 px), far surpassing the performance of baseline models (such as Octo’s 79.72 px / 65.46 px).
In the “grasping” and “release” skills, NeuroVLA exhibits greater robustness, achieving a grasping success rate of 88.89% and a release success rate of 100%. In contrast, baseline models often misinterpret incomplete forceps closure as task completion, leading to grasping failures.
#ICRA2026