Thrilled to share our latest work, ๐๐๐จ๐๐๐ง๐, a unified geometry-aware framework for language-guided robotic grasping.
Language-guided grasping is a key capability for intuitive humanโrobot interaction. A robot should not only detect objects but also understand natural instructions such as โpick up the blue cup behind the bowl.โ While recent multimodal models have shown promising results, most existing approaches rely on multi-stage pipelines that loosely couple perception and grasp prediction. These methods often overlook the tight integration of geometry, language, and visual reasoning, making them fragile in cluttered, occluded, or low-texture environments. This motivated us to bridge the gap between semantic language understanding and precise geometric grasp execution.
๐ง โจ ๐๐ก๐๐ญ ๐ฐ๐ ๐๐๐ฏ๐๐ฅ๐จ๐ฉ๐๐:
A novel unified framework for geometry-aware language-guided grasping that includes:
๐น Unified RGB-D Multimodal Representation:
We embed RGB, depth, and language features into a shared representation space, enabling consistent cross-modal semantic alignment for accurate target reasoning.
๐น Depth-Guided Geometric Module (DGGM):
Instead of treating depth as auxiliary input, we explicitly inject geometric priors derived from depth into the attention mechanism, strengthening object discrimination under occlusion and ambiguous visual conditions.
๐น Adaptive Dense Channel Integration (ADCI):
A dynamic multi-layer fusion strategy that balances global semantic cues and fine-grained geometric details for robust grasp prediction.
๐ฏ ๐๐๐ฒ ๐๐๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
โ GeoLanG significantly outperforms prior multi-stage baselines on OCID-VLG for language-guided grasping.
โ Demonstrates strong robustness in cluttered and heavily occluded scenes.
โ Successfully validated on real robotic hardware, showing reliable sim-to-real transfer.
๐ก ๐๐ก๐ฒ ๐ข๐ญ ๐ฆ๐๐ญ๐ญ๐๐ซ๐ฌ:
This work shows that tightly coupling geometric reasoning with multimodal language understanding can significantly enhance robotic grasp reliability. By embedding depth-aware geometric priors directly into attention mechanisms, we reduce ambiguity and improve consistency in grasp decision-making.
GeoLanG provides a pathway toward more intelligent robotic systems that understand not just what object to grasp, but also how to grasp it robustly in complex real-world environments.
๐ฑ ๐๐ก๐๐ญโ๐ฌ ๐ง๐๐ฑ๐ญ?
We are exploring extending this geometry-aware multimodal reasoning toward:
๐น Real-time interactive grasping
๐น Multi-step manipulation tasks
๐น Integration with motion planning and autonomous robotic control
#ICRA2026 #CUHK