๐Ÿš€ ICRA 2026: ๐‘ฎ๐’†๐’๐‘ณ๐’‚๐’๐‘ฎ: ๐‘ฎ๐’†๐’๐’Ž๐’†๐’•๐’“๐’š-๐‘จ๐’˜๐’‚๐’“๐’† ๐‘ณ๐’‚๐’๐’ˆ๐’–๐’‚๐’ˆ๐’†-๐‘ฎ๐’–๐’Š๐’…๐’†๐’… ๐‘ฎ๐’“๐’‚๐’”๐’‘๐’Š๐’๐’ˆ ๐’˜๐’Š๐’•๐’‰ ๐‘ผ๐’๐’Š๐’‡๐’Š๐’†๐’… ๐‘น๐‘ฎ๐‘ฉ-๐‘ซ ๐‘ด๐’–๐’๐’•๐’Š๐’Ž๐’๐’…๐’‚๐’ ๐‘ณ๐’†๐’‚๐’“๐’๐’Š๐’๐’ˆ

Thrilled to share our latest work, ๐†๐ž๐จ๐‹๐š๐ง๐†, a unified geometry-aware framework for language-guided robotic grasping.

Language-guided grasping is a key capability for intuitive humanโ€“robot interaction. A robot should not only detect objects but also understand natural instructions such as โ€œpick up the blue cup behind the bowl.โ€ While recent multimodal models have shown promising results, most existing approaches rely on multi-stage pipelines that loosely couple perception and grasp prediction. These methods often overlook the tight integration of geometry, language, and visual reasoning, making them fragile in cluttered, occluded, or low-texture environments. This motivated us to bridge the gap between semantic language understanding and precise geometric grasp execution.

๐Ÿง โœจ ๐–๐ก๐š๐ญ ๐ฐ๐ž ๐๐ž๐ฏ๐ž๐ฅ๐จ๐ฉ๐ž๐:

A novel unified framework for geometry-aware language-guided grasping that includes:

๐Ÿ”น Unified RGB-D Multimodal Representation:

 We embed RGB, depth, and language features into a shared representation space, enabling consistent cross-modal semantic alignment for accurate target reasoning.

๐Ÿ”น Depth-Guided Geometric Module (DGGM):

 Instead of treating depth as auxiliary input, we explicitly inject geometric priors derived from depth into the attention mechanism, strengthening object discrimination under occlusion and ambiguous visual conditions.

๐Ÿ”น Adaptive Dense Channel Integration (ADCI):

 A dynamic multi-layer fusion strategy that balances global semantic cues and fine-grained geometric details for robust grasp prediction.

๐ŸŽฏ  ๐Š๐ž๐ฒ ๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:

โœ… GeoLanG significantly outperforms prior multi-stage baselines on OCID-VLG for language-guided grasping.

โœ… Demonstrates strong robustness in cluttered and heavily occluded scenes.

โœ… Successfully validated on real robotic hardware, showing reliable sim-to-real transfer.

๐Ÿ’ก ๐–๐ก๐ฒ ๐ข๐ญ ๐ฆ๐š๐ญ๐ญ๐ž๐ซ๐ฌ:

This work shows that tightly coupling geometric reasoning with multimodal language understanding can significantly enhance robotic grasp reliability. By embedding depth-aware geometric priors directly into attention mechanisms, we reduce ambiguity and improve consistency in grasp decision-making.

GeoLanG provides a pathway toward more intelligent robotic systems that understand not just what object to grasp, but also how to grasp it robustly in complex real-world environments.

๐ŸŒฑ ๐–๐ก๐š๐ญโ€™๐ฌ ๐ง๐ž๐ฑ๐ญ?

We are exploring extending this geometry-aware multimodal reasoning toward:

 ๐Ÿ”น Real-time interactive grasping

 ๐Ÿ”น Multi-step manipulation tasks

 ๐Ÿ”น Integration with motion planning and autonomous robotic control

#ICRA2026 #CUHK

No alternative text description for this image
No alternative text description for this image
Bookmark the permalink.

Comments are closed.