Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate MoMa-SG on two datasets and ablate key design choices. Real-world experiments on a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments.
MoMa-SG overview. We first discover interaction segments via probabilistic fusion of an interaction prior (YOLOv9) and depth disparity. For each segment, CoTracker3 tracks keypoints; a novel regularized twist formulation with geometric cosine prior estimates articulation models. SemanticSAM with CLIP features builds an open-vocabulary 3D map; objects are matched to articulations via a binary integer program; and child objects within articulated containers are discovered at maximum-opening states.
Probabilistic fusion of hand/agent visibility (YOLOv9) and depth warp disparity. Robustly finds interaction segments even under occlusions or low-dynamics scenarios such as opening large doors.
Novel geometric cosine prior on point trajectory pairs enforces orthogonality for revolute joints and suppresses rotation for prismatic joints — eliminating need for separate type classification rules.
Boundary-aware incremental 3D mapping with SemanticSAM + CLIP, IoU-constrained articulation–object assignment via BIP, and hierarchical parent-child containment discovery.
Arti4D-Semantic extends the original Arti4D dataset with rich semantic labels: instance segmentation of articulated parents and contained children, independently verified semantic categories, articulation axes, ground-truth interaction segments, and parent-child containment type (STATIC vs. ARTICULATED). It is the first benchmark to jointly address articulated object estimation and semantic 3D mapping in unconstrained in-the-wild settings.
Human demonstrator with a head-mounted RGB-D camera. Four labeled scenes (RH078, RR080, DR080, RH201) with ground-truth camera poses, plus an unlabeled scene (MHZH). Covers drawers, cabinets, doors, fridges, and sliding furniture.
External fixed observer viewpoint (MHZH-EXO). Introduces additional challenges from bystander motion and perspective changes. No ground-truth camera poses — pseudo-GT constructed via DROID-SLAM.
Teleoperated Toyota HSR with on-board camera (HSR split, 8 sequences). Characterized by higher depth noise and robot motion artifacts. Tests generalization to real manipulation scenarios.
| Method | 1D-IoU ↑ | Precision ↑ | Recall ↑ | Segment-IoU ↑ | Ion [s] ↓ | Ioff [s] ↓ |
|---|---|---|---|---|---|---|
| Pandora | 0.359 | 0.690 | 0.400 | 0.689 | 0.744 | 0.725 |
| HMM | 0.471 | 0.871 | 0.532 | 0.683 | 0.770 | 0.762 |
| ArtiPoint | 0.575 | 0.678 | 0.714 | 0.698 | 1.050 | 0.558 |
| MoMa-SG (Ours) | 0.649 | 0.786 | 0.800 | 0.718 | 0.816 | 0.742 |
MoMa-SG achieves the highest 1D-IoU, recall, and segment IoU. The depth warp prior provides a significant boost over hand-tracking alone.
| Method | Pris. θerr ↓ | Rev. θerr ↓ | Rev. dL2 ↓ | Type Acc. ↑ | Pris. Recall ↑ | Rev. Recall ↑ |
|---|---|---|---|---|---|---|
| Ours w/o twist regularization | 28.764° | 33.278° | 0.135 m | 0.542 | 1.000 | 0.305 |
| Ours w/ XFeat keypoints | 15.768° | 31.538° | 0.152 m | 0.823 | 0.824 | 0.821 |
| ArtGS | 52.070° | 62.638° | 0.301 m | 0.619 | 1.000 | 0.000 |
| Pandora | 46.814° | 50.620° | 0.195 m | 0.559 | 0.772 | 0.444 |
| ArtiPoint w/ Sturm et al. | 17.754° | 28.467° | 0.313 m | 0.872 | 0.854 | 0.905 |
| ArtiPoint w/ prior | 23.272° | 26.358° | 0.248 m | 0.776 | 0.709 | 0.906 |
| MoMa-SG (Ours) | 13.190° | 22.982° | 0.091 m | 0.884 | 0.917 | 0.866 |
MoMa-SG outperforms all baselines on axis and position errors. Ablation rows (yellow) show the importance of twist regularization and GFTT keypoints.
| Method | Pris. θerr ↓ | Rev. θerr ↓ | Rev. dL2 ↓ | Type Acc. ↑ |
|---|---|---|---|---|
| ArtiPoint | 35.88° | 25.43° | 0.278 m | 0.611 |
| MoMa-SG (Ours) | 7.15° | 16.91° | 0.115 m | 0.895 |
Results on 19 manipulation demos of DROID. MoMa-SG achieves a 5× reduction in prismatic error, demonstrating strong cross-dataset generalization.
| Object Set | Method | IoU ↑ | Recall @0.25 ↑ | Recall @0.50 ↑ | Recall @0.75 ↑ |
|---|---|---|---|---|---|
| Free (O) | Ours w/o boundary-aware merging | 0.481 | 0.807 | 0.534 | 0.086 |
| ConceptGraphs | 0.438 | 0.332 | 0.127 | 0.018 | |
| MoMa-SG (Ours) | 0.533 | 0.824 | 0.646 | 0.163 | |
| Matched (OA) | Pandora | 0.065 | 0.012 | 0.000 | 0.000 |
| MoMa-SG (Ours) | 0.292 | 0.454 | 0.303 | 0.061 |
MoMa-SG significantly outperforms baselines. Boundary-aware merging is critical for distinguishing adjacent drawers and cabinets.
Toyota HSR overall success
(opening & closing)
Boston Dynamics Spot
overall opening success
Mean prismatic state error
across 54 tested configurations
Mean revolute state error
across 54 tested configurations
MoMa-SG produces an embodiment-agnostic semantic-kinematic map. The same scene graph drives both the Toyota HSR and Boston Dynamics Spot without modification. Experiments span two kitchen environments, including autonomous opening, closing, online articulation state estimation with retrial behavior, and full language-grounded long-horizon manipulation (e.g., "get the milk from the fridge") using a GPT-5-mini planner.
Online articulation state estimation across diverse opening configurations. Enables automatic retrial after gripping failures by re-estimating current state via point-cloud overlap under the estimated twist.
Natural language commands mapped to sequences of open(), inspect(), retrieve(), and close() actions using the open-vocabulary scene graph as context for a receding-horizon GPT-5-mini planner.
Identical scene representation used across Toyota HSR (whole-body) and Boston Dynamics Spot (quadruped arm). Online grasp generation via SAM3 handle detection and predominant-axis estimation.
This work was funded by an academic grant from NVIDIA and the BrainLinks-BrainTools center of the University of Freiburg.