We present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes from in-the-wild RGB-D observations. Given a single ego-centric, exo-centric, or robot-centric demonstration, MoMa-SG distills interactions into an actionable scene graph that enables one-shot, open-world mobile manipulation on diverse robot embodiments.

MoMa-SG Teaser

Abstract

Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate MoMa-SG on two datasets and ablate key design choices. Real-world experiments on a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments.

Method

MoMa-SG Pipeline

MoMa-SG overview. We first discover interaction segments via probabilistic fusion of an interaction prior (YOLOv9) and depth disparity. For each segment, CoTracker3 tracks keypoints; a novel regularized twist formulation with geometric cosine prior estimates articulation models. SemanticSAM with CLIP features builds an open-vocabulary 3D map; objects are matched to articulations via a binary integer program; and child objects within articulated containers are discovered at maximum-opening states.

🔍 Interaction Discovery

Probabilistic fusion of hand/agent visibility (YOLOv9) and depth warp disparity. Robustly finds interaction segments even under occlusions or low-dynamics scenarios such as opening large doors.

🔧 Regularized Twist Estimation

Novel geometric cosine prior on point trajectory pairs enforces orthogonality for revolute joints and suppresses rotation for prismatic joints — eliminating need for separate type classification rules.

🗺️ Scene Graph Construction

Boundary-aware incremental 3D mapping with SemanticSAM + CLIP, IoU-constrained articulation–object assignment via BIP, and hierarchical parent-child containment discovery.

Contributions

Arti4D-Semantic Dataset

62
RGB-D Sequences
600
Object Interactions
5
Distinct Scenes
3
Observation Paradigms
29
Parent Object Categories
54
Child Object Categories

Arti4D-Semantic extends the original Arti4D dataset with rich semantic labels: instance segmentation of articulated parents and contained children, independently verified semantic categories, articulation axes, ground-truth interaction segments, and parent-child containment type (STATIC vs. ARTICULATED). It is the first benchmark to jointly address articulated object estimation and semantic 3D mapping in unconstrained in-the-wild settings.

Observation Paradigms

👁️ Ego-Centric (EGO)

Human demonstrator with a head-mounted RGB-D camera. Four labeled scenes (RH078, RR080, DR080, RH201) with ground-truth camera poses, plus an unlabeled scene (MHZH). Covers drawers, cabinets, doors, fridges, and sliding furniture.

📷 Exo-Centric (EXO)

External fixed observer viewpoint (MHZH-EXO). Introduces additional challenges from bystander motion and perspective changes. No ground-truth camera poses — pseudo-GT constructed via DROID-SLAM.

🤖 Robot-Centric (ROBOT)

Teleoperated Toyota HSR with on-board camera (HSR split, 8 sequences). Characterized by higher depth noise and robot motion artifacts. Tests generalization to real manipulation scenarios.

Experimental Results

Interaction Segmentation on Arti4D-Semantic

Method1D-IoU ↑Precision ↑Recall ↑Segment-IoU ↑Ion [s] ↓Ioff [s] ↓
Pandora0.3590.6900.4000.6890.7440.725
HMM0.4710.8710.5320.6830.7700.762
ArtiPoint0.5750.6780.7140.6981.0500.558
MoMa-SG (Ours)0.6490.7860.8000.7180.8160.742

MoMa-SG achieves the highest 1D-IoU, recall, and segment IoU. The depth warp prior provides a significant boost over hand-tracking alone.

Articulation Estimation on Arti4D

MethodPris. θerrRev. θerrRev. dL2Type Acc. ↑Pris. Recall ↑Rev. Recall ↑
Ours w/o twist regularization28.764°33.278°0.135 m0.5421.0000.305
Ours w/ XFeat keypoints15.768°31.538°0.152 m0.8230.8240.821
ArtGS52.070°62.638°0.301 m0.6191.0000.000
Pandora46.814°50.620°0.195 m0.5590.7720.444
ArtiPoint w/ Sturm et al.17.754°28.467°0.313 m0.8720.8540.905
ArtiPoint w/ prior23.272°26.358°0.248 m0.7760.7090.906
MoMa-SG (Ours)13.190°22.982°0.091 m0.8840.9170.866

MoMa-SG outperforms all baselines on axis and position errors. Ablation rows (yellow) show the importance of twist regularization and GFTT keypoints.

Cross-Dataset Generalization: DROID

MethodPris. θerrRev. θerrRev. dL2Type Acc. ↑
ArtiPoint35.88°25.43°0.278 m0.611
MoMa-SG (Ours)7.15°16.91°0.115 m0.895

Results on 19 manipulation demos of DROID. MoMa-SG achieves a 5× reduction in prismatic error, demonstrating strong cross-dataset generalization.

3D Articulated Part Segmentation & Containment

Object SetMethodIoU ↑Recall @0.25 ↑Recall @0.50 ↑Recall @0.75 ↑
Free (O)Ours w/o boundary-aware merging0.4810.8070.5340.086
ConceptGraphs0.4380.3320.1270.018
MoMa-SG (Ours)0.5330.8240.6460.163
Matched (OA)Pandora0.0650.0120.0000.000
MoMa-SG (Ours)0.2920.4540.3030.061

MoMa-SG significantly outperforms baselines. Boundary-aware merging is critical for distinguishing adjacent drawers and cabinets.

Real-World Mobile Manipulation

85.5%

Toyota HSR overall success
(opening & closing)

80.8%

Boston Dynamics Spot
overall opening success

1.7 cm

Mean prismatic state error
across 54 tested configurations

6.5°

Mean revolute state error
across 54 tested configurations

MoMa-SG produces an embodiment-agnostic semantic-kinematic map. The same scene graph drives both the Toyota HSR and Boston Dynamics Spot without modification. Experiments span two kitchen environments, including autonomous opening, closing, online articulation state estimation with retrial behavior, and full language-grounded long-horizon manipulation (e.g., "get the milk from the fridge") using a GPT-5-mini planner.

🔄 State Estimation & Retrial

Online articulation state estimation across diverse opening configurations. Enables automatic retrial after gripping failures by re-estimating current state via point-cloud overlap under the estimated twist.

🗣️ Language-Grounded Tasks

Natural language commands mapped to sequences of open(), inspect(), retrieve(), and close() actions using the open-vocabulary scene graph as context for a receding-horizon GPT-5-mini planner.

🤖 Embodiment-Agnostic

Identical scene representation used across Toyota HSR (whole-body) and Boston Dynamics Spot (quadruped arm). Online grasp generation via SAM3 handle detection and predominant-axis estimation.

Publication

Martin Büchner, Adrian Röfer, Tim Engelbracht, Tim Welschehold, Zuria Bauer, Hermann Blum, Marc Pollefeys, and Abhinav Valada.
Articulated 3D Scene Graphs for Open-World Mobile Manipulation.
arXiv preprint arXiv:2602.16356 [cs.RO], February 2026.
@article{buechner2026momasg,
  title={Articulated 3D Scene Graphs for Open-World Mobile Manipulation},
  author={Buechner, Martin and Roefer, Adrian and Engelbracht, Tim
          and Welschehold, Tim and Bauer, Zuria and Blum, Hermann
          and Pollefeys, Marc and Valada, Abhinav},
  journal={arXiv preprint arXiv:2602.16356},
  year={2026}
}

Authors

Martin Büchner
Martin Büchner

University of Freiburg

Adrian Röfer
Adrian Röfer

University of Freiburg

Tim Engelbracht
Tim Engelbracht

ETH Zürich

Tim Welschehold
Tim Welschehold

University of Freiburg

Zuria Bauer
Zuria Bauer

ETH Zürich

Hermann Blum
Hermann Blum

ETH Zürich / Uni Bonn

Marc Pollefeys
Marc Pollefeys

ETH Zürich

Abhinav Valada
Abhinav Valada

University of Freiburg

Acknowledgment

This work was funded by an academic grant from NVIDIA and the BrainLinks-BrainTools center of the University of Freiburg.