EgoMDM: Diffusion-based Human Motion Synthesis from Sparse Egocentric Sensors

Abstract

Accurate three-dimensional (3D) human motion tracking is essential for immersive augmented reality (AR) and virtual reality (VR) applications, allowing users to engage with virtual environments through realistic full-body avatars. Achieving this level of detail, however, is challenging when the driving signals are sparse, typically coming only from upper-body sensors, such as head-mounted devices and hand controllers.

To address this challenge, we propose EgoMDM (Egocentric Motion Diffusion Model), an end-to-end diffusion-based framework designed to reconstruct full-body motion from sparse tracking signals. EgoMDM models human motion in a conditional autoregressive manner using a unidirectional recurrent neural network, making it well-suited for real-time applications.

By embedding local-to-global translation, forward and inverse kinematics, and foot-contact detection within the diffusion framework, EgoMDM achieves seamless, end-to-end motion synthesis, effectively reducing artifacts like foot sliding and ground penetration. Additionally, EgoMDM is conditioned on the user's body scale, allowing it to generalize across a diverse population and produce consistent avatar shapes over time. In our extensive experiments on the AMASS motion capture dataset, EgoMDM achieves state-of-the-art performance in both motion tracking accuracy and synthesis quality, demonstrating its robustness and adaptability across various human motion scenarios. Furthermore, EgoMDM significantly outperforms the existing models when tested on the real signal inputs, highlighting its robustness and applicability to the real-world data.

Video

Key Contributions

Autoregressive conditional diffusion framework that synthesizes 3D human motion from sparse signals in real-time, achieving accurate and realistic full-body motion.
Global motion learning with foot contact detection that attains state-of-the-art performance in both tracking accuracy and synthesis quality.
Body-shape conditioned generation that generalizes seamlessly across users with diverse body scales, ensuring consistent and personalized motion.
Novel motion representation that decouples full-body kinematics into limb joint positions, twist angles, and torso angles to enable seamless motion reconstruction with analytical IK.
Real-time performance achieving 123.4 FPS on GPU (NVIDIA RTX 3090) and 80.8 FPS on CPU (Macbook M1 Pro) using 5-step DDIM sampling.

Method Overview

Given the tracking signals of the headset and two hand controllers, we first denoise the partially represented motion and construct the full-body motion using an analytical IK solver. The denoiser network first estimates the foot-ground contact probability, limb joint twists and positions, and torso joint angles. Followed by local-to-global translation, we construct the initial global-space human motion using a residual refinement network to update motion. Finally, full-body mesh motion is analytically computed.

Qualitative Results

Geometric Accuracy

A comparison of motion-tracking accuracy between HMD-Poser and EgoMDM (ours). Ground-truth is derived from the AMASS mocap dataset. Vertices are colored differently based on the per-vertex distance to the ground-truth motion (red indicates worse performance). The motion synthesized by EgoMDM shows larger similarity to the reference ground-truth motion than the state-of-the-art method in various movement scenarios.

Motion Feasibility

A comparison of motion-synthesis quality between AvatarJLM, HMD-Poser, and EgoMDM (ours). EgoMDM shows less foot skating (first row), floating (second row), and floor penetration (third row) compared to the other methods.

Uneven Terrain

Our model demonstrates robust performance even on challenging scenarios such as uneven terrain, showcasing its adaptability beyond flat ground-plane assumptions.

BibTeX

@inproceedings{shin2026egomdm,
  author    = {Shin, Soyong and Pahuja, Anuj and Richard, Alexander and Kitani, Kris and Saragih, Jason and Chen, Yuhua and Xu, Weipeng and Halilaj, Eni and Bagautdinov, Timur},
  title     = {EgoMDM: Diffusion-based Human Motion Synthesis from Sparse Egocentric Sensors},
  booktitle = {International Conference on 3D Vision (3DV)},
  year      = {2026},
}