3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, and Kai-Wei Chang, in NeurIPS, 2025.

Best Paper at Foundation Models Meet Embodied Agents Workshop at CVPR 2025

Code

Download the full text

Abstract

Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to plan and act in dynamic, multi-room 3D environments because they lack proper 3D spatial-temporal memory modeling. To address this, the authors introduce 3DMem-Bench, a benchmark with over 26,000 trajectories and 2,892 embodied tasks designed to evaluate an agent’s ability to reason over long-term memory in 3D environments. They then propose 3DLLM-Mem, a dynamic memory management and fusion model for embodied spatial-temporal reasoning. The model uses working-memory tokens to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, enabling agents to focus on task-relevant information while maintaining memory efficiency. Experiments show that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming strong baselines by 16.5% in success rate on the most challenging in-the-wild tasks of 3DMem-Bench.

Source Code

Bib Entry

@inproceedings{hu2025tdllm,
  title = {3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model},
  author = {Hu, Wenbo and Hong, Yining and Wang, Yanjun and Gao, Leison and Wei, Zibu and Yao, Xingcheng and Peng, Nanyun and Bitton, Yonatan and Szpektor, Idan and Chang, Kai-Wei},
  booktitle = {NeurIPS},
  year = {2025}
}