LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover, in NeurIPS, 2025.

Abstract

Existing autoregressive vision-language models (VLMs) offer impressive visual reasoning but suffer from slow sequential decoding and limited control over generation. Discrete diffusion models (DMs) provide parallel decoding and bidirectional context, yet their use in multimodal tasks is underexplored. LaViDa introduces a family of diffusion-based VLMs that integrate a vision encoder into a diffusion model and jointly fine-tune the combined parts for multimodal instruction following. The model incorporates complementary masking to improve training efficiency, a prefix KV cache for faster inference, and timestep shifting for high-quality sampling. LaViDa achieves competitive or superior performance to autoregressive VLMs on multi-modal benchmarks such as MMMU and COCO, while offering flexible speed-quality trade-offs and controllable generation. For example, LaViDa surpasses Open-LLaVa-Next-Llama3-8B by +4.1 CIDEr on COCO captioning with a 1.92x speedup and improves constrained poem completion by 59%. Code and models are available at the authors’ repository.

Source Code

Bib Entry

@inproceedings{li2025lavida,
  title = {LaViDa: A Large Diffusion Language Model for Multimodal Understanding},
  author = {Li, Shufan and Kallidromitis, Konstantinos and Bansal, Hritik and Gokul, Akash and Kato, Yusuke and Kozuka, Kazuki and Kuen, Jason and Lin, Zhe and Chang, Kai-Wei and Grover, Aditya},
  booktitle = {NeurIPS},
  year = {2025}
}

LaViDa: A Large Diffusion Language Model for Multimodal Understanding