Share this page:

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover, in NeurIPS, 2025.

Spotlight (top 5% papers)

Code

Download the full text


Abstract

Existing autoregressive vision-language models (VLMs) offer impressive visual reasoning but suffer from slow sequential decoding and limited control over generation. Discrete diffusion models (DMs) provide parallel decoding and bidirectional context, yet their use in multimodal tasks is underexplored. LaViDa introduces a family of diffusion-based VLMs that integrate a vision encoder into a diffusion model and jointly fine-tune the combined parts for multimodal instruction following. The model incorporates complementary masking to improve training efficiency, a prefix KV cache for faster inference, and timestep shifting for high-quality sampling. LaViDa achieves competitive or superior performance to autoregressive VLMs on multi-modal benchmarks such as MMMU and COCO, while offering flexible speed-quality trade-offs and controllable generation. For example, LaViDa surpasses Open-LLaVa-Next-Llama3-8B by +4.1 CIDEr on COCO captioning with a 1.92x speedup and improves constrained poem completion by 59%. Code and models are available at the authors’ repository.


Bib Entry

@inproceedings{li2025lavida,
  title = {LaViDa: A Large Diffusion Language Model for Multimodal Understanding},
  author = {Li, Shufan and Kallidromitis, Konstantinos and Bansal, Hritik and Gokul, Akash and Kato, Yusuke and Kozuka, Kazuki and Kuen, Jason and Lin, Zhe and Chang, Kai-Wei and Grover, Aditya},
  booktitle = {NeurIPS},
  year = {2025}
}

Related Publications

  1. HoneyBee: Data Recipes for Vision-Language Reasoners, CVPR, 2026
  2. MotionEdit: Benchmarking and Learning Motion-Centric Image Editing, CVPR, 2026
  3. PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
  4. STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
  5. Verbalized Representation Learning for Interpretable Few-Shot Generalization, ICCV, 2025
  6. Contrastive Visual Data Augmentation, ICML, 2025
  7. SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
  8. SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation, ICLR, 2025
  9. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
  10. Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
  11. CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
  12. DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
  13. "What's 'up' with vision-language models? Investigating their struggle to understand spatial relations.", EMNLP, 2023
  14. Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
  15. MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
  16. REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
  17. Grounded Language-Image Pre-training, CVPR, 2022
  18. How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022