Share this page:

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, Ying Nian Wu, and Lijuan Wang, in ICLR, 2025.

Spotlight (top 5% papers)

Download the full text


Abstract

Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model’s context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io


Bib Entry

@inproceedings{hong2025slowfast,
  title = {SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation},
  author = {Hong, Yining and Liu, Beide and Wu, Maxine and Zhai, Yuanhao and Chang, Kai-Wei and Li, Linjie and Lin, Kevin and Lin, Chung-Ching and Wang, Jianfeng and Yang, Zhengyuan and Wu, Ying Nian and Wang, Lijuan},
  booktitle = {ICLR},
  year = {2025}
}

Related Publications

  1. VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval, ACL, 2026
  2. HoneyBee: Data Recipes for Vision-Language Reasoners, CVPR, 2026
  3. MotionEdit: Benchmarking and Learning Motion-Centric Image Editing, CVPR, 2026
  4. LaViDa: A Large Diffusion Language Model for Multimodal Understanding, NeurIPS, 2025
  5. PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
  6. Verbalized Representation Learning for Interpretable Few-Shot Generalization, ICCV, 2025
  7. STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
  8. Contrastive Visual Data Augmentation, ICML, 2025
  9. SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
  10. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
  11. Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
  12. CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
  13. DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
  14. Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
  15. "What's 'up' with vision-language models? Investigating their struggle to understand spatial relations.", EMNLP, 2023
  16. MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
  17. REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
  18. Grounded Language-Image Pre-training, CVPR, 2022
  19. How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022