Share this page:

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, and Dong Yu, in CVPR, 2026.

Code

Download the full text


Abstract

We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness. Our code is at https://github.com/elainew728/motion-edit/.



Bib Entry

@inproceedings{wan2026motionedit,
  title = {MotionEdit: Benchmarking and Learning Motion-Centric Image Editing},
  author = {Wan, Yixin and Ke, Lei and Yu, Wenhao and Chang, Kai-Wei and Yu, Dong},
  booktitle = {CVPR},
  year = {2026}
}

Related Publications

  1. VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval, ACL, 2026
  2. HoneyBee: Data Recipes for Vision-Language Reasoners, CVPR, 2026
  3. LaViDa: A Large Diffusion Language Model for Multimodal Understanding, NeurIPS, 2025
  4. PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
  5. SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation, ICLR, 2025
  6. Verbalized Representation Learning for Interpretable Few-Shot Generalization, ICCV, 2025
  7. STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
  8. Contrastive Visual Data Augmentation, ICML, 2025
  9. SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
  10. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
  11. Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
  12. CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
  13. DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
  14. Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
  15. "What's 'up' with vision-language models? Investigating their struggle to understand spatial relations.", EMNLP, 2023
  16. MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
  17. REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
  18. Grounded Language-Image Pre-training, CVPR, 2022
  19. How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022