MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, and Dong Yu, in CVPR, 2026.
CodeDownload the full text
Abstract
We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness. Our code is at https://github.com/elainew728/motion-edit/.
🤔🤔Tired of static, lifeless image edits?
— Yixin Wan (@yixin_wan_) December 12, 2025
Not anymore! 🤗
🚀🚀We introduce MotionEdit, a framework supporting image editing that understands action, motion, interaction beyond static changes! 🤩🤩
🔗Full paper: https://t.co/lKKQX6DJPj
✨Project page: https://t.co/3XdkxVgM4b pic.twitter.com/KU8G3cQtXS
Bib Entry
@inproceedings{wan2026motionedit,
title = {MotionEdit: Benchmarking and Learning Motion-Centric Image Editing},
author = {Wan, Yixin and Ke, Lei and Yu, Wenhao and Chang, Kai-Wei and Yu, Dong},
booktitle = {CVPR},
year = {2026}
}
Related Publications
- VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval, ACL, 2026
- HoneyBee: Data Recipes for Vision-Language Reasoners, CVPR, 2026
- LaViDa: A Large Diffusion Language Model for Multimodal Understanding, NeurIPS, 2025
- PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
- SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation, ICLR, 2025
- Verbalized Representation Learning for Interpretable Few-Shot Generalization, ICCV, 2025
- STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
- Contrastive Visual Data Augmentation, ICML, 2025
- SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
- Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
- CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
- DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
- Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
- "What's 'up' with vision-language models? Investigating their struggle to understand spatial relations.", EMNLP, 2023
- MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
- Grounded Language-Image Pre-training, CVPR, 2022
- How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022