OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang, in NeurIPS, 2025.

Code

Download the full text

Abstract

OpenVLThinker is among the first open-source large vision-language models (LVLMs) that exhibit sophisticated chain-of-thought reasoning. When reasoning capabilities from text-only models are distilled into LVLMs via supervised fine-tuning (SFT), performance often degrades due to imprecise visual grounding; pure reinforcement-learning (RL) methods suffer from large search spaces that inhibit reflective behaviors in smaller models. The authors find that alternating between SFT and RL markedly improves performance after a few iterations. Initially, the base LVLM seldom exhibits reasoning behaviors, but SFT surfaces these latent actions and narrows the RL search space. Each subsequent RL stage refines the model’s reasoning and provides higher-quality SFT data for further improvement. OpenVLThinker-7B achieves consistent gains across six benchmarks requiring mathematical and general reasoning, improving MathVista by 3.8%, EMMA by 2.4% and HallusionBench by 1.6%, illustrating the synergy between SFT and RL for complex multimodal reasoning. The authors make the code, model and data publicly available.

Source Code

Bib Entry

@inproceedings{deng2025openvlthinker,
  title = {OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles},
  author = {Deng, Yihe and Bansal, Hritik and Yin, Fan and Peng, Nanyun and Wang, Wei and Chang, Kai-Wei},
  booktitle = {NeurIPS},
  year = {2025}
}