IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad Ayyubi, Kai-Wei Chang, and Shih-Fu Chang, in EMNLP-Finding, 2023.
Download the full text
Abstract
The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE.
Bib Entry
@inproceedings{you2023idealgpt,
author = {You, Haoxuan and Sun, Rui and Wang, Zhecan and Chen, Long and Wang, Gengyu and Ayyubi, Hammad and Chang, Kai-Wei and Chang, Shih-Fu},
booktitle = {EMNLP-Finding},
title = {IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models},
year = {2023}
}
Related Publications
- QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search, ICML, 2025
- DRS: Deep Question Reformulation With Structured Output, ACL-Findings, 2025
- V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning, ACL-Findings, 2025
- Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation, ACL-Findings, 2025
- VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning, CVPR, 2025
- MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations, ICLR, 2025
- BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression, NAACL-Finding, 2025
- QUDSELECT: Selective Decoding for Questions Under Discussion Parsing, EMNLP, 2024
- Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue, EMNLP, 2024
- LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning, EMNLP-Finding, 2024
- Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs, ACL, 2024
- Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data, ACL-Findings, 2024
- Can small language models help large language models reason better?: LM-guided chain-of-thought, LREC-COLING, 2024