Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng, in ACL-Findings, 2024.

Code

Download the full text

Abstract

Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models’ capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models’ quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText. We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which has a large room for improvement. Among open-source models, Deepseek-coder-instruct, a code LLM pretrained on 2T tokens, gets the highest accuracy of 37%. Analysis reveals that models encounter difficulties in data analysis and causal reasoning, and struggle in using causal knowledge and provided data simultaneously.

Source Code

Bib Entry

@inproceedings{liu2024are,
  title = {Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data},
  author = {Liu, Xiao and Wu, Zirui and Wu, Xueqing and Lu, Pan and Chang, Kai-Wei and Feng, Yansong},
  booktitle = {ACL-Findings},
  year = {2024}
}

Related Publications

Learning Structured Reasoning via Tractable Trajectory Control, ICML, 2026
Training LLMs for Divide-and-Conquer Reasoning, ACL, 2026
BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning, ACL-Findings, 2026
Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models, ACL-Findings, 2026
MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations, ICLR, 2025
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation, ACL-Findings, 2025
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search, ICML, 2025
DRS: Deep Question Reformulation With Structured Output, ACL-Findings, 2025
V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning, ACL-Findings, 2025
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning, CVPR, 2025
BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression, NAACL-Finding, 2025
QUDSELECT: Selective Decoding for Questions Under Discussion Parsing, EMNLP, 2024
Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue, EMNLP, 2024
LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning, EMNLP-Finding, 2024
Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs, ACL, 2024
Can small language models help large language models reason better?: LM-guided chain-of-thought, LREC-COLING, 2024
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models, EMNLP-Finding, 2023