Share this page:

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, and Charith Peris, in ACL-Findings, 2025.

Download the full text


Abstract

Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. The AIDSAFE-generated CoT datasets are publicly available on Hugging Face.


Bib Entry

@inproceedings{kumarage2025towards,
  title = {Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation},
  author = {Kumarage, Tharindu and Mehrabi, Ninareh and Ramakrishna, Anil and Zhao, Xinyan and Zemel, Richard and Chang, Kai-Wei and Galstyan, Aram and Gupta, Rahul and Peris, Charith},
  booktitle = {ACL-Findings},
  year = {2025}
}

Related Publications

  1. QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search, ICML, 2025
  2. DRS: Deep Question Reformulation With Structured Output, ACL-Findings, 2025
  3. V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning, ACL-Findings, 2025
  4. VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning, CVPR, 2025
  5. MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations, ICLR, 2025
  6. BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression, NAACL-Finding, 2025
  7. QUDSELECT: Selective Decoding for Questions Under Discussion Parsing, EMNLP, 2024
  8. Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue, EMNLP, 2024
  9. LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning, EMNLP-Finding, 2024
  10. Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs, ACL, 2024
  11. Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data, ACL-Findings, 2024
  12. Can small language models help large language models reason better?: LM-guided chain-of-thought, LREC-COLING, 2024
  13. IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models, EMNLP-Finding, 2023