Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy
Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, and Xinfeng Li, in ACL, 2026.
Download the full text
Abstract
Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly refuse benign requests. A key goal of safe alignment is therefore to improve safety while simultaneously minimizing false refusals. In this work, we introduce Energy Landscape Steering (ELS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We train a lightweight external Energy-Based Model (EBM) to assign high energy to undesirable states (false refusal or jailbreak) and low energy to desirable states (helpful response or safe reject). During inference, the EBM maps the LLM’s internal activations to an energy landscape, and we use the gradient of the energy function to steer the hidden states toward low-energy regions in real time. This dynamically guides the model toward desirable behavior without modifying its parameters. By decoupling behavioral control from the model’s core knowledge, ELS provides a flexible and computationally efficient solution. Extensive experiments across diverse models demonstrate its effectiveness, raising compliance on the ORB-H benchmark from 57.3 percent to 82.6 percent while maintaining baseline safety performance. Our work establishes a promising paradigm for building LLMs that simultaneously achieve high safety and low false refusal rates.
Bib Entry
@inproceedings{jiang2026overrefusal,
title = {Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy},
author = {Jiang, Eric Hanchen and Ou, Weixuan and Liu, Run and Pang, Shengyuan and Wan, Guancheng and Duan, Ranjie and Dong, Wei and Chang, Kai-Wei and Wang, XiaoFeng and Wu, Ying Nian and Li, Xinfeng},
booktitle = {ACL},
year = {2026}
}
Related Publications
- MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks, ACL, 2026
- SWAN: Semantic Watermarking with Abstract Meaning Representation, ACL, 2026
- ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System, ACL, 2026
- Open-Domain Safety Policy Construction, EACL-Findings, 2026
- Customize Multi-modal RAI Guardrails with Precedent-based predictions, COLM 2025, 2025
- X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, COLM 2025, 2025
- Vulnerability of LLMs to Vertically Aligned Text Manipulations, ACL, 2025
- Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models, CVPR, 2025
- Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety, NAACL-Finding, 2025
- SafeWorld: Geo-Diverse Safety Alignment, NeurIPS, 2024
- FLIRT: Feedback Loop In-context Red Teaming, EMNLP, 2024
- Data Advisor: Data Curation with Foresight for Safety Alignment of Large Language Models, EMNLP, 2024
- Prompt-Driven LLM Safeguarding via Directed Representation Optimization, ICML, 2024