Share this page:

Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy

Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, and Xinfeng Li, in ACL, 2026.

Download the full text


Abstract


Bib Entry

@inproceedings{jiang2026overrefusal,
  title = {Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy},
  author = {Jiang, Eric Hanchen and Ou, Weixuan and Liu, Run and Pang, Shengyuan and Wan, Guancheng and Duan, Ranjie and Dong, Wei and Chang, Kai-Wei and Wang, XiaoFeng and Wu, Ying Nian and Li, Xinfeng},
  booktitle = {ACL},
  year = {2026}
}

Related Publications

  1. MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks, ACL, 2026
  2. SWAN: Semantic Watermarking with Abstract Meaning Representation, ACL, 2026
  3. ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System, ACL, 2026
  4. Open-Domain Safety Policy Construction, EACL-Findings, 2026
  5. Customize Multi-modal RAI Guardrails with Precedent-based predictions, COLM 2025, 2025
  6. X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, COLM 2025, 2025
  7. Vulnerability of LLMs to Vertically Aligned Text Manipulations, ACL, 2025
  8. Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models, CVPR, 2025
  9. Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety, NAACL-Finding, 2025
  10. SafeWorld: Geo-Diverse Safety Alignment, NeurIPS, 2024
  11. FLIRT: Feedback Loop In-context Red Teaming, EMNLP, 2024
  12. Data Advisor: Data Curation with Foresight for Safety Alignment of Large Language Models, EMNLP, 2024
  13. Prompt-Driven LLM Safeguarding via Directed Representation Optimization, ICML, 2024