Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety

Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai-Wei Chang, in NAACL-Finding, 2025.

Download the full text

Abstract

Previous research on jailbreak attacks has mainly focused on optimizing the adversarial snippet content injected into input prompts to expose LLM security vulnerabilities. A significant portion of this research focuses on developing more complex, less readable adversarial snippets that can achieve higher attack success rates. In contrast to this trend, our research investigates the impact of the adversarial snippet’s position on the effectiveness of jailbreak attacks. We find that placing a simple and readable adversarial snippet at the beginning of the output effectively exposes LLM safety vulnerabilities, leading to much higher attack success rates than the input suffix attack or prompt-based output jailbreaks. Precisely speaking, we discover that directly enforcing the user’s target embedded output prefix is an effective method to expose LLMs’ safety vulnerabilities.

Bib Entry

@inproceedings{wang2025vulnerability,
  title = {Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety},
  author = {Wang, Yiwei and Chen, Muhao and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {NAACL-Finding},
  year = {2025}
}

Related Publications

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks, ACL, 2026
SWAN: Semantic Watermarking with Abstract Meaning Representation, ACL, 2026
Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy, ACL, 2026
ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System, ACL, 2026
Open-Domain Safety Policy Construction, EACL-Findings, 2026
Customize Multi-modal RAI Guardrails with Precedent-based predictions, COLM 2025, 2025
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, COLM 2025, 2025
Vulnerability of LLMs to Vertically Aligned Text Manipulations, ACL, 2025
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models, CVPR, 2025
SafeWorld: Geo-Diverse Safety Alignment, NeurIPS, 2024
FLIRT: Feedback Loop In-context Red Teaming, EMNLP, 2024
Data Advisor: Data Curation with Foresight for Safety Alignment of Large Language Models, EMNLP, 2024
Prompt-Driven LLM Safeguarding via Directed Representation Optimization, ICML, 2024