Share this page:

Prompt-Driven LLM Safeguarding via Directed Representation Optimization

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng, in ICML, 2024.

Download the full text


Abstract


Bib Entry

@inproceedings{zheng2024prompt,
  title = {Prompt-Driven LLM Safeguarding via Directed Representation Optimization},
  author = {Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun},
  year = {2024},
  booktitle = {ICML}
}

Related Publications

  1. Open-Domain Safety Policy Construction, EACL-Findings, 2026
  2. Customize Multi-modal RAI Guardrails with Precedent-based predictions, COLM 2025, 2025
  3. X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, COLM 2025, 2025
  4. Vulnerability of LLMs to Vertically Aligned Text Manipulations, ACL, 2025
  5. Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models, CVPR, 2025
  6. Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety, NAACL-Finding, 2025
  7. SafeWorld: Geo-Diverse Safety Alignment, NeurIPS, 2024
  8. FLIRT: Feedback Loop In-context Red Teaming, EMNLP, 2024
  9. Data Advisor: Data Curation with Foresight for Safety Alignment of Large Language Models, EMNLP, 2024