Open-Domain Safety Policy Construction
Di Wu, Siyue Liu, Zixiang Ji, Ya-Liang Chang, Zhe-Yu Liu, Andrew Pleffer, and Kai-Wei Chang, in EACL-Findings, 2026.
CodeDownload the full text
Abstract
Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting.
Bib Entry
@inproceedings{wu2026opendomain,
title = {Open-Domain Safety Policy Construction},
author = {Wu, Di and Liu, Siyue and Ji, Zixiang and Chang, Ya-Liang and Liu, Zhe-Yu and Pleffer, Andrew and Chang, Kai-Wei},
booktitle = {EACL-Findings},
year = {2026}
}
Related Publications
- MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks, ACL, 2026
- SWAN: Semantic Watermarking with Abstract Meaning Representation, ACL, 2026
- Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy, ACL, 2026
- ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System, ACL, 2026
- Customize Multi-modal RAI Guardrails with Precedent-based predictions, COLM 2025, 2025
- X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, COLM 2025, 2025
- Vulnerability of LLMs to Vertically Aligned Text Manipulations, ACL, 2025
- Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models, CVPR, 2025
- Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety, NAACL-Finding, 2025
- SafeWorld: Geo-Diverse Safety Alignment, NeurIPS, 2024
- FLIRT: Feedback Loop In-context Red Teaming, EMNLP, 2024
- Data Advisor: Data Curation with Foresight for Safety Alignment of Large Language Models, EMNLP, 2024
- Prompt-Driven LLM Safeguarding via Directed Representation Optimization, ICML, 2024