SafeWorld: Geo-Diverse Safety Alignment
Da Yin, Haoyi Qiu, Kung-Hsiang Huang, Kai-Wei Chang, and Nanyun Peng, in NeurIPS, 2024.
Download the full text
Abstract
Content Warning: This paper may contain examples of harmful contents by nature. In the rapidly evolving field of Large Language Models (LLMs), ensuring safety is a crucial and widely discussed topic. However, existing works often overlook the geo-diversity of cultural and legal standards across the world. To demonstrate the challenges posed by geo-diverse safety standards, we introduce SAFEWORLD, a novel benchmark specifically designed to evaluate LLMs’ ability to generate responses that are not only helpful but also culturally sensitive and legally compliant across diverse global contexts. SAFEWORLD encompasses 2,775 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. On top of it, we propose a multidimensional automatic safety evaluation framework that assesses the contextual appropriateness, accuracy, and comprehensiveness of responses. Our evaluations reveal that current LLMs struggle to meet these criteria. To enhance LLMs’ alignment with geo-diverse safety standards, we synthesize helpful preference pairs for Direct Preference Optimization (DPO) alignment training. The preference pair construction aims to encourage LLMs to behave appropriately and provide precise references to relevant cultural norms and policies when necessary. Our trained SAFEWORLDLM outperforms all competing models, including GPT-4o on all the three evaluation dimensions by a large margin. Global human evaluators also note a nearly 20% higher winning rate in helpfulness and harmfulness evaluation.
Bib Entry
@inproceedings{yin2024safeworld,
title = {SafeWorld: Geo-Diverse Safety Alignment},
author = {Yin, Da and Qiu, Haoyi and Huang, Kung-Hsiang and Chang, Kai-Wei and Peng, Nanyun},
booktitle = {NeurIPS},
year = {2024}
}
Related Publications
- Open-Domain Safety Policy Construction, EACL-Findings, 2026
- Customize Multi-modal RAI Guardrails with Precedent-based predictions, COLM 2025, 2025
- X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, COLM 2025, 2025
- Vulnerability of LLMs to Vertically Aligned Text Manipulations, ACL, 2025
- Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models, CVPR, 2025
- Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety, NAACL-Finding, 2025
- FLIRT: Feedback Loop In-context Red Teaming, EMNLP, 2024
- Data Advisor: Data Curation with Foresight for Safety Alignment of Large Language Models, EMNLP, 2024
- Prompt-Driven LLM Safeguarding via Directed Representation Optimization, ICML, 2024