On Localizing and Deleting Toxic Memories in Large Language Models
Anubrata Das, Manoj Kumar, Ninareh Mehrabi, Anil Ramakrishna, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Morteza Ziyadi, and Rahul Gupta, in NAACL-Finding, 2025.
Abstract
Bib Entry
@inproceedings{das2025localizing,
title = {On Localizing and Deleting Toxic Memories in Large Language Models},
author = {Das, Anubrata and Kumar, Manoj and Mehrabi, Ninareh and Ramakrishna, Anil and Rumshisky, Anna and Chang, Kai-Wei and Galstyan, Aram and Ziyadi, Morteza and Gupta, Rahul},
booktitle = {NAACL-Finding},
year = {2025}
}
Related Publications
-
A Meta-Evaluation of Measuring LLM Misgendering, COLM 2025, 2025
-
White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs, ACL, 2025
-
Controllable Generation via Locally Constrained Resampling, ICLR, 2025
-
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification, EMNLP-Finding, 2024
-
Mitigating Bias for Question Answering Models by Tracking Bias Influence, NAACL, 2024
-
Are you talking to ['xem'] or ['x', 'em']? On Tokenization and Addressing Misgendering in LLMs with Pronoun Tokenization Parity, NAACL-Findings, 2024
-
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems, EMNLP-Finding, 2023
-
Kelly is a Warm Person, Joseph is a Role Model: Gender Biases in LLM-Generated Reference Letters, EMNLP-Findings, 2023
-
The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks, ACL (short), 2023
-
Factoring the Matrix of Domination: A Critical Review and Reimagination of Intersectionality in AI Fairness, AIES, 2023
-
How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?, EMNLP (Short), 2022
-
On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations, ACL (short), 2022
-
Societal Biases in Language Generation: Progress and Challenges, ACL, 2021
-
"Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses, NAACL, 2021
-
BOLD: Dataset and metrics for measuring biases in open-ended language generation, FAccT, 2021
-
Towards Controllable Biases in Language Generation, EMNLP-Finding, 2020
-
The Woman Worked as a Babysitter: On Biases in Language Generation, EMNLP (short), 2019