Share this page:

A Meta-Evaluation of Measuring LLM Misgendering

Arjun Subramonian, Vagrant Gautam, Preethi Seshadri, Dietrich Klakow, Kai-Wei Chang, and Yizhou Sun, in COLM 2025, 2025.

Download the full text


Abstract

Numerous methods have been proposed to measure LLM misgendering, including probability-based evaluations (e.g., automatically with templatic sentences) and generation-based evaluations (e.g., with automatic heuristics or human validation). However, it has gone unexamined whether these evaluation methods have convergent validity, that is, whether their results align. Therefore, we conduct a systematic meta-evaluation of these methods across three existing datasets for LLM misgendering. We propose a method to transform each dataset to enable parallel probability- and generation-based evaluation. Then, by automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels, conflicting on 20.2% of evaluation instances. Finally, with a human evaluation of 2400 LLM generations, we show that misgendering behaviour is complex and goes far beyond pronouns, which automatic evaluations are not currently designed to capture, suggesting essential disagreement with human evaluations. Based on our findings, we provide recommendations for future evaluations of LLM misgendering. Our results are also more widely relevant, as they call into question broader methodological conventions in LLM evaluation, which often assume that different evaluation methods agree.


Bib Entry

@inproceedings{subramonian2025meta,
  title = {A Meta-Evaluation of Measuring LLM Misgendering},
  author = {Subramonian, Arjun and Gautam, Vagrant and Seshadri, Preethi and Klakow, Dietrich and Chang, Kai-Wei and Sun, Yizhou},
  booktitle = {COLM 2025},
  year = {2025}
}

Related Publications

  1. White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs, ACL, 2025
  2. Controllable Generation via Locally Constrained Resampling, ICLR, 2025
  3. On Localizing and Deleting Toxic Memories in Large Language Models, NAACL-Finding, 2025
  4. Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification, EMNLP-Finding, 2024
  5. Mitigating Bias for Question Answering Models by Tracking Bias Influence, NAACL, 2024
  6. Are you talking to ['xem'] or ['x', 'em']? On Tokenization and Addressing Misgendering in LLMs with Pronoun Tokenization Parity, NAACL-Findings, 2024
  7. Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems, EMNLP-Finding, 2023
  8. Kelly is a Warm Person, Joseph is a Role Model: Gender Biases in LLM-Generated Reference Letters, EMNLP-Findings, 2023
  9. The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks, ACL (short), 2023
  10. Factoring the Matrix of Domination: A Critical Review and Reimagination of Intersectionality in AI Fairness, AIES, 2023
  11. How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?, EMNLP (Short), 2022
  12. On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations, ACL (short), 2022
  13. Societal Biases in Language Generation: Progress and Challenges, ACL, 2021
  14. "Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses, NAACL, 2021
  15. BOLD: Dataset and metrics for measuring biases in open-ended language generation, FAccT, 2021
  16. Towards Controllable Biases in Language Generation, EMNLP-Finding, 2020
  17. The Woman Worked as a Babysitter: On Biases in Language Generation, EMNLP (short), 2019