PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, and Heng Ji, in NeurIPS, 2025.

Abstract

Real-world objects are composed of distinct, object-specific parts that support fine-grained reasoning. However, large multimodal models (LMMs) struggle to identify parts and reason about part-whole relationships. This paper introduces PARTONOMY, an LMM benchmark designed for pixel-level part grounding. The benchmark combines existing part datasets and a new annotated set comprising 862 part labels and 534 object labels. Experiments reveal that state-of-the-art segmenting LMMs perform poorly on part-level tasks (e.g., a strong model attains only 5.9% global IoU), highlighting a major capability gap. The authors identify architectural shortcomings in current segmenting LMMs, such as using [SEG] tokens and discarding predicted segmentations, and train several part-centric LMMs to address these issues. They propose PLUM, a novel segmenting LMM that uses span tagging and conditions on prior predictions in a feedback loop. PLUM trained on PARTONOMY achieves stronger performance on reasoning-based segmentation, VQA and visual hallucination benchmarks, opening avenues for more grounded visual understanding in LMMs.

Source Code

Bib Entry

@inproceedings{blume2025partonomy,
  title = {PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding},
  author = {Blume, Ansel and Kim, Jeonghwan and Ha, Hyeonjeong and Chatikyan, Elen and Jin, Xiaomeng and Nguyen, Khanh Duy and Peng, Nanyun and Chang, Kai-Wei and Hoiem, Derek and Ji, Heng},
  booktitle = {NeurIPS},
  year = {2025}
}

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding