PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, and Heng Ji, in NeurIPS, 2025.
Spotlight (top 5% papers)
CodeDownload the full text
Abstract
Real-world objects are composed of distinct, object-specific parts that support fine-grained reasoning. However, large multimodal models (LMMs) struggle to identify parts and reason about part-whole relationships. This paper introduces PARTONOMY, an LMM benchmark designed for pixel-level part grounding. The benchmark combines existing part datasets and a new annotated set comprising 862 part labels and 534 object labels. Experiments reveal that state-of-the-art segmenting LMMs perform poorly on part-level tasks (e.g., a strong model attains only 5.9% global IoU), highlighting a major capability gap. The authors identify architectural shortcomings in current segmenting LMMs, such as using [SEG] tokens and discarding predicted segmentations, and train several part-centric LMMs to address these issues. They propose PLUM, a novel segmenting LMM that uses span tagging and conditions on prior predictions in a feedback loop. PLUM trained on PARTONOMY achieves stronger performance on reasoning-based segmentation, VQA and visual hallucination benchmarks, opening avenues for more grounded visual understanding in LMMs.
Bib Entry
@inproceedings{blume2025partonomy, title = {PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding}, author = {Blume, Ansel and Kim, Jeonghwan and Ha, Hyeonjeong and Chatikyan, Elen and Jin, Xiaomeng and Nguyen, Khanh Duy and Peng, Nanyun and Chang, Kai-Wei and Hoiem, Derek and Ji, Heng}, booktitle = {NeurIPS}, year = {2025} }