The Progress Illusion: Revisiting Meta-Evaluation Standards of LLM Evaluators
Abstract
LLM judges have gained popularity as an inexpensive and performant substitute for human evaluation. However, we observe that the meta-evaluation setting in which the reliability of these LLM evaluators is established is substantially different from their use in model development. To address this, we revisit meta-evaluations of LLM evaluators under a setting that more closely aligns with practice by examining evaluators’ ability to distinguish test system pairs that are closer in capability. Our fine-grained approach shows that all LLM evaluator’s correlations with human judgments are concerningly low when the models perform similarly, showcasing a key limitation of current norms. Equipped with this better methodology, we next analyze the impact that the choice of the reference model makes to LLM-as-a-judge evaluator performance. We show that single-reference evaluators only perform well at ranking test systems that fall within particular capability ranges, even if the standard meta-evaluation reports high overall correlation. Taken together, our analysis shows critical issues with current LLM meta-evaluation and recommend avenues for improvement.
Presented at Findings of the Association for Computational Linguistics: EMNLP 2025 (Suzhou, China, November 2025).
Recommended citation: Tianruo Rose Xu, Vedant Gaur, Liu Leqi, Tanya Goyal. The Progress Illusion: Revisiting Meta-Evaluation Standards of LLM Evaluators. Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 2025.
