Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (2025)

Luke Guerdan  Solon Barocas  Kenneth Holstein  Hanna Wallach  Zhiwei Steven Wu  Alexandra Chouldechova

Abstract

The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliableapproaches to LLM-as-a-judge validation.

Machine Learning, ICML

1 Introduction

To improve efficiency, scalability, and repeatability, GenAI evaluations commonly rely on LLM-based judges as a substitute for human raters when rating system outputs for properties like their “relevance”, “helpfulness”, or “toxicity.” In this LLM-as-a-judge paradigm, illustrated in in Figure1, a judge LLM system is used to rate the outputs of a target GenAI system according to instructions specified in a rating task(e.g., Szymanski etal., 2024; Zheng etal., 2023; Bubeck etal., 2023). However,when using this paradigm,validating that judge systems produce accurate ratings is critical to the reliabilityof the resulting GenAI evaluations.

To validate a judge system, evaluators first curate a validation corpus in which each item in the corpusasks a rater to rate a single target system output according to a set of rating task instructions. Each item is then rated by multiple human raters and by the judge system.The human ratings are then aggregated into a single gold label rating for each item in the validation corpus. Judge system performance is then assessed by calculating the agreement rate between these gold labels and judge system ratings(Lu & Zhong, 2024; Kim etal., 2024; Jung etal., 2024; Dong etal., 2024; Es etal., 2023; Dubois etal., 2024; Shankar etal., 2024).

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (1)

Although this approach is well-motivated when there is a single “correct” rating for each item, judge systems are often used in settings where this gold-label assumption is violated. Items or rating criteria may be ambiguous, or there may be principled disagreement among human raters, leading to multiple “correct” ratings. We say that such rating tasks are indeterminate. In practice, indeterminate rating tasks are very common (Figure 9). For example, raters may reasonably disagree about the “helpfulness” of an item based on its length(Li etal., 2024a), or note that the “toxicity” of an item depends on the cultural context (Goyal etal., 2022).

We show that task indeterminacy substantively affectshow judge systems can be validated, which judge systems are selected, and most critically, the resulting conclusions that are drawn about target systems. We introduce a framework for understanding different approaches to LLM-as-a-judge validation as determined by the rating task design, including the rating elicitation scheme; the rating aggregation scheme; and the metric used to quantify human–judge agreement. We illustrate both theoretically and empirically the implications of these choices on judge system selection and on the resulting conclusions that are drawn about targetsystems. Our main contributions are listed below:

  • We provide the first framework for LLM-as-a-judge validation under rating task indeterminacy—i.e., where many items may have multiple “correct” ratings. This framework enables us to compare existing validation approaches and to develop principled alternatives.

  • We use this framework to compare existing validation approachesto alternative approaches that account for task indeterminacy. We demonstrate that the best-performing judge systems using the former can be among the worst-performing using the latter (§5), and explain how this arises from differences in how human raters and judge systems resolve ambiguities in forced-choice rating tasks (§6). These findings highlight the importance of the rating elicitation scheme, as well as pinpointing a mechanism by which rating task indeterminacy can confound LLM-as-a-judge validation.

  • We conduct an empirical study in which we use five different commercial LLMs as judge systems to rate the “toxicity” of a target system’s outputs using the Civil Comments dataset (§7). We demonstrate that the judge system selected by commonly used existing validation approaches performs 34% worse than the judge system selected when using our framework to explicitly account for the effects of rating task indeterminacy.

Our findings demonstrate that existing validation approaches can be highly suboptimal when used for indeterminate rating tasks. Because the resulting conclusions that are drawn about target systems depend critically on judge system performance, this then jeopardizes the validity of GenAI evaluations.We draw on our findings to offer concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation under rating task indeterminacy.

2 Preliminaries

Let 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT denote the judge system and let𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT denote the target system. We treat 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT as a black box that produces a string of output tokens given a string of input tokens. Output tokens can be represented as an arbitrary modality (e.g., text, images, audio, video) depending on the system.

Rating Task. We express a target system evaluation as a rating task consisting of n𝑛nitalic_n items 𝒯={t1,t2,,tn}𝒯subscript𝑡1subscript𝑡2subscript𝑡𝑛\mathcal{T}=\{t_{1},t_{2},...,t_{n}\}caligraphic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. Each item ti𝒯subscript𝑡𝑖𝒯t_{i}\in\mathcal{T}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T consists of 1) an output generated by 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT, 2) instructions for rating that output, and 3) a set of response options—i.e., possible ratings—for that output.As is often the case in practice, we assume that the rating instructionsand response options are the same for all items.

Our framework is compatible with two types of rating tasks (Zheng etal., 2023): Single output grading tasks instruct a rater (either human or judge system) to rate an output generated by 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT for a specific property like “helpfulness” or “toxicity”). Pairwise comparison tasks instruct a rater to provide a preference among two outputs generated by one or more target systems. Both types of rating tasks often include a rubric, annotation guidelines, and few-shot examples specifying how outputs should be rated. Letting 𝒪={o1,o2,oq}𝒪subscript𝑜1subscript𝑜2subscript𝑜𝑞\mathcal{O}=\{o_{1},o_{2},...o_{q}\}caligraphic_O = { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_o start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT } denote the ordered set of response options for a rating task, two common choices for response options are 𝒪={Yes,No}𝒪YesNo\mathcal{O}=\{\text{Yes},\text{No}\}caligraphic_O = { Yes , No } for single output grading tasks and 𝒪={Win,Tie,Lose}𝒪WinTieLose\mathcal{O}=\{\text{Win},\text{Tie},\text{Lose}\}caligraphic_O = { Win , Tie , Lose } for pairwise comparison tasks.

Rating Elicitation. We examine two rating elicitation schemes for eliciting ratings from human raters and judge systems: forced choice and response set (Figure 2). Forced choice elicitation instructs a rater to select a single option from 𝒪𝒪\mathcal{O}caligraphic_O. Response set elicitation instructs a rater to select all options from 𝒪𝒪\mathcal{O}caligraphic_O that are reasonable. Letting 𝒫(𝒪)𝒫𝒪\mathcal{P}(\mathcal{O})caligraphic_P ( caligraphic_O ) denote the power set of 𝒮𝒮\mathcal{S}caligraphic_S, we define 𝒬={𝒮1,𝒮2,,𝒮w}𝒬subscript𝒮1subscript𝒮2subscript𝒮𝑤\mathcal{Q}=\{\mathcal{S}_{1},\mathcal{S}_{2},...,\mathcal{S}_{w}\}caligraphic_Q = { caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT } to be the ordered set of all admissible response sets. Response set elicitation instructs a rater to select a response set 𝒮𝒬𝒮𝒬\mathcal{S}\in\mathcal{Q}caligraphic_S ∈ caligraphic_Q. For some tasks, |𝒪|=|𝒬|𝒪𝒬|\mathcal{O}|=|\mathcal{Q}|| caligraphic_O | = | caligraphic_Q |, in which case, the twoelicitationschemes carry equivalent information. However, differences arise when rating tasks are underspecified.

Definition 2.1 (Underspecified Rating Task).

A rating task is underspecified if |𝒪||𝒬|𝒪𝒬|\mathcal{O}|\leq|\mathcal{Q}|| caligraphic_O | ≤ | caligraphic_Q |.

For example, consider the single output grading task shown in Figure 2, where 𝒪={Yes,No}𝒪YesNo\mathcal{O}=\{\text{Yes},\text{No}\}caligraphic_O = { Yes , No } and 𝒬={Yes,No,{Yes,No}}𝒬YesNoYesNo\mathcal{Q}=\{\text{Yes},\text{No},\{\text{Yes},\text{No}\}\}caligraphic_Q = { Yes , No , { Yes , No } }. This task is underspecified because a rater can select a response set containing both Yes and No under response set elicitation if they determine both options are reasonable. We can construct a fully specified variant of this rating task by adding a Maybe option, and instructing raters to always select Maybe if they determine both Yes and No are reasonable options. When tasks are underspecified, forced choice elicitation capturesless information than response set elicitation.

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (2)

Human Raters. Let \mathcal{R}caligraphic_R denote a population of human raters, such as all target system users in a geographic region, a demographic group (e.g., females over 45), or a set of domain experts (e.g., licensed radiologists). We let R𝑅Ritalic_R denote a random variable modeling the selection of raters from \mathcal{R}caligraphic_R.

Judge System. We assume black-box access to the judge system, 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT, which, given an input item tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, returns an output that is then mapped to a response option in 𝒪𝒪\mathcal{O}caligraphic_O under forced choice elicitation or 𝒬𝒬\mathcal{Q}caligraphic_Q under response set elicitation.

2.1 A Probabilistic Model for Ratings

We model the distribution of human and judge system ratings via a joint distribution (T,Oj,Sj,Oh,Sh,R)𝑇superscript𝑂𝑗superscript𝑆𝑗superscript𝑂superscript𝑆𝑅\mathbb{P}(T,O^{j},S^{j},O^{h},S^{h},R)blackboard_P ( italic_T , italic_O start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_R ). Here, the random variables Ojsuperscript𝑂𝑗O^{j}italic_O start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and Sjsuperscript𝑆𝑗S^{j}italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denote the forced choice and response set ratings, respectively, returned by the judge system for a random item T𝑇Titalic_T. Similarly, Ohsuperscript𝑂O^{h}italic_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and Shsuperscript𝑆S^{h}italic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT denote the forced choice and response set ratings, respectively, that a randomly drawn rater R𝑅Ritalic_R assigns to an item T𝑇Titalic_T. We evaluate the quality of judge system ratings by comparing them with human ratings a per-item basis. For each item i𝑖iitalic_i, let ih=(Oh,ShT=ti)superscriptsubscript𝑖superscript𝑂conditionalsuperscript𝑆𝑇subscript𝑡𝑖\mathbb{P}_{i}^{h}=\mathbb{P}(O^{h},S^{h}\mid T=t_{i})blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = blackboard_P ( italic_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∣ italic_T = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the human rating distribution and let ij=(Oj,SjT=ti)superscriptsubscript𝑖𝑗superscript𝑂𝑗conditionalsuperscript𝑆𝑗𝑇subscript𝑡𝑖\mathbb{P}_{i}^{j}=\mathbb{P}(O^{j},S^{j}\mid T=t_{i})blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = blackboard_P ( italic_O start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∣ italic_T = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the rating distribution of 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT. We omit hhitalic_h and j𝑗jitalic_j when referring to a generic rater (human or judge system) and let Tsubscript𝑇\mathbb{P}_{T}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT denote a rating distribution conditioned on a random item T𝑇Titalic_T.

Aggregation Functions. We introduce aggregation functions a:Δ𝒴:𝑎Δ𝒴a:\Delta\rightarrow\mathcal{Y}italic_a : roman_Δ → caligraphic_Y to consolidate the full rating distribution into a rating vector. For example, applying a hard aggregation function 𝐲=ahard(i)𝐲subscript𝑎hardsubscript𝑖\mathbf{y}=a_{\text{hard}}(\mathbb{P}_{i})bold_y = italic_a start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) recovers a binary one-hot vector encoding a single rating task option (e.g., Yes) (see §§\S§ 4). The rating space 𝒴={a(i):iΔ}𝒴conditional-set𝑎subscript𝑖subscript𝑖Δ\mathcal{Y}=\{a(\mathbb{P}_{i}):\mathbb{P}_{i}\in\Delta\}caligraphic_Y = { italic_a ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Δ } contains all rating vectors that can be recovered from an aggregation function.

We use aggregation functions to define random variables over rating vectors. Specifically, let Y=a(T)𝑌𝑎subscript𝑇Y=a(\mathbb{P}_{T})italic_Y = italic_a ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) denote the random rating vector obtained by applying an aggregation function to the rating distribution of a random item T𝑇Titalic_T. This random variable setup enables us to reason probabilistically about aggregated ratings – e.g., by computing the expected agreement between aggregated rating vectors recovered from humans and the judge system (see §§\S§ 2.2). Let (ah,Yh,𝐲h,𝒴h)superscript𝑎superscript𝑌superscript𝐲superscript𝒴(a^{h},Y^{h},\mathbf{y}^{h},\mathcal{Y}^{h})( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) and (aj,Yj,𝐲j,𝒴j)superscript𝑎𝑗superscript𝑌𝑗superscript𝐲𝑗superscript𝒴𝑗(a^{j},Y^{j},\mathbf{y}^{j},\mathcal{Y}^{j})( italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , caligraphic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) denote the aggregation function, random rating vector, rating vector realization, and rating space for humans and 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT, respectively.

2.2 Evaluation Goals

Our goal is to characterize the validity of using a judge system as a surrogate for human ratings in evaluations of the target system. We approach validation of the judge system from two complementary angles: first, by directly measuring human–judge agreement (§§\S§ 2.2.1), and second, by examining the extent to which relying on judge system ratings in place of human ratings affects the conclusions drawn from downstream evaluations of 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT (§§\S§ 2.2.2).

2.2.1 Measuring Human-Judge Agreement

The standard approach for validating judge systems involves computing a measure of human–judge agreement (Table 1). Specifically, given the joint distribution ()\mathbb{P}(\cdot)blackboard_P ( ⋅ ), we evaluate

M(Yj,Yh)=𝔼(T,Yj,Yh)[m(Yj,Yh)],𝑀superscript𝑌𝑗superscript𝑌subscript𝔼𝑇superscript𝑌𝑗superscript𝑌delimited-[]𝑚superscript𝑌𝑗superscript𝑌M(Y^{j},Y^{h})=\mathbb{E}_{(T,Y^{j},Y^{h})}[m(Y^{j},Y^{h})],italic_M ( italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( italic_T , italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ italic_m ( italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ] ,(1)

where m:𝒴j×𝒴h:𝑚superscript𝒴𝑗superscript𝒴m:\mathcal{Y}^{j}\times\mathcal{Y}^{h}\rightarrow\mathbb{R}italic_m : caligraphic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT × caligraphic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT → blackboard_R is an agreement metric (e.g., Hit-Rate, Cohen’s κ𝜅\kappaitalic_κ, KL-Divergence). The expectation is taken over the joint distribution of random items T𝑇Titalic_T and corresponding aggregated rating vectors Yjsuperscript𝑌𝑗Y^{j}italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and Yhsuperscript𝑌Y^{h}italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT.

While Eq.(1) assumes that we know the rating distribution, in practice, we only have access to a small corpus of ratings. Therefore, we also estimate the agreement rate,

M^(Y^j,Y^h)=𝔼(T,Yj,Yh)[m(Y^j,Y^h)].^𝑀superscript^𝑌𝑗superscript^𝑌subscript𝔼𝑇superscript𝑌𝑗superscript𝑌delimited-[]𝑚superscript^𝑌𝑗superscript^𝑌\hat{M}(\hat{Y}^{j},\hat{Y}^{h})=\mathbb{E}_{(T,Y^{j},Y^{h})}[m(\hat{Y}^{j},%\hat{Y}^{h})].over^ start_ARG italic_M end_ARG ( over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( italic_T , italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ italic_m ( over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ] .(2)

Above, we estimate Y^hsuperscript^𝑌\hat{Y}^{h}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT from a corpus of human ratings 𝒞={(Tv,Rv,Tv)}v=1Niid()𝒞superscriptsubscriptsubscript𝑇𝑣subscript𝑅𝑣subscript𝑇𝑣𝑣1𝑁iidsimilar-to\mathcal{C}=\{(T_{v},R_{v},T_{v})\}_{v=1}^{N}\overset{\text{iid}}{\sim}\mathbb%{P}(\cdot)caligraphic_C = { ( italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT overiid start_ARG ∼ end_ARG blackboard_P ( ⋅ ). We assume this corpus only contains forced choice ratings, as this is the format used in existing GenAI evaluations. For each item, we estimate Y^jsuperscript^𝑌𝑗\hat{Y}^{j}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT by repeatedly sampling a response from 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT.

2.2.2 Measuring the Performance of a Judge System on Downstream Evaluation Tasks

We also examine how selecting judge systems based on certain human–judge agreement metrics impacts judge system performanceon downstream evaluation tasks.111This is analogous to how in classical supervised learning we might select a probabilistic classifier f^=f^λ^𝑓subscript^𝑓𝜆\hat{f}=\hat{f}_{\lambda}over^ start_ARG italic_f end_ARG = over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT by choosing λ𝜆\lambdaitalic_λ to minimize cross-validated negative log-likelihood. But then the model may get applied to a task where its misclassification error at a particular threshold is the more relevant performance metric. We focus on two specific downstream tasks. In content filtering tasks, the judge system is used to identify which outputs from 𝒢targetsubscript𝒢𝑡𝑎𝑟𝑔𝑒𝑡\mathcal{G}_{target}caligraphic_G start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT to allow or suppress. In prevalence estimation tasks, the judge system is used to estimate the prevalence of a certain property (e.g., contain “toxic” language) in 𝒢targetsubscript𝒢𝑡𝑎𝑟𝑔𝑒𝑡\mathcal{G}_{target}caligraphic_G start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT outputs.

Both content filtering and prevalence estimation require making binary categorizations of each item (e.g., whether an item contains “toxicity”). We categorize items using a threshold function skτ(𝐲)=𝟙[𝐲kτ]superscriptsubscript𝑠𝑘𝜏𝐲1delimited-[]subscript𝐲𝑘𝜏s_{k}^{\tau}(\mathbf{y})=\mathbbm{1}[\mathbf{y}_{k}\geq\tau]italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( bold_y ) = blackboard_1 [ bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_τ ] that labels an item as positive if the kth option has at least a τ𝜏\tauitalic_τ chance of being selected. The cutoff τ𝜏\tauitalic_τ is a policy determination. For example, authors of the Civil Comments toxicity classification dataset (Borkan etal., 2019) use τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 whidetermining whether to categorize an item as toxic or non-toxic.xic

Content Filtering. We evaluate 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT on content filtering tasks by measuring how often it makes the same allow/suppress decisions as the population of human raters. We quantify this via the decision consistency metric:

Ckτ(Yj,Yh)=𝔼(T,Yj,Yh)[𝟙[skτ(Yj)=skτ(Yh)]].subscriptsuperscript𝐶𝜏𝑘superscript𝑌𝑗superscript𝑌subscript𝔼𝑇superscript𝑌𝑗superscript𝑌delimited-[]1delimited-[]subscriptsuperscript𝑠𝜏𝑘superscript𝑌𝑗subscriptsuperscript𝑠𝜏𝑘superscript𝑌C^{\tau}_{k}(Y^{j},Y^{h})=\mathbb{E}_{(T,Y^{j},Y^{h})}[\mathbbm{1}[s^{\tau}_{k%}(Y^{j})=s^{\tau}_{k}(Y^{h})]].italic_C start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( italic_T , italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_1 [ italic_s start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = italic_s start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ] ] .

High consistency indicates that, if deployed, 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT would often make the same decisions as human raters on which outputs from 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT to allow or suppress (at the cutoff τ𝜏\tauitalic_τ).

Prevalence Estimation. In prevalence estimation tasks, 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT is used to evaluate the proportion of outputs from 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT that have a certain property. For example, in single-output rating tasks, prevalence estimation recovers the proportion of 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT outputs that are “relevant” or “toxic.” In pairwise comparison tasks, a prevalence estimation recovers the win rate — i.e., the proportion of items where an output from one target model 𝒢targetzsuperscriptsubscript𝒢target𝑧\mathcal{G}_{\text{target}}^{z}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT is rated as preferable to an output from a second target model 𝒢targetwsuperscriptsubscript𝒢target𝑤\mathcal{G}_{\text{target}}^{w}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT (Chiang etal., 2024). We measure the estimation bias between estimates obtained from human raters versus the judge system via

Bkτ(Yj,Yh)=𝔼[skτ(Yj)]𝔼[skτ(Yh)].subscriptsuperscript𝐵𝜏𝑘superscript𝑌𝑗superscript𝑌𝔼delimited-[]subscriptsuperscript𝑠𝜏𝑘superscript𝑌𝑗𝔼delimited-[]subscriptsuperscript𝑠𝜏𝑘superscript𝑌B^{\tau}_{k}(Y^{j},Y^{h})=\mathbb{E}[s^{\tau}_{k}(Y^{j})]-\mathbb{E}[s^{\tau}_%{k}(Y^{h})].italic_B start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) = blackboard_E [ italic_s start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ] - blackboard_E [ italic_s start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ] .

For example, when using 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT to rate 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT responses to automated red-teaming attacks designed to elicit toxicity (Mazeika etal., 2024; Ganguli etal., 2022), B<0𝐵0B<0italic_B < 0 indicates that 𝒢judgesubscript𝒢judge\mathcal{G}_{\text{judge}}caligraphic_G start_POSTSUBSCRIPT judge end_POSTSUBSCRIPT underestimates the prevalence of toxic outputs (i.e., the attack success rate) as compared to human ratings.

Prevalence estimation has become a central task in the GenAI evaluation literature. The popular Chatbot Arena leaderboard uses a Bradley-Terry model to estimate the win rate (a form of prevalence estimate) in pairwise comparison tasks (Chiang etal., 2024). Similarly, Prediction Powered Inference (PPI) is increasingly used to improve the sample-efficiency of prevalence estimators by combining gold label human ratings with judge system ratings (Angelopoulos etal., 2023b, a; Boyeau etal., 2024; Eyre & Madras, 2024).222PPI setups assume that humans and the judge system rate mutually exclusive subsets of items. In contrast, we validate the judge system by assuming that a subset of items have been rated by both humans and the judge system. Nevertheless, our findings challenge the reliability of using a single per-item human rating as a gold label – a foundational assumption in the PPI framework. In this work, we show that current approaches used to elicit and aggregate human ratings can yield misleading evaluations of target systems. Because human ratings also serve as a foundation for PPI estimators and the Bradley-Terry model, our findings in turn call into question the reliability of these widely-used methodologies for GenAI evaluation.

3 Decomposing Sources of Human Rating Variation in LLM-as-a-Judge Validation

With this framework in place, we now develop a model that decomposes sources of rating variation in the LLM-as-a-judge validation pipeline. Because human ratings are the benchmark for validating judge systems, disentangling “meaningful signal” from noise in rating variationis critical. Our model disentangles: (1) genuine differences in how raters interpret a rating task; (2) inconsistencies introduced by lapses in attention (i.e., error); and (3) variation introduced by requiring a rater to select a single option when they determine that more than one is reasonable — i.e., forced choice selection effects. We show that failing to account for each factor can lead to misleading evaluation results (§§\S§ 6-7).

Our rating model decomposes the human rating distribution for each item: ih=(Sh,Sh,Oh,ti)\mathbb{P}_{i}^{h}=\mathbb{P}(S_{*}^{h},S^{h},O^{h},\mid t_{i})blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = blackboard_P ( italic_S start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , ∣ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). To capture the potential for rater error, we distinguish between a human rater’s stable response set Shsuperscriptsubscript𝑆S_{*}^{h}italic_S start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT — i.e., the options they would consistently endorse when carefully completing the rating task — and the observed response set Shsuperscript𝑆S^{h}italic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT they provide through response set elicitation (Figure 3). A rater’s stable response set can differ from their observed response set if they fail to identify one or more options that could reasonably apply to a rating instruction (or erroneously endorse others). We describe differences between the stable and observed response set via an error matrix 𝐄i|𝒬|×|𝒬|subscript𝐄𝑖superscript𝒬𝒬\boldsymbol{\mathbf{E}}_{i}\in\mathbb{R}^{|\mathcal{Q}|\times|\mathcal{Q}|}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_Q | × | caligraphic_Q | end_POSTSUPERSCRIPT, where each entry encodes the probability that a rater endorses 𝒮vsubscript𝒮𝑣\mathcal{S}_{v}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT given that their stable response set is 𝒮vsubscript𝒮superscript𝑣\mathcal{S}_{v^{*}}caligraphic_S start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. We assume that error rates are constant across all raters:

Assumption 3.1 (Error Independence).

ShRSh,Tperpendicular-tosuperscript𝑆conditional𝑅superscriptsubscript𝑆𝑇S^{h}\perp R\mid S_{*}^{h},Titalic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⟂ italic_R ∣ italic_S start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_T.

While a rich literature exists on rater-dependent error modeling (Klie etal., 2023; Gordon etal., 2021), we make this simplifying assumption to examine the aggregate effects of rating error on downstream evaluations of judge systems.

We use a transition matrix 𝐅i|𝒪|×|𝒬|subscript𝐅𝑖superscript𝒪𝒬\mathbf{F}_{i}\in\mathbb{R}^{|\mathcal{O}|\times|\mathcal{Q}|}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_O | × | caligraphic_Q | end_POSTSUPERSCRIPT to represent how raters pick an option from their response set. Each element in 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains the probability of a rater selecting the k𝑘kitalic_kth option (e.g., Yes) given that they would select the v𝑣vitalic_vth response set (e.g., both Yes and No). As with the error matrix, we also assume that 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is fixed across raters:

Assumption 3.2 (Forced Choice Independence).

Oh{Sh,R}Sh,Tperpendicular-tosuperscript𝑂conditionalsuperscriptsubscript𝑆𝑅superscript𝑆𝑇O^{h}\perp\{S_{*}^{h},R\}\mid S^{h},Titalic_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⟂ { italic_S start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_R } ∣ italic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_T.

Both 𝐄isubscript𝐄𝑖\mathbf{E}_{i}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT have a reverse matrix, denoted as 𝐄isubscriptsuperscript𝐄𝑖\mathbf{E}^{\prime}_{i}bold_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐅isubscriptsuperscript𝐅𝑖\mathbf{F}^{\prime}_{i}bold_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively, that encode conditional probabilities in the reverse direction. Entries in 𝐅isubscriptsuperscript𝐅𝑖\mathbf{F}^{\prime}_{i}bold_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the probability of a rater endorsing the v𝑣vitalic_vth (observed) response set (e.g., Yes and No) given that they selected the k𝑘kitalic_kth forced choice option (e.g., Yes). Entries in 𝐄isubscriptsuperscript𝐄𝑖\mathbf{E}^{\prime}_{i}bold_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the probability of a rater endorsing 𝒮vsubscript𝒮superscript𝑣\mathcal{S}_{v^{*}}caligraphic_S start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT given that their observed response set is 𝒮vsubscript𝒮𝑣\mathcal{S}_{v}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.

Our rating model connects different representations of human rating variation (Fig. 3). The response set distribution 𝜽i=(Sh=svti)subscriptsuperscript𝜽𝑖superscriptsubscript𝑆conditionalsubscript𝑠superscript𝑣subscript𝑡𝑖\boldsymbol{\theta}^{*}_{i}=\mathbb{P}(S_{*}^{h}=s_{v^{*}}\mid t_{i})bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = blackboard_P ( italic_S start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∣ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents genuine differences in how a population of raters interprets an item in a rating task. This rating distribution, which is uncorrupted by error or forced choice selection effects, is our target parameter. In contrast, the forced choice distribution 𝑶i=(Oh=okti)subscript𝑶𝑖superscript𝑂conditionalsubscript𝑜𝑘subscript𝑡𝑖\boldsymbol{O}_{i}=\mathbb{P}(O^{h}=o_{k}\mid t_{i})bold_italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = blackboard_P ( italic_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) describes the probability distribution that is observed under rater error and forced choice selection effects. The following result shows that we can decompose response set distribution into rater error, forced choice selection effects, and the forced choice distribution:

Theorem 3.3.

(Rating Decomposition) Assume 3.1 and 3.2 hold on ()\mathbb{P}(\cdot)blackboard_P ( ⋅ ). Then 𝐎i=𝐅i(𝐄i𝛉i)subscript𝐎𝑖subscript𝐅𝑖subscript𝐄𝑖subscriptsuperscript𝛉𝑖\boldsymbol{O}_{i}=\mathbf{F}_{i}(\mathbf{E}_{i}\boldsymbol{\theta}^{*}_{i})bold_italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝛉i=𝐄i(𝐅i𝐎i)subscriptsuperscript𝛉𝑖superscriptsubscript𝐄𝑖superscriptsubscript𝐅𝑖subscript𝐎𝑖\boldsymbol{\theta}^{*}_{i}=\mathbf{E}_{i}^{\prime}(\mathbf{F}_{i}^{\prime}%\mathbf{O}_{i})bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) holds for all conditional rating distributions ihΔsuperscriptsubscript𝑖Δ\mathbb{P}_{i}^{h}\in\Deltablackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ roman_Δ.

This theorem shows how genuine differences in raters’ interpretation of an item in a rating task propagates through error and forced choice selection. It also provides a mechanism for recovering 𝜽isubscriptsuperscript𝜽𝑖\boldsymbol{\theta}^{*}_{i}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from 𝑶isubscript𝑶𝑖\boldsymbol{O}_{i}bold_italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by applying the reverse error and forced choice transition matrices. Given this decomposition, we might wonder when the response set distribution is identifiable from 𝐎isubscript𝐎𝑖\mathbf{O}_{i}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The following result shows that this is only possible when a rating task is fully specified:

Theorem 3.4 (Response Set Identifiably).

Assume 3.1 and 3.2 hold on ()\mathbb{P}(\cdot)blackboard_P ( ⋅ ). Further, assume that 𝒪𝒬𝒪𝒬\mathcal{O}\subseteq\mathcal{Q}caligraphic_O ⊆ caligraphic_Q and let 𝐄isubscript𝐄𝑖\mathbf{E}_{i}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the identity matrix. Then 𝛉isubscriptsuperscript𝛉𝑖\boldsymbol{\theta}^{*}_{i}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is identifiable from 𝐎isubscript𝐎𝑖\mathbf{O}_{i}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if and only if the rating task is fully specified.

This theorem shows that, even in an idealized setting with no rater error, an information loss occurs when compressing a response set rating into a forced choice rating. A practical implication is that rating tasks should be fully specified when possible to enable direct recovery of 𝜽isuperscriptsubscript𝜽𝑖\boldsymbol{\theta}_{i}^{*}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from 𝐎isubscript𝐎𝑖\mathbf{O}_{i}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In §§\S§ 6-7, we show that underspecified rating tasks yield substantive discrepancies in validations of judge systems.

4 Defining the Performance of a Judge System in the Absence of Gold Labels

Our model for human rating variation establishes how to define the performance of a judge system under indeterminacy. In particular, let ahsuperscript𝑎a^{h}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, ajsuperscript𝑎𝑗a^{j}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denote human and judge aggregation functions used to consolidate rating variation (i.e., represented via the forced choice or response set distributions) into a rating vector. Given a judge performance metric m𝑚mitalic_m, we call 𝐩=(ah,aj,m)𝐩superscript𝑎superscript𝑎𝑗𝑚\mathbf{p}=(a^{h},a^{j},m)bold_p = ( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_m ) a definition of performance. As we describe next, many such definitions could reasonably be used to validate a judge system. We describe each definition of performance by enumerating over aggregation functions:

Hard aggregation. The hard aggregation function is defined ahard(i)=𝐞ksubscript𝑎hardsubscript𝑖subscript𝐞superscript𝑘a_{\text{hard}}(\mathbb{P}_{i})=\mathbf{e}_{k^{*}}italic_a start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_e start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where 𝐞ksubscript𝐞superscript𝑘\mathbf{e}_{k^{*}}bold_e start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is an |𝒪|𝒪|\mathcal{O}|| caligraphic_O |-dimensional basis vector and k=argmaxk𝐎i,ksuperscript𝑘subscript𝑘subscript𝐎𝑖𝑘k^{*}=\arg\max_{k}\mathbf{O}_{i,k}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_O start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT is the mode of the forced choice distribution. Performance measures that rely on hard aggregation are consistent with categorical human–judge agreement metrics (e.g., Krippendorff’s α𝛼\alphaitalic_α). Measures relying on hard aggregation impose a gold-label assumption, and are the status quo in existing judge system validations (Lu & Zhong, 2024; Jung etal., 2024; Dong etal., 2024; Es etal., 2023; Dubois etal., 2024; Bubeck etal., 2023; Zheng etal., 2023; Faisal etal., 2024; Gu etal., 2024; Thakur etal., 2024; Li etal., 2024b; Chen etal., 2024; Chiang etal., 2024; Dorner etal., 2024; Mirzakhmedova etal., 2024; Chaudhary etal., 2024; Kim etal., 2024; Dettmers etal., 2024).

Soft aggregation. The soft aggregation function asoft(i)=𝐎isubscript𝑎softsubscript𝑖subscript𝐎𝑖a_{\text{soft}}(\mathbb{P}_{i})=\mathbf{O}_{i}italic_a start_POSTSUBSCRIPT soft end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT returns a probability vector over forced choice responses. Each entry 𝐎i,ksubscript𝐎𝑖𝑘\mathbf{O}_{i,k}bold_O start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT represents the probability that the k𝑘kitalic_kth option is selected by a rater under forced choice elicitation. Definitions of performance that rely on soft aggregation are consistent with distributional human–judge agreement metrics (e.g., KL-Divergence). Prior work has proposed soft label aggregation with distributional agreement metrics for evaluating ML systems under indeterminacy (Uma etal., 2020; Peterson etal., 2019; Collins etal., 2022). However, soft aggregation is seldom used in judge system validations.

Our rating model (§§\S§ 3) connects these categorical and distributional definitions of performance and multi-label definitions. Multi-label definitions provide a more granular representation of rating variation over response set data. Let 𝚲i{0,1}|𝒪|×|𝒬|subscript𝚲𝑖superscript01𝒪𝒬\boldsymbol{\Lambda}_{i}\in\{0,1\}^{|\mathcal{O}|\times|\mathcal{Q}|}bold_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT | caligraphic_O | × | caligraphic_Q | end_POSTSUPERSCRIPT be a binary matrix indicating whether the k𝑘kitalic_kth option is in the v𝑣vitalic_vth response set. We define the multi-label vector as 𝛀i=𝚲i(𝐄i𝜽i)subscript𝛀𝑖subscript𝚲𝑖subscript𝐄𝑖subscriptsuperscript𝜽𝑖\mathbf{\Omega}_{i}=\boldsymbol{\Lambda}_{i}(\mathbf{E}_{i}\boldsymbol{\theta}%^{*}_{i})bold_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Each entry in 𝛀i,ksubscript𝛀𝑖𝑘\mathbf{\Omega}_{i,k}bold_Ω start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT describes the probability that a rater selects the k𝑘kitalic_kth option in their observed response set under response set elicitation.333Unlike the forced choice and response set distributions, entries in the multi-label vector need not sum to one. Let 𝛀i=𝚲i(𝜽i)subscriptsuperscript𝛀𝑖subscript𝚲𝑖subscriptsuperscript𝜽𝑖\mathbf{\Omega}^{*}_{i}=\boldsymbol{\Lambda}_{i}(\boldsymbol{\theta}^{*}_{i})bold_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the corresponding multi-label vector that is uncorrupted by rater error. Two additional aggregation functions are consistent with multi-label vectors:

Hard Response Set. The hard response set (hrs) function ahrs(i)=𝟙{𝛀iτ}subscript𝑎hrssubscript𝑖1subscript𝛀𝑖𝜏a_{\text{hrs}}(\mathbb{P}_{i})=\mathbbm{1}\{\mathbf{\Omega}_{i}\geq\tau\}italic_a start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_1 { bold_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_τ } maps the response set distribution to a binary multi-label vector. The k𝑘kitalic_kth entry of this vector is one if there is at least a τ𝜏\tauitalic_τ probability of a response set containing option k𝑘kitalic_k being selected during response set elicitation. This aggregation function is consistent with measuring the coverage of a predicted judge system response in a response set containing multiple “correct” options.

Soft Response Set. The soft response set (srs) function asrs(i)=𝛀isubscript𝑎srssubscript𝑖subscript𝛀𝑖a_{\text{srs}}(\mathbb{P}_{i})=\mathbf{\Omega}_{i}italic_a start_POSTSUBSCRIPT srs end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT directly returns the non-thresholded multi-label vector. Each entry 𝛀i,ksubscript𝛀𝑖𝑘\mathbf{\Omega}_{i,k}bold_Ω start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT denotes the probability that a rater endorsed the k𝑘kitalic_kth option during response set elicitation. Definitions of performance that apply srs aggregation to the human rating distribution are consistent with continuous metrics such Mean Squared Error and Binary Cross Entropy.

Table 1 in Appendix A lists many definitions of performance that are consistent with these aggregation functions. This table also summarizes definitions of performance commonly used in (1) LLM-as-a-judge validations and (2) prior work studying evaluation under indeterminacy.

5 Ranking Judge Systems Under Competing Definitions of Performance

Given that there are many ways of defining performance under indeterminacy, it is unclear when one approach is preferable over another. One way to distinguish among competing definitions is by examining their downstream impact on judge system validation: when do two performance definitions yield a consistent ranking of judge systems? We now use our framework to formally investigate this question.

Let 𝒢judgewsuperscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤\mathcal{G}_{judge}^{w}caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT and 𝒢judgezsuperscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧\mathcal{G}_{judge}^{z}caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT denote two judge systems, each described by their conditional rating distributions Tj,zsuperscriptsubscript𝑇𝑗𝑧\mathbb{P}_{T}^{j,z}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT and Tj,wsuperscriptsubscript𝑇𝑗𝑤\mathbb{P}_{T}^{j,w}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT, respectively. We can compare these systems with respect to a performance definition 𝐩𝐩\mathbf{p}bold_p via,

δp(z,w)=𝔼[\displaystyle\delta_{\textbf{p}}(z,w)=\mathbb{E}_{\mathbb{P}^{*}}\bigl{[}italic_δ start_POSTSUBSCRIPT p end_POSTSUBSCRIPT ( italic_z , italic_w ) = blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [m(aj(Tj,z),ah(Th))𝑚superscript𝑎𝑗superscriptsubscript𝑇𝑗𝑧superscript𝑎superscriptsubscript𝑇\displaystyle m\bigl{(}a^{j}(\mathbb{P}_{T}^{j,z}),a^{h}(\mathbb{P}_{T}^{h})%\bigr{)}italic_m ( italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) , italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) )(3)
\displaystyle--m(aj(Tj,w),ah(Th))]\displaystyle m\bigl{(}a^{j}(\mathbb{P}_{T}^{j,w}),a^{h}(\mathbb{P}_{T}^{h})%\bigr{)}\bigr{]}italic_m ( italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) , italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ) ]

where superscript\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the full joint distribution over responses returned by both judge systems and human ratings.

To formalize a comparison between two systems, we let 𝒢judgez𝐩𝒢judgewsubscriptsucceeds-or-equals𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤\mathcal{G}_{judge}^{z}\succeq_{\mathbf{p}}\mathcal{G}_{judge}^{w}caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ⪰ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT denote that δ𝐩(z,w)0subscript𝛿𝐩𝑧𝑤0\delta_{\mathbf{p}}(z,w)\geq 0italic_δ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( italic_z , italic_w ) ≥ 0. For instance, when using Hit Rate with hard aggregation, 𝐩subscriptsucceeds-or-equals𝐩\succeq_{\mathbf{p}}⪰ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT implies that z𝑧zitalic_z achieves greater agreement with a majority vote over human ratings than w𝑤witalic_w.444For metrics where lower values indicate better performance, like KL-divergence, we invert the definition such that 𝒢judgez𝐩𝒢judgewδ𝐩(z,w)0iffsubscriptsucceeds-or-equals𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤subscript𝛿𝐩𝑧𝑤0\mathcal{G}_{judge}^{z}\succeq_{\mathbf{p}}\mathcal{G}_{judge}^{w}\iff\delta_{%\mathbf{p}}(z,w)\leq 0caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ⪰ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ⇔ italic_δ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( italic_z , italic_w ) ≤ 0. Now, suppose that we would like to compare judge systems under a different definition of performance, denoted by 𝐩=(ah,aj,m)superscript𝐩superscriptsubscript𝑎superscriptsubscript𝑎𝑗superscript𝑚\mathbf{p}^{\prime}=(a_{h}^{\prime},a_{j}^{\prime},m^{\prime})bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The following condition describes when these two definitions are guaranteed to yield an equivalent ranking of judge systems:

Definition 5.1 (Rank Consistency).

We say that 𝐩𝐩\mathbf{p}bold_p and 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are rank consistent if for all superscript\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, 𝒢judgez𝐩𝒢judgew𝒢judgez𝐩𝒢judgewiffsubscriptsucceeds-or-equals𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤subscriptsucceeds-or-equalssuperscript𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤\mathcal{G}_{judge}^{z}\succeq_{\mathbf{p}}\mathcal{G}_{judge}^{w}\;\iff\;%\mathcal{G}_{judge}^{z}\succeq_{\mathbf{p}^{\prime}}\mathcal{G}_{judge}^{w}caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ⪰ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ⇔ caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ⪰ start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT.

While there are many possible relationships between two definitions of performance, monotonicity captures one key property we might expect: when one system’s performance improves with respect to 𝐩𝐩\mathbf{p}bold_p, it should also improve with respect to 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if the two definitions are compatible. We formalize this notion in the following definition:

Definition 5.2 (Monotone Transformation).

𝐩𝐩\mathbf{p}bold_p is a monotone transformation of 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if there exists a monotone increasing function f𝑓fitalic_f such that m(aj(ij),ah(ih))=f(m(aj(ij),ah(ih)))superscript𝑚subscriptsuperscript𝑎𝑗superscriptsubscript𝑖𝑗subscriptsuperscript𝑎superscriptsubscript𝑖𝑓𝑚superscript𝑎𝑗superscriptsubscript𝑖𝑗superscript𝑎superscriptsubscript𝑖m^{\prime}\bigl{(}a^{\prime}_{j}(\mathbb{P}_{i}^{j}),a^{\prime}_{h}(\mathbb{P}%_{i}^{h})\bigr{)}=f\bigl{(}m\bigl{(}a^{j}(\mathbb{P}_{i}^{j}),a^{h}(\mathbb{P}%_{i}^{h})\bigr{)}\bigr{)}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ) = italic_f ( italic_m ( italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ) ) for all (ij,ih)Δ×Δ.superscriptsubscript𝑖𝑗superscriptsubscript𝑖ΔΔ(\mathbb{P}_{i}^{j},\mathbb{P}_{i}^{h})\in\Delta\times\Delta.( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ∈ roman_Δ × roman_Δ .

The following result shows that if two performance definitions are not monotone transformations of one another, there exist judge systems and a distribution over human ratings such that the definitions will yield contradictory rankings:

Theorem 5.3.

(Necessary Condition for Rank Consistency)If 𝐩𝐩\mathbf{p}bold_p is not a monotone transformation of 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then 𝐩𝐩\mathbf{p}bold_p and 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are not rank consistent.

Theorem 5.3 provides a useful tool for comparing definitions of performance: we can show that two definitions are not rank consistent by demonstrating a monotonicity violation.

We provide two examples of monotonicity violations in Appendix C. The first shows a violation between Hit Rate (defined over 𝐎𝐎\mathbf{O}bold_O) and KL-Divergence (defined over 𝐎𝐎\mathbf{O}bold_O). The second shows a violation between KL-Divergence (defined over 𝐎𝐎\mathbf{O}bold_O) and Mean Squared Error (defined over 𝛀𝛀\mathbf{\Omega}bold_Ω). This second example illustrates a pernicious issue arising in underspecified tasks: using Theorem 3.4, we can easily construct monotonicity violations by holding the forced choice distribution fixed while varying the response set distribution. This suggests that monotonicity, and by extension rank consistency, is unlikely to hold between definitions of performance defined over the forced choice distribution (i.e., categorical, distributional) and multi-label definitions.

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (3)
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (4)
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (5)
MetricSensitivity Parameter (β𝛽\betaitalic_β)
0.00.10.20.30.4
Hit Rate (h/h)Sonnet 3.5
0.610.660.680.55 (-0.29)0.42 (-0.58)
KLD (s/s)(judge, human) Sonnet 3.5
0.610.660.680.55 (-0.29)0.42 (-0.58)
KLD (s/s)(human, judge) Mistral Small
0.06 (-0.55)0.26 (-0.40)0.52 (-0.16)0.841.00
MSE (srs/srs)Mistral Small3.5 Turbo
0.06 (-0.55)0.26 (-0.40)0.52 (-0.16)0.79 (-0.05)0.96 (-0.04)
Coverage (h/hrs)Sonnet 3.54o Mini3.5 Turbo
0.610.660.680.64 (-0.20)0.96 (-0.04)
Consistency (τ=.5𝜏.5\tau=.5italic_τ = .5)Sonnet 3.5Mistral Small
0.610.660.680.841.00
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (6)

6 Reconciling Definitions of Performance via Synthetic Experiments

Given that a single violation in monotonicity can yield an inversion in system rankings, we might wonder how often inversions incur in practice. Yet merely documenting rank inversions would not provide a means of selecting among two definitions of performance that rank judge systems differently. Therefore, our experiments examine how well judge systems selected using different human–judge agreement metrics perform on downstream evaluation tasks.

Experiment Design. We use our rating decomposition (§§\S§3) to sample human and judge system rating distributions:

Human Rating Distribution. For each item, we sample the response set distribution 𝜽i,hDir(𝟏|𝒬|)similar-tosubscriptsuperscript𝜽𝑖Dirsubscript1𝒬\boldsymbol{\theta}^{*,h}_{i}\sim\text{Dir}(\mathbf{1}_{|\mathcal{Q}|})bold_italic_θ start_POSTSUPERSCRIPT ∗ , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ Dir ( bold_1 start_POSTSUBSCRIPT | caligraphic_Q | end_POSTSUBSCRIPT ). We let ϵitalic-ϵ\epsilonitalic_ϵ denote the probability that a rater selects an observed response set that differs from their stable response set. We construct the error matrix 𝐄ihsubscriptsuperscript𝐄𝑖\mathbf{E}^{h}_{i}bold_E start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that diagonal entries denote the probability of no rating error (1ϵ)1italic-ϵ(1-\epsilon)( 1 - italic_ϵ ) and off-diagonal entries denote the probability of rating error (ϵitalic-ϵ\epsilonitalic_ϵ). We use a skew parameter η𝜂\etaitalic_η to control how errors are distributed across response sets. We let k=0𝑘0k=0italic_k = 0 denote the index of the option used to categorize items (e.g., as “toxic”). The skew parameter controls whether errors systematically favor (η>0𝜂0\eta>0italic_η > 0) or disfavor (η<0𝜂0\eta<0italic_η < 0) response sets containing option k=0𝑘0k=0italic_k = 0.

We model 𝐅ihsuperscriptsubscript𝐅𝑖\mathbf{F}_{i}^{h}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT by sampling an exponential decay function 𝐅i,k,vhexp(γhrk)proportional-tosubscriptsuperscript𝐅𝑖𝑘𝑣superscript𝛾subscript𝑟𝑘\mathbf{F}^{h}_{i,k,v}\propto\exp(-\gamma^{h}\cdot r_{k})bold_F start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k , italic_v end_POSTSUBSCRIPT ∝ roman_exp ( - italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Here, rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the rank (low to high) of the k𝑘kitalic_kth option in the v𝑣vitalic_vth response set and γhsuperscript𝛾\gamma^{h}italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT controls the strength of selection effects. We measure the magnitude of forced choice selection effects via

Γh=1|𝒬+|𝒮v𝒬1|𝒮v|P(Oh=o0o0𝒮v),superscriptΓ1subscript𝒬subscriptsubscript𝒮𝑣𝒬1subscript𝒮𝑣𝑃superscript𝑂conditionalsubscript𝑜0subscript𝑜0subscript𝒮𝑣\Gamma^{h}=\frac{1}{|\mathcal{Q}_{+}|}\sum_{\mathcal{S}_{v}\in\mathcal{Q}}%\frac{1}{|\mathcal{S}_{v}|}P(O^{h}=o_{0}\mid o_{0}\in\mathcal{S}_{v}),roman_Γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_Q start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ caligraphic_Q end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | end_ARG italic_P ( italic_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ,

where 𝒬+subscript𝒬\mathcal{Q}_{+}caligraphic_Q start_POSTSUBSCRIPT + end_POSTSUBSCRIPT denotes the set of response sets containing option at index k=0𝑘0k=0italic_k = 0. A value of Γh=1superscriptΓ1\Gamma^{h}=1roman_Γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = 1 indicates random chance of the first option being selected (i.e., no selection effects), while Γh=2superscriptΓ2\Gamma^{h}=2roman_Γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = 2 and Γh=0.5superscriptΓ0.5\Gamma^{h}=0.5roman_Γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = 0.5 indicate a rater is twice or half as likely as chance to select the first option (i.e., positive and negative selection effects, respectively). We then compute the forced choice distribution via 𝐎ih=𝐅ih(𝐄ih(𝜽i,h))subscriptsuperscript𝐎𝑖superscriptsubscript𝐅𝑖subscriptsuperscript𝐄𝑖subscriptsuperscript𝜽𝑖\mathbf{O}^{h}_{i}=\mathbf{F}_{i}^{h}(\mathbf{E}^{h}_{i}(\mathbf{\boldsymbol{%\theta}}^{*,h}_{i}))bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( bold_E start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT ∗ , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).

Judge Rating Distribution. We model judge systems by sampling an ensemble of distributions with varying similarity to the human rating distribution. We control the deviation of the z𝑧zitalic_zth judge’s rating distribution via σz𝒰(σmin,σmax)similar-tosuperscript𝜎𝑧𝒰subscript𝜎minsubscript𝜎max\sigma^{z}\sim\mathcal{U}(\sigma_{\text{min}},\sigma_{\text{max}})italic_σ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ∼ caligraphic_U ( italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ). We then sample the z𝑧zitalic_zth judges’ response set distribution by applying 𝜽i,z=ΠΔ{𝜽i+ϵiz}superscriptsubscript𝜽𝑖𝑧subscriptΠΔsubscriptsuperscript𝜽𝑖superscriptsubscriptbold-italic-ϵ𝑖𝑧\boldsymbol{\theta}_{i}^{*,z}=\Pi_{\Delta}\{\boldsymbol{\theta}^{*}_{i}+%\boldsymbol{\epsilon}_{i}^{z}\}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ , italic_z end_POSTSUPERSCRIPT = roman_Π start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT { bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT }, where ϵiz𝒩(0,(σz)2𝐈)similar-tosuperscriptsubscriptbold-italic-ϵ𝑖𝑧𝒩0superscriptsuperscript𝜎𝑧2𝐈\boldsymbol{\epsilon}_{i}^{z}\sim\mathcal{N}(0,(\sigma^{z})^{2}\mathbf{I})bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , ( italic_σ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) and ΠΔsubscriptΠΔ\Pi_{\Delta}roman_Π start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT projects onto the probability simplex. We sample 𝐅izsuperscriptsubscript𝐅𝑖𝑧\mathbf{F}_{i}^{z}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT following the same procedure used to sample the human rating distribution and let ΓzsuperscriptΓ𝑧\Gamma^{z}roman_Γ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT denote the magnitude of the z𝑧zitalic_zth judge system’s forced choice selection effects. We assume that all variation in a judge systems’ response set distribution is captured in stable response sets – i.e., the judge system is not affected by rater error. As such, we let 𝐄izsuperscriptsubscript𝐄𝑖𝑧\mathbf{E}_{i}^{z}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT be the identity matrix and compute 𝐎iz=𝐅iz(𝐄iz(𝜽i,z))subscriptsuperscript𝐎𝑧𝑖superscriptsubscript𝐅𝑖𝑧subscriptsuperscript𝐄𝑧𝑖subscriptsuperscript𝜽𝑧𝑖\mathbf{O}^{z}_{i}=\mathbf{F}_{i}^{z}(\mathbf{E}^{z}_{i}(\mathbf{\boldsymbol{%\theta}}^{*,z}_{i}))bold_O start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ( bold_E start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT ∗ , italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).

We refer to forced choice selection effects between humans and the z𝑧zitalic_zth judge system as symmetric when sign(Γh1)=sign(Γz1)signsuperscriptΓ1signsuperscriptΓ𝑧1\text{sign}(\Gamma^{h}-1)=\text{sign}(\Gamma^{z}-1)sign ( roman_Γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - 1 ) = sign ( roman_Γ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ). In contrast, we refer to forced choice selection effects as asymmetric when sign(Γh1)sign(Γz1)signsuperscriptΓ1signsuperscriptΓ𝑧1\text{sign}(\Gamma^{h}-1)\neq\text{sign}(\Gamma^{z}-1)sign ( roman_Γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - 1 ) ≠ sign ( roman_Γ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ). We provide additional experiment setup details and describe our finite sample estimation approach in Appendix D.

Results. Figure 6 reports the performance of judge systems selected via different human–judge agreement metrics. The left panel shows the performance of judge systems on a fully specified rating task with no rating error. The center panel shows the performance of judge systems on an underspecified rating task with symmetric (left) and asymmetric (right) forced choice selection effects and no rating error. The right panel introduces additional random (ϵ=0.3italic-ϵ0.3\epsilon=0.3italic_ϵ = 0.3, η=0𝜂0\eta=0italic_η = 0) and additive (ϵ=0.3italic-ϵ0.3\epsilon=0.3italic_ϵ = 0.3, η=1𝜂1\eta=1italic_η = 1) rater error to the rating process.555The rightmost column is labeled “additive error” as the positive forced choice selection effects (Γ=2Γ2\Gamma=2roman_Γ = 2) and positive skew (η=1𝜂1\eta=1italic_η = 1) jointly shift probability mass toward option k=0𝑘0k=0italic_k = 0.

Finding 1: Fully specified rating tasks make more effective use of a limited annotation budget. Figure 6 shows that the annotation budget — i.e., the number of human ratings collected for each item in the evaluation corpus — has a significant impact on the quality of selected judge systems. While current practice is to select judge systems via a single rating per item, increasing the budget to three ratings per item yields a larger benefit than accounting for other factors manipulated in our experiments (i.e., forced choice selection effects and the human–judge agreement metric). Further, fully specifying rating tasks enables more effective use of a limited annotation budget. Judge systems selected with just one rating per item on a fully specified task (Figure 6, left) match the performance of those selected with three ratings per item on an underspecified task (Figure 6, center) – a 66% reduction in annotation budget.

Finding 2: Categorical agreement metrics are unreliable in underspecified rating tasks with asymmetric selection effects. As shown in the center panel of Figure 6, using categorical agreement metrics to select judge systems is unreliable when (1) rating tasks are underspecified and (2) selection effects are asymmetric. Figure 6 corroborates these findings by indicating weak spearman correlation between categorical human–judge agreement metrics and downstream performance metrics under asymmetric selection effects. Because asymmetric selection effects cannot be detected from forced choice ratings alone666Prior work has documented that humans and LLMs exhibit asymmetric survey response biases similar to those modeled by forced choice selection effects (Tjuatja etal., 2024)., distributional agreement metrics should be used when selecting judge systems from forced choice ratings on underspecified tasks.

Finding 3: The impact of rating error on judge system selection varies by context and is sometimes less critical than forced choice selection effects. Given the substantial literature investigating rater error (Klie etal., 2023; Plank, 2022; Gordon etal., 2021), its relatively modest impact on judge system selection is unexpected. The right panel of Figure 6 shows that even with high error rates (ϵ=0.3italic-ϵ0.3\epsilon=0.3italic_ϵ = 0.3) and adversarial conditions with additive effects (η=1𝜂1\eta=1italic_η = 1), error has a less significant affect on judge system selections than forced choice selection effects.777Figure 10 shows that these findings are robust to different choices of ϵitalic-ϵ\epsilonitalic_ϵ and η𝜂\etaitalic_η. Lemma B.1 (Appendix B.2) pinpoints the mechanism driving these empirical findings. When rater error affects human ratings but not the ratings assigned by a judge system, its impact on the comparative ranking of judge systems is limited. This is in contrast to forced choice selection effects, which affect both human and judge system ratings. Nevertheless, rater error can yield rank inversions under specific conditions that fall outside the scope of our main experiment design (see Figure 11).

We find that rater error has a significant affect on prevalence estimates obtained from human ratings (Figure 12), particularly when error is correlated with the option used to categorize items. Taken together, our results suggest that more research is needed to characterize the effects of rater error at various points in the LLM-as-a-judge validation pipeline and in downstream performance estimates.

Finding 4: Underspecified rating tasks, forced choice selection effects, and finite sample error yield misleading evaluations of target systems. Figure 6 shows that factors in the human rating process affect evaluations of the target system, irrespective of whether a judge system is introduced. The left panel shows the consistency between thresholded decisions obtained from the population rating distribution versus a finite sample approximation. Allow/suppress decisions obtained from sampling just one rating per item disagree with the decision produced by thresholding the population rating distribution in 30% of items. The right panel shows that, for underspecified tasks, prevalence estimates obtained from forced choice ratings underestimate the prevalence of the property of interest (e.g., “toxicity”) as compared to prevalence estimates obtained from response set ratings.888Figure 6 isolates the role of forced choice selection effects by assuming no rater error. Figure 12 in Appendix D characterizes interactions between forced choice selection effects and rater error. These findings underscore that carefully structuring the rating task and accounting for forced choice selection effects is essential for reliable target system evaluations.

7 Case Study: Validating LLM-as-a-Judge Systems for Toxicity Detection

We examine how the choice of performance definition impacts judge system validation in practice by constructing a toxicity rating task. We use the Civil Comments dataset (Borkan etal., 2019) and assume that each comment is an output from 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT corresponding to an item in a rating task (see §§\S§E). We compare human ratings against ratings provided by five judge systems: Mistral Small, Mistral Large, Claude Sonnet 3.5, GPT 3.5 Turbo, and GPT 4o Mini.

The Civil Comments data contains forced choice ratings elicited from an underspecified rating task. It is possible to recover three options from provided data: o1=subscript𝑜1absento_{1}=italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = Very Toxic, o2=subscript𝑜2absento_{2}=italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = Toxic, and o3=subscript𝑜3absento_{3}=italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = Not Toxic OR Hard to Say. We model how these forced choice options map to response sets via a sensitivity parameter β=(o2𝒮Oh=o3)𝛽subscript𝑜2conditional𝒮superscript𝑂subscript𝑜3\beta=\mathbb{P}(o_{2}\in\mathcal{S}\mid O^{h}=o_{3})italic_β = blackboard_P ( italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_S ∣ italic_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). This parameter denotes the probability that a rater would endorse a response set that contains Toxic given that their forced choice response was Not Toxic or Hard to Say.

Finding 5: The ranking of judge systems depends on the choice of agreement metric. Table 8 illustrates that, even when there is a perfect mapping between forced choice and response set distributions (β=0𝛽0\beta=0italic_β = 0), the selected judge system depends on the choice of human–judge agreement metric. Claude Sonnet 3.5 is selected when targeting Hit Hate (h/h), Coverage (h/hrs), and KL-Divergence999where KL-Divergence is measured as the deviation of the judge system rating distribution from the human rating distribution. (s/s). In contrast, Mistral Small is selected when targeting MSE (srs/srs) or KL-Divergence101010where KL-Divergence is measured as the deviation of the human rating distribution from the judge system rating distribution. (s/s). However, Mistral Small has 90% worse decision consistency than Sonnet 3.5 when β=0𝛽0\beta=0italic_β = 0.

This finding illustrates that selecting a judge system based on metrics developed for the determinate task setting can produce a model that performs very suboptimally on downstream content filtering tasks. Current practice is to measure asymmetric distributional metrics (e.g., KL-Divergence) as the deviation of the human rating distribution from the judge system rating distribution (Peterson etal., 2019; Collins etal., 2022; Uma etal., 2020; Fornaciari etal., 2021). While results presented in Appendix D indicate that judge systems selected with this directionality perform well in some settings, we observe the opposite directionality is preferable when β3𝛽3\beta\leq 3italic_β ≤ 3 in Table 8. We recommend using a symmetric metric (i.e., JS-Divergence) while selecting judge systems in practice, as this approach performs reliably across settings.

Finding 6: The ranking of judge systems depends on forced choice selection effects. Table 8 also shows that the ranking of judge systems inverts when β.3𝛽.3\beta\geq.3italic_β ≥ .3. Using Hit Rate to select a judge system — as is common practice in the literature (Table 1) — yields Sonnet 3.5 across all settings of β𝛽\betaitalic_β. While Sonnet 3.5 has the highest decision consistency at β=0𝛽0\beta=0italic_β = 0, the best model (defined against consistency) inverts to Mistral Small when β.3𝛽.3\beta\geq.3italic_β ≥ .3. Yet selecting Sonnet 3.5 when β=.3𝛽.3\beta=.3italic_β = .3 yields a 34.5% reduction in decision consistency as compared to Mistral Small. This underscores the importance of fully specifying rating tasks or using response set ratings when the rating task is underspecified.

One way to assess whether this value of β𝛽\betaitalic_β is plausible is to inspect the value recovered by judge systems: Sonnet 3.5 (0.320.320.320.32), Mistral Small (0.440.440.440.44), GPT 3.5 Turbo (0.040.040.040.04), GPT 4o Mini (0.060.060.060.06), and Mistral Large (0.110.110.110.11). Notably, the two models that align most closely with human ratings – Mistral Small and Sonnet 3.5 – both yield β.3𝛽.3\beta\geq.3italic_β ≥ .3. This suggests the performance inversions shown in Table 8 are plausible.

Finding 7: Forced choice ratings underestimate the prevalence of toxicity in target system outputs. Figure 8 shows the prevalence of toxicity in 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT outputs as a function of the sensitivity parameter (β𝛽\betaitalic_β) and cutoff (τ𝜏\tauitalic_τ). The setting with τ=.5𝜏.5\tau=.5italic_τ = .5 and β=0𝛽0\beta=0italic_β = 0 corresponds to the status quo approach of using forced choice ratings to estimate prevalence. However, we find that small increases to the value of beta (.1absent.1\geq.1≥ .1) yield significant changes to the estimated prevalence of toxicity in 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT outputs. For example, at τ=.5𝜏.5\tau=.5italic_τ = .5, increasing β𝛽\betaitalic_β from 0 to 0.05 doubles the prevalence of toxicity in target system outputs. This substantiates findings of our synthetic data experiments (Figure 6, right) and underscores the importance of carefully modeling the rating process to obtain reliable evaluations of the target system.

8 Conclusion

We introduce a framework for LLM-as-a-judge validation under rating task indeterminacy. Our framework provides a methodological foundation for more principled validation of judge systems designed to rate concepts such as “helpfulness”, “toxicity”, and “relevance” in target system outputs. We identify key factors in the LLM-as-a-judge validation pipeline: rating task design, judge performance measures, and rating elicitation and aggregation schemes, which can significantly affect downstream ratings performed by judge systems. We show that current practices for validating judge systems can yield misleading assessments of judge system performance and unreliable evaluations of target systems.

9 Impact Statement

In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We demonstrate that key methodological considerations, including rating task design, measures of judge system performance, and rating elicitation and aggregation schemes, have substantive downstream impacts on judge system ratings. Although our arguments have potential societal consequences, especially if the practices we advocate for see adoption and thus change current GenAI evaluation practice, there are no consequences we feel the need to highlight that are specific to this work rather than applicable to any work aiming to improve upon current evaluation practices. The validation practices described in this paper are not an endorsement for the adoption of a judge system in any particular setting.

10 Acknowledgments

We thank members of the Sociotechnical Alignment Center at Microsoft Research New York City for their helpful comments on early versions of this work. We also thank attendees and reviewers of the Statistical Frontiers in LLMs and Foundation Models (SFLLM) and Evaluating Evaluations (EvalEval) workshops at NeurIPS, and attendees of the Fairness, Explainability, Accountability, and Transparency (FEAT) reading group at Carnegie Mellon University. This work was supported in part by an award from the UL Research Institutes throughthe Center for Advancing Safety of Machine Intelligence (CASMI) at Northwestern University.

References

  • Angelopoulos etal. (2023a)Angelopoulos, A.N., Bates, S., Fannjiang, C., Jordan, M.I., and Zrnic, T.Prediction-powered inference.Science, 382(6671):669–674, 2023a.
  • Angelopoulos etal. (2023b)Angelopoulos, A.N., Duchi, J.C., and Zrnic, T.Ppi++: Efficient prediction-powered inference.arXiv preprint arXiv:2311.01453, 2023b.
  • Bencke etal. (2024)Bencke, L., Paula, F.S., dos Santos, B.G., and Moreira, V.P.Can we trust llms as relevance judges?In Simpósio Brasileiro de Banco de Dados (SBBD), pp. 600–612. SBC, 2024.
  • Borkan etal. (2019)Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L.Nuanced metrics for measuring unintended bias with real data for text classification.CoRR, abs/1903.04561, 2019.URL http://arxiv.org/abs/1903.04561.
  • Boyeau etal. (2024)Boyeau, P., Angelopoulos, A.N., Yosef, N., Malik, J., and Jordan, M.I.Autoeval done right: Using synthetic data for model evaluation.arXiv preprint arXiv:2403.07008, 2024.
  • Bubeck etal. (2023)Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., etal.Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023.
  • Chaudhary etal. (2024)Chaudhary, M., Gupta, H., Bhat, S., and Varma, V.Towards understanding the robustness of llm-based evaluations under perturbations.arXiv preprint arXiv:2412.09269, 2024.
  • Chen etal. (2024)Chen, D., Chen, R., Zhang, S., Wang, Y., Liu, Y., Zhou, H., Zhang, Q., Wan, Y., Zhou, P., and Sun, L.Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark.In Forty-first International Conference on Machine Learning, 2024.
  • Chiang etal. (2024)Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A.N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J.E., etal.Chatbot arena: An open platform for evaluating llms by human preference.In Forty-first International Conference on Machine Learning, 2024.
  • Collins etal. (2022)Collins, K.M., Bhatt, U., and Weller, A.Eliciting and learning with soft labels from every annotator.In Proceedings of the AAAI conference on human computation and crowdsourcing, volume10, pp. 40–52, 2022.
  • Dettmers etal. (2024)Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L.Qlora: Efficient finetuning of quantized llms.Advances in Neural Information Processing Systems, 36, 2024.
  • Dong etal. (2024)Dong, Y.R., Hu, T., and Collier, N.Can llm be a personalized judge?arXiv preprint arXiv:2406.11657, 2024.
  • Dorner etal. (2024)Dorner, F.E., Nastl, V.Y., and Hardt, M.Limits to scalable evaluation at the frontier: Llm as judge won’t beat twice the data.arXiv preprint arXiv:2410.13341, 2024.
  • Dubois etal. (2024)Dubois, Y., Li, C.X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P.S., and Hashimoto, T.B.Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024.
  • Es etal. (2023)Es, S., James, J., Espinosa-Anke, L., and Schockaert, S.Ragas: Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217, 2023.
  • Eyre & Madras (2024)Eyre, B. and Madras, D.Auto-evaluation with few labels through post-hoc regression.arXiv preprint arXiv:2411.12665, 2024.
  • Faisal etal. (2024)Faisal, F., Rahman, M.M., and Anastasopoulos, A.Dialectal toxicity detection: Evaluating llm-as-a-judge consistency across language varieties.arXiv preprint arXiv:2411.10954, 2024.
  • Fisch etal. (2020)Fisch, A., Schuster, T., Jaakkola, T., and Barzilay, R.Efficient conformal prediction via cascaded inference with expanded admission.arXiv preprint arXiv:2007.03114, 2020.
  • Fornaciari etal. (2021)Fornaciari, T., Uma, A., Paun, S., Plank, B., Hovy, D., Poesio, M., etal.Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021.
  • Ganguli etal. (2022)Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., etal.Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022.
  • Gordon etal. (2021)Gordon, M.L., Zhou, K., Patel, K., Hashimoto, T., and Bernstein, M.S.The disagreement deconvolution: Bringing machine learning performance metrics in line with reality.In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–14, 2021.
  • Goyal etal. (2022)Goyal, N., Kivlichan, I.D., Rosen, R., and Vasserman, L.Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation.Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–28, 2022.
  • Gu etal. (2024)Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., etal.A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024.
  • Jung etal. (2024)Jung, J., Brahman, F., and Choi, Y.Trust or escalate: Llm judges with provable guarantees for human agreement.arXiv preprint arXiv:2407.18370, 2024.
  • Kim etal. (2024)Kim, T.S., Lee, Y., Shin, J., Kim, Y.-H., and Kim, J.Evallm: Interactive evaluation of large language model prompts on user-defined criteria.In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–21, 2024.
  • Klie etal. (2023)Klie, J.-C., Webber, B., and Gurevych, I.Annotation error detection: Analyzing the past and present for a more coherent future.Computational Linguistics, 49(1):157–198, 2023.
  • Li etal. (2024a)Li, X., Lipton, Z.C., and Leqi, L.Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024a.
  • Li etal. (2024b)Li, Z., Wang, C., Ma, P., Wu, D., Wang, S., Gao, C., and Liu, Y.Split and merge: Aligning position biases in llm-based evaluators.In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11084–11108, 2024b.
  • Lu & Zhong (2024)Lu, H. and Zhong, F.Can vision-language models replace human annotators: A case study with celeba dataset.arXiv preprint arXiv:2410.09416, 2024.
  • Mazeika etal. (2024)Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., etal.Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024.
  • Mirzakhmedova etal. (2024)Mirzakhmedova, N., Gohsen, M., Chang, C.H., and Stein, B.Are large language models reliable argument quality annotators?In Conference on Advances in Robust Argumentation Machines, pp. 129–146. Springer, 2024.
  • Nie etal. (2020)Nie, Y., Zhou, X., and Bansal, M.What can we learn from collective human opinions on natural language inference data?arXiv preprint arXiv:2010.03532, 2020.
  • Pavlick & Kwiatkowski (2019)Pavlick, E. and Kwiatkowski, T.Inherent disagreements in human textual inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019.
  • Peterson etal. (2019)Peterson, J.C., Battleday, R.M., Griffiths, T.L., and Russakovsky, O.Human uncertainty makes classification more robust.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9617–9626, 2019.
  • Plank (2022)Plank, B.The’problem’of human label variation: On ground truth in data, modeling and evaluation.arXiv preprint arXiv:2211.02570, 2022.
  • Rahmani etal. (2024)Rahmani, H.A., Yilmaz, E., Craswell, N., Mitra, B., Thomas, P., Clarke, C.L., Aliannejadi, M., Siro, C., and Faggioli, G.Llmjudge: Llms for relevance judgments.arXiv preprint arXiv:2408.08896, 2024.
  • Shankar etal. (2024)Shankar, S., Zamfirescu-Pereira, J., Hartmann, B., Parameswaran, A.G., and Arawjo, I.Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences.arXiv preprint arXiv:2404.12272, 2024.
  • Szymanski etal. (2024)Szymanski, A., Ziems, N., Eicher-Miller, H.A., Li, T. J.-J., Jiang, M., and Metoyer, R.A.Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks.arXiv preprint arXiv:2410.20266, 2024.
  • Takehi etal. (2024)Takehi, R., Voorhees, E.M., and Sakai, T.Llm-assisted relevance assessments: When should we ask llms for help?arXiv preprint arXiv:2411.06877, 2024.
  • Thakur etal. (2024)Thakur, A.S., Choudhary, K., Ramayapally, V.S., Vaidyanathan, S., and Hupkes, D.Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024.
  • Tjuatja etal. (2024)Tjuatja, L., Chen, V., Wu, T., Talwalkwar, A., and Neubig, G.Do llms exhibit human-like response biases? a case study in survey design.Transactions of the Association for Computational Linguistics, 12:1011–1026, 2024.
  • Uma etal. (2020)Uma, A., Fornaciari, T., Hovy, D., Paun, S., Plank, B., and Poesio, M.A case for soft loss functions.In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume8, pp. 173–177, 2020.
  • Zheng etal. (2023)Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., etal.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023.

Appendix A Appendix.

This work contains the following appendices:

  • Appendix A Table 1 provides a summary of aggregation functions adopted in prior work on (1) validating LLM-as-a-judge systems, and (2) evaluation under task indeterminacy.

  • Appendix B provides additional theoretical analysis, including (i) proofs (§§\S§ B.1); and (ii) additional theoretical results characterizing rank consistency of MSE under error-free versus error-corrupted human ratings (§§\S§ B.2).

  • Appendix C provides examples of monotonicity violations among pairs of human–judge agreement metrics (see §§\S§ 5).

  • Appendix D provides further synthetic experiment setup details and empirical results.

  • Appendix E provides further setup details and empirical results for the toxicity detection case study.

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (7)
Metric TypeMetricJudge
Aggregation
Human
Aggregation
Relevant Work
Hit Rate (\uparrow)ahardjsubscriptsuperscript𝑎𝑗harda^{j}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTahardhsubscriptsuperscript𝑎harda^{h}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTLu & Zhong (2024); Jung etal. (2024); Dong etal. (2024); Es etal. (2023); Dubois etal. (2024); Bubeck etal. (2023); Zheng etal. (2023); Faisal etal. (2024); Gu etal. (2024); Thakur etal. (2024); Li etal. (2024b); Chen etal. (2024); Chiang etal. (2024); Dorner etal. (2024)
CategoricalKrippendorff’s α𝛼\alphaitalic_α (\uparrow)ahardjsubscriptsuperscript𝑎𝑗harda^{j}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTahardhsubscriptsuperscript𝑎harda^{h}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTMirzakhmedova etal. (2024); Chaudhary etal. (2024)
Fleiss’ κ𝜅\kappaitalic_κ (\uparrow)ahardjsubscriptsuperscript𝑎𝑗harda^{j}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTahardhsubscriptsuperscript𝑎harda^{h}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTKim etal. (2024); Dettmers etal. (2024); Bencke etal. (2024)
Cohen’s κ𝜅\kappaitalic_κ (\uparrow)ahardjsubscriptsuperscript𝑎𝑗harda^{j}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTahardhsubscriptsuperscript𝑎harda^{h}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTRahmani etal. (2024); Bencke etal. (2024)
Scott’s π𝜋\piitalic_π (\uparrow)ahardjsubscriptsuperscript𝑎𝑗harda^{j}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTahardhsubscriptsuperscript𝑎harda^{h}_{\text{hard}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPTThakur etal. (2024)
KL Divergence (\downarrow)asoftjsubscriptsuperscript𝑎𝑗softa^{j}_{\text{soft}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPTasofthsubscriptsuperscript𝑎softa^{h}_{\text{soft}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPTNie etal. (2020); Fornaciari etal. (2021)
DistributionalCross-Entropy (\downarrow)asoftjsubscriptsuperscript𝑎𝑗softa^{j}_{\text{soft}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPTasofthsubscriptsuperscript𝑎softa^{h}_{\text{soft}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPTPeterson etal. (2019); Pavlick & Kwiatkowski (2019); Collins etal. (2022)
JS Divergence (\downarrow)asoftjsubscriptsuperscript𝑎𝑗softa^{j}_{\text{soft}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPTasofthsubscriptsuperscript𝑎softa^{h}_{\text{soft}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPTNie etal. (2020); Fornaciari etal. (2021)
CategoricalMulti-label Coverage (\uparrow)ahardj,ahrsjsubscriptsuperscript𝑎𝑗hardsubscriptsuperscript𝑎𝑗hrsa^{j}_{\text{hard}},a^{j}_{\text{hrs}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPTahardh,ahrshsubscriptsuperscript𝑎hardsubscriptsuperscript𝑎hrsa^{h}_{\text{hard}},a^{h}_{\text{hrs}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPTFisch etal. (2020); Takehi etal. (2024)
PredictiveEfficiency (\downarrow)ahardj,ahrsjsubscriptsuperscript𝑎𝑗hardsubscriptsuperscript𝑎𝑗hrsa^{j}_{\text{hard}},a^{j}_{\text{hrs}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPTN/AFisch etal. (2020)
Recall (\uparrow)ahardj,ahrsjsubscriptsuperscript𝑎𝑗hardsubscriptsuperscript𝑎𝑗hrsa^{j}_{\text{hard}},a^{j}_{\text{hrs}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPTahardh,ahrshsubscriptsuperscript𝑎hardsubscriptsuperscript𝑎hrsa^{h}_{\text{hard}},a^{h}_{\text{hrs}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPT
Precision (\uparrow)ahardj,ahrsjsubscriptsuperscript𝑎𝑗hardsubscriptsuperscript𝑎𝑗hrsa^{j}_{\text{hard}},a^{j}_{\text{hrs}}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPTahardh,ahrshsubscriptsuperscript𝑎hardsubscriptsuperscript𝑎hrsa^{h}_{\text{hard}},a^{h}_{\text{hrs}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPT
ContinuousMulti-label Binary Cross Entropy (\downarrow)ajsubscriptsuperscript𝑎𝑗a^{j}_{*}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPTahardh,ahrshsubscriptsuperscript𝑎hardsubscriptsuperscript𝑎hrsa^{h}_{\text{hard}},a^{h}_{\text{hrs}}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hrs end_POSTSUBSCRIPT
Mean SquaredError (\downarrow)ajsubscriptsuperscript𝑎𝑗a^{j}_{*}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPTahsubscriptsuperscript𝑎a^{h}_{*}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT

Appendix B Additional Theoretical Analysis

B.1 Proofs

B.1.1 Theorem 3.3

Proof.

The forward model 𝑶i=𝐅i(𝐄i𝜽i)subscript𝑶𝑖subscript𝐅𝑖subscript𝐄𝑖subscriptsuperscript𝜽𝑖\boldsymbol{O}_{i}=\mathbf{F}_{i}(\mathbf{E}_{i}\boldsymbol{\theta}^{*}_{i})bold_italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) follows by the following factorization111111We omit i𝑖iitalic_i from all subscripts for brevity. :

(okt)conditionalsubscript𝑜𝑘𝑡\displaystyle\mathbb{P}(o_{k}\mid t)blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_t )=v,v,r(oksv,sv,r,t)(sv,sv,rt)absentsubscript𝑣superscript𝑣𝑟conditionalsubscript𝑜𝑘subscript𝑠𝑣subscript𝑠superscript𝑣𝑟𝑡subscript𝑠𝑣subscript𝑠superscript𝑣conditional𝑟𝑡\displaystyle=\sum_{v,{v^{*}},r}\mathbb{P}(o_{k}\mid s_{v},s_{v^{*}},r,t)\cdot%\mathbb{P}(s_{v},s_{v^{*}},r\mid t)= ∑ start_POSTSUBSCRIPT italic_v , italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_r , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_r ∣ italic_t )
=v,v,r(oksv,t)(sv,sv,rt)absentsubscript𝑣superscript𝑣𝑟conditionalsubscript𝑜𝑘subscript𝑠𝑣𝑡subscript𝑠𝑣subscript𝑠superscript𝑣conditional𝑟𝑡\displaystyle=\sum_{v,{v^{*}},r}\mathbb{P}(o_{k}\mid s_{v},t)\cdot\mathbb{P}(s%_{v},s_{v^{*}},r\mid t)= ∑ start_POSTSUBSCRIPT italic_v , italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_r ∣ italic_t )(4)
=v,v,r(oksv,t)(svsv,r,t)(sv,rt)absentsubscript𝑣superscript𝑣𝑟conditionalsubscript𝑜𝑘subscript𝑠𝑣𝑡conditionalsubscript𝑠𝑣subscript𝑠superscript𝑣𝑟𝑡subscript𝑠superscript𝑣conditional𝑟𝑡\displaystyle=\sum_{v,{v^{*}},r}\mathbb{P}(o_{k}\mid s_{v},t)\cdot\mathbb{P}(s%_{v}\mid s_{v^{*}},r,t)\cdot\mathbb{P}(s_{v^{*}},r\mid t)= ∑ start_POSTSUBSCRIPT italic_v , italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_r , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_r ∣ italic_t )
=v,v,r(oksv,t)(svsv,t)(sv,rt)absentsubscript𝑣superscript𝑣𝑟conditionalsubscript𝑜𝑘subscript𝑠𝑣𝑡conditionalsubscript𝑠𝑣subscript𝑠superscript𝑣𝑡subscript𝑠superscript𝑣conditional𝑟𝑡\displaystyle=\sum_{v,{v^{*}},r}\mathbb{P}(o_{k}\mid s_{v},t)\cdot\mathbb{P}(s%_{v}\mid s_{v^{*}},t)\cdot\mathbb{P}(s_{v^{*}},r\mid t)= ∑ start_POSTSUBSCRIPT italic_v , italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_r ∣ italic_t )(5)
=v(oksv,t)v(svsv,t)r(sv,rt)absentsubscript𝑣conditionalsubscript𝑜𝑘subscript𝑠𝑣𝑡subscriptsuperscript𝑣conditionalsubscript𝑠𝑣subscript𝑠superscript𝑣𝑡subscript𝑟subscript𝑠superscript𝑣conditional𝑟𝑡\displaystyle=\sum_{v}\mathbb{P}(o_{k}\mid s_{v},t)\cdot\sum_{v^{*}}\mathbb{P}%(s_{v}\mid s_{v^{*}},t)\cdot\sum_{r}\mathbb{P}(s_{v^{*}},r\mid t)= ∑ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) ⋅ ∑ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t ) ⋅ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_r ∣ italic_t )
=v𝐅k,vv𝐄v,v𝜽v.absentsubscript𝑣subscript𝐅𝑘𝑣subscriptsuperscript𝑣subscript𝐄𝑣superscript𝑣subscriptsuperscript𝜽superscript𝑣\displaystyle=\sum_{v}\boldsymbol{\mathbf{F}}_{k,v}\cdot\sum_{v^{*}}%\boldsymbol{\mathbf{E}}_{v,v^{*}}\cdot\boldsymbol{\theta}^{*}_{v^{*}}.= ∑ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_F start_POSTSUBSCRIPT italic_k , italic_v end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_v , italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

Above, (4) holds by forced choice independence and (5) holds by error independence. The reverse model 𝜽i=𝐄i(𝐅i𝐎i)subscriptsuperscript𝜽𝑖superscriptsubscript𝐄𝑖superscriptsubscript𝐅𝑖subscript𝐎𝑖\boldsymbol{\theta}^{*}_{i}=\mathbf{E}_{i}^{\prime}(\mathbf{F}_{i}^{\prime}%\mathbf{O}_{i})bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) follows by the following factorization:

(svt)conditionalsubscript𝑠superscript𝑣𝑡\displaystyle\mathbb{P}(s_{v^{*}}\mid t)blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∣ italic_t )=r,v,k(svr,sv,ok,t)(svr,ok,t)(ok,rt)absentsubscript𝑟𝑣𝑘conditionalsubscript𝑠superscript𝑣𝑟subscript𝑠𝑣subscript𝑜𝑘𝑡conditionalsubscript𝑠𝑣𝑟subscript𝑜𝑘𝑡subscript𝑜𝑘conditional𝑟𝑡\displaystyle=\sum_{r,v,k}\mathbb{P}(s_{v^{*}}\mid r,s_{v},o_{k},t)\cdot%\mathbb{P}(s_{v}\mid r,o_{k},t)\cdot\mathbb{P}(o_{k},r\mid t)= ∑ start_POSTSUBSCRIPT italic_r , italic_v , italic_k end_POSTSUBSCRIPT blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∣ italic_r , italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_r , italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r ∣ italic_t )
=r,v,k(svr,sv,t)(svr,ok,t)(ok,rt)absentsubscript𝑟𝑣𝑘conditionalsubscript𝑠superscript𝑣𝑟subscript𝑠𝑣𝑡conditionalsubscript𝑠𝑣𝑟subscript𝑜𝑘𝑡subscript𝑜𝑘conditional𝑟𝑡\displaystyle=\sum_{r,v,k}\mathbb{P}(s_{v^{*}}\mid r,s_{v},t)\cdot\mathbb{P}(s%_{v}\mid r,o_{k},t)\cdot\mathbb{P}(o_{k},r\mid t)= ∑ start_POSTSUBSCRIPT italic_r , italic_v , italic_k end_POSTSUBSCRIPT blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∣ italic_r , italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_r , italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r ∣ italic_t )(6)
=r,v,k((svr,sv,t)(r,svt)(r,svt))((okr,sv,t)(r,svt)(r,okt))(ok,rt)absentsubscript𝑟𝑣𝑘conditionalsubscript𝑠𝑣𝑟subscript𝑠superscript𝑣𝑡𝑟conditionalsubscript𝑠superscript𝑣𝑡𝑟conditionalsubscript𝑠𝑣𝑡conditionalsubscript𝑜𝑘𝑟subscript𝑠𝑣𝑡𝑟conditionalsubscript𝑠𝑣𝑡𝑟conditionalsubscript𝑜𝑘𝑡subscript𝑜𝑘conditional𝑟𝑡\displaystyle=\sum_{r,v,k}\left(\frac{\mathbb{P}(s_{v}\mid r,s_{v^{*}},t)\cdot%\mathbb{P}(r,s_{v^{*}}\mid t)}{\mathbb{P}(r,s_{v}\mid t)}\right)\cdot\left(%\frac{\mathbb{P}(o_{k}\mid r,s_{v},t)\cdot\mathbb{P}(r,s_{v}\mid t)}{\mathbb{P%}(r,o_{k}\mid t)}\right)\cdot\mathbb{P}(o_{k},r\mid t)= ∑ start_POSTSUBSCRIPT italic_r , italic_v , italic_k end_POSTSUBSCRIPT ( divide start_ARG blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_r , italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_r , italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∣ italic_t ) end_ARG start_ARG blackboard_P ( italic_r , italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_t ) end_ARG ) ⋅ ( divide start_ARG blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_r , italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_r , italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_t ) end_ARG start_ARG blackboard_P ( italic_r , italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_t ) end_ARG ) ⋅ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r ∣ italic_t )
=r,v,k(svsv,t)(r,svt)(oksv,t)absentsubscript𝑟𝑣𝑘conditionalsubscript𝑠𝑣subscript𝑠superscript𝑣𝑡𝑟conditionalsubscript𝑠superscript𝑣𝑡conditionalsubscript𝑜𝑘subscript𝑠𝑣𝑡\displaystyle=\sum_{r,v,k}\mathbb{P}(s_{v}\mid s_{v^{*}},t)\cdot\mathbb{P}(r,s%_{v^{*}}\mid t)\cdot\mathbb{P}(o_{k}\mid s_{v},t)= ∑ start_POSTSUBSCRIPT italic_r , italic_v , italic_k end_POSTSUBSCRIPT blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_r , italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∣ italic_t ) ⋅ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t )(7)
=v,k((svsv,t)(svt)(svt))((svok,t)(okt)(svt))r(r,svt)absentsubscript𝑣𝑘conditionalsuperscriptsubscript𝑠𝑣subscript𝑠𝑣𝑡conditionalsubscript𝑠𝑣𝑡conditionalsubscript𝑠superscript𝑣𝑡conditionalsubscript𝑠𝑣subscript𝑜𝑘𝑡conditionalsubscript𝑜𝑘𝑡conditionalsubscript𝑠𝑣𝑡subscript𝑟𝑟conditionalsubscript𝑠superscript𝑣𝑡\displaystyle=\sum_{v,k}\left(\frac{\mathbb{P}(s_{v}^{*}\mid s_{v},t)\cdot%\mathbb{P}(s_{v}\mid t)}{\mathbb{P}(s_{v^{*}}\mid t)}\right)\cdot\left(\frac{%\mathbb{P}(s_{v}\mid o_{k},t)\cdot\mathbb{P}(o_{k}\mid t)}{\mathbb{P}(s_{v}%\mid t)}\right)\cdot\sum_{r}\mathbb{P}(r,s_{v^{*}}\mid t)= ∑ start_POSTSUBSCRIPT italic_v , italic_k end_POSTSUBSCRIPT ( divide start_ARG blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_t ) end_ARG start_ARG blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∣ italic_t ) end_ARG ) ⋅ ( divide start_ARG blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_t ) end_ARG start_ARG blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_t ) end_ARG ) ⋅ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT blackboard_P ( italic_r , italic_s start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∣ italic_t )
=v,k(svsv,t)(svok,t)(okt)absentsubscript𝑣𝑘conditionalsuperscriptsubscript𝑠𝑣subscript𝑠𝑣𝑡conditionalsubscript𝑠𝑣subscript𝑜𝑘𝑡conditionalsubscript𝑜𝑘𝑡\displaystyle=\sum_{v,k}\mathbb{P}(s_{v}^{*}\mid s_{v},t)\cdot\mathbb{P}(s_{v}%\mid o_{k},t)\cdot\mathbb{P}(o_{k}\mid t)= ∑ start_POSTSUBSCRIPT italic_v , italic_k end_POSTSUBSCRIPT blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t ) ⋅ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_t )
=v𝐄v,vk𝐅v,k𝐎kabsentsubscript𝑣superscriptsubscript𝐄superscript𝑣𝑣subscript𝑘superscriptsubscript𝐅𝑣𝑘subscript𝐎𝑘\displaystyle=\sum_{v}\mathbf{E}_{v^{*},v}^{\prime}\cdot\sum_{k}\mathbf{F}_{v,%k}^{\prime}\cdot\mathbf{O}_{k}= ∑ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_F start_POSTSUBSCRIPT italic_v , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ bold_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

where (6) holds by forced choice independence and (7) holds by forced choice and error independence.∎

B.1.2 Theorem 3.4

Proof.

We remove dependence on i𝑖iitalic_i from all terms for brevity. To begin, note that 𝜽superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is identifiable from 𝐎𝐎\mathbf{O}bold_O if and only if 𝜽=𝐄(𝐅i𝐎)=𝐅𝜽superscript𝜽superscript𝐄superscriptsubscript𝐅𝑖𝐎𝐅superscript𝜽\boldsymbol{\theta}^{*}=\mathbf{E}^{\prime}(\mathbf{F}_{i}^{\prime}\mathbf{O})%=\mathbf{F}\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_O ) = bold_F bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is fully determined (where the system simplifies by taking 𝐄𝐄\mathbf{E}bold_E as the identity matrix). The system system 𝐅𝜽𝐅superscript𝜽\mathbf{F}\boldsymbol{\theta}^{*}bold_F bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT must be consistent because Theorem 3.3 establishes a solution. A consistent system with no=|𝒪|subscript𝑛𝑜𝒪n_{o}=|\mathcal{O}|italic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = | caligraphic_O | equations and ns=|𝒬|subscript𝑛𝑠𝒬n_{s}=|\mathcal{Q}|italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = | caligraphic_Q | unknowns is fully determined if and only if rank(𝐅𝐅\boldsymbol{\mathbf{F}}bold_F)=nsabsentsubscript𝑛𝑠=n_{s}= italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

We will first show that 𝒬={{ok}:ok𝒪}𝒬conditional-setsubscript𝑜𝑘subscript𝑜𝑘𝒪\mathcal{Q}=\{\{o_{k}\}:o_{k}\in\mathcal{O}\}caligraphic_Q = { { italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } : italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_O } implies that rank(𝐅𝐅\boldsymbol{\mathbf{F}}bold_F)=nsabsentsubscript𝑛𝑠=n_{s}= italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. To begin, note that (1) k𝐅k,v=1,v{1,,ns}formulae-sequencesubscript𝑘subscript𝐅𝑘𝑣1for-all𝑣1subscript𝑛𝑠\sum_{k}\mathbf{F}_{k,v}=1,\;\forall v\in\{1,...,n_{s}\}∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_F start_POSTSUBSCRIPT italic_k , italic_v end_POSTSUBSCRIPT = 1 , ∀ italic_v ∈ { 1 , … , italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } because each column in 𝐅𝐅\mathbf{F}bold_F represents a valid probability distribution; and (2) 𝐅k,v=0,ak,v{1,,ns}formulae-sequencesubscript𝐅𝑘𝑣0formulae-sequencefor-all𝑎𝑘for-all𝑣1subscript𝑛𝑠\mathbf{F}_{k,v}=0,\;\;\forall a\neq k,\;\forall v\in\{1,...,n_{s}\}\;bold_F start_POSTSUBSCRIPT italic_k , italic_v end_POSTSUBSCRIPT = 0 , ∀ italic_a ≠ italic_k , ∀ italic_v ∈ { 1 , … , italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } because 𝚲k,v=0𝑭k,v=0subscript𝚲𝑘𝑣0subscript𝑭𝑘𝑣0\mathbf{\Lambda}_{k,v}=0\implies\boldsymbol{F}_{k,v}=0bold_Λ start_POSTSUBSCRIPT italic_k , italic_v end_POSTSUBSCRIPT = 0 ⟹ bold_italic_F start_POSTSUBSCRIPT italic_k , italic_v end_POSTSUBSCRIPT = 0. This implies that

1=a𝐅a,v=ak𝐅a,v+𝐅k,v=𝐅k,v,v{1,,ns}.formulae-sequence1subscript𝑎subscript𝐅𝑎𝑣subscript𝑎𝑘subscript𝐅𝑎𝑣subscript𝐅𝑘𝑣subscript𝐅𝑘𝑣for-all𝑣1subscript𝑛𝑠1=\sum_{a}\mathbf{F}_{a,v}=\sum_{a\neq k}\mathbf{F}_{a,v}+\mathbf{F}_{k,v}=%\mathbf{F}_{k,v},\quad\forall v\in\{1,...,n_{s}\}.1 = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT bold_F start_POSTSUBSCRIPT italic_a , italic_v end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_a ≠ italic_k end_POSTSUBSCRIPT bold_F start_POSTSUBSCRIPT italic_a , italic_v end_POSTSUBSCRIPT + bold_F start_POSTSUBSCRIPT italic_k , italic_v end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_k , italic_v end_POSTSUBSCRIPT , ∀ italic_v ∈ { 1 , … , italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } .

Thus, each singleton set {ok}𝒮subscript𝑜𝑘𝒮\{o_{k}\}\in\mathcal{S}{ italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ∈ caligraphic_S maps to a standard basis vector 𝐞knosubscript𝐞𝑘superscriptsubscript𝑛𝑜\mathbf{e}_{k}\in\mathbb{R}^{n_{o}}bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Further, because ns=nosubscript𝑛𝑠subscript𝑛𝑜n_{s}=n_{o}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT by definition of 𝒬𝒬\mathcal{Q}caligraphic_Q and 𝒪𝒬𝒪𝒬\mathcal{O}\subseteq\mathcal{Q}caligraphic_O ⊆ caligraphic_Q, each option must appear in exactly one set, giving us exactly ns=nosubscript𝑛𝑠subscript𝑛𝑜n_{s}=n_{o}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT distinct basis vectors. The rank of a matrix is equal to the number of linear independent column vectors. Because each of the k𝑘kitalic_k standard basis vectors must be linearly independent, it follows that rank(𝐅𝐅\mathbf{F}bold_F)=no=nsabsentsubscript𝑛𝑜subscript𝑛𝑠=n_{o}=n_{s}= italic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

We will show the reverse implication that rank(𝐅𝐅\mathbf{F}bold_F) = nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT \implies 𝒬={{ok}:ok𝒪}𝒬conditional-setsubscript𝑜𝑘subscript𝑜𝑘𝒪\mathcal{Q}=\{\{o_{k}\}:o_{k}\in\mathcal{O}\}caligraphic_Q = { { italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } : italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_O } by contradiction. Suppose there exists a set 𝒮v𝒬subscript𝒮𝑣𝒬\mathcal{S}_{v}\in\mathcal{Q}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ caligraphic_Q containing more than one option, i.e., |𝒮|>1𝒮1|\mathcal{S}|>1| caligraphic_S | > 1. Let 𝐯𝐯\mathbf{v}bold_v denote the column of 𝐅𝐅\mathbf{F}bold_F corresponding to 𝒮vsubscript𝒮𝑣\mathcal{S}_{v}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Since 𝒪𝒬𝒪𝒬\mathcal{O}\subseteq\mathcal{Q}caligraphic_O ⊆ caligraphic_Q, for each option ok𝒮subscript𝑜𝑘𝒮o_{k}\in\mathcal{S}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_S, there exists a column in 𝐅𝐅\mathbf{F}bold_F that is the standard basis vector 𝐞ksubscript𝐞𝑘\mathbf{e}_{k}bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, as shown above. Therefore, 𝐯𝐯\mathbf{v}bold_v can be written as a linear combination of these basis vectors: 𝐯=kαk𝐞k𝐯subscript𝑘subscript𝛼𝑘subscript𝐞𝑘\mathbf{v}=\sum_{k}\alpha_{k}\mathbf{e}_{k}bold_v = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where αk=[0,1]subscript𝛼𝑘01\alpha_{k}=[0,1]italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ 0 , 1 ]. This shows that column 𝐯𝐯\mathbf{v}bold_v is linearly dependent with the columns corresponding to singleton sets oksubscript𝑜𝑘{o_{k}}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for ok𝒮subscript𝑜𝑘𝒮o_{k}\in\mathcal{S}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_S. This implies 𝐅𝐅\mathbf{F}bold_F cannot have nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT linearly independent columns, contradicting rank(𝐅𝐅\mathbf{F}bold_F) = nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

B.1.3 Theorem 5.3.

Proof.

Proof by contradiction. Let 𝐩𝐩\mathbf{p}bold_p and 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote pairs of performance definitions with increasing cardinality (i.e., higher values being better). Let Yz=aj(Tj,z),Yw=aj(Tj,w),Yh=ah(Th)formulae-sequencesuperscript𝑌𝑧superscript𝑎𝑗superscriptsubscript𝑇𝑗𝑧formulae-sequencesuperscript𝑌𝑤superscript𝑎𝑗superscriptsubscript𝑇𝑗𝑤superscript𝑌superscript𝑎superscriptsubscript𝑇Y^{z}=a^{j}(\mathbb{P}_{T}^{j,z}),\;Y^{w}=a^{j}(\mathbb{P}_{T}^{j,w}),\;Y^{h}=%a^{h}(\mathbb{P}_{T}^{h})italic_Y start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) , italic_Y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) denote random functions of T𝑇Titalic_T corresponding to definition 𝐩𝐩\mathbf{p}bold_p. Let Yz=aj(Tj,z),Yw=aj(Tj,w),Yh=ah(Th)formulae-sequencesubscriptsuperscript𝑌𝑧superscriptsubscript𝑎𝑗superscriptsubscript𝑇𝑗𝑧formulae-sequencesubscriptsuperscript𝑌𝑤superscriptsubscript𝑎𝑗superscriptsubscript𝑇𝑗𝑤subscriptsuperscript𝑌superscriptsubscript𝑎superscriptsubscript𝑇Y^{z}_{*}=a_{j}^{\prime}(\mathbb{P}_{T}^{j,z}),\;Y^{w}_{*}=a_{j}^{\prime}(%\mathbb{P}_{T}^{j,w}),\;Y^{h}_{*}=a_{h}^{\prime}(\mathbb{P}_{T}^{h})italic_Y start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) , italic_Y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) correspond to definition 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since 𝐩𝐩\mathbf{p}bold_p is not a monotone transformation of 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, by definition there must exist distributions {(ij,z,ih),(ij,w,ih)}Δ×Δsuperscriptsubscript𝑖𝑗𝑧superscriptsubscript𝑖superscriptsubscript𝑖𝑗𝑤superscriptsubscript𝑖ΔΔ\{(\mathbb{P}_{i}^{j,z},\mathbb{P}_{i}^{h}),(\mathbb{P}_{i}^{j,w},\mathbb{P}_{%i}^{h})\}\in\Delta\times\Delta{ ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) , ( blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) } ∈ roman_Δ × roman_Δ corresponding to realizations of these random variables satisfying

m(Yz,Yh)<m(Yw,Yh),m(Yz,Yh)>m(Yw,Yh).formulae-sequence𝑚superscript𝑌𝑧superscript𝑌𝑚superscript𝑌𝑤superscript𝑌superscript𝑚subscriptsuperscript𝑌𝑧subscriptsuperscript𝑌superscript𝑚subscriptsuperscript𝑌𝑤subscriptsuperscript𝑌m(Y^{z},Y^{h})<m(Y^{w},Y^{h}),\quad\quad m^{\prime}(Y^{z}_{*},Y^{h}_{*})>m^{%\prime}(Y^{w}_{*},Y^{h}_{*}).italic_m ( italic_Y start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) < italic_m ( italic_Y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) > italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) .

Now suppose that superscript\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT places all marginal probability mass over T𝑇Titalic_T on the i𝑖iitalic_i’th item – i.e., (T=ti)=1superscript𝑇subscript𝑡𝑖1\mathbb{P}^{*}(T=t_{i})=1blackboard_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_T = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1. Then:

δ𝐩(z,w)subscript𝛿𝐩𝑧𝑤\displaystyle\delta_{\mathbf{p}}(z,w)italic_δ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( italic_z , italic_w )=𝔼[m(Yz,Yh)m(Yw,Yh)]<0absentsubscript𝔼superscriptdelimited-[]𝑚superscript𝑌𝑧superscript𝑌𝑚superscript𝑌𝑤superscript𝑌0\displaystyle=\mathbb{E}_{\mathbb{P}^{*}}[m(Y^{z},Y^{h})-m(Y^{w},Y^{h})]<0= blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_m ( italic_Y start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) - italic_m ( italic_Y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ] < 0
δ𝐩(z,w)subscript𝛿superscript𝐩𝑧𝑤\displaystyle\delta_{\mathbf{p}^{\prime}}(z,w)italic_δ start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z , italic_w )=𝔼[m(Yz,Yh)m(Yw,Yh)]>0absentsubscript𝔼superscriptdelimited-[]superscript𝑚subscriptsuperscript𝑌𝑧subscriptsuperscript𝑌superscript𝑚subscriptsuperscript𝑌𝑤subscriptsuperscript𝑌0\displaystyle=\mathbb{E}_{\mathbb{P}^{*}}[m^{\prime}(Y^{z}_{*},Y^{h}_{*})-m^{%\prime}(Y^{w}_{*},Y^{h}_{*})]>0= blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] > 0

Thus, rank consistency is violated because there exists a distribution superscript\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for which 𝒢judgez𝐩𝒢judgewsubscriptsucceedssuperscript𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤\mathcal{G}_{judge}^{z}\succ_{\mathbf{p}^{\prime}}\mathcal{G}_{judge}^{w}caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ≻ start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT but 𝒢judgew𝐩𝒢judgezsubscriptsucceeds𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧\mathcal{G}_{judge}^{w}\succ_{\mathbf{p}}\mathcal{G}_{judge}^{z}caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ≻ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT. This provides a contradition, proving the result.

B.2 Rank Consistency Under Rater Error

Lemma B.1 (Rank Consistency of MSE (srs/srs) Under Rater Error ).

Let 𝛉superscript𝛉\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝛉=𝐄𝛉𝛉𝐄superscript𝛉\boldsymbol{\theta}=\mathbf{E}\boldsymbol{\theta}^{*}bold_italic_θ = bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote the stable and observed response set distributions for human raters.121212We omit subscript i𝑖iitalic_i from all terms for brevity. We also omit superscript hhitalic_h from human response set distribution, error, and multi-label vectors where the context is clear. Let 𝛉j,zsuperscript𝛉𝑗𝑧\boldsymbol{\theta}^{j,z}bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT and 𝛉j,wsuperscript𝛉𝑗𝑤\boldsymbol{\theta}^{j,w}bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT denote observed response set distributions for judge systems 𝒢judgezsubscriptsuperscript𝒢𝑧𝑗𝑢𝑑𝑔𝑒\mathcal{G}^{z}_{judge}caligraphic_G start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT and 𝒢judgewsubscriptsuperscript𝒢𝑤𝑗𝑢𝑑𝑔𝑒\mathcal{G}^{w}_{judge}caligraphic_G start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT where both judge systems have a rater error matrix 𝐄j,zsuperscript𝐄𝑗𝑧\mathbf{E}^{j,z}bold_E start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT and 𝐄j,wsuperscript𝐄𝑗𝑤\mathbf{E}^{j,w}bold_E start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT that is the identity. Let 𝚲𝚲\mathbf{\Lambda}bold_Λ be the binary matrix mapping response sets to options and define δ=MSE(𝛀,𝛀j,z)MSE(𝛀,𝛀j,w)superscript𝛿MSEsuperscript𝛀superscript𝛀𝑗𝑧MSEsuperscript𝛀superscript𝛀𝑗𝑤\delta^{*}=\text{MSE}(\boldsymbol{\Omega}^{*},\boldsymbol{\Omega}^{j,z})-\text%{MSE}(\boldsymbol{\Omega}^{*},\boldsymbol{\Omega}^{j,w})italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = MSE ( bold_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) - MSE ( bold_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) as the difference in MSE under error-free conditions. The ranking of judge systems using MSE with soft response set aggregation is preserved under human rating error if and only if:

sign((𝜽𝜽)T𝚲T𝚲(𝜽j,w𝜽j,z))=sign(δ).signsuperscriptsuperscript𝜽𝜽𝑇superscript𝚲𝑇𝚲superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧signsuperscript𝛿\text{sign}((\boldsymbol{\theta}^{*}-\boldsymbol{\theta})^{T}\mathbf{\Lambda}^%{T}\mathbf{\Lambda}(\boldsymbol{\theta}^{j,w}-\boldsymbol{\theta}^{j,z}))=%\text{sign}(\delta^{*}).sign ( ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_θ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Λ ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) ) = sign ( italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .(8)

This lemma provides conditions under which measuring the performance of a judge system against error-corrupted versus error-free human ratings yields a consistent ranking of judge systems (when measured against MSE(srs/srs)). The condition essentially requires that the direction of the error-induced shift in human ratings (𝜽𝜽superscript𝜽𝜽\boldsymbol{\theta}^{*}-\boldsymbol{\theta}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_θ) matches the direction of the stable response set shift across judge systems (𝜽j,w𝜽j,zsuperscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\boldsymbol{\theta}^{j,w}-\boldsymbol{\theta}^{j,z}bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT) when projected to the multi-label space. If human rating error and judge system performance differences shift the response set distribution in the same direction, the ranking of judge systems will be consistent for error-free and error-corrupted ratings. Conversely, rankings can invert under an inverse relationship.

E.q. (8) is satisfied in our experimental setup (§§\S§ 6) because the ensemble of judge system rating distributions is generated by adding random perturbations (i.e., uncorrelated with rater error) to the human stable response set vector. Thus we see little change in the reliability of MSE (srs/srs) across settings with no rater error (Figure 6, center) and rater error (Figure 6, right).

Proof of Lemma B.1.

For brevity, let 𝐌=𝚲T𝚲𝐌superscript𝚲𝑇𝚲\mathbf{M}=\mathbf{\Lambda}^{T}\mathbf{\Lambda}bold_M = bold_Λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Λ. The difference in judge system MSE measured against the multi-label human rating vector 𝛀=𝚲𝜽superscript𝛀𝚲superscript𝜽\mathbf{\Omega}^{*}=\mathbf{\Lambda}\boldsymbol{\theta}^{*}bold_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_Λ bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT derived from the stable response set distribution is given by:

δsuperscript𝛿\displaystyle\delta^{*}italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT=MSE(𝛀,𝛀j,z)MSE(𝛀,𝛀j,w)absentMSEsuperscript𝛀superscript𝛀𝑗𝑧MSEsuperscript𝛀superscript𝛀𝑗𝑤\displaystyle=\text{MSE}(\mathbf{\Omega}^{*},\;\mathbf{\Omega}^{j,z})-\text{%MSE}(\mathbf{\Omega}^{*},\;\mathbf{\Omega}^{j,w})= MSE ( bold_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) - MSE ( bold_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT )
=𝚲𝜽𝚲𝜽j,z22𝚲𝜽𝚲𝜽j,w22absentsubscriptsuperscriptnorm𝚲superscript𝜽𝚲superscript𝜽𝑗𝑧22subscriptsuperscriptnorm𝚲superscript𝜽𝚲superscript𝜽𝑗𝑤22\displaystyle=||\mathbf{\Lambda}\boldsymbol{\theta}^{*}-\mathbf{\Lambda}%\boldsymbol{\theta}^{j,z}||^{2}_{2}-||\mathbf{\Lambda}\boldsymbol{\theta}^{*}-%\mathbf{\Lambda}\boldsymbol{\theta}^{j,w}||^{2}_{2}= | | bold_Λ bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_Λ bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - | | bold_Λ bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_Λ bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=(𝜽𝜽j,z)T𝐌(𝜽𝜽j,z)(𝜽𝜽j,w)T𝐌(𝜽𝜽j,w)absentsuperscriptsuperscript𝜽superscript𝜽𝑗𝑧𝑇𝐌superscript𝜽superscript𝜽𝑗𝑧superscriptsuperscript𝜽superscript𝜽𝑗𝑤𝑇𝐌superscript𝜽superscript𝜽𝑗𝑤\displaystyle=(\boldsymbol{\theta}^{*}-\boldsymbol{\theta}^{j,z})^{T}\mathbf{M%}(\boldsymbol{\theta}^{*}-\boldsymbol{\theta}^{j,z})-(\boldsymbol{\theta}^{*}-%\boldsymbol{\theta}^{j,w})^{T}\mathbf{M}(\boldsymbol{\theta}^{*}-\boldsymbol{%\theta}^{j,w})= ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) - ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT )
=(𝜽)T𝐌𝜽(𝜽)T𝐌𝜽j,z(𝜽j,z)T𝐌𝜽+(𝜽j,z)T𝐌𝜽j,zabsentsuperscriptsuperscript𝜽𝑇𝐌superscript𝜽superscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑧superscriptsuperscript𝜽𝑗𝑧𝑇𝐌superscript𝜽superscriptsuperscript𝜽𝑗𝑧𝑇𝐌superscript𝜽𝑗𝑧\displaystyle=(\boldsymbol{\theta}^{*})^{T}\mathbf{M}\boldsymbol{\theta}^{*}-(%\boldsymbol{\theta}^{*})^{T}\mathbf{M}\boldsymbol{\theta}^{j,z}-(\boldsymbol{%\theta}^{j,z})^{T}\mathbf{M}\boldsymbol{\theta}^{*}+(\boldsymbol{\theta}^{j,z}%)^{T}\mathbf{M}\boldsymbol{\theta}^{j,z}= ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT - ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT
(𝜽)T𝐌𝜽+(𝜽)T𝐌𝜽j,w+(𝜽j,w)T𝐌𝜽(𝜽j,w)T𝐌𝜽j,wsuperscriptsuperscript𝜽𝑇𝐌superscript𝜽superscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscriptsuperscript𝜽𝑗𝑤𝑇𝐌superscript𝜽superscriptsuperscript𝜽𝑗𝑤𝑇𝐌superscript𝜽𝑗𝑤\displaystyle\quad\quad-(\boldsymbol{\theta}^{*})^{T}\mathbf{M}\boldsymbol{%\theta}^{*}+(\boldsymbol{\theta}^{*})^{T}\mathbf{M}\boldsymbol{\theta}^{j,w}+(%\boldsymbol{\theta}^{j,w})^{T}\mathbf{M}\boldsymbol{\theta}^{*}-(\boldsymbol{%\theta}^{j,w})^{T}\mathbf{M}\boldsymbol{\theta}^{j,w}- ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT + ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT
=(𝜽j,z)T𝐌𝜽j,z(𝜽j,w)T𝐌𝜽j,w+2(𝜽)T𝐌(𝜽j,w𝜽j,z)absentsuperscriptsuperscript𝜽𝑗𝑧𝑇𝐌superscript𝜽𝑗𝑧superscriptsuperscript𝜽𝑗𝑤𝑇𝐌superscript𝜽𝑗𝑤2superscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle=(\boldsymbol{\theta}^{j,z})^{T}\mathbf{M}\boldsymbol{\theta}^{j,%z}-(\boldsymbol{\theta}^{j,w})^{T}\mathbf{M}\boldsymbol{\theta}^{j,w}+2(%\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{\theta}^{j,w}-\boldsymbol{%\theta}^{j,z})= ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT - ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT + 2 ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )

Let 𝛀=𝚲(𝐄𝜽)𝛀𝚲𝐄𝜽\mathbf{\Omega}=\mathbf{\Lambda}(\mathbf{E}\boldsymbol{\theta})bold_Ω = bold_Λ ( bold_E bold_italic_θ ) denote the multi-label vector recovered from the observed response set distribution. Applying the same derivation as above to the error-corrupted MSE metric yields:

δ𝛿\displaystyle\deltaitalic_δ=MSE(𝛀,𝛀j,z)MSE(𝛀,𝛀j,w)absentMSE𝛀superscript𝛀𝑗𝑧MSE𝛀superscript𝛀𝑗𝑤\displaystyle=\text{MSE}(\mathbf{\Omega},\;\mathbf{\Omega}^{j,z})-\text{MSE}(%\mathbf{\Omega},\;\mathbf{\Omega}^{j,w})= MSE ( bold_Ω , bold_Ω start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) - MSE ( bold_Ω , bold_Ω start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT )
=(𝜽j,z)T𝐌𝜽j,z(𝜽j,w)T𝐌𝜽j,w+2(𝐄𝜽)T𝐌(𝜽j,w𝜽j,z)absentsuperscriptsuperscript𝜽𝑗𝑧𝑇𝐌superscript𝜽𝑗𝑧superscriptsuperscript𝜽𝑗𝑤𝑇𝐌superscript𝜽𝑗𝑤2superscript𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle=(\boldsymbol{\theta}^{j,z})^{T}\mathbf{M}\boldsymbol{\theta}^{j,%z}-(\boldsymbol{\theta}^{j,w})^{T}\mathbf{M}\boldsymbol{\theta}^{j,w}+2(%\mathbf{E}\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{\theta}^{j,w}-%\boldsymbol{\theta}^{j,z})= ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT - ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT + 2 ( bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )

Observe that the first two terms appear in both expansions. Thus we need to focus on the third term while showing the conditions required for rank consistency — i.e., sign(δ𝛿\deltaitalic_δ) = sign(δsuperscript𝛿\delta^{*}italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT).

  • Case 1: δ<0superscript𝛿0\delta^{*}<0italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT < 0 (𝒢judgezsubscriptsuperscript𝒢𝑧𝑗𝑢𝑑𝑔𝑒\mathcal{G}^{z}_{judge}caligraphic_G start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT is better than 𝒢judgewsubscriptsuperscript𝒢𝑤𝑗𝑢𝑑𝑔𝑒\mathcal{G}^{w}_{judge}caligraphic_G start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT under no rater error.) For both inequalities to hold, we need:

    2(𝐄𝜽)T𝐌(𝜽j,w𝜽j,z)2superscript𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle 2(\mathbf{E}\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{%\theta}^{j,w}-\boldsymbol{\theta}^{j,z})2 ( bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )2(𝜽)T𝐌(𝜽j,w𝜽j,z)absent2superscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle\leq 2(\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{\theta}%^{j,w}-\boldsymbol{\theta}^{j,z})≤ 2 ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )
    (𝐄𝜽)T𝐌(𝜽j,w𝜽j,z)superscript𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle(\mathbf{E}\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{%\theta}^{j,w}-\boldsymbol{\theta}^{j,z})( bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )(𝜽)T𝐌(𝜽j,w𝜽j,z)absentsuperscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle\leq(\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{\theta}^{%j,w}-\boldsymbol{\theta}^{j,z})≤ ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )
    (𝜽)T𝐌(𝜽j,w𝜽j,z)(𝐄𝜽)T𝐌(𝜽j,w𝜽j,z)superscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧superscript𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle(\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{\theta}^{j,w}%-\boldsymbol{\theta}^{j,z})-(\mathbf{E}\boldsymbol{\theta}^{*})^{T}\mathbf{M}(%\boldsymbol{\theta}^{j,w}-\boldsymbol{\theta}^{j,z})( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) - ( bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )0absent0\displaystyle\geq 0≥ 0
    ((𝜽)T(𝐄𝜽)T)𝐌(𝜽j,w𝜽j,z)superscriptsuperscript𝜽𝑇superscript𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle((\boldsymbol{\theta}^{*})^{T}-(\mathbf{E}\boldsymbol{\theta}^{*}%)^{T})\mathbf{M}(\boldsymbol{\theta}^{j,w}-\boldsymbol{\theta}^{j,z})( ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - ( bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )0absent0\displaystyle\geq 0≥ 0
    (𝜽𝐄𝜽)T𝐌(𝜽j,w𝜽j,z)superscriptsuperscript𝜽𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle(\boldsymbol{\theta}^{*}-\mathbf{E}\boldsymbol{\theta}^{*})^{T}%\mathbf{M}(\boldsymbol{\theta}^{j,w}-\boldsymbol{\theta}^{j,z})( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )0absent0\displaystyle\geq 0≥ 0
  • Case 2: δ>0superscript𝛿0\delta^{*}>0italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 (𝒢judgewsubscriptsuperscript𝒢𝑤𝑗𝑢𝑑𝑔𝑒\mathcal{G}^{w}_{judge}caligraphic_G start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT is better than 𝒢judgezsubscriptsuperscript𝒢𝑧𝑗𝑢𝑑𝑔𝑒\mathcal{G}^{z}_{judge}caligraphic_G start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT under no rater error.).For rank consistency, we need δ>0𝛿0\delta>0italic_δ > 0 as well. Following similar steps, we get:

2(𝐄𝜽)T𝐌(𝜽j,w𝜽j,z)2superscript𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle 2(\mathbf{E}\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{%\theta}^{j,w}-\boldsymbol{\theta}^{j,z})2 ( bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )2(𝜽)T𝐌(𝜽j,w𝜽j,z)absent2superscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle\geq 2(\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{\theta}%^{j,w}-\boldsymbol{\theta}^{j,z})≥ 2 ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )
2(𝐄𝜽)T𝐌(𝜽j,w𝜽j,z)2superscript𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle-2(\mathbf{E}\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{%\theta}^{j,w}-\boldsymbol{\theta}^{j,z})- 2 ( bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )2(𝜽)T𝐌(𝜽j,w𝜽j,z)absent2superscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑤superscript𝜽𝑗𝑧\displaystyle\leq-2(\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{\theta}%^{j,w}-\boldsymbol{\theta}^{j,z})≤ - 2 ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT )
2(𝐄𝜽)T𝐌(𝜽j,z𝜽j,w)2superscript𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑧superscript𝜽𝑗𝑤\displaystyle 2(\mathbf{E}\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{%\theta}^{j,z}-\boldsymbol{\theta}^{j,w})2 ( bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT )2(𝜽)T𝐌(𝜽j,z𝜽j,w)absent2superscriptsuperscript𝜽𝑇𝐌superscript𝜽𝑗𝑧superscript𝜽𝑗𝑤\displaystyle\leq 2(\boldsymbol{\theta}^{*})^{T}\mathbf{M}(\boldsymbol{\theta}%^{j,z}-\boldsymbol{\theta}^{j,w})≤ 2 ( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT )
(𝜽𝐄𝜽)T𝐌(𝜽j,z𝜽j,w)superscriptsuperscript𝜽𝐄superscript𝜽𝑇𝐌superscript𝜽𝑗𝑧superscript𝜽𝑗𝑤\displaystyle(\boldsymbol{\theta}^{*}-\mathbf{E}\boldsymbol{\theta}^{*})^{T}%\mathbf{M}(\boldsymbol{\theta}^{j,z}-\boldsymbol{\theta}^{j,w})( bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_E bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M ( bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT )0absent0\displaystyle\geq 0≥ 0

Appendix C Examples of Monotonicity Violations Among Pairs of Performance Metrics

{mdframed}

[backgroundcolor=gray!10,linewidth=1pt,linecolor=gray!50,innertopmargin=8pt,innerbottommargin=15pt,innerleftmargin=15pt,innerrightmargin=15pt]Example 1: Hit Rate (Forced Choice) and KL-Divergence (Forced Choice).

Let 𝐩=(ahardh,ahardj,Hit Rate)𝐩subscriptsuperscript𝑎hardsubscriptsuperscript𝑎𝑗hardHit Rate\mathbf{p}=(a^{h}_{\text{hard}},a^{j}_{\text{hard}},\text{Hit Rate})bold_p = ( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT , Hit Rate ) and let 𝐩=(asofth,asoftj,KL-Divergence)superscript𝐩subscriptsuperscript𝑎softsubscriptsuperscript𝑎𝑗softKL-Divergence\mathbf{p}^{\prime}=(a^{h}_{\text{soft}},a^{j}_{\text{soft}},\text{KL-%Divergence})bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPT , KL-Divergence ). Consider forced choice distributions recovered from human ratings and the judge systems z𝑧zitalic_z and w𝑤witalic_w, respectively: 𝐎h=asofth(h)superscript𝐎subscriptsuperscript𝑎softsuperscript\mathbf{O}^{h}=a^{h}_{\text{soft}}(\mathbb{P}^{h})bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPT ( blackboard_P start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ), 𝐎j,z=asoftj(j,z)superscript𝐎𝑗𝑧subscriptsuperscript𝑎𝑗softsuperscript𝑗𝑧\mathbf{O}^{j,z}=a^{j}_{\text{soft}}(\mathbb{P}^{j,z})bold_O start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPT ( blackboard_P start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ), 𝐎j,w=asofth(j,w)superscript𝐎𝑗𝑤subscriptsuperscript𝑎softsuperscript𝑗𝑤\mathbf{O}^{j,w}=a^{h}_{\text{soft}}(\mathbb{P}^{j,w})bold_O start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPT ( blackboard_P start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ), where we omit i𝑖iitalic_i from all terms for brevity.

Suppose these distributions are defined over three options 𝒪={o1,o2,o3}𝒪subscript𝑜1subscript𝑜2subscript𝑜3\mathcal{O}=\{o_{1},o_{2},o_{3}\}caligraphic_O = { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } and for an item i𝑖iitalic_i:

Human:𝐎h=(0.6,0.3,0.1),Judgez:𝐎j,z=(0.8,0.1,0.1),Judgew:𝐎j,w=(0.5,0.4,0.1)formulae-sequenceHuman:superscript𝐎0.60.30.1Judgez:superscript𝐎𝑗𝑧0.80.10.1Judgew:superscript𝐎𝑗𝑤0.50.40.1\displaystyle\text{Human:}\quad\mathbf{O}^{h}=(0.6,0.3,0.1),\quad\text{Judge $%z$:}\quad\mathbf{O}^{j,z}=(0.8,0.1,0.1),\quad\text{Judge $w$:}\quad\mathbf{O}^%{j,w}=(0.5,0.4,0.1)Human: bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ( 0.6 , 0.3 , 0.1 ) , Judge italic_z : bold_O start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT = ( 0.8 , 0.1 , 0.1 ) , Judge italic_w : bold_O start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT = ( 0.5 , 0.4 , 0.1 )

Under 𝐩𝐩\mathbf{p}bold_p, we have:

HR(𝐎j,z,𝐎h)=1.0>HR(𝐎j,w,𝐎h)=0.0𝒢judgez𝐩𝒢judgew.HRsuperscript𝐎𝑗𝑧superscript𝐎1.0HRsuperscript𝐎𝑗𝑤superscript𝐎0.0superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧subscriptsucceeds𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤\text{HR}(\mathbf{O}^{j,z},\mathbf{O}^{h})=1.0>\text{HR}(\mathbf{O}^{j,w},%\mathbf{O}^{h})=0.0\implies\mathcal{G}_{judge}^{z}\succ_{\mathbf{p}}\mathcal{G%}_{judge}^{w}.HR ( bold_O start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT , bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) = 1.0 > HR ( bold_O start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT , bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) = 0.0 ⟹ caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ≻ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT .

But under 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have:

KL(𝐎h𝐎j,z)0.15>KL(𝐎h𝐎j,w)0.02𝒢judgew𝐩𝒢judgez.KLconditionalsuperscript𝐎superscript𝐎𝑗𝑧0.15KLconditionalsuperscript𝐎superscript𝐎𝑗𝑤0.02superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤subscriptsucceeds𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧\text{KL}(\mathbf{O}^{h}\|\mathbf{O}^{j,z})\approx 0.15>\text{KL}(\mathbf{O}^{%h}\|\mathbf{O}^{j,w})\approx 0.02\implies\mathcal{G}_{judge}^{w}\succ_{\mathbf%{p}}\mathcal{G}_{judge}^{z}.KL ( bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∥ bold_O start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) ≈ 0.15 > KL ( bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∥ bold_O start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) ≈ 0.02 ⟹ caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ≻ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT .

Thus, we have identified a pair of conditional rating distributions j,zsuperscript𝑗𝑧\mathbb{P}^{j,z}blackboard_P start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT, j,wsuperscript𝑗𝑤\mathbb{P}^{j,w}blackboard_P start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT and a corresponding human rating distribution hsuperscript\mathbb{P}^{h}blackboard_P start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT where 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not a monotone transformation of 𝐩𝐩\mathbf{p}bold_p, so rank consistency between 𝐩𝐩\mathbf{p}bold_p and 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT cannot hold.

{mdframed}

[backgroundcolor=gray!10,linewidth=1pt,linecolor=gray!50,innertopmargin=10pt,innerbottommargin=10pt,innerleftmargin=10pt,innerrightmargin=10pt]Example 2: KL-Divergence (Forced Choice) and MSE (Multi-Label). Let 𝐩=(asofth,asoftj,KL-divergence)𝐩subscriptsuperscript𝑎softsubscriptsuperscript𝑎𝑗softKL-divergence\mathbf{p}=(a^{h}_{\text{soft}},a^{j}_{\text{soft}},\text{KL-divergence})bold_p = ( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT soft end_POSTSUBSCRIPT , KL-divergence ) and let 𝐩=(asrsh,asrsj,MSE)superscript𝐩subscriptsuperscript𝑎srssubscriptsuperscript𝑎𝑗srsMSE\mathbf{p}^{\prime}=(a^{h}_{\text{srs}},a^{j}_{\text{srs}},\text{MSE})bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT srs end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT srs end_POSTSUBSCRIPT , MSE ). Let 𝒪={o1,o2}𝒪subscript𝑜1subscript𝑜2\mathcal{O}=\{o_{1},o_{2}\}caligraphic_O = { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and 𝒬={{o1},{o2},{o1,o2}}𝒬subscript𝑜1subscript𝑜2subscript𝑜1subscript𝑜2\mathcal{Q}=\{\{o_{1}\},\{o_{2}\},\{o_{1},o_{2}\}\}caligraphic_Q = { { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } , { italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } , { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } }. Suppose that humans have no rater error (i.e., 𝐄hsuperscript𝐄\mathbf{E}^{h}bold_E start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and 𝐄jsuperscript𝐄𝑗\mathbf{E}^{j}bold_E start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT are both the identity). Let ihsubscriptsuperscript𝑖\mathbb{P}^{h}_{i}blackboard_P start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT satisfy the decomposition:

𝐎h=(.4,.6),𝐅h=[100011],𝜽,h=(0.4,0.5,0.1),𝛀h=(.5,.6).formulae-sequencesuperscript𝐎superscript.4.6topformulae-sequencesuperscript𝐅matrix100011formulae-sequencesuperscript𝜽superscript0.40.50.1topsuperscript𝛀superscript.5.6top\mathbf{O}^{h}=(.4,.6)^{\top},\;\;\mathbf{F}^{h}=\begin{bmatrix}1&0&0\\0&1&1\end{bmatrix},\;\;\boldsymbol{\theta}^{*,h}=(0.4,0.5,0.1)^{\top},\;\;%\mathbf{\Omega}^{h}=(.5,.6)^{\top}.bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ( .4 , .6 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , bold_italic_θ start_POSTSUPERSCRIPT ∗ , italic_h end_POSTSUPERSCRIPT = ( 0.4 , 0.5 , 0.1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ( .5 , .6 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Let ij,zsubscriptsuperscript𝑗𝑧𝑖\mathbb{P}^{j,z}_{i}blackboard_P start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the conditional rating distribution of the judge system satisfying:

𝐎j,z=(.4,.6),𝐅j,z=[101010],𝜽,j,z=(0.0,0.6,0.4),𝛀j,z=(.4,.6).formulae-sequencesuperscript𝐎𝑗𝑧superscript.4.6topformulae-sequencesuperscript𝐅𝑗𝑧matrix101010formulae-sequencesuperscript𝜽𝑗𝑧superscript0.00.60.4topsuperscript𝛀𝑗𝑧superscript.4.6top\mathbf{O}^{j,z}=(.4,.6)^{\top},\;\;\mathbf{F}^{j,z}=\begin{bmatrix}1&0&1\\0&1&0\end{bmatrix},\;\;\boldsymbol{\theta}^{*,j,z}=(0.0,0.6,0.4)^{\top},\;\;%\mathbf{\Omega}^{j,z}=(.4,.6)^{\top}.bold_O start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT = ( .4 , .6 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] , bold_italic_θ start_POSTSUPERSCRIPT ∗ , italic_j , italic_z end_POSTSUPERSCRIPT = ( 0.0 , 0.6 , 0.4 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT = ( .4 , .6 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Let ij,wsubscriptsuperscript𝑗𝑤𝑖\mathbb{P}^{j,w}_{i}blackboard_P start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the conditional rating distribution of the judge system satisfying:

𝐎j,w=(.5,.5),𝐅,j,w=[101010],𝜽,j,w=(0.4,0.5,0.1),𝛀j,w=(.5,.6).formulae-sequencesuperscript𝐎𝑗𝑤superscript.5.5topformulae-sequencesuperscript𝐅𝑗𝑤matrix101010formulae-sequencesuperscript𝜽𝑗𝑤superscript0.40.50.1topsuperscript𝛀𝑗𝑤superscript.5.6top\mathbf{O}^{j,w}=(.5,.5)^{\top},\;\;\mathbf{F}^{*,j,w}=\begin{bmatrix}1&0&1\\0&1&0\end{bmatrix},\;\;\boldsymbol{\theta}^{*,j,w}=(0.4,0.5,0.1)^{\top},\;\;%\mathbf{\Omega}^{j,w}=(.5,.6)^{\top}.bold_O start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT = ( .5 , .5 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT ∗ , italic_j , italic_w end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] , bold_italic_θ start_POSTSUPERSCRIPT ∗ , italic_j , italic_w end_POSTSUPERSCRIPT = ( 0.4 , 0.5 , 0.1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT = ( .5 , .6 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Under 𝐩𝐩\mathbf{p}bold_p, we have:

KL(𝐎h|𝐎j,z)=0<KL(𝐎h|𝐎j,w)0.02𝒢judgez𝐩𝒢judgew.KLconditionalsuperscript𝐎superscript𝐎𝑗𝑧0KLconditionalsuperscript𝐎superscript𝐎𝑗𝑤0.02superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧subscriptsucceeds𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤\text{KL}(\mathbf{O}^{h}|\mathbf{O}^{j,z})=0<\text{KL}(\mathbf{O}^{h}|\mathbf{%O}^{j,w})\approx 0.02\implies\mathcal{G}_{judge}^{z}\succ_{\mathbf{p}}\mathcal%{G}_{judge}^{w}.KL ( bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT | bold_O start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT ) = 0 < KL ( bold_O start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT | bold_O start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT ) ≈ 0.02 ⟹ caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ≻ start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT .

But under 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

MSE(𝛀j,z,𝛀h)=0.01>MSE(𝛀j,w,𝛀h)=0.00𝒢judgew𝐩𝒢judgez.MSEsuperscript𝛀𝑗𝑧superscript𝛀0.01MSEsuperscript𝛀𝑗𝑤superscript𝛀0.00superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑤subscriptsucceedssuperscript𝐩superscriptsubscript𝒢𝑗𝑢𝑑𝑔𝑒𝑧\text{MSE}(\mathbf{\Omega}^{j,z},\mathbf{\Omega}^{h})=0.01>\text{MSE}(\mathbf{%\Omega}^{j,w},\mathbf{\Omega}^{h})=0.00\implies\mathcal{G}_{judge}^{w}\succ_{%\mathbf{p}^{\prime}}\mathcal{G}_{judge}^{z}.MSE ( bold_Ω start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) = 0.01 > MSE ( bold_Ω start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT , bold_Ω start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) = 0.00 ⟹ caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ≻ start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j italic_u italic_d italic_g italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT .

yielding a violation of monotonicity. Thus, we have identified a pair of conditional rating distributions ij,zsubscriptsuperscript𝑗𝑧𝑖\mathbb{P}^{j,z}_{i}blackboard_P start_POSTSUPERSCRIPT italic_j , italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ij,wsubscriptsuperscript𝑗𝑤𝑖\mathbb{P}^{j,w}_{i}blackboard_P start_POSTSUPERSCRIPT italic_j , italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a corresponding human rating distribution ihsubscriptsuperscript𝑖\mathbb{P}^{h}_{i}blackboard_P start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not a monotone transformation of 𝐩𝐩\mathbf{p}bold_p, so rank consistency between 𝐩𝐩\mathbf{p}bold_p and 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT cannot hold.

Appendix D Additional Synthetic Experiment Setup Details and Results

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (8)
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (9)
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (10)

D.1 Setup Details

We run all experiments with 50505050 judge systems. We use 100 items in all experiments and select option and response set configurations satisfying: 2|𝒪|102𝒪102\leq|\mathcal{O}|\leq 102 ≤ | caligraphic_O | ≤ 10, 2|𝒬|302𝒬302\leq|\mathcal{Q}|\leq 302 ≤ | caligraphic_Q | ≤ 30, |𝒪||𝒬|𝒪𝒬|\mathcal{O}|\leq|\mathcal{Q}|| caligraphic_O | ≤ | caligraphic_Q |. We let σmin=0.02subscript𝜎min0.02\sigma_{\text{min}}=0.02italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = 0.02 and σmax=.4subscript𝜎max.4\sigma_{\text{max}}=.4italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = .4 when sampling the ensemble of judge systems. While performing synthetic experiments, we estimate 𝐎^isubscript^𝐎𝑖\hat{\mathbf{O}}_{i}over^ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where 𝐎^i,k=^(O=ok)subscript^𝐎𝑖𝑘^𝑂subscript𝑜𝑘\hat{\mathbf{O}}_{i,k}=\hat{\mathbb{P}}(O=o_{k})over^ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = over^ start_ARG blackboard_P end_ARG ( italic_O = italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) for both human and judge rating distributions via the maximum likelihood estimator. We then compute 𝜽^i=𝐄i(𝐅i𝐎^i)subscriptsuperscript^𝜽𝑖superscriptsubscript𝐄𝑖superscriptsubscript𝐅𝑖subscript^𝐎𝑖\hat{\boldsymbol{\theta}}^{*}_{i}=\mathbf{E}_{i}^{\prime}(\mathbf{F}_{i}^{%\prime}\hat{\mathbf{O}}_{i})over^ start_ARG bold_italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT over^ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) This also allows us to compute 𝛀^i=𝜽^i(𝚲)subscript^𝛀𝑖subscriptsuperscript^𝜽𝑖𝚲\hat{\mathbf{\Omega}}_{i}=\hat{\boldsymbol{\theta}}^{*}_{i}(\mathbf{\Lambda})over^ start_ARG bold_Ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG bold_italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Λ ). We apply aggregation functions to the estimated forced choice distribution and estimated multi-label vector to obtain estimated performance metrics. For example, while applying soft aggregation, we obtain Y^h=asrsh(𝛀^ih)superscript^𝑌subscriptsuperscript𝑎𝑠𝑟𝑠subscriptsuperscript^𝛀𝑖\hat{Y}^{h}=a^{h}_{srs}(\hat{\mathbf{\Omega}}^{h}_{i})over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_r italic_s end_POSTSUBSCRIPT ( over^ start_ARG bold_Ω end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and Y^j=asrsj(𝛀^ij)superscript^𝑌𝑗subscriptsuperscript𝑎𝑗𝑠𝑟𝑠subscriptsuperscript^𝛀𝑗𝑖\hat{Y}^{j}=a^{j}_{srs}(\hat{\mathbf{\Omega}}^{j}_{i})over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_r italic_s end_POSTSUBSCRIPT ( over^ start_ARG bold_Ω end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) after estimating (𝛀^ih,𝛀^ij)subscriptsuperscript^𝛀𝑖subscriptsuperscript^𝛀𝑗𝑖(\hat{\mathbf{\Omega}}^{h}_{i},\hat{\mathbf{\Omega}}^{j}_{i})( over^ start_ARG bold_Ω end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_Ω end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) via the procedure outlined above. The expected performance is then given by: M^mse(Y^h,Y^j)=E(T,Y^h,Y^j)[|Y^hY^j|22].subscript^𝑀𝑚𝑠𝑒superscript^𝑌superscript^𝑌𝑗subscript𝐸𝑇superscript^𝑌superscript^𝑌𝑗delimited-[]superscriptsubscriptsuperscript^𝑌superscript^𝑌𝑗22\hat{M}_{mse}(\hat{Y}^{h},\hat{Y}^{j})=E_{(T,\hat{Y}^{h},\hat{Y}^{j})}[|\hat{Y%}^{h}-\hat{Y}^{j}|_{2}^{2}].over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = italic_E start_POSTSUBSCRIPT ( italic_T , over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ | over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .. We use the same finite sample estimation approach to recover other metrics. This estimator is consistent so long as 𝛀^𝑝𝛀𝑝^𝛀𝛀\hat{\mathbf{\Omega}}\xrightarrow{p}\mathbf{\Omega}over^ start_ARG bold_Ω end_ARG start_ARROW overitalic_p → end_ARROW bold_Ω as n𝑛n\rightarrow\inftyitalic_n → ∞, where 𝑝𝑝\xrightarrow{p}start_ARROW overitalic_p → end_ARROW denotes convergence in probability. This follows by a standard maximum likelihood convergence argument.

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (11)

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (12)

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (13)

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (14)

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (15)

Appendix E Additional Case Study Setup Details and Results

We sample N=200𝑁200N=200italic_N = 200 comments from the civil comments dataset and suppose that they are outputs from 𝒢targetsubscript𝒢target\mathcal{G}_{\text{target}}caligraphic_G start_POSTSUBSCRIPT target end_POSTSUBSCRIPT. We stratify sampled comments by observed disagreement in forced choice responses. Specifically, we select an even sample of comments with toxicity annotations at: 10%, 20%, 30%, 40%, 50%. We use the matrix 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to construct reverse forced choice transition matrices from observed forced choice ratings. The parametrization shown in Table 2 yields a conservative analysis of forced choice selection effects. Mapping forced choices to a response set containing both Very Toxic and Toxic would flip thresholded decisions from s.5(𝐲^h)subscript𝑠.5superscript^𝐲s_{.5}(\hat{\mathbf{y}}^{h})italic_s start_POSTSUBSCRIPT .5 end_POSTSUBSCRIPT ( over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) at a smaller magnitude of β𝛽\betaitalic_β. Figure 16 shows prompts used to collect ratings from judge systems.

OptionsResponse Set
VTTN/UVT,TT,N/UVT,N/UVT,T,N/U
ForcedChoiceVery Toxic1000000
Toxic0100000
No/Unsure0β𝛽\betaitalic_β1β1𝛽1-\beta1 - italic_β0000

MetricSensitivity Parameter (β𝛽\betaitalic_β)
0.00.10.20.30.4
Hit Rate (h/h)Sonnet 3.5
0.61 (0)0.66 (0)0.68 (0)0.55 (-0.29)0.42 (-0.58)
Fleiss’s κ𝜅\kappaitalic_κ (h/h)Sonnet 3.5
0.61 (0)0.66 (0)0.68 (0)0.55 (-0.29)0.42 (-0.58)
Cohen’s κ𝜅\kappaitalic_κ (h/h)Mistral Large
0.27 (-0.34)0.38 (-0.28)0.58 (-0.10)0.75 (-0.09)0.77 (-0.23)
MSE (h/h)Sonnet 3.5
0.61 (0)0.66 (0)0.68 (0)0.55 (-0.29)0.42 (-0.58)
KLD(h,j) (s/s)Mistral Small
0.06 (-0.55)0.26 (-0.40)0.52 (-0.16)0.84 (0)1.00 (0)
KLD(j,h) (s/s)Sonnet 3.5
0.61 (0)0.66 (0)0.68 (0)0.55 (-0.29)0.42 (-0.58)
CE(h,j) (s/s)Mistral Small
0.06 (-0.55)0.26 (-0.40)0.52 (-0.16)0.84 (0)1.00 (0)
CE(j,h) (s/s)Sonnet 3.5
0.61 (0)0.66 (0)0.68 (0)0.55 (-0.29)0.42 (-0.58)
JSD (s/s)Sonnet 3.5
0.61 (0)0.66 (0)0.68 (0)0.55 (-0.29)0.42 (-0.58)
MSE (s/s)Sonnet 3.5
0.61 (0)0.66 (0)0.68 (0)0.55 (-0.29)0.42 (-0.58)
Coverage (h/hrs)Sonnet 3.54o Mini3.5 Turbo
0.61 (0)0.66 (0)0.68 (0)0.64 (-0.20)0.96 (-0.04)
MSE (srs/srs)Mistral Small3.5 Turbo
0.06 (-0.55)0.26 (-0.40)0.52 (-0.16)0.79 (-0.05)0.96 (-0.04)
Consistency (τ=.5𝜏.5\tau=.5italic_τ = .5)Sonnet 3.5Mistral Small
0.610.660.680.841.00
MetricSensitivity Parameter (β𝛽\betaitalic_β)
0.00.10.20.30.4
Hit Rate (h/h)Sonnet 3.5
0.37 (0)0.17 (0)0.09 (0)0.42 (+0.35)0.58 (+0.57)
Fleiss’s κ𝜅\kappaitalic_κ (h/h)Sonnet 3.5
0.37 (0)0.17 (0)0.09 (0)0.42 (+0.35)0.58 (+0.57)
Cohen’s κ𝜅\kappaitalic_κ (h/h)Mistral Large
0.71 (+0.34)0.51 (+0.34)0.26 (+0.17)0.07 (0)0.24 (+0.23)
MSE (h/h)Sonnet 3.5
0.37 (0)0.17 (0)0.09 (0)0.42 (+0.35)0.58 (+0.57)
KLD(h,j) (s/s)Mistral Small
0.94 (+0.57)0.74 (+0.57)0.49 (+0.40)0.16 (+0.09)0.01 (0)
KLD(j,h) (s/s)Sonnet 3.5
0.37 (0)0.17 (0)0.09 (0)0.42 (+0.35)0.58 (+0.57)
CE(h,j) (s/s)Mistral Small
0.94 (+0.57)0.74 (+0.57)0.49 (+0.40)0.16 (+0.09)0.01 (0)
CE(j,h) (s/s)Sonnet 3.5
0.37 (0)0.17 (0)0.09 (0)0.42 (+0.35)0.58 (+0.57)
JSD (s/s)Sonnet 3.5
0.37 (0)0.17 (0)0.09 (0)0.42 (+0.35)0.58 (+0.57)
MSE (s/s)Sonnet 3.5
0.37 (0)0.17 (0)0.09 (0)0.42 (+0.35)0.58 (+0.57)
Coverage (h/hrs)Sonnet 3.54o Mini3.5 Turbo
0.37 (0)0.17 (0)0.09 (0)0.16 (+0.09)0.05 (+0.04)
MSE (srs/srs)Mistral Small3.5 Turbo
0.94 (+0.57)0.74 (+0.57)0.49 (+0.40)0.12 (+0.05)0.05 (+0.04)
Bias (MAE) (γ=.5𝛾.5\gamma=.5italic_γ = .5)Sonnet 3.5Mistral LargeMistral Small
0.370.170.090.070.01
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Merrill Bechtelar CPA

Last Updated:

Views: 6067

Rating: 5 / 5 (50 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Merrill Bechtelar CPA

Birthday: 1996-05-19

Address: Apt. 114 873 White Lodge, Libbyfurt, CA 93006

Phone: +5983010455207

Job: Legacy Representative

Hobby: Blacksmithing, Urban exploration, Sudoku, Slacklining, Creative writing, Community, Letterboxing

Introduction: My name is Merrill Bechtelar CPA, I am a clean, agreeable, glorious, magnificent, witty, enchanting, comfortable person who loves writing and wants to share my knowledge and understanding with you.