Luke GuerdanSolon BarocasKenneth HolsteinHanna WallachZhiwei Steven WuAlexandra Chouldechova
Abstract
The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliableapproaches to LLM-as-a-judge validation.
Machine Learning, ICML
1 Introduction
To improve efficiency, scalability, and repeatability, GenAI evaluations commonly rely on LLM-based judges as a substitute for human raters when rating system outputs for properties like their “relevance”, “helpfulness”, or “toxicity.” In this LLM-as-a-judge paradigm, illustrated in in Figure1, a judge LLM system is used to rate the outputs of a target GenAI system according to instructions specified in a rating task(e.g., Szymanski etal., 2024; Zheng etal., 2023; Bubeck etal., 2023). However,when using this paradigm,validating that judge systems produce accurate ratings is critical to the reliabilityof the resulting GenAI evaluations.
To validate a judge system, evaluators first curate a validation corpus in which each item in the corpusasks a rater to rate a single target system output according to a set of rating task instructions. Each item is then rated by multiple human raters and by the judge system.The human ratings are then aggregated into a single gold label rating for each item in the validation corpus. Judge system performance is then assessed by calculating the agreement rate between these gold labels and judge system ratings(Lu & Zhong, 2024; Kim etal., 2024; Jung etal., 2024; Dong etal., 2024; Es etal., 2023; Dubois etal., 2024; Shankar etal., 2024).
Although this approach is well-motivated when there is a single “correct” rating for each item, judge systems are often used in settings where this gold-label assumption is violated. Items or rating criteria may be ambiguous, or there may be principled disagreement among human raters, leading to multiple “correct” ratings. We say that such rating tasks are indeterminate. In practice, indeterminate rating tasks are very common (Figure 9). For example, raters may reasonably disagree about the “helpfulness” of an item based on its length(Li etal., 2024a), or note that the “toxicity” of an item depends on the cultural context (Goyal etal., 2022).
We show that task indeterminacy substantively affectshow judge systems can be validated, which judge systems are selected, and most critically, the resulting conclusions that are drawn about target systems. We introduce a framework for understanding different approaches to LLM-as-a-judge validation as determined by the rating task design, including the rating elicitation scheme; the rating aggregation scheme; and the metric used to quantify human–judge agreement. We illustrate both theoretically and empirically the implications of these choices on judge system selection and on the resulting conclusions that are drawn about targetsystems. Our main contributions are listed below:
•
We provide the first framework for LLM-as-a-judge validation under rating task indeterminacy—i.e., where many items may have multiple “correct” ratings. This framework enables us to compare existing validation approaches and to develop principled alternatives.
•
We use this framework to compare existing validation approachesto alternative approaches that account for task indeterminacy. We demonstrate that the best-performing judge systems using the former can be among the worst-performing using the latter (§5), and explain how this arises from differences in how human raters and judge systems resolve ambiguities in forced-choice rating tasks (§6). These findings highlight the importance of the rating elicitation scheme, as well as pinpointing a mechanism by which rating task indeterminacy can confound LLM-as-a-judge validation.
•
We conduct an empirical study in which we use five different commercial LLMs as judge systems to rate the “toxicity” of a target system’s outputs using the Civil Comments dataset (§7). We demonstrate that the judge system selected by commonly used existing validation approaches performs 34% worse than the judge system selected when using our framework to explicitly account for the effects of rating task indeterminacy.
Our findings demonstrate that existing validation approaches can be highly suboptimal when used for indeterminate rating tasks. Because the resulting conclusions that are drawn about target systems depend critically on judge system performance, this then jeopardizes the validity of GenAI evaluations.We draw on our findings to offer concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation under rating task indeterminacy.
2 Preliminaries
Let denote the judge system and let denote the target system. We treat as a black box that produces a string of output tokens given a string of input tokens. Output tokens can be represented as an arbitrary modality (e.g., text, images, audio, video) depending on the system.
Rating Task. We express a target system evaluation as a rating task consisting of items . Each item consists of 1) an output generated by , 2) instructions for rating that output, and 3) a set of response options—i.e., possible ratings—for that output.As is often the case in practice, we assume that the rating instructionsand response options are the same for all items.
Our framework is compatible with two types of rating tasks (Zheng etal., 2023): Single output grading tasks instruct a rater (either human or judge system) to rate an output generated by for a specific property like “helpfulness” or “toxicity”). Pairwise comparison tasks instruct a rater to provide a preference among two outputs generated by one or more target systems. Both types of rating tasks often include a rubric, annotation guidelines, and few-shot examples specifying how outputs should be rated. Letting denote the ordered set of response options for a rating task, two common choices for response options are for single output grading tasks and for pairwise comparison tasks.
Rating Elicitation. We examine two rating elicitation schemes for eliciting ratings from human raters and judge systems: forced choice and response set (Figure 2). Forced choice elicitation instructs a rater to select a single option from . Response set elicitation instructs a rater to select all options from that are reasonable. Letting denote the power set of , we define to be the ordered set of all admissible response sets. Response set elicitation instructs a rater to select a response set . For some tasks, , in which case, the twoelicitationschemes carry equivalent information. However, differences arise when rating tasks are underspecified.
Definition 2.1(Underspecified Rating Task).
A rating task is underspecified if .
For example, consider the single output grading task shown in Figure 2, where and . This task is underspecified because a rater can select a response set containing both Yes and No under response set elicitation if they determine both options are reasonable. We can construct a fully specified variant of this rating task by adding a Maybe option, and instructing raters to always select Maybe if they determine both Yes and No are reasonable options. When tasks are underspecified, forced choice elicitation capturesless information than response set elicitation.
Human Raters. Let denote a population of human raters, such as all target system users in a geographic region, a demographic group (e.g., females over 45), or a set of domain experts (e.g., licensed radiologists). We let denote a random variable modeling the selection of raters from .
Judge System. We assume black-box access to the judge system, , which, given an input item , returns an output that is then mapped to a response option in under forced choice elicitation or under response set elicitation.
2.1 A Probabilistic Model for Ratings
We model the distribution of human and judge system ratings via a joint distribution . Here, the random variables and denote the forced choice and response set ratings, respectively, returned by the judge system for a random item . Similarly, and denote the forced choice and response set ratings, respectively, that a randomly drawn rater assigns to an item . We evaluate the quality of judge system ratings by comparing them with human ratings a per-item basis. For each item , let denote the human rating distribution and let denote the rating distribution of . We omit and when referring to a generic rater (human or judge system) and let denote a rating distribution conditioned on a random item .
Aggregation Functions. We introduce aggregation functions to consolidate the full rating distribution into a rating vector. For example, applying a hard aggregation function recovers a binary one-hot vector encoding a single rating task option (e.g., Yes) (see 4). The rating space contains all rating vectors that can be recovered from an aggregation function.
We use aggregation functions to define random variables over rating vectors. Specifically, let denote the random rating vector obtained by applying an aggregation function to the rating distribution of a random item . This random variable setup enables us to reason probabilistically about aggregated ratings – e.g., by computing the expected agreement between aggregated rating vectors recovered from humans and the judge system (see 2.2). Let and denote the aggregation function, random rating vector, rating vector realization, and rating space for humans and , respectively.
2.2 Evaluation Goals
Our goal is to characterize the validity of using a judge system as a surrogate for human ratings in evaluations of the target system. We approach validation of the judge system from two complementary angles: first, by directly measuring human–judge agreement ( 2.2.1), and second, by examining the extent to which relying on judge system ratings in place of human ratings affects the conclusions drawn from downstream evaluations of ( 2.2.2).
2.2.1 Measuring Human-Judge Agreement
The standard approach for validating judge systems involves computing a measure of human–judge agreement (Table 1). Specifically, given the joint distribution , we evaluate
(1)
where is an agreement metric (e.g., Hit-Rate, Cohen’s , KL-Divergence). The expectation is taken over the joint distribution of random items and corresponding aggregated rating vectors and .
While Eq.(1) assumes that we know the rating distribution, in practice, we only have access to a small corpus of ratings. Therefore, we also estimate the agreement rate,
(2)
Above, we estimate from a corpus of human ratings . We assume this corpus only contains forced choice ratings, as this is the format used in existing GenAI evaluations. For each item, we estimate by repeatedly sampling a response from .
2.2.2 Measuring the Performance of a Judge System on Downstream Evaluation Tasks
We also examine how selecting judge systems based on certain human–judge agreement metrics impacts judge system performanceon downstream evaluation tasks.111This is analogous to how in classical supervised learning we might select a probabilistic classifier by choosing to minimize cross-validated negative log-likelihood. But then the model may get applied to a task where its misclassification error at a particular threshold is the more relevant performance metric. We focus on two specific downstream tasks. In content filtering tasks, the judge system is used to identify which outputs from to allow or suppress. In prevalence estimation tasks, the judge system is used to estimate the prevalence of a certain property (e.g., contain “toxic” language) in outputs.
Both content filtering and prevalence estimation require making binary categorizations of each item (e.g., whether an item contains “toxicity”). We categorize items using a threshold function that labels an item as positive if the kth option has at least a chance of being selected. The cutoff is a policy determination. For example, authors of the Civil Comments toxicity classification dataset (Borkan etal., 2019) use whidetermining whether to categorize an item as toxic or non-toxic.xic
Content Filtering. We evaluate on content filtering tasks by measuring how often it makes the same allow/suppress decisions as the population of human raters. We quantify this via the decision consistency metric:
High consistency indicates that, if deployed, would often make the same decisions as human raters on which outputs from to allow or suppress (at the cutoff ).
Prevalence Estimation. In prevalence estimation tasks, is used to evaluate the proportion of outputs from that have a certain property. For example, in single-output rating tasks, prevalence estimation recovers the proportion of outputs that are “relevant” or “toxic.” In pairwise comparison tasks, a prevalence estimation recovers the win rate — i.e., the proportion of items where an output from one target model is rated as preferable to an output from a second target model (Chiang etal., 2024). We measure the estimation bias between estimates obtained from human raters versus the judge system via
For example, when using to rate responses to automated red-teaming attacks designed to elicit toxicity (Mazeika etal., 2024; Ganguli etal., 2022), indicates that underestimates the prevalence of toxic outputs (i.e., the attack success rate) as compared to human ratings.
Prevalence estimation has become a central task in the GenAI evaluation literature. The popular Chatbot Arena leaderboard uses a Bradley-Terry model to estimate the win rate (a form of prevalence estimate) in pairwise comparison tasks (Chiang etal., 2024). Similarly, Prediction Powered Inference (PPI) is increasingly used to improve the sample-efficiency of prevalence estimators by combining gold label human ratings with judge system ratings (Angelopoulos etal., 2023b, a; Boyeau etal., 2024; Eyre & Madras, 2024).222PPI setups assume that humans and the judge system rate mutually exclusive subsets of items. In contrast, we validate the judge system by assuming that a subset of items have been rated by both humans and the judge system. Nevertheless, our findings challenge the reliability of using a single per-item human rating as a gold label – a foundational assumption in the PPI framework. In this work, we show that current approaches used to elicit and aggregate human ratings can yield misleading evaluations of target systems. Because human ratings also serve as a foundation for PPI estimators and the Bradley-Terry model, our findings in turn call into question the reliability of these widely-used methodologies for GenAI evaluation.
3 Decomposing Sources of Human Rating Variation in LLM-as-a-Judge Validation
With this framework in place, we now develop a model that decomposes sources of rating variation in the LLM-as-a-judge validation pipeline. Because human ratings are the benchmark for validating judge systems, disentangling “meaningful signal” from noise in rating variationis critical. Our model disentangles: (1) genuine differences in how raters interpret a rating task; (2) inconsistencies introduced by lapses in attention (i.e., error); and (3) variation introduced by requiring a rater to select a single option when they determine that more than one is reasonable — i.e., forced choice selection effects. We show that failing to account for each factor can lead to misleading evaluation results ( 6-7).
Our rating model decomposes the human rating distribution for each item: . To capture the potential for rater error, we distinguish between a human rater’s stable response set — i.e., the options they would consistently endorse when carefully completing the rating task — and the observed response set they provide through response set elicitation (Figure 3). A rater’s stable response set can differ from their observed response set if they fail to identify one or more options that could reasonably apply to a rating instruction (or erroneously endorse others). We describe differences between the stable and observed response set via an error matrix , where each entry encodes the probability that a rater endorses given that their stable response set is . We assume that error rates are constant across all raters:
Assumption 3.1(Error Independence).
.
While a rich literature exists on rater-dependent error modeling (Klie etal., 2023; Gordon etal., 2021), we make this simplifying assumption to examine the aggregate effects of rating error on downstream evaluations of judge systems.
We use a transition matrix to represent how raters pick an option from their response set. Each element in contains the probability of a rater selecting the th option (e.g., Yes) given that they would select the th response set (e.g., both Yes and No). As with the error matrix, we also assume that is fixed across raters:
Assumption 3.2(Forced Choice Independence).
.
Both and have a reverse matrix, denoted as and , respectively, that encode conditional probabilities in the reverse direction. Entries in denote the probability of a rater endorsing the th (observed) response set (e.g., Yes and No) given that they selected the th forced choice option (e.g., Yes). Entries in denote the probability of a rater endorsing given that their observed response set is .
Our rating model connects different representations of human rating variation (Fig. 3). The response set distribution represents genuine differences in how a population of raters interprets an item in a rating task. This rating distribution, which is uncorrupted by error or forced choice selection effects, is our target parameter. In contrast, the forced choice distribution describes the probability distribution that is observed under rater error and forced choice selection effects. The following result shows that we can decompose response set distribution into rater error, forced choice selection effects, and the forced choice distribution:
Theorem 3.3.
(Rating Decomposition) Assume 3.1 and 3.2 hold on . Then and holds for all conditional rating distributions .
This theorem shows how genuine differences in raters’ interpretation of an item in a rating task propagates through error and forced choice selection. It also provides a mechanism for recovering from by applying the reverse error and forced choice transition matrices. Given this decomposition, we might wonder when the response set distribution is identifiable from . The following result shows that this is only possible when a rating task is fully specified:
Theorem 3.4(Response Set Identifiably).
Assume 3.1 and 3.2 hold on . Further, assume that and let be the identity matrix. Then is identifiable from if and only if the rating task is fully specified.
This theorem shows that, even in an idealized setting with no rater error, an information loss occurs when compressing a response set rating into a forced choice rating. A practical implication is that rating tasks should be fully specified when possible to enable direct recovery of from . In 6-7, we show that underspecified rating tasks yield substantive discrepancies in validations of judge systems.
4 Defining the Performance of a Judge System in the Absence of Gold Labels
Our model for human rating variation establishes how to define the performance of a judge system under indeterminacy. In particular, let , denote human and judge aggregation functions used to consolidate rating variation (i.e., represented via the forced choice or response set distributions) into a rating vector. Given a judge performance metric , we call a definition of performance. As we describe next, many such definitions could reasonably be used to validate a judge system. We describe each definition of performance by enumerating over aggregation functions:
Hard aggregation. The hard aggregation function is defined , where is an -dimensional basis vector and is the mode of the forced choice distribution. Performance measures that rely on hard aggregation are consistent with categorical human–judge agreement metrics (e.g., Krippendorff’s ). Measures relying on hard aggregation impose a gold-label assumption, and are the status quo in existing judge system validations (Lu & Zhong, 2024; Jung etal., 2024; Dong etal., 2024; Es etal., 2023; Dubois etal., 2024; Bubeck etal., 2023; Zheng etal., 2023; Faisal etal., 2024; Gu etal., 2024; Thakur etal., 2024; Li etal., 2024b; Chen etal., 2024; Chiang etal., 2024; Dorner etal., 2024; Mirzakhmedova etal., 2024; Chaudhary etal., 2024; Kim etal., 2024; Dettmers etal., 2024).
Soft aggregation. The soft aggregation function returns a probability vector over forced choice responses. Each entry represents the probability that the th option is selected by a rater under forced choice elicitation. Definitions of performance that rely on soft aggregation are consistent with distributional human–judge agreement metrics (e.g., KL-Divergence). Prior work has proposed soft label aggregation with distributional agreement metrics for evaluating ML systems under indeterminacy (Uma etal., 2020; Peterson etal., 2019; Collins etal., 2022). However, soft aggregation is seldom used in judge system validations.
Our rating model ( 3) connects these categorical and distributional definitions of performance and multi-label definitions. Multi-label definitions provide a more granular representation of rating variation over response set data. Let be a binary matrix indicating whether the th option is in the th response set. We define the multi-label vector as . Each entry in describes the probability that a rater selects the th option in their observed response set under response set elicitation.333Unlike the forced choice and response set distributions, entries in the multi-label vector need not sum to one. Let denote the corresponding multi-label vector that is uncorrupted by rater error. Two additional aggregation functions are consistent with multi-label vectors:
Hard Response Set. The hard response set (hrs) function maps the response set distribution to a binary multi-label vector. The th entry of this vector is one if there is at least a probability of a response set containing option being selected during response set elicitation. This aggregation function is consistent with measuring the coverage of a predicted judge system response in a response set containing multiple “correct” options.
Soft Response Set. The soft response set (srs) function directly returns the non-thresholded multi-label vector. Each entry denotes the probability that a rater endorsed the th option during response set elicitation. Definitions of performance that apply srs aggregation to the human rating distribution are consistent with continuous metrics such Mean Squared Error and Binary Cross Entropy.
Table 1 in Appendix A lists many definitions of performance that are consistent with these aggregation functions. This table also summarizes definitions of performance commonly used in (1) LLM-as-a-judge validations and (2) prior work studying evaluation under indeterminacy.
5 Ranking Judge Systems Under Competing Definitions of Performance
Given that there are many ways of defining performance under indeterminacy, it is unclear when one approach is preferable over another. One way to distinguish among competing definitions is by examining their downstream impact on judge system validation: when do two performance definitions yield a consistent ranking of judge systems? We now use our framework to formally investigate this question.
Let and denote two judge systems, each described by their conditional rating distributions and , respectively. We can compare these systems with respect to a performance definition via,
(3)
where represents the full joint distribution over responses returned by both judge systems and human ratings.
To formalize a comparison between two systems, we let denote that . For instance, when using Hit Rate with hard aggregation, implies that achieves greater agreement with a majority vote over human ratings than .444For metrics where lower values indicate better performance, like KL-divergence, we invert the definition such that . Now, suppose that we would like to compare judge systems under a different definition of performance, denoted by . The following condition describes when these two definitions are guaranteed to yield an equivalent ranking of judge systems:
Definition 5.1(Rank Consistency).
We say that and are rank consistent if for all , .
While there are many possible relationships between two definitions of performance, monotonicity captures one key property we might expect: when one system’s performance improves with respect to , it should also improve with respect to if the two definitions are compatible. We formalize this notion in the following definition:
Definition 5.2(Monotone Transformation).
is a monotone transformation of if there exists a monotone increasing function such that for all
The following result shows that if two performance definitions are not monotone transformations of one another, there exist judge systems and a distribution over human ratings such that the definitions will yield contradictory rankings:
Theorem 5.3.
(Necessary Condition for Rank Consistency)If is not a monotone transformation of , then and are not rank consistent.
Theorem 5.3 provides a useful tool for comparing definitions of performance: we can show that two definitions are not rank consistent by demonstrating a monotonicity violation.
We provide two examples of monotonicity violations in Appendix C. The first shows a violation between Hit Rate (defined over ) and KL-Divergence (defined over ). The second shows a violation between KL-Divergence (defined over ) and Mean Squared Error (defined over ). This second example illustrates a pernicious issue arising in underspecified tasks: using Theorem 3.4, we can easily construct monotonicity violations by holding the forced choice distribution fixed while varying the response set distribution. This suggests that monotonicity, and by extension rank consistency, is unlikely to hold between definitions of performance defined over the forced choice distribution (i.e., categorical, distributional) and multi-label definitions.
Metric
Sensitivity Parameter ()
0.0
0.1
0.2
0.3
0.4
Hit Rate (h/h)
Sonnet 3.5
0.61
0.66
0.68
0.55(-0.29)
0.42(-0.58)
KLD (s/s)(judge, human)
Sonnet 3.5
0.61
0.66
0.68
0.55(-0.29)
0.42(-0.58)
KLD (s/s)(human, judge)
Mistral Small
0.06(-0.55)
0.26(-0.40)
0.52(-0.16)
0.84
1.00
MSE (srs/srs)
Mistral Small
3.5 Turbo
0.06(-0.55)
0.26(-0.40)
0.52(-0.16)
0.79(-0.05)
0.96(-0.04)
Coverage (h/hrs)
Sonnet 3.5
4o Mini
3.5 Turbo
0.61
0.66
0.68
0.64(-0.20)
0.96(-0.04)
Consistency ()
Sonnet 3.5
Mistral Small
0.61
0.66
0.68
0.84
1.00
6 Reconciling Definitions of Performance via Synthetic Experiments
Given that a single violation in monotonicity can yield an inversion in system rankings, we might wonder how often inversions incur in practice. Yet merely documenting rank inversions would not provide a means of selecting among two definitions of performance that rank judge systems differently. Therefore, our experiments examine how well judge systems selected using different human–judge agreement metrics perform on downstream evaluation tasks.
Experiment Design. We use our rating decomposition (3) to sample human and judge system rating distributions:
Human Rating Distribution. For each item, we sample the response set distribution . We let denote the probability that a rater selects an observed response set that differs from their stable response set. We construct the error matrix such that diagonal entries denote the probability of no rating error and off-diagonal entries denote the probability of rating error (). We use a skew parameter to control how errors are distributed across response sets. We let denote the index of the option used to categorize items (e.g., as “toxic”). The skew parameter controls whether errors systematically favor () or disfavor () response sets containing option .
We model by sampling an exponential decay function . Here, denotes the rank (low to high) of the th option in the th response set and controls the strength of selection effects. We measure the magnitude of forced choice selection effects via
where denotes the set of response sets containing option at index . A value of indicates random chance of the first option being selected (i.e., no selection effects), while and indicate a rater is twice or half as likely as chance to select the first option (i.e., positive and negative selection effects, respectively). We then compute the forced choice distribution via .
Judge Rating Distribution. We model judge systems by sampling an ensemble of distributions with varying similarity to the human rating distribution. We control the deviation of the th judge’s rating distribution via . We then sample the th judges’ response set distribution by applying , where and projects onto the probability simplex. We sample following the same procedure used to sample the human rating distribution and let denote the magnitude of the th judge system’s forced choice selection effects. We assume that all variation in a judge systems’ response set distribution is captured in stable response sets – i.e., the judge system is not affected by rater error. As such, we let be the identity matrix and compute .
We refer to forced choice selection effects between humans and the th judge system as symmetric when . In contrast, we refer to forced choice selection effects as asymmetric when . We provide additional experiment setup details and describe our finite sample estimation approach in Appendix D.
Results. Figure 6 reports the performance of judge systems selected via different human–judge agreement metrics. The left panel shows the performance of judge systems on a fully specified rating task with no rating error. The center panel shows the performance of judge systems on an underspecified rating task with symmetric (left) and asymmetric (right) forced choice selection effects and no rating error. The right panel introduces additional random (, ) and additive (, ) rater error to the rating process.555The rightmost column is labeled “additive error” as the positive forced choice selection effects () and positive skew () jointly shift probability mass toward option .
Finding 1: Fully specified rating tasks make more effective use of a limited annotation budget. Figure 6 shows that the annotation budget — i.e., the number of human ratings collected for each item in the evaluation corpus — has a significant impact on the quality of selected judge systems. While current practice is to select judge systems via a single rating per item, increasing the budget to three ratings per item yields a larger benefit than accounting for other factors manipulated in our experiments (i.e., forced choice selection effects and the human–judge agreement metric). Further, fully specifying rating tasks enables more effective use of a limited annotation budget. Judge systems selected with just one rating per item on a fully specified task (Figure 6, left) match the performance of those selected with three ratings per item on an underspecified task (Figure 6, center) – a 66% reduction in annotation budget.
Finding 2: Categorical agreement metrics are unreliable in underspecified rating tasks with asymmetric selection effects. As shown in the center panel of Figure 6, using categorical agreement metrics to select judge systems is unreliable when (1) rating tasks are underspecified and (2) selection effects are asymmetric. Figure 6 corroborates these findings by indicating weak spearman correlation between categorical human–judge agreement metrics and downstream performance metrics under asymmetric selection effects. Because asymmetric selection effects cannot be detected from forced choice ratings alone666Prior work has documented that humans and LLMs exhibit asymmetric survey response biases similar to those modeled by forced choice selection effects (Tjuatja etal., 2024)., distributional agreement metrics should be used when selecting judge systems from forced choice ratings on underspecified tasks.
Finding 3: The impact of rating error on judge system selection varies by context and is sometimes less critical than forced choice selection effects. Given the substantial literature investigating rater error (Klie etal., 2023; Plank, 2022; Gordon etal., 2021), its relatively modest impact on judge system selection is unexpected. The right panel of Figure 6 shows that even with high error rates () and adversarial conditions with additive effects (), error has a less significant affect on judge system selections than forced choice selection effects.777Figure 10 shows that these findings are robust to different choices of and . Lemma B.1 (Appendix B.2) pinpoints the mechanism driving these empirical findings. When rater error affects human ratings but not the ratings assigned by a judge system, its impact on the comparative ranking of judge systems is limited. This is in contrast to forced choice selection effects, which affect both human and judge system ratings. Nevertheless, rater error can yield rank inversions under specific conditions that fall outside the scope of our main experiment design (see Figure 11).
We find that rater error has a significant affect on prevalence estimates obtained from human ratings (Figure 12), particularly when error is correlated with the option used to categorize items. Taken together, our results suggest that more research is needed to characterize the effects of rater error at various points in the LLM-as-a-judge validation pipeline and in downstream performance estimates.
Finding 4: Underspecified rating tasks, forced choice selection effects, and finite sample error yield misleading evaluations of target systems. Figure 6 shows that factors in the human rating process affect evaluations of the target system, irrespective of whether a judge system is introduced. The left panel shows the consistency between thresholded decisions obtained from the population rating distribution versus a finite sample approximation. Allow/suppress decisions obtained from sampling just one rating per item disagree with the decision produced by thresholding the population rating distribution in 30% of items. The right panel shows that, for underspecified tasks, prevalence estimates obtained from forced choice ratings underestimate the prevalence of the property of interest (e.g., “toxicity”) as compared to prevalence estimates obtained from response set ratings.888Figure 6 isolates the role of forced choice selection effects by assuming no rater error. Figure 12 in Appendix D characterizes interactions between forced choice selection effects and rater error. These findings underscore that carefully structuring the rating task and accounting for forced choice selection effects is essential for reliable target system evaluations.
7 Case Study: Validating LLM-as-a-Judge Systems for Toxicity Detection
We examine how the choice of performance definition impacts judge system validation in practice by constructing a toxicity rating task. We use the Civil Comments dataset (Borkan etal., 2019) and assume that each comment is an output from corresponding to an item in a rating task (see E). We compare human ratings against ratings provided by five judge systems: Mistral Small, Mistral Large, Claude Sonnet 3.5, GPT 3.5 Turbo, and GPT 4o Mini.
The Civil Comments data contains forced choice ratings elicited from an underspecified rating task. It is possible to recover three options from provided data: Very Toxic, Toxic, and Not Toxic OR Hard to Say. We model how these forced choice options map to response sets via a sensitivity parameter . This parameter denotes the probability that a rater would endorse a response set that contains Toxic given that their forced choice response was Not Toxic or Hard to Say.
Finding 5: The ranking of judge systems depends on the choice of agreement metric. Table 8 illustrates that, even when there is a perfect mapping between forced choice and response set distributions (), the selected judge system depends on the choice of human–judge agreement metric. Claude Sonnet 3.5 is selected when targeting Hit Hate (h/h), Coverage (h/hrs), and KL-Divergence999where KL-Divergence is measured as the deviation of the judge system rating distribution from the human rating distribution. (s/s). In contrast, Mistral Small is selected when targeting MSE (srs/srs) or KL-Divergence101010where KL-Divergence is measured as the deviation of the human rating distribution from the judge system rating distribution. (s/s). However, Mistral Small has 90% worse decision consistency than Sonnet 3.5 when .
This finding illustrates that selecting a judge system based on metrics developed for the determinate task setting can produce a model that performs very suboptimally on downstream content filtering tasks. Current practice is to measure asymmetric distributional metrics (e.g., KL-Divergence) as the deviation of the human rating distribution from the judge system rating distribution (Peterson etal., 2019; Collins etal., 2022; Uma etal., 2020; Fornaciari etal., 2021). While results presented in Appendix D indicate that judge systems selected with this directionality perform well in some settings, we observe the opposite directionality is preferable when in Table 8. We recommend using a symmetric metric (i.e., JS-Divergence) while selecting judge systems in practice, as this approach performs reliably across settings.
Finding 6: The ranking of judge systems depends on forced choice selection effects. Table 8 also shows that the ranking of judge systems inverts when . Using Hit Rate to select a judge system — as is common practice in the literature (Table 1) — yields Sonnet 3.5 across all settings of . While Sonnet 3.5 has the highest decision consistency at , the best model (defined against consistency) inverts to Mistral Small when . Yet selecting Sonnet 3.5 when yields a 34.5% reduction in decision consistency as compared to Mistral Small. This underscores the importance of fully specifying rating tasks or using response set ratings when the rating task is underspecified.
One way to assess whether this value of is plausible is to inspect the value recovered by judge systems: Sonnet 3.5 (), Mistral Small (), GPT 3.5 Turbo (), GPT 4o Mini (), and Mistral Large (). Notably, the two models that align most closely with human ratings – Mistral Small and Sonnet 3.5 – both yield . This suggests the performance inversions shown in Table 8 are plausible.
Finding 7: Forced choice ratings underestimate the prevalence of toxicity in target system outputs. Figure 8 shows the prevalence of toxicity in outputs as a function of the sensitivity parameter () and cutoff (). The setting with and corresponds to the status quo approach of using forced choice ratings to estimate prevalence. However, we find that small increases to the value of beta () yield significant changes to the estimated prevalence of toxicity in outputs. For example, at , increasing from 0 to 0.05 doubles the prevalence of toxicity in target system outputs. This substantiates findings of our synthetic data experiments (Figure 6, right) and underscores the importance of carefully modeling the rating process to obtain reliable evaluations of the target system.
8 Conclusion
We introduce a framework for LLM-as-a-judge validation under rating task indeterminacy. Our framework provides a methodological foundation for more principled validation of judge systems designed to rate concepts such as “helpfulness”, “toxicity”, and “relevance” in target system outputs. We identify key factors in the LLM-as-a-judge validation pipeline: rating task design, judge performance measures, and rating elicitation and aggregation schemes, which can significantly affect downstream ratings performed by judge systems. We show that current practices for validating judge systems can yield misleading assessments of judge system performance and unreliable evaluations of target systems.
9 Impact Statement
In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We demonstrate that key methodological considerations, including rating task design, measures of judge system performance, and rating elicitation and aggregation schemes, have substantive downstream impacts on judge system ratings. Although our arguments have potential societal consequences, especially if the practices we advocate for see adoption and thus change current GenAI evaluation practice, there are no consequences we feel the need to highlight that are specific to this work rather than applicable to any work aiming to improve upon current evaluation practices. The validation practices described in this paper are not an endorsement for the adoption of a judge system in any particular setting.
10 Acknowledgments
We thank members of the Sociotechnical Alignment Center at Microsoft Research New York City for their helpful comments on early versions of this work. We also thank attendees and reviewers of the Statistical Frontiers in LLMs and Foundation Models (SFLLM) and Evaluating Evaluations (EvalEval) workshops at NeurIPS, and attendees of the Fairness, Explainability, Accountability, and Transparency (FEAT) reading group at Carnegie Mellon University. This work was supported in part by an award from the UL Research Institutes throughthe Center for Advancing Safety of Machine Intelligence (CASMI) at Northwestern University.
Bencke etal. (2024)Bencke, L., Paula, F.S., dos Santos, B.G., and Moreira, V.P.Can we trust llms as relevance judges?In Simpósio Brasileiro de Banco de Dados (SBBD), pp. 600–612. SBC, 2024.
Borkan etal. (2019)Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L.Nuanced metrics for measuring unintended bias with real data for text classification.CoRR, abs/1903.04561, 2019.URL http://arxiv.org/abs/1903.04561.
Boyeau etal. (2024)Boyeau, P., Angelopoulos, A.N., Yosef, N., Malik, J., and Jordan, M.I.Autoeval done right: Using synthetic data for model evaluation.arXiv preprint arXiv:2403.07008, 2024.
Bubeck etal. (2023)Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., etal.Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023.
Chaudhary etal. (2024)Chaudhary, M., Gupta, H., Bhat, S., and Varma, V.Towards understanding the robustness of llm-based evaluations under perturbations.arXiv preprint arXiv:2412.09269, 2024.
Chen etal. (2024)Chen, D., Chen, R., Zhang, S., Wang, Y., Liu, Y., Zhou, H., Zhang, Q., Wan, Y., Zhou, P., and Sun, L.Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark.In Forty-first International Conference on Machine Learning, 2024.
Chiang etal. (2024)Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A.N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J.E., etal.Chatbot arena: An open platform for evaluating llms by human preference.In Forty-first International Conference on Machine Learning, 2024.
Collins etal. (2022)Collins, K.M., Bhatt, U., and Weller, A.Eliciting and learning with soft labels from every annotator.In Proceedings of the AAAI conference on human computation and crowdsourcing, volume10, pp. 40–52, 2022.
Dettmers etal. (2024)Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L.Qlora: Efficient finetuning of quantized llms.Advances in Neural Information Processing Systems, 36, 2024.
Dong etal. (2024)Dong, Y.R., Hu, T., and Collier, N.Can llm be a personalized judge?arXiv preprint arXiv:2406.11657, 2024.
Dorner etal. (2024)Dorner, F.E., Nastl, V.Y., and Hardt, M.Limits to scalable evaluation at the frontier: Llm as judge won’t beat twice the data.arXiv preprint arXiv:2410.13341, 2024.
Dubois etal. (2024)Dubois, Y., Li, C.X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P.S., and Hashimoto, T.B.Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024.
Es etal. (2023)Es, S., James, J., Espinosa-Anke, L., and Schockaert, S.Ragas: Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217, 2023.
Eyre & Madras (2024)Eyre, B. and Madras, D.Auto-evaluation with few labels through post-hoc regression.arXiv preprint arXiv:2411.12665, 2024.
Faisal etal. (2024)Faisal, F., Rahman, M.M., and Anastasopoulos, A.Dialectal toxicity detection: Evaluating llm-as-a-judge consistency across language varieties.arXiv preprint arXiv:2411.10954, 2024.
Fisch etal. (2020)Fisch, A., Schuster, T., Jaakkola, T., and Barzilay, R.Efficient conformal prediction via cascaded inference with expanded admission.arXiv preprint arXiv:2007.03114, 2020.
Fornaciari etal. (2021)Fornaciari, T., Uma, A., Paun, S., Plank, B., Hovy, D., Poesio, M., etal.Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021.
Ganguli etal. (2022)Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., etal.Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022.
Gordon etal. (2021)Gordon, M.L., Zhou, K., Patel, K., Hashimoto, T., and Bernstein, M.S.The disagreement deconvolution: Bringing machine learning performance metrics in line with reality.In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–14, 2021.
Goyal etal. (2022)Goyal, N., Kivlichan, I.D., Rosen, R., and Vasserman, L.Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation.Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–28, 2022.
Jung etal. (2024)Jung, J., Brahman, F., and Choi, Y.Trust or escalate: Llm judges with provable guarantees for human agreement.arXiv preprint arXiv:2407.18370, 2024.
Kim etal. (2024)Kim, T.S., Lee, Y., Shin, J., Kim, Y.-H., and Kim, J.Evallm: Interactive evaluation of large language model prompts on user-defined criteria.In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–21, 2024.
Klie etal. (2023)Klie, J.-C., Webber, B., and Gurevych, I.Annotation error detection: Analyzing the past and present for a more coherent future.Computational Linguistics, 49(1):157–198, 2023.
Li etal. (2024a)Li, X., Lipton, Z.C., and Leqi, L.Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024a.
Li etal. (2024b)Li, Z., Wang, C., Ma, P., Wu, D., Wang, S., Gao, C., and Liu, Y.Split and merge: Aligning position biases in llm-based evaluators.In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11084–11108, 2024b.
Lu & Zhong (2024)Lu, H. and Zhong, F.Can vision-language models replace human annotators: A case study with celeba dataset.arXiv preprint arXiv:2410.09416, 2024.
Mazeika etal. (2024)Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., etal.Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024.
Mirzakhmedova etal. (2024)Mirzakhmedova, N., Gohsen, M., Chang, C.H., and Stein, B.Are large language models reliable argument quality annotators?In Conference on Advances in Robust Argumentation Machines, pp. 129–146. Springer, 2024.
Nie etal. (2020)Nie, Y., Zhou, X., and Bansal, M.What can we learn from collective human opinions on natural language inference data?arXiv preprint arXiv:2010.03532, 2020.
Pavlick & Kwiatkowski (2019)Pavlick, E. and Kwiatkowski, T.Inherent disagreements in human textual inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019.
Peterson etal. (2019)Peterson, J.C., Battleday, R.M., Griffiths, T.L., and Russakovsky, O.Human uncertainty makes classification more robust.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9617–9626, 2019.
Plank (2022)Plank, B.The’problem’of human label variation: On ground truth in data, modeling and evaluation.arXiv preprint arXiv:2211.02570, 2022.
Rahmani etal. (2024)Rahmani, H.A., Yilmaz, E., Craswell, N., Mitra, B., Thomas, P., Clarke, C.L., Aliannejadi, M., Siro, C., and Faggioli, G.Llmjudge: Llms for relevance judgments.arXiv preprint arXiv:2408.08896, 2024.
Shankar etal. (2024)Shankar, S., Zamfirescu-Pereira, J., Hartmann, B., Parameswaran, A.G., and Arawjo, I.Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences.arXiv preprint arXiv:2404.12272, 2024.
Szymanski etal. (2024)Szymanski, A., Ziems, N., Eicher-Miller, H.A., Li, T. J.-J., Jiang, M., and Metoyer, R.A.Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks.arXiv preprint arXiv:2410.20266, 2024.
Takehi etal. (2024)Takehi, R., Voorhees, E.M., and Sakai, T.Llm-assisted relevance assessments: When should we ask llms for help?arXiv preprint arXiv:2411.06877, 2024.
Thakur etal. (2024)Thakur, A.S., Choudhary, K., Ramayapally, V.S., Vaidyanathan, S., and Hupkes, D.Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024.
Tjuatja etal. (2024)Tjuatja, L., Chen, V., Wu, T., Talwalkwar, A., and Neubig, G.Do llms exhibit human-like response biases? a case study in survey design.Transactions of the Association for Computational Linguistics, 12:1011–1026, 2024.
Uma etal. (2020)Uma, A., Fornaciari, T., Hovy, D., Paun, S., Plank, B., and Poesio, M.A case for soft loss functions.In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume8, pp. 173–177, 2020.
Zheng etal. (2023)Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., etal.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
Appendix A Appendix.
This work contains the following appendices:
•
Appendix A Table 1 provides a summary of aggregation functions adopted in prior work on (1) validating LLM-as-a-judge systems, and (2) evaluation under task indeterminacy.
•
Appendix B provides additional theoretical analysis, including (i) proofs ( B.1); and (ii) additional theoretical results characterizing rank consistency of MSE under error-free versus error-corrupted human ratings ( B.2).
•
Appendix C provides examples of monotonicity violations among pairs of human–judge agreement metrics (see 5).
•
Appendix D provides further synthetic experiment setup details and empirical results.
•
Appendix E provides further setup details and empirical results for the toxicity detection case study.
The forward model follows by the following factorization111111We omit from all subscripts for brevity. :
(4)
(5)
Above, (4) holds by forced choice independence and (5) holds by error independence. The reverse model follows by the following factorization:
(6)
(7)
where (6) holds by forced choice independence and (7) holds by forced choice and error independence.∎
B.1.2 Theorem 3.4
Proof.
We remove dependence on from all terms for brevity. To begin, note that is identifiable from if and only if is fully determined (where the system simplifies by taking as the identity matrix). The system system must be consistent because Theorem 3.3 establishes a solution. A consistent system with equations and unknowns is fully determined if and only if rank().
We will first show that implies that rank(). To begin, note that (1) because each column in represents a valid probability distribution; and (2) because . This implies that
Thus, each singleton set maps to a standard basis vector . Further, because by definition of and , each option must appear in exactly one set, giving us exactly distinct basis vectors. The rank of a matrix is equal to the number of linear independent column vectors. Because each of the standard basis vectors must be linearly independent, it follows that rank().
We will show the reverse implication that rank() = by contradiction. Suppose there exists a set containing more than one option, i.e., . Let denote the column of corresponding to . Since , for each option , there exists a column in that is the standard basis vector , as shown above. Therefore, can be written as a linear combination of these basis vectors: where . This shows that column is linearly dependent with the columns corresponding to singleton sets for . This implies cannot have linearly independent columns, contradicting rank() = .
∎
B.1.3 Theorem 5.3.
Proof.
Proof by contradiction. Let and denote pairs of performance definitions with increasing cardinality (i.e., higher values being better). Let denote random functions of corresponding to definition . Let correspond to definition . Since is not a monotone transformation of , by definition there must exist distributions corresponding to realizations of these random variables satisfying
Now suppose that places all marginal probability mass over on the ’th item – i.e., . Then:
Thus, rank consistency is violated because there exists a distribution for which but . This provides a contradition, proving the result.
∎
B.2 Rank Consistency Under Rater Error
Lemma B.1(Rank Consistency of MSE (srs/srs) Under Rater Error ).
Let and denote the stable and observed response set distributions for human raters.121212We omit subscript from all terms for brevity. We also omit superscript from human response set distribution, error, and multi-label vectors where the context is clear. Let and denote observed response set distributions for judge systems and where both judge systems have a rater error matrix and that is the identity. Let be the binary matrix mapping response sets to options and define as the difference in MSE under error-free conditions. The ranking of judge systems using MSE with soft response set aggregation is preserved under human rating error if and only if:
(8)
This lemma provides conditions under which measuring the performance of a judge system against error-corrupted versus error-free human ratings yields a consistent ranking of judge systems (when measured against MSE(srs/srs)). The condition essentially requires that the direction of the error-induced shift in human ratings () matches the direction of the stable response set shift across judge systems () when projected to the multi-label space. If human rating error and judge system performance differences shift the response set distribution in the same direction, the ranking of judge systems will be consistent for error-free and error-corrupted ratings. Conversely, rankings can invert under an inverse relationship.
E.q. (8) is satisfied in our experimental setup ( 6) because the ensemble of judge system rating distributions is generated by adding random perturbations (i.e., uncorrelated with rater error) to the human stable response set vector. Thus we see little change in the reliability of MSE (srs/srs) across settings with no rater error (Figure 6, center) and rater error (Figure 6, right).
Proof of Lemma B.1.
For brevity, let . The difference in judge system MSE measured against the multi-label human rating vector derived from the stable response set distribution is given by:
Let denote the multi-label vector recovered from the observed response set distribution. Applying the same derivation as above to the error-corrupted MSE metric yields:
Observe that the first two terms appear in both expansions. Thus we need to focus on the third term while showing the conditions required for rank consistency — i.e., sign() = sign().
•
Case 1: ( is better than under no rater error.) For both inequalities to hold, we need:
•
Case 2: ( is better than under no rater error.).For rank consistency, we need as well. Following similar steps, we get:
∎
Appendix C Examples of Monotonicity Violations Among Pairs of Performance Metrics
{mdframed}
[backgroundcolor=gray!10,linewidth=1pt,linecolor=gray!50,innertopmargin=8pt,innerbottommargin=15pt,innerleftmargin=15pt,innerrightmargin=15pt]Example 1: Hit Rate (Forced Choice) and KL-Divergence (Forced Choice).
Let and let . Consider forced choice distributions recovered from human ratings and the judge systems and , respectively: , , , where we omit from all terms for brevity.
Suppose these distributions are defined over three options and for an item :
Under , we have:
But under , we have:
Thus, we have identified a pair of conditional rating distributions , and a corresponding human rating distribution where is not a monotone transformation of , so rank consistency between and cannot hold.
{mdframed}
[backgroundcolor=gray!10,linewidth=1pt,linecolor=gray!50,innertopmargin=10pt,innerbottommargin=10pt,innerleftmargin=10pt,innerrightmargin=10pt]Example 2: KL-Divergence (Forced Choice) and MSE (Multi-Label). Let and let . Let and . Suppose that humans have no rater error (i.e., and are both the identity). Let satisfy the decomposition:
Let denote the conditional rating distribution of the judge system satisfying:
Let denote the conditional rating distribution of the judge system satisfying:
Under , we have:
But under
yielding a violation of monotonicity. Thus, we have identified a pair of conditional rating distributions , and a corresponding human rating distribution where is not a monotone transformation of , so rank consistency between and cannot hold.
Appendix D Additional Synthetic Experiment Setup Details and Results
D.1 Setup Details
We run all experiments with judge systems. We use 100 items in all experiments and select option and response set configurations satisfying: , , . We let and when sampling the ensemble of judge systems. While performing synthetic experiments, we estimate where for both human and judge rating distributions via the maximum likelihood estimator. We then compute This also allows us to compute . We apply aggregation functions to the estimated forced choice distribution and estimated multi-label vector to obtain estimated performance metrics. For example, while applying soft aggregation, we obtain and after estimating via the procedure outlined above. The expected performance is then given by: . We use the same finite sample estimation approach to recover other metrics. This estimator is consistent so long as as , where denotes convergence in probability. This follows by a standard maximum likelihood convergence argument.
Appendix E Additional Case Study Setup Details and Results
We sample comments from the civil comments dataset and suppose that they are outputs from . We stratify sampled comments by observed disagreement in forced choice responses. Specifically, we select an even sample of comments with toxicity annotations at: 10%, 20%, 30%, 40%, 50%. We use the matrix to construct reverse forced choice transition matrices from observed forced choice ratings. The parametrization shown in Table 2 yields a conservative analysis of forced choice selection effects. Mapping forced choices to a response set containing both Very Toxic and Toxic would flip thresholded decisions from at a smaller magnitude of . Figure 16 shows prompts used to collect ratings from judge systems.
Introduction: My name is Merrill Bechtelar CPA, I am a clean, agreeable, glorious, magnificent, witty, enchanting, comfortable person who loves writing and wants to share my knowledge and understanding with you.
We notice you're using an ad blocker
Without advertising income, we can't keep making this site awesome for you.