OSP logo
os://protocol
Checks

Judge

LLM-as-judge evaluation — score agent output against quality criteria using configurable models, rubrics, and pass/fail thresholds.

This interface is experimental. No real implementation exists yet. The API shape may change before stabilization.

Overview

Judge is the checks-phase interface for LLM-as-judge evaluation. It uses a model to score agent output against quality criteria, returning a numeric score (0–1), a pass/fail result, and natural-language reasoning. Results can optionally include a per-criterion breakdown via ruleResults, feed into audit records, and trigger approval flows when a score falls below threshold.

Provider analogues: OpenAI Evals, Braintrust Scorers, LangSmith Evaluators, Arize Phoenix.

Evaluation Flow

TypeScript API

import type { JudgeConfig, JudgeResult, Judge } from 'osprotocol/checks/judge'

JudgeConfig

interface JudgeConfig {
  model?: string
  criteria: string
  threshold?: number
  metadata?: Record<string, unknown>
}
FieldTypeDescription
modelstringModel identifier to use as judge. Falls back to provider default when omitted.
criteriastringNatural-language description of what constitutes a passing result. Required.
thresholdnumberMinimum score (0–1) for passed: true. Defaults to provider-defined value when omitted.
metadataRecord<string, unknown>Arbitrary metadata attached to this evaluation run.

JudgeResult

interface JudgeResult {
  score: number
  passed: boolean
  reasoning: string
  ruleResults?: RuleResult[]
  metadata?: Record<string, unknown>
}
FieldTypeDescription
scorenumberNumeric quality score in the range 0–1.
passedbooleantrue when score >= threshold.
reasoningstringModel-generated explanation for the score.
ruleResultsRuleResult[]Optional per-criterion breakdown from linked rules.
metadataRecord<string, unknown>Arbitrary metadata returned by the judge.

Judge

interface Judge {
  evaluate(content: unknown, config: JudgeConfig): Promise<JudgeResult>
}

evaluate accepts any content value (string, object, or structured output) and a JudgeConfig, and returns a Promise<JudgeResult>.

Usage Examples

Basic evaluation

const result = await judge.evaluate(agentOutput, {
  criteria: 'The response must be factually accurate, concise, and free of harmful content.',
  threshold: 0.8,
})

console.log(result.passed)    // true | false
console.log(result.score)     // e.g. 0.92
console.log(result.reasoning) // "The response was accurate and well-scoped..."

Breakdown by criteria using ruleResults

const result = await judge.evaluate(agentOutput, {
  model: 'claude-opus-4-6',
  criteria: 'Evaluate accuracy, tone, and completeness separately.',
  threshold: 0.75,
})

if (result.ruleResults) {
  for (const rule of result.ruleResults) {
    console.log(rule.name, rule.passed, rule.score)
  }
}

Conditional approval trigger

const result = await judge.evaluate(agentOutput, {
  criteria: 'Output must not contain PII and must follow the brand voice guide.',
  threshold: 0.9,
})

if (!result.passed) {
  // Route to human approval before publishing
  await approvalGate.request({
    reason: result.reasoning,
    score: result.score,
  })
}

Rules vs Judge

RulesJudge
Evaluation methodDeterministic / programmaticLLM-based qualitative evaluation
OutputPass/fail per ruleScore (0–1) + reasoning
Best forSchema validation, format checks, required fieldsTone, accuracy, helpfulness, nuanced quality
LatencyLowHigher (model call required)
CostNoneModel inference cost
AuditabilityExact rule matchNatural-language reasoning

Use Rules for hard constraints and Judge for qualitative grading where human-like judgment is required.

Integration

  • Rules — deterministic checks whose results can be surfaced as ruleResults inside a JudgeResult
  • AuditJudgeResult records are written to the audit log for traceability
  • Approval — when passed is false, evaluation results can trigger a human approval gate before execution continues