Model Evaluations Dashboard

ChatGPT Performance - Mylius

Average scores across all evaluation categories for different naming conditions

How Average Scores Are Calculated:

For each evaluation category (Iconography, Association, Atmosphere, Emotion), the average score is calculated by:

  1. Collecting all individual scores for that category from the evaluation data
  2. Summing all the scores together
  3. Dividing the total by the number of evaluations to get the average
  4. This process is repeated for each of the three evaluation conditions:
    • First Iteration: Judge Model (O3) know the true exact model names
    • Random Names: Judge Model (O3) see the model names like: Model_1, Model_2 etc.
    • Mixed Names: Judge Model (O3) see mismatched and randomzed model names like sees the generated metadata of GPT 4.1s as Gemini etc.

Evaluation of Judge Model Behavior in Scoring Metadata

Findings

  1. Judge Model Neutrality Toward AI Model Identities
    • When comparing the original labels with randomized labels, scores remained very similar.
    • When comparing original labels with mismatched labels, scores again stayed stable across most categories (Iconography, Association, Atmosphere, Emotion).
    • This indicates that the judge model does not exhibit a systematic bias toward specific AI model names.
    • In other words, o3 does not reward or penalize ChatGPT, Claude, Gemini, Mistral, or Grok based on identity—its scoring is robust when model names are hidden or shuffled.
    ✅ Conclusion: There is no evidence of “brand favoritism” among AI models.
  2. Lower Scores for Human-Labeled Outputs
    • A notable effect appears with human-generated metadata:
    • When explicitly labeled as Human, scores were consistently lower across categories.
    • When the same human annotations were mislabeled as coming from an AI model, scores increased significantly (up to +1.6 points on a 0–10 scale in some categories).
    • This suggests a label-based bias: the judge model implicitly expects model outputs to be “better” and penalizes the Human label.
    ✅ Conclusion: To avoid systematic underestimation of human work, human annotations should be anonymized or masked in evaluation.

Why Human Data Scores Lower Overall

Even without labels, human annotations received lower average scores than AI outputs. Analysis of sample outputs reveals several reasons:

  • Structural Completeness Bias
    • AI outputs always fill every category (Iconography, Association, Atmosphere, Emotion).
    • Human outputs often leave fields blank (especially Atmosphere/Emotion), which the judge interprets as lower quality.
  • Verbosity and Detail
    • AI systematically expands metadata into long, exhaustive lists: motifs, objects, stylistic elements, concepts.
    • Human annotations are more selective and concise, focusing on core motifs.
    • o3 appears calibrated to reward verbosity and taxonomic breadth.
  • Consistency of Schema
    • AI outputs follow a rigid schema that aligns with the judge’s expectations.
    • Human annotations show greater variation (sometimes mixing interpretation, sometimes unusual categories like Bibel).
    • This inconsistency is penalized.
  • Interpretive vs. Descriptive Bias
    • Humans sometimes make interpretive attributions (e.g. “Jesus Christus, Maria, Johannes” for a family prayer scene).
    • AI stays closer to neutral description.
    • o3 rewards descriptive neutrality over interpretive subjectivity.
  • Atmosphere/Emotion Coverage
    • AI always provides rich sets of adjectives for atmosphere/emotion.
    • Human entries are minimal or missing.
    • These omissions weigh heavily in the scoring.
  • Evaluator Alignment Effect
    • As an LLM, o3 is biased toward outputs that look like LLM outputs: exhaustive, well-structured, adjective-rich.
    • Human data looks stylistically different, and thus is scored lower—even when it may be more precise or historically meaningful.
candidgarden