Model Evaluations Dashboard

Average scores across all evaluation categories for different naming conditions

For each evaluation category (Iconography, Association, Atmosphere, Emotion), the average score is calculated by:

Collecting all individual scores for that category from the evaluation data
Summing all the scores together
Dividing the total by the number of evaluations to get the average
This process is repeated for each of the three evaluation conditions:
- First Iteration: Judge Model (O3) know the true exact model names
- Random Names: Judge Model (O3) see the model names like: Model_1, Model_2 etc.
- Mixed Names: Judge Model (O3) see mismatched and randomzed model names like sees the generated metadata of GPT 4.1s as Gemini etc.

Evaluation of Judge Model Behavior in Scoring Metadata

Even without labels, human annotations received lower average scores than AI outputs. Analysis of sample outputs reveals several reasons:

Structural Completeness Bias
- AI outputs always fill every category (Iconography, Association, Atmosphere, Emotion).
- Human outputs often leave fields blank (especially Atmosphere/Emotion), which the judge interprets as lower quality.
Verbosity and Detail
- AI systematically expands metadata into long, exhaustive lists: motifs, objects, stylistic elements, concepts.
- Human annotations are more selective and concise, focusing on core motifs.
- o3 appears calibrated to reward verbosity and taxonomic breadth.
Consistency of Schema
- AI outputs follow a rigid schema that aligns with the judge’s expectations.
- Human annotations show greater variation (sometimes mixing interpretation, sometimes unusual categories like Bibel).
- This inconsistency is penalized.
Interpretive vs. Descriptive Bias
- Humans sometimes make interpretive attributions (e.g. “Jesus Christus, Maria, Johannes” for a family prayer scene).
- AI stays closer to neutral description.
- o3 rewards descriptive neutrality over interpretive subjectivity.
Atmosphere/Emotion Coverage
- AI always provides rich sets of adjectives for atmosphere/emotion.
- Human entries are minimal or missing.
- These omissions weigh heavily in the scoring.
Evaluator Alignment Effect
- As an LLM, o3 is biased toward outputs that look like LLM outputs: exhaustive, well-structured, adjective-rich.
- Human data looks stylistically different, and thus is scored lower—even when it may be more precise or historically meaningful.