Average scores across all evaluation categories for different naming conditions
For each evaluation category (Iconography, Association, Atmosphere, Emotion), the average score is calculated by:
Even without labels, human annotations received lower average scores than AI outputs. Analysis of sample outputs reveals several reasons: