Ranking Methodology

Aggregating across tasks: HELM-style win rate

The per-task metrics are on different scales and directions, so directly averaging them is not meaningful. Following HELM, each task is reduced to pairwise comparisons:

WR_task = (models beaten + 0.5 × models tied) / (N − 1)

where N is the number of models with data on that task. A model's mean win rate is the mean of its per-task win rates across the tasks in scope. The Individual and Distributional columns on the leaderboard are this mean computed over the corresponding subset of tasks; the overall column is their average.

ELO Rating (alternative ranking)

Each task generates pairwise matchups (ties counted as draws). Matchups are processed with K = 32 and initial rating 1500, shuffled 200 times with a fixed seed and averaged. Reported separately for Individual and Distributional.