The per-task metrics are on different scales and directions, so directly averaging them is not meaningful. Following HELM, each task is reduced to pairwise comparisons:
where N is the number of models with data on that task. A model's mean win rate is the mean of its per-task win rates across the tasks in scope. The Individual and Distributional columns on the leaderboard are this mean computed over the corresponding subset of tasks; the overall column is their average.
Each task generates pairwise matchups (ties counted as draws). Matchups are processed with K = 32 and initial rating 1500, shuffled 200 times with a fixed seed and averaged. Reported separately for Individual and Distributional.