BehaviorBench Leaderboards

A benchmark for foundation models on behavioral-science tasks, evaluated at the individual and distributional levels.

Paper TBA BeFM1.5-4B BeFM1.5-70B Dataset Code Chat with BeFM
Ranking method:
Ranking: Mean Win Rate (HELM-style). See full methodology →
Tags: the reasoning: high chip next to a model name shows the reasoning_effort used for evaluation. It applies only to reasoning models.

Citation

If you use BehaviorBench or BeFM in your work, please consider citing:

@misc{behaviorbench2026,
  title  = {{BehaviorBench}: Benchmarking Foundation Models for Behavioral Science Tasks},
  author = {Huang, Jin* and Xie, Yutong* and Song, Wanli and Zhang, Xingjian and Yuan, Walter and Jackson, Matthew O. and Mei, Qiaozhu},
  year   = {2026},
  note   = {Preprint coming soon}
}