BehaviorBench Leaderboards

A benchmark for foundation models on behavioral-science tasks, evaluated at the individual and distributional levels.

Paper BeFM1.5-4B BeFM1.5-70B Dataset Code Chat with BeFM

Stanford

Ranking: Mean Win Rate (HELM-style). See full methodology →
Tags: the reasoning: high chip next to a model name shows the reasoning_effort used for evaluation. It applies only to reasoning models.

Citation

If you use BehaviorBench or BeFM in your work, please consider citing:

@misc{huang2026behaviorbenchbenchmarkingfoundationmodels,
  title={BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks},
  author={Jin Huang and Yutong Xie and Wanli Song and Xingjian Zhang and Walter Yuan and Matthew O. Jackson and Qiaozhu Mei},
  year={2026},
  eprint={2606.24162},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2606.24162},
}