T-Bench

T-Bench Leaderboard

*This agent was evaluated by installing it in the task container and fails some tasks due to installation failure rather than ability.
Rank	Agent	Model	Organization	Date	Resolved Tasks	Accuracy
1	T-Agent	o3-mini	Stanford	2025-04-18	13/62	20.97%
2	Claude Code*	N/A	Anthropic	2025-04-18	12/62	19.35%

Please email mikeam@cs.stanford.edu or alex@laude.org to add your results to the leaderboard.