Tau2-Bench Telecom

Name: Tau2-Bench Telecom leaderboard
Creator: Sierra

Agentic

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.

Each task pairs the agent with an LLM-simulated user in a dual-control environment where both sides can act and use tools, so the agent must guide the user through steps it cannot perform itself (rebooting a phone, toggling settings) while staying inside a written policy document. A run passes only if the final database state and required actions match the ground truth. Telecom is the hardest of the original three domains (retail, airline, telecom) and the one vendors quote, with roughly 114 tasks averaged over repeated trials. The paper also reports pass^k reliability (probability of succeeding on all k independent trials), which falls off sharply for less consistent models. Scores here are vendor-reported pass@1 figures; Sierra's own re-evaluations with a standardized user simulator run several points lower.

Leaderboard

#ModelScoreProvider

Sources:

Official tau-bench leaderboard sierra-research/tau2-bench Tau2-Bench paper (arXiv 2506.07982)

Share:

Details:

Category
Agentic
Created by
Sierra
Models tested
14
Leader
Claude Opus 4.6
Top score
99.3%

Updated June 2026