Tau2-Bench Telecom
AgenticSierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.
Each task pairs the agent with an LLM-simulated user in a dual-control environment where both sides can act and use tools, so the agent must guide the user through steps it cannot perform itself (rebooting a phone, toggling settings) while staying inside a written policy document. A run passes only if the final database state and required actions match the ground truth. Telecom is the hardest of the original three domains (retail, airline, telecom) and the one vendors quote, with roughly 114 tasks averaged over repeated trials. The paper also reports pass^k reliability (probability of succeeding on all k independent trials), which falls off sharply for less consistent models. Scores here are vendor-reported pass@1 figures; Sierra's own re-evaluations with a standardized user simulator run several points lower.
Category
AgenticCreated by
SierraModels tested
14Leader
Claude Opus 4.6Top score
99.3%
Updated June 2026