Tau2-Bench Telecom logo

Tau2-Bench Telecom

Agentic

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.

Each task pairs the agent with an LLM-simulated user in a dual-control environment where both sides can act and use tools, so the agent must guide the user through steps it cannot perform itself (rebooting a phone, toggling settings) while staying inside a written policy document. A run passes only if the final database state and required actions match the ground truth. Telecom is the hardest of the original three domains (retail, airline, telecom) and the one vendors quote, with roughly 114 tasks averaged over repeated trials. The paper also reports pass^k reliability (probability of succeeding on all k independent trials), which falls off sharply for less consistent models. Scores here are vendor-reported pass@1 figures; Sierra's own re-evaluations with a standardized user simulator run several points lower.

Leaderboard
Share:
Details:

Updated June 2026