Berkeley Function Calling Leaderboard V4
AgenticThe reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.
BFCL V4 averages a wide set of subcategories into one overall accuracy figure. Single-turn calls (simple, multiple, parallel) are graded by AST matching against possible answers, while multi-turn agentic tasks are graded by executing the calls and checking state. V4 added web search, memory (key-value, vector store, recursive summarization backends) and format sensitivity categories, and hallucination resistance is measured through relevance and irrelevance detection (calling a function when appropriate, abstaining when none fits). Models appear in native function-calling (FC) or prompt mode, recorded on each row; the board also publishes total run cost and latency columns, which are whole-benchmark figures rather than per-task costs, so they are not charted here.
- 177.47%Anthropic
- 273.24%Anthropic
- 3Gemini 3 ProPrompt72.51%Google DeepMind
- 4GLM 4.6FC, Thinking72.38%Z.AI
- 568.7%Anthropic
- 668.14%Google DeepMind
- 7o3Prompt63.05%OpenAI
- 859.06%Moonshot AI
- 9DeepSeek V3.2 ExpPrompt, Thinking56.73%DeepSeek
- 1056.24%Google DeepMind
- 11GPT-5.2FC55.87%OpenAI
- 1255.46%OpenAI
- 1354.12%DeepSeek
- 14GPT-4.1FC53.96%OpenAI
Category
AgenticCreated by
UC BerkeleyModels tested
12Configs tested
14Leader
Claude Opus 4.5Top score
77.47%
Updated April 2026