Berkeley Function Calling Leaderboard V4

Agentic

The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.

The board was last updated in April 2026, so the very newest frontier models (Claude Fable 5, Claude Opus 4.8, GPT-5.5) are not on it yet.

BFCL V4 averages a wide set of subcategories into one overall accuracy figure. Single-turn calls (simple, multiple, parallel) are graded by AST matching against possible answers, while multi-turn agentic tasks are graded by executing the calls and checking state. V4 added web search, memory (key-value, vector store, recursive summarization backends) and format sensitivity categories, and hallucination resistance is measured through relevance and irrelevance detection (calling a function when appropriate, abstaining when none fits). Models appear in native function-calling (FC) or prompt mode, recorded on each row; the board also publishes total run cost and latency columns, which are whole-benchmark figures rather than per-task costs, so they are not charted here.

Leaderboard

#ModelScoreProvider

1
Claude Opus 4.5FC
77.47%Anthropic
2
Claude Sonnet 4.5FC
73.24%Anthropic
3
Gemini 3 ProPrompt
72.51%Google DeepMind
4
GLM 4.6FC, Thinking
72.38%Z.AI
5
Claude Haiku 4.5FC
68.7%Anthropic
6
Gemini 3 ProFC
68.14%Google DeepMind
7
o3Prompt
63.05%OpenAI
8
Kimi K2 0711FC
59.06%Moonshot AI
9
DeepSeek V3.2 ExpPrompt, Thinking
56.73%DeepSeek
10
Gemini 2.5 FlashFC
56.24%Google DeepMind
11
GPT-5.2FC
55.87%OpenAI
12
GPT-5 MiniFC
55.46%OpenAI
13
DeepSeek V3.2 ExpFC
54.12%DeepSeek
14
GPT-4.1FC
53.96%OpenAI

Sources:

Raw leaderboard CSV (data_overall.csv)BFCL V4 leaderboard BFCL V4 announcement BFCL paper (ICML 2025)ShishirPatil/gorilla (BFCL eval code)

Share:

Details:

Category
Agentic
Created by
UC Berkeley
Models tested
12
Configs tested
14
Leader
Claude Opus 4.5
Top score
77.47%

Updated April 2026