Andrew Ginns

🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

2012
Total Evaluation Runs
45
Models Evaluated
3
Test Cases

Providers Tested

AmazonAnthropicGoogleOSSOpenAI
Data updated: Dec 20, 2025
Difficulty:
Provider:
What do these metrics mean?
Success Rate
The percentage of successful Mermaid diagram generations out of all runs.
Avg Cost/Run
The average cost in USD to generate one diagram, based on provider pricing.
Price/Success
The effective cost for each successful diagram, calculated as (Avg Cost / Success Rate).
Avg Duration
The average time in seconds taken to generate a diagram.
Avg Tokens
The average number of tokens (input + output) used per run.
Runs
The total number of times this model was run in the evaluation.

Model Leaderboard

Rank Model Success Rate Avg Cost/Run Price/Success Avg Duration Avg Tokens Runs Provider
1 gpt-5.2 (medium)
100.0%
$0.17
$0.17 43.19s 69,768 45 OpenAI
2 claude-opus-4.5
100.0%
$0.53
$0.53 58.24s 93,280 45 Anthropic
3 gpt-5.1-codex-max (medium)
97.8%
$0.12
$0.13 54.27s 78,085 45 OpenAI
4 gemini-3-pro-preview
97.8%
$0.25
$0.25 79.49s 80,741 45 Google
5 claude-sonnet-4.5
95.6%
$0.26
$0.27 58.42s 75,860 45 Anthropic
6 gpt-5 (medium)
93.3%
$0.14
$0.15 76.25s 74,458 45 OpenAI
7 gpt-5.1-codex (medium)
93.3%
$0.14
$0.15 63.58s 89,230 45 OpenAI
8 o3
93.3%
$0.15
$0.16 118.85s 64,968 45 OpenAI
9 gpt-5.1 (medium)
91.1%
$0.11
$0.12 53.77s 61,876 45 OpenAI
10 gpt-5.2 (none)
86.7%
$0.13
$0.15 21.41s 57,198 45 OpenAI
11 gpt-5-codex
80.0%
$0.08
$0.10 59.10s 47,452 45 OpenAI
12 gemini-3-flash-preview
77.8%
$0.05
$0.06 40.33s 69,554 45 Google
13 o4-mini
62.2%
$0.08
$0.13 60.44s 59,667 45 OpenAI
14 gpt-5.1 (none)
55.6%
$0.03
$0.06 33.31s 17,784 45 OpenAI
15 gemini-2.5-flash-preview-09-2025
45.7%
$0.03
$0.07 46.23s 46,035 46 Google
16 gpt-4.1
42.2%
$0.01
$0.03 16.41s 4,300 45 OpenAI
17 gpt-5-mini
35.6%
$0.01
$0.02 51.49s 8,576 45 OpenAI
18 gemini-2.5-pro
35.6%
$0.10
$0.29 52.74s 31,300 45 Google
19 gpt-5.1-codex-mini (medium)
33.3%
$0.01
$0.04 31.13s 32,398 45 OpenAI
20 gemini-2.5-pro-preview-06-05
27.1%
$0.04
$0.14 36.92s 8,246 48 Google
21 gemini-2.5-pro-preview-05-06
26.7%
$0.13
$0.49 49.85s 19,754 45 Google
22 gemini-2.5-pro-preview-03-25
22.9%
$0.11
$0.49 57.17s 16,394 48 Google
23 gpt-4.1-mini
20.0%
$0.00
$0.01 23.43s 3,733 45 OpenAI
24 gpt-5-nano
13.3%
$0.00
$0.03 88.50s 11,963 45 OpenAI
25 gemini-2.5-flash
13.3%
$0.01
$0.09 10.15s 6,991 90 Google
26 qwen3-30b-a3b-thinking-2507-mlx
10.3%
$0.00
$0.02 92.27s 8,167 39 OSS
27 seed-oss-36b-instruct-mlx
6.3%
$0.00
$0.01 396.96s 3,054 16 OSS
28 gemini-2.5-flash-lite-preview-06-17
4.4%
$0.00
$0.02 4.40s 5,234 45 Google
29 gemini-2.5-flash-preview-05-20
4.4%
$0.01
$0.18 9.26s 5,120 45 Google
30 gemini-2.5-flash-preview-04-17
4.4%
$0.02
$0.52 24.15s 10,493 45 Google
31 gemini-2.5-flash-lite
3.3%
$0.00
$0.04 5.90s 9,507 90 Google
32 gpt-oss-20b
2.2%
$0.00
$0.01 47.90s 3,897 45 OSS
33 nova-premier-v1:0
2.2%
$0.03
$1.29 58.14s 7,520 45 Amazon
34 nova-micro-v1:0
0.0%
$0.00
N/A 19.31s 1,798 45 Amazon
35 nova-lite-v1:0
0.0%
$0.00
N/A 24.21s 3,091 45 Amazon
36 gemini-2.0-flash
0.0%
$0.00
N/A 4.21s 1,326 60 Google
37 xlam-2-32b-fc-r
0.0%
$0.00
N/A 171.86s 8,211 15 OSS
38 qwen3-coder-30b-a3b-instruct-mlx
0.0%
$0.00
N/A 21.61s 4,384 45 OSS
39 qwen3-30b-a3b-2507
0.0%
$0.00
N/A 23.04s 4,278 45 OSS
40 gpt-4.1-nano
0.0%
$0.00
N/A 18.99s 3,920 45 OpenAI
41 gemma-3-27b
0.0%
$0.00
N/A 120.44s 6,955 45 OSS
42 gemini-2.5-flash-lite-preview-09-2025
0.0%
$0.00
N/A 5.68s 5,688 45 Google
43 nova-pro-v1:0
0.0%
$0.00
N/A 49.14s 905 45 Amazon
44 llama-xlam-2-70b-fc-r
0.0%
$0.00
N/A 238.06s 8,592 15 OSS
45 magistral-small-2509-mlx
0.0%
$0.00
N/A 582.19s 4,439 15 OSS

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: January 12, 2026 at 01:40 AM UTC