Andrew Ginns

🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

1696
Total Evaluation Runs
38
Models Evaluated
3
Test Cases

Providers Tested

AmazonGoogleOSSOpenAI
Data updated: Nov 16, 2025
Difficulty:
Provider:
What do these metrics mean?
Success Rate
The percentage of successful Mermaid diagram generations out of all runs.
Avg Cost/Run
The average cost in USD to generate one diagram, based on provider pricing.
Price/Success
The effective cost for each successful diagram, calculated as (Avg Cost / Success Rate).
Avg Duration
The average time in seconds taken to generate a diagram.
Avg Tokens
The average number of tokens (input + output) used per run.
Runs
The total number of times this model was run in the evaluation.

Model Leaderboard

Rank Model Success Rate Avg Cost/Run Price/Success Avg Duration Avg Tokens Runs Provider
1 openai:gpt-5.1(medium)
82.2%
$0.0885
$0.1076 58.33s 51,824 45 OpenAI
2 openai:gpt-5-codex
80.0%
$0.0811
$0.1013 59.10s 47,451.956 45 OpenAI
3 openai:o3
77.8%
$0.1161
$0.1493 50.75s 50,068.044 45 OpenAI
4 openai:gpt-5-mini
75.6%
$0.0153
$0.0202 72.09s 42,373.667 45 OpenAI
5 openai:gpt-5.1-codex
73.3%
$0.0673
$0.0918 48.23s 41,631.556 45 OpenAI
6 openai:gpt-5
66.7%
$0.0862
$0.1292 102.82s 45,380.267 45 OpenAI
7 openai:gpt-5.1
48.9%
$0.0342
$0.0700 21.86s 17,134.067 45 OpenAI
8 openai:gpt-5.1-codex-mini
33.3%
$0.0075
$0.0225 25.89s 20,605.356 45 OpenAI
9 gemini-2.5-flash-preview-09-2025
31.1%
$0.0206
$0.0661 32.33s 22,980.822 45 Google
10 gemini-2.5-pro-preview-06-05
27.1%
$0.0387
$0.1427 36.92s 8,245.333 48 Google
11 gemini-2.5-pro-preview-05-06
26.7%
$0.1308
$0.4904 49.85s 19,753.911 45 Google
12 gemini-2.5-pro-preview-03-25
22.9%
$0.1133
$0.4942 57.17s 16,393.313 48 Google
13 openai:gpt-5-nano
22.2%
$0.0030
$0.0133 98.28s 21,206.533 45 OpenAI
14 gemini-2.5-pro
20.0%
$0.0544
$0.2722 32.94s 14,255.511 45 Google
15 gemini-2.5-flash
13.3%
$0.0118
$0.0887 10.15s 6,990.467 90 Google
16 qwen3-30b-a3b-thinking-2507-mlx
10.3%
$0.0017
$0.0168 92.27s 8,166.795 39 OSS
17 seed-oss-36b-instruct-mlx
6.3%
$0.0009
$0.0150 396.96s 3,053.438 16 OSS
18 gemini-2.5-flash-lite-preview-06-17
4.4%
$0.0008
$0.0189 4.40s 5,233.378 45 Google
19 gemini-2.5-flash-preview-04-17
4.4%
$0.0233
$0.5237 24.15s 10,492.711 45 Google
20 gemini-2.5-flash-preview-05-20
4.4%
$0.0080
$0.1789 9.26s 5,119.933 45 Google
21 gemini-2.5-flash-lite
3.3%
$0.0013
$0.0382 5.90s 9,506.689 90 Google
22 bedrock:us.amazon.nova-premier-v1:0
2.2%
$0.0287
$1.2916 58.14s 7,519.867 45 Amazon
23 gpt-oss-20b
2.2%
$0.0002
$0.0111 47.90s 3,896.022 45 OSS
24 us.amazon.nova-premier-v1:0
2.2%
$0.0287
$1.2916 58.14s 7,519.867 45 Amazon
25 bedrock:us.amazon.nova-lite-v1:0
0.0%
$0.0003
N/A 24.21s 3,090.2 45 Amazon
26 bedrock:us.amazon.nova-micro-v1:0
0.0%
$0.0001
N/A 19.31s 1,797.067 45 Amazon
27 gemini-2.0-flash
0.0%
$0.0003
N/A 4.21s 1,325.667 60 Google
28 bedrock:us.amazon.nova-pro-v1:0
0.0%
$0.0011
N/A 49.14s 904.2 45 Amazon
29 google/gemma-3-27b
0.0%
$0.0007
N/A 120.44s 6,954.467 45 OSS
30 gemini-2.5-flash-lite-preview-09-2025
0.0%
$0.0011
N/A 5.68s 5,687.822 45 Google
31 llama-xlam-2-70b-fc-r
0.0%
$0.0027
N/A 238.06s 8,591.267 15 OSS
32 magistral-small-2509-mlx
0.0%
$0.0042
N/A 582.19s 4,438.333 15 OSS
33 qwen/qwen3-30b-a3b-2507
0.0%
$0.0006
N/A 23.04s 4,277.978 45 OSS
34 qwen3-coder-30b-a3b-instruct-mlx
0.0%
$0.0005
N/A 21.61s 4,383.356 45 OSS
35 us.amazon.nova-lite-v1:0
0.0%
$0.0003
N/A 24.21s 3,090.2 45 Amazon
36 us.amazon.nova-micro-v1:0
0.0%
$0.0001
N/A 19.31s 1,797.067 45 Amazon
37 us.amazon.nova-pro-v1:0
0.0%
$0.0011
N/A 49.14s 904.2 45 Amazon
38 xlam-2-32b-fc-r
0.0%
$0.0004
N/A 171.86s 8,210.2 15 OSS

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: November 26, 2025 at 01:20 AM UTC