Andrew Ginns

🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

699
Total Evaluation Runs
13
Models Evaluated
3
Test Cases

Providers Tested

AmazonGoogle
Data updated: Jul 23, 2025
Difficulty:
Provider:

Model Leaderboard

Rank Model Success Rate Avg Cost/Run Avg Duration Avg Tokens Runs Provider
1 gemini-2.5-pro-preview-06-05
29.4%
$0.0383
36.84s 8,111.882 51 Google
2 gemini-2.5-pro-preview-05-06
26.7%
$0.1308
49.85s 19,753.911 45 Google
3 gemini-2.5-pro-preview-03-25
22.9%
$0.1133
57.17s 16,393.313 48 Google
4 gemini-2.5-flash
13.3%
$0.0128
10.15s 6,990.467 45 Google
5 gemini-2.5-flash-lite-preview-06-17
5.0%
$0.0008
4.40s 4,974.583 60 Google
6 gemini-2.5-flash-preview-05-20
5.0%
$0.0101
9.75s 5,771.55 60 Google
7 gemini-2.5-flash-preview-04-17
4.4%
$0.0233
24.15s 10,492.711 45 Google
8 bedrock:us.amazon.nova-premier-v1:0
3.3%
$0.0356
63.19s 9,528.967 60 Amazon
9 gemini-2.5-flash-lite
2.2%
$0.0009
5.57s 5,586.156 45 Google
10 gemini-2.0-flash
0.0%
$0.0003
4.21s 1,325.667 60 Google
11 bedrock:us.amazon.nova-pro-v1:0
0.0%
$0.0008
49.53s 678.15 60 Amazon
12 bedrock:us.amazon.nova-micro-v1:0
0.0%
$0.0001
18.83s 1,783.85 60 Amazon
13 bedrock:us.amazon.nova-lite-v1:0
0.0%
$0.0002
24.54s 2,799.317 60 Amazon

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: August 19, 2025 at 01:19 AM UTC