🧜♀️ Merbench - LLM Evaluation
Getting LLMs to consistently nail the mermaid diagram syntax can be... an adventure.
Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.
Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over five complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.
Evaluation Summary
105
Total Evaluation Runs
7
Models Evaluated
3
Test Cases
Providers Tested
Data updated: Jul 3, 2025
Difficulty:
Provider:
Model Leaderboard
Rank | Model | Success Rate | Avg Cost/Run | Avg Duration | Avg Tokens | Runs | Provider |
---|---|---|---|---|---|---|---|
1 | gemini-2.5-pro-preview-06-05 | $0.0087 | 46.89s | 8,693.733 | 15 | ||
2 | gemini-2.5-pro-preview-05-06 | $0.0461 | 77.49s | 46,132.333 | 15 | ||
3 | gemini-2.5-pro-preview-03-25 | $0.0379 | 100.73s | 37,934.067 | 15 | ||
4 | gemini-2.5-flash | $0.0128 | 12.85s | 12,838.467 | 15 | ||
5 | gemini-2.5-flash-lite-preview-06-17 | $0.0042 | 4.42s | 4,198.2 | 15 | ||
6 | gemini-2.5-flash-preview-04-17 | $0.0205 | 27.50s | 20,486.067 | 15 | ||
7 | gemini-2.0-flash | $0.0016 | 7.10s | 1,581.533 | 15 |
Performance vs Efficiency Trade-offs
Loading chart data...
Performance by Difficulty Level
Loading chart data...
Token Usage Breakdown
Loading chart data...
Failure Analysis by Reason
Loading chart data...
Last updated: July 4, 2025 at 03:09 PM UTC