CentralGauge - AL Code Benchmark for LLMs

Active

An open source benchmark for evaluating LLM performance on AL code generation for Microsoft Dynamics 365 Business Central, with 56 tasks across three difficulty tiers, real compilation, and test execution.

CentralGauge measures how well different LLMs can generate AL code for Business Central. Built with TypeScript on Deno, it runs 56 tasks across three difficulty tiers (Easy, Medium, Hard) against Docker-containerized BC environments.

Generated code is compiled in a real BC container and tested with actual test codeunits. No syntactic approximations. Scoring: 50 points for compilation, 30 for passing tests, 20 for code patterns (10 required, 10 forbidden). Pass threshold is 70 points. Models get a second attempt to fix compilation errors, with a 10-point penalty.

Supports OpenAI, Anthropic, Google Gemini, Azure OpenAI, OpenRouter (200+ models), and local Ollama instances. Each run tracks token usage and cost. Results are stored in SQLite for historical comparison. Reports available in HTML and JSON.