I Rewrote My Tree-Sitter Grammar From Scratch. It Got 10x Smaller.

Update (less than 24 hours later): Every single error is fixed. 100% of production files parse successfully. Read more in the “What’s left” section below.

The parser.c file was 106 MB. Bigger than most of the projects it was meant to parse. GitHub’s limit is 100 MB. I couldn’t push it.

That’s the short version of why tree-sitter-al v2 exists. The long version involves 291 property rules, an external scanner, and a single-session ground-up rewrite that cut the grammar from 8,500 lines to 3,000.

The numbers

Metric	Before	After
parser.c	106 MB (can’t push)	10.6 MB
Errors	14	0
Success rate	99.91%	100%
Symbols	2,249	~740
States	29,126	~5,300
grammar.js	8,500 lines	~3,100 lines
Tests	1,225	1,404
Keywords	invisible in queries	82 named nodes
Query files	3 (partial)	5 (comprehensive)

Same language. Same 15,358 production files for validation. Fewer errors, smaller parser, more tests.

How the old grammar painted itself into a corner

The original grammar grew organically over months. Every AL property got its own rule: caption_property, editable_property, source_table_property, and so on. 291 of them. Each one validated what values the property could accept at parse time. Caption only accepts strings. Editable only accepts booleans.

No other tree-sitter grammar does this. There’s a reason for that. It’s the compiler’s job to validate property values, not the parser’s. But the approach worked, and incremental changes kept the error count low.

The problem surfaced when I tried to make keywords visible in the parse tree. Tree-sitter queries (for syntax highlighting, code navigation, symbol search) can only match named nodes. Keywords created with regex-based kw() are invisible to queries. So procedure, if, table, begin are all opaque tokens that no query can target.

I added 80 named keyword rules. Parser worked fine, no regressions. But parser.c grew from 95 MB to 106 MB. Past the GitHub limit.

The real issue was the 2,249 symbols. Each named keyword added states, and the grammar was already at tree-sitter’s practical limits.

One C function to replace 291 rules

The reason the old grammar needed 291 property rules was an LR(1) parsing limitation. A generic identifier = value ; property rule conflicts with identifier : type ; variable declarations. The parser sees identifier and can’t decide which path to take with just one token of lookahead.

I solved this with an external scanner. When the parser state allows both properties and variables, the scanner reads ahead past the identifier:

Match the identifier
Skip whitespace
If the next character is = (and not part of := or ==): emit PROPERTY_NAME
Otherwise: let the grammar handle it as a regular identifier

One C function. Trivial lookahead. It replaces 291 grammar rules with one generic property rule that accepts any Name = Value ;. The ~20 properties with genuinely unique syntax (CalcFormula, TableRelation, Permissions, etc.) keep their dedicated rules. Everything else just works.

Building from zero

I didn’t try to refactor v1. I started with an empty grammar.js and built up in phases, validating against all 15,358 production files after each one.

Phase	What	Success Rate	parser.c
1	Scaffold + 19 object types	~0%	203 KB
2	Generic property + complex properties	~5%	289 KB
3	Fields, keys, sections, types	~10%	917 KB
4	Procedures, triggers, variables	~15%	1.2 MB
5	Statements & expressions	33.4%	2.4 MB
6	Preprocessor, extensions, views	87.1%	4.6 MB
7	Edge cases, property values	97.9%	5.4 MB
8	Split constructs, fragmented if-else	99.95%	9.4 MB

The entire rewrite happened in a single session, with Claude as the copilot. Months of incremental work, replaced in one sitting. Either a testament to better architecture or a warning about sunk cost. Phase 5 to 6 was the biggest jump: preprocessor support took it from 33% to 87% in one phase.

The begin/end discovery

Update (less than 24 hours later): Everything below is wrong. I fixed it. See the update at the end.

One thing that surprised me: begin and end can never be named nodes. Not as named rules, not with alias() (named or anonymous). Any naming mechanism changes the internal token type, and that breaks GLR backtracking in preprocessor-split constructs.

When AL code has begin inside a #if branch and the matching end elsewhere, the parser needs to backtrack through multiple alternative parse paths. Named tokens trigger different error recovery behavior that inserts MISSING tokens instead of trying alternatives. This is a fundamental tree-sitter limitation, not something you can work around in the grammar.

~~So 80 out of 82 keywords are named nodes. The two that refuse? begin and end. The most fundamental pair in the language. Every highlighting query needs to match them as literal strings instead.~~

Turns out, you can work around it. The trick is to make the parser context-aware: in normal code, begin and end are fully visible to tooling. Inside #if blocks, the parser backs off and treats them as plain text so it doesn’t get confused by split code paths. All 82 keywords now work with syntax highlighting and code navigation.

Now available everywhere

V2 is up on PyPI and npm:

Python (tree-sitter 0.24+):

pip install tree-sitter-al

import tree_sitter, tree_sitter_al

lang = tree_sitter.Language(tree_sitter_al.language())
parser = tree_sitter.Parser(lang)
tree = parser.parse(b'codeunit 50100 MyCodeunit { }')
print(tree.root_node.sexp())

Node.js:

npm install tree-sitter-al

Pre-built binaries for all platforms (WASM, Linux, Windows, macOS) are available from GitHub Releases.

The Python package is the one I’m most excited about. I’m already using it in code-graph-rag for graph-based AL code analysis. Any Python tool that speaks tree-sitter can pick it up the same way.

People are also using the grammar for editor support. Some developers write AL extensions entirely in Neovim, and tree-sitter-al gives them proper syntax highlighting, indentation, and code folding that previously only existed in VS Code. If you’re building an editor integration or a language server, the query files are ready to drop in.

The grammar has also been accepted into the tree-sitter-language-pack, which means any tool that uses the pack gets AL support automatically.

What’s left What was left

Update: This section aged poorly. Within 24 hours of publishing, every single error was fixed.

Seven files still fail to parse. All seven have the same pattern: begin inside one #if branch with the matching end in a completely separate #if block. Fixing this would require the scanner to look ahead through entire #if blocks to match begin/end pairs, which is a lot of scanner complexity for 0.05% of files.

~~99.95% on production code is the line I’m comfortable shipping at. But if you want to take a crack at the scanner problem, I’d love the help. The repo is open.~~

Zero files fail. 15,358 out of 15,358. 100%.

It turns out “a lot of complexity” was one extra counter. The seven errors fell into five distinct patterns. None of them required anything novel. I just had to find the right approach.

The lesson: “fundamental limitation” is often “I haven’t found the trick yet.”

Try it

The grammar is open source: github.com/SShadowS/tree-sitter-al

Here’s what the parser sees when it reads a simple AL procedure:

AL CodeSyntax Tree

codeunit 50100 MyCodeunit
{
    procedure GetTotal(Count: Integer): Decimal
    var
        Result: Decimal;
    begin
        Result := Count * 1.5;
        exit(Result);
    end;
}

Loading parser...

If you’re building AL tooling, code analysis, or IDE features, this handles ~~effectively~~ all production AL code. Every file. I built the query files too: syntax highlighting, code navigation, scope analysis, indentation, and folding, all out of the box.