Doubling MemCP Parser Speed – Here’s How

Why Parser Speed Matters for OLTP

MemCP is an in-memory, MySQL-compatible database. It handles two kinds of workloads: analytics queries that scan millions of rows, and OLTP queries — small INSERTs, UPDATEs, DELETEs, and SELECTs coming in rapid fire from applications.

For OLTP / analytics queries, parsing a 1000-character SELECT takes milliseconds against a query that runs for seconds or minutes. Nobody cares. For OLTP, every microsecond in the parser is a microsecond added to every request. A web application doing 10,000 INSERT/s spends more time parsing SQL than executing it if the parser is slow.

The queries that matter for parser performance are the ones that come in high volume: single-row inserts, point updates, key lookups. And bulk inserts — INSERT INTO t VALUES (...) and SELECT a,b,c FROM t with hundreds or thousands of rows — where the parser processes the same value-tuple grammar thousands of times in a single call.

Packrat Parsing and Why MemCP Uses It

MemCP’s SQL parser is built on go-packrat, a packrat parser combinator library. A packrat parser works by combining small parsers (atoms, regex matchers) into larger ones through combinators: And (sequence), Or (alternatives), Kleene (repetition). The grammar for INSERT INTO t VALUES (1,'a'), (2,'b') is built from:

insertStmt = And(INSERT, INTO, tableName, VALUES, tupleList)
tupleList  = Many(tuple, comma)
tuple      = And(lparen, valueList, rparen)
valueList  = Many(value, comma)
value      = Or(integer, string, null, ...)

Packrat parsing guarantees O(n) time through memoization: every parser result at every input position is cached, so no work is done twice. This matters for grammars with backtracking — if Or tries alternative A and it fails at position 50, the partial results from positions 0-49 are cached and reused when alternative B is tried.

MemCP uses packrat combinators instead of a generated parser (like yacc) for a specific reason: extensibility. MemCP’s Scheme runtime lets libraries define new SQL syntax by constructing parser trees at runtime:

(define my-parser (sql-and (sql-atom "CUSTOM") (sql-atom "KEYWORD") sql-expr))
(register-syntax "custom_stmt" my-parser)

A generated parser would require a build step and recompilation. Combinator parsers are data structures — they can be constructed, composed, and extended at runtime.

The Problem

Profiling a bulk INSERT showed the parser responsible for tens of thousands of heap allocations per query. Each allocation is cheap individually, but at OLTP volumes they add up: GC pressure, cache misses, wasted cycles.

We optimized go-packrat across five versions. The results were surprising — reducing allocations doesn’t always make things faster. In the end, choosing the right changes got us a 2x speedup.

Measurements

All measurements: (time (parse_sql ...)) in MemCP, median of multiple runs. The input is INSERT INTO t VALUES('a','b'), ... with 3, 100, 1000, and 10000 rows.

Absolute times:

Version	x3	x100	x1000	x10000
v2.1.15 (baseline)	1143 us	14.6 ms	150.9 ms	1267.6 ms
v2.1.16	542 us	10.2 ms	78.9 ms	727.4 ms
v2.1.17	639 us	10.6 ms	99.7 ms	1013.2 ms
v2.1.18	978 us	22.7 ms	214.1 ms	2115.6 ms
v2.1.19	577 us	9.0 ms	73.0 ms	675.8 ms

Relative to v2.1.15 baseline:

Version	x3	x100	x1000	x10000
v2.1.15	1.0x	1.0x	1.0x	1.0x
v2.1.16	2.1x	1.4x	1.9x	1.7x
v2.1.17	1.8x	1.4x	1.5x	1.3x
v2.1.18	1.2x	0.6x	0.7x	0.6x
v2.1.19	2.0x	1.6x	2.1x	1.9x

Per-row cost (median / row count):

Version	x3	x100	x1000	x10000
v2.1.15	381 us	146 us	151 us	127 us
v2.1.16	181 us	102 us	79 us	73 us
v2.1.17	213 us	106 us	100 us	101 us
v2.1.18	326 us	227 us	214 us	212 us
v2.1.19	192 us	90 us	73 us	68 us

v2.1.19 achieves the best per-row cost at scale: 68 us/row at 10,000 rows, 7-11% faster than v2.1.16 and nearly 2x the baseline. Its per-row cost drops with scale (192 -> 68 us), showing good amortization of fixed overhead.

What Changed — and What We Learned

v2.1.16: Replacing Expensive Operations (2x speedup)

Seven internal changes, all pure wins:

Keyword matching without regex. Every SQL keyword was matched via (?i)^SELECT with FindStringSubmatch. Replaced with strings.EqualFold for case-insensitive, == for case-sensitive. With 313 atom parsers in the grammar, this was the highest-impact single change.
Word boundaries as []bool instead of map[int]bool. Flat slice indexed by position, one allocation instead of thousands of map buckets.
Whitespace skip fast-path. A byte check (<= ' ' or == '/') gates the regex call. ~80% of positions have no whitespace, so Skip() becomes a single comparison.
Flat memoization. map[int]map[Parser]*MemoEntry flattened to []map[Parser]*MemoEntry. One array index replaces one hash lookup for the outer dimension.
Lr object pooling. Left-recursion markers recycled via sync.Pool.
Slice pre-allocation. And/Kleene/ManyParser pre-allocate result slices instead of growing from nil.
Specialized regex matchers. MemCP’s 9 regex patterns are recognized at construction time and replaced with hand-written func(string) int matchers. Identifiers use a 256-bit bitmap lookup. Integers and floats use byte loops. String bodies use a backslash-skip loop. regexp.Regexp is never compiled for these patterns.

Every one of these replaces an expensive operation with a cheaper one. The replacement is faster AND allocates less.

And the best: The user interface didn’t change. The user still declares his regex-based grammar and we replace common patterns in the regex with faster parsers.

v2.1.17-18: The Allocation Trap (1.4-2.9x regression)

These versions pursued allocation reduction further — replacing the inner map[Parser]*MemoEntry with a linked list through slab-allocated entries (v2.1.17), then packing the per-position data into a uint32 with multi-slab indirection (v2.1.18).

Allocations dropped from ~630 to ~30 (v2.1.17) to ~7 (v2.1.18). Real-world performance went in the opposite direction.

What went wrong:

MemoEntry grew from 40 to 64 bytes (added rule and nextMemo fields for the linked list). Entries per cache line dropped from 1.6 to 1.0.
Linked-list traversal replaced hash lookup. Go’s small maps (2-5 entries) use contiguous bucket arrays — one cache line load. The linked list follows pointers to slab entries allocated in DFS parse-tree order, not grouped by position. Each hop risks a cache miss.
Double indirection in v2.1.18. Every memo access became s.memoSlabs[idx>>8][idx&0xFF] — two pointer dereferences plus arithmetic, where a direct *MemoEntry pointer was one.
The regression scales with input size. At 3 rows, the working set fits in L1/L2 regardless — v2.1.18 is 1.8x slower. At 10,000 rows, the working set exceeds L2, every cache miss pays full penalty — v2.1.18 is 2.9x slower.

The per-row cost confirms this. v2.1.16’s drops from 181 to 73 us/row (fixed overhead amortizes, marginal cost is low). v2.1.18’s stays flat at ~212-227 us/row — per-access overhead that scales with the working set.

v2.1.19: Combining the Best (new fastest)

Reverted to v2.1.16’s map-based memoization. Kept the additive features from v2.1.17-18 that don’t affect memo access patterns:

Combinator buffer reuse (v2.1.17): And/Kleene/ManyParser reuse a []T buffer across non-recursive calls, detected by a depth counter. Avoids per-call slice allocations.
FindStringIndex (v2.1.17): MatchRegexp uses FindStringIndex instead of FindStringSubmatch, avoiding a []string allocation per regex match.
CharMap dispatch (v2.1.18): OrParser.SetCharMap() installs a first-byte lookup table. After skipping whitespace, only the 1-2 sub-parsers matching the current byte are tried instead of all alternatives.
NoMemo bypass (v2.1.18): KleeneParser and ManyParser gained a NoMemo flag that calls Match() directly, bypassing the memo table entirely.
Scanner.Reset() (v2.1.18): Reinitializes a Scanner for a new input, reusing allocated slices.

The result is 7-11% faster than v2.1.16 at 100+ rows. The buffer reuse, FindStringIndex, and CharMap dispatch each shave a small amount off the per-row cost without affecting cache behavior.

The Lesson: Allocations Are a Proxy Metric

The allocation count across versions tells a misleading story:

Version	Allocs (synthetic, 200 tuples)	Real-world speed
v2.1.15	~6,800	slowest
v2.1.16	~630	fast
v2.1.17	~30	slower
v2.1.18	~7	slowest since baseline
v2.1.19	~630	fastest

v2.1.19 has the same allocation count as v2.1.16 and is the fastest version. v2.1.18 has 99% fewer allocations and is the slowest since baseline.

What matters is memory access latency, not allocation count. Go’s allocator and GC handle short-lived objects efficiently — allocating a small map costs ~200ns, and the GC reclaims it for free if it dies young. Replacing that map with a slab-allocated linked list eliminates the allocation but makes every subsequent lookup slower by polluting cache lines with scattered, pointer-chased data.

For hot-loop data structures, spatial locality matters more than allocation count. A small Go map with 2-5 entries has excellent locality: one contiguous bucket array, O(1) lookup, fits in a cache line. That’s hard to beat with hand-rolled data structures.

Does the Grammar Even Need Memoization?

Packrat memoization guarantees O(n) by preventing redundant re-parsing during backtracking. But SQL is essentially LL(1) — each construct is determined by its first token. The top-level dispatch (SELECT vs INSERT) benefits from memoization: ~50 lookups per query. The Kleene/Many inner loop processing value tuples or column lists does not — it advances position-by-position, never revisiting. Every memo entry created in the loop is written once and never read.

For a 10,000-row INSERT, the inner loop creates tens of thousands of memo entries that are never reused — megabytes of write-only data occupying cache lines.

The NoMemo flag addresses this. When set, Kleene/ManyParser call Match() directly, bypassing applyRule() and the memo table entirely. This turns the inner loop into pure recursive descent — no memo overhead, no cache pollution.

A synthetic SELECT col_0, col_1, ... col_N FROM t benchmark inside go-packrat shows the effect clearly:

Columns	NewScanner	Reset	Reset + NoMemo
3	47 allocs / 4.6 us	35 allocs / 2.0 us	21 allocs / 1.5 us
10	89 allocs / 8.9 us	77 allocs / 4.3 us	21 allocs / 2.1 us
50	329 allocs / 34 us	317 allocs / 19 us	21 allocs / 4.9 us
200	1229 allocs / 122 us	1217 allocs / 71 us	21 allocs / 15 us
1000	6029 allocs / 507 us	6018 allocs / 393 us	21 allocs / 74 us

With NoMemo, the allocation count is constant at 21 regardless of column count — the column list loop allocates nothing. The 21 allocs are structural overhead (Scanner construction, the SELECT and FROM keyword lookups). At 1000 columns, NoMemo is 6.9x faster than the default path and scales perfectly linearly: 74 ns/column with no cache degradation.

The same applies to INSERT value lists, Kleene repetitions, and any other Many/Kleene loop over non-left-recursive sub-parsers. Combined with map-based memoization for the structural parse, this gives:

Structural parse (statement dispatch, clauses): memoization, small working set, benefits from caching
Repetition loops (value tuples, column lists): direct calls, zero memo overhead

NoMemo and Scanner.Reset() require caller-side changes (setting the flag, pooling Scanners). These are available in v2.1.19 but not yet activated in MemCP. When they are, the per-query allocation count will drop further without the cache locality penalty.

What’s Next

The last remaining source of per-query allocations is the callback interface. The merge callback func(string, ...T) T has no access to per-query state. MemCP’s callbacks allocate result objects on the heap because there’s no way to pass a per-query arena through the parser.

Planned: a UserData any field on Scanner, accessible from callbacks. This lets the callback use a per-query arena allocator, replacing individual heap allocations with bulk arena allocation.

Outlook: JIT for the Parser

The next logical step is to apply the MemCP JIT not only to scan loops, but also to the packrat parser objects themselves.

Today, the parser is built from combinators (And, Or, Many, Atom) that are interpreted at runtime. Even after heavy optimization, this is still a generic dispatch mechanism.

In the future, frequently used grammar paths — such as INSERT, SELECT, value lists, or identifiers — could be compiled at runtime into specialized matching loops. This would allow:

embedding static rule structure as immediates
turning Or alternatives into direct first-byte dispatch
emitting Many as compact native loops
eliminating unnecessary memo lookups

The result would be a transition from a flexible combinator interpreter to a specialized native state machine — without sacrificing the runtime extensibility that makes the packrat approach powerful.

The parser speedup was step one.
A parser JIT could be step two.

Try It

go-packrat is open source under GPLv3 (with custom licensing available):

go get github.com/launix-de/go-packrat/v2@v2.1.19

MemCP is at github.com/launix-de/memcp.