
Why Parser Speed Matters for OLTP
MemCP is an in-memory, MySQL-compatible database. It handles two kinds of workloads: analytics queries that scan millions of rows, and OLTP queries — small INSERTs, UPDATEs, DELETEs, and SELECTs coming in rapid fire from applications.
For OLTP / analytics queries, parsing a 1000-character SELECT takes milliseconds against a query that runs for seconds or minutes. Nobody cares. For OLTP, every microsecond in the parser is a microsecond added to every request. A web application doing 10,000 INSERT/s spends more time parsing SQL than executing it if the parser is slow.
The queries that matter for parser performance are the ones that come in high volume: single-row inserts, point updates, key lookups. And bulk inserts — INSERT INTO t VALUES (...) and SELECT a,b,c FROM t with hundreds or thousands of rows — where the parser processes the same value-tuple grammar thousands of times in a single call.
Packrat Parsing and Why MemCP Uses It
MemCP’s SQL parser is built on go-packrat, a packrat parser combinator library. A packrat parser works by combining small parsers (atoms, regex matchers) into larger ones through combinators: And (sequence), Or (alternatives), Kleene (repetition). The grammar for INSERT INTO t VALUES (1,'a'), (2,'b') is built from:
insertStmt = And(INSERT, INTO, tableName, VALUES, tupleList)
tupleList = Many(tuple, comma)
tuple = And(lparen, valueList, rparen)
valueList = Many(value, comma)
value = Or(integer, string, null, ...)
Packrat parsing guarantees O(n) time through memoization: every parser result at every input position is cached, so no work is done twice. This matters for grammars with backtracking — if Or tries alternative A and it fails at position 50, the partial results from positions 0-49 are cached and reused when alternative B is tried.
MemCP uses packrat combinators instead of a generated parser (like yacc) for a specific reason: extensibility. MemCP’s Scheme runtime lets libraries define new SQL syntax by constructing parser trees at runtime:
(define my-parser (sql-and (sql-atom "CUSTOM") (sql-atom "KEYWORD") sql-expr))
(register-syntax "custom_stmt" my-parser)
A generated parser would require a build step and recompilation. Combinator parsers are data structures — they can be constructed, composed, and extended at runtime.
The Problem
Profiling a bulk INSERT showed the parser responsible for tens of thousands of heap allocations per query. Each allocation is cheap individually, but at OLTP volumes they add up: GC pressure, cache misses, wasted cycles.
We optimized go-packrat across five versions. The results were surprising — reducing allocations doesn’t always make things faster. In the end, choosing the right changes got us a 2x speedup.
Measurements
All measurements: (time (parse_sql ...)) in MemCP, median of multiple runs. The input is INSERT INTO t VALUES('a','b'), ... with 3, 100, 1000, and 10000 rows.
Absolute times:
| Version | x3 | x100 | x1000 | x10000 |
|---|---|---|---|---|
| v2.1.15 (baseline) | 1143 us | 14.6 ms | 150.9 ms | 1267.6 ms |
| v2.1.16 | 542 us | 10.2 ms | 78.9 ms | 727.4 ms |
| v2.1.17 | 639 us | 10.6 ms | 99.7 ms | 1013.2 ms |
| v2.1.18 | 978 us | 22.7 ms | 214.1 ms | 2115.6 ms |
| v2.1.19 | 577 us | 9.0 ms | 73.0 ms | 675.8 ms |
Relative to v2.1.15 baseline:
| Version | x3 | x100 | x1000 | x10000 |
|---|---|---|---|---|
| v2.1.15 | 1.0x | 1.0x | 1.0x | 1.0x |
| v2.1.16 | 2.1x | 1.4x | 1.9x | 1.7x |
| v2.1.17 | 1.8x | 1.4x | 1.5x | 1.3x |
| v2.1.18 | 1.2x | 0.6x | 0.7x | 0.6x |
| v2.1.19 | 2.0x | 1.6x | 2.1x | 1.9x |
Per-row cost (median / row count):
| Version | x3 | x100 | x1000 | x10000 |
|---|---|---|---|---|
| v2.1.15 | 381 us | 146 us | 151 us | 127 us |
| v2.1.16 | 181 us | 102 us | 79 us | 73 us |
| v2.1.17 | 213 us | 106 us | 100 us | 101 us |
| v2.1.18 | 326 us | 227 us | 214 us | 212 us |
| v2.1.19 | 192 us | 90 us | 73 us | 68 us |
v2.1.19 achieves the best per-row cost at scale: 68 us/row at 10,000 rows, 7-11% faster than v2.1.16 and nearly 2x the baseline. Its per-row cost drops with scale (192 -> 68 us), showing good amortization of fixed overhead.
What Changed — and What We Learned
v2.1.16: Replacing Expensive Operations (2x speedup)
Seven internal changes, all pure wins:
- Keyword matching without regex. Every SQL keyword was matched via
(?i)^SELECTwithFindStringSubmatch. Replaced withstrings.EqualFoldfor case-insensitive,==for case-sensitive. With 313 atom parsers in the grammar, this was the highest-impact single change. - Word boundaries as
[]boolinstead ofmap[int]bool. Flat slice indexed by position, one allocation instead of thousands of map buckets. - Whitespace skip fast-path. A byte check (
<= ' 'or== '/') gates the regex call. ~80% of positions have no whitespace, soSkip()becomes a single comparison. - Flat memoization.
map[int]map[Parser]*MemoEntryflattened to[]map[Parser]*MemoEntry. One array index replaces one hash lookup for the outer dimension. - Lr object pooling. Left-recursion markers recycled via
sync.Pool. - Slice pre-allocation.
And/Kleene/ManyParserpre-allocate result slices instead of growing from nil. - Specialized regex matchers. MemCP’s 9 regex patterns are recognized at construction time and replaced with hand-written
func(string) intmatchers. Identifiers use a 256-bit bitmap lookup. Integers and floats use byte loops. String bodies use a backslash-skip loop.regexp.Regexpis never compiled for these patterns.
Every one of these replaces an expensive operation with a cheaper one. The replacement is faster AND allocates less.
And the best: The user interface didn’t change. The user still declares his regex-based grammar and we replace common patterns in the regex with faster parsers.
v2.1.17-18: The Allocation Trap (1.4-2.9x regression)
These versions pursued allocation reduction further — replacing the inner map[Parser]*MemoEntry with a linked list through slab-allocated entries (v2.1.17), then packing the per-position data into a uint32 with multi-slab indirection (v2.1.18).
Allocations dropped from ~630 to ~30 (v2.1.17) to ~7 (v2.1.18). Real-world performance went in the opposite direction.
What went wrong:
- MemoEntry grew from 40 to 64 bytes (added
ruleandnextMemofields for the linked list). Entries per cache line dropped from 1.6 to 1.0. - Linked-list traversal replaced hash lookup. Go’s small maps (2-5 entries) use contiguous bucket arrays — one cache line load. The linked list follows pointers to slab entries allocated in DFS parse-tree order, not grouped by position. Each hop risks a cache miss.
- Double indirection in v2.1.18. Every memo access became
s.memoSlabs[idx>>8][idx&0xFF]— two pointer dereferences plus arithmetic, where a direct*MemoEntrypointer was one. - The regression scales with input size. At 3 rows, the working set fits in L1/L2 regardless — v2.1.18 is 1.8x slower. At 10,000 rows, the working set exceeds L2, every cache miss pays full penalty — v2.1.18 is 2.9x slower.
The per-row cost confirms this. v2.1.16’s drops from 181 to 73 us/row (fixed overhead amortizes, marginal cost is low). v2.1.18’s stays flat at ~212-227 us/row — per-access overhead that scales with the working set.
v2.1.19: Combining the Best (new fastest)
Reverted to v2.1.16’s map-based memoization. Kept the additive features from v2.1.17-18 that don’t affect memo access patterns:
- Combinator buffer reuse (v2.1.17): And/Kleene/ManyParser reuse a
[]Tbuffer across non-recursive calls, detected by a depth counter. Avoids per-call slice allocations. - FindStringIndex (v2.1.17):
MatchRegexpusesFindStringIndexinstead ofFindStringSubmatch, avoiding a[]stringallocation per regex match. - CharMap dispatch (v2.1.18):
OrParser.SetCharMap()installs a first-byte lookup table. After skipping whitespace, only the 1-2 sub-parsers matching the current byte are tried instead of all alternatives. - NoMemo bypass (v2.1.18):
KleeneParserandManyParsergained aNoMemoflag that callsMatch()directly, bypassing the memo table entirely. - Scanner.Reset() (v2.1.18): Reinitializes a Scanner for a new input, reusing allocated slices.
The result is 7-11% faster than v2.1.16 at 100+ rows. The buffer reuse, FindStringIndex, and CharMap dispatch each shave a small amount off the per-row cost without affecting cache behavior.
The Lesson: Allocations Are a Proxy Metric
The allocation count across versions tells a misleading story:
| Version | Allocs (synthetic, 200 tuples) | Real-world speed |
|---|---|---|
| v2.1.15 | ~6,800 | slowest |
| v2.1.16 | ~630 | fast |
| v2.1.17 | ~30 | slower |
| v2.1.18 | ~7 | slowest since baseline |
| v2.1.19 | ~630 | fastest |
v2.1.19 has the same allocation count as v2.1.16 and is the fastest version. v2.1.18 has 99% fewer allocations and is the slowest since baseline.
What matters is memory access latency, not allocation count. Go’s allocator and GC handle short-lived objects efficiently — allocating a small map costs ~200ns, and the GC reclaims it for free if it dies young. Replacing that map with a slab-allocated linked list eliminates the allocation but makes every subsequent lookup slower by polluting cache lines with scattered, pointer-chased data.
For hot-loop data structures, spatial locality matters more than allocation count. A small Go map with 2-5 entries has excellent locality: one contiguous bucket array, O(1) lookup, fits in a cache line. That’s hard to beat with hand-rolled data structures.
Does the Grammar Even Need Memoization?
Packrat memoization guarantees O(n) by preventing redundant re-parsing during backtracking. But SQL is essentially LL(1) — each construct is determined by its first token. The top-level dispatch (SELECT vs INSERT) benefits from memoization: ~50 lookups per query. The Kleene/Many inner loop processing value tuples or column lists does not — it advances position-by-position, never revisiting. Every memo entry created in the loop is written once and never read.
For a 10,000-row INSERT, the inner loop creates tens of thousands of memo entries that are never reused — megabytes of write-only data occupying cache lines.
The NoMemo flag addresses this. When set, Kleene/ManyParser call Match() directly, bypassing applyRule() and the memo table entirely. This turns the inner loop into pure recursive descent — no memo overhead, no cache pollution.
A synthetic SELECT col_0, col_1, ... col_N FROM t benchmark inside go-packrat shows the effect clearly:
| Columns | NewScanner | Reset | Reset + NoMemo |
|---|---|---|---|
| 3 | 47 allocs / 4.6 us | 35 allocs / 2.0 us | 21 allocs / 1.5 us |
| 10 | 89 allocs / 8.9 us | 77 allocs / 4.3 us | 21 allocs / 2.1 us |
| 50 | 329 allocs / 34 us | 317 allocs / 19 us | 21 allocs / 4.9 us |
| 200 | 1229 allocs / 122 us | 1217 allocs / 71 us | 21 allocs / 15 us |
| 1000 | 6029 allocs / 507 us | 6018 allocs / 393 us | 21 allocs / 74 us |
With NoMemo, the allocation count is constant at 21 regardless of column count — the column list loop allocates nothing. The 21 allocs are structural overhead (Scanner construction, the SELECT and FROM keyword lookups). At 1000 columns, NoMemo is 6.9x faster than the default path and scales perfectly linearly: 74 ns/column with no cache degradation.
The same applies to INSERT value lists, Kleene repetitions, and any other Many/Kleene loop over non-left-recursive sub-parsers. Combined with map-based memoization for the structural parse, this gives:
- Structural parse (statement dispatch, clauses): memoization, small working set, benefits from caching
- Repetition loops (value tuples, column lists): direct calls, zero memo overhead
NoMemo and Scanner.Reset() require caller-side changes (setting the flag, pooling Scanners). These are available in v2.1.19 but not yet activated in MemCP. When they are, the per-query allocation count will drop further without the cache locality penalty.
What’s Next
The last remaining source of per-query allocations is the callback interface. The merge callback func(string, ...T) T has no access to per-query state. MemCP’s callbacks allocate result objects on the heap because there’s no way to pass a per-query arena through the parser.
Planned: a UserData any field on Scanner, accessible from callbacks. This lets the callback use a per-query arena allocator, replacing individual heap allocations with bulk arena allocation.
Outlook: JIT for the Parser
The next logical step is to apply the MemCP JIT not only to scan loops, but also to the packrat parser objects themselves.
Today, the parser is built from combinators (And, Or, Many, Atom) that are interpreted at runtime. Even after heavy optimization, this is still a generic dispatch mechanism.
In the future, frequently used grammar paths — such as INSERT, SELECT, value lists, or identifiers — could be compiled at runtime into specialized matching loops. This would allow:
- embedding static rule structure as immediates
- turning
Oralternatives into direct first-byte dispatch - emitting
Manyas compact native loops - eliminating unnecessary memo lookups
The result would be a transition from a flexible combinator interpreter to a specialized native state machine — without sacrificing the runtime extensibility that makes the packrat approach powerful.
The parser speedup was step one.
A parser JIT could be step two.
Try It
go-packrat is open source under GPLv3 (with custom licensing available):
go get github.com/launix-de/go-packrat/v2@v2.1.19
MemCP is at github.com/launix-de/memcp.
Comments are closed