{"id":8131,"date":"2026-02-18T20:15:07","date_gmt":"2026-02-18T19:15:07","guid":{"rendered":"https:\/\/launix.de\/launix\/?p=8131"},"modified":"2026-03-10T11:17:18","modified_gmt":"2026-03-10T10:17:18","slug":"36-faster-without-changing-a-single-query-inside-memcps-new-scan-engine","status":"publish","type":"post","link":"https:\/\/launix.de\/launix\/en\/36-faster-without-changing-a-single-query-inside-memcps-new-scan-engine\/","title":{"rendered":"36% Faster Without Changing a Single Query: Inside MemCP&#8217;s New Scan Engine"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">MemCP is an in-memory columnar database written in Go with a Scheme scripting layer for query compilation, planning and optimization. Its storage engine organizes data into shards, each containing a compressed main storage and a small append-only delta buffer for recent writes. Until now, the two primary scan paths \u2014 unordered aggregation (<code>scan<\/code>) and ordered retrieval (<code>scan_order<\/code>) \u2014 each contained their own inline loops for reading columns, dispatching between main and delta storage, calling the map function, and folding the reduce. This worked, but the duplication made it harder to optimize either path and \u2014 more critically \u2014 impossible to swap in a network-transparent implementation for distributed query execution.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\">This post describes the refactoring we just landed: <strong>ShardMapReducer<\/strong>, a universal batch map+reduce interface that unifies both scan paths, eliminates per-row heap allocations, and lays the groundwork for clustering.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Problem: Two Scan Paths, Duplicated Logic<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before this change, <code>scan.go<\/code> and <code>scan_order.go<\/code> each independently:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Opened column readers for the requested columns<\/li>\n\n\n\n<li>Iterated over matching record IDs<\/li>\n\n\n\n<li>For each ID, branched on whether it lived in main storage or the delta buffer<\/li>\n\n\n\n<li>Called <code>ColumnStorage.GetValue(id)<\/code> or <code>getDelta(id - mainCount, col)<\/code><\/li>\n\n\n\n<li>Invoked the map callback<\/li>\n\n\n\n<li>Folded the result into an accumulator via reduce<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">The two paths differed only in <em>how<\/em> record IDs were produced: <code>scan<\/code> iterates an index and filters inline; <code>scan_order<\/code> sorts filtered IDs per shard, then merges them globally via a priority queue. But the map+reduce hot loop was copy-pasted between them, and any optimization (JIT compilation, batch column reads, SIMD) would have to be applied twice.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Worse, both paths interleaved filter evaluation with map+reduce execution in a single loop. For each row, the code would read the filter columns, evaluate the condition, then \u2014 if the row passed \u2014 immediately read the map columns and call map+reduce. This meant the CPU cache had to juggle both filter-column data and map-column data simultaneously, evicting one to load the other on every row.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Solution: ShardMapReducer<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We introduced a single struct that encapsulates the map+reduce pipeline:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>type ShardMapReducer struct {\n    shard     *storageShard\n    mainCols  &#91;]ColumnStorage   \/\/ direct main storage references\n    colNames  &#91;]string          \/\/ for delta access\n    isUpdate  &#91;]bool            \/\/ $update columns get UpdateFunction\n    args      &#91;]scm.Scmer       \/\/ pre-allocated argument buffer\n    mapFn     func(...scm.Scmer) scm.Scmer\n    reduceFn  func(...scm.Scmer) scm.Scmer\n    mainCount uint\n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The key method is <code>Stream(acc, recids) -&gt; acc<\/code>: given an accumulator and a batch of record IDs, it returns the new accumulator after applying map+reduce to each record. Column readers are opened once at construction (<code>OpenMapReducer<\/code>), and the args buffer is pre-allocated \u2014 no allocations in the hot path.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Run Detection: Main vs Delta<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>Stream<\/code> method doesn&#8217;t just iterate blindly. It partitions the incoming ID batch into <strong>contiguous runs<\/strong> of the same storage type:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>recids: &#91;3, 7, 12, 15, 1001, 1003, 18, 22, 1005]\n         \\_____________\/  \\________\/  \\_____\/ \\__\/\n          MainBlock(4)  DeltaBlock(2) Main(2) Delta(1)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Each run is dispatched to either <code>processMainBlock<\/code> or <code>processDeltaBlock<\/code>. The main block is a tight loop with direct <code>ColumnStorage.GetValue<\/code> calls \u2014 no branching on storage type per row. In the common case (99%+ of data in compressed main storage, only a handful of recent delta rows), this means a single <code>processMainBlock<\/code> call covers almost the entire batch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These two methods are the future JIT compilation targets \u2014 more on that below.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Stack Buffer in scan.go<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The unordered scan path now collects filtered record IDs into a stack-allocated buffer before flushing them to the MapReducer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>var buf &#91;1024]uint\nbufN := 0\n\n\/\/ ... inside index iteration:\nbuf&#91;bufN] = idx\nbufN++\nif bufN == 1024 {\n    acc = mapper.Stream(acc, buf&#91;:bufN])\n    bufN = 0\n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">1024 items x 8 bytes = 8 KB \u2014 fits comfortably in L1 cache, requires zero heap allocation. The old code called map+reduce inline for every passing row; the new code batches them, letting the MapReducer process a dense block of IDs with optimal memory access patterns.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Batched Merge in scan_order<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The ordered scan path merges sorted per-shard results via a global priority queue. The merge pops items one at a time from the heap, accumulating consecutive same-shard items into a buffer. When the next heap pop yields a different shard, the buffer is flushed to the previous shard&#8217;s MapReducer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>var buf &#91;1024]uint\nbufN := 0\nvar bufShard *shardqueue\n\nfor len(q.q) &gt; 0 {\n    qx := q.q&#91;0]                    \/\/ peek at heap root\n    item := qx.items&#91;0]             \/\/ take its front item\n    qx.items = qx.items&#91;1:]         \/\/ consume it\n\n    if bufShard != nil &amp;&amp; bufShard != qx {\n        \/\/ shard changed \u2014 flush buffer to previous shard's mapper\n        acc = bufShard.mapper.Stream(acc, buf&#91;:bufN])\n        bufN = 0\n    }\n    bufShard = qx\n    buf&#91;bufN] = item\n    bufN++\n    if bufN == len(buf) {            \/\/ buffer full \u2014 flush\n        acc = bufShard.mapper.Stream(acc, buf&#91;:bufN])\n        bufN = 0\n    }\n\n    \/\/ re-heapify or remove exhausted shard\n    if len(qx.items) &gt; 0 { heap.Fix(&amp;q, 0) } else { heap.Pop(&amp;q) }\n}\n\/\/ flush remaining\nif bufN &gt; 0 { acc = bufShard.mapper.Stream(acc, buf&#91;:bufN]) }<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This uses the same 8 KB stack buffer as the unordered path. When a single shard dominates a region of the sort order, the buffer fills with consecutive same-shard items and flushes in one <code>Stream()<\/code> call. For cluster mode, each <code>Stream()<\/code> call becomes a network RPC \u2014 batching N consecutive same-shard items into one call reduces round-trips from N to ~N\/batchsize.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Benchmark Results<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We benchmarked the new code against the previous version on a table with 2 million rows, 4 columns (INT, INT, INT, TEXT). Five runs per query, median reported:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Query<\/th><th>Baseline<\/th><th>ShardMapReducer<\/th><th>Speedup<\/th><\/tr><\/thead><tbody><tr><td><code>SELECT COUNT(*)<\/code><\/td><td>1273 ms<\/td><td>946 ms<\/td><td><strong>1.35x (-26%)<\/strong><\/td><\/tr><tr><td><code>SELECT SUM(value)<\/code><\/td><td>1394 ms<\/td><td>1078 ms<\/td><td><strong>1.29x (-23%)<\/strong><\/td><\/tr><tr><td><code>ORDER BY value DESC LIMIT 100<\/code><\/td><td>1607 ms<\/td><td>1096 ms<\/td><td><strong>1.47x (-32%)<\/strong><\/td><\/tr><tr><td><code>WHERE value &gt; 500000<\/code> (filter+count)<\/td><td>1614 ms<\/td><td>1034 ms<\/td><td><strong>1.56x (-36%)<\/strong><\/td><\/tr><tr><td><code>GROUP BY category % 100<\/code><\/td><td>\u2014<\/td><td>\u2014<\/td><td>\u2014<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The variance also dropped significantly (e.g., SCAN+FILTER: +\/-20 ms vs +\/-278 ms on the baseline), confirming more predictable memory access patterns.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where Does the Speedup Come From?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A 23-36% improvement from what is essentially a refactoring \u2014 no algorithmic changes, same data structures, same query plans \u2014 deserves a closer look. There is no single silver bullet; the speedup is the cumulative effect of removing many small inefficiencies that, at 2 million iterations, add up to hundreds of milliseconds.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Eliminated Main\/Delta Branching Per Row<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The old code checked for every column of every row whether the record lived in main storage or the delta buffer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Old: inside the per-row loop, for EACH column\nif idx &lt; t.main_count {\n    mdataset&#91;i] = mcols&#91;i].GetValue(idx)\n} else {\n    mdataset&#91;i] = t.getDelta(int(idx-t.main_count), col)\n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">With 2M rows and 4 columns, that is <strong>8 million conditional branches<\/strong>. The branch predictor learns &#8220;almost always main&#8221;, but each misprediction costs roughly 15 CPU cycles on modern hardware.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The new code partitions the batch <strong>once<\/strong> via run detection. <code>processMainBlock<\/code> contains <strong>no branch<\/strong> \u2014 just a tight loop calling <code>col.GetValue(id)<\/code>. With 99.9% of data in main storage, the entire 2M-row scan typically executes as a single <code>processMainBlock<\/code> call.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Phase Separation: Filter vs Map+Reduce<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This is likely the largest contributor, especially for filtered scans.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The old code ran everything in one interleaved loop:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>for each row:\n  read condition columns   -&gt; load cache lines for filter columns\n  evaluate filter\n  if passes:\n    read callback columns  -&gt; load cache lines for map columns (evicts filter data!)\n    call map + reduce<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The new code separates these into distinct phases:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Phase 1 (filter): for each row:\n  read condition columns only   -&gt; L1 stays hot for filter columns\n  if passes: buf&#91;n++] = idx\n\nPhase 2 (map+reduce): mapper.Stream(buf&#91;:n])\n  read callback columns only    -&gt; L1 now fully dedicated to map columns\n  call map + reduce<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In Phase 1, only the filter columns compete for cache space. In Phase 2, only the map columns are accessed. The old code forced both column sets to share the same cache lines, causing thrashing on every row that passed the filter.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why SCAN+FILTER Benefits Most (1.56x)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><code>WHERE value &gt; 500000<\/code> filters roughly half the rows. The old code still interleaved filter and map column reads for every passing row \u2014 the cache pollution hit it on every match. The new code collects all passing IDs first (Phase 1 touches only the filter column), then processes them in a clean sequential sweep through the map columns (Phase 2). The more selective the filter, the denser and more cache-friendly the resulting ID batches become.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why COUNT Benefits Least (1.35x)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">COUNT has no real map phase \u2014 map is essentially identity, reduce is a counter increment. There are no &#8220;map columns&#8221; to read, so the phase-separation advantage barely applies. The 35% speedup for COUNT comes almost entirely from the eliminated per-row branching and the pre-allocated args buffer. It represents the &#8220;floor&#8221; of what the structural improvements buy you even without cache-locality gains.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pre-Allocated Args Buffer<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><code>ShardMapReducer.args<\/code> is allocated once at construction and reused for every row. The old code reused a <code>mdataset<\/code> slice too, but allocated it fresh per scan call. More importantly, the MapReducer is a longer-lived object that the Go runtime can place more efficiently, reducing GC tracing overhead across millions of iterations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Stack Buffer Eliminates GC Pressure<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><code>var buf [1024]uint<\/code> lives on the Go stack \u2014 no heap allocation, no GC tracking. The old code had no equivalent buffer and invoked map+reduce inline per row. With 2M rows, the cumulative GC scheduling overhead was measurable \u2014 the reduced variance (+\/-20 ms vs +\/-278 ms for SCAN+FILTER) is a direct indicator of less GC interference.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Compound Effect<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">None of these improvements alone would explain 36%. But they multiply:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>8M fewer branches saves ~10-20 ms (misprediction cost)<\/li>\n\n\n\n<li>Phase separation saves ~100-200 ms (cache locality)<\/li>\n\n\n\n<li>Stack buffer saves ~50-100 ms (GC pressure)<\/li>\n\n\n\n<li>Tighter inner loops save ~50-100 ms (instruction cache, branch predictor warmup)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">At 2 million rows, &#8220;a few nanoseconds per row&#8221; becomes hundreds of milliseconds. The old code suffered death by a thousand cuts \u2014 each overhead was invisible in isolation, but collectively they consumed 23-36% of the scan time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Cluster Readiness: Scmer Serialization<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>OpenMapReducer<\/code> constructor accepts the original <code>scm.Scmer<\/code> values for <code>mapFn<\/code> and <code>reduceFn<\/code> (instead of pre-optimized Go functions), storing them alongside the optimized callables. This separation enables future network serialization: the coordinator ships the Scmer ASTs to remote nodes, which call <code>OptimizeProcToSerialFunction<\/code> (and eventually JIT) on the receiver side. Combined with the batched merge in <code>scan_order<\/code>, where consecutive same-shard items are flushed as a single <code>Stream()<\/code> call, this means each RPC covers an entire batch rather than a single row \u2014 critical for keeping network overhead manageable during distributed ORDER BY queries.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What This Enables: Network-Transparent Distributed Scans<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The real motivation behind ShardMapReducer is not just the local speedup \u2014 it is the <strong>abstraction boundary<\/strong> it creates for clustering.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">MemCP&#8217;s cluster design uses a MOESI-inspired cache coherence protocol with CRUSH-based directory assignment. Each shard has a deterministic &#8220;home node&#8221; for coordination, but data can be cached on any node. When a query needs to scan a shard that lives on a remote node, we need to ship the computation there rather than pulling the data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With ShardMapReducer, the interface is already in place:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Local:   acc = mapper.Stream(acc, recids)    \/\/ tight loop on column storage\nRemote:  acc = rpc.Stream(acc, recids)       \/\/ send (acc, recids), receive new acc<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The coordinator sends the serialized filter\/map\/reduce lambdas along with the batch of record IDs. The remote node executes <code>processMainBlock<\/code>\/<code>processDeltaBlock<\/code> locally and returns only the <strong>accumulator<\/strong> \u2014 typically a single number for aggregates, or a small assoc-list for GROUP BY. The data never crosses the wire; only the computation result does.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For the unordered scan path, this maps directly onto the existing two-phase architecture:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Phase 1<\/strong> (per-shard, now per-node): filter + map + reduce -> one accumulator per node<\/li>\n\n\n\n<li><strong>Phase 2<\/strong> (coordinator): reduce2 over all node results -> final answer<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">For ordered scans, the remote node can perform the filter+sort phase locally and return sorted record metadata. The coordinator then drives the global merge via the priority queue, calling <code>Stream<\/code> on the appropriate node for each batch of consecutive same-shard items.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The batch size naturally adapts: local scans flush at 1024 items, ordered scans accumulate consecutive same-shard items during the priority-queue merge before streaming them through the MapReducer, and remote scans can bundle all of a node&#8217;s matching IDs into a single RPC. This &#8220;graceful degradation&#8221; from batch=all down to batch=1 falls out of the interface without special cases.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The JIT Opportunity: processMainBlock Is Almost Trivially Compilable<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The 36% speedup we measured is with Go&#8217;s standard compiled code \u2014 interpreted Scheme lambdas for map and reduce, generic <code>ColumnStorage.GetValue<\/code> calls with interface dispatch. There is a lot of performance still left on the table.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Consider what <code>processMainBlock<\/code> actually does:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>func (m *ShardMapReducer) processMainBlock(acc Scmer, recids &#91;]uint) Scmer {\n    for _, id := range recids {\n        for i, col := range m.mainCols {\n            m.args&#91;i] = col.GetValue(id)\n        }\n        acc = m.reduceFn(acc, m.mapFn(m.args...))\n    }\n    return acc\n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This is a tight loop with exactly three responsibilities: read columns, call map, call reduce. Before ShardMapReducer, these responsibilities were buried inside a 200-line function that also handled filter evaluation, main\/delta branching, lock management, delta column lookups, transaction visibility checks, and index iteration. Extracting <code>processMainBlock<\/code> into its own method was a prerequisite \u2014 you cannot JIT-compile a function that does everything.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now that the method is isolated, JIT compilation becomes almost mechanical. At query planning time, we know:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Which columns<\/strong> are read (their concrete storage types: StorageInt, StorageEnum, StorageFloat, \u2026)<\/li>\n\n\n\n<li><strong>What map does<\/strong> (the Scheme AST \u2014 e.g., <code>(list col1 col2)<\/code> for a simple projection, or <code>(* col1 col2)<\/code> for a computed expression)<\/li>\n\n\n\n<li><strong>What reduce does<\/strong> (e.g., <code>(+ acc mapped)<\/code> for SUM, <code>(set_assoc acc key val)<\/code> for GROUP BY collection)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">A specialized <code>processMainBlock<\/code> for <code>SELECT SUM(value) FROM t<\/code> could compile down to:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Generated at query time \u2014 no interface dispatch, no Scmer boxing\nfunc processMainBlock_SUM_StorageInt(col *StorageInt, recids &#91;]uint) int64 {\n    acc := int64(0)\n    for _, id := range recids {\n        acc += col.values&#91;id]  \/\/ direct array access, no GetValue overhead\n    }\n    return acc\n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">No <code>interface{}<\/code> boxing, no virtual method calls, no Scheme interpreter overhead \u2014 just a loop over a contiguous integer array. This is the kind of code that Go (or a codegen backend) can auto-vectorize with SIMD instructions. On AVX2 hardware, eight int64 additions per cycle are realistic; on AVX-512, sixteen.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The old scan architecture made this impossible because the JIT target would have been a monolithic function mixing filter logic, storage dispatch, lock handling, and reduce folding. ShardMapReducer cleanly separates concerns: the filter phase is the caller&#8217;s responsibility, storage dispatch happens via run detection in <code>Stream<\/code>, and <code>processMainBlock<\/code> is purely &#8220;read columns, compute, fold&#8221; \u2014 exactly the pattern that JIT backends (and LLVM, and even Go&#8217;s own compiler with sufficient inlining) can optimize aggressively.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The same applies to <code>processDeltaBlock<\/code>, though it matters less in practice \u2014 delta storage typically holds fewer than 0.1% of rows and is only accessed during the brief window between writes and the next shard rebuild.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Next Steps<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The local foundation is in place. ShardMapReducer is the single point where all three future optimization axes converge:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Clustering<\/strong> (replace what is below the interface):<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Binary Scmer wire format<\/strong> \u2014 replace JSON serialization with a compact tag+payload encoding for efficient RPC<\/li>\n\n\n\n<li><strong>Scan RPC protocol<\/strong> \u2014 ship filter\/map\/reduce lambdas to remote nodes, receive accumulator results<\/li>\n\n\n\n<li><strong>Batched aggregate push-down<\/strong> \u2014 fuse multiple GROUP BY aggregates into a single remote scan instead of N separate passes<\/li>\n\n\n\n<li><strong>Remote ShardMapReducer<\/strong> \u2014 implement <code>Stream()<\/code> as an RPC call, transparent to the caller<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>JIT compilation<\/strong> (replace what is inside the interface):<\/p>\n\n\n\n<ol start=\"5\" class=\"wp-block-list\">\n<li><strong>Type-specialized processMainBlock<\/strong> \u2014 generate Go code (or machine code) for concrete column types and known map\/reduce functions, eliminating interface dispatch and Scmer boxing<\/li>\n\n\n\n<li><strong>SIMD vectorization<\/strong> \u2014 for simple aggregates (SUM, COUNT, MIN, MAX) over integer\/float columns, process 4-8 values per CPU cycle<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Query optimization<\/strong> (improve what is above the interface):<\/p>\n\n\n\n<ol start=\"7\" class=\"wp-block-list\">\n<li><strong>Predicate push-down into processMainBlock<\/strong> \u2014 fuse the filter phase into the map loop for cases where filter and map columns overlap, avoiding the two-pass overhead<\/li>\n\n\n\n<li><strong>Columnar batch reads<\/strong> \u2014 instead of <code>GetValue(id)<\/code> per row, read contiguous column segments into a temporary buffer for even better cache line utilization<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">The beauty of the current design is that none of these optimizations require changes to the scan logic in <code>scan.go<\/code> or <code>scan_order.go<\/code>. The callers see <code>Stream(acc, recids) -&gt; acc<\/code> \u2014 whether that runs interpreted Scheme on local storage, JIT-compiled native code, or a network RPC to a remote node is entirely hidden behind the interface. That was the whole point.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>MemCP is an in-memory columnar database written in Go with a Scheme scripting layer for query compilation, planning and optimization. Its storage engine organizes data into shards, each containing a compressed main storage and a small append-only delta buffer for recent writes. Until now, the two primary scan paths \u2014 unordered aggregation (scan) and ordered&#8230;<\/p>","protected":false},"author":2,"featured_media":6010,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":0,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","_uag_custom_page_level_css":"","footnotes":""},"categories":[129,128],"tags":[],"class_list":["post-8131","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-memcp","category-programming","single-item"],"featured_image_urls_v2":{"full":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"thumbnail":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo-150x150.png",150,150,true],"medium":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"medium_large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"1536x1536":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"2048x2048":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"trp-custom-language-flag":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo-14x12.png",14,12,true],"xs-thumb":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo-64x64.png",64,64,true],"appku-shop-single":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false]},"post_excerpt_stackable_v2":"<p>MemCP is an in-memory columnar database written in Go with a Scheme scripting layer for query compilation, planning and optimization. Its storage engine organizes data into shards, each containing a compressed main storage and a small append-only delta buffer for recent writes. Until now, the two primary scan paths \u2014 unordered aggregation (scan) and ordered retrieval (scan_order) \u2014 each contained their own inline loops for reading columns, dispatching between main and delta storage, calling the map function, and folding the reduce. This worked, but the duplication made it harder to optimize either path and \u2014 more critically \u2014 impossible to&hellip;<\/p>\n","category_list_v2":"<a href=\"https:\/\/launix.de\/launix\/en\/category\/memcp\/\" rel=\"category tag\">MemCP<\/a>, <a href=\"https:\/\/launix.de\/launix\/en\/category\/programming\/\" rel=\"category tag\">Programming<\/a>","author_info_v2":{"name":"Carl-Philip H\u00e4nsch","url":"https:\/\/launix.de\/launix\/en\/author\/carli\/"},"comments_num_v2":"0 comments","uagb_featured_image_src":{"full":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"thumbnail":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo-150x150.png",150,150,true],"medium":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"medium_large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"1536x1536":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"2048x2048":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false],"trp-custom-language-flag":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo-14x12.png",14,12,true],"xs-thumb":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo-64x64.png",64,64,true],"appku-shop-single":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2024\/06\/memcp-logo.png",256,216,false]},"uagb_author_info":{"display_name":"Carl-Philip H\u00e4nsch","author_link":"https:\/\/launix.de\/launix\/en\/author\/carli\/"},"uagb_comment_info":0,"uagb_excerpt":"MemCP is an in-memory columnar database written in Go with a Scheme scripting layer for query compilation, planning and optimization. Its storage engine organizes data into shards, each containing a compressed main storage and a small append-only delta buffer for recent writes. Until now, the two primary scan paths \u2014 unordered aggregation (scan) and ordered...","_links":{"self":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/8131","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/comments?post=8131"}],"version-history":[{"count":3,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/8131\/revisions"}],"predecessor-version":[{"id":8134,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/8131\/revisions\/8134"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/media\/6010"}],"wp:attachment":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/media?parent=8131"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/categories?post=8131"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/tags?post=8131"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}