{"id":8164,"date":"2026-03-15T16:39:18","date_gmt":"2026-03-15T15:39:18","guid":{"rendered":"https:\/\/launix.de\/launix\/?p=8164"},"modified":"2026-03-15T16:39:34","modified_gmt":"2026-03-15T15:39:34","slug":"why-eliminating-48-of-memory-writes-only-gave-us-4-speedup","status":"publish","type":"post","link":"https:\/\/launix.de\/launix\/en\/why-eliminating-48-of-memory-writes-only-gave-us-4-speedup\/","title":{"rendered":"Why Eliminating 48% of Memory Writes Only Gave Us 4% Speedup"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Lessons from building a bit-parallel SIMD simulator for a RISC-V CPU<\/em><\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">The Setup<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We&#8217;re building a bit-parallel simulation engine for <a href=\"https:\/\/verilator.org\">Verilator<\/a>, the open-source Verilog simulator. The idea: instead of evaluating a circuit signal-by-signal, group thousands of independent 1-bit operations (AND, OR, XOR, NOT, MUX) into SIMD batches. One 64-bit word operation computes 64 independent logic gates simultaneously.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our test target is the <a href=\"https:\/\/github.com\/openhwgroup\/cva6\">CVA6<\/a>, a 6-stage RISC-V CPU with ~570K signal bits. The bit-parallel engine decomposes the entire combinational logic into ~514K bit-level operations organized in ~250 batches, executed via a wavefront scheduler.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After a long series of optimizations, we reached <strong>1.45\u00d7 speedup<\/strong> over Verilator&#8217;s standard eval path (22.7 kHz vs 15.7 kHz simulation clock). Then we discovered something interesting in the profiling data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Discovery: 48.8% Dead Writes<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Our engine uses a flat <code>uint64_t<\/code> bank (~9,500 words, 76 KB) where all signal values live. Each batch reads its inputs from the bank, computes, and writes results back. An earlier optimization \u2014 <strong>expression fusion<\/strong> \u2014 inlines single-consumer intermediate results directly into the consumer&#8217;s compute expression, bypassing the bank entirely.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But here&#8217;s the thing: the fused producer batches were still writing their results to the bank, even though nobody reads them. A quick analysis revealed:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>16,779 total bank write lines<\/strong> in the generated code<\/li>\n\n\n\n<li><strong>8,181 of them write to words that are never read<\/strong> (48.8%)<\/li>\n\n\n\n<li>These dead writes come from producer batches whose outputs are consumed exclusively via fusion<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Half of all memory writes were provably useless.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Fix<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Simple: before emitting a bank write, check if the output word has exactly one consumer AND that consumer fuses the expression inline. If so, skip the write.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>if (isDeadWrite(batch.outSegBase + dw)) continue;  \/\/ consumer fuses inline<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">The Disappointment<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Result: 1.49\u00d7 speedup.<\/strong> Up from 1.45\u00d7. Only +4%.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We eliminated 48.8% of all bank writes and got 4% speedup. Where did the other 44.8% go?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why Writes Are (Almost) Free<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The answer is <strong>store buffers<\/strong>. Modern x86 CPUs have deep store buffers (typically 56\u201372 entries on recent Intel\/AMD cores) that absorb writes without stalling the pipeline. A <code>mov [mem], reg<\/code> instruction retires in 1 cycle regardless of whether the cache line is hot, cold, or even evicted \u2014 the store buffer holds the write until the cache hierarchy can absorb it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our bank is 76 KB, which fits in L1 cache (typically 32\u201348 KB per core for data, but the working set is smaller because we access it sequentially). Even when writes miss L1, the store buffer absorbs them. The CPU only stalls when the store buffer is <em>full<\/em>, which requires ~60 consecutive writes without any intervening loads that could drain it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In our generated code, writes and reads are interleaved (each batch reads inputs, then writes output), so the store buffer never fills up. The dead writes were hiding in the buffer, costing essentially zero pipeline cycles.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What Actually Costs Time<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Our profiling revealed the real bottleneck: <strong>shuffles<\/strong>. 80% of the generated code is shuffle operations (pext\/pdep, shift+mask) that rearrange bits from scattered source positions into contiguous batch input words. Only 20% is actual compute (AND, OR, XOR on 64-bit words).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The shuffle cost comes from the data layout: each batch combines operations from different signals across the entire CPU. Input bits are scattered across the bank. Even after our fixpoint layout optimization (which reduced shuffle cost by 18%), 97% of gathers still require bit-level permutation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is the fundamental tension: batching operations for SIMD parallelism requires bringing together bits from different signals, but those signals live at different bank positions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Performance Timeline<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Optimization<\/th><th>Speedup<\/th><th>Note<\/th><\/tr><\/thead><tbody><tr><td>Initial correct implementation<\/td><td>1.28\u00d7<\/td><td>Wavefront scheduler, zero-copy reads<\/td><\/tr><tr><td>+ Expression fusion<\/td><td>1.32\u00d7<\/td><td>Inline single-consumer intermediates<\/td><\/tr><tr><td>+ Fixpoint input-sort<\/td><td>1.45\u00d7<\/td><td>Global shuffle cost minimization<\/td><\/tr><tr><td>+ Dead write elimination<\/td><td>1.49\u00d7<\/td><td>Skip writes for fused outputs<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The biggest single gain came from the fixpoint sort (+13%), not from reducing writes or compute.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Lessons Learned<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Writes are cheap, reads are expensive.<\/strong> On modern CPUs with deep store buffers, writes to hot cache lines are essentially free. Focus optimization on reducing <em>reads<\/em>, especially scattered reads that cause cache misses.<\/li>\n\n\n\n<li><strong>The shuffle is the bottleneck, not the compute.<\/strong> In any data-parallel system where the parallelism comes from combining independent operations, the cost of <em>gathering<\/em> inputs dominates the cost of <em>computing<\/em> on them. This is well-known in GPU programming but equally applies to CPU SIMD.<\/li>\n\n\n\n<li><strong>Local optimizations can hurt globally.<\/strong> Our per-batch input sort improved local shuffle cost but degraded downstream batches (whose input homes changed). Only the fixpoint iteration \u2014 which accepts changes only when the <em>global<\/em> cost decreases \u2014 found the right balance.<\/li>\n\n\n\n<li><strong>Measure before optimizing.<\/strong> We spent time on predicated execution (dark silicon pruning) which showed neutral results with random test data. The shuffle analysis immediately pointed to the real bottleneck.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s Next<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Store buffer exploitation<\/strong>: Since writes are free, we could <em>speculatively<\/em> write intermediate results to the bank even when they might be dead \u2014 trading bank space for simpler code generation.<\/li>\n\n\n\n<li><strong>Layout graph coloring<\/strong>: Place signals that are frequently read together into adjacent bank words, reducing cache line crossings.<\/li>\n\n\n\n<li><strong>AVX-512<\/strong>: The current 64-bit word size limits us to 64 parallel operations. AVX-512 would give 512, but requires careful alignment and the shuffle problem becomes even more critical.<\/li>\n\n\n\n<li><strong>Multithreading<\/strong>: Independent batch clusters could run on separate cores, with the bank in shared memory. The wavefront structure naturally provides a dependency graph for task scheduling.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">The Code<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The bit-parallel engine is implemented as a Verilator <code>--bit-parallel<\/code> flag across ~2,500 lines of C++ in the Verilator compiler (scheduler, layout optimizer, code emitter) plus the generated code (~127K lines for CVA6). All standard Verilator regression tests pass.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The full implementation evolved over 22 commits in a single development session, starting from a broken 512-instance model and ending with a working 1.49\u00d7 speedup on a real RISC-V CPU design.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Lessons from building a bit-parallel SIMD simulator for a RISC-V CPU<\/p>","protected":false},"author":2,"featured_media":8165,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":0,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","_uag_custom_page_level_css":"","footnotes":""},"categories":[128],"tags":[],"class_list":["post-8164","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-programming","single-item"],"featured_image_urls_v2":{"full":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca.png",1536,1024,false],"thumbnail":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-150x150.png",150,150,true],"medium":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-300x200.png",300,200,true],"medium_large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-768x512.png",751,501,true],"large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-1024x683.png",751,501,true],"1536x1536":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca.png",1536,1024,false],"2048x2048":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca.png",1536,1024,false],"trp-custom-language-flag":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-18x12.png",18,12,true],"xs-thumb":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-64x64.png",64,64,true],"appku-shop-single":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-620x500.png",620,500,true]},"post_excerpt_stackable_v2":"<p>Lessons from building a bit-parallel SIMD simulator for a RISC-V CPU The Setup We&#8217;re building a bit-parallel simulation engine for Verilator, the open-source Verilog simulator. The idea: instead of evaluating a circuit signal-by-signal, group thousands of independent 1-bit operations (AND, OR, XOR, NOT, MUX) into SIMD batches. One 64-bit word operation computes 64 independent logic gates simultaneously. Our test target is the CVA6, a 6-stage RISC-V CPU with ~570K signal bits. The bit-parallel engine decomposes the entire combinational logic into ~514K bit-level operations organized in ~250 batches, executed via a wavefront scheduler. After a long series of optimizations, we reached&hellip;<\/p>\n","category_list_v2":"<a href=\"https:\/\/launix.de\/launix\/en\/category\/programming\/\" rel=\"category tag\">Programming<\/a>","author_info_v2":{"name":"Carl-Philip H\u00e4nsch","url":"https:\/\/launix.de\/launix\/en\/author\/carli\/"},"comments_num_v2":"0 comments","uagb_featured_image_src":{"full":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca.png",1536,1024,false],"thumbnail":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-150x150.png",150,150,true],"medium":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-300x200.png",300,200,true],"medium_large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-768x512.png",751,501,true],"large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-1024x683.png",751,501,true],"1536x1536":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca.png",1536,1024,false],"2048x2048":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca.png",1536,1024,false],"trp-custom-language-flag":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-18x12.png",18,12,true],"xs-thumb":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-64x64.png",64,64,true],"appku-shop-single":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2026\/03\/fd4e66e9-3c8d-41ce-89f7-b899c9e37aca-620x500.png",620,500,true]},"uagb_author_info":{"display_name":"Carl-Philip H\u00e4nsch","author_link":"https:\/\/launix.de\/launix\/en\/author\/carli\/"},"uagb_comment_info":0,"uagb_excerpt":"Lessons from building a bit-parallel SIMD simulator for a RISC-V CPU","_links":{"self":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/8164","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/comments?post=8164"}],"version-history":[{"count":1,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/8164\/revisions"}],"predecessor-version":[{"id":8166,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/8164\/revisions\/8166"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/media\/8165"}],"wp:attachment":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/media?parent=8164"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/categories?post=8164"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/tags?post=8164"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}