{"id":4901,"date":"2023-02-14T20:49:52","date_gmt":"2023-02-14T19:49:52","guid":{"rendered":"https:\/\/launix.de\/launix\/?p=4901"},"modified":"2023-03-02T10:26:44","modified_gmt":"2023-03-02T09:26:44","slug":"sequence-compression-in-in-memory-database-yields-99-memory-savings-and-a-total-of-13","status":"publish","type":"post","link":"https:\/\/launix.de\/launix\/en\/sequence-compression-in-in-memory-database-yields-99-memory-savings-and-a-total-of-13\/","title":{"rendered":"Sequence Compression in In-Memory Database yields 99% memory savings and a total of 13%"},"content":{"rendered":"\n<p>One of the most interesting compression techniques on columnar storages is <strong>sequence compression<\/strong>.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>A sequence is a column of numbers where each distance between two neighbouring numbers is equal. Example:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>1 2 3 4 5 6 7 8 9<\/code><\/li>\n\n\n\n<li><code>3 3 3 3 3 3 3 3 3 3<\/code><\/li>\n\n\n\n<li><code>10 20 30 40<\/code><\/li>\n\n\n\n<li><code>8 7 6 5<\/code><\/li>\n<\/ul>\n\n\n\n<p>A sequence can be described by its starting point, its stride as well as its length:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>(1,1,9)<\/code><\/li>\n\n\n\n<li><code>(3,0,10)<\/code><\/li>\n\n\n\n<li><code>(10,10,4)<\/code><\/li>\n\n\n\n<li><code>(8,-1,4)<\/code><\/li>\n<\/ul>\n\n\n\n<p>The longer the sequence, the higher the compression ratio. For a typical SQL engine workload with AUTO_INCREMENT IDs, this means that you can store 150,000 IDs by just using three numbers.<\/p>\n\n\n\n<p>Whenever the sequence is interrupted, a new sequence has to be described. This means the column<\/p>\n\n\n\n<p><code>1 2 3 4 6 7 8<\/code><\/p>\n\n\n\n<p>will be encoded as<\/p>\n\n\n\n<p><code>(1,1,4)(6,1,3)<\/code><\/p>\n\n\n\n<p>which is nearly the same storage-wise as the uncompressed column.<\/p>\n\n\n\n<p>So the compression ratio is as bigger as the the column is regular. Since we use bit compression for the stride, we can use a heuristic that less than 30% of the values are allowed to trigger a sequence-restart.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Tweaking the format for fast random access<\/h2>\n\n\n\n<p>To efficiently read a sequence compressed column, we have to alter the format a little bit. Instead of recording <code>(start, stride, count)<\/code>, we will record <code>(recordId, start, stride)<\/code>.<\/p>\n\n\n\n<p>The value <code>count<\/code> is automatically derived from the difference between its successor recordId. But as you see, we won&#8217;t need it anyways. <code>recordId<\/code> is a unsigned integer counting seemlessly from 0 to array_size-1. Whenever we want to find a specific value, we can use the following algorithm:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Given <code>i<\/code> is the recordId we want to read out<\/li>\n\n\n\n<li>Do a binary search through the array of sequences such that we get the lowest <code>x<\/code> with <code>sequence[x].recordId &lt;= i<\/code><\/li>\n\n\n\n<li>Calculate the result by <code>sequence[x].start + (i - sequence[x].recordId) * sequence[x].stride<\/code><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Evaluation<\/h2>\n\n\n\n<p>In our example with <a aria-label=\"OppelBI (opens in a new tab)\" href=\"https:\/\/oppel.bi\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"ek-link\">OppelBI<\/a> data (mostly online shop statistics), we were able to compress some integer storages by 99% which yields an overall saving of 13% of storage (our 55MB MySQL data is compressed to 15MB rather than 17MB)<\/p>\n\n\n\n<p>For the TPC-H benchmark, we achieved a saving of 7MB for the 1.1GB dataset (scale factor 1) which is only 0,8%. We attribute this to the randomness of TPC-H&#8217;s data. Real world ERP or shop data is much more regular and less string-heavvy than TPC-H&#8217;s benchmark data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Optimized for AI workloads<\/h2>\n\n\n\n<p>Especially AI workloads like this table:<\/p>\n\n\n\n<p><code>AIInferenceMatrix(matrixId integer, column integer, row integer, value double)<\/code><\/p>\n\n\n\n<p>can be sequence-compressed very efficiently if you keep the row and column values sorted. This also yields a matrix&#8217;s <code>value<\/code> column memory layout that exactly matches the internal representation of the matrix that can be directly passed to TensorFlow or similar AI libraries.<\/p>\n\n\n\n<p>This implies that AI workloads no longer have to be stored as stupid BLOBs but rather can get a full relational storage which means easier access for partial data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Sequence Compression is a powerful technique to further compress data in RAM. It reduces cache misses and thus fastens up random access to &#8220;boring&#8221; data.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the most interesting compression techniques on columnar storages is sequence compression.<\/p>","protected":false},"author":2,"featured_media":4902,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":2,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","_uag_custom_page_level_css":"","footnotes":""},"categories":[129,128],"tags":[],"class_list":["post-4901","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-memcp","category-programming","single-item"],"featured_image_urls_v2":{"full":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920.jpg",1920,1280,false],"thumbnail":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-150x150.jpg",150,150,true],"medium":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-300x200.jpg",300,200,true],"medium_large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-768x512.jpg",751,501,true],"large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-1024x683.jpg",751,501,true],"1536x1536":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-1536x1024.jpg",1536,1024,true],"2048x2048":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920.jpg",1920,1280,false],"trp-custom-language-flag":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920.jpg",18,12,false],"xs-thumb":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-64x64.jpg",64,64,true],"appku-shop-single":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920.jpg",620,413,false]},"post_excerpt_stackable_v2":"<p>One of the most interesting compression techniques on columnar storages is sequence compression. A sequence is a column of numbers where each distance between two neighbouring numbers is equal. Example: 1 2 3 4 5 6 7 8 9 3 3 3 3 3 3 3 3 3 3 10 20 30 40 8 7 6 5 A sequence can be described by its starting point, its stride as well as its length: (1,1,9) (3,0,10) (10,10,4) (8,-1,4) The longer the sequence, the higher the compression ratio. For a typical SQL engine workload with AUTO_INCREMENT IDs, this means that you can&hellip;<\/p>\n","category_list_v2":"<a href=\"https:\/\/launix.de\/launix\/en\/category\/memcp\/\" rel=\"category tag\">MemCP<\/a>, <a href=\"https:\/\/launix.de\/launix\/en\/category\/programming\/\" rel=\"category tag\">Programming<\/a>","author_info_v2":{"name":"Carl-Philip H\u00e4nsch","url":"https:\/\/launix.de\/launix\/en\/author\/carli\/"},"comments_num_v2":"0 comments","uagb_featured_image_src":{"full":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920.jpg",1920,1280,false],"thumbnail":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-150x150.jpg",150,150,true],"medium":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-300x200.jpg",300,200,true],"medium_large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-768x512.jpg",751,501,true],"large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-1024x683.jpg",751,501,true],"1536x1536":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-1536x1024.jpg",1536,1024,true],"2048x2048":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920.jpg",1920,1280,false],"trp-custom-language-flag":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920.jpg",18,12,false],"xs-thumb":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920-64x64.jpg",64,64,true],"appku-shop-single":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/02\/animals-737407_1920.jpg",620,413,false]},"uagb_author_info":{"display_name":"Carl-Philip H\u00e4nsch","author_link":"https:\/\/launix.de\/launix\/en\/author\/carli\/"},"uagb_comment_info":0,"uagb_excerpt":"One of the most interesting compression techniques on columnar storages is sequence compression.","_links":{"self":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/4901","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/comments?post=4901"}],"version-history":[{"count":4,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/4901\/revisions"}],"predecessor-version":[{"id":4908,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/4901\/revisions\/4908"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/media\/4902"}],"wp:attachment":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/media?parent=4901"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/categories?post=4901"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/tags?post=4901"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}