{"id":4698,"date":"2023-01-17T13:43:15","date_gmt":"2023-01-17T12:43:15","guid":{"rendered":"https:\/\/launix.de\/launix\/?p=4698"},"modified":"2023-03-02T10:26:44","modified_gmt":"2023-03-02T09:26:44","slug":"on-designing-an-interface-for-columnar-in-memory-storage-in-golang","status":"publish","type":"post","link":"https:\/\/launix.de\/launix\/en\/on-designing-an-interface-for-columnar-in-memory-storage-in-golang\/","title":{"rendered":"On designing an interface for columnar in-memory storage in golang"},"content":{"rendered":"\n<p>The advantages for columnar storages over row based storages are the ability for good in-memory compression <strong>(low memory usage)<\/strong> and cache locality when accessing only few columns <strong>(performance)<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Designing an Interface<\/h2>\n\n\n\n<p>When designing an interface for a storage engine for an in-memory database, a lot of considerations have to be made. Here are the design goals:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The interface must be simple so that using it does not require implementing all kinds of corner cases (especially with datatypes)<\/li>\n\n\n\n<li>The interface must allow fast value fetches for random data access<\/li>\n\n\n\n<li>The interface must allow analyzing the data and finding the perfect storage and compression format<\/li>\n<\/ul>\n\n\n\n<p>Here&#8217;s what I came up with:<\/p>\n\n\n\n<pre><font color=\"#E9AD0C\">type<\/font> ColumnStorage <font color=\"#E9AD0C\">interface<\/font> {\n        getValue(<font color=\"#87FFAF\">uint<\/font>) scmer <font color=\"#33C7DE\">\/\/ read function<\/font>\n        <font color=\"#33C7DE\">\/\/ buildup functions 1) prepare 2) scan, 3) proposeCompression(), if != nil repeat at 1, 4) init, 5) build; all values are passed through twice<\/font>\n        <font color=\"#33C7DE\">\/\/ analyze<\/font>\n        prepare()\n        scan(<font color=\"#87FFAF\">uint<\/font>, scmer)\n        proposeCompression() ColumnStorage\n\n        <font color=\"#33C7DE\">\/\/ store<\/font>\n        init(<font color=\"#87FFAF\">uint<\/font>)\n        build(<font color=\"#87FFAF\">uint<\/font>, scmer)\n        finish()\n}\n<\/pre>\n\n\n\n<p>The read interface is pretty simple: <code>getValue(i)<\/code> will read the column value at recordId <code>i<\/code> and return the <code>scmer<\/code> value. <code>scmer<\/code> is a kind of &#8220;any&#8221; datatype that is used in our scheme interpreter to further process the data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data Analysis on Columnar Storages<\/h2>\n\n\n\n<p>The write interface is a bit more complicated and basically works in two passes:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>analyze phase:\n<ul class=\"wp-block-list\">\n<li><code>prepare()<\/code> is called<\/li>\n\n\n\n<li>every future value that shall be stored will be passed to <code>scan(recordId, value)<\/code><\/li>\n\n\n\n<li>the columnar storage will analyze the data. If it knows a better storage format than its own class, it will return an other column container when calling <code>proposeCompression()<\/code><\/li>\n\n\n\n<li>when <code>proposeCompression()<\/code> returns another container, repeat step 1 analyze phase with the new container<\/li>\n\n\n\n<li>otherwise proceed with the store phase<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>store phase\n<ul class=\"wp-block-list\">\n<li><code>init(size)<\/code> is called where the storage can allocate the memory for its columns<\/li>\n\n\n\n<li>for each value, <code>build(recordId, value)<\/code> is called to write the data to memory<\/li>\n\n\n\n<li>for cleanup of possible reverse dictionaries that are only needed in the store phase, <code>finish()<\/code> is called<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<p>During analyze phase, statistics about the data can be collected. This allows a storage engine to check how wide the integers are or if there are lots of string duplicates.<\/p>\n\n\n\n<p>The classes implementing that interface are then structured as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>StorageSCMER<\/code> is a universal value storage; fast but needs lot of RAM<\/li>\n\n\n\n<li><code>StorageSCMER<\/code> analyzes a column whether it better fits the StorageString or StorageInt scheme; otherwise (i.e. for float64 values, stay at StorageSCMER)<\/li>\n\n\n\n<li><code>StorageInt<\/code> is a bit compressed integer storage. When analyzed values do not exceed i.e. 15, the storage will only eat up 4 bits per stored integer. The numbers are stored in an array of 64 bit integers which are shifted when reading and writing<\/li>\n\n\n\n<li><code>StorageString<\/code> is a dictionary compressed string storage using golang slices<\/li>\n<\/ul>\n\n\n\n<p>With this interface we have built a powerful interface to implement any columnar storage compression format. We can support bit compressed integers as well as dictionaries.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Evaluation<\/h2>\n\n\n\n<p>We loaded a bunch of 55MiB of .jsonl files into our storage. This is our memory usage using the storage interface:<\/p>\n\n\n\n<figure class=\"wp-block-table caption-align-default\"><table><tbody><tr><td><strong>Storage<\/strong><\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>Memory Usage<\/strong><\/td><\/tr><tr><td>MySQL InnoDB original storage<\/td><td class=\"has-text-align-right\" data-align=\"right\">55 MiB Disk Space<\/td><\/tr><tr><td>Export from MySQL in .jsonl format<\/td><td class=\"has-text-align-right\" data-align=\"right\">55 MiB Disk Space<\/td><\/tr><tr><td><\/td><td class=\"has-text-align-right\" data-align=\"right\"><\/td><\/tr><tr><td>Imported .jsonl as []map[string]scmer into delta storage<\/td><td class=\"has-text-align-right\" data-align=\"right\">233 MiB RAM<\/td><\/tr><tr><td>Moved to main storage only using <code>StorageSCMER<\/code><\/td><td class=\"has-text-align-right\" data-align=\"right\">81 MiB RAM<\/td><\/tr><tr><td>Using <code>StorageInt<\/code> for integer columns<\/td><td class=\"has-text-align-right\" data-align=\"right\">46 MiB RAM<\/td><\/tr><tr><td>Using <code>StorageString<\/code> for string columns<\/td><td class=\"has-text-align-right\" data-align=\"right\">58 MiB RAM<\/td><\/tr><tr><td>Using <code>StorageInt<\/code> and <code>StorageString<\/code><\/td><td class=\"has-text-align-right\" data-align=\"right\">23 MiB RAM<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">Evaluation<\/figcaption><\/figure>\n\n\n\n<p>As you see, RAM compressed columnar storage needs only about half the RAM compared to disk based InnoDB storage or .json files.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/inmemdb.png\"><img loading=\"lazy\" decoding=\"async\" width=\"962\" height=\"993\" src=\"https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/inmemdb.png\" alt=\"\" class=\"wp-image-4702\" srcset=\"https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/inmemdb.png 962w, https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/inmemdb-291x300.png 291w, https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/inmemdb-768x793.png 768w, https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/inmemdb-600x619.png 600w\" sizes=\"auto, (max-width: 962px) 100vw, 962px\" \/><\/a><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>The advantages for columnar storages over row based storages are the ability for good in-memory compression (low memory usage) and cache locality when accessing only few columns (performance). Designing an Interface When designing an interface for a storage engine for an in-memory database, a lot of considerations have to be made. Here are the design&#8230;<\/p>","protected":false},"author":2,"featured_media":4704,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":2,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","_uag_custom_page_level_css":"","footnotes":""},"categories":[129,128],"tags":[],"class_list":["post-4698","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-memcp","category-programming","single-item"],"featured_image_urls_v2":{"full":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920.jpg",1920,1280,false],"thumbnail":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-150x150.jpg",150,150,true],"medium":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-300x200.jpg",300,200,true],"medium_large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-768x512.jpg",751,501,true],"large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-1024x683.jpg",751,501,true],"1536x1536":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-1536x1024.jpg",1536,1024,true],"2048x2048":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920.jpg",1920,1280,false],"trp-custom-language-flag":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920.jpg",18,12,false],"xs-thumb":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-64x64.jpg",64,64,true],"appku-shop-single":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920.jpg",620,413,false]},"post_excerpt_stackable_v2":"<p>The advantages for columnar storages over row based storages are the ability for good in-memory compression (low memory usage) and cache locality when accessing only few columns (performance). Designing an Interface When designing an interface for a storage engine for an in-memory database, a lot of considerations have to be made. Here are the design goals: The interface must be simple so that using it does not require implementing all kinds of corner cases (especially with datatypes) The interface must allow fast value fetches for random data access The interface must allow analyzing the data and finding the perfect storage&hellip;<\/p>\n","category_list_v2":"<a href=\"https:\/\/launix.de\/launix\/en\/category\/memcp\/\" rel=\"category tag\">MemCP<\/a>, <a href=\"https:\/\/launix.de\/launix\/en\/category\/programming\/\" rel=\"category tag\">Programming<\/a>","author_info_v2":{"name":"Carl-Philip H\u00e4nsch","url":"https:\/\/launix.de\/launix\/en\/author\/carli\/"},"comments_num_v2":"0 comments","uagb_featured_image_src":{"full":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920.jpg",1920,1280,false],"thumbnail":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-150x150.jpg",150,150,true],"medium":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-300x200.jpg",300,200,true],"medium_large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-768x512.jpg",751,501,true],"large":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-1024x683.jpg",751,501,true],"1536x1536":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-1536x1024.jpg",1536,1024,true],"2048x2048":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920.jpg",1920,1280,false],"trp-custom-language-flag":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920.jpg",18,12,false],"xs-thumb":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920-64x64.jpg",64,64,true],"appku-shop-single":["https:\/\/launix.de\/launix\/wp-content\/uploads\/2023\/01\/chips-20072_1920.jpg",620,413,false]},"uagb_author_info":{"display_name":"Carl-Philip H\u00e4nsch","author_link":"https:\/\/launix.de\/launix\/en\/author\/carli\/"},"uagb_comment_info":0,"uagb_excerpt":"The advantages for columnar storages over row based storages are the ability for good in-memory compression (low memory usage) and cache locality when accessing only few columns (performance). Designing an Interface When designing an interface for a storage engine for an in-memory database, a lot of considerations have to be made. Here are the design...","_links":{"self":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/4698","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/comments?post=4698"}],"version-history":[{"count":4,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/4698\/revisions"}],"predecessor-version":[{"id":4822,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/posts\/4698\/revisions\/4822"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/media\/4704"}],"wp:attachment":[{"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/media?parent=4698"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/categories?post=4698"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/launix.de\/launix\/en\/wp-json\/wp\/v2\/tags?post=4698"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}