On Compressing NULL values in bit-compressed Integer Storages

Usually, databases store NULL values in form of bitmasks. In this case, each value eats up 1 bit for the possibility to become NULL. I will prove that we can do better.

Uncompressed column stores of integers are a continuous column of i.e. 64 bit integers. The problem with 64 bit integers is that every possible value has a meaning. You cannot just take one value (i.e. MAXINT-1) and declare it to be NULL. Malicious users could smuggle that value in and turn a value NULL potentilly breaking the software, in worst case introducing a security risk. So the only chance to decode NULL in 64 bit integers is to introduce 65 bits.

In bit-compressed integer storages however, we exactly know the value range of our stored values. Whenever we know what the highest value is that will be stored into our storage, we can declare the next highest number to be NULL. This is how we do the detection:

diff --git a/storage/storage-int.go b/storage/storage-int.go
index 7a9dff0..4fed6d3 100644
--- a/storage/storage-int.go
+++ b/storage/storage-int.go
@@ -24,6 +24,8 @@ type StorageInt struct {
        chunk []uint64
        bitsize uint8
        hasNegative bool
+       hasNull bool
+       null uint64 // which value is null
 }

@@ -82,7 +92,15 @@
 func (s *StorageInt) scan(i uint, value scm.Scmer) {
        // storage is so simple, dont need scan
+       if value == nil {
+               s.hasNull = true
+               return
+       }
        v := toInt(value)
+       if v >= int64(s.null) {
+               // mark 1+highest value as null
+               s.null = uint64(v) + 1
+       }
        if v < 0 {
                s.hasNegative = true
                v = -v

@@ -93,20 +111,32 @@
 func (s *StorageInt) init(i uint) {
-       // allocate
+       if s.hasNull {
+               // need an extra bit because of null??
+               l := uint8(bits.Len64(uint64(s.null)))
+               if l > s.bitsize {
+                       s.bitsize = l
+               }
+       }
        if s.hasNegative {
                s.bitsize = s.bitsize + 1
        }

The last patch is to ensure that if we have e.g. booleans with 0 and 1 (1 bit), having NULL will expand the storage to 2 bits. Otherwise, we would have an encodign problem.

The read operation on the storage is fairly easy:

 func (s *StorageInt) getValue(i uint) scm.Scmer {
-       if (s.hasNegative) {
-               return scm.Number(s.getValueInt(i))
-       } else {
-               return scm.Number(s.getValueUInt(i))
+       if (s.hasNegative) { // with sign expansion
+               v := s.getValueInt(i)
+               if s.hasNull && uint64(v) == s.null {
+                       return nil
+               }
+               return scm.Number(v)
+       } else { // without sign expansion
+               v := s.getValueUInt(i)
+               if s.hasNull && v == s.null {
+                       return nil
+               }
+               return scm.Number(v)
        }
 }

So we found a way to encode NULL inside a bit compressed integer storage as a single value.

I measured the RAM usage of our test workload. Before the NULL implementation, some columns of our biggest table had to use StorageSCMER because of the NULL values. This implementation took 23 MB of RAM.

With the new NULL implementation, these columns have been able to be encoded in StorageInt which made the workload occupy only 16 MiB of RAM. That’s a 40% savings in overall memory!

memcp: Eine

When the

Comments are closed