When thinking about compression, one often thinks that a bit can not be compressed in less than a bit. In this article, I will prove otherwise.
Shannons Coding Theorem
But there must be a way, according to Shannons Coding Theorem. A single bit is defined by the amount of information that is needed to encode a decision where both outcomes are with 50-50 probability.
Whenever a event has a higher probability, say 75% – this lower bound of will be at 0.5 bits. However the 25% counterpart would take 2 bits in this encoding. But how can we encode something like half bits or even smaller in a computer that is single-bit addressable?
The amount of bits needed to encode an event according to Shannons Coding Theorem is:
E(X) = ld(P(X))
So the total amount of data needed to store a bit mask with the distribution of 25% 1’s and 75% 0’s is:
H(X) = len * ΣP(X)*ld(P(X))
= len * (0.25 * 2 bits + 0.75 * 0.5 bits)
= len * 0.81
So a distribution of 25-75 will only need 0.81 bits per bit. But how to encode this?
Sub-Bit encoding
Given a integer greater or equal than 2, let’s say 100, we can split any number into two parts:
a) number / 100 (integer division)
b) number % 100 (modulo – remainder of the integer division)
Whenever we read the modulo from a big number, we get a value between 0 and 99. When our probability of a 1
is 81%, we can treat all modulo remainder >= 81 as 1
and all remainders < 81 as 0
. In case of a 0
, we have a number between 0 and 80; in case of a 1
we have a number between and 81 and 99 from which we subtract 81, so we get a number between 0 and 18. This will be the reminder of our bit extraction.
After we got our remainder-remainder (either 0..80 for 0
or 0..18 for 1
), we will store the remainder-remainder back into the big number by setting number = (number / 100) * remainder_remainder_size + remainder_remainder
.
This way we get a kind of „stack“ structure from which we can extract single bits while making the number smaller for a fractional amount of bits.
Understanding fractional bits
Given our dividor is 256 (which is 8 bit), we will extract 8 bits from the big number with every division. When the probability of our event is 1/256, we will have a remainder_remainder_size
of 1, so the number is multiplied by 1 and added by 0. So we have effectively extracted 8 bits from the number.
Given our dividor is 256 and the probability is 50% (so the remainder split is at 128), we will extract 8 bits and push back 7 bits (number * 128 + remainder_remainder
) – so we effectively extracted 1 bit from the number.
Given our dividor is 256 and the probability is 255/256 with a split point of 255, we will extract 8 bits and store back 7.994 (ld(255)
) bits, so we effectively extracted 0.006 bits.
Putting it all together
To store large amounts of sub-bits, one has to implement some optimizations. To store millions of bits, one needs hundrets of thousands of bits in large numbers. The more bits we want to store, the larger our storage number gets.
On modern processor architectures, one can instead of handling one large number, use multiple numbers which do not exceed 64 bits. A 64 bit number can store about 200 bits given a 1:8 probability ratio like it is used in sparse bit maps like in MemCP.
This also solves another problem: random access. When using sub-bit compression, the single bits are not that easilly accessible. One has to pop all numbers from the stack-like structure that have been pushed to the number before.
However when about 200 bits fit into a 64 bit integer, one can store the amount of bits per 64 bit chunk in an 8 bit integer. So when randomly accessing a bit, one can use the bit amount array to skip all 64 bit integers that are before our wanted bit. Then we have to unravel the bits from the right 64 bit integer.
In-Memory Compression in MemCP Database
MemCP is a In-Memory Storage Engine licensed under GPLv3 that is capable of extreme compression of data in RAM. This helps improving access time when memory bandwith is a bottleneck on big mainframe architectures.
It compresses data with an average compression ratio of 1:5 compared to disk based database systems which makes data access extremely fast. The columnar storage format allows the efficient handling of analytic workloads on big data in realtime.
The database is able to handle big data in realtime – especially for ERP systems or QA measurement systems that have to compute results over large amounts of data.
Comments are closed