Most databases implement indices as a kind of tree. I will show you that columnar storages can do even better.
The most widely used kind of tree structure in databases is the B-Tree. A B-Tree is a n-ary tree whose nodes fit exactly into one „page“ – may it be a cache line of 64 bytes or a HDD page of 512 bytes or a memory page of 4K.
The advantage of btrees is that it balances the memory fetch time with the time spent on computing on a chunk of memory. A btree node is basically layouted like pointer|value|pointer|value|pointer|value|pointer
This means, half of the space of the index is wasted on pointers, the other half is wasted on values. Latter is problematic when values get wide – especially when dealing with strings.
We will provide a different approach for indices that are far more memory efficient and thus faster in in-memory storage engines: sorted list indices.
Sorted lists of Record IDs
In contrast to OLTP databases that need fast insert and delete procedures, in a separated main and delta storage, we don’t have to care about insert performance because we never insert into the main storage. When items are inserted, they are inserted into the delta storage and the main storage is rebuilt on demand. Deletions are also easy to handle. When scanning through a table, we can just ignore deleted items.
So instead of storing pointers and values (where on multidimensional indices, values can span multiple columns), we can just store the recordId. For a 10k items storage, this would require less then 20KiB of RAM ! The list of recordIds can then be sorted according to our sort criterion and compressed into a bit compressed integer storage.
Whenever we want to find a certain item or a range of items, we can now walk through the sorted list of recordIds and perform our binary search algorithms.
This is the data structure I came up with:
type StorageIndex struct { cols []string // sort equal-cols alphabetically, so similar conditions are canonical savings float64 // store the amount of time savings here -> add selectivity (outputted / size) on each sortedItems StorageInt // we can do binary searches here t *table inactive bool }
Here some facts:
cols
is the list of columns that are considered in the index- The list of sorted recordIds is stored in
sortedItems
- The items are sorted according to the value of
cols[0]
; in case of equality,cols[1]
is considered and so on - In the beginning, we set
inactive
to true, so the index is considered but not built - In
savings
, we accumulate the amount of computation time we could save by having the index - As soon as
savings
exceeds a threshold, the index is built. Otherwise, we will do a full table scan. - The index implements
func (s *StorageIndex) iterate(lower []scmer, upperLast scmer) chan uint
which will start a index scan beginning atlower
and will stop as soon as the last column reaches a value greater thanupperLast
. Wheninactive
is set, the function will instead increase thesavings
value and stream all recordIds of the table
Finding the Perfect Index for a Certain Condition
In memcp, we use scheme – a lisp dialect – as the internal query plan language. This has one big advantage: Every piece of code is an array that can be examined by our optimizer.
We implemented a helper function func (t *table) iterateIndex(condition scmer) chan uint
that takes a condition, finds the perfect index or creates it and then streams all recordIds that might fit into the condition into a ring buffer (chan uint
)
So we define a datatype like:
type columnboundaries struct{ col string lower scmer upper scmer }
And then collect all conditions into a variable cols
of the type []columnbounbdaries
. For doing so, we implemented a recursive function traverseCondition(condition scmer)
that scans the condition for equal?
, <
, >
, <=
, >=
as well as BETWEEN ... AND ...
and IN
expressions. Whenever a column value is compared against a constant, we can fill in a columnboundaries
object.
After collecting all boundaries, we sort cols
according to the following rules:
equals?
is more selective than<
or>
comparisons, so we put all boundaries withlower == upper
first- An index scan can only consider one range column (as long as we don’t use spatial indices), so we can only keep one boundary with
lower != upper
- To save memory, we sort all
equals?
-boundaries alphabetically by the column name, so that similar SQL queries can use the same index - Instead of an index (col1, col2), the already existing index (col1, col2, col3) can be used as an alternative
As soon as the perfect index constellation is found, we create the StorageIndex
object, but set it to inactive
first. We then delegate the task of filling a chan uint
to the StorageIndex
object by calling func (s *StorageIndex) iterate(lower []scmer, upperLast scmer) chan uint
Automatic Index Building and Cost Calculation
We did some measurements on the performance of on-the-fly index building:
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) 02727 - Neugersdorf ==> "8.275433ms" > (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) building index on PLZ over [Ort] 02727 - Neugersdorf ==> "18.187862ms" > (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) 02727 - Neugersdorf ==> "43.871µs" > (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) 02727 - Neugersdorf ==> "45.681µs" > (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) 02727 - Neugersdorf ==> "39.021µs" >
The first query did an unoptimized table scan with 8.3ms
The second query did the index build with 18.2ms
The third query already uses the index and only needs 42µs – this is a speedup of almost 200x!
So you see: A index build is about twice the cost as a full table scan. This means, as soon as a index is used the second time, we can build the index and remove the inactive
flag. When we increase savings
by 1.0 every time the index is used and build the index at a threshold of 2.0, we will have the optimal heuristic for index creation.
When a table is rebuild, we can of course take over our measurements stored in the savings
variable to start the index build a bit earlier.
Comments are closed