
Most databases implement indices as a kind of tree. I will show you that columnar storages can do even better.
The most widely used kind of tree structure in databases is the B-Tree. A B-Tree is a n-ary tree whose nodes fit exactly into one “page” – may it be a cache line of 64 bytes or a HDD page of 512 bytes or a memory page of 4K.
The advantage of btrees is that it balances the memory fetch time with the time spent on computing on a chunk of memory. A btree node is basically layouted like pointer|value|pointer|value|pointer|value|pointer
This means, half of the space of the index is wasted on pointers, the other half is wasted on values. Latter is problematic when values get wide – especially when dealing with strings.
We will provide a different approach for indices that are far more memory efficient and thus faster in in-memory storage engines: sorted list indices.
Sorted lists of Record IDs
In contrast to OLTP databases that need fast insert and delete procedures, in a separated main and delta storage, we don’t have to care about insert performance because we never insert into the main storage. When items are inserted, they are inserted into the delta storage and the main storage is rebuilt on demand. Deletions are also easy to handle. When scanning through a table, we can just ignore deleted items.
So instead of storing pointers and values (where on multidimensional indices, values can span multiple columns), we can just store the recordId. For a 10k items storage, this would require less then 20KiB of RAM ! The list of recordIds can then be sorted according to our sort criterion and compressed into a bit compressed integer storage.
Whenever we want to find a certain item or a range of items, we can now walk through the sorted list of recordIds and perform our binary search algorithms.
This is the data structure I came up with:
type StorageIndex struct {
        cols []string // sort equal-cols alphabetically, so similar conditions are canonical
        savings float64 // store the amount of time savings here -> add selectivity (outputted / size) on each
        sortedItems StorageInt // we can do binary searches here
        t *table
        inactive bool
}
Here some facts:
- colsis the list of columns that are considered in the index
- The list of sorted recordIds is stored in sortedItems
- The items are sorted according to the value of cols[0]; in case of equality,cols[1]is considered and so on
- In the beginning, we set inactiveto true, so the index is considered but not built
- In savings, we accumulate the amount of computation time we could save by having the index
- As soon as savingsexceeds a threshold, the index is built. Otherwise, we will do a full table scan.
- The index implements func (s *StorageIndex) iterate(lower []scmer, upperLast scmer) chan uintwhich will start a index scan beginning atlowerand will stop as soon as the last column reaches a value greater thanupperLast. Wheninactiveis set, the function will instead increase thesavingsvalue and stream all recordIds of the table
Finding the Perfect Index for a Certain Condition
In memcp, we use scheme – a lisp dialect – as the internal query plan language. This has one big advantage: Every piece of code is an array that can be examined by our optimizer.
We implemented a helper function func (t *table) iterateIndex(condition scmer) chan uint that takes a condition, finds the perfect index or creates it and then streams all recordIds that might fit into the condition into a ring buffer (chan uint)
So we define a datatype like:
type columnboundaries struct{
        col string
        lower scmer
        upper scmer
}
And then collect all conditions into a variable cols of the type []columnbounbdaries. For doing so, we implemented a recursive function traverseCondition(condition scmer) that scans the condition for equal?, <, >, <=, >= as well as BETWEEN ... AND ... and IN expressions. Whenever a column value is compared against a constant, we can fill in a columnboundaries object.
After collecting all boundaries, we sort cols according to the following rules:
- equals?is more selective than- <or- >comparisons, so we put all boundaries with- lower == upperfirst
- An index scan can only consider one range column (as long as we don’t use spatial indices), so we can only keep one boundary with lower != upper
- To save memory, we sort all equals?-boundaries alphabetically by the column name, so that similar SQL queries can use the same index
- Instead of an index (col1, col2), the already existing index (col1, col2, col3) can be used as an alternative
As soon as the perfect index constellation is found, we create the StorageIndex object, but set it to inactive first. We then delegate the task of filling a chan uint to the StorageIndex object by calling func (s *StorageIndex) iterate(lower []scmer, upperLast scmer) chan uint
Automatic Index Building and Cost Calculation
We did some measurements on the performance of on-the-fly index building:
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) 02727 - Neugersdorf ==> "8.275433ms" > (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) building index on PLZ over [Ort] 02727 - Neugersdorf ==> "18.187862ms" > (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) 02727 - Neugersdorf ==> "43.871µs" > (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) 02727 - Neugersdorf ==> "45.681µs" > (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort))) 02727 - Neugersdorf ==> "39.021µs" >
The first query did an unoptimized table scan with 8.3ms
The second query did the index build with 18.2ms
The third query already uses the index and only needs 42µs – this is a speedup of almost 200x!
So you see: A index build is about twice the cost as a full table scan. This means, as soon as a index is used the second time, we can build the index and remove the inactive flag. When we increase savings by 1.0 every time the index is used and build the index at a threshold of 2.0, we will have the optimal heuristic for index creation.
When a table is rebuild, we can of course take over our measurements stored in the savings variable to start the index build a bit earlier.
 English
English				 German
German					          
Comments are closed