⚠️ This library is in active development: there is currently no release schedule!
This package is developed by Well-Typed LLP on behalf of Input Output Global, Inc. (IOG) and INTERSECT. The main contributors are Duncan Coutts, Joris Dral, Matthias Heinzel, Wolfgang Jeltsch, Wen Kokke, and Alex Washburn.
This package contains an efficient implementation of on-disk key–value
storage, implemented as a log-structured merge-tree or LSM-tree. An
LSM-tree is a data structure for key–value mappings, similar to
Data.Map
, but optimized for large tables with a high insertion volume.
It has support for:
-
Basic key–value operations, such as lookup, insert, and delete.
-
Range lookups, which efficiently retrieve the values for all keys in a given range.
-
Monoidal upserts (or "mupserts") which combine the stored and new values.
-
BLOB storage which assocates a large auxiliary BLOB with a key.
-
Durable on-disk persistence and rollback via named snapshots.
-
Cheap table duplication where all duplicates can be independently accessed and modified.
-
High-performance lookups on SSDs using I/O batching and parallelism.
This package exports two modules:
-
Database.LSMTree.Simple
This module exports a simplified API which picks sensible defaults for a number of configuration parameters.
It does not support mupserts or BLOBs, due to their unintuitive interaction, see Mupserts and BLOBs.
If you are looking at this package for the first time, it is strongly recommended that you start by reading this module.
-
Database.LSMTree
This module exports the full API.
The interaction between mupserts and BLOBs is unintuitive. A mupsert updates the value associated with the key by combining the old and new value with a user-specified function. However, this does not apply to any BLOB value associated with the key, which is simply overwritten by the new BLOB value.
-
This package only supports 64-bit, little-endian systems.
-
On Windows, the package has only been tested with NTFS filesystems.
-
On Linux, executables using this package, including test and benchmark suites, must be compiled with the
-threaded
RTS option enabled.
LSM-trees can be used concurrently, but with a few restrictions:
-
Each session locks its session directory. This means that a database cannot be accessed from different processes at the same time.
-
Tables can be used concurrently and concurrent use of read operations such as lookups is determinstic. However, concurrent use of write operations such as insert or delete with any other operation results in a race condition.
The worst-case time and space complexities are given in big-O notation. The time cost of operations on LSM-trees is generally dominated by the number of disk I/O actions. As such, the worst-case complexity of basic operations refer to the number of disk I/O actions.
TODO: Describe the time complexity of the basic operations.
The in-memory size of an LSM-tree is described in terms of the variable
n, which refers to the number of physical database entries. A
physical database entry is any key–operation pair, e.g., Insert k v
or Delete k
, whereas a logical database entry is determined by all
physical entries with the same key.
The worst-case in-memory size of an LSM-tree is O(n).
-
The worst-case size of the write buffer is O(1).
The maximum size of the write buffer on the write buffer allocation strategy, which is determined by the
confWriteBufferAlloc
field ofTableConfig
. Regardless of write buffer allocation strategy, the size of the write buffer may never exceed 4GiB.AllocNumEntries maxEntries
The maximum size of the write buffer is the maximum number of entries multiplied by the average size of a key–operation pair. -
The worst-case size of the Bloom filters is O(n).
The total size of all Bloom filters depends on the Bloom filter allocation strategy, which is determined by the
confBloomFilterAlloc
field ofTableConfig
.AllocFixed bitsPerPhysicalEntry
The total size of all Bloom filters is the number of bits per physical entry multiplied by the number of physical entries.AllocRequestFPR requestedFPR
TODO: How does one determine the bloom filter size usingAllocRequestFPR
? -
The worst-case size of the indexes is O(n).
The total size of all indexes depends on the index type, which is determined by the
confFencePointerIndex
field ofTableConfig
. The size of the various indexes is described in reference to the size of the database in memory pages.OrdinaryIndex
An ordinary index stores the maximum serialised key for each memory page. The total size of all indexes is proportional to the average size of one serialised key per memory page.CompactIndex
A compact index stores the 64 most significant bits of the minimum serialised key for each memory page, as well as 1 bit per memory page to resolve clashes, 1 bit per memory page to mark overflow pages, and a negligable amount of memory for tie breakers. The total size of all indexes is approximately 66 bits per memory page.
The total size of an LSM-tree must not exceed 241 physical
entries. Violation of this condition is checked and will throw a
TableTooLargeError
.
The implementation of LSM-trees in this package draws inspiration from:
-
Chris Okasaki. 1998. "Purely Functional Data Structures" doi:10.1017/CBO9780511530104
-
Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. "Monkey: Optimal Navigable Key-Value Store." doi:10.1145/3035918.3064054
-
Subhadeep Sarkar, Dimitris Staratzis, Ziehen Zhu, and Manos Athanassoulis. 2021. "Constructing and analyzing the LSM compaction design space." doi:10.14778/3476249.3476274