Options creation¶
Options objects¶
Important
The default values mentioned here, describe the values of the C++ library only. This wrapper does not set any default value itself. So as soon as the rocksdb developers change a default value this document could be outdated. So if you really depend on a default value, double check it with the according version of the C++ library.
- class rocksdb.ColumnFamilyOptions¶
- __init__(**kwargs)¶
All options mentioned below can also be passed as keyword-arguments in the constructor. For example:
import rocksdb opts = rocksdb.ColumnFamilyOptions(disable_auto_compactions=True) # is the same as opts = rocksdb.ColumnFamilyOptions() opts.disable_auto_compactions = True
- write_buffer_size¶
Amount of data to build up in memory (backed by an unsorted log on disk) before converting to a sorted on-disk file.
Larger values increase performance, especially during bulk loads. Up to
max_write_buffer_number
write buffers may be held in memory at the same time, so you may wish to adjust this parameter to control memory usage. Also, a larger write buffer will result in a longer recovery time the next time the database is opened.Note that write_buffer_size is enforced per column family. See py:attr:db_write_buffer_size for sharing memory across column families.
Type:int
Default:67108864
- max_write_buffer_number¶
The maximum number of write buffers that are built up in memory. The default is 2, so that when 1 write buffer is being flushed to storage, new writes can continue to the other write buffer.
Type:int
Default:2
- min_write_buffer_number_to_merge¶
The minimum number of write buffers that will be merged together before writing to storage. If set to 1, then all write buffers are fushed to L0 as individual files and this increases read amplification because a get request has to check in all of these files. Also, an in-memory merge may result in writing lesser data to storage if there are duplicate records in each of these individual write buffers.
Type:int
Default:1
- compression_opts¶
A dictionary specifying different options for compression algorithms. When setting, only the values present in the dictionary are applied.
Type:dict
- window_bits (
int
, default: -14) FIXME
- level (
int
, default: kDefaultCompressionLevel) FIXME
- strategy (
int
, default: 0) FIXME
- max_dict_bytes (
int
, default: 0) Maximum size of dictionaries used to prime the compression library. Enabling dictionary can improve compression ratios when there are repetitions across data blocks.
The dictionary is created by sampling the SST file data. If zstd_max_train_bytes is nonzero, the samples are passed through zstd’s dictionary generator. Otherwise, the random samples are used directly as the dictionary.
When compression dictionary is disabled, we compress and write each block before buffering data for the next one. When compression dictionary is enabled, we buffer all SST file data in-memory so we can sample it, as data can only be compressed and written after the dictionary has been finalized. So users of this feature may see increased memory usage.
- zstd_max_train_bytes (
int
, default: 0) Maximum size of training data passed to zstd’s dictionary trainer. Using zstd’s dictionary trainer can achieve even better compression ratio improvements than using max_dict_bytes alone.
The training data will be used to generate a dictionary of max_dict_bytes.
- parallel_threads (
int
, default: 1) Number of threads for parallel compression. Parallel compression is enabled only if threads > 1.
THE FEATURE IS STILL EXPERIMENTAL
This option is valid only when BlockBasedTable is used.
When parallel compression is enabled, SST size file sizes might be more inflated compared to the target size, because more data of unknown compressed size is in flight when compression is parallelized. To be reasonably accurate, this inflation is also estimated by using historical compression ratio and current bytes inflight.
- enabled (
bool
, default: False) When the compression options are set by the user, it will be set to “True”. For bottommost_compression_opts, to enable it, user must set
enabled=True
. Otherwise, bottommost compression will usecompression_opts
as default compression options.For
compression_opts
, ifenabled=False
, it is still used as compression options for compression process.
- window_bits (
- bottommost_compression_opts¶
Different options for compression algorithms used by
bottommost_compression
if it is enabled. To enable it, please see the definition ofcompression_opts
.
- compression¶
Compress blocks using the specified compression algorithm. This parameter can be changed dynamically.
If you do not set :py:attr`compression_opts`.``level``, or set it to kDefaultCompressionLevel, we will attempt to pick the default corresponding to compression as follows:
CompressionType.zstd_compression: 3
CompressionType.zlib_compression:
Z_DEFAULT_COMPRESSION
(currently -1)CompressionType.lz4hc_compression: 0
For all others, we do not specify a compression level
Type: Member ofrocksdb.CompressionType
- bottommost_compression¶
Compression algorithm that will be used for the bottommost level that contain files.
Type: Member ofrocksdb.CompressionType
- compaction_pri¶
If level compaction_style = kCompactionStyleLevel, for each level, which files are prioritized to be picked to compact.
Type: Member ofrocksdb.CompactionPri
- max_compaction_bytes¶
We try to limit number of bytes in one compaction to be lower than this threshold. But it’s not guaranteed. Value 0 will be sanitized.
Type:int
Default:target_file_size_base * 25
- num_levels¶
Number of levels for this database
Type:int
Default:7
- level0_file_num_compaction_trigger¶
Number of files to trigger level-0 compaction. A value <0 means that level-0 compaction will not be triggered by number of files at all.
Type:int
Default:4
- level0_slowdown_writes_trigger¶
Soft limit on number of level-0 files. We start slowing down writes at this point. A value <0 means that no writing slow down will be triggered by number of files in level-0.
Type:int
Default:20
- level0_stop_writes_trigger¶
Maximum number of level-0 files. We stop writes at this point.
Type:int
Default:24
- max_mem_compaction_level¶
Maximum level to which a new compacted memtable is pushed if it does not create overlap. We try to push to level 2 to avoid the relatively expensive level 0=>1 compactions and to avoid some expensive manifest file operations. We do not push all the way to the largest level since that can generate a lot of wasted disk space if the same key space is being repeatedly overwritten.
Type:int
Default:2
- target_file_size_base¶
- Target file size for compaction.target_file_size_base is per-file size for level-1.Target file size for level L can be calculated bytarget_file_size_base * (target_file_size_multiplier ^ (L-1)).
For example, if target_file_size_base is 2MB and target_file_size_multiplier is 10, then each file on level-1 will be 2MB, and each file on level 2 will be 20MB, and each file on level-3 will be 200MB.
Type:int
Default:2097152
- target_file_size_multiplier¶
- by default target_file_size_multiplier is 1, which meansby default files in different levels will have similar size.Type:
int
Default:1
- max_bytes_for_level_base¶
Control maximum total data size for a level. max_bytes_for_level_base is the max total for level-1. Maximum number of bytes for level L can be calculated as (max_bytes_for_level_base) * (max_bytes_for_level_multiplier ^ (L-1)) For example, if max_bytes_for_level_base is 20MB, and if max_bytes_for_level_multiplier is 10, total data size for level-1 will be 20MB, total file size for level-2 will be 200MB, and total file size for level-3 will be 2GB.
Type:int
Default:10485760
- max_bytes_for_level_multiplier¶
-
Type:
int
Default:10
- max_bytes_for_level_multiplier_additional¶
Different max-size multipliers for different levels. These are multiplied by max_bytes_for_level_multiplier to arrive at the max-size of each level.
Type:[int]
Default:[1, 1, 1, 1, 1, 1, 1]
- arena_block_size¶
size of one block in arena memory allocation. If <= 0, a proper value is automatically calculated (usually 1/10 of writer_buffer_size).
Type:int
Default:0
- disable_auto_compactions¶
Disable automatic compactions. Manual compactions can still be issued on this database.
Type:bool
Default:False
- compaction_style¶
The compaction style. Could be set to
"level"
to use level-style compaction. For universal-style compaction use"universal"
. For FIFO compaction use"fifo"
. If no compaction style use"none"
.Type:string
Default:level
- compaction_options_universal¶
Options to use for universal-style compaction. They make only sense if
rocksdb.Options.compaction_style
is set to"universal"
.It is a dict with the following keys.
size_ratio
:Percentage flexibilty while comparing file size. If the candidate file(s) size is 1% smaller than the next file’s size, then include next file into this candidate set. Default:
1
min_merge_width
:The minimum number of files in a single compaction run. Default:
2
max_merge_width
:The maximum number of files in a single compaction run. Default:
UINT_MAX
max_size_amplification_percent
:The size amplification is defined as the amount (in percentage) of additional storage needed to store a single byte of data in the database. For example, a size amplification of 2% means that a database that contains 100 bytes of user-data may occupy upto 102 bytes of physical storage. By this definition, a fully compacted database has a size amplification of 0%. Rocksdb uses the following heuristic to calculate size amplification: it assumes that all files excluding the earliest file contribute to the size amplification. Default:
200
, which means that a 100 byte database could require upto 300 bytes of storage.
compression_size_percent
:If this option is set to be -1 (the default value), all the output files will follow compression type specified.
If this option is not negative, we will try to make sure compressed size is just above this value. In normal cases, at least this percentage of data will be compressed.
When we are compacting to a new file, here is the criteria whether it needs to be compressed: assuming here are the list of files sorted by generation time:
A1...An B1...Bm C1...Ct
whereA1
is the newest andCt
is the oldest, and we are going to compactB1...Bm
, we calculate the total size of all the files as total_size, as well as the total size ofC1...Ct
astotal_C
, the compaction output file will be compressed iftotal_C / total_size < this percentage
. Default: -1
stop_style
:The algorithm used to stop picking files into a single compaction. Can be either
"similar_size"
or"total_size"
.similar_size
: Pick files of similar size.total_size
: Total size of picked files is greater than next file.
Default:
"total_size"
For setting options, just assign a dict with the fields to set. It is allowed to omit keys in this dict. Missing keys are just not set to the underlying options object.
This example just changes the stop_style and leaves the other options untouched.
opts = rocksdb.Options() opts.compaction_options_universal = {'stop_style': 'similar_size'}
- max_sequential_skip_in_iterations¶
An iteration->Next() sequentially skips over keys with the same user-key unless this option is set. This number specifies the number of keys (with the same userkey) that will be sequentially skipped before a reseek is issued.
Type:int
Default:8
- inplace_update_support¶
Allows thread-safe inplace updates. Requires Updates if
key exists in current memtable
new sizeof(new_value) <= sizeof(old_value)
old_value for that key is a put i.e. kTypeValue
Type:bool
Default:False
- inplace_update_num_locks¶
- Number of locks used for inplace update.Default: 10000, if
inplace_update_support
= True, else 0.Type:int
Default:10000
- table_factory¶
Factory for the files forming the persisten data storage. Sometimes they are also named SST-Files. Right now you can assign instances of the following classes.
rocksdb.TotalOrderPlainTableFactory
Default:
rocksdb.BlockBasedTableFactory
- memtable_factory¶
This is a factory that provides MemTableRep objects. Right now you can assing instances of the following classes.
Default:
rocksdb.SkipListMemtableFactory
- comparator¶
Comparator used to define the order of keys in the table. A python comparator must implement the
rocksdb.interfaces.Comparator
interface.Requires: The client must ensure that the comparator supplied here has the same name and orders keys exactly the same as the comparator provided to previous open calls on the same DB.
Default:
rocksdb.BytewiseComparator
- merge_operator¶
The client must provide a merge operator if Merge operation needs to be accessed. Calling Merge on a DB without a merge operator would result in
rocksdb.errors.NotSupported
. The client must ensure that the merge operator supplied here has the same name and exactly the same semantics as the merge operator provided to previous open calls on the same DB. The only exception is reserved for upgrade, where a DB previously without a merge operator is introduced to Merge operation for the first time. It’s necessary to specify a merge operator when openning the DB in this case.A python merge operator must implement the
rocksdb.interfaces.MergeOperator
orrocksdb.interfaces.AssociativeMergeOperator
interface.Default:
None
- prefix_extractor¶
If not
None
, use the specified function to determine the prefixes for keys. These prefixes will be placed in the filter. Depending on the workload, this can reduce the number of read-IOP cost for scans when a prefix is passed to the calls generating an iterator (rocksdb.DB.iterkeys()
…).A python prefix_extractor must implement the
rocksdb.interfaces.SliceTransform
interfaceFor prefix filtering to work properly, “prefix_extractor” and “comparator” must be such that the following properties hold:
key.starts_with(prefix(key))
compare(prefix(key), key) <= 0
If compare(k1, k2) <= 0, then compare(prefix(k1), prefix(k2)) <= 0
prefix(prefix(key)) == prefix(key)
Default:
None
- optimize_filters_for_hits¶
This flag specifies that the implementation should optimize the filters mainly for cases where keys are found rather than also optimize for keys missed. This would be used in cases where the application knows that there are very few misses or the performance in the case of misses is not important.
For now, this flag allows us to not store filters for the last level i.e the largest level which contains data of the LSM store. For keys which are hits, the filters in this level are not useful because we will search for the data anyway. NOTE: the filters in other levels are still useful even for key hit because they tell us whether to look in that level or go to the higher level.
Type:bool
Default:False
- paranoid_file_checks¶
After writing every SST file, reopen it and read all the keys. Checks the hash of all of the keys and values written versus the keys in the file and signals a corruption if they do not match
Type:bool
Default:False
- class rocksdb.Options¶
- __init__(**kwargs)¶
Inherits all attributes from
ColumnFamilyOptions
.All options mentioned below can also be passed as keyword-arguments in the constructor. For example:
import rocksdb opts = rocksdb.Options(create_if_missing=True) # is the same as opts = rocksdb.Options() opts.create_if_missing = True
- create_if_missing¶
If
True
, the database will be created if it is missing.Type:bool
Default:False
- create_missing_column_families¶
If
True
, missing column families will be automatically created.Type:bool
Default:False
- error_if_exists¶
If
True
, an error is raised if the database already exists.Type:bool
Default:False
- paranoid_checks¶
If
True
, the implementation will do aggressive checking of the data it is processing and will stop early if it detects any errors. This may have unforeseen ramifications: for example, a corruption of one DB entry may cause a large number of entries to become unreadable or for the entire DB to become unopenable. If any of the writes to the database fails (Put, Delete, Merge, Write), the database will switch to read-only mode and fail all other Write operations.Type:bool
Default:True
- max_open_files¶
Number of open files that can be used by the DB. You may need to increase this if your database has a large working set. Value -1 means files opened are always kept open. You can estimate number of files based on target_file_size_base and target_file_size_multiplier for level-based compaction. For universal-style compaction, you can usually set it to -1.
Type:int
Default:5000
- use_fsync¶
If true, then every store to stable storage will issue a fsync. If false, then every store to stable storage will issue a fdatasync. This parameter should be set to true while storing data to filesystem like ext3 that can lose files after a reboot.
Type:bool
Default:False
- db_log_dir¶
This specifies the info LOG dir. If it is empty, the log files will be in the same dir as data. If it is non empty, the log files will be in the specified dir, and the db data dir’s absolute path will be used as the log file name’s prefix.
Type:unicode
Default:""
- wal_dir¶
This specifies the absolute dir path for write-ahead logs (WAL). If it is empty, the log files will be in the same dir as data, dbname is used as the data dir by default. If it is non empty, the log files will be in kept the specified dir. When destroying the db, all log files in wal_dir and the dir itself is deleted
Type:unicode
Default:""
- delete_obsolete_files_period_micros¶
The periodicity when obsolete files get deleted. The default value is 6 hours. The files that get out of scope by compaction process will still get automatically delete on every compaction, regardless of this setting
Type:int
Default:21600000000
- max_background_compactions¶
Maximum number of concurrent background jobs, submitted to the default LOW priority thread pool
Type:int
Default:1
- stats_history_buffer_size¶
if not zero, periodically take stats snapshots and store in memory, the memory size for stats snapshots is capped at stats_history_buffer_size
Type:int
Default:1048576
- max_background_jobs¶
Maximum number of concurrent background jobs (compactions and flushes).
Type:int
Default:2
- max_background_flushes¶
Maximum number of concurrent background memtable flush jobs, submitted to the HIGH priority thread pool. By default, all background jobs (major compaction and memtable flush) go to the LOW priority pool. If this option is set to a positive number, memtable flush jobs will be submitted to the HIGH priority pool. It is important when the same Env is shared by multiple db instances. Without a separate pool, long running major compaction jobs could potentially block memtable flush jobs of other db instances, leading to unnecessary Put stalls.
Type:int
Default:1
- max_log_file_size¶
Specify the maximal size of the info log file. If the log file is larger than max_log_file_size, a new info log file will be created. If max_log_file_size == 0, all logs will be written to one log file.
Type:int
Default:0
- log_file_time_to_roll¶
Time for the info log file to roll (in seconds). If specified with non-zero value, log file will be rolled if it has been active longer than log_file_time_to_roll. A value of
0
means disabled.Type:int
Default:0
- keep_log_file_num¶
Maximal info log files to be kept.
Type:int
Default:1000
- max_manifest_file_size¶
manifest file is rolled over on reaching this limit. The older manifest file be deleted. The default value is MAX_INT so that roll-over does not take place.
Type:int
Default:(2**64) - 1
- table_cache_numshardbits¶
Number of shards used for table cache.
Type:int
Default:4
- wal_ttl_seconds, wal_size_limit_mb
The following two fields affect how archived logs will be deleted.
If both set to 0, logs will be deleted asap and will not get into the archive.
If wal_ttl_seconds is 0 and wal_size_limit_mb is not 0, WAL files will be checked every 10 min and if total size is greater then wal_size_limit_mb, they will be deleted starting with the earliest until size_limit is met. All empty files will be deleted.
If wal_ttl_seconds is not 0 and wal_size_limit_mb is 0, then WAL files will be checked every wal_ttl_secondsi / 2 and those that are older than wal_ttl_seconds will be deleted.
If both are not 0, WAL files will be checked every 10 min and both checks will be performed with ttl being first.
Type:int
Default:0
- manifest_preallocation_size¶
Number of bytes to preallocate (via fallocate) the manifest files. Default is 4mb, which is reasonable to reduce random IO as well as prevent overallocation for mounts that preallocate large amounts of data (such as xfs’s allocsize option).
Type:int
Default:4194304
- enable_write_thread_adaptive_yield¶
If
True
, threads synchronizing with the write batch group leader will wait for up towrite_thread_max_yield_usec
before blocking on a mutex. This can substantially improve throughput for concurrent workloads, regardless of whether allow_concurrent_memtable_write is enabled.Type:bool
Default:True
- allow_concurrent_memtable_write¶
If
True
, allow multi-writers to update mem tables in parallel. Only some memtable_factory-s support concurrent writes; currently it is implemented only for SkipListFactory. Concurrent memtable writes are not compatible withinplace_update_support
or filter_deletes. It is strongly recommended to setenable_write_thread_adaptive_yield
if you are going to use this feature.Type:bool
Default:True
- allow_mmap_reads¶
Allow the OS to mmap file for reading sst tables
Type:bool
Default:True
- allow_mmap_writes¶
Allow the OS to mmap file for writing
Type:bool
Default:False
- is_fd_close_on_exec¶
Disable child process inherit open files
Type:bool
Default:True
- stats_dump_period_sec¶
If not zero, dump rocksdb.stats to LOG every stats_dump_period_sec
Type:int
Default:3600
- advise_random_on_open¶
If set true, will hint the underlying file system that the file access pattern is random, when a sst file is opened.
Type:bool
Default:True
- use_adaptive_mutex¶
Use adaptive mutex, which spins in the user space before resorting to kernel. This could reduce context switch when the mutex is not heavily contended. However, if the mutex is hot, we could end up wasting spin time.
Type:bool
Default:False
- bytes_per_sync¶
Allows OS to incrementally sync files to disk while they are being written, asynchronously, in the background. Issue one request for every bytes_per_sync written. 0 turns it off.
Type:int
Default:0
- row_cache¶
A global cache for table-level rows. If
None
this cache is not used. Otherwise it must be an instance ofrocksdb.LRUCache
Default:
None
- IncreaseParallelism(total_threads=16)¶
By default, RocksDB uses only one background thread for flush and compaction. Calling this function will set it up such that total of total_threads is used. A good value for total_threads is the number of cores. You almost definitely want to call this function if your system is bottlenecked by RocksDB.
CompactionPri¶
CompressionTypes¶
BytewiseComparator¶
- class rocksdb.BytewiseComparator¶
Wraps the rocksdb Bytewise Comparator, it uses lexicographic byte-wise ordering
BloomFilterPolicy¶
LRUCache¶
TableFactories¶
Currently RocksDB supports two types of tables: plain table and block-based table.
Instances of this classes can assigned to rocksdb.Options.table_factory
Block-based table: This is the default table type that RocksDB inherited from LevelDB. It was designed for storing data in hard disk or flash device.
Plain table: It is one of RocksDB’s SST file format optimized for low query latency on pure-memory or really low-latency media.
Tutorial of rocksdb table formats is available here: https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats
- class rocksdb.BlockBasedTableFactory¶
Wraps BlockBasedTableFactory of RocksDB.
- __init__(index_type='binary_search', hash_index_allow_collision=True, checksum='crc32', block_cache, block_cache_compressed, filter_policy=None, no_block_cache=False, block_size=None, block_size_deviation=None, block_restart_interval=None, whole_key_filtering=None, enable_index_compression=None, cache_index_and_filter_blocks=None, format_version=None)¶
- Parameters:
index_type (string) –
binary_search
a space efficient index block that is optimized for binary-search-based index.hash_search
the hash index. If enabled, will do hash lookup when Options.prefix_extractor is provided.
hash_index_allow_collision (bool) – Influence the behavior when
hash_search
is used. IfFalse
, stores a precise prefix to block range mapping. IfTrue
, does not store prefix and allows prefix hash collision (less memory consumption)checksum (string) – Use the specified checksum type. Newly created table files will be protected with this checksum type. Old table files will still be readable, even though they have different checksum type. Can be either
crc32
orxxhash
.block_cache –
Control over blocks (user data is stored in a set of blocks, and a block is the unit of reading from disk).
If
None
, rocksdb will automatically create and use an 8MB internal cache. If notNone
use the specified cache for blocks. In that case it must be an instance ofrocksdb.LRUCache
block_cache_compressed – If
None
, rocksdb will not use a compressed block cache. If notNone
use the specified cache for compressed blocks. In that case it must be an instance ofrocksdb.LRUCache
filter_policy – If not
None
use the specified filter policy to reduce disk reads. A python filter policy must implement therocksdb.interfaces.FilterPolicy
interface. Recommended is a instance ofrocksdb.BloomFilterPolicy
no_block_cache (bool) – Disable block cache. If this is set to true, then no block cache should be used, and the block_cache should point to
None
block_size (int) – If set to
None
the rocksdb default of4096
is used. Approximate size of user data packed per block. Note that the block size specified here corresponds to uncompressed data. The actual size of the unit read from disk may be smaller if compression is enabled. This parameter can be changed dynamically.block_size_deviation (int) – If set to
None
the rocksdb default of10
is used. This is used to close a block before it reaches the configured ‘block_size’. If the percentage of free space in the current block is less than this specified number and adding a new record to the block will exceed the configured block size, then this block will be closed and the new record will be written to the next block.block_restart_interval (int) – If set to
None
the rocksdb default of16
is used. Number of keys between restart points for delta encoding of keys. This parameter can be changed dynamically. Most clients should leave this parameter alone.whole_key_filtering (bool) – If set to
None
the rocksdb default ofTrue
is used. IfTrue
, place whole keys in the filter (not just prefixes). This must generally be true for gets to be efficient.enable_index_compression (bool) – If set to
None
the rocksdb default ofTrue
is used. Store index blocks on disk in compressed format. Setting this option toFalse
will avoid the overhead of decompression if index blocks are evicted and read back.cache_index_and_filter_blocks (boot) – If set to
None
the rocksdb default ofFalse
is used. Indicates if we’d put index/filter blocks to the block cache. IfFalse
, each “table reader” object will pre-load index/filter block during table initialization.format_version (int) –
If set to
None
the rocksdb default of4
is used. There are currently 6 versions:- 0
This version is currently written out by all RocksDB’s versions by default. Can be read by really old RocksDB’s. Doesn’t support changing checksum (default is CRC32).
- 1
Can be read by RocksDB’s versions since 3.0. Supports non-default checksum, like xxHash. It is written by RocksDB when BlockBasedTableOptions::checksum is something other than kCRC32c. (version 0 is silently upconverted)
- 2
Can be read by RocksDB’s versions since 3.10. Changes the way we encode compressed blocks with LZ4, BZip2 and Zlib compression. If you don’t plan to run RocksDB before version 3.10, you should probably use this.
- 3
Can be read by RocksDB’s versions since 5.15. Changes the way we encode the keys in index blocks. If you don’t plan to run RocksDB before version 5.15, you should probably use this. This option only affects newly written tables. When reading existing tables, the information about version is read from the footer.
- 4
Can be read by RocksDB’s versions since 5.16. Changes the way we encode the values in index blocks. If you don’t plan to run RocksDB before version 5.16 and you are using index_block_restart_interval > 1, you should probably use this as it would reduce the index size. This option only affects newly written tables. When reading existing tables, the information about version is read from the footer.
- 5
Can be read by RocksDB’s versions since 6.6.0. Full and partitioned filters use a generally faster and more accurate Bloom filter implementation, with a different schema.
- class rocksdb.PlainTableFactory¶
Plain Table with prefix-only seek. It wraps rocksdb PlainTableFactory.
For this factory, you need to set
rocksdb.Options.prefix_extractor
properly to make it work. Look-up will start with prefix hash lookup for key prefix. Inside the hash bucket found, a binary search is executed for hash conflicts. Finally, a linear search is used.- __init__(user_key_len=0, bloom_bits_per_key=10, hash_table_ratio=0.75, index_sparseness=10, huge_page_tlb_size=0, encoding_type='plain', full_scan_mode=False, store_index_in_file=False)¶
- Parameters:
user_key_len (int) – Plain table has optimization for fix-sized keys, which can be specified via user_key_len. Alternatively, you can pass 0 if your keys have variable lengths.
bloom_bits_per_key (int) – The number of bits used for bloom filer per prefix. You may disable it by passing 0.
hash_table_ratio (float) – The desired utilization of the hash table used for prefix hashing. hash_table_ratio = number of prefixes / #buckets in the hash table.
index_sparseness (int) – Inside each prefix, need to build one index record for how many keys for binary search inside each hash bucket. For encoding type
prefix
, the value will be used when writing to determine an interval to rewrite the full key. It will also be used as a suggestion and satisfied when possible.huge_page_tlb_size (int) – If <=0, allocate hash indexes and blooms from malloc. Otherwise from huge page TLB. The user needs to reserve huge pages for it to be allocated, like:
sysctl -w vm.nr_hugepages=20
See linux doc Documentation/vm/hugetlbpage.txtencoding_type (string) –
How to encode the keys. The value will determine how to encode keys when writing to a new SST file. This value will be stored inside the SST file which will be used when reading from the file, which makes it possible for users to choose different encoding type when reopening a DB. Files with different encoding types can co-exist in the same DB and can be read.
plain
: Always write full keys without any special encoding.prefix
: Find opportunity to write the same prefix once for multiple rows.In some cases, when a key follows a previous key with the same prefix, instead of writing out the full key, it just writes out the size of the shared prefix, as well as other bytes, to save some bytes.
When using this option, the user is required to use the same prefix extractor to make sure the same prefix will be extracted from the same key. The Name() value of the prefix extractor will be stored in the file. When reopening the file, the name of the options.prefix_extractor given will be bitwise compared to the prefix extractors stored in the file. An error will be returned if the two don’t match.
full_scan_mode (bool) – Mode for reading the whole file one record by one without using the index.
store_index_in_file (bool) – Compute plain table index and bloom filter during file building and store it in file. When reading file, index will be mmaped instead of recomputation.
MemtableFactories¶
RocksDB has different classes to represent the in-memory buffer for the current
operations. You have to assing instances of the following classes to
rocksdb.Options.memtable_factory
.
This page has a comparison the most popular ones.
https://github.com/facebook/rocksdb/wiki/Hash-based-memtable-implementations
- class rocksdb.VectorMemtableFactory¶
This creates MemTableReps that are backed by an std::vector. On iteration, the vector is sorted. This is useful for workloads where iteration is very rare and writes are generally not issued after reads begin.
- __init__(count=0)¶
- Parameters:
count (int) – Passed to the constructor of the underlying std::vector of each VectorRep. On initialization, the underlying array will be at least count bytes reserved for usage.
- class rocksdb.HashSkipListMemtableFactory¶
This class contains a fixed array of buckets, each pointing to a skiplist (null if the bucket is empty).
Note
rocksdb.Options.prefix_extractor
must be set, otherwise rocksdb fails back to skip-list.- __init__(bucket_count=1000000, skiplist_height=4, skiplist_branching_factor=4)¶
- Parameters:
bucket_count (int) – number of fixed array buckets
skiplist_height (int) – the max height of the skiplist
skiplist_branching_factor (int) – probabilistic size ratio between adjacent link lists in the skiplist
- class rocksdb.HashLinkListMemtableFactory¶
The factory is to create memtables with a hashed linked list. It contains a fixed array of buckets, each pointing to a sorted single linked list (null if the bucket is empty).
Note
rocksdb.Options.prefix_extractor
must be set, otherwise rocksdb fails back to skip-list.- __init__(bucket_count=50000)¶
- Parameters:
bucket (int) – number of fixed array buckets