{% if include.truncate %} {% if page.content contains '' %} diff --git a/docs/_posts/2017-12-19-write-prepared-txn.markdown b/docs/_posts/2017-12-19-write-prepared-txn.markdown index d592b6f7b16..439b3f83cc4 100644 --- a/docs/_posts/2017-12-19-write-prepared-txn.markdown +++ b/docs/_posts/2017-12-19-write-prepared-txn.markdown @@ -7,8 +7,6 @@ category: blog RocksDB supports both optimistic and pessimistic concurrency controls. The pessimistic transactions make use of locks to provide isolation between the transactions. The default write policy in pessimistic transactions is _WriteCommitted_, which means that the data is written to the DB, i.e., the memtable, only after the transaction is committed. This policy simplified the implementation but came with some limitations in throughput, transaction size, and variety in supported isolation levels. In the below, we explain these in detail and present the other write policies, _WritePrepared_ and _WriteUnprepared_. We then dive into the design of _WritePrepared_ transactions. -> _WritePrepared_ are to be announced as production-ready soon. - ### WriteCommitted, Pros and Cons With _WriteCommitted_ write policy, the data is written to the memtable only after the transaction commits. This greatly simplifies the read path as any data that is read by other transactions can be assumed to be committed. This write policy, however, implies that the writes are buffered in memory in the meanwhile. This makes memory a bottleneck for large transactions. The delay of the commit phase in 2PC (two-phase commit) also becomes noticeable since most of the work, i.e., writing to memtable, is done at the commit phase. When the commit of multiple transactions are done in a serial fashion, such as in 2PC implementation of MySQL, the lengthy commit latency becomes a major contributor to lower throughput. Moreover this write policy cannot provide weaker isolation levels, such as READ UNCOMMITTED, that could potentially provide higher throughput for some applications. @@ -28,10 +26,16 @@ With _WritePrepared_, a transaction still buffers the writes in a write batch ob The _CommitCache_ is a lock-free data structure that caches the recent commit entries. Looking up the entries in the cache must be enough for almost all th transactions that commit in a timely manner. When evicting the older entries from the cache, it still maintains some other data structures to cover the corner cases for transactions that takes abnormally too long to finish. We will cover them in the design details below. -### Preliminary Results -The full experimental results are to be reported soon. Here we present the improvement in tps observed in some preliminary experiments with MyRocks: -* sysbench update-noindex: 25% -* sysbench read-write: 7.6% -* linkbench: 3.7% +### Benchmark Results +Here we presents the improvements observed in MyRocks with sysbench and linkbench: +* benchmark...........tps.........p95 latency....cpu/query +* insert...................68% +* update-noindex...30%......38% +* update-index.......61%.......28% +* read-write............6%........3.5% +* read-only...........-1.2%.....-1.8% +* linkbench.............1.9%......+overall........0.6% + +Here are also the detailed results for [In-Memory Sysbench](https://gist.github.com/maysamyabandeh/bdb868091b2929a6d938615fdcf58424) and [SSD Sysbench](https://gist.github.com/maysamyabandeh/ff94f378ab48925025c34c47eff99306) curtesy of [@mdcallag](https://github.com/mdcallag). Learn more [here](https://github.com/facebook/rocksdb/wiki/WritePrepared-Transactions). diff --git a/docs/_posts/2018-11-21-delete-range.markdown b/docs/_posts/2018-11-21-delete-range.markdown new file mode 100644 index 00000000000..96fc3562d19 --- /dev/null +++ b/docs/_posts/2018-11-21-delete-range.markdown @@ -0,0 +1,292 @@ +--- +title: "DeleteRange: A New Native RocksDB Operation" +layout: post +author: +- abhimadan +- ajkr +category: blog +--- +## Motivation + +### Deletion patterns in LSM + +Deleting a range of keys is a common pattern in RocksDB. Most systems built on top of +RocksDB have multi-component key schemas, where keys sharing a common prefix are +logically related. Here are some examples. + +MyRocks is a MySQL fork using RocksDB as its storage engine. Each key's first +four bytes identify the table or index to which that key belongs. Thus dropping +a table or index involves deleting all the keys with that prefix. + +Rockssandra is a Cassandra variant that uses RocksDB as its storage engine. One +of its admin tool commands, `nodetool cleanup`, removes key-ranges that have been migrated +to other nodes in the cluster. + +Marketplace uses RocksDB to store product data. Its key begins with product ID, +and it stores various data associated with the product in separate keys. When a +product is removed, all these keys must be deleted. + +When we decide what to improve, we try to find a use case that's common across +users, since we want to build a generally useful system, not one that has many +one-off features for individual users. The range deletion pattern is common as +illustrated above, so from this perspective it's a good target for optimization. + +### Existing mechanisms: challenges and opportunities + +The most common pattern we see is scan-and-delete, i.e., advance an iterator +through the to-be-deleted range, and issue a `Delete` for each key. This is +slow (involves read I/O) so cannot be done in any critical path. Additionally, +it creates many tombstones, which slows down iterators and doesn't offer a deadline +for space reclamation. + +Another common pattern is using a custom compaction filter that drops keys in +the deleted range(s). This deletes the range asynchronously, so cannot be used +in cases where readers must not see keys in deleted ranges. Further, it has the +disadvantage of outputting tombstones to all but the bottom level. That's +because compaction cannot detect whether dropping a key would cause an older +version at a lower level to reappear. + +If space reclamation time is important, or it is important that the deleted +range not affect iterators, the user can trigger `CompactRange` on the deleted +range. This can involve arbitrarily long waits in the compaction queue, and +increases write-amp. By the time it's finished, however, the range is completely +gone from the LSM. + +`DeleteFilesInRange` can be used prior to compacting the deleted range as long +as snapshot readers do not need to access them. It drops files that are +completely contained in the deleted range. That saves write-amp because, in +`CompactRange`, the file data would have to be rewritten several times before it +reaches the bottom of the LSM, where tombstones can finally be dropped. + +In addition to the above approaches having various drawbacks, they are quite +complicated to reason about and implement. In an ideal world, deleting a range +of keys would be (1) simple, i.e., a single API call; (2) synchronous, i.e., +when the call finishes, the keys are guaranteed to be wiped from the DB; (3) low +latency so it can be used in critical paths; and (4) a first-class operation +with all the guarantees of any other write, like atomicity, crash-recovery, etc. + +## v1: Getting it to work + +### Where to persist them? + +The first place we thought about storing them is inline with the data blocks. +We could not think of a good way to do it, however, since the start of a range +tombstone covering a key could be anywhere, making binary search impossible. +So, we decided to investigate segregated storage. + +A second solution we considered is appending to the manifest. This file is +append-only, periodically compacted, and stores metadata like the level to which +each SST belongs. This is tempting because it leverages an existing file, which +is maintained in the background and fully read when the DB is opened. However, +it conceptually violates the manifest's purpose, which is to store metadata. It +also has no way to detect when a range tombstone no longer covers anything and +is droppable. Further, it'd be possible for keys above a range tombstone to disappear +when they have their seqnums zeroed upon compaction to the bottommost level. + +A third candidate is using a separate column family. This has similar problems +to the manifest approach. That is, we cannot easily detect when a range +tombstone is obsolete, and seqnum zeroing can cause a key +to go from above a range tombstone to below, i.e., disappearing. The upside is +we can reuse logic for memory buffering, consistent reads/writes, etc. + +The problems with the second and third solutions indicate a need for range +tombstones to be aware of flush/compaction. An easy way to achieve this is put +them in the SST files themselves - but not in the data blocks, as explained for +the first solution. So, we introduced a separate meta-block for range tombstones. +This resolved the problem of when to obsolete range tombstones, as it's simple: +when they're compacted to the bottom level. We also reused the LSM invariants +that newer versions of a key are always in a higher level to prevent the seqnum +zeroing problem. This approach has the side benefit of constraining the range +tombstones seen during reads to ones in a similar key-range. + +![](/static/images/delrange/delrange_sst_blocks.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*When there are range tombstones in an SST, they are segregated in a separate meta-block* +{: style="text-align: center"} + +![](/static/images/delrange/delrange_key_schema.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Logical range tombstones (left) and their corresponding physical key-value representation (right)* +{: style="text-align: center"} + +### Write path + +`WriteBatch` stores range tombstones in its buffer which are logged to the WAL and +then applied to a dedicated range tombstone memtable during `Write`. Later in +the background the range tombstone memtable and its corresponding data memtable +are flushed together into a single SST with a range tombstone meta-block. SSTs +periodically undergo compaction which rewrites SSTs with point data and range +tombstones dropped or merged wherever possible. + +We chose to use a dedicated memtable for range tombstones. The memtable +representation is always skiplist in order to minimize overhead in the usual +case, which is the memtable contains zero or a small number of range tombstones. +The range tombstones are segregated to a separate memtable for the same reason +we segregated range tombstones in SSTs. That is, we did not know how to +interleave the range tombstone with point data in a way that we would be able to +find it for arbitrary keys that it covers. + +![](/static/images/delrange/delrange_write_path.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 70%"} + +*Lifetime of point keys and range tombstones in RocksDB* +{: style="text-align: center"} + +During flush and compaction, we chose to write out all non-obsolete range +tombstones unsorted. Sorting by a single dimension is easy to implement, but +doesn't bring asymptotic improvement to queries over range data. Ideally, we +want to store skylines (see “Read Path” subsection below) computed over our ranges so we can binary search. +However, a couple of concerns cause doing this in flush and compaction to feel +unsatisfactory: (1) we need to store multiple skylines, one for each snapshot, +which further complicates the range tombstone meta-block encoding; and (2) even +if we implement this, the range tombstone memtable still needs to be linearly +scanned. Given these concerns we decided to defer collapsing work to the read +side, hoping a good caching strategy could optimize this at some future point. + + +### Read path + +In point lookups, we aggregate range tombstones in an unordered vector as we +search through live memtable, immutable memtables, and then SSTs. When a key is +found that matches the lookup key, we do a scan through the vector, checking +whether the key is deleted. + +In iterators, we aggregate range tombstones into a skyline as we visit live +memtable, immutable memtables, and SSTs. The skyline is expensive to construct but fast to determine whether a key is covered. The skyline keeps track of the most recent range tombstone found to optimize `Next` and `Prev`. + +|![](/static/images/delrange/delrange_uncollapsed.png) |![](/static/images/delrange/delrange_collapsed.png) | + +*([Image source: Leetcode](https://leetcode.com/problems/the-skyline-problem/description/)) The skyline problem involves taking building location/height data in the +unsearchable form of A and converting it to the form of B, which is +binary-searchable. With overlapping range tombstones, to achieve efficient +searching we need to solve an analogous problem, where the x-axis is the +key-space and the y-axis is the sequence number.* +{: style="text-align: center"} + +### Performance characteristics + +For the v1 implementation, writes are much faster compared to the scan and +delete (optionally within a transaction) pattern. `DeleteRange` only logs to WAL +and applies to memtable. Logging to WAL always `fflush`es, and optionally +`fsync`s or `fdatasync`s. Applying to memtable is always an in-memory operation. +Since range tombstones have a dedicated skiplist memtable, the complexity of inserting is O(log(T)), where T is the number of existing buffered range tombstones. + +Reading in the presence of v1 range tombstones, however, is much slower than reads +in a database where scan-and-delete has happened, due to the linear scan over +range tombstone memtables/meta-blocks. + +Iterating in a database with v1 range tombstones is usually slower than in a +scan-and-delete database, although the gap lessens as iterations grow longer. +When an iterator is first created and seeked, we construct a skyline over its +tombstones. This operation is O(T\*log(T)) where T is the number of tombstones +found across live memtable, immutable memtable, L0 files, and one file from each +of the L1+ levels. However, moving the iterator forwards or backwards is simply +a constant-time operation (excluding edge cases, e.g., many range tombstones +between consecutive point keys). + +## v2: Making it fast + +`DeleteRange`’s negative impact on read perf is a barrier to its adoption. The +root cause is range tombstones are not stored or cached in a format that can be +efficiently searched. We needed to design DeleteRange so that we could maintain +write performance while making read performance competitive with workarounds +used in production (e.g., scan-and-delete). + +### Representations + +The key idea of the redesign is that, instead of globally collapsing range tombstones, + we can locally “fragment” them for each SST file and memtable to guarantee that: + +* no range tombstones overlap; and +* range tombstones are ordered by start key. + +Combined, these properties make range tombstones binary searchable. This + fragmentation will happen on the read path, but unlike the previous design, we can + easily cache many of these range tombstone fragments on the read path. + +### Write path + +The write path remains unchanged. + +### Read path + +When an SST file is opened, its range tombstones are fragmented and cached. For point + lookups, we binary search each file's fragmented range tombstones for one that covers + the lookup key. Unlike the old design, once we find a tombstone, we no longer need to + search for the key in lower levels, since we know that any keys on those levels will be + covered (though we do still check the current level since there may be keys written after + the range tombstone). + +For range scans, we create iterators over all the fragmented range + tombstones and store them in a list, seeking each one to cover the start key of the range + scan (if possible), and query each encountered key in this structure as in the old design, + advancing range tombstone iterators as necessary. In effect, we implicitly create a skyline. + This requires significantly less work on iterator creation, but since each memtable/SST has +its own range tombstone iterator, querying range tombstones requires key comparisons (and +possibly iterator increments) for several iterators (as opposed to v1, where we had a global +collapsed representation of all range tombstones). As a result, very long range scans may become + slower than before, but short range scans are an order of magnitude faster, which are the + more common class of range scan. + +## Benchmarks + +To understand the performance of this new design, we used `db_bench` to compare point lookup, short range scan, + and long range scan performance across: + +* the v1 DeleteRange design, +* the scan-and-delete workaround, and +* the v2 DeleteRange design. + +In these benchmarks, we used a database with 5 million data keys, and 10000 range tombstones (ignoring +those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written. +Writing the range tombstones ensures that most of them are not compacted away, and we have more tombstones +in higher levels that cover keys in lower levels, which allows the benchmarks to exercise more interesting behavior +when reading deleted keys. + +Point lookup benchmarks read 100000 keys from a database using `readwhilewriting`. Range scan benchmarks used +`seekrandomwhilewriting` and seeked 100000 times, and advanced up to 10 keys away from the seek position for short range scans, and advanced up to 1000 keys away from the seek position for long range scans. + +The results are summarized in the tables below, averaged over 10 runs (note the +different SHAs for v1 benchmarks are due to a new `db_bench` flag that was added in order to compare performance with databases with no tombstones; for brevity, those results are not reported here). Also note that the block cache was large enough to hold the entire db, so the large throughput is due to limited I/Os and little time spent on decompression. The range tombstone blocks are always pinned uncompressed in memory. We believe these setup details should not affect relative performance between versions. + +### Point Lookups + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |35cd754a6 |1.3179 |759,830.90 | +|scan-del |7528130e3 |0.6036 |1,667,237.70 | +|v2 |7528130e3 |0.6128 |1,634,633.40 | + +### Short Range Scans + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |0ed738fdd |6.23 |176,562.00 | +|scan-del |PR 4677 |2.6844 |377,313.00 | +|v2 |PR 4677 |2.8226 |361,249.70 | + +### Long Range scans + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |0ed738fdd |52.7066 |19,074.00 | +|scan-del |PR 4677 |38.0325 |26,648.60 | +|v2 |PR 4677 |41.2882 |24,714.70 | + +## Future Work + +Note that memtable range tombstones are fragmented every read; for now this is acceptable, + since we expect there to be relatively few range tombstones in memtables (and users can + enforce this by keeping track of the number of memtable range deletions and manually flushing + after it passes a threshold). In the future, a specialized data structure can be used for storing + range tombstones in memory to avoid this work. + +Another future optimization is to create a new format version that requires range tombstones to + be stored in a fragmented form. This would save time when opening SST files, and when `max_open_files` +is not -1 (i.e., files may be opened several times). + +## Acknowledgements + +Special thanks to Peter Mattis and Nikhil Benesch from Cockroach Labs, who were early users of +DeleteRange v1 in production, contributed the cleanest/most efficient v1 aggregation implementation, found and fixed bugs, and provided initial DeleteRange v2 design and continued help. + +Thanks to Huachao Huang and Jinpeng Zhang from PingCAP for early DeleteRange v1 adoption, bug reports, and fixes. diff --git a/docs/_sass/_blog.scss b/docs/_sass/_blog.scss index 74335d10b41..12a73c1fcda 100644 --- a/docs/_sass/_blog.scss +++ b/docs/_sass/_blog.scss @@ -35,11 +35,13 @@ border-radius: 50%; height: 50px; left: 50%; - margin-left: -25px; + margin-left: auto; + margin-right: auto; + display: inline-block; overflow: hidden; - position: absolute; + position: static; top: -25px; width: 50px; } } -} \ No newline at end of file +} diff --git a/docs/static/images/delrange/delrange_collapsed.png b/docs/static/images/delrange/delrange_collapsed.png new file mode 100644 index 00000000000..52246c2c1d6 Binary files /dev/null and b/docs/static/images/delrange/delrange_collapsed.png differ diff --git a/docs/static/images/delrange/delrange_key_schema.png b/docs/static/images/delrange/delrange_key_schema.png new file mode 100644 index 00000000000..0a14d4a3a52 Binary files /dev/null and b/docs/static/images/delrange/delrange_key_schema.png differ diff --git a/docs/static/images/delrange/delrange_sst_blocks.png b/docs/static/images/delrange/delrange_sst_blocks.png new file mode 100644 index 00000000000..6003e42ae89 Binary files /dev/null and b/docs/static/images/delrange/delrange_sst_blocks.png differ diff --git a/docs/static/images/delrange/delrange_uncollapsed.png b/docs/static/images/delrange/delrange_uncollapsed.png new file mode 100644 index 00000000000..39c7097af96 Binary files /dev/null and b/docs/static/images/delrange/delrange_uncollapsed.png differ diff --git a/docs/static/images/delrange/delrange_write_path.png b/docs/static/images/delrange/delrange_write_path.png new file mode 100644 index 00000000000..229dfb349ac Binary files /dev/null and b/docs/static/images/delrange/delrange_write_path.png differ diff --git a/env/env.cc b/env/env.cc index 9b7f5e40ded..a41feaf00e6 100644 --- a/env/env.cc +++ b/env/env.cc @@ -43,7 +43,7 @@ uint64_t Env::GetThreadID() const { Status Env::ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) { Status s = RenameFile(old_fname, fname); if (!s.ok()) { @@ -242,11 +242,11 @@ void Fatal(Logger* info_log, const char* format, ...) { va_end(ap); } -void LogFlush(const shared_ptr& info_log) { +void LogFlush(const std::shared_ptr& info_log) { LogFlush(info_log.get()); } -void Log(const InfoLogLevel log_level, const shared_ptr& info_log, +void Log(const InfoLogLevel log_level, const std::shared_ptr& info_log, const char* format, ...) { va_list ap; va_start(ap, format); @@ -254,49 +254,49 @@ void Log(const InfoLogLevel log_level, const shared_ptr& info_log, va_end(ap); } -void Header(const shared_ptr& info_log, const char* format, ...) { +void Header(const std::shared_ptr& info_log, const char* format, ...) { va_list ap; va_start(ap, format); Headerv(info_log.get(), format, ap); va_end(ap); } -void Debug(const shared_ptr& info_log, const char* format, ...) { +void Debug(const std::shared_ptr& info_log, const char* format, ...) { va_list ap; va_start(ap, format); Debugv(info_log.get(), format, ap); va_end(ap); } -void Info(const shared_ptr& info_log, const char* format, ...) { +void Info(const std::shared_ptr& info_log, const char* format, ...) { va_list ap; va_start(ap, format); Infov(info_log.get(), format, ap); va_end(ap); } -void Warn(const shared_ptr& info_log, const char* format, ...) { +void Warn(const std::shared_ptr& info_log, const char* format, ...) { va_list ap; va_start(ap, format); Warnv(info_log.get(), format, ap); va_end(ap); } -void Error(const shared_ptr& info_log, const char* format, ...) { +void Error(const std::shared_ptr& info_log, const char* format, ...) { va_list ap; va_start(ap, format); Errorv(info_log.get(), format, ap); va_end(ap); } -void Fatal(const shared_ptr& info_log, const char* format, ...) { +void Fatal(const std::shared_ptr& info_log, const char* format, ...) { va_list ap; va_start(ap, format); Fatalv(info_log.get(), format, ap); va_end(ap); } -void Log(const shared_ptr& info_log, const char* format, ...) { +void Log(const std::shared_ptr& info_log, const char* format, ...) { va_list ap; va_start(ap, format); Logv(info_log.get(), format, ap); @@ -305,7 +305,7 @@ void Log(const shared_ptr& info_log, const char* format, ...) { Status WriteStringToFile(Env* env, const Slice& data, const std::string& fname, bool should_sync) { - unique_ptr file; + std::unique_ptr file; EnvOptions soptions; Status s = env->NewWritableFile(fname, &file, soptions); if (!s.ok()) { @@ -324,7 +324,7 @@ Status WriteStringToFile(Env* env, const Slice& data, const std::string& fname, Status ReadFileToString(Env* env, const std::string& fname, std::string* data) { EnvOptions soptions; data->clear(); - unique_ptr file; + std::unique_ptr file; Status s = env->NewSequentialFile(fname, &file, soptions); if (!s.ok()) { return s; diff --git a/env/env_basic_test.cc b/env/env_basic_test.cc index e33c79f3a29..22983dbecdb 100644 --- a/env/env_basic_test.cc +++ b/env/env_basic_test.cc @@ -2,17 +2,14 @@ // Use of this source code is governed by a BSD-style license that can be // found in the LICENSE file. See the AUTHORS file for names of contributors. -#include #include #include #include +#include -#include "cloud/aws/aws_env.h" #include "env/mock_env.h" #include "rocksdb/env.h" #include "rocksdb/utilities/object_registry.h" -#include "util/logging.h" -#include "util/stderr_logger.h" #include "util/testharness.h" namespace rocksdb { @@ -63,7 +60,9 @@ class EnvBasicTestWithParam : public testing::Test, test_dir_ = test::PerThreadDBPath(env_, "env_basic_test"); } - void SetUp() { env_->CreateDirIfMissing(test_dir_); } + void SetUp() { + env_->CreateDirIfMissing(test_dir_); + } void TearDown() { std::vector files; @@ -92,55 +91,6 @@ static std::unique_ptr mock_env(new MockEnv(Env::Default())); INSTANTIATE_TEST_CASE_P(MockEnv, EnvBasicTestWithParam, ::testing::Values(mock_env.get())); #ifndef ROCKSDB_LITE - -#ifdef USE_AWS -// Register an AWS env -void CreateAwsEnv(const std::string& dbpath, - std::unique_ptr* result) { - std::shared_ptr info_log; - info_log.reset(new rocksdb::StderrLogger(rocksdb::InfoLogLevel::DEBUG_LEVEL)); - std::string aws_access_key_id; - std::string aws_secret_access_key; - std::string aws_region; - Status st = rocksdb::AwsEnv::GetTestCredentials( - &aws_access_key_id, &aws_secret_access_key, &aws_region); - if (!st.ok()) { - Log(InfoLogLevel::DEBUG_LEVEL, info_log, st.ToString().c_str()); - return; - } - rocksdb::CloudEnvOptions coptions; - coptions.credentials.access_key_id = aws_access_key_id; - coptions.credentials.secret_key = aws_secret_access_key; - rocksdb::CloudEnv* s; - ROCKS_LOG_INFO(info_log, "Created new aws env with path %s", dbpath.c_str()); - st = rocksdb::AwsEnv::NewAwsEnv( - Env::Default(), - "envtest." + AwsEnv::GetTestBucketSuffix(), dbpath, aws_region, - "envtest." + AwsEnv::GetTestBucketSuffix(), dbpath, aws_region, - coptions, std::move(info_log), &s); - assert(st.ok()); - if (!st.ok()) { - Log(InfoLogLevel::DEBUG_LEVEL, info_log, st.ToString().c_str()); - return; - } - ((CloudEnvImpl*)s)->TEST_DisableCloudManifest(); - ((AwsEnv*)s)->TEST_SetFileDeletionDelay(std::chrono::seconds(0)); - // If we are keeping wal in cloud storage, then tail it as well. - // so that our unit tests can run to completion. - if (!coptions.keep_local_log_files) { - AwsEnv* aws = static_cast(s); - aws->StartTailingStream(); - } - result->reset(new NormalizingEnvWrapper(s)); -} -static rocksdb::Registrar s3_reg( - "s3://.*", - [](const std::string& uri, std::unique_ptr* env_guard) { - CreateAwsEnv(uri, env_guard); - return env_guard->get(); - }); -#endif /* USE_AWS */ - static std::unique_ptr mem_env(NewMemEnv(Env::Default())); INSTANTIATE_TEST_CASE_P(MemEnv, EnvBasicTestWithParam, ::testing::Values(mem_env.get())); @@ -183,7 +133,7 @@ INSTANTIATE_TEST_CASE_P(CustomEnv, EnvMoreTestWithParam, TEST_P(EnvBasicTestWithParam, Basics) { uint64_t file_size; - unique_ptr writable_file; + std::unique_ptr writable_file; std::vector children; // Check that the directory is empty. @@ -236,8 +186,8 @@ TEST_P(EnvBasicTestWithParam, Basics) { ASSERT_EQ(0U, file_size); // Check that opening non-existent file fails. - unique_ptr seq_file; - unique_ptr rand_file; + std::unique_ptr seq_file; + std::unique_ptr rand_file; ASSERT_TRUE(!env_->NewSequentialFile(test_dir_ + "/non_existent", &seq_file, soptions_) .ok()); @@ -258,22 +208,20 @@ TEST_P(EnvBasicTestWithParam, Basics) { } TEST_P(EnvBasicTestWithParam, ReadWrite) { - unique_ptr writable_file; - unique_ptr seq_file; - unique_ptr rand_file; + std::unique_ptr writable_file; + std::unique_ptr seq_file; + std::unique_ptr rand_file; Slice result; char scratch[100]; - std::string fname = "/100.sst"; - ASSERT_OK( - env_->NewWritableFile(test_dir_ + fname, &writable_file, soptions_)); + ASSERT_OK(env_->NewWritableFile(test_dir_ + "/f", &writable_file, soptions_)); ASSERT_OK(writable_file->Append("hello ")); ASSERT_OK(writable_file->Append("world")); ASSERT_OK(writable_file->Close()); writable_file.reset(); // Read sequentially. - ASSERT_OK(env_->NewSequentialFile(test_dir_ + fname, &seq_file, soptions_)); + ASSERT_OK(env_->NewSequentialFile(test_dir_ + "/f", &seq_file, soptions_)); ASSERT_OK(seq_file->Read(5, &result, scratch)); // Read "hello". ASSERT_EQ(0, result.compare("hello")); ASSERT_OK(seq_file->Skip(1)); @@ -286,8 +234,7 @@ TEST_P(EnvBasicTestWithParam, ReadWrite) { ASSERT_EQ(0U, result.size()); // Random reads. - ASSERT_OK( - env_->NewRandomAccessFile(test_dir_ + fname, &rand_file, soptions_)); + ASSERT_OK(env_->NewRandomAccessFile(test_dir_ + "/f", &rand_file, soptions_)); ASSERT_OK(rand_file->Read(6, 5, &result, scratch)); // Read "world". ASSERT_EQ(0, result.compare("world")); ASSERT_OK(rand_file->Read(0, 5, &result, scratch)); // Read "hello". @@ -297,12 +244,10 @@ TEST_P(EnvBasicTestWithParam, ReadWrite) { // Too high offset. ASSERT_TRUE(rand_file->Read(1000, 5, &result, scratch).ok()); - // delete test file - ASSERT_TRUE(env_->DeleteFile(test_dir_ + fname).ok()); } TEST_P(EnvBasicTestWithParam, Misc) { - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK(env_->NewWritableFile(test_dir_ + "/b", &writable_file, soptions_)); // These are no-ops, but we test they return success. @@ -313,7 +258,6 @@ TEST_P(EnvBasicTestWithParam, Misc) { } TEST_P(EnvBasicTestWithParam, LargeWrite) { - std::string fname = "/f.log"; const size_t kWriteSize = 300 * 1024; char* scratch = new char[kWriteSize * 2]; @@ -322,17 +266,16 @@ TEST_P(EnvBasicTestWithParam, LargeWrite) { write_data.append(1, static_cast(i)); } - unique_ptr writable_file; - ASSERT_OK( - env_->NewWritableFile(test_dir_ + fname, &writable_file, soptions_)); + std::unique_ptr writable_file; + ASSERT_OK(env_->NewWritableFile(test_dir_ + "/f", &writable_file, soptions_)); ASSERT_OK(writable_file->Append("foo")); ASSERT_OK(writable_file->Append(write_data)); ASSERT_OK(writable_file->Close()); writable_file.reset(); - unique_ptr seq_file; + std::unique_ptr seq_file; Slice result; - ASSERT_OK(env_->NewSequentialFile(test_dir_ + fname, &seq_file, soptions_)); + ASSERT_OK(env_->NewSequentialFile(test_dir_ + "/f", &seq_file, soptions_)); ASSERT_OK(seq_file->Read(3, &result, scratch)); // Read "foo". ASSERT_EQ(0, result.compare("foo")); @@ -344,7 +287,7 @@ TEST_P(EnvBasicTestWithParam, LargeWrite) { read += result.size(); } ASSERT_TRUE(write_data == read_data); - delete[] scratch; + delete [] scratch; } TEST_P(EnvMoreTestWithParam, GetModTime) { @@ -397,7 +340,7 @@ TEST_P(EnvMoreTestWithParam, GetChildren) { // if dir is a file, returns IOError ASSERT_OK(env_->CreateDir(test_dir_)); - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK( env_->NewWritableFile(test_dir_ + "/file", &writable_file, soptions_)); ASSERT_OK(writable_file->Close()); diff --git a/env/env_chroot.cc b/env/env_chroot.cc index 6a1fda8a834..f6236c81b2c 100644 --- a/env/env_chroot.cc +++ b/env/env_chroot.cc @@ -50,7 +50,7 @@ class ChrootEnv : public EnvWrapper { } virtual Status NewRandomAccessFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { auto status_and_enc_path = EncodePathWithNewBasename(fname); if (!status_and_enc_path.first.ok()) { @@ -61,7 +61,7 @@ class ChrootEnv : public EnvWrapper { } virtual Status NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { auto status_and_enc_path = EncodePathWithNewBasename(fname); if (!status_and_enc_path.first.ok()) { @@ -73,7 +73,7 @@ class ChrootEnv : public EnvWrapper { virtual Status ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { auto status_and_enc_path = EncodePathWithNewBasename(fname); if (!status_and_enc_path.first.ok()) { @@ -89,7 +89,7 @@ class ChrootEnv : public EnvWrapper { } virtual Status NewRandomRWFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { auto status_and_enc_path = EncodePathWithNewBasename(fname); if (!status_and_enc_path.first.ok()) { @@ -100,7 +100,7 @@ class ChrootEnv : public EnvWrapper { } virtual Status NewDirectory(const std::string& dir, - unique_ptr* result) override { + std::unique_ptr* result) override { auto status_and_enc_path = EncodePathWithNewBasename(dir); if (!status_and_enc_path.first.ok()) { return status_and_enc_path.first; @@ -238,7 +238,7 @@ class ChrootEnv : public EnvWrapper { } virtual Status NewLogger(const std::string& fname, - shared_ptr* result) override { + std::shared_ptr* result) override { auto status_and_enc_path = EncodePathWithNewBasename(fname); if (!status_and_enc_path.first.ok()) { return status_and_enc_path.first; diff --git a/env/env_encryption.cc b/env/env_encryption.cc index e80796fe0c7..e38693e3ce7 100644 --- a/env/env_encryption.cc +++ b/env/env_encryption.cc @@ -422,7 +422,7 @@ class EncryptedEnv : public EnvWrapper { // NewRandomAccessFile opens a file for random read access. virtual Status NewRandomAccessFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { result->reset(); if (options.use_mmap_reads) { @@ -456,10 +456,10 @@ class EncryptedEnv : public EnvWrapper { (*result) = std::unique_ptr(new EncryptedRandomAccessFile(underlying.release(), stream.release(), prefixLength)); return Status::OK(); } - + // NewWritableFile opens a file for sequential writing. virtual Status NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { result->reset(); if (options.use_mmap_writes) { @@ -505,8 +505,8 @@ class EncryptedEnv : public EnvWrapper { // // The returned file will only be accessed by one thread at a time. virtual Status ReopenWritableFile(const std::string& fname, - unique_ptr* result, - const EnvOptions& options) override { + std::unique_ptr* result, + const EnvOptions& options) override { result->reset(); if (options.use_mmap_writes) { return Status::InvalidArgument(); @@ -546,7 +546,7 @@ class EncryptedEnv : public EnvWrapper { // Reuse an existing file by renaming it and opening it as writable. virtual Status ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { result->reset(); if (options.use_mmap_writes) { @@ -590,7 +590,7 @@ class EncryptedEnv : public EnvWrapper { // // The returned file will only be accessed by one thread at a time. virtual Status NewRandomRWFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { result->reset(); if (options.use_mmap_reads || options.use_mmap_writes) { @@ -692,7 +692,7 @@ Status BlockAccessCipherStream::Encrypt(uint64_t fileOffset, char *data, size_t auto blockSize = BlockSize(); uint64_t blockIndex = fileOffset / blockSize; size_t blockOffset = fileOffset % blockSize; - unique_ptr blockBuffer; + std::unique_ptr blockBuffer; std::string scratch; AllocateScratch(scratch); @@ -705,8 +705,8 @@ Status BlockAccessCipherStream::Encrypt(uint64_t fileOffset, char *data, size_t // We're not encrypting a full block. // Copy data to blockBuffer if (!blockBuffer.get()) { - // Allocate buffer - blockBuffer = unique_ptr(new char[blockSize]); + // Allocate buffer + blockBuffer = std::unique_ptr(new char[blockSize]); } block = blockBuffer.get(); // Copy plain data to block buffer @@ -737,7 +737,7 @@ Status BlockAccessCipherStream::Decrypt(uint64_t fileOffset, char *data, size_t auto blockSize = BlockSize(); uint64_t blockIndex = fileOffset / blockSize; size_t blockOffset = fileOffset % blockSize; - unique_ptr blockBuffer; + std::unique_ptr blockBuffer; std::string scratch; AllocateScratch(scratch); @@ -750,8 +750,8 @@ Status BlockAccessCipherStream::Decrypt(uint64_t fileOffset, char *data, size_t // We're not decrypting a full block. // Copy data to blockBuffer if (!blockBuffer.get()) { - // Allocate buffer - blockBuffer = unique_ptr(new char[blockSize]); + // Allocate buffer + blockBuffer = std::unique_ptr(new char[blockSize]); } block = blockBuffer.get(); // Copy encrypted data to block buffer @@ -882,7 +882,9 @@ size_t CTREncryptionProvider::PopulateSecretPrefixPart(char* /*prefix*/, return 0; } -Status CTREncryptionProvider::CreateCipherStream(const std::string& fname, const EnvOptions& options, Slice &prefix, unique_ptr* result) { +Status CTREncryptionProvider::CreateCipherStream( + const std::string& fname, const EnvOptions& options, Slice& prefix, + std::unique_ptr* result) { // Read plain text part of prefix. auto blockSize = cipher_.BlockSize(); uint64_t initialCounter; @@ -905,8 +907,9 @@ Status CTREncryptionProvider::CreateCipherStream(const std::string& fname, const Status CTREncryptionProvider::CreateCipherStreamFromPrefix( const std::string& /*fname*/, const EnvOptions& /*options*/, uint64_t initialCounter, const Slice& iv, const Slice& /*prefix*/, - unique_ptr* result) { - (*result) = unique_ptr(new CTRCipherStream(cipher_, iv.data(), initialCounter)); + std::unique_ptr* result) { + (*result) = std::unique_ptr( + new CTRCipherStream(cipher_, iv.data(), initialCounter)); return Status::OK(); } diff --git a/env/env_hdfs.cc b/env/env_hdfs.cc index 1eaea3a1ce5..14fb902f0d4 100644 --- a/env/env_hdfs.cc +++ b/env/env_hdfs.cc @@ -381,7 +381,7 @@ const std::string HdfsEnv::pathsep = "/"; // open a file for sequential reading Status HdfsEnv::NewSequentialFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) { result->reset(); HdfsReadableFile* f = new HdfsReadableFile(fileSys_, fname); @@ -396,7 +396,7 @@ Status HdfsEnv::NewSequentialFile(const std::string& fname, // open a file for random reading Status HdfsEnv::NewRandomAccessFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) { result->reset(); HdfsReadableFile* f = new HdfsReadableFile(fileSys_, fname); @@ -411,7 +411,7 @@ Status HdfsEnv::NewRandomAccessFile(const std::string& fname, // create a new file for writing Status HdfsEnv::NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) { result->reset(); Status s; @@ -437,7 +437,7 @@ class HdfsDirectory : public Directory { }; Status HdfsEnv::NewDirectory(const std::string& name, - unique_ptr* result) { + std::unique_ptr* result) { int value = hdfsExists(fileSys_, name.c_str()); switch (value) { case HDFS_EXISTS: @@ -581,7 +581,7 @@ Status HdfsEnv::UnlockFile(FileLock* lock) { } Status HdfsEnv::NewLogger(const std::string& fname, - shared_ptr* result) { + std::shared_ptr* result) { HdfsWritableFile* f = new HdfsWritableFile(fileSys_, fname); if (f == nullptr || !f->isValid()) { delete f; @@ -610,10 +610,10 @@ Status NewHdfsEnv(Env** hdfs_env, const std::string& fsname) { // dummy placeholders used when HDFS is not available namespace rocksdb { Status HdfsEnv::NewSequentialFile(const std::string& /*fname*/, - unique_ptr* /*result*/, + std::unique_ptr* /*result*/, const EnvOptions& /*options*/) { return Status::NotSupported("Not compiled with hdfs support"); - } +} Status NewHdfsEnv(Env** /*hdfs_env*/, const std::string& /*fsname*/) { return Status::NotSupported("Not compiled with hdfs support"); diff --git a/env/env_posix.cc b/env/env_posix.cc index 34d49b9dc15..c2e456a6614 100644 --- a/env/env_posix.cc +++ b/env/env_posix.cc @@ -142,7 +142,7 @@ class PosixEnv : public Env { } virtual Status NewSequentialFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { result->reset(); int fd = -1; @@ -192,7 +192,7 @@ class PosixEnv : public Env { } virtual Status NewRandomAccessFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { result->reset(); Status s; @@ -249,7 +249,7 @@ class PosixEnv : public Env { } virtual Status OpenWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options, bool reopen = false) { result->reset(); @@ -333,20 +333,20 @@ class PosixEnv : public Env { } virtual Status NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { return OpenWritableFile(fname, result, options, false); } virtual Status ReopenWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { return OpenWritableFile(fname, result, options, true); } virtual Status ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { result->reset(); Status s; @@ -430,7 +430,7 @@ class PosixEnv : public Env { } virtual Status NewRandomRWFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { int fd = -1; int flags = cloexec_flags(O_RDWR, &options); @@ -455,7 +455,7 @@ class PosixEnv : public Env { virtual Status NewMemoryMappedFileBuffer( const std::string& fname, - unique_ptr* result) override { + std::unique_ptr* result) override { int fd = -1; Status status; int flags = cloexec_flags(O_RDWR, nullptr); @@ -497,7 +497,7 @@ class PosixEnv : public Env { } virtual Status NewDirectory(const std::string& name, - unique_ptr* result) override { + std::unique_ptr* result) override { result->reset(); int fd; int flags = cloexec_flags(0, nullptr); @@ -791,7 +791,7 @@ class PosixEnv : public Env { } virtual Status NewLogger(const std::string& fname, - shared_ptr* result) override { + std::shared_ptr* result) override { FILE* f; { IOSTATS_TIMER_GUARD(open_nanos); diff --git a/env/env_test.cc b/env/env_test.cc index eda6b9d5d76..36cbd735d7d 100644 --- a/env/env_test.cc +++ b/env/env_test.cc @@ -181,11 +181,11 @@ TEST_F(EnvPosixTest, DISABLED_FilePermission) { std::vector fileNames{ test::PerThreadDBPath(env_, "testfile"), test::PerThreadDBPath(env_, "testfile1")}; - unique_ptr wfile; + std::unique_ptr wfile; ASSERT_OK(env_->NewWritableFile(fileNames[0], &wfile, soptions)); ASSERT_OK(env_->NewWritableFile(fileNames[1], &wfile, soptions)); wfile.reset(); - unique_ptr rwfile; + std::unique_ptr rwfile; ASSERT_OK(env_->NewRandomRWFile(fileNames[1], &rwfile, soptions)); struct stat sb; @@ -217,7 +217,7 @@ TEST_F(EnvPosixTest, MemoryMappedFileBuffer) { std::string expected_data; std::string fname = test::PerThreadDBPath(env_, "testfile"); { - unique_ptr wfile; + std::unique_ptr wfile; const EnvOptions soptions; ASSERT_OK(env_->NewWritableFile(fname, &wfile, soptions)); @@ -812,7 +812,7 @@ class IoctlFriendlyTmpdir { #ifndef ROCKSDB_LITE TEST_F(EnvPosixTest, PositionedAppend) { - unique_ptr writable_file; + std::unique_ptr writable_file; EnvOptions options; options.use_direct_writes = true; options.use_mmap_writes = false; @@ -832,7 +832,7 @@ TEST_F(EnvPosixTest, PositionedAppend) { // The file now has 1 sector worth of a followed by a page worth of b // Verify the above - unique_ptr seq_file; + std::unique_ptr seq_file; ASSERT_OK(env_->NewSequentialFile(ift.name() + "/f", &seq_file, options)); char scratch[kPageSize * 2]; Slice result; @@ -851,10 +851,10 @@ TEST_P(EnvPosixTestWithParam, RandomAccessUniqueID) { soptions.use_direct_reads = soptions.use_direct_writes = direct_io_; IoctlFriendlyTmpdir ift; std::string fname = ift.name() + "/testfile"; - unique_ptr wfile; + std::unique_ptr wfile; ASSERT_OK(env_->NewWritableFile(fname, &wfile, soptions)); - unique_ptr file; + std::unique_ptr file; // Get Unique ID ASSERT_OK(env_->NewRandomAccessFile(fname, &file, soptions)); @@ -921,7 +921,7 @@ TEST_P(EnvPosixTestWithParam, AllocateTest) { EnvOptions soptions; soptions.use_mmap_writes = false; soptions.use_direct_reads = soptions.use_direct_writes = direct_io_; - unique_ptr wfile; + std::unique_ptr wfile; ASSERT_OK(env_->NewWritableFile(fname, &wfile, soptions)); // allocate 100 MB @@ -990,14 +990,14 @@ TEST_P(EnvPosixTestWithParam, RandomAccessUniqueIDConcurrent) { fnames.push_back(ift.name() + "/" + "testfile" + ToString(i)); // Create file. - unique_ptr wfile; + std::unique_ptr wfile; ASSERT_OK(env_->NewWritableFile(fnames[i], &wfile, soptions)); } // Collect and check whether the IDs are unique. std::unordered_set ids; for (const std::string fname : fnames) { - unique_ptr file; + std::unique_ptr file; std::string unique_id; ASSERT_OK(env_->NewRandomAccessFile(fname, &file, soptions)); size_t id_size = file->GetUniqueId(temp_id, MAX_ID_SIZE); @@ -1033,14 +1033,14 @@ TEST_P(EnvPosixTestWithParam, RandomAccessUniqueIDDeletes) { for (int i = 0; i < 1000; ++i) { // Create file. { - unique_ptr wfile; + std::unique_ptr wfile; ASSERT_OK(env_->NewWritableFile(fname, &wfile, soptions)); } // Get Unique ID std::string unique_id; { - unique_ptr file; + std::unique_ptr file; ASSERT_OK(env_->NewRandomAccessFile(fname, &file, soptions)); size_t id_size = file->GetUniqueId(temp_id, MAX_ID_SIZE); ASSERT_TRUE(id_size > 0); @@ -1076,7 +1076,7 @@ TEST_P(EnvPosixTestWithParam, InvalidateCache) { // Create file. { - unique_ptr wfile; + std::unique_ptr wfile; #if !defined(OS_MACOSX) && !defined(OS_WIN) && !defined(OS_SOLARIS) && !defined(OS_AIX) if (soptions.use_direct_writes) { soptions.use_direct_writes = false; @@ -1090,7 +1090,7 @@ TEST_P(EnvPosixTestWithParam, InvalidateCache) { // Random Read { - unique_ptr file; + std::unique_ptr file; auto scratch = NewAligned(kSectorSize, 0); Slice result; #if !defined(OS_MACOSX) && !defined(OS_WIN) && !defined(OS_SOLARIS) && !defined(OS_AIX) @@ -1107,7 +1107,7 @@ TEST_P(EnvPosixTestWithParam, InvalidateCache) { // Sequential Read { - unique_ptr file; + std::unique_ptr file; auto scratch = NewAligned(kSectorSize, 0); Slice result; #if !defined(OS_MACOSX) && !defined(OS_WIN) && !defined(OS_SOLARIS) && !defined(OS_AIX) @@ -1252,7 +1252,7 @@ TEST_P(EnvPosixTestWithParam, LogBufferMaxSizeTest) { TEST_P(EnvPosixTestWithParam, Preallocation) { rocksdb::SyncPoint::GetInstance()->EnableProcessing(); const std::string src = test::PerThreadDBPath(env_, "testfile"); - unique_ptr srcfile; + std::unique_ptr srcfile; EnvOptions soptions; soptions.use_direct_reads = soptions.use_direct_writes = direct_io_; #if !defined(OS_MACOSX) && !defined(OS_WIN) && !defined(OS_SOLARIS) && !defined(OS_AIX) && !defined(OS_OPENBSD) && !defined(OS_FREEBSD) @@ -1315,7 +1315,7 @@ TEST_P(EnvPosixTestWithParam, ConsistentChildrenAttributes) { for (int i = 0; i < kNumChildren; ++i) { const std::string path = test::TmpDir(env_) + "/" + "testfile_" + std::to_string(i); - unique_ptr file; + std::unique_ptr file; #if !defined(OS_MACOSX) && !defined(OS_WIN) && !defined(OS_SOLARIS) && !defined(OS_AIX) && !defined(OS_OPENBSD) && !defined(OS_FREEBSD) if (soptions.use_direct_writes) { rocksdb::SyncPoint::GetInstance()->SetCallBack( @@ -1368,50 +1368,110 @@ TEST_P(EnvPosixTestWithParam, WritableFileWrapper) { inc(1); return Status::OK(); } - Status Truncate(uint64_t /*size*/) override { return Status::OK(); } - Status Close() override { inc(2); return Status::OK(); } - Status Flush() override { inc(3); return Status::OK(); } - Status Sync() override { inc(4); return Status::OK(); } - Status Fsync() override { inc(5); return Status::OK(); } - void SetIOPriority(Env::IOPriority /*pri*/) override { inc(6); } - uint64_t GetFileSize() override { inc(7); return 0; } + + Status PositionedAppend(const Slice& /*data*/, + uint64_t /*offset*/) override { + inc(2); + return Status::OK(); + } + + Status Truncate(uint64_t /*size*/) override { + inc(3); + return Status::OK(); + } + + Status Close() override { + inc(4); + return Status::OK(); + } + + Status Flush() override { + inc(5); + return Status::OK(); + } + + Status Sync() override { + inc(6); + return Status::OK(); + } + + Status Fsync() override { + inc(7); + return Status::OK(); + } + + bool IsSyncThreadSafe() const override { + inc(8); + return true; + } + + bool use_direct_io() const override { + inc(9); + return true; + } + + size_t GetRequiredBufferAlignment() const override { + inc(10); + return 0; + } + + void SetIOPriority(Env::IOPriority /*pri*/) override { inc(11); } + + Env::IOPriority GetIOPriority() override { + inc(12); + return Env::IOPriority::IO_LOW; + } + + void SetWriteLifeTimeHint(Env::WriteLifeTimeHint /*hint*/) override { + inc(13); + } + + Env::WriteLifeTimeHint GetWriteLifeTimeHint() override { + inc(14); + return Env::WriteLifeTimeHint::WLTH_NOT_SET; + } + + uint64_t GetFileSize() override { + inc(15); + return 0; + } + + void SetPreallocationBlockSize(size_t /*size*/) override { inc(16); } + void GetPreallocationStatus(size_t* /*block_size*/, size_t* /*last_allocated_block*/) override { - inc(8); + inc(17); } + size_t GetUniqueId(char* /*id*/, size_t /*max_size*/) const override { - inc(9); + inc(18); return 0; } + Status InvalidateCache(size_t /*offset*/, size_t /*length*/) override { - inc(10); + inc(19); return Status::OK(); } - protected: - Status Allocate(uint64_t /*offset*/, uint64_t /*len*/) override { - inc(11); + Status RangeSync(uint64_t /*offset*/, uint64_t /*nbytes*/) override { + inc(20); return Status::OK(); } - Status RangeSync(uint64_t /*offset*/, uint64_t /*nbytes*/) override { - inc(12); + + void PrepareWrite(size_t /*offset*/, size_t /*len*/) override { inc(21); } + + Status Allocate(uint64_t /*offset*/, uint64_t /*len*/) override { + inc(22); return Status::OK(); } public: - ~Base() { - inc(13); - } + ~Base() { inc(23); } }; class Wrapper : public WritableFileWrapper { public: explicit Wrapper(WritableFile* target) : WritableFileWrapper(target) {} - - void CallProtectedMethods() { - Allocate(0, 0); - RangeSync(0, 0); - } }; int step = 0; @@ -1420,19 +1480,30 @@ TEST_P(EnvPosixTestWithParam, WritableFileWrapper) { Base b(&step); Wrapper w(&b); w.Append(Slice()); + w.PositionedAppend(Slice(), 0); + w.Truncate(0); w.Close(); w.Flush(); w.Sync(); w.Fsync(); + w.IsSyncThreadSafe(); + w.use_direct_io(); + w.GetRequiredBufferAlignment(); w.SetIOPriority(Env::IOPriority::IO_HIGH); + w.GetIOPriority(); + w.SetWriteLifeTimeHint(Env::WriteLifeTimeHint::WLTH_NOT_SET); + w.GetWriteLifeTimeHint(); w.GetFileSize(); + w.SetPreallocationBlockSize(0); w.GetPreallocationStatus(nullptr, nullptr); w.GetUniqueId(nullptr, 0); w.InvalidateCache(0, 0); - w.CallProtectedMethods(); + w.RangeSync(0, 0); + w.PrepareWrite(0, 0); + w.Allocate(0, 0); } - EXPECT_EQ(14, step); + EXPECT_EQ(24, step); } TEST_P(EnvPosixTestWithParam, PosixRandomRWFile) { @@ -1567,7 +1638,7 @@ TEST_P(EnvPosixTestWithParam, PosixRandomRWFileRandomized) { const std::string path = test::PerThreadDBPath(env_, "random_rw_file_rand"); env_->DeleteFile(path); - unique_ptr file; + std::unique_ptr file; #ifdef OS_LINUX // Cannot open non-existing file. @@ -1641,7 +1712,7 @@ class TestEnv : public EnvWrapper { int GetCloseCount() { return close_count; } virtual Status NewLogger(const std::string& /*fname*/, - shared_ptr* result) { + std::shared_ptr* result) { result->reset(new TestLogger(this)); return Status::OK(); } @@ -1685,8 +1756,8 @@ INSTANTIATE_TEST_CASE_P(DefaultEnvWithDirectIO, EnvPosixTestWithParam, #endif // !defined(ROCKSDB_LITE) #if !defined(ROCKSDB_LITE) && !defined(OS_WIN) -static unique_ptr chroot_env(NewChrootEnv(Env::Default(), - test::TmpDir(Env::Default()))); +static std::unique_ptr chroot_env( + NewChrootEnv(Env::Default(), test::TmpDir(Env::Default()))); INSTANTIATE_TEST_CASE_P( ChrootEnvWithoutDirectIO, EnvPosixTestWithParam, ::testing::Values(std::pair(chroot_env.get(), false))); diff --git a/env/mock_env.cc b/env/mock_env.cc index 12c096cefba..84b30607172 100644 --- a/env/mock_env.cc +++ b/env/mock_env.cc @@ -319,7 +319,7 @@ class TestMemLogger : public Logger { static const uint64_t flush_every_seconds_ = 5; std::atomic_uint_fast64_t last_flush_micros_; Env* env_; - bool flush_pending_; + std::atomic flush_pending_; public: TestMemLogger(std::unique_ptr f, Env* env, @@ -424,7 +424,7 @@ MockEnv::~MockEnv() { // Partial implementation of the Env interface. Status MockEnv::NewSequentialFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& /*soptions*/) { auto fn = NormalizePath(fname); MutexLock lock(&mutex_); @@ -441,7 +441,7 @@ Status MockEnv::NewSequentialFile(const std::string& fname, } Status MockEnv::NewRandomAccessFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& /*soptions*/) { auto fn = NormalizePath(fname); MutexLock lock(&mutex_); @@ -458,7 +458,7 @@ Status MockEnv::NewRandomAccessFile(const std::string& fname, } Status MockEnv::NewRandomRWFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& /*soptions*/) { auto fn = NormalizePath(fname); MutexLock lock(&mutex_); @@ -476,7 +476,7 @@ Status MockEnv::NewRandomRWFile(const std::string& fname, Status MockEnv::ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) { auto s = RenameFile(old_fname, fname); if (!s.ok()) { @@ -487,7 +487,7 @@ Status MockEnv::ReuseWritableFile(const std::string& fname, } Status MockEnv::NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& env_options) { auto fn = NormalizePath(fname); MutexLock lock(&mutex_); @@ -503,7 +503,7 @@ Status MockEnv::NewWritableFile(const std::string& fname, } Status MockEnv::NewDirectory(const std::string& /*name*/, - unique_ptr* result) { + std::unique_ptr* result) { result->reset(new MockEnvDirectory()); return Status::OK(); } @@ -660,7 +660,7 @@ Status MockEnv::LinkFile(const std::string& src, const std::string& dest) { } Status MockEnv::NewLogger(const std::string& fname, - shared_ptr* result) { + std::shared_ptr* result) { auto fn = NormalizePath(fname); MutexLock lock(&mutex_); auto iter = file_map_.find(fn); diff --git a/env/mock_env.h b/env/mock_env.h index 816256ab08c..87b8deaf8c3 100644 --- a/env/mock_env.h +++ b/env/mock_env.h @@ -28,28 +28,28 @@ class MockEnv : public EnvWrapper { // Partial implementation of the Env interface. virtual Status NewSequentialFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& soptions) override; virtual Status NewRandomAccessFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& soptions) override; virtual Status NewRandomRWFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override; virtual Status ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override; virtual Status NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& env_options) override; virtual Status NewDirectory(const std::string& name, - unique_ptr* result) override; + std::unique_ptr* result) override; virtual Status FileExists(const std::string& fname) override; @@ -81,7 +81,7 @@ class MockEnv : public EnvWrapper { const std::string& target) override; virtual Status NewLogger(const std::string& fname, - shared_ptr* result) override; + std::shared_ptr* result) override; virtual Status LockFile(const std::string& fname, FileLock** flock) override; diff --git a/env/mock_env_test.cc b/env/mock_env_test.cc index 19e259ccd85..abd5b89f0b7 100644 --- a/env/mock_env_test.cc +++ b/env/mock_env_test.cc @@ -29,7 +29,7 @@ TEST_F(MockEnvTest, Corrupt) { const std::string kGood = "this is a good string, synced to disk"; const std::string kCorrupted = "this part may be corrupted"; const std::string kFileName = "/dir/f"; - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK(env_->NewWritableFile(kFileName, &writable_file, soptions_)); ASSERT_OK(writable_file->Append(kGood)); ASSERT_TRUE(writable_file->GetFileSize() == kGood.size()); @@ -37,7 +37,7 @@ TEST_F(MockEnvTest, Corrupt) { std::string scratch; scratch.resize(kGood.size() + kCorrupted.size() + 16); Slice result; - unique_ptr rand_file; + std::unique_ptr rand_file; ASSERT_OK(env_->NewRandomAccessFile(kFileName, &rand_file, soptions_)); ASSERT_OK(rand_file->Read(0, kGood.size(), &result, &(scratch[0]))); ASSERT_EQ(result.compare(kGood), 0); diff --git a/hdfs/env_hdfs.h b/hdfs/env_hdfs.h index b0c9e33fd78..a77c42e0af8 100644 --- a/hdfs/env_hdfs.h +++ b/hdfs/env_hdfs.h @@ -255,23 +255,24 @@ class HdfsEnv : public Env { } virtual Status NewSequentialFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override; - virtual Status NewRandomAccessFile(const std::string& /*fname*/, - unique_ptr* /*result*/, - const EnvOptions& /*options*/) override { + virtual Status NewRandomAccessFile( + const std::string& /*fname*/, + std::unique_ptr* /*result*/, + const EnvOptions& /*options*/) override { return notsup; } virtual Status NewWritableFile(const std::string& /*fname*/, - unique_ptr* /*result*/, + std::unique_ptr* /*result*/, const EnvOptions& /*options*/) override { return notsup; } virtual Status NewDirectory(const std::string& /*name*/, - unique_ptr* /*result*/) override { + std::unique_ptr* /*result*/) override { return notsup; } @@ -328,7 +329,7 @@ class HdfsEnv : public Env { virtual Status UnlockFile(FileLock* /*lock*/) override { return notsup; } virtual Status NewLogger(const std::string& /*fname*/, - shared_ptr* /*result*/) override { + std::shared_ptr* /*result*/) override { return notsup; } diff --git a/include/rocksdb/advanced_options.h b/include/rocksdb/advanced_options.h index 940a6f6b74a..fe331482e26 100644 --- a/include/rocksdb/advanced_options.h +++ b/include/rocksdb/advanced_options.h @@ -413,6 +413,7 @@ struct AdvancedColumnFamilyOptions { // of the level. // At the same time max_bytes_for_level_multiplier and // max_bytes_for_level_multiplier_additional are still satisfied. + // (When L0 is too large, we make some adjustment. See below.) // // With this option on, from an empty DB, we make last level the base level, // which means merging L0 data into the last level, until it exceeds @@ -451,6 +452,29 @@ struct AdvancedColumnFamilyOptions { // max_bytes_for_level_base, for a more predictable LSM tree shape. It is // useful to limit worse case space amplification. // + // + // If the compaction from L0 is lagged behind, a special mode will be turned + // on to prioritize write amplification against max_bytes_for_level_multiplier + // or max_bytes_for_level_base. The L0 compaction is lagged behind by looking + // at number of L0 files and total L0 size. If number of L0 files is at least + // the double of level0_file_num_compaction_trigger, or the total size is + // at least max_bytes_for_level_base, this mode is on. The target of L1 grows + // to the actual data size in L0, and then determine the target for each level + // so that each level will have the same level multiplier. + // + // For example, when L0 size is 100MB, the size of last level is 1600MB, + // max_bytes_for_level_base = 80MB, and max_bytes_for_level_multiplier = 10. + // Since L0 size is larger than max_bytes_for_level_base, this is a L0 + // compaction backlogged mode. So that the L1 size is determined to be 100MB. + // Based on max_bytes_for_level_multiplier = 10, at least 3 non-0 levels will + // be needed. The level multiplier will be calculated to be 4 and the three + // levels' target to be [100MB, 400MB, 1600MB]. + // + // In this mode, The number of levels will be no more than the normal mode, + // and the level multiplier will be lower. The write amplification will + // likely to be reduced. + // + // // max_bytes_for_level_multiplier_additional is ignored with this flag on. // // Turning this feature on or off for an existing DB can cause unexpected @@ -478,19 +502,25 @@ struct AdvancedColumnFamilyOptions { // threshold. But it's not guaranteed. // Value 0 will be sanitized. // - // Default: result.target_file_size_base * 25 + // Default: target_file_size_base * 25 + // + // Dynamically changeable through SetOptions() API uint64_t max_compaction_bytes = 0; // All writes will be slowed down to at least delayed_write_rate if estimated // bytes needed to be compaction exceed this threshold. // // Default: 64GB + // + // Dynamically changeable through SetOptions() API uint64_t soft_pending_compaction_bytes_limit = 64 * 1073741824ull; // All writes are stopped if estimated bytes needed to be compaction exceed // this threshold. // // Default: 256GB + // + // Dynamically changeable through SetOptions() API uint64_t hard_pending_compaction_bytes_limit = 256 * 1073741824ull; // The compaction style. Default: kCompactionStyleLevel @@ -502,13 +532,17 @@ struct AdvancedColumnFamilyOptions { CompactionPri compaction_pri = kByCompensatedSize; // The options needed to support Universal Style compactions + // + // Dynamically changeable through SetOptions() API + // Dynamic change example: + // SetOptions("compaction_options_universal", "{size_ratio=2;}") CompactionOptionsUniversal compaction_options_universal; // The options for FIFO compaction style // // Dynamically changeable through SetOptions() API // Dynamic change example: - // SetOption("compaction_options_fifo", "{max_table_files_size=100;ttl=2;}") + // SetOptions("compaction_options_fifo", "{max_table_files_size=100;ttl=2;}") CompactionOptionsFIFO compaction_options_fifo; // An iteration->Next() sequentially skips over keys with the same @@ -578,7 +612,10 @@ struct AdvancedColumnFamilyOptions { bool optimize_filters_for_hits = false; // After writing every SST file, reopen it and read all the keys. + // // Default: false + // + // Dynamically changeable through SetOptions() API bool paranoid_file_checks = false; // In debug mode, RocksDB run consistency checks on the LSM every time the LSM @@ -588,7 +625,10 @@ struct AdvancedColumnFamilyOptions { bool force_consistency_checks = false; // Measure IO stats in compactions and flushes, if true. + // // Default: false + // + // Dynamically changeable through SetOptions() API bool report_bg_io_stats = false; // Non-bottom-level files older than TTL will go through the compaction diff --git a/include/rocksdb/c.h b/include/rocksdb/c.h index 0899ed62559..cf46054aa34 100644 --- a/include/rocksdb/c.h +++ b/include/rocksdb/c.h @@ -1422,6 +1422,10 @@ extern ROCKSDB_LIBRARY_API const char* rocksdb_livefiles_smallestkey( const rocksdb_livefiles_t*, int index, size_t* size); extern ROCKSDB_LIBRARY_API const char* rocksdb_livefiles_largestkey( const rocksdb_livefiles_t*, int index, size_t* size); +extern ROCKSDB_LIBRARY_API uint64_t rocksdb_livefiles_entries( + const rocksdb_livefiles_t*, int index); +extern ROCKSDB_LIBRARY_API uint64_t rocksdb_livefiles_deletions( + const rocksdb_livefiles_t*, int index); extern ROCKSDB_LIBRARY_API void rocksdb_livefiles_destroy( const rocksdb_livefiles_t*); diff --git a/include/rocksdb/cache.h b/include/rocksdb/cache.h index da3b934d830..190112b37e8 100644 --- a/include/rocksdb/cache.h +++ b/include/rocksdb/cache.h @@ -25,6 +25,7 @@ #include #include #include +#include "rocksdb/memory_allocator.h" #include "rocksdb/slice.h" #include "rocksdb/statistics.h" #include "rocksdb/status.h" @@ -58,13 +59,24 @@ struct LRUCacheOptions { // BlockBasedTableOptions::cache_index_and_filter_blocks_with_high_priority. double high_pri_pool_ratio = 0.0; + // If non-nullptr will use this allocator instead of system allocator when + // allocating memory for cache blocks. Call this method before you start using + // the cache! + // + // Caveat: when the cache is used as block cache, the memory allocator is + // ignored when dealing with compression libraries that allocate memory + // internally (currently only XPRESS). + std::shared_ptr memory_allocator; + LRUCacheOptions() {} LRUCacheOptions(size_t _capacity, int _num_shard_bits, - bool _strict_capacity_limit, double _high_pri_pool_ratio) + bool _strict_capacity_limit, double _high_pri_pool_ratio, + std::shared_ptr _memory_allocator = nullptr) : capacity(_capacity), num_shard_bits(_num_shard_bits), strict_capacity_limit(_strict_capacity_limit), - high_pri_pool_ratio(_high_pri_pool_ratio) {} + high_pri_pool_ratio(_high_pri_pool_ratio), + memory_allocator(std::move(_memory_allocator)) {} }; // Create a new cache with a fixed size capacity. The cache is sharded @@ -75,10 +87,10 @@ struct LRUCacheOptions { // high_pri_pool_pct. // num_shard_bits = -1 means it is automatically determined: every shard // will be at least 512KB and number of shard bits will not exceed 6. -extern std::shared_ptr NewLRUCache(size_t capacity, - int num_shard_bits = -1, - bool strict_capacity_limit = false, - double high_pri_pool_ratio = 0.0); +extern std::shared_ptr NewLRUCache( + size_t capacity, int num_shard_bits = -1, + bool strict_capacity_limit = false, double high_pri_pool_ratio = 0.0, + std::shared_ptr memory_allocator = nullptr); extern std::shared_ptr NewLRUCache(const LRUCacheOptions& cache_opts); @@ -97,7 +109,8 @@ class Cache { // likely to get evicted than low priority entries. enum class Priority { HIGH, LOW }; - Cache() {} + Cache(std::shared_ptr allocator = nullptr) + : memory_allocator_(std::move(allocator)) {} // Destroys all existing entries by calling the "deleter" // function that was passed via the Insert() function. @@ -228,10 +241,14 @@ class Cache { virtual void TEST_mark_as_data_block(const Slice& /*key*/, size_t /*charge*/) {} + MemoryAllocator* memory_allocator() const { return memory_allocator_.get(); } + private: // No copying allowed Cache(const Cache&); Cache& operator=(const Cache&); + + std::shared_ptr memory_allocator_; }; } // namespace rocksdb diff --git a/include/rocksdb/db.h b/include/rocksdb/db.h index f1430bce83f..6a37084c52e 100644 --- a/include/rocksdb/db.h +++ b/include/rocksdb/db.h @@ -287,16 +287,12 @@ class DB { // a non-OK status on error. It is not an error if no keys exist in the range // ["begin_key", "end_key"). // - // This feature is currently an experimental performance optimization for - // deleting very large ranges of contiguous keys. Invoking it many times or on - // small ranges may severely degrade read performance; in particular, the - // resulting performance can be worse than calling Delete() for each key in - // the range. Note also the degraded read performance affects keys outside the - // deleted ranges, and affects database operations involving scans, like flush - // and compaction. - // - // Consider setting ReadOptions::ignore_range_deletions = true to speed - // up reads for key(s) that are known to be unaffected by range deletions. + // This feature is now usable in production, with the following caveats: + // 1) Accumulating many range tombstones in the memtable will degrade read + // performance; this can be avoided by manually flushing occasionally. + // 2) Limiting the maximum number of open files in the presence of range + // tombstones can degrade read performance. To avoid this problem, set + // max_open_files to -1 whenever possible. virtual Status DeleteRange(const WriteOptions& options, ColumnFamilyHandle* column_family, const Slice& begin_key, const Slice& end_key); @@ -572,6 +568,11 @@ class DB { // log files that should be kept. static const std::string kMinLogNumberToKeep; + // "rocksdb.min-obsolete-sst-number-to-keep" - return the minimum file + // number for an obsolete SST to be kept. The max value of `uint64_t` + // will be returned if all obsolete files can be deleted. + static const std::string kMinObsoleteSstNumberToKeep; + // "rocksdb.total-sst-files-size" - returns total size (bytes) of all SST // files. // WARNING: may slow down online queries if there are too many files. @@ -670,6 +671,7 @@ class DB { // "rocksdb.current-super-version-number" // "rocksdb.estimate-live-data-size" // "rocksdb.min-log-number-to-keep" + // "rocksdb.min-obsolete-sst-number-to-keep" // "rocksdb.total-sst-files-size" // "rocksdb.live-sst-files-size" // "rocksdb.base-level" @@ -900,11 +902,22 @@ class DB { virtual DBOptions GetDBOptions() const = 0; // Flush all mem-table data. + // Flush a single column family, even when atomic flush is enabled. To flush + // multiple column families, use Flush(options, column_families). virtual Status Flush(const FlushOptions& options, ColumnFamilyHandle* column_family) = 0; virtual Status Flush(const FlushOptions& options) { return Flush(options, DefaultColumnFamily()); } + // Flushes multiple column families. + // If atomic flush is not enabled, Flush(options, column_families) is + // equivalent to calling Flush(options, column_family) multiple times. + // If atomic flush is enabled, Flush(options, column_families) will flush all + // column families specified in 'column_families' up to the latest sequence + // number at the time when flush is requested. + virtual Status Flush( + const FlushOptions& options, + const std::vector& column_families) = 0; // Flush the WAL memory buffer to the file. If sync is true, it calls SyncWAL // afterwards. @@ -979,9 +992,9 @@ class DB { // cleared aggressively and the iterator might keep getting invalid before // an update is read. virtual Status GetUpdatesSince( - SequenceNumber seq_number, unique_ptr* iter, - const TransactionLogIterator::ReadOptions& - read_options = TransactionLogIterator::ReadOptions()) = 0; + SequenceNumber seq_number, std::unique_ptr* iter, + const TransactionLogIterator::ReadOptions& read_options = + TransactionLogIterator::ReadOptions()) = 0; // Windows API macro interference #undef DeleteFile diff --git a/include/rocksdb/env.h b/include/rocksdb/env.h index 7558364614d..bc439ac1c4c 100644 --- a/include/rocksdb/env.h +++ b/include/rocksdb/env.h @@ -137,9 +137,8 @@ class Env { // // The returned file will only be accessed by one thread at a time. virtual Status NewSequentialFile(const std::string& fname, - unique_ptr* result, - const EnvOptions& options) - = 0; + std::unique_ptr* result, + const EnvOptions& options) = 0; // Create a brand new random access read-only file with the // specified name. On success, stores a pointer to the new file in @@ -149,9 +148,8 @@ class Env { // // The returned file may be concurrently accessed by multiple threads. virtual Status NewRandomAccessFile(const std::string& fname, - unique_ptr* result, - const EnvOptions& options) - = 0; + std::unique_ptr* result, + const EnvOptions& options) = 0; // These values match Linux definition // https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fcntl.h#n56 enum WriteLifeTimeHint { @@ -171,7 +169,7 @@ class Env { // // The returned file will only be accessed by one thread at a time. virtual Status NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) = 0; // Create an object that writes to a new file with the specified @@ -182,7 +180,7 @@ class Env { // // The returned file will only be accessed by one thread at a time. virtual Status ReopenWritableFile(const std::string& /*fname*/, - unique_ptr* /*result*/, + std::unique_ptr* /*result*/, const EnvOptions& /*options*/) { return Status::NotSupported(); } @@ -190,7 +188,7 @@ class Env { // Reuse an existing file by renaming it and opening it as writable. virtual Status ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options); // Open `fname` for random read and write, if file doesn't exist the file @@ -199,7 +197,7 @@ class Env { // // The returned file will only be accessed by one thread at a time. virtual Status NewRandomRWFile(const std::string& /*fname*/, - unique_ptr* /*result*/, + std::unique_ptr* /*result*/, const EnvOptions& /*options*/) { return Status::NotSupported("RandomRWFile is not implemented in this Env"); } @@ -209,7 +207,7 @@ class Env { // file in `*result`. The file must exist prior to this call. virtual Status NewMemoryMappedFileBuffer( const std::string& /*fname*/, - unique_ptr* /*result*/) { + std::unique_ptr* /*result*/) { return Status::NotSupported( "MemoryMappedFileBuffer is not implemented in this Env"); } @@ -222,7 +220,7 @@ class Env { // *result and returns OK. On failure stores nullptr in *result and // returns non-OK. virtual Status NewDirectory(const std::string& name, - unique_ptr* result) = 0; + std::unique_ptr* result) = 0; // Returns OK if the named file exists. // NotFound if the named file does not exist, @@ -370,7 +368,7 @@ class Env { // Create and return a log file for storing informational messages. virtual Status NewLogger(const std::string& fname, - shared_ptr* result) = 0; + std::shared_ptr* result) = 0; // Returns the number of micro-seconds since some fixed point in time. // It is often used as system time such as in GenericRateLimiter @@ -942,24 +940,32 @@ class FileLock { void operator=(const FileLock&); }; -extern void LogFlush(const shared_ptr& info_log); +extern void LogFlush(const std::shared_ptr& info_log); extern void Log(const InfoLogLevel log_level, - const shared_ptr& info_log, const char* format, ...); + const std::shared_ptr& info_log, const char* format, + ...); // a set of log functions with different log levels. -extern void Header(const shared_ptr& info_log, const char* format, ...); -extern void Debug(const shared_ptr& info_log, const char* format, ...); -extern void Info(const shared_ptr& info_log, const char* format, ...); -extern void Warn(const shared_ptr& info_log, const char* format, ...); -extern void Error(const shared_ptr& info_log, const char* format, ...); -extern void Fatal(const shared_ptr& info_log, const char* format, ...); +extern void Header(const std::shared_ptr& info_log, const char* format, + ...); +extern void Debug(const std::shared_ptr& info_log, const char* format, + ...); +extern void Info(const std::shared_ptr& info_log, const char* format, + ...); +extern void Warn(const std::shared_ptr& info_log, const char* format, + ...); +extern void Error(const std::shared_ptr& info_log, const char* format, + ...); +extern void Fatal(const std::shared_ptr& info_log, const char* format, + ...); // Log the specified data to *info_log if info_log is non-nullptr. // The default info log level is InfoLogLevel::INFO_LEVEL. -extern void Log(const shared_ptr& info_log, const char* format, ...) +extern void Log(const std::shared_ptr& info_log, const char* format, + ...) # if defined(__GNUC__) || defined(__clang__) - __attribute__((__format__ (__printf__, 2, 3))) + __attribute__((__format__(__printf__, 2, 3))) # endif ; @@ -1005,37 +1011,38 @@ class EnvWrapper : public Env { Env* target() const { return target_; } // The following text is boilerplate that forwards all methods to target() - Status NewSequentialFile(const std::string& f, unique_ptr* r, + Status NewSequentialFile(const std::string& f, + std::unique_ptr* r, const EnvOptions& options) override { return target_->NewSequentialFile(f, r, options); } Status NewRandomAccessFile(const std::string& f, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& options) override { return target_->NewRandomAccessFile(f, r, options); } - Status NewWritableFile(const std::string& f, unique_ptr* r, + Status NewWritableFile(const std::string& f, std::unique_ptr* r, const EnvOptions& options) override { return target_->NewWritableFile(f, r, options); } Status ReopenWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { return target_->ReopenWritableFile(fname, result, options); } Status ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& options) override { return target_->ReuseWritableFile(fname, old_fname, r, options); } Status NewRandomRWFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { return target_->NewRandomRWFile(fname, result, options); } Status NewDirectory(const std::string& name, - unique_ptr* result) override { + std::unique_ptr* result) override { return target_->NewDirectory(name, result); } Status FileExists(const std::string& f) override { @@ -1113,7 +1120,7 @@ class EnvWrapper : public Env { return target_->GetTestDirectory(path); } Status NewLogger(const std::string& fname, - shared_ptr* result) override { + std::shared_ptr* result) override { return target_->NewLogger(fname, result); } uint64_t NowMicros() override { return target_->NowMicros(); } @@ -1224,36 +1231,57 @@ class WritableFileWrapper : public WritableFile { Status Sync() override { return target_->Sync(); } Status Fsync() override { return target_->Fsync(); } bool IsSyncThreadSafe() const override { return target_->IsSyncThreadSafe(); } + + bool use_direct_io() const override { return target_->use_direct_io(); } + + size_t GetRequiredBufferAlignment() const override { + return target_->GetRequiredBufferAlignment(); + } + void SetIOPriority(Env::IOPriority pri) override { target_->SetIOPriority(pri); } + Env::IOPriority GetIOPriority() override { return target_->GetIOPriority(); } + + void SetWriteLifeTimeHint(Env::WriteLifeTimeHint hint) override { + target_->SetWriteLifeTimeHint(hint); + } + + Env::WriteLifeTimeHint GetWriteLifeTimeHint() override { + return target_->GetWriteLifeTimeHint(); + } + uint64_t GetFileSize() override { return target_->GetFileSize(); } + + void SetPreallocationBlockSize(size_t size) override { + target_->SetPreallocationBlockSize(size); + } + void GetPreallocationStatus(size_t* block_size, size_t* last_allocated_block) override { target_->GetPreallocationStatus(block_size, last_allocated_block); } + size_t GetUniqueId(char* id, size_t max_size) const override { return target_->GetUniqueId(id, max_size); } + Status InvalidateCache(size_t offset, size_t length) override { return target_->InvalidateCache(offset, length); } - void SetPreallocationBlockSize(size_t size) override { - target_->SetPreallocationBlockSize(size); + Status RangeSync(uint64_t offset, uint64_t nbytes) override { + return target_->RangeSync(offset, nbytes); } + void PrepareWrite(size_t offset, size_t len) override { target_->PrepareWrite(offset, len); } - protected: Status Allocate(uint64_t offset, uint64_t len) override { return target_->Allocate(offset, len); } - Status RangeSync(uint64_t offset, uint64_t nbytes) override { - return target_->RangeSync(offset, nbytes); - } private: WritableFile* target_; diff --git a/include/rocksdb/env_encryption.h b/include/rocksdb/env_encryption.h index 70dce616a62..a6e91954656 100644 --- a/include/rocksdb/env_encryption.h +++ b/include/rocksdb/env_encryption.h @@ -142,8 +142,9 @@ class EncryptionProvider { // CreateCipherStream creates a block access cipher stream for a file given // given name and options. - virtual Status CreateCipherStream(const std::string& fname, const EnvOptions& options, - Slice& prefix, unique_ptr* result) = 0; + virtual Status CreateCipherStream( + const std::string& fname, const EnvOptions& options, Slice& prefix, + std::unique_ptr* result) = 0; }; // This encryption provider uses a CTR cipher stream, with a given block cipher @@ -174,10 +175,11 @@ class CTREncryptionProvider : public EncryptionProvider { // CreateCipherStream creates a block access cipher stream for a file given // given name and options. - virtual Status CreateCipherStream(const std::string& fname, const EnvOptions& options, - Slice& prefix, unique_ptr* result) override; + virtual Status CreateCipherStream( + const std::string& fname, const EnvOptions& options, Slice& prefix, + std::unique_ptr* result) override; - protected: + protected: // PopulateSecretPrefixPart initializes the data into a new prefix block // that will be encrypted. This function will store the data in plain text. // It will be encrypted later (before written to disk). @@ -187,8 +189,10 @@ class CTREncryptionProvider : public EncryptionProvider { // CreateCipherStreamFromPrefix creates a block access cipher stream for a file given // given name and options. The given prefix is already decrypted. - virtual Status CreateCipherStreamFromPrefix(const std::string& fname, const EnvOptions& options, - uint64_t initialCounter, const Slice& iv, const Slice& prefix, unique_ptr* result); + virtual Status CreateCipherStreamFromPrefix( + const std::string& fname, const EnvOptions& options, + uint64_t initialCounter, const Slice& iv, const Slice& prefix, + std::unique_ptr* result); }; } // namespace rocksdb diff --git a/include/rocksdb/filter_policy.h b/include/rocksdb/filter_policy.h index 4e1dc3bfc93..9c0904456f4 100644 --- a/include/rocksdb/filter_policy.h +++ b/include/rocksdb/filter_policy.h @@ -145,6 +145,6 @@ class FilterPolicy { // ignores trailing spaces, it would be incorrect to use a // FilterPolicy (like NewBloomFilterPolicy) that does not ignore // trailing spaces in keys. -extern const FilterPolicy* NewBloomFilterPolicy(int bits_per_key, - bool use_block_based_builder = true); +extern const FilterPolicy* NewBloomFilterPolicy( + int bits_per_key, bool use_block_based_builder = false); } diff --git a/include/rocksdb/listener.h b/include/rocksdb/listener.h index 46ce712dc5b..9b4e8a86664 100644 --- a/include/rocksdb/listener.h +++ b/include/rocksdb/listener.h @@ -4,6 +4,7 @@ #pragma once +#include #include #include #include @@ -143,6 +144,21 @@ struct TableFileDeletionInfo { Status status; }; +struct FileOperationInfo { + using TimePoint = std::chrono::time_point; + + const std::string& path; + uint64_t offset; + size_t length; + const TimePoint& start_timestamp; + const TimePoint& finish_timestamp; + Status status; + FileOperationInfo(const std::string& _path, const TimePoint& start, + const TimePoint& finish) + : path(_path), start_timestamp(start), finish_timestamp(finish) {} +}; + struct FlushJobInfo { // the name of the column family std::string cf_name; @@ -177,6 +193,8 @@ struct CompactionJobInfo { explicit CompactionJobInfo(const CompactionJobStats& _stats) : stats(_stats) {} + // the id of the column family where the compaction happened. + uint32_t cf_id; // the name of the column family where the compaction happened. std::string cf_name; // the status indicating whether the compaction was successful or not. @@ -297,6 +315,16 @@ class EventListener { // returned value. virtual void OnTableFileDeleted(const TableFileDeletionInfo& /*info*/) {} + // A callback function to RocksDB which will be called before a + // RocksDB starts to compact. The default implementation is + // no-op. + // + // Note that the this function must be implemented in a way such that + // it should not run for an extended period of time before the function + // returns. Otherwise, RocksDB may be blocked. + virtual void OnCompactionBegin(DB* /*db*/, + const CompactionJobInfo& /*ci*/) {} + // A callback function for RocksDB which will be called whenever // a registered RocksDB compacts a file. The default implementation // is a no-op. @@ -395,6 +423,18 @@ class EventListener { // returns. Otherwise, RocksDB may be blocked. virtual void OnStallConditionsChanged(const WriteStallInfo& /*info*/) {} + // A callback function for RocksDB which will be called whenever a file read + // operation finishes. + virtual void OnFileReadFinish(const FileOperationInfo& /* info */) {} + + // A callback function for RocksDB which will be called whenever a file write + // operation finishes. + virtual void OnFileWriteFinish(const FileOperationInfo& /* info */) {} + + // If true, the OnFileReadFinish and OnFileWriteFinish will be called. If + // false, then they won't be called. + virtual bool ShouldBeNotifiedOnFileIO() { return false; } + // A callback function for RocksDB which will be called just before // starting the automatic recovery process for recoverable background // errors, such as NoSpace(). The callback can suppress the automatic diff --git a/include/rocksdb/memory_allocator.h b/include/rocksdb/memory_allocator.h new file mode 100644 index 00000000000..889c0e92182 --- /dev/null +++ b/include/rocksdb/memory_allocator.h @@ -0,0 +1,77 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#pragma once + +#include "rocksdb/status.h" + +#include + +namespace rocksdb { + +// MemoryAllocator is an interface that a client can implement to supply custom +// memory allocation and deallocation methods. See rocksdb/cache.h for more +// information. +// All methods should be thread-safe. +class MemoryAllocator { + public: + virtual ~MemoryAllocator() = default; + + // Name of the cache allocator, printed in the log + virtual const char* Name() const = 0; + + // Allocate a block of at least size. Has to be thread-safe. + virtual void* Allocate(size_t size) = 0; + + // Deallocate previously allocated block. Has to be thread-safe. + virtual void Deallocate(void* p) = 0; + + // Returns the memory size of the block allocated at p. The default + // implementation that just returns the original allocation_size is fine. + virtual size_t UsableSize(void* /*p*/, size_t allocation_size) const { + // default implementation just returns the allocation size + return allocation_size; + } +}; + +struct JemallocAllocatorOptions { + // Jemalloc tcache cache allocations by size class. For each size class, + // it caches between 20 (for large size classes) to 200 (for small size + // classes). To reduce tcache memory usage in case the allocator is access + // by large number of threads, we can control whether to cache an allocation + // by its size. + bool limit_tcache_size = false; + + // Lower bound of allocation size to use tcache, if limit_tcache_size=true. + // When used with block cache, it is recommneded to set it to block_size/4. + size_t tcache_size_lower_bound = 1024; + + // Upper bound of allocation size to use tcache, if limit_tcache_size=true. + // When used with block cache, it is recommneded to set it to block_size. + size_t tcache_size_upper_bound = 16 * 1024; +}; + +// Generate memory allocators which allocates through Jemalloc and utilize +// MADV_DONTDUMP through madvice to exclude cache items from core dump. +// Applications can use the allocator with block cache to exclude block cache +// usage from core dump. +// +// Implementation details: +// The JemallocNodumpAllocator creates a delicated jemalloc arena, and all +// allocations of the JemallocNodumpAllocator is through the same arena. +// The memory allocator hooks memory allocation of the arena, and call +// madvice() with MADV_DONTDUMP flag to exclude the piece of memory from +// core dump. Side benefit of using single arena would be reduce of jemalloc +// metadata for some workload. +// +// To mitigate mutex contention for using one single arena, jemalloc tcache +// (thread-local cache) is enabled to cache unused allocations for future use. +// The tcache normally incur 0.5M extra memory usage per-thread. The usage +// can be reduce by limitting allocation sizes to cache. +extern Status NewJemallocNodumpAllocator( + JemallocAllocatorOptions& options, + std::shared_ptr* memory_allocator); + +} // namespace rocksdb diff --git a/include/rocksdb/metadata.h b/include/rocksdb/metadata.h index a9773bf40c4..e62d4f40982 100644 --- a/include/rocksdb/metadata.h +++ b/include/rocksdb/metadata.h @@ -63,7 +63,10 @@ struct SstFileMetaData { smallestkey(""), largestkey(""), num_reads_sampled(0), - being_compacted(false) {} + being_compacted(false), + num_entries(0), + num_deletions(0) {} + SstFileMetaData(const std::string& _file_name, const std::string& _path, size_t _size, SequenceNumber _smallest_seqno, SequenceNumber _largest_seqno, @@ -78,7 +81,9 @@ struct SstFileMetaData { smallestkey(_smallestkey), largestkey(_largestkey), num_reads_sampled(_num_reads_sampled), - being_compacted(_being_compacted) {} + being_compacted(_being_compacted), + num_entries(0), + num_deletions(0) {} // File size in bytes. size_t size; @@ -93,11 +98,15 @@ struct SstFileMetaData { std::string largestkey; // Largest user defined key in the file. uint64_t num_reads_sampled; // How many times the file is read. bool being_compacted; // true if the file is currently being compacted. + + uint64_t num_entries; + uint64_t num_deletions; }; // The full set of metadata associated with each SST file. struct LiveFileMetaData : SstFileMetaData { std::string column_family_name; // Name of the column family int level; // Level at which this file resides. + LiveFileMetaData() : column_family_name(), level(0) {} }; } // namespace rocksdb diff --git a/include/rocksdb/options.h b/include/rocksdb/options.h index 0ed3ad91049..3ace2db2bad 100644 --- a/include/rocksdb/options.h +++ b/include/rocksdb/options.h @@ -188,8 +188,7 @@ struct ColumnFamilyOptions : public AdvancedColumnFamilyOptions { // Dynamically changeable through SetOptions() API size_t write_buffer_size = 64 << 20; - // Compress blocks using the specified compression algorithm. This - // parameter can be changed dynamically. + // Compress blocks using the specified compression algorithm. // // Default: kSnappyCompression, if it's supported. If snappy is not linked // with the library, the default is kNoCompression. @@ -212,6 +211,8 @@ struct ColumnFamilyOptions : public AdvancedColumnFamilyOptions { // - kZlibCompression: Z_DEFAULT_COMPRESSION (currently -1) // - kLZ4HCCompression: 0 // - For all others, we do not specify a compression level + // + // Dynamically changeable through SetOptions() API CompressionType compression; // Compression algorithm that will be used for the bottommost level that @@ -418,7 +419,10 @@ struct DBOptions { // files opened are always kept open. You can estimate number of files based // on target_file_size_base and target_file_size_multiplier for level-based // compaction. For universal-style compaction, you can usually set it to -1. + // // Default: -1 + // + // Dynamically changeable through SetDBOptions() API. int max_open_files = -1; // If max_open_files is -1, DB will open all files on DB::Open(). You can @@ -433,7 +437,10 @@ struct DBOptions { // [sum of all write_buffer_size * max_write_buffer_number] * 4 // This option takes effect only when there are more than one column family as // otherwise the wal size is dictated by the write_buffer_size. + // // Default: 0 + // + // Dynamically changeable through SetDBOptions() API. uint64_t max_total_wal_size = 0; // If non-null, then we should collect metrics about database operations @@ -494,13 +501,23 @@ struct DBOptions { // value is 6 hours. The files that get out of scope by compaction // process will still get automatically delete on every compaction, // regardless of this setting + // + // Default: 6 hours + // + // Dynamically changeable through SetDBOptions() API. uint64_t delete_obsolete_files_period_micros = 6ULL * 60 * 60 * 1000000; // Maximum number of concurrent background jobs (compactions and flushes). + // + // Default: 2 + // + // Dynamically changeable through SetDBOptions() API. int max_background_jobs = 2; // NOT SUPPORTED ANYMORE: RocksDB automatically decides this based on the // value of max_background_jobs. This option is ignored. + // + // Dynamically changeable through SetDBOptions() API. int base_background_compactions = -1; // NOT SUPPORTED ANYMORE: RocksDB automatically decides this based on the @@ -515,7 +532,10 @@ struct DBOptions { // If you're increasing this, also consider increasing number of threads in // LOW priority thread pool. For more information, see // Env::SetBackgroundThreads + // // Default: -1 + // + // Dynamically changeable through SetDBOptions() API. int max_background_compactions = -1; // This value represents the maximum number of threads that will @@ -644,7 +664,10 @@ struct DBOptions { bool skip_log_error_on_recovery = false; // if not zero, dump rocksdb.stats to LOG every stats_dump_period_sec + // // Default: 600 (10 min) + // + // Dynamically changeable through SetDBOptions() API. unsigned int stats_dump_period_sec = 600; // If set true, will hint the underlying file system that the file @@ -711,6 +734,8 @@ struct DBOptions { // true. // // Default: 0 + // + // Dynamically changeable through SetDBOptions() API. size_t compaction_readahead_size = 0; // This is a maximum buffer size that is used by WinMmapReadableFile in @@ -737,6 +762,8 @@ struct DBOptions { // write requests if the logical sector size is unusual // // Default: 1024 * 1024 (1 MB) + // + // Dynamically changeable through SetDBOptions() API. size_t writable_file_max_buffer_size = 1024 * 1024; @@ -759,17 +786,23 @@ struct DBOptions { // to smooth out write I/Os over time. Users shouldn't rely on it for // persistency guarantee. // Issue one request for every bytes_per_sync written. 0 turns it off. - // Default: 0 // // You may consider using rate_limiter to regulate write rate to device. // When rate limiter is enabled, it automatically enables bytes_per_sync // to 1MB. // // This option applies to table files + // + // Default: 0, turned off + // + // Dynamically changeable through SetDBOptions() API. uint64_t bytes_per_sync = 0; // Same as bytes_per_sync, but applies to WAL files + // // Default: 0, turned off + // + // Dynamically changeable through SetDBOptions() API. uint64_t wal_bytes_per_sync = 0; // A vector of EventListeners which callback functions will be called @@ -796,6 +829,8 @@ struct DBOptions { // Unit: byte per second. // // Default: 0 + // + // Dynamically changeable through SetDBOptions() API. uint64_t delayed_write_rate = 0; // By default, a single write thread queue is maintained. The thread gets @@ -945,6 +980,20 @@ struct DBOptions { // relies on manual invocation of FlushWAL to write the WAL buffer to its // file. bool manual_wal_flush = false; + + // If true, RocksDB supports flushing multiple column families and committing + // their results atomically to MANIFEST. Note that it is not + // necessary to set atomic_flush to true if WAL is always enabled since WAL + // allows the database to be restored to the last persistent state in WAL. + // This option is useful when there are column families with writes NOT + // protected by WAL. + // For manual flush, application has to specify which column families to + // flush atomically in DB::Flush. + // For auto-triggered flush, RocksDB atomically flushes ALL column families. + // + // Currently, any WAL-enabled writes after atomic flush may be replayed + // independently if the process crashes later and tries to recover. + bool atomic_flush = false; }; // Options to control the behavior of a database (passed to DB::Open) @@ -1290,6 +1339,11 @@ struct IngestExternalFileOptions { bool write_global_seqno = true; }; -struct TraceOptions {}; +// TraceOptions is used for StartTrace +struct TraceOptions { + // To avoid the trace file size grows large than the storage space, + // user can set the max trace file size in Bytes. Default is 64GB + uint64_t max_trace_file_size = uint64_t{64} * 1024 * 1024 * 1024; +}; } // namespace rocksdb diff --git a/include/rocksdb/perf_context.h b/include/rocksdb/perf_context.h index d3771d3f082..3f125c21364 100644 --- a/include/rocksdb/perf_context.h +++ b/include/rocksdb/perf_context.h @@ -5,6 +5,7 @@ #pragma once +#include #include #include @@ -16,12 +17,44 @@ namespace rocksdb { // and transparently. // Use SetPerfLevel(PerfLevel::kEnableTime) to enable time stats. +// Break down performance counters by level and store per-level perf context in +// PerfContextByLevel +struct PerfContextByLevel { + // # of times bloom filter has avoided file reads, i.e., negatives. + uint64_t bloom_filter_useful = 0; + // # of times bloom FullFilter has not avoided the reads. + uint64_t bloom_filter_full_positive = 0; + // # of times bloom FullFilter has not avoided the reads and data actually + // exist. + uint64_t bloom_filter_full_true_positive = 0; + + // total number of user key returned (only include keys that are found, does + // not include keys that are deleted or merged without a final put + uint64_t user_key_return_count; + + // total nanos spent on reading data from SST files + uint64_t get_from_table_nanos; + + void Reset(); // reset all performance counters to zero +}; + struct PerfContext { + ~PerfContext(); + void Reset(); // reset all performance counters to zero std::string ToString(bool exclude_zero_counters = false) const; + // enable per level perf context and allocate storage for PerfContextByLevel + void EnablePerLevelPerfContext(); + + // temporarily disable per level perf contxt by setting the flag to false + void DisablePerLevelPerfContext(); + + // free the space for PerfContextByLevel, also disable per level perf context + void ClearPerLevelPerfContext(); + uint64_t user_key_comparison_count; // total number of user key comparisons uint64_t block_cache_hit_count; // total number of block cache hits uint64_t block_read_count; // total number of block reads (with IO) @@ -168,6 +201,8 @@ struct PerfContext { uint64_t env_lock_file_nanos; uint64_t env_unlock_file_nanos; uint64_t env_new_logger_nanos; + std::map* level_to_perf_context = nullptr; + bool per_level_perf_context_enabled = false; }; // Get Thread-local PerfContext object pointer diff --git a/include/rocksdb/sst_file_reader.h b/include/rocksdb/sst_file_reader.h new file mode 100644 index 00000000000..e58c84792e6 --- /dev/null +++ b/include/rocksdb/sst_file_reader.h @@ -0,0 +1,45 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#pragma once + +#ifndef ROCKSDB_LITE + +#include "rocksdb/slice.h" +#include "rocksdb/options.h" +#include "rocksdb/iterator.h" +#include "rocksdb/table_properties.h" + +namespace rocksdb { + +// SstFileReader is used to read sst files that are generated by DB or +// SstFileWriter. +class SstFileReader { + public: + SstFileReader(const Options& options); + + ~SstFileReader(); + + // Prepares to read from the file located at "file_path". + Status Open(const std::string& file_path); + + // Returns a new iterator over the table contents. + // Most read options provide the same control as we read from DB. + // If "snapshot" is nullptr, the iterator returns only the latest keys. + Iterator* NewIterator(const ReadOptions& options); + + std::shared_ptr GetTableProperties() const; + + // Verifies whether there is corruption in this table. + Status VerifyChecksum(); + + private: + struct Rep; + std::unique_ptr rep_; +}; + +} // namespace rocksdb + +#endif // !ROCKSDB_LITE diff --git a/include/rocksdb/statistics.h b/include/rocksdb/statistics.h index c493a18240c..14e6195faea 100644 --- a/include/rocksdb/statistics.h +++ b/include/rocksdb/statistics.h @@ -155,7 +155,8 @@ enum Tickers : uint32_t { // Disabled by default. To enable it set stats level to kAll DB_MUTEX_WAIT_MICROS, RATE_LIMIT_DELAY_MILLIS, - NO_ITERATORS, // number of iterators currently open + // DEPRECATED number of iterators currently open + NO_ITERATORS, // Number of MultiGet calls, keys read, and bytes read NUMBER_MULTIGET_CALLS, @@ -322,159 +323,15 @@ enum Tickers : uint32_t { // Number of keys actually found in MultiGet calls (vs number requested by caller) // NUMBER_MULTIGET_KEYS_READ gives the number requested by caller NUMBER_MULTIGET_KEYS_FOUND, + + NO_ITERATOR_CREATED, // number of iterators created + NO_ITERATOR_DELETED, // number of iterators deleted TICKER_ENUM_MAX }; // The order of items listed in Tickers should be the same as // the order listed in TickersNameMap -const std::vector> TickersNameMap = { - {BLOCK_CACHE_MISS, "rocksdb.block.cache.miss"}, - {BLOCK_CACHE_HIT, "rocksdb.block.cache.hit"}, - {BLOCK_CACHE_ADD, "rocksdb.block.cache.add"}, - {BLOCK_CACHE_ADD_FAILURES, "rocksdb.block.cache.add.failures"}, - {BLOCK_CACHE_INDEX_MISS, "rocksdb.block.cache.index.miss"}, - {BLOCK_CACHE_INDEX_HIT, "rocksdb.block.cache.index.hit"}, - {BLOCK_CACHE_INDEX_ADD, "rocksdb.block.cache.index.add"}, - {BLOCK_CACHE_INDEX_BYTES_INSERT, "rocksdb.block.cache.index.bytes.insert"}, - {BLOCK_CACHE_INDEX_BYTES_EVICT, "rocksdb.block.cache.index.bytes.evict"}, - {BLOCK_CACHE_FILTER_MISS, "rocksdb.block.cache.filter.miss"}, - {BLOCK_CACHE_FILTER_HIT, "rocksdb.block.cache.filter.hit"}, - {BLOCK_CACHE_FILTER_ADD, "rocksdb.block.cache.filter.add"}, - {BLOCK_CACHE_FILTER_BYTES_INSERT, - "rocksdb.block.cache.filter.bytes.insert"}, - {BLOCK_CACHE_FILTER_BYTES_EVICT, "rocksdb.block.cache.filter.bytes.evict"}, - {BLOCK_CACHE_DATA_MISS, "rocksdb.block.cache.data.miss"}, - {BLOCK_CACHE_DATA_HIT, "rocksdb.block.cache.data.hit"}, - {BLOCK_CACHE_DATA_ADD, "rocksdb.block.cache.data.add"}, - {BLOCK_CACHE_DATA_BYTES_INSERT, "rocksdb.block.cache.data.bytes.insert"}, - {BLOCK_CACHE_BYTES_READ, "rocksdb.block.cache.bytes.read"}, - {BLOCK_CACHE_BYTES_WRITE, "rocksdb.block.cache.bytes.write"}, - {BLOOM_FILTER_USEFUL, "rocksdb.bloom.filter.useful"}, - {BLOOM_FILTER_FULL_POSITIVE, "rocksdb.bloom.filter.full.positive"}, - {BLOOM_FILTER_FULL_TRUE_POSITIVE, - "rocksdb.bloom.filter.full.true.positive"}, - {PERSISTENT_CACHE_HIT, "rocksdb.persistent.cache.hit"}, - {PERSISTENT_CACHE_MISS, "rocksdb.persistent.cache.miss"}, - {SIM_BLOCK_CACHE_HIT, "rocksdb.sim.block.cache.hit"}, - {SIM_BLOCK_CACHE_MISS, "rocksdb.sim.block.cache.miss"}, - {MEMTABLE_HIT, "rocksdb.memtable.hit"}, - {MEMTABLE_MISS, "rocksdb.memtable.miss"}, - {GET_HIT_L0, "rocksdb.l0.hit"}, - {GET_HIT_L1, "rocksdb.l1.hit"}, - {GET_HIT_L2_AND_UP, "rocksdb.l2andup.hit"}, - {COMPACTION_KEY_DROP_NEWER_ENTRY, "rocksdb.compaction.key.drop.new"}, - {COMPACTION_KEY_DROP_OBSOLETE, "rocksdb.compaction.key.drop.obsolete"}, - {COMPACTION_KEY_DROP_RANGE_DEL, "rocksdb.compaction.key.drop.range_del"}, - {COMPACTION_KEY_DROP_USER, "rocksdb.compaction.key.drop.user"}, - {COMPACTION_RANGE_DEL_DROP_OBSOLETE, - "rocksdb.compaction.range_del.drop.obsolete"}, - {COMPACTION_OPTIMIZED_DEL_DROP_OBSOLETE, - "rocksdb.compaction.optimized.del.drop.obsolete"}, - {COMPACTION_CANCELLED, "rocksdb.compaction.cancelled"}, - {NUMBER_KEYS_WRITTEN, "rocksdb.number.keys.written"}, - {NUMBER_KEYS_READ, "rocksdb.number.keys.read"}, - {NUMBER_KEYS_UPDATED, "rocksdb.number.keys.updated"}, - {BYTES_WRITTEN, "rocksdb.bytes.written"}, - {BYTES_READ, "rocksdb.bytes.read"}, - {NUMBER_DB_SEEK, "rocksdb.number.db.seek"}, - {NUMBER_DB_NEXT, "rocksdb.number.db.next"}, - {NUMBER_DB_PREV, "rocksdb.number.db.prev"}, - {NUMBER_DB_SEEK_FOUND, "rocksdb.number.db.seek.found"}, - {NUMBER_DB_NEXT_FOUND, "rocksdb.number.db.next.found"}, - {NUMBER_DB_PREV_FOUND, "rocksdb.number.db.prev.found"}, - {ITER_BYTES_READ, "rocksdb.db.iter.bytes.read"}, - {NO_FILE_CLOSES, "rocksdb.no.file.closes"}, - {NO_FILE_OPENS, "rocksdb.no.file.opens"}, - {NO_FILE_ERRORS, "rocksdb.no.file.errors"}, - {STALL_L0_SLOWDOWN_MICROS, "rocksdb.l0.slowdown.micros"}, - {STALL_MEMTABLE_COMPACTION_MICROS, "rocksdb.memtable.compaction.micros"}, - {STALL_L0_NUM_FILES_MICROS, "rocksdb.l0.num.files.stall.micros"}, - {STALL_MICROS, "rocksdb.stall.micros"}, - {DB_MUTEX_WAIT_MICROS, "rocksdb.db.mutex.wait.micros"}, - {RATE_LIMIT_DELAY_MILLIS, "rocksdb.rate.limit.delay.millis"}, - {NO_ITERATORS, "rocksdb.num.iterators"}, - {NUMBER_MULTIGET_CALLS, "rocksdb.number.multiget.get"}, - {NUMBER_MULTIGET_KEYS_READ, "rocksdb.number.multiget.keys.read"}, - {NUMBER_MULTIGET_BYTES_READ, "rocksdb.number.multiget.bytes.read"}, - {NUMBER_FILTERED_DELETES, "rocksdb.number.deletes.filtered"}, - {NUMBER_MERGE_FAILURES, "rocksdb.number.merge.failures"}, - {BLOOM_FILTER_PREFIX_CHECKED, "rocksdb.bloom.filter.prefix.checked"}, - {BLOOM_FILTER_PREFIX_USEFUL, "rocksdb.bloom.filter.prefix.useful"}, - {NUMBER_OF_RESEEKS_IN_ITERATION, "rocksdb.number.reseeks.iteration"}, - {GET_UPDATES_SINCE_CALLS, "rocksdb.getupdatessince.calls"}, - {BLOCK_CACHE_COMPRESSED_MISS, "rocksdb.block.cachecompressed.miss"}, - {BLOCK_CACHE_COMPRESSED_HIT, "rocksdb.block.cachecompressed.hit"}, - {BLOCK_CACHE_COMPRESSED_ADD, "rocksdb.block.cachecompressed.add"}, - {BLOCK_CACHE_COMPRESSED_ADD_FAILURES, - "rocksdb.block.cachecompressed.add.failures"}, - {WAL_FILE_SYNCED, "rocksdb.wal.synced"}, - {WAL_FILE_BYTES, "rocksdb.wal.bytes"}, - {WRITE_DONE_BY_SELF, "rocksdb.write.self"}, - {WRITE_DONE_BY_OTHER, "rocksdb.write.other"}, - {WRITE_TIMEDOUT, "rocksdb.write.timeout"}, - {WRITE_WITH_WAL, "rocksdb.write.wal"}, - {COMPACT_READ_BYTES, "rocksdb.compact.read.bytes"}, - {COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes"}, - {FLUSH_WRITE_BYTES, "rocksdb.flush.write.bytes"}, - {NUMBER_DIRECT_LOAD_TABLE_PROPERTIES, - "rocksdb.number.direct.load.table.properties"}, - {NUMBER_SUPERVERSION_ACQUIRES, "rocksdb.number.superversion_acquires"}, - {NUMBER_SUPERVERSION_RELEASES, "rocksdb.number.superversion_releases"}, - {NUMBER_SUPERVERSION_CLEANUPS, "rocksdb.number.superversion_cleanups"}, - {NUMBER_BLOCK_COMPRESSED, "rocksdb.number.block.compressed"}, - {NUMBER_BLOCK_DECOMPRESSED, "rocksdb.number.block.decompressed"}, - {NUMBER_BLOCK_NOT_COMPRESSED, "rocksdb.number.block.not_compressed"}, - {MERGE_OPERATION_TOTAL_TIME, "rocksdb.merge.operation.time.nanos"}, - {FILTER_OPERATION_TOTAL_TIME, "rocksdb.filter.operation.time.nanos"}, - {ROW_CACHE_HIT, "rocksdb.row.cache.hit"}, - {ROW_CACHE_MISS, "rocksdb.row.cache.miss"}, - {READ_AMP_ESTIMATE_USEFUL_BYTES, "rocksdb.read.amp.estimate.useful.bytes"}, - {READ_AMP_TOTAL_READ_BYTES, "rocksdb.read.amp.total.read.bytes"}, - {NUMBER_RATE_LIMITER_DRAINS, "rocksdb.number.rate_limiter.drains"}, - {NUMBER_ITER_SKIP, "rocksdb.number.iter.skip"}, - {BLOB_DB_NUM_PUT, "rocksdb.blobdb.num.put"}, - {BLOB_DB_NUM_WRITE, "rocksdb.blobdb.num.write"}, - {BLOB_DB_NUM_GET, "rocksdb.blobdb.num.get"}, - {BLOB_DB_NUM_MULTIGET, "rocksdb.blobdb.num.multiget"}, - {BLOB_DB_NUM_SEEK, "rocksdb.blobdb.num.seek"}, - {BLOB_DB_NUM_NEXT, "rocksdb.blobdb.num.next"}, - {BLOB_DB_NUM_PREV, "rocksdb.blobdb.num.prev"}, - {BLOB_DB_NUM_KEYS_WRITTEN, "rocksdb.blobdb.num.keys.written"}, - {BLOB_DB_NUM_KEYS_READ, "rocksdb.blobdb.num.keys.read"}, - {BLOB_DB_BYTES_WRITTEN, "rocksdb.blobdb.bytes.written"}, - {BLOB_DB_BYTES_READ, "rocksdb.blobdb.bytes.read"}, - {BLOB_DB_WRITE_INLINED, "rocksdb.blobdb.write.inlined"}, - {BLOB_DB_WRITE_INLINED_TTL, "rocksdb.blobdb.write.inlined.ttl"}, - {BLOB_DB_WRITE_BLOB, "rocksdb.blobdb.write.blob"}, - {BLOB_DB_WRITE_BLOB_TTL, "rocksdb.blobdb.write.blob.ttl"}, - {BLOB_DB_BLOB_FILE_BYTES_WRITTEN, "rocksdb.blobdb.blob.file.bytes.written"}, - {BLOB_DB_BLOB_FILE_BYTES_READ, "rocksdb.blobdb.blob.file.bytes.read"}, - {BLOB_DB_BLOB_FILE_SYNCED, "rocksdb.blobdb.blob.file.synced"}, - {BLOB_DB_BLOB_INDEX_EXPIRED_COUNT, - "rocksdb.blobdb.blob.index.expired.count"}, - {BLOB_DB_BLOB_INDEX_EXPIRED_SIZE, "rocksdb.blobdb.blob.index.expired.size"}, - {BLOB_DB_BLOB_INDEX_EVICTED_COUNT, - "rocksdb.blobdb.blob.index.evicted.count"}, - {BLOB_DB_BLOB_INDEX_EVICTED_SIZE, "rocksdb.blobdb.blob.index.evicted.size"}, - {BLOB_DB_GC_NUM_FILES, "rocksdb.blobdb.gc.num.files"}, - {BLOB_DB_GC_NUM_NEW_FILES, "rocksdb.blobdb.gc.num.new.files"}, - {BLOB_DB_GC_FAILURES, "rocksdb.blobdb.gc.failures"}, - {BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, "rocksdb.blobdb.gc.num.keys.overwritten"}, - {BLOB_DB_GC_NUM_KEYS_EXPIRED, "rocksdb.blobdb.gc.num.keys.expired"}, - {BLOB_DB_GC_NUM_KEYS_RELOCATED, "rocksdb.blobdb.gc.num.keys.relocated"}, - {BLOB_DB_GC_BYTES_OVERWRITTEN, "rocksdb.blobdb.gc.bytes.overwritten"}, - {BLOB_DB_GC_BYTES_EXPIRED, "rocksdb.blobdb.gc.bytes.expired"}, - {BLOB_DB_GC_BYTES_RELOCATED, "rocksdb.blobdb.gc.bytes.relocated"}, - {BLOB_DB_FIFO_NUM_FILES_EVICTED, "rocksdb.blobdb.fifo.num.files.evicted"}, - {BLOB_DB_FIFO_NUM_KEYS_EVICTED, "rocksdb.blobdb.fifo.num.keys.evicted"}, - {BLOB_DB_FIFO_BYTES_EVICTED, "rocksdb.blobdb.fifo.bytes.evicted"}, - {TXN_PREPARE_MUTEX_OVERHEAD, "rocksdb.txn.overhead.mutex.prepare"}, - {TXN_OLD_COMMIT_MAP_MUTEX_OVERHEAD, - "rocksdb.txn.overhead.mutex.old.commit.map"}, - {TXN_DUPLICATE_KEY_OVERHEAD, "rocksdb.txn.overhead.duplicate.key"}, - {TXN_SNAPSHOT_MUTEX_OVERHEAD, "rocksdb.txn.overhead.mutex.snapshot"}, - {NUMBER_MULTIGET_KEYS_FOUND, "rocksdb.number.multiget.keys.found"}, -}; +extern const std::vector> TickersNameMap; /** * Keep adding histogram's here. @@ -557,57 +414,10 @@ enum Histograms : uint32_t { // Time spent flushing memtable to disk FLUSH_TIME, - HISTOGRAM_ENUM_MAX, // TODO(ldemailly): enforce HistogramsNameMap match + HISTOGRAM_ENUM_MAX, }; -const std::vector> HistogramsNameMap = { - {DB_GET, "rocksdb.db.get.micros"}, - {DB_WRITE, "rocksdb.db.write.micros"}, - {COMPACTION_TIME, "rocksdb.compaction.times.micros"}, - {SUBCOMPACTION_SETUP_TIME, "rocksdb.subcompaction.setup.times.micros"}, - {TABLE_SYNC_MICROS, "rocksdb.table.sync.micros"}, - {COMPACTION_OUTFILE_SYNC_MICROS, "rocksdb.compaction.outfile.sync.micros"}, - {WAL_FILE_SYNC_MICROS, "rocksdb.wal.file.sync.micros"}, - {MANIFEST_FILE_SYNC_MICROS, "rocksdb.manifest.file.sync.micros"}, - {TABLE_OPEN_IO_MICROS, "rocksdb.table.open.io.micros"}, - {DB_MULTIGET, "rocksdb.db.multiget.micros"}, - {READ_BLOCK_COMPACTION_MICROS, "rocksdb.read.block.compaction.micros"}, - {READ_BLOCK_GET_MICROS, "rocksdb.read.block.get.micros"}, - {WRITE_RAW_BLOCK_MICROS, "rocksdb.write.raw.block.micros"}, - {STALL_L0_SLOWDOWN_COUNT, "rocksdb.l0.slowdown.count"}, - {STALL_MEMTABLE_COMPACTION_COUNT, "rocksdb.memtable.compaction.count"}, - {STALL_L0_NUM_FILES_COUNT, "rocksdb.num.files.stall.count"}, - {HARD_RATE_LIMIT_DELAY_COUNT, "rocksdb.hard.rate.limit.delay.count"}, - {SOFT_RATE_LIMIT_DELAY_COUNT, "rocksdb.soft.rate.limit.delay.count"}, - {NUM_FILES_IN_SINGLE_COMPACTION, "rocksdb.numfiles.in.singlecompaction"}, - {DB_SEEK, "rocksdb.db.seek.micros"}, - {WRITE_STALL, "rocksdb.db.write.stall"}, - {SST_READ_MICROS, "rocksdb.sst.read.micros"}, - {NUM_SUBCOMPACTIONS_SCHEDULED, "rocksdb.num.subcompactions.scheduled"}, - {BYTES_PER_READ, "rocksdb.bytes.per.read"}, - {BYTES_PER_WRITE, "rocksdb.bytes.per.write"}, - {BYTES_PER_MULTIGET, "rocksdb.bytes.per.multiget"}, - {BYTES_COMPRESSED, "rocksdb.bytes.compressed"}, - {BYTES_DECOMPRESSED, "rocksdb.bytes.decompressed"}, - {COMPRESSION_TIMES_NANOS, "rocksdb.compression.times.nanos"}, - {DECOMPRESSION_TIMES_NANOS, "rocksdb.decompression.times.nanos"}, - {READ_NUM_MERGE_OPERANDS, "rocksdb.read.num.merge_operands"}, - {BLOB_DB_KEY_SIZE, "rocksdb.blobdb.key.size"}, - {BLOB_DB_VALUE_SIZE, "rocksdb.blobdb.value.size"}, - {BLOB_DB_WRITE_MICROS, "rocksdb.blobdb.write.micros"}, - {BLOB_DB_GET_MICROS, "rocksdb.blobdb.get.micros"}, - {BLOB_DB_MULTIGET_MICROS, "rocksdb.blobdb.multiget.micros"}, - {BLOB_DB_SEEK_MICROS, "rocksdb.blobdb.seek.micros"}, - {BLOB_DB_NEXT_MICROS, "rocksdb.blobdb.next.micros"}, - {BLOB_DB_PREV_MICROS, "rocksdb.blobdb.prev.micros"}, - {BLOB_DB_BLOB_FILE_WRITE_MICROS, "rocksdb.blobdb.blob.file.write.micros"}, - {BLOB_DB_BLOB_FILE_READ_MICROS, "rocksdb.blobdb.blob.file.read.micros"}, - {BLOB_DB_BLOB_FILE_SYNC_MICROS, "rocksdb.blobdb.blob.file.sync.micros"}, - {BLOB_DB_GC_MICROS, "rocksdb.blobdb.gc.micros"}, - {BLOB_DB_COMPRESSION_MICROS, "rocksdb.blobdb.compression.micros"}, - {BLOB_DB_DECOMPRESSION_MICROS, "rocksdb.blobdb.decompression.micros"}, - {FLUSH_TIME, "rocksdb.db.flush.micros"}, -}; +extern const std::vector> HistogramsNameMap; struct HistogramData { double median; diff --git a/include/rocksdb/table.h b/include/rocksdb/table.h index a177d1c7ae1..a99c8bf6e72 100644 --- a/include/rocksdb/table.h +++ b/include/rocksdb/table.h @@ -47,6 +47,7 @@ enum ChecksumType : char { kNoChecksum = 0x0, kCRC32c = 0x1, kxxHash = 0x2, + kxxHash64 = 0x3, }; // For advanced user only @@ -137,6 +138,8 @@ struct BlockBasedTableOptions { // If non-NULL use the specified cache for compressed blocks. // If NULL, rocksdb will not use a compressed block cache. + // Note: though it looks similar to `block_cache`, RocksDB doesn't put the + // same type of object there. std::shared_ptr block_cache_compressed = nullptr; // Approximate size of user data packed per block. Note that the @@ -449,7 +452,7 @@ class TableFactory { // NewTableReader() is called in three places: // (1) TableCache::FindTable() calls the function when table cache miss // and cache the table object returned. - // (2) SstFileReader (for SST Dump) opens the table and dump the table + // (2) SstFileDumper (for SST Dump) opens the table and dump the table // contents using the iterator of the table. // (3) DBImpl::IngestExternalFile() calls this function to read the contents of // the sst file it's attempting to add @@ -461,8 +464,8 @@ class TableFactory { // table_reader is the output table reader. virtual Status NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table_reader, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table_reader, bool prefetch_index_and_filter_in_cache = true) const = 0; // Return a table builder to write to a file for this table type. diff --git a/include/rocksdb/table_properties.h b/include/rocksdb/table_properties.h index d545e455ffc..75c180ff4fc 100644 --- a/include/rocksdb/table_properties.h +++ b/include/rocksdb/table_properties.h @@ -40,6 +40,8 @@ struct TablePropertiesNames { static const std::string kRawValueSize; static const std::string kNumDataBlocks; static const std::string kNumEntries; + static const std::string kDeletedKeys; + static const std::string kMergeOperands; static const std::string kNumRangeDeletions; static const std::string kFormatVersion; static const std::string kFixedKeyLen; @@ -152,6 +154,10 @@ struct TableProperties { uint64_t num_data_blocks = 0; // the number of entries in this table uint64_t num_entries = 0; + // the number of deletions in the table + uint64_t num_deletions = 0; + // the number of merge operands in the table + uint64_t num_merge_operands = 0; // the number of range deletions in this table uint64_t num_range_deletions = 0; // format version, reserved for backward compatibility @@ -216,6 +222,10 @@ struct TableProperties { // Below is a list of non-basic properties that are collected by database // itself. Especially some properties regarding to the internal keys (which // is unknown to `table`). +// +// DEPRECATED: these properties now belong as TableProperties members. Please +// use TableProperties::num_deletions and TableProperties::num_merge_operands, +// respectively. extern uint64_t GetDeletedKeys(const UserCollectedProperties& props); extern uint64_t GetMergeOperands(const UserCollectedProperties& props, bool* property_present); diff --git a/include/rocksdb/trace_reader_writer.h b/include/rocksdb/trace_reader_writer.h index 31226487b85..28919a0fadc 100644 --- a/include/rocksdb/trace_reader_writer.h +++ b/include/rocksdb/trace_reader_writer.h @@ -24,6 +24,7 @@ class TraceWriter { virtual Status Write(const Slice& data) = 0; virtual Status Close() = 0; + virtual uint64_t GetFileSize() = 0; }; // TraceReader allows reading RocksDB traces from any system, one operation at diff --git a/include/rocksdb/transaction_log.h b/include/rocksdb/transaction_log.h index 1d8ef918612..cf80a633f1c 100644 --- a/include/rocksdb/transaction_log.h +++ b/include/rocksdb/transaction_log.h @@ -60,7 +60,7 @@ struct BatchResult { // Add empty __ctor and __dtor for the rule of five // However, preserve the original semantics and prohibit copying - // as the unique_ptr member does not copy. + // as the std::unique_ptr member does not copy. BatchResult() {} ~BatchResult() {} diff --git a/include/rocksdb/utilities/env_mirror.h b/include/rocksdb/utilities/env_mirror.h index bc27cdc4884..40e9411ffae 100644 --- a/include/rocksdb/utilities/env_mirror.h +++ b/include/rocksdb/utilities/env_mirror.h @@ -48,20 +48,21 @@ class EnvMirror : public EnvWrapper { delete b_; } - Status NewSequentialFile(const std::string& f, unique_ptr* r, + Status NewSequentialFile(const std::string& f, + std::unique_ptr* r, const EnvOptions& options) override; Status NewRandomAccessFile(const std::string& f, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& options) override; - Status NewWritableFile(const std::string& f, unique_ptr* r, + Status NewWritableFile(const std::string& f, std::unique_ptr* r, const EnvOptions& options) override; Status ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& options) override; virtual Status NewDirectory(const std::string& name, - unique_ptr* result) override { - unique_ptr br; + std::unique_ptr* result) override { + std::unique_ptr br; Status as = a_->NewDirectory(name, result); Status bs = b_->NewDirectory(name, &br); assert(as == bs); diff --git a/include/rocksdb/utilities/object_registry.h b/include/rocksdb/utilities/object_registry.h index b046ba7c1f5..86a51b92ead 100644 --- a/include/rocksdb/utilities/object_registry.h +++ b/include/rocksdb/utilities/object_registry.h @@ -27,8 +27,8 @@ namespace rocksdb { template T* NewCustomObject(const std::string& target, std::unique_ptr* res_guard); -// Returns a new T when called with a string. Populates the unique_ptr argument -// if granting ownership to caller. +// Returns a new T when called with a string. Populates the std::unique_ptr +// argument if granting ownership to caller. template using FactoryFunc = std::function*)>; diff --git a/include/rocksdb/utilities/stackable_db.h b/include/rocksdb/utilities/stackable_db.h index 721203f7ce4..eae3a85ea1f 100644 --- a/include/rocksdb/utilities/stackable_db.h +++ b/include/rocksdb/utilities/stackable_db.h @@ -278,6 +278,11 @@ class StackableDB : public DB { ColumnFamilyHandle* column_family) override { return db_->Flush(fopts, column_family); } + virtual Status Flush( + const FlushOptions& fopts, + const std::vector& column_families) override { + return db_->Flush(fopts, column_families); + } virtual Status SyncWAL() override { return db_->SyncWAL(); @@ -364,7 +369,7 @@ class StackableDB : public DB { } virtual Status GetUpdatesSince( - SequenceNumber seq_number, unique_ptr* iter, + SequenceNumber seq_number, std::unique_ptr* iter, const TransactionLogIterator::ReadOptions& read_options) override { return db_->GetUpdatesSince(seq_number, iter, read_options); } diff --git a/include/rocksdb/utilities/transaction.h b/include/rocksdb/utilities/transaction.h index 86627d4f458..c1e2441bc37 100644 --- a/include/rocksdb/utilities/transaction.h +++ b/include/rocksdb/utilities/transaction.h @@ -239,14 +239,15 @@ class Transaction { // An overload of the above method that receives a PinnableSlice // For backward compatibility a default implementation is provided virtual Status GetForUpdate(const ReadOptions& options, - ColumnFamilyHandle* /*column_family*/, + ColumnFamilyHandle* column_family, const Slice& key, PinnableSlice* pinnable_val, - bool /*exclusive*/ = true) { + bool exclusive = true) { if (pinnable_val == nullptr) { std::string* null_str = nullptr; - return GetForUpdate(options, key, null_str); + return GetForUpdate(options, column_family, key, null_str, exclusive); } else { - auto s = GetForUpdate(options, key, pinnable_val->GetSelf()); + auto s = GetForUpdate(options, column_family, key, + pinnable_val->GetSelf(), exclusive); pinnable_val->PinSelf(); return s; } diff --git a/include/rocksdb/utilities/transaction_db.h b/include/rocksdb/utilities/transaction_db.h index 3d7bc355a37..1a692f2a7ac 100644 --- a/include/rocksdb/utilities/transaction_db.h +++ b/include/rocksdb/utilities/transaction_db.h @@ -171,8 +171,8 @@ struct KeyLockInfo { struct DeadlockInfo { TransactionID m_txn_id; uint32_t m_cf_id; - std::string m_waiting_key; bool m_exclusive; + std::string m_waiting_key; }; struct DeadlockPath { diff --git a/include/rocksdb/version.h b/include/rocksdb/version.h index c24ba1d3902..24cef677f11 100644 --- a/include/rocksdb/version.h +++ b/include/rocksdb/version.h @@ -5,8 +5,8 @@ #pragma once #define ROCKSDB_MAJOR 5 -#define ROCKSDB_MINOR 17 -#define ROCKSDB_PATCH 2 +#define ROCKSDB_MINOR 18 +#define ROCKSDB_PATCH 3 // Do not use these. We made the mistake of declaring macros starting with // double underscore. Now we have to live with our choice. We'll deprecate these diff --git a/include/rocksdb/write_buffer_manager.h b/include/rocksdb/write_buffer_manager.h index 856cf4b2463..dea904c187e 100644 --- a/include/rocksdb/write_buffer_manager.h +++ b/include/rocksdb/write_buffer_manager.h @@ -30,6 +30,8 @@ class WriteBufferManager { bool enabled() const { return buffer_size_ != 0; } + bool cost_to_cache() const { return cache_rep_ != nullptr; } + // Only valid if enabled() size_t memory_usage() const { return memory_used_.load(std::memory_order_relaxed); diff --git a/java/CMakeLists.txt b/java/CMakeLists.txt index 96c08b23189..8f4ec9a568a 100644 --- a/java/CMakeLists.txt +++ b/java/CMakeLists.txt @@ -25,6 +25,7 @@ set(JNI_NATIVE_SOURCES rocksjni/jnicallback.cc rocksjni/loggerjnicallback.cc rocksjni/lru_cache.cc + rocksjni/memory_util.cc rocksjni/memtablejni.cc rocksjni/merge_operator.cc rocksjni/native_comparator_wrapper_test.cc @@ -57,6 +58,7 @@ set(JNI_NATIVE_SOURCES rocksjni/writebatchhandlerjnicallback.cc rocksjni/write_batch_test.cc rocksjni/write_batch_with_index.cc + rocksjni/write_buffer_manager.cc ) set(NATIVE_JAVA_CLASSES @@ -96,6 +98,7 @@ set(NATIVE_JAVA_CLASSES org.rocksdb.IngestExternalFileOptions org.rocksdb.Logger org.rocksdb.LRUCache + org.rocksdb.MemoryUtil org.rocksdb.MemTableConfig org.rocksdb.NativeComparatorWrapper org.rocksdb.NativeLibraryLoader @@ -130,6 +133,7 @@ set(NATIVE_JAVA_CLASSES org.rocksdb.TransactionLogIterator org.rocksdb.TransactionOptions org.rocksdb.TtlDB + org.rocksdb.UInt64AddOperator org.rocksdb.VectorMemTableConfig org.rocksdb.WBWIRocksIterator org.rocksdb.WriteBatch @@ -142,6 +146,7 @@ set(NATIVE_JAVA_CLASSES org.rocksdb.SnapshotTest org.rocksdb.WriteBatchTest org.rocksdb.WriteBatchTestInternalHelper + org.rocksdb.WriteBufferManager ) include(FindJava) @@ -222,6 +227,8 @@ add_jar( src/main/java/org/rocksdb/IngestExternalFileOptions.java src/main/java/org/rocksdb/Logger.java src/main/java/org/rocksdb/LRUCache.java + src/main/java/org/rocksdb/MemoryUsageType.java + src/main/java/org/rocksdb/MemoryUtil.java src/main/java/org/rocksdb/MemTableConfig.java src/main/java/org/rocksdb/MergeOperator.java src/main/java/org/rocksdb/MutableColumnFamilyOptionsInterface.java @@ -278,6 +285,7 @@ add_jar( src/main/java/org/rocksdb/WriteBatch.java src/main/java/org/rocksdb/WriteBatchWithIndex.java src/main/java/org/rocksdb/WriteOptions.java + src/main/java/org/rocksdb/WriteBufferManager.java src/main/java/org/rocksdb/util/BytewiseComparator.java src/main/java/org/rocksdb/util/DirectBytewiseComparator.java src/main/java/org/rocksdb/util/Environment.java @@ -290,6 +298,7 @@ add_jar( src/test/java/org/rocksdb/RocksDBExceptionTest.java src/test/java/org/rocksdb/RocksMemoryResource.java src/test/java/org/rocksdb/SnapshotTest.java + src/main/java/org/rocksdb/UInt64AddOperator.java src/test/java/org/rocksdb/WriteBatchTest.java src/test/java/org/rocksdb/util/CapturingWriteBatchHandler.java src/test/java/org/rocksdb/util/WriteBatchGetter.java diff --git a/java/Makefile b/java/Makefile index f58fff06e50..b3b89eb8372 100644 --- a/java/Makefile +++ b/java/Makefile @@ -30,6 +30,8 @@ NATIVE_JAVA_CLASSES = org.rocksdb.AbstractCompactionFilter\ org.rocksdb.HashSkipListMemTableConfig\ org.rocksdb.Logger\ org.rocksdb.LRUCache\ + org.rocksdb.MemoryUsageType\ + org.rocksdb.MemoryUtil\ org.rocksdb.MergeOperator\ org.rocksdb.NativeComparatorWrapper\ org.rocksdb.OptimisticTransactionDB\ @@ -60,10 +62,12 @@ NATIVE_JAVA_CLASSES = org.rocksdb.AbstractCompactionFilter\ org.rocksdb.VectorMemTableConfig\ org.rocksdb.Snapshot\ org.rocksdb.StringAppendOperator\ + org.rocksdb.UInt64AddOperator\ org.rocksdb.WriteBatch\ org.rocksdb.WriteBatch.Handler\ org.rocksdb.WriteOptions\ org.rocksdb.WriteBatchWithIndex\ + org.rocksdb.WriteBufferManager\ org.rocksdb.WBWIRocksIterator NATIVE_JAVA_TEST_CLASSES = org.rocksdb.RocksDBExceptionTest\ @@ -111,6 +115,7 @@ JAVA_TESTS = org.rocksdb.BackupableDBOptionsTest\ org.rocksdb.KeyMayExistTest\ org.rocksdb.LoggerTest\ org.rocksdb.LRUCacheTest\ + org.rocksdb.MemoryUtilTest\ org.rocksdb.MemTableTest\ org.rocksdb.MergeTest\ org.rocksdb.MixedOptionsTest\ diff --git a/java/rocksjni/compaction_options_fifo.cc b/java/rocksjni/compaction_options_fifo.cc index 95bbfc621dc..00761b6ac5f 100644 --- a/java/rocksjni/compaction_options_fifo.cc +++ b/java/rocksjni/compaction_options_fifo.cc @@ -46,6 +46,53 @@ jlong Java_org_rocksdb_CompactionOptionsFIFO_maxTableFilesSize(JNIEnv* /*env*/, return static_cast(opt->max_table_files_size); } +/* + * Class: org_rocksdb_CompactionOptionsFIFO + * Method: setTtl + * Signature: (JJ)V + */ +void Java_org_rocksdb_CompactionOptionsFIFO_setTtl(JNIEnv* /*env*/, + jobject /*jobj*/, + jlong jhandle, jlong ttl) { + auto* opt = reinterpret_cast(jhandle); + opt->ttl = static_cast(ttl); +} + +/* + * Class: org_rocksdb_CompactionOptionsFIFO + * Method: ttl + * Signature: (J)J + */ +jlong Java_org_rocksdb_CompactionOptionsFIFO_ttl(JNIEnv* /*env*/, + jobject /*jobj*/, + jlong jhandle) { + auto* opt = reinterpret_cast(jhandle); + return static_cast(opt->ttl); +} + +/* + * Class: org_rocksdb_CompactionOptionsFIFO + * Method: setAllowCompaction + * Signature: (JZ)V + */ +void Java_org_rocksdb_CompactionOptionsFIFO_setAllowCompaction( + JNIEnv* /*env*/, jobject /*jobj*/, jlong jhandle, + jboolean allow_compaction) { + auto* opt = reinterpret_cast(jhandle); + opt->allow_compaction = static_cast(allow_compaction); +} + +/* + * Class: org_rocksdb_CompactionOptionsFIFO + * Method: allowCompaction + * Signature: (J)Z + */ +jboolean Java_org_rocksdb_CompactionOptionsFIFO_allowCompaction( + JNIEnv* /*env*/, jobject /*jobj*/, jlong jhandle) { + auto* opt = reinterpret_cast(jhandle); + return static_cast(opt->allow_compaction); +} + /* * Class: org_rocksdb_CompactionOptionsFIFO * Method: disposeInternal diff --git a/java/rocksjni/memory_util.cc b/java/rocksjni/memory_util.cc new file mode 100644 index 00000000000..9c2bfd04e2b --- /dev/null +++ b/java/rocksjni/memory_util.cc @@ -0,0 +1,100 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#include +#include +#include +#include +#include + +#include "include/org_rocksdb_MemoryUtil.h" + +#include "rocksjni/portal.h" + +#include "rocksdb/utilities/memory_util.h" + + +/* + * Class: org_rocksdb_MemoryUtil + * Method: getApproximateMemoryUsageByType + * Signature: ([J[J)Ljava/util/Map; + */ +jobject Java_org_rocksdb_MemoryUtil_getApproximateMemoryUsageByType( + JNIEnv *env, jclass /*jclazz*/, jlongArray jdb_handles, jlongArray jcache_handles) { + + std::vector dbs; + jsize db_handle_count = env->GetArrayLength(jdb_handles); + if(db_handle_count > 0) { + jlong *ptr_jdb_handles = env->GetLongArrayElements(jdb_handles, nullptr); + if (ptr_jdb_handles == nullptr) { + // exception thrown: OutOfMemoryError + return nullptr; + } + for (jsize i = 0; i < db_handle_count; i++) { + dbs.push_back(reinterpret_cast(ptr_jdb_handles[i])); + } + env->ReleaseLongArrayElements(jdb_handles, ptr_jdb_handles, JNI_ABORT); + } + + std::unordered_set cache_set; + jsize cache_handle_count = env->GetArrayLength(jcache_handles); + if(cache_handle_count > 0) { + jlong *ptr_jcache_handles = env->GetLongArrayElements(jcache_handles, nullptr); + if (ptr_jcache_handles == nullptr) { + // exception thrown: OutOfMemoryError + return nullptr; + } + for (jsize i = 0; i < cache_handle_count; i++) { + auto *cache_ptr = + reinterpret_cast *>(ptr_jcache_handles[i]); + cache_set.insert(cache_ptr->get()); + } + env->ReleaseLongArrayElements(jcache_handles, ptr_jcache_handles, JNI_ABORT); + } + + std::map usage_by_type; + if(rocksdb::MemoryUtil::GetApproximateMemoryUsageByType(dbs, cache_set, &usage_by_type) != rocksdb::Status::OK()) { + // Non-OK status + return nullptr; + } + + jobject jusage_by_type = rocksdb::HashMapJni::construct( + env, static_cast(usage_by_type.size())); + if (jusage_by_type == nullptr) { + // exception occurred + return nullptr; + } + const rocksdb::HashMapJni::FnMapKV + fn_map_kv = + [env](const std::pair& pair) { + // Construct key + const jobject jusage_type = + rocksdb::ByteJni::valueOf(env, rocksdb::MemoryUsageTypeJni::toJavaMemoryUsageType(pair.first)); + if (jusage_type == nullptr) { + // an error occurred + return std::unique_ptr>(nullptr); + } + // Construct value + const jobject jusage_value = + rocksdb::LongJni::valueOf(env, pair.second); + if (jusage_value == nullptr) { + // an error occurred + return std::unique_ptr>(nullptr); + } + // Construct and return pointer to pair of jobjects + return std::unique_ptr>( + new std::pair(jusage_type, + jusage_value)); + }; + + if (!rocksdb::HashMapJni::putAll(env, jusage_by_type, usage_by_type.begin(), + usage_by_type.end(), fn_map_kv)) { + // exception occcurred + jusage_by_type = nullptr; + } + + return jusage_by_type; + +} diff --git a/java/rocksjni/merge_operator.cc b/java/rocksjni/merge_operator.cc index 782153f5712..e06a06f7e35 100644 --- a/java/rocksjni/merge_operator.cc +++ b/java/rocksjni/merge_operator.cc @@ -13,6 +13,7 @@ #include #include "include/org_rocksdb_StringAppendOperator.h" +#include "include/org_rocksdb_UInt64AddOperator.h" #include "rocksdb/db.h" #include "rocksdb/memtablerep.h" #include "rocksdb/merge_operator.h" @@ -47,3 +48,28 @@ void Java_org_rocksdb_StringAppendOperator_disposeInternal(JNIEnv* /*env*/, reinterpret_cast*>(jhandle); delete sptr_string_append_op; // delete std::shared_ptr } + +/* + * Class: org_rocksdb_UInt64AddOperator + * Method: newSharedUInt64AddOperator + * Signature: ()J + */ +jlong Java_org_rocksdb_UInt64AddOperator_newSharedUInt64AddOperator( + JNIEnv* /*env*/, jclass /*jclazz*/) { + auto* sptr_uint64_add_op = new std::shared_ptr( + rocksdb::MergeOperators::CreateUInt64AddOperator()); + return reinterpret_cast(sptr_uint64_add_op); +} + +/* + * Class: org_rocksdb_UInt64AddOperator + * Method: disposeInternal + * Signature: (J)V + */ +void Java_org_rocksdb_UInt64AddOperator_disposeInternal(JNIEnv* /*env*/, + jobject /*jobj*/, + jlong jhandle) { + auto* sptr_uint64_add_op = + reinterpret_cast*>(jhandle); + delete sptr_uint64_add_op; // delete std::shared_ptr +} diff --git a/java/rocksjni/options.cc b/java/rocksjni/options.cc index 9aed80e1e66..342ee3e9e4c 100644 --- a/java/rocksjni/options.cc +++ b/java/rocksjni/options.cc @@ -250,6 +250,20 @@ void Java_org_rocksdb_Options_setWriteBufferSize(JNIEnv* env, jobject /*jobj*/, } } +/* + * Class: org_rocksdb_Options + * Method: setWriteBufferManager + * Signature: (JJ)V + */ +void Java_org_rocksdb_Options_setWriteBufferManager(JNIEnv* /*env*/, jobject /*jobj*/, + jlong joptions_handle, + jlong jwrite_buffer_manager_handle) { + auto* write_buffer_manager = + reinterpret_cast *>(jwrite_buffer_manager_handle); + reinterpret_cast(joptions_handle)->write_buffer_manager = + *write_buffer_manager; +} + /* * Class: org_rocksdb_Options * Method: writeBufferSize @@ -1956,8 +1970,8 @@ jbyte Java_org_rocksdb_Options_compressionType(JNIEnv* /*env*/, * @param jcompression_levels A reference to a java byte array * where each byte indicates a compression level * - * @return A unique_ptr to the vector, or unique_ptr(nullptr) if a JNI exception - * occurs + * @return A std::unique_ptr to the vector, or std::unique_ptr(nullptr) if a JNI + * exception occurs */ std::unique_ptr> rocksdb_compression_vector_helper(JNIEnv* env, jbyteArray jcompression_levels) { @@ -5518,6 +5532,20 @@ void Java_org_rocksdb_DBOptions_setDbWriteBufferSize( opt->db_write_buffer_size = static_cast(jdb_write_buffer_size); } +/* + * Class: org_rocksdb_DBOptions + * Method: setWriteBufferManager + * Signature: (JJ)V + */ +void Java_org_rocksdb_DBOptions_setWriteBufferManager(JNIEnv* /*env*/, jobject /*jobj*/, + jlong jdb_options_handle, + jlong jwrite_buffer_manager_handle) { + auto* write_buffer_manager = + reinterpret_cast *>(jwrite_buffer_manager_handle); + reinterpret_cast(jdb_options_handle)->write_buffer_manager = + *write_buffer_manager; +} + /* * Class: org_rocksdb_DBOptions * Method: dbWriteBufferSize @@ -6525,6 +6553,31 @@ jlong Java_org_rocksdb_ReadOptions_iterateUpperBound(JNIEnv* /*env*/, return reinterpret_cast(upper_bound_slice_handle); } +/* + * Class: org_rocksdb_ReadOptions + * Method: setIterateLowerBound + * Signature: (JJ)I + */ +void Java_org_rocksdb_ReadOptions_setIterateLowerBound( + JNIEnv* /*env*/, jobject /*jobj*/, jlong jhandle, + jlong jlower_bound_slice_handle) { + reinterpret_cast(jhandle)->iterate_lower_bound = + reinterpret_cast(jlower_bound_slice_handle); +} + +/* + * Class: org_rocksdb_ReadOptions + * Method: iterateLowerBound + * Signature: (J)J + */ +jlong Java_org_rocksdb_ReadOptions_iterateLowerBound(JNIEnv* /*env*/, + jobject /*jobj*/, + jlong jhandle) { + auto& lower_bound_slice_handle = + reinterpret_cast(jhandle)->iterate_lower_bound; + return reinterpret_cast(lower_bound_slice_handle); +} + ///////////////////////////////////////////////////////////////////// // rocksdb::ComparatorOptions diff --git a/java/rocksjni/portal.h b/java/rocksjni/portal.h index a0d1846a659..0bf2867c1c0 100644 --- a/java/rocksjni/portal.h +++ b/java/rocksjni/portal.h @@ -26,6 +26,7 @@ #include "rocksdb/rate_limiter.h" #include "rocksdb/status.h" #include "rocksdb/utilities/backupable_db.h" +#include "rocksdb/utilities/memory_util.h" #include "rocksdb/utilities/transaction_db.h" #include "rocksdb/utilities/write_batch_with_index.h" #include "rocksjni/compaction_filter_factory_jnicallback.h" @@ -2251,7 +2252,7 @@ class ByteJni : public JavaClass { * @param env A pointer to the Java environment * * @return The Java Method ID or nullptr if the class or method id could not - * be retieved + * be retrieved */ static jmethodID getByteValueMethod(JNIEnv* env) { jclass clazz = getJClass(env); @@ -2264,6 +2265,39 @@ class ByteJni : public JavaClass { assert(mid != nullptr); return mid; } + + /** + * Calls the Java Method: Byte#valueOf, returning a constructed Byte jobject + * + * @param env A pointer to the Java environment + * + * @return A constructing Byte object or nullptr if the class or method id could not + * be retrieved, or an exception occurred + */ + static jobject valueOf(JNIEnv* env, jbyte jprimitive_byte) { + jclass clazz = getJClass(env); + if (clazz == nullptr) { + // exception occurred accessing class + return nullptr; + } + + static jmethodID mid = + env->GetStaticMethodID(clazz, "valueOf", "(B)Ljava/lang/Byte;"); + if (mid == nullptr) { + // exception thrown: NoSuchMethodException or OutOfMemoryError + return nullptr; + } + + const jobject jbyte_obj = + env->CallStaticObjectMethod(clazz, mid, jprimitive_byte); + if (env->ExceptionCheck()) { + // exception occurred + return nullptr; + } + + return jbyte_obj; + } + }; // The portal class for java.lang.StringBuilder @@ -3345,8 +3379,12 @@ class TickerTypeJni { return 0x5D; case rocksdb::Tickers::NUMBER_MULTIGET_KEYS_FOUND: return 0x5E; - case rocksdb::Tickers::TICKER_ENUM_MAX: + case rocksdb::Tickers::NO_ITERATOR_CREATED: return 0x5F; + case rocksdb::Tickers::NO_ITERATOR_DELETED: + return 0x60; + case rocksdb::Tickers::TICKER_ENUM_MAX: + return 0x61; default: // undefined/default @@ -3549,6 +3587,10 @@ class TickerTypeJni { case 0x5E: return rocksdb::Tickers::NUMBER_MULTIGET_KEYS_FOUND; case 0x5F: + return rocksdb::Tickers::NO_ITERATOR_CREATED; + case 0x60: + return rocksdb::Tickers::NO_ITERATOR_DELETED; + case 0x61: return rocksdb::Tickers::TICKER_ENUM_MAX; default: @@ -3795,6 +3837,48 @@ class RateLimiterModeJni { } }; +// The portal class for org.rocksdb.MemoryUsageType +class MemoryUsageTypeJni { +public: + // Returns the equivalent org.rocksdb.MemoryUsageType for the provided + // C++ rocksdb::MemoryUtil::UsageType enum + static jbyte toJavaMemoryUsageType( + const rocksdb::MemoryUtil::UsageType& usage_type) { + switch(usage_type) { + case rocksdb::MemoryUtil::UsageType::kMemTableTotal: + return 0x0; + case rocksdb::MemoryUtil::UsageType::kMemTableUnFlushed: + return 0x1; + case rocksdb::MemoryUtil::UsageType::kTableReadersTotal: + return 0x2; + case rocksdb::MemoryUtil::UsageType::kCacheTotal: + return 0x3; + default: + // undefined: use kNumUsageTypes + return 0x4; + } + } + + // Returns the equivalent C++ rocksdb::MemoryUtil::UsageType enum for the + // provided Java org.rocksdb.MemoryUsageType + static rocksdb::MemoryUtil::UsageType toCppMemoryUsageType( + jbyte usage_type) { + switch(usage_type) { + case 0x0: + return rocksdb::MemoryUtil::UsageType::kMemTableTotal; + case 0x1: + return rocksdb::MemoryUtil::UsageType::kMemTableUnFlushed; + case 0x2: + return rocksdb::MemoryUtil::UsageType::kTableReadersTotal; + case 0x3: + return rocksdb::MemoryUtil::UsageType::kCacheTotal; + default: + // undefined/default: use kNumUsageTypes + return rocksdb::MemoryUtil::UsageType::kNumUsageTypes; + } + } +}; + // The portal class for org.rocksdb.Transaction class TransactionJni : public JavaClass { public: diff --git a/java/rocksjni/statisticsjni.cc b/java/rocksjni/statisticsjni.cc index 3ac1e5b413e..8fddc437a0b 100644 --- a/java/rocksjni/statisticsjni.cc +++ b/java/rocksjni/statisticsjni.cc @@ -11,11 +11,11 @@ namespace rocksdb { StatisticsJni::StatisticsJni(std::shared_ptr stats) - : StatisticsImpl(stats, false), m_ignore_histograms() { + : StatisticsImpl(stats), m_ignore_histograms() { } StatisticsJni::StatisticsJni(std::shared_ptr stats, - const std::set ignore_histograms) : StatisticsImpl(stats, false), + const std::set ignore_histograms) : StatisticsImpl(stats), m_ignore_histograms(ignore_histograms) { } diff --git a/java/rocksjni/table.cc b/java/rocksjni/table.cc index 5f5f8cd2abf..3dbd13280ad 100644 --- a/java/rocksjni/table.cc +++ b/java/rocksjni/table.cc @@ -37,7 +37,7 @@ jlong Java_org_rocksdb_PlainTableConfig_newTableFactoryHandle( /* * Class: org_rocksdb_BlockBasedTableConfig * Method: newTableFactoryHandle - * Signature: (ZJIJJIIZIZZZJIBBI)J + * Signature: (ZJIJJIIZJZZZZJZZJIBBI)J */ jlong Java_org_rocksdb_BlockBasedTableConfig_newTableFactoryHandle( JNIEnv * /*env*/, jobject /*jobj*/, jboolean no_block_cache, @@ -45,7 +45,10 @@ jlong Java_org_rocksdb_BlockBasedTableConfig_newTableFactoryHandle( jlong block_size, jint block_size_deviation, jint block_restart_interval, jboolean whole_key_filtering, jlong jfilter_policy, jboolean cache_index_and_filter_blocks, + jboolean cache_index_and_filter_blocks_with_high_priority, jboolean pin_l0_filter_and_index_blocks_in_cache, + jboolean partition_filters, jlong metadata_block_size, + jboolean pin_top_level_index_and_filter, jboolean hash_index_allow_collision, jlong block_cache_compressed_size, jint block_cache_compressd_num_shard_bits, jbyte jchecksum_type, jbyte jindex_type, jint jformat_version) { @@ -77,8 +80,13 @@ jlong Java_org_rocksdb_BlockBasedTableConfig_newTableFactoryHandle( options.filter_policy = *pFilterPolicy; } options.cache_index_and_filter_blocks = cache_index_and_filter_blocks; + options.cache_index_and_filter_blocks_with_high_priority = + cache_index_and_filter_blocks_with_high_priority; options.pin_l0_filter_and_index_blocks_in_cache = pin_l0_filter_and_index_blocks_in_cache; + options.partition_filters = partition_filters; + options.metadata_block_size = metadata_block_size; + options.pin_top_level_index_and_filter = pin_top_level_index_and_filter; options.hash_index_allow_collision = hash_index_allow_collision; if (block_cache_compressed_size > 0) { if (block_cache_compressd_num_shard_bits > 0) { diff --git a/java/rocksjni/write_buffer_manager.cc b/java/rocksjni/write_buffer_manager.cc new file mode 100644 index 00000000000..043f69031c0 --- /dev/null +++ b/java/rocksjni/write_buffer_manager.cc @@ -0,0 +1,38 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#include + +#include "include/org_rocksdb_WriteBufferManager.h" + +#include "rocksdb/cache.h" +#include "rocksdb/write_buffer_manager.h" + +/* + * Class: org_rocksdb_WriteBufferManager + * Method: newWriteBufferManager + * Signature: (JJ)J + */ +jlong Java_org_rocksdb_WriteBufferManager_newWriteBufferManager( + JNIEnv* /*env*/, jclass /*jclazz*/, jlong jbuffer_size, jlong jcache_handle) { + auto* cache_ptr = + reinterpret_cast *>(jcache_handle); + auto* write_buffer_manager = new std::shared_ptr( + std::make_shared(jbuffer_size, *cache_ptr)); + return reinterpret_cast(write_buffer_manager); +} + +/* + * Class: org_rocksdb_WriteBufferManager + * Method: disposeInternal + * Signature: (J)V + */ +void Java_org_rocksdb_WriteBufferManager_disposeInternal( + JNIEnv* /*env*/, jobject /*jobj*/, jlong jhandle) { + auto* write_buffer_manager = + reinterpret_cast *>(jhandle); + assert(write_buffer_manager != nullptr); + delete write_buffer_manager; +} diff --git a/java/src/main/java/org/rocksdb/BlockBasedTableConfig.java b/java/src/main/java/org/rocksdb/BlockBasedTableConfig.java index 2dbbc64d358..1032be6e799 100644 --- a/java/src/main/java/org/rocksdb/BlockBasedTableConfig.java +++ b/java/src/main/java/org/rocksdb/BlockBasedTableConfig.java @@ -22,7 +22,11 @@ public BlockBasedTableConfig() { wholeKeyFiltering_ = true; filter_ = null; cacheIndexAndFilterBlocks_ = false; + cacheIndexAndFilterBlocksWithHighPriority_ = false; pinL0FilterAndIndexBlocksInCache_ = false; + partitionFilters_ = false; + metadataBlockSize_ = 4096; + pinTopLevelIndexAndFilter_ = true; hashIndexAllowCollision_ = true; blockCacheCompressedSize_ = 0; blockCacheCompressedNumShardBits_ = 0; @@ -246,6 +250,31 @@ public BlockBasedTableConfig setCacheIndexAndFilterBlocks( return this; } + /** + * Indicates if index and filter blocks will be treated as high-priority in the block cache. + * See note below about applicability. If not specified, defaults to false. + * + * @return if index and filter blocks will be treated as high-priority. + */ + public boolean cacheIndexAndFilterBlocksWithHighPriority() { + return cacheIndexAndFilterBlocksWithHighPriority_; + } + + /** + * If true, cache index and filter blocks with high priority. If set to true, + * depending on implementation of block cache, index and filter blocks may be + * less likely to be evicted than data blocks. + * + * @param cacheIndexAndFilterBlocksWithHighPriority if index and filter blocks + * will be treated as high-priority. + * @return the reference to the current config. + */ + public BlockBasedTableConfig setCacheIndexAndFilterBlocksWithHighPriority( + final boolean cacheIndexAndFilterBlocksWithHighPriority) { + cacheIndexAndFilterBlocksWithHighPriority_ = cacheIndexAndFilterBlocksWithHighPriority; + return this; + } + /** * Indicating if we'd like to pin L0 index/filter blocks to the block cache. If not specified, defaults to false. @@ -269,6 +298,70 @@ public BlockBasedTableConfig setPinL0FilterAndIndexBlocksInCache( return this; } + /** + * Indicating if we're using partitioned filters. Defaults to false. + * + * @return if we're using partition filters. + */ + public boolean partitionFilters() { + return partitionFilters_; + } + + /** + * Use partitioned full filters for each SST file. This option is incompatible with + * block-based filters. + * + * @param partitionFilters use partition filters. + * @return the reference to the current config. + */ + public BlockBasedTableConfig setPartitionFilters(final boolean partitionFilters) { + partitionFilters_ = partitionFilters; + return this; + } + + /** + * @return block size for partitioned metadata. + */ + public long metadataBlockSize() { + return metadataBlockSize_; + } + + /** + * Set block size for partitioned metadata. + * + * @param metadataBlockSize Partitioned metadata block size. + * @return the reference to the current config. + */ + public BlockBasedTableConfig setMetadataBlockSize( + final long metadataBlockSize) { + metadataBlockSize_ = metadataBlockSize; + return this; + } + + /** + * Indicates if top-level index and filter blocks should be pinned. + * + * @return if top-level index and filter blocks should be pinned. + */ + public boolean pinTopLevelIndexAndFilter() { + return pinTopLevelIndexAndFilter_; + } + + /** + * If cacheIndexAndFilterBlocks is true and the below is true, then + * the top-level index of partitioned filter and index blocks are stored in + * the cache, but a reference is held in the "table reader" object so the + * blocks are pinned and only evicted from cache when the table reader is + * freed. This is not limited to l0 in LSM tree. + * + * @param pinTopLevelIndexAndFilter if top-level index and filter blocks should be pinned. + * @return the reference to the current config. + */ + public BlockBasedTableConfig setPinTopLevelIndexAndFilter(final boolean pinTopLevelIndexAndFilter) { + pinTopLevelIndexAndFilter_ = pinTopLevelIndexAndFilter; + return this; + } + /** * Influence the behavior when kHashSearch is used. if false, stores a precise prefix to block range mapping @@ -440,20 +533,27 @@ public int formatVersion() { return newTableFactoryHandle(noBlockCache_, blockCacheSize_, blockCacheNumShardBits_, blockCacheHandle, blockSize_, blockSizeDeviation_, blockRestartInterval_, wholeKeyFiltering_, filterHandle, cacheIndexAndFilterBlocks_, - pinL0FilterAndIndexBlocksInCache_, hashIndexAllowCollision_, blockCacheCompressedSize_, - blockCacheCompressedNumShardBits_, checksumType_.getValue(), indexType_.getValue(), - formatVersion_); + cacheIndexAndFilterBlocksWithHighPriority_, pinL0FilterAndIndexBlocksInCache_, + partitionFilters_, metadataBlockSize_, pinTopLevelIndexAndFilter_, + hashIndexAllowCollision_, blockCacheCompressedSize_, blockCacheCompressedNumShardBits_, + checksumType_.getValue(), indexType_.getValue(), formatVersion_); } private native long newTableFactoryHandle(boolean noBlockCache, long blockCacheSize, int blockCacheNumShardBits, long blockCacheHandle, long blockSize, int blockSizeDeviation, int blockRestartInterval, boolean wholeKeyFiltering, long filterPolicyHandle, - boolean cacheIndexAndFilterBlocks, boolean pinL0FilterAndIndexBlocksInCache, - boolean hashIndexAllowCollision, long blockCacheCompressedSize, - int blockCacheCompressedNumShardBits, byte checkSumType, byte indexType, int formatVersion); + boolean cacheIndexAndFilterBlocks, boolean cacheIndexAndFilterBlocksWithHighPriority, + boolean pinL0FilterAndIndexBlocksInCache, boolean partitionFilters, long metadataBlockSize, + boolean pinTopLevelIndexAndFilter, boolean hashIndexAllowCollision, + long blockCacheCompressedSize, int blockCacheCompressedNumShardBits, + byte checkSumType, byte indexType, int formatVersion); private boolean cacheIndexAndFilterBlocks_; + private boolean cacheIndexAndFilterBlocksWithHighPriority_; private boolean pinL0FilterAndIndexBlocksInCache_; + private boolean partitionFilters_; + private long metadataBlockSize_; + private boolean pinTopLevelIndexAndFilter_; private IndexType indexType_; private boolean hashIndexAllowCollision_; private ChecksumType checksumType_; diff --git a/java/src/main/java/org/rocksdb/CompactionOptionsFIFO.java b/java/src/main/java/org/rocksdb/CompactionOptionsFIFO.java index f795807804d..36d78fe6e6f 100644 --- a/java/src/main/java/org/rocksdb/CompactionOptionsFIFO.java +++ b/java/src/main/java/org/rocksdb/CompactionOptionsFIFO.java @@ -42,8 +42,77 @@ public long maxTableFilesSize() { return maxTableFilesSize(nativeHandle_); } + /** + * Drop files older than TTL. TTL based deletion will take precedence over + * size based deletion if ttl > 0. + * delete if sst_file_creation_time < (current_time - ttl). + * unit: seconds. Ex: 1 day = 1 * 24 * 60 * 60 + * + * Default: 0 (disabled) + * + * @param ttl The ttl for the table files in seconds + * + * @return the reference to the current options. + */ + public CompactionOptionsFIFO setTtl(final long ttl) { + setTtl(nativeHandle_, ttl); + return this; + } + + /** + * The current ttl value. + * Drop files older than TTL. TTL based deletion will take precedence over + * size based deletion if ttl > 0. + * delete if sst_file_creation_time < (current_time - ttl). + * + * Default: 0 (disabled) + * + * @return the ttl in seconds + */ + public long ttl() { + return ttl(nativeHandle_); + } + + /** + * If true, try to do compaction to compact smaller files into larger ones. + * Minimum files to compact follows options.level0_file_num_compaction_trigger + * and compaction won't trigger if average compact bytes per del file is + * larger than options.write_buffer_size. This is to protect large files + * from being compacted again. + * + * Default: false + * + * @param allowCompaction should allow intra-L0 compaction? + * + * @return the reference to the current options. + */ + public CompactionOptionsFIFO setAllowCompaction(final boolean allowCompaction) { + setAllowCompaction(nativeHandle_, allowCompaction); + return this; + } + + /** + * Check if intra-L0 compaction is enabled. + * If true, try to do compaction to compact smaller files into larger ones. + * Minimum files to compact follows options.level0_file_num_compaction_trigger + * and compaction won't trigger if average compact bytes per del file is + * larger than options.write_buffer_size. This is to protect large files + * from being compacted again. + * + * Default: false + * + * @return a boolean value indicating whether intra-L0 compaction is enabled + */ + public boolean allowCompaction() { + return allowCompaction(nativeHandle_); + } + private native void setMaxTableFilesSize(long handle, long maxTableFilesSize); private native long maxTableFilesSize(long handle); + private native void setTtl(long handle, long ttl); + private native long ttl(long handle); + private native void setAllowCompaction(long handle, boolean allowCompaction); + private native boolean allowCompaction(long handle); private native static long newCompactionOptionsFIFO(); @Override protected final native void disposeInternal(final long handle); diff --git a/java/src/main/java/org/rocksdb/DBOptions.java b/java/src/main/java/org/rocksdb/DBOptions.java index c3232938893..280623a208e 100644 --- a/java/src/main/java/org/rocksdb/DBOptions.java +++ b/java/src/main/java/org/rocksdb/DBOptions.java @@ -46,6 +46,7 @@ public DBOptions(DBOptions other) { this.numShardBits_ = other.numShardBits_; this.rateLimiter_ = other.rateLimiter_; this.rowCache_ = other.rowCache_; + this.writeBufferManager_ = other.writeBufferManager_; } /** @@ -668,6 +669,20 @@ public DBOptions setDbWriteBufferSize(final long dbWriteBufferSize) { } @Override + public DBOptions setWriteBufferManager(final WriteBufferManager writeBufferManager) { + assert(isOwningHandle()); + setWriteBufferManager(nativeHandle_, writeBufferManager.nativeHandle_); + this.writeBufferManager_ = writeBufferManager; + return this; + } + + @Override + public WriteBufferManager writeBufferManager() { + assert(isOwningHandle()); + return this.writeBufferManager_; + } + + @Override public long dbWriteBufferSize() { assert(isOwningHandle()); return dbWriteBufferSize(nativeHandle_); @@ -1087,6 +1102,8 @@ private native void setAdviseRandomOnOpen( private native boolean adviseRandomOnOpen(long handle); private native void setDbWriteBufferSize(final long handle, final long dbWriteBufferSize); + private native void setWriteBufferManager(final long dbOptionsHandle, + final long writeBufferManagerHandle); private native long dbWriteBufferSize(final long handle); private native void setAccessHintOnCompactionStart(final long handle, final byte accessHintOnCompactionStart); @@ -1158,4 +1175,5 @@ private native void setAvoidFlushDuringShutdown(final long handle, private int numShardBits_; private RateLimiter rateLimiter_; private Cache rowCache_; + private WriteBufferManager writeBufferManager_; } diff --git a/java/src/main/java/org/rocksdb/DBOptionsInterface.java b/java/src/main/java/org/rocksdb/DBOptionsInterface.java index 7c406eaf8ab..accfb4c59ae 100644 --- a/java/src/main/java/org/rocksdb/DBOptionsInterface.java +++ b/java/src/main/java/org/rocksdb/DBOptionsInterface.java @@ -991,6 +991,28 @@ public interface DBOptionsInterface { */ T setDbWriteBufferSize(long dbWriteBufferSize); + /** + * Use passed {@link WriteBufferManager} to control memory usage across + * multiple column families and/or DB instances. + * + * Check + * https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager + * for more details on when to use it + * + * @param writeBufferManager The WriteBufferManager to use + * @return the reference of the current options. + */ + T setWriteBufferManager(final WriteBufferManager writeBufferManager); + + /** + * Reference to {@link WriteBufferManager} used by it.
+ * + * Default: null (Disabled) + * + * @return a reference to WriteBufferManager + */ + WriteBufferManager writeBufferManager(); + /** * Amount of data to build up in memtables across all column * families before writing to disk. diff --git a/java/src/main/java/org/rocksdb/MemoryUsageType.java b/java/src/main/java/org/rocksdb/MemoryUsageType.java new file mode 100644 index 00000000000..3523cd0ee65 --- /dev/null +++ b/java/src/main/java/org/rocksdb/MemoryUsageType.java @@ -0,0 +1,72 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +package org.rocksdb; + +/** + * MemoryUsageType + * + *

The value will be used as a key to indicate the type of memory usage + * described

+ */ +public enum MemoryUsageType { + /** + * Memory usage of all the mem-tables. + */ + kMemTableTotal((byte) 0), + /** + * Memory usage of those un-flushed mem-tables. + */ + kMemTableUnFlushed((byte) 1), + /** + * Memory usage of all the table readers. + */ + kTableReadersTotal((byte) 2), + /** + * Memory usage by Cache. + */ + kCacheTotal((byte) 3), + /** + * Max usage types - copied to keep 1:1 with native. + */ + kNumUsageTypes((byte) 4); + + /** + * Returns the byte value of the enumerations value + * + * @return byte representation + */ + public byte getValue() { + return value_; + } + + /** + *

Get the MemoryUsageType enumeration value by + * passing the byte identifier to this method.

+ * + * @param byteIdentifier of MemoryUsageType. + * + * @return MemoryUsageType instance. + * + * @throws IllegalArgumentException if the usage type for the byteIdentifier + * cannot be found + */ + public static MemoryUsageType getMemoryUsageType(final byte byteIdentifier) { + for (final MemoryUsageType MemoryUsageType : MemoryUsageType.values()) { + if (MemoryUsageType.getValue() == byteIdentifier) { + return MemoryUsageType; + } + } + + throw new IllegalArgumentException( + "Illegal value provided for MemoryUsageType."); + } + + private MemoryUsageType(byte value) { + value_ = value; + } + + private final byte value_; +} diff --git a/java/src/main/java/org/rocksdb/MemoryUtil.java b/java/src/main/java/org/rocksdb/MemoryUtil.java new file mode 100644 index 00000000000..52b2175e6b1 --- /dev/null +++ b/java/src/main/java/org/rocksdb/MemoryUtil.java @@ -0,0 +1,60 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +package org.rocksdb; + +import java.util.*; + +/** + * JNI passthrough for MemoryUtil. + */ +public class MemoryUtil { + + /** + *

Returns the approximate memory usage of different types in the input + * list of DBs and Cache set. For instance, in the output map the key + * kMemTableTotal will be associated with the memory + * usage of all the mem-tables from all the input rocksdb instances.

+ * + *

Note that for memory usage inside Cache class, we will + * only report the usage of the input "cache_set" without + * including those Cache usage inside the input list "dbs" + * of DBs.

+ * + * @param dbs List of dbs to collect memory usage for. + * @param caches Set of caches to collect memory usage for. + * @return Map from {@link MemoryUsageType} to memory usage as a {@link Long}. + */ + public static Map getApproximateMemoryUsageByType(final List dbs, final Set caches) { + int dbCount = (dbs == null) ? 0 : dbs.size(); + int cacheCount = (caches == null) ? 0 : caches.size(); + long[] dbHandles = new long[dbCount]; + long[] cacheHandles = new long[cacheCount]; + if (dbCount > 0) { + ListIterator dbIter = dbs.listIterator(); + while (dbIter.hasNext()) { + dbHandles[dbIter.nextIndex()] = dbIter.next().nativeHandle_; + } + } + if (cacheCount > 0) { + // NOTE: This index handling is super ugly but I couldn't get a clean way to track both the + // index and the iterator simultaneously within a Set. + int i = 0; + for (Cache cache : caches) { + cacheHandles[i] = cache.nativeHandle_; + i++; + } + } + Map byteOutput = getApproximateMemoryUsageByType(dbHandles, cacheHandles); + Map output = new HashMap<>(); + for(Map.Entry longEntry : byteOutput.entrySet()) { + output.put(MemoryUsageType.getMemoryUsageType(longEntry.getKey()), longEntry.getValue()); + } + return output; + } + + private native static Map getApproximateMemoryUsageByType(final long[] dbHandles, + final long[] cacheHandles); +} diff --git a/java/src/main/java/org/rocksdb/Options.java b/java/src/main/java/org/rocksdb/Options.java index cac4fc5a368..2ff4ec12040 100644 --- a/java/src/main/java/org/rocksdb/Options.java +++ b/java/src/main/java/org/rocksdb/Options.java @@ -70,6 +70,7 @@ public Options(Options other) { this.compactionOptionsFIFO_ = other.compactionOptionsFIFO_; this.compressionOptions_ = other.compressionOptions_; this.rowCache_ = other.rowCache_; + this.writeBufferManager_ = other.writeBufferManager_; } @Override @@ -724,6 +725,20 @@ public Options setDbWriteBufferSize(final long dbWriteBufferSize) { } @Override + public Options setWriteBufferManager(final WriteBufferManager writeBufferManager) { + assert(isOwningHandle()); + setWriteBufferManager(nativeHandle_, writeBufferManager.nativeHandle_); + this.writeBufferManager_ = writeBufferManager; + return this; + } + + @Override + public WriteBufferManager writeBufferManager() { + assert(isOwningHandle()); + return this.writeBufferManager_; + } + + @Override public long dbWriteBufferSize() { assert(isOwningHandle()); return dbWriteBufferSize(nativeHandle_); @@ -1690,6 +1705,8 @@ private native void setAdviseRandomOnOpen( private native boolean adviseRandomOnOpen(long handle); private native void setDbWriteBufferSize(final long handle, final long dbWriteBufferSize); + private native void setWriteBufferManager(final long handle, + final long writeBufferManagerHandle); private native long dbWriteBufferSize(final long handle); private native void setAccessHintOnCompactionStart(final long handle, final byte accessHintOnCompactionStart); @@ -1909,4 +1926,5 @@ private native void setForceConsistencyChecks(final long handle, private CompactionOptionsFIFO compactionOptionsFIFO_; private CompressionOptions compressionOptions_; private Cache rowCache_; + private WriteBufferManager writeBufferManager_; } diff --git a/java/src/main/java/org/rocksdb/ReadOptions.java b/java/src/main/java/org/rocksdb/ReadOptions.java index be8aec6b32c..f176d249b02 100644 --- a/java/src/main/java/org/rocksdb/ReadOptions.java +++ b/java/src/main/java/org/rocksdb/ReadOptions.java @@ -27,6 +27,7 @@ public ReadOptions() { public ReadOptions(ReadOptions other) { super(copyReadOptions(other.nativeHandle_)); iterateUpperBoundSlice_ = other.iterateUpperBoundSlice_; + iterateLowerBoundSlice_ = other.iterateLowerBoundSlice_; } /** @@ -423,15 +424,65 @@ public Slice iterateUpperBound() { return null; } + /** + * Defines the smallest key at which the backward iterator can return an + * entry. Once the bound is passed, Valid() will be false. + * `iterate_lower_bound` is inclusive ie the bound value is a valid entry. + * + * If prefix_extractor is not null, the Seek target and `iterate_lower_bound` + * need to have the same prefix. This is because ordering is not guaranteed + * outside of prefix domain. + * + * Default: nullptr + * + * @param iterateLowerBound Slice representing the lower bound + * @return the reference to the current ReadOptions. + */ + public ReadOptions setIterateLowerBound(final Slice iterateLowerBound) { + assert(isOwningHandle()); + if (iterateLowerBound != null) { + // Hold onto a reference so it doesn't get garbaged collected out from under us. + iterateLowerBoundSlice_ = iterateLowerBound; + setIterateLowerBound(nativeHandle_, iterateLowerBoundSlice_.getNativeHandle()); + } + return this; + } + + /** + * Defines the smallest key at which the backward iterator can return an + * entry. Once the bound is passed, Valid() will be false. + * `iterate_lower_bound` is inclusive ie the bound value is a valid entry. + * + * If prefix_extractor is not null, the Seek target and `iterate_lower_bound` + * need to have the same prefix. This is because ordering is not guaranteed + * outside of prefix domain. + * + * Default: nullptr + * + * @return Slice representing current iterate_lower_bound setting, or null if + * one does not exist. + */ + public Slice iterateLowerBound() { + assert(isOwningHandle()); + long lowerBoundSliceHandle = iterateLowerBound(nativeHandle_); + if (lowerBoundSliceHandle != 0) { + // Disown the new slice - it's owned by the C++ side of the JNI boundary + // from the perspective of this method. + return new Slice(lowerBoundSliceHandle, false); + } + return null; + } + // instance variables // NOTE: If you add new member variables, please update the copy constructor above! // - // Hold a reference to any iterate upper bound that was set on this object - // until we're destroyed or it's overwritten. That way the caller can freely + // Hold a reference to any iterate upper/lower bound that was set on this object + // until we're destroyed or it's overwritten. That way the caller can freely // leave scope without us losing the Java Slice object, which during close() // would also reap its associated rocksdb::Slice native object since it's // possibly (likely) to be an owning handle. protected Slice iterateUpperBoundSlice_; + protected Slice iterateLowerBoundSlice_; private native static long newReadOptions(); private native static long copyReadOptions(long handle); @@ -465,6 +516,9 @@ private native void setIgnoreRangeDeletions(final long handle, private native void setIterateUpperBound(final long handle, final long upperBoundSliceHandle); private native long iterateUpperBound(final long handle); + private native void setIterateLowerBound(final long handle, + final long upperBoundSliceHandle); + private native long iterateLowerBound(final long handle); @Override protected final native void disposeInternal(final long handle); diff --git a/java/src/main/java/org/rocksdb/RocksDB.java b/java/src/main/java/org/rocksdb/RocksDB.java index 38be3333f45..7ac08fdf05b 100644 --- a/java/src/main/java/org/rocksdb/RocksDB.java +++ b/java/src/main/java/org/rocksdb/RocksDB.java @@ -439,6 +439,12 @@ protected void storeOptionsInstance(DBOptionsInterface options) { options_ = options; } + private static void checkBounds(int offset, int len, int size) { + if ((offset | len | (offset + len) | (size - (offset + len))) < 0) { + throw new IndexOutOfBoundsException(String.format("offset(%d), len(%d), size(%d)", offset, len, size)); + } + } + /** * Set the database entry for "key" to "value". * @@ -453,6 +459,28 @@ public void put(final byte[] key, final byte[] value) put(nativeHandle_, key, 0, key.length, value, 0, value.length); } + /** + * Set the database entry for "key" to "value" + * + * @param key The specified key to be inserted + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * must be non-negative and no larger than ("key".length - offset) + * @param value the value associated with the specified key + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @throws RocksDBException thrown if errors happens in underlying native library. + */ + public void put(final byte[] key, int offset, int len, final byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + put(nativeHandle_, key, offset, len, value, vOffset, vLen); + } + /** * Set the database entry for "key" to "value" in the specified * column family. @@ -473,6 +501,32 @@ public void put(final ColumnFamilyHandle columnFamilyHandle, columnFamilyHandle.nativeHandle_); } + /** + * Set the database entry for "key" to "value" in the specified + * column family. + * + * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} + * instance + * @param key The specified key to be inserted + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * must be non-negative and no larger than ("key".length - offset) + * @param value the value associated with the specified key + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @throws RocksDBException thrown if errors happens in underlying native library. + */ + public void put(final ColumnFamilyHandle columnFamilyHandle, final byte[] key, int offset, int len, final byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + put(nativeHandle_, key, offset, len, value, vOffset, vLen, + columnFamilyHandle.nativeHandle_); + } + /** * Set the database entry for "key" to "value". * @@ -489,6 +543,32 @@ public void put(final WriteOptions writeOpts, final byte[] key, key, 0, key.length, value, 0, value.length); } + /** + * Set the database entry for "key" to "value". + * + * @param writeOpts {@link org.rocksdb.WriteOptions} instance. + * @param key The specified key to be inserted + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * must be non-negative and no larger than ("key".length - offset) + * @param value the value associated with the specified key + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void put(final WriteOptions writeOpts, byte[] key, int offset, int len, byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + put(nativeHandle_, writeOpts.nativeHandle_, + key, offset, len, value, vOffset, vLen); + } + + /** * Set the database entry for "key" to "value" for the specified * column family. @@ -512,6 +592,36 @@ public void put(final ColumnFamilyHandle columnFamilyHandle, 0, value.length, columnFamilyHandle.nativeHandle_); } + /** + * Set the database entry for "key" to "value" for the specified + * column family. + * + * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} + * instance + * @param writeOpts {@link org.rocksdb.WriteOptions} instance. + * @param key The specified key to be inserted + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * must be non-negative and no larger than ("key".length - offset) + * @param value the value associated with the specified key + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void put(final ColumnFamilyHandle columnFamilyHandle, + final WriteOptions writeOpts, final byte[] key, int offset, int len, + final byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + put(nativeHandle_, writeOpts.nativeHandle_, key, offset, len, value, + vOffset, vLen, columnFamilyHandle.nativeHandle_); + } + /** * If the key definitely does not exist in the database, then this method * returns false, else true. @@ -528,6 +638,27 @@ public boolean keyMayExist(final byte[] key, final StringBuilder value) { return keyMayExist(nativeHandle_, key, 0, key.length, value); } + /** + * If the key definitely does not exist in the database, then this method + * returns false, else true. + * + * This check is potentially lighter-weight than invoking DB::Get(). One way + * to make this lighter weight is to avoid doing any IOs. + * + * @param key byte array of a key to search for + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value StringBuilder instance which is a out parameter if a value is + * found in block-cache. + * + * @return boolean value indicating if key does not exist or might exist. + */ + public boolean keyMayExist(final byte[] key, int offset, int len, final StringBuilder value) { + checkBounds(offset, len, key.length); + return keyMayExist(nativeHandle_, key, offset, len, value); + } + /** * If the key definitely does not exist in the database, then this method * returns false, else true. @@ -547,6 +678,30 @@ public boolean keyMayExist(final ColumnFamilyHandle columnFamilyHandle, columnFamilyHandle.nativeHandle_, value); } + /** + * If the key definitely does not exist in the database, then this method + * returns false, else true. + * + * This check is potentially lighter-weight than invoking DB::Get(). One way + * to make this lighter weight is to avoid doing any IOs. + * + * @param columnFamilyHandle {@link ColumnFamilyHandle} instance + * @param key byte array of a key to search for + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value StringBuilder instance which is a out parameter if a value is + * found in block-cache. + * @return boolean value indicating if key does not exist or might exist. + */ + public boolean keyMayExist(final ColumnFamilyHandle columnFamilyHandle, + final byte[] key, int offset, int len, final StringBuilder value) { + checkBounds(offset, len, key.length); + return keyMayExist(nativeHandle_, key, offset, len, + columnFamilyHandle.nativeHandle_, value); + } + + /** * If the key definitely does not exist in the database, then this method * returns false, else true. @@ -566,6 +721,29 @@ public boolean keyMayExist(final ReadOptions readOptions, key, 0, key.length, value); } + /** + * If the key definitely does not exist in the database, then this method + * returns false, else true. + * + * This check is potentially lighter-weight than invoking DB::Get(). One way + * to make this lighter weight is to avoid doing any IOs. + * + * @param readOptions {@link ReadOptions} instance + * @param key byte array of a key to search for + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value StringBuilder instance which is a out parameter if a value is + * found in block-cache. + * @return boolean value indicating if key does not exist or might exist. + */ + public boolean keyMayExist(final ReadOptions readOptions, + final byte[] key, int offset, int len, final StringBuilder value) { + checkBounds(offset, len, key.length); + return keyMayExist(nativeHandle_, readOptions.nativeHandle_, + key, offset, len, value); + } + /** * If the key definitely does not exist in the database, then this method * returns false, else true. @@ -588,6 +766,32 @@ public boolean keyMayExist(final ReadOptions readOptions, value); } + /** + * If the key definitely does not exist in the database, then this method + * returns false, else true. + * + * This check is potentially lighter-weight than invoking DB::Get(). One way + * to make this lighter weight is to avoid doing any IOs. + * + * @param readOptions {@link ReadOptions} instance + * @param columnFamilyHandle {@link ColumnFamilyHandle} instance + * @param key byte array of a key to search for + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value StringBuilder instance which is a out parameter if a value is + * found in block-cache. + * @return boolean value indicating if key does not exist or might exist. + */ + public boolean keyMayExist(final ReadOptions readOptions, + final ColumnFamilyHandle columnFamilyHandle, final byte[] key, int offset, int len, + final StringBuilder value) { + checkBounds(offset, len, key.length); + return keyMayExist(nativeHandle_, readOptions.nativeHandle_, + key, offset, len, columnFamilyHandle.nativeHandle_, + value); + } + /** * Apply the specified updates to the database. * @@ -631,6 +835,30 @@ public void merge(final byte[] key, final byte[] value) merge(nativeHandle_, key, 0, key.length, value, 0, value.length); } + /** + * Add merge operand for key/value pair. + * + * @param key the specified key to be merged. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value the value to be merged with the current value for the specified key. + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void merge(final byte[] key, int offset, int len, final byte[] value, int vOffset, int vLen) + throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + merge(nativeHandle_, key, offset, len, value, vOffset, vLen); + } + + /** * Add merge operand for key/value pair in a ColumnFamily. * @@ -648,6 +876,32 @@ public void merge(final ColumnFamilyHandle columnFamilyHandle, columnFamilyHandle.nativeHandle_); } + /** + * Add merge operand for key/value pair in a ColumnFamily. + * + * @param columnFamilyHandle {@link ColumnFamilyHandle} instance + * @param key the specified key to be merged. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value the value to be merged with the current value for + * the specified key. + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void merge(final ColumnFamilyHandle columnFamilyHandle, + final byte[] key, int offset, int len, final byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + merge(nativeHandle_, key, offset, len, value, vOffset, vLen, + columnFamilyHandle.nativeHandle_); + } + /** * Add merge operand for key/value pair. * @@ -665,6 +919,32 @@ public void merge(final WriteOptions writeOpts, final byte[] key, key, 0, key.length, value, 0, value.length); } + /** + * Add merge operand for key/value pair. + * + * @param writeOpts {@link WriteOptions} for this write. + * @param key the specified key to be merged. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value the value to be merged with the current value for + * the specified key. + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void merge(final WriteOptions writeOpts, final byte[] key, int offset, int len, + final byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + merge(nativeHandle_, writeOpts.nativeHandle_, + key, offset, len, value, vOffset, vLen); + } + /** * Add merge operand for key/value pair. * @@ -685,13 +965,44 @@ public void merge(final ColumnFamilyHandle columnFamilyHandle, columnFamilyHandle.nativeHandle_); } + /** + * Add merge operand for key/value pair. + * + * @param columnFamilyHandle {@link ColumnFamilyHandle} instance + * @param writeOpts {@link WriteOptions} for this write. + * @param key the specified key to be merged. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value the value to be merged with the current value for + * the specified key. + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void merge(final ColumnFamilyHandle columnFamilyHandle, + final WriteOptions writeOpts, final byte[] key, int offset, int len, + final byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + merge(nativeHandle_, writeOpts.nativeHandle_, + key, offset, len, value, vOffset, vLen, + columnFamilyHandle.nativeHandle_); + } + // TODO(AR) we should improve the #get() API, returning -1 (RocksDB.NOT_FOUND) is not very nice // when we could communicate better status into, also the C++ code show that -2 could be returned /** * Get the value associated with the specified key within column family* + * * @param key the key to retrieve the value. * @param value the out-value to receive the retrieved value. + * * @return The size of the actual value that matches the specified * {@code key} in byte. If the return value is greater than the * length of {@code value}, then it indicates that the size of the @@ -706,6 +1017,35 @@ public int get(final byte[] key, final byte[] value) throws RocksDBException { return get(nativeHandle_, key, 0, key.length, value, 0, value.length); } + /** + * Get the value associated with the specified key within column family* + * + * @param key the key to retrieve the value. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value the out-value to receive the retrieved value. + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @return The size of the actual value that matches the specified + * {@code key} in byte. If the return value is greater than the + * length of {@code value}, then it indicates that the size of the + * input buffer {@code value} is insufficient and partial result will + * be returned. RocksDB.NOT_FOUND will be returned if the value not + * found. + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public int get(final byte[] key, int offset, int len, final byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + return get(nativeHandle_, key, offset, len, value, vOffset, vLen); + } + /** * Get the value associated with the specified key within column family. * @@ -729,6 +1069,39 @@ public int get(final ColumnFamilyHandle columnFamilyHandle, final byte[] key, columnFamilyHandle.nativeHandle_); } + /** + * Get the value associated with the specified key within column family. + * + * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} + * instance + * @param key the key to retrieve the value. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value the out-value to receive the retrieved value. + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * + * @return The size of the actual value that matches the specified + * {@code key} in byte. If the return value is greater than the + * length of {@code value}, then it indicates that the size of the + * input buffer {@code value} is insufficient and partial result will + * be returned. RocksDB.NOT_FOUND will be returned if the value not + * found. + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public int get(final ColumnFamilyHandle columnFamilyHandle, final byte[] key, int offset, int len, + final byte[] value, int vOffset, int vLen) throws RocksDBException, IllegalArgumentException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + return get(nativeHandle_, key, offset, len, value, vOffset, vLen, + columnFamilyHandle.nativeHandle_); + } + /** * Get the value associated with the specified key. * @@ -750,6 +1123,38 @@ public int get(final ReadOptions opt, final byte[] key, return get(nativeHandle_, opt.nativeHandle_, key, 0, key.length, value, 0, value.length); } + + /** + * Get the value associated with the specified key. + * + * @param opt {@link org.rocksdb.ReadOptions} instance. + * @param key the key to retrieve the value. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value the out-value to receive the retrieved value. + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * @return The size of the actual value that matches the specified + * {@code key} in byte. If the return value is greater than the + * length of {@code value}, then it indicates that the size of the + * input buffer {@code value} is insufficient and partial result will + * be returned. RocksDB.NOT_FOUND will be returned if the value not + * found. + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public int get(final ReadOptions opt, final byte[] key, int offset, int len, + final byte[] value, int vOffset, int vLen) throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + return get(nativeHandle_, opt.nativeHandle_, + key, offset, len, value, vOffset, vLen); + } + /** * Get the value associated with the specified key within column family. * @@ -775,6 +1180,40 @@ public int get(final ColumnFamilyHandle columnFamilyHandle, 0, value.length, columnFamilyHandle.nativeHandle_); } + /** + * Get the value associated with the specified key within column family. + * + * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} + * instance + * @param opt {@link org.rocksdb.ReadOptions} instance. + * @param key the key to retrieve the value. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param value the out-value to receive the retrieved value. + * @param vOffset the offset of the "value" array to be used, must be non-negative and + * no longer than "key".length + * @param vLen the length of the "value" array to be used, must be non-negative and + * must be non-negative and no larger than ("value".length - offset) + * @return The size of the actual value that matches the specified + * {@code key} in byte. If the return value is greater than the + * length of {@code value}, then it indicates that the size of the + * input buffer {@code value} is insufficient and partial result will + * be returned. RocksDB.NOT_FOUND will be returned if the value not + * found. + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public int get(final ColumnFamilyHandle columnFamilyHandle, + final ReadOptions opt, final byte[] key, int offset, int len, final byte[] value, int vOffset, int vLen) + throws RocksDBException { + checkBounds(offset, len, key.length); + checkBounds(vOffset, vLen, value.length); + return get(nativeHandle_, opt.nativeHandle_, key, offset, len, value, + vOffset, vLen, columnFamilyHandle.nativeHandle_); + } + /** * The simplified version of get which returns a new byte array storing * the value associated with the specified input key if any. null will be @@ -791,6 +1230,26 @@ public byte[] get(final byte[] key) throws RocksDBException { return get(nativeHandle_, key, 0, key.length); } + /** + * The simplified version of get which returns a new byte array storing + * the value associated with the specified input key if any. null will be + * returned if the specified key is not found. + * + * @param key the key retrieve the value. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @return a byte array storing the value associated with the input key if + * any. null if it does not find the specified key. + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public byte[] get(final byte[] key, int offset, int len) throws RocksDBException { + checkBounds(offset, len, key.length); + return get(nativeHandle_, key, offset, len); + } + /** * The simplified version of get which returns a new byte array storing * the value associated with the specified input key if any. null will be @@ -811,6 +1270,30 @@ public byte[] get(final ColumnFamilyHandle columnFamilyHandle, columnFamilyHandle.nativeHandle_); } + /** + * The simplified version of get which returns a new byte array storing + * the value associated with the specified input key if any. null will be + * returned if the specified key is not found. + * + * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} + * instance + * @param key the key retrieve the value. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @return a byte array storing the value associated with the input key if + * any. null if it does not find the specified key. + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public byte[] get(final ColumnFamilyHandle columnFamilyHandle, + final byte[] key, int offset, int len) throws RocksDBException { + checkBounds(offset, len, key.length); + return get(nativeHandle_, key, offset, len, + columnFamilyHandle.nativeHandle_); + } + /** * The simplified version of get which returns a new byte array storing * the value associated with the specified input key if any. null will be @@ -829,6 +1312,28 @@ public byte[] get(final ReadOptions opt, final byte[] key) return get(nativeHandle_, opt.nativeHandle_, key, 0, key.length); } + /** + * The simplified version of get which returns a new byte array storing + * the value associated with the specified input key if any. null will be + * returned if the specified key is not found. + * + * @param key the key retrieve the value. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param opt Read options. + * @return a byte array storing the value associated with the input key if + * any. null if it does not find the specified key. + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public byte[] get(final ReadOptions opt, final byte[] key, int offset, int len) + throws RocksDBException { + checkBounds(offset, len, key.length); + return get(nativeHandle_, opt.nativeHandle_, key, offset, len); + } + /** * The simplified version of get which returns a new byte array storing * the value associated with the specified input key if any. null will be @@ -850,6 +1355,31 @@ public byte[] get(final ColumnFamilyHandle columnFamilyHandle, columnFamilyHandle.nativeHandle_); } + /** + * The simplified version of get which returns a new byte array storing + * the value associated with the specified input key if any. null will be + * returned if the specified key is not found. + * + * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} + * instance + * @param key the key retrieve the value. + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * @param opt Read options. + * @return a byte array storing the value associated with the input key if + * any. null if it does not find the specified key. + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public byte[] get(final ColumnFamilyHandle columnFamilyHandle, + final ReadOptions opt, final byte[] key, int offset, int len) throws RocksDBException { + checkBounds(offset, len, key.length); + return get(nativeHandle_, opt.nativeHandle_, key, offset, len, + columnFamilyHandle.nativeHandle_); + } + /** * Returns a map of keys for which values were found in DB. * @@ -1073,6 +1603,23 @@ public void delete(final byte[] key) throws RocksDBException { delete(nativeHandle_, key, 0, key.length); } + /** + * Delete the database entry (if any) for "key". Returns OK on + * success, and a non-OK status on error. It is not an error if "key" + * did not exist in the database. + * + * @param key Key to delete within database + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void delete(final byte[] key, int offset, int len) throws RocksDBException { + delete(nativeHandle_, key, offset, len); + } + /** * Remove the database entry (if any) for "key". Returns OK on * success, and a non-OK status on error. It is not an error if "key" @@ -1110,6 +1657,26 @@ public void delete(final ColumnFamilyHandle columnFamilyHandle, delete(nativeHandle_, key, 0, key.length, columnFamilyHandle.nativeHandle_); } + /** + * Delete the database entry (if any) for "key". Returns OK on + * success, and a non-OK status on error. It is not an error if "key" + * did not exist in the database. + * + * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} + * instance + * @param key Key to delete within database + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void delete(final ColumnFamilyHandle columnFamilyHandle, + final byte[] key, int offset, int len) throws RocksDBException { + delete(nativeHandle_, key, offset, len, columnFamilyHandle.nativeHandle_); + } + /** * Remove the database entry (if any) for "key". Returns OK on * success, and a non-OK status on error. It is not an error if "key" @@ -1145,6 +1712,25 @@ public void delete(final WriteOptions writeOpt, final byte[] key) delete(nativeHandle_, writeOpt.nativeHandle_, key, 0, key.length); } + /** + * Delete the database entry (if any) for "key". Returns OK on + * success, and a non-OK status on error. It is not an error if "key" + * did not exist in the database. + * + * @param writeOpt WriteOptions to be used with delete operation + * @param key Key to delete within database + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void delete(final WriteOptions writeOpt, final byte[] key, int offset, int len) + throws RocksDBException { + delete(nativeHandle_, writeOpt.nativeHandle_, key, offset, len); + } + /** * Remove the database entry (if any) for "key". Returns OK on * success, and a non-OK status on error. It is not an error if "key" @@ -1187,6 +1773,29 @@ public void delete(final ColumnFamilyHandle columnFamilyHandle, columnFamilyHandle.nativeHandle_); } + /** + * Delete the database entry (if any) for "key". Returns OK on + * success, and a non-OK status on error. It is not an error if "key" + * did not exist in the database. + * + * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} + * instance + * @param writeOpt WriteOptions to be used with delete operation + * @param key Key to delete within database + * @param offset the offset of the "key" array to be used, must be non-negative and + * no larger than "key".length + * @param len the length of the "key" array to be used, must be non-negative and + * + * @throws RocksDBException thrown if error happens in underlying + * native library. + */ + public void delete(final ColumnFamilyHandle columnFamilyHandle, + final WriteOptions writeOpt, final byte[] key, int offset, int len) + throws RocksDBException { + delete(nativeHandle_, writeOpt.nativeHandle_, key, offset, len, + columnFamilyHandle.nativeHandle_); + } + /** * Remove the database entry for {@code key}. Requires that the key exists * and was not overwritten. It is not an error if the key did not exist diff --git a/java/src/main/java/org/rocksdb/StatisticsCollector.java b/java/src/main/java/org/rocksdb/StatisticsCollector.java index 48cf8af88e6..fb3f57150f0 100644 --- a/java/src/main/java/org/rocksdb/StatisticsCollector.java +++ b/java/src/main/java/org/rocksdb/StatisticsCollector.java @@ -93,9 +93,9 @@ public void run() { statsCallback.histogramCallback(histogramType, histogramData); } } - - Thread.sleep(_statsCollectionInterval); } + + Thread.sleep(_statsCollectionInterval); } catch (final InterruptedException e) { Thread.currentThread().interrupt(); diff --git a/java/src/main/java/org/rocksdb/TickerType.java b/java/src/main/java/org/rocksdb/TickerType.java index fdcf62ff8a5..08ed18fb3eb 100644 --- a/java/src/main/java/org/rocksdb/TickerType.java +++ b/java/src/main/java/org/rocksdb/TickerType.java @@ -304,9 +304,9 @@ public enum TickerType { RATE_LIMIT_DELAY_MILLIS((byte) 0x37), /** - * Number of iterators currently open. + * Number of iterators created. */ - NO_ITERATORS((byte) 0x38), + NO_ITERATOR_CREATED((byte) 0x38), /** * Number of MultiGet calls. @@ -475,7 +475,12 @@ public enum TickerType { */ NUMBER_MULTIGET_KEYS_FOUND((byte) 0x5E), - TICKER_ENUM_MAX((byte) 0x5F); + /** + * Number of iterators deleted. + */ + NO_ITERATOR_DELETED((byte) 0x5F), + + TICKER_ENUM_MAX((byte) 0x60); private final byte value; diff --git a/java/src/main/java/org/rocksdb/UInt64AddOperator.java b/java/src/main/java/org/rocksdb/UInt64AddOperator.java new file mode 100644 index 00000000000..cce9b298d8a --- /dev/null +++ b/java/src/main/java/org/rocksdb/UInt64AddOperator.java @@ -0,0 +1,19 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +package org.rocksdb; + +/** + * Uint64AddOperator is a merge operator that accumlates a long + * integer value. + */ +public class UInt64AddOperator extends MergeOperator { + public UInt64AddOperator() { + super(newSharedUInt64AddOperator()); + } + + private native static long newSharedUInt64AddOperator(); + @Override protected final native void disposeInternal(final long handle); +} diff --git a/java/src/main/java/org/rocksdb/WriteBufferManager.java b/java/src/main/java/org/rocksdb/WriteBufferManager.java new file mode 100644 index 00000000000..a5f80644fb5 --- /dev/null +++ b/java/src/main/java/org/rocksdb/WriteBufferManager.java @@ -0,0 +1,30 @@ +package org.rocksdb; + +import org.rocksdb.Cache; + +/** + * Java wrapper over native write_buffer_manager class + */ +public class WriteBufferManager extends RocksObject { + static { + RocksDB.loadLibrary(); + } + + /** + * Construct a new instance of WriteBufferManager. + * + * Check + * https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager + * for more details on when to use it + * + * @param bufferSizeBytes buffer size(in bytes) to use for native write_buffer_manager + * @param cache cache whose memory should be bounded by this write buffer manager + */ + public WriteBufferManager(final long bufferSizeBytes, final Cache cache){ + super(newWriteBufferManager(bufferSizeBytes, cache.nativeHandle_)); + } + + private native static long newWriteBufferManager(final long bufferSizeBytes, final long cacheHandle); + @Override + protected native void disposeInternal(final long handle); +} diff --git a/java/src/test/java/org/rocksdb/BlockBasedTableConfigTest.java b/java/src/test/java/org/rocksdb/BlockBasedTableConfigTest.java index 2b15b69f812..754cf11c039 100644 --- a/java/src/test/java/org/rocksdb/BlockBasedTableConfigTest.java +++ b/java/src/test/java/org/rocksdb/BlockBasedTableConfigTest.java @@ -95,6 +95,46 @@ public void cacheIndexAndFilterBlocks() { } + @Test + public void cacheIndexAndFilterBlocksWithHighPriority() { + BlockBasedTableConfig blockBasedTableConfig = new BlockBasedTableConfig(); + blockBasedTableConfig.setCacheIndexAndFilterBlocksWithHighPriority(true); + assertThat(blockBasedTableConfig.cacheIndexAndFilterBlocksWithHighPriority()). + isTrue(); + } + + @Test + public void pinL0FilterAndIndexBlocksInCache() { + BlockBasedTableConfig blockBasedTableConfig = new BlockBasedTableConfig(); + blockBasedTableConfig.setPinL0FilterAndIndexBlocksInCache(true); + assertThat(blockBasedTableConfig.pinL0FilterAndIndexBlocksInCache()). + isTrue(); + } + + @Test + public void partitionFilters() { + BlockBasedTableConfig blockBasedTableConfig = new BlockBasedTableConfig(); + blockBasedTableConfig.setPartitionFilters(true); + assertThat(blockBasedTableConfig.partitionFilters()). + isTrue(); + } + + @Test + public void metadataBlockSize() { + BlockBasedTableConfig blockBasedTableConfig = new BlockBasedTableConfig(); + blockBasedTableConfig.setMetadataBlockSize(1024); + assertThat(blockBasedTableConfig.metadataBlockSize()). + isEqualTo(1024); + } + + @Test + public void pinTopLevelIndexAndFilter() { + BlockBasedTableConfig blockBasedTableConfig = new BlockBasedTableConfig(); + blockBasedTableConfig.setPinTopLevelIndexAndFilter(false); + assertThat(blockBasedTableConfig.pinTopLevelIndexAndFilter()). + isFalse(); + } + @Test public void hashIndexAllowCollision() { BlockBasedTableConfig blockBasedTableConfig = new BlockBasedTableConfig(); diff --git a/java/src/test/java/org/rocksdb/CompactionOptionsFIFOTest.java b/java/src/test/java/org/rocksdb/CompactionOptionsFIFOTest.java index 370a28e8196..df4c98ec14c 100644 --- a/java/src/test/java/org/rocksdb/CompactionOptionsFIFOTest.java +++ b/java/src/test/java/org/rocksdb/CompactionOptionsFIFOTest.java @@ -18,9 +18,27 @@ public class CompactionOptionsFIFOTest { @Test public void maxTableFilesSize() { final long size = 500 * 1024 * 1026; - try(final CompactionOptionsFIFO opt = new CompactionOptionsFIFO()) { + try (final CompactionOptionsFIFO opt = new CompactionOptionsFIFO()) { opt.setMaxTableFilesSize(size); assertThat(opt.maxTableFilesSize()).isEqualTo(size); } } + + @Test + public void ttl() { + final long ttl = 7 * 24 * 60 * 60; // 7 days + try (final CompactionOptionsFIFO opt = new CompactionOptionsFIFO()) { + opt.setTtl(ttl); + assertThat(opt.ttl()).isEqualTo(ttl); + } + } + + @Test + public void allowCompaction() { + final boolean allowCompaction = true; + try (final CompactionOptionsFIFO opt = new CompactionOptionsFIFO()) { + opt.setAllowCompaction(allowCompaction); + assertThat(opt.allowCompaction()).isEqualTo(allowCompaction); + } + } } diff --git a/java/src/test/java/org/rocksdb/DBOptionsTest.java b/java/src/test/java/org/rocksdb/DBOptionsTest.java index 453639d5744..bad01c4354b 100644 --- a/java/src/test/java/org/rocksdb/DBOptionsTest.java +++ b/java/src/test/java/org/rocksdb/DBOptionsTest.java @@ -424,6 +424,26 @@ public void dbWriteBufferSize() { } } + @Test + public void setWriteBufferManager() throws RocksDBException { + try (final DBOptions opt = new DBOptions(); + final Cache cache = new LRUCache(1 * 1024 * 1024); + final WriteBufferManager writeBufferManager = new WriteBufferManager(2000l, cache)) { + opt.setWriteBufferManager(writeBufferManager); + assertThat(opt.writeBufferManager()).isEqualTo(writeBufferManager); + } + } + + @Test + public void setWriteBufferManagerWithZeroBufferSize() throws RocksDBException { + try (final DBOptions opt = new DBOptions(); + final Cache cache = new LRUCache(1 * 1024 * 1024); + final WriteBufferManager writeBufferManager = new WriteBufferManager(0l, cache)) { + opt.setWriteBufferManager(writeBufferManager); + assertThat(opt.writeBufferManager()).isEqualTo(writeBufferManager); + } + } + @Test public void accessHintOnCompactionStart() { try(final DBOptions opt = new DBOptions()) { diff --git a/java/src/test/java/org/rocksdb/KeyMayExistTest.java b/java/src/test/java/org/rocksdb/KeyMayExistTest.java index 8092270eb2d..577fe2eadfe 100644 --- a/java/src/test/java/org/rocksdb/KeyMayExistTest.java +++ b/java/src/test/java/org/rocksdb/KeyMayExistTest.java @@ -48,12 +48,33 @@ public void keyMayExist() throws RocksDBException { assertThat(exists).isTrue(); assertThat(retValue.toString()).isEqualTo("value"); + // Slice key + StringBuilder builder = new StringBuilder("prefix"); + int offset = builder.toString().length(); + builder.append("slice key 0"); + int len = builder.toString().length() - offset; + builder.append("suffix"); + + byte[] sliceKey = builder.toString().getBytes(); + byte[] sliceValue = "slice value 0".getBytes(); + db.put(sliceKey, offset, len, sliceValue, 0, sliceValue.length); + + retValue = new StringBuilder(); + exists = db.keyMayExist(sliceKey, offset, len, retValue); + assertThat(exists).isTrue(); + assertThat(retValue.toString().getBytes()).isEqualTo(sliceValue); + // Test without column family but with readOptions try (final ReadOptions readOptions = new ReadOptions()) { retValue = new StringBuilder(); exists = db.keyMayExist(readOptions, "key".getBytes(), retValue); assertThat(exists).isTrue(); assertThat(retValue.toString()).isEqualTo("value"); + + retValue = new StringBuilder(); + exists = db.keyMayExist(readOptions, sliceKey, offset, len, retValue); + assertThat(exists).isTrue(); + assertThat(retValue.toString().getBytes()).isEqualTo(sliceValue); } // Test with column family @@ -63,6 +84,13 @@ public void keyMayExist() throws RocksDBException { assertThat(exists).isTrue(); assertThat(retValue.toString()).isEqualTo("value"); + // Test slice sky with column family + retValue = new StringBuilder(); + exists = db.keyMayExist(columnFamilyHandleList.get(0), sliceKey, offset, len, + retValue); + assertThat(exists).isTrue(); + assertThat(retValue.toString().getBytes()).isEqualTo(sliceValue); + // Test with column family and readOptions try (final ReadOptions readOptions = new ReadOptions()) { retValue = new StringBuilder(); @@ -71,11 +99,23 @@ public void keyMayExist() throws RocksDBException { retValue); assertThat(exists).isTrue(); assertThat(retValue.toString()).isEqualTo("value"); + + // Test slice key with column family and read options + retValue = new StringBuilder(); + exists = db.keyMayExist(readOptions, + columnFamilyHandleList.get(0), sliceKey, offset, len, + retValue); + assertThat(exists).isTrue(); + assertThat(retValue.toString().getBytes()).isEqualTo(sliceValue); } // KeyMayExist in CF1 must return false assertThat(db.keyMayExist(columnFamilyHandleList.get(1), "key".getBytes(), retValue)).isFalse(); + + // slice key + assertThat(db.keyMayExist(columnFamilyHandleList.get(1), + sliceKey, 1, 3, retValue)).isFalse(); } finally { for (final ColumnFamilyHandle columnFamilyHandle : columnFamilyHandleList) { diff --git a/java/src/test/java/org/rocksdb/MemoryUtilTest.java b/java/src/test/java/org/rocksdb/MemoryUtilTest.java new file mode 100644 index 00000000000..73fcc87c32e --- /dev/null +++ b/java/src/test/java/org/rocksdb/MemoryUtilTest.java @@ -0,0 +1,143 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +package org.rocksdb; + +import org.junit.ClassRule; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.nio.charset.StandardCharsets; +import java.util.*; + +import static org.assertj.core.api.Assertions.assertThat; + +public class MemoryUtilTest { + + private static final String MEMTABLE_SIZE = "rocksdb.size-all-mem-tables"; + private static final String UNFLUSHED_MEMTABLE_SIZE = "rocksdb.cur-size-all-mem-tables"; + private static final String TABLE_READERS = "rocksdb.estimate-table-readers-mem"; + + private final byte[] key = "some-key".getBytes(StandardCharsets.UTF_8); + private final byte[] value = "some-value".getBytes(StandardCharsets.UTF_8); + + @ClassRule + public static final RocksMemoryResource rocksMemoryResource = + new RocksMemoryResource(); + + @Rule public TemporaryFolder dbFolder1 = new TemporaryFolder(); + @Rule public TemporaryFolder dbFolder2 = new TemporaryFolder(); + + /** + * Test MemoryUtil.getApproximateMemoryUsageByType before and after a put + get + */ + @Test + public void getApproximateMemoryUsageByType() throws RocksDBException { + try (final Cache cache = new LRUCache(8 * 1024 * 1024); + final Options options = + new Options() + .setCreateIfMissing(true) + .setTableFormatConfig(new BlockBasedTableConfig().setBlockCache(cache)); + final FlushOptions flushOptions = + new FlushOptions().setWaitForFlush(true); + final RocksDB db = + RocksDB.open(options, dbFolder1.getRoot().getAbsolutePath())) { + + List dbs = new ArrayList<>(1); + dbs.add(db); + Set caches = new HashSet<>(1); + caches.add(cache); + Map usage = MemoryUtil.getApproximateMemoryUsageByType(dbs, caches); + + assertThat(usage.get(MemoryUsageType.kMemTableTotal)).isEqualTo( + db.getAggregatedLongProperty(MEMTABLE_SIZE)); + assertThat(usage.get(MemoryUsageType.kMemTableUnFlushed)).isEqualTo( + db.getAggregatedLongProperty(UNFLUSHED_MEMTABLE_SIZE)); + assertThat(usage.get(MemoryUsageType.kTableReadersTotal)).isEqualTo( + db.getAggregatedLongProperty(TABLE_READERS)); + assertThat(usage.get(MemoryUsageType.kCacheTotal)).isEqualTo(0); + + db.put(key, value); + db.flush(flushOptions); + db.get(key); + + usage = MemoryUtil.getApproximateMemoryUsageByType(dbs, caches); + assertThat(usage.get(MemoryUsageType.kMemTableTotal)).isGreaterThan(0); + assertThat(usage.get(MemoryUsageType.kMemTableTotal)).isEqualTo( + db.getAggregatedLongProperty(MEMTABLE_SIZE)); + assertThat(usage.get(MemoryUsageType.kMemTableUnFlushed)).isGreaterThan(0); + assertThat(usage.get(MemoryUsageType.kMemTableUnFlushed)).isEqualTo( + db.getAggregatedLongProperty(UNFLUSHED_MEMTABLE_SIZE)); + assertThat(usage.get(MemoryUsageType.kTableReadersTotal)).isGreaterThan(0); + assertThat(usage.get(MemoryUsageType.kTableReadersTotal)).isEqualTo( + db.getAggregatedLongProperty(TABLE_READERS)); + assertThat(usage.get(MemoryUsageType.kCacheTotal)).isGreaterThan(0); + + } + } + + /** + * Test MemoryUtil.getApproximateMemoryUsageByType with null inputs + */ + @Test + public void getApproximateMemoryUsageByTypeNulls() throws RocksDBException { + Map usage = MemoryUtil.getApproximateMemoryUsageByType(null, null); + + assertThat(usage.get(MemoryUsageType.kMemTableTotal)).isEqualTo(null); + assertThat(usage.get(MemoryUsageType.kMemTableUnFlushed)).isEqualTo(null); + assertThat(usage.get(MemoryUsageType.kTableReadersTotal)).isEqualTo(null); + assertThat(usage.get(MemoryUsageType.kCacheTotal)).isEqualTo(null); + } + + /** + * Test MemoryUtil.getApproximateMemoryUsageByType with two DBs and two caches + */ + @Test + public void getApproximateMemoryUsageByTypeMultiple() throws RocksDBException { + try (final Cache cache1 = new LRUCache(1 * 1024 * 1024); + final Options options1 = + new Options() + .setCreateIfMissing(true) + .setTableFormatConfig(new BlockBasedTableConfig().setBlockCache(cache1)); + final RocksDB db1 = + RocksDB.open(options1, dbFolder1.getRoot().getAbsolutePath()); + final Cache cache2 = new LRUCache(1 * 1024 * 1024); + final Options options2 = + new Options() + .setCreateIfMissing(true) + .setTableFormatConfig(new BlockBasedTableConfig().setBlockCache(cache2)); + final RocksDB db2 = + RocksDB.open(options2, dbFolder2.getRoot().getAbsolutePath()); + final FlushOptions flushOptions = + new FlushOptions().setWaitForFlush(true); + + ) { + List dbs = new ArrayList<>(1); + dbs.add(db1); + dbs.add(db2); + Set caches = new HashSet<>(1); + caches.add(cache1); + caches.add(cache2); + + for (RocksDB db: dbs) { + db.put(key, value); + db.flush(flushOptions); + db.get(key); + } + + Map usage = MemoryUtil.getApproximateMemoryUsageByType(dbs, caches); + assertThat(usage.get(MemoryUsageType.kMemTableTotal)).isEqualTo( + db1.getAggregatedLongProperty(MEMTABLE_SIZE) + db2.getAggregatedLongProperty(MEMTABLE_SIZE)); + assertThat(usage.get(MemoryUsageType.kMemTableUnFlushed)).isEqualTo( + db1.getAggregatedLongProperty(UNFLUSHED_MEMTABLE_SIZE) + db2.getAggregatedLongProperty(UNFLUSHED_MEMTABLE_SIZE)); + assertThat(usage.get(MemoryUsageType.kTableReadersTotal)).isEqualTo( + db1.getAggregatedLongProperty(TABLE_READERS) + db2.getAggregatedLongProperty(TABLE_READERS)); + assertThat(usage.get(MemoryUsageType.kCacheTotal)).isGreaterThan(0); + + } + } + +} diff --git a/java/src/test/java/org/rocksdb/MergeTest.java b/java/src/test/java/org/rocksdb/MergeTest.java index 73b90869cf1..b2ec62635a1 100644 --- a/java/src/test/java/org/rocksdb/MergeTest.java +++ b/java/src/test/java/org/rocksdb/MergeTest.java @@ -5,6 +5,7 @@ package org.rocksdb; +import java.nio.ByteBuffer; import java.util.Arrays; import java.util.List; import java.util.ArrayList; @@ -44,6 +45,38 @@ public void stringOption() } } + private byte[] longToByteArray(long l) { + ByteBuffer buf = ByteBuffer.allocate(Long.BYTES); + buf.putLong(l); + return buf.array(); + } + + private long longFromByteArray(byte[] a) { + ByteBuffer buf = ByteBuffer.allocate(Long.BYTES); + buf.put(a); + buf.flip(); + return buf.getLong(); + } + + @Test + public void uint64AddOption() + throws InterruptedException, RocksDBException { + try (final Options opt = new Options() + .setCreateIfMissing(true) + .setMergeOperatorName("uint64add"); + final RocksDB db = RocksDB.open(opt, + dbFolder.getRoot().getAbsolutePath())) { + // writing (long)100 under key + db.put("key".getBytes(), longToByteArray(100)); + // merge (long)1 under key + db.merge("key".getBytes(), longToByteArray(1)); + + final byte[] value = db.get("key".getBytes()); + final long longValue = longFromByteArray(value); + assertThat(longValue).isEqualTo(101); + } + } + @Test public void cFStringOption() throws InterruptedException, RocksDBException { @@ -86,6 +119,48 @@ public void cFStringOption() } } + @Test + public void cFUInt64AddOption() + throws InterruptedException, RocksDBException { + + try (final ColumnFamilyOptions cfOpt1 = new ColumnFamilyOptions() + .setMergeOperatorName("uint64add"); + final ColumnFamilyOptions cfOpt2 = new ColumnFamilyOptions() + .setMergeOperatorName("uint64add") + ) { + final List cfDescriptors = Arrays.asList( + new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, cfOpt1), + new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, cfOpt2) + ); + + final List columnFamilyHandleList = new ArrayList<>(); + try (final DBOptions opt = new DBOptions() + .setCreateIfMissing(true) + .setCreateMissingColumnFamilies(true); + final RocksDB db = RocksDB.open(opt, + dbFolder.getRoot().getAbsolutePath(), cfDescriptors, + columnFamilyHandleList)) { + try { + // writing (long)100 under key + db.put(columnFamilyHandleList.get(1), + "cfkey".getBytes(), longToByteArray(100)); + // merge (long)1 under key + db.merge(columnFamilyHandleList.get(1), + "cfkey".getBytes(), longToByteArray(1)); + + byte[] value = db.get(columnFamilyHandleList.get(1), + "cfkey".getBytes()); + long longValue = longFromByteArray(value); + assertThat(longValue).isEqualTo(101); + } finally { + for (final ColumnFamilyHandle handle : columnFamilyHandleList) { + handle.close(); + } + } + } + } + } + @Test public void operatorOption() throws InterruptedException, RocksDBException { @@ -108,6 +183,28 @@ public void operatorOption() } } + @Test + public void uint64AddOperatorOption() + throws InterruptedException, RocksDBException { + try (final UInt64AddOperator uint64AddOperator = new UInt64AddOperator(); + final Options opt = new Options() + .setCreateIfMissing(true) + .setMergeOperator(uint64AddOperator); + final RocksDB db = RocksDB.open(opt, + dbFolder.getRoot().getAbsolutePath())) { + // Writing (long)100 under key + db.put("key".getBytes(), longToByteArray(100)); + + // Writing (long)1 under key + db.merge("key".getBytes(), longToByteArray(1)); + + final byte[] value = db.get("key".getBytes()); + final long longValue = longFromByteArray(value); + + assertThat(longValue).isEqualTo(101); + } + } + @Test public void cFOperatorOption() throws InterruptedException, RocksDBException { @@ -170,6 +267,68 @@ public void cFOperatorOption() } } + @Test + public void cFUInt64AddOperatorOption() + throws InterruptedException, RocksDBException { + try (final UInt64AddOperator uint64AddOperator = new UInt64AddOperator(); + final ColumnFamilyOptions cfOpt1 = new ColumnFamilyOptions() + .setMergeOperator(uint64AddOperator); + final ColumnFamilyOptions cfOpt2 = new ColumnFamilyOptions() + .setMergeOperator(uint64AddOperator) + ) { + final List cfDescriptors = Arrays.asList( + new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, cfOpt1), + new ColumnFamilyDescriptor("new_cf".getBytes(), cfOpt2) + ); + final List columnFamilyHandleList = new ArrayList<>(); + try (final DBOptions opt = new DBOptions() + .setCreateIfMissing(true) + .setCreateMissingColumnFamilies(true); + final RocksDB db = RocksDB.open(opt, + dbFolder.getRoot().getAbsolutePath(), cfDescriptors, + columnFamilyHandleList) + ) { + try { + // writing (long)100 under key + db.put(columnFamilyHandleList.get(1), + "cfkey".getBytes(), longToByteArray(100)); + // merge (long)1 under key + db.merge(columnFamilyHandleList.get(1), + "cfkey".getBytes(), longToByteArray(1)); + byte[] value = db.get(columnFamilyHandleList.get(1), + "cfkey".getBytes()); + long longValue = longFromByteArray(value); + + // Test also with createColumnFamily + try (final ColumnFamilyOptions cfHandleOpts = + new ColumnFamilyOptions() + .setMergeOperator(uint64AddOperator); + final ColumnFamilyHandle cfHandle = + db.createColumnFamily( + new ColumnFamilyDescriptor("new_cf2".getBytes(), + cfHandleOpts)) + ) { + // writing (long)200 under cfkey2 + db.put(cfHandle, "cfkey2".getBytes(), longToByteArray(200)); + // merge (long)50 under cfkey2 + db.merge(cfHandle, new WriteOptions(), "cfkey2".getBytes(), + longToByteArray(50)); + value = db.get(cfHandle, "cfkey2".getBytes()); + long longValueTmpCf = longFromByteArray(value); + + assertThat(longValue).isEqualTo(101); + assertThat(longValueTmpCf).isEqualTo(250); + } + } finally { + for (final ColumnFamilyHandle columnFamilyHandle : + columnFamilyHandleList) { + columnFamilyHandle.close(); + } + } + } + } + } + @Test public void operatorGcBehaviour() throws RocksDBException { @@ -182,7 +341,6 @@ public void operatorGcBehaviour() //no-op } - // test reuse try (final Options opt = new Options() .setMergeOperator(stringAppendOperator); @@ -213,6 +371,48 @@ public void operatorGcBehaviour() } } + @Test + public void uint64AddOperatorGcBehaviour() + throws RocksDBException { + try (final UInt64AddOperator uint64AddOperator = new UInt64AddOperator()) { + try (final Options opt = new Options() + .setCreateIfMissing(true) + .setMergeOperator(uint64AddOperator); + final RocksDB db = RocksDB.open(opt, + dbFolder.getRoot().getAbsolutePath())) { + //no-op + } + + // test reuse + try (final Options opt = new Options() + .setMergeOperator(uint64AddOperator); + final RocksDB db = RocksDB.open(opt, + dbFolder.getRoot().getAbsolutePath())) { + //no-op + } + + // test param init + try (final UInt64AddOperator uint64AddOperator2 = new UInt64AddOperator(); + final Options opt = new Options() + .setMergeOperator(uint64AddOperator2); + final RocksDB db = RocksDB.open(opt, + dbFolder.getRoot().getAbsolutePath())) { + //no-op + } + + // test replace one with another merge operator instance + try (final Options opt = new Options() + .setMergeOperator(uint64AddOperator); + final UInt64AddOperator newUInt64AddOperator = new UInt64AddOperator()) { + opt.setMergeOperator(newUInt64AddOperator); + try (final RocksDB db = RocksDB.open(opt, + dbFolder.getRoot().getAbsolutePath())) { + //no-op + } + } + } + } + @Test public void emptyStringInSetMergeOperatorByName() { try (final Options opt = new Options() diff --git a/java/src/test/java/org/rocksdb/OptionsTest.java b/java/src/test/java/org/rocksdb/OptionsTest.java index 7f7679d732c..2571c3e26fb 100644 --- a/java/src/test/java/org/rocksdb/OptionsTest.java +++ b/java/src/test/java/org/rocksdb/OptionsTest.java @@ -645,6 +645,26 @@ public void dbWriteBufferSize() { } } + @Test + public void setWriteBufferManager() throws RocksDBException { + try (final Options opt = new Options(); + final Cache cache = new LRUCache(1 * 1024 * 1024); + final WriteBufferManager writeBufferManager = new WriteBufferManager(2000l, cache)) { + opt.setWriteBufferManager(writeBufferManager); + assertThat(opt.writeBufferManager()).isEqualTo(writeBufferManager); + } + } + + @Test + public void setWriteBufferManagerWithZeroBufferSize() throws RocksDBException { + try (final Options opt = new Options(); + final Cache cache = new LRUCache(1 * 1024 * 1024); + final WriteBufferManager writeBufferManager = new WriteBufferManager(0l, cache)) { + opt.setWriteBufferManager(writeBufferManager); + assertThat(opt.writeBufferManager()).isEqualTo(writeBufferManager); + } + } + @Test public void accessHintOnCompactionStart() { try (final Options opt = new Options()) { diff --git a/java/src/test/java/org/rocksdb/ReadOptionsTest.java b/java/src/test/java/org/rocksdb/ReadOptionsTest.java index f7d799909d9..4e860ae4ccf 100644 --- a/java/src/test/java/org/rocksdb/ReadOptionsTest.java +++ b/java/src/test/java/org/rocksdb/ReadOptionsTest.java @@ -144,16 +144,34 @@ public void iterateUpperBoundNull() { } } + @Test + public void iterateLowerBound() { + try (final ReadOptions opt = new ReadOptions()) { + Slice lowerBound = buildRandomSlice(); + opt.setIterateLowerBound(lowerBound); + assertThat(Arrays.equals(lowerBound.data(), opt.iterateLowerBound().data())).isTrue(); + } + } + + @Test + public void iterateLowerBoundNull() { + try (final ReadOptions opt = new ReadOptions()) { + assertThat(opt.iterateLowerBound()).isNull(); + } + } + @Test public void copyConstructor() { try (final ReadOptions opt = new ReadOptions()) { opt.setVerifyChecksums(false); opt.setFillCache(false); opt.setIterateUpperBound(buildRandomSlice()); + opt.setIterateLowerBound(buildRandomSlice()); ReadOptions other = new ReadOptions(opt); assertThat(opt.verifyChecksums()).isEqualTo(other.verifyChecksums()); assertThat(opt.fillCache()).isEqualTo(other.fillCache()); assertThat(Arrays.equals(opt.iterateUpperBound().data(), other.iterateUpperBound().data())).isTrue(); + assertThat(Arrays.equals(opt.iterateLowerBound().data(), other.iterateLowerBound().data())).isTrue(); } } @@ -237,6 +255,22 @@ public void failIterateUpperBoundUninitialized() { } } + @Test + public void failSetIterateLowerBoundUninitialized() { + try (final ReadOptions readOptions = + setupUninitializedReadOptions(exception)) { + readOptions.setIterateLowerBound(null); + } + } + + @Test + public void failIterateLowerBoundUninitialized() { + try (final ReadOptions readOptions = + setupUninitializedReadOptions(exception)) { + readOptions.iterateLowerBound(); + } + } + private ReadOptions setupUninitializedReadOptions( ExpectedException exception) { final ReadOptions readOptions = new ReadOptions(); diff --git a/java/src/test/java/org/rocksdb/RocksDBTest.java b/java/src/test/java/org/rocksdb/RocksDBTest.java index 158b8d56a89..66ebc69db8a 100644 --- a/java/src/test/java/org/rocksdb/RocksDBTest.java +++ b/java/src/test/java/org/rocksdb/RocksDBTest.java @@ -4,6 +4,7 @@ // (found in the LICENSE.Apache file in the root directory). package org.rocksdb; +import org.junit.Assert; import org.junit.Assume; import org.junit.ClassRule; import org.junit.Rule; @@ -11,6 +12,7 @@ import org.junit.rules.ExpectedException; import org.junit.rules.TemporaryFolder; +import java.nio.ByteBuffer; import java.util.*; import static org.assertj.core.api.Assertions.assertThat; @@ -70,6 +72,57 @@ public void put() throws RocksDBException { "value".getBytes()); assertThat(db.get("key2".getBytes())).isEqualTo( "12345678".getBytes()); + + + // put + Segment key3 = sliceSegment("key3"); + Segment key4 = sliceSegment("key4"); + Segment value0 = sliceSegment("value 0"); + Segment value1 = sliceSegment("value 1"); + db.put(key3.data, key3.offset, key3.len, value0.data, value0.offset, value0.len); + db.put(opt, key4.data, key4.offset, key4.len, value1.data, value1.offset, value1.len); + + // compare + Assert.assertTrue(value0.isSamePayload(db.get(key3.data, key3.offset, key3.len))); + Assert.assertTrue(value1.isSamePayload(db.get(key4.data, key4.offset, key4.len))); + } + } + + private static Segment sliceSegment(String key) { + ByteBuffer rawKey = ByteBuffer.allocate(key.length() + 4); + rawKey.put((byte)0); + rawKey.put((byte)0); + rawKey.put(key.getBytes()); + + return new Segment(rawKey.array(), 2, key.length()); + } + + private static class Segment { + final byte[] data; + final int offset; + final int len; + + public boolean isSamePayload(byte[] value) { + if (value == null) { + return false; + } + if (value.length != len) { + return false; + } + + for (int i = 0; i < value.length; i++) { + if (data[i + offset] != value[i]) { + return false; + } + } + + return true; + } + + public Segment(byte[] value, int offset, int len) { + this.data = value; + this.offset = offset; + this.len = len; } } @@ -242,6 +295,18 @@ public void merge() throws RocksDBException { db.merge(wOpt, "key2".getBytes(), "xxxx".getBytes()); assertThat(db.get("key2".getBytes())).isEqualTo( "xxxx".getBytes()); + + Segment key3 = sliceSegment("key3"); + Segment key4 = sliceSegment("key4"); + Segment value0 = sliceSegment("value 0"); + Segment value1 = sliceSegment("value 1"); + + db.merge(key3.data, key3.offset, key3.len, value0.data, value0.offset, value0.len); + db.merge(wOpt, key4.data, key4.offset, key4.len, value1.data, value1.offset, value1.len); + + // compare + Assert.assertTrue(value0.isSamePayload(db.get(key3.data, key3.offset, key3.len))); + Assert.assertTrue(value1.isSamePayload(db.get(key4.data, key4.offset, key4.len))); } } @@ -259,6 +324,18 @@ public void delete() throws RocksDBException { db.delete(wOpt, "key2".getBytes()); assertThat(db.get("key1".getBytes())).isNull(); assertThat(db.get("key2".getBytes())).isNull(); + + + Segment key3 = sliceSegment("key3"); + Segment key4 = sliceSegment("key4"); + db.put("key3".getBytes(), "key3 value".getBytes()); + db.put("key4".getBytes(), "key4 value".getBytes()); + + db.delete(key3.data, key3.offset, key3.len); + db.delete(wOpt, key4.data, key4.offset, key4.len); + + assertThat(db.get("key3".getBytes())).isNull(); + assertThat(db.get("key4".getBytes())).isNull(); } } diff --git a/memtable/alloc_tracker.cc b/memtable/alloc_tracker.cc index 9889cc4230c..a1fa4938c52 100644 --- a/memtable/alloc_tracker.cc +++ b/memtable/alloc_tracker.cc @@ -24,7 +24,8 @@ AllocTracker::~AllocTracker() { FreeMem(); } void AllocTracker::Allocate(size_t bytes) { assert(write_buffer_manager_ != nullptr); - if (write_buffer_manager_->enabled()) { + if (write_buffer_manager_->enabled() || + write_buffer_manager_->cost_to_cache()) { bytes_allocated_.fetch_add(bytes, std::memory_order_relaxed); write_buffer_manager_->ReserveMem(bytes); } @@ -32,7 +33,8 @@ void AllocTracker::Allocate(size_t bytes) { void AllocTracker::DoneAllocating() { if (write_buffer_manager_ != nullptr && !done_allocating_) { - if (write_buffer_manager_->enabled()) { + if (write_buffer_manager_->enabled() || + write_buffer_manager_->cost_to_cache()) { write_buffer_manager_->ScheduleFreeMem( bytes_allocated_.load(std::memory_order_relaxed)); } else { @@ -47,7 +49,8 @@ void AllocTracker::FreeMem() { DoneAllocating(); } if (write_buffer_manager_ != nullptr && !freed_) { - if (write_buffer_manager_->enabled()) { + if (write_buffer_manager_->enabled() || + write_buffer_manager_->cost_to_cache()) { write_buffer_manager_->FreeMem( bytes_allocated_.load(std::memory_order_relaxed)); } else { diff --git a/memtable/hash_skiplist_rep.cc b/memtable/hash_skiplist_rep.cc index 93082b1ec28..a5c46011e3f 100644 --- a/memtable/hash_skiplist_rep.cc +++ b/memtable/hash_skiplist_rep.cc @@ -168,7 +168,7 @@ class HashSkipListRep : public MemTableRep { Bucket* list_; Bucket::Iterator iter_; // here we track if we own list_. If we own it, we are also - // responsible for it's cleaning. This is a poor man's shared_ptr + // responsible for it's cleaning. This is a poor man's std::shared_ptr bool own_list_; std::unique_ptr arena_; std::string tmp_; // For passing to EncodeKey diff --git a/memtable/write_buffer_manager.cc b/memtable/write_buffer_manager.cc index 21b18c8f76e..7f2e664ab5e 100644 --- a/memtable/write_buffer_manager.cc +++ b/memtable/write_buffer_manager.cc @@ -79,7 +79,7 @@ WriteBufferManager::~WriteBufferManager() { void WriteBufferManager::ReserveMemWithCache(size_t mem) { #ifndef ROCKSDB_LITE assert(cache_rep_ != nullptr); - // Use a mutex to protect various data structures. Can be optimzied to a + // Use a mutex to protect various data structures. Can be optimized to a // lock-free solution if it ends up with a performance bottleneck. std::lock_guard lock(cache_rep_->cache_mutex_); @@ -102,14 +102,14 @@ void WriteBufferManager::ReserveMemWithCache(size_t mem) { void WriteBufferManager::FreeMemWithCache(size_t mem) { #ifndef ROCKSDB_LITE assert(cache_rep_ != nullptr); - // Use a mutex to protect various data structures. Can be optimzied to a + // Use a mutex to protect various data structures. Can be optimized to a // lock-free solution if it ends up with a performance bottleneck. std::lock_guard lock(cache_rep_->cache_mutex_); size_t new_mem_used = memory_used_.load(std::memory_order_relaxed) - mem; memory_used_.store(new_mem_used, std::memory_order_relaxed); // Gradually shrink memory costed in the block cache if the actual // usage is less than 3/4 of what we reserve from the block cache. - // We do this becausse: + // We do this because: // 1. we don't pay the cost of the block cache immediately a memtable is // freed, as block cache insert is expensive; // 2. eventually, if we walk away from a temporary memtable size increase, diff --git a/monitoring/perf_context.cc b/monitoring/perf_context.cc index 9bba841f8f5..423443869be 100644 --- a/monitoring/perf_context.cc +++ b/monitoring/perf_context.cc @@ -15,7 +15,7 @@ PerfContext perf_context; #if defined(OS_SOLARIS) __thread PerfContext perf_context_; #else -__thread PerfContext perf_context; +thread_local PerfContext perf_context; #endif #endif @@ -31,6 +31,12 @@ PerfContext* get_perf_context() { #endif } +PerfContext::~PerfContext() { +#if !defined(NPERF_CONTEXT) && defined(ROCKSDB_SUPPORT_THREAD_LOCAL) && !defined(OS_SOLARIS) + ClearPerLevelPerfContext(); +#endif +} + void PerfContext::Reset() { #ifndef NPERF_CONTEXT user_key_comparison_count = 0; @@ -104,6 +110,11 @@ void PerfContext::Reset() { env_lock_file_nanos = 0; env_unlock_file_nanos = 0; env_new_logger_nanos = 0; + if (per_level_perf_context_enabled && level_to_perf_context) { + for (auto& kv : *level_to_perf_context) { + kv.second.Reset(); + } + } #endif } @@ -112,6 +123,25 @@ void PerfContext::Reset() { ss << #counter << " = " << counter << ", "; \ } +#define PERF_CONTEXT_BY_LEVEL_OUTPUT_ONE_COUNTER(counter) \ + if (per_level_perf_context_enabled && \ + level_to_perf_context) { \ + ss << #counter << " = "; \ + for (auto& kv : *level_to_perf_context) { \ + if (!exclude_zero_counters || (kv.second.counter > 0)) { \ + ss << kv.second.counter << "@level" << kv.first << ", "; \ + } \ + } \ + } + +void PerfContextByLevel::Reset() { +#ifndef NPERF_CONTEXT + bloom_filter_useful = 0; + bloom_filter_full_positive = 0; + bloom_filter_full_true_positive = 0; +#endif +} + std::string PerfContext::ToString(bool exclude_zero_counters) const { #ifdef NPERF_CONTEXT return ""; @@ -186,8 +216,30 @@ std::string PerfContext::ToString(bool exclude_zero_counters) const { PERF_CONTEXT_OUTPUT(env_lock_file_nanos); PERF_CONTEXT_OUTPUT(env_unlock_file_nanos); PERF_CONTEXT_OUTPUT(env_new_logger_nanos); + PERF_CONTEXT_BY_LEVEL_OUTPUT_ONE_COUNTER(bloom_filter_useful); + PERF_CONTEXT_BY_LEVEL_OUTPUT_ONE_COUNTER(bloom_filter_full_positive); + PERF_CONTEXT_BY_LEVEL_OUTPUT_ONE_COUNTER(bloom_filter_full_true_positive); return ss.str(); #endif } +void PerfContext::EnablePerLevelPerfContext() { + if (!level_to_perf_context) { + level_to_perf_context = new std::map(); + } + per_level_perf_context_enabled = true; +} + +void PerfContext::DisablePerLevelPerfContext(){ + per_level_perf_context_enabled = false; +} + +void PerfContext::ClearPerLevelPerfContext(){ + if (level_to_perf_context) { + delete level_to_perf_context; + level_to_perf_context = nullptr; + } + per_level_perf_context_enabled = false; +} + } diff --git a/monitoring/perf_context_imp.h b/monitoring/perf_context_imp.h index cfcded1c96b..d67654914e8 100644 --- a/monitoring/perf_context_imp.h +++ b/monitoring/perf_context_imp.h @@ -16,7 +16,7 @@ extern PerfContext perf_context; extern __thread PerfContext perf_context_; #define perf_context (*get_perf_context()) #else -extern __thread PerfContext perf_context; +extern thread_local PerfContext perf_context; #endif #endif @@ -59,6 +59,22 @@ extern __thread PerfContext perf_context; perf_context.metric += value; \ } +// Increase metric value +#define PERF_COUNTER_BY_LEVEL_ADD(metric, value, level) \ + if (perf_level >= PerfLevel::kEnableCount && \ + perf_context.per_level_perf_context_enabled && \ + perf_context.level_to_perf_context) { \ + if ((*(perf_context.level_to_perf_context)).find(level) != \ + (*(perf_context.level_to_perf_context)).end()) { \ + (*(perf_context.level_to_perf_context))[level].metric += value; \ + } \ + else { \ + PerfContextByLevel empty_context; \ + (*(perf_context.level_to_perf_context))[level] = empty_context; \ + (*(perf_context.level_to_perf_context))[level].metric += value; \ + } \ + } \ + #endif } diff --git a/monitoring/statistics.cc b/monitoring/statistics.cc index 59ce3d9e0a8..cba427ae4b7 100644 --- a/monitoring/statistics.cc +++ b/monitoring/statistics.cc @@ -17,13 +17,214 @@ namespace rocksdb { +// The order of items listed in Tickers should be the same as +// the order listed in TickersNameMap +const std::vector> TickersNameMap = { + {BLOCK_CACHE_MISS, "rocksdb.block.cache.miss"}, + {BLOCK_CACHE_HIT, "rocksdb.block.cache.hit"}, + {BLOCK_CACHE_ADD, "rocksdb.block.cache.add"}, + {BLOCK_CACHE_ADD_FAILURES, "rocksdb.block.cache.add.failures"}, + {BLOCK_CACHE_INDEX_MISS, "rocksdb.block.cache.index.miss"}, + {BLOCK_CACHE_INDEX_HIT, "rocksdb.block.cache.index.hit"}, + {BLOCK_CACHE_INDEX_ADD, "rocksdb.block.cache.index.add"}, + {BLOCK_CACHE_INDEX_BYTES_INSERT, "rocksdb.block.cache.index.bytes.insert"}, + {BLOCK_CACHE_INDEX_BYTES_EVICT, "rocksdb.block.cache.index.bytes.evict"}, + {BLOCK_CACHE_FILTER_MISS, "rocksdb.block.cache.filter.miss"}, + {BLOCK_CACHE_FILTER_HIT, "rocksdb.block.cache.filter.hit"}, + {BLOCK_CACHE_FILTER_ADD, "rocksdb.block.cache.filter.add"}, + {BLOCK_CACHE_FILTER_BYTES_INSERT, + "rocksdb.block.cache.filter.bytes.insert"}, + {BLOCK_CACHE_FILTER_BYTES_EVICT, "rocksdb.block.cache.filter.bytes.evict"}, + {BLOCK_CACHE_DATA_MISS, "rocksdb.block.cache.data.miss"}, + {BLOCK_CACHE_DATA_HIT, "rocksdb.block.cache.data.hit"}, + {BLOCK_CACHE_DATA_ADD, "rocksdb.block.cache.data.add"}, + {BLOCK_CACHE_DATA_BYTES_INSERT, "rocksdb.block.cache.data.bytes.insert"}, + {BLOCK_CACHE_BYTES_READ, "rocksdb.block.cache.bytes.read"}, + {BLOCK_CACHE_BYTES_WRITE, "rocksdb.block.cache.bytes.write"}, + {BLOOM_FILTER_USEFUL, "rocksdb.bloom.filter.useful"}, + {BLOOM_FILTER_FULL_POSITIVE, "rocksdb.bloom.filter.full.positive"}, + {BLOOM_FILTER_FULL_TRUE_POSITIVE, + "rocksdb.bloom.filter.full.true.positive"}, + {PERSISTENT_CACHE_HIT, "rocksdb.persistent.cache.hit"}, + {PERSISTENT_CACHE_MISS, "rocksdb.persistent.cache.miss"}, + {SIM_BLOCK_CACHE_HIT, "rocksdb.sim.block.cache.hit"}, + {SIM_BLOCK_CACHE_MISS, "rocksdb.sim.block.cache.miss"}, + {MEMTABLE_HIT, "rocksdb.memtable.hit"}, + {MEMTABLE_MISS, "rocksdb.memtable.miss"}, + {GET_HIT_L0, "rocksdb.l0.hit"}, + {GET_HIT_L1, "rocksdb.l1.hit"}, + {GET_HIT_L2_AND_UP, "rocksdb.l2andup.hit"}, + {COMPACTION_KEY_DROP_NEWER_ENTRY, "rocksdb.compaction.key.drop.new"}, + {COMPACTION_KEY_DROP_OBSOLETE, "rocksdb.compaction.key.drop.obsolete"}, + {COMPACTION_KEY_DROP_RANGE_DEL, "rocksdb.compaction.key.drop.range_del"}, + {COMPACTION_KEY_DROP_USER, "rocksdb.compaction.key.drop.user"}, + {COMPACTION_RANGE_DEL_DROP_OBSOLETE, + "rocksdb.compaction.range_del.drop.obsolete"}, + {COMPACTION_OPTIMIZED_DEL_DROP_OBSOLETE, + "rocksdb.compaction.optimized.del.drop.obsolete"}, + {COMPACTION_CANCELLED, "rocksdb.compaction.cancelled"}, + {NUMBER_KEYS_WRITTEN, "rocksdb.number.keys.written"}, + {NUMBER_KEYS_READ, "rocksdb.number.keys.read"}, + {NUMBER_KEYS_UPDATED, "rocksdb.number.keys.updated"}, + {BYTES_WRITTEN, "rocksdb.bytes.written"}, + {BYTES_READ, "rocksdb.bytes.read"}, + {NUMBER_DB_SEEK, "rocksdb.number.db.seek"}, + {NUMBER_DB_NEXT, "rocksdb.number.db.next"}, + {NUMBER_DB_PREV, "rocksdb.number.db.prev"}, + {NUMBER_DB_SEEK_FOUND, "rocksdb.number.db.seek.found"}, + {NUMBER_DB_NEXT_FOUND, "rocksdb.number.db.next.found"}, + {NUMBER_DB_PREV_FOUND, "rocksdb.number.db.prev.found"}, + {ITER_BYTES_READ, "rocksdb.db.iter.bytes.read"}, + {NO_FILE_CLOSES, "rocksdb.no.file.closes"}, + {NO_FILE_OPENS, "rocksdb.no.file.opens"}, + {NO_FILE_ERRORS, "rocksdb.no.file.errors"}, + {STALL_L0_SLOWDOWN_MICROS, "rocksdb.l0.slowdown.micros"}, + {STALL_MEMTABLE_COMPACTION_MICROS, "rocksdb.memtable.compaction.micros"}, + {STALL_L0_NUM_FILES_MICROS, "rocksdb.l0.num.files.stall.micros"}, + {STALL_MICROS, "rocksdb.stall.micros"}, + {DB_MUTEX_WAIT_MICROS, "rocksdb.db.mutex.wait.micros"}, + {RATE_LIMIT_DELAY_MILLIS, "rocksdb.rate.limit.delay.millis"}, + {NO_ITERATORS, "rocksdb.num.iterators"}, + {NUMBER_MULTIGET_CALLS, "rocksdb.number.multiget.get"}, + {NUMBER_MULTIGET_KEYS_READ, "rocksdb.number.multiget.keys.read"}, + {NUMBER_MULTIGET_BYTES_READ, "rocksdb.number.multiget.bytes.read"}, + {NUMBER_FILTERED_DELETES, "rocksdb.number.deletes.filtered"}, + {NUMBER_MERGE_FAILURES, "rocksdb.number.merge.failures"}, + {BLOOM_FILTER_PREFIX_CHECKED, "rocksdb.bloom.filter.prefix.checked"}, + {BLOOM_FILTER_PREFIX_USEFUL, "rocksdb.bloom.filter.prefix.useful"}, + {NUMBER_OF_RESEEKS_IN_ITERATION, "rocksdb.number.reseeks.iteration"}, + {GET_UPDATES_SINCE_CALLS, "rocksdb.getupdatessince.calls"}, + {BLOCK_CACHE_COMPRESSED_MISS, "rocksdb.block.cachecompressed.miss"}, + {BLOCK_CACHE_COMPRESSED_HIT, "rocksdb.block.cachecompressed.hit"}, + {BLOCK_CACHE_COMPRESSED_ADD, "rocksdb.block.cachecompressed.add"}, + {BLOCK_CACHE_COMPRESSED_ADD_FAILURES, + "rocksdb.block.cachecompressed.add.failures"}, + {WAL_FILE_SYNCED, "rocksdb.wal.synced"}, + {WAL_FILE_BYTES, "rocksdb.wal.bytes"}, + {WRITE_DONE_BY_SELF, "rocksdb.write.self"}, + {WRITE_DONE_BY_OTHER, "rocksdb.write.other"}, + {WRITE_TIMEDOUT, "rocksdb.write.timeout"}, + {WRITE_WITH_WAL, "rocksdb.write.wal"}, + {COMPACT_READ_BYTES, "rocksdb.compact.read.bytes"}, + {COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes"}, + {FLUSH_WRITE_BYTES, "rocksdb.flush.write.bytes"}, + {NUMBER_DIRECT_LOAD_TABLE_PROPERTIES, + "rocksdb.number.direct.load.table.properties"}, + {NUMBER_SUPERVERSION_ACQUIRES, "rocksdb.number.superversion_acquires"}, + {NUMBER_SUPERVERSION_RELEASES, "rocksdb.number.superversion_releases"}, + {NUMBER_SUPERVERSION_CLEANUPS, "rocksdb.number.superversion_cleanups"}, + {NUMBER_BLOCK_COMPRESSED, "rocksdb.number.block.compressed"}, + {NUMBER_BLOCK_DECOMPRESSED, "rocksdb.number.block.decompressed"}, + {NUMBER_BLOCK_NOT_COMPRESSED, "rocksdb.number.block.not_compressed"}, + {MERGE_OPERATION_TOTAL_TIME, "rocksdb.merge.operation.time.nanos"}, + {FILTER_OPERATION_TOTAL_TIME, "rocksdb.filter.operation.time.nanos"}, + {ROW_CACHE_HIT, "rocksdb.row.cache.hit"}, + {ROW_CACHE_MISS, "rocksdb.row.cache.miss"}, + {READ_AMP_ESTIMATE_USEFUL_BYTES, "rocksdb.read.amp.estimate.useful.bytes"}, + {READ_AMP_TOTAL_READ_BYTES, "rocksdb.read.amp.total.read.bytes"}, + {NUMBER_RATE_LIMITER_DRAINS, "rocksdb.number.rate_limiter.drains"}, + {NUMBER_ITER_SKIP, "rocksdb.number.iter.skip"}, + {BLOB_DB_NUM_PUT, "rocksdb.blobdb.num.put"}, + {BLOB_DB_NUM_WRITE, "rocksdb.blobdb.num.write"}, + {BLOB_DB_NUM_GET, "rocksdb.blobdb.num.get"}, + {BLOB_DB_NUM_MULTIGET, "rocksdb.blobdb.num.multiget"}, + {BLOB_DB_NUM_SEEK, "rocksdb.blobdb.num.seek"}, + {BLOB_DB_NUM_NEXT, "rocksdb.blobdb.num.next"}, + {BLOB_DB_NUM_PREV, "rocksdb.blobdb.num.prev"}, + {BLOB_DB_NUM_KEYS_WRITTEN, "rocksdb.blobdb.num.keys.written"}, + {BLOB_DB_NUM_KEYS_READ, "rocksdb.blobdb.num.keys.read"}, + {BLOB_DB_BYTES_WRITTEN, "rocksdb.blobdb.bytes.written"}, + {BLOB_DB_BYTES_READ, "rocksdb.blobdb.bytes.read"}, + {BLOB_DB_WRITE_INLINED, "rocksdb.blobdb.write.inlined"}, + {BLOB_DB_WRITE_INLINED_TTL, "rocksdb.blobdb.write.inlined.ttl"}, + {BLOB_DB_WRITE_BLOB, "rocksdb.blobdb.write.blob"}, + {BLOB_DB_WRITE_BLOB_TTL, "rocksdb.blobdb.write.blob.ttl"}, + {BLOB_DB_BLOB_FILE_BYTES_WRITTEN, "rocksdb.blobdb.blob.file.bytes.written"}, + {BLOB_DB_BLOB_FILE_BYTES_READ, "rocksdb.blobdb.blob.file.bytes.read"}, + {BLOB_DB_BLOB_FILE_SYNCED, "rocksdb.blobdb.blob.file.synced"}, + {BLOB_DB_BLOB_INDEX_EXPIRED_COUNT, + "rocksdb.blobdb.blob.index.expired.count"}, + {BLOB_DB_BLOB_INDEX_EXPIRED_SIZE, "rocksdb.blobdb.blob.index.expired.size"}, + {BLOB_DB_BLOB_INDEX_EVICTED_COUNT, + "rocksdb.blobdb.blob.index.evicted.count"}, + {BLOB_DB_BLOB_INDEX_EVICTED_SIZE, "rocksdb.blobdb.blob.index.evicted.size"}, + {BLOB_DB_GC_NUM_FILES, "rocksdb.blobdb.gc.num.files"}, + {BLOB_DB_GC_NUM_NEW_FILES, "rocksdb.blobdb.gc.num.new.files"}, + {BLOB_DB_GC_FAILURES, "rocksdb.blobdb.gc.failures"}, + {BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, "rocksdb.blobdb.gc.num.keys.overwritten"}, + {BLOB_DB_GC_NUM_KEYS_EXPIRED, "rocksdb.blobdb.gc.num.keys.expired"}, + {BLOB_DB_GC_NUM_KEYS_RELOCATED, "rocksdb.blobdb.gc.num.keys.relocated"}, + {BLOB_DB_GC_BYTES_OVERWRITTEN, "rocksdb.blobdb.gc.bytes.overwritten"}, + {BLOB_DB_GC_BYTES_EXPIRED, "rocksdb.blobdb.gc.bytes.expired"}, + {BLOB_DB_GC_BYTES_RELOCATED, "rocksdb.blobdb.gc.bytes.relocated"}, + {BLOB_DB_FIFO_NUM_FILES_EVICTED, "rocksdb.blobdb.fifo.num.files.evicted"}, + {BLOB_DB_FIFO_NUM_KEYS_EVICTED, "rocksdb.blobdb.fifo.num.keys.evicted"}, + {BLOB_DB_FIFO_BYTES_EVICTED, "rocksdb.blobdb.fifo.bytes.evicted"}, + {TXN_PREPARE_MUTEX_OVERHEAD, "rocksdb.txn.overhead.mutex.prepare"}, + {TXN_OLD_COMMIT_MAP_MUTEX_OVERHEAD, + "rocksdb.txn.overhead.mutex.old.commit.map"}, + {TXN_DUPLICATE_KEY_OVERHEAD, "rocksdb.txn.overhead.duplicate.key"}, + {TXN_SNAPSHOT_MUTEX_OVERHEAD, "rocksdb.txn.overhead.mutex.snapshot"}, + {NUMBER_MULTIGET_KEYS_FOUND, "rocksdb.number.multiget.keys.found"}, + {NO_ITERATOR_CREATED, "rocksdb.num.iterator.created"}, + {NO_ITERATOR_DELETED, "rocksdb.num.iterator.deleted"}, +}; + +const std::vector> HistogramsNameMap = { + {DB_GET, "rocksdb.db.get.micros"}, + {DB_WRITE, "rocksdb.db.write.micros"}, + {COMPACTION_TIME, "rocksdb.compaction.times.micros"}, + {SUBCOMPACTION_SETUP_TIME, "rocksdb.subcompaction.setup.times.micros"}, + {TABLE_SYNC_MICROS, "rocksdb.table.sync.micros"}, + {COMPACTION_OUTFILE_SYNC_MICROS, "rocksdb.compaction.outfile.sync.micros"}, + {WAL_FILE_SYNC_MICROS, "rocksdb.wal.file.sync.micros"}, + {MANIFEST_FILE_SYNC_MICROS, "rocksdb.manifest.file.sync.micros"}, + {TABLE_OPEN_IO_MICROS, "rocksdb.table.open.io.micros"}, + {DB_MULTIGET, "rocksdb.db.multiget.micros"}, + {READ_BLOCK_COMPACTION_MICROS, "rocksdb.read.block.compaction.micros"}, + {READ_BLOCK_GET_MICROS, "rocksdb.read.block.get.micros"}, + {WRITE_RAW_BLOCK_MICROS, "rocksdb.write.raw.block.micros"}, + {STALL_L0_SLOWDOWN_COUNT, "rocksdb.l0.slowdown.count"}, + {STALL_MEMTABLE_COMPACTION_COUNT, "rocksdb.memtable.compaction.count"}, + {STALL_L0_NUM_FILES_COUNT, "rocksdb.num.files.stall.count"}, + {HARD_RATE_LIMIT_DELAY_COUNT, "rocksdb.hard.rate.limit.delay.count"}, + {SOFT_RATE_LIMIT_DELAY_COUNT, "rocksdb.soft.rate.limit.delay.count"}, + {NUM_FILES_IN_SINGLE_COMPACTION, "rocksdb.numfiles.in.singlecompaction"}, + {DB_SEEK, "rocksdb.db.seek.micros"}, + {WRITE_STALL, "rocksdb.db.write.stall"}, + {SST_READ_MICROS, "rocksdb.sst.read.micros"}, + {NUM_SUBCOMPACTIONS_SCHEDULED, "rocksdb.num.subcompactions.scheduled"}, + {BYTES_PER_READ, "rocksdb.bytes.per.read"}, + {BYTES_PER_WRITE, "rocksdb.bytes.per.write"}, + {BYTES_PER_MULTIGET, "rocksdb.bytes.per.multiget"}, + {BYTES_COMPRESSED, "rocksdb.bytes.compressed"}, + {BYTES_DECOMPRESSED, "rocksdb.bytes.decompressed"}, + {COMPRESSION_TIMES_NANOS, "rocksdb.compression.times.nanos"}, + {DECOMPRESSION_TIMES_NANOS, "rocksdb.decompression.times.nanos"}, + {READ_NUM_MERGE_OPERANDS, "rocksdb.read.num.merge_operands"}, + {BLOB_DB_KEY_SIZE, "rocksdb.blobdb.key.size"}, + {BLOB_DB_VALUE_SIZE, "rocksdb.blobdb.value.size"}, + {BLOB_DB_WRITE_MICROS, "rocksdb.blobdb.write.micros"}, + {BLOB_DB_GET_MICROS, "rocksdb.blobdb.get.micros"}, + {BLOB_DB_MULTIGET_MICROS, "rocksdb.blobdb.multiget.micros"}, + {BLOB_DB_SEEK_MICROS, "rocksdb.blobdb.seek.micros"}, + {BLOB_DB_NEXT_MICROS, "rocksdb.blobdb.next.micros"}, + {BLOB_DB_PREV_MICROS, "rocksdb.blobdb.prev.micros"}, + {BLOB_DB_BLOB_FILE_WRITE_MICROS, "rocksdb.blobdb.blob.file.write.micros"}, + {BLOB_DB_BLOB_FILE_READ_MICROS, "rocksdb.blobdb.blob.file.read.micros"}, + {BLOB_DB_BLOB_FILE_SYNC_MICROS, "rocksdb.blobdb.blob.file.sync.micros"}, + {BLOB_DB_GC_MICROS, "rocksdb.blobdb.gc.micros"}, + {BLOB_DB_COMPRESSION_MICROS, "rocksdb.blobdb.compression.micros"}, + {BLOB_DB_DECOMPRESSION_MICROS, "rocksdb.blobdb.decompression.micros"}, + {FLUSH_TIME, "rocksdb.db.flush.micros"}, +}; + std::shared_ptr CreateDBStatistics() { - return std::make_shared(nullptr, false); + return std::make_shared(nullptr); } -StatisticsImpl::StatisticsImpl(std::shared_ptr stats, - bool enable_internal_stats) - : stats_(std::move(stats)), enable_internal_stats_(enable_internal_stats) {} +StatisticsImpl::StatisticsImpl(std::shared_ptr stats) + : stats_(std::move(stats)) {} StatisticsImpl::~StatisticsImpl() {} @@ -33,10 +234,7 @@ uint64_t StatisticsImpl::getTickerCount(uint32_t tickerType) const { } uint64_t StatisticsImpl::getTickerCountLocked(uint32_t tickerType) const { - assert( - enable_internal_stats_ ? - tickerType < INTERNAL_TICKER_ENUM_MAX : - tickerType < TICKER_ENUM_MAX); + assert(tickerType < TICKER_ENUM_MAX); uint64_t res = 0; for (size_t core_idx = 0; core_idx < per_core_stats_.Size(); ++core_idx) { res += per_core_stats_.AccessAtCore(core_idx)->tickers_[tickerType]; @@ -52,10 +250,7 @@ void StatisticsImpl::histogramData(uint32_t histogramType, std::unique_ptr StatisticsImpl::getHistogramImplLocked( uint32_t histogramType) const { - assert( - enable_internal_stats_ ? - histogramType < INTERNAL_HISTOGRAM_ENUM_MAX : - histogramType < HISTOGRAM_ENUM_MAX); + assert(histogramType < HISTOGRAM_ENUM_MAX); std::unique_ptr res_hist(new HistogramImpl()); for (size_t core_idx = 0; core_idx < per_core_stats_.Size(); ++core_idx) { res_hist->Merge( @@ -80,8 +275,7 @@ void StatisticsImpl::setTickerCount(uint32_t tickerType, uint64_t count) { } void StatisticsImpl::setTickerCountLocked(uint32_t tickerType, uint64_t count) { - assert(enable_internal_stats_ ? tickerType < INTERNAL_TICKER_ENUM_MAX - : tickerType < TICKER_ENUM_MAX); + assert(tickerType < TICKER_ENUM_MAX); for (size_t core_idx = 0; core_idx < per_core_stats_.Size(); ++core_idx) { if (core_idx == 0) { per_core_stats_.AccessAtCore(core_idx)->tickers_[tickerType] = count; @@ -95,8 +289,7 @@ uint64_t StatisticsImpl::getAndResetTickerCount(uint32_t tickerType) { uint64_t sum = 0; { MutexLock lock(&aggregate_lock_); - assert(enable_internal_stats_ ? tickerType < INTERNAL_TICKER_ENUM_MAX - : tickerType < TICKER_ENUM_MAX); + assert(tickerType < TICKER_ENUM_MAX); for (size_t core_idx = 0; core_idx < per_core_stats_.Size(); ++core_idx) { sum += per_core_stats_.AccessAtCore(core_idx)->tickers_[tickerType].exchange( @@ -110,10 +303,7 @@ uint64_t StatisticsImpl::getAndResetTickerCount(uint32_t tickerType) { } void StatisticsImpl::recordTick(uint32_t tickerType, uint64_t count) { - assert( - enable_internal_stats_ ? - tickerType < INTERNAL_TICKER_ENUM_MAX : - tickerType < TICKER_ENUM_MAX); + assert(tickerType < TICKER_ENUM_MAX); per_core_stats_.Access()->tickers_[tickerType].fetch_add( count, std::memory_order_relaxed); if (stats_ && tickerType < TICKER_ENUM_MAX) { @@ -122,10 +312,7 @@ void StatisticsImpl::recordTick(uint32_t tickerType, uint64_t count) { } void StatisticsImpl::measureTime(uint32_t histogramType, uint64_t value) { - assert( - enable_internal_stats_ ? - histogramType < INTERNAL_HISTOGRAM_ENUM_MAX : - histogramType < HISTOGRAM_ENUM_MAX); + assert(histogramType < HISTOGRAM_ENUM_MAX); per_core_stats_.Access()->histograms_[histogramType].Add(value); if (stats_ && histogramType < HISTOGRAM_ENUM_MAX) { stats_->measureTime(histogramType, value); @@ -157,41 +344,36 @@ std::string StatisticsImpl::ToString() const { std::string res; res.reserve(20000); for (const auto& t : TickersNameMap) { - if (t.first < TICKER_ENUM_MAX || enable_internal_stats_) { - char buffer[kTmpStrBufferSize]; - snprintf(buffer, kTmpStrBufferSize, "%s COUNT : %" PRIu64 "\n", - t.second.c_str(), getTickerCountLocked(t.first)); - res.append(buffer); - } + assert(t.first < TICKER_ENUM_MAX); + char buffer[kTmpStrBufferSize]; + snprintf(buffer, kTmpStrBufferSize, "%s COUNT : %" PRIu64 "\n", + t.second.c_str(), getTickerCountLocked(t.first)); + res.append(buffer); } for (const auto& h : HistogramsNameMap) { - if (h.first < HISTOGRAM_ENUM_MAX || enable_internal_stats_) { - char buffer[kTmpStrBufferSize]; - HistogramData hData; - getHistogramImplLocked(h.first)->Data(&hData); - // don't handle failures - buffer should always be big enough and arguments - // should be provided correctly - int ret = snprintf( - buffer, kTmpStrBufferSize, - "%s P50 : %f P95 : %f P99 : %f P100 : %f COUNT : %" PRIu64 " SUM : %" - PRIu64 "\n", h.second.c_str(), hData.median, hData.percentile95, - hData.percentile99, hData.max, hData.count, hData.sum); - if (ret < 0 || ret >= kTmpStrBufferSize) { - assert(false); - continue; - } - res.append(buffer); + assert(h.first < HISTOGRAM_ENUM_MAX); + char buffer[kTmpStrBufferSize]; + HistogramData hData; + getHistogramImplLocked(h.first)->Data(&hData); + // don't handle failures - buffer should always be big enough and arguments + // should be provided correctly + int ret = snprintf( + buffer, kTmpStrBufferSize, + "%s P50 : %f P95 : %f P99 : %f P100 : %f COUNT : %" PRIu64 " SUM : %" + PRIu64 "\n", h.second.c_str(), hData.median, hData.percentile95, + hData.percentile99, hData.max, hData.count, hData.sum); + if (ret < 0 || ret >= kTmpStrBufferSize) { + assert(false); + continue; } + res.append(buffer); } res.shrink_to_fit(); return res; } bool StatisticsImpl::HistEnabledForType(uint32_t type) const { - if (LIKELY(!enable_internal_stats_)) { - return type < HISTOGRAM_ENUM_MAX; - } - return true; + return type < HISTOGRAM_ENUM_MAX; } } // namespace rocksdb diff --git a/monitoring/statistics.h b/monitoring/statistics.h index 4427c8c5465..dcd5f7a010c 100644 --- a/monitoring/statistics.h +++ b/monitoring/statistics.h @@ -41,8 +41,7 @@ enum HistogramsInternal : uint32_t { class StatisticsImpl : public Statistics { public: - StatisticsImpl(std::shared_ptr stats, - bool enable_internal_stats); + StatisticsImpl(std::shared_ptr stats); virtual ~StatisticsImpl(); virtual uint64_t getTickerCount(uint32_t ticker_type) const override; @@ -62,8 +61,6 @@ class StatisticsImpl : public Statistics { private: // If non-nullptr, forwards updates to the object pointed to by `stats_`. std::shared_ptr stats_; - // TODO(ajkr): clean this up since there are no internal stats anymore - bool enable_internal_stats_; // Synchronizes anything that operates across other cores' local data, // such that operations like Reset() can be performed atomically. mutable port::Mutex aggregate_lock_; diff --git a/monitoring/statistics_test.cc b/monitoring/statistics_test.cc index 43aacde9c1b..a77022bfb3d 100644 --- a/monitoring/statistics_test.cc +++ b/monitoring/statistics_test.cc @@ -16,7 +16,7 @@ class StatisticsTest : public testing::Test {}; // Sanity check to make sure that contents and order of TickersNameMap // match Tickers enum -TEST_F(StatisticsTest, Sanity) { +TEST_F(StatisticsTest, SanityTickers) { EXPECT_EQ(static_cast(Tickers::TICKER_ENUM_MAX), TickersNameMap.size()); @@ -26,6 +26,18 @@ TEST_F(StatisticsTest, Sanity) { } } +// Sanity check to make sure that contents and order of HistogramsNameMap +// match Tickers enum +TEST_F(StatisticsTest, SanityHistograms) { + EXPECT_EQ(static_cast(Histograms::HISTOGRAM_ENUM_MAX), + HistogramsNameMap.size()); + + for (uint32_t h = 0; h < Histograms::HISTOGRAM_ENUM_MAX; h++) { + auto pair = HistogramsNameMap[static_cast(h)]; + ASSERT_EQ(pair.first, h) << "Miss match at " << pair.second; + } +} + } // namespace rocksdb int main(int argc, char** argv) { diff --git a/options/cf_options.h b/options/cf_options.h index 1658bf427a3..69b0b0105af 100644 --- a/options/cf_options.h +++ b/options/cf_options.h @@ -18,7 +18,7 @@ namespace rocksdb { // ImmutableCFOptions is a data struct used by RocksDB internal. It contains a // subset of Options that should not be changed during the entire lifetime // of DB. Raw pointers defined in this struct do not have ownership to the data -// they point to. Options contains shared_ptr to these data. +// they point to. Options contains std::shared_ptr to these data. struct ImmutableCFOptions { ImmutableCFOptions(); explicit ImmutableCFOptions(const Options& options); diff --git a/options/db_options.cc b/options/db_options.cc index fd3cdcccd66..4e8134511ba 100644 --- a/options/db_options.cc +++ b/options/db_options.cc @@ -85,7 +85,8 @@ ImmutableDBOptions::ImmutableDBOptions(const DBOptions& options) allow_ingest_behind(options.allow_ingest_behind), preserve_deletes(options.preserve_deletes), two_write_queues(options.two_write_queues), - manual_wal_flush(options.manual_wal_flush) { + manual_wal_flush(options.manual_wal_flush), + atomic_flush(options.atomic_flush) { } void ImmutableDBOptions::Dump(Logger* log) const { diff --git a/options/db_options.h b/options/db_options.h index 107d35c8770..2cd83b55d43 100644 --- a/options/db_options.h +++ b/options/db_options.h @@ -78,6 +78,7 @@ struct ImmutableDBOptions { bool preserve_deletes; bool two_write_queues; bool manual_wal_flush; + bool atomic_flush; }; struct MutableDBOptions { diff --git a/options/options_helper.cc b/options/options_helper.cc index f4c59ff06e7..27a2252a02e 100644 --- a/options/options_helper.cc +++ b/options/options_helper.cc @@ -126,6 +126,7 @@ DBOptions BuildDBOptions(const ImmutableDBOptions& immutable_db_options, immutable_db_options.preserve_deletes; options.two_write_queues = immutable_db_options.two_write_queues; options.manual_wal_flush = immutable_db_options.manual_wal_flush; + options.atomic_flush = immutable_db_options.atomic_flush; return options; } @@ -215,7 +216,8 @@ std::map std::unordered_map OptionsHelper::checksum_type_string_map = {{"kNoChecksum", kNoChecksum}, {"kCRC32c", kCRC32c}, - {"kxxHash", kxxHash}}; + {"kxxHash", kxxHash}, + {"kxxHash64", kxxHash64}}; std::unordered_map OptionsHelper::compression_type_string_map = { @@ -1554,7 +1556,11 @@ std::unordered_map offsetof(struct ImmutableDBOptions, manual_wal_flush)}}, {"seq_per_batch", {0, OptionType::kBoolean, OptionVerificationType::kDeprecated, false, - 0}}}; + 0}}, + {"atomic_flush", + {offsetof(struct DBOptions, atomic_flush), OptionType::kBoolean, + OptionVerificationType::kNormal, false, + offsetof(struct ImmutableDBOptions, atomic_flush)}}}; std::unordered_map OptionsHelper::block_base_table_index_type_string_map = { diff --git a/options/options_parser.cc b/options/options_parser.cc index f9144b67d77..32cfb8d5316 100644 --- a/options/options_parser.cc +++ b/options/options_parser.cc @@ -48,7 +48,7 @@ Status PersistRocksDBOptions(const DBOptions& db_opt, if (!s.ok()) { return s; } - unique_ptr writable; + std::unique_ptr writable; writable.reset(new WritableFileWriter(std::move(wf), file_name, EnvOptions(), nullptr /* statistics */)); diff --git a/options/options_settable_test.cc b/options/options_settable_test.cc index ded152ba99e..cad1af3d769 100644 --- a/options/options_settable_test.cc +++ b/options/options_settable_test.cc @@ -291,7 +291,8 @@ TEST_F(OptionsSettableTest, DBOptionsAllFieldsSettable) { "concurrent_prepare=false;" "two_write_queues=false;" "manual_wal_flush=false;" - "seq_per_batch=false;", + "seq_per_batch=false;" + "atomic_flush=false", new_options)); ASSERT_EQ(unset_bytes_base, NumUnsetBytes(new_options_ptr, sizeof(DBOptions), diff --git a/port/jemalloc_helper.h b/port/jemalloc_helper.h new file mode 100644 index 00000000000..412a80d26a4 --- /dev/null +++ b/port/jemalloc_helper.h @@ -0,0 +1,49 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#pragma once + +#ifdef ROCKSDB_JEMALLOC +#ifdef __FreeBSD__ +#include +#else +#include +#endif + +// Declare non-standard jemalloc APIs as weak symbols. We can null-check these +// symbols to detect whether jemalloc is linked with the binary. +extern "C" void* mallocx(size_t, int) __attribute__((__weak__)); +extern "C" void* rallocx(void*, size_t, int) __attribute__((__weak__)); +extern "C" size_t xallocx(void*, size_t, size_t, int) __attribute__((__weak__)); +extern "C" size_t sallocx(const void*, int) __attribute__((__weak__)); +extern "C" void dallocx(void*, int) __attribute__((__weak__)); +extern "C" void sdallocx(void*, size_t, int) __attribute__((__weak__)); +extern "C" size_t nallocx(size_t, int) __attribute__((__weak__)); +extern "C" int mallctl(const char*, void*, size_t*, void*, size_t) + __attribute__((__weak__)); +extern "C" int mallctlnametomib(const char*, size_t*, size_t*) + __attribute__((__weak__)); +extern "C" int mallctlbymib(const size_t*, size_t, void*, size_t*, void*, + size_t) __attribute__((__weak__)); +extern "C" void malloc_stats_print(void (*)(void*, const char*), void*, + const char*) __attribute__((__weak__)); +extern "C" size_t malloc_usable_size(JEMALLOC_USABLE_SIZE_CONST void*) + JEMALLOC_CXX_THROW __attribute__((__weak__)); + +// Check if Jemalloc is linked with the binary. Note the main program might be +// using a different memory allocator even this method return true. +// It is loosely based on folly::usingJEMalloc(), minus the check that actually +// allocate memory and see if it is through jemalloc, to handle the dlopen() +// case: +// https://github.com/facebook/folly/blob/76cf8b5841fb33137cfbf8b224f0226437c855bc/folly/memory/Malloc.h#L147 +static inline bool HasJemalloc() { + return mallocx != nullptr && rallocx != nullptr && xallocx != nullptr && + sallocx != nullptr && dallocx != nullptr && sdallocx != nullptr && + nallocx != nullptr && mallctl != nullptr && + mallctlnametomib != nullptr && mallctlbymib != nullptr && + malloc_stats_print != nullptr && malloc_usable_size != nullptr; +} + +#endif // ROCKSDB_JEMALLOC diff --git a/port/win/env_win.cc b/port/win/env_win.cc index 723a273f0bf..d3013906709 100644 --- a/port/win/env_win.cc +++ b/port/win/env_win.cc @@ -102,7 +102,8 @@ WinEnvIO::~WinEnvIO() { Status WinEnvIO::DeleteFile(const std::string& fname) { Status result; - BOOL ret = DeleteFileA(fname.c_str()); + BOOL ret = RX_DeleteFile(RX_FN(fname).c_str()); + if(!ret) { auto lastError = GetLastError(); result = IOErrorFromWindowsError("Failed to delete: " + fname, @@ -114,7 +115,7 @@ Status WinEnvIO::DeleteFile(const std::string& fname) { Status WinEnvIO::Truncate(const std::string& fname, size_t size) { Status s; - int result = truncate(fname.c_str(), size); + int result = rocksdb::port::Truncate(fname, size); if (result != 0) { s = IOError("Failed to truncate: " + fname, errno); } @@ -151,8 +152,8 @@ Status WinEnvIO::NewSequentialFile(const std::string& fname, { IOSTATS_TIMER_GUARD(open_nanos); - hFile = CreateFileA( - fname.c_str(), GENERIC_READ, + hFile = RX_CreateFile( + RX_FN(fname).c_str(), GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_EXISTING, // Original fopen mode is "rb" fileFlags, NULL); @@ -190,7 +191,7 @@ Status WinEnvIO::NewRandomAccessFile(const std::string& fname, { IOSTATS_TIMER_GUARD(open_nanos); hFile = - CreateFileA(fname.c_str(), GENERIC_READ, + RX_CreateFile(RX_FN(fname).c_str(), GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_EXISTING, fileFlags, NULL); } @@ -217,7 +218,7 @@ Status WinEnvIO::NewRandomAccessFile(const std::string& fname, "NewRandomAccessFile failed to map empty file: " + fname, EINVAL); } - HANDLE hMap = CreateFileMappingA(hFile, NULL, PAGE_READONLY, + HANDLE hMap = RX_CreateFileMapping(hFile, NULL, PAGE_READONLY, 0, // Whole file at its present length 0, NULL); // Mapping name @@ -302,8 +303,8 @@ Status WinEnvIO::OpenWritableFile(const std::string& fname, HANDLE hFile = 0; { IOSTATS_TIMER_GUARD(open_nanos); - hFile = CreateFileA( - fname.c_str(), + hFile = RX_CreateFile( + RX_FN(fname).c_str(), desired_access, // Access desired shared_mode, NULL, // Security attributes @@ -366,7 +367,7 @@ Status WinEnvIO::NewRandomRWFile(const std::string & fname, { IOSTATS_TIMER_GUARD(open_nanos); hFile = - CreateFileA(fname.c_str(), + RX_CreateFile(RX_FN(fname).c_str(), desired_access, shared_mode, NULL, // Security attributes @@ -399,8 +400,8 @@ Status WinEnvIO::NewMemoryMappedFileBuffer(const std::string & fname, HANDLE hFile = INVALID_HANDLE_VALUE; { IOSTATS_TIMER_GUARD(open_nanos); - hFile = CreateFileA( - fname.c_str(), GENERIC_READ | GENERIC_WRITE, + hFile = RX_CreateFile( + RX_FN(fname).c_str(), GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_EXISTING, // Open only if it exists @@ -432,7 +433,7 @@ Status WinEnvIO::NewMemoryMappedFileBuffer(const std::string & fname, "The specified file size does not fit into 32-bit memory addressing: " + fname); } - HANDLE hMap = CreateFileMappingA(hFile, NULL, PAGE_READWRITE, + HANDLE hMap = RX_CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, // Whole file at its present length 0, NULL); // Mapping name @@ -483,7 +484,7 @@ Status WinEnvIO::NewDirectory(const std::string& name, // 0 - for access means read metadata { IOSTATS_TIMER_GUARD(open_nanos); - handle = ::CreateFileA(name.c_str(), 0, + handle = RX_CreateFile(RX_FN(name).c_str(), 0, FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, @@ -509,8 +510,7 @@ Status WinEnvIO::FileExists(const std::string& fname) { // which is consistent with _access() impl on windows // but can be added WIN32_FILE_ATTRIBUTE_DATA attrs; - if (FALSE == GetFileAttributesExA(fname.c_str(), GetFileExInfoStandard, - &attrs)) { + if (FALSE == RX_GetFileAttributesEx(RX_FN(fname).c_str(), GetFileExInfoStandard, &attrs)) { auto lastError = GetLastError(); switch (lastError) { case ERROR_ACCESS_DENIED: @@ -535,11 +535,12 @@ Status WinEnvIO::GetChildren(const std::string& dir, result->clear(); std::vector output; - WIN32_FIND_DATA data; + RX_WIN32_FIND_DATA data; + memset(&data, 0, sizeof(data)); std::string pattern(dir); pattern.append("\\").append("*"); - HANDLE handle = ::FindFirstFileExA(pattern.c_str(), + HANDLE handle = RX_FindFirstFileEx(RX_FN(pattern).c_str(), FindExInfoBasic, // Do not want alternative name &data, FindExSearchNameMatch, @@ -572,8 +573,9 @@ Status WinEnvIO::GetChildren(const std::string& dir, data.cFileName[MAX_PATH - 1] = 0; while (true) { - output.emplace_back(data.cFileName); - BOOL ret =- ::FindNextFileA(handle, &data); + auto x = RX_FILESTRING(data.cFileName, RX_FNLEN(data.cFileName)); + output.emplace_back(FN_TO_RX(x)); + BOOL ret =- RX_FindNextFile(handle, &data); // If the function fails the return value is zero // and non-zero otherwise. Not TRUE or FALSE. if (ret == FALSE) { @@ -588,8 +590,7 @@ Status WinEnvIO::GetChildren(const std::string& dir, Status WinEnvIO::CreateDir(const std::string& name) { Status result; - - BOOL ret = CreateDirectoryA(name.c_str(), NULL); + BOOL ret = RX_CreateDirectory(RX_FN(name).c_str(), NULL); if (!ret) { auto lastError = GetLastError(); result = IOErrorFromWindowsError( @@ -606,7 +607,7 @@ Status WinEnvIO::CreateDirIfMissing(const std::string& name) { return result; } - BOOL ret = CreateDirectoryA(name.c_str(), NULL); + BOOL ret = RX_CreateDirectory(RX_FN(name).c_str(), NULL); if (!ret) { auto lastError = GetLastError(); if (lastError != ERROR_ALREADY_EXISTS) { @@ -622,7 +623,7 @@ Status WinEnvIO::CreateDirIfMissing(const std::string& name) { Status WinEnvIO::DeleteDir(const std::string& name) { Status result; - BOOL ret = RemoveDirectoryA(name.c_str()); + BOOL ret = RX_RemoveDirectory(RX_FN(name).c_str()); if (!ret) { auto lastError = GetLastError(); result = IOErrorFromWindowsError("Failed to remove dir: " + name, lastError); @@ -635,7 +636,7 @@ Status WinEnvIO::GetFileSize(const std::string& fname, Status s; WIN32_FILE_ATTRIBUTE_DATA attrs; - if (GetFileAttributesExA(fname.c_str(), GetFileExInfoStandard, &attrs)) { + if (RX_GetFileAttributesEx(RX_FN(fname).c_str(), GetFileExInfoStandard, &attrs)) { ULARGE_INTEGER file_size; file_size.HighPart = attrs.nFileSizeHigh; file_size.LowPart = attrs.nFileSizeLow; @@ -670,7 +671,7 @@ Status WinEnvIO::GetFileModificationTime(const std::string& fname, Status s; WIN32_FILE_ATTRIBUTE_DATA attrs; - if (GetFileAttributesExA(fname.c_str(), GetFileExInfoStandard, &attrs)) { + if (RX_GetFileAttributesEx(RX_FN(fname).c_str(), GetFileExInfoStandard, &attrs)) { *file_mtime = FileTimeToUnixTime(attrs.ftLastWriteTime); } else { auto lastError = GetLastError(); @@ -688,7 +689,7 @@ Status WinEnvIO::RenameFile(const std::string& src, // rename() is not capable of replacing the existing file as on Linux // so use OS API directly - if (!MoveFileExA(src.c_str(), target.c_str(), MOVEFILE_REPLACE_EXISTING)) { + if (!RX_MoveFileEx(RX_FN(src).c_str(), RX_FN(target).c_str(), MOVEFILE_REPLACE_EXISTING)) { DWORD lastError = GetLastError(); std::string text("Failed to rename: "); @@ -704,7 +705,7 @@ Status WinEnvIO::LinkFile(const std::string& src, const std::string& target) { Status result; - if (!CreateHardLinkA(target.c_str(), src.c_str(), NULL)) { + if (!RX_CreateHardLink(RX_FN(target).c_str(), RX_FN(src).c_str(), NULL)) { DWORD lastError = GetLastError(); if (lastError == ERROR_NOT_SAME_DEVICE) { return Status::NotSupported("No cross FS links allowed"); @@ -721,8 +722,9 @@ Status WinEnvIO::LinkFile(const std::string& src, Status WinEnvIO::NumFileLinks(const std::string& fname, uint64_t* count) { Status s; - HANDLE handle = ::CreateFileA( - fname.c_str(), 0, FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE, + HANDLE handle = RX_CreateFile( + RX_FN(fname).c_str(), 0, + FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, NULL); if (INVALID_HANDLE_VALUE == handle) { @@ -758,7 +760,7 @@ Status WinEnvIO::AreFilesSame(const std::string& first, } // 0 - for access means read metadata - HANDLE file_1 = ::CreateFileA(first.c_str(), 0, + HANDLE file_1 = RX_CreateFile(RX_FN(first).c_str(), 0, FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, @@ -773,7 +775,7 @@ Status WinEnvIO::AreFilesSame(const std::string& first, } UniqueCloseHandlePtr g_1(file_1, CloseHandleFunc); - HANDLE file_2 = ::CreateFileA(second.c_str(), 0, + HANDLE file_2 = RX_CreateFile(RX_FN(second).c_str(), 0, FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, // make opening folders possible @@ -835,7 +837,7 @@ Status WinEnvIO::LockFile(const std::string& lockFname, HANDLE hFile = 0; { IOSTATS_TIMER_GUARD(open_nanos); - hFile = CreateFileA(lockFname.c_str(), (GENERIC_READ | GENERIC_WRITE), + hFile = RX_CreateFile(RX_FN(lockFname).c_str(), (GENERIC_READ | GENERIC_WRITE), ExclusiveAccessON, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); } @@ -898,8 +900,8 @@ Status WinEnvIO::NewLogger(const std::string& fname, HANDLE hFile = 0; { IOSTATS_TIMER_GUARD(open_nanos); - hFile = CreateFileA( - fname.c_str(), GENERIC_WRITE, + hFile = RX_CreateFile( + RX_FN(fname).c_str(), GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_DELETE, // In RocksDb log files are // renamed and deleted before // they are closed. This enables @@ -992,17 +994,17 @@ Status WinEnvIO::GetAbsolutePath(const std::string& db_path, // For test compatibility we will consider starting slash as an // absolute path if ((!db_path.empty() && (db_path[0] == '\\' || db_path[0] == '/')) || - !PathIsRelativeA(db_path.c_str())) { + !RX_PathIsRelative(RX_FN(db_path).c_str())) { *output_path = db_path; return Status::OK(); } - std::string result; + RX_FILESTRING result; result.resize(MAX_PATH); // Hopefully no changes the current directory while we do this // however _getcwd also suffers from the same limitation - DWORD len = GetCurrentDirectoryA(MAX_PATH, &result[0]); + DWORD len = RX_GetCurrentDirectory(MAX_PATH, &result[0]); if (len == 0) { auto lastError = GetLastError(); return IOErrorFromWindowsError("Failed to get current working directory", @@ -1010,8 +1012,9 @@ Status WinEnvIO::GetAbsolutePath(const std::string& db_path, } result.resize(len); - - result.swap(*output_path); + std::string res = FN_TO_RX(result); + + res.swap(*output_path); return Status::OK(); } @@ -1076,7 +1079,7 @@ EnvOptions WinEnvIO::OptimizeForManifestRead( // Returns true iff the named directory exists and is a directory. bool WinEnvIO::DirExists(const std::string& dname) { WIN32_FILE_ATTRIBUTE_DATA attrs; - if (GetFileAttributesExA(dname.c_str(), GetFileExInfoStandard, &attrs)) { + if (RX_GetFileAttributesEx(RX_FN(dname).c_str(), GetFileExInfoStandard, &attrs)) { return 0 != (attrs.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY); } return false; @@ -1085,7 +1088,7 @@ bool WinEnvIO::DirExists(const std::string& dname) { size_t WinEnvIO::GetSectorSize(const std::string& fname) { size_t sector_size = kSectorSize; - if (PathIsRelativeA(fname.c_str())) { + if (RX_PathIsRelative(RX_FN(fname).c_str())) { return sector_size; } diff --git a/port/win/env_win.h b/port/win/env_win.h index 81b323a7119..d61ac3acd6d 100644 --- a/port/win/env_win.h +++ b/port/win/env_win.h @@ -109,8 +109,8 @@ class WinEnvIO { // The returned file will only be accessed by one thread at a time. virtual Status NewRandomRWFile(const std::string& fname, - unique_ptr* result, - const EnvOptions& options); + std::unique_ptr* result, + const EnvOptions& options); virtual Status NewMemoryMappedFileBuffer( const std::string& fname, diff --git a/port/win/io_win.h b/port/win/io_win.h index 3b08c394f4a..c46876b8c0c 100644 --- a/port/win/io_win.h +++ b/port/win/io_win.h @@ -58,7 +58,7 @@ class WinFileData { protected: const std::string filename_; HANDLE hFile_; - // If ture, the I/O issued would be direct I/O which the buffer + // If true, the I/O issued would be direct I/O which the buffer // will need to be aligned (not sure there is a guarantee that the buffer // passed in is aligned). const bool use_direct_io_; diff --git a/port/win/port_win.cc b/port/win/port_win.cc index 75b4ec6de90..6ca5bba3b94 100644 --- a/port/win/port_win.cc +++ b/port/win/port_win.cc @@ -26,11 +26,30 @@ #include #include +#ifdef ROCKSDB_WINDOWS_UTF8_FILENAMES +// utf8 <-> utf16 +#include +#include +#include +#endif + #include "util/logging.h" namespace rocksdb { namespace port { +#ifdef ROCKSDB_WINDOWS_UTF8_FILENAMES +std::string utf16_to_utf8(const std::wstring& utf16) { + std::wstring_convert,wchar_t> convert; + return convert.to_bytes(utf16); +} + +std::wstring utf8_to_utf16(const std::string& utf8) { + std::wstring_convert> converter; + return converter.from_bytes(utf8); +} +#endif + void gettimeofday(struct timeval* tv, struct timezone* /* tz */) { using namespace std::chrono; @@ -110,7 +129,7 @@ void InitOnce(OnceType* once, void (*initializer)()) { struct DIR { HANDLE handle_; bool firstread_; - WIN32_FIND_DATA data_; + RX_WIN32_FIND_DATA data_; dirent entry_; DIR() : handle_(INVALID_HANDLE_VALUE), @@ -137,7 +156,7 @@ DIR* opendir(const char* name) { std::unique_ptr dir(new DIR); - dir->handle_ = ::FindFirstFileExA(pattern.c_str(), + dir->handle_ = RX_FindFirstFileEx(RX_FN(pattern).c_str(), FindExInfoBasic, // Do not want alternative name &dir->data_, FindExSearchNameMatch, @@ -148,8 +167,9 @@ DIR* opendir(const char* name) { return nullptr; } + RX_FILESTRING x(dir->data_.cFileName, RX_FNLEN(dir->data_.cFileName)); strcpy_s(dir->entry_.d_name, sizeof(dir->entry_.d_name), - dir->data_.cFileName); + FN_TO_RX(x).c_str()); return dir.release(); } @@ -165,14 +185,15 @@ struct dirent* readdir(DIR* dirp) { return &dirp->entry_; } - auto ret = ::FindNextFileA(dirp->handle_, &dirp->data_); + auto ret = RX_FindNextFile(dirp->handle_, &dirp->data_); if (ret == 0) { return nullptr; } + RX_FILESTRING x(dirp->data_.cFileName, RX_FNLEN(dirp->data_.cFileName)); strcpy_s(dirp->entry_.d_name, sizeof(dirp->entry_.d_name), - dirp->data_.cFileName); + FN_TO_RX(x).c_str()); return &dirp->entry_; } @@ -182,11 +203,15 @@ int closedir(DIR* dirp) { return 0; } -int truncate(const char* path, int64_t len) { +int truncate(const char* path, int64_t length) { if (path == nullptr) { errno = EFAULT; return -1; } + return rocksdb::port::Truncate(path, length); +} + +int Truncate(std::string path, int64_t len) { if (len < 0) { errno = EINVAL; @@ -194,7 +219,7 @@ int truncate(const char* path, int64_t len) { } HANDLE hFile = - CreateFile(path, GENERIC_READ | GENERIC_WRITE, + RX_CreateFile(RX_FN(path).c_str(), GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, // Security attrs OPEN_EXISTING, // Truncate existing file only diff --git a/port/win/port_win.h b/port/win/port_win.h index 41ccea68d45..9b8ba9ff89f 100644 --- a/port/win/port_win.h +++ b/port/win/port_win.h @@ -327,11 +327,62 @@ inline void* pthread_getspecific(pthread_key_t key) { // using C-runtime to implement. Note, this does not // feel space with zeros in case the file is extended. int truncate(const char* path, int64_t length); +int Truncate(std::string path, int64_t length); void Crash(const std::string& srcfile, int srcline); extern int GetMaxOpenFiles(); +std::string utf16_to_utf8(const std::wstring& utf16); +std::wstring utf8_to_utf16(const std::string& utf8); } // namespace port + +#ifdef ROCKSDB_WINDOWS_UTF8_FILENAMES + +#define RX_FILESTRING std::wstring +#define RX_FN(a) rocksdb::port::utf8_to_utf16(a) +#define FN_TO_RX(a) rocksdb::port::utf16_to_utf8(a) +#define RX_FNLEN(a) ::wcslen(a) + +#define RX_DeleteFile DeleteFileW +#define RX_CreateFile CreateFileW +#define RX_CreateFileMapping CreateFileMappingW +#define RX_GetFileAttributesEx GetFileAttributesExW +#define RX_FindFirstFileEx FindFirstFileExW +#define RX_FindNextFile FindNextFileW +#define RX_WIN32_FIND_DATA WIN32_FIND_DATAW +#define RX_CreateDirectory CreateDirectoryW +#define RX_RemoveDirectory RemoveDirectoryW +#define RX_GetFileAttributesEx GetFileAttributesExW +#define RX_MoveFileEx MoveFileExW +#define RX_CreateHardLink CreateHardLinkW +#define RX_PathIsRelative PathIsRelativeW +#define RX_GetCurrentDirectory GetCurrentDirectoryW + +#else + +#define RX_FILESTRING std::string +#define RX_FN(a) a +#define FN_TO_RX(a) a +#define RX_FNLEN(a) strlen(a) + +#define RX_DeleteFile DeleteFileA +#define RX_CreateFile CreateFileA +#define RX_CreateFileMapping CreateFileMappingA +#define RX_GetFileAttributesEx GetFileAttributesExA +#define RX_FindFirstFileEx FindFirstFileExA +#define RX_CreateDirectory CreateDirectoryA +#define RX_FindNextFile FindNextFileA +#define RX_WIN32_FIND_DATA WIN32_FIND_DATA +#define RX_CreateDirectory CreateDirectoryA +#define RX_RemoveDirectory RemoveDirectoryA +#define RX_GetFileAttributesEx GetFileAttributesExA +#define RX_MoveFileEx MoveFileExA +#define RX_CreateHardLink CreateHardLinkA +#define RX_PathIsRelative PathIsRelativeA +#define RX_GetCurrentDirectory GetCurrentDirectoryA + +#endif + using port::pthread_key_t; using port::pthread_key_create; using port::pthread_key_delete; diff --git a/port/win/win_thread.cc b/port/win/win_thread.cc index b48af2370fc..9a976e2c6b8 100644 --- a/port/win/win_thread.cc +++ b/port/win/win_thread.cc @@ -40,7 +40,7 @@ struct WindowsThread::Data { void WindowsThread::Init(std::function&& func) { data_ = std::make_shared(std::move(func)); - // We create another instance of shared_ptr to get an additional ref + // We create another instance of std::shared_ptr to get an additional ref // since we may detach and destroy this instance before the threadproc // may start to run. We choose to allocate this additional ref on the heap // so we do not need to synchronize and allow this thread to proceed diff --git a/src.mk b/src.mk index e2ad3f45c18..97dad2034b3 100644 --- a/src.mk +++ b/src.mk @@ -11,6 +11,7 @@ LIB_SOURCES = \ db/compaction_iterator.cc \ db/compaction_job.cc \ db/compaction_picker.cc \ + db/compaction_picker_fifo.cc \ db/compaction_picker_universal.cc \ db/convenience.cc \ db/db_filesnapshot.cc \ @@ -43,6 +44,7 @@ LIB_SOURCES = \ db/merge_helper.cc \ db/merge_operator.cc \ db/range_del_aggregator.cc \ + db/range_tombstone_fragmenter.cc \ db/repair.cc \ db/snapshot_impl.cc \ db/table_cache.cc \ @@ -120,6 +122,7 @@ LIB_SOURCES = \ table/plain_table_index.cc \ table/plain_table_key_coding.cc \ table/plain_table_reader.cc \ + table/sst_file_reader.cc \ table/sst_file_writer.cc \ table/table_properties.cc \ table/two_level_iterator.cc \ @@ -142,6 +145,7 @@ LIB_SOURCES = \ util/filename.cc \ util/filter_policy.cc \ util/hash.cc \ + util/jemalloc_nodump_allocator.cc \ util/log_buffer.cc \ util/murmurhash.cc \ util/random.cc \ @@ -341,6 +345,7 @@ MAIN_SOURCES = \ db/repair_test.cc \ db/range_del_aggregator_test.cc \ db/range_del_aggregator_bench.cc \ + db/range_tombstone_fragmenter_test.cc \ db/table_properties_collector_test.cc \ db/util_merge_operators_test.cc \ db/version_builder_test.cc \ @@ -369,6 +374,7 @@ MAIN_SOURCES = \ table/data_block_hash_index_test.cc \ table/full_filter_block_test.cc \ table/merger_test.cc \ + table/sst_file_reader_test.cc \ table/table_reader_bench.cc \ table/table_test.cc \ third-party/gtest-1.7.0/fused-src/gtest/gtest-all.cc \ @@ -449,6 +455,7 @@ JNI_NATIVE_SOURCES = \ java/rocksjni/loggerjnicallback.cc \ java/rocksjni/lru_cache.cc \ java/rocksjni/memtablejni.cc \ + java/rocksjni/memory_util.cc \ java/rocksjni/merge_operator.cc \ java/rocksjni/native_comparator_wrapper_test.cc \ java/rocksjni/optimistic_transaction_db.cc \ @@ -481,4 +488,5 @@ JNI_NATIVE_SOURCES = \ java/rocksjni/write_batch.cc \ java/rocksjni/writebatchhandlerjnicallback.cc \ java/rocksjni/write_batch_test.cc \ - java/rocksjni/write_batch_with_index.cc + java/rocksjni/write_batch_with_index.cc \ + java/rocksjni/write_buffer_manager.cc diff --git a/table/adaptive_table_factory.cc b/table/adaptive_table_factory.cc index 0a3e9415ad7..bbba3b91935 100644 --- a/table/adaptive_table_factory.cc +++ b/table/adaptive_table_factory.cc @@ -42,8 +42,8 @@ extern const uint64_t kCuckooTableMagicNumber; Status AdaptiveTableFactory::NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table, bool /*prefetch_index_and_filter_in_cache*/) const { Footer footer; auto s = ReadFooterFromFile(file.get(), nullptr /* prefetch_buffer */, diff --git a/table/adaptive_table_factory.h b/table/adaptive_table_factory.h index 00af6a76e95..2a82dbfa988 100644 --- a/table/adaptive_table_factory.h +++ b/table/adaptive_table_factory.h @@ -35,8 +35,8 @@ class AdaptiveTableFactory : public TableFactory { Status NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table, bool prefetch_index_and_filter_in_cache = true) const override; TableBuilder* NewTableBuilder( diff --git a/table/block.cc b/table/block.cc index c8247828e4f..4e8d6e5ca5a 100644 --- a/table/block.cc +++ b/table/block.cc @@ -781,47 +781,45 @@ Block::Block(BlockContents&& contents, SequenceNumber _global_seqno, size_ = 0; // Error marker } else { // Should only decode restart points for uncompressed blocks - if (compression_type() == kNoCompression) { - num_restarts_ = NumRestarts(); - switch (IndexType()) { - case BlockBasedTableOptions::kDataBlockBinarySearch: - restart_offset_ = static_cast(size_) - - (1 + num_restarts_) * sizeof(uint32_t); - if (restart_offset_ > size_ - sizeof(uint32_t)) { - // The size is too small for NumRestarts() and therefore - // restart_offset_ wrapped around. - size_ = 0; - } + num_restarts_ = NumRestarts(); + switch (IndexType()) { + case BlockBasedTableOptions::kDataBlockBinarySearch: + restart_offset_ = static_cast(size_) - + (1 + num_restarts_) * sizeof(uint32_t); + if (restart_offset_ > size_ - sizeof(uint32_t)) { + // The size is too small for NumRestarts() and therefore + // restart_offset_ wrapped around. + size_ = 0; + } + break; + case BlockBasedTableOptions::kDataBlockBinaryAndHash: + if (size_ < sizeof(uint32_t) /* block footer */ + + sizeof(uint16_t) /* NUM_BUCK */) { + size_ = 0; break; - case BlockBasedTableOptions::kDataBlockBinaryAndHash: - if (size_ < sizeof(uint32_t) /* block footer */ + - sizeof(uint16_t) /* NUM_BUCK */) { - size_ = 0; - break; - } - - uint16_t map_offset; - data_block_hash_index_.Initialize( - contents.data.data(), - static_cast(contents.data.size() - - sizeof(uint32_t)), /*chop off - NUM_RESTARTS*/ - &map_offset); - - restart_offset_ = map_offset - num_restarts_ * sizeof(uint32_t); - - if (restart_offset_ > map_offset) { - // map_offset is too small for NumRestarts() and - // therefore restart_offset_ wrapped around. - size_ = 0; - break; - } + } + + uint16_t map_offset; + data_block_hash_index_.Initialize( + contents.data.data(), + static_cast(contents.data.size() - + sizeof(uint32_t)), /*chop off + NUM_RESTARTS*/ + &map_offset); + + restart_offset_ = map_offset - num_restarts_ * sizeof(uint32_t); + + if (restart_offset_ > map_offset) { + // map_offset is too small for NumRestarts() and + // therefore restart_offset_ wrapped around. + size_ = 0; break; - default: - size_ = 0; // Error marker - } + } + break; + default: + size_ = 0; // Error marker + } } - } if (read_amp_bytes_per_bit != 0 && statistics && size_ != 0) { read_amp_bitmap_.reset(new BlockReadAmpBitmap( restart_offset_, read_amp_bytes_per_bit, statistics)); @@ -834,6 +832,7 @@ DataBlockIter* Block::NewIterator(const Comparator* cmp, const Comparator* ucmp, bool /*total_order_seek*/, bool /*key_includes_seq*/, bool /*value_is_full*/, + bool block_contents_pinned, BlockPrefixIndex* /*prefix_index*/) { DataBlockIter* ret_iter; if (iter != nullptr) { @@ -852,7 +851,7 @@ DataBlockIter* Block::NewIterator(const Comparator* cmp, const Comparator* ucmp, } else { ret_iter->Initialize( cmp, ucmp, data_, restart_offset_, num_restarts_, global_seqno_, - read_amp_bitmap_.get(), cachable(), + read_amp_bitmap_.get(), block_contents_pinned, data_block_hash_index_.Valid() ? &data_block_hash_index_ : nullptr); if (read_amp_bitmap_) { if (read_amp_bitmap_->GetStatistics() != stats) { @@ -870,6 +869,7 @@ IndexBlockIter* Block::NewIterator(const Comparator* cmp, const Comparator* ucmp, IndexBlockIter* iter, Statistics* /*stats*/, bool total_order_seek, bool key_includes_seq, bool value_is_full, + bool block_contents_pinned, BlockPrefixIndex* prefix_index) { IndexBlockIter* ret_iter; if (iter != nullptr) { @@ -890,7 +890,8 @@ IndexBlockIter* Block::NewIterator(const Comparator* cmp, total_order_seek ? nullptr : prefix_index; ret_iter->Initialize(cmp, ucmp, data_, restart_offset_, num_restarts_, prefix_index_ptr, key_includes_seq, value_is_full, - cachable(), nullptr /* data_block_hash_index */); + block_contents_pinned, + nullptr /* data_block_hash_index */); } return ret_iter; diff --git a/table/block.h b/table/block.h index 83900b56f55..1a8073203b4 100644 --- a/table/block.h +++ b/table/block.h @@ -153,14 +153,12 @@ class Block { size_t size() const { return size_; } const char* data() const { return data_; } - bool cachable() const { return contents_.cachable; } // The additional memory space taken by the block data. size_t usable_size() const { return contents_.usable_size(); } uint32_t NumRestarts() const; + bool own_bytes() const { return contents_.own_bytes(); } + BlockBasedTableOptions::DataBlockIndexType IndexType() const; - CompressionType compression_type() const { - return contents_.compression_type; - } // If comparator is InternalKeyComparator, user_comparator is its user // comparator; they are equal otherwise. @@ -170,7 +168,7 @@ class Block { // // key_includes_seq, default true, means that the keys are in internal key // format. - // value_is_full, default ture, means that no delta encoding is + // value_is_full, default true, means that no delta encoding is // applied to values. // // NewIterator @@ -180,6 +178,14 @@ class Block { // If `prefix_index` is not nullptr this block will do hash lookup for the key // prefix. If total_order_seek is true, prefix_index_ is ignored. // + // If `block_contents_pinned` is true, the caller will guarantee that when + // the cleanup functions are transferred from the iterator to other + // classes, e.g. PinnableSlice, the pointer to the bytes will still be + // valid. Either the iterator holds cache handle or ownership of some resource + // and release them in a release function, or caller is sure that the data + // will not go away (for example, it's from mmapped file which will not be + // closed). + // // NOTE: for the hash based lookup, if a key prefix doesn't match any key, // the iterator will simply be set as "invalid", rather than returning // the key that is just pass the target key. @@ -188,7 +194,8 @@ class Block { const Comparator* comparator, const Comparator* user_comparator, TBlockIter* iter = nullptr, Statistics* stats = nullptr, bool total_order_seek = true, bool key_includes_seq = true, - bool value_is_full = true, BlockPrefixIndex* prefix_index = nullptr); + bool value_is_full = true, bool block_contents_pinned = false, + BlockPrefixIndex* prefix_index = nullptr); // Report an approximation of how much memory has been used. size_t ApproximateMemoryUsage() const; @@ -295,7 +302,9 @@ class BlockIter : public InternalIteratorBase { Slice value_; Status status_; bool key_pinned_; - // whether the block data is guaranteed to outlive this iterator + // Whether the block data is guaranteed to outlive this iterator, and + // as long as the cleanup functions are transferred to another class, + // e.g. PinnableSlice, the pointer to the bytes will still be valid. bool block_contents_pinned_; SequenceNumber global_seqno_; @@ -449,7 +458,7 @@ class IndexBlockIter final : public BlockIter { } // key_includes_seq, default true, means that the keys are in internal key // format. - // value_is_full, default ture, means that no delta encoding is + // value_is_full, default true, means that no delta encoding is // applied to values. IndexBlockIter(const Comparator* comparator, const Comparator* user_comparator, const char* data, diff --git a/table/block_based_filter_block_test.cc b/table/block_based_filter_block_test.cc index 8de857f4efc..3cba09847a8 100644 --- a/table/block_based_filter_block_test.cc +++ b/table/block_based_filter_block_test.cc @@ -55,7 +55,7 @@ class FilterBlockTest : public testing::Test { TEST_F(FilterBlockTest, EmptyBuilder) { BlockBasedFilterBlockBuilder builder(nullptr, table_options_); - BlockContents block(builder.Finish(), false, kNoCompression); + BlockContents block(builder.Finish()); ASSERT_EQ("\\x00\\x00\\x00\\x00\\x0b", EscapeString(block.data)); BlockBasedFilterBlockReader reader(nullptr, table_options_, true, std::move(block), nullptr); @@ -75,7 +75,7 @@ TEST_F(FilterBlockTest, SingleChunk) { builder.StartBlock(300); builder.Add("hello"); ASSERT_EQ(5, builder.NumAdded()); - BlockContents block(builder.Finish(), false, kNoCompression); + BlockContents block(builder.Finish()); BlockBasedFilterBlockReader reader(nullptr, table_options_, true, std::move(block), nullptr); ASSERT_TRUE(reader.KeyMayMatch("foo", nullptr, 100)); @@ -107,7 +107,7 @@ TEST_F(FilterBlockTest, MultiChunk) { builder.Add("box"); builder.Add("hello"); - BlockContents block(builder.Finish(), false, kNoCompression); + BlockContents block(builder.Finish()); BlockBasedFilterBlockReader reader(nullptr, table_options_, true, std::move(block), nullptr); @@ -152,7 +152,7 @@ class BlockBasedFilterBlockTest : public testing::Test { TEST_F(BlockBasedFilterBlockTest, BlockBasedEmptyBuilder) { FilterBlockBuilder* builder = new BlockBasedFilterBlockBuilder( nullptr, table_options_); - BlockContents block(builder->Finish(), false, kNoCompression); + BlockContents block(builder->Finish()); ASSERT_EQ("\\x00\\x00\\x00\\x00\\x0b", EscapeString(block.data)); FilterBlockReader* reader = new BlockBasedFilterBlockReader( nullptr, table_options_, true, std::move(block), nullptr); @@ -174,7 +174,7 @@ TEST_F(BlockBasedFilterBlockTest, BlockBasedSingleChunk) { builder->Add("box"); builder->StartBlock(300); builder->Add("hello"); - BlockContents block(builder->Finish(), false, kNoCompression); + BlockContents block(builder->Finish()); FilterBlockReader* reader = new BlockBasedFilterBlockReader( nullptr, table_options_, true, std::move(block), nullptr); ASSERT_TRUE(reader->KeyMayMatch("foo", nullptr, 100)); @@ -210,7 +210,7 @@ TEST_F(BlockBasedFilterBlockTest, BlockBasedMultiChunk) { builder->Add("box"); builder->Add("hello"); - BlockContents block(builder->Finish(), false, kNoCompression); + BlockContents block(builder->Finish()); FilterBlockReader* reader = new BlockBasedFilterBlockReader( nullptr, table_options_, true, std::move(block), nullptr); diff --git a/table/block_based_table_builder.cc b/table/block_based_table_builder.cc index 59c385d65ae..a4007b07a2c 100644 --- a/table/block_based_table_builder.cc +++ b/table/block_based_table_builder.cc @@ -42,6 +42,7 @@ #include "util/coding.h" #include "util/compression.h" #include "util/crc32c.h" +#include "util/memory_allocator.h" #include "util/stop_watch.h" #include "util/string_util.h" #include "util/xxhash.h" @@ -449,6 +450,11 @@ void BlockBasedTableBuilder::Add(const Slice& key, const Slice& value) { r->props.num_entries++; r->props.raw_key_size += key.size(); r->props.raw_value_size += value.size(); + if (value_type == kTypeDeletion || value_type == kTypeSingleDeletion) { + r->props.num_deletions++; + } else if (value_type == kTypeMerge) { + r->props.num_merge_operands++; + } r->index_builder->OnKeyAdded(key); NotifyCollectTableCollectorsOnAdd(key, value, r->offset, @@ -609,6 +615,18 @@ void BlockBasedTableBuilder::WriteRawBlock(const Slice& block_contents, EncodeFixed32(trailer_without_type, XXH32_digest(xxh)); break; } + case kxxHash64: { + XXH64_state_t* const state = XXH64_createState(); + XXH64_reset(state, 0); + XXH64_update(state, block_contents.data(), + static_cast(block_contents.size())); + XXH64_update(state, trailer, 1); // Extend to cover block type + EncodeFixed32(trailer_without_type, + static_cast(XXH64_digest(state) & // lower 32 bits + uint64_t{0xffffffff})); + XXH64_freeState(state); + break; + } } assert(r->status.ok()); @@ -636,9 +654,9 @@ Status BlockBasedTableBuilder::status() const { return rep_->status; } -static void DeleteCachedBlock(const Slice& /*key*/, void* value) { - Block* block = reinterpret_cast(value); - delete block; +static void DeleteCachedBlockContents(const Slice& /*key*/, void* value) { + BlockContents* bc = reinterpret_cast(value); + delete bc; } // @@ -654,13 +672,16 @@ Status BlockBasedTableBuilder::InsertBlockInCache(const Slice& block_contents, size_t size = block_contents.size(); - std::unique_ptr ubuf(new char[size + 1]); + auto ubuf = + AllocateBlock(size + 1, block_cache_compressed->memory_allocator()); memcpy(ubuf.get(), block_contents.data(), size); ubuf[size] = type; - BlockContents results(std::move(ubuf), size, true, type); - - Block* block = new Block(std::move(results), kDisableGlobalSequenceNumber); + BlockContents* block_contents_to_cache = + new BlockContents(std::move(ubuf), size); +#ifndef NDEBUG + block_contents_to_cache->is_raw_block = true; +#endif // NDEBUG // make cache key by appending the file offset to the cache prefix id char* end = EncodeVarint64( @@ -671,8 +692,10 @@ Status BlockBasedTableBuilder::InsertBlockInCache(const Slice& block_contents, (end - r->compressed_cache_key_prefix)); // Insert into compressed block cache. - block_cache_compressed->Insert(key, block, block->ApproximateMemoryUsage(), - &DeleteCachedBlock); + block_cache_compressed->Insert( + key, block_contents_to_cache, + block_contents_to_cache->ApproximateMemoryUsage(), + &DeleteCachedBlockContents); // Invalidate OS cache. r->file->InvalidateCache(static_cast(r->offset), size); diff --git a/table/block_based_table_factory.cc b/table/block_based_table_factory.cc index 485aed87041..fbb7406a3d8 100644 --- a/table/block_based_table_factory.cc +++ b/table/block_based_table_factory.cc @@ -194,8 +194,8 @@ BlockBasedTableFactory::BlockBasedTableFactory( Status BlockBasedTableFactory::NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table_reader, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table_reader, bool prefetch_index_and_filter_in_cache) const { return BlockBasedTable::Open( table_reader_options.ioptions, table_reader_options.env_options, diff --git a/table/block_based_table_factory.h b/table/block_based_table_factory.h index b30bd6232ac..cde6f653573 100644 --- a/table/block_based_table_factory.h +++ b/table/block_based_table_factory.h @@ -53,8 +53,8 @@ class BlockBasedTableFactory : public TableFactory { Status NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table_reader, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table_reader, bool prefetch_index_and_filter_in_cache = true) const override; TableBuilder* NewTableBuilder( diff --git a/table/block_based_table_reader.cc b/table/block_based_table_reader.cc index 9f2e02d6806..a126de88c04 100644 --- a/table/block_based_table_reader.cc +++ b/table/block_based_table_reader.cc @@ -78,13 +78,14 @@ Status ReadBlockFromFile( RandomAccessFileReader* file, FilePrefetchBuffer* prefetch_buffer, const Footer& footer, const ReadOptions& options, const BlockHandle& handle, std::unique_ptr* result, const ImmutableCFOptions& ioptions, - bool do_uncompress, const Slice& compression_dict, + bool do_uncompress, bool maybe_compressed, const Slice& compression_dict, const PersistentCacheOptions& cache_options, SequenceNumber global_seqno, - size_t read_amp_bytes_per_bit, const bool immortal_file = false) { + size_t read_amp_bytes_per_bit, MemoryAllocator* memory_allocator) { BlockContents contents; BlockFetcher block_fetcher(file, prefetch_buffer, footer, options, handle, &contents, ioptions, do_uncompress, - compression_dict, cache_options, immortal_file); + maybe_compressed, compression_dict, cache_options, + memory_allocator); Status s = block_fetcher.ReadBlockContents(); if (s.ok()) { result->reset(new Block(std::move(contents), global_seqno, @@ -94,6 +95,20 @@ Status ReadBlockFromFile( return s; } +inline MemoryAllocator* GetMemoryAllocator( + const BlockBasedTableOptions& table_options) { + return table_options.block_cache.get() + ? table_options.block_cache->memory_allocator() + : nullptr; +} + +inline MemoryAllocator* GetMemoryAllocatorForCompressedBlock( + const BlockBasedTableOptions& table_options) { + return table_options.block_cache_compressed.get() + ? table_options.block_cache_compressed->memory_allocator() + : nullptr; +} + // Delete the resource that is held by the iterator. template void DeleteHeldResource(void* arg, void* /*ignored*/) { @@ -215,13 +230,15 @@ class PartitionIndexReader : public IndexReader, public Cleanable { IndexReader** index_reader, const PersistentCacheOptions& cache_options, const int level, const bool index_key_includes_seq, - const bool index_value_is_full) { + const bool index_value_is_full, + MemoryAllocator* memory_allocator) { std::unique_ptr index_block; auto s = ReadBlockFromFile( file, prefetch_buffer, footer, ReadOptions(), index_handle, &index_block, ioptions, true /* decompress */, - Slice() /*compression dict*/, cache_options, - kDisableGlobalSequenceNumber, 0 /* read_amp_bytes_per_bit */); + true /*maybe_compressed*/, Slice() /*compression dict*/, cache_options, + kDisableGlobalSequenceNumber, 0 /* read_amp_bytes_per_bit */, + memory_allocator); if (s.ok()) { *index_reader = new PartitionIndexReader( @@ -239,6 +256,8 @@ class PartitionIndexReader : public IndexReader, public Cleanable { Statistics* kNullStats = nullptr; // Filters are already checked before seeking the index if (!partition_map_.empty()) { + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. return NewTwoLevelIterator( new BlockBasedTable::PartitionedIndexIteratorState( table_, &partition_map_, index_key_includes_seq_, @@ -250,6 +269,8 @@ class PartitionIndexReader : public IndexReader, public Cleanable { auto ro = ReadOptions(); ro.fill_cache = fill_cache; bool kIsIndex = true; + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. return new BlockBasedTableIterator( table_, ro, *icomparator_, index_block_->NewIterator( @@ -270,6 +291,8 @@ class PartitionIndexReader : public IndexReader, public Cleanable { IndexBlockIter biter; BlockHandle handle; Statistics* kNullStats = nullptr; + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. index_block_->NewIterator( icomparator_, icomparator_->user_comparator(), &biter, kNullStats, true, index_key_includes_seq_, index_value_is_full_); @@ -312,7 +335,7 @@ class PartitionIndexReader : public IndexReader, public Cleanable { const bool is_index = true; // TODO: Support counter batch update for partitioned index and // filter blocks - s = table_->MaybeLoadDataBlockToCache( + s = table_->MaybeReadBlockAndLoadToCache( prefetch_buffer.get(), rep, ro, handle, compression_dict, &block, is_index, nullptr /* get_context */); @@ -388,13 +411,15 @@ class BinarySearchIndexReader : public IndexReader { IndexReader** index_reader, const PersistentCacheOptions& cache_options, const bool index_key_includes_seq, - const bool index_value_is_full) { + const bool index_value_is_full, + MemoryAllocator* memory_allocator) { std::unique_ptr index_block; auto s = ReadBlockFromFile( file, prefetch_buffer, footer, ReadOptions(), index_handle, &index_block, ioptions, true /* decompress */, - Slice() /*compression dict*/, cache_options, - kDisableGlobalSequenceNumber, 0 /* read_amp_bytes_per_bit */); + true /*maybe_compressed*/, Slice() /*compression dict*/, cache_options, + kDisableGlobalSequenceNumber, 0 /* read_amp_bytes_per_bit */, + memory_allocator); if (s.ok()) { *index_reader = new BinarySearchIndexReader( @@ -409,6 +434,8 @@ class BinarySearchIndexReader : public IndexReader { IndexBlockIter* iter = nullptr, bool /*dont_care*/ = true, bool /*dont_care*/ = true) override { Statistics* kNullStats = nullptr; + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. return index_block_->NewIterator( icomparator_, icomparator_->user_comparator(), iter, kNullStats, true, index_key_includes_seq_, index_value_is_full_); @@ -458,13 +485,15 @@ class HashIndexReader : public IndexReader { InternalIterator* meta_index_iter, IndexReader** index_reader, bool /*hash_index_allow_collision*/, const PersistentCacheOptions& cache_options, - const bool index_key_includes_seq, const bool index_value_is_full) { + const bool index_key_includes_seq, const bool index_value_is_full, + MemoryAllocator* memory_allocator) { std::unique_ptr index_block; auto s = ReadBlockFromFile( file, prefetch_buffer, footer, ReadOptions(), index_handle, &index_block, ioptions, true /* decompress */, - Slice() /*compression dict*/, cache_options, - kDisableGlobalSequenceNumber, 0 /* read_amp_bytes_per_bit */); + true /*maybe_compressed*/, Slice() /*compression dict*/, cache_options, + kDisableGlobalSequenceNumber, 0 /* read_amp_bytes_per_bit */, + memory_allocator); if (!s.ok()) { return s; @@ -502,8 +531,9 @@ class HashIndexReader : public IndexReader { BlockContents prefixes_contents; BlockFetcher prefixes_block_fetcher( file, prefetch_buffer, footer, ReadOptions(), prefixes_handle, - &prefixes_contents, ioptions, true /* decompress */, - dummy_comp_dict /*compression dict*/, cache_options); + &prefixes_contents, ioptions, true /*decompress*/, + true /*maybe_compressed*/, dummy_comp_dict /*compression dict*/, + cache_options, memory_allocator); s = prefixes_block_fetcher.ReadBlockContents(); if (!s.ok()) { return s; @@ -511,8 +541,9 @@ class HashIndexReader : public IndexReader { BlockContents prefixes_meta_contents; BlockFetcher prefixes_meta_block_fetcher( file, prefetch_buffer, footer, ReadOptions(), prefixes_meta_handle, - &prefixes_meta_contents, ioptions, true /* decompress */, - dummy_comp_dict /*compression dict*/, cache_options); + &prefixes_meta_contents, ioptions, true /*decompress*/, + true /*maybe_compressed*/, dummy_comp_dict /*compression dict*/, + cache_options, memory_allocator); s = prefixes_meta_block_fetcher.ReadBlockContents(); if (!s.ok()) { // TODO: log error @@ -534,10 +565,12 @@ class HashIndexReader : public IndexReader { IndexBlockIter* iter = nullptr, bool total_order_seek = true, bool /*dont_care*/ = true) override { Statistics* kNullStats = nullptr; + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. return index_block_->NewIterator( icomparator_, icomparator_->user_comparator(), iter, kNullStats, total_order_seek, index_key_includes_seq_, index_value_is_full_, - prefix_index_.get()); + false /* block_contents_pinned */, prefix_index_.get()); } virtual size_t size() const override { return index_block_->size(); } @@ -572,8 +605,7 @@ class HashIndexReader : public IndexReader { assert(index_block_ != nullptr); } - ~HashIndexReader() { - } + ~HashIndexReader() {} std::unique_ptr index_block_; std::unique_ptr prefix_index_; @@ -737,9 +769,9 @@ Status BlockBasedTable::Open(const ImmutableCFOptions& ioptions, const EnvOptions& env_options, const BlockBasedTableOptions& table_options, const InternalKeyComparator& internal_comparator, - unique_ptr&& file, + std::unique_ptr&& file, uint64_t file_size, - unique_ptr* table_reader, + std::unique_ptr* table_reader, const SliceTransform* prefix_extractor, const bool prefetch_index_and_filter_in_cache, const bool skip_filters, const int level, @@ -807,7 +839,7 @@ Status BlockBasedTable::Open(const ImmutableCFOptions& ioptions, // raw pointer will be used to create HashIndexReader, whose reset may // access a dangling pointer. Rep* rep = new BlockBasedTable::Rep(ioptions, env_options, table_options, - internal_comparator, skip_filters, + internal_comparator, skip_filters, level, immortal_table); rep->file = std::move(file); rep->footer = footer; @@ -818,7 +850,7 @@ Status BlockBasedTable::Open(const ImmutableCFOptions& ioptions, rep->internal_prefix_transform.reset( new InternalKeySliceTransform(prefix_extractor)); SetupCacheKeyPrefix(rep, file_size); - unique_ptr new_table(new BlockBasedTable(rep)); + std::unique_ptr new_table(new BlockBasedTable(rep)); // page cache options rep->persistent_cache_options = @@ -878,7 +910,9 @@ Status BlockBasedTable::Open(const ImmutableCFOptions& ioptions, if (s.ok()) { s = ReadProperties(meta_iter->value(), rep->file.get(), prefetch_buffer.get(), rep->footer, rep->ioptions, - &table_properties, false /* compression_type_missing */); + &table_properties, + false /* compression_type_missing */, + nullptr /* memory_allocator */); } if (!s.ok()) { @@ -921,9 +955,10 @@ Status BlockBasedTable::Open(const ImmutableCFOptions& ioptions, ReadOptions read_options; read_options.verify_checksums = false; BlockFetcher compression_block_fetcher( - rep->file.get(), prefetch_buffer.get(), rep->footer, read_options, - compression_dict_handle, compression_dict_cont.get(), rep->ioptions, false /* decompress */, - Slice() /*compression dict*/, cache_options); + rep->file.get(), prefetch_buffer.get(), rep->footer, read_options, + compression_dict_handle, compression_dict_cont.get(), rep->ioptions, + false /* decompress */, false /*maybe_compressed*/, + Slice() /*compression dict*/, cache_options); s = compression_block_fetcher.ReadBlockContents(); if (!s.ok()) { @@ -964,20 +999,22 @@ Status BlockBasedTable::Open(const ImmutableCFOptions& ioptions, rep->ioptions.info_log, "Error when seeking to range delete tombstones block from file: %s", s.ToString().c_str()); - } else { - if (found_range_del_block && !rep->range_del_handle.IsNull()) { - ReadOptions read_options; - s = MaybeLoadDataBlockToCache( - prefetch_buffer.get(), rep, read_options, rep->range_del_handle, - Slice() /* compression_dict */, &rep->range_del_entry, - false /* is_index */, nullptr /* get_context */); - if (!s.ok()) { - ROCKS_LOG_WARN( - rep->ioptions.info_log, - "Encountered error while reading data from range del block %s", - s.ToString().c_str()); - } + } else if (found_range_del_block && !rep->range_del_handle.IsNull()) { + ReadOptions read_options; + s = MaybeReadBlockAndLoadToCache( + prefetch_buffer.get(), rep, read_options, rep->range_del_handle, + Slice() /* compression_dict */, &rep->range_del_entry, + false /* is_index */, nullptr /* get_context */); + if (!s.ok()) { + ROCKS_LOG_WARN( + rep->ioptions.info_log, + "Encountered error while reading data from range del block %s", + s.ToString().c_str()); } + auto iter = std::unique_ptr( + new_table->NewUnfragmentedRangeTombstoneIterator(read_options)); + rep->fragmented_range_dels = std::make_shared( + std::move(iter), internal_comparator); } bool need_upper_bound_check = @@ -1019,7 +1056,7 @@ Status BlockBasedTable::Open(const ImmutableCFOptions& ioptions, bool disable_prefix_seek = rep->index_type == BlockBasedTableOptions::kHashSearch && need_upper_bound_check; - unique_ptr> iter( + std::unique_ptr> iter( new_table->NewIndexIterator(ReadOptions(), disable_prefix_seek, nullptr, &index_entry)); s = iter->status(); @@ -1094,7 +1131,7 @@ Status BlockBasedTable::Open(const ImmutableCFOptions& ioptions, if (tail_prefetch_stats != nullptr) { assert(prefetch_buffer->min_offset_read() < file_size); tail_prefetch_stats->RecordEffectiveSize( - static_cast(file_size) - prefetch_buffer->min_offset_read()); + static_cast(file_size) - prefetch_buffer->min_offset_read()); } *table_reader = std::move(new_table); } @@ -1148,9 +1185,10 @@ Status BlockBasedTable::ReadMetaBlock(Rep* rep, Status s = ReadBlockFromFile( rep->file.get(), prefetch_buffer, rep->footer, ReadOptions(), rep->footer.metaindex_handle(), &meta, rep->ioptions, - true /* decompress */, Slice() /*compression dict*/, - rep->persistent_cache_options, kDisableGlobalSequenceNumber, - 0 /* read_amp_bytes_per_bit */); + true /* decompress */, true /*maybe_compressed*/, + Slice() /*compression dict*/, rep->persistent_cache_options, + kDisableGlobalSequenceNumber, 0 /* read_amp_bytes_per_bit */, + GetMemoryAllocator(rep->table_options)); if (!s.ok()) { ROCKS_LOG_ERROR(rep->ioptions.info_log, @@ -1169,15 +1207,14 @@ Status BlockBasedTable::ReadMetaBlock(Rep* rep, Status BlockBasedTable::GetDataBlockFromCache( const Slice& block_cache_key, const Slice& compressed_block_cache_key, - Cache* block_cache, Cache* block_cache_compressed, - const ImmutableCFOptions& ioptions, const ReadOptions& read_options, - BlockBasedTable::CachableEntry* block, uint32_t format_version, - const Slice& compression_dict, size_t read_amp_bytes_per_bit, bool is_index, - GetContext* get_context) { + Cache* block_cache, Cache* block_cache_compressed, Rep* rep, + const ReadOptions& read_options, + BlockBasedTable::CachableEntry* block, const Slice& compression_dict, + size_t read_amp_bytes_per_bit, bool is_index, GetContext* get_context) { Status s; - Block* compressed_block = nullptr; + BlockContents* compressed_block = nullptr; Cache::Handle* block_cache_compressed_handle = nullptr; - Statistics* statistics = ioptions.statistics; + Statistics* statistics = rep->ioptions.statistics; // Lookup uncompressed cache first if (block_cache != nullptr) { @@ -1220,32 +1257,34 @@ Status BlockBasedTable::GetDataBlockFromCache( // found compressed block RecordTick(statistics, BLOCK_CACHE_COMPRESSED_HIT); - compressed_block = reinterpret_cast( + compressed_block = reinterpret_cast( block_cache_compressed->Value(block_cache_compressed_handle)); - assert(compressed_block->compression_type() != kNoCompression); + CompressionType compression_type = compressed_block->get_compression_type(); + assert(compression_type != kNoCompression); // Retrieve the uncompressed contents into a new buffer BlockContents contents; - UncompressionContext uncompresssion_ctx(compressed_block->compression_type(), - compression_dict); - s = UncompressBlockContents(uncompresssion_ctx, compressed_block->data(), - compressed_block->size(), &contents, - format_version, ioptions); + UncompressionContext uncompresssion_ctx(compression_type, compression_dict); + s = UncompressBlockContents(uncompresssion_ctx, compressed_block->data.data(), + compressed_block->data.size(), &contents, + rep->table_options.format_version, rep->ioptions, + GetMemoryAllocator(rep->table_options)); // Insert uncompressed block into block cache if (s.ok()) { block->value = - new Block(std::move(contents), compressed_block->global_seqno(), + new Block(std::move(contents), rep->get_global_seqno(is_index), read_amp_bytes_per_bit, statistics); // uncompressed block - assert(block->value->compression_type() == kNoCompression); - if (block_cache != nullptr && block->value->cachable() && + if (block_cache != nullptr && block->value->own_bytes() && read_options.fill_cache) { size_t charge = block->value->ApproximateMemoryUsage(); s = block_cache->Insert(block_cache_key, block->value, charge, &DeleteCachedEntry, &(block->cache_handle)); +#ifndef NDEBUG block_cache->TEST_mark_as_data_block(block_cache_key, charge); +#endif // NDEBUG if (s.ok()) { if (get_context != nullptr) { get_context->get_context_stats_.num_cache_add++; @@ -1290,64 +1329,77 @@ Status BlockBasedTable::PutDataBlockToCache( const Slice& block_cache_key, const Slice& compressed_block_cache_key, Cache* block_cache, Cache* block_cache_compressed, const ReadOptions& /*read_options*/, const ImmutableCFOptions& ioptions, - CachableEntry* block, Block* raw_block, uint32_t format_version, - const Slice& compression_dict, size_t read_amp_bytes_per_bit, bool is_index, - Cache::Priority priority, GetContext* get_context) { - assert(raw_block->compression_type() == kNoCompression || + CachableEntry* cached_block, BlockContents* raw_block_contents, + CompressionType raw_block_comp_type, uint32_t format_version, + const Slice& compression_dict, SequenceNumber seq_no, + size_t read_amp_bytes_per_bit, MemoryAllocator* memory_allocator, + bool is_index, Cache::Priority priority, GetContext* get_context) { + assert(raw_block_comp_type == kNoCompression || block_cache_compressed != nullptr); Status s; // Retrieve the uncompressed contents into a new buffer - BlockContents contents; + BlockContents uncompressed_block_contents; Statistics* statistics = ioptions.statistics; - if (raw_block->compression_type() != kNoCompression) { - UncompressionContext uncompression_ctx(raw_block->compression_type(), + if (raw_block_comp_type != kNoCompression) { + UncompressionContext uncompression_ctx(raw_block_comp_type, compression_dict); - s = UncompressBlockContents(uncompression_ctx, raw_block->data(), - raw_block->size(), &contents, format_version, - ioptions); + s = UncompressBlockContents( + uncompression_ctx, raw_block_contents->data.data(), + raw_block_contents->data.size(), &uncompressed_block_contents, + format_version, ioptions, memory_allocator); } if (!s.ok()) { - delete raw_block; return s; } - if (raw_block->compression_type() != kNoCompression) { - block->value = new Block(std::move(contents), raw_block->global_seqno(), - read_amp_bytes_per_bit, - statistics); // uncompressed block + if (raw_block_comp_type != kNoCompression) { + cached_block->value = new Block(std::move(uncompressed_block_contents), + seq_no, read_amp_bytes_per_bit, + statistics); // uncompressed block } else { - block->value = raw_block; - raw_block = nullptr; + cached_block->value = + new Block(std::move(*raw_block_contents), seq_no, + read_amp_bytes_per_bit, ioptions.statistics); } // Insert compressed block into compressed block cache. // Release the hold on the compressed cache entry immediately. - if (block_cache_compressed != nullptr && raw_block != nullptr && - raw_block->cachable()) { - s = block_cache_compressed->Insert(compressed_block_cache_key, raw_block, - raw_block->ApproximateMemoryUsage(), - &DeleteCachedEntry); + if (block_cache_compressed != nullptr && + raw_block_comp_type != kNoCompression && raw_block_contents != nullptr && + raw_block_contents->own_bytes()) { +#ifndef NDEBUG + assert(raw_block_contents->is_raw_block); +#endif // NDEBUG + + // We cannot directly put raw_block_contents because this could point to + // an object in the stack. + BlockContents* block_cont_for_comp_cache = + new BlockContents(std::move(*raw_block_contents)); + s = block_cache_compressed->Insert( + compressed_block_cache_key, block_cont_for_comp_cache, + block_cont_for_comp_cache->ApproximateMemoryUsage(), + &DeleteCachedEntry); if (s.ok()) { // Avoid the following code to delete this cached block. - raw_block = nullptr; RecordTick(statistics, BLOCK_CACHE_COMPRESSED_ADD); } else { RecordTick(statistics, BLOCK_CACHE_COMPRESSED_ADD_FAILURES); + delete block_cont_for_comp_cache; } } - delete raw_block; // insert into uncompressed block cache - assert((block->value->compression_type() == kNoCompression)); - if (block_cache != nullptr && block->value->cachable()) { - size_t charge = block->value->ApproximateMemoryUsage(); - s = block_cache->Insert(block_cache_key, block->value, charge, - &DeleteCachedEntry, &(block->cache_handle), - priority); + if (block_cache != nullptr && cached_block->value->own_bytes()) { + size_t charge = cached_block->value->ApproximateMemoryUsage(); + s = block_cache->Insert(block_cache_key, cached_block->value, charge, + &DeleteCachedEntry, + &(cached_block->cache_handle), priority); +#ifndef NDEBUG block_cache->TEST_mark_as_data_block(block_cache_key, charge); +#endif // NDEBUG if (s.ok()) { - assert(block->cache_handle != nullptr); + assert(cached_block->cache_handle != nullptr); if (get_context != nullptr) { get_context->get_context_stats_.num_cache_add++; get_context->get_context_stats_.num_cache_bytes_write += charge; @@ -1373,12 +1425,12 @@ Status BlockBasedTable::PutDataBlockToCache( RecordTick(statistics, BLOCK_CACHE_DATA_BYTES_INSERT, charge); } } - assert(reinterpret_cast( - block_cache->Value(block->cache_handle)) == block->value); + assert(reinterpret_cast(block_cache->Value( + cached_block->cache_handle)) == cached_block->value); } else { RecordTick(statistics, BLOCK_CACHE_ADD_FAILURES); - delete block->value; - block->value = nullptr; + delete cached_block->value; + cached_block->value = nullptr; } } @@ -1399,10 +1451,11 @@ FilterBlockReader* BlockBasedTable::ReadFilter( Slice dummy_comp_dict; - BlockFetcher block_fetcher(rep->file.get(), prefetch_buffer, rep->footer, - ReadOptions(), filter_handle, &block, - rep->ioptions, false /* decompress */, - dummy_comp_dict, rep->persistent_cache_options); + BlockFetcher block_fetcher( + rep->file.get(), prefetch_buffer, rep->footer, ReadOptions(), + filter_handle, &block, rep->ioptions, false /* decompress */, + false /*maybe_compressed*/, dummy_comp_dict, + rep->persistent_cache_options, GetMemoryAllocator(rep->table_options)); Status s = block_fetcher.ReadBlockContents(); if (!s.ok()) { @@ -1551,12 +1604,16 @@ InternalIteratorBase* BlockBasedTable::NewIndexIterator( GetContext* get_context) { // index reader has already been pre-populated. if (rep_->index_reader) { + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. return rep_->index_reader->NewIterator( input_iter, read_options.total_order_seek || disable_prefix_seek, read_options.fill_cache); } // we have a pinned index block if (rep_->index_entry.IsSet()) { + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. return rep_->index_entry.value->NewIterator( input_iter, read_options.total_order_seek || disable_prefix_seek, read_options.fill_cache); @@ -1639,6 +1696,8 @@ InternalIteratorBase* BlockBasedTable::NewIndexIterator( } assert(cache_handle); + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. auto* iter = index_reader->NewIterator( input_iter, read_options.total_order_seek || disable_prefix_seek); @@ -1673,9 +1732,9 @@ TBlockIter* BlockBasedTable::NewDataBlockIterator( if (rep->compression_dict_block) { compression_dict = rep->compression_dict_block->data; } - s = MaybeLoadDataBlockToCache(prefetch_buffer, rep, ro, handle, - compression_dict, &block, is_index, - get_context); + s = MaybeReadBlockAndLoadToCache(prefetch_buffer, rep, ro, handle, + compression_dict, &block, is_index, + get_context); } TBlockIter* iter; @@ -1697,10 +1756,13 @@ TBlockIter* BlockBasedTable::NewDataBlockIterator( READ_BLOCK_GET_MICROS); s = ReadBlockFromFile( rep->file.get(), prefetch_buffer, rep->footer, ro, handle, - &block_value, rep->ioptions, rep->blocks_maybe_compressed, - compression_dict, rep->persistent_cache_options, + &block_value, rep->ioptions, + rep->blocks_maybe_compressed /*do_decompress*/, + rep->blocks_maybe_compressed, compression_dict, + rep->persistent_cache_options, is_index ? kDisableGlobalSequenceNumber : rep->global_seqno, - rep->table_options.read_amp_bytes_per_bit, rep->immortal_table); + rep->table_options.read_amp_bytes_per_bit, + GetMemoryAllocator(rep->table_options)); } if (s.ok()) { block.value = block_value.release(); @@ -1710,10 +1772,20 @@ TBlockIter* BlockBasedTable::NewDataBlockIterator( if (s.ok()) { assert(block.value != nullptr); const bool kTotalOrderSeek = true; + // Block contents are pinned and it is still pinned after the iterator + // is destoryed as long as cleanup functions are moved to another object, + // when: + // 1. block cache handle is set to be released in cleanup function, or + // 2. it's pointing to immortable source. If own_bytes is true then we are + // not reading data from the original source, weather immortal or not. + // Otherwise, the block is pinned iff the source is immortal. + bool block_contents_pinned = + (block.cache_handle != nullptr || + (!block.value->own_bytes() && rep->immortal_table)); iter = block.value->NewIterator( &rep->internal_comparator, rep->internal_comparator.user_comparator(), iter, rep->ioptions.statistics, kTotalOrderSeek, key_includes_seq, - index_key_is_full); + index_key_is_full, block_contents_pinned); if (block.cache_handle != nullptr) { iter->RegisterCleanup(&ReleaseCachedEntry, block_cache, block.cache_handle); @@ -1722,7 +1794,7 @@ TBlockIter* BlockBasedTable::NewDataBlockIterator( // insert a dummy record to block cache to track the memory usage Cache::Handle* cache_handle; // There are two other types of cache keys: 1) SST cache key added in - // `MaybeLoadDataBlockToCache` 2) dummy cache key added in + // `MaybeReadBlockAndLoadToCache` 2) dummy cache key added in // `write_buffer_manager`. Use longer prefix (41 bytes) to differentiate // from SST cache key(31 bytes), and use non-zero prefix to // differentiate from `write_buffer_manager` @@ -1758,25 +1830,28 @@ TBlockIter* BlockBasedTable::NewDataBlockIterator( return iter; } -Status BlockBasedTable::MaybeLoadDataBlockToCache( +Status BlockBasedTable::MaybeReadBlockAndLoadToCache( FilePrefetchBuffer* prefetch_buffer, Rep* rep, const ReadOptions& ro, const BlockHandle& handle, Slice compression_dict, CachableEntry* block_entry, bool is_index, GetContext* get_context) { assert(block_entry != nullptr); const bool no_io = (ro.read_tier == kBlockCacheTier); Cache* block_cache = rep->table_options.block_cache.get(); + + // No point to cache compressed blocks if it never goes away Cache* block_cache_compressed = - rep->table_options.block_cache_compressed.get(); + rep->immortal_table ? nullptr + : rep->table_options.block_cache_compressed.get(); + // First, try to get the block from the cache + // // If either block cache is enabled, we'll try to read from it. Status s; + char cache_key[kMaxCacheKeyPrefixSize + kMaxVarint64Length]; + char compressed_cache_key[kMaxCacheKeyPrefixSize + kMaxVarint64Length]; + Slice key /* key to the block cache */; + Slice ckey /* key to the compressed block cache */; if (block_cache != nullptr || block_cache_compressed != nullptr) { - Statistics* statistics = rep->ioptions.statistics; - char cache_key[kMaxCacheKeyPrefixSize + kMaxVarint64Length]; - char compressed_cache_key[kMaxCacheKeyPrefixSize + kMaxVarint64Length]; - Slice key, /* key to the block cache */ - ckey /* key to the compressed block cache */; - // create key for block cache if (block_cache != nullptr) { key = GetCacheKey(rep->cache_key_prefix, rep->cache_key_prefix_size, @@ -1789,30 +1864,42 @@ Status BlockBasedTable::MaybeLoadDataBlockToCache( compressed_cache_key); } - s = GetDataBlockFromCache( - key, ckey, block_cache, block_cache_compressed, rep->ioptions, ro, - block_entry, rep->table_options.format_version, compression_dict, - rep->table_options.read_amp_bytes_per_bit, is_index, get_context); + s = GetDataBlockFromCache(key, ckey, block_cache, block_cache_compressed, + rep, ro, block_entry, compression_dict, + rep->table_options.read_amp_bytes_per_bit, + is_index, get_context); + // Can't find the block from the cache. If I/O is allowed, read from the + // file. if (block_entry->value == nullptr && !no_io && ro.fill_cache) { - std::unique_ptr raw_block; + Statistics* statistics = rep->ioptions.statistics; + bool do_decompress = + block_cache_compressed == nullptr && rep->blocks_maybe_compressed; + CompressionType raw_block_comp_type; + BlockContents raw_block_contents; { StopWatch sw(rep->ioptions.env, statistics, READ_BLOCK_GET_MICROS); - s = ReadBlockFromFile( + BlockFetcher block_fetcher( rep->file.get(), prefetch_buffer, rep->footer, ro, handle, - &raw_block, rep->ioptions, - block_cache_compressed == nullptr && rep->blocks_maybe_compressed, + &raw_block_contents, rep->ioptions, + do_decompress /* do uncompress */, rep->blocks_maybe_compressed, compression_dict, rep->persistent_cache_options, - is_index ? kDisableGlobalSequenceNumber : rep->global_seqno, - rep->table_options.read_amp_bytes_per_bit, rep->immortal_table); + GetMemoryAllocator(rep->table_options), + GetMemoryAllocatorForCompressedBlock(rep->table_options)); + s = block_fetcher.ReadBlockContents(); + raw_block_comp_type = block_fetcher.get_compression_type(); } if (s.ok()) { + SequenceNumber seq_no = rep->get_global_seqno(is_index); + // If filling cache is allowed and a cache is configured, try to put the + // block to the cache. s = PutDataBlockToCache( key, ckey, block_cache, block_cache_compressed, ro, rep->ioptions, - block_entry, raw_block.release(), rep->table_options.format_version, - compression_dict, rep->table_options.read_amp_bytes_per_bit, - is_index, + block_entry, &raw_block_contents, raw_block_comp_type, + rep->table_options.format_version, compression_dict, seq_no, + rep->table_options.read_amp_bytes_per_bit, + GetMemoryAllocator(rep->table_options), is_index, is_index && rep->table_options .cache_index_and_filter_blocks_with_high_priority ? Cache::Priority::HIGH @@ -1855,6 +1942,8 @@ BlockBasedTable::PartitionedIndexIteratorState::NewSecondaryIterator( RecordTick(rep->ioptions.statistics, BLOCK_CACHE_BYTES_READ, block_cache->GetUsage(block->second.cache_handle)); Statistics* kNullStats = nullptr; + // We don't return pinned datat from index blocks, so no need + // to set `block_contents_pinned`. return block->second.value->NewIterator( &rep->internal_comparator, rep->internal_comparator.user_comparator(), nullptr, kNullStats, true, index_key_includes_seq_, index_key_is_full_); @@ -1933,7 +2022,7 @@ bool BlockBasedTable::PrefixMayMatch( // Then, try find it within each block // we already know prefix_extractor and prefix_extractor_name must match // because `CheckPrefixMayMatch` first checks `check_filter_ == true` - unique_ptr> iiter( + std::unique_ptr> iiter( NewIndexIterator(no_io_read_options, /* need_upper_bound_check */ false)); iiter->Seek(internal_prefix); @@ -2249,7 +2338,20 @@ InternalIterator* BlockBasedTable::NewIterator( } } -InternalIterator* BlockBasedTable::NewRangeTombstoneIterator( +FragmentedRangeTombstoneIterator* BlockBasedTable::NewRangeTombstoneIterator( + const ReadOptions& read_options) { + if (rep_->fragmented_range_dels == nullptr) { + return nullptr; + } + SequenceNumber snapshot = kMaxSequenceNumber; + if (read_options.snapshot != nullptr) { + snapshot = read_options.snapshot->GetSequenceNumber(); + } + return new FragmentedRangeTombstoneIterator( + rep_->fragmented_range_dels, rep_->internal_comparator, snapshot); +} + +InternalIterator* BlockBasedTable::NewUnfragmentedRangeTombstoneIterator( const ReadOptions& read_options) { if (rep_->range_del_handle.IsNull()) { // The block didn't exist, nullptr indicates no range tombstones. @@ -2302,6 +2404,7 @@ bool BlockBasedTable::FullFilterKeyMayMatch( } if (may_match) { RecordTick(rep_->ioptions.statistics, BLOOM_FILTER_FULL_POSITIVE); + PERF_COUNTER_BY_LEVEL_ADD(bloom_filter_full_positive, 1, rep_->level); } return may_match; } @@ -2326,6 +2429,7 @@ Status BlockBasedTable::Get(const ReadOptions& read_options, const Slice& key, if (!FullFilterKeyMayMatch(read_options, filter, key, no_io, prefix_extractor)) { RecordTick(rep_->ioptions.statistics, BLOOM_FILTER_USEFUL); + PERF_COUNTER_BY_LEVEL_ADD(bloom_filter_useful, 1, rep_->level); } else { IndexBlockIter iiter_on_stack; // if prefix_extractor found in block differs from options, disable @@ -2358,6 +2462,7 @@ Status BlockBasedTable::Get(const ReadOptions& read_options, const Slice& key, // TODO: think about interaction with Merge. If a user key cannot // cross one data block, we should be fine. RecordTick(rep_->ioptions.statistics, BLOOM_FILTER_USEFUL); + PERF_COUNTER_BY_LEVEL_ADD(bloom_filter_useful, 1, rep_->level); break; } else { DataBlockIter biter; @@ -2410,6 +2515,8 @@ Status BlockBasedTable::Get(const ReadOptions& read_options, const Slice& key, } if (matched && filter != nullptr && !filter->IsBlockBased()) { RecordTick(rep_->ioptions.statistics, BLOOM_FILTER_FULL_TRUE_POSITIVE); + PERF_COUNTER_BY_LEVEL_ADD(bloom_filter_full_true_positive, 1, + rep_->level); } if (s.ok()) { s = iiter->status(); @@ -2524,11 +2631,11 @@ Status BlockBasedTable::VerifyChecksumInBlocks( BlockHandle handle = index_iter->value(); BlockContents contents; Slice dummy_comp_dict; - BlockFetcher block_fetcher(rep_->file.get(), nullptr /* prefetch buffer */, - rep_->footer, ReadOptions(), handle, &contents, - rep_->ioptions, false /* decompress */, - dummy_comp_dict /*compression dict*/, - rep_->persistent_cache_options); + BlockFetcher block_fetcher( + rep_->file.get(), nullptr /* prefetch buffer */, rep_->footer, + ReadOptions(), handle, &contents, rep_->ioptions, + false /* decompress */, false /*maybe_compressed*/, + dummy_comp_dict /*compression dict*/, rep_->persistent_cache_options); s = block_fetcher.ReadBlockContents(); if (!s.ok()) { break; @@ -2550,11 +2657,11 @@ Status BlockBasedTable::VerifyChecksumInBlocks( s = handle.DecodeFrom(&input); BlockContents contents; Slice dummy_comp_dict; - BlockFetcher block_fetcher(rep_->file.get(), nullptr /* prefetch buffer */, - rep_->footer, ReadOptions(), handle, &contents, - rep_->ioptions, false /* decompress */, - dummy_comp_dict /*compression dict*/, - rep_->persistent_cache_options); + BlockFetcher block_fetcher( + rep_->file.get(), nullptr /* prefetch buffer */, rep_->footer, + ReadOptions(), handle, &contents, rep_->ioptions, + false /* decompress */, false /*maybe_compressed*/, + dummy_comp_dict /*compression dict*/, rep_->persistent_cache_options); s = block_fetcher.ReadBlockContents(); if (!s.ok()) { break; @@ -2583,8 +2690,7 @@ bool BlockBasedTable::TEST_KeyInCache(const ReadOptions& options, Status s; s = GetDataBlockFromCache( - cache_key, ckey, block_cache, nullptr, rep_->ioptions, options, &block, - rep_->table_options.format_version, + cache_key, ckey, block_cache, nullptr, rep_, options, &block, rep_->compression_dict_block ? rep_->compression_dict_block->data : Slice(), 0 /* read_amp_bytes_per_bit */); @@ -2644,7 +2750,8 @@ Status BlockBasedTable::CreateIndexReader( rep_->table_properties == nullptr || rep_->table_properties->index_key_is_user_key == 0, rep_->table_properties == nullptr || - rep_->table_properties->index_value_is_delta_encoded == 0); + rep_->table_properties->index_value_is_delta_encoded == 0, + GetMemoryAllocator(rep_->table_options)); } case BlockBasedTableOptions::kBinarySearch: { return BinarySearchIndexReader::Create( @@ -2653,7 +2760,8 @@ Status BlockBasedTable::CreateIndexReader( rep_->table_properties == nullptr || rep_->table_properties->index_key_is_user_key == 0, rep_->table_properties == nullptr || - rep_->table_properties->index_value_is_delta_encoded == 0); + rep_->table_properties->index_value_is_delta_encoded == 0, + GetMemoryAllocator(rep_->table_options)); } case BlockBasedTableOptions::kHashSearch: { std::unique_ptr meta_guard; @@ -2675,7 +2783,8 @@ Status BlockBasedTable::CreateIndexReader( rep_->table_properties == nullptr || rep_->table_properties->index_key_is_user_key == 0, rep_->table_properties == nullptr || - rep_->table_properties->index_value_is_delta_encoded == 0); + rep_->table_properties->index_value_is_delta_encoded == 0, + GetMemoryAllocator(rep_->table_options)); } meta_index_iter = meta_iter_guard.get(); } @@ -2688,7 +2797,8 @@ Status BlockBasedTable::CreateIndexReader( rep_->table_properties == nullptr || rep_->table_properties->index_key_is_user_key == 0, rep_->table_properties == nullptr || - rep_->table_properties->index_value_is_delta_encoded == 0); + rep_->table_properties->index_value_is_delta_encoded == 0, + GetMemoryAllocator(rep_->table_options)); } default: { std::string error_message = @@ -2699,7 +2809,7 @@ Status BlockBasedTable::CreateIndexReader( } uint64_t BlockBasedTable::ApproximateOffsetOf(const Slice& key) { - unique_ptr> index_iter( + std::unique_ptr> index_iter( NewIndexIterator(ReadOptions())); index_iter->Seek(key); @@ -2857,7 +2967,8 @@ Status BlockBasedTable::DumpTable(WritableFile* out_file, BlockFetcher block_fetcher( rep_->file.get(), nullptr /* prefetch_buffer */, rep_->footer, ReadOptions(), handle, &block, rep_->ioptions, - false /*decompress*/, dummy_comp_dict /*compression dict*/, + false /*decompress*/, false /*maybe_compressed*/, + dummy_comp_dict /*compression dict*/, rep_->persistent_cache_options); s = block_fetcher.ReadBlockContents(); if (!s.ok()) { diff --git a/table/block_based_table_reader.h b/table/block_based_table_reader.h index 3cada0c2c2d..cb6a865660c 100644 --- a/table/block_based_table_reader.h +++ b/table/block_based_table_reader.h @@ -16,6 +16,7 @@ #include #include +#include "db/range_tombstone_fragmenter.h" #include "options/cf_options.h" #include "rocksdb/options.h" #include "rocksdb/persistent_cache.h" @@ -88,8 +89,9 @@ class BlockBasedTable : public TableReader { const EnvOptions& env_options, const BlockBasedTableOptions& table_options, const InternalKeyComparator& internal_key_comparator, - unique_ptr&& file, - uint64_t file_size, unique_ptr* table_reader, + std::unique_ptr&& file, + uint64_t file_size, + std::unique_ptr* table_reader, const SliceTransform* prefix_extractor = nullptr, bool prefetch_index_and_filter_in_cache = true, bool skip_filters = false, int level = -1, @@ -112,7 +114,7 @@ class BlockBasedTable : public TableReader { bool skip_filters = false, bool for_compaction = false) override; - InternalIterator* NewRangeTombstoneIterator( + FragmentedRangeTombstoneIterator* NewRangeTombstoneIterator( const ReadOptions& read_options) override; // @param skip_filters Disables loading/accessing the filter block @@ -255,13 +257,11 @@ class BlockBasedTable : public TableReader { // @param block_entry value is set to the uncompressed block if found. If // in uncompressed block cache, also sets cache_handle to reference that // block. - static Status MaybeLoadDataBlockToCache(FilePrefetchBuffer* prefetch_buffer, - Rep* rep, const ReadOptions& ro, - const BlockHandle& handle, - Slice compression_dict, - CachableEntry* block_entry, - bool is_index = false, - GetContext* get_context = nullptr); + static Status MaybeReadBlockAndLoadToCache( + FilePrefetchBuffer* prefetch_buffer, Rep* rep, const ReadOptions& ro, + const BlockHandle& handle, Slice compression_dict, + CachableEntry* block_entry, bool is_index = false, + GetContext* get_context = nullptr); // For the following two functions: // if `no_io == true`, we will not try to read filter/index from sst file @@ -299,9 +299,9 @@ class BlockBasedTable : public TableReader { // dictionary. static Status GetDataBlockFromCache( const Slice& block_cache_key, const Slice& compressed_block_cache_key, - Cache* block_cache, Cache* block_cache_compressed, - const ImmutableCFOptions& ioptions, const ReadOptions& read_options, - BlockBasedTable::CachableEntry* block, uint32_t format_version, + Cache* block_cache, Cache* block_cache_compressed, Rep* rep, + const ReadOptions& read_options, + BlockBasedTable::CachableEntry* block, const Slice& compression_dict, size_t read_amp_bytes_per_bit, bool is_index = false, GetContext* get_context = nullptr); @@ -311,16 +311,18 @@ class BlockBasedTable : public TableReader { // On success, Status::OK will be returned; also @block will be populated with // uncompressed block and its cache handle. // - // REQUIRES: raw_block is heap-allocated. PutDataBlockToCache() will be - // responsible for releasing its memory if error occurs. + // Allocated memory managed by raw_block_contents will be transferred to + // PutDataBlockToCache(). After the call, the object will be invalid. // @param compression_dict Data for presetting the compression library's // dictionary. static Status PutDataBlockToCache( const Slice& block_cache_key, const Slice& compressed_block_cache_key, Cache* block_cache, Cache* block_cache_compressed, const ReadOptions& read_options, const ImmutableCFOptions& ioptions, - CachableEntry* block, Block* raw_block, uint32_t format_version, - const Slice& compression_dict, size_t read_amp_bytes_per_bit, + CachableEntry* block, BlockContents* raw_block_contents, + CompressionType raw_block_comp_type, uint32_t format_version, + const Slice& compression_dict, SequenceNumber seq_no, + size_t read_amp_bytes_per_bit, MemoryAllocator* memory_allocator, bool is_index = false, Cache::Priority pri = Cache::Priority::LOW, GetContext* get_context = nullptr); @@ -383,6 +385,9 @@ class BlockBasedTable : public TableReader { friend class PartitionedFilterBlockReader; friend class PartitionedFilterBlockTest; + + InternalIterator* NewUnfragmentedRangeTombstoneIterator( + const ReadOptions& read_options); }; // Maitaning state of a two-level iteration on a partitioned index structure @@ -431,7 +436,7 @@ struct BlockBasedTable::Rep { Rep(const ImmutableCFOptions& _ioptions, const EnvOptions& _env_options, const BlockBasedTableOptions& _table_opt, const InternalKeyComparator& _internal_comparator, bool skip_filters, - const bool _immortal_table) + int _level, const bool _immortal_table) : ioptions(_ioptions), env_options(_env_options), table_options(_table_opt), @@ -444,6 +449,7 @@ struct BlockBasedTable::Rep { prefix_filtering(true), range_del_handle(BlockHandle::NullBlockHandle()), global_seqno(kDisableGlobalSequenceNumber), + level(_level), immortal_table(_immortal_table) {} const ImmutableCFOptions& ioptions; @@ -452,7 +458,7 @@ struct BlockBasedTable::Rep { const FilterPolicy* const filter_policy; const InternalKeyComparator& internal_comparator; Status status; - unique_ptr file; + std::unique_ptr file; char cache_key_prefix[kMaxCacheKeyPrefixSize]; size_t cache_key_prefix_size = 0; char persistent_cache_key_prefix[kMaxCacheKeyPrefixSize]; @@ -468,8 +474,8 @@ struct BlockBasedTable::Rep { // index_reader and filter will be populated and used only when // options.block_cache is nullptr; otherwise we will get the index block via // the block cache. - unique_ptr index_reader; - unique_ptr filter; + std::unique_ptr index_reader; + std::unique_ptr filter; enum class FilterType { kNoFilter, @@ -494,7 +500,7 @@ struct BlockBasedTable::Rep { // module should not be relying on db module. However to make things easier // and compatible with existing code, we introduce a wrapper that allows // block to extract prefix without knowing if a key is internal or not. - unique_ptr internal_prefix_transform; + std::unique_ptr internal_prefix_transform; std::shared_ptr table_prefix_extractor; // only used in level 0 files when pin_l0_filter_and_index_blocks_in_cache is @@ -509,6 +515,7 @@ struct BlockBasedTable::Rep { // cache is enabled. CachableEntry range_del_entry; BlockHandle range_del_handle; + std::shared_ptr fragmented_range_dels; // If global_seqno is used, all Keys in this file will have the same // seqno with value `global_seqno`. @@ -517,12 +524,20 @@ struct BlockBasedTable::Rep { // and every key have it's own seqno. SequenceNumber global_seqno; + // the level when the table is opened, could potentially change when trivial + // move is involved + int level; + // If false, blocks in this file are definitely all uncompressed. Knowing this // before reading individual blocks enables certain optimizations. bool blocks_maybe_compressed = true; bool closed = false; const bool immortal_table; + + SequenceNumber get_global_seqno(bool is_index) const { + return is_index ? kDisableGlobalSequenceNumber : global_seqno; + } }; template diff --git a/table/block_fetcher.cc b/table/block_fetcher.cc index ea97066ec40..9ad254a59f5 100644 --- a/table/block_fetcher.cc +++ b/table/block_fetcher.cc @@ -17,13 +17,14 @@ #include "rocksdb/env.h" #include "table/block.h" #include "table/block_based_table_reader.h" -#include "table/persistent_cache_helper.h" #include "table/format.h" +#include "table/persistent_cache_helper.h" #include "util/coding.h" #include "util/compression.h" #include "util/crc32c.h" #include "util/file_reader_writer.h" #include "util/logging.h" +#include "util/memory_allocator.h" #include "util/stop_watch.h" #include "util/string_util.h" #include "util/xxhash.h" @@ -48,6 +49,12 @@ void BlockFetcher::CheckBlockChecksum() { case kxxHash: actual = XXH32(data, static_cast(block_size_) + 1, 0); break; + case kxxHash64: + actual =static_cast ( + XXH64(data, static_cast(block_size_) + 1, 0) & + uint64_t{0xffffffff} + ); + break; default: status_ = Status::Corruption( "unknown checksum type " + ToString(footer_.checksum()) + " in " + @@ -107,9 +114,11 @@ bool BlockFetcher::TryGetCompressedBlockFromPersistentCache() { if (cache_options_.persistent_cache && cache_options_.persistent_cache->IsCompressed()) { // lookup uncompressed cache mode p-cache + std::unique_ptr raw_data; status_ = PersistentCacheHelper::LookupRawPage( - cache_options_, handle_, &heap_buf_, block_size_ + kBlockTrailerSize); + cache_options_, handle_, &raw_data, block_size_ + kBlockTrailerSize); if (status_.ok()) { + heap_buf_ = CacheAllocationPtr(raw_data.release()); used_buf_ = heap_buf_.get(); slice_ = Slice(heap_buf_.get(), block_size_); return true; @@ -131,8 +140,13 @@ void BlockFetcher::PrepareBufferForBlockFromFile() { // If we've got a small enough hunk of data, read it in to the // trivially allocated stack buffer instead of needing a full malloc() used_buf_ = &stack_buf_[0]; + } else if (maybe_compressed_ && !do_uncompress_) { + compressed_buf_ = AllocateBlock(block_size_ + kBlockTrailerSize, + memory_allocator_compressed_); + used_buf_ = compressed_buf_.get(); } else { - heap_buf_.reset(new char[block_size_ + kBlockTrailerSize]); + heap_buf_ = + AllocateBlock(block_size_ + kBlockTrailerSize, memory_allocator_); used_buf_ = heap_buf_.get(); } } @@ -159,29 +173,45 @@ void BlockFetcher::InsertUncompressedBlockToPersistentCacheIfNeeded() { } } +inline void BlockFetcher::CopyBufferToHeap() { + assert(used_buf_ != heap_buf_.get()); + heap_buf_ = AllocateBlock(block_size_ + kBlockTrailerSize, memory_allocator_); + memcpy(heap_buf_.get(), used_buf_, block_size_ + kBlockTrailerSize); +} + inline void BlockFetcher::GetBlockContents() { if (slice_.data() != used_buf_) { // the slice content is not the buffer provided - *contents_ = BlockContents(Slice(slice_.data(), block_size_), - immortal_source_, compression_type); + *contents_ = BlockContents(Slice(slice_.data(), block_size_)); } else { // page can be either uncompressed or compressed, the buffer either stack // or heap provided. Refer to https://github.com/facebook/rocksdb/pull/4096 if (got_from_prefetch_buffer_ || used_buf_ == &stack_buf_[0]) { - assert(used_buf_ != heap_buf_.get()); - heap_buf_.reset(new char[block_size_ + kBlockTrailerSize]); - memcpy(heap_buf_.get(), used_buf_, block_size_ + kBlockTrailerSize); + CopyBufferToHeap(); + } else if (used_buf_ == compressed_buf_.get()) { + if (compression_type_ == kNoCompression && + memory_allocator_ != memory_allocator_compressed_) { + CopyBufferToHeap(); + } else { + heap_buf_ = std::move(compressed_buf_); + } } - *contents_ = BlockContents(std::move(heap_buf_), block_size_, true, - compression_type); + *contents_ = BlockContents(std::move(heap_buf_), block_size_); } +#ifndef NDEBUG + contents_->is_raw_block = true; +#endif } Status BlockFetcher::ReadBlockContents() { block_size_ = static_cast(handle_.size()); if (TryGetUncompressBlockFromPersistentCache()) { + compression_type_ = kNoCompression; +#ifndef NDEBUG + contents_->is_raw_block = true; +#endif // NDEBUG return Status::OK(); } if (TryGetFromPrefetchBuffer()) { @@ -222,15 +252,16 @@ Status BlockFetcher::ReadBlockContents() { PERF_TIMER_GUARD(block_decompress_time); - compression_type = - static_cast(slice_.data()[block_size_]); + compression_type_ = get_block_compression_type(slice_.data(), block_size_); - if (do_uncompress_ && compression_type != kNoCompression) { + if (do_uncompress_ && compression_type_ != kNoCompression) { // compressed page, uncompress, update cache - UncompressionContext uncompression_ctx(compression_type, compression_dict_); - status_ = - UncompressBlockContents(uncompression_ctx, slice_.data(), block_size_, - contents_, footer_.version(), ioptions_); + UncompressionContext uncompression_ctx(compression_type_, + compression_dict_); + status_ = UncompressBlockContents(uncompression_ctx, slice_.data(), + block_size_, contents_, footer_.version(), + ioptions_, memory_allocator_); + compression_type_ = kNoCompression; } else { GetBlockContents(); } diff --git a/table/block_fetcher.h b/table/block_fetcher.h index 9e0d2448dd5..aed73a39252 100644 --- a/table/block_fetcher.h +++ b/table/block_fetcher.h @@ -10,6 +10,7 @@ #pragma once #include "table/block.h" #include "table/format.h" +#include "util/memory_allocator.h" namespace rocksdb { class BlockFetcher { @@ -24,9 +25,11 @@ class BlockFetcher { FilePrefetchBuffer* prefetch_buffer, const Footer& footer, const ReadOptions& read_options, const BlockHandle& handle, BlockContents* contents, const ImmutableCFOptions& ioptions, - bool do_uncompress, const Slice& compression_dict, + bool do_uncompress, bool maybe_compressed, + const Slice& compression_dict, const PersistentCacheOptions& cache_options, - const bool immortal_source = false) + MemoryAllocator* memory_allocator = nullptr, + MemoryAllocator* memory_allocator_compressed = nullptr) : file_(file), prefetch_buffer_(prefetch_buffer), footer_(footer), @@ -35,10 +38,13 @@ class BlockFetcher { contents_(contents), ioptions_(ioptions), do_uncompress_(do_uncompress), - immortal_source_(immortal_source), + maybe_compressed_(maybe_compressed), compression_dict_(compression_dict), - cache_options_(cache_options) {} + cache_options_(cache_options), + memory_allocator_(memory_allocator), + memory_allocator_compressed_(memory_allocator_compressed) {} Status ReadBlockContents(); + CompressionType get_compression_type() const { return compression_type_; } private: static const uint32_t kDefaultStackBufferSize = 5000; @@ -51,17 +57,20 @@ class BlockFetcher { BlockContents* contents_; const ImmutableCFOptions& ioptions_; bool do_uncompress_; - const bool immortal_source_; + bool maybe_compressed_; const Slice& compression_dict_; const PersistentCacheOptions& cache_options_; + MemoryAllocator* memory_allocator_; + MemoryAllocator* memory_allocator_compressed_; Status status_; Slice slice_; char* used_buf_ = nullptr; size_t block_size_; - std::unique_ptr heap_buf_; + CacheAllocationPtr heap_buf_; + CacheAllocationPtr compressed_buf_; char stack_buf_[kDefaultStackBufferSize]; bool got_from_prefetch_buffer_ = false; - rocksdb::CompressionType compression_type; + rocksdb::CompressionType compression_type_; // return true if found bool TryGetUncompressBlockFromPersistentCache(); @@ -69,6 +78,8 @@ class BlockFetcher { bool TryGetFromPrefetchBuffer(); bool TryGetCompressedBlockFromPersistentCache(); void PrepareBufferForBlockFromFile(); + // Copy content from used_buf_ to new heap buffer. + void CopyBufferToHeap(); void GetBlockContents(); void InsertCompressedBlockToPersistentCacheIfNeeded(); void InsertUncompressedBlockToPersistentCacheIfNeeded(); diff --git a/table/block_test.cc b/table/block_test.cc index 0ca6ec3f6de..5ac9ffb2141 100644 --- a/table/block_test.cc +++ b/table/block_test.cc @@ -117,7 +117,6 @@ TEST_F(BlockTest, SimpleTest) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = false; Block reader(std::move(contents), kDisableGlobalSequenceNumber); // read contents of block sequentially @@ -188,7 +187,6 @@ TEST_F(BlockTest, ValueDeltaEncodingTest) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = false; Block reader(std::move(contents), kDisableGlobalSequenceNumber); const bool kTotalOrderSeek = true; @@ -247,7 +245,6 @@ BlockContents GetBlockContents(std::unique_ptr *builder, BlockContents contents; contents.data = rawblock; - contents.cachable = false; return contents; } @@ -257,8 +254,7 @@ void CheckBlockContents(BlockContents contents, const int max_key, const std::vector &values) { const size_t prefix_size = 6; // create block reader - BlockContents contents_ref(contents.data, contents.cachable, - contents.compression_type); + BlockContents contents_ref(contents.data); Block reader1(std::move(contents), kDisableGlobalSequenceNumber); Block reader2(std::move(contents_ref), kDisableGlobalSequenceNumber); @@ -486,7 +482,6 @@ TEST_F(BlockTest, BlockWithReadAmpBitmap) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = true; Block reader(std::move(contents), kDisableGlobalSequenceNumber, kBytesPerBit, stats.get()); @@ -521,7 +516,6 @@ TEST_F(BlockTest, BlockWithReadAmpBitmap) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = true; Block reader(std::move(contents), kDisableGlobalSequenceNumber, kBytesPerBit, stats.get()); @@ -558,7 +552,6 @@ TEST_F(BlockTest, BlockWithReadAmpBitmap) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = true; Block reader(std::move(contents), kDisableGlobalSequenceNumber, kBytesPerBit, stats.get()); diff --git a/table/cuckoo_table_builder.cc b/table/cuckoo_table_builder.cc index 7d9842a95f0..f590e6ad405 100644 --- a/table/cuckoo_table_builder.cc +++ b/table/cuckoo_table_builder.cc @@ -289,6 +289,7 @@ Status CuckooTableBuilder::Finish() { } } properties_.num_entries = num_entries_; + properties_.num_deletions = num_entries_ - num_values_; properties_.fixed_key_len = key_size_; properties_.user_collected_properties[ CuckooTablePropertyNames::kValueLength].assign( diff --git a/table/cuckoo_table_builder_test.cc b/table/cuckoo_table_builder_test.cc index 27eacf6ec95..c1e350327f3 100644 --- a/table/cuckoo_table_builder_test.cc +++ b/table/cuckoo_table_builder_test.cc @@ -43,8 +43,15 @@ class CuckooBuilderTest : public testing::Test { std::string expected_unused_bucket, uint64_t expected_table_size, uint32_t expected_num_hash_func, bool expected_is_last_level, uint32_t expected_cuckoo_block_size = 1) { + uint64_t num_deletions = 0; + for (const auto& key : keys) { + ParsedInternalKey parsed; + if (ParseInternalKey(key, &parsed) && parsed.type == kTypeDeletion) { + num_deletions++; + } + } // Read file - unique_ptr read_file; + std::unique_ptr read_file; ASSERT_OK(env_->NewRandomAccessFile(fname, &read_file, env_options_)); uint64_t read_file_size; ASSERT_OK(env_->GetFileSize(fname, &read_file_size)); @@ -56,7 +63,7 @@ class CuckooBuilderTest : public testing::Test { // Assert Table Properties. TableProperties* props = nullptr; - unique_ptr file_reader( + std::unique_ptr file_reader( new RandomAccessFileReader(std::move(read_file), fname)); ASSERT_OK(ReadTableProperties(file_reader.get(), read_file_size, kCuckooTableMagicNumber, ioptions, @@ -90,6 +97,7 @@ class CuckooBuilderTest : public testing::Test { ASSERT_EQ(expected_is_last_level, is_last_level_found); ASSERT_EQ(props->num_entries, keys.size()); + ASSERT_EQ(props->num_deletions, num_deletions); ASSERT_EQ(props->fixed_key_len, keys.empty() ? 0 : keys[0].size()); ASSERT_EQ(props->data_size, expected_unused_bucket.size() * (expected_table_size + expected_cuckoo_block_size - 1)); @@ -126,9 +134,10 @@ class CuckooBuilderTest : public testing::Test { } } - std::string GetInternalKey(Slice user_key, bool zero_seqno) { + std::string GetInternalKey(Slice user_key, bool zero_seqno, + ValueType type = kTypeValue) { IterKey ikey; - ikey.SetInternalKey(user_key, zero_seqno ? 0 : 1000, kTypeValue); + ikey.SetInternalKey(user_key, zero_seqno ? 0 : 1000, type); return ikey.GetInternalKey().ToString(); } @@ -152,10 +161,10 @@ class CuckooBuilderTest : public testing::Test { }; TEST_F(CuckooBuilderTest, SuccessWithEmptyFile) { - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("EmptyFile"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, 4, 100, BytewiseComparator(), 1, false, false, @@ -169,50 +178,57 @@ TEST_F(CuckooBuilderTest, SuccessWithEmptyFile) { } TEST_F(CuckooBuilderTest, WriteSuccessNoCollisionFullKey) { - uint32_t num_hash_fun = 4; - std::vector user_keys = {"key01", "key02", "key03", "key04"}; - std::vector values = {"v01", "v02", "v03", "v04"}; - // Need to have a temporary variable here as VS compiler does not currently - // support operator= with initializer_list as a parameter - std::unordered_map> hm = { - {user_keys[0], {0, 1, 2, 3}}, - {user_keys[1], {1, 2, 3, 4}}, - {user_keys[2], {2, 3, 4, 5}}, - {user_keys[3], {3, 4, 5, 6}}}; - hash_map = std::move(hm); - - std::vector expected_locations = {0, 1, 2, 3}; - std::vector keys; - for (auto& user_key : user_keys) { - keys.push_back(GetInternalKey(user_key, false)); - } - uint64_t expected_table_size = GetExpectedTableSize(keys.size()); - - unique_ptr writable_file; - fname = test::PerThreadDBPath("NoCollisionFullKey"); - ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( - new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); - CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, - 100, BytewiseComparator(), 1, false, false, - GetSliceHash, 0 /* column_family_id */, - kDefaultColumnFamilyName); - ASSERT_OK(builder.status()); - for (uint32_t i = 0; i < user_keys.size(); i++) { - builder.Add(Slice(keys[i]), Slice(values[i])); - ASSERT_EQ(builder.NumEntries(), i + 1); + for (auto type : {kTypeValue, kTypeDeletion}) { + uint32_t num_hash_fun = 4; + std::vector user_keys = {"key01", "key02", "key03", "key04"}; + std::vector values; + if (type == kTypeValue) { + values = {"v01", "v02", "v03", "v04"}; + } else { + values = {"", "", "", ""}; + } + // Need to have a temporary variable here as VS compiler does not currently + // support operator= with initializer_list as a parameter + std::unordered_map> hm = { + {user_keys[0], {0, 1, 2, 3}}, + {user_keys[1], {1, 2, 3, 4}}, + {user_keys[2], {2, 3, 4, 5}}, + {user_keys[3], {3, 4, 5, 6}}}; + hash_map = std::move(hm); + + std::vector expected_locations = {0, 1, 2, 3}; + std::vector keys; + for (auto& user_key : user_keys) { + keys.push_back(GetInternalKey(user_key, false, type)); + } + uint64_t expected_table_size = GetExpectedTableSize(keys.size()); + + std::unique_ptr writable_file; + fname = test::PerThreadDBPath("NoCollisionFullKey"); + ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); + std::unique_ptr file_writer( + new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); + CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, + 100, BytewiseComparator(), 1, false, false, + GetSliceHash, 0 /* column_family_id */, + kDefaultColumnFamilyName); ASSERT_OK(builder.status()); + for (uint32_t i = 0; i < user_keys.size(); i++) { + builder.Add(Slice(keys[i]), Slice(values[i])); + ASSERT_EQ(builder.NumEntries(), i + 1); + ASSERT_OK(builder.status()); + } + size_t bucket_size = keys[0].size() + values[0].size(); + ASSERT_EQ(expected_table_size * bucket_size - 1, builder.FileSize()); + ASSERT_OK(builder.Finish()); + ASSERT_OK(file_writer->Close()); + ASSERT_LE(expected_table_size * bucket_size, builder.FileSize()); + + std::string expected_unused_bucket = GetInternalKey("key00", true); + expected_unused_bucket += std::string(values[0].size(), 'a'); + CheckFileContents(keys, values, expected_locations, expected_unused_bucket, + expected_table_size, 2, false); } - size_t bucket_size = keys[0].size() + values[0].size(); - ASSERT_EQ(expected_table_size * bucket_size - 1, builder.FileSize()); - ASSERT_OK(builder.Finish()); - ASSERT_OK(file_writer->Close()); - ASSERT_LE(expected_table_size * bucket_size, builder.FileSize()); - - std::string expected_unused_bucket = GetInternalKey("key00", true); - expected_unused_bucket += std::string(values[0].size(), 'a'); - CheckFileContents(keys, values, expected_locations, - expected_unused_bucket, expected_table_size, 2, false); } TEST_F(CuckooBuilderTest, WriteSuccessWithCollisionFullKey) { @@ -236,10 +252,10 @@ TEST_F(CuckooBuilderTest, WriteSuccessWithCollisionFullKey) { } uint64_t expected_table_size = GetExpectedTableSize(keys.size()); - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("WithCollisionFullKey"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, 100, BytewiseComparator(), 1, false, false, @@ -284,11 +300,11 @@ TEST_F(CuckooBuilderTest, WriteSuccessWithCollisionAndCuckooBlock) { } uint64_t expected_table_size = GetExpectedTableSize(keys.size()); - unique_ptr writable_file; + std::unique_ptr writable_file; uint32_t cuckoo_block_size = 2; fname = test::PerThreadDBPath("WithCollisionFullKey2"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder( file_writer.get(), kHashTableRatio, num_hash_fun, 100, @@ -338,10 +354,10 @@ TEST_F(CuckooBuilderTest, WithCollisionPathFullKey) { } uint64_t expected_table_size = GetExpectedTableSize(keys.size()); - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("WithCollisionPathFullKey"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, 100, BytewiseComparator(), 1, false, false, @@ -388,10 +404,10 @@ TEST_F(CuckooBuilderTest, WithCollisionPathFullKeyAndCuckooBlock) { } uint64_t expected_table_size = GetExpectedTableSize(keys.size()); - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("WithCollisionPathFullKeyAndCuckooBlock"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, 100, BytewiseComparator(), 2, false, false, @@ -431,10 +447,10 @@ TEST_F(CuckooBuilderTest, WriteSuccessNoCollisionUserKey) { std::vector expected_locations = {0, 1, 2, 3}; uint64_t expected_table_size = GetExpectedTableSize(user_keys.size()); - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("NoCollisionUserKey"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, 100, BytewiseComparator(), 1, false, false, @@ -475,10 +491,10 @@ TEST_F(CuckooBuilderTest, WriteSuccessWithCollisionUserKey) { std::vector expected_locations = {0, 1, 2, 3}; uint64_t expected_table_size = GetExpectedTableSize(user_keys.size()); - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("WithCollisionUserKey"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, 100, BytewiseComparator(), 1, false, false, @@ -521,10 +537,10 @@ TEST_F(CuckooBuilderTest, WithCollisionPathUserKey) { std::vector expected_locations = {0, 1, 3, 4, 2}; uint64_t expected_table_size = GetExpectedTableSize(user_keys.size()); - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("WithCollisionPathUserKey"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, 2, BytewiseComparator(), 1, false, false, @@ -566,10 +582,10 @@ TEST_F(CuckooBuilderTest, FailWhenCollisionPathTooLong) { }; hash_map = std::move(hm); - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("WithCollisionPathUserKey"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, 2, BytewiseComparator(), 1, false, false, @@ -594,10 +610,10 @@ TEST_F(CuckooBuilderTest, FailWhenSameKeyInserted) { uint32_t num_hash_fun = 4; std::string user_key = "repeatedkey"; - unique_ptr writable_file; + std::unique_ptr writable_file; fname = test::PerThreadDBPath("FailWhenSameKeyInserted"); ASSERT_OK(env_->NewWritableFile(fname, &writable_file, env_options_)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, EnvOptions())); CuckooTableBuilder builder(file_writer.get(), kHashTableRatio, num_hash_fun, 100, BytewiseComparator(), 1, false, false, diff --git a/table/cuckoo_table_factory.cc b/table/cuckoo_table_factory.cc index 84d22468eb9..74d18d51213 100644 --- a/table/cuckoo_table_factory.cc +++ b/table/cuckoo_table_factory.cc @@ -14,7 +14,7 @@ namespace rocksdb { Status CuckooTableFactory::NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, + std::unique_ptr&& file, uint64_t file_size, std::unique_ptr* table, bool /*prefetch_index_and_filter_in_cache*/) const { std::unique_ptr new_reader(new CuckooTableReader( diff --git a/table/cuckoo_table_factory.h b/table/cuckoo_table_factory.h index a96635de57d..eb3c5e51768 100644 --- a/table/cuckoo_table_factory.h +++ b/table/cuckoo_table_factory.h @@ -60,8 +60,8 @@ class CuckooTableFactory : public TableFactory { Status NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table, bool prefetch_index_and_filter_in_cache = true) const override; TableBuilder* NewTableBuilder( diff --git a/table/cuckoo_table_reader_test.cc b/table/cuckoo_table_reader_test.cc index 36083c54747..74fb52e6c78 100644 --- a/table/cuckoo_table_reader_test.cc +++ b/table/cuckoo_table_reader_test.cc @@ -95,7 +95,7 @@ class CuckooReaderTest : public testing::Test { const Comparator* ucomp = BytewiseComparator()) { std::unique_ptr writable_file; ASSERT_OK(env->NewWritableFile(fname, &writable_file, env_options)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, env_options)); CuckooTableBuilder builder( @@ -115,7 +115,7 @@ class CuckooReaderTest : public testing::Test { // Check reader now. std::unique_ptr read_file; ASSERT_OK(env->NewRandomAccessFile(fname, &read_file, env_options)); - unique_ptr file_reader( + std::unique_ptr file_reader( new RandomAccessFileReader(std::move(read_file), fname)); const ImmutableCFOptions ioptions(options); CuckooTableReader reader(ioptions, std::move(file_reader), file_size, ucomp, @@ -144,7 +144,7 @@ class CuckooReaderTest : public testing::Test { void CheckIterator(const Comparator* ucomp = BytewiseComparator()) { std::unique_ptr read_file; ASSERT_OK(env->NewRandomAccessFile(fname, &read_file, env_options)); - unique_ptr file_reader( + std::unique_ptr file_reader( new RandomAccessFileReader(std::move(read_file), fname)); const ImmutableCFOptions ioptions(options); CuckooTableReader reader(ioptions, std::move(file_reader), file_size, ucomp, @@ -323,7 +323,7 @@ TEST_F(CuckooReaderTest, WhenKeyNotFound) { CreateCuckooFileAndCheckReader(); std::unique_ptr read_file; ASSERT_OK(env->NewRandomAccessFile(fname, &read_file, env_options)); - unique_ptr file_reader( + std::unique_ptr file_reader( new RandomAccessFileReader(std::move(read_file), fname)); const ImmutableCFOptions ioptions(options); CuckooTableReader reader(ioptions, std::move(file_reader), file_size, ucmp, @@ -411,7 +411,7 @@ void WriteFile(const std::vector& keys, std::unique_ptr writable_file; ASSERT_OK(env->NewWritableFile(fname, &writable_file, env_options)); - unique_ptr file_writer( + std::unique_ptr file_writer( new WritableFileWriter(std::move(writable_file), fname, env_options)); CuckooTableBuilder builder( file_writer.get(), hash_ratio, 64, 1000, test::Uint64Comparator(), 5, @@ -432,7 +432,7 @@ void WriteFile(const std::vector& keys, env->GetFileSize(fname, &file_size); std::unique_ptr read_file; ASSERT_OK(env->NewRandomAccessFile(fname, &read_file, env_options)); - unique_ptr file_reader( + std::unique_ptr file_reader( new RandomAccessFileReader(std::move(read_file), fname)); const ImmutableCFOptions ioptions(options); @@ -464,7 +464,7 @@ void ReadKeys(uint64_t num, uint32_t batch_size) { env->GetFileSize(fname, &file_size); std::unique_ptr read_file; ASSERT_OK(env->NewRandomAccessFile(fname, &read_file, env_options)); - unique_ptr file_reader( + std::unique_ptr file_reader( new RandomAccessFileReader(std::move(read_file), fname)); const ImmutableCFOptions ioptions(options); diff --git a/table/data_block_hash_index_test.cc b/table/data_block_hash_index_test.cc index dc62917f2a1..ac12bbf935d 100644 --- a/table/data_block_hash_index_test.cc +++ b/table/data_block_hash_index_test.cc @@ -7,12 +7,14 @@ #include #include +#include "db/table_properties_collector.h" #include "rocksdb/slice.h" #include "table/block.h" #include "table/block_based_table_reader.h" #include "table/block_builder.h" #include "table/data_block_hash_index.h" #include "table/get_context.h" +#include "table/table_builder.h" #include "util/testharness.h" #include "util/testutil.h" @@ -282,7 +284,6 @@ TEST(DataBlockHashIndex, BlockRestartIndexExceedMax) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = false; Block reader(std::move(contents), kDisableGlobalSequenceNumber); ASSERT_EQ(reader.IndexType(), @@ -305,7 +306,6 @@ TEST(DataBlockHashIndex, BlockRestartIndexExceedMax) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = false; Block reader(std::move(contents), kDisableGlobalSequenceNumber); ASSERT_EQ(reader.IndexType(), @@ -337,7 +337,6 @@ TEST(DataBlockHashIndex, BlockSizeExceedMax) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = false; Block reader(std::move(contents), kDisableGlobalSequenceNumber); ASSERT_EQ(reader.IndexType(), @@ -362,7 +361,6 @@ TEST(DataBlockHashIndex, BlockSizeExceedMax) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = false; Block reader(std::move(contents), kDisableGlobalSequenceNumber); // the index type have fallen back to binary when build finish. @@ -390,7 +388,6 @@ TEST(DataBlockHashIndex, BlockTestSingleKey) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = false; Block reader(std::move(contents), kDisableGlobalSequenceNumber); const InternalKeyComparator icmp(BytewiseComparator()); @@ -472,7 +469,6 @@ TEST(DataBlockHashIndex, BlockTestLarge) { // create block reader BlockContents contents; contents.data = rawblock; - contents.cachable = false; Block reader(std::move(contents), kDisableGlobalSequenceNumber); const InternalKeyComparator icmp(BytewiseComparator()); @@ -540,9 +536,9 @@ TEST(DataBlockHashIndex, BlockTestLarge) { void TestBoundary(InternalKey& ik1, std::string& v1, InternalKey& ik2, std::string& v2, InternalKey& seek_ikey, GetContext& get_context, Options& options) { - unique_ptr file_writer; - unique_ptr file_reader; - unique_ptr table_reader; + std::unique_ptr file_writer; + std::unique_ptr file_reader; + std::unique_ptr table_reader; int level_ = -1; std::vector keys; @@ -555,7 +551,7 @@ void TestBoundary(InternalKey& ik1, std::string& v1, InternalKey& ik2, soptions.use_mmap_reads = ioptions.allow_mmap_reads; file_writer.reset( test::GetWritableFileWriter(new test::StringSink(), "" /* don't care */)); - unique_ptr builder; + std::unique_ptr builder; std::vector> int_tbl_prop_collector_factories; std::string column_family_name; diff --git a/table/format.cc b/table/format.cc index 16d959c3dce..0e43e824334 100644 --- a/table/format.cc +++ b/table/format.cc @@ -24,6 +24,7 @@ #include "util/crc32c.h" #include "util/file_reader_writer.h" #include "util/logging.h" +#include "util/memory_allocator.h" #include "util/stop_watch.h" #include "util/string_util.h" #include "util/xxhash.h" @@ -279,8 +280,8 @@ Status ReadFooterFromFile(RandomAccessFileReader* file, Status UncompressBlockContentsForCompressionType( const UncompressionContext& uncompression_ctx, const char* data, size_t n, BlockContents* contents, uint32_t format_version, - const ImmutableCFOptions& ioptions) { - std::unique_ptr ubuf; + const ImmutableCFOptions& ioptions, MemoryAllocator* allocator) { + CacheAllocationPtr ubuf; assert(uncompression_ctx.type() != kNoCompression && "Invalid compression type"); @@ -296,81 +297,82 @@ Status UncompressBlockContentsForCompressionType( if (!Snappy_GetUncompressedLength(data, n, &ulength)) { return Status::Corruption(snappy_corrupt_msg); } - ubuf.reset(new char[ulength]); + ubuf = AllocateBlock(ulength, allocator); if (!Snappy_Uncompress(data, n, ubuf.get())) { return Status::Corruption(snappy_corrupt_msg); } - *contents = BlockContents(std::move(ubuf), ulength, true, kNoCompression); + *contents = BlockContents(std::move(ubuf), ulength); break; } case kZlibCompression: - ubuf.reset(Zlib_Uncompress( + ubuf = Zlib_Uncompress( uncompression_ctx, data, n, &decompress_size, - GetCompressFormatForVersion(kZlibCompression, format_version))); + GetCompressFormatForVersion(kZlibCompression, format_version), + allocator); if (!ubuf) { static char zlib_corrupt_msg[] = "Zlib not supported or corrupted Zlib compressed block contents"; return Status::Corruption(zlib_corrupt_msg); } - *contents = - BlockContents(std::move(ubuf), decompress_size, true, kNoCompression); + *contents = BlockContents(std::move(ubuf), decompress_size); break; case kBZip2Compression: - ubuf.reset(BZip2_Uncompress( + ubuf = BZip2_Uncompress( data, n, &decompress_size, - GetCompressFormatForVersion(kBZip2Compression, format_version))); + GetCompressFormatForVersion(kBZip2Compression, format_version), + allocator); if (!ubuf) { static char bzip2_corrupt_msg[] = "Bzip2 not supported or corrupted Bzip2 compressed block contents"; return Status::Corruption(bzip2_corrupt_msg); } - *contents = - BlockContents(std::move(ubuf), decompress_size, true, kNoCompression); + *contents = BlockContents(std::move(ubuf), decompress_size); break; case kLZ4Compression: - ubuf.reset(LZ4_Uncompress( + ubuf = LZ4_Uncompress( uncompression_ctx, data, n, &decompress_size, - GetCompressFormatForVersion(kLZ4Compression, format_version))); + GetCompressFormatForVersion(kLZ4Compression, format_version), + allocator); if (!ubuf) { static char lz4_corrupt_msg[] = "LZ4 not supported or corrupted LZ4 compressed block contents"; return Status::Corruption(lz4_corrupt_msg); } - *contents = - BlockContents(std::move(ubuf), decompress_size, true, kNoCompression); + *contents = BlockContents(std::move(ubuf), decompress_size); break; case kLZ4HCCompression: - ubuf.reset(LZ4_Uncompress( + ubuf = LZ4_Uncompress( uncompression_ctx, data, n, &decompress_size, - GetCompressFormatForVersion(kLZ4HCCompression, format_version))); + GetCompressFormatForVersion(kLZ4HCCompression, format_version), + allocator); if (!ubuf) { static char lz4hc_corrupt_msg[] = "LZ4HC not supported or corrupted LZ4HC compressed block contents"; return Status::Corruption(lz4hc_corrupt_msg); } - *contents = - BlockContents(std::move(ubuf), decompress_size, true, kNoCompression); + *contents = BlockContents(std::move(ubuf), decompress_size); break; case kXpressCompression: + // XPRESS allocates memory internally, thus no support for custom + // allocator. ubuf.reset(XPRESS_Uncompress(data, n, &decompress_size)); if (!ubuf) { static char xpress_corrupt_msg[] = "XPRESS not supported or corrupted XPRESS compressed block contents"; return Status::Corruption(xpress_corrupt_msg); } - *contents = - BlockContents(std::move(ubuf), decompress_size, true, kNoCompression); + *contents = BlockContents(std::move(ubuf), decompress_size); break; case kZSTD: case kZSTDNotFinalCompression: - ubuf.reset(ZSTD_Uncompress(uncompression_ctx, data, n, &decompress_size)); + ubuf = ZSTD_Uncompress(uncompression_ctx, data, n, &decompress_size, + allocator); if (!ubuf) { static char zstd_corrupt_msg[] = "ZSTD not supported or corrupted ZSTD compressed block contents"; return Status::Corruption(zstd_corrupt_msg); } - *contents = - BlockContents(std::move(ubuf), decompress_size, true, kNoCompression); + *contents = BlockContents(std::move(ubuf), decompress_size); break; default: return Status::Corruption("bad block type"); @@ -396,11 +398,13 @@ Status UncompressBlockContentsForCompressionType( Status UncompressBlockContents(const UncompressionContext& uncompression_ctx, const char* data, size_t n, BlockContents* contents, uint32_t format_version, - const ImmutableCFOptions& ioptions) { + const ImmutableCFOptions& ioptions, + MemoryAllocator* allocator) { assert(data[n] != kNoCompression); assert(data[n] == uncompression_ctx.type()); - return UncompressBlockContentsForCompressionType( - uncompression_ctx, data, n, contents, format_version, ioptions); + return UncompressBlockContentsForCompressionType(uncompression_ctx, data, n, + contents, format_version, + ioptions, allocator); } } // namespace rocksdb diff --git a/table/format.h b/table/format.h index 6e0e99c1c74..0039c70a417 100644 --- a/table/format.h +++ b/table/format.h @@ -26,6 +26,7 @@ #include "port/port.h" // noexcept #include "table/persistent_cache_options.h" #include "util/file_reader_writer.h" +#include "util/memory_allocator.h" namespace rocksdb { @@ -188,24 +189,42 @@ Status ReadFooterFromFile(RandomAccessFileReader* file, // 1-byte type + 32-bit crc static const size_t kBlockTrailerSize = 5; +inline CompressionType get_block_compression_type(const char* block_data, + size_t block_size) { + return static_cast(block_data[block_size]); +} + struct BlockContents { Slice data; // Actual contents of data - bool cachable; // True iff data can be cached - CompressionType compression_type; - std::unique_ptr allocation; + CacheAllocationPtr allocation; + +#ifndef NDEBUG + // Whether the block is a raw block, which contains compression type + // byte. It is only used for assertion. + bool is_raw_block = false; +#endif // NDEBUG + + BlockContents() {} + + BlockContents(const Slice& _data) : data(_data) {} - BlockContents() : cachable(false), compression_type(kNoCompression) {} + BlockContents(CacheAllocationPtr&& _data, size_t _size) + : data(_data.get(), _size), allocation(std::move(_data)) {} - BlockContents(const Slice& _data, bool _cachable, - CompressionType _compression_type) - : data(_data), cachable(_cachable), compression_type(_compression_type) {} + BlockContents(std::unique_ptr&& _data, size_t _size) + : data(_data.get(), _size) { + allocation.reset(_data.release()); + } + + bool own_bytes() const { return allocation.get() != nullptr; } - BlockContents(std::unique_ptr&& _data, size_t _size, bool _cachable, - CompressionType _compression_type) - : data(_data.get(), _size), - cachable(_cachable), - compression_type(_compression_type), - allocation(std::move(_data)) {} + // It's the caller's responsibility to make sure that this is + // for raw block contents, which contains the compression + // byte in the end. + CompressionType get_compression_type() const { + assert(is_raw_block); + return get_block_compression_type(data.data(), data.size()); + } // The additional memory space taken by the block data. size_t usable_size() const { @@ -220,15 +239,20 @@ struct BlockContents { } } + size_t ApproximateMemoryUsage() const { + return usable_size() + sizeof(*this); + } + BlockContents(BlockContents&& other) ROCKSDB_NOEXCEPT { *this = std::move(other); } BlockContents& operator=(BlockContents&& other) { data = std::move(other.data); - cachable = other.cachable; - compression_type = other.compression_type; allocation = std::move(other.allocation); +#ifndef NDEBUG + is_raw_block = other.is_raw_block; +#endif // NDEBUG return *this; } }; @@ -252,7 +276,7 @@ extern Status ReadBlockContents( extern Status UncompressBlockContents( const UncompressionContext& uncompression_ctx, const char* data, size_t n, BlockContents* contents, uint32_t compress_format_version, - const ImmutableCFOptions& ioptions); + const ImmutableCFOptions& ioptions, MemoryAllocator* allocator = nullptr); // This is an extension to UncompressBlockContents that accepts // a specific compression type. This is used by un-wrapped blocks @@ -260,7 +284,7 @@ extern Status UncompressBlockContents( extern Status UncompressBlockContentsForCompressionType( const UncompressionContext& uncompression_ctx, const char* data, size_t n, BlockContents* contents, uint32_t compress_format_version, - const ImmutableCFOptions& ioptions); + const ImmutableCFOptions& ioptions, MemoryAllocator* allocator = nullptr); // Implementation details follow. Clients should ignore, diff --git a/table/get_context.cc b/table/get_context.cc index 0aa75b6079c..6f0bd2ebbc3 100644 --- a/table/get_context.cc +++ b/table/get_context.cc @@ -43,7 +43,7 @@ GetContext::GetContext(const Comparator* ucmp, Statistics* statistics, GetState init_state, const Slice& user_key, PinnableSlice* pinnable_val, bool* value_found, MergeContext* merge_context, - RangeDelAggregator* _range_del_agg, Env* env, + SequenceNumber* _max_covering_tombstone_seq, Env* env, SequenceNumber* seq, PinnedIteratorsManager* _pinned_iters_mgr, ReadCallback* callback, bool* is_blob_index) @@ -56,7 +56,7 @@ GetContext::GetContext(const Comparator* ucmp, pinnable_val_(pinnable_val), value_found_(value_found), merge_context_(merge_context), - range_del_agg_(_range_del_agg), + max_covering_tombstone_seq_(_max_covering_tombstone_seq), env_(env), seq_(seq), replay_log_(nullptr), @@ -185,7 +185,8 @@ bool GetContext::SaveValue(const ParsedInternalKey& parsed_key, auto type = parsed_key.type; // Key matches. Process it if ((type == kTypeValue || type == kTypeMerge || type == kTypeBlobIndex) && - range_del_agg_ != nullptr && range_del_agg_->ShouldDelete(parsed_key)) { + max_covering_tombstone_seq_ != nullptr && + *max_covering_tombstone_seq_ > parsed_key.sequence) { type = kTypeRangeDeletion; } switch (type) { diff --git a/table/get_context.h b/table/get_context.h index 066be104ba8..407473808f1 100644 --- a/table/get_context.h +++ b/table/get_context.h @@ -6,7 +6,6 @@ #pragma once #include #include "db/merge_context.h" -#include "db/range_del_aggregator.h" #include "db/read_callback.h" #include "rocksdb/env.h" #include "rocksdb/statistics.h" @@ -52,8 +51,9 @@ class GetContext { GetContext(const Comparator* ucmp, const MergeOperator* merge_operator, Logger* logger, Statistics* statistics, GetState init_state, const Slice& user_key, PinnableSlice* value, bool* value_found, - MergeContext* merge_context, RangeDelAggregator* range_del_agg, - Env* env, SequenceNumber* seq = nullptr, + MergeContext* merge_context, + SequenceNumber* max_covering_tombstone_seq, Env* env, + SequenceNumber* seq = nullptr, PinnedIteratorsManager* _pinned_iters_mgr = nullptr, ReadCallback* callback = nullptr, bool* is_blob_index = nullptr); @@ -76,7 +76,9 @@ class GetContext { GetState State() const { return state_; } - RangeDelAggregator* range_del_agg() { return range_del_agg_; } + SequenceNumber* max_covering_tombstone_seq() { + return max_covering_tombstone_seq_; + } PinnedIteratorsManager* pinned_iters_mgr() { return pinned_iters_mgr_; } @@ -111,7 +113,7 @@ class GetContext { PinnableSlice* pinnable_val_; bool* value_found_; // Is value set correctly? Used by KeyMayExist MergeContext* merge_context_; - RangeDelAggregator* range_del_agg_; + SequenceNumber* max_covering_tombstone_seq_; Env* env_; // If a key is found, seq_ will be set to the SequenceNumber of most recent // write to the key or kMaxSequenceNumber if unknown diff --git a/table/iterator.cc b/table/iterator.cc index 97c47fb2854..3a1063f6ef9 100644 --- a/table/iterator.cc +++ b/table/iterator.cc @@ -103,7 +103,7 @@ Status Iterator::GetProperty(std::string prop_name, std::string* prop) { *prop = "0"; return Status::OK(); } - return Status::InvalidArgument("Undentified property."); + return Status::InvalidArgument("Unidentified property."); } namespace { diff --git a/table/meta_blocks.cc b/table/meta_blocks.cc index 256730bfa7a..fdf8a56120e 100644 --- a/table/meta_blocks.cc +++ b/table/meta_blocks.cc @@ -79,6 +79,8 @@ void PropertyBlockBuilder::AddTableProperty(const TableProperties& props) { Add(TablePropertiesNames::kIndexValueIsDeltaEncoded, props.index_value_is_delta_encoded); Add(TablePropertiesNames::kNumEntries, props.num_entries); + Add(TablePropertiesNames::kDeletedKeys, props.num_deletions); + Add(TablePropertiesNames::kMergeOperands, props.num_merge_operands); Add(TablePropertiesNames::kNumRangeDeletions, props.num_range_deletions); Add(TablePropertiesNames::kNumDataBlocks, props.num_data_blocks); Add(TablePropertiesNames::kFilterSize, props.filter_size); @@ -173,7 +175,8 @@ Status ReadProperties(const Slice& handle_value, RandomAccessFileReader* file, FilePrefetchBuffer* prefetch_buffer, const Footer& footer, const ImmutableCFOptions& ioptions, TableProperties** table_properties, - bool compression_type_missing) { + bool /*compression_type_missing*/, + MemoryAllocator* memory_allocator) { assert(table_properties); Slice v = handle_value; @@ -189,15 +192,13 @@ Status ReadProperties(const Slice& handle_value, RandomAccessFileReader* file, Slice compression_dict; PersistentCacheOptions cache_options; - BlockFetcher block_fetcher( - file, prefetch_buffer, footer, read_options, handle, &block_contents, - ioptions, false /* decompress */, compression_dict, cache_options); + BlockFetcher block_fetcher(file, prefetch_buffer, footer, read_options, + handle, &block_contents, ioptions, + false /* decompress */, false /*maybe_compressed*/, + compression_dict, cache_options, memory_allocator); s = block_fetcher.ReadBlockContents(); - // override compression_type when table file is known to contain undefined - // value at compression type marker - if (compression_type_missing) { - block_contents.compression_type = kNoCompression; - } + // property block is never compressed. Need to add uncompress logic if we are + // to compress it.. if (!s.ok()) { return s; @@ -229,6 +230,10 @@ Status ReadProperties(const Slice& handle_value, RandomAccessFileReader* file, {TablePropertiesNames::kNumDataBlocks, &new_table_properties->num_data_blocks}, {TablePropertiesNames::kNumEntries, &new_table_properties->num_entries}, + {TablePropertiesNames::kDeletedKeys, + &new_table_properties->num_deletions}, + {TablePropertiesNames::kMergeOperands, + &new_table_properties->num_merge_operands}, {TablePropertiesNames::kNumRangeDeletions, &new_table_properties->num_range_deletions}, {TablePropertiesNames::kFormatVersion, @@ -263,6 +268,12 @@ Status ReadProperties(const Slice& handle_value, RandomAccessFileReader* file, {key, handle.offset() + iter.ValueOffset()}); if (pos != predefined_uint64_properties.end()) { + if (key == TablePropertiesNames::kDeletedKeys || + key == TablePropertiesNames::kMergeOperands) { + // Insert in user-collected properties for API backwards compatibility + new_table_properties->user_collected_properties.insert( + {key, raw_val.ToString()}); + } // handle predefined rocksdb properties uint64_t val; if (!GetVarint64(&raw_val, &val)) { @@ -305,9 +316,10 @@ Status ReadProperties(const Slice& handle_value, RandomAccessFileReader* file, Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size, uint64_t table_magic_number, - const ImmutableCFOptions &ioptions, + const ImmutableCFOptions& ioptions, TableProperties** properties, - bool compression_type_missing) { + bool compression_type_missing, + MemoryAllocator* memory_allocator) { // -- Read metaindex block Footer footer; auto s = ReadFooterFromFile(file, nullptr /* prefetch_buffer */, file_size, @@ -323,19 +335,17 @@ Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size, Slice compression_dict; PersistentCacheOptions cache_options; - BlockFetcher block_fetcher( - file, nullptr /* prefetch_buffer */, footer, read_options, - metaindex_handle, &metaindex_contents, ioptions, false /* decompress */, - compression_dict, cache_options); + BlockFetcher block_fetcher(file, nullptr /* prefetch_buffer */, footer, + read_options, metaindex_handle, + &metaindex_contents, ioptions, + false /* decompress */, false /*maybe_compressed*/, + compression_dict, cache_options, memory_allocator); s = block_fetcher.ReadBlockContents(); if (!s.ok()) { return s; } - // override compression_type when table file is known to contain undefined - // value at compression type marker - if (compression_type_missing) { - metaindex_contents.compression_type = kNoCompression; - } + // property blocks are never compressed. Need to add uncompress logic if we + // are to compress it. Block metaindex_block(std::move(metaindex_contents), kDisableGlobalSequenceNumber); std::unique_ptr meta_iter( @@ -352,7 +362,8 @@ Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size, TableProperties table_properties; if (found_properties_block == true) { s = ReadProperties(meta_iter->value(), file, nullptr /* prefetch_buffer */, - footer, ioptions, properties, compression_type_missing); + footer, ioptions, properties, compression_type_missing, + memory_allocator); } else { s = Status::NotFound(); } @@ -375,10 +386,11 @@ Status FindMetaBlock(InternalIterator* meta_index_iter, Status FindMetaBlock(RandomAccessFileReader* file, uint64_t file_size, uint64_t table_magic_number, - const ImmutableCFOptions &ioptions, + const ImmutableCFOptions& ioptions, const std::string& meta_block_name, BlockHandle* block_handle, - bool compression_type_missing) { + bool /*compression_type_missing*/, + MemoryAllocator* memory_allocator) { Footer footer; auto s = ReadFooterFromFile(file, nullptr /* prefetch_buffer */, file_size, &footer, table_magic_number); @@ -395,16 +407,14 @@ Status FindMetaBlock(RandomAccessFileReader* file, uint64_t file_size, BlockFetcher block_fetcher( file, nullptr /* prefetch_buffer */, footer, read_options, metaindex_handle, &metaindex_contents, ioptions, - false /* do decompression */, compression_dict, cache_options); + false /* do decompression */, false /*maybe_compressed*/, + compression_dict, cache_options, memory_allocator); s = block_fetcher.ReadBlockContents(); if (!s.ok()) { return s; } - // override compression_type when table file is known to contain undefined - // value at compression type marker - if (compression_type_missing) { - metaindex_contents.compression_type = kNoCompression; - } + // meta blocks are never compressed. Need to add uncompress logic if we are to + // compress it. Block metaindex_block(std::move(metaindex_contents), kDisableGlobalSequenceNumber); @@ -420,7 +430,8 @@ Status ReadMetaBlock(RandomAccessFileReader* file, uint64_t table_magic_number, const ImmutableCFOptions& ioptions, const std::string& meta_block_name, - BlockContents* contents, bool compression_type_missing) { + BlockContents* contents, bool /*compression_type_missing*/, + MemoryAllocator* memory_allocator) { Status status; Footer footer; status = ReadFooterFromFile(file, prefetch_buffer, file_size, &footer, @@ -439,17 +450,14 @@ Status ReadMetaBlock(RandomAccessFileReader* file, BlockFetcher block_fetcher(file, prefetch_buffer, footer, read_options, metaindex_handle, &metaindex_contents, ioptions, - false /* decompress */, compression_dict, - cache_options); + false /* decompress */, false /*maybe_compressed*/, + compression_dict, cache_options, memory_allocator); status = block_fetcher.ReadBlockContents(); if (!status.ok()) { return status; } - // override compression_type when table file is known to contain undefined - // value at compression type marker - if (compression_type_missing) { - metaindex_contents.compression_type = kNoCompression; - } + // meta block is never compressed. Need to add uncompress logic if we are to + // compress it. // Finding metablock Block metaindex_block(std::move(metaindex_contents), @@ -469,7 +477,8 @@ Status ReadMetaBlock(RandomAccessFileReader* file, // Reading metablock BlockFetcher block_fetcher2( file, prefetch_buffer, footer, read_options, block_handle, contents, - ioptions, false /* decompress */, compression_dict, cache_options); + ioptions, false /* decompress */, false /*maybe_compressed*/, + compression_dict, cache_options, memory_allocator); return block_fetcher2.ReadBlockContents(); } diff --git a/table/meta_blocks.h b/table/meta_blocks.h index a18c8edc47c..1c8fe686ca8 100644 --- a/table/meta_blocks.h +++ b/table/meta_blocks.h @@ -11,12 +11,13 @@ #include "db/builder.h" #include "db/table_properties_collector.h" -#include "util/kv_map.h" #include "rocksdb/comparator.h" +#include "rocksdb/memory_allocator.h" #include "rocksdb/options.h" #include "rocksdb/slice.h" #include "table/block_builder.h" #include "table/format.h" +#include "util/kv_map.h" namespace rocksdb { @@ -96,7 +97,8 @@ Status ReadProperties(const Slice& handle_value, RandomAccessFileReader* file, FilePrefetchBuffer* prefetch_buffer, const Footer& footer, const ImmutableCFOptions& ioptions, TableProperties** table_properties, - bool compression_type_missing = false); + bool compression_type_missing = false, + MemoryAllocator* memory_allocator = nullptr); // Directly read the properties from the properties block of a plain table. // @returns a status to indicate if the operation succeeded. On success, @@ -108,9 +110,10 @@ Status ReadProperties(const Slice& handle_value, RandomAccessFileReader* file, // `ReadProperties`, `FindMetaBlock`, and `ReadMetaBlock` Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size, uint64_t table_magic_number, - const ImmutableCFOptions &ioptions, + const ImmutableCFOptions& ioptions, TableProperties** properties, - bool compression_type_missing = false); + bool compression_type_missing = false, + MemoryAllocator* memory_allocator = nullptr); // Find the meta block from the meta index block. Status FindMetaBlock(InternalIterator* meta_index_iter, @@ -120,10 +123,11 @@ Status FindMetaBlock(InternalIterator* meta_index_iter, // Find the meta block Status FindMetaBlock(RandomAccessFileReader* file, uint64_t file_size, uint64_t table_magic_number, - const ImmutableCFOptions &ioptions, + const ImmutableCFOptions& ioptions, const std::string& meta_block_name, BlockHandle* block_handle, - bool compression_type_missing = false); + bool compression_type_missing = false, + MemoryAllocator* memory_allocator = nullptr); // Read the specified meta block with name meta_block_name // from `file` and initialize `contents` with contents of this block. @@ -134,6 +138,7 @@ Status ReadMetaBlock(RandomAccessFileReader* file, const ImmutableCFOptions& ioptions, const std::string& meta_block_name, BlockContents* contents, - bool compression_type_missing = false); + bool compression_type_missing = false, + MemoryAllocator* memory_allocator = nullptr); } // namespace rocksdb diff --git a/table/mock_table.cc b/table/mock_table.cc index a5473b30bc8..65a43616969 100644 --- a/table/mock_table.cc +++ b/table/mock_table.cc @@ -60,8 +60,8 @@ MockTableFactory::MockTableFactory() : next_id_(1) {} Status MockTableFactory::NewTableReader( const TableReaderOptions& /*table_reader_options*/, - unique_ptr&& file, uint64_t /*file_size*/, - unique_ptr* table_reader, + std::unique_ptr&& file, uint64_t /*file_size*/, + std::unique_ptr* table_reader, bool /*prefetch_index_and_filter_in_cache*/) const { uint32_t id = GetIDFromFile(file.get()); diff --git a/table/mock_table.h b/table/mock_table.h index 92cf87370ff..2f123a963cd 100644 --- a/table/mock_table.h +++ b/table/mock_table.h @@ -157,8 +157,8 @@ class MockTableFactory : public TableFactory { const char* Name() const override { return "MockTable"; } Status NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table_reader, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table_reader, bool prefetch_index_and_filter_in_cache = true) const override; TableBuilder* NewTableBuilder( const TableBuilderOptions& table_builder_options, diff --git a/table/partitioned_filter_block_test.cc b/table/partitioned_filter_block_test.cc index 0b11c0df2a7..ffa8a9a5630 100644 --- a/table/partitioned_filter_block_test.cc +++ b/table/partitioned_filter_block_test.cc @@ -33,7 +33,7 @@ class MockedBlockBasedTable : public BlockBasedTable { const SliceTransform* prefix_extractor) const override { Slice slice = slices[filter_blk_handle.offset()]; auto obj = new FullFilterBlockReader( - prefix_extractor, true, BlockContents(slice, false, kNoCompression), + prefix_extractor, true, BlockContents(slice), rep_->table_options.filter_policy->GetFilterBitsReader(slice), nullptr); return {obj, nullptr}; } @@ -44,7 +44,7 @@ class MockedBlockBasedTable : public BlockBasedTable { const SliceTransform* prefix_extractor) const override { Slice slice = slices[filter_blk_handle.offset()]; auto obj = new FullFilterBlockReader( - prefix_extractor, true, BlockContents(slice, false, kNoCompression), + prefix_extractor, true, BlockContents(slice), rep_->table_options.filter_policy->GetFilterBitsReader(slice), nullptr); return obj; } @@ -147,10 +147,10 @@ class PartitionedFilterBlockTest const bool kImmortal = true; table.reset(new MockedBlockBasedTable( new BlockBasedTable::Rep(ioptions, env_options, table_options_, icomp, - !kSkipFilters, !kImmortal))); + !kSkipFilters, 0, !kImmortal))); auto reader = new PartitionedFilterBlockReader( - prefix_extractor, true, BlockContents(slice, false, kNoCompression), - nullptr, nullptr, icomp, table.get(), pib->seperator_is_key_plus_seq(), + prefix_extractor, true, BlockContents(slice), nullptr, nullptr, icomp, + table.get(), pib->seperator_is_key_plus_seq(), !pib->get_use_value_delta_encoding()); return reader; } diff --git a/table/persistent_cache_helper.cc b/table/persistent_cache_helper.cc index 103f57c80ac..4e90697a6e5 100644 --- a/table/persistent_cache_helper.cc +++ b/table/persistent_cache_helper.cc @@ -29,12 +29,9 @@ void PersistentCacheHelper::InsertUncompressedPage( const BlockContents& contents) { assert(cache_options.persistent_cache); assert(!cache_options.persistent_cache->IsCompressed()); - if (!contents.cachable || contents.compression_type != kNoCompression) { - // We shouldn't cache this. Either - // (1) content is not cacheable - // (2) content is compressed - return; - } + // Precondition: + // (1) content is cacheable + // (2) content is not compressed // construct the page key char cache_key[BlockBasedTable::kMaxCacheKeyPrefixSize + kMaxVarint64Length]; @@ -109,8 +106,7 @@ Status PersistentCacheHelper::LookupUncompressedPage( // update stats RecordTick(cache_options.statistics, PERSISTENT_CACHE_HIT); // construct result and return - *contents = - BlockContents(std::move(data), size, false /*cacheable*/, kNoCompression); + *contents = BlockContents(std::move(data), size); return Status::OK(); } diff --git a/table/plain_table_builder.cc b/table/plain_table_builder.cc index 717635cc1a9..453b6c768b5 100644 --- a/table/plain_table_builder.cc +++ b/table/plain_table_builder.cc @@ -166,6 +166,12 @@ void PlainTableBuilder::Add(const Slice& key, const Slice& value) { properties_.num_entries++; properties_.raw_key_size += key.size(); properties_.raw_value_size += value.size(); + if (internal_key.type == kTypeDeletion || + internal_key.type == kTypeSingleDeletion) { + properties_.num_deletions++; + } else if (internal_key.type == kTypeMerge) { + properties_.num_merge_operands++; + } // notify property collectors NotifyCollectTableCollectorsOnAdd( diff --git a/table/plain_table_factory.cc b/table/plain_table_factory.cc index b88a689d4b0..273a1bd4f2f 100644 --- a/table/plain_table_factory.cc +++ b/table/plain_table_factory.cc @@ -19,8 +19,8 @@ namespace rocksdb { Status PlainTableFactory::NewTableReader( const TableReaderOptions& table_reader_options, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table, bool /*prefetch_index_and_filter_in_cache*/) const { return PlainTableReader::Open( table_reader_options.ioptions, table_reader_options.env_options, diff --git a/table/plain_table_factory.h b/table/plain_table_factory.h index f540a92b89d..157e3acda01 100644 --- a/table/plain_table_factory.h +++ b/table/plain_table_factory.h @@ -149,8 +149,8 @@ class PlainTableFactory : public TableFactory { const char* Name() const override { return "PlainTable"; } Status NewTableReader(const TableReaderOptions& table_reader_options, - unique_ptr&& file, - uint64_t file_size, unique_ptr* table, + std::unique_ptr&& file, + uint64_t file_size, std::unique_ptr* table, bool prefetch_index_and_filter_in_cache) const override; TableBuilder* NewTableBuilder( diff --git a/table/plain_table_key_coding.h b/table/plain_table_key_coding.h index 321e0aed594..9a27ad06b78 100644 --- a/table/plain_table_key_coding.h +++ b/table/plain_table_key_coding.h @@ -114,7 +114,7 @@ class PlainTableFileReader { }; // Keep buffers for two recent reads. - std::array, 2> buffers_; + std::array, 2> buffers_; uint32_t num_buf_; Status status_; diff --git a/table/plain_table_reader.cc b/table/plain_table_reader.cc index 4f6c99f94af..ae656763cbb 100644 --- a/table/plain_table_reader.cc +++ b/table/plain_table_reader.cc @@ -91,14 +91,13 @@ class PlainTableIterator : public InternalIterator { }; extern const uint64_t kPlainTableMagicNumber; -PlainTableReader::PlainTableReader(const ImmutableCFOptions& ioptions, - unique_ptr&& file, - const EnvOptions& storage_options, - const InternalKeyComparator& icomparator, - EncodingType encoding_type, - uint64_t file_size, - const TableProperties* table_properties, - const SliceTransform* prefix_extractor) +PlainTableReader::PlainTableReader( + const ImmutableCFOptions& ioptions, + std::unique_ptr&& file, + const EnvOptions& storage_options, const InternalKeyComparator& icomparator, + EncodingType encoding_type, uint64_t file_size, + const TableProperties* table_properties, + const SliceTransform* prefix_extractor) : internal_comparator_(icomparator), encoding_type_(encoding_type), full_scan_mode_(false), @@ -118,8 +117,8 @@ PlainTableReader::~PlainTableReader() { Status PlainTableReader::Open( const ImmutableCFOptions& ioptions, const EnvOptions& env_options, const InternalKeyComparator& internal_comparator, - unique_ptr&& file, uint64_t file_size, - unique_ptr* table_reader, const int bloom_bits_per_key, + std::unique_ptr&& file, uint64_t file_size, + std::unique_ptr* table_reader, const int bloom_bits_per_key, double hash_table_ratio, size_t index_sparseness, size_t huge_page_tlb_size, bool full_scan_mode, const SliceTransform* prefix_extractor) { if (file_size > PlainTableIndex::kMaxFileSize) { diff --git a/table/plain_table_reader.h b/table/plain_table_reader.h index df08a98fa17..5f8248dd717 100644 --- a/table/plain_table_reader.h +++ b/table/plain_table_reader.h @@ -48,7 +48,7 @@ struct PlainTableReaderFileInfo { bool is_mmap_mode; Slice file_data; uint32_t data_end_offset; - unique_ptr file; + std::unique_ptr file; PlainTableReaderFileInfo(unique_ptr&& _file, const EnvOptions& storage_options, @@ -71,8 +71,8 @@ class PlainTableReader: public TableReader { static Status Open(const ImmutableCFOptions& ioptions, const EnvOptions& env_options, const InternalKeyComparator& internal_comparator, - unique_ptr&& file, - uint64_t file_size, unique_ptr* table, + std::unique_ptr&& file, + uint64_t file_size, std::unique_ptr* table, const int bloom_bits_per_key, double hash_table_ratio, size_t index_sparseness, size_t huge_page_tlb_size, bool full_scan_mode, @@ -104,7 +104,7 @@ class PlainTableReader: public TableReader { } PlainTableReader(const ImmutableCFOptions& ioptions, - unique_ptr&& file, + std::unique_ptr&& file, const EnvOptions& env_options, const InternalKeyComparator& internal_comparator, EncodingType encoding_type, uint64_t file_size, @@ -153,8 +153,8 @@ class PlainTableReader: public TableReader { DynamicBloom bloom_; PlainTableReaderFileInfo file_info_; Arena arena_; - std::unique_ptr index_block_alloc_; - std::unique_ptr bloom_block_alloc_; + CacheAllocationPtr index_block_alloc_; + CacheAllocationPtr bloom_block_alloc_; const ImmutableCFOptions& ioptions_; uint64_t file_size_; diff --git a/table/sst_file_reader.cc b/table/sst_file_reader.cc new file mode 100644 index 00000000000..a915449bee0 --- /dev/null +++ b/table/sst_file_reader.cc @@ -0,0 +1,84 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#ifndef ROCKSDB_LITE + +#include "rocksdb/sst_file_reader.h" + +#include "db/db_iter.h" +#include "options/cf_options.h" +#include "table/get_context.h" +#include "table/table_reader.h" +#include "table/table_builder.h" +#include "util/file_reader_writer.h" + +namespace rocksdb { + +struct SstFileReader::Rep { + Options options; + EnvOptions soptions; + ImmutableCFOptions ioptions; + MutableCFOptions moptions; + + std::unique_ptr table_reader; + + Rep(const Options& opts) + : options(opts), + soptions(options), + ioptions(options), + moptions(ColumnFamilyOptions(options)) {} +}; + +SstFileReader::SstFileReader(const Options& options) + : rep_(new Rep(options)) {} + +SstFileReader::~SstFileReader() {} + +Status SstFileReader::Open(const std::string& file_path) { + auto r = rep_.get(); + Status s; + uint64_t file_size = 0; + std::unique_ptr file; + std::unique_ptr file_reader; + s = r->options.env->GetFileSize(file_path, &file_size); + if (s.ok()) { + s = r->options.env->NewRandomAccessFile(file_path, &file, r->soptions); + } + if (s.ok()) { + file_reader.reset(new RandomAccessFileReader(std::move(file), file_path)); + } + if (s.ok()) { + s = r->options.table_factory->NewTableReader( + TableReaderOptions(r->ioptions, r->moptions.prefix_extractor.get(), + r->soptions, r->ioptions.internal_comparator), + std::move(file_reader), file_size, &r->table_reader); + } + return s; +} + +Iterator* SstFileReader::NewIterator(const ReadOptions& options) { + auto r = rep_.get(); + auto sequence = options.snapshot != nullptr ? + options.snapshot->GetSequenceNumber() : + kMaxSequenceNumber; + auto internal_iter = r->table_reader->NewIterator( + options, r->moptions.prefix_extractor.get()); + return NewDBIterator(r->options.env, options, r->ioptions, r->moptions, + r->ioptions.user_comparator, internal_iter, sequence, + r->moptions.max_sequential_skip_in_iterations, + nullptr /* read_callback */); +} + +std::shared_ptr SstFileReader::GetTableProperties() const { + return rep_->table_reader->GetTableProperties(); +} + +Status SstFileReader::VerifyChecksum() { + return rep_->table_reader->VerifyChecksum(); +} + +} // namespace rocksdb + +#endif // !ROCKSDB_LITE diff --git a/table/sst_file_reader_test.cc b/table/sst_file_reader_test.cc new file mode 100644 index 00000000000..8da366fd7cc --- /dev/null +++ b/table/sst_file_reader_test.cc @@ -0,0 +1,106 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#ifndef ROCKSDB_LITE + +#include + +#include "rocksdb/sst_file_reader.h" +#include "rocksdb/sst_file_writer.h" +#include "util/testharness.h" +#include "util/testutil.h" +#include "utilities/merge_operators.h" + +namespace rocksdb { + +std::string EncodeAsString(uint64_t v) { + char buf[16]; + snprintf(buf, sizeof(buf), "%08" PRIu64, v); + return std::string(buf); +} + +std::string EncodeAsUint64(uint64_t v) { + std::string dst; + PutFixed64(&dst, v); + return dst; +} + +class SstFileReaderTest : public testing::Test { + public: + SstFileReaderTest() { + options_.merge_operator = MergeOperators::CreateUInt64AddOperator(); + sst_name_ = test::PerThreadDBPath("sst_file"); + } + + void CreateFileAndCheck(const std::vector& keys) { + SstFileWriter writer(soptions_, options_); + ASSERT_OK(writer.Open(sst_name_)); + for (size_t i = 0; i + 2 < keys.size(); i += 3) { + ASSERT_OK(writer.Put(keys[i], keys[i])); + ASSERT_OK(writer.Merge(keys[i+1], EncodeAsUint64(i+1))); + ASSERT_OK(writer.Delete(keys[i+2])); + } + ASSERT_OK(writer.Finish()); + + ReadOptions ropts; + SstFileReader reader(options_); + ASSERT_OK(reader.Open(sst_name_)); + ASSERT_OK(reader.VerifyChecksum()); + std::unique_ptr iter(reader.NewIterator(ropts)); + iter->SeekToFirst(); + for (size_t i = 0; i + 2 < keys.size(); i += 3) { + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().compare(keys[i]), 0); + ASSERT_EQ(iter->value().compare(keys[i]), 0); + iter->Next(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().compare(keys[i+1]), 0); + ASSERT_EQ(iter->value().compare(EncodeAsUint64(i+1)), 0); + iter->Next(); + } + ASSERT_FALSE(iter->Valid()); + } + + protected: + Options options_; + EnvOptions soptions_; + std::string sst_name_; +}; + +const uint64_t kNumKeys = 100; + +TEST_F(SstFileReaderTest, Basic) { + std::vector keys; + for (uint64_t i = 0; i < kNumKeys; i++) { + keys.emplace_back(EncodeAsString(i)); + } + CreateFileAndCheck(keys); +} + +TEST_F(SstFileReaderTest, Uint64Comparator) { + options_.comparator = test::Uint64Comparator(); + std::vector keys; + for (uint64_t i = 0; i < kNumKeys; i++) { + keys.emplace_back(EncodeAsUint64(i)); + } + CreateFileAndCheck(keys); +} + +} // namespace rocksdb + +int main(int argc, char** argv) { + ::testing::InitGoogleTest(&argc, argv); + return RUN_ALL_TESTS(); +} + +#else +#include + +int main(int /*argc*/, char** /*argv*/) { + fprintf(stderr, "SKIPPED as SstFileReader is not supported in ROCKSDB_LITE\n"); + return 0; +} + +#endif // ROCKSDB_LITE diff --git a/table/sst_file_writer.cc b/table/sst_file_writer.cc index e0c4c31896b..a752504c8f6 100644 --- a/table/sst_file_writer.cc +++ b/table/sst_file_writer.cc @@ -238,7 +238,8 @@ Status SstFileWriter::Open(const std::string& file_path) { nullptr /* compression_dict */, r->skip_filters, r->column_family_name, unknown_level); r->file_writer.reset( - new WritableFileWriter(std::move(sst_file), file_path, r->env_options)); + new WritableFileWriter(std::move(sst_file), file_path, r->env_options, + nullptr /* stats */, r->ioptions.listeners)); // TODO(tec) : If table_factory is using compressed block cache, we will // be adding the external sst file blocks into it, which is wasteful. diff --git a/table/table_properties.cc b/table/table_properties.cc index 207a6419119..56e1d03f1f7 100644 --- a/table/table_properties.cc +++ b/table/table_properties.cc @@ -78,6 +78,9 @@ std::string TableProperties::ToString( AppendProperty(result, "# data blocks", num_data_blocks, prop_delim, kv_delim); AppendProperty(result, "# entries", num_entries, prop_delim, kv_delim); + AppendProperty(result, "# deletions", num_deletions, prop_delim, kv_delim); + AppendProperty(result, "# merge operands", num_merge_operands, prop_delim, + kv_delim); AppendProperty(result, "# range deletions", num_range_deletions, prop_delim, kv_delim); @@ -170,6 +173,8 @@ void TableProperties::Add(const TableProperties& tp) { raw_value_size += tp.raw_value_size; num_data_blocks += tp.num_data_blocks; num_entries += tp.num_entries; + num_deletions += tp.num_deletions; + num_merge_operands += tp.num_merge_operands; num_range_deletions += tp.num_range_deletions; } @@ -195,6 +200,9 @@ const std::string TablePropertiesNames::kNumDataBlocks = "rocksdb.num.data.blocks"; const std::string TablePropertiesNames::kNumEntries = "rocksdb.num.entries"; +const std::string TablePropertiesNames::kDeletedKeys = "rocksdb.deleted.keys"; +const std::string TablePropertiesNames::kMergeOperands = + "rocksdb.merge.operands"; const std::string TablePropertiesNames::kNumRangeDeletions = "rocksdb.num.range-deletions"; const std::string TablePropertiesNames::kFilterPolicy = diff --git a/table/table_reader.h b/table/table_reader.h index 505b5ba1fb8..a5f15e13044 100644 --- a/table/table_reader.h +++ b/table/table_reader.h @@ -9,6 +9,7 @@ #pragma once #include +#include "db/range_tombstone_fragmenter.h" #include "rocksdb/slice_transform.h" #include "table/internal_iterator.h" @@ -44,7 +45,7 @@ class TableReader { bool skip_filters = false, bool for_compaction = false) = 0; - virtual InternalIterator* NewRangeTombstoneIterator( + virtual FragmentedRangeTombstoneIterator* NewRangeTombstoneIterator( const ReadOptions& /*read_options*/) { return nullptr; } diff --git a/table/table_reader_bench.cc b/table/table_reader_bench.cc index 4032c4a5a1e..fbcfac826c8 100644 --- a/table/table_reader_bench.cc +++ b/table/table_reader_bench.cc @@ -86,9 +86,9 @@ void TableReaderBenchmark(Options& opts, EnvOptions& env_options, const ImmutableCFOptions ioptions(opts); const ColumnFamilyOptions cfo(opts); const MutableCFOptions moptions(cfo); - unique_ptr file_writer; + std::unique_ptr file_writer; if (!through_db) { - unique_ptr file; + std::unique_ptr file; env->NewWritableFile(file_name, &file, env_options); std::vector > @@ -127,9 +127,9 @@ void TableReaderBenchmark(Options& opts, EnvOptions& env_options, db->Flush(FlushOptions()); } - unique_ptr table_reader; + std::unique_ptr table_reader; if (!through_db) { - unique_ptr raf; + std::unique_ptr raf; s = env->NewRandomAccessFile(file_name, &raf, env_options); if (!s.ok()) { fprintf(stderr, "Create File Error: %s\n", s.ToString().c_str()); @@ -137,7 +137,7 @@ void TableReaderBenchmark(Options& opts, EnvOptions& env_options, } uint64_t file_size; env->GetFileSize(file_name, &file_size); - unique_ptr file_reader( + std::unique_ptr file_reader( new RandomAccessFileReader(std::move(raf), file_name)); s = opts.table_factory->NewTableReader( TableReaderOptions(ioptions, moptions.prefix_extractor.get(), @@ -170,12 +170,12 @@ void TableReaderBenchmark(Options& opts, EnvOptions& env_options, if (!through_db) { PinnableSlice value; MergeContext merge_context; - RangeDelAggregator range_del_agg(ikc, {} /* snapshots */); + SequenceNumber max_covering_tombstone_seq = 0; GetContext get_context(ioptions.user_comparator, ioptions.merge_operator, ioptions.info_log, ioptions.statistics, GetContext::kNotFound, Slice(key), &value, nullptr, &merge_context, - &range_del_agg, env); + &max_covering_tombstone_seq, env); s = table_reader->Get(read_options, key, &get_context, nullptr); } else { s = db->Get(read_options, key, &result); diff --git a/table/table_test.cc b/table/table_test.cc index 26383fa8179..5ec613bec44 100644 --- a/table/table_test.cc +++ b/table/table_test.cc @@ -232,7 +232,6 @@ class BlockConstructor: public Constructor { data_ = builder.Finish().ToString(); BlockContents contents; contents.data = data_; - contents.cachable = false; block_ = new Block(std::move(contents), kDisableGlobalSequenceNumber); return Status::OK(); } @@ -325,7 +324,7 @@ class TableConstructor: public Constructor { soptions.use_mmap_reads = ioptions.allow_mmap_reads; file_writer_.reset(test::GetWritableFileWriter(new test::StringSink(), "" /* don't care */)); - unique_ptr builder; + std::unique_ptr builder; std::vector> int_tbl_prop_collector_factories; std::string column_family_name; @@ -423,9 +422,9 @@ class TableConstructor: public Constructor { } uint64_t uniq_id_; - unique_ptr file_writer_; - unique_ptr file_reader_; - unique_ptr table_reader_; + std::unique_ptr file_writer_; + std::unique_ptr file_reader_; + std::unique_ptr table_reader_; bool convert_to_internal_key_; int level_; @@ -508,7 +507,7 @@ class InternalIteratorFromIterator : public InternalIterator { virtual Status status() const override { return it_->status(); } private: - unique_ptr it_; + std::unique_ptr it_; }; class DBConstructor: public Constructor { @@ -1024,7 +1023,7 @@ class HarnessTest : public testing::Test { WriteBufferManager write_buffer_; bool support_prev_; bool only_support_prefix_seek_; - shared_ptr internal_comparator_; + std::shared_ptr internal_comparator_; }; static bool Between(uint64_t val, uint64_t low, uint64_t high) { @@ -1278,6 +1277,13 @@ TEST_P(BlockBasedTableTest, RangeDelBlock) { std::vector keys = {"1pika", "2chu"}; std::vector vals = {"p", "c"}; + std::vector expected_tombstones = { + {"1pika", "2chu", 0}, + {"2chu", "c", 1}, + {"2chu", "c", 0}, + {"c", "p", 0}, + }; + for (int i = 0; i < 2; i++) { RangeTombstone t(keys[i], vals[i], i); std::pair p = t.Serialize(); @@ -1310,14 +1316,15 @@ TEST_P(BlockBasedTableTest, RangeDelBlock) { ASSERT_FALSE(iter->Valid()); iter->SeekToFirst(); ASSERT_TRUE(iter->Valid()); - for (int i = 0; i < 2; i++) { + for (size_t i = 0; i < expected_tombstones.size(); i++) { ASSERT_TRUE(iter->Valid()); ParsedInternalKey parsed_key; ASSERT_TRUE(ParseInternalKey(iter->key(), &parsed_key)); RangeTombstone t(parsed_key, iter->value()); - ASSERT_EQ(t.start_key_, keys[i]); - ASSERT_EQ(t.end_key_, vals[i]); - ASSERT_EQ(t.seq_, i); + const auto& expected_t = expected_tombstones[i]; + ASSERT_EQ(t.start_key_, expected_t.start_key_); + ASSERT_EQ(t.end_key_, expected_t.end_key_); + ASSERT_EQ(t.seq_, expected_t.seq_); iter->Next(); } ASSERT_TRUE(!iter->Valid()); @@ -1385,8 +1392,8 @@ void PrefetchRange(TableConstructor* c, Options* opt, // prefetch auto* table_reader = dynamic_cast(c->GetTableReader()); Status s; - unique_ptr begin, end; - unique_ptr i_begin, i_end; + std::unique_ptr begin, end; + std::unique_ptr i_begin, i_end; if (key_begin != nullptr) { if (c->ConvertToInternalKey()) { i_begin.reset(new InternalKey(key_begin, kMaxSequenceNumber, kTypeValue)); @@ -1417,7 +1424,7 @@ TEST_P(BlockBasedTableTest, PrefetchTest) { // The purpose of this test is to test the prefetching operation built into // BlockBasedTable. Options opt; - unique_ptr ikc; + std::unique_ptr ikc; ikc.reset(new test::PlainInternalKeyComparator(opt.comparator)); opt.compression = kNoCompression; BlockBasedTableOptions table_options = GetBlockBasedTableOptions(); @@ -2009,7 +2016,7 @@ TEST_P(BlockBasedTableTest, FilterBlockInBlockCache) { // -- PART 1: Open with regular block cache. // Since block_cache is disabled, no cache activities will be involved. - unique_ptr iter; + std::unique_ptr iter; int64_t last_cache_bytes_read = 0; // At first, no block will be accessed. @@ -2343,7 +2350,7 @@ TEST_P(BlockBasedTableTest, NoObjectInCacheAfterTableClose) { } // Create a table Options opt; - unique_ptr ikc; + std::unique_ptr ikc; ikc.reset(new test::PlainInternalKeyComparator(opt.comparator)); opt.compression = kNoCompression; BlockBasedTableOptions table_options = @@ -2419,7 +2426,7 @@ TEST_P(BlockBasedTableTest, BlockCacheLeak) { // unique ID from the file. Options opt; - unique_ptr ikc; + std::unique_ptr ikc; ikc.reset(new test::PlainInternalKeyComparator(opt.comparator)); opt.compression = kNoCompression; BlockBasedTableOptions table_options = GetBlockBasedTableOptions(); @@ -2442,7 +2449,7 @@ TEST_P(BlockBasedTableTest, BlockCacheLeak) { const MutableCFOptions moptions(opt); c.Finish(opt, ioptions, moptions, table_options, *ikc, &keys, &kvmap); - unique_ptr iter( + std::unique_ptr iter( c.NewIterator(moptions.prefix_extractor.get())); iter->SeekToFirst(); while (iter->Valid()) { @@ -2477,6 +2484,78 @@ TEST_P(BlockBasedTableTest, BlockCacheLeak) { c.ResetTableReader(); } +namespace { +class CustomMemoryAllocator : public MemoryAllocator { + public: + virtual const char* Name() const override { return "CustomMemoryAllocator"; } + + void* Allocate(size_t size) override { + ++numAllocations; + auto ptr = new char[size + 16]; + memcpy(ptr, "memory_allocator_", 16); // mangle first 16 bytes + return reinterpret_cast(ptr + 16); + } + void Deallocate(void* p) override { + ++numDeallocations; + char* ptr = reinterpret_cast(p) - 16; + delete[] ptr; + } + + std::atomic numAllocations; + std::atomic numDeallocations; +}; +} // namespace + +TEST_P(BlockBasedTableTest, MemoryAllocator) { + auto custom_memory_allocator = std::make_shared(); + { + Options opt; + std::unique_ptr ikc; + ikc.reset(new test::PlainInternalKeyComparator(opt.comparator)); + opt.compression = kNoCompression; + BlockBasedTableOptions table_options; + table_options.block_size = 1024; + LRUCacheOptions lruOptions; + lruOptions.memory_allocator = custom_memory_allocator; + lruOptions.capacity = 16 * 1024 * 1024; + lruOptions.num_shard_bits = 4; + table_options.block_cache = NewLRUCache(std::move(lruOptions)); + opt.table_factory.reset(NewBlockBasedTableFactory(table_options)); + + TableConstructor c(BytewiseComparator(), + true /* convert_to_internal_key_ */); + c.Add("k01", "hello"); + c.Add("k02", "hello2"); + c.Add("k03", std::string(10000, 'x')); + c.Add("k04", std::string(200000, 'x')); + c.Add("k05", std::string(300000, 'x')); + c.Add("k06", "hello3"); + c.Add("k07", std::string(100000, 'x')); + std::vector keys; + stl_wrappers::KVMap kvmap; + const ImmutableCFOptions ioptions(opt); + const MutableCFOptions moptions(opt); + c.Finish(opt, ioptions, moptions, table_options, *ikc, &keys, &kvmap); + + std::unique_ptr iter( + c.NewIterator(moptions.prefix_extractor.get())); + iter->SeekToFirst(); + while (iter->Valid()) { + iter->key(); + iter->value(); + iter->Next(); + } + ASSERT_OK(iter->status()); + } + + // out of scope, block cache should have been deleted, all allocations + // deallocated + EXPECT_EQ(custom_memory_allocator->numAllocations.load(), + custom_memory_allocator->numDeallocations.load()); + // make sure that allocations actually happened through the cache allocator + EXPECT_GT(custom_memory_allocator->numAllocations.load(), 0); +} + TEST_P(BlockBasedTableTest, NewIndexIteratorLeak) { // A regression test to avoid data race described in // https://github.com/facebook/rocksdb/issues/1267 @@ -2550,7 +2629,7 @@ TEST_F(PlainTableTest, BasicPlainTableProperties) { PlainTableFactory factory(plain_table_options); test::StringSink sink; - unique_ptr file_writer( + std::unique_ptr file_writer( test::GetWritableFileWriter(new test::StringSink(), "" /* don't care */)); Options options; const ImmutableCFOptions ioptions(options); @@ -2579,7 +2658,7 @@ TEST_F(PlainTableTest, BasicPlainTableProperties) { test::StringSink* ss = static_cast(file_writer->writable_file()); - unique_ptr file_reader( + std::unique_ptr file_reader( test::GetRandomAccessFileReader( new test::StringSource(ss->contents(), 72242, true))); @@ -2658,9 +2737,9 @@ static void DoCompressionTest(CompressionType comp) { ASSERT_TRUE(Between(c.ApproximateOffsetOf("abc"), 0, 0)); ASSERT_TRUE(Between(c.ApproximateOffsetOf("k01"), 0, 0)); ASSERT_TRUE(Between(c.ApproximateOffsetOf("k02"), 0, 0)); - ASSERT_TRUE(Between(c.ApproximateOffsetOf("k03"), 2000, 3000)); - ASSERT_TRUE(Between(c.ApproximateOffsetOf("k04"), 2000, 3000)); - ASSERT_TRUE(Between(c.ApproximateOffsetOf("xyz"), 4000, 6100)); + ASSERT_TRUE(Between(c.ApproximateOffsetOf("k03"), 2000, 3500)); + ASSERT_TRUE(Between(c.ApproximateOffsetOf("k04"), 2000, 3500)); + ASSERT_TRUE(Between(c.ApproximateOffsetOf("xyz"), 4000, 6500)); c.ResetTableReader(); } @@ -2706,6 +2785,7 @@ TEST_F(GeneralTableTest, ApproximateOffsetOfCompressed) { } } +#ifndef ROCKSDB_VALGRIND_RUN // RandomizedHarnessTest is very slow for certain combination of arguments // Split into 8 pieces to reduce the time individual tests take. TEST_F(HarnessTest, Randomized1) { @@ -2789,6 +2869,7 @@ TEST_F(HarnessTest, RandomizedLongDB) { ASSERT_GT(files, 0); } #endif // ROCKSDB_LITE +#endif // ROCKSDB_VALGRIND_RUN class MemTableTest : public testing::Test {}; @@ -2824,7 +2905,8 @@ TEST_F(MemTableTest, Simple) { iter = memtable->NewIterator(ReadOptions(), &arena); arena_iter_guard.set(iter); } else { - iter = memtable->NewRangeTombstoneIterator(ReadOptions()); + iter = memtable->NewRangeTombstoneIterator( + ReadOptions(), kMaxSequenceNumber /* read_seq */); iter_guard.reset(iter); } if (iter == nullptr) { @@ -2924,6 +3006,26 @@ TEST_F(HarnessTest, FooterTests) { ASSERT_EQ(decoded_footer.index_handle().size(), index.size()); ASSERT_EQ(decoded_footer.version(), 1U); } + { + // xxhash64 block based + std::string encoded; + Footer footer(kBlockBasedTableMagicNumber, 1); + BlockHandle meta_index(10, 5), index(20, 15); + footer.set_metaindex_handle(meta_index); + footer.set_index_handle(index); + footer.set_checksum(kxxHash64); + footer.EncodeTo(&encoded); + Footer decoded_footer; + Slice encoded_slice(encoded); + decoded_footer.DecodeFrom(&encoded_slice); + ASSERT_EQ(decoded_footer.table_magic_number(), kBlockBasedTableMagicNumber); + ASSERT_EQ(decoded_footer.checksum(), kxxHash64); + ASSERT_EQ(decoded_footer.metaindex_handle().offset(), meta_index.offset()); + ASSERT_EQ(decoded_footer.metaindex_handle().size(), meta_index.size()); + ASSERT_EQ(decoded_footer.index_handle().offset(), index.offset()); + ASSERT_EQ(decoded_footer.index_handle().size(), index.size()); + ASSERT_EQ(decoded_footer.version(), 1U); + } // Plain table is not supported in ROCKSDB_LITE #ifndef ROCKSDB_LITE { @@ -3151,7 +3253,7 @@ TEST_F(PrefixTest, PrefixAndWholeKeyTest) { TEST_P(BlockBasedTableTest, DISABLED_TableWithGlobalSeqno) { BlockBasedTableOptions bbto = GetBlockBasedTableOptions(); test::StringSink* sink = new test::StringSink(); - unique_ptr file_writer( + std::unique_ptr file_writer( test::GetWritableFileWriter(sink, "" /* don't care */)); Options options; options.table_factory.reset(NewBlockBasedTableFactory(bbto)); @@ -3189,7 +3291,7 @@ TEST_P(BlockBasedTableTest, DISABLED_TableWithGlobalSeqno) { // Helper function to get version, global_seqno, global_seqno_offset std::function GetVersionAndGlobalSeqno = [&]() { - unique_ptr file_reader( + std::unique_ptr file_reader( test::GetRandomAccessFileReader( new test::StringSource(ss_rw.contents(), 73342, true))); @@ -3218,9 +3320,9 @@ TEST_P(BlockBasedTableTest, DISABLED_TableWithGlobalSeqno) { }; // Helper function to get the contents of the table InternalIterator - unique_ptr table_reader; + std::unique_ptr table_reader; std::function GetTableInternalIter = [&]() { - unique_ptr file_reader( + std::unique_ptr file_reader( test::GetRandomAccessFileReader( new test::StringSource(ss_rw.contents(), 73342, true))); @@ -3333,7 +3435,7 @@ TEST_P(BlockBasedTableTest, BlockAlignTest) { BlockBasedTableOptions bbto = GetBlockBasedTableOptions(); bbto.block_align = true; test::StringSink* sink = new test::StringSink(); - unique_ptr file_writer( + std::unique_ptr file_writer( test::GetWritableFileWriter(sink, "" /* don't care */)); Options options; options.compression = kNoCompression; @@ -3365,7 +3467,7 @@ TEST_P(BlockBasedTableTest, BlockAlignTest) { file_writer->Flush(); test::RandomRWStringSink ss_rw(sink); - unique_ptr file_reader( + std::unique_ptr file_reader( test::GetRandomAccessFileReader( new test::StringSource(ss_rw.contents(), 73342, true))); @@ -3423,7 +3525,7 @@ TEST_P(BlockBasedTableTest, PropertiesBlockRestartPointTest) { BlockBasedTableOptions bbto = GetBlockBasedTableOptions(); bbto.block_align = true; test::StringSink* sink = new test::StringSink(); - unique_ptr file_writer( + std::unique_ptr file_writer( test::GetWritableFileWriter(sink, "" /* don't care */)); Options options; @@ -3458,7 +3560,7 @@ TEST_P(BlockBasedTableTest, PropertiesBlockRestartPointTest) { file_writer->Flush(); test::RandomRWStringSink ss_rw(sink); - unique_ptr file_reader( + std::unique_ptr file_reader( test::GetRandomAccessFileReader( new test::StringSource(ss_rw.contents(), 73342, true))); @@ -3477,10 +3579,10 @@ TEST_P(BlockBasedTableTest, PropertiesBlockRestartPointTest) { Slice compression_dict; PersistentCacheOptions cache_options; - BlockFetcher block_fetcher(file, nullptr /* prefetch_buffer */, footer, - read_options, handle, contents, ioptions, - false /* decompress */, compression_dict, - cache_options); + BlockFetcher block_fetcher( + file, nullptr /* prefetch_buffer */, footer, read_options, handle, + contents, ioptions, false /* decompress */, + false /*maybe_compressed*/, compression_dict, cache_options); ASSERT_OK(block_fetcher.ReadBlockContents()); }; @@ -3566,7 +3668,8 @@ TEST_P(BlockBasedTableTest, PropertiesMetaBlockLast) { BlockFetcher block_fetcher( table_reader.get(), nullptr /* prefetch_buffer */, footer, ReadOptions(), metaindex_handle, &metaindex_contents, ioptions, false /* decompress */, - compression_dict, pcache_opts); + false /*maybe_compressed*/, compression_dict, pcache_opts, + nullptr /*memory_allocator*/); ASSERT_OK(block_fetcher.ReadBlockContents()); Block metaindex_block(std::move(metaindex_contents), kDisableGlobalSequenceNumber); diff --git a/tools/benchmark.sh b/tools/benchmark.sh index 6d09204900f..0ba1081e195 100755 --- a/tools/benchmark.sh +++ b/tools/benchmark.sh @@ -151,8 +151,8 @@ function summarize_result { stall_pct=$( grep "^Cumulative stall" $test_out| tail -1 | awk '{ print $5 }' ) ops_sec=$( grep ^${bench_name} $test_out | awk '{ print $5 }' ) mb_sec=$( grep ^${bench_name} $test_out | awk '{ print $7 }' ) - lo_wgb=$( grep "^ L0" $test_out | tail -1 | awk '{ print $8 }' ) - sum_wgb=$( grep "^ Sum" $test_out | tail -1 | awk '{ print $8 }' ) + lo_wgb=$( grep "^ L0" $test_out | tail -1 | awk '{ print $9 }' ) + sum_wgb=$( grep "^ Sum" $test_out | tail -1 | awk '{ print $9 }' ) sum_size=$( grep "^ Sum" $test_out | tail -1 | awk '{ printf "%.1f", $3 / 1024.0 }' ) wamp=$( echo "scale=1; $sum_wgb / $lo_wgb" | bc ) wmb_ps=$( echo "scale=1; ( $sum_wgb * 1024.0 ) / $uptime" | bc ) diff --git a/tools/check_format_compatible.sh b/tools/check_format_compatible.sh index 5959fb83293..2d260c7ecc3 100755 --- a/tools/check_format_compatible.sh +++ b/tools/check_format_compatible.sh @@ -56,7 +56,7 @@ declare -a backward_compatible_checkout_objs=("2.2.fb.branch" "2.3.fb.branch" "2 declare -a forward_compatible_checkout_objs=("3.10.fb" "3.11.fb" "3.12.fb" "3.13.fb" "4.0.fb" "4.1.fb" "4.2.fb" "4.3.fb" "4.4.fb" "4.5.fb" "4.6.fb" "4.7.fb" "4.8.fb" "4.9.fb" "4.10.fb" "4.11.fb" "4.12.fb" "4.13.fb" "5.0.fb" "5.1.fb" "5.2.fb" "5.3.fb" "5.4.fb" "5.5.fb" "5.6.fb" "5.7.fb" "5.8.fb" "5.9.fb" "5.10.fb") declare -a forward_compatible_with_options_checkout_objs=("5.11.fb" "5.12.fb" "5.13.fb" "5.14.fb") declare -a checkout_objs=(${backward_compatible_checkout_objs[@]} ${forward_compatible_checkout_objs[@]} ${forward_compatible_with_options_checkout_objs[@]}) -declare -a extern_sst_ingestion_compatible_checkout_objs=("5.14.fb" "5.15.fb") +declare -a extern_sst_ingestion_compatible_checkout_objs=("5.14.fb" "5.15.fb" "5.16.fb" "5.17.fb") generate_db() { diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc index 2dd3f402fef..2e20fd8275f 100644 --- a/tools/db_bench_tool.cc +++ b/tools/db_bench_tool.cc @@ -34,10 +34,12 @@ #include "cloud/aws/aws_env.h" #include "db/db_impl.h" +#include "db/malloc_stats.h" #include "db/version_set.h" #include "hdfs/env_hdfs.h" #include "monitoring/histogram.h" #include "monitoring/statistics.h" +#include "options/cf_options.h" #include "port/port.h" #include "port/stack_trace.h" #include "rocksdb/cache.h" @@ -46,7 +48,6 @@ #include "rocksdb/filter_policy.h" #include "rocksdb/memtablerep.h" #include "rocksdb/options.h" -#include "options/cf_options.h" #include "rocksdb/perf_context.h" #include "rocksdb/persistent_cache.h" #include "rocksdb/rate_limiter.h" @@ -249,6 +250,10 @@ DEFINE_bool(reverse_iterator, false, "When true use Prev rather than Next for iterators that do " "Seek and then Next"); +DEFINE_int64(max_scan_distance, 0, + "Used to define iterate_upper_bound (or iterate_lower_bound " + "if FLAGS_reverse_iterator is set to true) when value is nonzero"); + DEFINE_bool(use_uint64_comparator, false, "use Uint64 user comparator"); DEFINE_int64(batch_size, 1, "Batch size"); @@ -641,9 +646,11 @@ DEFINE_bool(optimize_filters_for_hits, false, DEFINE_uint64(delete_obsolete_files_period_micros, 0, "Ignored. Left here for backward compatibility"); +DEFINE_int64(writes_before_delete_range, 0, + "Number of writes before DeleteRange is called regularly."); + DEFINE_int64(writes_per_range_tombstone, 0, - "Number of writes between range " - "tombstones"); + "Number of writes between range tombstones"); DEFINE_int64(range_tombstone_width, 100, "Number of keys in tombstone's range"); @@ -941,6 +948,9 @@ DEFINE_uint64(max_compaction_bytes, rocksdb::Options().max_compaction_bytes, #ifndef ROCKSDB_LITE DEFINE_bool(readonly, false, "Run read only benchmarks."); + +DEFINE_bool(print_malloc_stats, false, + "Print malloc stats to stdout after benchmarks finish."); #endif // ROCKSDB_LITE DEFINE_bool(disable_auto_compactions, false, "Do not auto trigger compactions"); @@ -1195,11 +1205,12 @@ class ReportFileOpEnv : public EnvWrapper { counters_.bytes_written_ = 0; } - Status NewSequentialFile(const std::string& f, unique_ptr* r, + Status NewSequentialFile(const std::string& f, + std::unique_ptr* r, const EnvOptions& soptions) override { class CountingFile : public SequentialFile { private: - unique_ptr target_; + std::unique_ptr target_; ReportFileOpCounters* counters_; public: @@ -1227,11 +1238,11 @@ class ReportFileOpEnv : public EnvWrapper { } Status NewRandomAccessFile(const std::string& f, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& soptions) override { class CountingFile : public RandomAccessFile { private: - unique_ptr target_; + std::unique_ptr target_; ReportFileOpCounters* counters_; public: @@ -1256,11 +1267,11 @@ class ReportFileOpEnv : public EnvWrapper { return s; } - Status NewWritableFile(const std::string& f, unique_ptr* r, + Status NewWritableFile(const std::string& f, std::unique_ptr* r, const EnvOptions& soptions) override { class CountingFile : public WritableFile { private: - unique_ptr target_; + std::unique_ptr target_; ReportFileOpCounters* counters_; public: @@ -2026,12 +2037,15 @@ class Benchmark { int prefix_size_; int64_t keys_per_prefix_; int64_t entries_per_batch_; + int64_t writes_before_delete_range_; int64_t writes_per_range_tombstone_; int64_t range_tombstone_width_; int64_t max_num_range_tombstones_; WriteOptions write_options_; Options open_options_; // keep options around to properly destroy db later +#ifndef ROCKSDB_LITE TraceOptions trace_options_; +#endif int64_t reads_; int64_t deletes_; double read_random_exp_range_; @@ -2553,6 +2567,7 @@ void VerifyDBFromDB(std::string& truth_db_name) { value_size_ = FLAGS_value_size; key_size_ = FLAGS_key_size; entries_per_batch_ = FLAGS_batch_size; + writes_before_delete_range_ = FLAGS_writes_before_delete_range; writes_per_range_tombstone_ = FLAGS_writes_per_range_tombstone; range_tombstone_width_ = FLAGS_range_tombstone_width; max_num_range_tombstones_ = FLAGS_max_num_range_tombstones; @@ -2907,6 +2922,7 @@ void VerifyDBFromDB(std::string& truth_db_name) { } SetPerfLevel(static_cast (shared->perf_level)); + perf_context.EnablePerLevelPerfContext(); thread->stats.Start(thread->tid); (arg->bm->*(arg->method))(thread); thread->stats.Stop(); @@ -3934,9 +3950,13 @@ void VerifyDBFromDB(std::string& truth_db_name) { bytes += value_size_ + key_size_; ++num_written; if (writes_per_range_tombstone_ > 0 && - num_written / writes_per_range_tombstone_ <= + num_written > writes_before_delete_range_ && + (num_written - writes_before_delete_range_) / + writes_per_range_tombstone_ <= max_num_range_tombstones_ && - num_written % writes_per_range_tombstone_ == 0) { + (num_written - writes_before_delete_range_) % + writes_per_range_tombstone_ == + 0) { int64_t begin_num = key_gens[id]->Next(); if (FLAGS_expand_range_tombstones) { for (int64_t offset = 0; offset < range_tombstone_width_; @@ -4287,7 +4307,7 @@ void VerifyDBFromDB(std::string& truth_db_name) { } if (levelMeta.level == 0) { for (auto& fileMeta : levelMeta.files) { - fprintf(stdout, "Level[%d]: %s(size: %" PRIu64 " bytes)\n", + fprintf(stdout, "Level[%d]: %s(size: %" ROCKSDB_PRIszt " bytes)\n", levelMeta.level, fileMeta.name.c_str(), fileMeta.size); } } else { @@ -4606,9 +4626,31 @@ void VerifyDBFromDB(std::string& truth_db_name) { std::unique_ptr key_guard; Slice key = AllocateKey(&key_guard); + std::unique_ptr upper_bound_key_guard; + Slice upper_bound = AllocateKey(&upper_bound_key_guard); + std::unique_ptr lower_bound_key_guard; + Slice lower_bound = AllocateKey(&lower_bound_key_guard); + Duration duration(FLAGS_duration, reads_); char value_buffer[256]; while (!duration.Done(1)) { + int64_t seek_pos = thread->rand.Next() % FLAGS_num; + GenerateKeyFromInt((uint64_t)seek_pos, FLAGS_num, &key); + if (FLAGS_max_scan_distance != 0) { + if (FLAGS_reverse_iterator) { + GenerateKeyFromInt( + (uint64_t)std::max((int64_t)0, + seek_pos - FLAGS_max_scan_distance), + FLAGS_num, &lower_bound); + options.iterate_lower_bound = &lower_bound; + } else { + GenerateKeyFromInt( + (uint64_t)std::min(FLAGS_num, seek_pos + FLAGS_max_scan_distance), + FLAGS_num, &upper_bound); + options.iterate_upper_bound = &upper_bound; + } + } + if (!FLAGS_use_tailing_iterator) { if (db_.db != nullptr) { delete single_iter; @@ -4629,7 +4671,6 @@ void VerifyDBFromDB(std::string& truth_db_name) { iter_to_use = multi_iters[thread->rand.Next() % multi_iters.size()]; } - GenerateKeyFromInt(thread->rand.Next() % FLAGS_num, FLAGS_num, &key); iter_to_use->Seek(key); read++; if (iter_to_use->Valid() && iter_to_use->key().compare(key) == 0) { @@ -5726,7 +5767,7 @@ void VerifyDBFromDB(std::string& truth_db_name) { void Replay(ThreadState* /*thread*/, DBWithColumnFamilies* db_with_cfh) { Status s; - unique_ptr trace_reader; + std::unique_ptr trace_reader; s = NewFileTraceReader(FLAGS_env, EnvOptions(), FLAGS_trace_file, &trace_reader); if (!s.ok()) { @@ -5854,6 +5895,15 @@ int db_bench_tool(int argc, char** argv) { rocksdb::Benchmark benchmark; benchmark.Run(); + +#ifndef ROCKSDB_LITE + if (FLAGS_print_malloc_stats) { + std::string stats_string; + rocksdb::DumpMallocStats(&stats_string); + fprintf(stdout, "Malloc stats:\n%s\n", stats_string.c_str()); + } +#endif // ROCKSDB_LITE + return 0; } } // namespace rocksdb diff --git a/tools/db_bench_tool_test.cc b/tools/db_bench_tool_test.cc index 67426066eb9..dfc461193c4 100644 --- a/tools/db_bench_tool_test.cc +++ b/tools/db_bench_tool_test.cc @@ -279,7 +279,7 @@ const std::string options_file_content = R"OPTIONS_FILE( TEST_F(DBBenchTest, OptionsFileFromFile) { const std::string kOptionsFileName = test_path_ + "/OPTIONS_flash"; - unique_ptr writable; + std::unique_ptr writable; ASSERT_OK(Env::Default()->NewWritableFile(kOptionsFileName, &writable, EnvOptions())); ASSERT_OK(writable->Append(options_file_content)); diff --git a/tools/db_crashtest.py b/tools/db_crashtest.py index 59528128b4c..0bf43780df5 100644 --- a/tools/db_crashtest.py +++ b/tools/db_crashtest.py @@ -15,6 +15,9 @@ # default_params < {blackbox,whitebox}_default_params < # simple_default_params < # {blackbox,whitebox}_simple_default_params < args +# for enable_atomic_flush: +# default_params < {blackbox,whitebox}_default_params < +# atomic_flush_params < args expected_values_file = tempfile.NamedTemporaryFile() @@ -122,6 +125,15 @@ def is_direct_io_supported(dbname): whitebox_simple_default_params = {} +atomic_flush_params = { + "atomic_flush": 1, + "disable_wal": 1, + "reopen": 0, + # use small value for write_buffer_size so that RocksDB triggers flush + # more frequently + "write_buffer_size": 1024 * 1024, +} + def finalize_and_sanitize(src_params): dest_params = dict([(k, v() if callable(v) else v) @@ -152,6 +164,8 @@ def gen_cmd_params(args): params.update(blackbox_simple_default_params) if args.test_type == 'whitebox': params.update(whitebox_simple_default_params) + if args.enable_atomic_flush: + params.update(atomic_flush_params) for k, v in vars(args).items(): if v is not None: @@ -164,7 +178,7 @@ def gen_cmd(params, unknown_params): '--{0}={1}'.format(k, v) for k, v in finalize_and_sanitize(params).items() if k not in set(['test_type', 'simple', 'duration', 'interval', - 'random_kill_odd']) + 'random_kill_odd', 'enable_atomic_flush']) and v is not None] + unknown_params return cmd @@ -356,6 +370,7 @@ def main(): db_stress multiple times") parser.add_argument("test_type", choices=["blackbox", "whitebox"]) parser.add_argument("--simple", action="store_true") + parser.add_argument("--enable_atomic_flush", action='store_true') all_params = dict(default_params.items() + blackbox_default_params.items() diff --git a/tools/db_repl_stress.cc b/tools/db_repl_stress.cc index 5901b97778e..c640b5945b0 100644 --- a/tools/db_repl_stress.cc +++ b/tools/db_repl_stress.cc @@ -67,7 +67,7 @@ struct ReplicationThread { static void ReplicationThreadBody(void* arg) { ReplicationThread* t = reinterpret_cast(arg); DB* db = t->db; - unique_ptr iter; + std::unique_ptr iter; SequenceNumber currentSeqNum = 1; while (!t->stop.load(std::memory_order_acquire)) { iter.reset(); diff --git a/tools/db_stress.cc b/tools/db_stress.cc index 45a7c9a0d0a..20b2899e957 100644 --- a/tools/db_stress.cc +++ b/tools/db_stress.cc @@ -133,6 +133,8 @@ DEFINE_bool(test_batches_snapshots, false, "\t(b) No long validation at the end (more speed up)\n" "\t(c) Test snapshot and atomicity of batch writes"); +DEFINE_bool(atomic_flush, false, "If true, the test enables atomic flush\n"); + DEFINE_int32(threads, 32, "Number of concurrent threads to run."); DEFINE_int32(ttl, -1, @@ -790,46 +792,36 @@ class Stats { } } - void AddBytesForWrites(int nwrites, size_t nbytes) { + void AddBytesForWrites(long nwrites, size_t nbytes) { writes_ += nwrites; bytes_ += nbytes; } - void AddGets(int ngets, int nfounds) { + void AddGets(long ngets, long nfounds) { founds_ += nfounds; gets_ += ngets; } - void AddPrefixes(int nprefixes, int count) { + void AddPrefixes(long nprefixes, long count) { prefixes_ += nprefixes; iterator_size_sums_ += count; } - void AddIterations(int n) { - iterations_ += n; - } + void AddIterations(long n) { iterations_ += n; } - void AddDeletes(int n) { - deletes_ += n; - } + void AddDeletes(long n) { deletes_ += n; } void AddSingleDeletes(size_t n) { single_deletes_ += n; } - void AddRangeDeletions(int n) { - range_deletions_ += n; - } + void AddRangeDeletions(long n) { range_deletions_ += n; } - void AddCoveredByRangeDeletions(int n) { - covered_by_range_deletions_ += n; - } + void AddCoveredByRangeDeletions(long n) { covered_by_range_deletions_ += n; } - void AddErrors(int n) { - errors_ += n; - } + void AddErrors(long n) { errors_ += n; } - void AddNumCompactFilesSucceed(int n) { num_compact_files_succeed_ += n; } + void AddNumCompactFilesSucceed(long n) { num_compact_files_succeed_ += n; } - void AddNumCompactFilesFailed(int n) { num_compact_files_failed_ += n; } + void AddNumCompactFilesFailed(long n) { num_compact_files_failed_ += n; } void Report(const char* name) { std::string extra; @@ -948,7 +940,7 @@ class SharedState { if (status.ok()) { status = FLAGS_env->GetFileSize(FLAGS_expected_values_path, &size); } - unique_ptr wfile; + std::unique_ptr wfile; if (status.ok() && size == 0) { const EnvOptions soptions; status = FLAGS_env->NewWritableFile(FLAGS_expected_values_path, &wfile, @@ -1743,6 +1735,9 @@ class StressTest { } } if (snap_state.key_vec != nullptr) { + // When `prefix_extractor` is set, seeking to beginning and scanning + // across prefixes are only supported with `total_order_seek` set. + ropt.total_order_seek = true; std::unique_ptr iterator(db->NewIterator(ropt)); std::unique_ptr> tmp_bitvec(new std::vector(FLAGS_max_key)); for (iterator->SeekToFirst(); iterator->Valid(); iterator->Next()) { @@ -1892,27 +1887,6 @@ class StressTest { } } - if (FLAGS_backup_one_in > 0 && - thread->rand.Uniform(FLAGS_backup_one_in) == 0) { - std::string backup_dir = FLAGS_db + "/.backup" + ToString(thread->tid); - BackupableDBOptions backup_opts(backup_dir); - BackupEngine* backup_engine = nullptr; - Status s = BackupEngine::Open(FLAGS_env, backup_opts, &backup_engine); - if (s.ok()) { - s = backup_engine->CreateNewBackup(db_); - } - if (s.ok()) { - s = backup_engine->PurgeOldBackups(0 /* num_backups_to_keep */); - } - if (!s.ok()) { - printf("A BackupEngine operation failed with: %s\n", - s.ToString().c_str()); - } - if (backup_engine != nullptr) { - delete backup_engine; - } - } - if (FLAGS_compact_files_one_in > 0 && thread->rand.Uniform(FLAGS_compact_files_one_in) == 0) { auto* random_cf = @@ -1975,15 +1949,6 @@ class StressTest { auto column_family = column_families_[rand_column_family]; - if (FLAGS_flush_one_in > 0 && - thread->rand.Uniform(FLAGS_flush_one_in) == 0) { - FlushOptions flush_opts; - Status status = db_->Flush(flush_opts, column_family); - if (!status.ok()) { - fprintf(stdout, "Unable to perform Flush(): %s\n", status.ToString().c_str()); - } - } - if (FLAGS_compact_range_one_in > 0 && thread->rand.Uniform(FLAGS_compact_range_one_in) == 0) { int64_t end_key_num; @@ -2007,6 +1972,21 @@ class StressTest { std::vector rand_column_families = GenerateColumnFamilies(FLAGS_column_families, rand_column_family); + + if (FLAGS_flush_one_in > 0 && + thread->rand.Uniform(FLAGS_flush_one_in) == 0) { + FlushOptions flush_opts; + std::vector cfhs; + std::for_each( + rand_column_families.begin(), rand_column_families.end(), + [this, &cfhs](int k) { cfhs.push_back(column_families_[k]); }); + Status status = db_->Flush(flush_opts, cfhs); + if (!status.ok()) { + fprintf(stdout, "Unable to perform Flush(): %s\n", + status.ToString().c_str()); + } + } + std::vector rand_keys = GenerateKeys(rand_key); if (FLAGS_ingest_external_file_one_in > 0 && @@ -2014,6 +1994,15 @@ class StressTest { TestIngestExternalFile(thread, rand_column_families, rand_keys, lock); } + if (FLAGS_backup_one_in > 0 && + thread->rand.Uniform(FLAGS_backup_one_in) == 0) { + Status s = TestBackupRestore(thread, rand_column_families, rand_keys); + if (!s.ok()) { + VerificationAbort(shared, "Backup/restore gave inconsistent state", + s); + } + } + if (FLAGS_acquire_snapshot_one_in > 0 && thread->rand.Uniform(FLAGS_acquire_snapshot_one_in) == 0) { auto snapshot = db_->GetSnapshot(); @@ -2029,6 +2018,9 @@ class StressTest { if (FLAGS_compare_full_db_state_snapshot && (thread->tid == 0)) { key_vec = new std::vector(FLAGS_max_key); + // When `prefix_extractor` is set, seeking to beginning and scanning + // across prefixes are only supported with `total_order_seek` set. + ropt.total_order_seek = true; std::unique_ptr iterator(db_->NewIterator(ropt)); for (iterator->SeekToFirst(); iterator->Valid(); iterator->Next()) { uint64_t key_val; @@ -2199,6 +2191,106 @@ class StressTest { return s; } +#ifdef ROCKSDB_LITE + virtual Status TestBackupRestore( + ThreadState* /* thread */, + const std::vector& /* rand_column_families */, + const std::vector& /* rand_keys */) { + assert(false); + fprintf(stderr, + "RocksDB lite does not support " + "TestBackupRestore\n"); + std::terminate(); + } +#else // ROCKSDB_LITE + virtual Status TestBackupRestore(ThreadState* thread, + const std::vector& rand_column_families, + const std::vector& rand_keys) { + // Note the column families chosen by `rand_column_families` cannot be + // dropped while the locks for `rand_keys` are held. So we should not have + // to worry about accessing those column families throughout this function. + assert(rand_column_families.size() == rand_keys.size()); + std::string backup_dir = FLAGS_db + "/.backup" + ToString(thread->tid); + std::string restore_dir = FLAGS_db + "/.restore" + ToString(thread->tid); + BackupableDBOptions backup_opts(backup_dir); + BackupEngine* backup_engine = nullptr; + Status s = BackupEngine::Open(FLAGS_env, backup_opts, &backup_engine); + if (s.ok()) { + s = backup_engine->CreateNewBackup(db_); + } + if (s.ok()) { + delete backup_engine; + backup_engine = nullptr; + s = BackupEngine::Open(FLAGS_env, backup_opts, &backup_engine); + } + if (s.ok()) { + s = backup_engine->RestoreDBFromLatestBackup(restore_dir /* db_dir */, + restore_dir /* wal_dir */); + } + if (s.ok()) { + s = backup_engine->PurgeOldBackups(0 /* num_backups_to_keep */); + } + DB* restored_db = nullptr; + std::vector restored_cf_handles; + if (s.ok()) { + Options restore_options(options_); + restore_options.listeners.clear(); + std::vector cf_descriptors; + // TODO(ajkr): `column_family_names_` is not safe to access here when + // `clear_column_family_one_in != 0`. But we can't easily switch to + // `ListColumnFamilies` to get names because it won't necessarily give + // the same order as `column_family_names_`. + assert(FLAGS_clear_column_family_one_in == 0); + for (auto name : column_family_names_) { + cf_descriptors.emplace_back(name, ColumnFamilyOptions(restore_options)); + } + s = DB::Open(DBOptions(restore_options), restore_dir, cf_descriptors, + &restored_cf_handles, &restored_db); + } + // for simplicity, currently only verifies existence/non-existence of a few + // keys + for (size_t i = 0; s.ok() && i < rand_column_families.size(); ++i) { + std::string key_str = Key(rand_keys[i]); + Slice key = key_str; + std::string restored_value; + Status get_status = restored_db->Get( + ReadOptions(), restored_cf_handles[rand_column_families[i]], key, + &restored_value); + bool exists = + thread->shared->Exists(rand_column_families[i], rand_keys[i]); + if (get_status.ok()) { + if (!exists) { + s = Status::Corruption( + "key exists in restore but not in original db"); + } + } else if (get_status.IsNotFound()) { + if (exists) { + s = Status::Corruption( + "key exists in original db but not in restore"); + } + } else { + s = get_status; + } + } + if (backup_engine != nullptr) { + delete backup_engine; + backup_engine = nullptr; + } + if (restored_db != nullptr) { + for (auto* cf_handle : restored_cf_handles) { + restored_db->DestroyColumnFamilyHandle(cf_handle); + } + delete restored_db; + restored_db = nullptr; + } + if (!s.ok()) { + printf("A backup/restore operation failed with: %s\n", + s.ToString().c_str()); + } + return s; + } +#endif // ROCKSDB_LITE + void VerificationAbort(SharedState* shared, std::string msg, Status s) const { printf("Verification failed: %s. Status is %s\n", msg.c_str(), s.ToString().c_str()); @@ -2218,6 +2310,8 @@ class StressTest { fprintf(stdout, "Format version : %d\n", FLAGS_format_version); fprintf(stdout, "TransactionDB : %s\n", FLAGS_use_txn ? "true" : "false"); + fprintf(stdout, "Atomic flush : %s\n", + FLAGS_atomic_flush ? "true" : "false"); fprintf(stdout, "Column families : %d\n", FLAGS_column_families); if (!FLAGS_test_batches_snapshots) { fprintf(stdout, "Clear CFs one in : %d\n", @@ -2363,6 +2457,7 @@ class StressTest { FLAGS_universal_max_merge_width; options_.compaction_options_universal.max_size_amplification_percent = FLAGS_universal_max_size_amplification_percent; + options_.atomic_flush = FLAGS_atomic_flush; } else { #ifdef ROCKSDB_LITE fprintf(stderr, "--options_file not supported in lite mode\n"); @@ -2594,7 +2689,7 @@ class NonBatchedOpsStressTest : public StressTest { } if (!thread->rand.OneIn(2)) { // Use iterator to verify this range - unique_ptr iter( + std::unique_ptr iter( db_->NewIterator(options, column_families_[cf])); iter->Seek(Key(start)); for (auto i = start; i < end; i++) { @@ -2733,16 +2828,15 @@ class NonBatchedOpsStressTest : public StressTest { } Iterator* iter = db_->NewIterator(ro_copy, cfh); - int64_t count = 0; + long count = 0; for (iter->Seek(prefix); iter->Valid() && iter->key().starts_with(prefix); iter->Next()) { ++count; } - assert(count <= - (static_cast(1) << ((8 - FLAGS_prefix_size) * 8))); + assert(count <= (static_cast(1) << ((8 - FLAGS_prefix_size) * 8))); Status s = iter->status(); if (iter->status().ok()) { - thread->stats.AddPrefixes(1, static_cast(count)); + thread->stats.AddPrefixes(1, count); } else { thread->stats.AddErrors(1); } @@ -3272,7 +3366,7 @@ class BatchedOpsStressTest : public StressTest { iters[i]->Seek(prefix_slices[i]); } - int count = 0; + long count = 0; while (iters[0]->Valid() && iters[0]->key().starts_with(prefix_slices[0])) { count++; std::string values[10]; @@ -3327,6 +3421,274 @@ class BatchedOpsStressTest : public StressTest { virtual void VerifyDb(ThreadState* /* thread */) const {} }; +class AtomicFlushStressTest : public StressTest { + public: + AtomicFlushStressTest() : batch_id_(0) {} + + virtual ~AtomicFlushStressTest() {} + + virtual Status TestPut(ThreadState* thread, WriteOptions& write_opts, + const ReadOptions& /* read_opts */, + const std::vector& rand_column_families, + const std::vector& rand_keys, + char (&value)[100], + std::unique_ptr& /* lock */) { + std::string key_str = Key(rand_keys[0]); + Slice key = key_str; + uint64_t value_base = batch_id_.fetch_add(1); + size_t sz = + GenerateValue(static_cast(value_base), value, sizeof(value)); + Slice v(value, sz); + WriteBatch batch; + for (auto cf : rand_column_families) { + ColumnFamilyHandle* cfh = column_families_[cf]; + if (FLAGS_use_merge) { + batch.Merge(cfh, key, v); + } else { /* !FLAGS_use_merge */ + batch.Put(cfh, key, v); + } + } + Status s = db_->Write(write_opts, &batch); + if (!s.ok()) { + fprintf(stderr, "multi put or merge error: %s\n", s.ToString().c_str()); + thread->stats.AddErrors(1); + } else { + auto num = static_cast(rand_column_families.size()); + thread->stats.AddBytesForWrites(num, (sz + 1) * num); + } + + return s; + } + + virtual Status TestDelete(ThreadState* thread, WriteOptions& write_opts, + const std::vector& rand_column_families, + const std::vector& rand_keys, + std::unique_ptr& /* lock */) { + std::string key_str = Key(rand_keys[0]); + Slice key = key_str; + WriteBatch batch; + for (auto cf : rand_column_families) { + ColumnFamilyHandle* cfh = column_families_[cf]; + batch.Delete(cfh, key); + } + Status s = db_->Write(write_opts, &batch); + if (!s.ok()) { + fprintf(stderr, "multidel error: %s\n", s.ToString().c_str()); + thread->stats.AddErrors(1); + } else { + thread->stats.AddDeletes(static_cast(rand_column_families.size())); + } + return s; + } + + virtual Status TestDeleteRange(ThreadState* thread, WriteOptions& write_opts, + const std::vector& rand_column_families, + const std::vector& rand_keys, + std::unique_ptr& /* lock */) { + int64_t rand_key = rand_keys[0]; + auto shared = thread->shared; + int64_t max_key = shared->GetMaxKey(); + if (rand_key > max_key - FLAGS_range_deletion_width) { + rand_key = + thread->rand.Next() % (max_key - FLAGS_range_deletion_width + 1); + } + std::string key_str = Key(rand_key); + Slice key = key_str; + std::string end_key_str = Key(rand_key + FLAGS_range_deletion_width); + Slice end_key = end_key_str; + WriteBatch batch; + for (auto cf : rand_column_families) { + ColumnFamilyHandle* cfh = column_families_[rand_column_families[cf]]; + batch.DeleteRange(cfh, key, end_key); + } + Status s = db_->Write(write_opts, &batch); + if (!s.ok()) { + fprintf(stderr, "multi del range error: %s\n", s.ToString().c_str()); + thread->stats.AddErrors(1); + } else { + thread->stats.AddRangeDeletions( + static_cast(rand_column_families.size())); + } + return s; + } + + virtual void TestIngestExternalFile( + ThreadState* /* thread */, + const std::vector& /* rand_column_families */, + const std::vector& /* rand_keys */, + std::unique_ptr& /* lock */) { + assert(false); + fprintf(stderr, + "AtomicFlushStressTest does not support TestIngestExternalFile " + "because it's not possible to verify the result\n"); + std::terminate(); + } + + virtual Status TestGet(ThreadState* thread, const ReadOptions& readoptions, + const std::vector& rand_column_families, + const std::vector& rand_keys) { + std::string key_str = Key(rand_keys[0]); + Slice key = key_str; + auto cfh = + column_families_[rand_column_families[thread->rand.Next() % + rand_column_families.size()]]; + std::string from_db; + Status s = db_->Get(readoptions, cfh, key, &from_db); + if (s.ok()) { + thread->stats.AddGets(1, 1); + } else if (s.IsNotFound()) { + thread->stats.AddGets(1, 0); + } else { + thread->stats.AddErrors(1); + } + return s; + } + + virtual Status TestPrefixScan(ThreadState* thread, + const ReadOptions& readoptions, + const std::vector& rand_column_families, + const std::vector& rand_keys) { + std::string key_str = Key(rand_keys[0]); + Slice key = key_str; + Slice prefix = Slice(key.data(), FLAGS_prefix_size); + + std::string upper_bound; + Slice ub_slice; + ReadOptions ro_copy = readoptions; + if (thread->rand.OneIn(2) && GetNextPrefix(prefix, &upper_bound)) { + ub_slice = Slice(upper_bound); + ro_copy.iterate_upper_bound = &ub_slice; + } + auto cfh = + column_families_[rand_column_families[thread->rand.Next() % + rand_column_families.size()]]; + Iterator* iter = db_->NewIterator(ro_copy, cfh); + long count = 0; + for (iter->Seek(prefix); iter->Valid() && iter->key().starts_with(prefix); + iter->Next()) { + ++count; + } + assert(count <= (static_cast(1) << ((8 - FLAGS_prefix_size) * 8))); + Status s = iter->status(); + if (s.ok()) { + thread->stats.AddPrefixes(1, count); + } else { + thread->stats.AddErrors(1); + } + delete iter; + return s; + } + + virtual void VerifyDb(ThreadState* thread) const { + ReadOptions options(FLAGS_verify_checksum, true); + // We must set total_order_seek to true because we are doing a SeekToFirst + // on a column family whose memtables may support (by default) prefix-based + // iterator. In this case, NewIterator with options.total_order_seek being + // false returns a prefix-based iterator. Calling SeekToFirst using this + // iterator causes the iterator to become invalid. That means we cannot + // iterate the memtable using this iterator any more, although the memtable + // contains the most up-to-date key-values. + options.total_order_seek = true; + assert(thread != nullptr); + auto shared = thread->shared; + std::vector > iters(column_families_.size()); + for (size_t i = 0; i != column_families_.size(); ++i) { + iters[i].reset(db_->NewIterator(options, column_families_[i])); + } + for (auto& iter : iters) { + iter->SeekToFirst(); + } + size_t num = column_families_.size(); + assert(num == iters.size()); + std::vector statuses(num, Status::OK()); + do { + size_t valid_cnt = 0; + size_t idx = 0; + for (auto& iter : iters) { + if (iter->Valid()) { + ++valid_cnt; + } else { + statuses[idx] = iter->status(); + } + ++idx; + } + if (valid_cnt == 0) { + Status status; + for (size_t i = 0; i != num; ++i) { + const auto& s = statuses[i]; + if (!s.ok()) { + status = s; + fprintf(stderr, "Iterator on cf %s has error: %s\n", + column_families_[i]->GetName().c_str(), + s.ToString().c_str()); + shared->SetVerificationFailure(); + } + } + if (status.ok()) { + fprintf(stdout, "Finished scanning all column families.\n"); + } + break; + } else if (valid_cnt != iters.size()) { + for (size_t i = 0; i != num; ++i) { + if (!iters[i]->Valid()) { + if (statuses[i].ok()) { + fprintf(stderr, "Finished scanning cf %s\n", + column_families_[i]->GetName().c_str()); + } else { + fprintf(stderr, "Iterator on cf %s has error: %s\n", + column_families_[i]->GetName().c_str(), + statuses[i].ToString().c_str()); + } + } else { + fprintf(stderr, "cf %s has remaining data to scan\n", + column_families_[i]->GetName().c_str()); + } + } + shared->SetVerificationFailure(); + break; + } + // If the program reaches here, then all column families' iterators are + // still valid. + Slice key; + Slice value; + for (size_t i = 0; i != num; ++i) { + if (i == 0) { + key = iters[i]->key(); + value = iters[i]->value(); + } else { + if (key.compare(iters[i]->key()) != 0) { + fprintf(stderr, "Verification failed\n"); + fprintf(stderr, "cf%s: %s => %s\n", + column_families_[0]->GetName().c_str(), + key.ToString(true /* hex */).c_str(), + value.ToString(/* hex */).c_str()); + fprintf(stderr, "cf%s: %s => %s\n", + column_families_[i]->GetName().c_str(), + iters[i]->key().ToString(true /* hex */).c_str(), + iters[i]->value().ToString(true /* hex */).c_str()); + shared->SetVerificationFailure(); + } + } + } + for (auto& iter : iters) { + iter->Next(); + } + } while (true); + } + + virtual std::vector GenerateColumnFamilies( + const int /* num_column_families */, int /* rand_column_family */) const { + std::vector ret; + int num = static_cast(column_families_.size()); + int k = 0; + std::generate_n(back_inserter(ret), num, [&k]() -> int { return k++; }); + return ret; + } + + private: + std::atomic batch_id_; +}; + } // namespace rocksdb int main(int argc, char** argv) { @@ -3415,6 +3777,11 @@ int main(int argc, char** argv) { "Error: nooverwritepercent must be 0 when using file ingestion\n"); exit(1); } + if (FLAGS_clear_column_family_one_in > 0 && FLAGS_backup_one_in > 0) { + fprintf(stderr, + "Error: clear_column_family_one_in must be 0 when using backup\n"); + exit(1); + } // Choose a location for the test database if none given with --db= if (FLAGS_db.empty()) { @@ -3428,7 +3795,9 @@ int main(int argc, char** argv) { rocksdb_kill_prefix_blacklist = SplitString(FLAGS_kill_prefix_blacklist); std::unique_ptr stress; - if (FLAGS_test_batches_snapshots) { + if (FLAGS_atomic_flush) { + stress.reset(new rocksdb::AtomicFlushStressTest()); + } else if (FLAGS_test_batches_snapshots) { stress.reset(new rocksdb::BatchedOpsStressTest()); } else { stress.reset(new rocksdb::NonBatchedOpsStressTest()); diff --git a/tools/ldb_cmd.cc b/tools/ldb_cmd.cc index 4b6f6f4d8a2..997718ef28e 100644 --- a/tools/ldb_cmd.cc +++ b/tools/ldb_cmd.cc @@ -1964,11 +1964,11 @@ void DumpWalFile(std::string wal_file, bool print_header, bool print_values, bool is_write_committed, LDBCommandExecuteResult* exec_state) { Env* env_ = Env::Default(); EnvOptions soptions; - unique_ptr wal_file_reader; + std::unique_ptr wal_file_reader; Status status; { - unique_ptr file; + std::unique_ptr file; status = env_->NewSequentialFile(wal_file, &file, soptions); if (status.ok()) { wal_file_reader.reset( @@ -1999,7 +1999,8 @@ void DumpWalFile(std::string wal_file, bool print_header, bool print_values, } DBOptions db_options; log::Reader reader(db_options.info_log, std::move(wal_file_reader), - &reporter, true /* checksum */, log_number); + &reporter, true /* checksum */, log_number, + false /* retry_after_eof */); std::string scratch; WriteBatch batch; Slice record; @@ -2844,8 +2845,8 @@ void DumpSstFile(std::string filename, bool output_hex, bool show_properties) { return; } // no verification - rocksdb::SstFileReader reader(filename, false, output_hex); - Status st = reader.ReadSequential(true, std::numeric_limits::max(), false, // has_from + rocksdb::SstFileDumper dumper(filename, false, output_hex); + Status st = dumper.ReadSequential(true, std::numeric_limits::max(), false, // has_from from_key, false, // has_to to_key); if (!st.ok()) { @@ -2859,21 +2860,17 @@ void DumpSstFile(std::string filename, bool output_hex, bool show_properties) { std::shared_ptr table_properties_from_reader; - st = reader.ReadTableProperties(&table_properties_from_reader); + st = dumper.ReadTableProperties(&table_properties_from_reader); if (!st.ok()) { std::cerr << filename << ": " << st.ToString() << ". Try to use initial table properties" << std::endl; - table_properties = reader.GetInitTableProperties(); + table_properties = dumper.GetInitTableProperties(); } else { table_properties = table_properties_from_reader.get(); } if (table_properties != nullptr) { std::cout << std::endl << "Table Properties:" << std::endl; std::cout << table_properties->ToString("\n") << std::endl; - std::cout << "# deleted keys: " - << rocksdb::GetDeletedKeys( - table_properties->user_collected_properties) - << std::endl; } } } diff --git a/tools/sst_dump_test.cc b/tools/sst_dump_test.cc index beab224d129..9032123cc6f 100644 --- a/tools/sst_dump_test.cc +++ b/tools/sst_dump_test.cc @@ -43,7 +43,7 @@ void createSST(const std::string& file_name, std::shared_ptr tf; tf.reset(new rocksdb::BlockBasedTableFactory(table_options)); - unique_ptr file; + std::unique_ptr file; Env* env = Env::Default(); EnvOptions env_options; ReadOptions read_options; @@ -51,7 +51,7 @@ void createSST(const std::string& file_name, const ImmutableCFOptions imoptions(opts); const MutableCFOptions moptions(opts); rocksdb::InternalKeyComparator ikc(opts.comparator); - unique_ptr tb; + std::unique_ptr tb; ASSERT_OK(env->NewWritableFile(file_name, &file, env_options)); diff --git a/tools/sst_dump_tool.cc b/tools/sst_dump_tool.cc index 6ca56aad98c..25699777e89 100644 --- a/tools/sst_dump_tool.cc +++ b/tools/sst_dump_tool.cc @@ -43,7 +43,7 @@ namespace rocksdb { -SstFileReader::SstFileReader(const std::string& file_path, bool verify_checksum, +SstFileDumper::SstFileDumper(const std::string& file_path, bool verify_checksum, bool output_hex) : file_name_(file_path), read_num_(0), @@ -74,7 +74,7 @@ static const std::vector> {CompressionType::kXpressCompression, "kXpressCompression"}, {CompressionType::kZSTD, "kZSTD"}}; -Status SstFileReader::GetTableReader(const std::string& file_path) { +Status SstFileDumper::GetTableReader(const std::string& file_path) { // Warning about 'magic_number' being uninitialized shows up only in UBsan // builds. Though access is guarded by 's.ok()' checks, fix the issue to // avoid any warnings. @@ -83,7 +83,7 @@ Status SstFileReader::GetTableReader(const std::string& file_path) { // read table magic number Footer footer; - unique_ptr file; + std::unique_ptr file; uint64_t file_size = 0; Status s = options_.env->NewRandomAccessFile(file_path, &file, soptions_); if (s.ok()) { @@ -123,10 +123,10 @@ Status SstFileReader::GetTableReader(const std::string& file_path) { return s; } -Status SstFileReader::NewTableReader( +Status SstFileDumper::NewTableReader( const ImmutableCFOptions& /*ioptions*/, const EnvOptions& /*soptions*/, const InternalKeyComparator& /*internal_comparator*/, uint64_t file_size, - unique_ptr* /*table_reader*/) { + std::unique_ptr* /*table_reader*/) { // We need to turn off pre-fetching of index and filter nodes for // BlockBasedTable if (BlockBasedTableFactory::kName == options_.table_factory->Name()) { @@ -143,12 +143,12 @@ Status SstFileReader::NewTableReader( std::move(file_), file_size, &table_reader_); } -Status SstFileReader::VerifyChecksum() { +Status SstFileDumper::VerifyChecksum() { return table_reader_->VerifyChecksum(); } -Status SstFileReader::DumpTable(const std::string& out_filename) { - unique_ptr out_file; +Status SstFileDumper::DumpTable(const std::string& out_filename) { + std::unique_ptr out_file; Env* env = Env::Default(); env->NewWritableFile(out_filename, &out_file, soptions_); Status s = table_reader_->DumpTable(out_file.get(), @@ -157,23 +157,23 @@ Status SstFileReader::DumpTable(const std::string& out_filename) { return s; } -uint64_t SstFileReader::CalculateCompressedTableSize( +uint64_t SstFileDumper::CalculateCompressedTableSize( const TableBuilderOptions& tb_options, size_t block_size) { - unique_ptr out_file; - unique_ptr env(NewMemEnv(Env::Default())); + std::unique_ptr out_file; + std::unique_ptr env(NewMemEnv(Env::Default())); env->NewWritableFile(testFileName, &out_file, soptions_); - unique_ptr dest_writer; + std::unique_ptr dest_writer; dest_writer.reset( new WritableFileWriter(std::move(out_file), testFileName, soptions_)); BlockBasedTableOptions table_options; table_options.block_size = block_size; BlockBasedTableFactory block_based_tf(table_options); - unique_ptr table_builder; + std::unique_ptr table_builder; table_builder.reset(block_based_tf.NewTableBuilder( tb_options, TablePropertiesCollectorFactory::Context::kUnknownColumnFamily, dest_writer.get())); - unique_ptr iter(table_reader_->NewIterator( + std::unique_ptr iter(table_reader_->NewIterator( ReadOptions(), moptions_.prefix_extractor.get())); for (iter->SeekToFirst(); iter->Valid(); iter->Next()) { if (!iter->status().ok()) { @@ -192,7 +192,7 @@ uint64_t SstFileReader::CalculateCompressedTableSize( return size; } -int SstFileReader::ShowAllCompressionSizes( +int SstFileDumper::ShowAllCompressionSizes( size_t block_size, const std::vector>& compression_types) { @@ -226,7 +226,7 @@ int SstFileReader::ShowAllCompressionSizes( return 0; } -Status SstFileReader::ReadTableProperties(uint64_t table_magic_number, +Status SstFileDumper::ReadTableProperties(uint64_t table_magic_number, RandomAccessFileReader* file, uint64_t file_size) { TableProperties* table_properties = nullptr; @@ -240,7 +240,7 @@ Status SstFileReader::ReadTableProperties(uint64_t table_magic_number, return s; } -Status SstFileReader::SetTableOptionsByMagicNumber( +Status SstFileDumper::SetTableOptionsByMagicNumber( uint64_t table_magic_number) { assert(table_properties_); if (table_magic_number == kBlockBasedTableMagicNumber || @@ -283,7 +283,7 @@ Status SstFileReader::SetTableOptionsByMagicNumber( return Status::OK(); } -Status SstFileReader::SetOldTableOptions() { +Status SstFileDumper::SetOldTableOptions() { assert(table_properties_ == nullptr); options_.table_factory = std::make_shared(); fprintf(stdout, "Sst file format: block-based(old version)\n"); @@ -291,7 +291,7 @@ Status SstFileReader::SetOldTableOptions() { return Status::OK(); } -Status SstFileReader::ReadSequential(bool print_kv, uint64_t read_num, +Status SstFileDumper::ReadSequential(bool print_kv, uint64_t read_num, bool has_from, const std::string& from_key, bool has_to, const std::string& to_key, bool use_from_as_prefix) { @@ -348,7 +348,7 @@ Status SstFileReader::ReadSequential(bool print_kv, uint64_t read_num, return ret; } -Status SstFileReader::ReadTableProperties( +Status SstFileDumper::ReadTableProperties( std::shared_ptr* table_properties) { if (!table_reader_) { return init_result_; @@ -570,16 +570,16 @@ int SSTDumpTool::Run(int argc, char** argv) { filename = std::string(dir_or_file) + "/" + filename; } - rocksdb::SstFileReader reader(filename, verify_checksum, + rocksdb::SstFileDumper dumper(filename, verify_checksum, output_hex); - if (!reader.getStatus().ok()) { + if (!dumper.getStatus().ok()) { fprintf(stderr, "%s: %s\n", filename.c_str(), - reader.getStatus().ToString().c_str()); + dumper.getStatus().ToString().c_str()); continue; } if (command == "recompress") { - reader.ShowAllCompressionSizes( + dumper.ShowAllCompressionSizes( set_block_size ? block_size : 16384, compression_types.empty() ? kCompressions : compression_types); return 0; @@ -589,7 +589,7 @@ int SSTDumpTool::Run(int argc, char** argv) { std::string out_filename = filename.substr(0, filename.length() - 4); out_filename.append("_dump.txt"); - st = reader.DumpTable(out_filename); + st = dumper.DumpTable(out_filename); if (!st.ok()) { fprintf(stderr, "%s: %s\n", filename.c_str(), st.ToString().c_str()); exit(1); @@ -601,7 +601,7 @@ int SSTDumpTool::Run(int argc, char** argv) { // scan all files in give file path. if (command == "" || command == "scan" || command == "check") { - st = reader.ReadSequential( + st = dumper.ReadSequential( command == "scan", read_num > 0 ? (read_num - total_read) : read_num, has_from || use_from_as_prefix, from_key, has_to, to_key, use_from_as_prefix); @@ -609,14 +609,14 @@ int SSTDumpTool::Run(int argc, char** argv) { fprintf(stderr, "%s: %s\n", filename.c_str(), st.ToString().c_str()); } - total_read += reader.GetReadNumber(); + total_read += dumper.GetReadNumber(); if (read_num > 0 && total_read > read_num) { break; } } if (command == "verify") { - st = reader.VerifyChecksum(); + st = dumper.VerifyChecksum(); if (!st.ok()) { fprintf(stderr, "%s is corrupted: %s\n", filename.c_str(), st.ToString().c_str()); @@ -631,11 +631,11 @@ int SSTDumpTool::Run(int argc, char** argv) { std::shared_ptr table_properties_from_reader; - st = reader.ReadTableProperties(&table_properties_from_reader); + st = dumper.ReadTableProperties(&table_properties_from_reader); if (!st.ok()) { fprintf(stderr, "%s: %s\n", filename.c_str(), st.ToString().c_str()); fprintf(stderr, "Try to use initial table properties\n"); - table_properties = reader.GetInitTableProperties(); + table_properties = dumper.GetInitTableProperties(); } else { table_properties = table_properties_from_reader.get(); } @@ -646,19 +646,6 @@ int SSTDumpTool::Run(int argc, char** argv) { "------------------------------\n" " %s", table_properties->ToString("\n ", ": ").c_str()); - fprintf(stdout, "# deleted keys: %" PRIu64 "\n", - rocksdb::GetDeletedKeys( - table_properties->user_collected_properties)); - - bool property_present; - uint64_t merge_operands = rocksdb::GetMergeOperands( - table_properties->user_collected_properties, &property_present); - if (property_present) { - fprintf(stdout, " # merge operands: %" PRIu64 "\n", - merge_operands); - } else { - fprintf(stdout, " # merge operands: UNKNOWN\n"); - } } total_num_files += 1; total_num_data_blocks += table_properties->num_data_blocks; diff --git a/tools/sst_dump_tool_imp.h b/tools/sst_dump_tool_imp.h index ca60dd93c9c..9e83d8d0402 100644 --- a/tools/sst_dump_tool_imp.h +++ b/tools/sst_dump_tool_imp.h @@ -15,9 +15,9 @@ namespace rocksdb { -class SstFileReader { +class SstFileDumper { public: - explicit SstFileReader(const std::string& file_name, bool verify_checksum, + explicit SstFileDumper(const std::string& file_name, bool verify_checksum, bool output_hex); Status ReadSequential(bool print_kv, uint64_t read_num, bool has_from, @@ -57,7 +57,7 @@ class SstFileReader { const EnvOptions& soptions, const InternalKeyComparator& internal_comparator, uint64_t file_size, - unique_ptr* table_reader); + std::unique_ptr* table_reader); std::string file_name_; uint64_t read_num_; @@ -70,13 +70,13 @@ class SstFileReader { Options options_; Status init_result_; - unique_ptr table_reader_; - unique_ptr file_; + std::unique_ptr table_reader_; + std::unique_ptr file_; const ImmutableCFOptions ioptions_; const MutableCFOptions moptions_; InternalKeyComparator internal_comparator_; - unique_ptr table_properties_; + std::unique_ptr table_properties_; }; } // namespace rocksdb diff --git a/tools/trace_analyzer_tool.cc b/tools/trace_analyzer_tool.cc index 7915322f0e7..49f2175a394 100644 --- a/tools/trace_analyzer_tool.cc +++ b/tools/trace_analyzer_tool.cc @@ -139,7 +139,7 @@ DEFINE_bool(no_key, false, DEFINE_bool(print_overall_stats, true, " Print the stats of the whole trace, " "like total requests, keys, and etc."); -DEFINE_bool(print_key_distribution, false, "Print the key size distribution."); +DEFINE_bool(output_key_distribution, false, "Print the key size distribution."); DEFINE_bool( output_value_distribution, false, "Out put the value size distribution, only available for Put and Merge.\n" @@ -158,6 +158,9 @@ DEFINE_int32(value_interval, 8, "To output the value distribution, we need to set the value " "intervals and make the statistic of the value size distribution " "in different intervals. The default is 8."); +DEFINE_double(sample_ratio, 1.0, + "If the trace size is extremely huge or user want to sample " + "the trace when analyzing, sample ratio can be set (0, 1.0]"); namespace rocksdb { @@ -276,9 +279,17 @@ TraceAnalyzer::TraceAnalyzer(std::string& trace_path, std::string& output_path, total_access_keys_ = 0; total_gets_ = 0; total_writes_ = 0; + trace_create_time_ = 0; begin_time_ = 0; end_time_ = 0; time_series_start_ = 0; + cur_time_sec_ = 0; + if (FLAGS_sample_ratio > 1.0 || FLAGS_sample_ratio <= 0) { + sample_max_ = 1; + } else { + sample_max_ = static_cast(1.0 / FLAGS_sample_ratio); + } + ta_.resize(kTaTypeNum); ta_[0].type_name = "get"; if (FLAGS_analyze_get) { @@ -328,6 +339,9 @@ TraceAnalyzer::TraceAnalyzer(std::string& trace_path, std::string& output_path, } else { ta_[7].enabled = false; } + for (int i = 0; i < kTaTypeNum; i++) { + ta_[i].sample_count = 0; + } } TraceAnalyzer::~TraceAnalyzer() {} @@ -363,6 +377,13 @@ Status TraceAnalyzer::PrepareProcessing() { if (!s.ok()) { return s; } + + qps_stats_name = + output_path_ + "/" + FLAGS_output_prefix + "-cf_qps_stats.txt"; + s = env_->NewWritableFile(qps_stats_name, &cf_qps_f_, env_options_); + if (!s.ok()) { + return s; + } } return Status::OK(); } @@ -422,6 +443,7 @@ Status TraceAnalyzer::StartProcessing() { fprintf(stderr, "Cannot read the header\n"); return s; } + trace_create_time_ = header.ts; if (FLAGS_output_time_series) { time_series_start_ = header.ts; } @@ -521,7 +543,7 @@ Status TraceAnalyzer::MakeStatistics() { } // Generate the key size distribution data - if (FLAGS_print_key_distribution) { + if (FLAGS_output_key_distribution) { if (stat.second.a_key_size_stats.find(record.first.size()) == stat.second.a_key_size_stats.end()) { stat.second.a_key_size_stats[record.first.size()] = 1; @@ -565,17 +587,31 @@ Status TraceAnalyzer::MakeStatistics() { // find the medium of the key size uint64_t k_count = 0; + bool get_mid = false; for (auto& record : stat.second.a_key_size_stats) { k_count += record.second; - if (k_count >= stat.second.a_key_mid) { + if (!get_mid && k_count >= stat.second.a_key_mid) { stat.second.a_key_mid = record.first; - break; + get_mid = true; + } + if (FLAGS_output_key_distribution && stat.second.a_key_size_f) { + ret = sprintf(buffer_, "%" PRIu64 " %" PRIu64 "\n", record.first, + record.second); + if (ret < 0) { + return Status::IOError("Format output failed"); + } + std::string printout(buffer_); + s = stat.second.a_key_size_f->Append(printout); + if (!s.ok()) { + fprintf(stderr, "Write key size distribution file failed\n"); + return s; + } } } // output the value size distribution uint64_t v_begin = 0, v_end = 0, v_count = 0; - bool get_mid = false; + get_mid = false; for (auto& record : stat.second.a_value_size_stats) { v_begin = v_end; v_end = (record.first + 1) * FLAGS_value_interval; @@ -740,6 +776,9 @@ Status TraceAnalyzer::MakeStatisticCorrelation(TraceStats& stats, // Process the statistics of QPS Status TraceAnalyzer::MakeStatisticQPS() { + if(begin_time_ == 0) { + begin_time_ = trace_create_time_; + } uint32_t duration = static_cast((end_time_ - begin_time_) / 1000000); int ret; @@ -818,6 +857,32 @@ Status TraceAnalyzer::MakeStatisticQPS() { stat.second.a_ave_qps = (static_cast(cf_qps_sum)) / duration; } + // Output the accessed unique key number change overtime + if (stat.second.a_key_num_f) { + uint64_t cur_uni_key = + static_cast(stat.second.a_key_stats.size()); + double cur_ratio = 0.0; + uint64_t cur_num = 0; + for (uint32_t i = 0; i < duration; i++) { + auto find_time = stat.second.uni_key_num.find(i); + if (find_time != stat.second.uni_key_num.end()) { + cur_ratio = (static_cast(find_time->second)) / cur_uni_key; + cur_num = find_time->second; + } + ret = sprintf(buffer_, "%" PRIu64 " %.12f\n", cur_num, cur_ratio); + if (ret < 0) { + return Status::IOError("Format the output failed"); + } + std::string printout(buffer_); + s = stat.second.a_key_num_f->Append(printout); + if (!s.ok()) { + fprintf(stderr, + "Write accessed unique key number change file failed\n"); + return s; + } + } + } + // output the prefix of top k access peak if (FLAGS_output_prefix_cut > 0 && stat.second.a_top_qps_prefix_f) { while (!stat.second.top_k_qps_sec.empty()) { @@ -882,6 +947,33 @@ Status TraceAnalyzer::MakeStatisticQPS() { } } + if (cf_qps_f_) { + int cfs_size = static_cast(cfs_.size()); + uint32_t v; + for (uint32_t i = 0; i < duration; i++) { + for (int cf = 0; cf < cfs_size; cf++) { + if (cfs_[cf].cf_qps.find(i) != cfs_[cf].cf_qps.end()) { + v = cfs_[cf].cf_qps[i]; + } else { + v = 0; + } + if (cf < cfs_size - 1) { + ret = sprintf(buffer_, "%u ", v); + } else { + ret = sprintf(buffer_, "%u\n", v); + } + if (ret < 0) { + return Status::IOError("Format the output failed"); + } + std::string printout(buffer_); + s = cf_qps_f_->Append(printout); + if (!s.ok()) { + return s; + } + } + } + } + qps_peak_ = qps_peak; for (int type = 0; type <= kTaTypeNum; type++) { if (duration == 0) { @@ -1010,7 +1102,7 @@ Status TraceAnalyzer::ReProcessing() { } // Make the statistics fo the key size distribution - if (FLAGS_print_key_distribution) { + if (FLAGS_output_key_distribution) { if (cfs_[cf_id].w_key_size_stats.find(input_key.size()) == cfs_[cf_id].w_key_size_stats.end()) { cfs_[cf_id].w_key_size_stats[input_key.size()] = 1; @@ -1129,6 +1221,11 @@ Status TraceAnalyzer::KeyStatsInsertion(const uint32_t& type, tmp_qps_map[prefix] = 1; ta_[type].stats[cf_id].a_qps_prefix_stats[time_in_sec] = tmp_qps_map; } + if (time_in_sec != cur_time_sec_) { + ta_[type].stats[cf_id].uni_key_num[cur_time_sec_] = + static_cast(ta_[type].stats[cf_id].a_key_stats.size()); + cur_time_sec_ = time_in_sec; + } } else { found_stats->second.a_count++; found_stats->second.a_key_size_sqsum += MultiplyCheckOverflow( @@ -1149,6 +1246,11 @@ Status TraceAnalyzer::KeyStatsInsertion(const uint32_t& type, s = StatsUnitCorrelationUpdate(found_key->second, type, ts, key); } } + if (time_in_sec != cur_time_sec_) { + found_stats->second.uni_key_num[cur_time_sec_] = + static_cast(found_stats->second.a_key_stats.size()); + cur_time_sec_ = time_in_sec; + } auto found_value = found_stats->second.a_value_size_stats.find(dist_value_size); @@ -1189,6 +1291,10 @@ Status TraceAnalyzer::KeyStatsInsertion(const uint32_t& type, cfs_[cf_id] = cf_unit; } + if (FLAGS_output_qps_stats) { + cfs_[cf_id].cf_qps[time_in_sec]++; + } + if (FLAGS_output_time_series) { TraceUnit trace_u; trace_u.type = type; @@ -1251,6 +1357,9 @@ Status TraceAnalyzer::OpenStatsOutputFiles(const std::string& type, if (FLAGS_output_key_stats) { s = CreateOutputFile(type, new_stats.cf_name, "accessed_key_stats.txt", &new_stats.a_key_f); + s = CreateOutputFile(type, new_stats.cf_name, + "accessed_unique_key_num_change.txt", + &new_stats.a_key_num_f); if (!FLAGS_key_space_dir.empty()) { s = CreateOutputFile(type, new_stats.cf_name, "whole_key_stats.txt", &new_stats.w_key_f); @@ -1289,6 +1398,12 @@ Status TraceAnalyzer::OpenStatsOutputFiles(const std::string& type, &new_stats.a_value_size_f); } + if (FLAGS_output_key_distribution) { + s = CreateOutputFile(type, new_stats.cf_name, + "accessed_key_size_distribution.txt", + &new_stats.a_key_size_f); + } + if (FLAGS_output_qps_stats) { s = CreateOutputFile(type, new_stats.cf_name, "qps_stats.txt", &new_stats.a_qps_f); @@ -1328,6 +1443,10 @@ void TraceAnalyzer::CloseOutputFiles() { stat.second.a_key_f->Close(); } + if (stat.second.a_key_num_f) { + stat.second.a_key_num_f->Close(); + } + if (stat.second.a_count_dist_f) { stat.second.a_count_dist_f->Close(); } @@ -1340,6 +1459,10 @@ void TraceAnalyzer::CloseOutputFiles() { stat.second.a_value_size_f->Close(); } + if (stat.second.a_key_size_f) { + stat.second.a_key_size_f->Close(); + } + if (stat.second.a_qps_f) { stat.second.a_qps_f->Close(); } @@ -1373,6 +1496,15 @@ Status TraceAnalyzer::HandleGet(uint32_t column_family_id, } } + if (ta_[TraceOperationType::kGet].sample_count >= sample_max_) { + ta_[TraceOperationType::kGet].sample_count = 0; + } + if (ta_[TraceOperationType::kGet].sample_count > 0) { + ta_[TraceOperationType::kGet].sample_count++; + return Status::OK(); + } + ta_[TraceOperationType::kGet].sample_count++; + if (!ta_[TraceOperationType::kGet].enabled) { return Status::OK(); } @@ -1400,6 +1532,15 @@ Status TraceAnalyzer::HandlePut(uint32_t column_family_id, const Slice& key, } } + if (ta_[TraceOperationType::kPut].sample_count >= sample_max_) { + ta_[TraceOperationType::kPut].sample_count = 0; + } + if (ta_[TraceOperationType::kPut].sample_count > 0) { + ta_[TraceOperationType::kPut].sample_count++; + return Status::OK(); + } + ta_[TraceOperationType::kPut].sample_count++; + if (!ta_[TraceOperationType::kPut].enabled) { return Status::OK(); } @@ -1424,6 +1565,15 @@ Status TraceAnalyzer::HandleDelete(uint32_t column_family_id, } } + if (ta_[TraceOperationType::kDelete].sample_count >= sample_max_) { + ta_[TraceOperationType::kDelete].sample_count = 0; + } + if (ta_[TraceOperationType::kDelete].sample_count > 0) { + ta_[TraceOperationType::kDelete].sample_count++; + return Status::OK(); + } + ta_[TraceOperationType::kDelete].sample_count++; + if (!ta_[TraceOperationType::kDelete].enabled) { return Status::OK(); } @@ -1448,6 +1598,15 @@ Status TraceAnalyzer::HandleSingleDelete(uint32_t column_family_id, } } + if (ta_[TraceOperationType::kSingleDelete].sample_count >= sample_max_) { + ta_[TraceOperationType::kSingleDelete].sample_count = 0; + } + if (ta_[TraceOperationType::kSingleDelete].sample_count > 0) { + ta_[TraceOperationType::kSingleDelete].sample_count++; + return Status::OK(); + } + ta_[TraceOperationType::kSingleDelete].sample_count++; + if (!ta_[TraceOperationType::kSingleDelete].enabled) { return Status::OK(); } @@ -1473,6 +1632,15 @@ Status TraceAnalyzer::HandleDeleteRange(uint32_t column_family_id, } } + if (ta_[TraceOperationType::kRangeDelete].sample_count >= sample_max_) { + ta_[TraceOperationType::kRangeDelete].sample_count = 0; + } + if (ta_[TraceOperationType::kRangeDelete].sample_count > 0) { + ta_[TraceOperationType::kRangeDelete].sample_count++; + return Status::OK(); + } + ta_[TraceOperationType::kRangeDelete].sample_count++; + if (!ta_[TraceOperationType::kRangeDelete].enabled) { return Status::OK(); } @@ -1499,6 +1667,15 @@ Status TraceAnalyzer::HandleMerge(uint32_t column_family_id, const Slice& key, } } + if (ta_[TraceOperationType::kMerge].sample_count >= sample_max_) { + ta_[TraceOperationType::kMerge].sample_count = 0; + } + if (ta_[TraceOperationType::kMerge].sample_count > 0) { + ta_[TraceOperationType::kMerge].sample_count++; + return Status::OK(); + } + ta_[TraceOperationType::kMerge].sample_count++; + if (!ta_[TraceOperationType::kMerge].enabled) { return Status::OK(); } @@ -1535,6 +1712,15 @@ Status TraceAnalyzer::HandleIter(uint32_t column_family_id, } } + if (ta_[type].sample_count >= sample_max_) { + ta_[type].sample_count = 0; + } + if (ta_[type].sample_count > 0) { + ta_[type].sample_count++; + return Status::OK(); + } + ta_[type].sample_count++; + if (!ta_[type].enabled) { return Status::OK(); } @@ -1596,6 +1782,8 @@ void TraceAnalyzer::PrintStatistics() { ta_[type].total_succ_access += stat.a_succ_count; printf("*********************************************************\n"); printf("colume family id: %u\n", stat.cf_id); + printf("Total number of queries to this cf by %s: %" PRIu64 "\n", + ta_[type].type_name.c_str(), stat.a_count); printf("Total unique keys in this cf: %" PRIu64 "\n", total_a_keys); printf("Average key size: %f key size medium: %" PRIu64 " Key size Variation: %f\n", @@ -1642,15 +1830,6 @@ void TraceAnalyzer::PrintStatistics() { } } - // print the key size distribution - if (FLAGS_print_key_distribution) { - printf("The key size distribution\n"); - for (auto& record : stat.a_key_size_stats) { - printf("key_size %" PRIu64 " nums: %" PRIu64 "\n", record.first, - record.second); - } - } - // print the operation correlations if (!FLAGS_print_correlation.empty()) { for (int correlation = 0; @@ -1700,6 +1879,8 @@ void TraceAnalyzer::PrintStatistics() { printf("Average QPS per second: %f Peak QPS: %u\n", qps_ave_[kTaTypeNum], qps_peak_[kTaTypeNum]); } + printf("The statistics related to query number need to times: %u\n", + sample_max_); printf("Total_requests: %" PRIu64 " Total_accessed_keys: %" PRIu64 " Total_gets: %" PRIu64 " Total_write_batch: %" PRIu64 "\n", total_requests_, total_access_keys_, total_gets_, total_writes_); diff --git a/tools/trace_analyzer_tool.h b/tools/trace_analyzer_tool.h index ac9f42f1c07..be96f5005da 100644 --- a/tools/trace_analyzer_tool.h +++ b/tools/trace_analyzer_tool.h @@ -115,12 +115,15 @@ struct TraceStats { top_k_qps_sec; std::list time_series; std::vector> correlation_output; + std::map uni_key_num; std::unique_ptr time_series_f; std::unique_ptr a_key_f; std::unique_ptr a_count_dist_f; std::unique_ptr a_prefix_cut_f; std::unique_ptr a_value_size_f; + std::unique_ptr a_key_size_f; + std::unique_ptr a_key_num_f; std::unique_ptr a_qps_f; std::unique_ptr a_top_qps_prefix_f; std::unique_ptr w_key_f; @@ -140,6 +143,7 @@ struct TypeUnit { uint64_t total_keys; uint64_t total_access; uint64_t total_succ_access; + uint32_t sample_count; std::map stats; TypeUnit() = default; ~TypeUnit() = default; @@ -155,6 +159,7 @@ struct CfUnit { uint64_t a_count; // the total keys in this cf that are accessed std::map w_key_size_stats; // whole key space key size // statistic this cf + std::map cf_qps; }; class TraceAnalyzer { @@ -204,11 +209,15 @@ class TraceAnalyzer { uint64_t total_access_keys_; uint64_t total_gets_; uint64_t total_writes_; + uint64_t trace_create_time_; uint64_t begin_time_; uint64_t end_time_; uint64_t time_series_start_; + uint32_t sample_max_; + uint32_t cur_time_sec_; std::unique_ptr trace_sequence_f_; // readable trace std::unique_ptr qps_f_; // overall qps + std::unique_ptr cf_qps_f_; // The qps of each CF> std::unique_ptr wkey_input_f_; std::vector ta_; // The main statistic collecting data structure std::map cfs_; // All the cf_id appears in this trace; diff --git a/util/auto_roll_logger_test.cc b/util/auto_roll_logger_test.cc index 5a6b3abc112..284a9815218 100644 --- a/util/auto_roll_logger_test.cc +++ b/util/auto_roll_logger_test.cc @@ -230,7 +230,7 @@ TEST_F(AutoRollLoggerTest, CompositeRollByTimeAndSizeLogger) { TEST_F(AutoRollLoggerTest, CreateLoggerFromOptions) { DBOptions options; NoSleepEnv nse(Env::Default()); - shared_ptr logger; + std::shared_ptr logger; // Normal logger ASSERT_OK(CreateLoggerFromOptions(kTestDir, options, &logger)); @@ -273,7 +273,7 @@ TEST_F(AutoRollLoggerTest, CreateLoggerFromOptions) { TEST_F(AutoRollLoggerTest, LogFlushWhileRolling) { DBOptions options; - shared_ptr logger; + std::shared_ptr logger; InitTestDb(); options.max_log_file_size = 1024 * 5; diff --git a/util/autovector.h b/util/autovector.h index b5c84712450..97348d818a0 100644 --- a/util/autovector.h +++ b/util/autovector.h @@ -271,7 +271,12 @@ class autovector { template void emplace_back(Args&&... args) { - push_back(value_type(args...)); + if (num_stack_items_ < kSize) { + values_[num_stack_items_++] = + std::move(value_type(std::forward(args)...)); + } else { + vect_.emplace_back(std::forward(args)...); + } } void pop_back() { diff --git a/util/compression.h b/util/compression.h index e918e14fbec..e91faeac658 100644 --- a/util/compression.h +++ b/util/compression.h @@ -14,8 +14,10 @@ #include #include "rocksdb/options.h" +#include "rocksdb/table.h" #include "util/coding.h" #include "util/compression_context_cache.h" +#include "util/memory_allocator.h" #ifdef SNAPPY #include @@ -495,11 +497,10 @@ inline bool Zlib_Compress(const CompressionContext& ctx, // header in varint32 format // @param compression_dict Data for presetting the compression library's // dictionary. -inline char* Zlib_Uncompress(const UncompressionContext& ctx, - const char* input_data, size_t input_length, - int* decompress_size, - uint32_t compress_format_version, - int windowBits = -14) { +inline CacheAllocationPtr Zlib_Uncompress( + const UncompressionContext& ctx, const char* input_data, + size_t input_length, int* decompress_size, uint32_t compress_format_version, + MemoryAllocator* allocator = nullptr, int windowBits = -14) { #ifdef ZLIB uint32_t output_len = 0; if (compress_format_version == 2) { @@ -541,9 +542,9 @@ inline char* Zlib_Uncompress(const UncompressionContext& ctx, _stream.next_in = (Bytef*)input_data; _stream.avail_in = static_cast(input_length); - char* output = new char[output_len]; + auto output = AllocateBlock(output_len, allocator); - _stream.next_out = (Bytef*)output; + _stream.next_out = (Bytef*)output.get(); _stream.avail_out = static_cast(output_len); bool done = false; @@ -561,19 +562,17 @@ inline char* Zlib_Uncompress(const UncompressionContext& ctx, size_t old_sz = output_len; uint32_t output_len_delta = output_len / 5; output_len += output_len_delta < 10 ? 10 : output_len_delta; - char* tmp = new char[output_len]; - memcpy(tmp, output, old_sz); - delete[] output; - output = tmp; + auto tmp = AllocateBlock(output_len, allocator); + memcpy(tmp.get(), output.get(), old_sz); + output = std::move(tmp); // Set more output. - _stream.next_out = (Bytef*)(output + old_sz); + _stream.next_out = (Bytef*)(output.get() + old_sz); _stream.avail_out = static_cast(output_len - old_sz); break; } case Z_BUF_ERROR: default: - delete[] output; inflateEnd(&_stream); return nullptr; } @@ -590,6 +589,7 @@ inline char* Zlib_Uncompress(const UncompressionContext& ctx, (void)input_length; (void)decompress_size; (void)compress_format_version; + (void)allocator; (void)windowBits; return nullptr; #endif @@ -660,9 +660,9 @@ inline bool BZip2_Compress(const CompressionContext& /*ctx*/, // block header // compress_format_version == 2 -- decompressed size is included in the block // header in varint32 format -inline char* BZip2_Uncompress(const char* input_data, size_t input_length, - int* decompress_size, - uint32_t compress_format_version) { +inline CacheAllocationPtr BZip2_Uncompress( + const char* input_data, size_t input_length, int* decompress_size, + uint32_t compress_format_version, MemoryAllocator* allocator = nullptr) { #ifdef BZIP2 uint32_t output_len = 0; if (compress_format_version == 2) { @@ -690,9 +690,9 @@ inline char* BZip2_Uncompress(const char* input_data, size_t input_length, _stream.next_in = (char*)input_data; _stream.avail_in = static_cast(input_length); - char* output = new char[output_len]; + auto output = AllocateBlock(output_len, allocator); - _stream.next_out = (char*)output; + _stream.next_out = (char*)output.get(); _stream.avail_out = static_cast(output_len); bool done = false; @@ -709,18 +709,16 @@ inline char* BZip2_Uncompress(const char* input_data, size_t input_length, assert(compress_format_version != 2); uint32_t old_sz = output_len; output_len = output_len * 1.2; - char* tmp = new char[output_len]; - memcpy(tmp, output, old_sz); - delete[] output; - output = tmp; + auto tmp = AllocateBlock(output_len, allocator); + memcpy(tmp.get(), output.get(), old_sz); + output = std::move(tmp); // Set more output. - _stream.next_out = (char*)(output + old_sz); + _stream.next_out = (char*)(output.get() + old_sz); _stream.avail_out = static_cast(output_len - old_sz); break; } default: - delete[] output; BZ2_bzDecompressEnd(&_stream); return nullptr; } @@ -736,6 +734,7 @@ inline char* BZip2_Uncompress(const char* input_data, size_t input_length, (void)input_length; (void)decompress_size; (void)compress_format_version; + (void)allocator; return nullptr; #endif } @@ -791,6 +790,7 @@ inline bool LZ4_Compress(const CompressionContext& ctx, #else // up to r123 outlen = LZ4_compress_limitedOutput(input, &(*output)[output_header_len], static_cast(length), compress_bound); + (void)ctx; #endif // LZ4_VERSION_NUMBER >= 10400 if (outlen == 0) { @@ -814,10 +814,12 @@ inline bool LZ4_Compress(const CompressionContext& ctx, // header in varint32 format // @param compression_dict Data for presetting the compression library's // dictionary. -inline char* LZ4_Uncompress(const UncompressionContext& ctx, - const char* input_data, size_t input_length, - int* decompress_size, - uint32_t compress_format_version) { +inline CacheAllocationPtr LZ4_Uncompress(const UncompressionContext& ctx, + const char* input_data, + size_t input_length, + int* decompress_size, + uint32_t compress_format_version, + MemoryAllocator* allocator = nullptr) { #ifdef LZ4 uint32_t output_len = 0; if (compress_format_version == 2) { @@ -837,7 +839,7 @@ inline char* LZ4_Uncompress(const UncompressionContext& ctx, input_data += 8; } - char* output = new char[output_len]; + auto output = AllocateBlock(output_len, allocator); #if LZ4_VERSION_NUMBER >= 10400 // r124+ LZ4_streamDecode_t* stream = LZ4_createStreamDecode(); if (ctx.dict().size()) { @@ -845,17 +847,17 @@ inline char* LZ4_Uncompress(const UncompressionContext& ctx, static_cast(ctx.dict().size())); } *decompress_size = LZ4_decompress_safe_continue( - stream, input_data, output, static_cast(input_length), + stream, input_data, output.get(), static_cast(input_length), static_cast(output_len)); LZ4_freeStreamDecode(stream); #else // up to r123 - *decompress_size = - LZ4_decompress_safe(input_data, output, static_cast(input_length), - static_cast(output_len)); + *decompress_size = LZ4_decompress_safe(input_data, output.get(), + static_cast(input_length), + static_cast(output_len)); + (void)ctx; #endif // LZ4_VERSION_NUMBER >= 10400 if (*decompress_size < 0) { - delete[] output; return nullptr; } assert(*decompress_size == static_cast(output_len)); @@ -866,6 +868,7 @@ inline char* LZ4_Uncompress(const UncompressionContext& ctx, (void)input_length; (void)decompress_size; (void)compress_format_version; + (void)allocator; return nullptr; #endif } @@ -1028,9 +1031,10 @@ inline bool ZSTD_Compress(const CompressionContext& ctx, const char* input, // @param compression_dict Data for presetting the compression library's // dictionary. -inline char* ZSTD_Uncompress(const UncompressionContext& ctx, - const char* input_data, size_t input_length, - int* decompress_size) { +inline CacheAllocationPtr ZSTD_Uncompress( + const UncompressionContext& ctx, const char* input_data, + size_t input_length, int* decompress_size, + MemoryAllocator* allocator = nullptr) { #ifdef ZSTD uint32_t output_len = 0; if (!compression::GetDecompressedSizeInfo(&input_data, &input_length, @@ -1038,17 +1042,17 @@ inline char* ZSTD_Uncompress(const UncompressionContext& ctx, return nullptr; } - char* output = new char[output_len]; + auto output = AllocateBlock(output_len, allocator); size_t actual_output_length; #if ZSTD_VERSION_NUMBER >= 500 // v0.5.0+ ZSTD_DCtx* context = ctx.GetZSTDContext(); assert(context != nullptr); actual_output_length = ZSTD_decompress_usingDict( - context, output, output_len, input_data, input_length, ctx.dict().data(), - ctx.dict().size()); + context, output.get(), output_len, input_data, input_length, + ctx.dict().data(), ctx.dict().size()); #else // up to v0.4.x actual_output_length = - ZSTD_decompress(output, output_len, input_data, input_length); + ZSTD_decompress(output.get(), output_len, input_data, input_length); #endif // ZSTD_VERSION_NUMBER >= 500 assert(actual_output_length == output_len); *decompress_size = static_cast(actual_output_length); @@ -1058,6 +1062,7 @@ inline char* ZSTD_Uncompress(const UncompressionContext& ctx, (void)input_data; (void)input_length; (void)decompress_size; + (void)allocator; return nullptr; #endif } diff --git a/util/delete_scheduler.cc b/util/delete_scheduler.cc index 1d51055a3bf..f5ee2844896 100644 --- a/util/delete_scheduler.cc +++ b/util/delete_scheduler.cc @@ -52,11 +52,12 @@ DeleteScheduler::~DeleteScheduler() { } Status DeleteScheduler::DeleteFile(const std::string& file_path, - const std::string& dir_to_sync) { + const std::string& dir_to_sync, + const bool force_bg) { Status s; - if (rate_bytes_per_sec_.load() <= 0 || + if (rate_bytes_per_sec_.load() <= 0 || (!force_bg && total_trash_size_.load() > - sst_file_manager_->GetTotalSize() * max_trash_db_ratio_.load()) { + sst_file_manager_->GetTotalSize() * max_trash_db_ratio_.load())) { // Rate limiting is disabled or trash size makes up more than // max_trash_db_ratio_ (default 25%) of the total DB size TEST_SYNC_POINT("DeleteScheduler::DeleteFile"); @@ -275,7 +276,7 @@ Status DeleteScheduler::DeleteTrashFile(const std::string& path_in_trash, Status my_status = env_->NumFileLinks(path_in_trash, &num_hard_links); if (my_status.ok()) { if (num_hard_links == 1) { - unique_ptr wf; + std::unique_ptr wf; my_status = env_->ReopenWritableFile(path_in_trash, &wf, EnvOptions()); if (my_status.ok()) { diff --git a/util/delete_scheduler.h b/util/delete_scheduler.h index cbd13ecefd0..29b70517b67 100644 --- a/util/delete_scheduler.h +++ b/util/delete_scheduler.h @@ -46,8 +46,11 @@ class DeleteScheduler { rate_bytes_per_sec_.store(bytes_per_sec); } - // Mark file as trash directory and schedule it's deletion - Status DeleteFile(const std::string& fname, const std::string& dir_to_sync); + // Mark file as trash directory and schedule it's deletion. If force_bg is + // set, it forces the file to always be deleted in the background thread, + // except when rate limiting is disabled + Status DeleteFile(const std::string& fname, const std::string& dir_to_sync, + const bool force_bg = false); // Wait for all files being deleteing in the background to finish or for // destructor to be called. diff --git a/util/fault_injection_test_env.cc b/util/fault_injection_test_env.cc index 3b3dbbe99bd..64e9da1aac6 100644 --- a/util/fault_injection_test_env.cc +++ b/util/fault_injection_test_env.cc @@ -29,12 +29,12 @@ std::string GetDirName(const std::string filename) { // A basic file truncation function suitable for this test. Status Truncate(Env* env, const std::string& filename, uint64_t length) { - unique_ptr orig_file; + std::unique_ptr orig_file; const EnvOptions options; Status s = env->NewSequentialFile(filename, &orig_file, options); if (!s.ok()) { - fprintf(stderr, "Cannot truncate file %s: %s\n", filename.c_str(), - s.ToString().c_str()); + fprintf(stderr, "Cannot open file %s for truncation: %s\n", + filename.c_str(), s.ToString().c_str()); return s; } @@ -46,7 +46,7 @@ Status Truncate(Env* env, const std::string& filename, uint64_t length) { #endif if (s.ok()) { std::string tmp_name = GetDirName(filename) + "/truncate.tmp"; - unique_ptr tmp_file; + std::unique_ptr tmp_file; s = env->NewWritableFile(tmp_name, &tmp_file, options); if (s.ok()) { s = tmp_file->Append(result); @@ -103,7 +103,7 @@ Status TestDirectory::Fsync() { } TestWritableFile::TestWritableFile(const std::string& fname, - unique_ptr&& f, + std::unique_ptr&& f, FaultInjectionTestEnv* env) : state_(fname), target_(std::move(f)), @@ -157,8 +157,8 @@ Status TestWritableFile::Sync() { } Status FaultInjectionTestEnv::NewDirectory(const std::string& name, - unique_ptr* result) { - unique_ptr r; + std::unique_ptr* result) { + std::unique_ptr r; Status s = target()->NewDirectory(name, &r); assert(s.ok()); if (!s.ok()) { @@ -168,9 +168,9 @@ Status FaultInjectionTestEnv::NewDirectory(const std::string& name, return Status::OK(); } -Status FaultInjectionTestEnv::NewWritableFile(const std::string& fname, - unique_ptr* result, - const EnvOptions& soptions) { +Status FaultInjectionTestEnv::NewWritableFile( + const std::string& fname, std::unique_ptr* result, + const EnvOptions& soptions) { if (!IsFilesystemActive()) { return GetError(); } @@ -197,6 +197,27 @@ Status FaultInjectionTestEnv::NewWritableFile(const std::string& fname, return s; } +Status FaultInjectionTestEnv::ReopenWritableFile( + const std::string& fname, std::unique_ptr* result, + const EnvOptions& soptions) { + if (!IsFilesystemActive()) { + return GetError(); + } + Status s = target()->ReopenWritableFile(fname, result, soptions); + if (s.ok()) { + result->reset(new TestWritableFile(fname, std::move(*result), this)); + // WritableFileWriter* file is opened + // again then it will be truncated - so forget our saved state. + UntrackFile(fname); + MutexLock l(&mutex_); + open_files_.insert(fname); + auto dir_and_name = GetDirAndName(fname); + auto& list = dir_to_new_files_since_last_sync_[dir_and_name.first]; + list.insert(dir_and_name.second); + } + return s; +} + Status FaultInjectionTestEnv::NewRandomAccessFile( const std::string& fname, std::unique_ptr* result, const EnvOptions& soptions) { diff --git a/util/fault_injection_test_env.h b/util/fault_injection_test_env.h index 563986e29ec..d3775d3a3fe 100644 --- a/util/fault_injection_test_env.h +++ b/util/fault_injection_test_env.h @@ -56,7 +56,7 @@ struct FileState { class TestWritableFile : public WritableFile { public: explicit TestWritableFile(const std::string& fname, - unique_ptr&& f, + std::unique_ptr&& f, FaultInjectionTestEnv* env); virtual ~TestWritableFile(); virtual Status Append(const Slice& data) override; @@ -77,7 +77,7 @@ class TestWritableFile : public WritableFile { private: FileState state_; - unique_ptr target_; + std::unique_ptr target_; bool writable_file_opened_; FaultInjectionTestEnv* env_; }; @@ -94,7 +94,7 @@ class TestDirectory : public Directory { private: FaultInjectionTestEnv* env_; std::string dirname_; - unique_ptr dir_; + std::unique_ptr dir_; }; class FaultInjectionTestEnv : public EnvWrapper { @@ -104,12 +104,16 @@ class FaultInjectionTestEnv : public EnvWrapper { virtual ~FaultInjectionTestEnv() {} Status NewDirectory(const std::string& name, - unique_ptr* result) override; + std::unique_ptr* result) override; Status NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& soptions) override; + Status ReopenWritableFile(const std::string& fname, + std::unique_ptr* result, + const EnvOptions& soptions) override; + Status NewRandomAccessFile(const std::string& fname, std::unique_ptr* result, const EnvOptions& soptions) override; diff --git a/util/file_reader_writer.cc b/util/file_reader_writer.cc index cd09f712255..9e40d4d4082 100644 --- a/util/file_reader_writer.cc +++ b/util/file_reader_writer.cc @@ -98,8 +98,21 @@ Status RandomAccessFileReader::Read(uint64_t offset, size_t n, Slice* result, allowed = read_size; } Slice tmp; + + FileOperationInfo::TimePoint start_ts; + uint64_t orig_offset = 0; + if (ShouldNotifyListeners()) { + start_ts = std::chrono::system_clock::now(); + orig_offset = aligned_offset + buf.CurrentSize(); + } s = file_->Read(aligned_offset + buf.CurrentSize(), allowed, &tmp, buf.Destination()); + if (ShouldNotifyListeners()) { + auto finish_ts = std::chrono::system_clock::now(); + NotifyOnFileReadFinish(orig_offset, tmp.size(), start_ts, finish_ts, + s); + } + buf.Size(buf.CurrentSize() + tmp.size()); if (!s.ok() || tmp.size() < allowed) { break; @@ -131,7 +144,22 @@ Status RandomAccessFileReader::Read(uint64_t offset, size_t n, Slice* result, allowed = n; } Slice tmp_result; + +#ifndef ROCKSDB_LITE + FileOperationInfo::TimePoint start_ts; + if (ShouldNotifyListeners()) { + start_ts = std::chrono::system_clock::now(); + } +#endif s = file_->Read(offset + pos, allowed, &tmp_result, scratch + pos); +#ifndef ROCKSDB_LITE + if (ShouldNotifyListeners()) { + auto finish_ts = std::chrono::system_clock::now(); + NotifyOnFileReadFinish(offset + pos, tmp_result.size(), start_ts, + finish_ts, s); + } +#endif + if (res_scratch == nullptr) { // we can't simply use `scratch` because reads of mmap'd files return // data in a different buffer. @@ -414,7 +442,22 @@ Status WritableFileWriter::WriteBuffered(const char* data, size_t size) { { IOSTATS_TIMER_GUARD(write_nanos); TEST_SYNC_POINT("WritableFileWriter::Flush:BeforeAppend"); + +#ifndef ROCKSDB_LITE + FileOperationInfo::TimePoint start_ts; + uint64_t old_size = writable_file_->GetFileSize(); + if (ShouldNotifyListeners()) { + start_ts = std::chrono::system_clock::now(); + old_size = next_write_offset_; + } +#endif s = writable_file_->Append(Slice(src, allowed)); +#ifndef ROCKSDB_LITE + if (ShouldNotifyListeners()) { + auto finish_ts = std::chrono::system_clock::now(); + NotifyOnFileWriteFinish(old_size, allowed, start_ts, finish_ts, s); + } +#endif if (!s.ok()) { return s; } @@ -477,8 +520,16 @@ Status WritableFileWriter::WriteDirect() { { IOSTATS_TIMER_GUARD(write_nanos); TEST_SYNC_POINT("WritableFileWriter::Flush:BeforeAppend"); + FileOperationInfo::TimePoint start_ts; + if (ShouldNotifyListeners()) { + start_ts = std::chrono::system_clock::now(); + } // direct writes must be positional s = writable_file_->PositionedAppend(Slice(src, size), write_offset); + if (ShouldNotifyListeners()) { + auto finish_ts = std::chrono::system_clock::now(); + NotifyOnFileWriteFinish(write_offset, size, start_ts, finish_ts, s); + } if (!s.ok()) { buf_.Size(file_advance + leftover_tail); return s; @@ -753,7 +804,7 @@ std::unique_ptr NewReadaheadRandomAccessFile( } Status NewWritableFile(Env* env, const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) { Status s = env->NewWritableFile(fname, result, options); TEST_KILL_RANDOM("NewWritableFile:0", rocksdb_kill_odds * REDUCE_ODDS2); diff --git a/util/file_reader_writer.h b/util/file_reader_writer.h index a2c90f2b330..1083c685cb7 100644 --- a/util/file_reader_writer.h +++ b/util/file_reader_writer.h @@ -12,6 +12,7 @@ #include #include "port/port.h" #include "rocksdb/env.h" +#include "rocksdb/listener.h" #include "rocksdb/rate_limiter.h" #include "util/aligned_buffer.h" #include "util/sync_point.h" @@ -62,6 +63,24 @@ class SequentialFileReader { class RandomAccessFileReader { private: +#ifndef ROCKSDB_LITE + void NotifyOnFileReadFinish(uint64_t offset, size_t length, + const FileOperationInfo::TimePoint& start_ts, + const FileOperationInfo::TimePoint& finish_ts, + const Status& status) const { + FileOperationInfo info(file_name_, start_ts, finish_ts); + info.offset = offset; + info.length = length; + info.status = status; + + for (auto& listener : listeners_) { + listener->OnFileReadFinish(info); + } + } +#endif // ROCKSDB_LITE + + bool ShouldNotifyListeners() const { return !listeners_.empty(); } + std::unique_ptr file_; std::string file_name_; Env* env_; @@ -70,16 +89,15 @@ class RandomAccessFileReader { HistogramImpl* file_read_hist_; RateLimiter* rate_limiter_; bool for_compaction_; + std::vector> listeners_; public: - explicit RandomAccessFileReader(std::unique_ptr&& raf, - std::string _file_name, - Env* env = nullptr, - Statistics* stats = nullptr, - uint32_t hist_type = 0, - HistogramImpl* file_read_hist = nullptr, - RateLimiter* rate_limiter = nullptr, - bool for_compaction = false) + explicit RandomAccessFileReader( + std::unique_ptr&& raf, std::string _file_name, + Env* env = nullptr, Statistics* stats = nullptr, uint32_t hist_type = 0, + HistogramImpl* file_read_hist = nullptr, + RateLimiter* rate_limiter = nullptr, bool for_compaction = false, + const std::vector>& listeners = {}) : file_(std::move(raf)), file_name_(std::move(_file_name)), env_(env), @@ -87,7 +105,19 @@ class RandomAccessFileReader { hist_type_(hist_type), file_read_hist_(file_read_hist), rate_limiter_(rate_limiter), - for_compaction_(for_compaction) {} + for_compaction_(for_compaction), + listeners_() { +#ifndef ROCKSDB_LITE + std::for_each(listeners.begin(), listeners.end(), + [this](const std::shared_ptr& e) { + if (e->ShouldBeNotifiedOnFileIO()) { + listeners_.emplace_back(e); + } + }); +#else // !ROCKSDB_LITE + (void)listeners; +#endif + } RandomAccessFileReader(RandomAccessFileReader&& o) ROCKSDB_NOEXCEPT { *this = std::move(o); @@ -124,6 +154,24 @@ class RandomAccessFileReader { // Use posix write to write data to a file. class WritableFileWriter { private: +#ifndef ROCKSDB_LITE + void NotifyOnFileWriteFinish(uint64_t offset, size_t length, + const FileOperationInfo::TimePoint& start_ts, + const FileOperationInfo::TimePoint& finish_ts, + const Status& status) { + FileOperationInfo info(file_name_, start_ts, finish_ts); + info.offset = offset; + info.length = length; + info.status = status; + + for (auto& listener : listeners_) { + listener->OnFileWriteFinish(info); + } + } +#endif // ROCKSDB_LITE + + bool ShouldNotifyListeners() const { return !listeners_.empty(); } + std::unique_ptr writable_file_; std::string file_name_; AlignedBuffer buf_; @@ -142,11 +190,13 @@ class WritableFileWriter { uint64_t bytes_per_sync_; RateLimiter* rate_limiter_; Statistics* stats_; + std::vector> listeners_; public: - WritableFileWriter(std::unique_ptr&& file, - const std::string& _file_name, const EnvOptions& options, - Statistics* stats = nullptr) + WritableFileWriter( + std::unique_ptr&& file, const std::string& _file_name, + const EnvOptions& options, Statistics* stats = nullptr, + const std::vector>& listeners = {}) : writable_file_(std::move(file)), file_name_(_file_name), buf_(), @@ -159,11 +209,22 @@ class WritableFileWriter { last_sync_size_(0), bytes_per_sync_(options.bytes_per_sync), rate_limiter_(options.rate_limiter), - stats_(stats) { + stats_(stats), + listeners_() { TEST_SYNC_POINT_CALLBACK("WritableFileWriter::WritableFileWriter:0", reinterpret_cast(max_buffer_size_)); buf_.Alignment(writable_file_->GetRequiredBufferAlignment()); buf_.AllocateNewBuffer(std::min((size_t)65536, max_buffer_size_)); +#ifndef ROCKSDB_LITE + std::for_each(listeners.begin(), listeners.end(), + [this](const std::shared_ptr& e) { + if (e->ShouldBeNotifiedOnFileIO()) { + listeners_.emplace_back(e); + } + }); +#else // !ROCKSDB_LITE + (void)listeners; +#endif } WritableFileWriter(const WritableFileWriter&) = delete; @@ -254,7 +315,7 @@ class FilePrefetchBuffer { }; extern Status NewWritableFile(Env* env, const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options); bool ReadOneLine(std::istringstream* iss, SequentialFile* seq_file, std::string* output, bool* has_data, Status* result); diff --git a/util/file_reader_writer_test.cc b/util/file_reader_writer_test.cc index 3ca44ecc095..72dd625c1fb 100644 --- a/util/file_reader_writer_test.cc +++ b/util/file_reader_writer_test.cc @@ -71,8 +71,8 @@ TEST_F(WritableFileWriterTest, RangeSync) { EnvOptions env_options; env_options.bytes_per_sync = kMb; - unique_ptr wf(new FakeWF); - unique_ptr writer( + std::unique_ptr wf(new FakeWF); + std::unique_ptr writer( new WritableFileWriter(std::move(wf), "" /* don't care */, env_options)); Random r(301); std::unique_ptr large_buf(new char[10 * kMb]); @@ -147,14 +147,14 @@ TEST_F(WritableFileWriterTest, IncrementalBuffer) { env_options.writable_file_max_buffer_size = (attempt < kNumAttempts / 2) ? 512 * 1024 : 700 * 1024; std::string actual; - unique_ptr wf(new FakeWF(&actual, + std::unique_ptr wf(new FakeWF(&actual, #ifndef ROCKSDB_LITE - attempt % 2 == 1, + attempt % 2 == 1, #else - false, + false, #endif - no_flush)); - unique_ptr writer(new WritableFileWriter( + no_flush)); + std::unique_ptr writer(new WritableFileWriter( std::move(wf), "" /* don't care */, env_options)); std::string target; @@ -206,9 +206,9 @@ TEST_F(WritableFileWriterTest, AppendStatusReturn) { bool use_direct_io_; bool io_error_; }; - unique_ptr wf(new FakeWF()); + std::unique_ptr wf(new FakeWF()); wf->Setuse_direct_io(true); - unique_ptr writer( + std::unique_ptr writer( new WritableFileWriter(std::move(wf), "" /* don't care */, EnvOptions())); ASSERT_OK(writer->Append(std::string(2 * kMb, 'a'))); diff --git a/util/file_util.cc b/util/file_util.cc index aa2994b1e9f..3f730f3e840 100644 --- a/util/file_util.cc +++ b/util/file_util.cc @@ -19,16 +19,16 @@ Status CopyFile(Env* env, const std::string& source, const std::string& destination, uint64_t size, bool use_fsync) { const EnvOptions soptions; Status s; - unique_ptr src_reader; - unique_ptr dest_writer; + std::unique_ptr src_reader; + std::unique_ptr dest_writer; { - unique_ptr srcfile; + std::unique_ptr srcfile; s = env->NewSequentialFile(source, &srcfile, soptions); if (!s.ok()) { return s; } - unique_ptr destfile; + std::unique_ptr destfile; s = env->NewWritableFile(destination, &destfile, soptions); if (!s.ok()) { return s; @@ -71,9 +71,9 @@ Status CreateFile(Env* env, const std::string& destination, const std::string& contents, bool use_fsync) { const EnvOptions soptions; Status s; - unique_ptr dest_writer; + std::unique_ptr dest_writer; - unique_ptr destfile; + std::unique_ptr destfile; s = env->NewWritableFile(destination, &destfile, soptions); if (!s.ok()) { return s; @@ -89,16 +89,23 @@ Status CreateFile(Env* env, const std::string& destination, Status DeleteSSTFile(const ImmutableDBOptions* db_options, const std::string& fname, const std::string& dir_to_sync) { + return DeleteDBFile(db_options, fname, dir_to_sync, false); +} + +Status DeleteDBFile(const ImmutableDBOptions* db_options, + const std::string& fname, const std::string& dir_to_sync, + const bool force_bg) { #ifndef ROCKSDB_LITE auto sfm = static_cast(db_options->sst_file_manager.get()); if (sfm) { - return sfm->ScheduleFileDeletion(fname, dir_to_sync); + return sfm->ScheduleFileDeletion(fname, dir_to_sync, force_bg); } else { return db_options->env->DeleteFile(fname); } #else (void)dir_to_sync; + (void)force_bg; // SstFileManager is not supported in ROCKSDB_LITE return db_options->env->DeleteFile(fname); #endif diff --git a/util/file_util.h b/util/file_util.h index 5c05c9def6e..cd054518e17 100644 --- a/util/file_util.h +++ b/util/file_util.h @@ -25,4 +25,9 @@ extern Status DeleteSSTFile(const ImmutableDBOptions* db_options, const std::string& fname, const std::string& path_to_sync); +extern Status DeleteDBFile(const ImmutableDBOptions* db_options, + const std::string& fname, + const std::string& path_to_sync, + const bool force_bg); + } // namespace rocksdb diff --git a/util/heap.h b/util/heap.h index 4d5894134f2..6093c20e2bf 100644 --- a/util/heap.h +++ b/util/heap.h @@ -92,9 +92,9 @@ class BinaryHeap { reset_root_cmp_cache(); } - bool empty() const { - return data_.empty(); - } + bool empty() const { return data_.empty(); } + + size_t size() const { return data_.size(); } void reset_root_cmp_cache() { root_cmp_cache_ = port::kMaxSizet; } diff --git a/util/jemalloc_nodump_allocator.cc b/util/jemalloc_nodump_allocator.cc new file mode 100644 index 00000000000..cdd08e932e3 --- /dev/null +++ b/util/jemalloc_nodump_allocator.cc @@ -0,0 +1,206 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#include "util/jemalloc_nodump_allocator.h" + +#include +#include + +#include "port/likely.h" +#include "port/port.h" +#include "util/string_util.h" + +namespace rocksdb { + +#ifdef ROCKSDB_JEMALLOC_NODUMP_ALLOCATOR + +std::atomic JemallocNodumpAllocator::original_alloc_{nullptr}; + +JemallocNodumpAllocator::JemallocNodumpAllocator( + JemallocAllocatorOptions& options, + std::unique_ptr&& arena_hooks, unsigned arena_index) + : options_(options), + arena_hooks_(std::move(arena_hooks)), + arena_index_(arena_index), + tcache_(&JemallocNodumpAllocator::DestroyThreadSpecificCache) {} + +int JemallocNodumpAllocator::GetThreadSpecificCache(size_t size) { + // We always enable tcache. The only corner case is when there are a ton of + // threads accessing with low frequency, then it could consume a lot of + // memory (may reach # threads * ~1MB) without bringing too much benefit. + if (options_.limit_tcache_size && (size <= options_.tcache_size_lower_bound || + size > options_.tcache_size_upper_bound)) { + return MALLOCX_TCACHE_NONE; + } + unsigned* tcache_index = reinterpret_cast(tcache_.Get()); + if (UNLIKELY(tcache_index == nullptr)) { + // Instantiate tcache. + tcache_index = new unsigned(0); + size_t tcache_index_size = sizeof(unsigned); + int ret = + mallctl("tcache.create", tcache_index, &tcache_index_size, nullptr, 0); + if (ret != 0) { + // No good way to expose the error. Silently disable tcache. + delete tcache_index; + return MALLOCX_TCACHE_NONE; + } + tcache_.Reset(static_cast(tcache_index)); + } + return MALLOCX_TCACHE(*tcache_index); +} + +void* JemallocNodumpAllocator::Allocate(size_t size) { + int tcache_flag = GetThreadSpecificCache(size); + return mallocx(size, MALLOCX_ARENA(arena_index_) | tcache_flag); +} + +void JemallocNodumpAllocator::Deallocate(void* p) { + // Obtain tcache. + size_t size = 0; + if (options_.limit_tcache_size) { + size = malloc_usable_size(p); + } + int tcache_flag = GetThreadSpecificCache(size); + // No need to pass arena index to dallocx(). Jemalloc will find arena index + // from its own metadata. + dallocx(p, tcache_flag); +} + +void* JemallocNodumpAllocator::Alloc(extent_hooks_t* extent, void* new_addr, + size_t size, size_t alignment, bool* zero, + bool* commit, unsigned arena_ind) { + extent_alloc_t* original_alloc = + original_alloc_.load(std::memory_order_relaxed); + assert(original_alloc != nullptr); + void* result = original_alloc(extent, new_addr, size, alignment, zero, commit, + arena_ind); + if (result != nullptr) { + int ret = madvise(result, size, MADV_DONTDUMP); + if (ret != 0) { + fprintf( + stderr, + "JemallocNodumpAllocator failed to set MADV_DONTDUMP, error code: %d", + ret); + assert(false); + } + } + return result; +} + +Status JemallocNodumpAllocator::DestroyArena(unsigned arena_index) { + assert(arena_index != 0); + std::string key = "arena." + ToString(arena_index) + ".destroy"; + int ret = mallctl(key.c_str(), nullptr, 0, nullptr, 0); + if (ret != 0) { + return Status::Incomplete("Failed to destroy jemalloc arena, error code: " + + ToString(ret)); + } + return Status::OK(); +} + +void JemallocNodumpAllocator::DestroyThreadSpecificCache(void* ptr) { + assert(ptr != nullptr); + unsigned* tcache_index = static_cast(ptr); + size_t tcache_index_size = sizeof(unsigned); + int ret __attribute__((__unused__)) = + mallctl("tcache.destroy", nullptr, 0, tcache_index, tcache_index_size); + // Silently ignore error. + assert(ret == 0); + delete tcache_index; +} + +JemallocNodumpAllocator::~JemallocNodumpAllocator() { + // Destroy tcache before destroying arena. + autovector tcache_list; + tcache_.Scrape(&tcache_list, nullptr); + for (void* tcache_index : tcache_list) { + DestroyThreadSpecificCache(tcache_index); + } + // Destroy arena. Silently ignore error. + Status s __attribute__((__unused__)) = DestroyArena(arena_index_); + assert(s.ok()); +} + +size_t JemallocNodumpAllocator::UsableSize(void* p, + size_t /*allocation_size*/) const { + return malloc_usable_size(static_cast(p)); +} +#endif // ROCKSDB_JEMALLOC_NODUMP_ALLOCATOR + +Status NewJemallocNodumpAllocator( + JemallocAllocatorOptions& options, + std::shared_ptr* memory_allocator) { + *memory_allocator = nullptr; + Status unsupported = Status::NotSupported( + "JemallocNodumpAllocator only available with jemalloc version >= 5 " + "and MADV_DONTDUMP is available."); +#ifndef ROCKSDB_JEMALLOC_NODUMP_ALLOCATOR + (void)options; + return unsupported; +#else + if (!HasJemalloc()) { + return unsupported; + } + if (memory_allocator == nullptr) { + return Status::InvalidArgument("memory_allocator must be non-null."); + } + if (options.limit_tcache_size && + options.tcache_size_lower_bound >= options.tcache_size_upper_bound) { + return Status::InvalidArgument( + "tcache_size_lower_bound larger or equal to tcache_size_upper_bound."); + } + + // Create arena. + unsigned arena_index = 0; + size_t arena_index_size = sizeof(arena_index); + int ret = + mallctl("arenas.create", &arena_index, &arena_index_size, nullptr, 0); + if (ret != 0) { + return Status::Incomplete("Failed to create jemalloc arena, error code: " + + ToString(ret)); + } + assert(arena_index != 0); + + // Read existing hooks. + std::string key = "arena." + ToString(arena_index) + ".extent_hooks"; + extent_hooks_t* hooks; + size_t hooks_size = sizeof(hooks); + ret = mallctl(key.c_str(), &hooks, &hooks_size, nullptr, 0); + if (ret != 0) { + JemallocNodumpAllocator::DestroyArena(arena_index); + return Status::Incomplete("Failed to read existing hooks, error code: " + + ToString(ret)); + } + + // Store existing alloc. + extent_alloc_t* original_alloc = hooks->alloc; + extent_alloc_t* expected = nullptr; + bool success = + JemallocNodumpAllocator::original_alloc_.compare_exchange_strong( + expected, original_alloc); + if (!success && original_alloc != expected) { + JemallocNodumpAllocator::DestroyArena(arena_index); + return Status::Incomplete("Original alloc conflict."); + } + + // Set the custom hook. + std::unique_ptr new_hooks(new extent_hooks_t(*hooks)); + new_hooks->alloc = &JemallocNodumpAllocator::Alloc; + extent_hooks_t* hooks_ptr = new_hooks.get(); + ret = mallctl(key.c_str(), nullptr, nullptr, &hooks_ptr, sizeof(hooks_ptr)); + if (ret != 0) { + JemallocNodumpAllocator::DestroyArena(arena_index); + return Status::Incomplete("Failed to set custom hook, error code: " + + ToString(ret)); + } + + // Create cache allocator. + memory_allocator->reset( + new JemallocNodumpAllocator(options, std::move(new_hooks), arena_index)); + return Status::OK(); +#endif // ROCKSDB_JEMALLOC_NODUMP_ALLOCATOR +} + +} // namespace rocksdb diff --git a/util/jemalloc_nodump_allocator.h b/util/jemalloc_nodump_allocator.h new file mode 100644 index 00000000000..e93c1223778 --- /dev/null +++ b/util/jemalloc_nodump_allocator.h @@ -0,0 +1,79 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#pragma once + +#include +#include + +#include "port/jemalloc_helper.h" +#include "port/port.h" +#include "rocksdb/memory_allocator.h" +#include "util/core_local.h" +#include "util/thread_local.h" + +#if defined(ROCKSDB_JEMALLOC) && defined(ROCKSDB_PLATFORM_POSIX) + +#include + +#if (JEMALLOC_VERSION_MAJOR >= 5) && defined(MADV_DONTDUMP) +#define ROCKSDB_JEMALLOC_NODUMP_ALLOCATOR + +namespace rocksdb { + +class JemallocNodumpAllocator : public MemoryAllocator { + public: + JemallocNodumpAllocator(JemallocAllocatorOptions& options, + std::unique_ptr&& arena_hooks, + unsigned arena_index); + ~JemallocNodumpAllocator(); + + const char* Name() const override { return "JemallocNodumpAllocator"; } + void* Allocate(size_t size) override; + void Deallocate(void* p) override; + size_t UsableSize(void* p, size_t allocation_size) const override; + + private: + friend Status NewJemallocNodumpAllocator( + JemallocAllocatorOptions& options, + std::shared_ptr* memory_allocator); + + // Custom alloc hook to replace jemalloc default alloc. + static void* Alloc(extent_hooks_t* extent, void* new_addr, size_t size, + size_t alignment, bool* zero, bool* commit, + unsigned arena_ind); + + // Destroy arena on destruction of the allocator, or on failure. + static Status DestroyArena(unsigned arena_index); + + // Destroy tcache on destruction of the allocator, or thread exit. + static void DestroyThreadSpecificCache(void* ptr); + + // Get or create tcache. Return flag suitable to use with `mallocx`: + // either MALLOCX_TCACHE_NONE or MALLOCX_TCACHE(tc). + int GetThreadSpecificCache(size_t size); + + // A function pointer to jemalloc default alloc. Use atomic to make sure + // NewJemallocNodumpAllocator is thread-safe. + // + // Hack: original_alloc_ needs to be static for Alloc() to access it. + // alloc needs to be static to pass to jemalloc as function pointer. + static std::atomic original_alloc_; + + const JemallocAllocatorOptions options_; + + // Custom hooks has to outlive corresponding arena. + const std::unique_ptr arena_hooks_; + + // Arena index. + const unsigned arena_index_; + + // Hold thread-local tcache index. + ThreadLocalPtr tcache_; +}; + +} // namespace rocksdb +#endif // (JEMALLOC_VERSION_MAJOR >= 5) && MADV_DONTDUMP +#endif // ROCKSDB_JEMALLOC && ROCKSDB_PLATFORM_POSIX diff --git a/util/log_write_bench.cc b/util/log_write_bench.cc index b4e12b948c5..5c9b3e84bf4 100644 --- a/util/log_write_bench.cc +++ b/util/log_write_bench.cc @@ -35,9 +35,9 @@ void RunBenchmark() { Env* env = Env::Default(); EnvOptions env_options = env->OptimizeForLogWrite(EnvOptions()); env_options.bytes_per_sync = FLAGS_bytes_per_sync; - unique_ptr file; + std::unique_ptr file; env->NewWritableFile(file_name, &file, env_options); - unique_ptr writer; + std::unique_ptr writer; writer.reset(new WritableFileWriter(std::move(file), env_options)); std::string record; diff --git a/util/logging.h b/util/logging.h index 992e0018d7c..f605d36a5ac 100644 --- a/util/logging.h +++ b/util/logging.h @@ -11,40 +11,47 @@ // with macros. #pragma once -#include "port/port.h" // Helper macros that include information about file name and line number -#define STRINGIFY(x) #x -#define TOSTRING(x) STRINGIFY(x) -#define PREPEND_FILE_LINE(FMT) ("[" __FILE__ ":" TOSTRING(__LINE__) "] " FMT) +#define ROCKS_LOG_STRINGIFY(x) #x +#define ROCKS_LOG_TOSTRING(x) ROCKS_LOG_STRINGIFY(x) +#define ROCKS_LOG_PREPEND_FILE_LINE(FMT) ("[%s:" ROCKS_LOG_TOSTRING(__LINE__) "] " FMT) + +inline const char* RocksLogShorterFileName(const char* file) +{ + // 15 is the length of "util/logging.h". + // If the name of this file changed, please change this number, too. + return file + (sizeof(__FILE__) > 15 ? sizeof(__FILE__) - 15 : 0); +} // Don't inclide file/line info in HEADER level -#define ROCKS_LOG_HEADER(LGR, FMT, ...) \ +#define ROCKS_LOG_HEADER(LGR, FMT, ...) \ rocksdb::Log(InfoLogLevel::HEADER_LEVEL, LGR, FMT, ##__VA_ARGS__) -#define ROCKS_LOG_DEBUG(LGR, FMT, ...) \ - rocksdb::Log(InfoLogLevel::DEBUG_LEVEL, LGR, PREPEND_FILE_LINE(FMT), \ - ##__VA_ARGS__) +#define ROCKS_LOG_DEBUG(LGR, FMT, ...) \ + rocksdb::Log(InfoLogLevel::DEBUG_LEVEL, LGR, ROCKS_LOG_PREPEND_FILE_LINE(FMT), \ + RocksLogShorterFileName(__FILE__), ##__VA_ARGS__) -#define ROCKS_LOG_INFO(LGR, FMT, ...) \ - rocksdb::Log(InfoLogLevel::INFO_LEVEL, LGR, PREPEND_FILE_LINE(FMT), \ - ##__VA_ARGS__) +#define ROCKS_LOG_INFO(LGR, FMT, ...) \ + rocksdb::Log(InfoLogLevel::INFO_LEVEL, LGR, ROCKS_LOG_PREPEND_FILE_LINE(FMT), \ + RocksLogShorterFileName(__FILE__), ##__VA_ARGS__) -#define ROCKS_LOG_WARN(LGR, FMT, ...) \ - rocksdb::Log(InfoLogLevel::WARN_LEVEL, LGR, PREPEND_FILE_LINE(FMT), \ - ##__VA_ARGS__) +#define ROCKS_LOG_WARN(LGR, FMT, ...) \ + rocksdb::Log(InfoLogLevel::WARN_LEVEL, LGR, ROCKS_LOG_PREPEND_FILE_LINE(FMT), \ + RocksLogShorterFileName(__FILE__), ##__VA_ARGS__) -#define ROCKS_LOG_ERROR(LGR, FMT, ...) \ - rocksdb::Log(InfoLogLevel::ERROR_LEVEL, LGR, PREPEND_FILE_LINE(FMT), \ - ##__VA_ARGS__) +#define ROCKS_LOG_ERROR(LGR, FMT, ...) \ + rocksdb::Log(InfoLogLevel::ERROR_LEVEL, LGR, ROCKS_LOG_PREPEND_FILE_LINE(FMT), \ + RocksLogShorterFileName(__FILE__), ##__VA_ARGS__) -#define ROCKS_LOG_FATAL(LGR, FMT, ...) \ - rocksdb::Log(InfoLogLevel::FATAL_LEVEL, LGR, PREPEND_FILE_LINE(FMT), \ - ##__VA_ARGS__) +#define ROCKS_LOG_FATAL(LGR, FMT, ...) \ + rocksdb::Log(InfoLogLevel::FATAL_LEVEL, LGR, ROCKS_LOG_PREPEND_FILE_LINE(FMT), \ + RocksLogShorterFileName(__FILE__), ##__VA_ARGS__) -#define ROCKS_LOG_BUFFER(LOG_BUF, FMT, ...) \ - rocksdb::LogToBuffer(LOG_BUF, PREPEND_FILE_LINE(FMT), ##__VA_ARGS__) +#define ROCKS_LOG_BUFFER(LOG_BUF, FMT, ...) \ + rocksdb::LogToBuffer(LOG_BUF, ROCKS_LOG_PREPEND_FILE_LINE(FMT), \ + RocksLogShorterFileName(__FILE__), ##__VA_ARGS__) -#define ROCKS_LOG_BUFFER_MAX_SZ(LOG_BUF, MAX_LOG_SIZE, FMT, ...) \ - rocksdb::LogToBuffer(LOG_BUF, MAX_LOG_SIZE, PREPEND_FILE_LINE(FMT), \ - ##__VA_ARGS__) +#define ROCKS_LOG_BUFFER_MAX_SZ(LOG_BUF, MAX_LOG_SIZE, FMT, ...) \ + rocksdb::LogToBuffer(LOG_BUF, MAX_LOG_SIZE, ROCKS_LOG_PREPEND_FILE_LINE(FMT), \ + RocksLogShorterFileName(__FILE__), ##__VA_ARGS__) diff --git a/util/memory_allocator.h b/util/memory_allocator.h new file mode 100644 index 00000000000..99a7241d0a9 --- /dev/null +++ b/util/memory_allocator.h @@ -0,0 +1,38 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). +// + +#pragma once + +#include "rocksdb/memory_allocator.h" + +namespace rocksdb { + +struct CustomDeleter { + CustomDeleter(MemoryAllocator* a = nullptr) : allocator(a) {} + + void operator()(char* ptr) const { + if (allocator) { + allocator->Deallocate(reinterpret_cast(ptr)); + } else { + delete[] ptr; + } + } + + MemoryAllocator* allocator; +}; + +using CacheAllocationPtr = std::unique_ptr; + +inline CacheAllocationPtr AllocateBlock(size_t size, + MemoryAllocator* allocator) { + if (allocator) { + auto block = reinterpret_cast(allocator->Allocate(size)); + return CacheAllocationPtr(block, allocator); + } + return CacheAllocationPtr(new char[size]); +} + +} // namespace rocksdb diff --git a/util/mock_time_env.h b/util/mock_time_env.h new file mode 100644 index 00000000000..c6ab8a7483d --- /dev/null +++ b/util/mock_time_env.h @@ -0,0 +1,43 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#pragma once + +#include "rocksdb/env.h" + +namespace rocksdb { + +class MockTimeEnv : public EnvWrapper { + public: + explicit MockTimeEnv(Env* base) : EnvWrapper(base) {} + + virtual Status GetCurrentTime(int64_t* time) override { + assert(time != nullptr); + assert(current_time_ <= + static_cast(std::numeric_limits::max())); + *time = static_cast(current_time_); + return Status::OK(); + } + + virtual uint64_t NowMicros() override { + assert(current_time_ <= std::numeric_limits::max() / 1000000); + return current_time_ * 1000000; + } + + virtual uint64_t NowNanos() override { + assert(current_time_ <= std::numeric_limits::max() / 1000000000); + return current_time_ * 1000000000; + } + + void set_current_time(uint64_t time) { + assert(time >= current_time_); + current_time_ = time; + } + + private: + std::atomic current_time_{0}; +}; + +} // namespace rocksdb diff --git a/util/repeatable_thread.h b/util/repeatable_thread.h index 34164ca562b..3506234f9e9 100644 --- a/util/repeatable_thread.h +++ b/util/repeatable_thread.h @@ -10,6 +10,7 @@ #include "port/port.h" #include "rocksdb/env.h" +#include "util/mock_time_env.h" #include "util/mutexlock.h" namespace rocksdb { @@ -80,7 +81,17 @@ class RepeatableThread { cond_var_.SignalAll(); #endif while (running_) { +#ifndef NDEBUG + if (dynamic_cast(env_) != nullptr) { + // MockTimeEnv is used. Since it is not easy to mock TimedWait, + // we wait without timeout to wait for TEST_WaitForRun to wake us up. + cond_var_.Wait(); + } else { + cond_var_.TimedWait(wait_until); + } +#else cond_var_.TimedWait(wait_until); +#endif if (env_->NowMicros() >= wait_until) { break; } diff --git a/util/slice_transform_test.cc b/util/slice_transform_test.cc index ddbb9f4bfac..2eb56af6d6c 100644 --- a/util/slice_transform_test.cc +++ b/util/slice_transform_test.cc @@ -24,7 +24,7 @@ TEST_F(SliceTransformTest, CapPrefixTransform) { std::string s; s = "abcdefge"; - unique_ptr transform; + std::unique_ptr transform; transform.reset(NewCappedPrefixTransform(6)); ASSERT_EQ(transform->Transform(s).ToString(), "abcdef"); @@ -115,7 +115,7 @@ TEST_F(SliceTransformDBTest, CapPrefix) { ASSERT_OK(db()->Put(wo, "foo3", "bar3")); ASSERT_OK(db()->Flush(fo)); - unique_ptr iter(db()->NewIterator(ro)); + std::unique_ptr iter(db()->NewIterator(ro)); iter->Seek("foo"); ASSERT_OK(iter->status()); diff --git a/util/sst_file_manager_impl.cc b/util/sst_file_manager_impl.cc index ee1394bc91e..733cd9cf609 100644 --- a/util/sst_file_manager_impl.cc +++ b/util/sst_file_manager_impl.cc @@ -402,8 +402,11 @@ bool SstFileManagerImpl::CancelErrorRecovery(ErrorHandler* handler) { } Status SstFileManagerImpl::ScheduleFileDeletion( - const std::string& file_path, const std::string& path_to_sync) { - return delete_scheduler_.DeleteFile(file_path, path_to_sync); + const std::string& file_path, const std::string& path_to_sync, + const bool force_bg) { + TEST_SYNC_POINT("SstFileManagerImpl::ScheduleFileDeletion"); + return delete_scheduler_.DeleteFile(file_path, path_to_sync, + force_bg); } void SstFileManagerImpl::WaitForEmptyTrash() { diff --git a/util/sst_file_manager_impl.h b/util/sst_file_manager_impl.h index d11035df80c..211b4fa7160 100644 --- a/util/sst_file_manager_impl.h +++ b/util/sst_file_manager_impl.h @@ -111,9 +111,12 @@ class SstFileManagerImpl : public SstFileManager { // not guaranteed bool CancelErrorRecovery(ErrorHandler* db); - // Mark file as trash and schedule it's deletion. + // Mark file as trash and schedule it's deletion. If force_bg is set, it + // forces the file to be deleting in the background regardless of DB size, + // except when rate limited delete is disabled virtual Status ScheduleFileDeletion(const std::string& file_path, - const std::string& dir_to_sync); + const std::string& dir_to_sync, + const bool force_bg = false); // Wait for all files being deleteing in the background to finish or for // destructor to be called. diff --git a/util/sync_point.cc b/util/sync_point.cc index ce0fa0a9727..4599c256d9f 100644 --- a/util/sync_point.cc +++ b/util/sync_point.cc @@ -17,9 +17,7 @@ SyncPoint* SyncPoint::GetInstance() { return &sync_point; } -SyncPoint::SyncPoint() : - impl_(new Data) { -} +SyncPoint::SyncPoint() : impl_(new Data) {} SyncPoint:: ~SyncPoint() { delete impl_; diff --git a/util/testutil.cc b/util/testutil.cc index 0983f759ce9..2f8e31cd571 100644 --- a/util/testutil.cc +++ b/util/testutil.cc @@ -126,19 +126,19 @@ const Comparator* Uint64Comparator() { WritableFileWriter* GetWritableFileWriter(WritableFile* wf, const std::string& fname) { - unique_ptr file(wf); + std::unique_ptr file(wf); return new WritableFileWriter(std::move(file), fname, EnvOptions()); } RandomAccessFileReader* GetRandomAccessFileReader(RandomAccessFile* raf) { - unique_ptr file(raf); + std::unique_ptr file(raf); return new RandomAccessFileReader(std::move(file), "[test RandomAccessFileReader]"); } SequentialFileReader* GetSequentialFileReader(SequentialFile* se, const std::string& fname) { - unique_ptr file(se); + std::unique_ptr file(se); return new SequentialFileReader(std::move(file), fname); } @@ -401,5 +401,21 @@ Status DestroyDir(Env* env, const std::string& dir) { return s; } +bool IsDirectIOSupported(Env* env, const std::string& dir) { + EnvOptions env_options; + env_options.use_mmap_writes = false; + env_options.use_direct_writes = true; + std::string tmp = TempFileName(dir, 999); + Status s; + { + std::unique_ptr file; + s = env->NewWritableFile(tmp, &file, env_options); + } + if (s.ok()) { + s = env->DeleteFile(tmp); + } + return s.ok(); +} + } // namespace test } // namespace rocksdb diff --git a/util/testutil.h b/util/testutil.h index c16c0cbe503..2aab3df72c4 100644 --- a/util/testutil.h +++ b/util/testutil.h @@ -64,7 +64,7 @@ class ErrorEnv : public EnvWrapper { num_writable_file_errors_(0) { } virtual Status NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& soptions) override { result->reset(); if (writable_file_error_) { @@ -554,7 +554,7 @@ class StringEnv : public EnvWrapper { const Status WriteToNewFile(const std::string& file_name, const std::string& content) { - unique_ptr r; + std::unique_ptr r; auto s = NewWritableFile(file_name, &r, EnvOptions()); if (!s.ok()) { return s; @@ -567,7 +567,8 @@ class StringEnv : public EnvWrapper { } // The following text is boilerplate that forwards all methods to target() - Status NewSequentialFile(const std::string& f, unique_ptr* r, + Status NewSequentialFile(const std::string& f, + std::unique_ptr* r, const EnvOptions& /*options*/) override { auto iter = files_.find(f); if (iter == files_.end()) { @@ -577,11 +578,11 @@ class StringEnv : public EnvWrapper { return Status::OK(); } Status NewRandomAccessFile(const std::string& /*f*/, - unique_ptr* /*r*/, + std::unique_ptr* /*r*/, const EnvOptions& /*options*/) override { return Status::NotSupported(); } - Status NewWritableFile(const std::string& f, unique_ptr* r, + Status NewWritableFile(const std::string& f, std::unique_ptr* r, const EnvOptions& /*options*/) override { auto iter = files_.find(f); if (iter != files_.end()) { @@ -591,7 +592,7 @@ class StringEnv : public EnvWrapper { return Status::OK(); } virtual Status NewDirectory(const std::string& /*name*/, - unique_ptr* /*result*/) override { + std::unique_ptr* /*result*/) override { return Status::NotSupported(); } Status FileExists(const std::string& f) override { @@ -747,5 +748,7 @@ std::string RandomName(Random* rnd, const size_t len); Status DestroyDir(Env* env, const std::string& dir); +bool IsDirectIOSupported(Env* env, const std::string& dir); + } // namespace test } // namespace rocksdb diff --git a/util/thread_local.cc b/util/thread_local.cc index dea2002a021..7346eff11e8 100644 --- a/util/thread_local.cc +++ b/util/thread_local.cc @@ -204,7 +204,7 @@ extern "C" { // The linker must not discard thread_callback_on_exit. (We force a reference // to this variable with a linker /include:symbol pragma to ensure that.) If // this variable is discarded, the OnThreadExit function will never be called. -#ifdef _WIN64 +#ifndef _X86_ // .CRT section is merged with .rdata on x64 so it must be constant data. #pragma const_seg(".CRT$XLB") @@ -219,7 +219,7 @@ const PIMAGE_TLS_CALLBACK p_thread_callback_on_exit = #pragma comment(linker, "/include:_tls_used") #pragma comment(linker, "/include:p_thread_callback_on_exit") -#else // _WIN64 +#else // _X86_ #pragma data_seg(".CRT$XLB") PIMAGE_TLS_CALLBACK p_thread_callback_on_exit = wintlscleanup::WinOnThreadExit; @@ -229,7 +229,7 @@ PIMAGE_TLS_CALLBACK p_thread_callback_on_exit = wintlscleanup::WinOnThreadExit; #pragma comment(linker, "/INCLUDE:__tls_used") #pragma comment(linker, "/INCLUDE:_p_thread_callback_on_exit") -#endif // _WIN64 +#endif // _X86_ #else // https://github.com/couchbase/gperftools/blob/master/src/windows/port.cc diff --git a/util/thread_operation.h b/util/thread_operation.h index 025392b59de..f1827da0a0c 100644 --- a/util/thread_operation.h +++ b/util/thread_operation.h @@ -70,7 +70,7 @@ static OperationStageInfo global_op_stage_table[] = { {ThreadStatus::STAGE_MEMTABLE_ROLLBACK, "MemTableList::RollbackMemtableFlush"}, {ThreadStatus::STAGE_MEMTABLE_INSTALL_FLUSH_RESULTS, - "MemTableList::InstallMemtableFlushResults"}, + "MemTableList::TryInstallMemtableFlushResults"}, }; // The structure that describes a state. diff --git a/util/threadpool_imp.cc b/util/threadpool_imp.cc index d850b7c9e9f..b431830ee6d 100644 --- a/util/threadpool_imp.cc +++ b/util/threadpool_imp.cc @@ -188,7 +188,7 @@ void ThreadPoolImpl::Impl::BGThread(size_t thread_id) { bool low_cpu_priority = false; while (true) { -// Wait until there is an item that is ready to run + // Wait until there is an item that is ready to run std::unique_lock lock(mu_); // Stop waiting if the thread needs to do work or needs to terminate. while (!exit_all_threads_ && !IsLastExcessiveThread(thread_id) && @@ -198,7 +198,7 @@ void ThreadPoolImpl::Impl::BGThread(size_t thread_id) { if (exit_all_threads_) { // mechanism to let BG threads exit safely - if(!wait_for_jobs_to_complete_ || + if (!wait_for_jobs_to_complete_ || queue_.empty()) { break; } diff --git a/util/trace_replay.cc b/util/trace_replay.cc index cd2e3ee95e2..5b9bec651e4 100644 --- a/util/trace_replay.cc +++ b/util/trace_replay.cc @@ -16,6 +16,8 @@ namespace rocksdb { +const std::string kTraceMagic = "feedcafedeadbeef"; + namespace { void EncodeCFAndKey(std::string* dst, uint32_t cf_id, const Slice& key) { PutFixed32(dst, cf_id); @@ -29,14 +31,20 @@ void DecodeCFAndKey(std::string& buffer, uint32_t* cf_id, Slice* key) { } } // namespace -Tracer::Tracer(Env* env, std::unique_ptr&& trace_writer) - : env_(env), trace_writer_(std::move(trace_writer)) { +Tracer::Tracer(Env* env, const TraceOptions& trace_options, + std::unique_ptr&& trace_writer) + : env_(env), + trace_options_(trace_options), + trace_writer_(std::move(trace_writer)) { WriteHeader(); } Tracer::~Tracer() { trace_writer_.reset(); } Status Tracer::Write(WriteBatch* write_batch) { + if (IsTraceFileOverMax()) { + return Status::OK(); + } Trace trace; trace.ts = env_->NowMicros(); trace.type = kTraceWrite; @@ -45,6 +53,9 @@ Status Tracer::Write(WriteBatch* write_batch) { } Status Tracer::Get(ColumnFamilyHandle* column_family, const Slice& key) { + if (IsTraceFileOverMax()) { + return Status::OK(); + } Trace trace; trace.ts = env_->NowMicros(); trace.type = kTraceGet; @@ -53,6 +64,9 @@ Status Tracer::Get(ColumnFamilyHandle* column_family, const Slice& key) { } Status Tracer::IteratorSeek(const uint32_t& cf_id, const Slice& key) { + if (IsTraceFileOverMax()) { + return Status::OK(); + } Trace trace; trace.ts = env_->NowMicros(); trace.type = kTraceIteratorSeek; @@ -61,6 +75,9 @@ Status Tracer::IteratorSeek(const uint32_t& cf_id, const Slice& key) { } Status Tracer::IteratorSeekForPrev(const uint32_t& cf_id, const Slice& key) { + if (IsTraceFileOverMax()) { + return Status::OK(); + } Trace trace; trace.ts = env_->NowMicros(); trace.type = kTraceIteratorSeekForPrev; @@ -68,6 +85,11 @@ Status Tracer::IteratorSeekForPrev(const uint32_t& cf_id, const Slice& key) { return WriteTrace(trace); } +bool Tracer::IsTraceFileOverMax() { + uint64_t trace_file_size = trace_writer_->GetFileSize(); + return (trace_file_size > trace_options_.max_trace_file_size); +} + Status Tracer::WriteHeader() { std::ostringstream s; s << kTraceMagic << "\t" @@ -103,7 +125,7 @@ Status Tracer::WriteTrace(const Trace& trace) { Status Tracer::Close() { return WriteFooter(); } Replayer::Replayer(DB* db, const std::vector& handles, - unique_ptr&& reader) + std::unique_ptr&& reader) : trace_reader_(std::move(reader)) { assert(db != nullptr); db_ = static_cast(db->GetRootDB()); diff --git a/util/trace_replay.h b/util/trace_replay.h index b324696f013..d935f65ce7e 100644 --- a/util/trace_replay.h +++ b/util/trace_replay.h @@ -10,6 +10,7 @@ #include #include "rocksdb/env.h" +#include "rocksdb/options.h" #include "rocksdb/trace_reader_writer.h" namespace rocksdb { @@ -21,7 +22,7 @@ class DBImpl; class Slice; class WriteBatch; -const std::string kTraceMagic = "feedcafedeadbeef"; +extern const std::string kTraceMagic; const unsigned int kTraceTimestampSize = 8; const unsigned int kTraceTypeSize = 1; const unsigned int kTracePayloadLengthSize = 4; @@ -55,13 +56,15 @@ struct Trace { // Trace RocksDB operations using a TraceWriter. class Tracer { public: - Tracer(Env* env, std::unique_ptr&& trace_writer); + Tracer(Env* env, const TraceOptions& trace_options, + std::unique_ptr&& trace_writer); ~Tracer(); Status Write(WriteBatch* write_batch); Status Get(ColumnFamilyHandle* cfname, const Slice& key); Status IteratorSeek(const uint32_t& cf_id, const Slice& key); Status IteratorSeekForPrev(const uint32_t& cf_id, const Slice& key); + bool IsTraceFileOverMax(); Status Close(); @@ -71,7 +74,8 @@ class Tracer { Status WriteTrace(const Trace& trace); Env* env_; - unique_ptr trace_writer_; + TraceOptions trace_options_; + std::unique_ptr trace_writer_; }; // Replay RocksDB operations from a trace. diff --git a/util/transaction_test_util.cc b/util/transaction_test_util.cc index 63339189170..58d95b2ae19 100644 --- a/util/transaction_test_util.cc +++ b/util/transaction_test_util.cc @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -135,8 +136,7 @@ bool RandomTransactionInserter::DoInsert(DB* db, Transaction* txn, std::vector set_vec(num_sets_); std::iota(set_vec.begin(), set_vec.end(), static_cast(0)); - std::random_shuffle(set_vec.begin(), set_vec.end(), - [&](uint64_t r) { return rand_->Uniform(r); }); + std::shuffle(set_vec.begin(), set_vec.end(), std::random_device{}); // For each set, pick a key at random and increment it for (uint16_t set_i : set_vec) { @@ -258,10 +258,8 @@ Status RandomTransactionInserter::Verify(DB* db, uint16_t num_sets, std::vector set_vec(num_sets); std::iota(set_vec.begin(), set_vec.end(), static_cast(0)); - if (rand) { - std::random_shuffle(set_vec.begin(), set_vec.end(), - [&](uint64_t r) { return rand->Uniform(r); }); - } + std::shuffle(set_vec.begin(), set_vec.end(), std::random_device{}); + // For each set of keys with the same prefix, sum all the values for (uint16_t set_i : set_vec) { // Five digits (since the largest uint16_t is 65535) plus the NUL diff --git a/util/vector_iterator.h b/util/vector_iterator.h new file mode 100644 index 00000000000..da60eb229cf --- /dev/null +++ b/util/vector_iterator.h @@ -0,0 +1,100 @@ +#pragma once + +#include +#include +#include + +#include "db/dbformat.h" +#include "rocksdb/iterator.h" +#include "rocksdb/slice.h" +#include "table/internal_iterator.h" + +namespace rocksdb { + +// Iterator over a vector of keys/values +class VectorIterator : public InternalIterator { + public: + VectorIterator(std::vector keys, std::vector values, + const InternalKeyComparator* icmp) + : keys_(std::move(keys)), + values_(std::move(values)), + indexed_cmp_(icmp, &keys_), + current_(keys.size()) { + assert(keys_.size() == values_.size()); + + indices_.reserve(keys_.size()); + for (size_t i = 0; i < keys_.size(); i++) { + indices_.push_back(i); + } + std::sort(indices_.begin(), indices_.end(), indexed_cmp_); + } + + virtual bool Valid() const override { + return !indices_.empty() && current_ < indices_.size(); + } + + virtual void SeekToFirst() override { current_ = 0; } + virtual void SeekToLast() override { current_ = indices_.size() - 1; } + + virtual void Seek(const Slice& target) override { + current_ = std::lower_bound(indices_.begin(), indices_.end(), target, + indexed_cmp_) - + indices_.begin(); + } + + virtual void SeekForPrev(const Slice& target) override { + current_ = std::lower_bound(indices_.begin(), indices_.end(), target, + indexed_cmp_) - + indices_.begin(); + if (!Valid()) { + SeekToLast(); + } else { + Prev(); + } + } + + virtual void Next() override { current_++; } + virtual void Prev() override { current_--; } + + virtual Slice key() const override { + return Slice(keys_[indices_[current_]]); + } + virtual Slice value() const override { + return Slice(values_[indices_[current_]]); + } + + virtual Status status() const override { return Status::OK(); } + + virtual bool IsKeyPinned() const override { return true; } + virtual bool IsValuePinned() const override { return true; } + + private: + struct IndexedKeyComparator { + IndexedKeyComparator(const InternalKeyComparator* c, + const std::vector* ks) + : cmp(c), keys(ks) {} + + bool operator()(size_t a, size_t b) const { + return cmp->Compare((*keys)[a], (*keys)[b]) < 0; + } + + bool operator()(size_t a, const Slice& b) const { + return cmp->Compare((*keys)[a], b) < 0; + } + + bool operator()(const Slice& a, size_t b) const { + return cmp->Compare(a, (*keys)[b]) < 0; + } + + const InternalKeyComparator* cmp; + const std::vector* keys; + }; + + std::vector keys_; + std::vector values_; + IndexedKeyComparator indexed_cmp_; + std::vector indices_; + size_t current_; +}; + +} // namespace rocksdb diff --git a/util/xxhash.cc b/util/xxhash.cc index 4bce61a4878..2ec95a636e5 100644 --- a/util/xxhash.cc +++ b/util/xxhash.cc @@ -34,6 +34,39 @@ You can contact the author at : //************************************** // Tuning parameters //************************************** +/*!XXH_FORCE_MEMORY_ACCESS : + * By default, access to unaligned memory is controlled by `memcpy()`, which is + * safe and portable. Unfortunately, on some target/compiler combinations, the + * generated assembly is sub-optimal. The below switch allow to select different + * access method for improved performance. Method 0 (default) : use `memcpy()`. + * Safe and portable. Method 1 : `__packed` statement. It depends on compiler + * extension (ie, not portable). This method is safe if your compiler supports + * it, and *generally* as fast or faster than `memcpy`. Method 2 : direct + * access. This method doesn't depend on compiler but violate C standard. It can + * generate buggy code on targets which do not support unaligned memory + * accesses. But in some circumstances, it's the only known way to get the most + * performance (ie GCC + ARMv6) See http://stackoverflow.com/a/32095106/646947 + * for details. Prefer these methods in priority order (0 > 1 > 2) + */ + +#include "util/util.h" + +#ifndef XXH_FORCE_MEMORY_ACCESS /* can be defined externally, on command line \ + for example */ +#if defined(__GNUC__) && \ + (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || \ + defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) || \ + defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__)) +#define XXH_FORCE_MEMORY_ACCESS 2 +#elif (defined(__INTEL_COMPILER) && !defined(_WIN32)) || \ + (defined(__GNUC__) && \ + (defined(__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) || \ + defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) || \ + defined(__ARM_ARCH_7S__))) +#define XXH_FORCE_MEMORY_ACCESS 1 +#endif +#endif + // Unaligned memory access is automatically enabled for "common" CPU, such as x86. // For others CPU, the compiler will be more cautious, and insert extra code to ensure aligned access is respected. // If you know your target CPU supports unaligned memory access, you want to force this option manually to improve performance. @@ -58,6 +91,21 @@ You can contact the author at : // This option has no impact on Little_Endian CPU. #define XXH_FORCE_NATIVE_FORMAT 0 +/*!XXH_FORCE_ALIGN_CHECK : + * This is a minor performance trick, only useful with lots of very small keys. + * It means : check for aligned/unaligned input. + * The check costs one initial branch per hash; + * set it to 0 when the input is guaranteed to be aligned, + * or when alignment doesn't matter for performance. + */ +#ifndef XXH_FORCE_ALIGN_CHECK /* can be defined externally */ +#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || \ + defined(_M_X64) +#define XXH_FORCE_ALIGN_CHECK 0 +#else +#define XXH_FORCE_ALIGN_CHECK 1 +#endif +#endif //************************************** // Compiler Specific Options @@ -91,7 +139,7 @@ FORCE_INLINE void XXH_free (void* p) { free(p); } // for memcpy() #include FORCE_INLINE void* XXH_memcpy(void* dest, const void* src, size_t size) { return memcpy(dest,src,size); } - +#include /* assert */ namespace rocksdb { //************************************** @@ -134,6 +182,34 @@ typedef struct _U32_S { U32 v; } _PACKED U32_S; #define A32(x) (((U32_S *)(x))->v) +#if (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS == 2)) + +/* Force direct memory access. Only works on CPU which support unaligned memory + * access in hardware */ +static U32 XXH_read32(const void* memPtr) { return *(const U32*)memPtr; } + +#elif (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS == 1)) + +/* __pack instructions are safer, but compiler specific, hence potentially + * problematic for some compilers */ +/* currently only defined for gcc and icc */ +typedef union { + U32 u32; +} __attribute__((packed)) unalign; +static U32 XXH_read32(const void* ptr) { return ((const unalign*)ptr)->u32; } + +#else + +/* portable and safe solution. Generally efficient. + * see : http://stackoverflow.com/a/32095106/646947 + */ +static U32 XXH_read32(const void* memPtr) { + U32 val; + memcpy(&val, memPtr, sizeof(val)); + return val; +} + +#endif /* XXH_FORCE_DIRECT_MEMORY_ACCESS */ //*************************************** // Compiler-specific Functions and Macros @@ -143,8 +219,10 @@ typedef struct _U32_S { U32 v; } _PACKED U32_S; // Note : although _rotl exists for minGW (GCC under windows), performance seems poor #if defined(_MSC_VER) # define XXH_rotl32(x,r) _rotl(x,r) +#define XXH_rotl64(x, r) _rotl64(x, r) #else # define XXH_rotl32(x,r) ((x << r) | (x >> (32 - r))) +#define XXH_rotl64(x, r) ((x << r) | (x >> (64 - r))) #endif #if defined(_MSC_VER) // Visual Studio @@ -199,12 +277,25 @@ FORCE_INLINE U32 XXH_readLE32_align(const U32* ptr, XXH_endianess endian, XXH_al return endian==XXH_littleEndian ? *ptr : XXH_swap32(*ptr); } -FORCE_INLINE U32 XXH_readLE32(const U32* ptr, XXH_endianess endian) { return XXH_readLE32_align(ptr, endian, XXH_unaligned); } +FORCE_INLINE U32 XXH_readLE32_align(const void* ptr, XXH_endianess endian, + XXH_alignment align) { + if (align == XXH_unaligned) + return endian == XXH_littleEndian ? XXH_read32(ptr) + : XXH_swap32(XXH_read32(ptr)); + else + return endian == XXH_littleEndian ? *(const U32*)ptr + : XXH_swap32(*(const U32*)ptr); +} +FORCE_INLINE U32 XXH_readLE32(const U32* ptr, XXH_endianess endian) { + return XXH_readLE32_align(ptr, endian, XXH_unaligned); +} //**************************** // Simple Hash Functions //**************************** +#define XXH_get32bits(p) XXH_readLE32_align(p, endian, align) + FORCE_INLINE U32 XXH32_endian_align(const void* input, int len, U32 seed, XXH_endianess endian, XXH_alignment align) { const BYTE* p = (const BYTE*)input; @@ -476,4 +567,508 @@ U32 XXH32_digest (void* state_in) return h32; } +/* ******************************************************************* + * 64-bit hash functions + *********************************************************************/ + + #if (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==2)) + + /* Force direct memory access. Only works on CPU which support unaligned memory access in hardware */ + static U64 XXH_read64(const void* memPtr) { return *(const U64*) memPtr; } + + #elif (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==1)) + + /* __pack instructions are safer, but compiler specific, hence potentially problematic for some compilers */ + /* currently only defined for gcc and icc */ + typedef union { U32 u32; U64 u64; } __attribute__((packed)) unalign64; + static U64 XXH_read64(const void* ptr) { return ((const unalign64*)ptr)->u64; } + + #else + + /* portable and safe solution. Generally efficient. + * see : http://stackoverflow.com/a/32095106/646947 + */ + + static U64 XXH_read64(const void* memPtr) + { + U64 val; + memcpy(&val, memPtr, sizeof(val)); + return val; + } +#endif /* XXH_FORCE_DIRECT_MEMORY_ACCESS */ + +#if defined(_MSC_VER) /* Visual Studio */ +#define XXH_swap64 _byteswap_uint64 +#elif XXH_GCC_VERSION >= 403 +#define XXH_swap64 __builtin_bswap64 +#else +static U64 XXH_swap64(U64 x) { + return ((x << 56) & 0xff00000000000000ULL) | + ((x << 40) & 0x00ff000000000000ULL) | + ((x << 24) & 0x0000ff0000000000ULL) | + ((x << 8) & 0x000000ff00000000ULL) | + ((x >> 8) & 0x00000000ff000000ULL) | + ((x >> 24) & 0x0000000000ff0000ULL) | + ((x >> 40) & 0x000000000000ff00ULL) | + ((x >> 56) & 0x00000000000000ffULL); +} +#endif + +FORCE_INLINE U64 XXH_readLE64_align(const void* ptr, XXH_endianess endian, + XXH_alignment align) { + if (align == XXH_unaligned) + return endian == XXH_littleEndian ? XXH_read64(ptr) + : XXH_swap64(XXH_read64(ptr)); + else + return endian == XXH_littleEndian ? *(const U64*)ptr + : XXH_swap64(*(const U64*)ptr); +} + +FORCE_INLINE U64 XXH_readLE64(const void* ptr, XXH_endianess endian) { + return XXH_readLE64_align(ptr, endian, XXH_unaligned); +} + +static U64 XXH_readBE64(const void* ptr) { + return XXH_CPU_LITTLE_ENDIAN ? XXH_swap64(XXH_read64(ptr)) : XXH_read64(ptr); +} + +/*====== xxh64 ======*/ + +static const U64 PRIME64_1 = + 11400714785074694791ULL; /* 0b1001111000110111011110011011000110000101111010111100101010000111 + */ +static const U64 PRIME64_2 = + 14029467366897019727ULL; /* 0b1100001010110010101011100011110100100111110101001110101101001111 + */ +static const U64 PRIME64_3 = + 1609587929392839161ULL; /* 0b0001011001010110011001111011000110011110001101110111100111111001 + */ +static const U64 PRIME64_4 = + 9650029242287828579ULL; /* 0b1000010111101011110010100111011111000010101100101010111001100011 + */ +static const U64 PRIME64_5 = + 2870177450012600261ULL; /* 0b0010011111010100111010110010111100010110010101100110011111000101 + */ + +static U64 XXH64_round(U64 acc, U64 input) { + acc += input * PRIME64_2; + acc = XXH_rotl64(acc, 31); + acc *= PRIME64_1; + return acc; +} + +static U64 XXH64_mergeRound(U64 acc, U64 val) { + val = XXH64_round(0, val); + acc ^= val; + acc = acc * PRIME64_1 + PRIME64_4; + return acc; +} + +static U64 XXH64_avalanche(U64 h64) { + h64 ^= h64 >> 33; + h64 *= PRIME64_2; + h64 ^= h64 >> 29; + h64 *= PRIME64_3; + h64 ^= h64 >> 32; + return h64; +} + +#define XXH_get64bits(p) XXH_readLE64_align(p, endian, align) + +static U64 XXH64_finalize(U64 h64, const void* ptr, size_t len, + XXH_endianess endian, XXH_alignment align) { + const BYTE* p = (const BYTE*)ptr; + +#define PROCESS1_64 \ + h64 ^= (*p++) * PRIME64_5; \ + h64 = XXH_rotl64(h64, 11) * PRIME64_1; + +#define PROCESS4_64 \ + h64 ^= (U64)(XXH_get32bits(p)) * PRIME64_1; \ + p += 4; \ + h64 = XXH_rotl64(h64, 23) * PRIME64_2 + PRIME64_3; + +#define PROCESS8_64 \ + { \ + U64 const k1 = XXH64_round(0, XXH_get64bits(p)); \ + p += 8; \ + h64 ^= k1; \ + h64 = XXH_rotl64(h64, 27) * PRIME64_1 + PRIME64_4; \ + } + + switch (len & 31) { + case 24: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 16: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 8: + PROCESS8_64; + return XXH64_avalanche(h64); + + case 28: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 20: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 12: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 4: + PROCESS4_64; + return XXH64_avalanche(h64); + + case 25: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 17: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 9: + PROCESS8_64; + PROCESS1_64; + return XXH64_avalanche(h64); + + case 29: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 21: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 13: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 5: + PROCESS4_64; + PROCESS1_64; + return XXH64_avalanche(h64); + + case 26: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 18: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 10: + PROCESS8_64; + PROCESS1_64; + PROCESS1_64; + return XXH64_avalanche(h64); + + case 30: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 22: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 14: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 6: + PROCESS4_64; + PROCESS1_64; + PROCESS1_64; + return XXH64_avalanche(h64); + + case 27: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 19: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 11: + PROCESS8_64; + PROCESS1_64; + PROCESS1_64; + PROCESS1_64; + return XXH64_avalanche(h64); + + case 31: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 23: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 15: + PROCESS8_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 7: + PROCESS4_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 3: + PROCESS1_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 2: + PROCESS1_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 1: + PROCESS1_64; + FALLTHROUGH_INTENDED; + /* fallthrough */ + case 0: + return XXH64_avalanche(h64); + } + + /* impossible to reach */ + assert(0); + return 0; /* unreachable, but some compilers complain without it */ +} + +FORCE_INLINE U64 XXH64_endian_align(const void* input, size_t len, U64 seed, + XXH_endianess endian, XXH_alignment align) { + const BYTE* p = (const BYTE*)input; + const BYTE* bEnd = p + len; + U64 h64; + +#if defined(XXH_ACCEPT_NULL_INPUT_POINTER) && \ + (XXH_ACCEPT_NULL_INPUT_POINTER >= 1) + if (p == NULL) { + len = 0; + bEnd = p = (const BYTE*)(size_t)32; + } +#endif + + if (len >= 32) { + const BYTE* const limit = bEnd - 32; + U64 v1 = seed + PRIME64_1 + PRIME64_2; + U64 v2 = seed + PRIME64_2; + U64 v3 = seed + 0; + U64 v4 = seed - PRIME64_1; + + do { + v1 = XXH64_round(v1, XXH_get64bits(p)); + p += 8; + v2 = XXH64_round(v2, XXH_get64bits(p)); + p += 8; + v3 = XXH64_round(v3, XXH_get64bits(p)); + p += 8; + v4 = XXH64_round(v4, XXH_get64bits(p)); + p += 8; + } while (p <= limit); + + h64 = XXH_rotl64(v1, 1) + XXH_rotl64(v2, 7) + XXH_rotl64(v3, 12) + + XXH_rotl64(v4, 18); + h64 = XXH64_mergeRound(h64, v1); + h64 = XXH64_mergeRound(h64, v2); + h64 = XXH64_mergeRound(h64, v3); + h64 = XXH64_mergeRound(h64, v4); + + } else { + h64 = seed + PRIME64_5; + } + + h64 += (U64)len; + + return XXH64_finalize(h64, p, len, endian, align); +} + +unsigned long long XXH64(const void* input, size_t len, + unsigned long long seed) { +#if 0 + /* Simple version, good for code maintenance, but unfortunately slow for small inputs */ + XXH64_state_t state; + XXH64_reset(&state, seed); + XXH64_update(&state, input, len); + return XXH64_digest(&state); +#else + XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; + + if (XXH_FORCE_ALIGN_CHECK) { + if ((((size_t)input) & 7) == + 0) { /* Input is aligned, let's leverage the speed advantage */ + if ((endian_detected == XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) + return XXH64_endian_align(input, len, seed, XXH_littleEndian, + XXH_aligned); + else + return XXH64_endian_align(input, len, seed, XXH_bigEndian, XXH_aligned); + } + } + + if ((endian_detected == XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) + return XXH64_endian_align(input, len, seed, XXH_littleEndian, + XXH_unaligned); + else + return XXH64_endian_align(input, len, seed, XXH_bigEndian, XXH_unaligned); +#endif +} + +/*====== Hash Streaming ======*/ + +XXH64_state_t* XXH64_createState(void) { + return (XXH64_state_t*)XXH_malloc(sizeof(XXH64_state_t)); +} +XXH_errorcode XXH64_freeState(XXH64_state_t* statePtr) { + XXH_free(statePtr); + return XXH_OK; +} + +void XXH64_copyState(XXH64_state_t* dstState, const XXH64_state_t* srcState) { + memcpy(dstState, srcState, sizeof(*dstState)); +} + +XXH_errorcode XXH64_reset(XXH64_state_t* statePtr, unsigned long long seed) { + XXH64_state_t state; /* using a local state to memcpy() in order to avoid + strict-aliasing warnings */ + memset(&state, 0, sizeof(state)); + state.v1 = seed + PRIME64_1 + PRIME64_2; + state.v2 = seed + PRIME64_2; + state.v3 = seed + 0; + state.v4 = seed - PRIME64_1; + /* do not write into reserved, planned to be removed in a future version */ + memcpy(statePtr, &state, sizeof(state) - sizeof(state.reserved)); + return XXH_OK; +} + +FORCE_INLINE XXH_errorcode XXH64_update_endian(XXH64_state_t* state, + const void* input, size_t len, + XXH_endianess endian) { + if (input == NULL) +#if defined(XXH_ACCEPT_NULL_INPUT_POINTER) && \ + (XXH_ACCEPT_NULL_INPUT_POINTER >= 1) + return XXH_OK; +#else + return XXH_ERROR; +#endif + + { + const BYTE* p = (const BYTE*)input; + const BYTE* const bEnd = p + len; + + state->total_len += len; + + if (state->memsize + len < 32) { /* fill in tmp buffer */ + XXH_memcpy(((BYTE*)state->mem64) + state->memsize, input, len); + state->memsize += (U32)len; + return XXH_OK; + } + + if (state->memsize) { /* tmp buffer is full */ + XXH_memcpy(((BYTE*)state->mem64) + state->memsize, input, + 32 - state->memsize); + state->v1 = + XXH64_round(state->v1, XXH_readLE64(state->mem64 + 0, endian)); + state->v2 = + XXH64_round(state->v2, XXH_readLE64(state->mem64 + 1, endian)); + state->v3 = + XXH64_round(state->v3, XXH_readLE64(state->mem64 + 2, endian)); + state->v4 = + XXH64_round(state->v4, XXH_readLE64(state->mem64 + 3, endian)); + p += 32 - state->memsize; + state->memsize = 0; + } + + if (p + 32 <= bEnd) { + const BYTE* const limit = bEnd - 32; + U64 v1 = state->v1; + U64 v2 = state->v2; + U64 v3 = state->v3; + U64 v4 = state->v4; + + do { + v1 = XXH64_round(v1, XXH_readLE64(p, endian)); + p += 8; + v2 = XXH64_round(v2, XXH_readLE64(p, endian)); + p += 8; + v3 = XXH64_round(v3, XXH_readLE64(p, endian)); + p += 8; + v4 = XXH64_round(v4, XXH_readLE64(p, endian)); + p += 8; + } while (p <= limit); + + state->v1 = v1; + state->v2 = v2; + state->v3 = v3; + state->v4 = v4; + } + + if (p < bEnd) { + XXH_memcpy(state->mem64, p, (size_t)(bEnd - p)); + state->memsize = (unsigned)(bEnd - p); + } + } + + return XXH_OK; +} + +XXH_errorcode XXH64_update(XXH64_state_t* state_in, const void* input, + size_t len) { + XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; + + if ((endian_detected == XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) + return XXH64_update_endian(state_in, input, len, XXH_littleEndian); + else + return XXH64_update_endian(state_in, input, len, XXH_bigEndian); +} + +FORCE_INLINE U64 XXH64_digest_endian(const XXH64_state_t* state, + XXH_endianess endian) { + U64 h64; + + if (state->total_len >= 32) { + U64 const v1 = state->v1; + U64 const v2 = state->v2; + U64 const v3 = state->v3; + U64 const v4 = state->v4; + + h64 = XXH_rotl64(v1, 1) + XXH_rotl64(v2, 7) + XXH_rotl64(v3, 12) + + XXH_rotl64(v4, 18); + h64 = XXH64_mergeRound(h64, v1); + h64 = XXH64_mergeRound(h64, v2); + h64 = XXH64_mergeRound(h64, v3); + h64 = XXH64_mergeRound(h64, v4); + } else { + h64 = state->v3 /*seed*/ + PRIME64_5; + } + + h64 += (U64)state->total_len; + + return XXH64_finalize(h64, state->mem64, (size_t)state->total_len, endian, + XXH_aligned); +} + +unsigned long long XXH64_digest(const XXH64_state_t* state_in) { + XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; + + if ((endian_detected == XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) + return XXH64_digest_endian(state_in, XXH_littleEndian); + else + return XXH64_digest_endian(state_in, XXH_bigEndian); +} + +/*====== Canonical representation ======*/ + +void XXH64_canonicalFromHash(XXH64_canonical_t* dst, XXH64_hash_t hash) { + XXH_STATIC_ASSERT(sizeof(XXH64_canonical_t) == sizeof(XXH64_hash_t)); + if (XXH_CPU_LITTLE_ENDIAN) hash = XXH_swap64(hash); + memcpy(dst, &hash, sizeof(*dst)); +} + +XXH64_hash_t XXH64_hashFromCanonical(const XXH64_canonical_t* src) { + return XXH_readBE64(src); +} } // namespace rocksdb diff --git a/util/xxhash.h b/util/xxhash.h index 3343e3488f4..88352ac75f9 100644 --- a/util/xxhash.h +++ b/util/xxhash.h @@ -59,6 +59,14 @@ It depends on successfully passing SMHasher test set. #pragma once +#include + +#if !defined(__VMS) && \ + (defined(__cplusplus) || \ + (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) /* C99 */)) +#include +#endif + #if defined (__cplusplus) namespace rocksdb { #endif @@ -67,6 +75,7 @@ namespace rocksdb { //**************************** // Type //**************************** +/* size_t */ typedef enum { XXH_OK=0, XXH_ERROR } XXH_errorcode; @@ -157,7 +166,74 @@ To free memory context, use XXH32_digest(), or free(). #define XXH32_result XXH32_digest #define XXH32_getIntermediateResult XXH32_intermediateDigest +/*-********************************************************************** + * 64-bit hash + ************************************************************************/ +typedef unsigned long long XXH64_hash_t; +/*! XXH64() : + Calculate the 64-bit hash of sequence of length "len" stored at memory + address "input". "seed" can be used to alter the result predictably. This + function runs faster on 64-bit systems, but slower on 32-bit systems (see + benchmark). +*/ +XXH64_hash_t XXH64(const void* input, size_t length, unsigned long long seed); + +/*====== Streaming ======*/ +typedef struct XXH64_state_s XXH64_state_t; /* incomplete type */ +XXH64_state_t* XXH64_createState(void); +XXH_errorcode XXH64_freeState(XXH64_state_t* statePtr); +void XXH64_copyState(XXH64_state_t* dst_state, const XXH64_state_t* src_state); + +XXH_errorcode XXH64_reset(XXH64_state_t* statePtr, unsigned long long seed); +XXH_errorcode XXH64_update(XXH64_state_t* statePtr, const void* input, + size_t length); +XXH64_hash_t XXH64_digest(const XXH64_state_t* statePtr); + +/*====== Canonical representation ======*/ +typedef struct { + unsigned char digest[8]; +} XXH64_canonical_t; +void XXH64_canonicalFromHash(XXH64_canonical_t* dst, XXH64_hash_t hash); +XXH64_hash_t XXH64_hashFromCanonical(const XXH64_canonical_t* src); + +/* These definitions are only present to allow + * static allocation of XXH state, on stack or in a struct for example. + * Never **ever** use members directly. */ + +#if !defined(__VMS) && \ + (defined(__cplusplus) || \ + (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) /* C99 */)) + +struct XXH64_state_s { + uint64_t total_len; + uint64_t v1; + uint64_t v2; + uint64_t v3; + uint64_t v4; + uint64_t mem64[4]; + uint32_t memsize; + uint32_t reserved[2]; /* never read nor write, might be removed in a future + version */ +}; /* typedef'd to XXH64_state_t */ + +#else + +#ifndef XXH_NO_LONG_LONG /* remove 64-bit support */ +struct XXH64_state_s { + unsigned long long total_len; + unsigned long long v1; + unsigned long long v2; + unsigned long long v3; + unsigned long long v4; + unsigned long long mem64[4]; + unsigned memsize; + unsigned reserved[2]; /* never read nor write, might be removed in a future + version */ +}; /* typedef'd to XXH64_state_t */ +#endif + +#endif #if defined (__cplusplus) } // namespace rocksdb diff --git a/utilities/backupable/backupable_db.cc b/utilities/backupable/backupable_db.cc index 4cafc6ab148..78def188cf4 100644 --- a/utilities/backupable/backupable_db.cc +++ b/utilities/backupable/backupable_db.cc @@ -305,16 +305,16 @@ class BackupEngineImpl : public BackupEngine { // @param contents If non-empty, the file will be created with these contents. Status CopyOrCreateFile(const std::string& src, const std::string& dst, const std::string& contents, Env* src_env, - Env* dst_env, bool sync, RateLimiter* rate_limiter, + Env* dst_env, const EnvOptions& src_env_options, + bool sync, RateLimiter* rate_limiter, uint64_t* size = nullptr, uint32_t* checksum_value = nullptr, uint64_t size_limit = 0, std::function progress_callback = []() {}); - Status CalculateChecksum(const std::string& src, - Env* src_env, - uint64_t size_limit, - uint32_t* checksum_value); + Status CalculateChecksum(const std::string& src, Env* src_env, + const EnvOptions& src_env_options, + uint64_t size_limit, uint32_t* checksum_value); struct CopyOrCreateResult { uint64_t size; @@ -331,6 +331,7 @@ class BackupEngineImpl : public BackupEngine { std::string contents; Env* src_env; Env* dst_env; + EnvOptions src_env_options; bool sync; RateLimiter* rate_limiter; uint64_t size_limit; @@ -338,14 +339,15 @@ class BackupEngineImpl : public BackupEngine { std::function progress_callback; CopyOrCreateWorkItem() - : src_path(""), - dst_path(""), - contents(""), - src_env(nullptr), - dst_env(nullptr), - sync(false), - rate_limiter(nullptr), - size_limit(0) {} + : src_path(""), + dst_path(""), + contents(""), + src_env(nullptr), + dst_env(nullptr), + src_env_options(), + sync(false), + rate_limiter(nullptr), + size_limit(0) {} CopyOrCreateWorkItem(const CopyOrCreateWorkItem&) = delete; CopyOrCreateWorkItem& operator=(const CopyOrCreateWorkItem&) = delete; @@ -360,6 +362,7 @@ class BackupEngineImpl : public BackupEngine { contents = std::move(o.contents); src_env = o.src_env; dst_env = o.dst_env; + src_env_options = std::move(o.src_env_options); sync = o.sync; rate_limiter = o.rate_limiter; size_limit = o.size_limit; @@ -370,14 +373,15 @@ class BackupEngineImpl : public BackupEngine { CopyOrCreateWorkItem(std::string _src_path, std::string _dst_path, std::string _contents, Env* _src_env, Env* _dst_env, - bool _sync, RateLimiter* _rate_limiter, - uint64_t _size_limit, + EnvOptions _src_env_options, bool _sync, + RateLimiter* _rate_limiter, uint64_t _size_limit, std::function _progress_callback = []() {}) : src_path(std::move(_src_path)), dst_path(std::move(_dst_path)), contents(std::move(_contents)), src_env(_src_env), dst_env(_dst_env), + src_env_options(std::move(_src_env_options)), sync(_sync), rate_limiter(_rate_limiter), size_limit(_size_limit), @@ -471,7 +475,8 @@ class BackupEngineImpl : public BackupEngine { std::vector& backup_items_to_finish, BackupID backup_id, bool shared, const std::string& src_dir, const std::string& fname, // starts with "/" - RateLimiter* rate_limiter, uint64_t size_bytes, uint64_t size_limit = 0, + const EnvOptions& src_env_options, RateLimiter* rate_limiter, + uint64_t size_bytes, uint64_t size_limit = 0, bool shared_checksum = false, std::function progress_callback = []() {}, const std::string& contents = std::string()); @@ -479,9 +484,9 @@ class BackupEngineImpl : public BackupEngine { // backup state data BackupID latest_backup_id_; BackupID latest_valid_backup_id_; - std::map> backups_; - std::map>> corrupt_backups_; + std::map> backups_; + std::map>> + corrupt_backups_; std::unordered_map> backuped_file_infos_; std::atomic stop_backup_; @@ -492,10 +497,10 @@ class BackupEngineImpl : public BackupEngine { Env* backup_env_; // directories - unique_ptr backup_directory_; - unique_ptr shared_directory_; - unique_ptr meta_directory_; - unique_ptr private_directory_; + std::unique_ptr backup_directory_; + std::unique_ptr shared_directory_; + std::unique_ptr meta_directory_; + std::unique_ptr private_directory_; static const size_t kDefaultCopyFileBufferSize = 5 * 1024 * 1024LL; // 5MB size_t copy_file_buffer_size_; @@ -616,7 +621,7 @@ Status BackupEngineImpl::Initialize() { } assert(backups_.find(backup_id) == backups_.end()); backups_.insert(std::make_pair( - backup_id, unique_ptr(new BackupMeta( + backup_id, std::unique_ptr(new BackupMeta( GetBackupMetaFile(backup_id, false /* tmp */), GetBackupMetaFile(backup_id, true /* tmp */), &backuped_file_infos_, backup_env_)))); @@ -723,9 +728,10 @@ Status BackupEngineImpl::Initialize() { CopyOrCreateResult result; result.status = CopyOrCreateFile( work_item.src_path, work_item.dst_path, work_item.contents, - work_item.src_env, work_item.dst_env, work_item.sync, - work_item.rate_limiter, &result.size, &result.checksum_value, - work_item.size_limit, work_item.progress_callback); + work_item.src_env, work_item.dst_env, work_item.src_env_options, + work_item.sync, work_item.rate_limiter, &result.size, + &result.checksum_value, work_item.size_limit, + work_item.progress_callback); work_item.result.set_value(std::move(result)); } }); @@ -761,7 +767,7 @@ Status BackupEngineImpl::CreateNewBackupWithMetadata( } auto ret = backups_.insert(std::make_pair( - new_backup_id, unique_ptr(new BackupMeta( + new_backup_id, std::unique_ptr(new BackupMeta( GetBackupMetaFile(new_backup_id, false /* tmp */), GetBackupMetaFile(new_backup_id, true /* tmp */), &backuped_file_infos_, backup_env_)))); @@ -796,8 +802,10 @@ Status BackupEngineImpl::CreateNewBackupWithMetadata( if (s.ok()) { CheckpointImpl checkpoint(db); uint64_t sequence_number = 0; + DBOptions db_options = db->GetDBOptions(); + EnvOptions src_raw_env_options(db_options); s = checkpoint.CreateCustomCheckpoint( - db->GetDBOptions(), + db_options, [&](const std::string& /*src_dirname*/, const std::string& /*fname*/, FileType) { // custom checkpoint will switch to calling copy_file_cb after it sees @@ -815,11 +823,33 @@ Status BackupEngineImpl::CreateNewBackupWithMetadata( if (type == kTableFile) { st = db_env_->GetFileSize(src_dirname + fname, &size_bytes); } + EnvOptions src_env_options; + switch (type) { + case kLogFile: + src_env_options = + db_env_->OptimizeForLogRead(src_raw_env_options); + break; + case kTableFile: + src_env_options = db_env_->OptimizeForCompactionTableRead( + src_raw_env_options, ImmutableDBOptions(db_options)); + break; + case kDescriptorFile: + src_env_options = + db_env_->OptimizeForManifestRead(src_raw_env_options); + break; + default: + // Other backed up files (like options file) are not read by live + // DB, so don't need to worry about avoiding mixing buffered and + // direct I/O. Just use plain defaults. + src_env_options = src_raw_env_options; + break; + } if (st.ok()) { st = AddBackupFileWorkItem( live_dst_paths, backup_items_to_finish, new_backup_id, options_.share_table_files && type == kTableFile, src_dirname, - fname, rate_limiter, size_bytes, size_limit_bytes, + fname, src_env_options, rate_limiter, size_bytes, + size_limit_bytes, options_.share_files_with_checksum && type == kTableFile, progress_callback); } @@ -829,8 +859,9 @@ Status BackupEngineImpl::CreateNewBackupWithMetadata( Log(options_.info_log, "add file for backup %s", fname.c_str()); return AddBackupFileWorkItem( live_dst_paths, backup_items_to_finish, new_backup_id, - false /* shared */, "" /* src_dir */, fname, rate_limiter, - contents.size(), 0 /* size_limit */, false /* shared_checksum */, + false /* shared */, "" /* src_dir */, fname, + EnvOptions() /* src_env_options */, rate_limiter, contents.size(), + 0 /* size_limit */, false /* shared_checksum */, progress_callback, contents); } /* create_file_cb */, &sequence_number, flush_before_backup ? 0 : port::kMaxUint64); @@ -869,7 +900,7 @@ Status BackupEngineImpl::CreateNewBackupWithMetadata( s = new_backup->StoreToFile(options_.sync); } if (s.ok() && options_.sync) { - unique_ptr backup_private_directory; + std::unique_ptr backup_private_directory; backup_env_->NewDirectory( GetAbsolutePath(GetPrivateFileRel(new_backup_id, false)), &backup_private_directory); @@ -1114,7 +1145,8 @@ Status BackupEngineImpl::RestoreDBFromBackup( dst.c_str()); CopyOrCreateWorkItem copy_or_create_work_item( GetAbsolutePath(file), dst, "" /* contents */, backup_env_, db_env_, - false, rate_limiter, 0 /* size_limit */); + EnvOptions() /* src_env_options */, false, rate_limiter, + 0 /* size_limit */); RestoreAfterCopyOrCreateWorkItem after_copy_or_create_work_item( copy_or_create_work_item.result.get_future(), file_info->checksum_value); @@ -1183,15 +1215,15 @@ Status BackupEngineImpl::VerifyBackup(BackupID backup_id) { Status BackupEngineImpl::CopyOrCreateFile( const std::string& src, const std::string& dst, const std::string& contents, - Env* src_env, Env* dst_env, bool sync, RateLimiter* rate_limiter, - uint64_t* size, uint32_t* checksum_value, uint64_t size_limit, - std::function progress_callback) { + Env* src_env, Env* dst_env, const EnvOptions& src_env_options, bool sync, + RateLimiter* rate_limiter, uint64_t* size, uint32_t* checksum_value, + uint64_t size_limit, std::function progress_callback) { assert(src.empty() != contents.empty()); Status s; - unique_ptr dst_file; - unique_ptr src_file; - EnvOptions env_options; - env_options.use_mmap_writes = false; + std::unique_ptr dst_file; + std::unique_ptr src_file; + EnvOptions dst_env_options; + dst_env_options.use_mmap_writes = false; // TODO:(gzh) maybe use direct reads/writes here if possible if (size != nullptr) { *size = 0; @@ -1205,18 +1237,18 @@ Status BackupEngineImpl::CopyOrCreateFile( size_limit = std::numeric_limits::max(); } - s = dst_env->NewWritableFile(dst, &dst_file, env_options); + s = dst_env->NewWritableFile(dst, &dst_file, dst_env_options); if (s.ok() && !src.empty()) { - s = src_env->NewSequentialFile(src, &src_file, env_options); + s = src_env->NewSequentialFile(src, &src_file, src_env_options); } if (!s.ok()) { return s; } - unique_ptr dest_writer( - new WritableFileWriter(std::move(dst_file), dst, env_options)); - unique_ptr src_reader; - unique_ptr buf; + std::unique_ptr dest_writer( + new WritableFileWriter(std::move(dst_file), dst, dst_env_options)); + std::unique_ptr src_reader; + std::unique_ptr buf; if (!src.empty()) { src_reader.reset(new SequentialFileReader(std::move(src_file), src)); buf.reset(new char[copy_file_buffer_size_]); @@ -1276,9 +1308,10 @@ Status BackupEngineImpl::AddBackupFileWorkItem( std::unordered_set& live_dst_paths, std::vector& backup_items_to_finish, BackupID backup_id, bool shared, const std::string& src_dir, - const std::string& fname, RateLimiter* rate_limiter, uint64_t size_bytes, - uint64_t size_limit, bool shared_checksum, - std::function progress_callback, const std::string& contents) { + const std::string& fname, const EnvOptions& src_env_options, + RateLimiter* rate_limiter, uint64_t size_bytes, uint64_t size_limit, + bool shared_checksum, std::function progress_callback, + const std::string& contents) { assert(!fname.empty() && fname[0] == '/'); assert(contents.empty() != src_dir.empty()); @@ -1289,7 +1322,7 @@ Status BackupEngineImpl::AddBackupFileWorkItem( if (shared && shared_checksum) { // add checksum and file length to the file name - s = CalculateChecksum(src_dir + fname, db_env_, size_limit, + s = CalculateChecksum(src_dir + fname, db_env_, src_env_options, size_limit, &checksum_value); if (!s.ok()) { return s; @@ -1365,8 +1398,8 @@ Status BackupEngineImpl::AddBackupFileWorkItem( // the file is present and referenced by a backup ROCKS_LOG_INFO(options_.info_log, "%s already present, calculate checksum", fname.c_str()); - s = CalculateChecksum(src_dir + fname, db_env_, size_limit, - &checksum_value); + s = CalculateChecksum(src_dir + fname, db_env_, src_env_options, + size_limit, &checksum_value); } } live_dst_paths.insert(final_dest_path); @@ -1376,8 +1409,8 @@ Status BackupEngineImpl::AddBackupFileWorkItem( copy_dest_path->c_str()); CopyOrCreateWorkItem copy_or_create_work_item( src_dir.empty() ? "" : src_dir + fname, *copy_dest_path, contents, - db_env_, backup_env_, options_.sync, rate_limiter, size_limit, - progress_callback); + db_env_, backup_env_, src_env_options, options_.sync, rate_limiter, + size_limit, progress_callback); BackupAfterCopyOrCreateWorkItem after_copy_or_create_work_item( copy_or_create_work_item.result.get_future(), shared, need_to_copy, backup_env_, temp_dest_path, final_dest_path, dst_relative); @@ -1399,6 +1432,7 @@ Status BackupEngineImpl::AddBackupFileWorkItem( } Status BackupEngineImpl::CalculateChecksum(const std::string& src, Env* src_env, + const EnvOptions& src_env_options, uint64_t size_limit, uint32_t* checksum_value) { *checksum_value = 0; @@ -1406,17 +1440,13 @@ Status BackupEngineImpl::CalculateChecksum(const std::string& src, Env* src_env, size_limit = std::numeric_limits::max(); } - EnvOptions env_options; - env_options.use_mmap_writes = false; - env_options.use_direct_reads = false; - std::unique_ptr src_file; - Status s = src_env->NewSequentialFile(src, &src_file, env_options); + Status s = src_env->NewSequentialFile(src, &src_file, src_env_options); if (!s.ok()) { return s; } - unique_ptr src_reader( + std::unique_ptr src_reader( new SequentialFileReader(std::move(src_file), src)); std::unique_ptr buf(new char[copy_file_buffer_size_]); Slice data; @@ -1634,15 +1664,15 @@ Status BackupEngineImpl::BackupMeta::LoadFromFile( const std::unordered_map& abs_path_to_size) { assert(Empty()); Status s; - unique_ptr backup_meta_file; + std::unique_ptr backup_meta_file; s = env_->NewSequentialFile(meta_filename_, &backup_meta_file, EnvOptions()); if (!s.ok()) { return s; } - unique_ptr backup_meta_reader( + std::unique_ptr backup_meta_reader( new SequentialFileReader(std::move(backup_meta_file), meta_filename_)); - unique_ptr buf(new char[max_backup_meta_file_size_ + 1]); + std::unique_ptr buf(new char[max_backup_meta_file_size_ + 1]); Slice data; s = backup_meta_reader->Read(max_backup_meta_file_size_, &data, buf.get()); @@ -1736,7 +1766,7 @@ Status BackupEngineImpl::BackupMeta::LoadFromFile( Status BackupEngineImpl::BackupMeta::StoreToFile(bool sync) { Status s; - unique_ptr backup_meta_file; + std::unique_ptr backup_meta_file; EnvOptions env_options; env_options.use_mmap_writes = false; env_options.use_direct_writes = false; @@ -1745,7 +1775,7 @@ Status BackupEngineImpl::BackupMeta::StoreToFile(bool sync) { return s; } - unique_ptr buf(new char[max_backup_meta_file_size_]); + std::unique_ptr buf(new char[max_backup_meta_file_size_]); size_t len = 0, buf_size = max_backup_meta_file_size_; len += snprintf(buf.get(), buf_size, "%" PRId64 "\n", timestamp_); len += snprintf(buf.get() + len, buf_size - len, "%" PRIu64 "\n", @@ -1762,7 +1792,8 @@ Status BackupEngineImpl::BackupMeta::StoreToFile(bool sync) { else if (len + hex_meta_strlen >= buf_size) { backup_meta_file->Append(Slice(buf.get(), len)); buf.reset(); - unique_ptr new_reset_buf(new char[max_backup_meta_file_size_]); + std::unique_ptr new_reset_buf( + new char[max_backup_meta_file_size_]); buf.swap(new_reset_buf); len = 0; } @@ -1776,7 +1807,7 @@ Status BackupEngineImpl::BackupMeta::StoreToFile(bool sync) { "%" ROCKSDB_PRIszt "\n", files_.size()) >= buf_size) { backup_meta_file->Append(Slice(buf.get(), len)); buf.reset(); - unique_ptr new_reset_buf(new char[max_backup_meta_file_size_]); + std::unique_ptr new_reset_buf(new char[max_backup_meta_file_size_]); buf.swap(new_reset_buf); len = 0; } @@ -1794,7 +1825,8 @@ Status BackupEngineImpl::BackupMeta::StoreToFile(bool sync) { if (newlen >= buf_size) { backup_meta_file->Append(Slice(buf.get(), len)); buf.reset(); - unique_ptr new_reset_buf(new char[max_backup_meta_file_size_]); + std::unique_ptr new_reset_buf( + new char[max_backup_meta_file_size_]); buf.swap(new_reset_buf); len = 0; } diff --git a/utilities/backupable/backupable_db_test.cc b/utilities/backupable/backupable_db_test.cc index 9fdc058fd03..26ff00e91a1 100644 --- a/utilities/backupable/backupable_db_test.cc +++ b/utilities/backupable/backupable_db_test.cc @@ -179,7 +179,8 @@ class TestEnv : public EnvWrapper { bool fail_reads_; }; - Status NewSequentialFile(const std::string& f, unique_ptr* r, + Status NewSequentialFile(const std::string& f, + std::unique_ptr* r, const EnvOptions& options) override { MutexLock l(&mutex_); if (dummy_sequential_file_) { @@ -187,11 +188,18 @@ class TestEnv : public EnvWrapper { new TestEnv::DummySequentialFile(dummy_sequential_file_fail_reads_)); return Status::OK(); } else { - return EnvWrapper::NewSequentialFile(f, r, options); + Status s = EnvWrapper::NewSequentialFile(f, r, options); + if (s.ok()) { + if ((*r)->use_direct_io()) { + ++num_direct_seq_readers_; + } + ++num_seq_readers_; + } + return s; } } - Status NewWritableFile(const std::string& f, unique_ptr* r, + Status NewWritableFile(const std::string& f, std::unique_ptr* r, const EnvOptions& options) override { MutexLock l(&mutex_); written_files_.push_back(f); @@ -199,7 +207,28 @@ class TestEnv : public EnvWrapper { return Status::NotSupported("Sorry, can't do this"); } limit_written_files_--; - return EnvWrapper::NewWritableFile(f, r, options); + Status s = EnvWrapper::NewWritableFile(f, r, options); + if (s.ok()) { + if ((*r)->use_direct_io()) { + ++num_direct_writers_; + } + ++num_writers_; + } + return s; + } + + virtual Status NewRandomAccessFile(const std::string& fname, + unique_ptr* result, + const EnvOptions& options) override { + MutexLock l(&mutex_); + Status s = EnvWrapper::NewRandomAccessFile(fname, result, options); + if (s.ok()) { + if ((*result)->use_direct_io()) { + ++num_direct_rand_readers_; + } + ++num_rand_readers_; + } + return s; } virtual Status DeleteFile(const std::string& fname) override { @@ -308,13 +337,30 @@ class TestEnv : public EnvWrapper { void SetNewDirectoryFailure(bool fail) { new_directory_failure_ = fail; } virtual Status NewDirectory(const std::string& name, - unique_ptr* result) override { + std::unique_ptr* result) override { if (new_directory_failure_) { return Status::IOError("SimulatedFailure"); } return EnvWrapper::NewDirectory(name, result); } + void ClearFileOpenCounters() { + MutexLock l(&mutex_); + num_rand_readers_ = 0; + num_direct_rand_readers_ = 0; + num_seq_readers_ = 0; + num_direct_seq_readers_ = 0; + num_writers_ = 0; + num_direct_writers_ = 0; + } + + int num_rand_readers() { return num_rand_readers_; } + int num_direct_rand_readers() { return num_direct_rand_readers_; } + int num_seq_readers() { return num_seq_readers_; } + int num_direct_seq_readers() { return num_direct_seq_readers_; } + int num_writers() { return num_writers_; } + int num_direct_writers() { return num_direct_writers_; } + private: port::Mutex mutex_; bool dummy_sequential_file_ = false; @@ -328,6 +374,15 @@ class TestEnv : public EnvWrapper { bool get_children_failure_ = false; bool create_dir_if_missing_failure_ = false; bool new_directory_failure_ = false; + + // Keeps track of how many files of each type were successfully opened, and + // out of those, how many were opened with direct I/O. + std::atomic num_rand_readers_; + std::atomic num_direct_rand_readers_; + std::atomic num_seq_readers_; + std::atomic num_direct_seq_readers_; + std::atomic num_writers_; + std::atomic num_direct_writers_; }; // TestEnv class FileManager : public EnvWrapper { @@ -427,7 +482,7 @@ class FileManager : public EnvWrapper { } Status WriteToFile(const std::string& fname, const std::string& data) { - unique_ptr file; + std::unique_ptr file; EnvOptions env_options; env_options.use_mmap_writes = false; Status s = EnvWrapper::NewWritableFile(fname, &file, env_options); @@ -620,22 +675,22 @@ class BackupableDBTest : public testing::Test { std::shared_ptr logger_; // envs - unique_ptr db_chroot_env_; - unique_ptr backup_chroot_env_; - unique_ptr test_db_env_; - unique_ptr test_backup_env_; - unique_ptr file_manager_; + std::unique_ptr db_chroot_env_; + std::unique_ptr backup_chroot_env_; + std::unique_ptr test_db_env_; + std::unique_ptr test_backup_env_; + std::unique_ptr file_manager_; // all the dbs! DummyDB* dummy_db_; // BackupableDB owns dummy_db_ - unique_ptr db_; - unique_ptr backup_engine_; + std::unique_ptr db_; + std::unique_ptr backup_engine_; // options Options options_; protected: - unique_ptr backupable_options_; + std::unique_ptr backupable_options_; }; // BackupableDBTest void AppendPath(const std::string& path, std::vector& v) { @@ -1633,6 +1688,59 @@ TEST_F(BackupableDBTest, WriteOnlyEngineNoSharedFileDeletion) { AssertBackupConsistency(i + 1, 0, (i + 1) * kNumKeys); } } + +TEST_P(BackupableDBTestWithParam, BackupUsingDirectIO) { + // Tests direct I/O on the backup engine's reads and writes on the DB env and + // backup env + // We use ChrootEnv underneath so the below line checks for direct I/O support + // in the chroot directory, not the true filesystem root. + if (!test::IsDirectIOSupported(test_db_env_.get(), "/")) { + return; + } + const int kNumKeysPerBackup = 100; + const int kNumBackups = 3; + options_.use_direct_reads = true; + OpenDBAndBackupEngine(true /* destroy_old_data */); + for (int i = 0; i < kNumBackups; ++i) { + FillDB(db_.get(), i * kNumKeysPerBackup /* from */, + (i + 1) * kNumKeysPerBackup /* to */); + ASSERT_OK(db_->Flush(FlushOptions())); + + // Clear the file open counters and then do a bunch of backup engine ops. + // For all ops, files should be opened in direct mode. + test_backup_env_->ClearFileOpenCounters(); + test_db_env_->ClearFileOpenCounters(); + CloseBackupEngine(); + OpenBackupEngine(); + ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), + false /* flush_before_backup */)); + ASSERT_OK(backup_engine_->VerifyBackup(i + 1)); + CloseBackupEngine(); + OpenBackupEngine(); + std::vector backup_infos; + backup_engine_->GetBackupInfo(&backup_infos); + ASSERT_EQ(static_cast(i + 1), backup_infos.size()); + + // Verify backup engine always opened files with direct I/O + ASSERT_EQ(0, test_db_env_->num_writers()); + ASSERT_EQ(0, test_db_env_->num_rand_readers()); + ASSERT_GT(test_db_env_->num_direct_seq_readers(), 0); + // Currently the DB doesn't support reading WALs or manifest with direct + // I/O, so subtract two. + ASSERT_EQ(test_db_env_->num_seq_readers() - 2, + test_db_env_->num_direct_seq_readers()); + ASSERT_EQ(0, test_db_env_->num_rand_readers()); + } + CloseDBAndBackupEngine(); + + for (int i = 0; i < kNumBackups; ++i) { + AssertBackupConsistency(i + 1 /* backup_id */, + i * kNumKeysPerBackup /* start_exist */, + (i + 1) * kNumKeysPerBackup /* end_exist */, + (i + 2) * kNumKeysPerBackup /* end */); + } +} + } // anon namespace } // namespace rocksdb diff --git a/utilities/blob_db/blob_db_impl.cc b/utilities/blob_db/blob_db_impl.cc index 1a32bd562eb..bf46bf6b1c0 100644 --- a/utilities/blob_db/blob_db_impl.cc +++ b/utilities/blob_db/blob_db_impl.cc @@ -26,6 +26,7 @@ #include "util/cast_util.h" #include "util/crc32c.h" #include "util/file_reader_writer.h" +#include "util/file_util.h" #include "util/filename.h" #include "util/logging.h" #include "util/mutexlock.h" @@ -404,82 +405,91 @@ std::shared_ptr BlobDBImpl::FindBlobFileLocked( return (b1 || b2) ? nullptr : (*finditr); } -std::shared_ptr BlobDBImpl::CheckOrCreateWriterLocked( - const std::shared_ptr& bfile) { - std::shared_ptr writer = bfile->GetWriter(); - if (writer) return writer; - - Status s = CreateWriterLocked(bfile); - if (!s.ok()) return nullptr; - - writer = bfile->GetWriter(); - return writer; +Status BlobDBImpl::CheckOrCreateWriterLocked( + const std::shared_ptr& blob_file, + std::shared_ptr* writer) { + assert(writer != nullptr); + *writer = blob_file->GetWriter(); + if (*writer != nullptr) { + return Status::OK(); + } + Status s = CreateWriterLocked(blob_file); + if (s.ok()) { + *writer = blob_file->GetWriter(); + } + return s; } -std::shared_ptr BlobDBImpl::SelectBlobFile() { +Status BlobDBImpl::SelectBlobFile(std::shared_ptr* blob_file) { + assert(blob_file != nullptr); { ReadLock rl(&mutex_); if (open_non_ttl_file_ != nullptr) { - return open_non_ttl_file_; + *blob_file = open_non_ttl_file_; + return Status::OK(); } } // CHECK again WriteLock wl(&mutex_); if (open_non_ttl_file_ != nullptr) { - return open_non_ttl_file_; + *blob_file = open_non_ttl_file_; + return Status::OK(); } - std::shared_ptr bfile = NewBlobFile("SelectBlobFile"); - assert(bfile); + *blob_file = NewBlobFile("SelectBlobFile"); + assert(*blob_file != nullptr); // file not visible, hence no lock - std::shared_ptr writer = CheckOrCreateWriterLocked(bfile); - if (!writer) { + std::shared_ptr writer; + Status s = CheckOrCreateWriterLocked(*blob_file, &writer); + if (!s.ok()) { ROCKS_LOG_ERROR(db_options_.info_log, - "Failed to get writer from blob file: %s", - bfile->PathName().c_str()); - return nullptr; + "Failed to get writer from blob file: %s, error: %s", + (*blob_file)->PathName().c_str(), s.ToString().c_str()); + return s; } - bfile->file_size_ = BlobLogHeader::kSize; - bfile->header_.compression = bdb_options_.compression; - bfile->header_.has_ttl = false; - bfile->header_.column_family_id = + (*blob_file)->file_size_ = BlobLogHeader::kSize; + (*blob_file)->header_.compression = bdb_options_.compression; + (*blob_file)->header_.has_ttl = false; + (*blob_file)->header_.column_family_id = reinterpret_cast(DefaultColumnFamily())->GetID(); - bfile->header_valid_ = true; - bfile->SetColumnFamilyId(bfile->header_.column_family_id); - bfile->SetHasTTL(false); - bfile->SetCompression(bdb_options_.compression); + (*blob_file)->header_valid_ = true; + (*blob_file)->SetColumnFamilyId((*blob_file)->header_.column_family_id); + (*blob_file)->SetHasTTL(false); + (*blob_file)->SetCompression(bdb_options_.compression); - Status s = writer->WriteHeader(bfile->header_); + s = writer->WriteHeader((*blob_file)->header_); if (!s.ok()) { ROCKS_LOG_ERROR(db_options_.info_log, "Failed to write header to new blob file: %s" " status: '%s'", - bfile->PathName().c_str(), s.ToString().c_str()); - return nullptr; + (*blob_file)->PathName().c_str(), s.ToString().c_str()); + return s; } - blob_files_.insert(std::make_pair(bfile->BlobFileNumber(), bfile)); - open_non_ttl_file_ = bfile; + blob_files_.insert( + std::make_pair((*blob_file)->BlobFileNumber(), *blob_file)); + open_non_ttl_file_ = *blob_file; total_blob_size_ += BlobLogHeader::kSize; - return bfile; + return s; } -std::shared_ptr BlobDBImpl::SelectBlobFileTTL(uint64_t expiration) { +Status BlobDBImpl::SelectBlobFileTTL(uint64_t expiration, + std::shared_ptr* blob_file) { + assert(blob_file != nullptr); assert(expiration != kNoExpiration); uint64_t epoch_read = 0; - std::shared_ptr bfile; { ReadLock rl(&mutex_); - bfile = FindBlobFileLocked(expiration); + *blob_file = FindBlobFileLocked(expiration); epoch_read = epoch_of_.load(); } - if (bfile) { - assert(!bfile->Immutable()); - return bfile; + if (*blob_file != nullptr) { + assert(!(*blob_file)->Immutable()); + return Status::OK(); } uint64_t exp_low = @@ -487,61 +497,66 @@ std::shared_ptr BlobDBImpl::SelectBlobFileTTL(uint64_t expiration) { uint64_t exp_high = exp_low + bdb_options_.ttl_range_secs; ExpirationRange expiration_range = std::make_pair(exp_low, exp_high); - bfile = NewBlobFile("SelectBlobFileTTL"); - assert(bfile); + *blob_file = NewBlobFile("SelectBlobFileTTL"); + assert(*blob_file != nullptr); ROCKS_LOG_INFO(db_options_.info_log, "New blob file TTL range: %s %d %d", - bfile->PathName().c_str(), exp_low, exp_high); + (*blob_file)->PathName().c_str(), exp_low, exp_high); LogFlush(db_options_.info_log); // we don't need to take lock as no other thread is seeing bfile yet - std::shared_ptr writer = CheckOrCreateWriterLocked(bfile); - if (!writer) { - ROCKS_LOG_ERROR(db_options_.info_log, - "Failed to get writer from blob file with TTL: %s", - bfile->PathName().c_str()); - return nullptr; + std::shared_ptr writer; + Status s = CheckOrCreateWriterLocked(*blob_file, &writer); + if (!s.ok()) { + ROCKS_LOG_ERROR( + db_options_.info_log, + "Failed to get writer from blob file with TTL: %s, error: %s", + (*blob_file)->PathName().c_str(), s.ToString().c_str()); + return s; } - bfile->header_.expiration_range = expiration_range; - bfile->header_.compression = bdb_options_.compression; - bfile->header_.has_ttl = true; - bfile->header_.column_family_id = + (*blob_file)->header_.expiration_range = expiration_range; + (*blob_file)->header_.compression = bdb_options_.compression; + (*blob_file)->header_.has_ttl = true; + (*blob_file)->header_.column_family_id = reinterpret_cast(DefaultColumnFamily())->GetID(); - ; - bfile->header_valid_ = true; - bfile->SetColumnFamilyId(bfile->header_.column_family_id); - bfile->SetHasTTL(true); - bfile->SetCompression(bdb_options_.compression); - bfile->file_size_ = BlobLogHeader::kSize; + (*blob_file)->header_valid_ = true; + (*blob_file)->SetColumnFamilyId((*blob_file)->header_.column_family_id); + (*blob_file)->SetHasTTL(true); + (*blob_file)->SetCompression(bdb_options_.compression); + (*blob_file)->file_size_ = BlobLogHeader::kSize; // set the first value of the range, since that is // concrete at this time. also necessary to add to open_ttl_files_ - bfile->expiration_range_ = expiration_range; + (*blob_file)->expiration_range_ = expiration_range; WriteLock wl(&mutex_); // in case the epoch has shifted in the interim, then check // check condition again - should be rare. if (epoch_of_.load() != epoch_read) { - auto bfile2 = FindBlobFileLocked(expiration); - if (bfile2) return bfile2; + std::shared_ptr blob_file2 = FindBlobFileLocked(expiration); + if (blob_file2 != nullptr) { + *blob_file = std::move(blob_file2); + return Status::OK(); + } } - Status s = writer->WriteHeader(bfile->header_); + s = writer->WriteHeader((*blob_file)->header_); if (!s.ok()) { ROCKS_LOG_ERROR(db_options_.info_log, "Failed to write header to new blob file: %s" " status: '%s'", - bfile->PathName().c_str(), s.ToString().c_str()); - return nullptr; + (*blob_file)->PathName().c_str(), s.ToString().c_str()); + return s; } - blob_files_.insert(std::make_pair(bfile->BlobFileNumber(), bfile)); - open_ttl_files_.insert(bfile); + blob_files_.insert( + std::make_pair((*blob_file)->BlobFileNumber(), *blob_file)); + open_ttl_files_.insert(*blob_file); total_blob_size_ += BlobLogHeader::kSize; epoch_of_++; - return bfile; + return s; } class BlobDBImpl::BlobInserter : public WriteBatch::Handler { @@ -695,36 +710,41 @@ Status BlobDBImpl::PutBlobValue(const WriteOptions& /*options*/, return s; } - std::shared_ptr bfile = (expiration != kNoExpiration) - ? SelectBlobFileTTL(expiration) - : SelectBlobFile(); - assert(bfile != nullptr); - assert(bfile->compression() == bdb_options_.compression); - - s = AppendBlob(bfile, headerbuf, key, value_compressed, expiration, - &index_entry); - if (expiration == kNoExpiration) { - RecordTick(statistics_, BLOB_DB_WRITE_BLOB); + std::shared_ptr blob_file; + if (expiration != kNoExpiration) { + s = SelectBlobFileTTL(expiration, &blob_file); } else { - RecordTick(statistics_, BLOB_DB_WRITE_BLOB_TTL); + s = SelectBlobFile(&blob_file); + } + if (s.ok()) { + assert(blob_file != nullptr); + assert(blob_file->compression() == bdb_options_.compression); + s = AppendBlob(blob_file, headerbuf, key, value_compressed, expiration, + &index_entry); } - if (s.ok()) { if (expiration != kNoExpiration) { - bfile->ExtendExpirationRange(expiration); + blob_file->ExtendExpirationRange(expiration); } - s = CloseBlobFileIfNeeded(bfile); - if (s.ok()) { - s = WriteBatchInternal::PutBlobIndex(batch, column_family_id, key, - index_entry); + s = CloseBlobFileIfNeeded(blob_file); + } + if (s.ok()) { + s = WriteBatchInternal::PutBlobIndex(batch, column_family_id, key, + index_entry); + } + if (s.ok()) { + if (expiration == kNoExpiration) { + RecordTick(statistics_, BLOB_DB_WRITE_BLOB); + } else { + RecordTick(statistics_, BLOB_DB_WRITE_BLOB_TTL); } } else { ROCKS_LOG_ERROR(db_options_.info_log, "Failed to append blob to FILE: %s: KEY: %s VALSZ: %d" " status: '%s' blob_file: '%s'", - bfile->PathName().c_str(), key.ToString().c_str(), + blob_file->PathName().c_str(), key.ToString().c_str(), value.size(), s.ToString().c_str(), - bfile->DumpState().c_str()); + blob_file->DumpState().c_str()); } } @@ -867,9 +887,10 @@ Status BlobDBImpl::AppendBlob(const std::shared_ptr& bfile, uint64_t key_offset = 0; { WriteLock lockbfile_w(&bfile->mutex_); - std::shared_ptr writer = CheckOrCreateWriterLocked(bfile); - if (!writer) { - return Status::IOError("Failed to create blob writer"); + std::shared_ptr writer; + s = CheckOrCreateWriterLocked(bfile, &writer); + if (!s.ok()) { + return s; } // write the blob to the blob log. @@ -1459,8 +1480,7 @@ Status BlobDBImpl::GCFileAndUpdateLSM(const std::shared_ptr& bfptr, return s; } - auto* cfh = - db_impl_->GetColumnFamilyHandleUnlocked(bfptr->column_family_id()); + auto cfh = db_impl_->DefaultColumnFamily(); auto* cfd = reinterpret_cast(cfh)->cfd(); auto column_family_id = cfd->GetID(); bool has_ttl = header.has_ttl; @@ -1575,7 +1595,13 @@ Status BlobDBImpl::GCFileAndUpdateLSM(const std::shared_ptr& bfptr, reason += bfptr->PathName(); newfile = NewBlobFile(reason); - new_writer = CheckOrCreateWriterLocked(newfile); + s = CheckOrCreateWriterLocked(newfile, &new_writer); + if (!s.ok()) { + ROCKS_LOG_ERROR(db_options_.info_log, + "Failed to open file %s for writer, error: %s", + newfile->PathName().c_str(), s.ToString().c_str()); + break; + } // Can't use header beyond this point newfile->header_ = std::move(header); newfile->header_valid_ = true; @@ -1720,7 +1746,8 @@ std::pair BlobDBImpl::DeleteObsoleteFiles(bool aborted) { bfile->PathName().c_str()); blob_files_.erase(bfile->BlobFileNumber()); - Status s = env_->DeleteFile(bfile->PathName()); + Status s = DeleteDBFile(&(db_impl_->immutable_db_options()), + bfile->PathName(), blob_dir_, true); if (!s.ok()) { ROCKS_LOG_ERROR(db_options_.info_log, "File failed to be deleted as obsolete %s", @@ -1810,7 +1837,7 @@ Status DestroyBlobDB(const std::string& dbname, const Options& options, uint64_t number; FileType type; if (ParseFileName(f, &number, &type) && type == kBlobFile) { - Status del = env->DeleteFile(blobdir + "/" + f); + Status del = DeleteDBFile(&soptions, blobdir + "/" + f, blobdir, true); if (status.ok() && !del.ok()) { status = del; } diff --git a/utilities/blob_db/blob_db_impl.h b/utilities/blob_db/blob_db_impl.h index 4296d5c6abb..8d5148def61 100644 --- a/utilities/blob_db/blob_db_impl.h +++ b/utilities/blob_db/blob_db_impl.h @@ -255,10 +255,11 @@ class BlobDBImpl : public BlobDB { // find an existing blob log file based on the expiration unix epoch // if such a file does not exist, return nullptr - std::shared_ptr SelectBlobFileTTL(uint64_t expiration); + Status SelectBlobFileTTL(uint64_t expiration, + std::shared_ptr* blob_file); // find an existing blob log file to append the value to - std::shared_ptr SelectBlobFile(); + Status SelectBlobFile(std::shared_ptr* blob_file); std::shared_ptr FindBlobFileLocked(uint64_t expiration) const; @@ -309,8 +310,8 @@ class BlobDBImpl : public BlobDB { // returns a Writer object for the file. If writer is not // already present, creates one. Needs Write Mutex to be held - std::shared_ptr CheckOrCreateWriterLocked( - const std::shared_ptr& bfile); + Status CheckOrCreateWriterLocked(const std::shared_ptr& blob_file, + std::shared_ptr* writer); // Iterate through keys and values on Blob and write into // separate file the remaining blobs and delete/update pointers @@ -347,7 +348,8 @@ class BlobDBImpl : public BlobDB { ColumnFamilyOptions cf_options_; EnvOptions env_options_; - // Raw pointer of statistic. db_options_ has a shared_ptr to hold ownership. + // Raw pointer of statistic. db_options_ has a std::shared_ptr to hold + // ownership. Statistics* statistics_; // by default this is "blob_dir" under dbname_ diff --git a/utilities/blob_db/blob_db_test.cc b/utilities/blob_db/blob_db_test.cc index cf8f1217aa0..d9cca123e96 100644 --- a/utilities/blob_db/blob_db_test.cc +++ b/utilities/blob_db/blob_db_test.cc @@ -18,6 +18,7 @@ #include "util/cast_util.h" #include "util/fault_injection_test_env.h" #include "util/random.h" +#include "util/sst_file_manager_impl.h" #include "util/string_util.h" #include "util/sync_point.h" #include "util/testharness.h" @@ -374,6 +375,19 @@ TEST_F(BlobDBTest, GetIOError) { fault_injection_env_->SetFilesystemActive(true); } +TEST_F(BlobDBTest, PutIOError) { + Options options; + options.env = fault_injection_env_.get(); + BlobDBOptions bdb_options; + bdb_options.min_blob_size = 0; // Make sure value write to blob file + bdb_options.disable_background_tasks = true; + Open(bdb_options, options); + fault_injection_env_->SetFilesystemActive(false, Status::IOError()); + ASSERT_TRUE(Put("foo", "v1").IsIOError()); + fault_injection_env_->SetFilesystemActive(true, Status::IOError()); + ASSERT_OK(Put("bar", "v1")); +} + TEST_F(BlobDBTest, WriteBatch) { Random rnd(301); BlobDBOptions bdb_options; @@ -749,6 +763,52 @@ TEST_F(BlobDBTest, ReadWhileGC) { } } +TEST_F(BlobDBTest, SstFileManager) { + // run the same test for Get(), MultiGet() and Iterator each. + std::shared_ptr sst_file_manager( + NewSstFileManager(mock_env_.get())); + sst_file_manager->SetDeleteRateBytesPerSecond(1); + SstFileManagerImpl *sfm = + static_cast(sst_file_manager.get()); + + BlobDBOptions bdb_options; + bdb_options.min_blob_size = 0; + Options db_options; + + int files_deleted_directly = 0; + int files_scheduled_to_delete = 0; + rocksdb::SyncPoint::GetInstance()->SetCallBack( + "SstFileManagerImpl::ScheduleFileDeletion", + [&](void * /*arg*/) { files_scheduled_to_delete++; }); + rocksdb::SyncPoint::GetInstance()->SetCallBack( + "DeleteScheduler::DeleteFile", + [&](void * /*arg*/) { files_deleted_directly++; }); + SyncPoint::GetInstance()->EnableProcessing(); + db_options.sst_file_manager = sst_file_manager; + + Open(bdb_options, db_options); + + // Create one obselete file and clean it. + blob_db_->Put(WriteOptions(), "foo", "bar"); + auto blob_files = blob_db_impl()->TEST_GetBlobFiles(); + ASSERT_EQ(1, blob_files.size()); + std::shared_ptr bfile = blob_files[0]; + ASSERT_OK(blob_db_impl()->TEST_CloseBlobFile(bfile)); + GCStats gc_stats; + ASSERT_OK(blob_db_impl()->TEST_GCFileAndUpdateLSM(bfile, &gc_stats)); + blob_db_impl()->TEST_DeleteObsoleteFiles(); + + // Even if SSTFileManager is not set, DB is creating a dummy one. + ASSERT_EQ(1, files_scheduled_to_delete); + ASSERT_EQ(0, files_deleted_directly); + Destroy(); + // Make sure that DestroyBlobDB() also goes through delete scheduler. + ASSERT_GE(2, files_scheduled_to_delete); + ASSERT_EQ(0, files_deleted_directly); + SyncPoint::GetInstance()->DisableProcessing(); + sfm->WaitForEmptyTrash(); +} + TEST_F(BlobDBTest, SnapshotAndGarbageCollection) { BlobDBOptions bdb_options; bdb_options.min_blob_size = 0; diff --git a/utilities/blob_db/blob_dump_tool.h b/utilities/blob_db/blob_dump_tool.h index e91feffa794..ff4672fd3f3 100644 --- a/utilities/blob_db/blob_dump_tool.h +++ b/utilities/blob_db/blob_dump_tool.h @@ -33,7 +33,7 @@ class BlobDumpTool { private: std::unique_ptr reader_; - std::unique_ptr buffer_; + std::unique_ptr buffer_; size_t buffer_size_; Status Read(uint64_t offset, size_t size, Slice* result); diff --git a/utilities/blob_db/blob_log_format.h b/utilities/blob_db/blob_log_format.h index 3e1b686aa12..fcc042f06db 100644 --- a/utilities/blob_db/blob_log_format.h +++ b/utilities/blob_db/blob_log_format.h @@ -10,7 +10,9 @@ #ifndef ROCKSDB_LITE #include +#include #include + #include "rocksdb/options.h" #include "rocksdb/slice.h" #include "rocksdb/status.h" @@ -106,8 +108,8 @@ struct BlobLogRecord { uint32_t blob_crc = 0; Slice key; Slice value; - std::string key_buf; - std::string value_buf; + std::unique_ptr key_buf; + std::unique_ptr value_buf; uint64_t record_size() const { return kHeaderSize + key_size + value_size; } diff --git a/utilities/blob_db/blob_log_reader.cc b/utilities/blob_db/blob_log_reader.cc index 4996d987b63..0f098f2d45c 100644 --- a/utilities/blob_db/blob_log_reader.cc +++ b/utilities/blob_db/blob_log_reader.cc @@ -24,10 +24,9 @@ Reader::Reader(unique_ptr&& file_reader, Env* env, buffer_(), next_byte_(0) {} -Status Reader::ReadSlice(uint64_t size, Slice* slice, std::string* buf) { +Status Reader::ReadSlice(uint64_t size, Slice* slice, char* buf) { StopWatch read_sw(env_, statistics_, BLOB_DB_BLOB_FILE_READ_MICROS); - buf->reserve(static_cast(size)); - Status s = file_->Read(next_byte_, static_cast(size), slice, &(*buf)[0]); + Status s = file_->Read(next_byte_, static_cast(size), slice, buf); next_byte_ += size; if (!s.ok()) { return s; @@ -42,7 +41,7 @@ Status Reader::ReadSlice(uint64_t size, Slice* slice, std::string* buf) { Status Reader::ReadHeader(BlobLogHeader* header) { assert(file_.get() != nullptr); assert(next_byte_ == 0); - Status s = ReadSlice(BlobLogHeader::kSize, &buffer_, &backing_store_); + Status s = ReadSlice(BlobLogHeader::kSize, &buffer_, header_buf_); if (!s.ok()) { return s; } @@ -56,7 +55,7 @@ Status Reader::ReadHeader(BlobLogHeader* header) { Status Reader::ReadRecord(BlobLogRecord* record, ReadLevel level, uint64_t* blob_offset) { - Status s = ReadSlice(BlobLogRecord::kHeaderSize, &buffer_, &backing_store_); + Status s = ReadSlice(BlobLogRecord::kHeaderSize, &buffer_, header_buf_); if (!s.ok()) { return s; } @@ -80,14 +79,18 @@ Status Reader::ReadRecord(BlobLogRecord* record, ReadLevel level, break; case kReadHeaderKey: - s = ReadSlice(record->key_size, &record->key, &record->key_buf); + record->key_buf.reset(new char[record->key_size]); + s = ReadSlice(record->key_size, &record->key, record->key_buf.get()); next_byte_ += record->value_size; break; case kReadHeaderKeyBlob: - s = ReadSlice(record->key_size, &record->key, &record->key_buf); + record->key_buf.reset(new char[record->key_size]); + s = ReadSlice(record->key_size, &record->key, record->key_buf.get()); if (s.ok()) { - s = ReadSlice(record->value_size, &record->value, &record->value_buf); + record->value_buf.reset(new char[record->value_size]); + s = ReadSlice(record->value_size, &record->value, + record->value_buf.get()); } if (s.ok()) { s = record->CheckBlobCRC(); diff --git a/utilities/blob_db/blob_log_reader.h b/utilities/blob_db/blob_log_reader.h index 4b780decd52..45e2e955145 100644 --- a/utilities/blob_db/blob_log_reader.h +++ b/utilities/blob_db/blob_log_reader.h @@ -60,19 +60,19 @@ class Reader { Status ReadRecord(BlobLogRecord* record, ReadLevel level = kReadHeader, uint64_t* blob_offset = nullptr); - Status ReadSlice(uint64_t size, Slice* slice, std::string* buf); - void ResetNextByte() { next_byte_ = 0; } uint64_t GetNextByte() const { return next_byte_; } private: + Status ReadSlice(uint64_t size, Slice* slice, char* buf); + const std::unique_ptr file_; Env* env_; Statistics* statistics_; - std::string backing_store_; Slice buffer_; + char header_buf_[BlobLogRecord::kHeaderSize]; // which byte to read next. For asserting proper usage uint64_t next_byte_; diff --git a/utilities/cassandra/cassandra_functional_test.cc b/utilities/cassandra/cassandra_functional_test.cc index 3e612b3ad6a..653e6da72b8 100644 --- a/utilities/cassandra/cassandra_functional_test.cc +++ b/utilities/cassandra/cassandra_functional_test.cc @@ -101,7 +101,7 @@ class TestCompactionFilterFactory : public CompactionFilterFactory { virtual std::unique_ptr CreateCompactionFilter( const CompactionFilter::Context& /*context*/) override { - return unique_ptr(new CassandraCompactionFilter( + return std::unique_ptr(new CassandraCompactionFilter( purge_ttl_on_expiration_, gc_grace_period_in_seconds_)); } diff --git a/utilities/cassandra/format.cc b/utilities/cassandra/format.cc index 4a22658de15..42cd7206b61 100644 --- a/utilities/cassandra/format.cc +++ b/utilities/cassandra/format.cc @@ -266,7 +266,7 @@ RowValue RowValue::ConvertExpiredColumnsToTombstones(bool* changed) const { std::static_pointer_cast(column); if(expiring_column->Expired()) { - shared_ptr tombstone = expiring_column->ToTombstone(); + std::shared_ptr tombstone = expiring_column->ToTombstone(); new_columns.push_back(tombstone); *changed = true; continue; diff --git a/utilities/checkpoint/checkpoint_impl.cc b/utilities/checkpoint/checkpoint_impl.cc index 48f9200fb64..9863ac1d564 100644 --- a/utilities/checkpoint/checkpoint_impl.cc +++ b/utilities/checkpoint/checkpoint_impl.cc @@ -133,7 +133,7 @@ Status CheckpointImpl::CreateCheckpoint(const std::string& checkpoint_dir, s = db_->GetEnv()->RenameFile(full_private_path, checkpoint_dir); } if (s.ok()) { - unique_ptr checkpoint_directory; + std::unique_ptr checkpoint_directory; db_->GetEnv()->NewDirectory(checkpoint_dir, &checkpoint_directory); if (checkpoint_directory != nullptr) { s = checkpoint_directory->Fsync(); diff --git a/utilities/checkpoint/checkpoint_test.cc b/utilities/checkpoint/checkpoint_test.cc index 62c78faa8b4..b8436ccf590 100644 --- a/utilities/checkpoint/checkpoint_test.cc +++ b/utilities/checkpoint/checkpoint_test.cc @@ -164,6 +164,16 @@ class CheckpointTest : public testing::Test { return DB::OpenForReadOnly(options, dbname_, &db_); } + Status ReadOnlyReopenWithColumnFamilies(const std::vector& cfs, + const Options& options) { + std::vector column_families; + for (const auto& cf : cfs) { + column_families.emplace_back(cf, options); + } + return DB::OpenForReadOnly(options, dbname_, column_families, &handles_, + &db_); + } + Status TryReopen(const Options& options) { Close(); last_options_ = options; @@ -612,6 +622,69 @@ TEST_F(CheckpointTest, CheckpointWithUnsyncedDataDropped) { db_ = nullptr; } +TEST_F(CheckpointTest, CheckpointReadOnlyDB) { + ASSERT_OK(Put("foo", "foo_value")); + ASSERT_OK(Flush()); + Close(); + Options options = CurrentOptions(); + ASSERT_OK(ReadOnlyReopen(options)); + Checkpoint* checkpoint = nullptr; + ASSERT_OK(Checkpoint::Create(db_, &checkpoint)); + ASSERT_OK(checkpoint->CreateCheckpoint(snapshot_name_)); + delete checkpoint; + checkpoint = nullptr; + Close(); + DB* snapshot_db = nullptr; + ASSERT_OK(DB::Open(options, snapshot_name_, &snapshot_db)); + ReadOptions read_opts; + std::string get_result; + ASSERT_OK(snapshot_db->Get(read_opts, "foo", &get_result)); + ASSERT_EQ("foo_value", get_result); + delete snapshot_db; +} + +TEST_F(CheckpointTest, CheckpointReadOnlyDBWithMultipleColumnFamilies) { + Options options = CurrentOptions(); + CreateAndReopenWithCF({"pikachu", "eevee"}, options); + for (int i = 0; i != 3; ++i) { + ASSERT_OK(Put(i, "foo", "foo_value")); + ASSERT_OK(Flush(i)); + } + Close(); + Status s = ReadOnlyReopenWithColumnFamilies( + {kDefaultColumnFamilyName, "pikachu", "eevee"}, options); + ASSERT_OK(s); + Checkpoint* checkpoint = nullptr; + ASSERT_OK(Checkpoint::Create(db_, &checkpoint)); + ASSERT_OK(checkpoint->CreateCheckpoint(snapshot_name_)); + delete checkpoint; + checkpoint = nullptr; + Close(); + + std::vector column_families{ + {kDefaultColumnFamilyName, options}, + {"pikachu", options}, + {"eevee", options}}; + DB* snapshot_db = nullptr; + std::vector snapshot_handles; + s = DB::Open(options, snapshot_name_, column_families, &snapshot_handles, + &snapshot_db); + ASSERT_OK(s); + ReadOptions read_opts; + for (int i = 0; i != 3; ++i) { + std::string get_result; + s = snapshot_db->Get(read_opts, snapshot_handles[i], "foo", &get_result); + ASSERT_OK(s); + ASSERT_EQ("foo_value", get_result); + } + + for (auto snapshot_h : snapshot_handles) { + delete snapshot_h; + } + snapshot_handles.clear(); + delete snapshot_db; +} + } // namespace rocksdb int main(int argc, char** argv) { diff --git a/utilities/column_aware_encoding_exp.cc b/utilities/column_aware_encoding_exp.cc index 988a59b3c77..c251c985ec6 100644 --- a/utilities/column_aware_encoding_exp.cc +++ b/utilities/column_aware_encoding_exp.cc @@ -88,7 +88,7 @@ class ColumnAwareEncodingExp { EnvOptions env_options; if (CompressionTypeSupported(compression_type)) { fprintf(stdout, "[%s]\n", FLAGS_compression_type.c_str()); - unique_ptr encoded_out_file; + std::unique_ptr encoded_out_file; std::unique_ptr env(NewMemEnv(Env::Default())); if (!FLAGS_encoded_file.empty()) { @@ -116,7 +116,7 @@ class ColumnAwareEncodingExp { uint64_t encode_time = sw.ElapsedNanosSafe(false /* reset */); fprintf(stdout, "Encode time: %" PRIu64 "\n", encode_time); if (decode) { - unique_ptr decoded_out_file; + std::unique_ptr decoded_out_file; if (!FLAGS_decoded_file.empty()) { env->NewWritableFile(FLAGS_decoded_file, &decoded_out_file, env_options); diff --git a/utilities/date_tiered/date_tiered_db_impl.cc b/utilities/date_tiered/date_tiered_db_impl.cc index 978bfb2e495..2574d379f2a 100644 --- a/utilities/date_tiered/date_tiered_db_impl.cc +++ b/utilities/date_tiered/date_tiered_db_impl.cc @@ -389,7 +389,7 @@ Iterator* DateTieredDBImpl::NewIterator(const ReadOptions& opts) { for (auto& item : handle_map_) { auto handle = item.second; builder.AddIterator(db_impl->NewInternalIterator( - arena, db_iter->GetRangeDelAggregator(), handle)); + arena, db_iter->GetRangeDelAggregator(), kMaxSequenceNumber, handle)); } auto internal_iter = builder.Finish(); db_iter->SetIterUnderDBIter(internal_iter); diff --git a/utilities/date_tiered/date_tiered_test.cc b/utilities/date_tiered/date_tiered_test.cc index 8e7fced58a0..35f15584e5a 100644 --- a/utilities/date_tiered/date_tiered_test.cc +++ b/utilities/date_tiered/date_tiered_test.cc @@ -13,6 +13,7 @@ #include "rocksdb/compaction_filter.h" #include "rocksdb/utilities/date_tiered_db.h" +#include "port/port.h" #include "util/logging.h" #include "util/string_util.h" #include "util/testharness.h" @@ -131,7 +132,7 @@ class DateTieredTest : public testing::Test { Options options_; KVMap::iterator kv_it_; const std::string kNewValue_ = "new_value"; - unique_ptr test_comp_filter_; + std::unique_ptr test_comp_filter_; }; // Puts a set of values and checks its presence using Get during ttl diff --git a/utilities/debug.cc b/utilities/debug.cc index e0c5f5566eb..72fcbf0f54d 100644 --- a/utilities/debug.cc +++ b/utilities/debug.cc @@ -19,9 +19,11 @@ Status GetAllKeyVersions(DB* db, Slice begin_key, Slice end_key, DBImpl* idb = static_cast(db->GetRootDB()); auto icmp = InternalKeyComparator(idb->GetOptions().comparator); - RangeDelAggregator range_del_agg(icmp, {} /* snapshots */); + ReadRangeDelAggregator range_del_agg(&icmp, + kMaxSequenceNumber /* upper_bound */); Arena arena; - ScopedArenaIterator iter(idb->NewInternalIterator(&arena, &range_del_agg)); + ScopedArenaIterator iter( + idb->NewInternalIterator(&arena, &range_del_agg, kMaxSequenceNumber)); if (!begin_key.empty()) { InternalKey ikey; diff --git a/utilities/document/json_document_test.cc b/utilities/document/json_document_test.cc index 977905b9156..9d79c41cf5d 100644 --- a/utilities/document/json_document_test.cc +++ b/utilities/document/json_document_test.cc @@ -249,21 +249,23 @@ TEST_F(JSONDocumentTest, OperatorEqualsTest) { ASSERT_TRUE(JSONDocument(static_cast(15)) == JSONDocument(static_cast(15))); - unique_ptr arrayWithInt8Doc(JSONDocument::ParseJSON("[8]")); + std::unique_ptr arrayWithInt8Doc( + JSONDocument::ParseJSON("[8]")); ASSERT_TRUE(arrayWithInt8Doc != nullptr); ASSERT_TRUE(arrayWithInt8Doc->IsArray()); ASSERT_TRUE((*arrayWithInt8Doc)[0].IsInt64()); ASSERT_TRUE((*arrayWithInt8Doc)[0] == JSONDocument(static_cast(8))); - unique_ptr arrayWithInt16Doc(JSONDocument::ParseJSON("[512]")); + std::unique_ptr arrayWithInt16Doc( + JSONDocument::ParseJSON("[512]")); ASSERT_TRUE(arrayWithInt16Doc != nullptr); ASSERT_TRUE(arrayWithInt16Doc->IsArray()); ASSERT_TRUE((*arrayWithInt16Doc)[0].IsInt64()); ASSERT_TRUE((*arrayWithInt16Doc)[0] == JSONDocument(static_cast(512))); - unique_ptr arrayWithInt32Doc( - JSONDocument::ParseJSON("[1000000]")); + std::unique_ptr arrayWithInt32Doc( + JSONDocument::ParseJSON("[1000000]")); ASSERT_TRUE(arrayWithInt32Doc != nullptr); ASSERT_TRUE(arrayWithInt32Doc->IsArray()); ASSERT_TRUE((*arrayWithInt32Doc)[0].IsInt64()); @@ -277,8 +279,8 @@ TEST_F(JSONDocumentTest, OperatorEqualsTest) { } TEST_F(JSONDocumentTest, JSONDocumentBuilderTest) { - unique_ptr parsedArray( - JSONDocument::ParseJSON("[1, [123, \"a\", \"b\"], {\"b\":\"c\"}]")); + std::unique_ptr parsedArray( + JSONDocument::ParseJSON("[1, [123, \"a\", \"b\"], {\"b\":\"c\"}]")); ASSERT_TRUE(parsedArray != nullptr); JSONDocumentBuilder builder; diff --git a/utilities/env_librados_test.cc b/utilities/env_librados_test.cc index 7d9b252ea41..fb10224e7d7 100644 --- a/utilities/env_librados_test.cc +++ b/utilities/env_librados_test.cc @@ -108,7 +108,7 @@ class EnvLibradosTest : public testing::Test { TEST_F(EnvLibradosTest, Basics) { uint64_t file_size; - unique_ptr writable_file; + std::unique_ptr writable_file; std::vector children; ASSERT_OK(env_->CreateDir("/dir")); @@ -150,8 +150,8 @@ TEST_F(EnvLibradosTest, Basics) { ASSERT_EQ(3U, file_size); // Check that opening non-existent file fails. - unique_ptr seq_file; - unique_ptr rand_file; + std::unique_ptr seq_file; + std::unique_ptr rand_file; ASSERT_TRUE( !env_->NewSequentialFile("/dir/non_existent", &seq_file, soptions_).ok()); ASSERT_TRUE(!seq_file); @@ -169,9 +169,9 @@ TEST_F(EnvLibradosTest, Basics) { } TEST_F(EnvLibradosTest, ReadWrite) { - unique_ptr writable_file; - unique_ptr seq_file; - unique_ptr rand_file; + std::unique_ptr writable_file; + std::unique_ptr seq_file; + std::unique_ptr rand_file; Slice result; char scratch[100]; @@ -210,7 +210,7 @@ TEST_F(EnvLibradosTest, ReadWrite) { TEST_F(EnvLibradosTest, Locks) { FileLock* lock = nullptr; - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK(env_->CreateDir("/dir")); @@ -229,7 +229,7 @@ TEST_F(EnvLibradosTest, Misc) { ASSERT_OK(env_->GetTestDirectory(&test_dir)); ASSERT_TRUE(!test_dir.empty()); - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_TRUE(!env_->NewWritableFile("/a/b", &writable_file, soptions_).ok()); ASSERT_OK(env_->NewWritableFile("/a", &writable_file, soptions_)); @@ -249,14 +249,14 @@ TEST_F(EnvLibradosTest, LargeWrite) { write_data.append(1, 'h'); } - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK(env_->CreateDir("/dir")); ASSERT_OK(env_->NewWritableFile("/dir/g", &writable_file, soptions_)); ASSERT_OK(writable_file->Append("foo")); ASSERT_OK(writable_file->Append(write_data)); writable_file.reset(); - unique_ptr seq_file; + std::unique_ptr seq_file; Slice result; ASSERT_OK(env_->NewSequentialFile("/dir/g", &seq_file, soptions_)); ASSERT_OK(seq_file->Read(3, &result, scratch)); // Read "foo". @@ -282,7 +282,7 @@ TEST_F(EnvLibradosTest, FrequentlySmallWrite) { write_data.append(1, 'h'); } - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK(env_->CreateDir("/dir")); ASSERT_OK(env_->NewWritableFile("/dir/g", &writable_file, soptions_)); ASSERT_OK(writable_file->Append("foo")); @@ -292,7 +292,7 @@ TEST_F(EnvLibradosTest, FrequentlySmallWrite) { } writable_file.reset(); - unique_ptr seq_file; + std::unique_ptr seq_file; Slice result; ASSERT_OK(env_->NewSequentialFile("/dir/g", &seq_file, soptions_)); ASSERT_OK(seq_file->Read(3, &result, scratch)); // Read "foo". @@ -317,7 +317,7 @@ TEST_F(EnvLibradosTest, Truncate) { write_data.append(1, 'h'); } - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK(env_->CreateDir("/dir")); ASSERT_OK(env_->NewWritableFile("/dir/g", &writable_file, soptions_)); ASSERT_OK(writable_file->Append(write_data)); @@ -801,7 +801,7 @@ class EnvLibradosMutipoolTest : public testing::Test { TEST_F(EnvLibradosMutipoolTest, Basics) { uint64_t file_size; - unique_ptr writable_file; + std::unique_ptr writable_file; std::vector children; std::vector v = {"/tmp/dir1", "/tmp/dir2", "/tmp/dir3", "/tmp/dir4", "dir"}; @@ -850,8 +850,8 @@ TEST_F(EnvLibradosMutipoolTest, Basics) { ASSERT_EQ(3U, file_size); // Check that opening non-existent file fails. - unique_ptr seq_file; - unique_ptr rand_file; + std::unique_ptr seq_file; + std::unique_ptr rand_file; ASSERT_TRUE( !env_->NewSequentialFile(dir_non_existent.c_str(), &seq_file, soptions_).ok()); ASSERT_TRUE(!seq_file); diff --git a/utilities/env_mirror.cc b/utilities/env_mirror.cc index d14de97d00d..327d8e16228 100644 --- a/utilities/env_mirror.cc +++ b/utilities/env_mirror.cc @@ -16,7 +16,7 @@ namespace rocksdb { // Env's. This is useful for debugging purposes. class SequentialFileMirror : public SequentialFile { public: - unique_ptr a_, b_; + std::unique_ptr a_, b_; std::string fname; explicit SequentialFileMirror(std::string f) : fname(f) {} @@ -60,7 +60,7 @@ class SequentialFileMirror : public SequentialFile { class RandomAccessFileMirror : public RandomAccessFile { public: - unique_ptr a_, b_; + std::unique_ptr a_, b_; std::string fname; explicit RandomAccessFileMirror(std::string f) : fname(f) {} @@ -95,7 +95,7 @@ class RandomAccessFileMirror : public RandomAccessFile { class WritableFileMirror : public WritableFile { public: - unique_ptr a_, b_; + std::unique_ptr a_, b_; std::string fname; explicit WritableFileMirror(std::string f) : fname(f) {} @@ -191,7 +191,7 @@ class WritableFileMirror : public WritableFile { }; Status EnvMirror::NewSequentialFile(const std::string& f, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& options) { if (f.find("/proc/") == 0) { return a_->NewSequentialFile(f, r, options); @@ -208,7 +208,7 @@ Status EnvMirror::NewSequentialFile(const std::string& f, } Status EnvMirror::NewRandomAccessFile(const std::string& f, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& options) { if (f.find("/proc/") == 0) { return a_->NewRandomAccessFile(f, r, options); @@ -225,7 +225,7 @@ Status EnvMirror::NewRandomAccessFile(const std::string& f, } Status EnvMirror::NewWritableFile(const std::string& f, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& options) { if (f.find("/proc/") == 0) return a_->NewWritableFile(f, r, options); WritableFileMirror* mf = new WritableFileMirror(f); @@ -241,7 +241,7 @@ Status EnvMirror::NewWritableFile(const std::string& f, Status EnvMirror::ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* r, + std::unique_ptr* r, const EnvOptions& options) { if (fname.find("/proc/") == 0) return a_->ReuseWritableFile(fname, old_fname, r, options); diff --git a/utilities/env_mirror_test.cc b/utilities/env_mirror_test.cc index 2bf8ec8583a..812595ca1ee 100644 --- a/utilities/env_mirror_test.cc +++ b/utilities/env_mirror_test.cc @@ -32,7 +32,7 @@ class EnvMirrorTest : public testing::Test { TEST_F(EnvMirrorTest, Basics) { uint64_t file_size; - unique_ptr writable_file; + std::unique_ptr writable_file; std::vector children; ASSERT_OK(env_->CreateDir("/dir")); @@ -91,8 +91,8 @@ TEST_F(EnvMirrorTest, Basics) { ASSERT_EQ(3U, file_size); // Check that opening non-existent file fails. - unique_ptr seq_file; - unique_ptr rand_file; + std::unique_ptr seq_file; + std::unique_ptr rand_file; ASSERT_TRUE( !env_->NewSequentialFile("/dir/non_existent", &seq_file, soptions_).ok()); ASSERT_TRUE(!seq_file); @@ -110,9 +110,9 @@ TEST_F(EnvMirrorTest, Basics) { } TEST_F(EnvMirrorTest, ReadWrite) { - unique_ptr writable_file; - unique_ptr seq_file; - unique_ptr rand_file; + std::unique_ptr writable_file; + std::unique_ptr seq_file; + std::unique_ptr rand_file; Slice result; char scratch[100]; @@ -162,7 +162,7 @@ TEST_F(EnvMirrorTest, Misc) { ASSERT_OK(env_->GetTestDirectory(&test_dir)); ASSERT_TRUE(!test_dir.empty()); - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK(env_->NewWritableFile("/a/b", &writable_file, soptions_)); // These are no-ops, but we test they return success. @@ -181,13 +181,13 @@ TEST_F(EnvMirrorTest, LargeWrite) { write_data.append(1, static_cast(i)); } - unique_ptr writable_file; + std::unique_ptr writable_file; ASSERT_OK(env_->NewWritableFile("/dir/f", &writable_file, soptions_)); ASSERT_OK(writable_file->Append("foo")); ASSERT_OK(writable_file->Append(write_data)); writable_file.reset(); - unique_ptr seq_file; + std::unique_ptr seq_file; Slice result; ASSERT_OK(env_->NewSequentialFile("/dir/f", &seq_file, soptions_)); ASSERT_OK(seq_file->Read(3, &result, scratch)); // Read "foo". diff --git a/utilities/env_timed.cc b/utilities/env_timed.cc index 6afd45bf999..86455ee65c0 100644 --- a/utilities/env_timed.cc +++ b/utilities/env_timed.cc @@ -18,21 +18,21 @@ class TimedEnv : public EnvWrapper { explicit TimedEnv(Env* base_env) : EnvWrapper(base_env) {} virtual Status NewSequentialFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { PERF_TIMER_GUARD(env_new_sequential_file_nanos); return EnvWrapper::NewSequentialFile(fname, result, options); } virtual Status NewRandomAccessFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { PERF_TIMER_GUARD(env_new_random_access_file_nanos); return EnvWrapper::NewRandomAccessFile(fname, result, options); } virtual Status NewWritableFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { PERF_TIMER_GUARD(env_new_writable_file_nanos); return EnvWrapper::NewWritableFile(fname, result, options); @@ -40,21 +40,21 @@ class TimedEnv : public EnvWrapper { virtual Status ReuseWritableFile(const std::string& fname, const std::string& old_fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { PERF_TIMER_GUARD(env_reuse_writable_file_nanos); return EnvWrapper::ReuseWritableFile(fname, old_fname, result, options); } virtual Status NewRandomRWFile(const std::string& fname, - unique_ptr* result, + std::unique_ptr* result, const EnvOptions& options) override { PERF_TIMER_GUARD(env_new_random_rw_file_nanos); return EnvWrapper::NewRandomRWFile(fname, result, options); } virtual Status NewDirectory(const std::string& name, - unique_ptr* result) override { + std::unique_ptr* result) override { PERF_TIMER_GUARD(env_new_directory_nanos); return EnvWrapper::NewDirectory(name, result); } @@ -131,7 +131,7 @@ class TimedEnv : public EnvWrapper { } virtual Status NewLogger(const std::string& fname, - shared_ptr* result) override { + std::shared_ptr* result) override { PERF_TIMER_GUARD(env_new_logger_nanos); return EnvWrapper::NewLogger(fname, result); } diff --git a/utilities/geodb/geodb_impl.cc b/utilities/geodb/geodb_impl.cc index 97c4da0f736..9150b16b2c5 100644 --- a/utilities/geodb/geodb_impl.cc +++ b/utilities/geodb/geodb_impl.cc @@ -222,7 +222,7 @@ GeoIterator* GeoDBImpl::SearchRadial(const GeoPosition& pos, Iterator* iter = db_->NewIterator(ReadOptions()); // Process each prospective quadkey - for (std::string qid : qids) { + for (const std::string& qid : qids) { // The user is interested in only these many objects. if (number_of_values == 0) { break; diff --git a/utilities/options/options_util_test.cc b/utilities/options/options_util_test.cc index bf830190c6a..4c12f1a67d2 100644 --- a/utilities/options/options_util_test.cc +++ b/utilities/options/options_util_test.cc @@ -104,8 +104,8 @@ class DummyTableFactory : public TableFactory { virtual Status NewTableReader( const TableReaderOptions& /*table_reader_options*/, - unique_ptr&& /*file*/, uint64_t /*file_size*/, - unique_ptr* /*table_reader*/, + std::unique_ptr&& /*file*/, + uint64_t /*file_size*/, std::unique_ptr* /*table_reader*/, bool /*prefetch_index_and_filter_in_cache*/) const override { return Status::NotSupported(); } diff --git a/utilities/persistent_cache/block_cache_tier.cc b/utilities/persistent_cache/block_cache_tier.cc index 1ebf8ae6b3a..f7f72df6dfc 100644 --- a/utilities/persistent_cache/block_cache_tier.cc +++ b/utilities/persistent_cache/block_cache_tier.cc @@ -263,7 +263,7 @@ Status BlockCacheTier::InsertImpl(const Slice& key, const Slice& data) { return Status::OK(); } -Status BlockCacheTier::Lookup(const Slice& key, unique_ptr* val, +Status BlockCacheTier::Lookup(const Slice& key, std::unique_ptr* val, size_t* size) { StopWatchNano timer(opt_.env, /*auto_start=*/ true); @@ -287,7 +287,7 @@ Status BlockCacheTier::Lookup(const Slice& key, unique_ptr* val, assert(file->refs_); - unique_ptr scratch(new char[lba.size_]); + std::unique_ptr scratch(new char[lba.size_]); Slice blk_key; Slice blk_val; @@ -369,7 +369,7 @@ bool BlockCacheTier::Reserve(const size_t size) { const double retain_fac = (100 - kEvictPct) / static_cast(100); while (size + size_ > opt_.cache_size * retain_fac) { - unique_ptr f(metadata_.Evict()); + std::unique_ptr f(metadata_.Evict()); if (!f) { // nothing is evictable return false; diff --git a/utilities/persistent_cache/block_cache_tier_file.h b/utilities/persistent_cache/block_cache_tier_file.h index ef5dbab0408..e38b6c9a1d3 100644 --- a/utilities/persistent_cache/block_cache_tier_file.h +++ b/utilities/persistent_cache/block_cache_tier_file.h @@ -149,7 +149,7 @@ class RandomAccessCacheFile : public BlockCacheFile { public: explicit RandomAccessCacheFile(Env* const env, const std::string& dir, const uint32_t cache_id, - const shared_ptr& log) + const std::shared_ptr& log) : BlockCacheFile(env, dir, cache_id), log_(log) {} virtual ~RandomAccessCacheFile() {} diff --git a/utilities/persistent_cache/persistent_cache_bench.cc b/utilities/persistent_cache/persistent_cache_bench.cc index 7d26c3a7de3..64d75c7a518 100644 --- a/utilities/persistent_cache/persistent_cache_bench.cc +++ b/utilities/persistent_cache/persistent_cache_bench.cc @@ -251,7 +251,7 @@ class CacheTierBenchmark { // create data for a key by filling with a certain pattern std::unique_ptr NewBlock(const uint64_t val) { - unique_ptr data(new char[FLAGS_iosize]); + std::unique_ptr data(new char[FLAGS_iosize]); memset(data.get(), val % 255, FLAGS_iosize); return data; } diff --git a/utilities/persistent_cache/persistent_cache_test.h b/utilities/persistent_cache/persistent_cache_test.h index 37e842f2e2a..ad99ea864bd 100644 --- a/utilities/persistent_cache/persistent_cache_test.h +++ b/utilities/persistent_cache/persistent_cache_test.h @@ -157,7 +157,7 @@ class PersistentCacheTierTest : public testing::Test { memset(edata, '0' + (i % 10), sizeof(edata)); auto k = prefix + PaddedNumber(i, /*count=*/8); Slice key(k); - unique_ptr block; + std::unique_ptr block; size_t block_size; if (eviction_enabled) { @@ -210,7 +210,7 @@ class PersistentCacheTierTest : public testing::Test { } const std::string path_; - shared_ptr log_; + std::shared_ptr log_; std::shared_ptr cache_; std::atomic key_{0}; size_t max_keys_ = 0; diff --git a/utilities/spatialdb/spatial_db.cc b/utilities/spatialdb/spatial_db.cc index 627eb9de6e4..b34976eb818 100644 --- a/utilities/spatialdb/spatial_db.cc +++ b/utilities/spatialdb/spatial_db.cc @@ -473,7 +473,7 @@ class SpatialIndexCursor : public Cursor { } - unique_ptr value_getter_; + std::unique_ptr value_getter_; bool valid_; Status status_; diff --git a/utilities/spatialdb/spatial_db_test.cc b/utilities/spatialdb/spatial_db_test.cc index 783b347d0a8..cb92af8b1a0 100644 --- a/utilities/spatialdb/spatial_db_test.cc +++ b/utilities/spatialdb/spatial_db_test.cc @@ -94,7 +94,7 @@ TEST_F(SpatialDBTest, FeatureSetSerializeTest) { ASSERT_EQ(deserialized.Get("m").get_double(), 3.25); // corrupted serialization - serialized = serialized.substr(0, serialized.size() - 4); + serialized = serialized.substr(0, serialized.size() - 1); deserialized.Clear(); ASSERT_TRUE(!deserialized.Deserialize(serialized)); } diff --git a/utilities/trace/file_trace_reader_writer.cc b/utilities/trace/file_trace_reader_writer.cc index 36baefc7bc2..4a81516a8b7 100644 --- a/utilities/trace/file_trace_reader_writer.cc +++ b/utilities/trace/file_trace_reader_writer.cc @@ -83,16 +83,18 @@ Status FileTraceWriter::Write(const Slice& data) { return file_writer_->Append(data); } +uint64_t FileTraceWriter::GetFileSize() { return file_writer_->GetFileSize(); } + Status NewFileTraceReader(Env* env, const EnvOptions& env_options, const std::string& trace_filename, std::unique_ptr* trace_reader) { - unique_ptr trace_file; + std::unique_ptr trace_file; Status s = env->NewRandomAccessFile(trace_filename, &trace_file, env_options); if (!s.ok()) { return s; } - unique_ptr file_reader; + std::unique_ptr file_reader; file_reader.reset( new RandomAccessFileReader(std::move(trace_file), trace_filename)); trace_reader->reset(new FileTraceReader(std::move(file_reader))); @@ -102,13 +104,13 @@ Status NewFileTraceReader(Env* env, const EnvOptions& env_options, Status NewFileTraceWriter(Env* env, const EnvOptions& env_options, const std::string& trace_filename, std::unique_ptr* trace_writer) { - unique_ptr trace_file; + std::unique_ptr trace_file; Status s = env->NewWritableFile(trace_filename, &trace_file, env_options); if (!s.ok()) { return s; } - unique_ptr file_writer; + std::unique_ptr file_writer; file_writer.reset(new WritableFileWriter(std::move(trace_file), trace_filename, env_options)); trace_writer->reset(new FileTraceWriter(std::move(file_writer))); diff --git a/utilities/trace/file_trace_reader_writer.h b/utilities/trace/file_trace_reader_writer.h index b363a3f09f7..863f5d9d061 100644 --- a/utilities/trace/file_trace_reader_writer.h +++ b/utilities/trace/file_trace_reader_writer.h @@ -22,7 +22,7 @@ class FileTraceReader : public TraceReader { virtual Status Close() override; private: - unique_ptr file_reader_; + std::unique_ptr file_reader_; Slice result_; size_t offset_; char* const buffer_; @@ -39,9 +39,10 @@ class FileTraceWriter : public TraceWriter { virtual Status Write(const Slice& data) override; virtual Status Close() override; + virtual uint64_t GetFileSize() override; private: - unique_ptr file_writer_; + std::unique_ptr file_writer_; }; } // namespace rocksdb diff --git a/utilities/transactions/pessimistic_transaction.cc b/utilities/transactions/pessimistic_transaction.cc index 67a333f3b08..d895d9d9357 100644 --- a/utilities/transactions/pessimistic_transaction.cc +++ b/utilities/transactions/pessimistic_transaction.cc @@ -232,7 +232,7 @@ Status WriteCommittedTxn::PrepareInternal() { WriteBatchInternal::MarkEndPrepare(GetWriteBatch()->GetWriteBatch(), name_); Status s = db_impl_->WriteImpl(write_options, GetWriteBatch()->GetWriteBatch(), - /*callback*/ nullptr, &log_number_, /*log ref*/ 0, + /*callback*/ nullptr, &log_number_, /*log_ref*/ 0, /* disable_memtable*/ true); return s; } @@ -322,12 +322,27 @@ Status PessimisticTransaction::Commit() { } Status WriteCommittedTxn::CommitWithoutPrepareInternal() { - Status s = db_->Write(write_options_, GetWriteBatch()->GetWriteBatch()); + uint64_t seq_used = kMaxSequenceNumber; + auto s = + db_impl_->WriteImpl(write_options_, GetWriteBatch()->GetWriteBatch(), + /*callback*/ nullptr, /*log_used*/ nullptr, + /*log_ref*/ 0, /*disable_memtable*/ false, &seq_used); + assert(!s.ok() || seq_used != kMaxSequenceNumber); + if (s.ok()) { + SetId(seq_used); + } return s; } Status WriteCommittedTxn::CommitBatchInternal(WriteBatch* batch, size_t) { - Status s = db_->Write(write_options_, batch); + uint64_t seq_used = kMaxSequenceNumber; + auto s = db_impl_->WriteImpl(write_options_, batch, /*callback*/ nullptr, + /*log_used*/ nullptr, /*log_ref*/ 0, + /*disable_memtable*/ false, &seq_used); + assert(!s.ok() || seq_used != kMaxSequenceNumber); + if (s.ok()) { + SetId(seq_used); + } return s; } @@ -345,8 +360,15 @@ Status WriteCommittedTxn::CommitInternal() { // in non recovery mode and simply insert the values WriteBatchInternal::Append(working_batch, GetWriteBatch()->GetWriteBatch()); - auto s = db_impl_->WriteImpl(write_options_, working_batch, nullptr, nullptr, - log_number_); + uint64_t seq_used = kMaxSequenceNumber; + auto s = + db_impl_->WriteImpl(write_options_, working_batch, /*callback*/ nullptr, + /*log_used*/ nullptr, /*log_ref*/ log_number_, + /*disable_memtable*/ false, &seq_used); + assert(!s.ok() || seq_used != kMaxSequenceNumber); + if (s.ok()) { + SetId(seq_used); + } return s; } diff --git a/utilities/transactions/pessimistic_transaction_db.cc b/utilities/transactions/pessimistic_transaction_db.cc index 6b016ef72a8..8eb21777a99 100644 --- a/utilities/transactions/pessimistic_transaction_db.cc +++ b/utilities/transactions/pessimistic_transaction_db.cc @@ -146,7 +146,9 @@ Status PessimisticTransactionDB::Initialize( assert(real_trx); real_trx->SetLogNumber(batch_info.log_number_); assert(seq != kMaxSequenceNumber); - real_trx->SetId(seq); + if (GetTxnDBOptions().write_policy != WRITE_COMMITTED) { + real_trx->SetId(seq); + } s = real_trx->SetName(recovered_trx->name_); if (!s.ok()) { diff --git a/utilities/transactions/transaction_lock_mgr.cc b/utilities/transactions/transaction_lock_mgr.cc index d285fd30ed4..8086f7c7c07 100644 --- a/utilities/transactions/transaction_lock_mgr.cc +++ b/utilities/transactions/transaction_lock_mgr.cc @@ -104,7 +104,7 @@ void DeadlockInfoBuffer::AddNewPath(DeadlockPath path) { return; } - paths_buffer_[buffer_idx_] = path; + paths_buffer_[buffer_idx_] = std::move(path); buffer_idx_ = (buffer_idx_ + 1) % paths_buffer_.size(); } @@ -222,9 +222,9 @@ void TransactionLockMgr::RemoveColumnFamily(uint32_t column_family_id) { } } -// Look up the LockMap shared_ptr for a given column_family_id. +// Look up the LockMap std::shared_ptr for a given column_family_id. // Note: The LockMap is only valid as long as the caller is still holding on -// to the returned shared_ptr. +// to the returned std::shared_ptr. std::shared_ptr TransactionLockMgr::GetLockMap( uint32_t column_family_id) { // First check thread-local cache @@ -494,8 +494,8 @@ bool TransactionLockMgr::IncrementWaiters( auto extracted_info = wait_txn_map_.Get(queue_values[head]); path.push_back({queue_values[head], extracted_info.m_cf_id, - extracted_info.m_waiting_key, - extracted_info.m_exclusive}); + extracted_info.m_exclusive, + extracted_info.m_waiting_key}); head = queue_parents[head]; } env->GetCurrentTime(&deadlock_time); diff --git a/utilities/transactions/transaction_test.cc b/utilities/transactions/transaction_test.cc index f49c9225741..0968b9a3493 100644 --- a/utilities/transactions/transaction_test.cc +++ b/utilities/transactions/transaction_test.cc @@ -606,6 +606,7 @@ TEST_P(TransactionTest, DeadlockCycleShared) { } } +#ifndef ROCKSDB_VALGRIND_RUN TEST_P(TransactionStressTest, DeadlockCycle) { WriteOptions write_options; ReadOptions read_options; @@ -768,6 +769,7 @@ TEST_P(TransactionStressTest, DeadlockStress) { t.join(); } } +#endif // ROCKSDB_VALGRIND_RUN TEST_P(TransactionTest, CommitTimeBatchFailTest) { WriteOptions write_options; @@ -1097,6 +1099,7 @@ TEST_P(TransactionTest, TwoPhaseEmptyWriteTest) { } } +#ifndef ROCKSDB_VALGRIND_RUN TEST_P(TransactionStressTest, TwoPhaseExpirationTest) { Status s; @@ -1334,6 +1337,7 @@ TEST_P(TransactionTest, PersistentTwoPhaseTransactionTest) { // deleting transaction should unregister transaction ASSERT_EQ(db->GetTransactionByName("xid"), nullptr); } +#endif // ROCKSDB_VALGRIND_RUN // TODO this test needs to be updated with serial commits TEST_P(TransactionTest, DISABLED_TwoPhaseMultiThreadTest) { diff --git a/utilities/transactions/write_prepared_transaction_test.cc b/utilities/transactions/write_prepared_transaction_test.cc index 127f8cc8648..1d645d237fc 100644 --- a/utilities/transactions/write_prepared_transaction_test.cc +++ b/utilities/transactions/write_prepared_transaction_test.cc @@ -731,6 +731,71 @@ TEST_P(WritePreparedTransactionTest, MaybeUpdateOldCommitMap) { MaybeUpdateOldCommitMapTestWithNext(p, c, s, ns, false); } +// Reproduce the bug with two snapshots with the same seuqence number and test +// that the release of the first snapshot will not affect the reads by the other +// snapshot +TEST_P(WritePreparedTransactionTest, DoubleSnapshot) { + TransactionOptions txn_options; + Status s; + + // Insert initial value + ASSERT_OK(db->Put(WriteOptions(), "key", "value1")); + + WritePreparedTxnDB* wp_db = dynamic_cast(db); + Transaction* txn = + wp_db->BeginTransaction(WriteOptions(), txn_options, nullptr); + ASSERT_OK(txn->SetName("txn")); + ASSERT_OK(txn->Put("key", "value2")); + ASSERT_OK(txn->Prepare()); + // Three snapshots with the same seq number + const Snapshot* snapshot0 = wp_db->GetSnapshot(); + const Snapshot* snapshot1 = wp_db->GetSnapshot(); + const Snapshot* snapshot2 = wp_db->GetSnapshot(); + ASSERT_OK(txn->Commit()); + SequenceNumber cache_size = wp_db->COMMIT_CACHE_SIZE; + SequenceNumber overlap_seq = txn->GetId() + cache_size; + delete txn; + + // 4th snapshot with a larger seq + const Snapshot* snapshot3 = wp_db->GetSnapshot(); + // Cause an eviction to advance max evicted seq number + // This also fetches the 4 snapshots from db since their seq is lower than the + // new max + wp_db->AddCommitted(overlap_seq, overlap_seq); + + ReadOptions ropt; + // It should see the value before commit + ropt.snapshot = snapshot2; + PinnableSlice pinnable_val; + s = wp_db->Get(ropt, wp_db->DefaultColumnFamily(), "key", &pinnable_val); + ASSERT_OK(s); + ASSERT_TRUE(pinnable_val == "value1"); + pinnable_val.Reset(); + + wp_db->ReleaseSnapshot(snapshot1); + + // It should still see the value before commit + s = wp_db->Get(ropt, wp_db->DefaultColumnFamily(), "key", &pinnable_val); + ASSERT_OK(s); + ASSERT_TRUE(pinnable_val == "value1"); + pinnable_val.Reset(); + + // Cause an eviction to advance max evicted seq number and trigger updating + // the snapshot list + overlap_seq += cache_size; + wp_db->AddCommitted(overlap_seq, overlap_seq); + + // It should still see the value before commit + s = wp_db->Get(ropt, wp_db->DefaultColumnFamily(), "key", &pinnable_val); + ASSERT_OK(s); + ASSERT_TRUE(pinnable_val == "value1"); + pinnable_val.Reset(); + + wp_db->ReleaseSnapshot(snapshot0); + wp_db->ReleaseSnapshot(snapshot2); + wp_db->ReleaseSnapshot(snapshot3); +} + // Test that the entries in old_commit_map_ get garbage collected properly TEST_P(WritePreparedTransactionTest, OldCommitMapGC) { const size_t snapshot_cache_bits = 0; @@ -816,6 +881,7 @@ TEST_P(WritePreparedTransactionTest, CheckAgainstSnapshotsTest) { std::vector snapshots = {100l, 200l, 300l, 400l, 500l, 600l, 700l, 800l, 900l}; const size_t snapshot_cache_bits = 2; + const uint64_t cache_size = 1ul << snapshot_cache_bits; // Safety check to express the intended size in the test. Can be adjusted if // the snapshots lists changed. assert((1ul << snapshot_cache_bits) * 2 + 1 == snapshots.size()); @@ -843,6 +909,57 @@ TEST_P(WritePreparedTransactionTest, CheckAgainstSnapshotsTest) { commit_entry.prep_seq <= snapshots.back(); ASSERT_EQ(expect_update, !wp_db->old_commit_map_empty_); } + + // Test that search will include multiple snapshot from snapshot cache + { + // exclude first and last item in the cache + CommitEntry commit_entry = {snapshots.front() + 1, + snapshots[cache_size - 1] - 1}; + wp_db->old_commit_map_empty_ = true; // reset + wp_db->old_commit_map_.clear(); + wp_db->CheckAgainstSnapshots(commit_entry); + ASSERT_EQ(wp_db->old_commit_map_.size(), cache_size - 2); + } + + // Test that search will include multiple snapshot from old snapshots + { + // include two in the middle + CommitEntry commit_entry = {snapshots[cache_size] + 1, + snapshots[cache_size + 2] + 1}; + wp_db->old_commit_map_empty_ = true; // reset + wp_db->old_commit_map_.clear(); + wp_db->CheckAgainstSnapshots(commit_entry); + ASSERT_EQ(wp_db->old_commit_map_.size(), 2); + } + + // Test that search will include both snapshot cache and old snapshots + // Case 1: includes all in snapshot cache + { + CommitEntry commit_entry = {snapshots.front() - 1, snapshots.back() + 1}; + wp_db->old_commit_map_empty_ = true; // reset + wp_db->old_commit_map_.clear(); + wp_db->CheckAgainstSnapshots(commit_entry); + ASSERT_EQ(wp_db->old_commit_map_.size(), snapshots.size()); + } + + // Case 2: includes all snapshot caches except the smallest + { + CommitEntry commit_entry = {snapshots.front() + 1, snapshots.back() + 1}; + wp_db->old_commit_map_empty_ = true; // reset + wp_db->old_commit_map_.clear(); + wp_db->CheckAgainstSnapshots(commit_entry); + ASSERT_EQ(wp_db->old_commit_map_.size(), snapshots.size() - 1); + } + + // Case 3: includes only the largest of snapshot cache + { + CommitEntry commit_entry = {snapshots[cache_size - 1] - 1, + snapshots.back() + 1}; + wp_db->old_commit_map_empty_ = true; // reset + wp_db->old_commit_map_.clear(); + wp_db->CheckAgainstSnapshots(commit_entry); + ASSERT_EQ(wp_db->old_commit_map_.size(), snapshots.size() - cache_size + 1); + } } // This test is too slow for travis diff --git a/utilities/transactions/write_prepared_txn_db.cc b/utilities/transactions/write_prepared_txn_db.cc index 2d8e4fcee1d..ca728d50713 100644 --- a/utilities/transactions/write_prepared_txn_db.cc +++ b/utilities/transactions/write_prepared_txn_db.cc @@ -379,9 +379,9 @@ void WritePreparedTxnDB::Init(const TransactionDBOptions& /* unused */) { // around. INC_STEP_FOR_MAX_EVICTED = std::max(COMMIT_CACHE_SIZE / 100, static_cast(1)); - snapshot_cache_ = unique_ptr[]>( + snapshot_cache_ = std::unique_ptr[]>( new std::atomic[SNAPSHOT_CACHE_SIZE] {}); - commit_cache_ = unique_ptr[]>( + commit_cache_ = std::unique_ptr[]>( new std::atomic[COMMIT_CACHE_SIZE] {}); } @@ -554,12 +554,6 @@ const std::vector WritePreparedTxnDB::GetSnapshotListFromDB( return db_impl_->snapshots().GetAll(nullptr, max); } -void WritePreparedTxnDB::ReleaseSnapshot(const Snapshot* snapshot) { - auto snap_seq = snapshot->GetSequenceNumber(); - ReleaseSnapshotInternal(snap_seq); - db_impl_->ReleaseSnapshot(snapshot); -} - void WritePreparedTxnDB::ReleaseSnapshotInternal( const SequenceNumber snap_seq) { // relax is enough since max increases monotonically, i.e., if snap_seq < @@ -572,14 +566,16 @@ void WritePreparedTxnDB::ReleaseSnapshotInternal( bool need_gc = false; { WPRecordTick(TXN_OLD_COMMIT_MAP_MUTEX_OVERHEAD); - ROCKS_LOG_WARN(info_log_, "old_commit_map_mutex_ overhead"); + ROCKS_LOG_WARN(info_log_, "old_commit_map_mutex_ overhead for %" PRIu64, + snap_seq); ReadLock rl(&old_commit_map_mutex_); auto prep_set_entry = old_commit_map_.find(snap_seq); need_gc = prep_set_entry != old_commit_map_.end(); } if (need_gc) { WPRecordTick(TXN_OLD_COMMIT_MAP_MUTEX_OVERHEAD); - ROCKS_LOG_WARN(info_log_, "old_commit_map_mutex_ overhead"); + ROCKS_LOG_WARN(info_log_, "old_commit_map_mutex_ overhead for %" PRIu64, + snap_seq); WriteLock wl(&old_commit_map_mutex_); old_commit_map_.erase(snap_seq); old_commit_map_empty_.store(old_commit_map_.empty(), @@ -588,6 +584,33 @@ void WritePreparedTxnDB::ReleaseSnapshotInternal( } } +void WritePreparedTxnDB::CleanupReleasedSnapshots( + const std::vector& new_snapshots, + const std::vector& old_snapshots) { + auto newi = new_snapshots.begin(); + auto oldi = old_snapshots.begin(); + for (; newi != new_snapshots.end() && oldi != old_snapshots.end();) { + assert(*newi >= *oldi); // cannot have new snapshots with lower seq + if (*newi == *oldi) { // still not released + auto value = *newi; + while (newi != new_snapshots.end() && *newi == value) { + newi++; + } + while (oldi != old_snapshots.end() && *oldi == value) { + oldi++; + } + } else { + assert(*newi > *oldi); // *oldi is released + ReleaseSnapshotInternal(*oldi); + oldi++; + } + } + // Everything remained in old_snapshots is released and must be cleaned up + for (; oldi != old_snapshots.end(); oldi++) { + ReleaseSnapshotInternal(*oldi); + } +} + void WritePreparedTxnDB::UpdateSnapshots( const std::vector& snapshots, const SequenceNumber& version) { @@ -636,6 +659,12 @@ void WritePreparedTxnDB::UpdateSnapshots( // Update the size at the end. Otherwise a parallel reader might read // items that are not set yet. snapshots_total_.store(snapshots.size(), std::memory_order_release); + + // Note: this must be done after the snapshots data structures are updated + // with the new list of snapshots. + CleanupReleasedSnapshots(snapshots, snapshots_all_); + snapshots_all_ = snapshots; + TEST_SYNC_POINT("WritePreparedTxnDB::UpdateSnapshots:p:end"); TEST_SYNC_POINT("WritePreparedTxnDB::UpdateSnapshots:s:end"); } @@ -654,13 +683,20 @@ void WritePreparedTxnDB::CheckAgainstSnapshots(const CommitEntry& evicted) { // place before gets overwritten the reader that reads bottom-up will // eventully see it. const bool next_is_larger = true; - SequenceNumber snapshot_seq = kMaxSequenceNumber; + // We will set to true if the border line snapshot suggests that. + bool search_larger_list = false; size_t ip1 = std::min(cnt, SNAPSHOT_CACHE_SIZE); for (; 0 < ip1; ip1--) { - snapshot_seq = snapshot_cache_[ip1 - 1].load(std::memory_order_acquire); + SequenceNumber snapshot_seq = + snapshot_cache_[ip1 - 1].load(std::memory_order_acquire); TEST_IDX_SYNC_POINT("WritePreparedTxnDB::CheckAgainstSnapshots:p:", ++sync_i); TEST_IDX_SYNC_POINT("WritePreparedTxnDB::CheckAgainstSnapshots:s:", sync_i); + if (ip1 == SNAPSHOT_CACHE_SIZE) { // border line snapshot + // snapshot_seq < commit_seq => larger_snapshot_seq <= commit_seq + // then later also continue the search to larger snapshots + search_larger_list = snapshot_seq < evicted.commit_seq; + } if (!MaybeUpdateOldCommitMap(evicted.prep_seq, evicted.commit_seq, snapshot_seq, !next_is_larger)) { break; @@ -675,17 +711,20 @@ void WritePreparedTxnDB::CheckAgainstSnapshots(const CommitEntry& evicted) { #endif TEST_SYNC_POINT("WritePreparedTxnDB::CheckAgainstSnapshots:p:end"); TEST_SYNC_POINT("WritePreparedTxnDB::CheckAgainstSnapshots:s:end"); - if (UNLIKELY(SNAPSHOT_CACHE_SIZE < cnt && ip1 == SNAPSHOT_CACHE_SIZE && - snapshot_seq < evicted.prep_seq)) { + if (UNLIKELY(SNAPSHOT_CACHE_SIZE < cnt && search_larger_list)) { // Then access the less efficient list of snapshots_ WPRecordTick(TXN_SNAPSHOT_MUTEX_OVERHEAD); - ROCKS_LOG_WARN(info_log_, "snapshots_mutex_ overhead"); + ROCKS_LOG_WARN(info_log_, + "snapshots_mutex_ overhead for <%" PRIu64 ",%" PRIu64 + "> with %" ROCKSDB_PRIszt " snapshots", + evicted.prep_seq, evicted.commit_seq, cnt); ReadLock rl(&snapshots_mutex_); // Items could have moved from the snapshots_ to snapshot_cache_ before // accquiring the lock. To make sure that we do not miss a valid snapshot, // read snapshot_cache_ again while holding the lock. for (size_t i = 0; i < SNAPSHOT_CACHE_SIZE; i++) { - snapshot_seq = snapshot_cache_[i].load(std::memory_order_acquire); + SequenceNumber snapshot_seq = + snapshot_cache_[i].load(std::memory_order_acquire); if (!MaybeUpdateOldCommitMap(evicted.prep_seq, evicted.commit_seq, snapshot_seq, next_is_larger)) { break; @@ -713,7 +752,10 @@ bool WritePreparedTxnDB::MaybeUpdateOldCommitMap( // then snapshot_seq < commit_seq if (prep_seq <= snapshot_seq) { // overlapping range WPRecordTick(TXN_OLD_COMMIT_MAP_MUTEX_OVERHEAD); - ROCKS_LOG_WARN(info_log_, "old_commit_map_mutex_ overhead"); + ROCKS_LOG_WARN(info_log_, + "old_commit_map_mutex_ overhead for %" PRIu64 + " commit entry: <%" PRIu64 ",%" PRIu64 ">", + snapshot_seq, prep_seq, commit_seq); WriteLock wl(&old_commit_map_mutex_); old_commit_map_empty_.store(false, std::memory_order_release); auto& vec = old_commit_map_[snapshot_seq]; diff --git a/utilities/transactions/write_prepared_txn_db.h b/utilities/transactions/write_prepared_txn_db.h index ec76e271634..e0263d4f7b9 100644 --- a/utilities/transactions/write_prepared_txn_db.h +++ b/utilities/transactions/write_prepared_txn_db.h @@ -112,8 +112,6 @@ class WritePreparedTxnDB : public PessimisticTransactionDB { const std::vector& column_families, std::vector* iterators) override; - virtual void ReleaseSnapshot(const Snapshot* snapshot) override; - // Check whether the transaction that wrote the value with sequence number seq // is visible to the snapshot with sequence number snapshot_seq. // Returns true if commit_seq <= snapshot_seq @@ -222,7 +220,6 @@ class WritePreparedTxnDB : public PessimisticTransactionDB { // rare case and it is ok to pay the cost of mutex ReadLock for such old, // reading transactions. WPRecordTick(TXN_OLD_COMMIT_MAP_MUTEX_OVERHEAD); - ROCKS_LOG_WARN(info_log_, "old_commit_map_mutex_ overhead"); ReadLock rl(&old_commit_map_mutex_); auto prep_set_entry = old_commit_map_.find(snapshot_seq); bool found = prep_set_entry != old_commit_map_.end(); @@ -380,6 +377,7 @@ class WritePreparedTxnDB : public PessimisticTransactionDB { friend class WritePreparedTransactionTest_AdvanceMaxEvictedSeqWithDuplicatesTest_Test; friend class WritePreparedTransactionTest_BasicRecoveryTest_Test; + friend class WritePreparedTransactionTest_DoubleSnapshot_Test; friend class WritePreparedTransactionTest_IsInSnapshotEmptyMapTest_Test; friend class WritePreparedTransactionTest_OldCommitMapGC_Test; friend class WritePreparedTransactionTest_RollbackTest_Test; @@ -519,6 +517,11 @@ class WritePreparedTxnDB : public PessimisticTransactionDB { // version value. void UpdateSnapshots(const std::vector& snapshots, const SequenceNumber& version); + // Check the new list of new snapshots against the old one to see if any of + // the snapshots are released and to do the cleanup for the released snapshot. + void CleanupReleasedSnapshots( + const std::vector& new_snapshots, + const std::vector& old_snapshots); // Check an evicted entry against live snapshots to see if it should be kept // around or it can be safely discarded (and hence assume committed for all @@ -549,10 +552,14 @@ class WritePreparedTxnDB : public PessimisticTransactionDB { static const size_t DEF_SNAPSHOT_CACHE_BITS = static_cast(7); const size_t SNAPSHOT_CACHE_BITS; const size_t SNAPSHOT_CACHE_SIZE; - unique_ptr[]> snapshot_cache_; + std::unique_ptr[]> snapshot_cache_; // 2nd list for storing snapshots. The list sorted in ascending order. // Thread-safety is provided with snapshots_mutex_. std::vector snapshots_; + // The list of all snapshots: snapshots_ + snapshot_cache_. This list although + // redundant but simplifies CleanupOldSnapshots implementation. + // Thread-safety is provided with snapshots_mutex_. + std::vector snapshots_all_; // The version of the latest list of snapshots. This can be used to avoid // rewriting a list that is concurrently updated with a more recent version. SequenceNumber snapshots_version_ = 0; @@ -567,7 +574,7 @@ class WritePreparedTxnDB : public PessimisticTransactionDB { const CommitEntry64bFormat FORMAT; // commit_cache_ must be initialized to zero to tell apart an empty index from // a filled one. Thread-safety is provided with commit_cache_mutex_. - unique_ptr[]> commit_cache_; + std::unique_ptr[]> commit_cache_; // The largest evicted *commit* sequence number from the commit_cache_. If a // seq is smaller than max_evicted_seq_ is might or might not be present in // commit_cache_. So commit_cache_ must first be checked before consulting diff --git a/utilities/ttl/ttl_test.cc b/utilities/ttl/ttl_test.cc index ee7b317aafd..f434d185700 100644 --- a/utilities/ttl/ttl_test.cc +++ b/utilities/ttl/ttl_test.cc @@ -370,14 +370,14 @@ class TtlTest : public testing::Test { static const int64_t kSampleSize_ = 100; std::string dbname_; DBWithTTL* db_ttl_; - unique_ptr env_; + std::unique_ptr env_; private: Options options_; KVMap kvmap_; KVMap::iterator kv_it_; const std::string kNewValue_ = "new_value"; - unique_ptr test_comp_filter_; + std::unique_ptr test_comp_filter_; }; // class TtlTest // If TTL is non positive or not provided, the behaviour is TTL = infinity