Skip to content

Commit

Permalink
Improve / clean up meta block code & integrity (facebook#9163)
Browse files Browse the repository at this point in the history
Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)

Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.

Pull Request resolved: facebook#9163

Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.

Reviewed By: ajkr, mrambacher

Differential Revision: D32514757

Pulled By: pdillinger

fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
  • Loading branch information
pdillinger authored and facebook-github-bot committed Nov 18, 2021
1 parent f429466 commit 230660b
Show file tree
Hide file tree
Showing 36 changed files with 493 additions and 503 deletions.
2 changes: 2 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
* Added input sanitization on negative bytes passed into `GenericRateLimiter::Request`.
* Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete.
* Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer.
* In some cases outside of the DB read and compaction paths, SST block checksums are now checked where they were not before.
* Explicitly check for and disallow the `BlockBasedTableOptions` if insertion into one of {`block_cache`, `block_cache_compressed`, `persistent_cache`} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.)

### Behavior Changes
Expand All @@ -29,6 +30,7 @@
* Made FileSystem extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
* Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more.
* Marked `WriteBufferManager` as `final` because it is not intended for extension.
* Removed unimportant implementation details from table_properties.h
* Add API `FSDirectory::FsyncWithDirOptions()`, which provides extra information like directory fsync reason in `DirFsyncOptions`. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the `DB::Open()` speed by ~20%.
* `DB::Open()` is not going be blocked by obsolete file purge if `DBOptions::avoid_unnecessary_blocking_io` is set to true.
* In builds where glibc provides `gettid()`, info log ("LOG" file) lines now print a system-wide thread ID from `gettid()` instead of the process-local `pthread_self()`. For all users, the thread ID format is changed from hexadecimal to decimal integer.
Expand Down
7 changes: 7 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,13 @@ ifdef COMPILE_WITH_ASAN
EXEC_LDFLAGS += -fsanitize=address
PLATFORM_CCFLAGS += -fsanitize=address
PLATFORM_CXXFLAGS += -fsanitize=address
ifeq ($(LIB_MODE),shared)
ifdef USE_CLANG
# Fix false ODR violation; see https://github.com/google/sanitizers/issues/1017
EXEC_LDFLAGS += -mllvm -asan-use-private-alias=1
PLATFORM_CXXFLAGS += -mllvm -asan-use-private-alias=1
endif
endif
endif

# TSAN doesn't work well with jemalloc. If we're compiling with TSAN, we should use regular malloc.
Expand Down
2 changes: 1 addition & 1 deletion db/corruption_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -544,7 +544,7 @@ TEST_F(CorruptionTest, RangeDeletionCorrupted) {
fs->GetFileSize(filename, file_opts.io_options, &file_size, nullptr));

BlockHandle range_del_handle;
ASSERT_OK(FindMetaBlock(
ASSERT_OK(FindMetaBlockInFile(
file_reader.get(), file_size, kBlockBasedTableMagicNumber,
ImmutableOptions(options_), kRangeDelBlock, &range_del_handle));

Expand Down
79 changes: 77 additions & 2 deletions db/db_table_properties_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,10 @@
#include "port/port.h"
#include "port/stack_trace.h"
#include "rocksdb/db.h"
#include "rocksdb/types.h"
#include "rocksdb/utilities/table_properties_collectors.h"
#include "table/format.h"
#include "table/meta_blocks.h"
#include "test_util/testharness.h"
#include "test_util/testutil.h"
#include "util/random.h"
Expand Down Expand Up @@ -63,21 +65,49 @@ class DBTablePropertiesTest : public DBTestBase,
TEST_F(DBTablePropertiesTest, GetPropertiesOfAllTablesTest) {
Options options = CurrentOptions();
options.level0_file_num_compaction_trigger = 8;
// Part of strategy to prevent pinning table files
options.max_open_files = 42;
Reopen(options);

// Create 4 tables
for (int table = 0; table < 4; ++table) {
// Use old meta name for table properties for one file
if (table == 3) {
SyncPoint::GetInstance()->SetCallBack(
"BlockBasedTableBuilder::WritePropertiesBlock:Meta", [&](void* meta) {
*reinterpret_cast<const std::string**>(meta) =
&kPropertiesBlockOldName;
});
SyncPoint::GetInstance()->EnableProcessing();
}
// Build file
for (int i = 0; i < 10 + table; ++i) {
ASSERT_OK(db_->Put(WriteOptions(), ToString(table * 100 + i), "val"));
}
ASSERT_OK(db_->Flush(FlushOptions()));
}
SyncPoint::GetInstance()->DisableProcessing();
std::string original_session_id;
ASSERT_OK(db_->GetDbSessionId(original_session_id));

// Part of strategy to prevent pinning table files
SyncPoint::GetInstance()->SetCallBack(
"VersionEditHandler::LoadTables:skip_load_table_files",
[&](void* skip_load) { *reinterpret_cast<bool*>(skip_load) = true; });
SyncPoint::GetInstance()->EnableProcessing();

// 1. Read table properties directly from file
Reopen(options);
// Clear out auto-opened files
dbfull()->TEST_table_cache()->EraseUnRefEntries();
ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U);
VerifyTableProperties(db_, 10 + 11 + 12 + 13);

// 2. Put two tables to table cache and
Reopen(options);
// Clear out auto-opened files
dbfull()->TEST_table_cache()->EraseUnRefEntries();
ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U);
// fetch key from 1st and 2nd table, which will internally place that table to
// the table cache.
for (int i = 0; i < 2; ++i) {
Expand All @@ -88,12 +118,57 @@ TEST_F(DBTablePropertiesTest, GetPropertiesOfAllTablesTest) {

// 3. Put all tables to table cache
Reopen(options);
// fetch key from 1st and 2nd table, which will internally place that table to
// the table cache.
// fetch key from all tables, which will place them in table cache.
for (int i = 0; i < 4; ++i) {
Get(ToString(i * 100 + 0));
}
VerifyTableProperties(db_, 10 + 11 + 12 + 13);

// 4. Try to read CORRUPT properties (a) directly from file, and (b)
// through reader on Get

// It's not practical to prevent table file read on Open, so we
// corrupt after open and after purging table cache.
for (bool direct : {true, false}) {
Reopen(options);
// Clear out auto-opened files
dbfull()->TEST_table_cache()->EraseUnRefEntries();
ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U);

TablePropertiesCollection props;
ASSERT_OK(db_->GetPropertiesOfAllTables(&props));
std::string sst_file = props.begin()->first;

// Corrupt the file's TableProperties using session id
std::string contents;
ASSERT_OK(
ReadFileToString(env_->GetFileSystem().get(), sst_file, &contents));
size_t pos = contents.find(original_session_id);
ASSERT_NE(pos, std::string::npos);
ASSERT_OK(test::CorruptFile(env_, sst_file, static_cast<int>(pos), 1,
/*verify checksum fails*/ false));

// Try to read CORRUPT properties
if (direct) {
ASSERT_TRUE(db_->GetPropertiesOfAllTables(&props).IsCorruption());
} else {
bool found_corruption = false;
for (int i = 0; i < 4; ++i) {
std::string result = Get(ToString(i * 100 + 0));
if (result.find_first_of("Corruption: block checksum mismatch") !=
std::string::npos) {
found_corruption = true;
}
}
ASSERT_TRUE(found_corruption);
}

// UN-corrupt file for next iteration
ASSERT_OK(test::CorruptFile(env_, sst_file, static_cast<int>(pos), 1,
/*verify checksum fails*/ false));
}

SyncPoint::GetInstance()->DisableProcessing();
}

TEST_F(DBTablePropertiesTest, CreateOnDeletionCollectorFactory) {
Expand Down
40 changes: 18 additions & 22 deletions db/plain_table_db_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -266,24 +266,22 @@ class TestPlainTableReader : public PlainTableReader {
const EnvOptions& env_options, const InternalKeyComparator& icomparator,
EncodingType encoding_type, uint64_t file_size, int bloom_bits_per_key,
double hash_table_ratio, size_t index_sparseness,
const TableProperties* table_properties,
std::unique_ptr<TableProperties>&& props,
std::unique_ptr<RandomAccessFileReader>&& file,
const ImmutableOptions& ioptions, const SliceTransform* prefix_extractor,
bool* expect_bloom_not_match, bool store_index_in_file,
uint32_t column_family_id, const std::string& column_family_name)
: PlainTableReader(ioptions, std::move(file), env_options, icomparator,
encoding_type, file_size, table_properties,
encoding_type, file_size, props.get(),
prefix_extractor),
expect_bloom_not_match_(expect_bloom_not_match) {
Status s = MmapDataIfNeeded();
EXPECT_TRUE(s.ok());

s = PopulateIndex(const_cast<TableProperties*>(table_properties),
bloom_bits_per_key, hash_table_ratio, index_sparseness,
2 * 1024 * 1024);
s = PopulateIndex(props.get(), bloom_bits_per_key, hash_table_ratio,
index_sparseness, 2 * 1024 * 1024);
EXPECT_TRUE(s.ok());

TableProperties* props = const_cast<TableProperties*>(table_properties);
EXPECT_EQ(column_family_id, static_cast<uint32_t>(props->column_family_id));
EXPECT_EQ(column_family_name, props->column_family_name);
if (store_index_in_file) {
Expand All @@ -297,7 +295,7 @@ class TestPlainTableReader : public PlainTableReader {
EXPECT_TRUE(num_blocks_ptr != props->user_collected_properties.end());
}
}
table_properties_.reset(props);
table_properties_ = std::move(props);
}

~TestPlainTableReader() override {}
Expand Down Expand Up @@ -337,26 +335,24 @@ class TestPlainTableFactory : public PlainTableFactory {
std::unique_ptr<RandomAccessFileReader>&& file, uint64_t file_size,
std::unique_ptr<TableReader>* table,
bool /*prefetch_index_and_filter_in_cache*/) const override {
TableProperties* props = nullptr;
auto s =
ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber,
table_reader_options.ioptions, &props,
true /* compression_type_missing */);
std::unique_ptr<TableProperties> props;
auto s = ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber,
table_reader_options.ioptions, &props);
EXPECT_TRUE(s.ok());

if (store_index_in_file_) {
BlockHandle bloom_block_handle;
s = FindMetaBlock(file.get(), file_size, kPlainTableMagicNumber,
table_reader_options.ioptions,
BloomBlockBuilder::kBloomBlock, &bloom_block_handle,
/* compression_type_missing */ true);
s = FindMetaBlockInFile(file.get(), file_size, kPlainTableMagicNumber,
table_reader_options.ioptions,
BloomBlockBuilder::kBloomBlock,
&bloom_block_handle);
EXPECT_TRUE(s.ok());

BlockHandle index_block_handle;
s = FindMetaBlock(file.get(), file_size, kPlainTableMagicNumber,
table_reader_options.ioptions,
PlainTableIndexBuilder::kPlainTableIndexBlock,
&index_block_handle, /* compression_type_missing */ true);
s = FindMetaBlockInFile(file.get(), file_size, kPlainTableMagicNumber,
table_reader_options.ioptions,
PlainTableIndexBuilder::kPlainTableIndexBlock,
&index_block_handle);
EXPECT_TRUE(s.ok());
}

Expand All @@ -370,8 +366,8 @@ class TestPlainTableFactory : public PlainTableFactory {
std::unique_ptr<PlainTableReader> new_reader(new TestPlainTableReader(
table_reader_options.env_options,
table_reader_options.internal_comparator, encoding_type, file_size,
bloom_bits_per_key_, hash_table_ratio_, index_sparseness_, props,
std::move(file), table_reader_options.ioptions,
bloom_bits_per_key_, hash_table_ratio_, index_sparseness_,
std::move(props), std::move(file), table_reader_options.ioptions,
table_reader_options.prefix_extractor, expect_bloom_not_match_,
store_index_in_file_, column_family_id_, column_family_name_));

Expand Down
10 changes: 10 additions & 0 deletions db/table_cache.cc
Original file line number Diff line number Diff line change
Expand Up @@ -655,6 +655,16 @@ size_t TableCache::GetMemoryUsageByTableReader(
return ret;
}

bool TableCache::HasEntry(Cache* cache, uint64_t file_number) {
Cache::Handle* handle = cache->Lookup(GetSliceForFileNumber(&file_number));
if (handle) {
cache->Release(handle);
return true;
} else {
return false;
}
}

void TableCache::Evict(Cache* cache, uint64_t file_number) {
cache->Erase(GetSliceForFileNumber(&file_number));
}
Expand Down
3 changes: 3 additions & 0 deletions db/table_cache.h
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,9 @@ class TableCache {
// Evict any entry for the specified file number
static void Evict(Cache* cache, uint64_t file_number);

// Query whether specified file number is currently in cache
static bool HasEntry(Cache* cache, uint64_t file_number);

// Clean table handle and erase it from the table cache
// Used in DB close, or the file is not live anymore.
void EraseHandle(const FileDescriptor& fd, Cache::Handle* handle);
Expand Down
14 changes: 5 additions & 9 deletions db/table_properties_collector_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -291,11 +291,9 @@ void TestCustomizedTablePropertiesCollector(
std::unique_ptr<RandomAccessFileReader> fake_file_reader(
new RandomAccessFileReader(std::move(source), "test"));

TableProperties* props;
std::unique_ptr<TableProperties> props;
Status s = ReadTableProperties(fake_file_reader.get(), fwf->contents().size(),
magic_number, ioptions, &props,
true /* compression_type_missing */);
std::unique_ptr<TableProperties> props_guard(props);
magic_number, ioptions, &props);
ASSERT_OK(s);

auto user_collected = props->user_collected_properties;
Expand Down Expand Up @@ -432,13 +430,11 @@ void TestInternalKeyPropertiesCollector(
std::unique_ptr<RandomAccessFileReader> reader(
new RandomAccessFileReader(std::move(source), "test"));

TableProperties* props;
Status s =
ReadTableProperties(reader.get(), fwf->contents().size(), magic_number,
ioptions, &props, true /* compression_type_missing */);
std::unique_ptr<TableProperties> props;
Status s = ReadTableProperties(reader.get(), fwf->contents().size(),
magic_number, ioptions, &props);
ASSERT_OK(s);

std::unique_ptr<TableProperties> props_guard(props);
auto user_collected = props->user_collected_properties;
uint64_t deleted = GetDeletedKeys(user_collected);
ASSERT_EQ(5u, deleted); // deletes + single-deletes
Expand Down
6 changes: 5 additions & 1 deletion db/version_edit_handler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -517,7 +517,11 @@ Status VersionEditHandler::MaybeCreateVersion(const VersionEdit& /*edit*/,
Status VersionEditHandler::LoadTables(ColumnFamilyData* cfd,
bool prefetch_index_and_filter_in_cache,
bool is_initial_load) {
if (skip_load_table_files_) {
bool skip_load_table_files = skip_load_table_files_;
TEST_SYNC_POINT_CALLBACK(
"VersionEditHandler::LoadTables:skip_load_table_files",
&skip_load_table_files);
if (skip_load_table_files) {
return Status::OK();
}
assert(cfd != nullptr);
Expand Down
7 changes: 3 additions & 4 deletions db/version_set.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1271,24 +1271,23 @@ Status Version::GetTableProperties(std::shared_ptr<const TableProperties>* tp,
return s;
}

TableProperties* raw_table_properties;
// By setting the magic number to kInvalidTableMagicNumber, we can by
// pass the magic number check in the footer.
std::unique_ptr<RandomAccessFileReader> file_reader(
new RandomAccessFileReader(
std::move(file), file_name, nullptr /* env */, io_tracer_,
nullptr /* stats */, 0 /* hist_type */, nullptr /* file_read_hist */,
nullptr /* rate_limiter */, ioptions->listeners));
std::unique_ptr<TableProperties> props;
s = ReadTableProperties(
file_reader.get(), file_meta->fd.GetFileSize(),
Footer::kInvalidTableMagicNumber /* table's magic number */, *ioptions,
&raw_table_properties, false /* compression_type_missing */);
&props);
if (!s.ok()) {
return s;
}
*tp = std::move(props);
RecordTick(ioptions->stats, NUMBER_DIRECT_LOAD_TABLE_PROPERTIES);

*tp = std::shared_ptr<const TableProperties>(raw_table_properties);
return s;
}

Expand Down
4 changes: 0 additions & 4 deletions include/rocksdb/table_properties.h
Original file line number Diff line number Diff line change
Expand Up @@ -71,10 +71,6 @@ struct TablePropertiesNames {
static const std::string kFastCompressionEstimatedDataSize;
};

extern const std::string kPropertiesBlock;
extern const std::string kCompressionDictBlock;
extern const std::string kRangeDelBlock;

// `TablePropertiesCollector` provides the mechanism for users to collect
// their own properties that they are interested in. This class is essentially
// a collection of callback functions that will be invoked during table
Expand Down
10 changes: 9 additions & 1 deletion table/block_based/block_based_table_builder.cc
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
#include "table/block_based/full_filter_block.h"
#include "table/block_based/partitioned_filter_block.h"
#include "table/format.h"
#include "table/meta_blocks.h"
#include "table/table_builder.h"
#include "util/coding.h"
#include "util/compression.h"
Expand All @@ -62,6 +63,8 @@ extern const std::string kHashIndexPrefixesMetadataBlock;
// Without anonymous namespace here, we fail the warning -Wmissing-prototypes
namespace {

constexpr size_t kBlockTrailerSize = BlockBasedTable::kBlockTrailerSize;

// Create a filter block builder based on its type.
FilterBlockBuilder* CreateFilterBlockBuilder(
const ImmutableCFOptions& /*opt*/, const MutableCFOptions& mopt,
Expand Down Expand Up @@ -1722,7 +1725,12 @@ void BlockBasedTableBuilder::WritePropertiesBlock(
&props_block_size);
}
#endif // !NDEBUG
meta_index_builder->Add(kPropertiesBlock, properties_block_handle);

const std::string* properties_block_meta = &kPropertiesBlock;
TEST_SYNC_POINT_CALLBACK(
"BlockBasedTableBuilder::WritePropertiesBlock:Meta",
&properties_block_meta);
meta_index_builder->Add(*properties_block_meta, properties_block_handle);
}
}

Expand Down
Loading

0 comments on commit 230660b

Please sign in to comment.