-
Notifications
You must be signed in to change notification settings - Fork 4
S3 hive style writes #697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: antalya
Are you sure you want to change the base?
S3 hive style writes #697
Conversation
Depends on #700 |
and writing more tests are the only thing missing I guess |
…rt table function
This is an automated comment for commit be49e03 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
|
||
bool has_partition_wildcard = configuration->withPartitionWildcard(); | ||
|
||
if (has_partition_wildcard && !partitioning_style_to_wildcard_acceptance.at(configuration->partitioning_style)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please get that value once and save to a meaningfully named variable, because it is not entirely clear what does this bool flag conveys...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean something like the below?
bool partitioning_style_supports_wildcard = partitioning_style_to_wildcard_acceptance.at(configuration->partitioning_style)
if (has_partition_wildcard && !partitioning_style_supports_wildcard)
{
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that'll be nice
"partitioning_style", | ||
"write_partition_columns_into_files" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two need some documentation with examples...
ASTPtr getPartitionByAst(const ASTPtr & table_level_partition_by, const ASTPtr & query, const StorageObjectStorage::ConfigurationPtr & configuration) | ||
{ | ||
ASTPtr query_partition_by = nullptr; | ||
if (const auto insert_query = std::dynamic_pointer_cast<ASTInsertQuery>(query)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not query->as<ASTInsertQuery>()
here?
src/Storages/PartitionStrategy.cpp
Outdated
{ | ||
Names extractPartitionRequiredColumns(ASTPtr & partition_by, const Block & sample_block, ContextPtr context) | ||
{ | ||
auto pby_clone = partition_by->clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to clone here?
@@ -74,7 +127,9 @@ static const std::unordered_set<std::string_view> optional_configuration_keys = | |||
"max_single_part_upload_size", | |||
"max_connections", | |||
"expiration_window_seconds", | |||
"no_sign_request" | |||
"no_sign_request", | |||
"partitioning_style", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was doubting if "partition_strategy" might be a better name, and then I saw that you have a class PartitionStrategy
that represents a thing that is controlled by this setting, and now I am certain.
src/Storages/PartitionStrategy.h
Outdated
ContextPtr context; | ||
}; | ||
|
||
struct PartitionStrategyProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer name PartitionStrategyFactory
here, the nuances of the name is debatable, but IMO provider
is:
- more vague, while this particular class just builds an instance based on input arguments.
- most often used in CH codebase in different meaning: like a simpler interface to the underlying complex system (metrics provider, columns in dictionaries), or give access to some pre-existing pre-configured component, managed elsewhere.
Also since it returns std::shared_ptr it is not entirely clear if that is a pre-existing object (that already has some state) and implicitly shared between callers, or a new one (which caller may freely modify and do not be afraid of side effects elsewhere)
src/Storages/PartitionedSink.cpp
Outdated
/* | ||
* `write_partition_columns_into_files` | ||
*/ | ||
const auto chunk = getPartitionStrategy()->getChunkWithoutPartitionColumnsIfNeeded(input_chunk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just partition_strategy
here?
const auto chunk = getPartitionStrategy()->getChunkWithoutPartitionColumnsIfNeeded(input_chunk); | |
const auto chunk = partition_strategy->getChunkWithoutPartitionColumnsIfNeeded(input_chunk); |
namespace DB | ||
{ | ||
|
||
struct PartitionStrategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is PartitionStrategy
usable with non-object-storage storage?
} | ||
|
||
namespace StorageObjectStorageSetting | ||
{ | ||
extern const StorageObjectStorageSettingsBool allow_dynamic_metadata_for_data_lakes; | ||
} | ||
|
||
namespace | ||
{ | ||
void sanityCheckPartitioningConfiguration( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this whole function (and getPartitionByAst) can be moved inside PartitionStrategyProvider
, since those only validate/prepare parameters for PartitionStrategyProvider::get
.
{ | ||
if (!table_level_partition_by && !query_partition_by) | ||
{ | ||
// do we want to assert that `partitioning_style` is not set to something different style AND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like we want it
setVirtuals(VirtualColumnUtils::getVirtualsForFileLikeStorage(metadata.columns, context_, sample_path)); | ||
setInMemoryMetadata(metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
std::string path; | ||
|
||
if (!prefix.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably add some logic here to prevent double slashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a problem in my tests because AWS S3 sdk splits the key into several string using the slash delimiter, but we should be careful
For some odd reason linking is failing on some builds:
|
// create table s3_table engine=s3('{_partition_id}'); -- partition id wildcard set, but no partition expression | ||
// create table s3_table engine=s3(partition_strategy='hive'); -- partition strategy set, but no partition expression | ||
if (partition_by) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat problematic.
PartitionStrategyFactory
needs a sample block, so it needs to be after the call to VirtualColumnUtils::getVirtualsForFileLikeStorage(metadata.columns, context, sample_path, format_settings)
since it might alter the number of columns. This function needs a sample_path
, that is resolved by getPathSample
.
On the other hand, PartitionStrategyFactory
performs some bucket & key validation which should happen before getPathSample
Tests are broken because we are missing some fixes by upstream, the below pr contains the list which I believe to fix those issues: Once we have 25.2, it'll be easier. |
More info on ClickHouse#76802
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add support for hive partition style writes. Depends on #710 and #711, tests will not pass otherwise
Documentation entry for user-facing changes