Skip to content

S3 hive style writes #697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: antalya
Choose a base branch
from
Open

S3 hive style writes #697

wants to merge 43 commits into from

Conversation

arthurpassos
Copy link
Collaborator

@arthurpassos arthurpassos commented Mar 20, 2025

More info on ClickHouse#76802

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add support for hive partition style writes. Depends on #710 and #711, tests will not pass otherwise

Documentation entry for user-facing changes

@arthurpassos
Copy link
Collaborator Author

Depends on #700

@arthurpassos
Copy link
Collaborator Author

Depends on #700

and writing more tests are the only thing missing I guess

@altinity-robot
Copy link
Collaborator

altinity-robot commented Mar 27, 2025

This is an automated comment for commit be49e03 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Regression aarch64 Tiered Storage s3amazonThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Regression aarch64 Tiered Storage s3gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Sign aarch64There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ error
Sign releaseThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ error
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Successful checks
Check nameDescriptionStatus
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Ready for releaseThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter attach partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter move partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter replace partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Clickhouse Keeper SSLThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP authenticationThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP external_user_directoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP role_mappingThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Parquet aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Parquet minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 ParquetThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 S3 azureThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 S3 gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 S3 minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Tiered Storage minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 aes_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 atomic_insertThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 base_58There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 clickhouse_keeperThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 datetime64_extended_rangeThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 disk_level_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 dnsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 enginesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 exampleThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 extended_precision_data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 kafkaThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 kerberosThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 key_valueThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 lightweight_deleteThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 memoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 part_moves_between_shardsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 selectsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 session_timezoneThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 tiered_storageThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 window_functionsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Alter move partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success


bool has_partition_wildcard = configuration->withPartitionWildcard();

if (has_partition_wildcard && !partitioning_style_to_wildcard_acceptance.at(configuration->partitioning_style))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get that value once and save to a meaningfully named variable, because it is not entirely clear what does this bool flag conveys...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean something like the below?

bool partitioning_style_supports_wildcard = partitioning_style_to_wildcard_acceptance.at(configuration->partitioning_style)

if (has_partition_wildcard && !partitioning_style_supports_wildcard)
{
...
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that'll be nice

Comment on lines 131 to 132
"partitioning_style",
"write_partition_columns_into_files"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two need some documentation with examples...

ASTPtr getPartitionByAst(const ASTPtr & table_level_partition_by, const ASTPtr & query, const StorageObjectStorage::ConfigurationPtr & configuration)
{
ASTPtr query_partition_by = nullptr;
if (const auto insert_query = std::dynamic_pointer_cast<ASTInsertQuery>(query))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not query->as<ASTInsertQuery>() here?

{
Names extractPartitionRequiredColumns(ASTPtr & partition_by, const Block & sample_block, ContextPtr context)
{
auto pby_clone = partition_by->clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to clone here?

@@ -74,7 +127,9 @@ static const std::unordered_set<std::string_view> optional_configuration_keys =
"max_single_part_upload_size",
"max_connections",
"expiration_window_seconds",
"no_sign_request"
"no_sign_request",
"partitioning_style",
Copy link
Member

@Enmk Enmk Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doubting if "partition_strategy" might be a better name, and then I saw that you have a class PartitionStrategy that represents a thing that is controlled by this setting, and now I am certain.

ContextPtr context;
};

struct PartitionStrategyProvider
Copy link
Member

@Enmk Enmk Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer name PartitionStrategyFactory here, the nuances of the name is debatable, but IMO provider is:

  1. more vague, while this particular class just builds an instance based on input arguments.
  2. most often used in CH codebase in different meaning: like a simpler interface to the underlying complex system (metrics provider, columns in dictionaries), or give access to some pre-existing pre-configured component, managed elsewhere.

Also since it returns std::shared_ptr it is not entirely clear if that is a pre-existing object (that already has some state) and implicitly shared between callers, or a new one (which caller may freely modify and do not be afraid of side effects elsewhere)

/*
* `write_partition_columns_into_files`
*/
const auto chunk = getPartitionStrategy()->getChunkWithoutPartitionColumnsIfNeeded(input_chunk);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just partition_strategy here?

Suggested change
const auto chunk = getPartitionStrategy()->getChunkWithoutPartitionColumnsIfNeeded(input_chunk);
const auto chunk = partition_strategy->getChunkWithoutPartitionColumnsIfNeeded(input_chunk);

namespace DB
{

struct PartitionStrategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is PartitionStrategy usable with non-object-storage storage?

}

namespace StorageObjectStorageSetting
{
extern const StorageObjectStorageSettingsBool allow_dynamic_metadata_for_data_lakes;
}

namespace
{
void sanityCheckPartitioningConfiguration(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this whole function (and getPartitionByAst) can be moved inside PartitionStrategyProvider, since those only validate/prepare parameters for PartitionStrategyProvider::get.

{
if (!table_level_partition_by && !query_partition_by)
{
// do we want to assert that `partitioning_style` is not set to something different style AND
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like we want it

setVirtuals(VirtualColumnUtils::getVirtualsForFileLikeStorage(metadata.columns, context_, sample_path));
setInMemoryMetadata(metadata);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
std::string path;

if (!prefix.empty())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably add some logic here to prevent double slashes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a problem in my tests because AWS S3 sdk splits the key into several string using the slash delimiter, but we should be careful

@arthurpassos
Copy link
Collaborator Author

For some odd reason linking is failing on some builds:

cmake/libabsl_bad_variant_access.a  contrib/abseil-cpp-cmake/libabsl_raw_logging_internal.a  contrib/abseil-cpp-cmake/libabsl_log_severity.a  rust/workspace/lib_ch_rust_skim_rust.a  contrib/unixodbc-cmake/lib_ltdl.a  -Wl,--start-group  contrib/libcxx-cmake/libcxx.a  contrib/libcxxabi-cmake/libcxxabi.a  contrib/libunwind-cmake/libunwind.a  base/glibc-compatibility/libglibc-compatibility.a  base/glibc-compatibility/memcpy/libmemcpy.a  -Wl,--end-group  -nodefaultlibs /usr/lib/llvm-19/lib/clang/19/lib/linux/libclang_rt.builtins-x86_64.a   -lc -lm -lrt -lpthread -ldl && :
Apr 03 22:12:37 ld.lld-19: error: undefined symbol: DB::generateSnowflakeID()
Apr 03 22:12:37 >>> referenced by PartitionStrategy.cpp:222 (./build_docker/./src/Storages/PartitionStrategy.cpp:222)
Apr 03 22:12:37 >>>               PartitionStrategy.cpp.o:(DB::HiveStylePartitionStrategy::getPath(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)) in archive src/libdbms.a
Apr 03 22:12:37 clang++-19: error: linker command failed with exit code 1 (use -v to see invocation)
Apr 03 22:12:38 [13326/13732] Generating StorageSystemLicenses.generated.cpp
Apr 03 22:38:46 [13327/13732] Building CXX object src/AggregateFunctions/CMakeFiles/clickhouse_aggregate_functions.dir/AggregateFunctionAvgWeighted.cpp.o
Apr 03 22:38:46 sccache: warning: The server looks like it shut down unexpectedly, compiling locally instead
Apr 03 22:38:46 ninja: build stopped: subcommand failed.

// create table s3_table engine=s3('{_partition_id}'); -- partition id wildcard set, but no partition expression
// create table s3_table engine=s3(partition_strategy='hive'); -- partition strategy set, but no partition expression
if (partition_by)
{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat problematic.

PartitionStrategyFactory needs a sample block, so it needs to be after the call to VirtualColumnUtils::getVirtualsForFileLikeStorage(metadata.columns, context, sample_path, format_settings) since it might alter the number of columns. This function needs a sample_path, that is resolved by getPathSample.

On the other hand, PartitionStrategyFactory performs some bucket & key validation which should happen before getPathSample

@arthurpassos
Copy link
Collaborator Author

Tests are broken because we are missing some fixes by upstream, the below pr contains the list which I believe to fix those issues:

ClickHouse#71636

Once we have 25.2, it'll be easier.

@MyroTk MyroTk added antalya-25.2.2 Planned for 25.2.2 release and removed antalya-25.2.2 Planned for 25.2.2 release labels Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants