Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](datalake) Add BucketShuffleJoin support for bucketed hive tables #27784

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Nitin-Kashyap
Copy link
Contributor

@Nitin-Kashyap Nitin-Kashyap commented Nov 29, 2023

Add BucketShuffleJoin support for bucketed hive tables generated by Spark. (27783)

Proposed changes

Issue Number: close #27783

1. Original planner updated to consider BucketShuffle for bucketed hive table
2. Neerids planner updated for bucketShuffle join on hive tables.
3. Added spark style hash calculation in BE for shuffle on one side.

###Sample Output:s
NeredisPlanner
OldPlanner

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/vec/columns/column_decimal.cpp Outdated Show resolved Hide resolved
be/src/vec/columns/column_map.cpp Show resolved Hide resolved
be/src/vec/columns/column_string.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from b4464d4 to f9e42ab Compare November 30, 2023 04:48
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/vec/columns/column_vector.cpp Outdated Show resolved Hide resolved
@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 2 times, most recently from ed212e1 to eaf29b0 Compare November 30, 2023 05:47
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@morningman morningman self-assigned this Nov 30, 2023
@morningman
Copy link
Contributor

Hi @Nitin-Kashyap , thanks for your contribution.
Could you please provide some create table stmt of hive table on spark side,
so that we can test this case?

@morningman
Copy link
Contributor

BTW, is it only suitable for "spark created" hive bucket table?
What if the hive table is created by other system with different hash function?

@Nitin-Kashyap
Copy link
Contributor Author

Nitin-Kashyap commented Dec 1, 2023

Hi @Nitin-Kashyap , thanks for your contribution. Could you please provide some create table stmt of hive table on spark side, so that we can test this case?

@morningman Please find the sample test I used for this case: -

CREATE TABLE parquet_test (
     user_id INT,
     key       VARCHAR(20),
     part      VARCAHAR(10)
)
USING parquet
PARTITIONED BY (part)
CLUSTERED BY (user_id) INTO 3 BUCKETS;

INSERT INTO parquet_test2 VALUES (31, 'U31', 'IN'),  (11,'U11','IN'), (21, 'U21', 'IN');

@Nitin-Kashyap
Copy link
Contributor Author

Nitin-Kashyap commented Dec 1, 2023

BTW, is it only suitable for "spark created" hive bucket table? What if the hive table is created by other system with different hash function?

@morningman Yes, for current scope it will understand only Spark created bucketed table, it identifies this by Properties defined by spark for bucket specification.

I plan to take up supporting for Hive, Hudi as well in some time (hopefully in next PR); for this I have left a place holder THashType [HIVE_MOD: Hive and Hudi use the same hash method] however for hudi some more changes on FE side need to do for identifing type bucket id from file path.

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from eaf29b0 to 34c701c Compare December 2, 2023 12:19
Copy link
Contributor

github-actions bot commented Dec 2, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

github-actions bot commented Dec 2, 2023

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 34c701c to d25350a Compare December 4, 2023 05:05
Copy link
Contributor

github-actions bot commented Dec 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

be/src/vec/utils/util.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

github-actions bot commented Dec 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 2 times, most recently from c76ddc5 to 40431d1 Compare December 6, 2024 05:51
@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 40431d1 to 843b9af Compare December 13, 2024 05:05
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 2 times, most recently from 73123f0 to 4091dd6 Compare December 13, 2024 05:15
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 4091dd6 to 5e0d2a0 Compare December 13, 2024 05:21
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 2 times, most recently from b53b7e0 to 5c27041 Compare December 13, 2024 05:48
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap
Copy link
Contributor Author

run buildall

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 5c27041 to 4a57ca3 Compare December 13, 2024 06:12
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 4a57ca3 to 471a7c5 Compare December 13, 2024 06:37
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 471a7c5 to 714534c Compare December 13, 2024 08:29
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 714534c to 10db37d Compare December 13, 2024 08:39
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 10db37d to 7784db9 Compare December 13, 2024 08:52
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Nitin-Kashyap and others added 3 commits December 31, 2024 19:41
… generated by Spark. (27783)

    1. Original planner updated to consider BucketShuffle for bucketed hive table
    2. Neerids planner updated for bucketShuffle join on hive tables.
    3. Added spark style hash calculation in BE for shuffle on one side.
    4. Added shuffle hash selection based on left(non-shuffling) side.
@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 7784db9 to 3780f23 Compare December 31, 2024 14:53
@924060929
Copy link
Contributor

You should support the enable_fallback_to_original_planner=true in master branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Enable BucketShuffle Join for Hive tables
7 participants