-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-26653: Wrong results when (map) joining multiple tables on parti… #6165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some more high level questions/comments regarding the bug/solution but will post those under the JIRA ticket.
| smallTableValueRow[c] = | ||
| VectorizedBatchUtil.getPrimitiveWritable(primitiveTypeInfos[c].getPrimitiveCategory()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which real use-case is this code trying to simulate? Is this equivalent to having null values in the data or something else. Basically, I am trying to understand under what circumstances we have these values in the table.
From a quick look, it seems that we are using special values (e.g., new Text(ArrayUtils.EMPTY_BYTE_ARRAY)) but not really nulls. If that's the case then I don't see why we need to handle this separately from VectorRandomRowSource.randomWritablePrimitiveRow. Wouldn't it make more sense to tune the random generator to occasioanlly generate this "special" values if they can really appear in practice?
| ONLY_ONE, | ||
| NO_REGULAR_SMALL_KEYS | ||
| NO_REGULAR_SMALL_KEYS, | ||
| EMPTY_VALUE, // Generate empty value entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By going over the code, I get the impression that the ValueOption enumeration is about the generation of the keys of the small table not the values. Mixing the two creates confusion and makes the code harder to understand.
| final boolean isEmptyValue = | ||
| testDesc.smallTableGenerationParameters.getValueOption() == ValueOption.EMPTY_VALUE && | ||
| testDesc.smallTableRetainValueColumnNums.length > 0 && | ||
| testDesc.smallTableRetainValueColumnNums.length == testDesc.bigTableKeyColumnNums.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simulation is problematic cause it makes the solution and the test code somewhat identical. We're implementing a copy logic in two places (prod & test) so the tests will trivially pass as it is right now and immediately fail if the implementation changes in the future.
| * @throws Exception Exception | ||
| */ | ||
| @Test | ||
| public void testSmallTableKeyOnlyProjectionWithEmptyValueString() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding tests in this class are useful but it may not be the best option in every situation. These tests depend on random generation of input/output and they are good for covering general behavior of the joins operators but for edge cases and very specific bugs having fixed input & output and join configuration would be much easier to reason about.
For showcasing the bug in this PR (if there is one), it would really help to have a dedicated test case possibly in another class and have well-defined and minimal input/output and join settings. Then we can discuss if we also need these randomized tests. The bug implies a problem in a binary join operator so we should be able to demonstrate the issue by correctly picking the schema/data for the left and right side of the join having a few rows on each side.
| throws HiveException { | ||
|
|
||
| // Check if the small table value is empty. | ||
| boolean isSmallTableValueEmpty = byteSegmentRef.getLength() == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we need to check the actual data in order to decide how to evaluate the join (or rather the creation of the resulting row) is somewhat suspicious and a bit brittle. Ideally, the compiler should be able to determine exactly how the operator should behave via the query plan. Can we exploit (or add) information in the query plan in order to drive the copy decision below?
| if (smallTableValueMapping.getCount() > 0 && | ||
| smallTableValueMapping.getCount() == bigTableKeyColumnMap.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the reasoning/intuition behind this check. Why do we care about values and keys being of the same length?



…tion column
What changes were proposed in this pull request?
In INNER joins, when we only project the small table's joining key, we can run into a situation when the hashmap's value is empty. Then, if we serialize the empty value, we will get
NULLs. Instead we should just copy thekeyinto the vectorized batch.Why are the changes needed?
Explained in detail: https://issues.apache.org/jira/browse/HIVE-26653
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit tests in
TestMapJoinOperator.java. The original issue is not reproducible anymore because of an unrelated patch, as explained in the Jira.