Skip to content

Conversation

@soumyakanti3578
Copy link
Contributor

…tion column

What changes were proposed in this pull request?

In INNER joins, when we only project the small table's joining key, we can run into a situation when the hashmap's value is empty. Then, if we serialize the empty value, we will get NULLs. Instead we should just copy the key into the vectorized batch.

Why are the changes needed?

Explained in detail: https://issues.apache.org/jira/browse/HIVE-26653

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests in TestMapJoinOperator.java. The original issue is not reproducible anymore because of an unrelated patch, as explained in the Jira.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 4, 2025

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some more high level questions/comments regarding the bug/solution but will post those under the JIRA ticket.

Comment on lines +312 to +313
smallTableValueRow[c] =
VectorizedBatchUtil.getPrimitiveWritable(primitiveTypeInfos[c].getPrimitiveCategory());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which real use-case is this code trying to simulate? Is this equivalent to having null values in the data or something else. Basically, I am trying to understand under what circumstances we have these values in the table.

From a quick look, it seems that we are using special values (e.g., new Text(ArrayUtils.EMPTY_BYTE_ARRAY)) but not really nulls. If that's the case then I don't see why we need to handle this separately from VectorRandomRowSource.randomWritablePrimitiveRow. Wouldn't it make more sense to tune the random generator to occasioanlly generate this "special" values if they can really appear in practice?

ONLY_ONE,
NO_REGULAR_SMALL_KEYS
NO_REGULAR_SMALL_KEYS,
EMPTY_VALUE, // Generate empty value entries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By going over the code, I get the impression that the ValueOption enumeration is about the generation of the keys of the small table not the values. Mixing the two creates confusion and makes the code harder to understand.

Comment on lines +1612 to +1615
final boolean isEmptyValue =
testDesc.smallTableGenerationParameters.getValueOption() == ValueOption.EMPTY_VALUE &&
testDesc.smallTableRetainValueColumnNums.length > 0 &&
testDesc.smallTableRetainValueColumnNums.length == testDesc.bigTableKeyColumnNums.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simulation is problematic cause it makes the solution and the test code somewhat identical. We're implementing a copy logic in two places (prod & test) so the tests will trivially pass as it is right now and immediately fail if the implementation changes in the future.

* @throws Exception Exception
*/
@Test
public void testSmallTableKeyOnlyProjectionWithEmptyValueString() throws Exception {
Copy link
Member

@zabetak zabetak Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding tests in this class are useful but it may not be the best option in every situation. These tests depend on random generation of input/output and they are good for covering general behavior of the joins operators but for edge cases and very specific bugs having fixed input & output and join configuration would be much easier to reason about.

For showcasing the bug in this PR (if there is one), it would really help to have a dedicated test case possibly in another class and have well-defined and minimal input/output and join settings. Then we can discuss if we also need these randomized tests. The bug implies a problem in a binary join operator so we should be able to demonstrate the issue by correctly picking the schema/data for the left and right side of the join having a few rows on each side.

throws HiveException {

// Check if the small table value is empty.
boolean isSmallTableValueEmpty = byteSegmentRef.getLength() == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we need to check the actual data in order to decide how to evaluate the join (or rather the creation of the resulting row) is somewhat suspicious and a bit brittle. Ideally, the compiler should be able to determine exactly how the operator should behave via the query plan. Can we exploit (or add) information in the query plan in order to drive the copy decision below?

Comment on lines +608 to +609
if (smallTableValueMapping.getCount() > 0 &&
smallTableValueMapping.getCount() == bigTableKeyColumnMap.length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reasoning/intuition behind this check. Why do we care about values and keys being of the same length?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants