Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add NestedLoopJoin rel #188

Merged
merged 11 commits into from
Nov 3, 2023

Conversation

danepitkin
Copy link
Contributor

Requires substrait-io/substrait#561 to be merged and released.

@danepitkin danepitkin force-pushed the danepitkin/nestedloopjoin branch 2 times, most recently from 4aad77f to 90d7b66 Compare October 10, 2023 23:47
@danepitkin danepitkin marked this pull request as draft October 10, 2023 23:48
@vbarua
Copy link
Member

vbarua commented Oct 20, 2023

substrait-io/substrait#561 has been merged. After it's released over the weekend, we can update the submodule to point to it (preferably as it's own PR).

@vbarua
Copy link
Member

vbarua commented Oct 23, 2023

I went ahead and updated substrait-java with the latest version of substrait, which includes the NestedLoopJoin changes.

@danepitkin danepitkin force-pushed the danepitkin/nestedloopjoin branch from 560136d to ef9c234 Compare October 25, 2023 19:46
@danepitkin danepitkin marked this pull request as ready for review October 26, 2023 02:23
Copy link
Contributor Author

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating to the latest version of Substrait! I reviewed the HashJoin PR and tried to implement everything that is applicable for NLJ.

Comment on lines 546 to 552
.condition(
// defaults to true if the join expression is unassigned, resulting in a cartesian
// join
Optional.of(
rel.hasExpression()
? converter.from(rel.getExpression())
: Expression.BoolLiteral.builder().value(true).build()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note for NLJ is that the spec says the Join Expression is optional, but defaults to True which becomes a cartesian join. I've added the code to default to True here.

@vbarua vbarua self-requested a review October 30, 2023 15:25
Comment on lines 46 to 49
Rel roundTrip(Rel rel) {
io.substrait.proto.Rel protoRel = relProtoConverter.toProto(rel);
Rel relReturned = protoRelConverter.from(protoRel);
assertEquals(rel, relReturned);
return protoRelConverter.from(protoRel);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I split up roundtrip functionality from verification-of-roundtrip so I could test inequality for NLJ more easily.

@Value.Immutable
public abstract class NestedLoopJoin extends BiRel implements HasExtension {

public abstract Optional<Expression> getCondition();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this

public abstract Expression getCondition();

The spec indicates that

  // optional, defaults to true (a cartesian join)
  Expression expression = 4;

but we can be stricter within the POJO layer and require it. You're already inserting a boolean literal condition if the protobuf doesn't contain one https://github.com/substrait-io/substrait-java/pull/188/files#r1372502462

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! I wonder if the spec should not have made it optional here? https://substrait.io/relations/physical_relations/#nlj-operator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go ahead and change the spec, since I just added it recently.

Copy link
Contributor Author

@danepitkin danepitkin Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's worth noting that JoinRel uses public abstract Optional<Expression> getCondition();, which is required according to the spec. Maybe it is best to implement NLJ similarly? I'll defer to your choice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change in the latest commit if it is the preferred choice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JoinRel in the protobuf spec doesn't have any information about whether condition is required or not:
https://github.com/substrait-io/substrait/blob/b3071bc9cd77cf916568641c83056a285f8123be/proto/substrait/algebra.proto#L156-L160

I'm guessing that's why it's Optional currently. I think it would make sense to make the JoinRel condition be required as well. That can be a separate change though.

NestedLoopJoin.builder()
.left(left)
.right(right)
.condition(converter.from(rel.getExpression()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want your original code that checks if the condition is null and then sets the condition to True if it is. The expression field in the protobuf isn't marked as required, so we can't assume it's present when reading the protos in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I'm thinking about this is that this is effectively required, because when it is null it should be treated as True. But we can handle that behaviour at our serialisation boundary and have it be properly required internally, which makes it easier to work with within our code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! +1

NestedLoopJoin.builder()
.from(
b.nestedLoopJoin(
__ -> b.bool(true), NestedLoopJoin.JoinType.INNER, leftTable, rightTable))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in ExtensionRoundtripTest.java use a bool true condition as well. Could you use a more interesting condition here like a key equality comparison between the two tables (ie. leftTable.a = rightTable.f) so that we can have a test that verifies that the condition expressions are being converted correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Will update.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. Left some small comments.

@danepitkin danepitkin force-pushed the danepitkin/nestedloopjoin branch from fa3aa86 to 47aa9d8 Compare November 3, 2023 15:54
@danepitkin
Copy link
Contributor Author

danepitkin commented Nov 3, 2023

I can repro the failure locally. I'm still digging into it, it seems the test case is surfacing an actual bug.

The issue is in the last param in this equal comparator output:

condition=ScalarFunctionInvocation{declaration=equal:any_any, arguments=[FieldReference{segments=[StructField{offset=0}], type=I64{nullable=false}}, FieldReference{segments=[StructField{offset=2}], type=I64{nullable=false}}]
condition=ScalarFunctionInvocation{declaration=equal:any_any, arguments=[FieldReference{segments=[StructField{offset=0}], type=I64{nullable=false}}, FieldReference{segments=[StructField{offset=2}], type=Str{nullable=false}}]

NestedLoopJoin.builder()
.from(
b.nestedLoopJoin(
__ -> b.equal(b.fieldReference(leftTable, 0), b.fieldReference(rightTable, 2)),
Copy link
Member

@vbarua vbarua Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! This is subtle and weird (and something to improve in the builder actually).

The condition on the join is evaluated relative to the joins record shape, which is the union of the left and right table records. It would look like:

 0, 1, 2, 3, 4, 5
(a, b, c, d, e, f)

b.fieldReference(rightTable, 2) here ends up referring to c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense! Thank you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realising there's nothing else in the code that uses this, which is why it hasn't come up before.

Copy link
Contributor Author

@danepitkin danepitkin Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test case, should I create a unioned named table to refer to? 5 is out of bounds when creating the Rel, but in bounds after the roundtrip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we have newInputRelReference(int index, List<Rel> rels), which might handle this. I'll dig into this more myself.

return Arrays.stream(indexes)
.mapToObj(index -> fieldReference(inputs, index))
.collect(java.util.stream.Collectors.toList());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on these, they will be quite helpful.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for adding this.

@danepitkin
Copy link
Contributor Author

Thank you so much for the help!

@vbarua vbarua merged commit b66d5b1 into substrait-io:main Nov 3, 2023
7 checks passed
@danepitkin danepitkin mentioned this pull request Nov 3, 2023
ajegou pushed a commit to ajegou/substrait-java that referenced this pull request Mar 29, 2024
* feat: more builder support for field references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants