-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add NestedLoopJoin rel #188
feat: add NestedLoopJoin rel #188
Conversation
4aad77f
to
90d7b66
Compare
substrait-io/substrait#561 has been merged. After it's released over the weekend, we can update the submodule to point to it (preferably as it's own PR). |
I went ahead and updated substrait-java with the latest version of substrait, which includes the NestedLoopJoin changes. |
560136d
to
ef9c234
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating to the latest version of Substrait! I reviewed the HashJoin PR and tried to implement everything that is applicable for NLJ.
.condition( | ||
// defaults to true if the join expression is unassigned, resulting in a cartesian | ||
// join | ||
Optional.of( | ||
rel.hasExpression() | ||
? converter.from(rel.getExpression()) | ||
: Expression.BoolLiteral.builder().value(true).build())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note for NLJ is that the spec says the Join Expression is optional, but defaults to True which becomes a cartesian join. I've added the code to default to True here.
Rel roundTrip(Rel rel) { | ||
io.substrait.proto.Rel protoRel = relProtoConverter.toProto(rel); | ||
Rel relReturned = protoRelConverter.from(protoRel); | ||
assertEquals(rel, relReturned); | ||
return protoRelConverter.from(protoRel); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I split up roundtrip functionality from verification-of-roundtrip so I could test inequality for NLJ more easily.
core/src/test/java/io/substrait/type/proto/ExtensionRoundtripTest.java
Outdated
Show resolved
Hide resolved
@Value.Immutable | ||
public abstract class NestedLoopJoin extends BiRel implements HasExtension { | ||
|
||
public abstract Optional<Expression> getCondition(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making this
public abstract Expression getCondition();
The spec indicates that
// optional, defaults to true (a cartesian join)
Expression expression = 4;
but we can be stricter within the POJO layer and require it. You're already inserting a boolean literal condition if the protobuf doesn't contain one https://github.com/substrait-io/substrait-java/pull/188/files#r1372502462
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! I wonder if the spec should not have made it optional here? https://substrait.io/relations/physical_relations/#nlj-operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go ahead and change the spec, since I just added it recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's worth noting that JoinRel
uses public abstract Optional<Expression> getCondition();
, which is required according to the spec. Maybe it is best to implement NLJ similarly? I'll defer to your choice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change in the latest commit if it is the preferred choice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JoinRel in the protobuf spec doesn't have any information about whether condition is required or not:
https://github.com/substrait-io/substrait/blob/b3071bc9cd77cf916568641c83056a285f8123be/proto/substrait/algebra.proto#L156-L160
I'm guessing that's why it's Optional currently. I think it would make sense to make the JoinRel condition be required as well. That can be a separate change though.
NestedLoopJoin.builder() | ||
.left(left) | ||
.right(right) | ||
.condition(converter.from(rel.getExpression())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still want your original code that checks if the condition is null and then sets the condition to True if it is. The expression
field in the protobuf isn't marked as required, so we can't assume it's present when reading the protos in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How I'm thinking about this is that this is effectively required, because when it is null it should be treated as True. But we can handle that behaviour at our serialisation boundary and have it be properly required internally, which makes it easier to work with within our code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! +1
NestedLoopJoin.builder() | ||
.from( | ||
b.nestedLoopJoin( | ||
__ -> b.bool(true), NestedLoopJoin.JoinType.INNER, leftTable, rightTable)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test in ExtensionRoundtripTest.java use a bool true condition as well. Could you use a more interesting condition here like a key equality comparison between the two tables (ie. leftTable.a = rightTable.f
) so that we can have a test that verifies that the condition expressions are being converted correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. Left some small comments.
fa3aa86
to
47aa9d8
Compare
I can repro the failure locally. I'm still digging into it, it seems the test case is surfacing an actual bug. The issue is in the last param in this equal comparator output:
|
NestedLoopJoin.builder() | ||
.from( | ||
b.nestedLoopJoin( | ||
__ -> b.equal(b.fieldReference(leftTable, 0), b.fieldReference(rightTable, 2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! This is subtle and weird (and something to improve in the builder actually).
The condition on the join is evaluated relative to the joins record shape, which is the union of the left and right table records. It would look like:
0, 1, 2, 3, 4, 5
(a, b, c, d, e, f)
b.fieldReference(rightTable, 2)
here ends up referring to c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense! Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm realising there's nothing else in the code that uses this, which is why it hasn't come up before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test case, should I create a unioned named table to refer to? 5
is out of bounds when creating the Rel, but in bounds after the roundtrip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we have newInputRelReference(int index, List<Rel> rels)
, which might handle this. I'll dig into this more myself.
return Arrays.stream(indexes) | ||
.mapToObj(index -> fieldReference(inputs, index)) | ||
.collect(java.util.stream.Collectors.toList()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on these, they will be quite helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for adding this.
Thank you so much for the help! |
* feat: more builder support for field references
Requires substrait-io/substrait#561 to be merged and released.