Skip to content

fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mbwhite
Copy link
Contributor

@mbwhite mbwhite commented Apr 10, 2025

This is resolving issue #382 found with TPC-H 17, when converting back to SQL from Substrait
The subquery references fields in an outer scope, the calcite correlation variables where referenced by Subsrtrait's 'outer reference'

However when converting back to calcite from a plan with these 'outer references' they were ignored.

Notes::

  • I've added two new TPC-x tests, these are similar the existing tests, but do conversions SQL to Substrait and Substrait to SQL
  • Additional SQL using subqueries has been added; would like to add some more. Suggestions welcome.
  • Added these as 1xx.sql files but could move to a separate folder?

@mbwhite mbwhite changed the title Handle Subqueries/set predicates with field references outside of the subquery fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery Apr 10, 2025
@mbwhite mbwhite marked this pull request as ready for review April 10, 2025 11:57
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions

@mbwhite mbwhite force-pushed the tpc-h-fixes branch 4 times, most recently from eaa231c to 074c9ef Compare April 11, 2025 09:47
@mbwhite
Copy link
Contributor Author

mbwhite commented Apr 11, 2025

@vbarua questions have been answered; I've updated the test cases to better separate concerns.

found the hidden setting in vscode to change the formatted back to spotless!

io.substrait.plan.Plan plan = new ProtoPlanConverter().from(possible);
SubstraitToCalcite substraitToCalcite = new SubstraitToCalcite(extensions, typeFactory);
RelNode relRoot = substraitToCalcite.convert(plan.getRoots().get(0)).project(true);
System.out.println(SubstraitToSql.toSql(relRoot));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: avoid printing during test. I suggest assertNotNull instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also changed the other existing tests where this occurred.

@@ -487,25 +492,44 @@ public RexNode visit(Expression.Cast expr) throws RuntimeException {
typeConverter.toCalcite(typeFactory, expr.getType()), expr.input().accept(this), safeCast);
}

AtomicInteger correlIdCount = new AtomicInteger(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? It's never used.

if (outerref.isPresent()) {
if (segment instanceof FieldReference.StructField) {
FieldReference.StructField f = (FieldReference.StructField) segment;
var node = referenceRelList.get(outerref.get() - 1).get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to track or handle the fact the there might be multiple Filters that can add correlation variables? We only ever add to this list.

How would we know which ones came from which input?

This is one the issue I had in mind when I mentioned
https://github.com/substrait-io/substrait-java/pull/383/files#r2038429378

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly the Calcite docs don't really help with the scope of these variables. There is some concept of 'namespace' eg from this error .

All correlation variables should resolve to the same namespace. Prev ns=org.apache.calcite.sql.validate.IdentifierNamespace@d36c1c3, new ns=org.apache.calcite.sql.validate.IdentifierNamespace@96abc7

which came from

select
    c1.c_name,
    o1.o_orderstatus,
    o1.o_totalprice
from
    customer c1,
    orders o1
where
    o1.o_custkey = c1.c_custkey
    and o1.o_totalprice > (
        select
            avg(o_totalprice)
        from
            orders o2, customer c2
        where
            o2.o_totalprice < c1.c_acctbal
            and o2.o_totalprice > (
                select
                    avg(c3.c_acctbal)
                from
                    customer c3
                where
                    c3.c_custkey = o2.o_custkey
                    and c3.c_address = o1.o_clerk
            )
    );

change the last line to c3.c_address = o2.o_clerk and it's ok..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implications is that each relation and it's immediate subexpression is the same namespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants