Skip to content

Conversation

@gord02
Copy link
Contributor

@gord02 gord02 commented Dec 1, 2025

This PR is to add support for the Expression Nested List type.

Issue

Key Changes:

  • Substrait pojo to proto and back conversion
  • Substrait pojo to calcite and back conversion
  • Added new type in Isthmus to support Nested functions list during the conversion to Calcite from Substrait
  • Added code to not do a project remapping on an empty table

Testing:

  • added tests for roundtrip between substrait to proto
  • added tests for roundtrip between substrait and calcite

./gradlew test --tests io.substrait.isthmus.NestedExpressionsTest --debug-jvm

@gord02 gord02 force-pushed the gordon.hamilton/expressionNestedTypes branch from 18967e4 to 9e99091 Compare December 2, 2025 15:50
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Thanks!

@benbellick benbellick self-requested a review December 2, 2025 20:42
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few more comments but the core/ part is looking great!

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the core stuff looks great! I will try and take a pass at the isthmus stuff, but as I mentioned to you online, I am not as comfortable with that part of the codebase. If the changes there seem simple enough that I am comfortable approving, I will do so. Otherwise, I'll leave it to @vbarua or others to make the final judgement call 🙂

@vbarua vbarua changed the title Add support for Expression nested List Type feat: support Nested Lists Dec 3, 2025
@gord02 gord02 marked this pull request as draft December 9, 2025 14:31
@gord02 gord02 marked this pull request as ready for review December 9, 2025 19:51
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working this, its looking really good overall. I did have some comments, but they are minor suggestions. If you could take a look and respond to them, I think we can get this PR merge in this week easily 🙂

Copy link

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments purely on the Isthmus side of things

Comment on lines 14 to 15
* constructor creates a special type of SqlKind.ARRAY_VALUE_CONSTRUCTOR for lists that store
* non-literal expressions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually and by looking at the tests (NestedListWithJustLiteralsTest), it seems this can indeed handle literal expressions, can you double check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can handle both literals and non-literals. I'll update the comment to avoid the confusion

* constructor creates a special type of SqlKind.ARRAY_VALUE_CONSTRUCTOR for lists that store
* non-literal expressions.
*/
public class NestedListConstructor extends SqlMultisetValueConstructor {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you picked SqlMultisetValueConstructor instead of the more natural SqlArrayValueConstructor because CallConverters.java#L144 would then match on your NestedList and treat it as the regular SqlArrayValueConstructor (then invoke LiteralConstructorConverter.java#L32 which is not what we want here).

SqlMultisetValueConstructor is conceptually wrong as a multiset is radically different from an array/list in standard SQL, on top of that I imagine that tomorrow you might want to support something similar and hit the same issue you are trying to avoid here by using this class in the first place.

The impedance mismatch is that in SQL (and Calcite), arrays and lists are technically the same entity, while IIRC in Substrait they are treated as different entities (@benbellick can you confirm this?).

By looking at LiteralConstructorConverter, there is an implicit assumption that arrays store only literals, we go down that route without checking if elements in the array are really literals (LiteralConstructorConverter.java#L62).

It's probably enough to change LiteralConstructorConverter::toNonEmptyListLiteral to something like this (haven't tested it):

private Optional<Expression> toNonEmptyListLiteral(
      RexCall call, Function<RexNode, Expression> topLevelConverter) {
    List<Expression> expressions = call.operands.stream()
        .map(topLevelConverter)
        .collect(Collectors.toList());

    // Check if all operands are actually literals
    if (expressions.stream().allMatch(e -> e instanceof Expression.Literal)) {
      return Optional.of(ExpressionCreator.list(
          call.getType().isNullable(),
          expressions.stream()
              .map(e -> (Expression.Literal) e)
              .collect(Collectors.toList())));
    }

    return Optional.empty();
  }

I suggest to extend SqlArrayValueConstructor (which, I know, extends SqlMultisetValueConstructor, but still), then fix LiteralConstructorConverter as suggested, so that we can continue with NestedExpressionConverter which comes just after, and we should be good

Copy link
Contributor Author

@gord02 gord02 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I am understanding it correctly, let me know your thoughts on the following scenarios: We want to ensure that the roundtrip of both a nested list with literals and non-literals are both returned to a nested list. If the literalConstructorConverter is run first on a list of just literals, then it would pass and then wouldn't be mapped back to a nested list. In the other case, where the nestedExpressionConverter is run first, the literal lists that were originally not a NestedList would become a nested list. Does the above account for this or is the difference not important?

Also, is there a way to meaningfully extend the SqlArrayValueConstructor class? Its definition is bare with just a constructor to its parent type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no distinction between lists and arrays in substrait. Only list is a legitimate type. Array is informally used in some places in the docs, but that doesn't actually exist as a distinct type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your point about

We want to ensure that the roundtrip of both a nested list with literals and non-literals are both returned to a nested list.

The design you have works to solve the problem of "NestedList in is always NestedList out". Thinking about what Alessandro is saying though, I actually don't think it's worth preserving that. There is a little bit of redundancy on the Substrait side in that list containing only literal values can be expressed equivalently as a Literal.List or a Nested.List.

In Calcite, these can both be mapped a SqlArrayValueConstructor. You introduced your NestedListConstructor to be able to distinguish between the two incoming cases so as to roundtrip them, but what if instead of doing that we just said:

  1. A SqlArrayValueConstructor call with all literals gets turned into a Substrait Literal.List
  2. A SqlArrayValueConstructor call with any non-literals gets turned into Substrat Nested.List

This isn't as nice from a roundtrip perspective, but from a pure Calcite perspective we're mapping to the most specific Substrait construct that we can use, which is better in my opinion.

One thing I would push for that's different from what Alessandro is suggesting is to have a single call converter SqlArrayValueConstructorCallConverter that combines the array converting code in LiteralConstructorConverter and your code in NestedExpressionCallConverter. Effectively, let's not split the code that handles arrays with only literals, and arrays with non-literals, and just put it all in one class so that it's easy to see how the behaviour works and you don't have to look at 2 different converters for 1 Calcite construct.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benbellick and @vbarua for the clarification, the fact that only List exists in Substrait actually makes things easier, as we can map it to Array in standard SQL (and in Calcite, where it's informally called List too sometimes).

In light of this clarification, my previous suggestion is an unneeded over complication, and I fully agree with Victor's proposal.

Expression.NestedList.builder().addValues(b.str("xzy")).addValues(b.str("abc")).build();

Rel project =
io.substrait.relation.Project.builder()
Copy link

@asolimando asolimando Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I am not very familiar with coding style conventions in substrait-java, but I generally prefer to use imports over fully qualified names for Java classes (in case you accept to change it, I saw this pattern in quite a few places, it should be fixed consistently)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants