[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

chris-twiner · 2024-10-15T13:09:46Z

What changes were proposed in this pull request?

4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait:

@DeveloperApi
trait AgnosticExpressionPathEncoder[T]
  extends AgnosticEncoder[T] {
  def toCatalyst(input: Expression): Expression
  def fromCatalyst(inputPath: Expression): Expression
}

and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as frameless or sparksql-scalapb).

SPARK-49960 provides the same information.

Why are the changes needed?

Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError.

Was this patch authored or co-authored using generative AI tooling?

No

…test

chris-twiner · 2024-10-16T09:30:09Z

@hvanhovell fyi

hvanhovell · 2024-10-16T14:36:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/EncoderUtils.scala

+ * @tparam T over T
+ */
+@DeveloperApi
+trait AgnosticExpressionPathEncoder[T]


@chris-twiner can you give me an example of what you exactly are missing from the agnostic encoder framework. I'd rather solve this problem at that level than create an escape hatch to raw catalyst expressions. I am not saying that we should not do this, but I'd like to have a (small) discussion first.

My rationale for pushing for agnostic encoders is that I want to create a situation where the Classic and Connect Spark SQL interfaces are on par. Catalyst bespoke encoders - sort of - defeat that.

@hvanhovell thanks for getting back to me, per the JIRA this is existing pre 4 functionality that is no longer fully working.
Frameless for example uses an extensible encoder derivation with injection/ADT support to provide type safe usage at compile time. Quality for example uses injections to store a result ADT efficiently, this SO has a similar often occurring example that can be solved. Lastly as the inbuilt encoders are not extensible you can bump into issues of it's derivation limitation (java.util.Calendar for example).

wrt to fully a unified interface impl, that's understood but this change is a minimal requirement to re-enable frameless style usage. I don't have any direct way to provide parity for connect yet (although your unification work provides a clear basis), I track it under frameless #701, although to go further down that route I'd also need custom expressions support in connect (but that's off topic and I know it's there to be used).

Yeah, that makes sense. However, I do want to call out that this is mostly internal API; we do not guarantee any compatibility between (minor) releases. For that reason, historically, most spark libraries have to create per spark version releases. The issue here IMO falls in that category.

I understand that this is a somewhat frustrating and impractical stance. I am open to having this interface for now, provided that in the future we can migrate towards AgnosticEncoders. The latter probably requires us to add additional encoders to the agnostic framework (e.g. an encoder for union types...).

wrt internal - very much understood, it's the price paid for the functionality and performance gains, as I target Databricks as well there is yet more fun - hence shim's complicated version support

(deleted my previous comment) I thought GH had lost it....

If Databricks compatibility something you want, then Agnostic Encoders are your friend.

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

eejbyfeldt · 2025-01-08T07:47:18Z

Would not the TransformingEncoder be enough of a customization point to implement the custom encoders provided by at least frameless (I am not familiar enough with the others)?

I guess it not ideal as it would require a rewrite of that part of the library. But at least from my experiments creating custom encoders by first creating AgnosticEncoders is much easier than creating the ExpressionEncoders directly. And based on the comments by @hvanhovell it seems that is a better approach for downstream libraries.

chris-twiner · 2025-01-08T10:18:46Z

Would not the TransformingEncoder be enough of a customization point to implement the custom encoders provided by at least frameless (I am not familiar enough with the others)?

I guess it not ideal as it would require a rewrite of that part of the library. But at least from my experiments creating custom encoders by first creating AgnosticEncoders is much easier than creating the ExpressionEncoders directly. And based on the comments by @hvanhovell it seems that is a better approach for downstream libraries.

hihi, per the response above, aside from the extra black box indirection that will not allow optimisation (e.g. constant folding etc.) the transformed type itself must be fully represented by AgnosticEncoders already, that's a significant amount of added spaghetti instead of simply using the ExpressionEncoder already present. Longer term, a full rewrite to agnosticencoder derivation (typelevel/frameless#701) might be possible (TransformingEncoder could be a good implementation for the Injection/Union approach although the block box indirection would still exist it might be tolerable for that).

eejbyfeldt · 2025-01-10T08:05:10Z

extra black box indirection that will not allow optimisation (e.g. constant folding etc.)

I am not sure I am following here. Is the black box you talking about the "code hiding" inside the Codec type? I am not following why that would not be possible for Spark to do constant folding by just executing that code?

chris-twiner · 2025-01-10T10:00:50Z

extra black box indirection that will not allow optimisation (e.g. constant folding etc.)

I am not sure I am following here. Is the black box you talking about the "code hiding" inside the Codec type? I am not following why that would not be possible for Spark to do constant folding by just executing that code?

Spark can only optimise Expressions, not general jvm byte code. Similarly, as Codec's are not an Expression, reference resolving will not take place on anything inside of that codec black box (stopping general use of ExpressionEncoders within the codec) - TransformingEncoder's are very good for the cases already implemented but fall short on cases where Spark could (as is currently done) optimise Expressions (including taking part in codegen). The use of custom TransformingEncoder codecs would also prohibit Dababricks Photon usage (or Fabrics upcoming equivalent) etc. as well in the general case (only provided as an example, the same is true for any custom Expression as well, most of spark-scalapb and frameless should be compatible as they are resolved to standard Spark Expressions - except udfs -, Quality custom expressions are definitely not however).

I'm more than happy to chat further directly (outside of this PR to minimise noise) if you would like to, but this PR is aimed at supporting 100% of the previous functionality, not around edge or sub use-cases that long-term may be better handled through another (possibly new) AgnosticEncoder (also hopefully keeping techdebt lower for users of frameless, spark-scalapb and Quality).

I am not sure I am following here. Is the black box you talking about the "code hiding" inside the Codec type? I am not following why that would not be possible for Spark to do constant folding by just executing that code?

Sorry I just realised your original point was probably around TransformingEncoders/Codecs as a replacement for frameless Injections / ADT more than a general solution for 100% of ExpressionEncoder replacements. It definitely does that for classic at least (I've not attempted via connect).

chris-twiner · 2025-02-20T15:39:54Z

replaced by #50023

github-actions bot added the SQL label Oct 15, 2024

chris-twiner added 2 commits October 15, 2024 16:41

[SPARK-49960] Re-Enable DeveloperAPI provided custom encoding

33f691e

[SPARK-49960] Re-Enable DeveloperAPI provided custom encoding - with …

b457353

…test

chris-twiner force-pushed the temp/AgnosticExpressionPathEncoder branch from 6840bcc to b457353 Compare October 15, 2024 14:41

chris-twiner mentioned this pull request Oct 15, 2024

Spark 4.0 / DBR 14.2+ - bleeding edge changes typelevel/frameless#787

Open

hvanhovell reviewed Oct 16, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala Show resolved Hide resolved

chris-twiner requested a review from hvanhovell October 18, 2024 16:13

[SPARK-49960] Rebase against master

10b8bc2

chris-twiner mentioned this pull request Feb 18, 2025

4+ AgnosticEncoder support & Spark Connect typelevel/frameless#701

Open

chris-twiner closed this Feb 20, 2025

chris-twiner mentioned this pull request Feb 20, 2025

[SPARK-49960][SQL] Custom ExpressionEncoder support and TransformingEncoder fixes #50023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

Uh oh!

chris-twiner commented Oct 15, 2024 •

edited

Loading

Uh oh!

chris-twiner commented Oct 16, 2024

Uh oh!

hvanhovell Oct 16, 2024

Uh oh!

chris-twiner Oct 16, 2024 •

edited

Loading

Uh oh!

hvanhovell Oct 16, 2024

Uh oh!

chris-twiner Oct 16, 2024

Uh oh!

hvanhovell Oct 16, 2024 •

edited

Loading

Uh oh!

hvanhovell Oct 16, 2024

Uh oh!

Uh oh!

eejbyfeldt commented Jan 8, 2025

Uh oh!

chris-twiner commented Jan 8, 2025

Uh oh!

eejbyfeldt commented Jan 10, 2025

Uh oh!

chris-twiner commented Jan 10, 2025 •

edited

Loading

Uh oh!

chris-twiner commented Feb 20, 2025

Uh oh!

Uh oh!

[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

Uh oh!

Conversation

chris-twiner commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

chris-twiner commented Oct 16, 2024

Uh oh!

hvanhovell Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

chris-twiner Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

chris-twiner Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

hvanhovell Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eejbyfeldt commented Jan 8, 2025

Uh oh!

chris-twiner commented Jan 8, 2025

Uh oh!

eejbyfeldt commented Jan 10, 2025

Uh oh!

chris-twiner commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chris-twiner commented Feb 20, 2025

Uh oh!

Uh oh!

chris-twiner commented Oct 15, 2024 •

edited

Loading

chris-twiner Oct 16, 2024 •

edited

Loading

hvanhovell Oct 16, 2024 •

edited

Loading

chris-twiner commented Jan 10, 2025 •

edited

Loading