Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

chris-twiner
Copy link
Contributor

@chris-twiner chris-twiner commented Oct 15, 2024

What changes were proposed in this pull request?

4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait:

@DeveloperApi
trait AgnosticExpressionPathEncoder[T]
  extends AgnosticEncoder[T] {
  def toCatalyst(input: Expression): Expression
  def fromCatalyst(inputPath: Expression): Expression
}

and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as frameless or sparksql-scalapb).

SPARK-49960 provides the same information.

Why are the changes needed?

Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Oct 15, 2024
@chris-twiner
Copy link
Contributor Author

@hvanhovell fyi

* @tparam T over T
*/
@DeveloperApi
trait AgnosticExpressionPathEncoder[T]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-twiner can you give me an example of what you exactly are missing from the agnostic encoder framework. I'd rather solve this problem at that level than create an escape hatch to raw catalyst expressions. I am not saying that we should not do this, but I'd like to have a (small) discussion first.

My rationale for pushing for agnostic encoders is that I want to create a situation where the Classic and Connect Spark SQL interfaces are on par. Catalyst bespoke encoders - sort of - defeat that.

Copy link
Contributor Author

@chris-twiner chris-twiner Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell thanks for getting back to me, per the JIRA this is existing pre 4 functionality that is no longer fully working.
Frameless for example uses an extensible encoder derivation with injection/ADT support to provide type safe usage at compile time. Quality for example uses injections to store a result ADT efficiently, this SO has a similar often occurring example that can be solved. Lastly as the inbuilt encoders are not extensible you can bump into issues of it's derivation limitation (java.util.Calendar for example).

wrt to fully a unified interface impl, that's understood but this change is a minimal requirement to re-enable frameless style usage. I don't have any direct way to provide parity for connect yet (although your unification work provides a clear basis), I track it under frameless #701, although to go further down that route I'd also need custom expressions support in connect (but that's off topic and I know it's there to be used).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. However, I do want to call out that this is mostly internal API; we do not guarantee any compatibility between (minor) releases. For that reason, historically, most spark libraries have to create per spark version releases. The issue here IMO falls in that category.

I understand that this is a somewhat frustrating and impractical stance. I am open to having this interface for now, provided that in the future we can migrate towards AgnosticEncoders. The latter probably requires us to add additional encoders to the agnostic framework (e.g. an encoder for union types...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt internal - very much understood, it's the price paid for the functionality and performance gains, as I target Databricks as well there is yet more fun - hence shim's complicated version support

Copy link
Contributor

@hvanhovell hvanhovell Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(deleted my previous comment) I thought GH had lost it....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Databricks compatibility something you want, then Agnostic Encoders are your friend.

@eejbyfeldt
Copy link

Would not the TransformingEncoder be enough of a customization point to implement the custom encoders provided by at least frameless (I am not familiar enough with the others)?

I guess it not ideal as it would require a rewrite of that part of the library. But at least from my experiments creating custom encoders by first creating AgnosticEncoders is much easier than creating the ExpressionEncoders directly. And based on the comments by @hvanhovell it seems that is a better approach for downstream libraries.

@chris-twiner
Copy link
Contributor Author

Would not the TransformingEncoder be enough of a customization point to implement the custom encoders provided by at least frameless (I am not familiar enough with the others)?

I guess it not ideal as it would require a rewrite of that part of the library. But at least from my experiments creating custom encoders by first creating AgnosticEncoders is much easier than creating the ExpressionEncoders directly. And based on the comments by @hvanhovell it seems that is a better approach for downstream libraries.

hihi, per the response above, aside from the extra black box indirection that will not allow optimisation (e.g. constant folding etc.) the transformed type itself must be fully represented by AgnosticEncoders already, that's a significant amount of added spaghetti instead of simply using the ExpressionEncoder already present. Longer term, a full rewrite to agnosticencoder derivation (typelevel/frameless#701) might be possible (TransformingEncoder could be a good implementation for the Injection/Union approach although the block box indirection would still exist it might be tolerable for that).

@eejbyfeldt
Copy link

extra black box indirection that will not allow optimisation (e.g. constant folding etc.)

I am not sure I am following here. Is the black box you talking about the "code hiding" inside the Codec type? I am not following why that would not be possible for Spark to do constant folding by just executing that code?

@chris-twiner
Copy link
Contributor Author

chris-twiner commented Jan 10, 2025

extra black box indirection that will not allow optimisation (e.g. constant folding etc.)

I am not sure I am following here. Is the black box you talking about the "code hiding" inside the Codec type? I am not following why that would not be possible for Spark to do constant folding by just executing that code?

Spark can only optimise Expressions, not general jvm byte code. Similarly, as Codec's are not an Expression, reference resolving will not take place on anything inside of that codec black box (stopping general use of ExpressionEncoders within the codec) - TransformingEncoder's are very good for the cases already implemented but fall short on cases where Spark could (as is currently done) optimise Expressions (including taking part in codegen). The use of custom TransformingEncoder codecs would also prohibit Dababricks Photon usage (or Fabrics upcoming equivalent) etc. as well in the general case (only provided as an example, the same is true for any custom Expression as well, most of spark-scalapb and frameless should be compatible as they are resolved to standard Spark Expressions - except udfs -, Quality custom expressions are definitely not however).

I'm more than happy to chat further directly (outside of this PR to minimise noise) if you would like to, but this PR is aimed at supporting 100% of the previous functionality, not around edge or sub use-cases that long-term may be better handled through another (possibly new) AgnosticEncoder (also hopefully keeping techdebt lower for users of frameless, spark-scalapb and Quality).

I am not sure I am following here. Is the black box you talking about the "code hiding" inside the Codec type? I am not following why that would not be possible for Spark to do constant folding by just executing that code?

Sorry I just realised your original point was probably around TransformingEncoders/Codecs as a replacement for frameless Injections / ADT more than a general solution for 100% of ExpressionEncoder replacements. It definitely does that for classic at least (I've not attempted via connect).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants