-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477
base: master
Are you sure you want to change the base?
[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477
Conversation
6840bcc
to
b457353
Compare
@hvanhovell fyi |
* @tparam T over T | ||
*/ | ||
@DeveloperApi | ||
trait AgnosticExpressionPathEncoder[T] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-twiner can you give me an example of what you exactly are missing from the agnostic encoder framework. I'd rather solve this problem at that level than create an escape hatch to raw catalyst expressions. I am not saying that we should not do this, but I'd like to have a (small) discussion first.
My rationale for pushing for agnostic encoders is that I want to create a situation where the Classic and Connect Spark SQL interfaces are on par. Catalyst bespoke encoders - sort of - defeat that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell thanks for getting back to me, per the JIRA this is existing pre 4 functionality that is no longer fully working.
Frameless for example uses an extensible encoder derivation with injection/ADT support to provide type safe usage at compile time. Quality for example uses injections to store a result ADT efficiently, this SO has a similar often occurring example that can be solved. Lastly as the inbuilt encoders are not extensible you can bump into issues of it's derivation limitation (java.util.Calendar for example).
wrt to fully a unified interface impl, that's understood but this change is a minimal requirement to re-enable frameless style usage. I don't have any direct way to provide parity for connect yet (although your unification work provides a clear basis), I track it under frameless #701, although to go further down that route I'd also need custom expressions support in connect (but that's off topic and I know it's there to be used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. However, I do want to call out that this is mostly internal API; we do not guarantee any compatibility between (minor) releases. For that reason, historically, most spark libraries have to create per spark version releases. The issue here IMO falls in that category.
I understand that this is a somewhat frustrating and impractical stance. I am open to having this interface for now, provided that in the future we can migrate towards AgnosticEncoders. The latter probably requires us to add additional encoders to the agnostic framework (e.g. an encoder for union types...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt internal - very much understood, it's the price paid for the functionality and performance gains, as I target Databricks as well there is yet more fun - hence shim's complicated version support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(deleted my previous comment) I thought GH had lost it....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Databricks compatibility something you want, then Agnostic Encoders are your friend.
Would not the I guess it not ideal as it would require a rewrite of that part of the library. But at least from my experiments creating custom encoders by first creating |
hihi, per the response above, aside from the extra black box indirection that will not allow optimisation (e.g. constant folding etc.) the transformed type itself must be fully represented by AgnosticEncoders already, that's a significant amount of added spaghetti instead of simply using the ExpressionEncoder already present. Longer term, a full rewrite to agnosticencoder derivation (typelevel/frameless#701) might be possible (TransformingEncoder could be a good implementation for the Injection/Union approach although the block box indirection would still exist it might be tolerable for that). |
I am not sure I am following here. Is the black box you talking about the "code hiding" inside the |
Spark can only optimise Expressions, not general jvm byte code. Similarly, as Codec's are not an Expression, reference resolving will not take place on anything inside of that codec black box (stopping general use of ExpressionEncoders within the codec) - TransformingEncoder's are very good for the cases already implemented but fall short on cases where Spark could (as is currently done) optimise Expressions (including taking part in codegen). The use of custom TransformingEncoder codecs would also prohibit Dababricks Photon usage (or Fabrics upcoming equivalent) etc. as well in the general case (only provided as an example, the same is true for any custom Expression as well, most of spark-scalapb and frameless should be compatible as they are resolved to standard Spark Expressions - except udfs -, Quality custom expressions are definitely not however). I'm more than happy to chat further directly (outside of this PR to minimise noise) if you would like to, but this PR is aimed at supporting 100% of the previous functionality, not around edge or sub use-cases that long-term may be better handled through another (possibly new) AgnosticEncoder (also hopefully keeping techdebt lower for users of frameless, spark-scalapb and Quality).
Sorry I just realised your original point was probably around TransformingEncoders/Codecs as a replacement for frameless Injections / ADT more than a general solution for 100% of ExpressionEncoder replacements. It definitely does that for classic at least (I've not attempted via connect). |
What changes were proposed in this pull request?
4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait:
and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as frameless or sparksql-scalapb).
SPARK-49960 provides the same information.
Why are the changes needed?
Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError.
Was this patch authored or co-authored using generative AI tooling?
No