feat: ScalaUDF and Java UDF support via Janino codegen#4267
Draft
mbutrovich wants to merge 67 commits into
Draft
feat: ScalaUDF and Java UDF support via Janino codegen#4267mbutrovich wants to merge 67 commits into
mbutrovich wants to merge 67 commits into
Conversation
This was referenced May 8, 2026
Contributor
Author
|
There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward. |
…ted body" on Spark 3.5
…undExpr mutable state
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
#4232 merged the JVM UDF bridge. This PR adds a codegen dispatcher on top: a
CometUDF(CometScalaUDFCodegen) that compiles a specialized batch kernel per boundScalaUDFexpression and input schema via Janino. Without this path, any plan containing aScalaUDFfalls back to Spark for the enclosing operator, losing native execution on the surrounding plan.The dispatcher is one of potentially many
CometUDFimplementations the bridge can route to. Hand-writtenCometUDFs for specific expression families (e.g. regex in #4239, JSON in #4305) remain a parallel path; the bridge dispatches by class name from the proto and does not require everything to go through the dispatcher.Benefits:
ScalaUDFwhose argument and return types are in the supported surface routes through native without a hand-writtenCometUDF.ScalaUDFargument tree, so Catalyst sub-expressions inside the UDF (upper(s),concat(c1, c2),monotonically_increasing_id(), higher-order functions liketransform/filter/array_max) compile into the same per-row loop as the user function.Opt-in via
spark.comet.exec.scalaUDF.codegen.enabled(defaulttrue). When disabled, plans containing aScalaUDFfall back to Spark for that operator.The
CometUDFcontract loosens from "should be stateless" to "may hold per-task state in fields." One instance per Spark task attempt per class, reused across all batches of the task, dropped on task completion. Per-instance access is single-threaded because Spark runs one native future per partition and Tokio polls one future per worker at a time.What changes are included in this PR?
org.apache.comet.codegen:CometBatchKernelCodegen(orchestrator) +CometBatchKernelCodegenInput/CometBatchKernelCodegenOutput(per-side emission) +CometBatchKernelJava base +CometInternalRow/CometArrayData/CometMapDatashim bases +CometSpecializedGettersDispatchfor sharedget(ordinal, dataType)dispatch. The framework is generic over Catalyst expressions; today's only consumer is the ScalaUDF dispatcher.org.apache.comet.udf.codegen:CometScalaUDFCodegen(bridge entry, per-task compile cache, per-partition kernel state).ArrayType,StructType, andMapTypeas both input and output, including arbitrary nesting. SealedArrowColumnSpecplus recursive nested-class emission. Each generatedInputArray_*/InputStruct_*/InputMap_*instance is allocated fresh pergetArray(i)/getStruct(i, n)/getMap(i)call withfinalslice fields, matching Spark'sColumnarRow/ColumnarArraymodel. Allocate-fresh keeps retain-by-reference consumers (e.g.ArrayDistinct.nullSafeEvalstashing references in anOpenHashSet) correct without giving up lazy reads.getStruct/getArray/getMap/getDecimal/getUTF8String/getBinaryinsideInputArray_*andInputStruct_*): when the element / field is nullable, the emitter prependsif (isNullAt(...)) return null;so consumers likeFlatten.doGenCode(which useCodeGenerator.setArrayElementand skip the caller-sideisNullAtcheck on reference types) don't store non-null shells / empty bytes / garbage decimals where Spark would store null. Elided when the spec says non-nullable.canHandleadmitsHigherOrderFunction/LambdaFunction/NamedLambdaVariabledespite theirCodegenFallbackmixin.CodegenFallback.doGenCodeemits a single((Expression) references[N]).eval(row)call site per HOF; the kernel dispatches toExpression.eval(InternalRow), which iterates the array, mutatesNamedLambdaVariable.value'sAtomicReferenceper element, and recursively evaluates the lambda body. Lambda-body leaf reads resolve through the kernel's typed Arrow getters since the kernel is anInternalRow. Cost model: per-row interpreted-eval inside the HOF subtree; surrounding native operators stay native, surrounding non-HOF expressions stay codegen.boundExpr(and any in-tree mutable state, notablyNamedLambdaVariable.valuefor HOFs) lives on the per-taskCometScalaUDFCodegeninstance, not on the companion. Concurrent partitions running the same query never share an expression-tree object, mirroring Spark's per-task closure-deserialize model. Bytecode dedup stays JVM-wide viaCodeGenerator.compile's source-keyed cache.(expression, input schema): zero-copy UTF8 reads onVarCharVector, non-nullableisNullAtelision, decimal short-precision fast path on both sides, UTF8 on-heap write shortcut, pre-sized variable-length output buffers,NullIntolerantshort-circuit, non-nullable output short-circuit, nullable-element elision on array / map writes, subexpression elimination. Complex-type output writes hoistgetChildByOrdinaland cast to once-per-batch setup so the per-row body has no runtime type dispatch and no redundant casts. In-code TODOs flag three further optimizations the input side has and the output side does not yet (UTF8 inline-unsafe write, cached write-buffer addresses, nested var-width sizing).canHandlemirrors WSCG'sspark.sql.codegen.maxFieldsgate by counting nested input fields plus the output field and refusing once the total exceeds the configured cap. Comet has no mid-execution fallback, so the gate fires at plan time rather than letting an oversized kernel reach Janino.CometBatchKernelCodegen.compilelogs the formatted Java source viaCodeFormatter.formatwhen compile throws, matching WSCG's diagnostics shape.ConcurrentHashMap<Long, ConcurrentHashMap<String, CometUDF>>keyed by(taskAttemptId, className)with aTaskCompletionListenerevicting the per-task entry. Invariant to Tokio work-stealing across batches: a task that migrates between workers still sees the same instance. Assertions on every invariant (single listener registration, non-null cache, reflective-instantiate success,TaskContextinstall effect).CometScalaUDFroutes anyScalaUDFwhose tree passesCometBatchKernelCodegen.canHandle. Proto build is inlined; no other expressions adopt the dispatcher in this PR.Utils.toArrowFieldandField.createVectorfor every output type. Input spec derives SparkDataTypes viaUtils.fromArrowField. Exception paths close partially allocated vectors to avoid leaks. The ArrowFieldis computed once per(expression, schema)cache entry rather than per batch.docs/source/user-guide/latest/jvm_udf_dispatch.mdcovers the on/off config, supported and unsupported types (including HOF support and themaxFieldsceiling), behavior notes, and the cross-query recompile caveat. Architecture lives in Scaladoc onCometScalaUDFCodegenandCometBatchKernelCodegen; in-codeTODOs carry the open items.How are these changes tested?
CometCodegenSourceSuite: generated-source assertions for every optimization, every complex-type shape, the null-guard contract on nested reference-typed getters (positive and negative cases per Struct / Array / Map element and Struct / Array / Map field), and aCacheKeydiscrimination test asserting(bytes, specs)keys differ onArrowColumnSpec.nullable.CometCodegenDispatchSmokeSuite: end-to-end correctness across the scalar and complex type surface (primitives, binary input and output, decimal precision boundaries, date / timestamp / timestampNTZ, array / struct / map round-trips including nested shapes and primitive-keyed maps), composed UDF trees, subquery reuse,TaskContextpropagation, per-task cache eviction across sequential runs, kernel-cache reuse across batches of one query, ScalaUDF as a child of a native Spark expression, plus aspark.sql.codegen.maxFieldsgate test asserting plan-time fallback. Plus regression tests pinning the null-guard contract viaarray_max(flatten(arr))overArray<Array<Binary>>/Array<Array<String>>/Array<Array<DECIMAL(10,2)>>/Array<Array<DECIMAL(30,2)>>with null inner elements, andarray_distinctoverArray<Struct>for the allocate-fresh identity contract.CometCodegenHOFSuite: HOF regressions coveringArrayTransformoverArray<Int>,array_max(transform(...)),array_max(filter(...)), and a per-task isolation regression that runs the same HOF query twice and asserts each matches Spark (guards against the JVM-wide-cache race that would have concurrent partitions clobberNamedLambdaVariable.value).CometCodegenDispatchFuzzSuite: schema-driven fuzz over random parquet files. Identity ScalaUDF on every primitive column; cardinality probe on every complex column (arrays, maps, structs); per-columnarray_maxelement fuzz overArray<primitive>; per-columnarray_max(flatten(...))element fuzz overArray<Array<primitive>>;array_max(map_keys/map_values(...))element fuzz overMap<primitive, primitive>;array_distinctelement fuzz overArray<Struct<primitives>>; randomized decimal identity fuzz across theMAX_LONG_DIGITS=18boundary at varying null densities.CometScalaUDFCompositionBenchmark: Spark vs Comet with the dispatcher enabled vs disabled, over three composed-UDF shapes.