Skip to content

feat: ScalaUDF and Java UDF support via Janino codegen#4267

Draft
mbutrovich wants to merge 67 commits into
apache:mainfrom
mbutrovich:codegen_scala_udf
Draft

feat: ScalaUDF and Java UDF support via Janino codegen#4267
mbutrovich wants to merge 67 commits into
apache:mainfrom
mbutrovich:codegen_scala_udf

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented May 8, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

#4232 merged the JVM UDF bridge. This PR adds a codegen dispatcher on top: a CometUDF (CometScalaUDFCodegen) that compiles a specialized batch kernel per bound ScalaUDF expression and input schema via Janino. Without this path, any plan containing a ScalaUDF falls back to Spark for the enclosing operator, losing native execution on the surrounding plan.

The dispatcher is one of potentially many CometUDF implementations the bridge can route to. Hand-written CometUDFs for specific expression families (e.g. regex in #4239, JSON in #4305) remain a parallel path; the bridge dispatches by class name from the proto and does not require everything to go through the dispatcher.

Benefits:

  • Any ScalaUDF whose argument and return types are in the supported surface routes through native without a hand-written CometUDF.
  • The dispatcher binds the entire ScalaUDF argument tree, so Catalyst sub-expressions inside the UDF (upper(s), concat(c1, c2), monotonically_increasing_id(), higher-order functions like transform / filter / array_max) compile into the same per-row loop as the user function.
  • Surrounding native operators stay native; the UDF is no longer a whole-operator fallback boundary.

Opt-in via spark.comet.exec.scalaUDF.codegen.enabled (default true). When disabled, plans containing a ScalaUDF fall back to Spark for that operator.

The CometUDF contract loosens from "should be stateless" to "may hold per-task state in fields." One instance per Spark task attempt per class, reused across all batches of the task, dropped on task completion. Per-instance access is single-threaded because Spark runs one native future per partition and Tokio polls one future per worker at a time.

What changes are included in this PR?

  • Generic codegen infrastructure under org.apache.comet.codegen: CometBatchKernelCodegen (orchestrator) + CometBatchKernelCodegenInput / CometBatchKernelCodegenOutput (per-side emission) + CometBatchKernel Java base + CometInternalRow / CometArrayData / CometMapData shim bases + CometSpecializedGettersDispatch for shared get(ordinal, dataType) dispatch. The framework is generic over Catalyst expressions; today's only consumer is the ScalaUDF dispatcher.
  • ScalaUDF dispatcher under org.apache.comet.udf.codegen: CometScalaUDFCodegen (bridge entry, per-task compile cache, per-partition kernel state).
  • Complex type support: ArrayType, StructType, and MapType as both input and output, including arbitrary nesting. Sealed ArrowColumnSpec plus recursive nested-class emission. Each generated InputArray_* / InputStruct_* / InputMap_* instance is allocated fresh per getArray(i) / getStruct(i, n) / getMap(i) call with final slice fields, matching Spark's ColumnarRow / ColumnarArray model. Allocate-fresh keeps retain-by-reference consumers (e.g. ArrayDistinct.nullSafeEval stashing references in an OpenHashSet) correct without giving up lazy reads.
  • Null-handling contract on nested reference-typed getters (getStruct / getArray / getMap / getDecimal / getUTF8String / getBinary inside InputArray_* and InputStruct_*): when the element / field is nullable, the emitter prepends if (isNullAt(...)) return null; so consumers like Flatten.doGenCode (which use CodeGenerator.setArrayElement and skip the caller-side isNullAt check on reference types) don't store non-null shells / empty bytes / garbage decimals where Spark would store null. Elided when the spec says non-nullable.
  • Higher-order function support: canHandle admits HigherOrderFunction / LambdaFunction / NamedLambdaVariable despite their CodegenFallback mixin. CodegenFallback.doGenCode emits a single ((Expression) references[N]).eval(row) call site per HOF; the kernel dispatches to Expression.eval(InternalRow), which iterates the array, mutates NamedLambdaVariable.value's AtomicReference per element, and recursively evaluates the lambda body. Lambda-body leaf reads resolve through the kernel's typed Arrow getters since the kernel is an InternalRow. Cost model: per-row interpreted-eval inside the HOF subtree; surrounding native operators stay native, surrounding non-HOF expressions stay codegen.
  • Per-task scope on the dispatcher's compile cache. The deserialized boundExpr (and any in-tree mutable state, notably NamedLambdaVariable.value for HOFs) lives on the per-task CometScalaUDFCodegen instance, not on the companion. Concurrent partitions running the same query never share an expression-tree object, mirroring Spark's per-task closure-deserialize model. Bytecode dedup stays JVM-wide via CodeGenerator.compile's source-keyed cache.
  • Optimization set applied per (expression, input schema): zero-copy UTF8 reads on VarCharVector, non-nullable isNullAt elision, decimal short-precision fast path on both sides, UTF8 on-heap write shortcut, pre-sized variable-length output buffers, NullIntolerant short-circuit, non-nullable output short-circuit, nullable-element elision on array / map writes, subexpression elimination. Complex-type output writes hoist getChildByOrdinal and cast to once-per-batch setup so the per-row body has no runtime type dispatch and no redundant casts. In-code TODOs flag three further optimizations the input side has and the output side does not yet (UTF8 inline-unsafe write, cached write-buffer addresses, nested var-width sizing).
  • Method-size guardrail: canHandle mirrors WSCG's spark.sql.codegen.maxFields gate by counting nested input fields plus the output field and refusing once the total exceeds the configured cap. Comet has no mid-execution fallback, so the gate fires at plan time rather than letting an oversized kernel reach Janino.
  • Failing-source dump on Janino exception: CometBatchKernelCodegen.compile logs the formatted Java source via CodeFormatter.format when compile throws, matching WSCG's diagnostics shape.
  • Bridge instance cache: ConcurrentHashMap<Long, ConcurrentHashMap<String, CometUDF>> keyed by (taskAttemptId, className) with a TaskCompletionListener evicting the per-task entry. Invariant to Tokio work-stealing across batches: a task that migrates between workers still sees the same instance. Assertions on every invariant (single listener registration, non-null cache, reflective-instantiate success, TaskContext install effect).
  • Serde routing: CometScalaUDF routes any ScalaUDF whose tree passes CometBatchKernelCodegen.canHandle. Proto build is inlined; no other expressions adopt the dispatcher in this PR.
  • Allocation reuses Utils.toArrowField and Field.createVector for every output type. Input spec derives Spark DataTypes via Utils.fromArrowField. Exception paths close partially allocated vectors to avoid leaks. The Arrow Field is computed once per (expression, schema) cache entry rather than per batch.
  • User guide page docs/source/user-guide/latest/jvm_udf_dispatch.md covers the on/off config, supported and unsupported types (including HOF support and the maxFields ceiling), behavior notes, and the cross-query recompile caveat. Architecture lives in Scaladoc on CometScalaUDFCodegen and CometBatchKernelCodegen; in-code TODOs carry the open items.

How are these changes tested?

  • CometCodegenSourceSuite: generated-source assertions for every optimization, every complex-type shape, the null-guard contract on nested reference-typed getters (positive and negative cases per Struct / Array / Map element and Struct / Array / Map field), and a CacheKey discrimination test asserting (bytes, specs) keys differ on ArrowColumnSpec.nullable.
  • CometCodegenDispatchSmokeSuite: end-to-end correctness across the scalar and complex type surface (primitives, binary input and output, decimal precision boundaries, date / timestamp / timestampNTZ, array / struct / map round-trips including nested shapes and primitive-keyed maps), composed UDF trees, subquery reuse, TaskContext propagation, per-task cache eviction across sequential runs, kernel-cache reuse across batches of one query, ScalaUDF as a child of a native Spark expression, plus a spark.sql.codegen.maxFields gate test asserting plan-time fallback. Plus regression tests pinning the null-guard contract via array_max(flatten(arr)) over Array<Array<Binary>> / Array<Array<String>> / Array<Array<DECIMAL(10,2)>> / Array<Array<DECIMAL(30,2)>> with null inner elements, and array_distinct over Array<Struct> for the allocate-fresh identity contract.
  • CometCodegenHOFSuite: HOF regressions covering ArrayTransform over Array<Int>, array_max(transform(...)), array_max(filter(...)), and a per-task isolation regression that runs the same HOF query twice and asserts each matches Spark (guards against the JVM-wide-cache race that would have concurrent partitions clobber NamedLambdaVariable.value).
  • CometCodegenDispatchFuzzSuite: schema-driven fuzz over random parquet files. Identity ScalaUDF on every primitive column; cardinality probe on every complex column (arrays, maps, structs); per-column array_max element fuzz over Array<primitive>; per-column array_max(flatten(...)) element fuzz over Array<Array<primitive>>; array_max(map_keys/map_values(...)) element fuzz over Map<primitive, primitive>; array_distinct element fuzz over Array<Struct<primitives>>; randomized decimal identity fuzz across the MAX_LONG_DIGITS=18 boundary at varying null densities.
  • CometScalaUDFCompositionBenchmark: Spark vs Comet with the dispatcher enabled vs disabled, over three composed-UDF shapes.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward.

@mbutrovich mbutrovich changed the title feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs feat: add ScalaUDF support via a codegen dispatcher May 14, 2026
@mbutrovich mbutrovich moved this from Todo to In progress in Comet Development May 14, 2026
@mbutrovich mbutrovich changed the title feat: add ScalaUDF support via a codegen dispatcher feat: ScalaUDF and Java UDF support via Janino codegen May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants