feat: ScalaUDF and Java UDF support via Janino codegen by mbutrovich · Pull Request #4267 · apache/datafusion-comet

mbutrovich · 2026-05-08T15:36:39Z

Which issue does this PR close?

Closes #.

Rationale for this change

#4232 merged the JVM UDF bridge. This PR adds a codegen dispatcher on top: a CometUDF (CometScalaUDFCodegen) that compiles a specialized batch kernel per bound ScalaUDF expression and input schema via Janino. Without this path, any plan containing a ScalaUDF falls back to Spark for the enclosing operator, losing native execution on the surrounding plan.

The dispatcher is one of potentially many CometUDF implementations the bridge can route to. Hand-written CometUDFs for specific expression families (e.g. regex in #4239, JSON in #4305) remain a parallel path; the bridge dispatches by class name from the proto and does not require everything to go through the dispatcher.

Benefits:

Any ScalaUDF whose argument and return types are in the supported surface routes through native without a hand-written CometUDF.
The dispatcher binds the entire ScalaUDF argument tree, so Catalyst sub-expressions inside the UDF (upper(s), concat(c1, c2), monotonically_increasing_id(), higher-order functions like transform / filter / array_max) compile into the same per-row loop as the user function.
Surrounding native operators stay native; the UDF is no longer a whole-operator fallback boundary.

Opt-in via spark.comet.exec.scalaUDF.codegen.enabled (default true). When disabled, plans containing a ScalaUDF fall back to Spark for that operator.

The CometUDF contract loosens from "should be stateless" to "may hold per-task state in fields." One instance per Spark task attempt per class, reused across all batches of the task, dropped on task completion. Per-instance access is single-threaded because Spark runs one native future per partition and Tokio polls one future per worker at a time.

What changes are included in this PR?

Generic codegen infrastructure under org.apache.comet.codegen: CometBatchKernelCodegen (orchestrator) + CometBatchKernelCodegenInput / CometBatchKernelCodegenOutput (per-side emission) + CometBatchKernel Java base + CometInternalRow / CometArrayData / CometMapData shim bases + CometSpecializedGettersDispatch for shared get(ordinal, dataType) dispatch. The framework is generic over Catalyst expressions; today's only consumer is the ScalaUDF dispatcher.
ScalaUDF dispatcher under org.apache.comet.udf.codegen: CometScalaUDFCodegen (bridge entry, per-task compile cache, per-partition kernel state).
Complex type support: ArrayType, StructType, and MapType as both input and output, including arbitrary nesting. Sealed ArrowColumnSpec plus recursive nested-class emission. Each generated InputArray_* / InputStruct_* / InputMap_* instance is allocated fresh per getArray(i) / getStruct(i, n) / getMap(i) call with final slice fields, matching Spark's ColumnarRow / ColumnarArray model. Allocate-fresh keeps retain-by-reference consumers (e.g. ArrayDistinct.nullSafeEval stashing references in an OpenHashSet) correct without giving up lazy reads.
Null-handling contract on nested reference-typed getters (getStruct / getArray / getMap / getDecimal / getUTF8String / getBinary inside InputArray_* and InputStruct_*): when the element / field is nullable, the emitter prepends if (isNullAt(...)) return null; so consumers like Flatten.doGenCode (which use CodeGenerator.setArrayElement and skip the caller-side isNullAt check on reference types) don't store non-null shells / empty bytes / garbage decimals where Spark would store null. Elided when the spec says non-nullable.
Higher-order function support: canHandle admits HigherOrderFunction / LambdaFunction / NamedLambdaVariable despite their CodegenFallback mixin. CodegenFallback.doGenCode emits a single ((Expression) references[N]).eval(row) call site per HOF; the kernel dispatches to Expression.eval(InternalRow), which iterates the array, mutates NamedLambdaVariable.value's AtomicReference per element, and recursively evaluates the lambda body. Lambda-body leaf reads resolve through the kernel's typed Arrow getters since the kernel is an InternalRow. Cost model: per-row interpreted-eval inside the HOF subtree; surrounding native operators stay native, surrounding non-HOF expressions stay codegen.
Per-task scope on the dispatcher's compile cache. The deserialized boundExpr (and any in-tree mutable state, notably NamedLambdaVariable.value for HOFs) lives on the per-task CometScalaUDFCodegen instance, not on the companion. Concurrent partitions running the same query never share an expression-tree object, mirroring Spark's per-task closure-deserialize model. Bytecode dedup stays JVM-wide via CodeGenerator.compile's source-keyed cache.
Optimization set applied per (expression, input schema): zero-copy UTF8 reads on VarCharVector, non-nullable isNullAt elision, decimal short-precision fast path on both sides, UTF8 on-heap write shortcut, pre-sized variable-length output buffers, NullIntolerant short-circuit, non-nullable output short-circuit, nullable-element elision on array / map writes, subexpression elimination. Complex-type output writes hoist getChildByOrdinal and cast to once-per-batch setup so the per-row body has no runtime type dispatch and no redundant casts. In-code TODOs flag three further optimizations the input side has and the output side does not yet (UTF8 inline-unsafe write, cached write-buffer addresses, nested var-width sizing).
Method-size guardrail: canHandle mirrors WSCG's spark.sql.codegen.maxFields gate by counting nested input fields plus the output field and refusing once the total exceeds the configured cap. Comet has no mid-execution fallback, so the gate fires at plan time rather than letting an oversized kernel reach Janino.
Failing-source dump on Janino exception: CometBatchKernelCodegen.compile logs the formatted Java source via CodeFormatter.format when compile throws, matching WSCG's diagnostics shape.
Bridge instance cache: ConcurrentHashMap<Long, ConcurrentHashMap<String, CometUDF>> keyed by (taskAttemptId, className) with a TaskCompletionListener evicting the per-task entry. Invariant to Tokio work-stealing across batches: a task that migrates between workers still sees the same instance. Assertions on every invariant (single listener registration, non-null cache, reflective-instantiate success, TaskContext install effect).
Serde routing: CometScalaUDF routes any ScalaUDF whose tree passes CometBatchKernelCodegen.canHandle. Proto build is inlined; no other expressions adopt the dispatcher in this PR.
Allocation reuses Utils.toArrowField and Field.createVector for every output type. Input spec derives Spark DataTypes via Utils.fromArrowField. Exception paths close partially allocated vectors to avoid leaks. The Arrow Field is computed once per (expression, schema) cache entry rather than per batch.
User guide page docs/source/user-guide/latest/jvm_udf_dispatch.md covers the on/off config, supported and unsupported types (including HOF support and the maxFields ceiling), behavior notes, and the cross-query recompile caveat. Architecture lives in Scaladoc on CometScalaUDFCodegen and CometBatchKernelCodegen; in-code TODOs carry the open items.

How are these changes tested?

CometCodegenSourceSuite: generated-source assertions for every optimization, every complex-type shape, the null-guard contract on nested reference-typed getters (positive and negative cases per Struct / Array / Map element and Struct / Array / Map field), and a CacheKey discrimination test asserting (bytes, specs) keys differ on ArrowColumnSpec.nullable.
CometCodegenDispatchSmokeSuite: end-to-end correctness across the scalar and complex type surface (primitives, binary input and output, decimal precision boundaries, date / timestamp / timestampNTZ, array / struct / map round-trips including nested shapes and primitive-keyed maps), composed UDF trees, subquery reuse, TaskContext propagation, per-task cache eviction across sequential runs, kernel-cache reuse across batches of one query, ScalaUDF as a child of a native Spark expression, plus a spark.sql.codegen.maxFields gate test asserting plan-time fallback. Plus regression tests pinning the null-guard contract via array_max(flatten(arr)) over Array<Array<Binary>> / Array<Array<String>> / Array<Array<DECIMAL(10,2)>> / Array<Array<DECIMAL(30,2)>> with null inner elements, and array_distinct over Array<Struct> for the allocate-fresh identity contract.
CometCodegenHOFSuite: HOF regressions covering ArrayTransform over Array<Int>, array_max(transform(...)), array_max(filter(...)), and a per-task isolation regression that runs the same HOF query twice and asserts each matches Spark (guards against the JVM-wide-cache race that would have concurrent partitions clobber NamedLambdaVariable.value).
CometCodegenDispatchFuzzSuite: schema-driven fuzz over random parquet files. Identity ScalaUDF on every primitive column; cardinality probe on every complex column (arrays, maps, structs); per-column array_max element fuzz over Array<primitive>; per-column array_max(flatten(...)) element fuzz over Array<Array<primitive>>; array_max(map_keys/map_values(...)) element fuzz over Map<primitive, primitive>; array_distinct element fuzz over Array<Struct<primitives>>; randomized decimal identity fuzz across the MAX_LONG_DIGITS=18 boundary at varying null densities.
CometScalaUDFCompositionBenchmark: Spark vs Comet with the dispatcher enabled vs disabled, over three composed-UDF shapes.

…UDFs

mbutrovich · 2026-05-08T21:49:51Z

There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward.

…r JNI

…ted body" on Spark 3.5

…scala_udf

…s, tests pass

…undExpr mutable state

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala…

1746bcc

…UDFs

This was referenced May 8, 2026

feat: add experimental support for Spark regexp expressions via JVM UDF framework #4239

Open

feat: add user-facing CometUDF registration for custom JVM UDFs #4233

Draft

mbutrovich and others added 9 commits May 8, 2026 11:44

prettier, add new suites to CI checks.

08d6b78

make format, fix shims for 4.0+

557752e

make format, fix shims for 4.0+

896f61f

Merge branch 'main' into codegen_scala_udf

a82e160

strengthen tests for composed expressions

2a158f4

make format, again.

654bbad

fix pr_benchmark_check.yml

10df7e0

fix arrow shading issue in CI.

7afe69f

fix Spark 4.0 collation expression shim

0dc5855

mbutrovich and others added 17 commits May 8, 2026 19:44

apply common subexpression elimination, add tests for subqueries in UDFs

43a7b0c

make format

9640897

decimal fast path. document 64KB limitation right now

f0c8296

pass through task context to get around tokio worker pool calling ove…

2173f40

…r JNI

fix compilation on scala 2.12, fix format issue

2f9585b

Merge branch 'main' into codegen_scala_udf

582cd17

decimal output, utf8 output, non-nullable output optimizations

22f3256

optimization menu

7666715

estimate binaryview and binary size

0a34636

fix "CSE collapses a repeated subtree to one evaluation in the genera…

e94b6db

…ted body" on Spark 3.5

Merge remote-tracking branch 'origin/codegen_scala_udf' into codegen_…

d0f1f27

…scala_udf

add some complex type support, remove apache#4239 code. update docs.

07e37ea

split codegen input and output, basic struct WIP

ebf77c4

split massive codegen file, handle recursive nested types

6836c30

map input

5d91a8f

more struct support

2a28aaf

revert some benchmark changes

0c6586a

mbutrovich and others added 5 commits May 14, 2026 13:23

fix after merging in upstream/main.

9f8aa07

switch to taskid-keyed state for CometUDFs.

17b2714

Merge branch 'main' into codegen_scala_udf

ff8ee79

reduce the scope to just ScalaUDF instead of general spark expression…

7ed806a

…s, tests pass

update docs

6ff5aa0

mbutrovich changed the title ~~feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs~~ feat: add ScalaUDF support via a codegen dispatcher May 14, 2026

mbutrovich moved this from Todo to In progress in Comet Development May 14, 2026

mbutrovich added 8 commits May 14, 2026 19:27

reorg codegen

935aec6

more tests

cbf96df

cleanup

5966055

document optimizations

748f943

fix tests

f9318d8

try to trim comments a bit

19ac9f6

update two tests

13270bf

revert unintended diff from main

1111c6f

mbutrovich mentioned this pull request May 15, 2026

feat: support stateful CometUDFs #4345

Merged

mbutrovich changed the title ~~feat: add ScalaUDF support via a codegen dispatcher~~ feat: ScalaUDF and Java UDF support via Janino codegen May 15, 2026

mbutrovich added 13 commits May 15, 2026 10:14

add Java UDF test

61ae5b7

update stale TODO references

6643208

better input fuzz coverage

965c2ba

better input fuzz coverage

948f3b9

better input fuzz coverage

41fc046

simplify input logic

25c2511

fix format

a057687

add fallback for too many args and a test, clean up printing code

650f619

stronger tests

b1e1c55

Merge branch 'main' into codegen_scala_udf

0f6f68c

fix(udf): scope the dispatcher's compile cache per task to isolate bo…

d967143

…undExpr mutable state

update docs

10da742

add missing suite

23df354

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ScalaUDF and Java UDF support via Janino codegen#4267

feat: ScalaUDF and Java UDF support via Janino codegen#4267
mbutrovich wants to merge 67 commits into
apache:mainfrom
mbutrovich:codegen_scala_udf

mbutrovich commented May 8, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbutrovich commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented May 8, 2026 •

edited

Loading