Skip to content

Publish UDF agent skills [skip ci]#15058

Merged
abellina merged 16 commits into
NVIDIA:mainfrom
rishic3:aether-udf-skills
Jun 16, 2026
Merged

Publish UDF agent skills [skip ci]#15058
abellina merged 16 commits into
NVIDIA:mainfrom
rishic3:aether-udf-skills

Conversation

@rishic3

@rishic3 rishic3 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Closes #15014. See the epic #14977 for the next phases.

Description

Publishes the UDF skills, docs, and tests. The tests are yet to be wired up to CI, which is planned in #15013.

Installation from this branch can be tested like so:

npx skills add "https://github.com/rishic3/spark-rapids/tree/aether-udf-skills"

Once merged, the following will work:

npx skills add NVIDIA/spark-rapids

Note that only directories with a SKILL.md file will be detected as skills.

Changes taking effect outside of skills/ are:

  • updating the root pom.xml to ignore skills/ in RAT and scalastyle checks
  • ignoring template pom.xml files under skills/ in make-scala-version-build-files.sh
  • updating the root LICENSE with the dual CC-by-4.0 and Apache 2.0 license

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@rishic3 rishic3 requested a review from a team as a code owner June 11, 2026 03:05
Signed-off-by: Rishi Chandra <rishic@nvidia.com>
@rishic3 rishic3 changed the title Publish UDF agent skills Publish UDF agent skills [skip ci] Jun 11, 2026
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR publishes the UDF agent skills — a collection of AI-assistant skill packages (SKILL.md-based) covering SQL conversion, cuDF/CUDA conversion, benchmarking, and test generation — along with dual-licensed documentation/examples and the build changes needed to keep the main Maven project from processing the skills subtree.

  • Build wiring: pom.xml and scala2.13/pom.xml exclude skills/** from RAT and scalastyle checks; make-scala-version-build-files.sh skips skills/ pom templates to prevent them from being transformed for Scala 2.13.
  • Scala test templates (UnitTest, CudfComparisonTest, SqlComparisonTest): all three are missing the installMutableClassLoader() setup that the Java counterparts include and that is required to avoid a RAPIDS ShimLoader failure on Java 17 under a forked Surefire JVM; the Scala TestUtils class does not define this method at all.
  • BenchUtils.scala Scaladoc example: the executeGpu doc block shows scala.io.Source.fromFile(...).mkString without a close, while the actual template code (SqlComparisonTest.scala) correctly uses try/finally.

Confidence Score: 4/5

Safe to merge for build infrastructure and documentation; the Scala test templates distributed to users will silently fail on Java 17 with the RAPIDS plugin until the missing classloader setup is added.

The Scala test scaffolding omits the URLClassLoader setup that the Java counterpart explicitly documents as required for Java 17 / RAPIDS ShimLoader. Every Scala-template user who runs mvn test with spark.plugins = com.nvidia.spark.SQLPlugin on Java 17 will get a ShimLoader exception before a single test runs. The Java templates, build changes, and documentation are all clean.

skills/udf-gen-test/templates/scala/src/test/scala/com/udf/TestUtils.scala and all three Scala test classes (UnitTest.scala, CudfComparisonTest.scala, SqlComparisonTest.scala) need the installMutableClassLoader setup added.

Important Files Changed

Filename Overview
skills/udf-gen-test/templates/scala/src/main/scala/com/udf/Arm.scala Adds ARM helpers (withResource, closeOnExcept, closeAll); closeAll now wraps each close in try/catch as requested, addressing the previous review concern.
skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/MicroBenchRunner.scala Microbenchmark runner; copyAllToHost correctly catches exceptions and calls closeAll; readParquetData and limitTable resource handling looks correct.
skills/udf-gen-test/templates/scala/src/test/scala/com/udf/UnitTest.scala Scala test template missing installMutableClassLoader() setup that the Java counterpart requires for Java 17 RAPIDS ShimLoader compatibility.
skills/udf-gen-test/templates/scala/src/test/scala/com/udf/TestUtils.scala Missing the installMutableClassLoader() method that Java TestUtils.java provides for Java 17 / RAPIDS ShimLoader compatibility.
skills/udf-gen-test/templates/scala/src/test/scala/com/udf/SqlComparisonTest.scala scala.io.Source is now correctly closed with try/finally, addressing the previous review comment.
skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/BenchUtils.scala Scaladoc example for executeGpu shows scala.io.Source without close; actual production logic is correct stub; P2 doc issue.
pom.xml Excludes skills/ from RAT license checks and scalastyle checks; correctly motivated by dual CC-BY-4.0/Apache-2.0 licensing in the skills subtree.
build/make-scala-version-build-files.sh Adds a guard to skip skills/** pom.xml template files, preventing them from being processed by the Scala version build file generator.
skills/udf-convert-to-cudf/examples/URLDecode.java RapidsUDF example with proper try-with-resources for GPU intermediates; correct null handling on CPU path.
skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/MicroBenchRunner.java Java microbenchmark template; resource management (closeAll, try-with-resources) is correct throughout.
skills/udf-gen-test/templates/java/src/test/java/com/udf/TestUtils.java Includes installMutableClassLoader() with clear explanation for Java 17 RAPIDS ShimLoader compatibility; assertDataFrameEquals properly sorts and compares rows.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User: npx skills add NVIDIA/spark-rapids] --> B[Skill discovery: SKILL.md files]
    B --> C1[udf-convert-to-sql]
    B --> C2[udf-convert-to-cudf]
    B --> C3[udf-convert-to-cuda]
    B --> C4[udf-gen-test]
    B --> C5[udf-judge-conversion]
    B --> C6[udf-optimize-cudf]
    B --> C7[udf-benchmark]
    C4 --> D1[Java template]
    C4 --> D2[Scala template]
    D1 --> E1[TestUtils.installMutableClassLoader ✅]
    D2 --> E2[TestUtils - method missing ❌]
    E1 --> F1[RAPIDS ShimLoader works on Java 17]
    E2 --> F2[RAPIDS ShimLoader fails on Java 17]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[User: npx skills add NVIDIA/spark-rapids] --> B[Skill discovery: SKILL.md files]
    B --> C1[udf-convert-to-sql]
    B --> C2[udf-convert-to-cudf]
    B --> C3[udf-convert-to-cuda]
    B --> C4[udf-gen-test]
    B --> C5[udf-judge-conversion]
    B --> C6[udf-optimize-cudf]
    B --> C7[udf-benchmark]
    C4 --> D1[Java template]
    C4 --> D2[Scala template]
    D1 --> E1[TestUtils.installMutableClassLoader ✅]
    D2 --> E2[TestUtils - method missing ❌]
    E1 --> F1[RAPIDS ShimLoader works on Java 17]
    E2 --> F2[RAPIDS ShimLoader fails on Java 17]
Loading

Reviews (6): Last reviewed commit: "add a note on heap size" | Re-trigger Greptile

Comment thread skills/tests/test_export/scala_fixtures.py Outdated
Comment thread skills/tests/test_export/utils.py Outdated
Comment thread skills/docs/dev/VERSIONS.md Outdated
@pxLi

pxLi commented Jun 12, 2026

Copy link
Copy Markdown
Member

I recommend starting with a skills-only change first. The example and test code should be added together with the CI updates in phase 1.

Refer to #14977

@rishic3

rishic3 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

I recommend starting with a skills-only change first. The example and test code should be added together with the CI updates in phase 1.

Refer to #14977

Sounds good, deferring tests/ to a follow-up. Can you clarify which examples you meant to defer? Some examples are bundled in the skill and given to the agent for reference, the top-level examples is for users to test on example UDFs.

@sameerz sameerz added the feature request New feature or request label Jun 12, 2026
sameerz
sameerz previously approved these changes Jun 12, 2026
@rishic3

rishic3 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

build

Comment thread skills/udf-convert-to-cuda/examples/cosine_similarity.cu
Comment thread skills/udf-convert-to-cuda/references/NATIVE_BUILD_ENV.md Outdated
if (resources != null) {
for (AutoCloseable r : resources) {
if (r != null) {
try { r.close(); } catch (Exception ignore) {}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have the ignored exceptions be added to the original exception with addSuppressed?

Comment thread skills/udf-gen-test/templates/java/src/test/java/com/udf/CudfComparisonTest.java Outdated
Comment thread skills/udf-gen-test/templates/java/run_gen_data.sh Outdated
Comment thread skills/udf-gen-test/templates/java/run_gen_data.sh Outdated
Comment thread skills/udf-convert-to-cudf/references/RAPIDS_UDF.md
Comment thread skills/udf-gen-test/templates/java/.mvn/jvm.config
* SPDX-License-Identifier: Apache-2.0
*/

package com.udf.bench;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could reuse DBGen() here (see datagen module). If not it would be good to discuss why we can't use it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize it supports custom generators, GeneratorFunction looks like what we'd need. But the README says it is not published on Maven? I could file a follow-up on this.

<scala.binary.version>2.12</scala.binary.version>
<scala.version>2.12.15</scala.version>
<!-- Spark/RAPIDS versions -->
<spark.version>3.5.5</spark.version>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why are we hardcoding these variables here (binary.version, spark.version, scala.version). Is it because these are meant to be replaced by the llm?

In other words, why are these poms so verbose compared to the regular module poms.

@rishic3 rishic3 Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These skill templates are meant to be standalone projects, copied out of the repo and into the user space. So we'll try to keep the version pins up to date with the latest GA release, but could be adjusted at runtime if the user so specifies.
As I understand spark-rapids' root pom.xml is where most of the complexity is (and all the version pins) and the modules just inherit it, hence why they are much simpler.

* @param f The function to execute with the resource
* @return The result of the function
*/
def withResource[T <: AutoCloseable, R](resource: T)(f: T => R): R = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move Arm.* from sql-plugin to the sql-plugin-api module so we don't need to clone parts of it here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I attempted this a while ago #13424 but never pushed it through 🙂 . May be a good time to revisit.

@abellina abellina left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments on my first pass today, I'll take another look once there are more updates to the change.

if (runGpu) {
try {
long[] times = runBenchmark(warmup, measured, profile, () -> {
try (ColumnVector result = executeGpu(table, numRows)) {}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does executeGpu return a result and executeCpu doesnt?

why do we just ignore the result here and close it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about either result as this is just for measurement, but having executeGpu return something is just to make sure we don't leak the output column.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then executeGpu should close it inside and return void.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up on this

@revans2 revans2 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any more blockers. There is more that would be nice to clean up. But we can look into it later.

@revans2

revans2 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

build

mapper.writer(printer).writeValue(new File(path), report);
System.err.println("Report written to: " + path);
} catch (Exception e) {
System.err.println("Failed to write report: " + e.getMessage());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be another swallowed error case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will follow up

* Assert.assertEquals("UNKNOWN", results[2].getAs("risk_level"));
* }</pre>
*/
public static void verifyUDFResults(Dataset<Row> resultDF, Dataset<Row> testDF) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, would be nice if this was "assertUDFResultsEqual". I had to read the comment to realize this was going to assert not throw.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will follow-up

}

test("UDF vs SQL expression") {
val testDF = UnitTest.createTestData(spark).repartition(1)

@abellina abellina Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the past, single partition execution has yielded some nasty bugs, especially with our gpu execs (like hash aggregate). Having multiple tasks with splits is more of the natural Spark execution as well. Why repartition(1)? This implies a single task will run the udf, which seems odd, especially since you have 4 cores total.

@rishic3 rishic3 Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in response to cases where we would see degenerate execution because the test dataframe would be too small, i.e. we were just passing the UDF single-row columns and not actually exercising columnar execution. Maybe we can replace with a repartition(2) or require a minimum number of test rows

@abellina abellina left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can take care of my comments as follow ups

@abellina abellina merged commit b71faa5 into NVIDIA:main Jun 16, 2026
46 of 48 checks passed
@rishic3

rishic3 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks @abellina @revans2 for shepherding this!

rishic3 added a commit that referenced this pull request Jun 25, 2026
… in skill templates [skip ci] (#15116)

### Description

This is a follow-up to address a few comments on
#15058.
-
([comment](#15058 (comment)))
- executeGpu now closes its result internally and returns void
-
([comment](#15058 (comment)))
- no longer catching write errors and letting it throw (more easily
reviewable with whitespace off)
-
([comment](#15058 (comment)))
- rename `verifyUDFResults` -> `assertUDFResults` to clarify that this
calls assertions
-
([comment](#15058 (comment)))
- bumped to `repartition(2)` and encouraging 10+ test cases, for more
realistic parallelism while ensuring the UDF is actually fed multi-row
columns

### Checklists

Documentation
- [ ] Updated for new or modified user-facing features or behaviors
- [X] No user-facing change

Testing
- [ ] Added or modified tests to cover new code paths
- [ ] Covered by existing tests
(Please provide the names of the existing tests in the PR description.)
- [X] Not required

Performance
- [ ] Tests ran and results are added in the PR description
- [ ] Issue filed with a link in the PR description
- [X] Not required

---------

Signed-off-by: Rishi Chandra <rishic@nvidia.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Publish UDF agent skills, unit tests, and docs to skills/.

6 participants