Publish UDF agent skills [skip ci]#15058
Conversation
Greptile SummaryThis PR publishes the UDF agent skills — a collection of AI-assistant skill packages (
Confidence Score: 4/5Safe to merge for build infrastructure and documentation; the Scala test templates distributed to users will silently fail on Java 17 with the RAPIDS plugin until the missing classloader setup is added. The Scala test scaffolding omits the URLClassLoader setup that the Java counterpart explicitly documents as required for Java 17 / RAPIDS ShimLoader. Every Scala-template user who runs skills/udf-gen-test/templates/scala/src/test/scala/com/udf/TestUtils.scala and all three Scala test classes (UnitTest.scala, CudfComparisonTest.scala, SqlComparisonTest.scala) need the installMutableClassLoader setup added. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User: npx skills add NVIDIA/spark-rapids] --> B[Skill discovery: SKILL.md files]
B --> C1[udf-convert-to-sql]
B --> C2[udf-convert-to-cudf]
B --> C3[udf-convert-to-cuda]
B --> C4[udf-gen-test]
B --> C5[udf-judge-conversion]
B --> C6[udf-optimize-cudf]
B --> C7[udf-benchmark]
C4 --> D1[Java template]
C4 --> D2[Scala template]
D1 --> E1[TestUtils.installMutableClassLoader ✅]
D2 --> E2[TestUtils - method missing ❌]
E1 --> F1[RAPIDS ShimLoader works on Java 17]
E2 --> F2[RAPIDS ShimLoader fails on Java 17]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[User: npx skills add NVIDIA/spark-rapids] --> B[Skill discovery: SKILL.md files]
B --> C1[udf-convert-to-sql]
B --> C2[udf-convert-to-cudf]
B --> C3[udf-convert-to-cuda]
B --> C4[udf-gen-test]
B --> C5[udf-judge-conversion]
B --> C6[udf-optimize-cudf]
B --> C7[udf-benchmark]
C4 --> D1[Java template]
C4 --> D2[Scala template]
D1 --> E1[TestUtils.installMutableClassLoader ✅]
D2 --> E2[TestUtils - method missing ❌]
E1 --> F1[RAPIDS ShimLoader works on Java 17]
E2 --> F2[RAPIDS ShimLoader fails on Java 17]
Reviews (6): Last reviewed commit: "add a note on heap size" | Re-trigger Greptile |
|
I recommend starting with a skills-only change first. The example and test code should be added together with the CI updates in phase 1. Refer to #14977 |
Sounds good, deferring tests/ to a follow-up. Can you clarify which examples you meant to defer? Some examples are bundled in the skill and given to the agent for reference, the top-level examples is for users to test on example UDFs. |
|
build |
| if (resources != null) { | ||
| for (AutoCloseable r : resources) { | ||
| if (r != null) { | ||
| try { r.close(); } catch (Exception ignore) {} |
There was a problem hiding this comment.
Should we have the ignored exceptions be added to the original exception with addSuppressed?
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
|
|
||
| package com.udf.bench; |
There was a problem hiding this comment.
It would be nice if we could reuse DBGen() here (see datagen module). If not it would be good to discuss why we can't use it.
There was a problem hiding this comment.
I didn't realize it supports custom generators, GeneratorFunction looks like what we'd need. But the README says it is not published on Maven? I could file a follow-up on this.
| <scala.binary.version>2.12</scala.binary.version> | ||
| <scala.version>2.12.15</scala.version> | ||
| <!-- Spark/RAPIDS versions --> | ||
| <spark.version>3.5.5</spark.version> |
There was a problem hiding this comment.
so why are we hardcoding these variables here (binary.version, spark.version, scala.version). Is it because these are meant to be replaced by the llm?
In other words, why are these poms so verbose compared to the regular module poms.
There was a problem hiding this comment.
These skill templates are meant to be standalone projects, copied out of the repo and into the user space. So we'll try to keep the version pins up to date with the latest GA release, but could be adjusted at runtime if the user so specifies.
As I understand spark-rapids' root pom.xml is where most of the complexity is (and all the version pins) and the modules just inherit it, hence why they are much simpler.
| * @param f The function to execute with the resource | ||
| * @return The result of the function | ||
| */ | ||
| def withResource[T <: AutoCloseable, R](resource: T)(f: T => R): R = { |
There was a problem hiding this comment.
We should move Arm.* from sql-plugin to the sql-plugin-api module so we don't need to clone parts of it here.
There was a problem hiding this comment.
Yeah, I attempted this a while ago #13424 but never pushed it through 🙂 . May be a good time to revisit.
abellina
left a comment
There was a problem hiding this comment.
I left some comments on my first pass today, I'll take another look once there are more updates to the change.
| if (runGpu) { | ||
| try { | ||
| long[] times = runBenchmark(warmup, measured, profile, () -> { | ||
| try (ColumnVector result = executeGpu(table, numRows)) {} |
There was a problem hiding this comment.
why does executeGpu return a result and executeCpu doesnt?
why do we just ignore the result here and close it?
There was a problem hiding this comment.
We don't care about either result as this is just for measurement, but having executeGpu return something is just to make sure we don't leak the output column.
There was a problem hiding this comment.
then executeGpu should close it inside and return void.
There was a problem hiding this comment.
I'll follow up on this
revans2
left a comment
There was a problem hiding this comment.
I don't see any more blockers. There is more that would be nice to clean up. But we can look into it later.
|
build |
| mapper.writer(printer).writeValue(new File(path), report); | ||
| System.err.println("Report written to: " + path); | ||
| } catch (Exception e) { | ||
| System.err.println("Failed to write report: " + e.getMessage()); |
There was a problem hiding this comment.
this would be another swallowed error case.
| * Assert.assertEquals("UNKNOWN", results[2].getAs("risk_level")); | ||
| * }</pre> | ||
| */ | ||
| public static void verifyUDFResults(Dataset<Row> resultDF, Dataset<Row> testDF) { |
There was a problem hiding this comment.
nit, would be nice if this was "assertUDFResultsEqual". I had to read the comment to realize this was going to assert not throw.
There was a problem hiding this comment.
Makes sense, will follow-up
| } | ||
|
|
||
| test("UDF vs SQL expression") { | ||
| val testDF = UnitTest.createTestData(spark).repartition(1) |
There was a problem hiding this comment.
in the past, single partition execution has yielded some nasty bugs, especially with our gpu execs (like hash aggregate). Having multiple tasks with splits is more of the natural Spark execution as well. Why repartition(1)? This implies a single task will run the udf, which seems odd, especially since you have 4 cores total.
There was a problem hiding this comment.
This was in response to cases where we would see degenerate execution because the test dataframe would be too small, i.e. we were just passing the UDF single-row columns and not actually exercising columnar execution. Maybe we can replace with a repartition(2) or require a minimum number of test rows
abellina
left a comment
There was a problem hiding this comment.
I think we can take care of my comments as follow ups
… in skill templates [skip ci] (#15116) ### Description This is a follow-up to address a few comments on #15058. - ([comment](#15058 (comment))) - executeGpu now closes its result internally and returns void - ([comment](#15058 (comment))) - no longer catching write errors and letting it throw (more easily reviewable with whitespace off) - ([comment](#15058 (comment))) - rename `verifyUDFResults` -> `assertUDFResults` to clarify that this calls assertions - ([comment](#15058 (comment))) - bumped to `repartition(2)` and encouraging 10+ test cases, for more realistic parallelism while ensuring the UDF is actually fed multi-row columns ### Checklists Documentation - [ ] Updated for new or modified user-facing features or behaviors - [X] No user-facing change Testing - [ ] Added or modified tests to cover new code paths - [ ] Covered by existing tests (Please provide the names of the existing tests in the PR description.) - [X] Not required Performance - [ ] Tests ran and results are added in the PR description - [ ] Issue filed with a link in the PR description - [X] Not required --------- Signed-off-by: Rishi Chandra <rishic@nvidia.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Closes #15014. See the epic #14977 for the next phases.
Description
Publishes the UDF skills, docs, and tests. The tests are yet to be wired up to CI, which is planned in #15013.
Installation from this branch can be tested like so:
npx skills add "https://github.com/rishic3/spark-rapids/tree/aether-udf-skills"Once merged, the following will work:
Note that only directories with a SKILL.md file will be detected as skills.
Changes taking effect outside of
skills/are:pom.xmlto ignoreskills/in RAT and scalastyle checkspom.xmlfiles underskills/inmake-scala-version-build-files.shChecklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance