diff --git a/LICENSE b/LICENSE
index 261eeb9e9f8..884b334376f 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,3 +1,406 @@
+Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+This code is dual-licensed with documentation/skills under the CC-BY-4.0 AND source code under Apache-2.0 license terms.
+
+
+Attribution 4.0 International
+
+=======================================================================
+
+Creative Commons Corporation ("Creative Commons") is not a law firm and
+does not provide legal services or legal advice. Distribution of
+Creative Commons public licenses does not create a lawyer-client or
+other relationship. Creative Commons makes its licenses and related
+information available on an "as-is" basis. Creative Commons gives no
+warranties regarding its licenses, any material licensed under their
+terms and conditions, or any related information. Creative Commons
+disclaims all liability for damages resulting from their use to the
+fullest extent possible.
+
+Using Creative Commons Public Licenses
+
+Creative Commons public licenses provide a standard set of terms and
+conditions that creators and other rights holders may use to share
+original works of authorship and other material subject to copyright
+and certain other rights specified in the public license below. The
+following considerations are for informational purposes only, are not
+exhaustive, and do not form part of our licenses.
+
+     Considerations for licensors: Our public licenses are
+     intended for use by those authorized to give the public
+     permission to use material in ways otherwise restricted by
+     copyright and certain other rights. Our licenses are
+     irrevocable. Licensors should read and understand the terms
+     and conditions of the license they choose before applying it.
+     Licensors should also secure all rights necessary before
+     applying our licenses so that the public can reuse the
+     material as expected. Licensors should clearly mark any
+     material not subject to the license. This includes other CC-
+     licensed material, or material used under an exception or
+     limitation to copyright. More considerations for licensors:
+    wiki.creativecommons.org/Considerations_for_licensors
+
+     Considerations for the public: By using one of our public
+     licenses, a licensor grants the public permission to use the
+     licensed material under specified terms and conditions. If
+     the licensor's permission is not necessary for any reason--for
+     example, because of any applicable exception or limitation to
+     copyright--then that use is not regulated by the license. Our
+     licenses grant only permissions under copyright and certain
+     other rights that a licensor has authority to grant. Use of
+     the licensed material may still be restricted for other
+     reasons, including because others have copyright or other
+     rights in the material. A licensor may make special requests,
+     such as asking that all changes be marked or described.
+     Although not required by our licenses, you are encouraged to
+     respect those requests where reasonable. More considerations
+     for the public:
+    wiki.creativecommons.org/Considerations_for_licensees
+
+=======================================================================
+
+Creative Commons Attribution 4.0 International Public License
+
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution 4.0 International Public License ("Public License"). To the
+extent this Public License may be interpreted as a contract, You are
+granted the Licensed Rights in consideration of Your acceptance of
+these terms and conditions, and the Licensor grants You such rights in
+consideration of benefits the Licensor receives from making the
+Licensed Material available under these terms and conditions.
+
+
+Section 1 -- Definitions.
+
+  a. Adapted Material means material subject to Copyright and Similar
+     Rights that is derived from or based upon the Licensed Material
+     and in which the Licensed Material is translated, altered,
+     arranged, transformed, or otherwise modified in a manner requiring
+     permission under the Copyright and Similar Rights held by the
+     Licensor. For purposes of this Public License, where the Licensed
+     Material is a musical work, performance, or sound recording,
+     Adapted Material is always produced where the Licensed Material is
+     synched in timed relation with a moving image.
+
+  b. Adapter's License means the license You apply to Your Copyright
+     and Similar Rights in Your contributions to Adapted Material in
+     accordance with the terms and conditions of this Public License.
+
+  c. Copyright and Similar Rights means copyright and/or similar rights
+     closely related to copyright including, without limitation,
+     performance, broadcast, sound recording, and Sui Generis Database
+     Rights, without regard to how the rights are labeled or
+     categorized. For purposes of this Public License, the rights
+     specified in Section 2(b)(1)-(2) are not Copyright and Similar
+     Rights.
+
+  d. Effective Technological Measures means those measures that, in the
+     absence of proper authority, may not be circumvented under laws
+     fulfilling obligations under Article 11 of the WIPO Copyright
+     Treaty adopted on December 20, 1996, and/or similar international
+     agreements.
+
+  e. Exceptions and Limitations means fair use, fair dealing, and/or
+     any other exception or limitation to Copyright and Similar Rights
+     that applies to Your use of the Licensed Material.
+
+  f. Licensed Material means the artistic or literary work, database,
+     or other material to which the Licensor applied this Public
+     License.
+
+  g. Licensed Rights means the rights granted to You subject to the
+     terms and conditions of this Public License, which are limited to
+     all Copyright and Similar Rights that apply to Your use of the
+     Licensed Material and that the Licensor has authority to license.
+
+  h. Licensor means the individual(s) or entity(ies) granting rights
+     under this Public License.
+
+  i. Share means to provide material to the public by any means or
+     process that requires permission under the Licensed Rights, such
+     as reproduction, public display, public performance, distribution,
+     dissemination, communication, or importation, and to make material
+     available to the public including in ways that members of the
+     public may access the material from a place and at a time
+     individually chosen by them.
+
+  j. Sui Generis Database Rights means rights other than copyright
+     resulting from Directive 96/9/EC of the European Parliament and of
+     the Council of 11 March 1996 on the legal protection of databases,
+     as amended and/or succeeded, as well as other essentially
+     equivalent rights anywhere in the world.
+
+  k. You means the individual or entity exercising the Licensed Rights
+     under this Public License. Your has a corresponding meaning.
+
+
+Section 2 -- Scope.
+
+  a. License grant.
+
+       1. Subject to the terms and conditions of this Public License,
+          the Licensor hereby grants You a worldwide, royalty-free,
+          non-sublicensable, non-exclusive, irrevocable license to
+          exercise the Licensed Rights in the Licensed Material to:
+
+            a. reproduce and Share the Licensed Material, in whole or
+               in part; and
+
+            b. produce, reproduce, and Share Adapted Material.
+
+       2. Exceptions and Limitations. For the avoidance of doubt, where
+          Exceptions and Limitations apply to Your use, this Public
+          License does not apply, and You do not need to comply with
+          its terms and conditions.
+
+       3. Term. The term of this Public License is specified in Section
+          6(a).
+
+       4. Media and formats; technical modifications allowed. The
+          Licensor authorizes You to exercise the Licensed Rights in
+          all media and formats whether now known or hereafter created,
+          and to make technical modifications necessary to do so. The
+          Licensor waives and/or agrees not to assert any right or
+          authority to forbid You from making technical modifications
+          necessary to exercise the Licensed Rights, including
+          technical modifications necessary to circumvent Effective
+          Technological Measures. For purposes of this Public License,
+          simply making modifications authorized by this Section 2(a)
+          (4) never produces Adapted Material.
+
+       5. Downstream recipients.
+
+            a. Offer from the Licensor -- Licensed Material. Every
+               recipient of the Licensed Material automatically
+               receives an offer from the Licensor to exercise the
+               Licensed Rights under the terms and conditions of this
+               Public License.
+
+            b. No downstream restrictions. You may not offer or impose
+               any additional or different terms or conditions on, or
+               apply any Effective Technological Measures to, the
+               Licensed Material if doing so restricts exercise of the
+               Licensed Rights by any recipient of the Licensed
+               Material.
+
+       6. No endorsement. Nothing in this Public License constitutes or
+          may be construed as permission to assert or imply that You
+          are, or that Your use of the Licensed Material is, connected
+          with, or sponsored, endorsed, or granted official status by,
+          the Licensor or others designated to receive attribution as
+          provided in Section 3(a)(1)(A)(i).
+
+  b. Other rights.
+
+       1. Moral rights, such as the right of integrity, are not
+          licensed under this Public License, nor are publicity,
+          privacy, and/or other similar personality rights; however, to
+          the extent possible, the Licensor waives and/or agrees not to
+          assert any such rights held by the Licensor to the limited
+          extent necessary to allow You to exercise the Licensed
+          Rights, but not otherwise.
+
+       2. Patent and trademark rights are not licensed under this
+          Public License.
+
+       3. To the extent possible, the Licensor waives any right to
+          collect royalties from You for the exercise of the Licensed
+          Rights, whether directly or through a collecting society
+          under any voluntary or waivable statutory or compulsory
+          licensing scheme. In all other cases the Licensor expressly
+          reserves any right to collect such royalties.
+
+
+Section 3 -- License Conditions.
+
+Your exercise of the Licensed Rights is expressly made subject to the
+following conditions.
+
+  a. Attribution.
+
+       1. If You Share the Licensed Material (including in modified
+          form), You must:
+
+            a. retain the following if it is supplied by the Licensor
+               with the Licensed Material:
+
+                 i. identification of the creator(s) of the Licensed
+                    Material and any others designated to receive
+                    attribution, in any reasonable manner requested by
+                    the Licensor (including by pseudonym if
+                    designated);
+
+                ii. a copyright notice;
+
+               iii. a notice that refers to this Public License;
+
+                iv. a notice that refers to the disclaimer of
+                    warranties;
+
+                 v. a URI or hyperlink to the Licensed Material to the
+                    extent reasonably practicable;
+
+            b. indicate if You modified the Licensed Material and
+               retain an indication of any previous modifications; and
+
+            c. indicate the Licensed Material is licensed under this
+               Public License, and include the text of, or the URI or
+               hyperlink to, this Public License.
+
+       2. You may satisfy the conditions in Section 3(a)(1) in any
+          reasonable manner based on the medium, means, and context in
+          which You Share the Licensed Material. For example, it may be
+          reasonable to satisfy the conditions by providing a URI or
+          hyperlink to a resource that includes the required
+          information.
+
+       3. If requested by the Licensor, You must remove any of the
+          information required by Section 3(a)(1)(A) to the extent
+          reasonably practicable.
+
+       4. If You Share Adapted Material You produce, the Adapter's
+          License You apply must not prevent recipients of the Adapted
+          Material from complying with this Public License.
+
+
+Section 4 -- Sui Generis Database Rights.
+
+Where the Licensed Rights include Sui Generis Database Rights that
+apply to Your use of the Licensed Material:
+
+  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
+     to extract, reuse, reproduce, and Share all or a substantial
+     portion of the contents of the database;
+
+  b. if You include all or a substantial portion of the database
+     contents in a database in which You have Sui Generis Database
+     Rights, then the database in which You have Sui Generis Database
+     Rights (but not its individual contents) is Adapted Material; and
+
+  c. You must comply with the conditions in Section 3(a) if You Share
+     all or a substantial portion of the contents of the database.
+
+For the avoidance of doubt, this Section 4 supplements and does not
+replace Your obligations under this Public License where the Licensed
+Rights include other Copyright and Similar Rights.
+
+
+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
+
+  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
+     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
+     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
+     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
+     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
+     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
+     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
+     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
+     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
+     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
+
+  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
+     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
+     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
+     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
+     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
+     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
+     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
+     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
+     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
+
+  c. The disclaimer of warranties and limitation of liability provided
+     above shall be interpreted in a manner that, to the extent
+     possible, most closely approximates an absolute disclaimer and
+     waiver of all liability.
+
+
+Section 6 -- Term and Termination.
+
+  a. This Public License applies for the term of the Copyright and
+     Similar Rights licensed here. However, if You fail to comply with
+     this Public License, then Your rights under this Public License
+     terminate automatically.
+
+  b. Where Your right to use the Licensed Material has terminated under
+     Section 6(a), it reinstates:
+
+       1. automatically as of the date the violation is cured, provided
+          it is cured within 30 days of Your discovery of the
+          violation; or
+
+       2. upon express reinstatement by the Licensor.
+
+     For the avoidance of doubt, this Section 6(b) does not affect any
+     right the Licensor may have to seek remedies for Your violations
+     of this Public License.
+
+  c. For the avoidance of doubt, the Licensor may also offer the
+     Licensed Material under separate terms or conditions or stop
+     distributing the Licensed Material at any time; however, doing so
+     will not terminate this Public License.
+
+  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
+     License.
+
+
+Section 7 -- Other Terms and Conditions.
+
+  a. The Licensor shall not be bound by any additional or different
+     terms or conditions communicated by You unless expressly agreed.
+
+  b. Any arrangements, understandings, or agreements regarding the
+     Licensed Material not stated herein are separate from and
+     independent of the terms and conditions of this Public License.
+
+
+Section 8 -- Interpretation.
+
+  a. For the avoidance of doubt, this Public License does not, and
+     shall not be interpreted to, reduce, limit, restrict, or impose
+     conditions on any use of the Licensed Material that could lawfully
+     be made without permission under this Public License.
+
+  b. To the extent possible, if any provision of this Public License is
+     deemed unenforceable, it shall be automatically reformed to the
+     minimum extent necessary to make it enforceable. If the provision
+     cannot be reformed, it shall be severed from this Public License
+     without affecting the enforceability of the remaining terms and
+     conditions.
+
+  c. No term or condition of this Public License will be waived and no
+     failure to comply consented to unless expressly agreed to by the
+     Licensor.
+
+  d. Nothing in this Public License constitutes or may be interpreted
+     as a limitation upon, or waiver of, any privileges and immunities
+     that apply to the Licensor or You, including from the legal
+     processes of any jurisdiction or authority.
+
+
+=======================================================================
+
+Creative Commons is not a party to its public
+licenses. Notwithstanding, Creative Commons may elect to apply one of
+its public licenses to material it publishes and in those instances
+will be considered the “Licensor.” The text of the Creative Commons
+public licenses is dedicated to the public domain under the CC0 Public
+Domain Dedication. Except for the limited purpose of indicating that
+material is shared under a Creative Commons public license or as
+otherwise permitted by the Creative Commons policies published at
+creativecommons.org/policies, Creative Commons does not authorize the
+use of the trademark "Creative Commons" or any other trademark or logo
+of Creative Commons without its prior written consent including,
+without limitation, in connection with any unauthorized modifications
+to any of its public licenses or any other arrangements,
+understandings, or agreements concerning use of licensed material. For
+the avoidance of doubt, this paragraph does not form part of the
+public licenses.
+
+Creative Commons may be contacted at creativecommons.org.
+
+
+
+
                                  Apache License
                            Version 2.0, January 2004
                         http://www.apache.org/licenses/
@@ -186,7 +589,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright [yyyy] [name of copyright owner]
+   Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES.
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
diff --git a/build/make-scala-version-build-files.sh b/build/make-scala-version-build-files.sh
index 21bf4471147..295ff44de1e 100755
--- a/build/make-scala-version-build-files.sh
+++ b/build/make-scala-version-build-files.sh
@@ -1,6 +1,6 @@
 #!/usr/bin/env bash
 #
-# Copyright (c) 2023-2025, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023-2026, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -78,6 +78,11 @@ for f in $(git ls-files '**pom.xml'); do
     echo "Skipping $f"
     continue
   fi
+  # Skills package their own pom.xml templates. Ignore those.
+  if [[ $f == skills/* ]]; then
+    echo "Skipping $f"
+    continue
+  fi
   echo $f
   tof="$TO_DIR/$f"
   mkdir -p $(dirname $tof)
diff --git a/pom.xml b/pom.xml
index 450211bcc4a..cc51b5215a0 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1654,6 +1654,8 @@ This will force full Scala code rebuild in downstream modules.
                         <exclude>**/target/**/*</exclude>
                         <exclude>**/cufile.log</exclude>
                         <exclude>**/cudf_log.txt</exclude>
+                        <!-- Agent skills package has its own mixed Apache-2.0/CC-BY-4.0 licensing. -->
+                        <exclude>skills/**</exclude>
                         <exclude>thirdparty/parquet-testing/**</exclude>
                     </excludes>
                 </configuration>
@@ -1704,6 +1706,7 @@ This will force full Scala code rebuild in downstream modules.
                                     <dirset dir="${spark.rapids.source.basedir}">
                                         <include name="**/src/main"/>
                                         <include name="**/src/test"/>
+                                        <exclude name="skills/**"/>
                                         <exclude name="**/target/*/generated/src/**"/>
                                     </dirset>
                                 </pathconvert>
diff --git a/scala2.13/pom.xml b/scala2.13/pom.xml
index 6b9a9aa8d68..69882d86cc1 100644
--- a/scala2.13/pom.xml
+++ b/scala2.13/pom.xml
@@ -1654,6 +1654,8 @@ This will force full Scala code rebuild in downstream modules.
                         <exclude>**/target/**/*</exclude>
                         <exclude>**/cufile.log</exclude>
                         <exclude>**/cudf_log.txt</exclude>
+                        <!-- Agent skills package has its own mixed Apache-2.0/CC-BY-4.0 licensing. -->
+                        <exclude>skills/**</exclude>
                         <exclude>thirdparty/parquet-testing/**</exclude>
                     </excludes>
                 </configuration>
@@ -1704,6 +1706,7 @@ This will force full Scala code rebuild in downstream modules.
                                     <dirset dir="${spark.rapids.source.basedir}">
                                         <include name="**/src/main"/>
                                         <include name="**/src/test"/>
+                                        <exclude name="skills/**"/>
                                         <exclude name="**/target/*/generated/src/**"/>
                                     </dirset>
                                 </pathconvert>
diff --git a/skills/.gitignore b/skills/.gitignore
new file mode 100644
index 00000000000..16a0e72b69f
--- /dev/null
+++ b/skills/.gitignore
@@ -0,0 +1,165 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[codz]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/_build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py.cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pdm
+.pdm-python
+.pdm-build/
+
+# pixi
+.pixi
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.envrc
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# Abstra
+.abstra/
+
+# Visual Studio Code
+.vscode/
+.cursor/
+.claude/
+
+# Ruff stuff:
+.ruff_cache/
+
+# PyPI configuration file
+.pypirc
+
+# Marimo
+marimo/_static/
+marimo/_lsp/
+__marimo__/
+
+# Streamlit
+.streamlit/secrets.toml
+
+# Scala
+.scala-build/
+.metals/
+.bsp/
+
+# Maven config under skills is source, not generated output.
+!**/.mvn/
+!**/.mvn/**
diff --git a/skills/README.md b/skills/README.md
new file mode 100644
index 00000000000..ca60dcfdad8
--- /dev/null
+++ b/skills/README.md
@@ -0,0 +1,174 @@
+# Project Aether Agent Skills
+
+Aether Agent is a set of skills to convert Apache Spark User-Defined Functions (UDFs) for GPU acceleration with the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids). It provides:
+
+1. **Test generation** -- Create unit tests and test data for existing UDFs.
+2. **Conversion** -- Convert a UDF to a GPU-compatible implementation (SQL, cuDF RapidsUDF, or native CUDA RapidsUDF).
+3. **Benchmarking** -- Generate synthetic data and benchmark the original UDF against the GPU implementation.
+4. **Optimization** -- Iteratively profile and optimize a cuDF RapidsUDF for GPU performance.
+
+<details open>
+<summary><strong>Table of Contents</strong></summary>
+
+- [Installation](#installation)
+- [Supported Formats](#supported-formats)
+- [Prerequisites](#prerequisites)
+- [Selecting an LLM](#selecting-an-llm)
+- [Quick Start](#quick-start)
+  - [Using Skills](#using-skills)
+  - [Try the Workflow](#try-the-workflow)
+
+</details>
+
+## Installation
+
+Install via the [skills CLI](https://github.com/vercel-labs/skills). Installing all skills is recommended, as they are designed to work together.
+
+```bash
+npx skills add NVIDIA/spark-rapids --skill '*' [--agent <agent>]
+```
+
+## Supported Formats
+
+| UDF Type  | cuDF RapidsUDF | CUDA RapidsUDF | Spark SQL |
+|-----------|----------------|------------------------|-----------|
+| Java UDF  | Yes | Yes | Yes |
+| Hive UDF  | Yes | Yes | Yes |
+| Scala UDF | Yes | Yes | Yes |
+| Java UDAF | -- | -- | Yes |
+| Hive UDAF | -- | -- | Yes |
+| Scala UDAF | -- | -- | Yes |
+
+## Prerequisites
+
+- **[Maven](https://maven.apache.org/install.html)** is required to build/compile UDFs.
+- **[JDK](https://docs.oracle.com/en/java/javase/index.html)** must be installed on the system.
+- **Local GPU** with [CUDA toolkit](https://developer.nvidia.com/cuda/toolkit) is required (see [Spark RAPIDS compatibility](https://nvidia.github.io/spark-rapids/docs/download.html) for version requirements).
+
+If a local GPU is not available, another option is to run Aether Agent from a cloud instance, such as AWS EC2.
+
+## Selecting an LLM
+
+For best results, we recommend the latest reasoning models from OpenAI, Anthropic, or Google. As a good proxy, models near the top of the [Terminal-Bench 2.0 leaderboard](https://www.tbench.ai/leaderboard/terminal-bench/2.0) tend to perform well.
+
+## Quick Start
+
+Skills require any IDE or LLM that supports the [agent skills spec](https://agentskills.io) (e.g., Cursor, Codex, Claude Code).
+
+### Using Skills
+
+Skills follow a multi-step workflow:
+
+1. **[udf-gen-test](udf-gen-test/SKILL.md)** -- Generate a unit test for the UDF
+2. **[udf-convert-to-cudf](udf-convert-to-cudf/SKILL.md)**, **[udf-convert-to-cuda](udf-convert-to-cuda/SKILL.md)**, or **[udf-convert-to-sql](udf-convert-to-sql/SKILL.md)** -- Convert the UDF to a GPU-compatible implementation
+3. **[udf-judge-conversion](udf-judge-conversion/SKILL.md)** -- Review generated tests and implementations for coverage gaps, bugs, and edge cases
+4. **[udf-benchmark](udf-benchmark/SKILL.md)** -- Benchmark CPU vs GPU performance
+5. **[udf-optimize-cudf](udf-optimize-cudf/SKILL.md)** -- Iteratively profile and optimize the cuDF RapidsUDF
+
+To invoke a skill, use your IDE's skill command, or simply describe the task and let the agent load the skill automatically.
+
+```bash
+# Manual invocation
+❯ Use the /udf-gen-test skill to generate a unit test for @FormatPhoneUDF.java
+
+# Automatic invocation
+❯ Generate a unit test for @FormatPhoneUDF.java
+```
+
+Each skill builds on the output of the previous one -- udf-gen-test produces a project with a passing unit test, which the conversion skills use as input, and the udf-benchmark skill uses the conversion output.
+
+You can invoke multiple steps in a single prompt:
+
+```bash
+❯ Generate a unit test for @FormatPhoneUDF.java, then convert it to cuDF, native CUDA, or SQL and benchmark
+```
+
+### Try the Workflow
+
+Once you've installed the skills, try the workflow with one of the provided example UDFs:
+- Java: [FormatPhoneUDF.java](examples/FormatPhoneUDF.java)
+- Hive: [IntegerMultiplyBy2UDF.java](examples/IntegerMultiplyBy2UDF.java)
+- Scala: [CalculateRiskUDF.scala](examples/CalculateRiskUDF.scala)
+
+#### Workflow
+
+(Example dialogue is from Claude Code).
+
+1. Open your project in your IDE of choice with the skills installed.
+2. Ask the agent to generate a unit test:
+
+    ```bash
+    ❯ Please generate a unit test for @FormatPhoneUDF.java
+      ⎿  Read FormatPhoneUDF.java (23 lines)
+
+    ● Skill(udf-gen-test)
+      ⎿  Successfully loaded skill
+    ● I'll generate a unit test for FormatPhoneUDF.java. Let me start by setting up the project.
+    ● ...
+    ● Unit test generation is complete. Here's a summary:
+
+      Results
+
+      - Project directory: FormatPhoneUDF/
+      - Unit test: FormatPhoneUDF/src/test/java/com/udf/UnitTest.java
+      - Status: All tests passing
+      - Test coverage: 12 cases
+   ```
+
+3. Once the test passes, convert the UDF:
+
+    ```bash
+    ❯ Please convert the UDF to cuDF
+
+    ● I'll use the udf-convert-to-cudf skill to convert this UDF to a GPU-accelerated RapidsUDF.
+    ● Skill(udf-convert-to-cudf)
+      ⎿  Successfully loaded skill
+    ● ...
+    ● Results
+
+      The RapidsUDF conversion is complete.
+
+      - RapidsUDF: FormatPhoneUDF/src/main/java/com/udf/FormatPhoneRapidsUDF.java
+      - Comparison test: passes with no memory leaks
+    ```
+
+   Alternatively, ask for a judge pass as part of the conversion request:
+
+    ```bash
+    ❯ Please convert the UDF to cuDF and run the judge review agent on the result
+    ```
+
+   **Note:** The conversion agent will only invoke the judge when you explicitly ask for it. Otherwise, the agent performs a local self-review. You can also invoke the udf-judge-conversion skill on its own, at any time.
+
+4. Finally, benchmark the result:
+
+    ```bash
+    ❯ Please benchmark the implementations on 100M rows.
+
+    ● Skill(udf-benchmark)
+      ⎿  Successfully loaded skill
+    ● ...
+    ● Benchmark Results — 100M rows
+
+      ┌─────────┬─────────────┐
+      │  Mode   │ Runtime (s) │
+      ├─────────┼─────────────┤
+      │ CPU     │ 16.27       │
+      ├─────────┼─────────────┤
+      │ GPU     │ 6.52        │
+      ├─────────┼─────────────┤
+      │ Speedup │ 2.50x       │
+      └─────────┴─────────────┘
+
+      The GPU RapidsUDF implementation is 2.5x faster than the CPU UDF on 100 million rows.
+    ```
+
+5. Optionally for cuDF RapidsUDF conversions, optimize the implementation:
+
+    ```bash
+    ❯ Please optimize the implementation
+
+    ● Skill(udf-optimize-cudf)
+      ⎿  Successfully loaded skill
+    ● ...
+    ```
diff --git a/skills/docs/dev/VERSIONS.md b/skills/docs/dev/VERSIONS.md
new file mode 100644
index 00000000000..83956663ca3
--- /dev/null
+++ b/skills/docs/dev/VERSIONS.md
@@ -0,0 +1,61 @@
+# Version Update Guide
+
+## Files To Update
+
+### Java udf-gen-test Maven template
+
+File: `skills/udf-gen-test/templates/java/pom.xml`
+
+Update these properties together:
+
+- `<scala.binary.version>`
+- `<spark.version>`
+- `<rapids4spark.version>`
+- `<cuda.version>` if the RAPIDS artifact classifier changes
+- `<cudf.git.branch>`
+- `<rapids.cmake.branch>`
+
+### Scala udf-gen-test Maven template
+
+File: `skills/udf-gen-test/templates/scala/pom.xml`
+
+Update these properties together:
+
+- `<scala.binary.version>`
+- `<scala.version>`
+- `<spark.version>`
+- `<rapids4spark.version>`
+- `<cuda.version>` if the RAPIDS artifact classifier changes
+- `<cudf.git.branch>`
+- `<rapids.cmake.branch>`
+
+### Native CUDA build image
+
+File: `skills/udf-convert-to-cuda/templates/cuda/Dockerfile`
+
+Update this default value:
+
+- `CUDA_VERSION` must match the CUDA toolkit version spark-rapids is built against (the same version the native build uses on the host).
+
+### Native CUDA dependency extraction
+
+File: `skills/udf-convert-to-cuda/templates/cuda/native/scripts/extract-cudf-libs.sh`
+
+Update these default values:
+
+- `SCALA_VERSION`
+- `RAPIDS4SPARK_VERSION`
+- `CUDA_VERSION` if the RAPIDS artifact classifier changes
+- `CUDF_BRANCH`
+
+### Native CUDA CMake template
+
+File: `skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/CMakeLists.txt`
+
+Update these values:
+
+- `RAPIDS_CMAKE_BRANCH`
+- `project(RAPIDSUDFJNI VERSION ...)`
+- `rapids_cpm_find(cudf ...)`
+
+`RAPIDS_CMAKE_BRANCH` should generally match the RAPIDS/cuDF branch or tag used by the Maven templates and `extract-cudf-libs.sh`. The `rapids_cpm_find(cudf...)` version should use the RAPIDS major/minor CPM version, for example `26.04.00` for `26.04.0`.
diff --git a/skills/examples/CalculateRiskUDF.scala b/skills/examples/CalculateRiskUDF.scala
new file mode 100644
index 00000000000..8e9dc6070c8
--- /dev/null
+++ b/skills/examples/CalculateRiskUDF.scala
@@ -0,0 +1,24 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package examples
+
+/**
+ * Calculate risk score based on credit score.
+ * 
+ * @param creditScore Credit score
+ * @return Risk score
+ */
+class CalculateRiskUDF extends Function1[Integer, String] with Serializable {
+  override def apply(creditScore: Integer): String = {
+    Option(creditScore) match {
+      case Some(score) if score >= 750 => "LOW"
+      case Some(score) if score >= 650 => "MEDIUM"
+      case Some(score) if score >= 500 => "HIGH"
+      case Some(score) if score < 500 => "VERY_HIGH"
+      case None => "UNKNOWN"
+    }
+  }
+}
diff --git a/skills/examples/FormatPhoneUDF.java b/skills/examples/FormatPhoneUDF.java
new file mode 100644
index 00000000000..f747887911d
--- /dev/null
+++ b/skills/examples/FormatPhoneUDF.java
@@ -0,0 +1,26 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package examples;
+
+import org.apache.spark.sql.api.java.UDF1;
+
+/** Strip non-digit characters and format as (XXX) XXX-XXXX. */
+public class FormatPhoneUDF implements UDF1<String, String> {
+    @Override
+    public String call(String phone) throws Exception {
+        if (phone == null) {
+            return null;
+        }
+        String digits = phone.replaceAll("[^0-9]", "");
+        if (digits.length() != 10) {
+            return null;
+        }
+        return String.format("(%s) %s-%s",
+            digits.substring(0, 3),
+            digits.substring(3, 6),
+            digits.substring(6));
+    }
+}
diff --git a/skills/examples/IntegerMultiplyBy2UDF.java b/skills/examples/IntegerMultiplyBy2UDF.java
new file mode 100644
index 00000000000..c6b0cb0055f
--- /dev/null
+++ b/skills/examples/IntegerMultiplyBy2UDF.java
@@ -0,0 +1,78 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package examples;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.log4j.Logger;
+
+@Description(name = "integer_multiply_by_2", value = "_FUNC_(x) - Returns x * 2 for integer values")
+public class IntegerMultiplyBy2UDF extends GenericUDF {
+    private static final Logger LOG = Logger.getLogger(IntegerMultiplyBy2UDF.class);
+    private PrimitiveObjectInspector inputOI;
+
+    @Override
+    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
+        if (arguments.length != 1) {
+            throw new UDFArgumentException("Exactly one argument is expected.");
+        }
+
+        ObjectInspector oi = arguments[0];
+        if (oi.getCategory() != ObjectInspector.Category.PRIMITIVE) {
+            throw new UDFArgumentTypeException(0, "Argument must be PRIMITIVE, but " + oi.getCategory().name() + " was passed.");
+        }
+
+        inputOI = (PrimitiveObjectInspector) oi;
+        
+        // Check if input is numeric
+        if (inputOI.getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.INT &&
+            inputOI.getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.LONG &&
+            inputOI.getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.SHORT &&
+            inputOI.getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.BYTE) {
+            throw new UDFArgumentTypeException(0, "Argument must be numeric (INT/LONG/SHORT/BYTE), but " + inputOI.getPrimitiveCategory().name() + " was passed.");
+        }
+
+        // Return LongWritable type for the result
+        return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
+    }
+
+    @Override
+    public Object evaluate(DeferredObject[] arguments) throws HiveException {
+        if (arguments == null || arguments.length != 1) {
+            return null;
+        }
+
+        Object input = arguments[0].get();
+        if (input == null) {
+            return null;
+        }
+
+        long value = getLongValue(input);
+        return new LongWritable(value * 2);
+    }
+
+    @Override
+    public String getDisplayString(String[] children) {
+        return "integer_multiply_by_2(" + (children != null ? String.join(",", children) : "") + ")";
+    }
+
+    private long getLongValue(Object obj) {
+        if (obj instanceof Number) {
+            return ((Number) obj).longValue();
+        } else {
+            throw new IllegalArgumentException("Cannot convert " + obj.getClass().getName() + " to long");
+        }
+    }
+}
diff --git a/skills/udf-benchmark/CUDF_MICROBENCHMARKS.md b/skills/udf-benchmark/CUDF_MICROBENCHMARKS.md
new file mode 100644
index 00000000000..703384979de
--- /dev/null
+++ b/skills/udf-benchmark/CUDF_MICROBENCHMARKS.md
@@ -0,0 +1,30 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: CC-BY-4.0
+-->
+
+# cuDF Microbenchmarks
+
+Measures fine-grained CPU vs. GPU performance without Spark overhead on in-memory data.
+
+## Contents
+- [ ] Implement MicroBenchRunner
+- [ ] Run microbenchmarks
+
+## Implement MicroBenchRunner
+
+Fill in the three TODO methods following the docstrings.
+
+## Run Microbenchmarks
+
+Generate data first (reuse from GenData output), then run:
+```bash
+./run_micro_benchmark.sh --mode all --data-path data/bench_data_<rows>_rows.parquet --rows <rows>
+```
+
+Note that the specified number of rows will be coalesced into a single cuDF table.
+A large table size (>1GB) will demonstrate better GPU performance.
+
+## Next Steps
+
+To profile and iteratively optimize GPU performance, use the **udf-optimize-cudf** skill.
diff --git a/skills/udf-benchmark/SKILL.md b/skills/udf-benchmark/SKILL.md
new file mode 100644
index 00000000000..d1efc8c8d24
--- /dev/null
+++ b/skills/udf-benchmark/SKILL.md
@@ -0,0 +1,82 @@
+---
+name: udf-benchmark
+description: Assists with benchmarking and profiling the performance of an Apache Spark UDF on the GPU. This is step 3 of 3 in the UDF conversion workflow (udf-gen-test -> udf-convert-to-* -> udf-benchmark). Use this skill when you have a CPU UDF and a RapidsUDF or SQL implementation, and need to benchmark the performance of the CPU UDF against the GPU implementation.
+license: CC-BY-4.0 AND Apache-2.0
+metadata:
+  spdx-file-copyright-text: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+model: inherit
+---
+
+# UDF Benchmark
+
+## Workflow
+
+- [ ] Step 1: Implement BenchUtils (fill in TODO methods)
+- [ ] Step 2: Validate with a small dataset
+- [ ] Step 3: Generate full benchmark data and run benchmarks
+- [ ] Step 4: cuDF microbenchmarks (skip for SQL targets)
+
+**Before making any edits, create a visible TODO checklist for every workflow step in this skill and keep it updated.** Do not produce a final answer until every required checklist item is marked complete.
+
+## Prerequisites
+
+- Project directory from Steps 1-2 (udf-gen-test, udf-convert-to-*) with passing tests
+
+Derive `<CamelName>` and `<snake_name>` from the UDF class name.
+
+> **Note:** Commands require access to `/tmp` (Spark temp storage) and `/dev` (GPU device). If commands fail due to sandbox restrictions, re-run them unsandboxed.
+
+## Step 1: Implement BenchUtils
+
+Read `src/main/<java|scala>/com/udf/bench/BenchUtils.<java|scala>`. Replace placeholders with the actual camel/snake UDF name.
+
+Fill in the TODO methods following the docstrings. For variable-length inputs, generate sizable rows representative of enterprise-scale data. Refer to the unit test for schema and example data.
+
+## Step 2: Validate
+
+Make scripts executable:
+```bash
+chmod +x *.sh
+```
+
+Run validation mode to test with a small dataset:
+```bash
+./run_gen_data.sh --rows 1000 --validate
+```
+
+This runs both the CPU and GPU implementations on the dataset.
+If validation fails, analyze the error and fix the BenchUtils implementation.
+
+## Step 3: Generate Data and Run Benchmarks
+
+The scripts set the default heap size to 16g in `.mvn/jvm.config`; adjust depending on data size.
+
+### Generate benchmark data (10M rows):
+```bash
+./run_gen_data.sh --rows 10000000
+```
+
+### Run benchmarks:
+```bash
+# CPU benchmark
+./run_spark_benchmark.sh --mode cpu --data-path data/bench_data_10000000_rows.parquet
+
+# GPU benchmark
+./run_spark_benchmark.sh --mode gpu --data-path data/bench_data_10000000_rows.parquet
+```
+
+Results are saved to the `results/` directory as JSON files.
+
+## Step 4: cuDF Microbenchmarks
+
+> Skip this step for SQL targets. This only applies to cuDF RapidsUDF conversions.
+
+Follow [CUDF_MICROBENCHMARKS.md](CUDF_MICROBENCHMARKS.md) to implement and run in-memory microbenchmarks.
+
+## Output
+
+Upon successful completion:
+- Benchmark utilities: `src/main/<java|scala>/com/udf/bench/BenchUtils.<java|scala>`
+- Microbenchmarks (cuDF): `src/main/<java|scala>/com/udf/bench/MicroBenchRunner.<java|scala>`
+- Generated data: `data/`
+- Benchmark results: `results/`
diff --git a/skills/udf-convert-to-cuda/SKILL.md b/skills/udf-convert-to-cuda/SKILL.md
new file mode 100644
index 00000000000..f43f7a98fb3
--- /dev/null
+++ b/skills/udf-convert-to-cuda/SKILL.md
@@ -0,0 +1,169 @@
+---
+name: udf-convert-to-cuda
+description: Assists with converting a non-aggregating Apache Spark UDF to a native CUDA RapidsUDF using JNI and libcudf. This is step 2 of 3 in the UDF conversion workflow (udf-gen-test -> udf-convert-to-cuda -> udf-benchmark). Use this skill when you have a CPU UDF with a unit test and need to convert it to a native CUDA implementation. Prefer udf-convert-to-cudf unless a CUDA implementation is necessary for performance or correctness, or if requested by the user.
+license: CC-BY-4.0 AND Apache-2.0
+metadata:
+  spdx-file-copyright-text: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+model: inherit
+---
+
+# Convert UDF to Native CUDA RapidsUDF
+
+## Workflow
+
+- [ ] Step 1: Copy CUDA add-on templates into the udf-gen-test project
+- [ ] Step 2: Create the Java RapidsUDF/JNI bridge
+- [ ] Step 3: Implement the CUDA/libcudf native function
+- [ ] Step 4: Build with the `cuda-native-udf` Maven profile
+- [ ] Step 5: Fill in the comparison test and iterate
+- [ ] Step 6: Run judge subagent if requested
+- [ ] Step 7: Review conversion
+
+**Before making any edits, create a visible TODO checklist for every workflow step in this skill and keep it updated.** Do not produce a final answer until every required checklist item is marked complete.
+
+## Prerequisites
+
+- Project directory from Step 1 (`udf-gen-test`) with a passing unit test
+- Native build tools: CMake 3.30.4+, a CUDA-compatible C++ compiler, `git`, and `unzip`
+- Docker is optional, but can be used for a stable native build environment
+
+Derive `<CamelName>` and `<snake_name>` from the UDF class name.
+
+> **Note:** Commands require access to `/tmp` (Spark temp storage) and `/dev` (GPU device). If commands fail due to sandbox restrictions, re-run them unsandboxed.
+
+## Step 1: Copy CUDA Add-On Templates
+
+Copy this skill's CUDA templates into the existing project:
+```bash
+cp -r templates/cuda/* <project_root>/<CamelName>/
+chmod +x <project_root>/<CamelName>/native/scripts/extract-cudf-libs.sh
+```
+
+The `udf-gen-test` Maven template already contains an inactive `cuda-native-udf` profile. The native profile is activated only when you build with `-Pcuda-native-udf`.
+
+Read [NATIVE_BUILD_ENV.md](references/NATIVE_BUILD_ENV.md) before changing build configuration.
+Read `examples/` for native RapidsUDF examples.
+
+## Step 2: Create the RapidsUDF/JNI Bridge
+
+Use `src/main/java/com/udf/PlaceholderUDFNameNativeRapidsUDF.java` as a starting point:
+
+1. Rename it to `<CamelName>NativeRapidsUDF.java`.
+2. Rename the class to `<CamelName>NativeRapidsUDF`.
+3. Copy the original CPU UDF interface and row-by-row method onto the class.
+4. Implement `evaluateColumnar` to validate column count/types and call the native method.
+5. Rename the native method to a descriptive operation name, e.g. `cosineSimilarityNative`.
+
+For Scala projects, keep this Java wrapper under `src/main/java/com/udf/` and register it from the Scala test/project. JNI can be used from Scala, but the Java wrapper keeps native symbol names and examples simpler.
+If the Java wrapper's CPU fallback needs to call a Scala object, direct references can fail before `scala-maven-plugin` compiles the Scala classes; use reflection in the row-by-row fallback only, and keep `evaluateColumnar` on the normal JNI path.
+
+Read [JNI_CUDA_GUIDE.md](references/JNI_CUDA_GUIDE.md) for the `evaluateColumnar` contract, type mapping, pointer ownership, `NativeDepsLoader`, and native memory rules.
+**Note:** memory allocations must use the active RMM resource; avoid direct usage of ad hoc CUDA or Thrust allocators.
+
+## Step 3: Implement Native CUDA Code
+
+Rename and edit:
+- `native/src/main/cpp/src/PlaceholderUDFNameJni.cpp`
+- `native/src/main/cpp/src/placeholder_udf_name.cu`
+- `native/src/main/cpp/src/placeholder_udf_name.hpp`
+
+Update `native/src/main/cpp/CMakeLists.txt` `SOURCE_FILES` to match the renamed files. If libcudf ABI/version compatibility is unclear, defer to the user.
+
+Read [JNI_CUDA_GUIDE.md](references/JNI_CUDA_GUIDE.md) before writing kernels.
+
+Verify cuDF header names before choosing includes or APIs. After dependency extraction, the active header tree will be cloned under `target/cudf-repo/cpp/include`.
+
+### Critical Requirements
+
+- **NEVER use `copyToHost()` or native methods that copy inputs from GPU to CPU.** This defeats the purpose of GPU acceleration
+- **Do NOT hardcode test values.** The RapidsUDF must implement actual business logic for ANY potential input
+
+## Step 4: Build
+
+The native Maven profile uses the RAPIDS dependency already declared in `pom.xml`.
+
+```bash
+mvn package -Pcuda-native-udf -DskipTests
+```
+
+To use the Docker build environment:
+```bash
+docker build -t cuda-udf-build .
+mkdir -p "$HOME/.m2"
+docker run --rm --gpus all \
+  --user "$(id -u):$(id -g)" \
+  -e HOME=/workspace \
+  -v "$PWD":/workspace \
+  -v "$HOME/.m2":/workspace/.m2 \
+  -w /workspace \
+  cuda-udf-build \
+  -c "mvn -B -Dmaven.repo.local=/workspace/.m2/repository package -Pcuda-native-udf -DskipTests -Dnative.build.path=/workspace/target/native-build-docker"
+```
+
+If the build fails while resolving cuDF headers or RAPIDS CMake, check network access and the generated `cudf.git.branch` / `rapids.cmake.branch` properties. These properties may contain either a branch or a tag.
+
+## Step 5: Build and Test
+
+Fill in the target-specific TODOs in `src/test/<java|scala>/com/udf/CudfComparisonTest.<java|scala>`:
+- Register `<CamelName>NativeRapidsUDF` as the GPU implementation
+- Replace placeholder UDF names
+
+Run:
+```bash
+# Java
+mvn test -Dtest=CudfComparisonTest -Pcuda-native-udf
+
+# Scala project using a Java native RapidsUDF wrapper
+mvn test -Dsuites=com.udf.CudfComparisonTest -Pcuda-native-udf
+```
+
+To run the tests inside the Docker build environment:
+
+```bash
+docker run --rm --gpus all \
+  --user "$(id -u):$(id -g)" \
+  -e HOME=/workspace \
+  -v "$PWD":/workspace \
+  -v "$HOME/.m2":/workspace/.m2 \
+  -v /etc/passwd:/etc/passwd:ro \
+  -v /etc/group:/etc/group:ro \
+  -w /workspace \
+  cuda-udf-build \
+  -c "mvn -B -Dmaven.repo.local=/workspace/.m2/repository test -Dtest=CudfComparisonTest -Pcuda-native-udf -Dnative.build.path=/workspace/target/native-build-docker -DskipCudfExtraction=true"
+```
+
+If tests fail, iterate on the Java bridge or native implementation.
+
+### Difficult Test Failures
+
+Treat the unit test as the CPU behavior specification. Do not weaken or remove test cases silently.
+
+- Tests that check for CPU errors may not be directly applicable to a columnar implementation: the GPU path typically evaluates a whole column and may produce nulls for invalid rows instead of throwing row-level exceptions. If this causes an unavoidable mismatch, add a clear comment in the test and a `TODO/NOTE` in the implementation explaining the mismatch.
+- If a test case does not pass because of inherent CUDA/libcudf/API limitations or low-level GPU/CPU semantic differences, comment out the conflicting assertion/test only after documenting how you tried to make the behavior match and why those attempts failed. Add a note to the user.
+- If the behavior is important, common, or part of the documented input domain, **always prefer fixing the implementation** over commenting out the test case. The exception is a performance-vs-correctness tradeoff that the user explicitly approves.
+
+## Step 6: Run Judge Subagent If Requested
+
+If the user explicitly asked for the judge, a judge subagent, or a review agent, treat that as an explicit request for delegation: you **MUST** launch a separate subagent with `model: inherit` and instruct it to use the **udf-judge-conversion** skill. Ask it to review the `UnitTest`, `CudfComparisonTest`, Java bridge, and JNI/CUDA sources.
+
+If the user did not request a judge/review agent, mark this step as skipped and continue to Step 7. If a required judge subagent is blocked by tool policy, stop and tell the user that explicit permission/instruction is needed.
+
+If you run the judge, wait for it to complete and review its report. If the judge finds any issues, 1) fix the issues, 2) re-run the tests, and 3) re-run the judge subagent.
+
+## Step 7: Review Conversion
+
+Review your own work to ensure:
+- The test runs on the GPU and directly compares CPU-GPU outputs
+- The implementation does not overfit to test cases
+- No `copyToHost()` or row-by-row GPU-to-CPU copying is used for computation
+- No debug statements (e.g., `TableDebug.get().debug(...)`) remain in final output
+
+## Output
+
+Upon successful completion:
+- Native RapidsUDF file at `src/main/java/com/udf/<CamelName>NativeRapidsUDF.java`
+- JNI/CUDA sources under `native/src/main/cpp/src/`
+- Packaged native library in the generated UDF JAR
+- Comparison test passes
+
+These outputs are required for **Step 3: Benchmark**.
diff --git a/skills/udf-convert-to-cuda/examples/CosineSimilarityJni.cpp b/skills/udf-convert-to-cuda/examples/CosineSimilarityJni.cpp
new file mode 100644
index 00000000000..39cc570d5e4
--- /dev/null
+++ b/skills/udf-convert-to-cuda/examples/CosineSimilarityJni.cpp
@@ -0,0 +1,67 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+#include "cosine_similarity.hpp"
+
+#include <cudf/column/column.hpp>
+#include <cudf/column/column_view.hpp>
+#include <cudf/lists/lists_column_view.hpp>
+
+#include <jni.h>
+
+#include <memory>
+#include <string>
+
+namespace {
+
+constexpr char const* RUNTIME_ERROR_CLASS = "java/lang/RuntimeException";
+constexpr char const* ILLEGAL_ARG_CLASS = "java/lang/IllegalArgumentException";
+
+void throw_java_exception(JNIEnv* env, char const* class_name, char const* message)
+{
+  jclass ex_class = env->FindClass(class_name);
+  if (ex_class != nullptr) {
+    env->ThrowNew(ex_class, message);
+  }
+}
+
+}  // namespace
+
+extern "C" {
+
+JNIEXPORT jlong JNICALL
+Java_com_udf_CosineSimilarityNativeRapidsUDF_cosineSimilarity(JNIEnv* env,
+                                                              jclass,
+                                                              jlong j_view1,
+                                                              jlong j_view2)
+{
+  try {
+    auto v1 = reinterpret_cast<cudf::column_view const*>(j_view1);
+    auto v2 = reinterpret_cast<cudf::column_view const*>(j_view2);
+    if (v1 == nullptr || v2 == nullptr) {
+      throw_java_exception(env, ILLEGAL_ARG_CLASS, "input column view is null");
+      return 0;
+    }
+    if (v1->type().id() != v2->type().id() || v1->type().id() != cudf::type_id::LIST) {
+      throw_java_exception(env, ILLEGAL_ARG_CLASS, "inputs are not list columns");
+      return 0;
+    }
+
+    auto lv1 = cudf::lists_column_view(*v1);
+    auto lv2 = cudf::lists_column_view(*v2);
+    std::unique_ptr<cudf::column> result = cosine_similarity(lv1, lv2);
+    return reinterpret_cast<jlong>(result.release());
+  } catch (std::bad_alloc const& e) {
+    auto message = std::string("Unable to allocate native memory: ") + e.what();
+    throw_java_exception(env, RUNTIME_ERROR_CLASS, message.c_str());
+  } catch (std::invalid_argument const& e) {
+    throw_java_exception(env, ILLEGAL_ARG_CLASS, e.what());
+  } catch (std::exception const& e) {
+    throw_java_exception(env, RUNTIME_ERROR_CLASS, e.what());
+  }
+  return 0;
+}
+
+}
diff --git a/skills/udf-convert-to-cuda/examples/CosineSimilarityNativeRapidsUDF.java b/skills/udf-convert-to-cuda/examples/CosineSimilarityNativeRapidsUDF.java
new file mode 100644
index 00000000000..af953a35516
--- /dev/null
+++ b/skills/udf-convert-to-cuda/examples/CosineSimilarityNativeRapidsUDF.java
@@ -0,0 +1,56 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf;
+
+import ai.rapids.cudf.ColumnVector;
+import com.nvidia.spark.RapidsUDF;
+import org.apache.spark.sql.api.java.UDF2;
+
+import scala.collection.mutable.WrappedArray;
+
+/**
+ * Native CUDA RapidsUDF example for cosine similarity over two LIST(FLOAT32) columns.
+ */
+public class CosineSimilarityNativeRapidsUDF
+        implements UDF2<WrappedArray<Float>, WrappedArray<Float>, Float>, RapidsUDF {
+    @Override
+    public Float call(WrappedArray<Float> v1, WrappedArray<Float> v2) {
+        if (v1 == null || v2 == null) {
+            return null;
+        }
+        if (v1.length() != v2.length()) {
+            throw new IllegalArgumentException("Array lengths must match: "
+                + v1.length() + " != " + v2.length());
+        }
+
+        double dotProduct = 0;
+        double magnitude1 = 0;
+        double magnitude2 = 0;
+        for (int i = 0; i < v1.length(); i++) {
+            float f1 = v1.apply(i);
+            float f2 = v2.apply(i);
+            dotProduct += f1 * f2;
+            magnitude1 += f1 * f1;
+            magnitude2 += f2 * f2;
+        }
+        return (float) (dotProduct / (Math.sqrt(magnitude1) * Math.sqrt(magnitude2)));
+    }
+
+    @Override
+    public ColumnVector evaluateColumnar(int numRows, ColumnVector... args) {
+        if (args.length != 2) {
+            throw new IllegalArgumentException("Unexpected argument count: " + args.length);
+        }
+        if (numRows != args[0].getRowCount() || numRows != args[1].getRowCount()) {
+            throw new IllegalArgumentException("Input row count mismatch");
+        }
+
+        NativeUDFLoader.ensureLoaded();
+        return new ColumnVector(cosineSimilarity(args[0].getNativeView(), args[1].getNativeView()));
+    }
+
+    private static native long cosineSimilarity(long vectorView1, long vectorView2);
+}
diff --git a/skills/udf-convert-to-cuda/examples/cosine_similarity.cu b/skills/udf-convert-to-cuda/examples/cosine_similarity.cu
new file mode 100644
index 00000000000..e36e3c17cfc
--- /dev/null
+++ b/skills/udf-convert-to-cuda/examples/cosine_similarity.cu
@@ -0,0 +1,119 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+#include "cosine_similarity.hpp"
+
+#include <cudf/column/column_factories.hpp>
+#include <cudf/lists/lists_column_view.hpp>
+#include <cudf/null_mask.hpp>
+#include <cudf/table/table_view.hpp>
+#include <cudf/utilities/bit.hpp>
+#include <cudf/utilities/type_checks.hpp>
+
+#include <rmm/cuda_stream_view.hpp>
+#include <rmm/device_uvector.hpp>
+#include <rmm/exec_policy.hpp>
+
+#include <cuda/std/cmath>
+
+#include <thrust/iterator/counting_iterator.h>
+#include <thrust/logical.h>
+#include <thrust/transform.h>
+
+namespace {
+
+struct cosine_similarity_functor {
+  float const* const v1;
+  float const* const v2;
+  int32_t const* const v1_offsets;
+  int32_t const* const v2_offsets;
+
+  __device__ float operator()(cudf::size_type row_idx)
+  {
+    auto const v1_start_idx = v1_offsets[row_idx];
+    auto const v1_num_elems = v1_offsets[row_idx + 1] - v1_start_idx;
+    auto const v2_start_idx = v2_offsets[row_idx];
+    auto const v2_num_elems = v2_offsets[row_idx + 1] - v2_start_idx;
+
+    double magnitude1 = 0;
+    double magnitude2 = 0;
+    double dot_product = 0;
+    for (auto i = 0; i < v1_num_elems; i++) {
+      float const f1 = v1[v1_start_idx + i];
+      float const f2 = v2[v2_start_idx + i];
+      magnitude1 += f1 * f1;
+      magnitude2 += f2 * f2;
+      dot_product += f1 * f2;
+    }
+    return static_cast<float>(dot_product / (cuda::std::sqrt(magnitude1) * cuda::std::sqrt(magnitude2)));
+  }
+};
+
+}  // namespace
+
+std::unique_ptr<cudf::column> cosine_similarity(cudf::lists_column_view const& lv1,
+                                                cudf::lists_column_view const& lv2,
+                                                rmm::cuda_stream_view stream,
+                                                rmm::device_async_resource_ref mr)
+{
+  if (!cudf::have_same_types(lv1.child(), lv2.child()) ||
+      lv1.child().type().id() != cudf::type_id::FLOAT32) {
+    throw std::invalid_argument("inputs are not lists of floats");
+  }
+
+  auto const row_count = lv1.size();
+  if (row_count != lv2.size()) {
+    throw std::invalid_argument("input row counts do not match");
+  }
+  if (row_count == 0) {
+    return cudf::make_empty_column(cudf::data_type{cudf::type_id::FLOAT32});
+  }
+  if (lv1.child().null_count() != 0 || lv2.child().null_count() != 0) {
+    throw std::invalid_argument("null floats are not supported");
+  }
+
+  auto const lv1_offsets_ptr = lv1.offsets().data<int32_t>();
+  auto const lv2_offsets_ptr = lv2.offsets().data<int32_t>();
+  auto const lv1_null_mask = lv1.parent().null_mask();
+  auto const lv2_null_mask = lv2.parent().null_mask();
+
+  bool const are_offsets_equal =
+    thrust::all_of(rmm::exec_policy_nosync(stream),
+                   thrust::make_counting_iterator<cudf::size_type>(0),
+                   thrust::make_counting_iterator<cudf::size_type>(row_count),
+                   [lv1_offsets_ptr, lv2_offsets_ptr, lv1_null_mask, lv2_null_mask]
+                   __device__(cudf::size_type idx) -> bool {
+                     bool const lv1_is_null =
+                       lv1_null_mask != nullptr && !cudf::bit_is_set(lv1_null_mask, idx);
+                     bool const lv2_is_null =
+                       lv2_null_mask != nullptr && !cudf::bit_is_set(lv2_null_mask, idx);
+                     if (lv1_is_null || lv2_is_null) {
+                       return true;
+                     }
+                     return (lv1_offsets_ptr[idx + 1] - lv1_offsets_ptr[idx]) ==
+                            (lv2_offsets_ptr[idx + 1] - lv2_offsets_ptr[idx]);
+                   });
+  if (!are_offsets_equal) {
+    throw std::invalid_argument("input list lengths do not match for every row");
+  }
+
+  rmm::device_uvector<float> float_results(row_count, stream, mr);
+  thrust::transform(rmm::exec_policy_nosync(stream),
+                    thrust::make_counting_iterator<cudf::size_type>(0),
+                    thrust::make_counting_iterator<cudf::size_type>(row_count),
+                    float_results.data(),
+                    cosine_similarity_functor({lv1.child().data<float>(),
+                                               lv2.child().data<float>(),
+                                               lv1.offsets().data<int32_t>(),
+                                               lv2.offsets().data<int32_t>()}));
+
+  auto [null_mask, null_count] =
+    cudf::bitmask_and(cudf::table_view({lv1.parent(), lv2.parent()}), stream, mr);
+  return std::make_unique<cudf::column>(cudf::data_type{cudf::type_id::FLOAT32},
+                                        row_count,
+                                        float_results.release(),
+                                        std::move(null_mask),
+                                        null_count);
+}
diff --git a/skills/udf-convert-to-cuda/examples/cosine_similarity.hpp b/skills/udf-convert-to-cuda/examples/cosine_similarity.hpp
new file mode 100644
index 00000000000..99b78ede0f7
--- /dev/null
+++ b/skills/udf-convert-to-cuda/examples/cosine_similarity.hpp
@@ -0,0 +1,22 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+#pragma once
+
+#include <cudf/column/column.hpp>
+#include <cudf/lists/lists_column_view.hpp>
+#include <cudf/utilities/default_stream.hpp>
+#include <cudf/utilities/memory_resource.hpp>
+
+#include <rmm/cuda_stream_view.hpp>
+#include <rmm/resource_ref.hpp>
+
+#include <memory>
+
+std::unique_ptr<cudf::column> cosine_similarity(
+  cudf::lists_column_view const& lv1,
+  cudf::lists_column_view const& lv2,
+  rmm::cuda_stream_view stream      = cudf::get_default_stream(),
+  rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());
diff --git a/skills/udf-convert-to-cuda/references/JNI_CUDA_GUIDE.md b/skills/udf-convert-to-cuda/references/JNI_CUDA_GUIDE.md
new file mode 100644
index 00000000000..2137b79837b
--- /dev/null
+++ b/skills/udf-convert-to-cuda/references/JNI_CUDA_GUIDE.md
@@ -0,0 +1,162 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: CC-BY-4.0
+-->
+
+# JNI and CUDA RapidsUDF Guide
+
+## RapidsUDF Contract
+
+The RapidsUDF interface provides a way to run a CPU UDF on the GPU when using the RAPIDS Accelerator for Apache Spark. The interface provides a single method you need to override called `evaluateColumnar`. The CPU UDF method must remain on the native RapidsUDF class so Spark can fall back to the CPU if a surrounding plan cannot run on the GPU.
+
+`evaluateColumnar(int numRows, ColumnVector... args)` receives columnar forms of the same inputs as the CPU UDF. All input columns should have `numRows` rows. Scalar inputs may be expanded into full columns by the RAPIDS Accelerator, so do not rely on detecting scalar-vs-column input.
+
+The returned `ColumnVector` must have `numRows` rows and a cuDF type that matches the Spark return type:
+
+| Spark Type | cuDF Type |
+|---|---|
+| BooleanType | BOOL8 |
+| ByteType | INT8 |
+| ShortType | INT16 |
+| IntegerType | INT32 |
+| LongType | INT64 |
+| FloatType | FLOAT32 |
+| DoubleType | FLOAT64 |
+| DecimalType | DECIMAL32, DECIMAL64, DECIMAL128 * |
+| DateType | TIMESTAMP_DAYS |
+| TimestampType | TIMESTAMP_MICROSECONDS |
+| StringType | STRING |
+| ArrayType | LIST of element type |
+| MapType | LIST of STRUCT(key, value) |
+| StructType | STRUCT of fields |
+
+For example, if the CPU UDF returns the Spark type ArrayType(MapType(StringType, StringType)) then evaluateColumnar must return a column of type LIST(LIST(STRUCT(STRING,STRING))).
+
+*Note: cuDF's DECIMAL32 corresponds to precision <= 9 digits, DECIMAL64 corresponds to 9 < precision <= 18 digits, and DECIMAL128 corresponds to 18 < precision <= 38 digits. Precision greater than 38 digits is unsupported.
+Note that cuDF decimals use a negative scale relative to Spark DecimalType. For example, Spark DecimalType(precision=11, scale=2) would translate to cuDF type DECIMAL64(scale=-2).
+
+For `ArrayType(elementType, containsNull)`, the LIST parent null mask represents null arrays. Child nulls represent null array elements and must match the `containsNull` contract. Either preserve child nulls deliberately or reject them explicitly.
+
+## Java Wrapper
+
+Use `NativeDepsLoader.loadNativeDeps(new String[] {"rapidsudfjni"})` from a synchronized loader. Call it from `evaluateColumnar`, not a static initializer, because the Spark driver may not have the executor CUDA runtime.
+
+Pass input columns to JNI with `ColumnVector.getNativeView()`. Wrap the native result with `new ColumnVector(nativeHandle)`.
+
+Do not close input `ColumnVector`s. The RAPIDS Accelerator owns them. Closing inputs can cause double-close errors.
+
+## JNI and Native Ownership
+
+JNI arguments are non-owning pointers:
+```cpp
+auto input = reinterpret_cast<cudf::column_view const*>(j_input);
+```
+
+The native function must allocate and return an owning `cudf::column`:
+```cpp
+std::unique_ptr<cudf::column> result = compute(*input);
+return reinterpret_cast<jlong>(result.release());
+```
+
+Never return a pointer to an input view, child view, stack object, or a column owned by a temporary that will be destroyed before Java wraps it.
+
+Catch `std::bad_alloc`, `std::invalid_argument`, and `std::exception`, then throw Java exceptions with `JNIEnv::ThrowNew`.
+
+## CUDA/libcudf Implementation
+
+Start with libcudf column APIs before writing custom kernels. Use custom CUDA kernels when the operation requires fused logic, custom reductions, or logic unavailable in cuDF Java/libcudf primitives.
+
+### Checklist
+
+- Validate input types and row counts in Java before crossing JNI when possible
+- Validate libcudf types again in JNI for native safety
+- Preserve Spark null semantics
+- Prefer `cudf::column_view`/`cudf::lists_column_view` for input views
+- Return `std::unique_ptr<cudf::column>`
+- Avoid host copies in the final implementation
+- Prefer public libcudf APIs; avoid using `cudf::detail`
+- Keep one native function focused on one UDF operation
+
+### Correctness Pitfalls
+
+- **Null values of fixed-width columns are undefined memory.** Check the null mask (`cudf::bit_is_set(...)` or `column_device_view::is_valid(...)`) before reading element values.
+- **Empty list/string columns have no offsets.** Accessing the offsets child of an empty list or string column is undefined behavior. Handle the empty case early (e.g., return `cudf::make_empty_column(...)`).
+- **Use `cudf::have_same_types(a, b)` for type comparison**, not `a.type() == b.type()` — equality misses differences such as decimal scale.
+- **`cudf::size_type` is `int32_t`. LIST offsets are always `int32_t`.** String offsets may be `int32_t` or `int64_t` for large strings.
+- **Nested column null masks must agree across levels.** When constructing LIST/STRUCT output yourself, ensure parent and child null masks are consistent.
+- **`CUDF_EXPECTS` conditions must be pure predicates** — side effects inside the condition may only execute in debug builds.
+
+### Useful Patterns
+
+- `rmm::device_uvector<T>`: temporary device output buffers that can be released into a `cudf::column`
+- `rmm::exec_policy_nosync(stream)`: pass the intended CUDA stream to Thrust algorithms (prefer the `_nosync` variant unless you need an implicit host-device sync)
+- `cudf::make_empty_column(...)`: return correctly typed empty outputs
+- `cudf::make_numeric_column(...)`: allocate fixed-width output columns with a null mask
+- `cudf::bitmask_and(cudf::table_view({...}))`: combine input validity masks for output null semantics
+- `cudf::lists_column_view`: inspect list offsets, child columns, parent null masks, and nested list shapes
+- `cudf::strings_column_view`: inspect string chars/offsets when implementing string kernels
+- `cudf::create_null_mask(...)`: create all-valid, all-null, or uninitialized masks for new outputs
+- CUB and Thrust APIs: useful for scans, reductions, transforms, selection, and sorting when libcudf does not provide the exact operation
+
+### Memory Allocation
+
+- All device allocations must go through the active RMM memory resource.
+- Use libcudf factories or RMM types such as `rmm::device_uvector<T>` and `rmm::device_buffer`; avoid direct calls to `cudaMalloc`, `cudaMallocAsync`, or other ad hoc device allocators.
+- Use the output MR for returned columns when the API exposes one; use `cudf::get_current_device_resource_ref()` for short-lived temporary buffers.
+- Use RMM pinned memory for large host buffers. Small CPU-only metadata may use normal C++ containers.
+
+Example allocating CUB scratch buffers through RMM:
+
+```cpp
+size_t temp_storage_bytes = 0;
+cub::DeviceScan::InclusiveSum(nullptr, temp_storage_bytes, in, out, n, stream.value());
+rmm::device_buffer temp_storage(temp_storage_bytes, stream, cudf::get_current_device_resource_ref());
+cub::DeviceScan::InclusiveSum(temp_storage.data(), temp_storage_bytes, in, out, n, stream.value());
+```
+
+### Stream and MR Plumbing
+
+Top-level native functions should accept stream and MR as the last two parameters, with defaults:
+
+```cpp
+std::unique_ptr<cudf::column> my_op(
+    cudf::column_view const& input,
+    rmm::cuda_stream_view stream      = cudf::get_default_stream(),
+    rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());
+```
+
+Use the passed-in `mr` for the returned column and `cudf::get_current_device_resource_ref()` for short-lived temporaries. Propagate `stream` to every libcudf call, Thrust call, and kernel launch — do not introduce `rmm::cuda_stream_default` inside the implementation.
+
+### Kernel Launch Discipline
+
+Always check kernel launches; silent launch failures cause downstream corruption.
+
+```cpp
+my_kernel<<<grid, block, 0, stream.value()>>>(args);
+CUDF_CHECK_CUDA(stream.value());
+```
+
+Prefer `cuda::std::` (e.g. `cuda::std::min`, `cuda::std::sqrt`, `cuda::std::numeric_limits<T>`) over `std::` inside `__device__` and `CUDF_HOST_DEVICE` code.
+
+Avoid synchronizing in the hot path except when required to fetch output sizes or while debugging.
+
+### Output Construction
+
+For variable-size list outputs:
+1. Compute per-row child sizes on device, using zero for null parent rows.
+2. Prefix-sum sizes into an `INT32` offsets column of length `numRows + 1`.
+3. Allocate the child column from the final offset, fill it on device, and set child nulls if `containsNull=true`.
+4. Assemble the LIST column from offsets, child column, parent null mask, and parent null count.
+
+For string outputs, construct proper offsets, chars, and null masks. For scalar numeric outputs, prefer libcudf transforms/reductions where possible.
+
+## Debugging
+
+Rerun tests with `-Ddebug.memory.leaks=true` to enable Java refcount debugging; this catches leaked `ColumnVector`, `Table`, `Scalar`, and Java-owned buffer objects.
+Note that it does **not** catch native memory leaks; use RMM RAII patterns to ensure all native allocations are freed.
+
+For native kernel memory errors, run the comparison test under Compute Sanitizer:
+
+```bash
+compute-sanitizer --tool memcheck mvn test <flags>
+```
diff --git a/skills/udf-convert-to-cuda/references/NATIVE_BUILD_ENV.md b/skills/udf-convert-to-cuda/references/NATIVE_BUILD_ENV.md
new file mode 100644
index 00000000000..56af56fd34d
--- /dev/null
+++ b/skills/udf-convert-to-cuda/references/NATIVE_BUILD_ENV.md
@@ -0,0 +1,92 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: CC-BY-4.0
+-->
+
+# Native CUDA UDF Build Environment
+
+## Dependency Model
+
+The native build uses the RAPIDS JAR already resolved by Maven. The `cuda-native-udf` profile asks Maven to copy `rapids-4-spark_<scala>-<version>-<cuda>.jar` and `rapids-4-spark_<scala>-<version>.jar` into `target/rapids-jar`. The `native/scripts/extract-cudf-libs.sh` script then extracts `libcudf.so*` and `libnvcomp.so*`, clones matching cuDF headers, builds `librapidsudfjni.so`, and packages it in the UDF JAR for `NativeDepsLoader`.
+
+No separate manual JAR download is required. Maven should resolve the RAPIDS dependency declared in `pom.xml`; the native profile reuses the same coordinates and copies the resolved JAR into `target/rapids-jar`.
+
+The profile first tries the CUDA-classified artifact (`-cuda12`) and then the unclassified artifact. If extraction fails, the selected JAR probably does not contain Linux native CUDA libraries or the Maven cache/repository is inconsistent with the generated version properties.
+
+## Required Tools
+
+- CUDA toolkit matching spark-rapids build and a compatible NVIDIA driver
+- CMake 3.30.4+
+- C++ compiler compatible with the selected CUDA toolkit
+- JDK 17
+- Maven
+- `git`
+- `unzip`
+
+## CUDA Toolkit Version
+
+The native build compiles against the prebuilt libcudf in the spark-rapids jar, so the local CUDA toolkit must match the version spark-rapids was built against.
+
+1. Get the CUDA version(s) spark-rapids is built against:
+
+```bash
+curl -fsSL https://nvidia.github.io/spark-rapids/docs/download.html \
+  | grep -Eo '[^<>]*built against CUDA[^<>]*'
+```
+
+2. Check the active toolkit (`nvcc --version`). CMake uses `$CUDACXX`, else `nvcc` on `PATH`, else `$CUDAToolkit_ROOT/bin/nvcc` — the default `PATH` `nvcc` may not be the one you want.
+
+3. If it doesn't match, point the build at a matching toolkit that's already installed; otherwise install one that matches:
+
+```bash
+export CUDACXX=/usr/local/cuda-<major.minor>/bin/nvcc
+export CUDAToolkit_ROOT=/usr/local/cuda-<major.minor>
+export PATH="$CUDAToolkit_ROOT/bin:$PATH"
+```
+
+Docker is optional. Use it when local compiler/CMake/CUDA versions drift or when the build needs to be reproducible across machines.
+
+The provided Dockerfile installs JDK 17 and sets it via `/etc/profile.d/java17.sh`. If a modified Dockerfile or alternate entrypoint bypasses the login shell and `mvn` reports Java 8, export `JAVA_HOME=/usr/lib/jvm/java-17-openjdk` and prepend `$JAVA_HOME/bin` to `PATH` explicitly.
+
+Use the full Docker command listed in SKILL.md. It runs as the calling user to avoid root-owned artifacts, mounts the project and Maven cache, and uses a Docker-specific native build path so CMake cache paths do not conflict with host builds.
+
+If a previous root container run already wrote `target/` artifacts, fix ownership or clean them before rerunning as a non-root user.
+
+CMake stores absolute source and build paths in `CMakeCache.txt`. A host-generated `target/native-build` cannot be reused from `/workspace/target/native-build` inside Docker. Use `mvn clean`, remove the stale native build directory, or pass a Docker-specific path such as `-Dnative.build.path=/workspace/target/native-build-docker`.
+
+## Version Alignment
+
+Keep these values aligned:
+- Spark version
+- Scala binary version
+- `rapids4spark.version`
+- `cuda.version`
+- `cudf.git.branch`
+- `rapids.cmake.branch`
+- JDK version
+
+The generated template maps RAPIDS `<major>.<minor>.<patch>` to the `v<major>.<minor>.00` cuDF and rapids-cmake tags. If building a snapshot, a custom RAPIDS JAR, or a patch release with known native ABI changes, verify the matching cuDF/RMM/CCCL versions with the user.
+
+## Fast Rebuilds and Verification
+
+After the first successful extraction, use `-DskipCudfExtraction=true` while iterating on Java/JNI/CUDA source:
+
+```bash
+mvn package -Pcuda-native-udf -DskipCudfExtraction=true -DskipTests
+```
+
+Verify deployable packaging with:
+
+```bash
+jar tf target/*.jar | grep librapidsudfjni.so
+```
+
+## Build Modes
+
+Default: `USE_PREBUILT_CUDF=ON`.
+
+This extracts `libcudf` from the RAPIDS JAR and builds only the UDF JNI/CUDA library. This is the stable, fast path.
+
+Escape hatch: `-DUSE_PREBUILT_CUDF=OFF`.
+
+This builds cuDF from source through RAPIDS CMake/CPM. It is slow and more sensitive to branch drift; ask the user before using it.
diff --git a/skills/udf-convert-to-cuda/templates/cuda/Dockerfile b/skills/udf-convert-to-cuda/templates/cuda/Dockerfile
new file mode 100644
index 00000000000..f75a5f19376
--- /dev/null
+++ b/skills/udf-convert-to-cuda/templates/cuda/Dockerfile
@@ -0,0 +1,65 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Reproducible build image for native CUDA RapidsUDF code.
+ARG CUDA_VERSION=12.9.1
+ARG LINUX_VERSION=rockylinux8
+
+FROM nvidia/cuda:${CUDA_VERSION}-devel-${LINUX_VERSION}
+
+ARG TOOLSET_VERSION=14
+ARG CMAKE_VERSION=3.30.4
+ARG CMAKE_ARCH=x86_64
+ARG CCACHE_VERSION=4.11.2
+ARG PARALLEL_LEVEL=10
+
+ENV TOOLSET_VERSION=${TOOLSET_VERSION}
+ENV PARALLEL_LEVEL=${PARALLEL_LEVEL}
+ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk
+
+RUN dnf --enablerepo=powertools install -y \
+  gcc-toolset-${TOOLSET_VERSION} \
+  git \
+  java-17-openjdk-devel \
+  maven \
+  ninja-build \
+  patch \
+  python39 \
+  scl-utils \
+  tar \
+  unzip \
+  wget \
+  zlib-devel \
+  && alternatives --set python /usr/bin/python3
+
+RUN cd /usr/local && \
+  wget --quiet https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-${CMAKE_ARCH}.tar.gz && \
+  tar zxf cmake-${CMAKE_VERSION}-linux-${CMAKE_ARCH}.tar.gz && \
+  rm cmake-${CMAKE_VERSION}-linux-${CMAKE_ARCH}.tar.gz
+ENV PATH=${JAVA_HOME}/bin:/usr/local/cmake-${CMAKE_VERSION}-linux-${CMAKE_ARCH}/bin:${PATH}
+
+# Bake the SCL activation and Java 17 environment into /etc/profile.d so they are restored by `bash -l` on every container start.
+RUN printf 'source /opt/rh/gcc-toolset-%s/enable\n' "${TOOLSET_VERSION}" \
+      > /etc/profile.d/scl-gcc-toolset.sh && \
+    printf '%s\n%s\n' \
+      'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk' \
+      'export PATH=$JAVA_HOME/bin:$PATH' \
+      > /etc/profile.d/java17.sh
+
+RUN cd /tmp && \
+  wget --quiet https://github.com/ccache/ccache/releases/download/v${CCACHE_VERSION}/ccache-${CCACHE_VERSION}.tar.gz && \
+  tar zxf ccache-${CCACHE_VERSION}.tar.gz && \
+  rm ccache-${CCACHE_VERSION}.tar.gz && \
+  cd ccache-${CCACHE_VERSION} && \
+  mkdir build && \
+  cd build && \
+  scl enable gcc-toolset-${TOOLSET_VERSION} \
+    "cmake .. \
+      -DCMAKE_BUILD_TYPE=Release \
+      -DZSTD_FROM_INTERNET=ON \
+      -DREDIS_STORAGE_BACKEND=OFF && \
+    cmake --build . --parallel ${PARALLEL_LEVEL} --target install" && \
+  cd ../.. && \
+  rm -rf ccache-${CCACHE_VERSION}
+
+ENTRYPOINT ["bash", "-l"]
diff --git a/skills/udf-convert-to-cuda/templates/cuda/native/scripts/extract-cudf-libs.sh b/skills/udf-convert-to-cuda/templates/cuda/native/scripts/extract-cudf-libs.sh
new file mode 100644
index 00000000000..52020e71c3c
--- /dev/null
+++ b/skills/udf-convert-to-cuda/templates/cuda/native/scripts/extract-cudf-libs.sh
@@ -0,0 +1,81 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_DIR="$(cd "${SCRIPT_DIR}/../.." && pwd)"
+TARGET_DIR="${TARGET_DIR:-${PROJECT_DIR}/target}"
+NATIVE_DEPS_DIR="${TARGET_DIR}/native-deps"
+CUDF_REPO_DIR="${TARGET_DIR}/cudf-repo"
+RAPIDS_JAR_DIR="${TARGET_DIR}/rapids-jar"
+
+SCALA_VERSION="${SCALA_VERSION:-2.12}"
+RAPIDS4SPARK_VERSION="${RAPIDS4SPARK_VERSION:-26.04.0}"
+CUDA_VERSION="${CUDA_VERSION:-cuda12}"
+CUDF_BRANCH="${CUDF_BRANCH:-v26.04.00}"
+
+mkdir -p "${NATIVE_DEPS_DIR}" "${CUDF_REPO_DIR}"
+
+choose_rapids_jar() {
+  local candidates=(
+    "${RAPIDS_JAR_DIR}/rapids-4-spark_${SCALA_VERSION}-${RAPIDS4SPARK_VERSION}-${CUDA_VERSION}.jar"
+    "${RAPIDS_JAR_DIR}/rapids-4-spark_${SCALA_VERSION}-${RAPIDS4SPARK_VERSION}.jar"
+    "${HOME}/.m2/repository/com/nvidia/rapids-4-spark_${SCALA_VERSION}/${RAPIDS4SPARK_VERSION}/rapids-4-spark_${SCALA_VERSION}-${RAPIDS4SPARK_VERSION}-${CUDA_VERSION}.jar"
+    "${HOME}/.m2/repository/com/nvidia/rapids-4-spark_${SCALA_VERSION}/${RAPIDS4SPARK_VERSION}/rapids-4-spark_${SCALA_VERSION}-${RAPIDS4SPARK_VERSION}.jar"
+  )
+
+  for candidate in "${candidates[@]}"; do
+    if [[ -f "${candidate}" ]]; then
+      echo "${candidate}"
+      return 0
+    fi
+  done
+
+  echo "ERROR: Could not find a rapids-4-spark jar." >&2
+  echo "Tried target/rapids-jar and ~/.m2 for version ${RAPIDS4SPARK_VERSION} (${CUDA_VERSION})." >&2
+  echo "Run the build through Maven with -Pcuda-native-udf so the profile can copy the RAPIDS dependency first." >&2
+  return 1
+}
+
+JAR_PATH="$(choose_rapids_jar)"
+
+echo "Using RAPIDS jar: ${JAR_PATH}"
+echo "Using cuDF header ref: ${CUDF_BRANCH}"
+
+TEMP_DIR="${TARGET_DIR}/cudf-extract"
+rm -rf "${TEMP_DIR}"
+mkdir -p "${TEMP_DIR}"
+
+if ! unzip -o "${JAR_PATH}" "*/libcudf.so*" "*/libnvcomp.so*" -d "${TEMP_DIR}"; then
+  echo "ERROR: Failed to extract libcudf/libnvcomp from ${JAR_PATH}" >&2
+  echo "The selected RAPIDS jar may not include native Linux CUDA libraries." >&2
+  rm -rf "${TEMP_DIR}"
+  exit 1
+fi
+
+while IFS= read -r source_file; do
+  cp -f "${source_file}" "${NATIVE_DEPS_DIR}/$(basename "${source_file}")"
+done < <(find "${TEMP_DIR}" -name "*.so*")
+rm -rf "${TEMP_DIR}"
+
+if [[ ! -f "${NATIVE_DEPS_DIR}/libcudf.so" ]]; then
+  echo "ERROR: libcudf.so was not extracted into ${NATIVE_DEPS_DIR}" >&2
+  exit 1
+fi
+
+if [[ ! -d "${CUDF_REPO_DIR}/.git" ]]; then
+  git clone --depth 1 --branch "${CUDF_BRANCH}" https://github.com/rapidsai/cudf.git "${CUDF_REPO_DIR}"
+else
+  echo "Using existing cuDF headers at ${CUDF_REPO_DIR}"
+fi
+
+if [[ ! -d "${CUDF_REPO_DIR}/cpp/include" ]]; then
+  echo "ERROR: cuDF headers not found at ${CUDF_REPO_DIR}/cpp/include" >&2
+  exit 1
+fi
+
+echo "Native dependencies ready:"
+echo "  Libraries: ${NATIVE_DEPS_DIR}"
+echo "  Headers:   ${CUDF_REPO_DIR}/cpp/include"
diff --git a/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/CMakeLists.txt b/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/CMakeLists.txt
new file mode 100644
index 00000000000..a9398b4938c
--- /dev/null
+++ b/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/CMakeLists.txt
@@ -0,0 +1,119 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+cmake_minimum_required(VERSION 3.30.4 FATAL_ERROR)
+
+set(RAPIDS_CMAKE_BRANCH "v26.04.00" CACHE STRING "rapids-cmake branch or tag")
+if(RAPIDS_CMAKE_BRANCH MATCHES "^v(.+)")
+  set(rapids-cmake-version "${CMAKE_MATCH_1}")
+  set(rapids-cmake-tag "${RAPIDS_CMAKE_BRANCH}")
+else()
+  set(rapids-cmake-branch "${RAPIDS_CMAKE_BRANCH}")
+endif()
+set(NATIVE_LIBRARY_NAME "rapidsudfjni" CACHE STRING "JNI shared library target name")
+set(NATIVE_DEPS_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../../../../target/native-deps" CACHE PATH "Directory containing prebuilt libcudf")
+set(CUDF_SOURCE_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../../../../target/cudf-repo/cpp" CACHE PATH "cuDF source directory for headers")
+set(GPU_ARCHS "RAPIDS" CACHE STRING "CUDA architectures")
+
+file(DOWNLOAD
+  https://raw.githubusercontent.com/rapidsai/rapids-cmake/${RAPIDS_CMAKE_BRANCH}/RAPIDS.cmake
+  ${CMAKE_BINARY_DIR}/RAPIDS.cmake
+)
+include(${CMAKE_BINARY_DIR}/RAPIDS.cmake)
+
+include(rapids-cmake)
+include(rapids-cpm)
+include(rapids-cuda)
+
+if(DEFINED ENV{CXX} AND NOT "$ENV{CXX}" STREQUAL "")
+  set(CMAKE_CXX_COMPILER "$ENV{CXX}" CACHE FILEPATH "C++ compiler" FORCE)
+endif()
+
+if(DEFINED GPU_ARCHS)
+  set(CMAKE_CUDA_ARCHITECTURES "${GPU_ARCHS}")
+endif()
+rapids_cuda_init_architectures(RAPIDSUDFJNI)
+
+project(RAPIDSUDFJNI VERSION 26.04.0 LANGUAGES C CXX CUDA)
+
+set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+set(CMAKE_CXX_STANDARD 20)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_CUDA_STANDARD 20)
+set(CMAKE_CUDA_STANDARD_REQUIRED ON)
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -w --expt-extended-lambda --expt-relaxed-constexpr")
+
+option(USE_PREBUILT_CUDF "Use libcudf extracted from the rapids-4-spark jar" ON)
+option(PER_THREAD_DEFAULT_STREAM "Build with per-thread default stream" ON)
+option(CUDF_ENABLE_ARROW_S3 "Enable Arrow S3 support in source-build mode" OFF)
+
+if(USE_PREBUILT_CUDF)
+  if(NOT EXISTS "${NATIVE_DEPS_DIR}")
+    message(FATAL_ERROR "NATIVE_DEPS_DIR does not exist: ${NATIVE_DEPS_DIR}")
+  endif()
+  if(NOT EXISTS "${CUDF_SOURCE_DIR}/include")
+    message(FATAL_ERROR "CUDF_SOURCE_DIR headers not found: ${CUDF_SOURCE_DIR}/include")
+  endif()
+
+  find_library(CUDF_LIBRARY NAMES cudf PATHS "${NATIVE_DEPS_DIR}" NO_DEFAULT_PATH REQUIRED)
+
+  get_property(rapids-cmake-dir GLOBAL PROPERTY rapids-cmake-dir)
+  if(NOT rapids-cmake-dir)
+    set(rapids-cmake-dir "${CMAKE_BINARY_DIR}/_deps/rapids-cmake-src")
+  endif()
+
+  rapids_cpm_init()
+  include("${rapids-cmake-dir}/cpm/cccl.cmake")
+  rapids_cpm_cccl()
+  include("${rapids-cmake-dir}/cpm/rmm.cmake")
+  rapids_cpm_rmm()
+
+  if(NOT TARGET rmm::rmm)
+    message(FATAL_ERROR "rmm::rmm target was not created")
+  endif()
+
+  get_target_property(RMM_INCLUDE_DIRS rmm::rmm INTERFACE_INCLUDE_DIRECTORIES)
+
+  add_library(cudf_imported SHARED IMPORTED GLOBAL)
+  set_target_properties(cudf_imported PROPERTIES IMPORTED_LOCATION "${CUDF_LIBRARY}")
+  target_include_directories(cudf_imported INTERFACE
+    "${CUDF_SOURCE_DIR}/include"
+    ${RMM_INCLUDE_DIRS}
+  )
+  target_link_libraries(cudf_imported INTERFACE rmm::rmm)
+  add_library(cudf::cudf ALIAS cudf_imported)
+else()
+  rapids_cpm_init()
+  rapids_cpm_find(cudf 26.04.00
+    CPM_ARGS
+      GIT_REPOSITORY https://github.com/rapidsai/cudf.git
+      GIT_TAG        ${RAPIDS_CMAKE_BRANCH}
+      GIT_SHALLOW    TRUE
+      SOURCE_SUBDIR  cpp
+      OPTIONS        "BUILD_TESTS OFF"
+                     "BUILD_BENCHMARKS OFF"
+                     "CUDF_ENABLE_ARROW_S3 ${CUDF_ENABLE_ARROW_S3}"
+                     "CUDF_KVIKIO_REMOTE_IO OFF"
+                     "DISABLE_DEPRECATION_WARNING ON"
+                     "AUTO_DETECT_CUDA_ARCHITECTURES OFF"
+  )
+endif()
+
+find_package(JNI REQUIRED)
+
+set(SOURCE_FILES
+  "src/PlaceholderUDFNameJni.cpp"
+  "src/placeholder_udf_name.cu"
+)
+
+add_library(${NATIVE_LIBRARY_NAME} SHARED ${SOURCE_FILES})
+set_target_properties(${NATIVE_LIBRARY_NAME} PROPERTIES BUILD_RPATH "\$ORIGIN")
+
+if(PER_THREAD_DEFAULT_STREAM)
+  target_compile_definitions(${NATIVE_LIBRARY_NAME} PRIVATE CUDA_API_PER_THREAD_DEFAULT_STREAM)
+endif()
+
+target_include_directories(${NATIVE_LIBRARY_NAME} PRIVATE ${JNI_INCLUDE_DIRS})
+target_compile_definitions(${NATIVE_LIBRARY_NAME} PUBLIC SPDLOG_ACTIVE_LEVEL=SPDLOG_LEVEL_OFF)
+target_link_libraries(${NATIVE_LIBRARY_NAME} cudf::cudf)
diff --git a/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/PlaceholderUDFNameJni.cpp b/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/PlaceholderUDFNameJni.cpp
new file mode 100644
index 00000000000..1b07459b9fd
--- /dev/null
+++ b/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/PlaceholderUDFNameJni.cpp
@@ -0,0 +1,58 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+#include "placeholder_udf_name.hpp"
+
+#include <cudf/column/column.hpp>
+#include <cudf/column/column_view.hpp>
+
+#include <jni.h>
+
+#include <memory>
+#include <string>
+
+namespace {
+
+constexpr char const* RUNTIME_ERROR_CLASS = "java/lang/RuntimeException";
+constexpr char const* ILLEGAL_ARG_CLASS = "java/lang/IllegalArgumentException";
+
+void throw_java_exception(JNIEnv* env, char const* class_name, char const* message)
+{
+  jclass ex_class = env->FindClass(class_name);
+  if (ex_class != nullptr) {
+    env->ThrowNew(ex_class, message);
+  }
+}
+
+}  // namespace
+
+extern "C" {
+
+JNIEXPORT jlong JNICALL
+Java_com_udf_PlaceholderUDFNameNativeRapidsUDF_evaluateNative(JNIEnv* env,
+                                                              jclass,
+                                                              jlong input_view)
+{
+  try {
+    auto input = reinterpret_cast<cudf::column_view const*>(input_view);
+    if (input == nullptr) {
+      throw_java_exception(env, ILLEGAL_ARG_CLASS, "input column view is null");
+      return 0;
+    }
+
+    std::unique_ptr<cudf::column> result = placeholder_udf_name(*input);
+    return reinterpret_cast<jlong>(result.release());
+  } catch (std::bad_alloc const& e) {
+    auto message = std::string("Unable to allocate native memory: ") + e.what();
+    throw_java_exception(env, RUNTIME_ERROR_CLASS, message.c_str());
+  } catch (std::invalid_argument const& e) {
+    throw_java_exception(env, ILLEGAL_ARG_CLASS, e.what());
+  } catch (std::exception const& e) {
+    throw_java_exception(env, RUNTIME_ERROR_CLASS, e.what());
+  }
+  return 0;
+}
+
+}
diff --git a/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/placeholder_udf_name.cu b/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/placeholder_udf_name.cu
new file mode 100644
index 00000000000..5e0de8f20ff
--- /dev/null
+++ b/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/placeholder_udf_name.cu
@@ -0,0 +1,18 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+#include "placeholder_udf_name.hpp"
+
+#include <cudf/column/column_factories.hpp>
+#include <cudf/null_mask.hpp>
+#include <cudf/types.hpp>
+
+std::unique_ptr<cudf::column> placeholder_udf_name(cudf::column_view const& input)
+{
+  // TODO: Replace this placeholder with the actual CUDA/libcudf implementation.
+  auto null_mask = cudf::create_null_mask(input.size(), cudf::mask_state::ALL_NULL);
+  return cudf::make_numeric_column(
+    cudf::data_type{cudf::type_id::INT32}, input.size(), std::move(null_mask), input.size());
+}
diff --git a/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/placeholder_udf_name.hpp b/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/placeholder_udf_name.hpp
new file mode 100644
index 00000000000..d34ac2f8828
--- /dev/null
+++ b/skills/udf-convert-to-cuda/templates/cuda/native/src/main/cpp/src/placeholder_udf_name.hpp
@@ -0,0 +1,13 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+#pragma once
+
+#include <cudf/column/column.hpp>
+#include <cudf/column/column_view.hpp>
+
+#include <memory>
+
+std::unique_ptr<cudf::column> placeholder_udf_name(cudf::column_view const& input);
diff --git a/skills/udf-convert-to-cuda/templates/cuda/src/main/java/com/udf/NativeUDFLoader.java b/skills/udf-convert-to-cuda/templates/cuda/src/main/java/com/udf/NativeUDFLoader.java
new file mode 100644
index 00000000000..d5469882951
--- /dev/null
+++ b/skills/udf-convert-to-cuda/templates/cuda/src/main/java/com/udf/NativeUDFLoader.java
@@ -0,0 +1,29 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf;
+
+import ai.rapids.cudf.NativeDepsLoader;
+
+import java.io.IOException;
+
+/** Loads JNI libraries packaged in this UDF jar. */
+public final class NativeUDFLoader {
+    private static boolean loaded;
+
+    private NativeUDFLoader() {
+    }
+
+    public static synchronized void ensureLoaded() {
+        if (!loaded) {
+            try {
+                NativeDepsLoader.loadNativeDeps(new String[] {"rapidsudfjni"});
+                loaded = true;
+            } catch (IOException e) {
+                throw new RuntimeException("Failed to load native CUDA UDF library", e);
+            }
+        }
+    }
+}
diff --git a/skills/udf-convert-to-cuda/templates/cuda/src/main/java/com/udf/PlaceholderUDFNameNativeRapidsUDF.java b/skills/udf-convert-to-cuda/templates/cuda/src/main/java/com/udf/PlaceholderUDFNameNativeRapidsUDF.java
new file mode 100644
index 00000000000..8212de0b399
--- /dev/null
+++ b/skills/udf-convert-to-cuda/templates/cuda/src/main/java/com/udf/PlaceholderUDFNameNativeRapidsUDF.java
@@ -0,0 +1,45 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf;
+
+import ai.rapids.cudf.ColumnVector;
+import com.nvidia.spark.RapidsUDF;
+// TODO: add imports for CPU UDF's base type, e.g.:
+//   import org.apache.hadoop.hive.ql.exec.UDF;
+//   import org.apache.spark.sql.api.java.UDFn;
+
+/**
+ * Template for a native CUDA RapidsUDF.
+ *
+ * 1. Rename this class and file to {@code <CamelName>NativeRapidsUDF}.
+ * 2. Match the CPU UDF's Spark contract:
+ *    - Hive UDF       : add {@code extends org.apache.hadoop.hive.ql.exec.UDF}
+ *    - Java typed UDF : add {@code implements UDFn<T1,...,R>} alongside {@code RapidsUDF}
+ *    - Scala CPU UDF  : implement the equivalent {@code UDFn<...>} contract.
+ *                       Invoke the Scala UDF via reflection from {@code call(...)}.
+ * 3. Add the CPU evaluation method.
+ * 4. Update {@code evaluateColumnar} and {@code evaluateNative} as needed to match the signature.
+ */
+public class PlaceholderUDFNameNativeRapidsUDF implements RapidsUDF {
+
+    // TODO: copy the original CPU evaluation method here (evaluate / call).
+
+    @Override
+    public ColumnVector evaluateColumnar(int numRows, ColumnVector... args) {
+        if (args.length != 1) {
+            throw new IllegalArgumentException("Unexpected argument count: " + args.length);
+        }
+        if (numRows != args[0].getRowCount()) {
+            throw new IllegalArgumentException(
+                "Expected " + numRows + " rows, received " + args[0].getRowCount());
+        }
+
+        NativeUDFLoader.ensureLoaded();
+        return new ColumnVector(evaluateNative(args[0].getNativeView()));
+    }
+
+    private static native long evaluateNative(long inputView);
+}
diff --git a/skills/udf-convert-to-cudf/SKILL.md b/skills/udf-convert-to-cudf/SKILL.md
new file mode 100644
index 00000000000..58d4b01d2b7
--- /dev/null
+++ b/skills/udf-convert-to-cudf/SKILL.md
@@ -0,0 +1,128 @@
+---
+name: udf-convert-to-cudf
+description: Assists with converting an Apache Spark UDF to a GPU-accelerated RapidsUDF using cuDF Java APIs. This is step 2 of 3 in the UDF conversion workflow (udf-gen-test -> udf-convert-to-cudf -> udf-benchmark). Use this skill when you have a CPU UDF with a unit test and need to convert it to a RapidsUDF.
+license: CC-BY-4.0 AND Apache-2.0
+metadata:
+  spdx-file-copyright-text: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+model: inherit
+---
+
+# Convert UDF to cuDF RapidsUDF
+
+## Workflow
+
+- [ ] Step 1: Create the RapidsUDF file
+- [ ] Step 2: Implement the `evaluateColumnar` method
+- [ ] Step 3: Build and test
+- [ ] Step 4: Check for memory leaks
+- [ ] Step 5: Run judge subagent if requested
+- [ ] Step 6: Review conversion
+
+**Before making any edits, create a visible TODO checklist for every workflow step in this skill and keep it updated.** Do not produce a final answer until every required checklist item is marked complete.
+
+## Prerequisites
+
+- Project directory from Step 1 (udf-gen-test) with passing unit test
+
+Derive `<CamelName>` and `<snake_name>` from the UDF class name.
+
+> **Note:** Commands require access to `/tmp` (Spark temp storage) and `/dev` (GPU device). If commands fail due to sandbox restrictions, re-run them unsandboxed.
+
+## Step 1: Create the RapidsUDF File
+
+Create a copy of the original UDF file in the same source directory (`src/main/<java|scala>/com/udf/`), then modify it:
+
+1. Add imports:
+    Java: `import ai.rapids.cudf.*;`, `import com.nvidia.spark.RapidsUDF;`
+    Scala: `import ai.rapids.cudf._`, `import com.nvidia.spark.RapidsUDF`, `import Arm.{withResource, closeOnExcept}`
+2. Add `implements RapidsUDF` to the class declaration
+3. Add the `evaluateColumnar` method stub:
+    Java: `public ColumnVector evaluateColumnar(int numRows, ColumnVector... args) { }`
+    Scala: `def evaluateColumnar(numRows: Int, args: ColumnVector*): ColumnVector = { }`
+4. Rename the class and the file to `<CamelName>RapidsUDF`
+
+## Step 2: Implement the `evaluateColumnar` method
+
+### Background
+
+**Read `references/RAPIDS_UDF.md`** for detailed background on:
+- How RapidsUDF and `evaluateColumnar` work
+- Input ColumnVector types and output type mapping
+- Debugging techniques and GPU memory management
+
+**Read `examples/` for example RapidsUDF implementations for the target language.**
+
+### Implementation
+
+1. Clone https://github.com/rapidsai/cudf (branch matching spark-rapids version) to `~/.cache/aether_agent/` if not already present. Explore `java/src/<main|test>/java/ai/rapids/cudf` for relevant methods and usage patterns.
+2. Implement the `evaluateColumnar` method using cuDF APIs.
+
+### Critical Requirements
+
+- **NEVER use `copyToHost()` or methods that copy data GPU→CPU.** This defeats the purpose of GPU acceleration
+- **Do NOT hardcode test values.** The RapidsUDF must implement actual business logic for ANY potential input
+
+## Step 3: Build and Test
+
+Fill in the target-specific TODOs in `src/test/<java|scala>/com/udf/CudfComparisonTest.<java|scala>`:
+- Implement `registerRapidsUDF` to register the new RapidsUDF class.
+- Replace placeholders with the actual camel/snake UDF name
+
+Then run the test:
+```bash
+# Java
+mvn test -Dtest=CudfComparisonTest
+
+# Scala
+mvn test -Dsuites=com.udf.CudfComparisonTest
+```
+
+If the test fails, analyze the error and iterate on the RapidsUDF implementation.
+
+### Difficult Test Failures
+
+Treat the unit test as the CPU behavior specification. Do not weaken or remove test cases silently.
+
+- Tests that check for CPU errors may not be directly applicable to a columnar implementation: the GPU path typically evaluates a whole column and may produce nulls for invalid rows instead of throwing row-level exceptions. If this causes an unavoidable mismatch, add a clear comment in the test and a `TODO/NOTE` in the implementation explaining the mismatch.
+- If a test case does not pass because of inherent cuDF/libcudf/API limitations or low-level GPU/CPU semantic differences, comment out the conflicting assertion/test only after documenting how you tried to make the behavior match and why those attempts failed. Add a note to the user.
+- If the behavior is important, common, or part of the documented input domain, **always prefer fixing the implementation** over commenting out the test case. The exception is a performance-vs-correctness tradeoff that the user explicitly approves.
+
+## Step 4: Memory Leak Check
+
+Re-run with memory leak detection:
+```bash
+# Java
+mvn test -Dtest=CudfComparisonTest -Ddebug.memory.leaks=true > /tmp/memleak.log 2>&1
+
+# Scala
+mvn test -Dsuites=com.udf.CudfComparisonTest -Ddebug.memory.leaks=true > /tmp/memleak.log 2>&1
+
+# Check for leaks
+grep "LEAKED" /tmp/memleak.log | head -5
+```
+
+If leaks are found, ensure all GPU objects are properly closed.
+
+## Step 5: Run Judge Subagent If Requested
+
+If the user explicitly asked for the judge, a judge subagent, or a review agent, treat that as an explicit request for delegation: you **MUST** launch a separate subagent with `model: inherit` and instruct it to use the **udf-judge-conversion** skill. Ask it to review the `UnitTest`, `CudfComparisonTest`, and RapidsUDF implementation.
+
+If the user did not request a judge/review agent, mark this step as skipped and continue to Step 6. If a required judge subagent is blocked by tool policy, stop and tell the user that explicit permission/instruction is needed.
+
+If you run the judge, wait for it to complete and review its report. If the judge finds any issues, 1) fix the issues, 2) re-run the tests and leak checks, and 3) re-run the judge subagent.
+
+## Step 6: Review Conversion
+
+Review your own work to ensure:
+- The test runs on the GPU and directly compares CPU-GPU outputs
+- The implementation does not overfit to test cases
+- No `copyToHost()` or row-by-row GPU-to-CPU copying is used for computation
+- No debug statements (e.g., `TableDebug.get().debug(...)`) remain in final output
+
+## Output
+
+Upon successful completion:
+- RapidsUDF file at `src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>`
+- Comparison test passes with no memory leaks
+
+These outputs are required for **Step 3: Benchmark**.
diff --git a/skills/udf-convert-to-cudf/examples/URLDecode.java b/skills/udf-convert-to-cudf/examples/URLDecode.java
new file mode 100644
index 00000000000..7122e15a68e
--- /dev/null
+++ b/skills/udf-convert-to-cudf/examples/URLDecode.java
@@ -0,0 +1,57 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import ai.rapids.cudf.*;
+import com.nvidia.spark.RapidsUDF;
+import org.apache.spark.sql.api.java.UDF1;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+
+/** Decode URL-encoded strings. */
+public class URLDecode implements UDF1<String, String>, RapidsUDF {
+  /** Row-by-row implementation that executes on the CPU */
+  @Override
+  public String call(String s) {
+    String result = null;
+    if (s != null) {
+      try {
+        result = URLDecoder.decode(s, "utf-8");
+      } catch (IllegalArgumentException ignored) {
+        result = s;
+      } catch (UnsupportedEncodingException e) {
+        // utf-8 is a builtin, standard encoding, so this should never happen
+        throw new RuntimeException(e);
+      }
+    }
+    return result;
+  }
+
+  /** Columnar implementation that runs on the GPU */
+  @Override
+  public ColumnVector evaluateColumnar(int numRows, ColumnVector... args) {
+    // The CPU implementation takes a single string argument, so similarly
+    // there should only be one column argument of type STRING.
+    if (args.length != 1) {
+      throw new IllegalArgumentException("Unexpected argument count: " + args.length);
+    }
+    ColumnVector input = args[0];
+    if (numRows != input.getRowCount()) {
+      throw new IllegalArgumentException("Expected " + numRows + " rows, received " + input.getRowCount());
+    }
+    if (!input.getType().equals(DType.STRING)) {
+      throw new IllegalArgumentException("Argument type is not a string column: " +
+          input.getType());
+    }
+
+    // The cudf urlDecode does not convert '+' to a space, so do that as a pre-pass first.
+    // All intermediate results are closed to avoid leaking GPU resources.
+    try (Scalar plusScalar = Scalar.fromString("+");
+         Scalar spaceScalar = Scalar.fromString(" ");
+         ColumnVector replaced = input.stringReplace(plusScalar, spaceScalar)) {
+      return replaced.urlDecode();
+    }
+  }
+}
diff --git a/skills/udf-convert-to-cudf/examples/URLDecodeExtendsFunction.scala b/skills/udf-convert-to-cudf/examples/URLDecodeExtendsFunction.scala
new file mode 100644
index 00000000000..4a5e4f086f0
--- /dev/null
+++ b/skills/udf-convert-to-cudf/examples/URLDecodeExtendsFunction.scala
@@ -0,0 +1,44 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import java.net.URLDecoder
+
+import ai.rapids.cudf._
+import com.nvidia.spark.RapidsUDF
+import Arm.{withResource, closeOnExcept}
+
+/** Decode URL-encoded strings. */
+class URLDecode extends Function1[String, String] with RapidsUDF with Serializable {
+  /** Row-by-row implementation that executes on the CPU */
+  override def apply(s: String): String = {
+    Option(s).map { s =>
+      try {
+        URLDecoder.decode(s, "utf-8")
+      } catch {
+        case _: IllegalArgumentException => s
+      }
+    }.orNull
+  }
+
+  /** Columnar implementation that runs on the GPU */
+  override def evaluateColumnar(numRows: Int, args: ColumnVector*): ColumnVector = {
+    // The CPU implementation takes a single string argument, so similarly
+    // there should only be one column argument of type STRING.
+    require(args.length == 1, s"Unexpected argument count: ${args.length}")
+    val input = args.head
+    require(numRows == input.getRowCount, s"Expected $numRows rows, received ${input.getRowCount}")
+    require(input.getType == DType.STRING, s"Argument type is not a string: ${input.getType}")
+
+    // The cudf urlDecode does not convert '+' to a space, so do that as a pre-pass first.
+    // All intermediate results are closed using withResource to avoid leaking GPU resources.
+    withResource(Scalar.fromString("+")) { plusScalar =>
+      withResource(Scalar.fromString(" ")) { spaceScalar =>
+        withResource(input.stringReplace(plusScalar, spaceScalar)) { replaced =>
+          replaced.urlDecode()
+        }
+      }
+    }
+  }
+}
diff --git a/skills/udf-convert-to-cudf/examples/URLDecodeHive.java b/skills/udf-convert-to-cudf/examples/URLDecodeHive.java
new file mode 100644
index 00000000000..d5b571e7085
--- /dev/null
+++ b/skills/udf-convert-to-cudf/examples/URLDecodeHive.java
@@ -0,0 +1,57 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import ai.rapids.cudf.*;
+import com.nvidia.spark.RapidsUDF;
+import org.apache.hadoop.hive.ql.exec.UDF;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+
+/** Decode URL-encoded strings. */
+public class URLDecode extends UDF implements RapidsUDF {
+
+  /** Row-by-row implementation that executes on the CPU */
+  public String evaluate(String s) {
+    String result = null;
+    if (s != null) {
+      try {
+        result = URLDecoder.decode(s, "utf-8");
+      } catch (IllegalArgumentException ignored) {
+        result = s;
+      } catch (UnsupportedEncodingException e) {
+        // utf-8 is a builtin, standard encoding, so this should never happen
+        throw new RuntimeException(e);
+      }
+    }
+    return result;
+  }
+
+  /** Columnar implementation that runs on the GPU */
+  @Override
+  public ColumnVector evaluateColumnar(int numRows, ColumnVector... args) {
+    // The CPU implementation takes a single string argument, so similarly
+    // there should only be one column argument of type STRING.
+    if (args.length != 1) {
+      throw new IllegalArgumentException("Unexpected argument count: " + args.length);
+    }
+    ColumnVector input = args[0];
+    if (numRows != input.getRowCount()) {
+      throw new IllegalArgumentException("Expected " + numRows + " rows, received " + input.getRowCount());
+    }
+    if (!input.getType().equals(DType.STRING)) {
+      throw new IllegalArgumentException("Argument type is not a string column: " +
+          input.getType());
+    }
+
+    // The cudf urlDecode does not convert '+' to a space, so do that as a pre-pass first.
+    // All intermediate results are closed to avoid leaking GPU resources.
+    try (Scalar plusScalar = Scalar.fromString("+");
+         Scalar spaceScalar = Scalar.fromString(" ");
+         ColumnVector replaced = input.stringReplace(plusScalar, spaceScalar)) {
+      return replaced.urlDecode();
+    }
+  }
+}
diff --git a/skills/udf-convert-to-cudf/examples/URLDecodeWithField.scala b/skills/udf-convert-to-cudf/examples/URLDecodeWithField.scala
new file mode 100644
index 00000000000..7dc4f122dcd
--- /dev/null
+++ b/skills/udf-convert-to-cudf/examples/URLDecodeWithField.scala
@@ -0,0 +1,48 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import java.net.URLDecoder
+
+import ai.rapids.cudf._
+import com.nvidia.spark.RapidsUDF
+import Arm.{withResource, closeOnExcept}
+
+/** Decode URL-encoded strings. */
+object URLDecode {
+  val myUDF = udf(
+    new Function1[String, String] with RapidsUDF with Serializable {
+      /** Row-by-row implementation that executes on the CPU */
+      override def apply(s: String): String = {
+        Option(s).map { s =>
+          try {
+            URLDecoder.decode(s, "utf-8")
+          } catch {
+            case _: IllegalArgumentException => s
+          }
+        }.orNull
+      }
+
+      /** Columnar implementation that runs on the GPU */
+      override def evaluateColumnar(numRows: Int, args: ColumnVector*): ColumnVector = {
+        // The CPU implementation takes a single string argument, so similarly
+        // there should only be one column argument of type STRING.
+        require(args.length == 1, s"Unexpected argument count: ${args.length}")
+        val input = args.head
+        require(numRows == input.getRowCount, s"Expected $numRows rows, received ${input.getRowCount}")
+        require(input.getType == DType.STRING, s"Argument type is not a string: ${input.getType}")
+
+        // The cudf urlDecode does not convert '+' to a space, so do that as a pre-pass first.
+        // All intermediate results are closed using withResource to avoid leaking GPU resources.
+        withResource(Scalar.fromString("+")) { plusScalar =>
+          withResource(Scalar.fromString(" ")) { spaceScalar =>
+            withResource(input.stringReplace(plusScalar, spaceScalar)) { replaced =>
+              replaced.urlDecode()
+            }
+          }
+        }
+      }
+    }
+  )
+}
diff --git a/skills/udf-convert-to-cudf/references/RAPIDS_UDF.md b/skills/udf-convert-to-cudf/references/RAPIDS_UDF.md
new file mode 100644
index 00000000000..737aa640d41
--- /dev/null
+++ b/skills/udf-convert-to-cudf/references/RAPIDS_UDF.md
@@ -0,0 +1,111 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: CC-BY-4.0
+-->
+
+# Background: RAPIDS Accelerated UDFs
+
+These instructions document how to implement a GPU version of an existing CPU UDF using the RapidsUDF interface. The RapidsUDF interface provides a way to run a CPU UDF on the GPU when using the RAPIDS Accelerator for Apache Spark.
+
+## Implementation
+
+The original CPU implementation is in the `evaluate` method. To make a UDF run on the GPU, you must implement the RapidsUDF interface, which provides a single method you need to override called `evaluateColumnar`. The `evaluateColumnar` function should use pre-existing cuDF methods from the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy) to perform the UDF computation by operating on cudF ColumnVectors.
+
+Note that you must keep both CPU and GPU evaluate methods, so that the UDF will still work if a higher-level operation involving the Rapids UDF falls back to the CPU.
+
+Refer to examples/ for example RapidsUDF implementations.
+
+## Interpreting Inputs
+
+The RAPIDS Accelerator will pass columnar forms of the same inputs for the CPU version of the UDF into the `args` array. For example, if the CPU UDF expects two inputs, a String and an Integer, then the evaluateColumnar method will be invoked with an array of two cuDF ColumnVector instances of type STRING and INT32 respectively.
+
+Note that passing scalar inputs to a RAPIDS accelerated UDF is supported with limitations. The scalar value will be replicated into a full column before being passed to evaluateColumnar. Therefore the UDF implementation cannot easily detect the difference between a scalar input and a columnar input.
+
+The implementation of evaluateColumnar must return a column with the specified numRows, equal to the input number of rows. All input columns will contain the same number of rows.
+
+## Generating output
+
+evaluateColumnar must return a ColumnVector of an appropriate cuDF type to match the result type of the original UDF.
+
+The following table shows the mapping of Spark types to equivalent cuDF columnar types:
+
+| Spark Type    | cuDF Type                                  |
+|---------------|--------------------------------------------|
+| BooleanType   | BOOL8                                      |
+| ByteType      | INT8                                       |
+| ShortType     | INT16                                      |
+| IntegerType   | INT32                                      |
+| LongType      | INT64                                      |
+| FloatType     | FLOAT32                                    |
+| DoubleType    | FLOAT64                                    |
+| DecimalType   | DECIMAL32, DECIMAL64, DECIMAL128 *         |
+| DateType      | TIMESTAMP_DAYS                             |
+| TimestampType | TIMESTAMP_MICROSECONDS                     |
+| StringType    | STRING                                     |
+| NullType      | INT8                                       |
+| ArrayType     | LIST of the underlying element type        |
+| MapType       | LIST of STRUCT of the key and value types  |
+| StructType    | STRUCT of all the field types              |
+
+For example, if the CPU UDF returns the Spark type `ArrayType(MapType(StringType, StringType))` then evaluateColumnar must return a column of type `LIST(LIST(STRUCT(STRING,STRING)))`.
+
+*Note: cuDF's DECIMAL32 corresponds to precision <= 9 digits, DECIMAL64 corresponds to 9 < precision <= 18 digits, and DECIMAL128 corresponds to 18 < precision <= 38 digits. Precision greater than 38 digits is unsupported.
+
+Note that cuDF decimals use a negative scale relative to Spark DecimalType. For example, Spark DecimalType(precision=11, scale=2) would translate to cuDF type DECIMAL64(scale=-2).
+
+## Debugging
+
+When debugging, it may be helpful to print data type information about cuDF objects. For example, to get information about a ColumnVector:
+
+```java
+System.out.println("Param 1 info:" + param1Column);
+```
+
+Example output:
+
+```text
+Param 1 info: ColumnVector{rows=10, type=INT32, nullCount=Optional.empty, offHeap=(ID: 880 7d1d4c5951e0)}
+```
+
+To print the actual values in a column or table, use `TableDebug`:
+
+```java
+TableDebug debugger = TableDebug.get();
+debugger.debug("Param 1 data:", param1Column);
+```
+
+Note that you should NEVER call this from production code, since it causes a device-to-host copy.
+
+## Managing Memory
+
+The Java memory model is not friendly for doing GPU operations because the JVM makes the assumption that everything we're trying to do is in heap memory. **Therefore, you must free the GPU resources in a timely manner with try-finally blocks**, calling `close()` to release GPU resources and `incRefCount()` to increment reference counts.
+
+The JVM's garbage collector is generally triggered when the JVM heap runs out of free space, but not necessarily when the GPU memory runs out. 
+To prevent these GPU memory leaks, the cuDF Java code tracks these objects, and if the garbage collector causes the memory to be freed instead of a proper close, it will output a warning like the following:
+
+```text
+ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 15 7fb5f94d8fa0)
+```
+
+These messages are an indication that an object on the GPU was not properly closed. Once a leak is detected, the Spark driver/executor `extraJavaOptions` can be set to `-Dai.rapids.refcount.debug=true -ea` to get a stack trace for the leak.
+
+The user will run the unit test and provide tracebacks if memory leaks occur to help you debug the issue.
+
+For Scala, use `withResource` and `closeOnExcept` from the `Arm` object for resource management.
+
+**Note:** Avoid placing the input ColumnVectors (those passed in `args`) in try-finally or try-with-resources blocks. The RAPIDS Accelerator will close the input columns for you. For example, avoid doing this:
+
+```java
+ColumnVector param1 = args[0];
+try {
+  // Do something with param1
+} finally {
+  param1.close();
+}
+```
+
+This will result in a double-close error:
+
+```text
+java.lang.IllegalStateException: Close called too many times ColumnVector{rows=10, type=INT32, nullCount=Optional.empty, offHeap=(ID: 637 0)}
+```
diff --git a/skills/udf-convert-to-sql/SKILL.md b/skills/udf-convert-to-sql/SKILL.md
new file mode 100644
index 00000000000..a55f464e555
--- /dev/null
+++ b/skills/udf-convert-to-sql/SKILL.md
@@ -0,0 +1,87 @@
+---
+name: udf-convert-to-sql
+description: Assists with converting an Apache Spark UDF to a functionally equivalent Spark SQL expression. This is step 2 of 3 in the UDF conversion workflow (udf-gen-test -> udf-convert-to-sql -> udf-benchmark). Use this skill when you have a CPU UDF with a unit test and need to convert it to SQL for GPU acceleration.
+license: CC-BY-4.0 AND Apache-2.0
+metadata:
+  spdx-file-copyright-text: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+model: inherit
+---
+
+# Convert UDF to Spark SQL
+
+## Workflow
+
+- [ ] Step 1: Implement the SQL expression
+- [ ] Step 2: Fill in the comparison test and iterate
+- [ ] Step 3: Run judge subagent if requested
+- [ ] Step 4: Review conversion
+
+**Before making any edits, create a visible TODO checklist for every workflow step in this skill and keep it updated.** Do not produce a final answer until every required checklist item is marked complete.
+
+## Prerequisites
+
+- Project directory from Step 1 (udf-gen-test) with passing unit test
+
+Derive `<CamelName>` and `<snake_name>` from the UDF class name.
+
+> **Note:** Commands require access to `/tmp` (Spark temp storage) and `/dev` (GPU device). If commands fail due to sandbox restrictions, re-run them unsandboxed.
+
+## Step 1: Implement the SQL Expression
+
+Implement the SQL expression in a file at `src/main/resources/<snake_name>.sql`.
+
+**Read `examples/` for example UDF-to-SQL conversions for the target language.**
+
+### Guidelines
+
+- Focus on correctness FIRST, then GPU compatibility — the test will report which operators are not GPU-compatible
+- Avoid expensive joins; prefer window functions, CTEs, and built-in array/map functions over explode-and-aggregate patterns
+
+**Do NOT hardcode test sample values or outputs.** The SQL expression must work correctly for ANY potential input.
+
+## Step 2: Fill in test and iterate
+
+Update `src/test/<java|scala>/com/udf/SqlComparisonTest.<java|scala>`:
+- Update the SQL file path to point to your `src/main/resources/<snake_name>.sql` file
+- Replace placeholders with the actual camel/snake UDF name
+
+Then run the test:
+```bash
+# Java
+mvn test -Dtest=SqlComparisonTest
+
+# Scala
+mvn test -Dsuites=com.udf.SqlComparisonTest
+```
+
+If the test fails, analyze the error and iterate on the SQL expression.
+
+### Difficult Test Failures
+
+Treat the unit test as the CPU behavior specification. Do not weaken or remove test cases silently.
+
+- Tests that check for CPU errors may not be directly applicable to SQL operators: Spark RAPIDS typically evaluates a whole column/batch and may produce nulls for invalid rows instead of throwing one row-level exception. Make an explicit judgment call about the UDF contract. Add a clear comment in the test and a `TODO/NOTE` in the SQL statement explaining the mismatch.
+- In rare cases, the Spark RAPIDS Plugin has known discrepancies in certain SQL operators. If a test case does not pass because of these discrepancies, notify the user and comment out the conflicting assertion/test only after documenting how you tried to make the behavior match and why those attempts failed.
+- If the behavior is important, common, or part of the documented input domain, **always prefer fixing the SQL expression** over commenting out the test case. The exception is a performance-vs-correctness tradeoff that the user explicitly approves.
+
+## Step 3: Run Judge Subagent If Requested
+
+If the user explicitly asked for the judge, a judge subagent, or a review agent, treat that as an explicit request for delegation: you **MUST** launch a separate subagent with `model: inherit` and instruct it to use the **udf-judge-conversion** skill. Ask it to review the `UnitTest`, `SqlComparisonTest`, and SQL expression.
+
+If the user did not request a judge/review agent, mark this step as skipped and continue to Step 4. If a required judge subagent is blocked by tool policy, stop and tell the user that explicit permission/instruction is needed.
+
+If you run the judge, wait for it to complete and review its report. If the judge finds any issues, 1) fix the issues, 2) re-run the tests, and 3) re-run the judge subagent.
+
+## Step 4: Review Conversion
+
+Review your own work to ensure:
+- The test runs on the GPU and directly compares CPU-SQL outputs
+- The implementation does not overfit to test cases
+
+## Output
+
+Upon successful completion:
+- SQL file at `src/main/resources/<snake_name>.sql`
+- Comparison test passes
+
+These outputs are required for **Step 3: Benchmark**.
diff --git a/skills/udf-convert-to-sql/examples/FormatPhone.java b/skills/udf-convert-to-sql/examples/FormatPhone.java
new file mode 100644
index 00000000000..1b0d227199a
--- /dev/null
+++ b/skills/udf-convert-to-sql/examples/FormatPhone.java
@@ -0,0 +1,27 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import org.apache.spark.sql.api.java.UDF1;
+
+/**
+ * Strip non-digit characters and format as (XXX) XXX-XXXX.
+ * See format_phone.sql for equivalent SQL expression.
+ */
+public class FormatPhone implements UDF1<String, String> {
+    @Override
+    public String call(String phone) throws Exception {
+        if (phone == null) {
+            return null;
+        }
+        String digits = phone.replaceAll("[^0-9]", "");
+        if (digits.length() != 10) {
+            return null;
+        }
+        return String.format("(%s) %s-%s",
+            digits.substring(0, 3),
+            digits.substring(3, 6),
+            digits.substring(6));
+    }
+}
diff --git a/skills/udf-convert-to-sql/examples/FormatPhone.scala b/skills/udf-convert-to-sql/examples/FormatPhone.scala
new file mode 100644
index 00000000000..0aebe0ef11d
--- /dev/null
+++ b/skills/udf-convert-to-sql/examples/FormatPhone.scala
@@ -0,0 +1,22 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import org.apache.spark.sql.functions.udf
+
+/**
+ * Strip non-digit characters and format as (XXX) XXX-XXXX.
+ * See format_phone.sql for equivalent SQL expression.
+ */
+object FormatPhone {
+  val formatPhone = udf((phone: String) => {
+    Option(phone).flatMap { p =>
+      val digits = p.replaceAll("[^0-9]", "")
+      if (digits.length == 10)
+        Some(s"($${digits.substring(0, 3)}) $${digits.substring(3, 6)}-$${digits.substring(6)}")
+      else
+        None
+    }.orNull
+  })
+}
diff --git a/skills/udf-convert-to-sql/examples/FormatPhoneHive.java b/skills/udf-convert-to-sql/examples/FormatPhoneHive.java
new file mode 100644
index 00000000000..4609b7254ee
--- /dev/null
+++ b/skills/udf-convert-to-sql/examples/FormatPhoneHive.java
@@ -0,0 +1,26 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import org.apache.hadoop.hive.ql.exec.UDF;
+
+/**
+ * Strip non-digit characters and format as (XXX) XXX-XXXX.
+ * See format_phone.sql for equivalent SQL expression.
+ */
+public class FormatPhone extends UDF {
+    public String evaluate(String phone) {
+        if (phone == null) {
+            return null;
+        }
+        String digits = phone.replaceAll("[^0-9]", "");
+        if (digits.length() != 10) {
+            return null;
+        }
+        return String.format("(%s) %s-%s",
+            digits.substring(0, 3),
+            digits.substring(3, 6),
+            digits.substring(6));
+    }
+}
diff --git a/skills/udf-convert-to-sql/examples/NormalizeTags.java b/skills/udf-convert-to-sql/examples/NormalizeTags.java
new file mode 100644
index 00000000000..152d63bb480
--- /dev/null
+++ b/skills/udf-convert-to-sql/examples/NormalizeTags.java
@@ -0,0 +1,37 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import org.apache.spark.sql.api.java.UDF1;
+import scala.collection.Seq;
+import scala.collection.Iterator;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.TreeSet;
+
+/**
+ * Lowercase, deduplicate, and sort a variable-length tag array.
+ * See normalize_tags.sql for equivalent SQL expression.
+ */
+public class NormalizeTags implements UDF1<Seq<String>, List<String>> {
+    @Override
+    public List<String> call(Seq<String> tags) throws Exception {
+        if (tags == null) {
+            return null;
+        }
+        TreeSet<String> result = new TreeSet<>();
+        Iterator<String> it = tags.iterator();
+        while (it.hasNext()) {
+            String tag = it.next();
+            if (tag != null) {
+                String stripped = tag.replaceAll("^ +| +$", "").toLowerCase();
+                if (!stripped.isEmpty()) {
+                    result.add(stripped);
+                }
+            }
+        }
+        return result.isEmpty() ? null : new ArrayList<>(result);
+    }
+}
diff --git a/skills/udf-convert-to-sql/examples/NormalizeTags.scala b/skills/udf-convert-to-sql/examples/NormalizeTags.scala
new file mode 100644
index 00000000000..92a2eee6954
--- /dev/null
+++ b/skills/udf-convert-to-sql/examples/NormalizeTags.scala
@@ -0,0 +1,25 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.udf
+
+/**
+ * Lowercase, deduplicate, and sort a variable-length tag array.
+ * See normalize_tags.sql for equivalent SQL expression.
+ */
+object NormalizeTags {
+  val normalizeTags: UserDefinedFunction = udf((tags: Seq[String]) => {
+    Option(tags).flatMap { ts =>
+      val cleaned = ts
+        .filter(_ != null)
+        .map(_.replaceAll("^ +| +$", "").toLowerCase)
+        .filter(_.nonEmpty)
+        .distinct
+        .sorted
+      if (cleaned.isEmpty) None else Some(cleaned)
+    }.orNull
+  })
+}
diff --git a/skills/udf-convert-to-sql/examples/NormalizeTagsHive.java b/skills/udf-convert-to-sql/examples/NormalizeTagsHive.java
new file mode 100644
index 00000000000..058bd210c5e
--- /dev/null
+++ b/skills/udf-convert-to-sql/examples/NormalizeTagsHive.java
@@ -0,0 +1,32 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import org.apache.hadoop.hive.ql.exec.UDF;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.TreeSet;
+
+/**
+ * Lowercase, deduplicate, and sort a variable-length tag array.
+ * See normalize_tags.sql for equivalent SQL expression.
+ */
+public class NormalizeTags extends UDF {
+    public List<String> evaluate(List<String> tags) {
+        if (tags == null) {
+            return null;
+        }
+        TreeSet<String> result = new TreeSet<>();
+        for (String tag : tags) {
+            if (tag != null) {
+                String stripped = tag.replaceAll("^ +| +$", "").toLowerCase();
+                if (!stripped.isEmpty()) {
+                    result.add(stripped);
+                }
+            }
+        }
+        return result.isEmpty() ? null : new ArrayList<>(result);
+    }
+}
diff --git a/skills/udf-convert-to-sql/examples/format_phone.sql b/skills/udf-convert-to-sql/examples/format_phone.sql
new file mode 100644
index 00000000000..6a35040c0e7
--- /dev/null
+++ b/skills/udf-convert-to-sql/examples/format_phone.sql
@@ -0,0 +1,17 @@
+-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+-- SPDX-License-Identifier: Apache-2.0
+
+SELECT
+  CASE
+    WHEN phone IS NULL THEN NULL
+    WHEN LENGTH(REGEXP_REPLACE(phone, '[^0-9]', '')) != 10 THEN NULL
+    ELSE CONCAT(
+      '(',
+      SUBSTR(REGEXP_REPLACE(phone, '[^0-9]', ''), 1, 3),
+      ') ',
+      SUBSTR(REGEXP_REPLACE(phone, '[^0-9]', ''), 4, 3),
+      '-',
+      SUBSTR(REGEXP_REPLACE(phone, '[^0-9]', ''), 7, 4)
+    )
+  END AS result
+FROM __table__
diff --git a/skills/udf-convert-to-sql/examples/normalize_tags.sql b/skills/udf-convert-to-sql/examples/normalize_tags.sql
new file mode 100644
index 00000000000..385d6adfca8
--- /dev/null
+++ b/skills/udf-convert-to-sql/examples/normalize_tags.sql
@@ -0,0 +1,15 @@
+-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+-- SPDX-License-Identifier: Apache-2.0
+
+SELECT
+  CASE
+    WHEN tags IS NULL THEN NULL
+    WHEN SIZE(FILTER(tags, x -> x IS NOT NULL AND TRIM(x) != '')) = 0 THEN NULL
+    ELSE ARRAY_SORT(ARRAY_DISTINCT(
+      TRANSFORM(
+        FILTER(tags, x -> x IS NOT NULL AND TRIM(x) != ''),
+        x -> LOWER(TRIM(x))
+      )
+    ))
+  END AS result
+FROM __table__
diff --git a/skills/udf-gen-test/SKILL.md b/skills/udf-gen-test/SKILL.md
new file mode 100644
index 00000000000..44b668dfd12
--- /dev/null
+++ b/skills/udf-gen-test/SKILL.md
@@ -0,0 +1,148 @@
+---
+name: udf-gen-test
+description: Assists with generating a unit test for an Apache Spark UDF. This is step 1 of 3 in the UDF conversion workflow (udf-gen-test -> udf-convert-to-* -> udf-benchmark). Use this skill when you have a CPU UDF and need to create a unit test for the UDF before converting it into a GPU-compatible implementation.
+license: CC-BY-4.0 AND Apache-2.0
+metadata:
+  spdx-file-copyright-text: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+model: inherit
+---
+
+# UDF Unit Test Generation
+
+## Workflow
+
+- [ ] Step 1: Set up project (copy template, add UDF source)
+- [ ] Step 2: Implement the unit test (fill in TODO methods)
+- [ ] Step 3: Compile and test until passing
+- [ ] Step 4: Run coverage and inspect gaps
+- [ ] Step 5: Verify outputs
+
+**Before making any edits, create a visible TODO checklist for every workflow step in this skill and keep it updated.** Do not produce a final answer until every required checklist item is marked complete.
+
+## Prerequisites
+
+- Path to the input UDF file (Java or Scala)
+
+Derive `<CamelName>` and `<snake_name>` from the UDF class name.
+
+> **Note:** Commands require access to `/tmp` (Spark temp storage) and `/dev` (GPU device). If commands fail due to sandbox restrictions, re-run them unsandboxed.
+
+## Step 1: Set Up the Project
+
+### 1a. Copy the template project
+
+The project can be found under this skill's templates directory.
+```bash
+cp -r templates/<java|scala> <project_root>/<CamelName>/
+```
+
+This provides a complete Maven project with all test and benchmark infrastructure.
+
+### 1b. Copy or extract the UDF source
+
+Before copying code, decide whether the input UDF is already self-contained:
+- If the UDF file contains only the target UDF and local helpers it directly needs, copy it as-is.
+- If the UDF is part of a larger project or a file containing unrelated UDFs/classes, extract only the target UDF class/object and all local helper classes/methods required for that UDF to compile and run (modifying package declarations as needed).
+
+The template project should contain the smallest self-contained implementation of the target CPU UDF.
+
+Place the resulting source file(s) in the source directory:
+- Java: `<CamelName>/src/main/java/com/udf/`
+- Scala: `<CamelName>/src/main/scala/com/udf/`
+
+Set the package declaration to `com.udf`:
+- Java: `package com.udf;`
+- Scala: `package com.udf`
+
+## Step 2: Implement the Unit Test
+
+Read `src/test/<java|scala>/com/udf/UnitTest.<java|scala>`. Replace placeholders with the actual camel/snake UDF name.
+
+Fill in the TODO methods following the docstrings. Include diverse edge cases in `createTestData` (nulls, empty strings, malformed inputs, varying lengths).
+
+### Test Data Coverage
+
+The generated tests should serve as a strong specification of the CPU UDF behavior over a documented input domain, and are intended to prove that a GPU or SQL implementation preserves the CPU UDF behavior.
+For each input type and visible UDF branch, include applicable examples from these coverage dimensions:
+- null inputs and null elements
+- empty strings, arrays, maps, or structs
+- malformed or unparsable inputs
+- edges of input boundaries, such as min/max valid values, string length, or array length
+- numeric sign/identity cases, such as negative, zero, and positive values
+- string variety, such as unicode, ASCII, and encoding-sensitive inputs
+- date/time boundaries, such as epoch, end-of-day/month/year, leap day, and DST/timezone transitions
+- decimal precision and scale
+- duplicate rows and repeated values
+- mixed valid/invalid rows in the same DataFrame
+- nested empty and nested null values
+
+Assertions should verify schema, row count, deterministic ordering, output values, null propagation, and exception/default behavior. Every visible UDF branch should be covered by the unit test or explicitly documented as out of scope.
+
+### Critical Requirements
+
+- Do NOT hardcode the UDF name; use the provided `udfName` argument. This ensures the correct registered UDF is exercised.
+- Assume the user's UDF implementation is correct; the assertions should reflect its actual behavior.
+
+## Step 3: Compile and Test
+
+```bash
+# Java
+mvn test -Dtest=UnitTest
+
+# Scala
+mvn test -Dsuites=com.udf.UnitTest
+```
+
+If it fails, analyze the error output (stdout/stderr) and fix the test code. Continue iterating until the test passes.
+
+## Step 4: Coverage Report
+
+The template projects use JaCoCo (Java) / scoverage (Scala) code coverage tools.
+
+```bash
+# Java
+mvn -Pcoverage test jacoco:report -Dtest=UnitTest
+
+# Scala
+mvn -Pcoverage scoverage:report -Dsuites=com.udf.UnitTest
+```
+
+For Java, read `target/site/jacoco/jacoco.csv` and inspect LINE, BRANCH, and METHOD counters for the target CPU UDF class and local helper classes. In `jacoco.xml`, counters appear as `<counter type="...">` elements, and source-line misses appear under `<sourcefile><line nr="..." mi="..." ci="..." mb="..." cb="...">`.
+
+For Scala, read `target/scoverage.xml` and inspect statement, branch, and method-level coverage for the target CPU UDF class/object and local helper classes/objects. scoverage XML stores package/class/method `statement-rate` and `branch-rate` attributes, and each executable statement has `line`, `branch`, and `invocation-count` attributes.
+
+Use the coverage report as actionable feedback:
+1. Inspect missed Java line, branch, and method coverage, or missed Scala statement, branch, and method-level coverage.
+2. Add test cases and assertions that exercise those paths.
+3. Re-run the unit test and coverage report.
+4. Repeat until important CPU UDF branches are covered.
+
+If a missed line, statement, branch, or method path cannot or should not be tested, add a clear comment explaining why. Examples include:
+- unreachable defensive code
+- unsupported input domains
+- unrelated template infrastructure
+
+Report the relevant counters for the target CPU UDF and local helper classes/objects:
+- Java: LINE, BRANCH, and METHOD counters from JaCoCo.
+- Scala: statement and branch coverage from scoverage, plus method-level statement/branch rates from `<method>` elements.
+
+NOTE: JaCoCo and scoverage will not track source-level coverage in external JARs. If the UDF relies on external JAR business logic, make a note of this residual coverage gap.
+
+## Step 5: Verify Outputs
+
+After the test passes, verify that:
+1. The test data covers various edge cases and reflects realistic input formats
+2. The assertions reflect actual UDF behavior (no "cheating" by hardcoding values)
+3. The coverage report shows strong coverage of the target CPU UDF and local helper logic
+4. Any uncovered lines, branches, or methods are explicitly explained
+5. Any external JAR logic invoked by the UDF is called out as outside the coverage scope
+
+If any quality checks fail, revise the test code and re-run.
+
+## Output
+
+Upon successful completion:
+- Project directory: `<project_root>/<CamelName>/`
+- Unit test: `src/test/<java|scala>/com/udf/UnitTest.<java|scala>`
+
+These outputs are required for **Step 2: Convert UDF**.
diff --git a/skills/udf-gen-test/templates/java/.mvn/jvm.config b/skills/udf-gen-test/templates/java/.mvn/jvm.config
new file mode 100644
index 00000000000..0ae13fa9a86
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/.mvn/jvm.config
@@ -0,0 +1,16 @@
+-Xmx16g
+-ea
+--add-opens=java.base/java.lang=ALL-UNNAMED
+--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
+--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
+--add-opens=java.base/java.io=ALL-UNNAMED
+--add-opens=java.base/java.net=ALL-UNNAMED
+--add-opens=java.base/java.nio=ALL-UNNAMED
+--add-opens=java.base/java.util=ALL-UNNAMED
+--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
+--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
+--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
+--add-opens=java.base/sun.nio.cs=ALL-UNNAMED
+--add-opens=java.base/sun.security.action=ALL-UNNAMED
+--add-opens=java.base/sun.util.calendar=ALL-UNNAMED
+--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
diff --git a/skills/udf-gen-test/templates/java/pom.xml b/skills/udf-gen-test/templates/java/pom.xml
new file mode 100644
index 00000000000..6925eb2f55f
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/pom.xml
@@ -0,0 +1,342 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0" 
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <modelVersion>4.0.0</modelVersion>
+    <groupId>com.udf</groupId>
+    <artifactId>aether-agent-udfs</artifactId>
+    <version>1.0.0</version>
+    <name>Aether UDF Conversion</name>
+    <description>This project contains UDFs that will be converted from CPU to GPU.</description>
+    <packaging>jar</packaging>
+
+    <properties>
+        <maven.compiler.source>17</maven.compiler.source>
+        <maven.compiler.target>17</maven.compiler.target>
+        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+        <project.reporting.sourceEncoding>UTF-8</project.reporting.sourceEncoding>
+        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
+        <scala.binary.version>2.12</scala.binary.version>
+        <!-- Spark/RAPIDS versions -->
+        <spark.version>3.5.5</spark.version>
+        <rapids4spark.version>26.04.0</rapids4spark.version>
+        <jacoco.version>0.8.14</jacoco.version>
+        <jacoco.agent.argLine></jacoco.agent.argLine>
+        <cuda.version>cuda12</cuda.version>
+        <cudf.git.branch>v26.04.00</cudf.git.branch>
+        <rapids.cmake.branch>v26.04.00</rapids.cmake.branch>
+        <!-- Memory leak debugging -->
+        <!-- SLF4J logs are off by default, enable cuDF logs if memory leak debugging is enabled -->
+        <debug.memory.leaks>false</debug.memory.leaks>
+        <cudf.log.level>off</cudf.log.level>
+        <!-- Native CUDA UDF build configuration. The cuda-native-udf profile uses these. -->
+        <USE_PREBUILT_CUDF>ON</USE_PREBUILT_CUDF>
+        <GPU_ARCHS>RAPIDS</GPU_ARCHS>
+        <CPP_PARALLEL_LEVEL>10</CPP_PARALLEL_LEVEL>
+        <PER_THREAD_DEFAULT_STREAM>ON</PER_THREAD_DEFAULT_STREAM>
+        <CUDF_ENABLE_ARROW_S3>OFF</CUDF_ENABLE_ARROW_S3>
+        <skipCudfExtraction>false</skipCudfExtraction>
+        <native.library.name>rapidsudfjni</native.library.name>
+        <native.build.path>${project.build.directory}/native-build</native.build.path>
+        <!-- These args apply to the forked Surefire JVM -->
+        <!-- Benchmarks run in the Maven JVM via exec:java, and args are in .mvn/jvm.config -->
+        <test.jvm.args>-Xmx5g -ea
+            -Dai.rapids.refcount.debug=${debug.memory.leaks}
+            -Dorg.slf4j.simpleLogger.defaultLogLevel=off
+            -Dorg.slf4j.simpleLogger.log.ai.rapids.cudf=${cudf.log.level}
+            --add-opens=java.base/java.lang=ALL-UNNAMED
+            --add-opens=java.base/java.lang.invoke=ALL-UNNAMED
+            --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
+            --add-opens=java.base/java.io=ALL-UNNAMED
+            --add-opens=java.base/java.net=ALL-UNNAMED
+            --add-opens=java.base/java.nio=ALL-UNNAMED
+            --add-opens=java.base/java.util=ALL-UNNAMED
+            --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
+            --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.cs=ALL-UNNAMED
+            --add-opens=java.base/sun.security.action=ALL-UNNAMED
+            --add-opens=java.base/sun.util.calendar=ALL-UNNAMED
+            --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED</test.jvm.args>
+    </properties>
+
+    <profiles>
+        <profile>
+            <id>debug-leaks</id>
+            <activation>
+                <property>
+                    <name>debug.memory.leaks</name>
+                    <value>true</value>
+                </property>
+            </activation>
+            <properties>
+                <cudf.log.level>error</cudf.log.level>
+            </properties>
+        </profile>
+        <profile>
+            <id>coverage</id>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.jacoco</groupId>
+                        <artifactId>jacoco-maven-plugin</artifactId>
+                        <version>${jacoco.version}</version>
+                        <configuration>
+                            <formats>
+                                <format>HTML</format>
+                                <format>XML</format>
+                                <format>CSV</format>
+                            </formats>
+                            <excludes>
+                                <exclude>com/udf/bench/*</exclude>
+                                <exclude>com/udf/SparkUtils*</exclude>
+                            </excludes>
+                        </configuration>
+                        <executions>
+                            <execution>
+                                <id>prepare-agent</id>
+                                <goals>
+                                    <goal>prepare-agent</goal>
+                                </goals>
+                                <configuration>
+                                    <propertyName>jacoco.agent.argLine</propertyName>
+                                </configuration>
+                            </execution>
+                            <execution>
+                                <id>report</id>
+                                <phase>verify</phase>
+                                <goals>
+                                    <goal>report</goal>
+                                </goals>
+                            </execution>
+                        </executions>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+        <profile>
+            <id>cuda-native-udf</id>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-dependency-plugin</artifactId>
+                        <version>3.6.1</version>
+                        <executions>
+                            <execution>
+                                <id>copy-rapids-jar-with-classifier</id>
+                                <phase>generate-sources</phase>
+                                <goals>
+                                    <goal>copy</goal>
+                                </goals>
+                                <configuration>
+                                    <artifactItems>
+                                        <artifactItem>
+                                            <groupId>com.nvidia</groupId>
+                                            <artifactId>rapids-4-spark_${scala.binary.version}</artifactId>
+                                            <version>${rapids4spark.version}</version>
+                                            <classifier>${cuda.version}</classifier>
+                                            <type>jar</type>
+                                            <overWrite>false</overWrite>
+                                            <outputDirectory>${project.build.directory}/rapids-jar</outputDirectory>
+                                        </artifactItem>
+                                    </artifactItems>
+                                    <ignoreMissingArtifact>true</ignoreMissingArtifact>
+                                </configuration>
+                            </execution>
+                            <execution>
+                                <id>copy-rapids-jar-no-classifier</id>
+                                <phase>generate-sources</phase>
+                                <goals>
+                                    <goal>copy</goal>
+                                </goals>
+                                <configuration>
+                                    <artifactItems>
+                                        <artifactItem>
+                                            <groupId>com.nvidia</groupId>
+                                            <artifactId>rapids-4-spark_${scala.binary.version}</artifactId>
+                                            <version>${rapids4spark.version}</version>
+                                            <type>jar</type>
+                                            <overWrite>false</overWrite>
+                                            <outputDirectory>${project.build.directory}/rapids-jar</outputDirectory>
+                                        </artifactItem>
+                                    </artifactItems>
+                                    <ignoreMissingArtifact>true</ignoreMissingArtifact>
+                                </configuration>
+                            </execution>
+                        </executions>
+                    </plugin>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-antrun-plugin</artifactId>
+                        <version>3.1.0</version>
+                        <executions>
+                            <execution>
+                                <id>extract-cuda-native-dependencies</id>
+                                <phase>generate-sources</phase>
+                                <configuration>
+                                    <skip>${skipCudfExtraction}</skip>
+                                    <target>
+                                        <exec executable="bash" dir="${project.basedir}" failonerror="true">
+                                            <arg value="native/scripts/extract-cudf-libs.sh"/>
+                                            <env key="RAPIDS4SPARK_VERSION" value="${rapids4spark.version}"/>
+                                            <env key="SCALA_VERSION" value="${scala.binary.version}"/>
+                                            <env key="CUDA_VERSION" value="${cuda.version}"/>
+                                            <env key="CUDF_BRANCH" value="${cudf.git.branch}"/>
+                                            <env key="TARGET_DIR" value="${project.build.directory}"/>
+                                        </exec>
+                                    </target>
+                                </configuration>
+                                <goals>
+                                    <goal>run</goal>
+                                </goals>
+                            </execution>
+                            <execution>
+                                <id>cmake-cuda-native-udf</id>
+                                <phase>compile</phase>
+                                <configuration>
+                                    <target>
+                                        <mkdir dir="${native.build.path}"/>
+                                        <exec executable="cmake" dir="${native.build.path}" failonerror="true">
+                                            <arg value="${project.basedir}/native/src/main/cpp"/>
+                                            <arg value="-DCMAKE_BUILD_TYPE=Release"/>
+                                            <arg value="-DNATIVE_LIBRARY_NAME=${native.library.name}"/>
+                                            <arg value="-DBUILD_UDF_BENCHMARKS=OFF"/>
+                                            <arg value="-DGPU_ARCHS=${GPU_ARCHS}"/>
+                                            <arg value="-DPER_THREAD_DEFAULT_STREAM=${PER_THREAD_DEFAULT_STREAM}"/>
+                                            <arg value="-DCUDF_ENABLE_ARROW_S3=${CUDF_ENABLE_ARROW_S3}"/>
+                                            <arg value="-DUSE_PREBUILT_CUDF=${USE_PREBUILT_CUDF}"/>
+                                            <arg value="-DNATIVE_DEPS_DIR=${project.build.directory}/native-deps"/>
+                                            <arg value="-DCUDF_SOURCE_DIR=${project.build.directory}/cudf-repo/cpp"/>
+                                            <arg value="-DRAPIDS_CMAKE_BRANCH=${rapids.cmake.branch}"/>
+                                        </exec>
+                                        <exec executable="cmake" failonerror="true">
+                                            <arg value="--build"/>
+                                            <arg value="${native.build.path}"/>
+                                            <arg value="-j${CPP_PARALLEL_LEVEL}"/>
+                                            <arg value="-v"/>
+                                        </exec>
+                                    </target>
+                                </configuration>
+                                <goals>
+                                    <goal>run</goal>
+                                </goals>
+                            </execution>
+                        </executions>
+                    </plugin>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-resources-plugin</artifactId>
+                        <version>3.3.1</version>
+                        <executions>
+                            <execution>
+                                <id>copy-cuda-native-library-to-classes</id>
+                                <phase>process-classes</phase>
+                                <goals>
+                                    <goal>copy-resources</goal>
+                                </goals>
+                                <configuration>
+                                    <overwrite>true</overwrite>
+                                    <outputDirectory>${project.build.outputDirectory}/${os.arch}/${os.name}</outputDirectory>
+                                    <resources>
+                                        <resource>
+                                            <directory>${native.build.path}</directory>
+                                            <includes>
+                                                <include>lib${native.library.name}.so</include>
+                                            </includes>
+                                        </resource>
+                                    </resources>
+                                </configuration>
+                            </execution>
+                        </executions>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+    </profiles>
+
+    <dependencies>
+        <!-- Spark -->
+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-hive_${scala.binary.version}</artifactId>
+            <version>${spark.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <!-- RAPIDS plugin -->
+        <dependency>
+            <groupId>com.nvidia</groupId>
+            <artifactId>rapids-4-spark_${scala.binary.version}</artifactId>
+            <version>${rapids4spark.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <!-- JUnit 4 -->
+        <dependency>
+            <groupId>junit</groupId>
+            <artifactId>junit</artifactId>
+            <version>4.13.2</version>
+            <scope>test</scope>
+        </dependency>
+        <!-- SLF4J -->
+        <dependency>
+            <groupId>org.slf4j</groupId>
+            <artifactId>slf4j-simple</artifactId>
+            <version>1.7.36</version>
+            <scope>test</scope>
+        </dependency>
+    </dependencies>
+
+    <build>
+        <plugins>
+            <!-- Surefire for JUnit tests -->
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-surefire-plugin</artifactId>
+                <version>3.1.2</version>
+                <configuration>
+                    <includes>
+                        <include>**/*Test.java</include>
+                    </includes>
+                    <argLine>@{jacoco.agent.argLine} ${test.jvm.args}</argLine>
+                </configuration>
+            </plugin>
+            <!-- exec-maven-plugin for running main classes -->
+            <plugin>
+                <groupId>org.codehaus.mojo</groupId>
+                <artifactId>exec-maven-plugin</artifactId>
+                <version>3.1.0</version>
+            </plugin>
+            <!-- Shade plugin to create an uber JAR -->
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-shade-plugin</artifactId>
+                <version>3.2.4</version>
+                <executions>
+                    <execution>
+                        <phase>package</phase>
+                        <goals>
+                            <goal>shade</goal>
+                        </goals>
+                        <configuration>
+                            <createDependencyReducedPom>false</createDependencyReducedPom>
+                            <filters>
+                                <filter>
+                                    <artifact>*:*</artifact>
+                                    <excludes>
+                                        <exclude>META-INF/*.SF</exclude>
+                                        <exclude>META-INF/*.DSA</exclude>
+                                        <exclude>META-INF/*.RSA</exclude>
+                                    </excludes>
+                                </filter>
+                            </filters>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
+        </plugins>
+    </build>
+</project>
diff --git a/skills/udf-gen-test/templates/java/run_gen_data.sh b/skills/udf-gen-test/templates/java/run_gen_data.sh
new file mode 100644
index 00000000000..1a7b4b2adbc
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/run_gen_data.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Generate or validate benchmark data
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR"
+
+print_usage() {
+    echo "Usage: $0 --rows NUM [--validate] [--output-path PATH] [--mvn-arg ARG]..."
+}
+
+ROWS=""
+VALIDATE=""
+OUTPUT_PATH=""
+MAVEN_ARGS=()
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --rows) ROWS="$2"; shift 2;;
+        --validate) VALIDATE="true"; shift;;
+        --output-path) OUTPUT_PATH="$2"; shift 2;;
+        --mvn-arg) MAVEN_ARGS+=("$2"); shift 2;;
+        *)
+            echo "Unknown option: $1"
+            print_usage
+            exit 1
+            ;;
+    esac
+done
+
+if [ -z "$ROWS" ]; then
+    echo "Error: --rows is required"
+    print_usage
+    exit 1
+fi
+
+SPARK_CONFS=(
+    --spark-conf spark.master="local[8]"
+    --spark-conf spark.rapids.sql.enabled="true"
+    --spark-conf spark.plugins="com.nvidia.spark.SQLPlugin"
+    --spark-conf spark.locality.wait="0s"
+    --spark-conf spark.sql.cache.serializer="com.nvidia.spark.ParquetCachedBatchSerializer"
+    --spark-conf spark.rapids.sql.format.parquet.reader.type="MULTITHREADED"
+    --spark-conf spark.rapids.sql.reader.batchSizeBytes="1000MB"
+    --spark-conf spark.sql.files.maxPartitionBytes="512MB"
+    --spark-conf spark.rapids.sql.metrics.level="DEBUG"
+)
+
+EXEC_ARGS="--rows $ROWS --partitions 32"
+for arg in "${SPARK_CONFS[@]}"; do
+    EXEC_ARGS="$EXEC_ARGS $arg"
+done
+
+if [ -n "$VALIDATE" ]; then
+    EXEC_ARGS="$EXEC_ARGS --validate"
+    echo "Running GenData in validation mode with $ROWS rows..."
+else
+    if [ -z "$OUTPUT_PATH" ]; then
+        OUTPUT_PATH="data/bench_data_${ROWS}_rows.parquet"
+    fi
+    EXEC_ARGS="$EXEC_ARGS --output-path $OUTPUT_PATH"
+    echo "Running GenData to generate $ROWS rows -> $OUTPUT_PATH..."
+fi
+
+mvn "${MAVEN_ARGS[@]}" compile exec:java \
+    -Dexec.mainClass="com.udf.bench.GenData" \
+    -Dexec.classpathScope=compile \
+    -Dexec.args="$EXEC_ARGS"
diff --git a/skills/udf-gen-test/templates/java/run_micro_benchmark.sh b/skills/udf-gen-test/templates/java/run_micro_benchmark.sh
new file mode 100644
index 00000000000..6cac6d3e8c9
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/run_micro_benchmark.sh
@@ -0,0 +1,60 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Run in-memory microbenchmark for RapidsUDFs.
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR"
+
+print_usage() {
+    echo "Usage: $0 --mode cpu|gpu|all --data-path PATH [--rows N] [--warmup N] [--measured N] [--pool-fraction F] [--profile] [--mvn-arg ARG]..."
+}
+
+MODE=""
+DATA_PATH=""
+PROFILE=""
+MAVEN_ARGS=()
+RUNNER_ARGS=()
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --mode) MODE="$2"; RUNNER_ARGS+=("$1" "$2"); shift 2;;
+        --data-path) DATA_PATH="$2"; RUNNER_ARGS+=("$1" "$2"); shift 2;;
+        --profile) PROFILE="true"; RUNNER_ARGS+=("$1"); shift;;
+        --mvn-arg) MAVEN_ARGS+=("$2"); shift 2;;
+        *) RUNNER_ARGS+=("$1"); shift;;
+    esac
+done
+
+if [ -z "$MODE" ] || [ -z "$DATA_PATH" ]; then
+    echo "Error: --mode and --data-path are required"
+    print_usage
+    exit 1
+fi
+
+MVN_CMD=(
+    mvn "${MAVEN_ARGS[@]}" compile exec:java
+    -Dexec.mainClass=com.udf.bench.MicroBenchRunner
+    -Dexec.classpathScope=compile
+    "-Dexec.args=${RUNNER_ARGS[*]}"
+)
+
+if [ -n "$PROFILE" ]; then
+    REPORT_PATH="results/microbench_$(date +%Y%m%d_%H%M%S)"
+    mkdir -p results
+    echo "Running microbenchmark (mode=$MODE) on $DATA_PATH with nsys profiling..."
+    echo "nsys report will be saved to: ${REPORT_PATH}.nsys-rep"
+    nsys profile \
+        -c cudaProfilerApi \
+        --capture-range-end=stop \
+        --trace=cuda,nvtx \
+        --nvtx-domain-include="libcudf" \
+        -o "$REPORT_PATH" \
+        "${MVN_CMD[@]}"
+else
+    echo "Running microbenchmark (mode=$MODE) on $DATA_PATH..."
+    "${MVN_CMD[@]}"
+fi
diff --git a/skills/udf-gen-test/templates/java/run_spark_benchmark.sh b/skills/udf-gen-test/templates/java/run_spark_benchmark.sh
new file mode 100644
index 00000000000..d8b2b1d1b70
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/run_spark_benchmark.sh
@@ -0,0 +1,69 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Run CPU or GPU Spark benchmark.
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR"
+
+print_usage() {
+    echo "Usage: $0 --mode cpu|gpu --data-path PATH [--result-path PATH] [--mvn-arg ARG]..."
+}
+
+MODE=""
+DATA_PATH=""
+RESULT_PATH=""
+MAVEN_ARGS=()
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --mode) MODE="$2"; shift 2;;
+        --data-path) DATA_PATH="$2"; shift 2;;
+        --result-path) RESULT_PATH="$2"; shift 2;;
+        --mvn-arg) MAVEN_ARGS+=("$2"); shift 2;;
+        *)
+            echo "Unknown option: $1"
+            print_usage
+            exit 1
+            ;;
+    esac
+done
+
+if [ -z "$MODE" ] || [ -z "$DATA_PATH" ]; then
+    echo "Error: --mode and --data-path are required"
+    print_usage
+    exit 1
+fi
+
+DATA_BASENAME=$(basename "$DATA_PATH" .parquet)
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+if [ -z "$RESULT_PATH" ]; then
+    RESULT_PATH="results/${MODE}_${DATA_BASENAME}_${TIMESTAMP}_result.json"
+fi
+
+SPARK_CONFS=(
+    --spark-conf spark.master="local[8]"
+    --spark-conf spark.rapids.sql.enabled="true"
+    --spark-conf spark.plugins="com.nvidia.spark.SQLPlugin"
+    --spark-conf spark.locality.wait="0s"
+    --spark-conf spark.sql.cache.serializer="com.nvidia.spark.ParquetCachedBatchSerializer"
+    --spark-conf spark.rapids.sql.format.parquet.reader.type="MULTITHREADED"
+    --spark-conf spark.rapids.sql.reader.batchSizeBytes="1000MB"
+    --spark-conf spark.sql.files.maxPartitionBytes="512MB"
+    --spark-conf spark.rapids.sql.metrics.level="DEBUG"
+)
+
+EXEC_ARGS="--mode $MODE --data-path $DATA_PATH --result-path $RESULT_PATH"
+for arg in "${SPARK_CONFS[@]}"; do
+    EXEC_ARGS="$EXEC_ARGS $arg"
+done
+EXEC_ARGS="$EXEC_ARGS --spark-conf spark.app.name=${MODE}_${DATA_BASENAME}_${TIMESTAMP}"
+
+echo "Running $MODE benchmark on $DATA_PATH..."
+mvn "${MAVEN_ARGS[@]}" compile exec:java \
+    -Dexec.mainClass="com.udf.bench.SparkBenchRunner" \
+    -Dexec.classpathScope=compile \
+    -Dexec.args="$EXEC_ARGS"
diff --git a/skills/udf-gen-test/templates/java/src/main/java/com/udf/SparkUtils.java b/skills/udf-gen-test/templates/java/src/main/java/com/udf/SparkUtils.java
new file mode 100644
index 00000000000..d50816e6fdf
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/main/java/com/udf/SparkUtils.java
@@ -0,0 +1,126 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf;
+
+import com.nvidia.spark.rapids.ExplainPlan;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * Spark utility methods.
+ */
+public class SparkUtils {
+
+    /**
+     * Apply key=value Spark configs to a builder.
+     *
+     * @param builder     the SparkSession builder to configure
+     * @param sparkConfs  "spark.key=value" config strings
+     * @return the same builder, for chaining
+     */
+    public static SparkSession.Builder applySparkConfs(
+            SparkSession.Builder builder, List<String> sparkConfs) {
+        for (String conf : sparkConfs) {
+            String[] kv = conf.split("=", 2);
+            if (kv.length == 2) builder.config(kv[0], kv[1]);
+        }
+        return builder;
+    }
+
+    /**
+     * Get a required argument from a parsed argument map, or throw.
+     *
+     * @param parsed the parsed argument map
+     * @param key    the argument key (without "--" prefix)
+     * @return the argument value
+     * @throws IllegalArgumentException if the key is missing
+     */
+    public static String requireArg(Map<String, String> parsed, String key) {
+        String val = parsed.get(key);
+        if (val == null) {
+            throw new IllegalArgumentException("--" + key + " is required");
+        }
+        return val;
+    }
+
+    /** 
+     * Ops that cause fallback but can be ignored, since they are strictly used for testing:
+     * - RDDScanExec/LocalTableScanExec: surfaces due to spark.createDataFrame()
+     * - CollectLimitExec: surfaces during dataframe collection (e.g. df.show())
+     * - ToPrettyString: surfaces due to df.show()
+     */
+    private static final Set<String> IGNORE_OPERATIONS = new HashSet<>(
+        Arrays.asList("RDDScanExec", "LocalTableScanExec", "CollectLimitExec", "ToPrettyString")
+    );
+
+    /**
+     * Assert that the DataFrame's plan can run on GPU.
+     * NOTE: This is only reliable in explainOnly mode, with AQE disabled.
+     *
+     * @param df the DataFrame to check
+     * @throws RuntimeException if any operations cannot run on GPU
+     */
+    public static void assertPlanRunsOnGpu(Dataset<Row> df) {
+        assertPlanRunsOnGpu(df, false);
+    }
+
+    /**
+     * Assert that the DataFrame's plan can run on GPU.
+     * NOTE: This is only reliable in explainOnly mode, with AQE disabled.
+     *
+     * @param df             the DataFrame to check
+     * @param returnFullPlan if true, include the full plan in the error message
+     * @throws RuntimeException if any operations cannot run on GPU
+     */
+    public static void assertPlanRunsOnGpu(Dataset<Row> df, boolean returnFullPlan) {
+        String plan = getGpuPlan(df);
+        List<String> unsupportedOps = getUnsupportedOps(plan);
+        if (!unsupportedOps.isEmpty()) {
+            StringBuilder sb = new StringBuilder();
+            sb.append("Some operations cannot run on GPU.\nFound the following unsupported ops:\n");
+            for (String op : unsupportedOps) {
+                sb.append("- ").append(op).append("\n");
+            }
+            if (returnFullPlan) {
+                sb.append("\nFull physical plan:\n").append(plan);
+            }
+            throw new RuntimeException(sb.toString());
+        }
+    }
+
+    /** Get the potential GPU plan using the RAPIDS ExplainPlan API. */
+    private static String getGpuPlan(Dataset<Row> df) {
+        return ExplainPlan.explainPotentialGpuPlan(df, "NOT_ON_GPU");
+    }
+
+    /** Parse the plan for unsupported operations (lines starting with '!'). */
+    private static List<String> getUnsupportedOps(String plan) {
+        List<String> result = new ArrayList<>();
+        for (String line : plan.split("\n")) {
+            // Each unsupported line looks like: ![Exec] <OPERATION> cannot run on GPU
+            String trimmed = line.trim();
+            if (trimmed.startsWith("!")) {
+                int start = trimmed.indexOf('<');
+                int end = trimmed.indexOf('>');
+                if (start >= 0 && end > start) {
+                    String op = trimmed.substring(start + 1, end);
+                    if (!IGNORE_OPERATIONS.contains(op)) {
+                        result.add(trimmed);
+                    }
+                }
+            }
+        }
+        return result;
+    }
+}
diff --git a/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/BenchUtils.java b/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/BenchUtils.java
new file mode 100644
index 00000000000..8fbf2fa2fa5
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/BenchUtils.java
@@ -0,0 +1,112 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf.bench;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.types.DataTypes;
+import static org.apache.spark.sql.functions.*;
+
+/**
+ * Benchmark utilities.
+ *   - generateSyntheticData: Create benchmark data for the UDF
+ *   - executeCpu: Register and run the CPU UDF
+ *   - executeGpu: Register and run the GPU implementation
+ */
+public class BenchUtils {
+
+    // ---------------------------------------------------------------------------
+    // Data generation
+    // ---------------------------------------------------------------------------
+
+    /**
+     * TODO: Generate a synthetic DataFrame matching the unit test schema.
+     *
+     * Use {@code spark.range(0, numRows, 1, numPartitions)} as the base, then apply
+     * randomized column generators to produce data matching the UDF's expected input.
+     *
+     * Requirements:
+     *   - Column names and types MUST match the unit test dataset schema
+     *   - Data should be realistic and varied (different lengths, edge cases, etc.)
+     *   - For variable-length inputs, generate sizable rows representative of
+     *     enterprise-scale data
+     *
+     * Example:
+     * <pre>{@code
+     *   Dataset<Row> baseDF = spark.range(0, numRows, 1, numPartitions).toDF("id");
+     *   return baseDF.select(
+     *       col("id"),
+     *       expr("CAST(rand() * 850 AS INT)").alias("credit_score")
+     *   );
+     * }</pre>
+     *
+     * @param spark         active SparkSession
+     * @param numRows       number of rows to generate
+     * @param numPartitions number of output partitions
+     * @return DataFrame with the same schema as the unit test data
+     */
+    public static Dataset<Row> generateSyntheticData(
+            SparkSession spark, long numRows, int numPartitions) {
+        return null; // TODO
+    }
+
+    // ---------------------------------------------------------------------------
+    // Execution
+    // ---------------------------------------------------------------------------
+
+    /**
+     * TODO: Execute the CPU UDF on the benchmark DataFrame.
+     *   1. Register the CPU UDF with Spark
+     *   2. Execute it on {@code df}
+     *   3. Return the result DataFrame
+     *
+     * Example:
+     * <pre>{@code
+     *   df.createOrReplaceTempView("bench_table");
+     *   spark.sql("CREATE TEMPORARY FUNCTION calculate_risk AS 'com.udf.CalculateRiskUDF'");
+     *   return spark.sql("SELECT *, calculate_risk(credit_score) AS risk_level FROM bench_table");
+     * }</pre>
+     *
+     * @param spark active SparkSession
+     * @param df    input benchmark DataFrame
+     * @return result DataFrame after applying the CPU UDF
+     */
+    public static Dataset<Row> executeCpu(SparkSession spark, Dataset<Row> df) {
+        return null; // TODO
+    }
+
+    /**
+     * TODO: Execute the GPU implementation on the benchmark DataFrame.
+     *
+     * For RapidsUDF - register the RapidsUDF and run the same query as executeCpu:
+     * <pre>{@code
+     *   df.createOrReplaceTempView("bench_table");
+     *   spark.sql("CREATE TEMPORARY FUNCTION calculate_risk_rapids AS 'com.udf.CalculateRiskRapidsUDF'");
+     *   return spark.sql("SELECT *, calculate_risk_rapids(credit_score) AS risk_level FROM bench_table");
+     * }</pre>
+     *
+     * For SQL - read the SQL file from src/main/resources/ and adapt it for
+     * benchmarking. The SQL was written for the unit test, so you must:
+     *   1. Replace "test_table" with "bench_table"
+     *   2. Replace the SELECT column list with "SELECT *" to avoid referencing
+     *      columns that may not exist in the benchmark DataFrame
+     * <pre>{@code
+     *   df.createOrReplaceTempView("bench_table");
+     *   String sqlContent = new String(Files.readAllBytes(Paths.get("src/main/resources/calculate_risk.sql")));
+     *   String benchSql = sqlContent.replace("test_table", "bench_table");
+     *   // Also replace the SELECT column list with SELECT * if needed
+     *   return spark.sql(benchSql);
+     * }</pre>
+     *
+     * @param spark active SparkSession
+     * @param df    input benchmark DataFrame
+     * @return result DataFrame after applying the GPU implementation
+     */
+    public static Dataset<Row> executeGpu(SparkSession spark, Dataset<Row> df) {
+        return null; // TODO
+    }
+}
diff --git a/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/GenData.java b/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/GenData.java
new file mode 100644
index 00000000000..94a22eea753
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/GenData.java
@@ -0,0 +1,110 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf.bench;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import com.udf.SparkUtils;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+/**
+ * Generates benchmark data and optionally validates by running
+ * BenchUtils.executeCpu and BenchUtils.executeGpu.
+ *
+ * Usage:
+ *   mvn exec:java -Dexec.mainClass=com.udf.bench.GenData \
+ *     -Dexec.args="--rows 1000 --validate --spark-conf k=v ..."
+ */
+public class GenData {
+
+    public static void main(String[] args) {
+        Map<String, String> argMap = new HashMap<>();
+        List<String> sparkConfs = new ArrayList<>();
+        parseArgs(args, argMap, sparkConfs);
+
+        long rows = Long.parseLong(SparkUtils.requireArg(argMap, "rows"));
+        int partitions = Integer.parseInt(argMap.getOrDefault("partitions", "32"));
+        boolean validate = argMap.containsKey("validate");
+        String outputPath = argMap.get("output-path");
+
+        // Build Spark session
+        SparkSession.Builder builder = SparkSession.builder().appName("GenData");
+        SparkUtils.applySparkConfs(builder, sparkConfs);
+        SparkSession spark = builder.enableHiveSupport().getOrCreate();
+
+        try {
+            // Generate synthetic data
+            Dataset<Row> df = BenchUtils.generateSyntheticData(spark, rows, partitions);
+
+            // Verify row count
+            long actualRows = df.count();
+            if (actualRows != rows) {
+                System.err.println("Row count mismatch: expected=" + rows
+                    + ", actual=" + actualRows);
+                System.exit(1);
+            }
+            System.out.println("Generated " + actualRows + " rows across "
+                + partitions + " partitions");
+
+            if (validate) {
+                // Validation mode — run both CPU and GPU execute, don't write
+                for (String label : new String[]{"cpu", "gpu"}) {
+                    try {
+                        if ("cpu".equals(label)) {
+                            BenchUtils.executeCpu(spark, df).collect();
+                        } else {
+                            BenchUtils.executeGpu(spark, df).collect();
+                        }
+                        System.out.println("Validation (" + label + ") passed.");
+                    } catch (Exception e) {
+                        System.err.println("Validation (" + label + ") failed: "
+                            + e.getClass().getSimpleName() + ": " + e.getMessage());
+                        e.printStackTrace(System.err);
+                        System.exit(1);
+                    }
+                }
+            } else {
+                // Generation mode — write to output path
+                if (outputPath == null) {
+                    throw new IllegalArgumentException(
+                        "--output-path is required when not in validation mode");
+                }
+                df.write().mode("overwrite").parquet(outputPath);
+                System.err.println("Successfully generated dataset and saved to: " + outputPath);
+            }
+        } catch (Exception e) {
+            System.err.println("Failed to generate dataset: "
+                + e.getClass().getSimpleName());
+            e.printStackTrace(System.err);
+            System.exit(1);
+        } finally {
+            spark.stop();
+        }
+
+        System.exit(0);
+    }
+
+    /** Parse CLI arguments. */
+    private static void parseArgs(String[] args, Map<String, String> map, List<String> sparkConfs) {
+        int i = 0;
+        while (i < args.length) {
+            switch (args[i]) {
+                case "--rows":        map.put("rows", args[i + 1]); i += 2; break;
+                case "--partitions":  map.put("partitions", args[i + 1]); i += 2; break;
+                case "--validate":    map.put("validate", "true"); i += 1; break;
+                case "--output-path": map.put("output-path", args[i + 1]); i += 2; break;
+                case "--spark-conf":  sparkConfs.add(args[i + 1]); i += 2; break;
+                default:
+                    throw new IllegalArgumentException("Unknown argument: " + args[i]);
+            }
+        }
+    }
+}
diff --git a/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/MicroBenchRunner.java b/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/MicroBenchRunner.java
new file mode 100644
index 00000000000..877bfe214c4
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/MicroBenchRunner.java
@@ -0,0 +1,320 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf.bench;
+
+import java.io.File;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.rapids.cudf.ColumnVector;
+import ai.rapids.cudf.Cuda;
+import ai.rapids.cudf.CudaMemInfo;
+import ai.rapids.cudf.HostColumnVector;
+import ai.rapids.cudf.Rmm;
+import ai.rapids.cudf.RmmAllocationMode;
+import ai.rapids.cudf.Table;
+
+/**
+ * Microbenchmark runner for CPU vs. RapidsUDF. Measures UDF execution time on in-memory dataset.
+ *
+ * Reads Parquet file (produced by GenData) via cuDF Table.readParquet.
+ * Benchmarks CPU (row-by-row evaluate) and GPU (evaluateColumnar) paths.
+ * Data loading and host/device transfers are not part of timing.
+ *
+ * Usage:
+ *   mvn exec:java -Dexec.mainClass=com.udf.bench.MicroBenchRunner \
+ *     -Dexec.args="--mode all --data-path data/bench_data --rows 1000000"
+ */
+public class MicroBenchRunner {
+
+    private static final int DEFAULT_WARMUP = 2;
+    private static final int DEFAULT_MEASURED = 4;
+    private static final float DEFAULT_RMM_ALLOC_FRACTION = 0.9f;
+
+    /**
+     * TODO: Extract column data from host memory into Java objects.
+     *
+     * Called once before CPU timing loop. Convert HostColumnVectors to
+     * array of Java objects for executeCpu.
+     * Use hostColumns[i].getJavaString(row), .getInt(row), .getDouble(row),
+     * .getStruct(row), .getList(row), etc. to extract values into typed arrays.
+     *
+     * This is outside of the timing loop due to overhead of extracting/boxing
+     * Java types from cuDF.
+     *
+     * Example for a UDF that takes (String, int):
+     * <pre>{@code
+     *   String[] col0 = new String[numRows];
+     *   int[] col1 = new int[numRows];
+     *   for (int i = 0; i < numRows; i++) {
+     *       col0[i] = hostColumns[0].getJavaString(i);
+     *       col1[i] = hostColumns[1].getInt(i);
+     *   }
+     *   return new Object[] { col0, col1 };
+     * }</pre>
+     *
+     * @param hostColumns all columns copied to host memory
+     * @param numRows     number of rows in the dataset
+     * @return array of typed Java arrays, one per UDF input column
+     */
+    public static Object[] prepareCpuData(HostColumnVector[] hostColumns, int numRows) {
+        // TODO: Extract columns to Java arrays
+        return null; // TODO
+    }
+
+    /**
+     * TODO: Execute the CPU UDF on Java data row-by-row.
+     *
+     * Example:
+     * <pre>{@code
+     *   import com.udf.PlaceholderUDFName;
+     *   String[] col0 = (String[]) data[0];
+     *   int[] col1 = (int[]) data[1];
+     *   PlaceholderUDFName udf = new PlaceholderUDFName();
+     *   for (int i = 0; i < numRows; i++) {
+     *       udf.evaluate(col0[i], col1[i]);
+     *   }
+     * }</pre>
+     *
+     * @param data    Java arrays from {@link #prepareCpuData}
+     * @param numRows number of rows in the dataset
+     */
+    public static void executeCpu(Object[] data, int numRows) {
+        // TODO: Cast arrays and call CPU UDF evaluate() per row
+    }
+
+    /**
+     * TODO: Execute the GPU UDF via evaluateColumnar.
+     *
+     * Example:
+     * <pre>{@code
+     *   import com.udf.PlaceholderRapidsUDFName;
+     *   PlaceholderRapidsUDFName udf = new PlaceholderRapidsUDFName();
+     *   return udf.evaluateColumnar(numRows,
+     *       table.getColumn(0), table.getColumn(1));
+     * }</pre>
+     *
+     * @param table   the dataset loaded on GPU
+     * @param numRows number of rows in the dataset
+     * @return result ColumnVector (NOTE: caller must close)
+     */
+    public static ColumnVector executeGpu(Table table, int numRows) {
+        // TODO: Instantiate RapidsUDF and call evaluateColumnar()
+        return null; // TODO
+    }
+
+    public static void main(String[] args) {
+        Map<String, String> argMap = new HashMap<>();
+        parseArgs(args, argMap);
+
+        String dataPath = argMap.get("data-path");
+        if (dataPath == null) {
+            throw new IllegalArgumentException("--data-path is required");
+        }
+        String mode = argMap.getOrDefault("mode", "all");
+        int maxRows = Integer.parseInt(argMap.getOrDefault("rows", "-1"));
+        float rmmAllocFraction = Float.parseFloat(argMap.getOrDefault("pool-fraction", String.valueOf(DEFAULT_RMM_ALLOC_FRACTION)));
+        int warmup = Integer.parseInt(argMap.getOrDefault("warmup", String.valueOf(DEFAULT_WARMUP)));
+        int measured = Integer.parseInt(argMap.getOrDefault("measured", String.valueOf(DEFAULT_MEASURED)));
+        boolean profile = argMap.containsKey("profile");
+
+        // Resolve execution mode
+        if (!"cpu".equals(mode) && !"gpu".equals(mode) && !"all".equals(mode)) {
+            throw new IllegalArgumentException(
+                "Unknown mode: '" + mode + "'. Must be 'cpu', 'gpu', or 'all'.");
+        }
+        boolean runCpu = "cpu".equals(mode) || "all".equals(mode);
+        boolean runGpu = "gpu".equals(mode) || "all".equals(mode);
+
+        // Initialize RMM pool
+        if (!Rmm.isInitialized()) {
+            CudaMemInfo memInfo = Cuda.memGetInfo();
+            long poolSize = (long) (memInfo.free * rmmAllocFraction) & ~255L;
+            Rmm.initialize(RmmAllocationMode.POOL, null, poolSize);
+        }
+
+        // Read Parquet data into cuDF table
+        try (Table table = readParquetData(dataPath, maxRows)) {
+            int numRows = (int) table.getRowCount();
+            int numCols = table.getNumberOfColumns();
+            double mb = getTableSizeMB(table);
+            System.out.printf("Loaded %,d rows x %d columns (%.1f MB) from: %s%n",
+                numRows, numCols, mb, dataPath);
+            System.out.printf("Microbenchmark: mode=%s, warmup=%d, measured=%d%n",
+                mode, warmup, measured);
+
+            double cpuMinMs = Double.NaN;
+            double gpuMinMs = Double.NaN;
+
+            // --- CPU Benchmark ---
+            if (runCpu) {
+                HostColumnVector[] hostColumns = copyAllToHost(table);
+                try {
+                    Object[] cpuData = prepareCpuData(hostColumns, numRows);
+                    long[] times = runBenchmark(warmup, measured, false, () ->
+                        executeCpu(cpuData, numRows));
+                    double medianMs = times[times.length / 2] / 1e6;
+                    cpuMinMs = times[0] / 1e6;
+                    System.out.printf("   CPU  | %,14d rows | median %10.1f ms | min %10.1f ms%n",
+                        numRows, medianMs, cpuMinMs);
+                } catch (Exception e) {
+                    System.err.printf("CPU benchmark failed: %s%n", e.getMessage());
+                    e.printStackTrace(System.err);
+                    System.exit(1);
+                } finally {
+                    closeAll(hostColumns);
+                }
+            }
+
+            // --- GPU Benchmark ---
+            if (runGpu) {
+                try {
+                    long[] times = runBenchmark(warmup, measured, profile, () -> {
+                        try (ColumnVector result = executeGpu(table, numRows)) {}
+                    });
+                    double medianMs = times[times.length / 2] / 1e6;
+                    gpuMinMs = times[0] / 1e6;
+                    System.out.printf("   GPU  | %,14d rows | median %10.1f ms | min %10.1f ms%n",
+                        numRows, medianMs, gpuMinMs);
+                } catch (Exception e) {
+                    System.err.printf("GPU benchmark failed: %s%n", e.getMessage());
+                    e.printStackTrace(System.err);
+                    System.exit(1);
+                }
+            }
+
+            // --- Speedup ---
+            if (!Double.isNaN(cpuMinMs) && !Double.isNaN(gpuMinMs)) {
+                double speedup = cpuMinMs / gpuMinMs;
+                System.out.printf(">> Speedup: %.2fx (CPU/GPU best)%n", speedup);
+            }
+        }
+
+        System.exit(0);
+    }
+
+    /**
+     * Run warmup + measured iterations. Profile the measured iterations if enabled.
+     * @return sorted array of measured elapsed times in nanoseconds
+     */
+    private static long[] runBenchmark(int warmup, int measured, boolean profile, Runnable block) {
+        for (int i = 0; i < warmup; i++) {
+            block.run();
+        }
+        long[] times = new long[measured];
+        for (int i = 0; i < measured; i++) {
+            if (profile) Cuda.profilerStart();
+            long start = System.nanoTime();
+            block.run();
+            times[i] = System.nanoTime() - start;
+            if (profile) Cuda.profilerStop();
+        }
+        Arrays.sort(times);
+        return times;
+    }
+
+    /**
+     * Read Parquet partition files from a directory into a cuDF Table.
+     * Reads files in sorted order, stopping once maxRows is reached.
+     * @param maxRows stop after accumulating this many rows; -1 means read all.
+     */
+    private static Table readParquetData(String dataPath, int maxRows) {
+        File[] partFiles = new File(dataPath).listFiles((dir, name) -> name.endsWith(".parquet"));
+        if (partFiles == null || partFiles.length == 0) {
+            throw new IllegalArgumentException("No .parquet files found in: " + dataPath);
+        }
+        Arrays.sort(partFiles);
+
+        Table[] tables = new Table[partFiles.length];
+        int count = 0;
+        long totalRows = 0;
+        try {
+            for (int i = 0; i < partFiles.length; i++) {
+                tables[i] = Table.readParquet(partFiles[i]);
+                count++;
+                totalRows += tables[i].getRowCount();
+                if (maxRows > 0 && totalRows >= maxRows) break;
+            }
+            if (count == 1) {
+                return limitTable(tables[0], maxRows);
+            }
+            try (Table combined = Table.concatenate(Arrays.copyOf(tables, count))) {
+                return limitTable(combined, maxRows);
+            }
+        } finally {
+            closeAll(tables);
+        }
+    }
+
+    /** Return a new Table with at most numRows rows. */
+    private static Table limitTable(Table table, int numRows) {
+        int n = (numRows <= 0)
+            ? (int) table.getRowCount()
+            : (int) Math.min(numRows, table.getRowCount());
+        ColumnVector[] cols = new ColumnVector[table.getNumberOfColumns()];
+        try {
+            for (int i = 0; i < cols.length; i++) {
+                cols[i] = table.getColumn(i).subVector(0, n);
+            }
+            return new Table(cols);
+        } finally {
+            closeAll(cols);
+        }
+    }
+
+    /** Get the size of the table in MB. */
+    private static double getTableSizeMB(Table table) {
+        long bytes = 0;
+        for (int i = 0; i < table.getNumberOfColumns(); i++) {
+            bytes += table.getColumn(i).getDeviceMemorySize();
+        }
+        return bytes / (1024.0 * 1024.0);
+    }
+
+    /** Copy all device columns to host memory. */
+    private static HostColumnVector[] copyAllToHost(Table table) {
+        HostColumnVector[] hostCols = new HostColumnVector[table.getNumberOfColumns()];
+        try {
+            for (int i = 0; i < hostCols.length; i++) {
+                hostCols[i] = table.getColumn(i).copyToHost();
+            }
+            return hostCols;
+        } catch (Exception e) {
+            closeAll(hostCols);
+            throw e;
+        }
+    }
+
+    /** Close all resources in an array. */
+    private static void closeAll(AutoCloseable[] resources) {
+        if (resources != null) {
+            for (AutoCloseable r : resources) {
+                if (r != null) {
+                    try { r.close(); } catch (Exception ignore) {}
+                }
+            }
+        }
+    }
+
+    /** Parse CLI arguments. */
+    private static void parseArgs(String[] args, Map<String, String> map) {
+        int i = 0;
+        while (i < args.length) {
+            switch (args[i]) {
+                case "--mode":        map.put("mode", args[i + 1]); i += 2; break;
+                case "--data-path":   map.put("data-path", args[i + 1]); i += 2; break;
+                case "--warmup":      map.put("warmup", args[i + 1]); i += 2; break;
+                case "--measured":    map.put("measured", args[i + 1]); i += 2; break;
+                case "--rows":        map.put("rows", args[i + 1]); i += 2; break;
+                case "--pool-fraction": map.put("pool-fraction", args[i + 1]); i += 2; break;
+                case "--profile":     map.put("profile", "true"); i += 1; break;
+                default:
+                    throw new IllegalArgumentException("Unknown argument: " + args[i]);
+            }
+        }
+    }
+}
diff --git a/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/SparkBenchRunner.java b/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/SparkBenchRunner.java
new file mode 100644
index 00000000000..a165a38d5ac
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/main/java/com/udf/bench/SparkBenchRunner.java
@@ -0,0 +1,163 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf.bench;
+
+import java.io.File;
+import java.io.PrintWriter;
+import java.io.StringWriter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.LinkedHashMap;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import com.fasterxml.jackson.core.util.DefaultIndenter;
+import com.fasterxml.jackson.core.util.DefaultPrettyPrinter;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.fasterxml.jackson.databind.SerializationFeature;
+import com.udf.SparkUtils;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+/**
+ * UDF benchmark runner. Measures the end-to-end runtime of:
+ *   Read Parquet -> Execute (CPU or GPU) -> Write no-op sink
+ *
+ * Produces a JSON file with the benchmark results.
+ * On error, also produces a separate error log file.
+ *
+ * Usage:
+ *   mvn exec:java -Dexec.mainClass=com.udf.bench.SparkBenchRunner \
+ *     -Dexec.args="--mode cpu --data-path data/bench_data_10M_rows.parquet ..."
+ */
+public class SparkBenchRunner {
+
+    private static final String DEFAULT_SPARK_LOG_LEVEL = "ERROR";
+
+    public static void main(String[] args) {
+        Map<String, String> argMap = new HashMap<>();
+        List<String> sparkConfs = new ArrayList<>();
+        parseArgs(args, argMap, sparkConfs);
+
+        String mode = SparkUtils.requireArg(argMap, "mode");
+        String dataPath = SparkUtils.requireArg(argMap, "data-path");
+        String resultPath = SparkUtils.requireArg(argMap, "result-path");
+        String sparkLogLevel = argMap.getOrDefault("spark-log-level", DEFAULT_SPARK_LOG_LEVEL);
+
+        // Validate mode
+        if (!"cpu".equals(mode) && !"gpu".equals(mode)) {
+            throw new IllegalArgumentException(
+                "Unknown mode: '" + mode + "'. Must be 'cpu' or 'gpu'.");
+        }
+
+        // Build Spark session
+        SparkSession.Builder builder = SparkSession.builder();
+        SparkUtils.applySparkConfs(builder, sparkConfs);
+        SparkSession spark = builder.enableHiveSupport().getOrCreate();
+        spark.sparkContext().setLogLevel(sparkLogLevel);
+
+        try {
+            // --- START JOB ---
+            long startTime = System.nanoTime();
+            Dataset<Row> df = spark.read().parquet(dataPath);
+            Dataset<Row> resultDf = "cpu".equals(mode)
+                ? BenchUtils.executeCpu(spark, df)
+                : BenchUtils.executeGpu(spark, df);
+            resultDf.write().format("noop").mode("overwrite").save();
+            double elapsed = (System.nanoTime() - startTime) / 1e9;
+            // --- END JOB ---
+
+            System.err.printf("E2E Runtime (s): %.2f%n", elapsed);
+
+            writeReport(resultPath, mode, dataPath, elapsed, "success", args, null, null);
+
+        } catch (Exception e) {
+            System.err.println("Benchmark run failed: " + e.getClass().getSimpleName());
+            e.printStackTrace(System.err);
+
+            // Error stack trace is written to a separate error log file.
+            String errorLogPath = resultPath.replace("_result.json", "_error.log");
+            writeErrorLog(errorLogPath, e);
+
+            writeReport(resultPath, mode, dataPath, -1, "error", args,
+                e.getMessage(), errorLogPath);
+
+            System.exit(1);
+        } finally {
+            spark.stop();
+        }
+
+        System.exit(0);
+    }
+
+    /** Write a JSON benchmark report containing the result and args. */
+    private static void writeReport(
+            String path, String mode, String dataPath, double elapsed,
+            String status, String[] cliArgs,
+            String errorMessage, String errorLogFile) {
+        File resultDir = new File(path).getParentFile();
+        if (resultDir != null) resultDir.mkdirs();
+
+        try {
+            Map<String, Object> report = new LinkedHashMap<>();
+            report.put("mode", mode);
+            report.put("data_path", dataPath);
+            report.put("status", status);
+            report.put("e2e_runtime", elapsed);
+            report.put("cli_args", Arrays.asList(cliArgs));
+            if (errorMessage != null) {
+                Map<String, String> error = new LinkedHashMap<>();
+                error.put("error_message", errorMessage);
+                if (errorLogFile != null) {
+                    error.put("error_log_file", errorLogFile);
+                }
+                report.put("error", error);
+            }
+
+            ObjectMapper mapper = new ObjectMapper();
+            mapper.enable(SerializationFeature.INDENT_OUTPUT);
+            DefaultPrettyPrinter printer = new DefaultPrettyPrinter();
+            printer.indentArraysWith(DefaultIndenter.SYSTEM_LINEFEED_INSTANCE);
+            mapper.writer(printer).writeValue(new File(path), report);
+            System.err.println("Report written to: " + path);
+        } catch (Exception e) {
+            System.err.println("Failed to write report: " + e.getMessage());
+        }
+    }
+
+    /** Write an exception to an error log file. */
+    private static void writeErrorLog(String path, Exception e) {
+        File logDir = new File(path).getParentFile();
+        if (logDir != null) logDir.mkdirs();
+
+        try (PrintWriter pw = new PrintWriter(path)) {
+            StringWriter sw = new StringWriter();
+            e.printStackTrace(new PrintWriter(sw));
+            pw.print(sw.toString());
+        } catch (Exception writeErr) {
+            System.err.println("Failed to write error log: " + writeErr.getMessage());
+        }
+        System.err.println("Error details written to: " + path);
+    }
+
+    /** Parse CLI arguments. */
+    private static void parseArgs(String[] args, Map<String, String> map, List<String> sparkConfs) {
+        int i = 0;
+        while (i < args.length) {
+            switch (args[i]) {
+                case "--mode":            map.put("mode", args[i + 1]); i += 2; break;
+                case "--data-path":       map.put("data-path", args[i + 1]); i += 2; break;
+                case "--result-path":     map.put("result-path", args[i + 1]); i += 2; break;
+                case "--spark-log-level": map.put("spark-log-level", args[i + 1]); i += 2; break;
+                case "--spark-conf":      sparkConfs.add(args[i + 1]); i += 2; break;
+                default:
+                    throw new IllegalArgumentException("Unknown argument: " + args[i]);
+            }
+        }
+    }
+}
diff --git a/skills/udf-gen-test/templates/java/src/test/java/com/udf/CudfComparisonTest.java b/skills/udf-gen-test/templates/java/src/test/java/com/udf/CudfComparisonTest.java
new file mode 100644
index 00000000000..e8109a25cd2
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/test/java/com/udf/CudfComparisonTest.java
@@ -0,0 +1,70 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+public class CudfComparisonTest {
+
+    private static SparkSession spark;
+    private static ClassLoader origContextClassLoader;
+
+    @BeforeClass
+    public static void setUp() {
+        origContextClassLoader = TestUtils.installMutableClassLoader();
+        spark = SparkSession.builder()
+            .appName("UDF vs. RapidsUDF Comparison Test")
+            .master("local[4]")
+            .config("spark.plugins", "com.nvidia.spark.SQLPlugin")
+            .config("spark.rapids.memory.gpu.pool", "NONE")
+            .config("spark.rapids.sql.explain", "NONE")
+            .enableHiveSupport()
+            .getOrCreate();
+    }
+
+    @AfterClass
+    public static void tearDown() {
+        if (spark != null) spark.stop();
+        if (origContextClassLoader != null) {
+            Thread.currentThread().setContextClassLoader(origContextClassLoader);
+        }
+    }
+
+    /** TODO: Register the RapidsUDF with Spark. */
+    public static void registerRapidsUDF(SparkSession spark, String udfName) { }
+
+    @Test
+    public void testCpuVsRapidsUDF() {
+        Dataset<Row> testDF = UnitTest.createTestData(spark).repartition(1);
+
+        // Run CPU UDF
+        UnitTest.registerUDF(spark, "placeholder_udf_name");
+        Dataset<Row> cpuResultDF = UnitTest.executeUDF(
+            spark, "placeholder_udf_name", testDF);
+        UnitTest.verifyUDFResults(cpuResultDF, testDF);
+
+        // Run RapidsUDF
+        registerRapidsUDF(spark, "placeholder_rapids_udf_name");
+        Dataset<Row> gpuResultDF = UnitTest.executeUDF(
+            spark, "placeholder_rapids_udf_name", testDF);
+        UnitTest.verifyUDFResults(gpuResultDF, testDF);
+
+        // Compare
+        TestUtils.assertDataFrameEquals(gpuResultDF, cpuResultDF);
+    }
+
+    /**
+     * TODO: If UnitTest adds extra @Test methods beyond the main result checks,
+     * add corresponding comparison tests here. Each case should run the same input
+     * through the CPU UDF and the RapidsUDF, apply equivalent assertions to both
+     * outputs, and compare the RapidsUDF output against the CPU output.
+     */
+}
diff --git a/skills/udf-gen-test/templates/java/src/test/java/com/udf/SqlComparisonTest.java b/skills/udf-gen-test/templates/java/src/test/java/com/udf/SqlComparisonTest.java
new file mode 100644
index 00000000000..26d08fc5f44
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/test/java/com/udf/SqlComparisonTest.java
@@ -0,0 +1,76 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+public class SqlComparisonTest {
+
+    private static SparkSession spark;
+    private static ClassLoader origContextClassLoader;
+
+    @BeforeClass
+    public static void setUp() {
+        origContextClassLoader = TestUtils.installMutableClassLoader();
+        spark = SparkSession.builder()
+            .appName("UDF vs. SQL Comparison Test")
+            .master("local[4]")
+            .config("spark.plugins", "com.nvidia.spark.SQLPlugin")
+            .config("spark.rapids.skipGpuArchitectureCheck", "true")
+            .config("spark.rapids.sql.mode", "explainOnly")
+            .config("spark.sql.adaptive.enabled", "false")
+            .enableHiveSupport()
+            .getOrCreate();
+    }
+
+    @AfterClass
+    public static void tearDown() {
+        if (spark != null) spark.stop();
+        if (origContextClassLoader != null) {
+            Thread.currentThread().setContextClassLoader(origContextClassLoader);
+        }
+    }
+
+    @Test
+    public void testUdfVsSqlExpression() throws IOException {
+        Dataset<Row> testDF = UnitTest.createTestData(spark).repartition(1);
+
+        // Run CPU UDF
+        UnitTest.registerUDF(spark, "placeholder_udf_name");
+        Dataset<Row> udfResultDF = UnitTest.executeUDF(
+            spark, "placeholder_udf_name", testDF);
+        UnitTest.verifyUDFResults(udfResultDF, testDF);
+
+        // Read and execute SQL expression
+        testDF.createOrReplaceTempView("test_table");
+        String sqlContent = new String(
+            Files.readAllBytes(Paths.get("src/main/resources/placeholder_udf_name.sql")));
+        Dataset<Row> sqlResultDF = spark.sql(sqlContent);
+        UnitTest.verifyUDFResults(sqlResultDF, testDF);
+
+        // Compare
+        TestUtils.assertDataFrameEquals(sqlResultDF, udfResultDF);
+
+        // Verify GPU compatibility
+        SparkUtils.assertPlanRunsOnGpu(sqlResultDF);
+    }
+
+    /**
+     * TODO: If UnitTest adds extra @Test methods beyond the main result checks,
+     * add corresponding comparison tests here. Each case should run the same input
+     * through the CPU UDF and the SQL expression, apply equivalent assertions to
+     * both outputs, and compare the SQL output against the CPU output.
+     */
+}
diff --git a/skills/udf-gen-test/templates/java/src/test/java/com/udf/TestUtils.java b/skills/udf-gen-test/templates/java/src/test/java/com/udf/TestUtils.java
new file mode 100644
index 00000000000..3beef6e6424
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/test/java/com/udf/TestUtils.java
@@ -0,0 +1,78 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf;
+
+import java.net.URL;
+import java.net.URLClassLoader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.Assert;
+
+/**
+ * Shared test utilities.
+ */
+public class TestUtils {
+
+    /**
+     * Install a URLClassLoader as the thread context classloader so that
+     * RAPIDS ShimLoader.findURLClassLoader() can discover and mutate it.
+     * https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin-api/src/main/scala/com/nvidia/spark/rapids/ShimLoader.scala
+     *
+     * On Java 17 in a forked Surefire JVM the only classloader is
+     * AppClassLoader, which is not a URLClassLoader. Without a URL CL
+     * the RAPIDS ShimLoader will throw since it will fail to install
+     * shim classes, e.g. https://github.com/NVIDIA/spark-rapids/issues/13915.
+     *
+     * Must be called before plugin initialization, i.e., before SparkSession.getOrCreate(). 
+     * Returns the original context classloader for the caller to restore on tearDown.
+     */
+    public static ClassLoader installMutableClassLoader() {
+        ClassLoader original = Thread.currentThread().getContextClassLoader();
+        if (original instanceof URLClassLoader) {
+            return original;
+        }
+        // Create a child URLClassLoader of original AppClassLoader with empty search path.
+        // ShimLoader will populate w/shim directories via addURL(). 
+        URLClassLoader wrapper = new URLClassLoader(new URL[0], original);
+        Thread.currentThread().setContextClassLoader(wrapper);
+        return original;
+    }
+
+    /** Compare two DataFrames row-by-row, reporting per-column mismatches. */
+    public static void assertDataFrameEquals(Dataset<Row> actual, Dataset<Row> expected) {
+        Assert.assertEquals("Schema mismatch", expected.schema(), actual.schema());
+
+        Row[] actualRows = (Row[]) actual.collect();
+        Row[] expectedRows = (Row[]) expected.collect();
+        Arrays.sort(actualRows, (a, b) -> a.toString().compareTo(b.toString()));
+        Arrays.sort(expectedRows, (a, b) -> a.toString().compareTo(b.toString()));
+
+        Assert.assertEquals("Row count mismatch", expectedRows.length, actualRows.length);
+
+        List<String> mismatches = new ArrayList<>();
+        String[] fields = actual.schema().fieldNames();
+        for (int i = 0; i < actualRows.length; i++) {
+            for (String field : fields) {
+                Object aVal = actualRows[i].getAs(field);
+                Object eVal = expectedRows[i].getAs(field);
+                boolean eq = (aVal == null && eVal == null)
+                    || (aVal != null && aVal.equals(eVal));
+                if (!eq) {
+                    mismatches.add(String.format("  [row %d] %s: actual=%s, expected=%s",
+                        i, field, aVal, eVal));
+                }
+            }
+        }
+        if (!mismatches.isEmpty()) {
+            Assert.fail("\nFound " + mismatches.size() + " column-level mismatches:\n"
+                + String.join("\n", mismatches) + "\n");
+        }
+    }
+}
diff --git a/skills/udf-gen-test/templates/java/src/test/java/com/udf/UnitTest.java b/skills/udf-gen-test/templates/java/src/test/java/com/udf/UnitTest.java
new file mode 100644
index 00000000000..17d6b177fe5
--- /dev/null
+++ b/skills/udf-gen-test/templates/java/src/test/java/com/udf/UnitTest.java
@@ -0,0 +1,122 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf;
+
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+public class UnitTest {
+
+    private static SparkSession spark;
+    private static ClassLoader origContextClassLoader;
+
+    @BeforeClass
+    public static void setUp() {
+        origContextClassLoader = TestUtils.installMutableClassLoader();
+        spark = SparkSession.builder()
+            .appName("UDF Unit Test")
+            .master("local[4]")
+            .config("spark.plugins", "com.nvidia.spark.SQLPlugin")
+            .config("spark.rapids.skipGpuArchitectureCheck", "true")
+            .config("spark.rapids.sql.mode", "explainOnly")
+            .config("spark.sql.adaptive.enabled", "false")
+            .enableHiveSupport()
+            .getOrCreate();
+    }
+
+    @AfterClass
+    public static void tearDown() {
+        if (spark != null) spark.stop();
+        if (origContextClassLoader != null) {
+            Thread.currentThread().setContextClassLoader(origContextClassLoader);
+        }
+    }
+
+    /**
+     * TODO: Create a test DataFrame with diverse test cases including edge cases.
+     *
+     * Example:
+     * <pre>{@code
+     *   StructType schema = new StructType(new StructField[]{
+     *       DataTypes.createStructField("id", DataTypes.IntegerType, false),
+     *       DataTypes.createStructField("credit_score", DataTypes.IntegerType, true)
+     *   });
+     *   List<Row> data = Arrays.asList(
+     *       RowFactory.create(1, 800),
+     *       RowFactory.create(2, 550),
+     *       RowFactory.create(3, null)
+     *   );
+     *   return spark.createDataFrame(data, schema);
+     * }</pre>
+     */
+    public static Dataset<Row> createTestData(SparkSession spark) {
+        return null; // TODO
+    }
+
+    /**
+     * TODO: Register the UDF with Spark.
+     *
+     * Example (Hive UDF):
+     * <pre>{@code
+     *   spark.sql("CREATE TEMPORARY FUNCTION " + udfName
+     *       + " AS 'com.udf.CalculateRiskUDF'");
+     * }</pre>
+     */
+    public static void registerUDF(SparkSession spark, String udfName) {
+        // TODO
+    }
+
+    /**
+     * TODO: Execute the UDF on the test DataFrame and return the result.
+     *
+     * Example:
+     * <pre>{@code
+     *   testDF.createOrReplaceTempView("test_table");
+     *   return spark.sql("SELECT *, " + udfName
+     *       + "(credit_score) AS risk_level FROM test_table");
+     * }</pre>
+     */
+    public static Dataset<Row> executeUDF(SparkSession spark, String udfName, Dataset<Row> testDF) {
+        return null; // TODO
+    }
+
+    /**
+     * TODO: Verify UDF results using Assert statements.
+     *
+     * Example:
+     * <pre>{@code
+     *   Row[] results = (Row[]) resultDF.sort("id").collect();
+     *   Assert.assertEquals("LOW", results[0].getAs("risk_level"));
+     *   Assert.assertEquals("MEDIUM", results[1].getAs("risk_level"));
+     *   Assert.assertEquals("UNKNOWN", results[2].getAs("risk_level"));
+     * }</pre>
+     */
+    public static void verifyUDFResults(Dataset<Row> resultDF, Dataset<Row> testDF) {
+        // TODO
+    }
+
+    @Test
+    public void testUDFProducesCorrectResults() {
+        Dataset<Row> testDF = createTestData(spark).repartition(1);
+
+        registerUDF(spark, "placeholder_udf_name");
+        Dataset<Row> resultDF = executeUDF(spark, "placeholder_udf_name", testDF);
+
+        verifyUDFResults(resultDF, testDF);
+    }
+}
diff --git a/skills/udf-gen-test/templates/scala/.mvn/jvm.config b/skills/udf-gen-test/templates/scala/.mvn/jvm.config
new file mode 100644
index 00000000000..f8f3f2490b0
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/.mvn/jvm.config
@@ -0,0 +1,15 @@
+-Xmx16g
+--add-opens=java.base/java.lang=ALL-UNNAMED
+--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
+--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
+--add-opens=java.base/java.io=ALL-UNNAMED
+--add-opens=java.base/java.net=ALL-UNNAMED
+--add-opens=java.base/java.nio=ALL-UNNAMED
+--add-opens=java.base/java.util=ALL-UNNAMED
+--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
+--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
+--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
+--add-opens=java.base/sun.nio.cs=ALL-UNNAMED
+--add-opens=java.base/sun.security.action=ALL-UNNAMED
+--add-opens=java.base/sun.util.calendar=ALL-UNNAMED
+--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
diff --git a/skills/udf-gen-test/templates/scala/pom.xml b/skills/udf-gen-test/templates/scala/pom.xml
new file mode 100644
index 00000000000..132034c8421
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/pom.xml
@@ -0,0 +1,382 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0" 
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <modelVersion>4.0.0</modelVersion>
+    <groupId>com.udf</groupId>
+    <artifactId>aether-agent-udfs</artifactId>
+    <version>1.0.0</version>
+    <name>Aether UDF Conversion</name>
+    <description>This project contains UDFs that will be converted from CPU to GPU.</description>
+    <packaging>jar</packaging>
+
+    <properties>
+        <maven.compiler.source>17</maven.compiler.source>
+        <maven.compiler.target>17</maven.compiler.target>
+        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+        <project.reporting.sourceEncoding>UTF-8</project.reporting.sourceEncoding>
+        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
+        <scala.binary.version>2.12</scala.binary.version>
+        <scala.version>2.12.15</scala.version>
+        <!-- Spark/RAPIDS versions -->
+        <spark.version>3.5.5</spark.version>
+        <rapids4spark.version>26.04.0</rapids4spark.version>
+        <scoverage.plugin.version>2.1.0</scoverage.plugin.version>
+        <scoverage.scalac.plugin.version>2.0.11</scoverage.scalac.plugin.version>
+        <scalatest.jvm.args>${test.jvm.args}</scalatest.jvm.args>
+        <cuda.version>cuda12</cuda.version>
+        <cudf.git.branch>v26.04.00</cudf.git.branch>
+        <rapids.cmake.branch>v26.04.00</rapids.cmake.branch>
+        <!-- Memory leak debugging -->
+        <!-- SLF4J logs are off by default, enable cuDF logs if memory leak debugging is enabled -->
+        <debug.memory.leaks>false</debug.memory.leaks>
+        <cudf.log.level>off</cudf.log.level>
+        <!-- Native CUDA UDF build configuration. The cuda-native-udf profile uses these. -->
+        <USE_PREBUILT_CUDF>ON</USE_PREBUILT_CUDF>
+        <GPU_ARCHS>RAPIDS</GPU_ARCHS>
+        <CPP_PARALLEL_LEVEL>10</CPP_PARALLEL_LEVEL>
+        <PER_THREAD_DEFAULT_STREAM>ON</PER_THREAD_DEFAULT_STREAM>
+        <CUDF_ENABLE_ARROW_S3>OFF</CUDF_ENABLE_ARROW_S3>
+        <skipCudfExtraction>false</skipCudfExtraction>
+        <native.library.name>rapidsudfjni</native.library.name>
+        <native.build.path>${project.build.directory}/native-build</native.build.path>
+        <!-- These args apply to the forked ScalaTest JVM -->
+        <!-- Benchmarks run in the Maven JVM via exec:java, and args are in .mvn/jvm.config -->
+        <test.jvm.args>-Xmx5g -ea
+            -Dai.rapids.refcount.debug=${debug.memory.leaks}
+            -Dorg.slf4j.simpleLogger.defaultLogLevel=off
+            -Dorg.slf4j.simpleLogger.log.ai.rapids.cudf=${cudf.log.level}
+            --add-opens=java.base/java.lang=ALL-UNNAMED
+            --add-opens=java.base/java.lang.invoke=ALL-UNNAMED
+            --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
+            --add-opens=java.base/java.io=ALL-UNNAMED
+            --add-opens=java.base/java.net=ALL-UNNAMED
+            --add-opens=java.base/java.nio=ALL-UNNAMED
+            --add-opens=java.base/java.util=ALL-UNNAMED
+            --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
+            --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.cs=ALL-UNNAMED
+            --add-opens=java.base/sun.security.action=ALL-UNNAMED
+            --add-opens=java.base/sun.util.calendar=ALL-UNNAMED
+            --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED</test.jvm.args>
+    </properties>
+
+    <profiles>
+        <profile>
+            <id>debug-leaks</id>
+            <activation>
+                <property>
+                    <name>debug.memory.leaks</name>
+                    <value>true</value>
+                </property>
+            </activation>
+            <properties>
+                <cudf.log.level>error</cudf.log.level>
+            </properties>
+        </profile>
+        <profile>
+            <id>coverage</id>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.scoverage</groupId>
+                        <artifactId>scoverage-maven-plugin</artifactId>
+                        <version>${scoverage.plugin.version}</version>
+                        <configuration>
+                            <scalaVersion>${scala.version}</scalaVersion>
+                            <scalacPluginVersion>${scoverage.scalac.plugin.version}</scalacPluginVersion>
+                            <highlighting>true</highlighting>
+                            <excludedPackages>com\.udf\.bench\..*;com\.udf\.SparkUtils.*;com\.udf\.Arm.*</excludedPackages>
+                        </configuration>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+        <profile>
+            <id>cuda-native-udf</id>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-dependency-plugin</artifactId>
+                        <version>3.6.1</version>
+                        <executions>
+                            <execution>
+                                <id>copy-rapids-jar-with-classifier</id>
+                                <phase>generate-sources</phase>
+                                <goals>
+                                    <goal>copy</goal>
+                                </goals>
+                                <configuration>
+                                    <artifactItems>
+                                        <artifactItem>
+                                            <groupId>com.nvidia</groupId>
+                                            <artifactId>rapids-4-spark_${scala.binary.version}</artifactId>
+                                            <version>${rapids4spark.version}</version>
+                                            <classifier>${cuda.version}</classifier>
+                                            <type>jar</type>
+                                            <overWrite>false</overWrite>
+                                            <outputDirectory>${project.build.directory}/rapids-jar</outputDirectory>
+                                        </artifactItem>
+                                    </artifactItems>
+                                    <ignoreMissingArtifact>true</ignoreMissingArtifact>
+                                </configuration>
+                            </execution>
+                            <execution>
+                                <id>copy-rapids-jar-no-classifier</id>
+                                <phase>generate-sources</phase>
+                                <goals>
+                                    <goal>copy</goal>
+                                </goals>
+                                <configuration>
+                                    <artifactItems>
+                                        <artifactItem>
+                                            <groupId>com.nvidia</groupId>
+                                            <artifactId>rapids-4-spark_${scala.binary.version}</artifactId>
+                                            <version>${rapids4spark.version}</version>
+                                            <type>jar</type>
+                                            <overWrite>false</overWrite>
+                                            <outputDirectory>${project.build.directory}/rapids-jar</outputDirectory>
+                                        </artifactItem>
+                                    </artifactItems>
+                                    <ignoreMissingArtifact>true</ignoreMissingArtifact>
+                                </configuration>
+                            </execution>
+                        </executions>
+                    </plugin>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-antrun-plugin</artifactId>
+                        <version>3.1.0</version>
+                        <executions>
+                            <execution>
+                                <id>extract-cuda-native-dependencies</id>
+                                <phase>generate-sources</phase>
+                                <configuration>
+                                    <skip>${skipCudfExtraction}</skip>
+                                    <target>
+                                        <exec executable="bash" dir="${project.basedir}" failonerror="true">
+                                            <arg value="native/scripts/extract-cudf-libs.sh"/>
+                                            <env key="RAPIDS4SPARK_VERSION" value="${rapids4spark.version}"/>
+                                            <env key="SCALA_VERSION" value="${scala.binary.version}"/>
+                                            <env key="CUDA_VERSION" value="${cuda.version}"/>
+                                            <env key="CUDF_BRANCH" value="${cudf.git.branch}"/>
+                                            <env key="TARGET_DIR" value="${project.build.directory}"/>
+                                        </exec>
+                                    </target>
+                                </configuration>
+                                <goals>
+                                    <goal>run</goal>
+                                </goals>
+                            </execution>
+                            <execution>
+                                <id>cmake-cuda-native-udf</id>
+                                <phase>compile</phase>
+                                <configuration>
+                                    <target>
+                                        <mkdir dir="${native.build.path}"/>
+                                        <exec executable="cmake" dir="${native.build.path}" failonerror="true">
+                                            <arg value="${project.basedir}/native/src/main/cpp"/>
+                                            <arg value="-DCMAKE_BUILD_TYPE=Release"/>
+                                            <arg value="-DNATIVE_LIBRARY_NAME=${native.library.name}"/>
+                                            <arg value="-DBUILD_UDF_BENCHMARKS=OFF"/>
+                                            <arg value="-DGPU_ARCHS=${GPU_ARCHS}"/>
+                                            <arg value="-DPER_THREAD_DEFAULT_STREAM=${PER_THREAD_DEFAULT_STREAM}"/>
+                                            <arg value="-DCUDF_ENABLE_ARROW_S3=${CUDF_ENABLE_ARROW_S3}"/>
+                                            <arg value="-DUSE_PREBUILT_CUDF=${USE_PREBUILT_CUDF}"/>
+                                            <arg value="-DNATIVE_DEPS_DIR=${project.build.directory}/native-deps"/>
+                                            <arg value="-DCUDF_SOURCE_DIR=${project.build.directory}/cudf-repo/cpp"/>
+                                            <arg value="-DRAPIDS_CMAKE_BRANCH=${rapids.cmake.branch}"/>
+                                        </exec>
+                                        <exec executable="cmake" failonerror="true">
+                                            <arg value="--build"/>
+                                            <arg value="${native.build.path}"/>
+                                            <arg value="-j${CPP_PARALLEL_LEVEL}"/>
+                                            <arg value="-v"/>
+                                        </exec>
+                                    </target>
+                                </configuration>
+                                <goals>
+                                    <goal>run</goal>
+                                </goals>
+                            </execution>
+                        </executions>
+                    </plugin>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-resources-plugin</artifactId>
+                        <version>3.3.1</version>
+                        <executions>
+                            <execution>
+                                <id>copy-cuda-native-library-to-classes</id>
+                                <phase>process-classes</phase>
+                                <goals>
+                                    <goal>copy-resources</goal>
+                                </goals>
+                                <configuration>
+                                    <overwrite>true</overwrite>
+                                    <outputDirectory>${project.build.outputDirectory}/${os.arch}/${os.name}</outputDirectory>
+                                    <resources>
+                                        <resource>
+                                            <directory>${native.build.path}</directory>
+                                            <includes>
+                                                <include>lib${native.library.name}.so</include>
+                                            </includes>
+                                        </resource>
+                                    </resources>
+                                </configuration>
+                            </execution>
+                        </executions>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+    </profiles>
+
+    <dependencies>
+        <!-- Spark -->
+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-hive_${scala.binary.version}</artifactId>
+            <version>${spark.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <!-- Scala -->
+        <dependency>
+            <groupId>org.scala-lang</groupId>
+            <artifactId>scala-library</artifactId>
+            <version>${scala.version}</version>
+        </dependency>
+        <!-- RAPIDS plugin -->
+        <dependency>
+            <groupId>com.nvidia</groupId>
+            <artifactId>rapids-4-spark_${scala.binary.version}</artifactId>
+            <version>${rapids4spark.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <!-- ScalaTest -->
+        <dependency>
+            <groupId>org.scalatest</groupId>
+            <artifactId>scalatest_${scala.binary.version}</artifactId>
+            <version>3.2.17</version>
+            <scope>test</scope>
+        </dependency>
+        <!-- SLF4J -->
+        <dependency>
+            <groupId>org.slf4j</groupId>
+            <artifactId>slf4j-simple</artifactId>
+            <version>1.7.36</version>
+            <scope>test</scope>
+        </dependency>
+    </dependencies>
+
+    <build>
+        <sourceDirectory>src/main/scala</sourceDirectory>
+        <testSourceDirectory>src/test/scala</testSourceDirectory>
+        <plugins>
+            <!-- Compile Java sources if present, e.g., CUDA JNI wrappers. -->
+            <plugin>
+                <groupId>org.codehaus.mojo</groupId>
+                <artifactId>build-helper-maven-plugin</artifactId>
+                <version>3.5.0</version>
+                <executions>
+                    <execution>
+                        <id>add-java-source</id>
+                        <phase>generate-sources</phase>
+                        <goals>
+                            <goal>add-source</goal>
+                        </goals>
+                        <configuration>
+                            <sources>
+                                <source>src/main/java</source>
+                            </sources>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
+            <!-- Scala compiler plugin -->
+            <plugin>
+                <groupId>net.alchim31.maven</groupId>
+                <artifactId>scala-maven-plugin</artifactId>
+                <version>4.3.0</version>
+                <executions>
+                    <execution>
+                        <goals>
+                            <goal>compile</goal>
+                            <goal>testCompile</goal>
+                        </goals>
+                    </execution>
+                </executions>
+                <configuration>
+                    <scalaVersion>${scala.version}</scalaVersion>
+                </configuration>
+            </plugin>
+            
+            <!-- Disable default surefire -->
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-surefire-plugin</artifactId>
+                <version>3.1.2</version>
+                <configuration>
+                    <skipTests>true</skipTests>
+                </configuration>
+            </plugin>
+            <!-- ScalaTest runner -->
+            <plugin>
+                <groupId>org.scalatest</groupId>
+                <artifactId>scalatest-maven-plugin</artifactId>
+                <version>2.2.0</version>
+                <configuration>
+                    <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
+                    <argLine>${scalatest.jvm.args}</argLine>
+                </configuration>
+                <executions>
+                    <execution>
+                        <id>test</id>
+                        <goals>
+                            <goal>test</goal>
+                        </goals>
+                    </execution>
+                </executions>
+            </plugin>
+            <!-- exec-maven-plugin for running main classes -->
+            <plugin>
+                <groupId>org.codehaus.mojo</groupId>
+                <artifactId>exec-maven-plugin</artifactId>
+                <version>3.1.0</version>
+            </plugin>
+            <!-- Shade plugin to create an uber JAR -->
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-shade-plugin</artifactId>
+                <version>3.2.4</version>
+                <executions>
+                    <execution>
+                        <phase>package</phase>
+                        <goals>
+                            <goal>shade</goal>
+                        </goals>
+                        <configuration>
+                            <createDependencyReducedPom>false</createDependencyReducedPom>
+                            <filters>
+                                <filter>
+                                    <artifact>*:*</artifact>
+                                    <excludes>
+                                        <exclude>META-INF/*.SF</exclude>
+                                        <exclude>META-INF/*.DSA</exclude>
+                                        <exclude>META-INF/*.RSA</exclude>
+                                    </excludes>
+                                </filter>
+                            </filters>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
+        </plugins>
+    </build>
+</project>
diff --git a/skills/udf-gen-test/templates/scala/run_gen_data.sh b/skills/udf-gen-test/templates/scala/run_gen_data.sh
new file mode 100644
index 00000000000..1a7b4b2adbc
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/run_gen_data.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Generate or validate benchmark data
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR"
+
+print_usage() {
+    echo "Usage: $0 --rows NUM [--validate] [--output-path PATH] [--mvn-arg ARG]..."
+}
+
+ROWS=""
+VALIDATE=""
+OUTPUT_PATH=""
+MAVEN_ARGS=()
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --rows) ROWS="$2"; shift 2;;
+        --validate) VALIDATE="true"; shift;;
+        --output-path) OUTPUT_PATH="$2"; shift 2;;
+        --mvn-arg) MAVEN_ARGS+=("$2"); shift 2;;
+        *)
+            echo "Unknown option: $1"
+            print_usage
+            exit 1
+            ;;
+    esac
+done
+
+if [ -z "$ROWS" ]; then
+    echo "Error: --rows is required"
+    print_usage
+    exit 1
+fi
+
+SPARK_CONFS=(
+    --spark-conf spark.master="local[8]"
+    --spark-conf spark.rapids.sql.enabled="true"
+    --spark-conf spark.plugins="com.nvidia.spark.SQLPlugin"
+    --spark-conf spark.locality.wait="0s"
+    --spark-conf spark.sql.cache.serializer="com.nvidia.spark.ParquetCachedBatchSerializer"
+    --spark-conf spark.rapids.sql.format.parquet.reader.type="MULTITHREADED"
+    --spark-conf spark.rapids.sql.reader.batchSizeBytes="1000MB"
+    --spark-conf spark.sql.files.maxPartitionBytes="512MB"
+    --spark-conf spark.rapids.sql.metrics.level="DEBUG"
+)
+
+EXEC_ARGS="--rows $ROWS --partitions 32"
+for arg in "${SPARK_CONFS[@]}"; do
+    EXEC_ARGS="$EXEC_ARGS $arg"
+done
+
+if [ -n "$VALIDATE" ]; then
+    EXEC_ARGS="$EXEC_ARGS --validate"
+    echo "Running GenData in validation mode with $ROWS rows..."
+else
+    if [ -z "$OUTPUT_PATH" ]; then
+        OUTPUT_PATH="data/bench_data_${ROWS}_rows.parquet"
+    fi
+    EXEC_ARGS="$EXEC_ARGS --output-path $OUTPUT_PATH"
+    echo "Running GenData to generate $ROWS rows -> $OUTPUT_PATH..."
+fi
+
+mvn "${MAVEN_ARGS[@]}" compile exec:java \
+    -Dexec.mainClass="com.udf.bench.GenData" \
+    -Dexec.classpathScope=compile \
+    -Dexec.args="$EXEC_ARGS"
diff --git a/skills/udf-gen-test/templates/scala/run_micro_benchmark.sh b/skills/udf-gen-test/templates/scala/run_micro_benchmark.sh
new file mode 100644
index 00000000000..6cac6d3e8c9
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/run_micro_benchmark.sh
@@ -0,0 +1,60 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Run in-memory microbenchmark for RapidsUDFs.
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR"
+
+print_usage() {
+    echo "Usage: $0 --mode cpu|gpu|all --data-path PATH [--rows N] [--warmup N] [--measured N] [--pool-fraction F] [--profile] [--mvn-arg ARG]..."
+}
+
+MODE=""
+DATA_PATH=""
+PROFILE=""
+MAVEN_ARGS=()
+RUNNER_ARGS=()
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --mode) MODE="$2"; RUNNER_ARGS+=("$1" "$2"); shift 2;;
+        --data-path) DATA_PATH="$2"; RUNNER_ARGS+=("$1" "$2"); shift 2;;
+        --profile) PROFILE="true"; RUNNER_ARGS+=("$1"); shift;;
+        --mvn-arg) MAVEN_ARGS+=("$2"); shift 2;;
+        *) RUNNER_ARGS+=("$1"); shift;;
+    esac
+done
+
+if [ -z "$MODE" ] || [ -z "$DATA_PATH" ]; then
+    echo "Error: --mode and --data-path are required"
+    print_usage
+    exit 1
+fi
+
+MVN_CMD=(
+    mvn "${MAVEN_ARGS[@]}" compile exec:java
+    -Dexec.mainClass=com.udf.bench.MicroBenchRunner
+    -Dexec.classpathScope=compile
+    "-Dexec.args=${RUNNER_ARGS[*]}"
+)
+
+if [ -n "$PROFILE" ]; then
+    REPORT_PATH="results/microbench_$(date +%Y%m%d_%H%M%S)"
+    mkdir -p results
+    echo "Running microbenchmark (mode=$MODE) on $DATA_PATH with nsys profiling..."
+    echo "nsys report will be saved to: ${REPORT_PATH}.nsys-rep"
+    nsys profile \
+        -c cudaProfilerApi \
+        --capture-range-end=stop \
+        --trace=cuda,nvtx \
+        --nvtx-domain-include="libcudf" \
+        -o "$REPORT_PATH" \
+        "${MVN_CMD[@]}"
+else
+    echo "Running microbenchmark (mode=$MODE) on $DATA_PATH..."
+    "${MVN_CMD[@]}"
+fi
diff --git a/skills/udf-gen-test/templates/scala/run_spark_benchmark.sh b/skills/udf-gen-test/templates/scala/run_spark_benchmark.sh
new file mode 100644
index 00000000000..d8b2b1d1b70
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/run_spark_benchmark.sh
@@ -0,0 +1,69 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Run CPU or GPU Spark benchmark.
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR"
+
+print_usage() {
+    echo "Usage: $0 --mode cpu|gpu --data-path PATH [--result-path PATH] [--mvn-arg ARG]..."
+}
+
+MODE=""
+DATA_PATH=""
+RESULT_PATH=""
+MAVEN_ARGS=()
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --mode) MODE="$2"; shift 2;;
+        --data-path) DATA_PATH="$2"; shift 2;;
+        --result-path) RESULT_PATH="$2"; shift 2;;
+        --mvn-arg) MAVEN_ARGS+=("$2"); shift 2;;
+        *)
+            echo "Unknown option: $1"
+            print_usage
+            exit 1
+            ;;
+    esac
+done
+
+if [ -z "$MODE" ] || [ -z "$DATA_PATH" ]; then
+    echo "Error: --mode and --data-path are required"
+    print_usage
+    exit 1
+fi
+
+DATA_BASENAME=$(basename "$DATA_PATH" .parquet)
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+if [ -z "$RESULT_PATH" ]; then
+    RESULT_PATH="results/${MODE}_${DATA_BASENAME}_${TIMESTAMP}_result.json"
+fi
+
+SPARK_CONFS=(
+    --spark-conf spark.master="local[8]"
+    --spark-conf spark.rapids.sql.enabled="true"
+    --spark-conf spark.plugins="com.nvidia.spark.SQLPlugin"
+    --spark-conf spark.locality.wait="0s"
+    --spark-conf spark.sql.cache.serializer="com.nvidia.spark.ParquetCachedBatchSerializer"
+    --spark-conf spark.rapids.sql.format.parquet.reader.type="MULTITHREADED"
+    --spark-conf spark.rapids.sql.reader.batchSizeBytes="1000MB"
+    --spark-conf spark.sql.files.maxPartitionBytes="512MB"
+    --spark-conf spark.rapids.sql.metrics.level="DEBUG"
+)
+
+EXEC_ARGS="--mode $MODE --data-path $DATA_PATH --result-path $RESULT_PATH"
+for arg in "${SPARK_CONFS[@]}"; do
+    EXEC_ARGS="$EXEC_ARGS $arg"
+done
+EXEC_ARGS="$EXEC_ARGS --spark-conf spark.app.name=${MODE}_${DATA_BASENAME}_${TIMESTAMP}"
+
+echo "Running $MODE benchmark on $DATA_PATH..."
+mvn "${MAVEN_ARGS[@]}" compile exec:java \
+    -Dexec.mainClass="com.udf.bench.SparkBenchRunner" \
+    -Dexec.classpathScope=compile \
+    -Dexec.args="$EXEC_ARGS"
diff --git a/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/Arm.scala b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/Arm.scala
new file mode 100644
index 00000000000..dd0c48d5f7c
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/Arm.scala
@@ -0,0 +1,67 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf
+
+/**
+ * Automatic resource management (ARM).
+ */
+object Arm {
+  /**
+   * Helper to auto-close GPU resources after use.
+   * 
+   * @param resource The AutoCloseable resource
+   * @param f The function to execute with the resource
+   * @return The result of the function
+   */
+  def withResource[T <: AutoCloseable, R](resource: T)(f: T => R): R = {
+    try {
+      f(resource)
+    } finally {
+      if (resource != null) {
+        resource.close()
+      }
+    }
+  }
+
+  /**
+   * Helper to auto-close GPU resources on an exception.
+   * 
+   * @param resource The AutoCloseable resource
+   * @param f The function to execute with the resource
+   * @return The result of the function
+   */
+  def closeOnExcept[T <: AutoCloseable, R](resource: T)(f: T => R): R = {
+    try {
+      f(resource)
+    } catch {
+      case e: Exception =>
+        if (resource != null) {
+          try {
+            resource.close()
+          } catch {
+            case closeException: Exception =>
+              e.addSuppressed(closeException)
+          }
+        }
+        throw e
+    }
+  }
+
+  /** 
+   * Close all resources in an array, skipping nulls.
+   * 
+   * @param resources The array of resources to close
+   */
+  def closeAll[T <: AutoCloseable](resources: Array[T]): Unit = {
+    if (resources != null) {
+      resources.foreach { r =>
+        if (r != null) {
+          try { r.close() } catch { case _: Exception => }
+        }
+      }
+    } 
+  }
+}
diff --git a/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/SparkUtils.scala b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/SparkUtils.scala
new file mode 100644
index 00000000000..8684af60cef
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/SparkUtils.scala
@@ -0,0 +1,84 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf
+
+import com.nvidia.spark.rapids.ExplainPlan
+import org.apache.spark.sql.{DataFrame, SparkSession}
+
+/**
+ * Spark utility methods.
+ */
+object SparkUtils {
+
+  /**
+   * Apply key=value Spark configs to a builder.
+   *
+   * @param builder     the SparkSession builder to configure
+   * @param sparkConfs  "spark.key=value" config strings
+   * @return the same builder, for chaining
+   */
+  def applySparkConfs(
+      builder: SparkSession.Builder,
+      sparkConfs: Seq[String]
+  ): SparkSession.Builder = {
+    for (conf <- sparkConfs) {
+      val kv = conf.split("=", 2)
+      if (kv.length == 2) builder.config(kv(0), kv(1))
+    }
+    builder
+  }
+
+  /** 
+   * Ops that cause fallback but can be ignored, since they are strictly used for testing:
+   * - RDDScanExec/LocalTableScanExec: surfaces due to spark.createDataFrame()
+   * - CollectLimitExec: surfaces during dataframe collection (e.g. df.show())
+   * - ToPrettyString: surfaces due to df.show()
+   */
+  private val IgnoreOperations = Set(
+    "RDDScanExec", "LocalTableScanExec", "CollectLimitExec", "ToPrettyString"
+  )
+
+  /**
+   * Assert that the DataFrame's plan can run on GPU.
+   * NOTE: This is only reliable in explainOnly mode, with AQE disabled.
+   *
+   * @param df              the DataFrame to check
+   * @param returnFullPlan  if true, include the full plan in the error message
+   * @throws RuntimeException if any operations cannot run on GPU
+   */
+  def assertPlanRunsOnGpu(df: DataFrame, returnFullPlan: Boolean = false): Unit = {
+    val plan = getGpuPlan(df)
+    val unsupportedOps = getUnsupportedOps(plan)
+    if (unsupportedOps.nonEmpty) {
+      val opsList = unsupportedOps.map(op => s"- $op").mkString("\n")
+      var errorMsg = s"Some operations cannot run on GPU.\nFound the following unsupported ops:\n$opsList"
+      if (returnFullPlan) {
+        errorMsg += s"\n\nFull physical plan:\n$plan"
+      }
+      throw new RuntimeException(errorMsg)
+    }
+  }
+
+  /** Get the potential GPU plan using the RAPIDS ExplainPlan API. */
+  private def getGpuPlan(df: DataFrame): String = {
+    ExplainPlan.explainPotentialGpuPlan(df, "NOT_ON_GPU")
+  }
+
+  /** Parse the plan for unsupported operations (lines starting with '!'). */
+  private def getUnsupportedOps(plan: String): Seq[String] = {
+    plan.split("\n").filter(_.trim.startsWith("!")).flatMap { line =>
+      // Each unsupported line looks like: ![Exec] <OPERATION> cannot run on GPU
+      val start = line.indexOf('<')
+      val end = line.indexOf('>')
+      if (start >= 0 && end > start) {
+        val op = line.substring(start + 1, end)
+        if (!IgnoreOperations.contains(op)) Some(line.trim) else None
+      } else {
+        None
+      }
+    }.toSeq
+  }
+}
diff --git a/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/BenchUtils.scala b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/BenchUtils.scala
new file mode 100644
index 00000000000..e4d5c3471ed
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/BenchUtils.scala
@@ -0,0 +1,109 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf.bench
+
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Benchmark utilities.
+ *   - generateSyntheticData: Create benchmark data for the UDF
+ *   - executeCpu: Register and run the CPU UDF
+ *   - executeGpu: Register and run the GPU implementation
+ */
+object BenchUtils {
+
+  // ---------------------------------------------------------------------------
+  // Data generation
+  // ---------------------------------------------------------------------------
+
+  /**
+   * TODO: Generate a synthetic DataFrame matching the unit test schema.
+   *
+   * Use `spark.range(0, numRows, 1, numPartitions)` as the base, then apply
+   * randomized column generators to produce data matching the UDF's expected input.
+   *
+   * Requirements:
+   *   - Column names and types MUST match the unit test dataset schema
+   *   - Data should be realistic and varied (different lengths, edge cases, etc.)
+   *   - For variable-length inputs, generate sizable rows representative of
+   *     enterprise-scale data
+   *
+   * Example:
+   * {{{
+   *   val baseDF = spark.range(0, numRows, 1, numPartitions)
+   *   baseDF.select(
+   *     col("id"),
+   *     (rand() * 850).cast(IntegerType).alias("credit_score")
+   *   )
+   * }}}
+   *
+   * @param spark         active SparkSession
+   * @param numRows       number of rows to generate
+   * @param numPartitions number of output partitions
+   * @return DataFrame with the same schema as the unit test data
+   */
+  def generateSyntheticData(
+      spark: SparkSession,
+      numRows: Long,
+      numPartitions: Int
+  ): DataFrame = ???
+
+  // ---------------------------------------------------------------------------
+  // Execution
+  // ---------------------------------------------------------------------------
+
+  /**
+   * TODO: Execute the CPU UDF on the benchmark DataFrame.
+   *   1. Register the CPU UDF with Spark
+   *   2. Execute it on `df`
+   *   3. Return the result DataFrame
+   *
+   * Example:
+   * {{{
+   *   import com.udf.CalculateRiskUDF
+   *   spark.udf.register("calculate_risk", new CalculateRiskUDF())
+   *   df.createOrReplaceTempView("bench_table")
+   *   spark.sql("SELECT *, calculate_risk(credit_score) AS risk_level FROM bench_table")
+   * }}}
+   *
+   * @param spark active SparkSession
+   * @param df    input benchmark DataFrame
+   * @return result DataFrame after applying the CPU UDF
+   */
+  def executeCpu(spark: SparkSession, df: DataFrame): DataFrame = ???
+
+  /**
+   * TODO: Execute the GPU implementation on the benchmark DataFrame.
+   *
+   * For RapidsUDF - register RapidsUDF and run the same query as executeCpu:
+   * {{{
+   *   import com.udf.CalculateRiskRapidsUDF
+   *   spark.udf.register("calculate_risk_rapids", new CalculateRiskRapidsUDF())
+   *   df.createOrReplaceTempView("bench_table")
+   *   spark.sql("SELECT *, calculate_risk_rapids(credit_score) AS risk_level FROM bench_table")
+   * }}}
+   *
+   * For SQL - read the SQL file from src/main/resources/ and adapt it for
+   * benchmarking. The SQL was written for the unit test, so you must:
+   *   1. Replace "test_table" with "bench_table"
+   *   2. Replace the SELECT column list with "SELECT *" to avoid referencing
+   *      columns that may not exist in the benchmark DataFrame
+   * {{{
+   *   df.createOrReplaceTempView("bench_table")
+   *   val sqlContent = scala.io.Source.fromFile("src/main/resources/calculate_risk.sql").mkString
+   *   val benchSql = sqlContent.replace("test_table", "bench_table")
+   *   // Also replace the SELECT column list with SELECT * if needed
+   *   spark.sql(benchSql)
+   * }}}
+   *
+   * @param spark active SparkSession
+   * @param df    input benchmark DataFrame
+   * @return result DataFrame after applying the GPU implementation
+   */
+  def executeGpu(spark: SparkSession, df: DataFrame): DataFrame = ???
+}
diff --git a/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/GenData.scala b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/GenData.scala
new file mode 100644
index 00000000000..31377658c07
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/GenData.scala
@@ -0,0 +1,101 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf.bench
+
+import com.udf.SparkUtils
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Generates benchmark data and optionally validates by running
+ * BenchUtils.executeCpu and BenchUtils.executeGpu.
+ *
+ * Usage:
+ *   mvn exec:java -Dexec.mainClass=com.udf.bench.GenData \
+ *     -Dexec.args="--rows 1000 --validate --spark-conf k=v ..."
+ */
+object GenData {
+
+  def main(args: Array[String]): Unit = {
+    val (parsed, sparkConfs) = parseArgs(args)
+
+    val rows = parsed.getOrElse("rows",
+      throw new IllegalArgumentException("--rows is required")).toLong
+    val partitions = parsed.getOrElse("partitions", "32").toInt
+    val validate = parsed.contains("validate")
+    val outputPath = parsed.get("output-path")
+
+    // Build Spark session
+    val builder = SparkSession.builder().appName("GenData")
+    SparkUtils.applySparkConfs(builder, sparkConfs)
+    val spark = builder.getOrCreate()
+
+    try {
+      // Generate synthetic data
+      val df = BenchUtils.generateSyntheticData(spark, rows, partitions)
+
+      // Verify row count
+      val actualRows = df.count()
+      if (actualRows != rows) {
+        System.err.println(s"Row count mismatch: expected=$rows, actual=$actualRows")
+        sys.exit(1)
+      }
+      println(s"Generated $actualRows rows across $partitions partitions")
+
+      if (validate) {
+        // Validation mode — run both CPU and GPU execute, don't write
+        for ((label, executeFn) <- Seq(
+          ("cpu", BenchUtils.executeCpu _),
+          ("gpu", BenchUtils.executeGpu _)
+        )) {
+          try {
+            executeFn(spark, df).collect()
+            println(s"Validation ($label) passed.")
+          } catch {
+            case e: Exception =>
+              System.err.println(
+                s"Validation ($label) failed: ${e.getClass.getSimpleName}: ${e.getMessage}")
+              e.printStackTrace(System.err)
+              sys.exit(1)
+          }
+        }
+      } else {
+        // Generation mode — write to output path
+        val path = outputPath.getOrElse(
+          throw new IllegalArgumentException("--output-path is required when not in validation mode"))
+        df.write.mode("overwrite").parquet(path)
+        System.err.println(s"Successfully generated dataset and saved to: $path")
+      }
+    } catch {
+      case e: Exception =>
+        System.err.println(s"Failed to generate dataset: ${e.getClass.getSimpleName}")
+        e.printStackTrace(System.err)
+        sys.exit(1)
+    } finally {
+      spark.stop()
+    }
+
+    sys.exit(0)
+  }
+
+  /** Parse CLI arguments. */
+  private def parseArgs(args: Array[String]): (Map[String, String], Seq[String]) = {
+    var map = Map.empty[String, String]
+    var sparkConfs = Seq.empty[String]
+    var i = 0
+    while (i < args.length) {
+      args(i) match {
+        case "--rows"        => map += ("rows" -> args(i + 1)); i += 2
+        case "--partitions"  => map += ("partitions" -> args(i + 1)); i += 2
+        case "--validate"    => map += ("validate" -> "true"); i += 1
+        case "--output-path" => map += ("output-path" -> args(i + 1)); i += 2
+        case "--spark-conf"  => sparkConfs :+= args(i + 1); i += 2
+        case other =>
+          throw new IllegalArgumentException(s"Unknown argument: $other")
+      }
+    }
+    (map, sparkConfs)
+  }
+}
diff --git a/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/MicroBenchRunner.scala b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/MicroBenchRunner.scala
new file mode 100644
index 00000000000..f1f22cc4469
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/MicroBenchRunner.scala
@@ -0,0 +1,297 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf.bench
+
+import java.io.File
+
+import scala.collection.mutable.ArrayBuffer
+
+import ai.rapids.cudf.{
+  ColumnVector,
+  Cuda,
+  CudaMemInfo,
+  HostColumnVector,
+  Rmm,
+  RmmAllocationMode,
+  Table
+}
+import com.udf.Arm.{closeAll, withResource}
+
+/**
+ * Microbenchmark runner for CPU vs. RapidsUDF. Measures UDF execution time on in-memory dataset.
+ *
+ * Reads Parquet file (produced by GenData) via cuDF Table.readParquet.
+ * Benchmarks CPU (row-by-row evaluate) and GPU (evaluateColumnar) paths.
+ * Data loading and host/device transfers are not part of timing.
+ *
+ * Usage:
+ *   mvn exec:java -Dexec.mainClass=com.udf.bench.MicroBenchRunner \
+ *     -Dexec.args="--mode all --data-path data/bench_data --rows 1000000"
+ */
+object MicroBenchRunner {
+
+  private val DefaultWarmup = 2
+  private val DefaultMeasured = 4
+  private val DefaultRmmAllocFraction = 0.9f
+
+  /**
+   * TODO: Extract column data from host memory into Scala objects.
+   *
+   * Called once before CPU timing loop. Convert HostColumnVectors to
+   * array of Scala objects for executeCpu.
+   * Use hostColumns(i).getJavaString(row), .getInt(row), .getDouble(row),
+   * .getStruct(row), .getList(row), etc. to extract values into typed arrays.
+   *
+   * This is outside of the timing loop due to overhead of extracting/boxing
+   * Java types from cuDF.
+   *
+   * Example for a UDF that takes (String, Int):
+   * {{{
+   *   val col0 = Array.tabulate(numRows)(i => hostColumns(0).getJavaString(i))
+   *   val col1 = Array.tabulate(numRows)(i => hostColumns(1).getInt(i))
+   *   Array[AnyRef](col0, col1.asInstanceOf[AnyRef])
+   * }}}
+   *
+   * @param hostColumns all columns copied to host memory
+   * @param numRows     number of rows in the dataset
+   * @return array of typed arrays, one per UDF input column
+   */
+  def prepareCpuData(
+      hostColumns: Array[HostColumnVector],
+      numRows: Int
+  ): Array[AnyRef] = ???
+
+  /**
+   * TODO: Execute the CPU UDF on Scala data row-by-row.
+   *
+   * Example:
+   * {{{
+   *   val col0 = data(0).asInstanceOf[Array[String]]
+   *   val col1 = data(1).asInstanceOf[Array[Int]]
+   *   val udf = new com.udf.PlaceholderUDFName()
+   *   var i = 0
+   *   while (i < numRows) {
+   *     udf.apply(col0(i), col1(i))
+   *     i += 1
+   *   }
+   * }}}
+   *
+   * @param data    typed arrays from [[prepareCpuData]]
+   * @param numRows number of rows in the dataset
+   */
+  def executeCpu(data: Array[AnyRef], numRows: Int): Unit = ???
+
+  /**
+   * TODO: Execute the GPU UDF via evaluateColumnar.
+   *
+   * Example:
+   * {{{
+   *   val udf = new com.udf.PlaceholderRapidsUDFName()
+   *   udf.evaluateColumnar(numRows,
+   *     table.getColumn(0), table.getColumn(1))
+   * }}}
+   *
+   * @param table   the dataset loaded on GPU
+   * @param numRows number of rows in the dataset
+   * @return result ColumnVector (NOTE: caller must close)
+   */
+  def executeGpu(table: Table, numRows: Int): ColumnVector = ???
+
+  def main(args: Array[String]): Unit = {
+    val parsed = parseArgs(args)
+
+    val dataPath = parsed.getOrElse("data-path",
+      throw new IllegalArgumentException("--data-path is required"))
+    val mode = parsed.getOrElse("mode", "all")
+    val maxRows = parsed.getOrElse("rows", "-1").toInt
+    val rmmAllocFraction = parsed.getOrElse("pool-fraction", DefaultRmmAllocFraction.toString).toFloat
+    val warmup = parsed.getOrElse("warmup", DefaultWarmup.toString).toInt
+    val measured = parsed.getOrElse("measured", DefaultMeasured.toString).toInt
+    val profile = parsed.contains("profile")
+
+    mode match {
+      case "cpu" | "gpu" | "all" =>
+      case other => throw new IllegalArgumentException(
+        s"Unknown mode: '$other'. Must be 'cpu', 'gpu', or 'all'.")
+    }
+    val runCpu = mode == "cpu" || mode == "all"
+    val runGpu = mode == "gpu" || mode == "all"
+
+    // Initialize RMM pool
+    if (!Rmm.isInitialized()) {
+      val memInfo = Cuda.memGetInfo()
+      val poolSize = (memInfo.free * rmmAllocFraction).toLong & ~255L
+      Rmm.initialize(RmmAllocationMode.POOL, null, poolSize)
+    }
+
+    // Read Parquet data into cuDF table
+    withResource(readParquetData(dataPath, maxRows)) { table =>
+      val numRows = table.getRowCount.toInt
+      val numCols = table.getNumberOfColumns
+      val mb = getTableSizeMB(table)
+      println(f"Loaded $numRows%,d rows x $numCols columns ($mb%.1f MB) from: $dataPath")
+      println(s"Microbenchmark: mode=$mode, warmup=$warmup, measured=$measured")
+
+      var cpuMinMs: Option[Double] = None
+      var gpuMinMs: Option[Double] = None
+
+      // --- CPU Benchmark ---
+      if (runCpu) {
+        val hostColumns = copyAllToHost(table)
+        try {
+          val cpuData = prepareCpuData(hostColumns, numRows)
+          val times = runBenchmark(warmup, measured) {
+            executeCpu(cpuData, numRows)
+          }
+          val medianMs = times(times.length / 2) / 1e6
+          val minMs = times(0) / 1e6
+          cpuMinMs = Some(minMs)
+          println(
+            f"   CPU  | $numRows%,14d rows | median $medianMs%10.1f ms | min $minMs%10.1f ms")
+        } catch {
+          case e: Exception =>
+            System.err.println(s"CPU benchmark failed: ${e.getMessage}")
+            e.printStackTrace(System.err)
+            System.exit(1)
+        } finally {
+          closeAll(hostColumns)
+        }
+      }
+
+      // --- GPU Benchmark ---
+      if (runGpu) {
+        try {
+          val times = runBenchmark(warmup, measured, profile = profile) {
+            withResource(executeGpu(table, numRows)) { _ => }
+          }
+          val medianMs = times(times.length / 2) / 1e6
+          val minMs = times(0) / 1e6
+          gpuMinMs = Some(minMs)
+          println(
+            f"   GPU  | $numRows%,14d rows | median $medianMs%10.1f ms | min $minMs%10.1f ms")
+        } catch {
+          case e: Exception =>
+            System.err.println(s"GPU benchmark failed: ${e.getMessage}")
+            e.printStackTrace(System.err)
+            System.exit(1)
+        }
+      }
+
+      // --- Speedup ---
+      for (cpu <- cpuMinMs; gpu <- gpuMinMs) {
+        val speedup = cpu / gpu
+        println(f">> Speedup: $speedup%.2fx (CPU/GPU best)")
+      }
+    }
+
+    System.exit(0)
+  }
+
+  /**
+   * Run warmup + measured iterations. Profile the measured iterations if enabled.
+   * @return sorted array of measured elapsed times in nanoseconds
+   */
+  private def runBenchmark(warmup: Int, measured: Int, profile: Boolean = false)
+      (block: => Unit): Array[Long] = {
+    for (_ <- 0 until warmup) block
+    (0 until measured).map { i =>
+      if (profile) Cuda.profilerStart()
+      val start = System.nanoTime()
+      block
+      val elapsed = System.nanoTime() - start
+      if (profile) Cuda.profilerStop()
+      elapsed
+    }.toArray.sorted
+  }
+
+  /**
+   * Read Parquet partition files from a directory into a cuDF Table.
+   * Reads files in sorted order, stopping once maxRows is reached.
+   * @param maxRows stop after accumulating this many rows; -1 means read all.
+   */
+  private def readParquetData(dataPath: String, maxRows: Int): Table = {
+    val partFiles = new File(dataPath).listFiles((_, name) => name.endsWith(".parquet"))
+    if (partFiles == null || partFiles.isEmpty) {
+      throw new IllegalArgumentException(s"No .parquet files found in: $dataPath")
+    }
+
+    val tables = ArrayBuffer.empty[Table]
+    var totalRows = 0L
+    try {
+      for (f <- partFiles.sorted if maxRows <= 0 || totalRows < maxRows) {
+        val t = Table.readParquet(f)
+        tables += t
+        totalRows += t.getRowCount
+      }
+      if (tables.length == 1) {
+        limitTable(tables(0), maxRows)
+      } else {
+        withResource(Table.concatenate(tables.toArray: _*)) { combined =>
+          limitTable(combined, maxRows)
+        }
+      }
+    } finally {
+      closeAll(tables.toArray)
+    }
+  }
+
+  /** Return a new Table with at most numRows rows. */
+  private def limitTable(table: Table, numRows: Int): Table = {
+    val n = if (numRows <= 0) table.getRowCount.toInt
+      else Math.min(numRows, table.getRowCount).toInt
+    val cols = new Array[ColumnVector](table.getNumberOfColumns)
+    try {
+      for (i <- cols.indices) {
+        cols(i) = table.getColumn(i).subVector(0, n)
+      }
+      new Table(cols: _*)
+    } finally {
+      closeAll(cols)
+    }
+  }
+
+  /** Get the size of the table in MB. */
+  private def getTableSizeMB(table: Table): Double = {
+    (0 until table.getNumberOfColumns)
+      .map(i => table.getColumn(i).getDeviceMemorySize)
+      .sum / (1024.0 * 1024.0)
+  }
+
+  /** Copy all device columns to host memory. */
+  private def copyAllToHost(table: Table): Array[HostColumnVector] = {
+    val hostCols = new Array[HostColumnVector](table.getNumberOfColumns)
+    try {
+      for (i <- hostCols.indices) {
+        hostCols(i) = table.getColumn(i).copyToHost()
+      }
+      hostCols
+    } catch {
+      case e: Throwable =>
+        closeAll(hostCols)
+        throw e
+    }
+  }
+
+  /** Parse CLI arguments. */
+  private def parseArgs(args: Array[String]): Map[String, String] = {
+    var map = Map.empty[String, String]
+    var i = 0
+    while (i < args.length) {
+      args(i) match {
+        case "--mode"        => map += ("mode" -> args(i + 1)); i += 2
+        case "--data-path"   => map += ("data-path" -> args(i + 1)); i += 2
+        case "--warmup"      => map += ("warmup" -> args(i + 1)); i += 2
+        case "--measured"    => map += ("measured" -> args(i + 1)); i += 2
+        case "--rows"        => map += ("rows" -> args(i + 1)); i += 2
+        case "--pool-fraction" => map += ("pool-fraction" -> args(i + 1)); i += 2
+        case "--profile"     => map += ("profile" -> "true"); i += 1
+        case other =>
+          throw new IllegalArgumentException(s"Unknown argument: $other")
+      }
+    }
+    map
+  }
+}
diff --git a/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/SparkBenchRunner.scala b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/SparkBenchRunner.scala
new file mode 100644
index 00000000000..3eafe49a201
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/main/scala/com/udf/bench/SparkBenchRunner.scala
@@ -0,0 +1,176 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf.bench
+
+import java.io.{File, PrintWriter, StringWriter}
+import com.fasterxml.jackson.core.util.{DefaultIndenter, DefaultPrettyPrinter}
+import com.fasterxml.jackson.databind.{ObjectMapper, SerializationFeature}
+import com.udf.SparkUtils
+import org.apache.spark.sql.{DataFrame, SparkSession}
+
+/**
+ * UDF benchmark runner. Measures the end-to-end runtime of:
+ *   Read Parquet -> Execute (CPU or GPU) -> Write no-op sink
+ *
+ * Produces a JSON file with the benchmark results.
+ * On error, also produces separate error log file.
+ *
+ * Usage:
+ *   mvn exec:java -Dexec.mainClass=com.udf.bench.SparkBenchRunner \
+ *     -Dexec.args="--mode cpu --data-path data/bench_data_10M_rows.parquet ..."
+ */
+object SparkBenchRunner {
+
+  private val DefaultSparkLogLevel = "ERROR"
+
+  def main(args: Array[String]): Unit = {
+    val (parsed, sparkConfs) = parseArgs(args)
+
+    val mode = parsed.getOrElse("mode",
+      throw new IllegalArgumentException("--mode is required (cpu or gpu)"))
+    val dataPath = parsed.getOrElse("data-path",
+      throw new IllegalArgumentException("--data-path is required"))
+    val resultPath = parsed.getOrElse("result-path",
+      throw new IllegalArgumentException("--result-path is required"))
+    val sparkLogLevel = parsed.getOrElse("spark-log-level", DefaultSparkLogLevel)
+
+    // Resolve execution mode
+    val executeFn: (SparkSession, DataFrame) => DataFrame = mode match {
+      case "cpu" => BenchUtils.executeCpu
+      case "gpu" => BenchUtils.executeGpu
+      case other =>
+        throw new IllegalArgumentException(
+          s"Unknown mode: '$other'. Must be 'cpu' or 'gpu'.")
+    }
+
+    // Build Spark session
+    val builder = SparkSession.builder()
+    SparkUtils.applySparkConfs(builder, sparkConfs)
+    val spark = builder.getOrCreate()
+    spark.sparkContext.setLogLevel(sparkLogLevel)
+
+    try {
+      // --- START JOB ---
+      val startTime = System.nanoTime()
+      val df = spark.read.parquet(dataPath)
+      val resultDf = executeFn(spark, df)
+      resultDf.write.format("noop").mode("overwrite").save()
+      val elapsed = (System.nanoTime() - startTime) / 1e9
+      // --- END JOB ---
+
+      System.err.println(s"E2E Runtime (s): ${f"$elapsed%.2f"}")
+
+      writeReport(
+        path = resultPath,
+        mode = mode,
+        dataPath = dataPath,
+        elapsed = elapsed,
+        status = "success",
+        cliArgs = args)
+
+    } catch {
+      case e: Exception =>
+        System.err.println(s"Benchmark run failed: ${e.getClass.getSimpleName}")
+        e.printStackTrace(System.err)
+
+        // Error stack trace is written to a separate error log file.
+        val errorLogPath = resultPath.replace("_result.json", "_error.log")
+        writeErrorLog(errorLogPath, e)
+
+        writeReport(
+          path = resultPath,
+          mode = mode,
+          dataPath = dataPath,
+          elapsed = -1,
+          status = "error",
+          cliArgs = args,
+          errorMessage = Option(e.getMessage),
+          errorLogFile = Some(errorLogPath))
+
+        sys.exit(1)
+    } finally {
+      spark.stop()
+    }
+
+    sys.exit(0)
+  }
+
+  /** Write a JSON benchmark report containing the result and args. */
+  private def writeReport(
+      path: String,
+      mode: String,
+      dataPath: String,
+      elapsed: Double,
+      status: String,
+      cliArgs: Array[String],
+      errorMessage: Option[String] = None,
+      errorLogFile: Option[String] = None
+  ): Unit = {
+    val resultDir = new File(path).getParentFile
+    if (resultDir != null) resultDir.mkdirs()
+
+    try {
+      import java.util.{LinkedHashMap => JLinkedHashMap, Arrays => JArrays}
+      val report = new JLinkedHashMap[String, AnyRef]()
+      report.put("mode", mode)
+      report.put("data_path", dataPath)
+      report.put("status", status)
+      report.put("e2e_runtime", java.lang.Double.valueOf(elapsed))
+      report.put("cli_args", JArrays.asList(cliArgs: _*))
+      errorMessage.foreach { msg =>
+        val error = new JLinkedHashMap[String, String]()
+        error.put("error_message", msg)
+        errorLogFile.foreach(f => error.put("error_log_file", f))
+        report.put("error", error)
+      }
+
+      val mapper = new ObjectMapper()
+      mapper.enable(SerializationFeature.INDENT_OUTPUT)
+      val printer = new DefaultPrettyPrinter()
+      printer.indentArraysWith(DefaultIndenter.SYSTEM_LINEFEED_INSTANCE)
+      mapper.writer(printer).writeValue(new File(path), report)
+      System.err.println(s"Report written to: $path")
+    } catch {
+      case e: Exception =>
+        System.err.println(s"Failed to write report: ${e.getMessage}")
+    }
+  }
+
+  /** Write an exception to an error log file. */
+  private def writeErrorLog(path: String, e: Exception): Unit = {
+    val logDir = new File(path).getParentFile
+    if (logDir != null) logDir.mkdirs()
+
+    val pw = new PrintWriter(path)
+    try {
+      val sw = new StringWriter()
+      e.printStackTrace(new java.io.PrintWriter(sw))
+      pw.print(sw.toString)
+    } finally {
+      pw.close()
+    }
+    System.err.println(s"Error details written to: $path")
+  }
+
+  /** Parse CLI arguments. */
+  private def parseArgs(args: Array[String]): (Map[String, String], Seq[String]) = {
+    var map = Map.empty[String, String]
+    var sparkConfs = Seq.empty[String]
+    var i = 0
+    while (i < args.length) {
+      args(i) match {
+        case "--mode"            => map += ("mode" -> args(i + 1)); i += 2
+        case "--data-path"       => map += ("data-path" -> args(i + 1)); i += 2
+        case "--result-path"     => map += ("result-path" -> args(i + 1)); i += 2
+        case "--spark-log-level" => map += ("spark-log-level" -> args(i + 1)); i += 2
+        case "--spark-conf"      => sparkConfs :+= args(i + 1); i += 2
+        case other =>
+          throw new IllegalArgumentException(s"Unknown argument: $other")
+      }
+    }
+    (map, sparkConfs)
+  }
+}
diff --git a/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/CudfComparisonTest.scala b/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/CudfComparisonTest.scala
new file mode 100644
index 00000000000..32361dc95c0
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/CudfComparisonTest.scala
@@ -0,0 +1,56 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf
+
+import org.apache.spark.sql.SparkSession
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.BeforeAndAfterAll
+
+class CudfComparisonTest extends AnyFunSuite with BeforeAndAfterAll {
+
+  var spark: SparkSession = _
+
+  override def beforeAll(): Unit = {
+    spark = SparkSession.builder()
+      .appName("UDF vs. RapidsUDF Comparison Test")
+      .master("local[4]")
+      .config("spark.plugins", "com.nvidia.spark.SQLPlugin")
+      .config("spark.rapids.memory.gpu.pool", "NONE")
+      .config("spark.rapids.sql.explain", "NONE")
+      .getOrCreate()
+  }
+
+  override def afterAll(): Unit = {
+    if (spark != null) spark.stop()
+  }
+
+  /** TODO: Register the RapidsUDF with Spark. */
+  def registerRapidsUDF(spark: SparkSession, udfName: String): Unit = ???
+
+  test("UDF vs RapidsUDF") {
+    val testDF = UnitTest.createTestData(spark).repartition(1)
+
+    // Run CPU UDF
+    UnitTest.registerUDF(spark, "placeholder_udf_name")
+    val cpuResultDF = UnitTest.executeUDF(spark, "placeholder_udf_name", testDF)
+    UnitTest.verifyUDFResults(cpuResultDF, testDF)
+
+    // Run RapidsUDF
+    registerRapidsUDF(spark, "placeholder_rapids_udf_name")
+    val gpuResultDF = UnitTest.executeUDF(spark, "placeholder_rapids_udf_name", testDF)
+    UnitTest.verifyUDFResults(gpuResultDF, testDF)
+
+    // Compare
+    TestUtils.assertDataFrameEquals(actual = gpuResultDF, expected = cpuResultDF)
+  }
+
+  /**
+   * TODO: If UnitTest adds extra tests beyond the main result checks, add
+   * corresponding comparison tests here. Each case should run the same input
+   * through the CPU UDF and the RapidsUDF, apply equivalent assertions to both
+   * outputs, and compare the RapidsUDF output against the CPU output.
+   */
+}
diff --git a/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/SqlComparisonTest.scala b/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/SqlComparisonTest.scala
new file mode 100644
index 00000000000..8aa167dbc22
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/SqlComparisonTest.scala
@@ -0,0 +1,59 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf
+
+import org.apache.spark.sql.SparkSession
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.BeforeAndAfterAll
+
+class SqlComparisonTest extends AnyFunSuite with BeforeAndAfterAll {
+
+  var spark: SparkSession = _
+
+  override def beforeAll(): Unit = {
+    spark = SparkSession.builder()
+      .appName("UDF vs. SQL Comparison Test")
+      .master("local[4]")
+      .config("spark.plugins", "com.nvidia.spark.SQLPlugin")
+      .config("spark.rapids.skipGpuArchitectureCheck", "true")
+      .config("spark.rapids.sql.mode", "explainOnly")
+      .config("spark.sql.adaptive.enabled", "false")
+      .getOrCreate()
+  }
+
+  override def afterAll(): Unit = {
+    if (spark != null) spark.stop()
+  }
+
+  test("UDF vs SQL expression") {
+    val testDF = UnitTest.createTestData(spark).repartition(1)
+
+    // Run CPU UDF
+    UnitTest.registerUDF(spark, "placeholder_udf_name")
+    val udfResultDF = UnitTest.executeUDF(spark, "placeholder_udf_name", testDF)
+    UnitTest.verifyUDFResults(udfResultDF, testDF)
+
+    // Read and execute SQL expression
+    testDF.createOrReplaceTempView("test_table")
+    val sqlSource = scala.io.Source.fromFile("src/main/resources/placeholder_udf_name.sql")
+    val sqlContent = try sqlSource.mkString finally sqlSource.close()
+    val sqlResultDF = spark.sql(sqlContent)
+    UnitTest.verifyUDFResults(sqlResultDF, testDF)
+
+    // Compare results
+    TestUtils.assertDataFrameEquals(actual = sqlResultDF, expected = udfResultDF)
+
+    // Verify GPU compatibility
+    SparkUtils.assertPlanRunsOnGpu(sqlResultDF)
+  }
+
+  /**
+   * TODO: If UnitTest adds extra tests beyond the main result checks, add
+   * corresponding comparison tests here. Each case should run the same input
+   * through the CPU UDF and the SQL expression, apply equivalent assertions to
+   * both outputs, and compare the SQL output against the CPU output.
+   */
+}
diff --git a/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/TestUtils.scala b/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/TestUtils.scala
new file mode 100644
index 00000000000..a6d87d5e1b1
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/TestUtils.scala
@@ -0,0 +1,47 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * Shared test utilities.
+ */
+object TestUtils {
+
+  /** Compare two DataFrames row-by-row, reporting per-column mismatches. */
+  def assertDataFrameEquals(
+    actual: DataFrame,
+    expected: DataFrame
+  ): Unit = {
+    assert(actual.schema == expected.schema,
+      s"Schema mismatch:\n  actual:   ${actual.schema}\n  expected: ${expected.schema}")
+
+    val actualRows  = actual.collect().sortBy(_.toString)
+    val expectedRows = expected.collect().sortBy(_.toString)
+
+    assert(actualRows.length == expectedRows.length,
+      s"Row count mismatch: actual=${actualRows.length}, expected=${expectedRows.length}")
+
+    val mismatches = scala.collection.mutable.ArrayBuffer.empty[String]
+    for (i <- actualRows.indices) {
+      val aRow = actualRows(i)
+      val eRow = expectedRows(i)
+      for (field <- actual.schema.fieldNames) {
+        val aVal = Option(aRow.getAs[Any](field))
+        val eVal = Option(eRow.getAs[Any](field))
+        if (aVal != eVal) {
+          mismatches += s"  [row $i] $field: actual=$aVal, expected=$eVal"
+        }
+      }
+    }
+
+    if (mismatches.nonEmpty) {
+      throw new AssertionError(
+        s"\nFound ${mismatches.length} column-level mismatches:\n${mismatches.mkString("\n")}\n")
+    }
+  }
+}
diff --git a/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/UnitTest.scala b/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/UnitTest.scala
new file mode 100644
index 00000000000..71298f472c3
--- /dev/null
+++ b/skills/udf-gen-test/templates/scala/src/test/scala/com/udf/UnitTest.scala
@@ -0,0 +1,96 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package com.udf
+
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+import org.scalatest.Assertions
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.BeforeAndAfterAll
+
+object UnitTest extends Assertions {
+  /**
+   * TODO: Create a test DataFrame with diverse test cases including edge cases.
+   *
+   * Example:
+   * {{{
+   *   val schema = StructType(Seq(
+   *     StructField("id", IntegerType, nullable = false),
+   *     StructField("credit_score", IntegerType, nullable = true)
+   *   ))
+   *   val testData = Seq(
+   *     Row(1, 800),
+   *     Row(2, 550),
+   *     Row(3, null)
+   *   )
+   *   spark.createDataFrame(spark.sparkContext.parallelize(testData), schema)
+   * }}}
+   */
+  def createTestData(spark: SparkSession): DataFrame = ???
+
+  /**
+   * TODO: Register the UDF with Spark.
+   *
+   * Example:
+   * {{{
+   *   spark.udf.register(udfName, new CalculateRiskUDF())
+   * }}}
+   */
+  def registerUDF(spark: SparkSession, udfName: String): Unit = ???
+
+  /**
+   * TODO: Execute the UDF on the test DataFrame and return the result.
+   *
+   * Example:
+   * {{{
+   *   testDF.createOrReplaceTempView("test_table")
+   *   spark.sql(s"SELECT *, $udfName(credit_score) AS risk_level FROM test_table")
+   * }}}
+   */
+  def executeUDF(spark: SparkSession, udfName: String, testDF: DataFrame): DataFrame = ???
+
+  /**
+   * TODO: Verify UDF results using assert statements.
+   *
+   * Example:
+   * {{{
+   *   val results = resultDF.collect().sortBy(_.getAs[Int]("id"))
+   *   assert(results(0).getAs[String]("risk_level") === "LOW")
+   *   assert(results(1).getAs[String]("risk_level") === "MEDIUM")
+   *   assert(results(2).getAs[String]("risk_level") === "UNKNOWN")
+   * }}}
+   */
+  def verifyUDFResults(resultDF: DataFrame, testDF: DataFrame): Unit = ???
+}
+
+class UnitTest extends AnyFunSuite with BeforeAndAfterAll {
+
+  var spark: SparkSession = _
+
+  override def beforeAll(): Unit = {
+    spark = SparkSession.builder()
+      .appName("UDF Unit Test")
+      .master("local[4]")
+      .config("spark.plugins", "com.nvidia.spark.SQLPlugin")
+      .config("spark.rapids.skipGpuArchitectureCheck", "true")
+      .config("spark.rapids.sql.mode", "explainOnly")
+      .config("spark.sql.adaptive.enabled", "false")
+      .getOrCreate()
+  }
+
+  override def afterAll(): Unit = {
+    if (spark != null) spark.stop()
+  }
+
+  test("UDF produces correct results") {
+    val testDF = UnitTest.createTestData(spark).repartition(1)
+
+    UnitTest.registerUDF(spark, "placeholder_udf_name")
+    val resultDF = UnitTest.executeUDF(spark, "placeholder_udf_name", testDF)
+
+    UnitTest.verifyUDFResults(resultDF, testDF)
+  }
+}
diff --git a/skills/udf-judge-conversion/SKILL.md b/skills/udf-judge-conversion/SKILL.md
new file mode 100644
index 00000000000..91784aeb9ad
--- /dev/null
+++ b/skills/udf-judge-conversion/SKILL.md
@@ -0,0 +1,92 @@
+---
+name: udf-judge-conversion
+description: Reviews generated UDF tests and GPU/SQL implementations for robustness, anti-cheating, and GPU execution integrity. Use when the user requests a judge/review-agent pass, or when manually reviewing a completed conversion.
+license: CC-BY-4.0 AND Apache-2.0
+metadata:
+  spdx-file-copyright-text: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+model: inherit
+---
+
+# Judge UDF Conversion
+
+## Purpose
+
+Review a completed UDF conversion and its tests as a skeptical QA/code-review subagent.
+Your job is to review whether the GPU/SQL implementation is a properly validated functional replacement for the CPU UDF.
+
+## Inputs
+
+Review the files that exist in the generated project:
+- CPU UDF source under `src/main/<java|scala>/com/udf/`
+- `src/test/<java|scala>/com/udf/UnitTest.<java|scala>`
+- `src/test/<java|scala>/com/udf/CudfComparisonTest.<java|scala>` or `SqlComparisonTest.<java|scala>`
+- GPU/SQL implementation files
+- coverage reports, test output, or comments documenting accepted discrepancies if present
+
+## Workflow
+
+- [ ] Step 1: Read the unit test and comparison test.
+- [ ] Step 2: Judge whether the tests are strong enough to specify CPU behavior.
+- [ ] Step 3: Judge whether the implementation cheats or silently falls back to CPU logic.
+- [ ] Step 4: Report actionable findings.
+
+## Unit Test Checks
+
+The unit test should be a strong specification of the CPU UDF behavior over its documented input domain.
+
+Check that:
+- Test data covers applicable edge cases such as nulls, empty values, malformed inputs, boundaries, duplicates, mixed valid/invalid rows, nested empties/nulls, unicode, timestamps/timezones, and decimal scale.
+- Assertions verify schema, row count, deterministic ordering, output values, null propagation, and exception/default behavior where applicable.
+- The test exercises visible CPU UDF branches. Coverage reports should support this when available.
+- Assertions reflect the CPU UDF's actual behavior and do not merely assert weak properties such as non-null output.
+- Extra unit tests outside the shared `verifyUDFResults` path are mirrored in the comparison test and run against both CPU and GPU/SQL paths.
+
+## Comparison Test Checks
+
+The comparison test should provide strong evidence that the converted implementation preserves the CPU UDF behavior.
+
+Check that:
+- The CPU path and GPU/SQL path run on the same input data.
+- The CPU result and GPU/SQL result are compared directly.
+- The comparison test actually runs on the GPU with the Spark RAPIDS plugin enabled.
+- The converted path is also validated with the same result assertions used for the CPU path.
+- Additional unit test cases are converted into CPU-vs-GPU/SQL comparison cases, not left as CPU-only tests.
+- Commented-out tests or assertions include a clear explanation and a user-facing note. Documented deviations are acceptable only if the reason is explicit.
+    - Note: you should not accept a documented deviation that removes coverage of the UDF's core logic.
+
+## Implementation Checks
+
+Fail the review if the implementation is tailored to the tests instead of implementing the UDF generally. Look for:
+- Hardcoded test inputs, IDs, row counts, or expected outputs.
+- Conditional branches that only handle exact values from the tests.
+- Literal lookup tables derived from test data.
+
+Fail the review if the implementation silently performs logic row-by-row on the CPU. Look for:
+- `copyToHost()`, `cudaMemcpyDeviceToHost`, or row-by-row scalar copies such as `getJavaString` to copy input data to the CPU.
+    - Note: small CPU objects for metadata or temporary storage are acceptable.
+
+If a GPU API's behavior is unclear, inspect the implementation or docs for the SQL/cuDF/libcudf/thrust APIs invoked by the UDF. Clone the matching source if needed to understand subtle null, type, boundary, or semantic behavior under the hood.
+
+## Output
+
+Start with a clear verdict:
+- `PASS`: no blocking issues found
+- `FAIL`: one or more blocking issues found
+
+### Verdict Examples
+
+`PASS`: The unit test covers normal inputs plus meaningful edge cases, coverage gaps are explained, the comparison test runs the same cases through CPU and GPU/SQL paths, the implementation is general, and there are no hidden CPU fallbacks or test-derived literals.
+
+`PASS with non-blocking risks`: One malformed-input assertion is commented out because the CPU throws a row-level exception while the GPU path returns null for that row, and comments explain the attempted fixes and why the behavior is outside the supported GPU contract. The normal input domain and core UDF logic are still fully tested.
+
+`FAIL`: A test for the primary transformation is commented out, most assertions only check row counts or non-null output, or the comparison test leaves extra CPU-only unit tests unmatched. These failures weaken confidence even if comments are present.
+
+`FAIL`: The implementation contains test-specific literals, dispatches on exact test rows, calls the CPU UDF from the GPU/SQL path, or copies column data to the host to perform normal business logic.
+
+For failures, concisely list specific findings with:
+- file/path
+- issue
+- why it matters
+- suggested fix
+
+Also include any non-blocking risks or test gaps separately.
diff --git a/skills/udf-optimize-cudf/SKILL.md b/skills/udf-optimize-cudf/SKILL.md
new file mode 100644
index 00000000000..63c27b2ed3c
--- /dev/null
+++ b/skills/udf-optimize-cudf/SKILL.md
@@ -0,0 +1,149 @@
+---
+name: udf-optimize-cudf
+description: Iteratively optimizes a cuDF RapidsUDF implementation for GPU performance. Use after testing and benchmarking with udf-benchmark. Runs a loop of profiling, optimizing, testing, and benchmarking until performance converges or the iteration budget is exhausted.
+license: CC-BY-4.0 AND Apache-2.0
+metadata:
+  spdx-file-copyright-text: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+model: inherit
+---
+
+# Optimize cuDF RapidsUDF
+
+## Workflow
+
+- [ ] Step 0: Create backup and establish baseline
+- [ ] Steps 1-4: Iterative optimization loop (repeat up to N iterations)
+  - [ ] Step 1: Profile with nsys
+  - [ ] Step 2: Implement one targeted change
+  - [ ] Step 3: Run unit tests (fail &rarr; discard, retry)
+  - [ ] Step 4: Run microbenchmarks (no improvement &rarr; discard, retry)
+- [ ] Final Step 1: Run judge subagent if requested
+- [ ] Final Step 2: Review optimized implementation and report results
+
+## Prerequisites
+
+- Project directory with passing unit tests and cuDF comparison test
+- MicroBenchRunner implemented and working (from the **udf-benchmark** skill)
+- Benchmark data generated (reuse from the benchmark step)
+
+Derive `<CamelName>` and `<snake_name>` from the UDF class name.
+
+> **Note:** Commands require access to `/tmp` (Spark temp storage) and `/dev` (GPU device). If commands fail due to sandbox restrictions, re-run them unsandboxed.
+
+## Step 0: Create Backup and Establish Baseline
+
+1. Create a backup of the current RapidsUDF implementation:
+```bash
+cp src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala> \
+   src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>.bak
+```
+
+2. If no `.orig.bak` exists yet, save the original unoptimized implementation:
+```bash
+cp src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala> \
+   src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>.orig.bak
+```
+This file is never overwritten; it preserves the pre-optimization baseline.
+
+3. If no baseline microbenchmark results exist, run the baseline now:
+```bash
+./run_micro_benchmark.sh --mode all --data-path data/bench_data_<rows>_rows.parquet --rows <rows>
+```
+
+Record the baseline GPU time and speedup. This is the number to beat.
+
+## Iterative Optimization Loop
+
+Repeat the following steps up to **N iterations** (default: 10). Also stop early if no improvement is found after **3 consecutive failed attempts**.
+
+Maintain an **optimization log** throughout the loop: for each iteration, record what change was attempted and whether it improved, regressed, or had no effect. This prevents repeating failed approaches and feeds the final report.
+
+### Step 1: Profile with nsys
+
+Profile the current implementation to identify bottlenecks:
+```bash
+./run_micro_benchmark.sh --mode gpu --data-path data/bench_data_<rows>_rows.parquet --rows <rows> --profile
+```
+
+Summarize libcudf kernel stats:
+```bash
+nsys stats --report nvtx_sum --format csv -o rapidsudf results/<report>.nsys-rep
+```
+
+Consult **references/OPTIMIZATION_PATTERNS.md** for interpreting profiler output and identifying optimization opportunities.
+
+> **Tip:** Profiling frequently is strongly recommended. Without profiler data, optimization changes are guesses. Try using other `nsys stats` commands as needed.
+
+### Step 2: Implement One Targeted Change
+
+Based on profiling insights (or optimization patterns from the reference), make **one targeted change** to the RapidsUDF implementation. Isolating changes one at a time makes it possible to attribute performance impact.
+
+### Step 3: Run Unit Tests
+
+```bash
+# Java
+mvn test -Dtest=CudfComparisonTest
+
+# Scala
+mvn test -Dsuites=com.udf.CudfComparisonTest
+```
+
+- **Tests pass** &rarr; proceed to Step 4
+- **Tests fail** &rarr; analyze the failure. If it is an ordinary implementation bug, fix it and rerun the test. If the targeted optimization introduces a CPU/GPU semantic mismatch that cannot be resolved, discard changes by restoring from backup:
+  ```bash
+  cp src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>.bak \
+     src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>
+  ```
+  Log the failure reason, then return to Step 1.
+
+Sometimes, matching a certain edge case is impossible without a major performance tradeoff. If so, document the attempted fix, the benchmark evidence, and the exact behavior difference, then ask the user whether the performance-vs-correctness tradeoff is acceptable.
+Do not comment out tests or accept a correctness difference during optimization unless the user explicitly approves that tradeoff.
+
+### Step 4: Run Microbenchmarks
+
+```bash
+./run_micro_benchmark.sh --mode all --data-path data/bench_data_<rows>_rows.parquet --rows <rows>
+```
+
+Compare the GPU time against the current best (from the last checkpoint).
+
+- **Performance improved** &rarr; create a new checkpoint:
+  ```bash
+  cp src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala> \
+     src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>.bak
+  ```
+  Record the new best GPU time. Reset the consecutive-failure counter. Return to Step 1.
+
+- **Performance did NOT improve** &rarr; discard changes:
+  ```bash
+  cp src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>.bak \
+     src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>
+  ```
+  Increment the consecutive-failure counter. Return to Step 1.
+
+## Final Step 1: Run Judge Subagent If Requested
+
+If the user explicitly asked for the judge, a judge subagent, or a review agent, treat that as an explicit request for delegation: you **MUST** launch a separate subagent with `model: inherit` and instruct it to use the **udf-judge-conversion** skill. Ask it to review the `UnitTest`, `CudfComparisonTest`, optimized RapidsUDF implementation, and optimization log as a cuDF conversion.
+
+If the user did not request a judge/review agent, mark this step as skipped and continue to Final Step 2. If a required judge subagent is blocked by tool policy, stop and tell the user that explicit permission/instruction is needed.
+
+If you run the judge, include the judge verdict in the final report. If there are any blocking issues, fix them or report the last known-good checkpoint.
+
+## Final Step 2: Review Optimized Implementation and Report Results
+
+After completing all iterations (or early-stopping), review your own work to ensure the optimization did not weaken correctness, introduce hardcoded test behavior, hide CPU fallback logic, or comment out core test coverage.
+
+After completing all iterations (or early-stopping), report:
+1. **Baseline**: starting GPU time and speedup
+2. **Final**: best GPU time and speedup (from the last checkpoint)
+3. **Successful optimizations**: what changes improved performance and by how much
+4. **Failed optimizations**: what was attempted but did not help
+5. **Review result**: self-review summary, or judge PASS/failures if the judge was requested
+
+## Output
+
+Upon successful completion:
+- Optimized RapidsUDF: `src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>`
+- Backup of best version: `src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>.bak`
+- Original unoptimized version: `src/main/<java|scala>/com/udf/<CamelName>RapidsUDF.<java|scala>.orig.bak`
+- Benchmark results: `results/`
diff --git a/skills/udf-optimize-cudf/references/OPTIMIZATION_PATTERNS.md b/skills/udf-optimize-cudf/references/OPTIMIZATION_PATTERNS.md
new file mode 100644
index 00000000000..2e51fac8012
--- /dev/null
+++ b/skills/udf-optimize-cudf/references/OPTIMIZATION_PATTERNS.md
@@ -0,0 +1,22 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: CC-BY-4.0
+-->
+
+# cuDF Optimization Patterns
+
+## Guidelines
+
+- **Rule of thumb:** fewer cuDF API calls typically results in better performance. Look for ways to collapse multiple operations into fewer calls.
+- Explore the cuDF repo `java/src/<main|test>/java/ai/rapids/cudf` to find alternative cuDF methods.
+
+## Profiling Signals
+
+| Signal | Where to look | What it means | Action |
+|---|---|---|---|
+| High invocation count | `nvtx_sum` Instances column | Loops or too many small kernels | Batch into fewer calls |
+| Low GPU utilization | `kernel_time / wall_time` | Launch/memory overhead dominates | Reduce total API calls |
+| Many `make_*_column` calls | `nvtx_sum` | Excessive intermediate columns | Shorten transformation chains |
+| Expensive kernel | `nvtx_sum` | Look for cheaper API (e.g., regex &rarr; stringReplace, stringSplitRecord &rarr; stringSplit) | Swap to cheaper cuDF API |
+| GPU slower than CPU at large scale | Speedup results | Algorithm has serial dependencies that don't parallelize well | Rethink overall algorithm to maximize columnar parallelism and reduce divergence |
+| Many gather/scatter or struct unpacking ops | `nvtx_sum` | Non-contiguous memory access patterns | Use APIs that leverage contiguous access (e.g., operate on cuDF child columns directly) |