Skip to content

cumulative updates since 25.12 to align with cuML 26.06#1019

Merged
eordentlich merged 13 commits into
NVIDIA:mainfrom
eordentlich:eo-26.06-updates
Jun 25, 2026
Merged

cumulative updates since 25.12 to align with cuML 26.06#1019
eordentlich merged 13 commits into
NVIDIA:mainfrom
eordentlich:eo-26.06-updates

Conversation

@eordentlich

Copy link
Copy Markdown
Collaborator

In addition:

  • dropping support for Spark 3.3 due to incompatible min python version for cuML
  • drops DB 13.3 and 14.3 support in benchmark and notebook examples

@eordentlich

Copy link
Copy Markdown
Collaborator Author

build

@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR aligns spark-rapids-ml with cuML 26.06, bumps the package version to 26.6.0, drops Spark 3.3 / Python 3.10 support, and updates Databricks benchmark tooling to target runtimes 15.4–17.3.

  • cuML API updates: Handle() is now explicitly passed to LogisticRegressionMG, LinearRegressionMG, and PCAMG; treelite deserialization is removed from the RF inference path (model bytes are assigned directly); radius is added to KNN defaults; device_ids and force_serial_epochs are added to UMAP defaults.
  • Spark 3.3 cleanup: All version-guarded compatibility shims for PySpark < 3.4 are removed from source, tests, and scripts; PySpark requirement widened to >=3.4.1,<4.0.
  • Bug fixes: UMAP model loading now uses pdf[\"data\"] (column access) instead of pdf.data (attribute access); _load_sparse_data in tests correctly handles zero-row normalization to avoid division by zero.

Confidence Score: 4/5

The PR is safe to merge; the cuML API changes are straightforward adapter updates and the Spark 3.3 cleanup is well-scoped.

Changes are primarily version-bump bookkeeping plus targeted API adapter fixes for cuML 26.06. The max_depth deprecated placeholder in _get_cuml_params_default is unconventional and could be confusing, but is functionally safe because _initialize_cuml_params always overwrites it with the Spark default before any cuML call. The undocumented n_components=1 to 2 change in test_pipeline.py mildly reduces test coverage of the 1-component UMAP case without explanation.

python/src/spark_rapids_ml/tree.py (the deprecated placeholder and the treelite removal) and python/tests/test_pipeline.py (silent n_components change) deserve a second look.

Important Files Changed

Filename Overview
python/src/spark_rapids_ml/tree.py Removes treelite deserialization (assigns model bytes directly to treelite_model_bytes), sets n_features_in explicitly on RF models, and marks max_depth default as deprecated in _get_cuml_params_default.
python/src/spark_rapids_ml/classification.py Adds explicit Handle() construction for LogisticRegressionMG to match new cuML 26.06 API requirement.
python/src/spark_rapids_ml/regression.py Removes deprecated normalize/standardization mapping for LinearRegression and adds explicit Handle() to LinearRegressionMG, aligning with cuML 26.06.
python/src/spark_rapids_ml/umap.py Adds device_ids and force_serial_epochs to default params; fixes pdf[data] column access (was pdf.data); imports cast from typing.
python/src/spark_rapids_ml/knn.py Adds radius: 1.0 default parameter to both NearestNeighborsClass and ApproximateNearestNeighborsClass to match new cuML 26.06 API.
python/pyproject.toml Version bumped to 26.6.0; minimum Python raised from 3.10 to 3.11; classifiers updated to 3.11/3.12; PySpark range widened to >=3.4.1,<4.0.
python/benchmark/databricks/run_benchmark.sh Adds dynamic SCALA_VERSION selection (2.12 vs 2.13) based on Databricks runtime version; updates SPARK_RAPIDS_VERSION to 26.06.0; drops support for DB 13.3 and 14.3.
python/tests/test_pipeline.py Changes UMAP test parameter from n_components=1 to n_components=2 without explanation.
python/tests/test_logistic_regression.py Removes all PySpark < 3.4 version-guard early returns; tests now run the full sparse path unconditionally.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[RandomForest model bytes from training worker] --> B{cuML 25.12}
    A --> C{cuML 26.06}
    B --> D[treelite.Model.deserialize_bytes]
    D --> E[rf._treelite_model_bytes = treelite_obj]
    C --> F[rf._treelite_model_bytes = model bytes directly]
    E --> G[rf.predict]
    F --> G
    G --> H[Transform output]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[RandomForest model bytes from training worker] --> B{cuML 25.12}
    A --> C{cuML 26.06}
    B --> D[treelite.Model.deserialize_bytes]
    D --> E[rf._treelite_model_bytes = treelite_obj]
    C --> F[rf._treelite_model_bytes = model bytes directly]
    E --> G[rf.predict]
    F --> G
    G --> H[Transform output]
Loading

Reviews (1): Last reviewed commit: "make copyrights more consistent, make ex..." | Re-trigger Greptile

Comment thread python/benchmark/databricks/init-pip-cuda-12.sh Outdated
Comment on lines +40 to +42
if [[ $db_version > 16.4 ]]; then
SCALA_VERSION=2.13
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Lexicographic version comparison is fragile for DB runtime versions

[[ $db_version > 16.4 ]] uses bash string ordering, not numeric ordering. For the currently listed versions (15.4, 16.4, 17.3) this happens to be correct, but a future runtime like "16.10" would compare as less than "16.4" lexicographically (since "1" < "4"), incorrectly keeping SCALA_VERSION=2.12. The same pattern applies in init-pip-cuda-12.sh at the [[ $DATABRICKS_RUNTIME_VERSION < "17.3" ]] guards. Consider splitting on . and doing integer comparisons, or using sort -V-based logic.

Comment thread ci/Dockerfile
# install cuML
ARG RAPIDS_VERSION=25.12
RUN conda install -y -c rapidsai -c conda-forge -c nvidia cuml=$RAPIDS_VERSION cuvs=$RAPIDS_VERSION python=3.10 pylibraft=$RAPIDS_VERSION raft-dask=$RAPIDS_VERSION cuda-version=12.2 numpy~=1.0 \
ARG RAPIDS_VERSION=26.06

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 CI now uses rapidsai-nightly channel

The channel was changed from rapidsai (stable) to rapidsai-nightly for 26.06. If this is temporary until the stable release is published, it should be tracked — nightly packages can change daily and may introduce unintended breakage in CI. Consider adding a comment noting this should be switched back to rapidsai once 26.06 is released to the stable channel.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@eordentlich

Copy link
Copy Markdown
Collaborator Author

build

1 similar comment
@eordentlich

Copy link
Copy Markdown
Collaborator Author

build

Comment thread .claude/skills/update-rapids-version/SKILL.md
Comment thread ci/test.sh
rishic3
rishic3 previously approved these changes Jun 9, 2026
Comment thread .claude/skills/update-rapids-version/SKILL.md
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
…mpatible with pyspark 3.3

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
…ataproc

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
@eordentlich

Copy link
Copy Markdown
Collaborator Author

build

Comment thread python/benchmark/databricks/gpu_etl_cluster_spec.sh Outdated
…iles to use python > 3.10, fix databricks 17.3 with plugin, update plugin to 26.06

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
@eordentlich

Copy link
Copy Markdown
Collaborator Author

build

1 similar comment
@eordentlich

Copy link
Copy Markdown
Collaborator Author

build

@rishic3 rishic3 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very minor comment

Comment thread notebooks/aws-emr/init-bootstrap-action.sh Outdated
…on 3.11 for compatibility with spark < 4

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
@eordentlich

Copy link
Copy Markdown
Collaborator Author

build

1 similar comment
@eordentlich

Copy link
Copy Markdown
Collaborator Author

build

@eordentlich eordentlich merged commit b6cb77a into NVIDIA:main Jun 25, 2026
4 checks passed
@eordentlich eordentlich deleted the eo-26.06-updates branch June 25, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants