Describe the bug
Build: examples_PCA_build_nightly/1550
The PCA example notebook (pca.ipynb) fails during gpu_pca_loaded.fit(data_df) in the jupyter nbconvert --execute step. The error is a Py4JException indicating the JVM mapInPandas method with a Boolean (barrier) argument does not exist. The installed Python pyspark is 3.5.8 (resolved from spark-rapids-ml requirements.txt 'pyspark>=3.4.1,<4.0'), while the Spark runtime distribution started as the standalone cluster is 3.4.3. PySpark 3.5.x mapInPandas passes an extra 'barrier' Boolean argument that Spark 3.4.3 Dataset.mapInPandas does not accept. Deterministic client/server version mismatch.
Error logs:
Collecting pyspark<4.0,>=3.4.1 (from requirements.txt line 17)
Downloading pyspark-3.5.8.tar.gz (317.8 MB)
...
wget https://.../org/apache/spark/3.4.3/spark-3.4.3-bin-hadoop3.tgz
...
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
gpu_pca_model = gpu_pca_loaded.fit(data_df)
...
File ".../spark_rapids_ml/core.py", line 1000, in _call_cuml_fit_func
dataset.mapInPandas(_train_udf, schema=self._out_schema())
File ".../pyspark/sql/pandas/map_ops.py", line 112, in mapInPandas
jdf = self._jdf.mapInPandas(udf_column._jc.expr(), barrier)
Py4JError: An error occurred while calling o665.mapInPandas. Trace:
py4j.Py4JException: Method mapInPandas([class org.apache.spark.sql.catalyst.expressions.PythonUDF, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:321)
at py4j.Gateway.invoke(Gateway.java:274)
Environment details
- spark-rapids-ml 26.6.0; requirements.txt pyspark>=3.4.1,<4.0 -> installs pyspark 3.5.8
- Spark runtime distribution 3.4.3 (bin-hadoop3) standalone cluster
- conda env pca-nightly, python 3.11
-->
Describe the bug
Build: examples_PCA_build_nightly/1550
The PCA example notebook (pca.ipynb) fails during gpu_pca_loaded.fit(data_df) in the jupyter nbconvert --execute step. The error is a Py4JException indicating the JVM mapInPandas method with a Boolean (barrier) argument does not exist. The installed Python pyspark is 3.5.8 (resolved from spark-rapids-ml requirements.txt 'pyspark>=3.4.1,<4.0'), while the Spark runtime distribution started as the standalone cluster is 3.4.3. PySpark 3.5.x mapInPandas passes an extra 'barrier' Boolean argument that Spark 3.4.3 Dataset.mapInPandas does not accept. Deterministic client/server version mismatch.
Error logs:
Environment details
-->