Skip to content
Merged
2 changes: 1 addition & 1 deletion .github/workflows/docker-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,6 @@ jobs:
with:
platforms: linux/amd64,linux/arm64
push: true
tags: ghcr.io/apache/datafusion-comet:spark-3.5-scala-2.12-${{ env.COMET_VERSION }}
tags: ghcr.io/apache/datafusion-comet:spark-4.1-scala-2.13-${{ env.COMET_VERSION }}
file: kube/Dockerfile
no-cache: true
4 changes: 2 additions & 2 deletions .github/workflows/pr_build_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -444,7 +444,7 @@ jobs:
uses: ./.github/actions/setup-builder
with:
rust-version: ${{ env.RUST_VERSION }}
jdk-version: 11
jdk-version: 17

- name: Download native library
uses: actions/download-artifact@v8
Expand Down Expand Up @@ -502,7 +502,7 @@ jobs:
uses: ./.github/actions/setup-builder
with:
rust-version: ${{ env.RUST_VERSION }}
jdk-version: 11
jdk-version: 17

- name: Download native library
uses: actions/download-artifact@v8
Expand Down
2 changes: 1 addition & 1 deletion docs/source/contributor-guide/benchmarking_aws_ec2.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ make release
Set `COMET_JAR` environment variable.

```shell
export COMET_JAR=/home/ec2-user/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-$COMET_VERSION.jar
export COMET_JAR=/home/ec2-user/datafusion-comet/spark/target/comet-spark-spark4.1_2.13-$COMET_VERSION.jar
```

## Run Benchmarks
Expand Down
12 changes: 6 additions & 6 deletions docs/source/contributor-guide/benchmarking_macos.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,13 @@ export DF_BENCH=`pwd`

## Install Spark

Install Apache Spark. This example refers to 3.5.4 version.
Install Apache Spark. This example refers to 4.1.1 version.

```shell
wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
tar xzf spark-3.5.4-bin-hadoop3.tgz
sudo mv spark-3.5.4-bin-hadoop3 /opt
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
tar xzf spark-4.1.1-bin-hadoop3.tgz
sudo mv spark-4.1.1-bin-hadoop3 /opt
export SPARK_HOME=/opt/spark-4.1.1-bin-hadoop3/
```

Start Spark in standalone mode:
Expand Down Expand Up @@ -129,7 +129,7 @@ make release COMET_FEATURES=mimalloc
Set `COMET_JAR` to point to the location of the Comet jar file. Example for Comet 0.8

```shell
export COMET_JAR=`pwd`/spark/target/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar
export COMET_JAR=`pwd`/spark/target/comet-spark-spark4.1_2.13-0.8.0-SNAPSHOT.jar
```

Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):
Expand Down
6 changes: 3 additions & 3 deletions docs/source/contributor-guide/benchmarking_spark_sql_perf.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ partitioning and writing to Parquet format automatically.

## Prerequisites

- Java 17 (for Spark 3.5+)
- Apache Spark 3.5.x
- Java 17
- Apache Spark 4.1.x
Comment thread
andygrove marked this conversation as resolved.
- SBT (Scala Build Tool)
- C compiler toolchain (`gcc`, `make`, `flex`, `bison`, `byacc`)

Expand Down Expand Up @@ -225,7 +225,7 @@ Build Comet from source and launch `spark-shell` with both the Comet and spark-s

```shell
make release
export COMET_JAR=$(pwd)/spark/target/comet-spark-spark3.5_2.12-*.jar
export COMET_JAR=$(pwd)/spark/target/comet-spark-spark4.1_2.13-*.jar

$SPARK_HOME/bin/spark-shell \
--master $SPARK_MASTER \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/contributor-guide/debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ make release COMET_FEATURES=backtrace
Set `RUST_BACKTRACE=1` for the Spark worker/executor process, or for `spark-submit` if running in local mode.

```console
RUST_BACKTRACE=1 $SPARK_HOME/spark-shell --jars spark/target/comet-spark-spark3.5_2.12-$COMET_VERSION.jar --conf spark.plugins=org.apache.spark.CometPlugin --conf spark.comet.enabled=true --conf spark.comet.exec.enabled=true
RUST_BACKTRACE=1 $SPARK_HOME/spark-shell --jars spark/target/comet-spark-spark4.1_2.13-$COMET_VERSION.jar --conf spark.plugins=org.apache.spark.CometPlugin --conf spark.comet.enabled=true --conf spark.comet.exec.enabled=true
```

Get the expanded exception details
Expand Down
2 changes: 1 addition & 1 deletion docs/source/contributor-guide/iceberg-spark-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Here is an overview of the changes that the diffs make to Iceberg:
Run `make release` in Comet to install the Comet JAR into the local Maven repository, specifying the Spark version.

```shell
PROFILES="-Pspark-3.5" make release
PROFILES="-Pspark-4.1" make release
```

## 2. Clone Iceberg and Apply Diff
Expand Down
8 changes: 4 additions & 4 deletions docs/source/user-guide/latest/datasources.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ Unlike to native Comet reader the Datafusion reader fully supports nested types
To build Comet with native DataFusion reader and remote HDFS support it is required to have a JDK installed

Example:
Build a Comet for `spark-3.5` provide a JDK path in `JAVA_HOME`
Build a Comet for `spark-4.1` provide a JDK path in `JAVA_HOME`
Provide the JRE linker path in `RUSTFLAGS`, the path can vary depending on the system. Typically JRE linker is a part of installed JDK

```shell
export JAVA_HOME="/opt/homebrew/opt/openjdk@11"
make release PROFILES="-Pspark-3.5" COMET_FEATURES=hdfs RUSTFLAGS="-L $JAVA_HOME/libexec/openjdk.jdk/Contents/Home/lib/server"
export JAVA_HOME="/opt/homebrew/opt/openjdk@17"
make release PROFILES="-Pspark-4.1" COMET_FEATURES=hdfs RUSTFLAGS="-L $JAVA_HOME/libexec/openjdk.jdk/Contents/Home/lib/server"
```

Start Comet with experimental reader and HDFS support as [described](installation.md/#run-spark-shell-with-comet-enabled)
Expand Down Expand Up @@ -149,7 +149,7 @@ docker compose -f kube/local/hdfs-docker-compose.yml up
- Build a project with HDFS support

```shell
JAVA_HOME="/opt/homebrew/opt/openjdk@11" make release PROFILES="-Pspark-3.5" COMET_FEATURES=hdfs RUSTFLAGS="-L /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk/Contents/Home/lib/server"
JAVA_HOME="/opt/homebrew/opt/openjdk@17" make release PROFILES="-Pspark-4.1" COMET_FEATURES=hdfs RUSTFLAGS="-L /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home/lib/server"
```

- Run local test
Expand Down
4 changes: 2 additions & 2 deletions docs/source/user-guide/latest/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ reader is enabled by default. To disable it, set `spark.comet.scan.icebergNative

```shell
$SPARK_HOME/bin/spark-shell \
--packages org.apache.datafusion:comet-spark-spark3.5_2.12:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1 \
--packages org.apache.datafusion:comet-spark-spark4.1_2.13:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1 \
--repositories https://repo1.maven.org/maven2/ \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog \
Expand Down Expand Up @@ -106,7 +106,7 @@ configure Spark to use a REST catalog with Comet's native Iceberg scan:

```shell
$SPARK_HOME/bin/spark-shell \
--packages org.apache.datafusion:comet-spark-spark3.5_2.12:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1 \
--packages org.apache.datafusion:comet-spark-spark4.1_2.13:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1 \
--repositories https://repo1.maven.org/maven2/ \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
Expand Down
6 changes: 3 additions & 3 deletions docs/source/user-guide/latest/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ Here are the direct links for downloading the Comet $COMET_VERSION jar file.
- [Comet plugin for Spark 3.5 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/$COMET_VERSION/comet-spark-spark3.5_2.12-$COMET_VERSION.jar)
- [Comet plugin for Spark 3.5 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.13/$COMET_VERSION/comet-spark-spark3.5_2.13-$COMET_VERSION.jar)
- [Comet plugin for Spark 4.0 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark4.0_2.13/$COMET_VERSION/comet-spark-spark4.0_2.13-$COMET_VERSION.jar)
- [Comet plugin for Spark 4.1 / Scala 2.13 (Experimental)](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark4.1_2.13/$COMET_VERSION/comet-spark-spark4.1_2.13-$COMET_VERSION.jar)
- [Comet plugin for Spark 4.1 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark4.1_2.13/$COMET_VERSION/comet-spark-spark4.1_2.13-$COMET_VERSION.jar)
- [Comet plugin for Spark 4.2 / Scala 2.13 (Experimental)](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark4.2_2.13/$COMET_VERSION/comet-spark-spark4.2_2.13-$COMET_VERSION.jar)
<!-- ENDIF -->

Expand All @@ -105,7 +105,7 @@ See the [Comet Kubernetes Guide](kubernetes.md) guide.
Make sure `SPARK_HOME` points to the same Spark version as Comet was built for.

```shell
export COMET_JAR=spark/target/comet-spark-spark3.5_2.12-$COMET_VERSION.jar
export COMET_JAR=spark/target/comet-spark-spark4.1_2.13-$COMET_VERSION.jar

$SPARK_HOME/bin/spark-shell \
--jars $COMET_JAR \
Expand Down Expand Up @@ -161,7 +161,7 @@ explicitly contain Comet otherwise Spark may use a different class-loader for th
components which will then fail at runtime. For example:

```
--driver-class-path spark/target/comet-spark-spark3.5_2.12-$COMET_VERSION.jar
--driver-class-path spark/target/comet-spark-spark4.1_2.13-$COMET_VERSION.jar
```

Some cluster managers may require additional configuration, see <https://spark.apache.org/docs/latest/cluster-overview.html>
Expand Down
14 changes: 7 additions & 7 deletions docs/source/user-guide/latest/kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,30 +69,30 @@ metadata:
spec:
type: Scala
mode: cluster
image: apache/datafusion-comet:0.7.0-spark3.5.5-scala2.12-java11
image: apache/datafusion-comet:0.7.0-spark4.1.1-scala2.13-java17
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.13-4.1.1.jar
sparkConf:
"spark.executor.extraClassPath": "/opt/spark/jars/comet-spark-spark3.5_2.12-0.7.0.jar"
"spark.driver.extraClassPath": "/opt/spark/jars/comet-spark-spark3.5_2.12-0.7.0.jar"
"spark.executor.extraClassPath": "/opt/spark/jars/comet-spark-spark4.1_2.13-0.7.0.jar"
"spark.driver.extraClassPath": "/opt/spark/jars/comet-spark-spark4.1_2.13-0.7.0.jar"
"spark.plugins": "org.apache.spark.CometPlugin"
"spark.comet.enabled": "true"
"spark.comet.exec.enabled": "true"
"spark.comet.exec.shuffle.enabled": "true"
"spark.comet.exec.shuffle.mode": "auto"
"spark.shuffle.manager": "org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager"
sparkVersion: 3.5.6
sparkVersion: 4.1.1
driver:
labels:
version: 3.5.6
version: 4.1.1
cores: 1
coreLimit: 1200m
memory: 512m
serviceAccount: spark-operator-spark
executor:
labels:
version: 3.5.6
version: 4.1.1
instances: 1
cores: 1
coreLimit: 1200m
Expand Down
10 changes: 5 additions & 5 deletions docs/source/user-guide/latest/source.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ cd apache-datafusion-comet-$COMET_VERSION
Build

```console
make release-nogit PROFILES="-Pspark-3.5"
make release-nogit PROFILES="-Pspark-4.1"
```

## Building from the GitHub repository
Expand All @@ -53,17 +53,17 @@ Build Comet for a specific Spark version:

```console
cd datafusion-comet
make release PROFILES="-Pspark-3.5"
make release PROFILES="-Pspark-4.1"
```

Note that the project builds for Scala 2.12 by default but can be built for Scala 2.13 using an additional profile:
Note that the project builds for Scala 2.13 by default but can be built for Scala 2.12 using an additional profile:

```console
make release PROFILES="-Pspark-3.5 -Pscala-2.13"
make release PROFILES="-Pspark-3.5 -Pscala-2.12"
Comment thread
andygrove marked this conversation as resolved.
```

To build Comet from the source distribution on an isolated environment without an access to `github.com` it is necessary to disable `git-commit-id-maven-plugin`, otherwise you will face errors that there is no access to the git during the build process. In that case you may use:

```console
make release-nogit PROFILES="-Pspark-3.5"
make release-nogit PROFILES="-Pspark-4.1"
```
16 changes: 8 additions & 8 deletions kube/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@
# limitations under the License.
#

FROM apache/spark:3.5.8 AS builder
FROM apache/spark:4.1.1 AS builder

USER root

# Installing JDK11 as the image comes with JRE
# Installing JDK17 as the image comes with JRE
RUN apt update \
&& apt install -y curl \
&& apt install -y openjdk-11-jdk \
&& apt install -y openjdk-17-jdk \
&& apt clean

RUN apt install -y gcc-10 g++-10 cpp-10 unzip
Expand All @@ -37,8 +37,8 @@ ENV PATH="$PATH:/root/.local/bin"
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
ENV RUSTFLAGS="-C debuginfo=line-tables-only -C incremental=false"
ENV SPARK_VERSION=3.5
ENV SCALA_VERSION=2.12
ENV SPARK_VERSION=4.1
ENV SCALA_VERSION=2.13

# copy source files to Docker image
RUN mkdir /comet
Expand Down Expand Up @@ -70,9 +70,9 @@ RUN mkdir -p /root/.m2 && \
RUN cd /comet \
&& JAVA_HOME=$(readlink -f $(which javac) | sed "s/\/bin\/javac//") make release-nogit PROFILES="-Pspark-$SPARK_VERSION -Pscala-$SCALA_VERSION"

FROM apache/spark:3.5.8
ENV SPARK_VERSION=3.5
ENV SCALA_VERSION=2.12
FROM apache/spark:4.1.1
ENV SPARK_VERSION=4.1
ENV SCALA_VERSION=2.13
USER root

# note the use of a wildcard in the file name so that this works with both snapshot and final release versions
Expand Down
34 changes: 20 additions & 14 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -65,24 +65,24 @@ under the License.
<extra-enforcer-rules.version>1.7.0</extra-enforcer-rules.version>
<scalafmt.version>3.6.1</scalafmt.version>
<apache-rat-plugin.version>0.16.1</apache-rat-plugin.version>
<scala.version>2.12.18</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<scala.version>2.13.17</scala.version>
<scala.binary.version>2.13</scala.binary.version>
<scala.plugin.version>4.9.6</scala.plugin.version>
<scalatest.version>3.2.16</scalatest.version>
<scalatest-maven-plugin.version>2.2.0</scalatest-maven-plugin.version>
<spark.version>3.5.8</spark.version>
<spark.version.short>3.5</spark.version.short>
<spark.version>4.1.1</spark.version>
<spark.version.short>4.1</spark.version.short>
<spark.maven.scope>provided</spark.maven.scope>
<protobuf.version>3.25.5</protobuf.version>
<parquet.version>1.13.1</parquet.version>
<parquet.version>1.16.0</parquet.version>
<parquet.maven.scope>provided</parquet.maven.scope>
<hadoop.version>3.3.4</hadoop.version>
<arrow.version>18.3.0</arrow.version>
<codehaus.jackson.version>1.9.13</codehaus.jackson.version>
<spotless.version>2.43.0</spotless.version>
<jacoco.version>0.8.11</jacoco.version>
<semanticdb.version>4.8.8</semanticdb.version>
<slf4j.version>2.0.7</slf4j.version>
<semanticdb.version>4.13.6</semanticdb.version>
<slf4j.version>2.0.17</slf4j.version>
<guava.version>33.2.1-jre</guava.version>
<testcontainers.version>1.21.0</testcontainers.version>
<amazon-awssdk-v2.version>2.31.51</amazon-awssdk-v2.version>
Expand Down Expand Up @@ -116,8 +116,8 @@ under the License.
-Djdk.reflect.useDirectMethodHandle=false
</extraJavaTestArgs>
<argLine>-ea -Xmx4g -Xss4m ${extraJavaTestArgs}</argLine>
<shims.majorVerSrc>spark-3.x</shims.majorVerSrc>
<shims.minorVerSrc>spark-3.5</shims.minorVerSrc>
<shims.majorVerSrc>spark-4.x</shims.majorVerSrc>
<shims.minorVerSrc>spark-4.1</shims.minorVerSrc>
</properties>

<dependencyManagement>
Expand Down Expand Up @@ -635,10 +635,13 @@ under the License.
<id>spark-3.4</id>
<properties>
<scala.version>2.12.17</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.4.3</spark.version>
<spark.version.short>3.4</spark.version.short>
<parquet.version>1.13.1</parquet.version>
<semanticdb.version>4.8.8</semanticdb.version>
<slf4j.version>2.0.6</slf4j.version>
<shims.majorVerSrc>spark-3.x</shims.majorVerSrc>
<shims.minorVerSrc>spark-3.4</shims.minorVerSrc>
<java.version>11</java.version>
<maven.compiler.source>${java.version}</maven.compiler.source>
Expand All @@ -650,10 +653,13 @@ under the License.
<id>spark-3.5</id>
<properties>
<scala.version>2.12.18</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.5.8</spark.version>
<spark.version.short>3.5</spark.version.short>
<parquet.version>1.13.1</parquet.version>
<semanticdb.version>4.8.8</semanticdb.version>
<slf4j.version>2.0.7</slf4j.version>
<shims.majorVerSrc>spark-3.x</shims.majorVerSrc>
<shims.minorVerSrc>spark-3.5</shims.minorVerSrc>
<java.version>11</java.version>
<maven.compiler.source>${java.version}</maven.compiler.source>
Expand All @@ -662,10 +668,8 @@ under the License.
</profile>

<profile>
<!-- FIXME: this is WIP. Tests may fail https://github.com/apache/datafusion-comet/issues/551 -->
<id>spark-4.0</id>
<properties>
<!-- Use Scala 2.13 by default -->
<scala.version>2.13.16</scala.version>
<scala.binary.version>2.13</scala.binary.version>
<spark.version>4.0.2</spark.version>
Expand All @@ -675,15 +679,13 @@ under the License.
<slf4j.version>2.0.16</slf4j.version>
<shims.majorVerSrc>spark-4.x</shims.majorVerSrc>
<shims.minorVerSrc>spark-4.0</shims.minorVerSrc>
<!-- Use jdk17 by default -->
<java.version>17</java.version>
<maven.compiler.source>${java.version}</maven.compiler.source>
<maven.compiler.target>${java.version}</maven.compiler.target>
</properties>
</profile>

<profile>
<!-- WIP: Spark 4.1 support, with its own shim sources for 4.1-specific APIs -->
<id>spark-4.1</id>
<properties>
<!-- Spark 4.1.1 is compiled against Scala 2.13.17 and emits calls into stdlib methods
Expand All @@ -699,7 +701,6 @@ under the License.
<slf4j.version>2.0.17</slf4j.version>
<shims.majorVerSrc>spark-4.x</shims.majorVerSrc>
<shims.minorVerSrc>spark-4.1</shims.minorVerSrc>
<!-- Use jdk17 by default -->
<java.version>17</java.version>
<maven.compiler.source>${java.version}</maven.compiler.source>
<maven.compiler.target>${java.version}</maven.compiler.target>
Expand Down Expand Up @@ -729,6 +730,11 @@ under the License.

<profile>
<id>scala-2.12</id>
<properties>
<scala.version>2.12.18</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<semanticdb.version>4.8.8</semanticdb.version>
</properties>
</profile>

<profile>
Expand Down
Loading
Loading