diff --git a/README.md b/README.md index b049799..fad8123 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,35 @@ -# OpenTelemetry SparkListener +# Spot: Spark-OpenTelemetry + +This package connects [Apache Spark™][sp-home] to [OpenTelemetry][ot-home]. + +This creates a layer of indirection to allow reporting metrics from any Spark or PySpark job to [OpenTelemetry Collector][ot-col], or directly to any [supported backend][ot-export]. + +## Status + +ℹ️This project is in initial development. It's not ready for use. + +## Usage + +The recommended way to use Spot relies on [OpenTelemetry Autoconfigure][ot-auto] to obtain the OpenTelemetry configuration. You pass the `spot-complete-*.jar` to spark-submit to make Spot available to your job, and configure `spark.extraListeners` to enable it. ```bash +SCALA_VERSION=2.12 # This will be 2.12 or 2.13, whichever matches your Spark deployment. spark-submit \ + --jar com.xebia.data.spot.spot-complete_${SCALA_VERSION}-x.y.z.jar \ --conf spark.extraListeners=com.xebia.data.spot.TelemetrySparkListener \ - ... + com.example.MySparkJob ``` + +### Prerequisites + +Instrumenting for telemetry is useless until you publish the recorded data somewhere. This might be the native metrics suite of your chosen cloud provider, or a free or commercial third party system such as Prometheus + Tempo + Grafana. You can have your instrumented Spark jobs publish directly to the backend, or run the traffic via OpenTelemetry Collector. Choosing the backend and routing architecture is outside the scope of this document. + +If you're using Spark on top of Kubernetes, you should install and configure the [OpenTelemetry Operator][ot-k8s-oper]. In any other deployment you should publish the appropriate [environment variables for autoconf][ot-auto-env]. + +[ot-auto]: https://opentelemetry.io/docs/languages/java/instrumentation/#automatic-configuration +[ot-auto-env]: https://opentelemetry.io/docs/languages/java/configuration/ +[ot-col]: https://opentelemetry.io/docs/collector/ +[ot-export]: https://opentelemetry.io/ecosystem/registry/?component=exporter +[ot-home]: https://opentelemetry.io/ +[ot-k8s-oper]: https://opentelemetry.io/docs/kubernetes/operator/ +[sp-home]: https://spark.apache.org diff --git a/build.sbt b/build.sbt index 759737d..01b6df0 100644 --- a/build.sbt +++ b/build.sbt @@ -1,17 +1,44 @@ ThisBuild / organization := "com.xebia.data" -ThisBuild / scalaVersion := "2.13.13" -ThisBuild / crossScalaVersions := Seq("2.12.18", "2.13.13") +ThisBuild / scmInfo := Some(ScmInfo( + url("https://github.com/xebia/spot"), + "https://github.com/xebia/spot.git", + "git@github.com:xebia/spot.git")) + +ThisBuild / scalaVersion := "2.13.14" +ThisBuild / crossScalaVersions := Seq("2.12.19", "2.13.14") + +import Dependencies._ lazy val spot = project .in(file("./spot")) + .disablePlugins(AssemblyPlugin) .settings( name := "spot", libraryDependencies ++= Seq( - "org.apache.spark" %% "spark-core" % "3.5.1", + `opentelemetry-api`, + `spark-core` % Provided + ), + ) - "io.opentelemetry" % "opentelemetry-api" % "1.37.0", - "io.opentelemetry" % "opentelemetry-sdk" % "1.37.0" % Runtime, +lazy val `spot-complete` = project + .in(file("./spot-complete")) + .dependsOn(spot) + .settings( + name := "spot-complete", + libraryDependencies ++= Seq( + `opentelemetry-sdk`, + `opentelemetry-sdk-autoconfigure` + ), + assembly / assemblyJarName := s"${name.value}_${scalaBinaryVersion.value}-${version.value}.jar", + assembly / assemblyOption ~= { + _.withIncludeScala(false) + } + ) - "io.opentelemetry" % "opentelemetry-sdk-extension-autoconfigure" % "1.34.0" % Optional, - ) +lazy val root = project + .in(file(".")) + .aggregate(spot, `spot-complete`) + .disablePlugins(AssemblyPlugin) + .settings( + publish / skip := true, ) diff --git a/project/Dependencies.scala b/project/Dependencies.scala new file mode 100644 index 0000000..b2654cb --- /dev/null +++ b/project/Dependencies.scala @@ -0,0 +1,11 @@ +import sbt._ + +object Dependencies { + private[this] val openTelemetryVersion = "1.39.0" + private[this] val openTelemetryAutoConf = "1.38.0" + + val `opentelemetry-api` = "io.opentelemetry" % "opentelemetry-api" % openTelemetryVersion + val `opentelemetry-sdk` = "io.opentelemetry" % "opentelemetry-sdk" % openTelemetryVersion + val `opentelemetry-sdk-autoconfigure` = "io.opentelemetry" % "opentelemetry-sdk-extension-autoconfigure" % openTelemetryAutoConf + val `spark-core` = "org.apache.spark" %% "spark-core" % "3.5.1" +} diff --git a/project/plugins.sbt b/project/plugins.sbt new file mode 100644 index 0000000..e252f09 --- /dev/null +++ b/project/plugins.sbt @@ -0,0 +1,2 @@ +addSbtPlugin("com.github.sbt" % "sbt-dynver" % "5.0.1") +addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.2.0") diff --git a/spot-complete/README.md b/spot-complete/README.md new file mode 100644 index 0000000..1d9b764 --- /dev/null +++ b/spot-complete/README.md @@ -0,0 +1 @@ +The `spot-complete` sbt project packages Spot and its dependencies as a single JAR file. diff --git a/spot/README.md b/spot/README.md new file mode 100644 index 0000000..2557eb5 --- /dev/null +++ b/spot/README.md @@ -0,0 +1 @@ +The `spot` sbt project packages the Spark-OpenTelemetry listener on its own.