Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
38 changes: 0 additions & 38 deletions .github/workflows/connectors_test.yaml

This file was deleted.

266 changes: 0 additions & 266 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -1218,240 +1218,6 @@ lazy val hudi = (project in file("hudi"))
Compile / packageBin := assembly.value
)

/**
* We want to publish the `standalone` project's shaded JAR (created from the
* build/sbt standalone/assembly command).
*
* However, build/sbt standalone/publish and build/sbt standalone/publishLocal will use the
* non-shaded JAR from the build/sbt standalone/package command.
*
* So, we create an impostor, cosmetic project used only for publishing.
*
* build/sbt standalone/package
* - creates connectors/standalone/target/scala-2.12/delta-standalone-original-shaded_2.12-0.2.1-SNAPSHOT.jar
* (this is the shaded JAR we want)
*
* build/sbt standaloneCosmetic/publishM2
* - packages the shaded JAR (above) and then produces:
* -- .m2/repository/io/delta/delta-standalone_2.12/0.2.1-SNAPSHOT/delta-standalone_2.12-0.2.1-SNAPSHOT.pom
* -- .m2/repository/io/delta/delta-standalone_2.12/0.2.1-SNAPSHOT/delta-standalone_2.12-0.2.1-SNAPSHOT.jar
* -- .m2/repository/io/delta/delta-standalone_2.12/0.2.1-SNAPSHOT/delta-standalone_2.12-0.2.1-SNAPSHOT-sources.jar
* -- .m2/repository/io/delta/delta-standalone_2.12/0.2.1-SNAPSHOT/delta-standalone_2.12-0.2.1-SNAPSHOT-javadoc.jar
*/
lazy val standaloneCosmetic = project
.dependsOn(storage) // this doesn't impact the output artifact (jar), only the pom.xml dependencies
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
.settings(
name := "delta-standalone",
commonSettings,
releaseSettings,
exportJars := true,
Compile / packageBin := (standaloneParquet / assembly).value,
Compile / packageSrc := (standalone / Compile / packageSrc).value,
libraryDependencies ++= scalaCollectionPar(scalaVersion.value) ++ Seq(
"org.apache.hadoop" % "hadoop-client" % hadoopVersion % "provided",
"org.apache.parquet" % "parquet-hadoop" % "1.12.3" % "provided",
// parquet4s-core dependencies that are not shaded are added with compile scope.
"com.chuusai" %% "shapeless" % "2.3.4",
"org.scala-lang.modules" %% "scala-collection-compat" % "2.4.3"
)
)

lazy val testStandaloneCosmetic = (project in file("connectors/testStandaloneCosmetic"))
.dependsOn(standaloneCosmetic)
.dependsOn(goldenTables % "test")
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
.settings(
name := "test-standalone-cosmetic",
commonSettings,
skipReleaseSettings,
libraryDependencies ++= Seq(
"org.apache.hadoop" % "hadoop-client" % hadoopVersion,
"org.scalatest" %% "scalatest" % scalaTestVersionForConnectors % "test",
)
)

/**
* A test project to verify `ParquetSchemaConverter` APIs are working after the user provides
* `parquet-hadoop`. We use a separate project because we want to test whether Delta Standlone APIs
* except `ParquetSchemaConverter` are working without `parquet-hadoop` in testStandaloneCosmetic`.
*/
lazy val testParquetUtilsWithStandaloneCosmetic = project.dependsOn(standaloneCosmetic)
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
.settings(
name := "test-parquet-utils-with-standalone-cosmetic",
commonSettings,
skipReleaseSettings,
libraryDependencies ++= Seq(
"org.apache.hadoop" % "hadoop-client" % hadoopVersion,
"org.apache.parquet" % "parquet-hadoop" % "1.12.3" % "provided",
"org.scalatest" %% "scalatest" % scalaTestVersionForConnectors % "test",
)
)

def scalaCollectionPar(version: String) = version match {
case v if v.startsWith("2.13.") =>
Seq("org.scala-lang.modules" %% "scala-parallel-collections" % "1.0.4")
case _ => Seq()
}

/**
* The public API ParquetSchemaConverter exposes Parquet classes in its methods so we cannot apply
* shading rules on it. However, sbt-assembly doesn't allow excluding a single file. Hence, we
* create a separate project to skip the shading.
*/
lazy val standaloneParquet = (project in file("connectors/standalone-parquet"))
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
.dependsOn(standaloneWithoutParquetUtils)
.settings(
name := "delta-standalone-parquet",
commonSettings,
skipReleaseSettings,
libraryDependencies ++= Seq(
"org.apache.parquet" % "parquet-hadoop" % "1.12.3" % "provided",
"org.scalatest" %% "scalatest" % scalaTestVersionForConnectors % "test"
),
assemblyPackageScala / assembleArtifact := false
)

/** A dummy project to allow `standaloneParquet` depending on the shaded standalone jar. */
lazy val standaloneWithoutParquetUtils = project
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
.settings(
name := "delta-standalone-without-parquet-utils",
commonSettings,
skipReleaseSettings,
exportJars := true,
Compile / packageBin := (standalone / assembly).value
)

// TODO scalastyle settings
lazy val standalone = (project in file("connectors/standalone"))
.dependsOn(storage % "compile->compile;provided->provided")
.dependsOn(goldenTables % "test")
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
.settings(
name := "delta-standalone-original",
commonSettings,
skipReleaseSettings,
standaloneMimaSettings,
// When updating any dependency here, we should also review `pomPostProcess` in project
// `standaloneCosmetic` and update it accordingly.
libraryDependencies ++= scalaCollectionPar(scalaVersion.value) ++ Seq(
"org.apache.hadoop" % "hadoop-client" % hadoopVersion % "provided",
"com.github.mjakubowski84" %% "parquet4s-core" % parquet4sVersion excludeAll (
ExclusionRule("org.slf4j", "slf4j-api")
),
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.12.3",
"org.json4s" %% "json4s-jackson" % "3.7.0-M11" excludeAll (
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.module")
),
"org.scalatest" %% "scalatest" % scalaTestVersionForConnectors % "test",
),
Compile / sourceGenerators += Def.task {
val file = (Compile / sourceManaged).value / "io" / "delta" / "standalone" / "package.scala"
IO.write(file,
s"""package io.delta
|
|package object standalone {
| val VERSION = "${version.value}"
| val NAME = "Delta Standalone"
|}
|""".stripMargin)
Seq(file)
},

/**
* Standalone packaged (unshaded) jar.
*
* Build with `build/sbt standalone/package` command.
* e.g. connectors/standalone/target/scala-2.12/delta-standalone-original-unshaded_2.12-0.2.1-SNAPSHOT.jar
*/
artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
artifact.name + "-unshaded" + "_" + sv.binary + "-" + module.revision + "." + artifact.extension
},

/**
* Standalone assembly (shaded) jar. This is what we want to release.
*
* Build with `build/sbt standalone/assembly` command.
* e.g. connectors/standalone/target/scala-2.12/delta-standalone-original-shaded_2.12-0.2.1-SNAPSHOT.jar
*/
assembly / logLevel := Level.Info,
assembly / test := {},
assembly / assemblyJarName := s"${name.value}-shaded_${scalaBinaryVersion.value}-${version.value}.jar",
// We exclude jars first, and then we shade what is remaining. Note: the input here is only
// `libraryDependencies` jars, not `.dependsOn(_)` jars.
assembly / assemblyExcludedJars := {
val cp = (assembly / fullClasspath).value
val allowedPrefixes = Set("META_INF", "io", "json4s", "jackson", "paranamer",
"parquet4s", "parquet-", "audience-annotations", "commons-pool")
cp.filter { f =>
!allowedPrefixes.exists(prefix => f.data.getName.startsWith(prefix))
}
},
assembly / assemblyShadeRules := Seq(
ShadeRule.rename("com.fasterxml.jackson.**" -> "shadedelta.@0").inAll,
ShadeRule.rename("com.thoughtworks.paranamer.**" -> "shadedelta.@0").inAll,
ShadeRule.rename("org.json4s.**" -> "shadedelta.@0").inAll,
ShadeRule.rename("com.github.mjakubowski84.parquet4s.**" -> "shadedelta.@0").inAll,
ShadeRule.rename("org.apache.commons.pool.**" -> "shadedelta.@0").inAll,
ShadeRule.rename("org.apache.parquet.**" -> "shadedelta.@0").inAll,
ShadeRule.rename("shaded.parquet.**" -> "shadedelta.@0").inAll,
ShadeRule.rename("org.apache.yetus.audience.**" -> "shadedelta.@0").inAll
),
assembly / assemblyMergeStrategy := {
// Discard `module-info.class` to fix the `different file contents found` error.
// TODO Upgrade SBT to 1.5 which will do this automatically
case "module-info.class" => MergeStrategy.discard
// Discard unused `parquet.thrift` so that we don't conflict the file used by the user
case "parquet.thrift" => MergeStrategy.discard
// Discard the jackson service configs that we don't need. These files are not shaded so
// adding them may conflict with other jackson version used by the user.
case PathList("META-INF", "services", xs @ _*) => MergeStrategy.discard
// This project `.dependsOn` delta-storage, and its classes will be included by default
// in this assembly jar. Manually discard them since it is already a compile-time dependency.
case PathList("io", "delta", "storage", xs @ _*) => MergeStrategy.discard
case x =>
val oldStrategy = (assembly / assemblyMergeStrategy).value
oldStrategy(x)
},
assembly / artifact := {
val art = (assembly / artifact).value
art.withClassifier(Some("assembly"))
},
addArtifact(assembly / artifact, assembly),

// Unidoc setting
unidocSourceFilePatterns += SourceFilePattern("io/delta/standalone/"),
javaCheckstyleSettings("dev/connectors-checkstyle.xml")
).configureUnidoc()


/*
TODO (TD): Tests are failing for some reason
lazy val compatibility = (project in file("connectors/oss-compatibility-tests"))
// depend on standalone test codes as well
.dependsOn(standalone % "compile->compile;test->test")
.dependsOn(spark % "test -> compile")
.settings(
name := "compatibility",
commonSettings,
skipReleaseSettings,
libraryDependencies ++= Seq(
// Test Dependencies
"io.netty" % "netty-buffer" % "4.1.63.Final" % "test",
"org.scalatest" %% "scalatest" % "3.1.0" % "test",
"commons-io" % "commons-io" % "2.8.0" % "test",
"org.apache.spark" %% "spark-sql" % defaultSparkVersion % "test",
"org.apache.spark" %% "spark-catalyst" % defaultSparkVersion % "test" classifier "tests",
"org.apache.spark" %% "spark-core" % defaultSparkVersion % "test" classifier "tests",
"org.apache.spark" %% "spark-sql" % defaultSparkVersion % "test" classifier "tests",
)
)
*/

lazy val goldenTables = (project in file("connectors/golden-tables"))
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
.settings(
Expand All @@ -1471,38 +1237,6 @@ lazy val goldenTables = (project in file("connectors/golden-tables"))
)
)

def sqlDeltaImportScalaVersion(scalaBinaryVersion: String): String = {
scalaBinaryVersion match {
// sqlDeltaImport doesn't support 2.11. We return 2.12 so that we can resolve the dependencies
// but we will not publish sqlDeltaImport with Scala 2.11.
case "2.11" => "2.12"
case _ => scalaBinaryVersion
}
}

lazy val sqlDeltaImport = (project in file("connectors/sql-delta-import"))
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
.settings (
name := "sql-delta-import",
commonSettings,
skipReleaseSettings,
publishArtifact := scalaBinaryVersion.value != "2.11",
Test / publishArtifact := false,
libraryDependencies ++= Seq(
// Using released delta-spark JAR instead of module dependency to break circular dependency
"io.delta" %% "delta-spark" % "3.3.2",

"io.netty" % "netty-buffer" % "4.1.63.Final" % "test",
"org.apache.spark" % ("spark-sql_" + sqlDeltaImportScalaVersion(scalaBinaryVersion.value)) % defaultSparkVersion % "provided",
"org.rogach" %% "scallop" % "3.5.1",
"org.scalatest" %% "scalatest" % scalaTestVersionForConnectors % "test",
"com.h2database" % "h2" % "1.4.200" % "test",
"org.apache.spark" % ("spark-catalyst_" + sqlDeltaImportScalaVersion(scalaBinaryVersion.value)) % defaultSparkVersion % "test",
"org.apache.spark" % ("spark-core_" + sqlDeltaImportScalaVersion(scalaBinaryVersion.value)) % defaultSparkVersion % "test",
"org.apache.spark" % ("spark-sql_" + sqlDeltaImportScalaVersion(scalaBinaryVersion.value)) % defaultSparkVersion % "test"
)
)

/**
* Get list of python files and return the mapping between source files and target paths
* in the generated package JAR.
Expand Down
25 changes: 1 addition & 24 deletions connectors/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1 @@
## Delta Standalone

Delta Standalone, formerly known as the Delta Standalone Reader (DSR), is a JVM library to read **and write** Delta tables. Unlike https://github.com/delta-io/delta, this project doesn't use Spark to read or write tables and it has only a few transitive dependencies. It can be used by any application that cannot use a Spark cluster.
- To compile the project, run `build/sbt standalone/compile`
- To test the project, run `build/sbt standalone/test`
- To publish the JAR, run `build/sbt standaloneCosmetic/publishM2`

See [Delta Standalone](https://docs.delta.io/latest/delta-standalone.html) for detailed documentation.

## Hive Connector

Read Delta tables directly from Apache Hive using the [Hive Connector](/hive/README.md). See the dedicated [README.md](/hive/README.md) for more details.

## Flink/Delta Connector

Use the [Flink/Delta Connector](flink/README.md) to read and write Delta tables from Apache Flink applications. The connector includes a sink for writing to Delta tables from Apache Flink, and a source for reading Delta tables using Apache Flink (still in progress.) See the dedicated [README.md](/flink/README.md) for more details.

## sql-delta-import

[sql-delta-import](/sql-delta-import/readme.md) allows for importing data from a JDBC source into a Delta table.

## Power BI connector
The connector for [Microsoft Power BI](https://powerbi.microsoft.com/) is basically just a custom Power Query function that allows you to read a Delta table from any file-based [data source supported by Microsoft Power BI](https://docs.microsoft.com/en-us/power-bi/connect-data/desktop-data-sources). Details can be found in the dedicated [README.md](/powerbi/README.md).

Connectors projects are no longer maintained in the `master` branch and new releases due to migration to the Delta Kernel project. Projects will continue to be supported in maintanence mode from the `spark-3.5-support` branch.
Loading
Loading