How Spark Submit works

The following explanations are mainly to understand the submission of yarn jobs in cluster mode.

spark-submit script, launches java process using script spark-class with main class org.apache.spark.deploy.SparkSubmit
- spark-class
  - finds java
  - executes $SPARK_HOME/conf/spark-env.sh through $SPARK_HOME/bin/load-spark-env.sh
  - starts java process
  - org.apache.spark.launcher.Main builds the command to be executed by spark-class, e.g. something like

/usr/lib/jvm/java-1.8.0-openjdk/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/hadoop/ \
-Dscala.usejavacp=true org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode cluster \
--conf spark.executor.memory=1g --conf spark.driver.memory=1g  \
--class za.co.absa.hyperdrive.driver.drivers.CommandLineIngestionDriver --name Hyperdrive \
--jars spark-jobs-current.jar hyperdrive-release-latest.jar arg1 arg2

SparkSubmit
- main -> doSubmit -> submit -> doRunMain -> runMain -> prepareSubmitEnvironment
- prepareSubmitEnvironment returns among others the main class to execute.
  - Also, login with keytab and principal is done here. https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L359
  - Return the Yarn Client main class, which is org.apache.spark.deploy.yarn.YarnClusterApplication
org.apache.spark.deploy.yarn.Client
- YarnClusterApplication.start -> Client.run -> Client.submitApplication
- __spark_conf__.zip archive is created here: https://github.com/apache/spark/blob/v3.2.1/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L750
- The upload of jars happens here: https://github.com/apache/spark/blob/v3.2.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L391
- submitApplication calls yarnClient.createApplication() and then yarnClient.submitApplication(ApplicationSubmissionContext)
There are two implementations of AbstractLauncher, which can be used to programmatically launch a spark job with the method startApplication
- SparkLauncher creates a java.lang.Process that executes the spark-submit script.
- InProcessLauncher calls org.apache.spark.deploy.InProcessSparkSubmit.main() directly within a thread (using new Thread())
- Both SparkLauncher and InProcessLauncher start a static instance of LauncherServer which is used to keep track of the launched spark jobs. The LauncherServer is not used when submitting an application through spark-submit

Questions

Is it really necessary for spark-submit to have the full $SPARK_HOME or are only specific files required?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How Spark Submit works

Questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally