-
Notifications
You must be signed in to change notification settings - Fork 5
How Spark Submit works
Kevin Wallimann edited this page Apr 13, 2022
·
2 revisions
The following explanations are mainly to understand the submission of yarn jobs in cluster mode.
-
spark-submit script, launches java process using script spark-class with main class org.apache.spark.deploy.SparkSubmit
-
spark-class
- finds
java
- executes
$SPARK_HOME/conf/spark-env.sh
through$SPARK_HOME/bin/load-spark-env.sh
- starts java process
-
org.apache.spark.launcher.Main builds the command to be executed by
spark-class
, e.g. something like
- finds
-
/usr/lib/jvm/java-1.8.0-openjdk/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/hadoop/ \
-Dscala.usejavacp=true org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode cluster \
--conf spark.executor.memory=1g --conf spark.driver.memory=1g \
--class za.co.absa.hyperdrive.driver.drivers.CommandLineIngestionDriver --name Hyperdrive \
--jars spark-jobs-current.jar hyperdrive-release-latest.jar arg1 arg2
-
SparkSubmit
-
main
->doSubmit
->submit
->doRunMain
->runMain
->prepareSubmitEnvironment
-
prepareSubmitEnvironment returns among others the main class to execute.
- Also, login with keytab and principal is done here. https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L359
- Return the Yarn Client main class, which is org.apache.spark.deploy.yarn.YarnClusterApplication
-
-
org.apache.spark.deploy.yarn.Client
-
YarnClusterApplication.start
->Client.run
->Client.submitApplication
-
__spark_conf__.zip
archive is created here: https://github.com/apache/spark/blob/v3.2.1/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L750 - The upload of jars happens here: https://github.com/apache/spark/blob/v3.2.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L391
-
submitApplication calls
yarnClient.createApplication()
and thenyarnClient.submitApplication(ApplicationSubmissionContext)
-
-
There are two implementations of AbstractLauncher, which can be used to programmatically launch a spark job with the method startApplication
-
SparkLauncher creates a
java.lang.Process
that executes thespark-submit
script. -
InProcessLauncher calls org.apache.spark.deploy.InProcessSparkSubmit.main() directly within a thread (using
new Thread()
) - Both
SparkLauncher
andInProcessLauncher
start a static instance ofLauncherServer
which is used to keep track of the launched spark jobs. TheLauncherServer
is not used when submitting an application throughspark-submit
-
SparkLauncher creates a
- Is it really necessary for
spark-submit
to have the full$SPARK_HOME
or are only specific files required?