Skip to content

Files

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Lesson 2: Spark Programming APIs

In this lesson, you see how Spark makes large scale data analysis much more accessible by providing application programming interfaces in languages familiar to data scientists.

We start by using Python and the PySpark interface to perform basic data analysis on a large dataset.

Throughout the lesson you learn how PySpark and SparkR interface with a Spark cluster and the Spark core framework, which is written in Scala.

With the SparkR API, you see how to reason about your data from a higher level using the DataFrame abstraction.

And finally, you see how by adding structure to your data with Spark SQL, you can construct complex queries to efficiently analyze your data.

Objectives

  • Understand how the extensibility the core Spark framework enables programming APIs to be developed in other languages
  • Work with the PySpark API to perform an end-to-end analysis of flight delay data
  • See how the DataFrame API in Spark allows for efficient analyses, both in terms of developer productivity and execution performance
  • Aggregate and filter a large dataset with Spark to visualize locally with ggplot2
  • Understand how to integrate existing SQL workflows with Spark

Examples

References

2.1: Introduction to the Spark Programming APIs

2.2: PySpark: Loading and Importing Data

2.3: PySpark: Parsing and Transforming Data

2.4: PySpark: Analyzing Flight Delays

2.5: SparkR: Introduction to DataFrames

2.6: SparkR: Aggregations and Analysis

2.7: SparkR: Visualizing Data with ggplot2

2.8: Why (Spark) SQL?

2.9: Spark SQL: Adding Structure to Your Data

2.10: Spark SQL: Integration into Existing Workflows