Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

readme.md

Lesson 3: Your First Spark Application

Throughout this lesson, you perform a case study based on data from DonorsChoose.org, a non-profit dedicated to supporting teachers in the classroom.

You focus on how to perform the tasks you might be familiar with from the standard data science process, but with data at scale this time.

We start with a formalization of the data science workflow and see where spark can help us build a machine learning pipeline.

From there, you use Spark and its higher level libraries to explore the DonorsChoose.org data sets and compute basic statistics to better understand the data.

Once you have done some basic exploratory analysis, you dive into a primer on natural language processing and see how to use text as an input to a model.

With properly vectorized data, you implement your own machine learning model with Spark to understand a little bit more about machine learning and see how you might build a complex algorithm with the framework.

Objectives

  • Understand how Spark can help with common tasks from the Data Science process
  • Perform EDA on the DonorsChoose.org project data with Spark
  • See how to perform basic data quality checks at scale with Spark
  • Visualize the distribution of your data locally with pandas after aggregating and summarizing it with Spark.
  • Understand the basics of tokenizing and vectorizing text
  • See how to leverage NLP techniques (like tf-idf weighting) to make sense of text
  • Understand the basics of the different types of machine learning
  • Perform k-means clustering on DonorsChoose.org project essays
  • Understand how to interpret the results of k-means, as well as the limitations of the algorithm

Examples

References

3.1: How Spark Fits into the Data Science Process

3.2: Introduction to Exploratory Data Analysis

3.3: Case Study: DonorsChoose.org

3.4: Data Quality Checks with Accumulators

3.5: Making Sense of Data: Summary Statistics and Distributions

3.6: Working with Text: Introduction to NLP

3.7: Tokenization and Vectorization with Spark

3.8: Summarization with tf-idf

3.9: Introduction to Machine Learning

3.10: Unsupervised Learning with Spark: Implementing k-means

3.11: Testing k-means with DonorsChoose.org Essays

3.12: Challenges of k-means: Latent Features, Interpretation, and Validation