Name	Name	Last commit message	Last commit date
parent directory ..
slides	slides
readme.md	readme.md

Lesson 3: Your First Spark Application

Throughout this lesson, you perform a case study based on data from DonorsChoose.org, a non-profit dedicated to supporting teachers in the classroom.

You focus on how to perform the tasks you might be familiar with from the standard data science process, but with data at scale this time.

We start with a formalization of the data science workflow and see where spark can help us build a machine learning pipeline.

From there, you use Spark and its higher level libraries to explore the DonorsChoose.org data sets and compute basic statistics to better understand the data.

Once you have done some basic exploratory analysis, you dive into a primer on natural language processing and see how to use text as an input to a model.

With properly vectorized data, you implement your own machine learning model with Spark to understand a little bit more about machine learning and see how you might build a complex algorithm with the framework.

Objectives

Understand how Spark can help with common tasks from the Data Science process
Perform EDA on the DonorsChoose.org project data with Spark
See how to perform basic data quality checks at scale with Spark
Visualize the distribution of your data locally with pandas after aggregating and summarizing it with Spark.
Understand the basics of tokenizing and vectorizing text
See how to leverage NLP techniques (like tf-idf weighting) to make sense of text
Understand the basics of the different types of machine learning
Perform k-means clustering on DonorsChoose.org project essays
Understand how to interpret the results of k-means, as well as the limitations of the algorithm

Examples

3.3 - 3.5: donors_choose_eda.ipynb
3.6 - 3.8: natural_language_processing.ipynb
3.10 - 3.12: kmeans.ipynb
Extra: nlp_ec2cluster.ipynb

References

3.1: How Spark Fits into the Data Science Process

3.2: Introduction to Exploratory Data Analysis

3.3: Case Study: DonorsChoose.org

3.4: Data Quality Checks with Accumulators

3.5: Making Sense of Data: Summary Statistics and Distributions

3.6: Working with Text: Introduction to NLP

3.7: Tokenization and Vectorization with Spark

3.8: Summarization with tf-idf

3.9: Introduction to Machine Learning

3.10: Unsupervised Learning with Spark: Implementing k-means

Stanford CS 221: k-means Algorithm

3.11: Testing k-means with DonorsChoose.org Essays

k-means vs. k-medoids: Convergence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

Lesson 3: Your First Spark Application

Objectives

Examples

References

3.1: How Spark Fits into the Data Science Process

3.2: Introduction to Exploratory Data Analysis

3.3: Case Study: DonorsChoose.org

3.4: Data Quality Checks with Accumulators

3.5: Making Sense of Data: Summary Statistics and Distributions

3.6: Working with Text: Introduction to NLP

3.7: Tokenization and Vectorization with Spark

3.8: Summarization with tf-idf

3.9: Introduction to Machine Learning

3.10: Unsupervised Learning with Spark: Implementing k-means

3.11: Testing k-means with DonorsChoose.org Essays

3.12: Challenges of k-means: Latent Features, Interpretation, and Validation

FilesExpand file tree

Lesson3

Directory actions

More options

Directory actions

More options

Latest commit

History

Lesson3

Folders and files

parent directory

readme.md

Lesson 3: Your First Spark Application

Objectives

Examples

References

3.1: How Spark Fits into the Data Science Process

3.2: Introduction to Exploratory Data Analysis

3.3: Case Study: DonorsChoose.org

3.4: Data Quality Checks with Accumulators

3.5: Making Sense of Data: Summary Statistics and Distributions

3.6: Working with Text: Introduction to NLP

3.7: Tokenization and Vectorization with Spark

3.8: Summarization with tf-idf

3.9: Introduction to Machine Learning

3.10: Unsupervised Learning with Spark: Implementing k-means

3.11: Testing k-means with DonorsChoose.org Essays

3.12: Challenges of k-means: Latent Features, Interpretation, and Validation