Throughout this lesson, you perform a case study based on data from DonorsChoose.org, a non-profit dedicated to supporting teachers in the classroom.
You focus on how to perform the tasks you might be familiar with from the standard data science process, but with data at scale this time.
We start with a formalization of the data science workflow and see where spark can help us build a machine learning pipeline.
From there, you use Spark and its higher level libraries to explore the DonorsChoose.org data sets and compute basic statistics to better understand the data.
Once you have done some basic exploratory analysis, you dive into a primer on natural language processing and see how to use text as an input to a model.
With properly vectorized data, you implement your own machine learning model with Spark to understand a little bit more about machine learning and see how you might build a complex algorithm with the framework.
- Understand how Spark can help with common tasks from the Data Science process
- Perform EDA on the DonorsChoose.org project data with Spark
- See how to perform basic data quality checks at scale with Spark
- Visualize the distribution of your data locally with pandas after aggregating and summarizing it with Spark.
- Understand the basics of tokenizing and vectorizing text
- See how to leverage NLP techniques (like tf-idf weighting) to make sense of text
- Understand the basics of the different types of machine learning
- Perform k-means clustering on DonorsChoose.org project essays
- Understand how to interpret the results of k-means, as well as the limitations of the algorithm
- 3.3 - 3.5: donors_choose_eda.ipynb
- 3.6 - 3.8: natural_language_processing.ipynb
- 3.10 - 3.12: kmeans.ipynb
- Extra: nlp_ec2cluster.ipynb
- Quora: What is the work flow or process of a data scientist?
- Stitch Fix: Rapid Development & Performance in Spark For Data Scientists
- HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm
- Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure
- Spark Docs (1.4.1): DataFrame NA Functions
- Spark Docs (1.4.1): Accumulators
- Statistical and Mathematical Functions with DataFrames in Spark
- A simple algorithm for finding frequent elements in streams and bags
- Wikipedia: Summary Statistics
- Wikipedia: Anscombe's quartet
- Pandas Plotting
- Plotting Spark DataFrames with Plotly
- Tf-idf: A Single-Page Tutorial
- Tutorial: Finding Important Words in Text Using TF-IDF
- Text Summarization: Generating Snippets
- Spark Docs (1.4.1): Feature Extraction and Transformation