Part 1: Data mining, text mining and Big Data Concepts by examples with Spark and Python

What data tell us?

What is happening? : Descriptive
Why is it happening? : Diagnostic
What is likely to happen? : Predictive
What do I need to do? : Prescriptive

This part, present techniques and theories about data mining and text mining, throw simple examples and more realistic use case:

Theses hand-on are mainly inspired by this workshop : https://github.com/PacktPublishing/Mastering-Big-Data-Analytics-with-PySpark Authored by: Danny Meijer. It is in fact started as a fork with adaptation to windows 10 (WSL) and add some parts issued form our courses in Cnam Liban. Ingénierie de la fouille et de la visualisation de données massives and Cours de bases de données documentaires et distribuées

About the WorkShop

Spark and it Python API PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. Theses activities starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.

You'll learn to work with Apache Spark and perform ML tasks. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use Resilient Distributed Datasets and the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks about open and free ('libre') environnement to do this kind of work : Data analyst, machine learning, development, EDI, ...

By the end of this course, you will not only be familair with data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in organizations.

What You Will Learn

Install development environnement for Data Analytic and machine learning by using Spark framework
Better knowledge about python, machine learning, functional programming, distributed paradigms for data analysis and interactive tools for learning and trying
Gain knowledge of vital Data Analytics concepts via practical use cases
Create elegant data visualizations using Jupyter
Run, process, and analyze large chunks of datasets using PySpark
Utilize Spark SQL to easily load big data into DataFrames
Create fast and scalable Machine Learning applications using MLlib with Spark
Perform exploratory Data Analysis in a scalable way
Achieve scalable, high-throughput and fault-tolerant processing of data streams using Spark Streaming

Learning by examples lectures (activities): prerequisite, organisation and outcomes

Each lecture will be presented with a small video
You will have to try by yourself, Do not be afraid it will be straightforward, by cloning a git repository and using an interactive environment jupyter lab
Many gathering with me will be done for your questions and more explanations if needed.
You will use the issue tracker for your questions and responses
Google meet for the webconf (after the first session 16/11/2020 17h) we will plan
Each activity is approximately 15mn video 30mn personal work
Theses sessions (starting 16/11/2020 17h) will be composed of 6 activities

The fundamentals behind machine learning, data mining and text mining will be presented gradually by covering specific activities

Platforms: google classroom, google drive, github, Windows 10 WSL, Linux, Spark, Python, PySpark and Jupyter Notebook.

What is spark? installing environement WSL (Windows Subsystem for Linux), java, python, spark, pyspark and jupyter lab

Core concept bihing haddop and spark, the filter map reduce paradigme

A series of activities to manipulate pyspark, collect data to gardualy present fundamentals behind machine learning, data mining and text mining

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
BDA		BDA
BigData		BigData
BundelInstall		BundelInstall
DataAnalyticPython		DataAnalyticPython
Spark		Spark
relatedConcepts		relatedConcepts
.gitignore		.gitignore
README.md		README.md
Untitled.ipynb		Untitled.ipynb
create_venv.py		create_venv.py
master.md		master.md
requirement_me.txt		requirement_me.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Part 1: Data mining, text mining and Big Data Concepts by examples with Spark and Python

What data tell us?

About the WorkShop

What You Will Learn

Learning by examples lectures (activities): prerequisite, organisation and outcomes

What is spark? installing environement WSL (Windows Subsystem for Linux), java, python, spark, pyspark and jupyter lab

Core concept bihing haddop and spark, the filter map reduce paradigme

A series of activities to manipulate pyspark, collect data to gardualy present fundamentals behind machine learning, data mining and text mining

About

Releases

Packages

Languages

opentrainingcamp/DataMiningSpark

Folders and files

Latest commit

History

Repository files navigation

Part 1: Data mining, text mining and Big Data Concepts by examples with Spark and Python

What data tell us?

About the WorkShop

What You Will Learn

Learning by examples lectures (activities): prerequisite, organisation and outcomes

What is spark? installing environement WSL (Windows Subsystem for Linux), java, python, spark, pyspark and jupyter lab

Core concept bihing haddop and spark, the filter map reduce paradigme

A series of activities to manipulate pyspark, collect data to gardualy present fundamentals behind machine learning, data mining and text mining

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages