- What is happening? : Descriptive
- Why is it happening? : Diagnostic
- What is likely to happen? : Predictive
- What do I need to do? : Prescriptive
This part, present techniques and theories about data mining and text mining, throw simple examples and more realistic use case:
Theses hand-on are mainly inspired by this workshop : https://github.com/PacktPublishing/Mastering-Big-Data-Analytics-with-PySpark Authored by: Danny Meijer. It is in fact started as a fork with adaptation to windows 10 (WSL) and add some parts issued form our courses in Cnam Liban. Ingénierie de la fouille et de la visualisation de données massives and Cours de bases de données documentaires et distribuées
Spark and it Python API PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. Theses activities starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.
You'll learn to work with Apache Spark and perform ML tasks. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use Resilient Distributed Datasets and the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks about open and free ('libre') environnement to do this kind of work : Data analyst, machine learning, development, EDI, ...
By the end of this course, you will not only be familair with data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in organizations.
- Install development environnement for Data Analytic and machine learning by using Spark framework
- Better knowledge about python, machine learning, functional programming, distributed paradigms for data analysis and interactive tools for learning and trying
- Gain knowledge of vital Data Analytics concepts via practical use cases
- Create elegant data visualizations using Jupyter
- Run, process, and analyze large chunks of datasets using PySpark
- Utilize Spark SQL to easily load big data into DataFrames
- Create fast and scalable Machine Learning applications using MLlib with Spark
- Perform exploratory Data Analysis in a scalable way
- Achieve scalable, high-throughput and fault-tolerant processing of data streams using Spark Streaming
- Each lecture will be presented with a small video
- You will have to try by yourself, Do not be afraid it will be straightforward, by cloning a git repository and using an interactive environment jupyter lab
- Many gathering with me will be done for your questions and more explanations if needed.
- You will use the issue tracker for your questions and responses
- Google meet for the webconf (after the first session 16/11/2020 17h) we will plan
- Each activity is approximately 15mn video 30mn personal work
- Theses sessions (starting 16/11/2020 17h) will be composed of 6 activities
The fundamentals behind machine learning, data mining and text mining will be presented gradually by covering specific activities
Platforms: google classroom, google drive, github, Windows 10 WSL, Linux, Spark, Python, PySpark and Jupyter Notebook.