Skip to content

opentrainingcamp/DataMiningSpark

Repository files navigation

What data tell us?

  • What is happening? : Descriptive
  • Why is it happening? : Diagnostic
  • What is likely to happen? : Predictive
  • What do I need to do? : Prescriptive

This part, present techniques and theories about data mining and text mining, throw simple examples and more realistic use case:

Theses hand-on are mainly inspired by this workshop : https://github.com/PacktPublishing/Mastering-Big-Data-Analytics-with-PySpark Authored by: Danny Meijer. It is in fact started as a fork with adaptation to windows 10 (WSL) and add some parts issued form our courses in Cnam Liban. Ingénierie de la fouille et de la visualisation de données massives and Cours de bases de données documentaires et distribuées

About the WorkShop

Spark and it Python API PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. Theses activities starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.

You'll learn to work with Apache Spark and perform ML tasks. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use Resilient Distributed Datasets and the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks about open and free ('libre') environnement to do this kind of work : Data analyst, machine learning, development, EDI, ...

By the end of this course, you will not only be familair with data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in organizations.

What You Will Learn

  • Install development environnement for Data Analytic and machine learning by using Spark framework
  • Better knowledge about python, machine learning, functional programming, distributed paradigms for data analysis and interactive tools for learning and trying
  • Gain knowledge of vital Data Analytics concepts via practical use cases
  • Create elegant data visualizations using Jupyter
  • Run, process, and analyze large chunks of datasets using PySpark
  • Utilize Spark SQL to easily load big data into DataFrames
  • Create fast and scalable Machine Learning applications using MLlib with Spark
  • Perform exploratory Data Analysis in a scalable way
  • Achieve scalable, high-throughput and fault-tolerant processing of data streams using Spark Streaming

Learning by examples lectures (activities): prerequisite, organisation and outcomes

  1. Each lecture will be presented with a small video
  2. You will have to try by yourself, Do not be afraid it will be straightforward, by cloning a git repository and using an interactive environment jupyter lab
  3. Many gathering with me will be done for your questions and more explanations if needed.
  4. You will use the issue tracker for your questions and responses
  5. Google meet for the webconf (after the first session 16/11/2020 17h) we will plan
  6. Each activity is approximately 15mn video 30mn personal work
  7. Theses sessions (starting 16/11/2020 17h) will be composed of 6 activities

The fundamentals behind machine learning, data mining and text mining will be presented gradually by covering specific activities

Platforms: google classroom, google drive, github, Windows 10 WSL, Linux, Spark, Python, PySpark and Jupyter Notebook.

What is spark? installing environement WSL (Windows Subsystem for Linux), java, python, spark, pyspark and jupyter lab

Core concept bihing haddop and spark, the filter map reduce paradigme

A series of activities to manipulate pyspark, collect data to gardualy present fundamentals behind machine learning, data mining and text mining

About

Data and text mining with Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published