Ecommerce Sales Prediction using Apache Spark

Contributor(s): Yash Sethia, Ritesh Kumar, Shubham

About

Apache Spark, written in Scala, is a general-purpose distributed data processing engine. Or in other words: load big data, do computations on it in a distributed way, and then store it.

Apache Spark contains libraries for data analysis, machine learning, graph analysis, and streaming live data. Spark is generally faster than Hadoop. This is because Hadoop writes intermediate results to disk (that is, lots of I/O operations) whereas Spark tries to keep intermediate results in memory (that is, in-memory computation) whenever possible. Moreover, Spark offers lazy evaluation of operations and optimizes them just before the final result; Sparks maintains a series of transformations that are to be performed without actually performing those operations unless we try to obtain the results. This way, Spark is able to find the best path looking at overall transformations required (for example, reducing two separate steps of adding number 5 and 20 to each element of the dataset into just a single step of adding 25 to each element of the dataset, or not actually doing operations on part of the dataset which will eventually will be filtered out in the final result).

This makes Spark one of the most popular tools for big data analytics currently. PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.

In this Notebook, we have the data of an Ecommerce website with the following fields:

Email ID of the user
Address of the user
Average Session Length of the user
Time spent by the user on the app
Time spent by the user on the website
Length of the membership of the user
Yearly amount spent by the user The database is however distributed. Thus we use Pyspark API of Apache Spark to handle this data to bring out meaningful inferences and try to predict the Amount spent by users using Linear Regression.

To run this in your system:

Clone this repository
In a terminal, navigate to the folder containing this repository and run the following command:

jupyter notebook

This will open the base directory of this repository in your browser. Now, open the file
Run all the cells of this file to see the results

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Cover.jpg		Cover.jpg
Ecommerce_Customers.csv		Ecommerce_Customers.csv
Ecommerce_Customers1.csv		Ecommerce_Customers1.csv
Ecommerce_Customers2.csv		Ecommerce_Customers2.csv
Pic1.jpg		Pic1.jpg
Pic2.jpg		Pic2.jpg
Pic3.jpg		Pic3.jpg
Pic4.jpg		Pic4.jpg
README.md		README.md
apache_ecommerce.ipynb		apache_ecommerce.ipynb
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ecommerce Sales Prediction using Apache Spark

Contributor(s): Yash Sethia, Ritesh Kumar, Shubham

About

To run this in your system:

About

Releases

Packages

Contributors 2

Languages

yash-sethia/Apache-Spark

Folders and files

Latest commit

History

Repository files navigation

Ecommerce Sales Prediction using Apache Spark

Contributor(s): Yash Sethia, Ritesh Kumar, Shubham

About

To run this in your system:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages