Data Analysis Project with Apache Hadoop and Apache Spark

Project Description

This is a team project for the Master's degree course "Large Scale Data Management Systems" (2024), part of the "Data Science & Machine Learning" MSc program at NTUA. The project involves analyzing large datasets using advanced data science techniques and tools, with a focus on Apache Hadoop and Apache Spark. The primary objective is to develop practical skills in installing, managing, and utilizing these distributed systems for large-scale data analysis.

Teams consist of two members working together to apply these tools and techniques effectively.

For more details, please refer to the project.pdf file.

Objectives

To become familiar with and develop skills in installing and managing the distributed systems Apache Spark and Apache Hadoop.
To utilize modern techniques through Spark's APIs for large-scale data analysis.
To understand the capabilities and limitations of these tools in relation to available resources and chosen configurations.

Datasets

Primary Dataset

Los Angeles Crime Data: This dataset contains crime records for Los Angeles from 2010 to the present. The data is available in CSV format from the following links:

Secondary Datasets

LA Police Stations: Contains the locations of the 21 police stations in Los Angeles. Available in CSV format:
- LA Police Stations Dataset
Median Household Income by Zip Code (Los Angeles County): Contains information about median household income by ZIP code based on census data from 2015, 2017, 2019, and 2021. For this project, only the 2015 data is required. Available in CSV format:
- 2015 Income Data
Reverse Geocoding Dataset: This dataset is crucial for mapping coordinates (latitude, longitude) to ZIP Codes within Los Angeles.
- Download from http://www.dblab.ece.ntua.gr/files/classes/data.tar.gz

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
outputs		outputs
Project_assignment.pdf		Project_assignment.pdf
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analysis Project with Apache Hadoop and Apache Spark

Project Description

Objectives

Datasets

Primary Dataset

Secondary Datasets

About

Releases

Packages

Languages

kprimetis/Big_Data_Course_Project

Folders and files

Latest commit

History

Repository files navigation

Data Analysis Project with Apache Hadoop and Apache Spark

Project Description

Objectives

Datasets

Primary Dataset

Secondary Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages