Skip to content

kprimetis/Big_Data_Course_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Analysis Project with Apache Hadoop and Apache Spark

Project Description

This is a team project for the Master's degree course "Large Scale Data Management Systems" (2024), part of the "Data Science & Machine Learning" MSc program at NTUA. The project involves analyzing large datasets using advanced data science techniques and tools, with a focus on Apache Hadoop and Apache Spark. The primary objective is to develop practical skills in installing, managing, and utilizing these distributed systems for large-scale data analysis.

Teams consist of two members working together to apply these tools and techniques effectively.

For more details, please refer to the project.pdf file.

Objectives

  • To become familiar with and develop skills in installing and managing the distributed systems Apache Spark and Apache Hadoop.
  • To utilize modern techniques through Spark's APIs for large-scale data analysis.
  • To understand the capabilities and limitations of these tools in relation to available resources and chosen configurations.

Datasets

Primary Dataset

Los Angeles Crime Data: This dataset contains crime records for Los Angeles from 2010 to the present. The data is available in CSV format from the following links:

Secondary Datasets

  1. LA Police Stations: Contains the locations of the 21 police stations in Los Angeles. Available in CSV format:

  2. Median Household Income by Zip Code (Los Angeles County): Contains information about median household income by ZIP code based on census data from 2015, 2017, 2019, and 2021. For this project, only the 2015 data is required. Available in CSV format:

  3. Reverse Geocoding Dataset: This dataset is crucial for mapping coordinates (latitude, longitude) to ZIP Codes within Los Angeles.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages