This is a team project for the Master's degree course "Large Scale Data Management Systems" (2024), part of the "Data Science & Machine Learning" MSc program at NTUA. The project involves analyzing large datasets using advanced data science techniques and tools, with a focus on Apache Hadoop and Apache Spark. The primary objective is to develop practical skills in installing, managing, and utilizing these distributed systems for large-scale data analysis.
Teams consist of two members working together to apply these tools and techniques effectively.
For more details, please refer to the project.pdf file.
- To become familiar with and develop skills in installing and managing the distributed systems Apache Spark and Apache Hadoop.
- To utilize modern techniques through Spark's APIs for large-scale data analysis.
- To understand the capabilities and limitations of these tools in relation to available resources and chosen configurations.
Los Angeles Crime Data: This dataset contains crime records for Los Angeles from 2010 to the present. The data is available in CSV format from the following links:
-
LA Police Stations: Contains the locations of the 21 police stations in Los Angeles. Available in CSV format:
-
Median Household Income by Zip Code (Los Angeles County): Contains information about median household income by ZIP code based on census data from 2015, 2017, 2019, and 2021. For this project, only the 2015 data is required. Available in CSV format:
-
Reverse Geocoding Dataset: This dataset is crucial for mapping coordinates (latitude, longitude) to ZIP Codes within Los Angeles.
- Download from http://www.dblab.ece.ntua.gr/files/classes/data.tar.gz
- Download from http://www.dblab.ece.ntua.gr/files/classes/data.tar.gz