For our capstone project, we were tasked with creating a full ETL pipeline using High Volume for-Hire (HVFHV) data, Yellow Taxi data, and Green Taxi data. Our goal was to extract data from a raw AWS S3 bucket, clean up the data, and load back into a conformed S3 bucket. Once doing that, we had to perform necessary transformations and load it back into a transformed S3 bucket, from which we loaded into Snowflake for final analysis before visualizing. We leveraged various AWS services and tools, such as AWS S3, AWS Glue Crawler, AWS Athena, Databricks/PySpark, GitHub Codespaces, Amazon Bedrock, Snowflake, ThoughtSpot, and Tableau.
For our use case, we decided to focus more specifically on traffic congestion pattern analysis within New York City boroughs and the comparison of landmark distribution on these patterns. To do this, we leveraged the taxi datasets given to us, as well as explored external landmark data that was used to supplement our analysis, visualizations, and solutions. Our end goal was to create a better way for tourists to navigate NYC while reducing travel time.
- Yellow Taxi Data: 9 files (September 2023 to May 2024)
- Green Taxi Data: 9 files (September 2023 to May 2024)
- HVFHV Data: 5 files (January 2024 to May 2024)
- Individual Landmark Data: 1 file (17 columns)
| Name | |
|---|---|
| Peter Alonzo | [email protected] |
| Nithila Annadurai | [email protected] |
| Alina Baby | [email protected] |