GitHub - Adarsh-Hota/ETL_spark-on-dataproc: A Pyspark project that performs ETL on a Dataproc cluster and writes data to Google Cloud Storage/BigQuery.

Overview

A simple Spark project that uses Pyspark to extract data from the NYC taxi dataset, perform some transformations and store the data in GCS/BigQuery.

The python notebook present in the notebooks folder can be executed locally in a Jupyter notebook. After running, the data is stored in the 'result' folder.

In the code folder, two python files run on a GCP Dataproc cluster. One of the files stores the data in Cloud Storage while the other file stores it in BigQuery.

Setup

GCP Compute Engine VM
- Instance configuration
  - Machine type: e2-standard-4
  - Boot disk image: ubuntu-2004-focal-v20230918
  - Boot disk size: 30 GB
  - Boot disk type: Balanced persistent disk
- Environment configuration
  - Spark: 3.5.0
  - Scala: 2.13.8
  - Jupyter Notebook: 6.5.4
  - Conda: 23.7.4
  - Python: 3.11.5
  - openjdk: 11.0.21
  - OpenJDK Runtime Environment: 11.0.21+9-post-Ubuntu-0ubuntu120.04
  - OpenJDK 64-Bit Server VM: 11.0.21+9-post-Ubuntu-0ubuntu120.04
Dataproc Cluster configuration
- Image version: 2.1.35-debian11
- Master node: Standard (1 master, N workers)
  - Machine type: e2-standard-2
  - Number of GPUs: 0
  - Primary disk type: pd-standard
  - Primary disk size: 500GB
  - Local SSDs: 0
- Worker nodes: 2
  - Machine type: n2-standard-4
  - Number of GPUs: 0
  - Primary disk type: pd-standard
  - Primary disk size: 500GB
  - Local SSDs: 0

Data Source

NYC Taxi & Limousine Commission website - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
code		code
data		data
docs/images		docs/images
notebooks		notebooks
result		result
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Setup

GCP Compute Engine VM

Dataproc Cluster configuration

Data Source

About

Releases

Packages

Languages

Adarsh-Hota/ETL_spark-on-dataproc

Folders and files

Latest commit

History

Repository files navigation

Overview

Setup

GCP Compute Engine VM

Dataproc Cluster configuration

Data Source

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages