data-engineer-test

This is for prospective data engineers interviewing at Holman.

Directions:

Fork this repo to your personal Github account.
Complete the below data exercises in the best way you see fit. Your project should be shareable with a link prior to your 2nd round interview. Here are some potential implementation options:

Develop within a PySpark Project: https://github.com/AlexIoannides/pyspark-example-project
Use a Jupyter notebook: https://jupyter.org/ or https://colab.google/
Implement a local database:
- https://www.prisma.io/dataguide/postgresql/setting-up-a-local-postgresql-database
- https://www.prisma.io/dataguide/mysql/setting-up-a-local-mysql-database

Olympics

Build a pipeline that turns multiple .csv files into a singular data object.

Reference the data within ./datasets/olympics.
Combine all files in this directory into a single consummable artifact, such as a database table, a parquet file, or a .csv file. Dataframes are appreciated, but the final object must be query-able in a programmatic fashion.
The above task should produce code that includes a schema defintion, transformation logic, and data upserts.

Countries

Build a pipeline that turns a single .csv files into a singular data object.

Reference the data within ./datasets/countries.
Using the countries of the world.csv, create a database table, a parquet file, or a .csv file. Dataframes are appreciated, but the final object must be query-able in a programmatic fashion.
The above task should produce code that includes a schema defintion, transformation logic, and data upserts.

Combing Olympics and Countries

Transform one or both of the olympics/countries tables to facilitate a join across these tables.

Normalize the data by applying a foreign key(s) to one/both tables. The key should be unique to represent 1 country, and ensure no cartesian joins occur. (We are aware that no true key exists, and an artificial key(s) will need to be produced)
Transform the 2 data objects via denormalization, and deliver 1 consummable artifact in the form of a table, parquet, or something query-able.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data-engineer-test

Directions:

Olympics

Countries

Combing Olympics and Countries

About

Uh oh!

Releases

Packages

License

HolmanAuto/data-engineer-test

Folders and files

Latest commit

History

Repository files navigation

data-engineer-test

Directions:

Olympics

Countries

Combing Olympics and Countries

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages