Due date: October 25, 2024 (11:59pm)
This homework has three parts.
-
The first part (released Monday, Oct 14) reviews some of the basics of data processing in the context of Pandas (input, cleaning, validation, and manipulation of DataFrames).
-
The second part (released Wednesday, Oct 16) is about performance measuring & comparison, and will explore several design choices which can affect performance.
-
Finally, the third part (released Friday, Oct 18) is a shorter part about programming in the shell, and its relevance to data processing.
Clone this repository to your own machine (or open up a Codespace),
then open up and complete part1.py
.
Parts 2 and 3 will be made available throughout the week.
If you get stuck, please ask a question on Piazza!
We will use Gradescope to submit the code for this assignment.
Instructions will be posted on Piazza.
You can submit code on Gradescope in one of two ways;
via a private GitHub repository which contains your work;
or, via uploading a .zip
file.
In order to receive credit for your work, please follow the following guidelines.
-
Make sure that
python3 part1.py
,python3 part2.py
, andpython3 part3.py
run successfully with no errors, and the same forpytest part1.py
,pytest part2.py
, andpytest part3.py
. We cannot give credit to code that doesn't run! -
Make sure that your
output/part1-answers.txt
,output/part2-answers.txt
, andoutput/part3-answers.txt
files are generated and up-to-date. These files should contain the answers to the questions in the assignment. Additionally, make sure your code is generating all plots in theoutput/
folder. Include all of these in your final code that you commit/upload. -
If you are using GitHub to submit, make sure that you
git commit
andgit push
your latest code to your personal repository. -
Don't rename any functions or methods or change the function signatures unless asked to do so.
-
As discussed in the syllabus, a small number of points on each homework (at most 10% of the grade) are reserved for style points. This also includes whether your free response answers are present and thoughtful. Here are some things to consider: are your variable names chosen appropriately? Have you added comments with
#
or docstrings with"""
where appropriate? Have you removed any obsolete, unused code blocks, functions, or variables?
Many thanks to Hassnain (the TA) and the data science course at LUMS (CS 334 taught by Dr. Mobin Javed) for the data and some of the exercises that were used in Part 1 of this homework assignment.