Due date: Friday, October 17, 2025 (11:59pm)
This homework has three parts.
-
The first part reviews some of the basics of data processing in the context of Pandas (input, cleaning, validation, and manipulation of DataFrames).
-
The second part is about performance measuring & comparison, and will explore several design choices which can affect performance.
-
Finally, the third part is a shorter part about programming in the shell, and its relevance to data processing.
Clone this repository to your own machine (or open up a Codespace),
then open up and complete part1.py.
Parts 2 and 3 will be made available throughout the week.
If you get stuck, please ask a question on Piazza!
We will use Gradescope to submit the code for this assignment.
Instructions will be posted on Piazza.
You can submit code on Gradescope in one of two ways;
via a private GitHub repository which contains your work;
or, via uploading a .zip file.
In order to receive credit for your work, please follow the following guidelines.
-
Make sure that
python3 part1.py,python3 part2.py, andpython3 part3.pyrun successfully with no errors, and the same forpytest part1.py,pytest part2.py, andpytest part3.py. We cannot give credit to code that doesn't run! -
If you are uploading a zip file, you should include all data and auxiliary files with your code. If your code doesn't have auxiliary files, it won't run on someone else's machine!
-
You will likely want to excluded the hidden folders, in particular
.git, from your zip file as it creates a lot of extra files for your upload. -
Make sure that your
output/part1-answers.txt,output/part2-answers.txt, andoutput/part3-answers.txtfiles are generated and up-to-date. These files should contain the answers to the questions in the assignment. Additionally, make sure your code is generating all plots in theoutput/folder. Include all of these in your final code that you commit/upload. -
If you are using GitHub to submit, make sure that you
git commitandgit pushyour latest code to your personal repository. -
Don't rename any functions or methods or change the function signatures unless asked to do so.
-
As discussed in the syllabus, a small number of points on each homework (at most 10% of the grade) are reserved for style points. This also includes whether your free response answers are present and thoughtful. Here are some things to consider: are your variable names chosen appropriately? Have you added comments with
#or docstrings with"""where appropriate? Have you removed any obsolete, unused code blocks, functions, or variables?
Many thanks to the data science course at LUMS (CS 334 taught by Dr. Mobin Javed) for the data and some of the exercises that were used in Part 1 of this homework assignment.