A set of Python scripts to clean Sue's contact data
January 3, 2020
Bill McCann
The follow tasks will complete this project
Manual Steps
- Copy data to this project's /data folder. Sue has duplicate files, so only copy the originals.
- Manually convert Excel workbooks to comma separated values, then remove the workbooks.
- Create a metadata catelog. The metadata will be a table that contains a row for each file. The columns will be the file name followed by indices to map columns for First Name, Last Name, email address and phone number.
Coding Steps
- Create a virtual environment
- Create script to read metadata catelog into a simple list
- Create a function to process all records in the list
- Create a function to read data file
- Create a function to map a row to new object
- Create a function to update a dictionary
- Create a function to write the dictionary to an output file
The file personal_contacts-a5c6e0cf-7063-463d-a69a-3e373715d925-v1.csv has two rows for the column headings. I delete the first row.
The Garden State export has separate columns for street number and street name. Since there is no data in these columns I am not going to bother writing code to deal with that.
The file personal_contacts-a5c6e0cf-7063-463d-a69a-3e373715d925-v1.csv has several phone numbers across multiple columns. For this data cleaning, I am only using two phone numbers. A quick scan shows that column L and O are the most frequently populated. I will use those.
Create Virtual Environment
The virtual environment was created using the following commands.
sudo apt-get update
sudo apt-get install python3-venv
python3 -m venv venv
First start the Python virtual environment
source ./venv/bin/activate