| Title | Data Cleaning Tasks |
|---|---|
| Date | 01.01.2018 |
-
Adjust Values: Open Refine and load in your browser. You should see
127.0.01.XXXXin your address bar. Add/preferencesto this address and adjust thevaluelimit to 10,000. This means that we can perform operations on larger datasets. -
Get Data: Go to the INFX 551 github repository, and find the
Datatab. Download the file underDataCleaningtitledBuilding_Permits.csvThis is a from the City of Seattle open data portal (read more about it at BUILDING PERMITS: CURRENT) -
Load Data: Upload this data to Refine by selecting the file form your desktop (or whatever directory you downloaded to). Be sure to select
commas (.csv)as the upload option.
- Replace all missing values with
N/A - For each column, trim leading and trailing whitespace
- Convert values with all UPPERCASE to Name Case
-
The
Categoryand theStatuscolumns have a series of codes. It would be helpful to know how many codes exist in this dataset. How would we find out? -
The
Valuecolumn could be summarized as a range. Create a new column directly to the left ofValueand title itValue RangeNow, cluster all values into High (> $1 million) Medium ($500,000 - 999,999) and Low ($1-499,999). How many cells did this effect?
-
The
Application Dateand theIssue Datetell us the lag time in city responding to applications for building permits. Use the values in these two columns to create a new third column titledPermit Issue Period- In this column calculate the time between Application Data and Issue Date. You should provide the value in days. What is the longest period you observe in this dataset? -
The
Locationfield is static. We want these values to be linked to google maps so that a user coming to the dataset can simply click to see a housing permit site. Use the text string"https://www.google.com/maps/place/"(the quotations are important) to create this link.