Skip to content

R package to assist with address cleaning tasks

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

EvictionLab/addressr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

182 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

addressr

The goal of addressr is to standardize address cleaning for various datasets used by the Eviction Lab.

Installation

You can install the development version of addressr from GitHub with:

# install.packages("pak")
pak::pak("EvictionLab/addressr")

Usage

Addresses can be difficult to work with and messy in many different ways.

Here is an example of a relatively clean table:

library(dplyr)

address_table <- dplyr::tribble(
  ~address, ~city, ~state,
  "456 Jersey Avenue #102", "Montclair", "NJ",
  "123-125 N Street Rd", "Cincinnati", "OH",
  "789 Pirate Cv East DBA ELAB LLC", "Memphis", "TN",
  "3548 1ST ST FL 1", "St. Louis", "MO",
)

Geocoders can be thrown off by various things, such as address ranges, unit numbers, directionals, and street endings.

The clean_address() function streamlines address cleaning and outputs data into a standard format

library(addressr)

cleaned_addresses <- address_table |> clean_address(address) |> janitor::remove_empty("cols")
#> preserve original data: 0.007 sec elapsed
#> extract address parts: 0.014 sec elapsed
#> standardize street suffix, directions & ordinals: 0.034 sec elapsed
#> check street numbers, units, and buildings: 0.01 sec elapsed
#> tidy output: 0.003 sec elapsed
#> total clean time: 0.072 sec elapsed

cleaned_addresses
#> # A tibble: 4 × 11
#>   address            city  state clean_address street_number_multi pre_direction
#>   <chr>              <chr> <chr> <chr>         <chr>               <chr>        
#> 1 456 Jersey Avenue… Mont… NJ    456 JERSEY A… 456                 <NA>         
#> 2 123-125 N Street … Cinc… OH    123-125 N ST… 123-125             N            
#> 3 789 Pirate Cv Eas… Memp… TN    789 PIRATE C… 789                 <NA>         
#> 4 3548 1ST ST FL 1   St. … MO    3548 FIRST S… 3548                <NA>         
#> # ℹ 5 more variables: street_name <chr>, street_suffix <chr>, unit_type <chr>,
#> #   unit <chr>, extra_back <chr>

By default, the function returns all address components. You can also select the output:

address_table |> clean_address(address, output = c("clean_address", "short_address", "street_number", "unit", "extra"))
#> preserve original data: 0.002 sec elapsed
#> extract address parts: 0.013 sec elapsed
#> standardize street suffix, directions & ordinals: 0.03 sec elapsed
#> check street numbers, units, and buildings: 0.009 sec elapsed
#> tidy output: 0.004 sec elapsed
#> total clean time: 0.06 sec elapsed
#> # A tibble: 4 × 8
#>   address      city  state clean_address short_address street_number unit  extra
#>   <chr>        <chr> <chr> <chr>         <chr>         <chr>         <chr> <chr>
#> 1 456 Jersey … Mont… NJ    456 JERSEY A… 456 JERSEY    456           102   <NA> 
#> 2 123-125 N S… Cinc… OH    123-125 N ST… 123-125 STRE… 123-125       <NA>  <NA> 
#> 3 789 Pirate … Memp… TN    789 PIRATE C… 789 PIRATE    789           <NA>  EAST…
#> 4 3548 1ST ST… St. … MO    3548 FIRST S… 3548 FIRST    3548          1     <NA>

clean_address will return the street number, pre-direction, street name, street suffix and post-direction. short_address will return the street number and street name.

The package can also separate rows with a street range or multiple addresses:

address_table <- address_table |> 
  add_row(address = "928-928 S Montgomery Ave 1500 Jefferson Rd", city = "New York", state = "NY") |> 
  add_row(address = "1500-1502 1550 Ptree Rd", city = "Atlanta", state = "GA")

address_table |> clean_address(address, separate_street_range = TRUE, separate_multi_address = TRUE)
#> preserve original data: 0.002 sec elapsed
#> separate multiple addresses: 0.007 sec elapsed
#> extract address parts: 0.011 sec elapsed
#> standardize street suffix, directions & ordinals: 0.03 sec elapsed
#> check street numbers, units, and buildings: 0.021 sec elapsed
#> tidy output: 0.003 sec elapsed
#> total clean time: 0.075 sec elapsed
#> # A tibble: 10 × 20
#>    address                 city  state original_row_id addressr_id clean_address
#>    <chr>                   <chr> <chr>           <int> <chr>       <chr>        
#>  1 456 Jersey Avenue #102  Mont… NJ                  1 1           456 JERSEY A…
#>  2 123-125 N Street Rd     Cinc… OH                  2 2-N1        123 N STREET…
#>  3 123-125 N Street Rd     Cinc… OH                  2 2-N2        125 N STREET…
#>  4 789 Pirate Cv East DBA… Memp… TN                  3 3           789 PIRATE C…
#>  5 3548 1ST ST FL 1        St. … MO                  4 4           3548 FIRST S…
#>  6 928-928 S Montgomery A… New … NY                  5 5-A1        928 S MONTGO…
#>  7 928-928 S Montgomery A… New … NY                  5 5-A2        1500 JEFFERS…
#>  8 1500-1502 1550 Ptree Rd Atla… GA                  6 6-N1        1500 PEACHTR…
#>  9 1500-1502 1550 Ptree Rd Atla… GA                  6 6-N2        1502 PEACHTR…
#> 10 1500-1502 1550 Ptree Rd Atla… GA                  6 6-N3        1550 PEACHTR…
#> # ℹ 14 more variables: street_number_coords <chr>, street_number <chr>,
#> #   street_number_multi <chr>, street_number_fraction <chr>,
#> #   pre_direction <chr>, street_name <chr>, street_suffix <chr>,
#> #   post_direction <chr>, building <chr>, unit_type <chr>, unit <chr>,
#> #   extra_front <chr>, extra_back <chr>, extra_unit <chr>

If there is a common pattern that is not removed with the function, you can use extract_remove_squish(), to pre-clean the data.

address_table |> 
  add_row(address = "246 S Bend St Unit 530 3 Bedroom") |> 
  extract_remove_squish(address, other, "\\d Bedroom")
#> # A tibble: 7 × 4
#>   address                                    city       state other    
#>   <chr>                                      <chr>      <chr> <chr>    
#> 1 456 Jersey Avenue #102                     Montclair  NJ    <NA>     
#> 2 123-125 N Street Rd                        Cincinnati OH    <NA>     
#> 3 789 Pirate Cv East DBA ELAB LLC            Memphis    TN    <NA>     
#> 4 3548 1ST ST FL 1                           St. Louis  MO    <NA>     
#> 5 928-928 S Montgomery Ave 1500 Jefferson Rd New York   NY    <NA>     
#> 6 1500-1502 1550 Ptree Rd                    Atlanta    GA    <NA>     
#> 7 246 S Bend St Unit 530                     <NA>       <NA>  3 Bedroom

To switch a column from the abbreviated spelling to the long format, there are two functions: switch_abbreviation and str_replace_names

  • switch_abbreviation() for abbreviations included in the address_abbreviations dataset: directions, all_street_suffixes, official_street_suffixes, unit_types, special_units, and highways.
    • all_street_suffixes has a one-to-many relationship and should only be used long-to-short to standardize spellings into the same official abbreviation (AV, AVE, AVN to AVE)
    • official_street_suffixes has a one-to-one relationship between short and long spellings. It can be used either short-to-long or long-to-short (assuming all endings are in the official format).
address_table |> 
  clean_address(address) |> 
  mutate(
    street_suffix_short = switch_abbreviation(street_suffix, "official_street_suffixes", "long-to-short"),
    .keep = "used"
    )
#> preserve original data: 0.004 sec elapsed
#> extract address parts: 0.011 sec elapsed
#> standardize street suffix, directions & ordinals: 0.039 sec elapsed
#> check street numbers, units, and buildings: 0.008 sec elapsed
#> tidy output: 0.003 sec elapsed
#> total clean time: 0.066 sec elapsed
#> # A tibble: 6 × 2
#>   street_suffix street_suffix_short
#>   <chr>         <chr>              
#> 1 AVENUE        AVE                
#> 2 ROAD          RD                 
#> 3 COVE          CV                 
#> 4 STREET        ST                 
#> 5 ROAD          RD                 
#> 6 ROAD          RD
  • str_replace_names() to replace any vector with another vector of the same length
address_table |> 
  mutate(state = str_replace_names(state, state.abb, state.name))
#> # A tibble: 6 × 3
#>   address                                    city       state     
#>   <chr>                                      <chr>      <chr>     
#> 1 456 Jersey Avenue #102                     Montclair  New Jersey
#> 2 123-125 N Street Rd                        Cincinnati Ohio      
#> 3 789 Pirate Cv East DBA ELAB LLC            Memphis    Tennessee 
#> 4 3548 1ST ST FL 1                           St. Louis  Missouri  
#> 5 928-928 S Montgomery Ave 1500 Jefferson Rd New York   New York  
#> 6 1500-1502 1550 Ptree Rd                    Atlanta    Georgia

About

R package to assist with address cleaning tasks

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages