Both of the scripts labeled Zip_Backfill run from the command line and take as input two files: input and output. Input is a .csv file with addresses that need to be backfilled. This file should be in the format: Lon Lat, Number, Street, City, District, Region, Zip, ID. Output is a .csv file that the backfilled addresses will be written into. Addresses rows will be returned in the same format as Input.
All other files are for testing purposes.
The difference between the two scripts lies in how they search census block polygons. Zip_Backfill.py looks through the loaded shape files to see if they contain a point from beginning to end ('for poly in zips.geom'). Zip_Backfill_Fast.py takes advantage of the fact that geographically close zip codes are numerically close and that geographically close address points are together within files. As such, it first tries the last zip code that was found and then zip codes progressively farther from it until it finds a match.
###Speed(macbook air)
Benton Indiana, 10229 rows:
- Zip_Backfill.py ~ 1 Hour 15 Minutes
- Zip_Backfill_Fast.py ~ 30 Seconds
###Accuracy
Running the current test file of 36 rows, three rows per state for three states in each of the four OpenAddresses areas, and using Census Tiger Line ZCTA shape file:
- 30/36 are found and backfilled with zip codes
- 27/30 of the backfilled zip codes agree with what the Google api returns as the zip code
Rows randomly picked from all files:
-
11,148/13,000 = 86%: backfilled
-
1,608/1,750 = 92%: agree with Google api
-
8587/8593 = 99% agree with Mapbox api