-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft data import pipeline #69
Conversation
&& deps=" \ | ||
unzip \ | ||
make \ | ||
wget \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need these dependencies immediately, but we'll probably need them once we start collecting shapefiles.
if var_code not in self.codes_to_names.keys(): | ||
continue | ||
# Process nulls | ||
if var_stat == -666666666: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any docs on this in the Census API. I was getting a few responses that looked like this from 2016 data, so I went ahead and treated them like nulls. I've since switched to 2017 data and I'm not sure whether the unusual responses persisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be good to report upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I'll open an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened datamade/census#72 for this.
data/scripts/import_acs_data.py
Outdated
philly_res = c.acs5.state_county(codes, PA_FIPS, PHILLY_FIPS) | ||
assert len(philly_res) > 0 | ||
for row in philly_res: | ||
writer.write_acs_row(PA_FIPS + PHILLY_FIPS, row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally confident what the convention is for notating sub-state FIPS codes. This code is written assuming that it's the concatenation of all the containing geometries.
data/scripts/import_acs_data.py
Outdated
writer.write_acs_row(fips, row) | ||
|
||
# Blockgroup-level data | ||
block_res = c.acs5.state_county_blockgroup(codes, PA_FIPS, PHILLY_FIPS, '*') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few of these variables aren't available at blockgroup resolution for the 5-year estimates. Rather than make this distinction in the import code, I decided to go ahead and structure the data the same way (i.e. each variable CSV file has the same FIPS codes) on the assumption that the application code will handle checking for null rows and warning that the blockgroup resolution is unavailable.
@@ -0,0 +1,598 @@ | |||
age_and_sex: | |||
verbose_name: Age and Sex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the verbose names in this file are currently being used, but I assume at some point we'll want to use them to create a crosswalk for user display.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is yaml giving us that just having this data as python dict is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two primary considerations:
- YAML is easier to write than Python dicts or JSON, and I had to write this out by hand
- There's enough data here that it needs to be in its own separate file to avoid cluttering up the code, so we might as well store it in a language-agnostic data format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
2885383
to
6fc9021
Compare
Requested review from @fgregg who may be able to accurately assess my use of https://github.com/datamade/census. |
if var_code not in self.codes_to_names.keys(): | ||
continue | ||
# Process nulls | ||
if var_stat == -666666666: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be good to report upstream.
@@ -0,0 +1,598 @@ | |||
age_and_sex: | |||
verbose_name: Age and Sex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is yaml giving us that just having this data as python dict is not?
data/scripts/import_acs_data.py
Outdated
writer.write_acs_row(PA_FIPS, row) | ||
|
||
# County-level data | ||
philly_res = c.acs5.state_county(codes, PA_FIPS, PHILLY_FIPS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is philadelphia county the right unit, and not the city of philadelphia?
If philadelphia county is the right unti, I would call this philly_county_fips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an ACS geography that corresponds to municipalities like Philly? I didn't immediately see one in the census
docs. I went with the county because the county and city are coterminous: https://en.wikipedia.org/wiki/Philadelphia_County,_Pennsylvania
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named places. This was one of the reasons why we originally wrote https://github.com/datamade/census_area/#usage
Newer versions of the Census API actually have native support for named places, which we don't have convenience methods for in our census wrapper. (https://github.com/uscensusbureau/api-geoHierarchy-changes/blob/master/changes.md?eml=gd&utm_medium=email&utm_source=govdelivery)
Since they are coterminous, this seems fine. It would be good to add a comment about that fact.
|
||
# County-level data | ||
philly_res = c.acs5.state_county(codes, PA_FIPS, PHILLY_FIPS) | ||
assert len(philly_res) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that you need this assertion suggests a problem upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't run into any problems with empty results here, I just wanted to be defensive and have it complain loudly in case of an obvious error. Does that make sense?
Nice looking code! |
Overview
Draft a simple data import pipeline for extracting a variety of ACS data for the app.
Notes
For now, I didn't put the raw data under version control, since we're not totally sure what intermediate/final data will look like yet. I figure it would make sense to do that once the full pipeline is in a semi-finished state.
Testing Instructions
Closes #67