Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft data import pipeline #69

Merged
merged 8 commits into from
Apr 8, 2019
Merged

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Apr 2, 2019

Overview

Draft a simple data import pipeline for extracting a variety of ACS data for the app.

Notes

For now, I didn't put the raw data under version control, since we're not totally sure what intermediate/final data will look like yet. I figure it would make sense to do that once the full pipeline is in a semi-finished state.

Testing Instructions

Closes #67

&& deps=" \
unzip \
make \
wget \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need these dependencies immediately, but we'll probably need them once we start collecting shapefiles.

if var_code not in self.codes_to_names.keys():
continue
# Process nulls
if var_stat == -666666666:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any docs on this in the Census API. I was getting a few responses that looked like this from 2016 data, so I went ahead and treated them like nulls. I've since switched to 2017 data and I'm not sure whether the unusual responses persisted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be good to report upstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll open an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened datamade/census#72 for this.

philly_res = c.acs5.state_county(codes, PA_FIPS, PHILLY_FIPS)
assert len(philly_res) > 0
for row in philly_res:
writer.write_acs_row(PA_FIPS + PHILLY_FIPS, row)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally confident what the convention is for notating sub-state FIPS codes. This code is written assuming that it's the concatenation of all the containing geometries.

writer.write_acs_row(fips, row)

# Blockgroup-level data
block_res = c.acs5.state_county_blockgroup(codes, PA_FIPS, PHILLY_FIPS, '*')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few of these variables aren't available at blockgroup resolution for the 5-year estimates. Rather than make this distinction in the import code, I decided to go ahead and structure the data the same way (i.e. each variable CSV file has the same FIPS codes) on the assumption that the application code will handle checking for null rows and warning that the blockgroup resolution is unavailable.

@@ -0,0 +1,598 @@
age_and_sex:
verbose_name: Age and Sex
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the verbose names in this file are currently being used, but I assume at some point we'll want to use them to create a crosswalk for user display.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is yaml giving us that just having this data as python dict is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two primary considerations:

  1. YAML is easier to write than Python dicts or JSON, and I had to write this out by hand
  2. There's enough data here that it needs to be in its own separate file to avoid cluttering up the code, so we might as well store it in a language-agnostic data format

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@jeancochrane jeancochrane force-pushed the jfc/data-import-pipeline#67 branch from 2885383 to 6fc9021 Compare April 3, 2019 21:45
@jeancochrane jeancochrane marked this pull request as ready for review April 3, 2019 21:47
@jeancochrane jeancochrane changed the title [ WIP ] Draft data import pipeline Draft data import pipeline Apr 3, 2019
@jeancochrane jeancochrane requested a review from fgregg April 3, 2019 21:59
@jeancochrane
Copy link
Contributor Author

Requested review from @fgregg who may be able to accurately assess my use of https://github.com/datamade/census.

if var_code not in self.codes_to_names.keys():
continue
# Process nulls
if var_stat == -666666666:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be good to report upstream.

@@ -0,0 +1,598 @@
age_and_sex:
verbose_name: Age and Sex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is yaml giving us that just having this data as python dict is not?

writer.write_acs_row(PA_FIPS, row)

# County-level data
philly_res = c.acs5.state_county(codes, PA_FIPS, PHILLY_FIPS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is philadelphia county the right unit, and not the city of philadelphia?

If philadelphia county is the right unti, I would call this philly_county_fips.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an ACS geography that corresponds to municipalities like Philly? I didn't immediately see one in the census docs. I went with the county because the county and city are coterminous: https://en.wikipedia.org/wiki/Philadelphia_County,_Pennsylvania

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named places. This was one of the reasons why we originally wrote https://github.com/datamade/census_area/#usage

Newer versions of the Census API actually have native support for named places, which we don't have convenience methods for in our census wrapper. (https://github.com/uscensusbureau/api-geoHierarchy-changes/blob/master/changes.md?eml=gd&utm_medium=email&utm_source=govdelivery)

Since they are coterminous, this seems fine. It would be good to add a comment about that fact.


# County-level data
philly_res = c.acs5.state_county(codes, PA_FIPS, PHILLY_FIPS)
assert len(philly_res) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that you need this assertion suggests a problem upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run into any problems with empty results here, I just wanted to be defensive and have it complain loudly in case of an obvious error. Does that make sense?

@fgregg
Copy link
Member

fgregg commented Apr 4, 2019

Nice looking code!

@jeancochrane jeancochrane merged commit c907930 into master Apr 8, 2019
@jeancochrane jeancochrane deleted the jfc/data-import-pipeline#67 branch April 8, 2019 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants