Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract APD incident report data and host it on GitHub #7

Closed
luqmaan opened this issue Jul 9, 2015 · 14 comments
Closed

Extract APD incident report data and host it on GitHub #7

luqmaan opened this issue Jul 9, 2015 · 14 comments
Labels
Inactive A project that hasn't had movement in awhile or is lacking a strong project champion

Comments

@luqmaan
Copy link
Member

luqmaan commented Jul 9, 2015

One sentence description:

Make the APD incidents report data accessible by hosting it on GitHub.

Link (more details/brain dump/alpha):

Miles Hutson (http://github.com/CuriousG102) has done some awesome work making a crime API from the APD incidents data. He created a scraper (https://github.com/CuriousG102/austin-crime-data/blob/master/download.py) that fetches the data from https://www.austintexas.gov/police/reports/search2.cfm.

To build on top of his work, I think we should extract the data from the APD incidents database and store it in a repo on GitHub (like https://github.com/scascketta/CapMetrics).

This makes analysis of the entire dataset easier. Instead of making many API calls to fetch incidents, you can just open a spreadsheet.

Project Needs (dev/design/resources):

Development help, policy help needed.

Status (in progress, pie-in-the-sky)

pie-in-the-sky

@CuriousG102
Copy link

I'd be open to the idea. This makes one-off analyses much easier for people and also enables a core tenet of data journalism: reproducibility. For example, if someone disputes your results, it would be nice for you to be able to go "run my program against commit 65a4e and you'll see the same thing".

Since things are constantly being added to the Postgres database you obviously can't do that right now.

Any suggestions for structure? Do you just picture a 200,000+ line csv file? Ethics: Should I do data fuzzing on coordinates?

@courtney-rosenthal
Copy link
Member

I guess I'm unclear the benefit of this versus asking Miles to provide an API that emits (e.g.) CSV -- presuming his system can handle the load.

@luqmaan
Copy link
Member Author

luqmaan commented Jul 9, 2015

I'm not sure an HTTP API would be able to respond with all the data ever in a single call.

I like the way CapMetrics does things. Have a look https://github.com/scascketta/CapMetrics/tree/master/data/vehicle_positions

There are a lot of benefits to this.

  • You can see the data by day through your browser
  • You can clone the repo and immediately be able to manipulate all the data at once
  • The data is versioned since its stored in git
  • The repo can be forked
  • Its free and easy

How would you fuzz the coordinates?

@CuriousG102
Copy link

Ditto what Luqmaan said. Fuzzing; I would round at the hundredths.

Sent from my iPhone

On Jul 9, 2015, at 1:07 PM, Luqmaan Dawoodjee [email protected] wrote:

I'm not sure an HTTP API would be able to respond with all the data ever in a single call.

I like the way CapMetrics does things. Have a look https://github.com/scascketta/CapMetrics/tree/master/data/vehicle_positions

There are a lot of benefits to this.

You can see the data by day through your browser
You can clone the repo and immediately be able to manipulate all the data at once
The data is versioned since its stored in git
The repo can be forked
Its free and easy
How would you fuzz the coordinates?


Reply to this email directly or view it on GitHub.

@luqmaan luqmaan removed the discussion label Oct 1, 2015
@decibel
Copy link

decibel commented Mar 7, 2016

Github doesn't seem an appropriate place to host this data; it's a version control system, not a database.

Perhaps we could get Google to donate some S3 space to Open Austin? That would allow for posting dumps of the entire Postgres database, as well as incremental updates.

It also depends on how complex the data structure is. If it's really just simple CSV then maybe a file-per-day in git isn't that bad, but it sounds like it's more sophisticated than that.

As for fuzzing, I don't thing that's necessary. This is all public data anyway.

@CuriousG102
Copy link

APD has changed their site, so the scraper associated with my project has been broken for a while now. It could be pretty trivially fixed by someone who knows Python and has time to read up on the libraries I'm using, but between school and an impending internship out of state I don't really have the time right now. So if there's still real interest in doing something with this data I'm happy to help someone else get started with running this stack themselves, and I'm happy to take and review pull requests. Hosting shouldn't be a problem, this can run on a $5 digital ocean server if the memory inefficient Beautiful Soup code is replaced with Scrapy's xpath selectors. This really needs to be done anyway because the Beautiful Soup code is grafted on from an earlier project and hard to read compared to how XPath selectors would work. If minimal changes are made to fix without replacing Beautiful Soup 4 then it can run on a $10 digital ocean server.

@luqmaan
Copy link
Member Author

luqmaan commented Mar 7, 2016

@decibel We can use S3 since Code for America pays for our AWS.

But csv files on GitHub are easier for people to access and fork.

@decibel
Copy link

decibel commented Mar 7, 2016 via email

@CuriousG102
Copy link

here is how the data is represented in the django server the scraper resides on
https://github.com/CuriousG102/interactives-backend/blob/f53e6ff5a600a39feaf0f2e3460272ab14f8a451/interactivesBackend/crimeAPI/models.py

The ORM only works with Postgres because of the usage of Postgres specific GIS features.

The linked repo has the full implementation of rest api, interactives, etc. The data is fundamentally relational so any dump that doesn't preserve that will have issues. Which issues you can live with (and why you want the dump at all rather than using an API) will be determined by your use case.

As I said, I do not have time to actively maintain this project, so it's up to you to write the code if you care to get the scraper running again and dump the data. I'll take any PR's that are reasonable to https://github.com/CuriousG102/interactives-backend/. If you replace the Beautiful Soup code with lxml or Scrapy's XPath code I'll even dockerize and run it on a $5 digital ocean server for you.

@decibel
Copy link

decibel commented Mar 7, 2016

Just to throw out another option... Would xml2dict be an option? I'm actually using that for a client in a plpythonu function that translates xml to a dict, which I then convert to json using json.dumps. (I ultimately split json objects and arrays into separate tables then create views that represent all the fields, but that's probably more effort than it's worth here.)

BTW, it looks like the new APD site provides specific vehicle information where appropriate too.

@amaliebarras
Copy link
Contributor

Looks like there's been lots of discussion on this ... I can see benefit in hosting this data in a backup place like github, or even data.world. Is anyone still interested in this project?

@tracypholmes
Copy link

tracypholmes commented Feb 28, 2017

I actually created a Ruby cli gem based on the 2016 data austin_crime. It's not much, and I'm refactoring, but I made it work. I'm interested and would need some help as I'm not that familiar with Python (currently working primarily in Ruby) and still learning things.

@amaliebarras
Copy link
Contributor

Awesome! I'll at the Python label to this issue as well!

@werdnanoslen werdnanoslen added this to the Status: Abandoned milestone Sep 16, 2017
@twentysixmoons twentysixmoons added Inactive A project that hasn't had movement in awhile or is lacking a strong project champion and removed Data wrangling Infrastructure Open data Public Health Transportation labels Feb 28, 2018
@mscarey
Copy link
Contributor

mscarey commented Apr 7, 2019

I think the APD site and the state of the art in crime data mapping will have changed enough that it's time to close this. See also issue #139, scrAPD, which is similar except it has only traffic fatalities in scope.

@mscarey mscarey closed this as completed Apr 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Inactive A project that hasn't had movement in awhile or is lacking a strong project champion
Projects
None yet
Development

No branches or pull requests

9 participants