Extract APD incident report data and host it on GitHub #7

luqmaan · 2015-07-09T14:35:11Z

One sentence description:

Make the APD incidents report data accessible by hosting it on GitHub.

Link (more details/brain dump/alpha):

Miles Hutson (http://github.com/CuriousG102) has done some awesome work making a crime API from the APD incidents data. He created a scraper (https://github.com/CuriousG102/austin-crime-data/blob/master/download.py) that fetches the data from https://www.austintexas.gov/police/reports/search2.cfm.

To build on top of his work, I think we should extract the data from the APD incidents database and store it in a repo on GitHub (like https://github.com/scascketta/CapMetrics).

This makes analysis of the entire dataset easier. Instead of making many API calls to fetch incidents, you can just open a spreadsheet.

Project Needs (dev/design/resources):

Development help, policy help needed.

Status (in progress, pie-in-the-sky)

pie-in-the-sky

CuriousG102 · 2015-07-09T16:57:42Z

I'd be open to the idea. This makes one-off analyses much easier for people and also enables a core tenet of data journalism: reproducibility. For example, if someone disputes your results, it would be nice for you to be able to go "run my program against commit 65a4e and you'll see the same thing".

Since things are constantly being added to the Postgres database you obviously can't do that right now.

Any suggestions for structure? Do you just picture a 200,000+ line csv file? Ethics: Should I do data fuzzing on coordinates?

courtney-rosenthal · 2015-07-09T17:36:02Z

I guess I'm unclear the benefit of this versus asking Miles to provide an API that emits (e.g.) CSV -- presuming his system can handle the load.

luqmaan · 2015-07-09T18:07:23Z

I'm not sure an HTTP API would be able to respond with all the data ever in a single call.

I like the way CapMetrics does things. Have a look https://github.com/scascketta/CapMetrics/tree/master/data/vehicle_positions

There are a lot of benefits to this.

You can see the data by day through your browser
You can clone the repo and immediately be able to manipulate all the data at once
The data is versioned since its stored in git
The repo can be forked
Its free and easy

How would you fuzz the coordinates?

CuriousG102 · 2015-07-09T21:23:12Z

Ditto what Luqmaan said. Fuzzing; I would round at the hundredths.

Sent from my iPhone

On Jul 9, 2015, at 1:07 PM, Luqmaan Dawoodjee [email protected] wrote:

I'm not sure an HTTP API would be able to respond with all the data ever in a single call.

I like the way CapMetrics does things. Have a look https://github.com/scascketta/CapMetrics/tree/master/data/vehicle_positions

There are a lot of benefits to this.

You can see the data by day through your browser
You can clone the repo and immediately be able to manipulate all the data at once
The data is versioned since its stored in git
The repo can be forked
Its free and easy
How would you fuzz the coordinates?

—
Reply to this email directly or view it on GitHub.

decibel · 2016-03-07T00:35:26Z

Github doesn't seem an appropriate place to host this data; it's a version control system, not a database.

Perhaps we could get Google to donate some S3 space to Open Austin? That would allow for posting dumps of the entire Postgres database, as well as incremental updates.

It also depends on how complex the data structure is. If it's really just simple CSV then maybe a file-per-day in git isn't that bad, but it sounds like it's more sophisticated than that.

As for fuzzing, I don't thing that's necessary. This is all public data anyway.

CuriousG102 · 2016-03-07T00:45:14Z

APD has changed their site, so the scraper associated with my project has been broken for a while now. It could be pretty trivially fixed by someone who knows Python and has time to read up on the libraries I'm using, but between school and an impending internship out of state I don't really have the time right now. So if there's still real interest in doing something with this data I'm happy to help someone else get started with running this stack themselves, and I'm happy to take and review pull requests. Hosting shouldn't be a problem, this can run on a $5 digital ocean server if the memory inefficient Beautiful Soup code is replaced with Scrapy's xpath selectors. This really needs to be done anyway because the Beautiful Soup code is grafted on from an earlier project and hard to read compared to how XPath selectors would work. If minimal changes are made to fix without replacing Beautiful Soup 4 then it can run on a $10 digital ocean server.

luqmaan · 2016-03-07T14:33:06Z

@decibel We can use S3 since Code for America pays for our AWS.

But csv files on GitHub are easier for people to access and fork.

decibel · 2016-03-07T19:26:59Z

But csv files on GitHub are easier to access.

I was under the impression that the data was more complex than a single CSV, but if it really is just a daily/weekly CSV file then github is fine for raw data. It would be nice to see what exactly was in the database someone mentioned. In terms of hosting, if there's an AWS account already setup then that would be an easy way to host the data pulling script.

CuriousG102 · 2016-03-07T21:28:50Z

here is how the data is represented in the django server the scraper resides on
https://github.com/CuriousG102/interactives-backend/blob/f53e6ff5a600a39feaf0f2e3460272ab14f8a451/interactivesBackend/crimeAPI/models.py

The ORM only works with Postgres because of the usage of Postgres specific GIS features.

The linked repo has the full implementation of rest api, interactives, etc. The data is fundamentally relational so any dump that doesn't preserve that will have issues. Which issues you can live with (and why you want the dump at all rather than using an API) will be determined by your use case.

As I said, I do not have time to actively maintain this project, so it's up to you to write the code if you care to get the scraper running again and dump the data. I'll take any PR's that are reasonable to https://github.com/CuriousG102/interactives-backend/. If you replace the Beautiful Soup code with lxml or Scrapy's XPath code I'll even dockerize and run it on a $5 digital ocean server for you.

decibel · 2016-03-07T22:22:50Z

Just to throw out another option... Would xml2dict be an option? I'm actually using that for a client in a plpythonu function that translates xml to a dict, which I then convert to json using json.dumps. (I ultimately split json objects and arrays into separate tables then create views that represent all the fields, but that's probably more effort than it's worth here.)

BTW, it looks like the new APD site provides specific vehicle information where appropriate too.

amaliebarras · 2017-02-28T00:32:54Z

Looks like there's been lots of discussion on this ... I can see benefit in hosting this data in a backup place like github, or even data.world. Is anyone still interested in this project?

tracypholmes · 2017-02-28T01:06:23Z

I actually created a Ruby cli gem based on the 2016 data austin_crime. It's not much, and I'm refactoring, but I made it work. I'm interested and would need some help as I'm not that familiar with Python (currently working primarily in Ruby) and still learning things.

amaliebarras · 2017-02-28T02:36:32Z

Awesome! I'll at the Python label to this issue as well!

mscarey · 2019-04-07T15:53:02Z

I think the APD site and the state of the art in crime data mapping will have changed enough that it's time to close this. See also issue #139, scrAPD, which is similar except it has only traffic fatalities in scope.

luqmaan added the discussion label Jul 9, 2015

luqmaan removed the discussion label Oct 1, 2015

amaliebarras added Abandoned and needs love Data wrangling Infrastructure Open data Public Health Transportation labels Feb 28, 2017

werdnanoslen added Backlog and removed Needs Leadership labels Sep 5, 2017

werdnanoslen added this to the Status: Abandoned milestone Sep 16, 2017

werdnanoslen removed the status: abandoned and needs love label Sep 16, 2017

twentysixmoons added Inactive A project that hasn't had movement in awhile or is lacking a strong project champion and removed Data wrangling Infrastructure Open data Public Health Transportation labels Feb 28, 2018

mscarey closed this as completed Apr 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract APD incident report data and host it on GitHub #7

Extract APD incident report data and host it on GitHub #7

luqmaan commented Jul 9, 2015

CuriousG102 commented Jul 9, 2015

courtney-rosenthal commented Jul 9, 2015

luqmaan commented Jul 9, 2015

CuriousG102 commented Jul 9, 2015

decibel commented Mar 7, 2016

CuriousG102 commented Mar 7, 2016

luqmaan commented Mar 7, 2016

decibel commented Mar 7, 2016 via email

CuriousG102 commented Mar 7, 2016

decibel commented Mar 7, 2016

amaliebarras commented Feb 28, 2017

tracypholmes commented Feb 28, 2017 •

edited

Loading

amaliebarras commented Feb 28, 2017

mscarey commented Apr 7, 2019

Extract APD incident report data and host it on GitHub #7

Extract APD incident report data and host it on GitHub #7

Comments

luqmaan commented Jul 9, 2015

CuriousG102 commented Jul 9, 2015

courtney-rosenthal commented Jul 9, 2015

luqmaan commented Jul 9, 2015

CuriousG102 commented Jul 9, 2015

decibel commented Mar 7, 2016

CuriousG102 commented Mar 7, 2016

luqmaan commented Mar 7, 2016

decibel commented Mar 7, 2016 via email

CuriousG102 commented Mar 7, 2016

decibel commented Mar 7, 2016

amaliebarras commented Feb 28, 2017

tracypholmes commented Feb 28, 2017 • edited Loading

amaliebarras commented Feb 28, 2017

mscarey commented Apr 7, 2019

tracypholmes commented Feb 28, 2017 •

edited

Loading