-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract APD incident report data and host it on GitHub #7
Comments
I'd be open to the idea. This makes one-off analyses much easier for people and also enables a core tenet of data journalism: reproducibility. For example, if someone disputes your results, it would be nice for you to be able to go "run my program against commit 65a4e and you'll see the same thing". Since things are constantly being added to the Postgres database you obviously can't do that right now. Any suggestions for structure? Do you just picture a 200,000+ line csv file? Ethics: Should I do data fuzzing on coordinates? |
I guess I'm unclear the benefit of this versus asking Miles to provide an API that emits (e.g.) CSV -- presuming his system can handle the load. |
I'm not sure an HTTP API would be able to respond with all the data ever in a single call. I like the way CapMetrics does things. Have a look https://github.com/scascketta/CapMetrics/tree/master/data/vehicle_positions There are a lot of benefits to this.
How would you fuzz the coordinates? |
Ditto what Luqmaan said. Fuzzing; I would round at the hundredths. Sent from my iPhone
|
Github doesn't seem an appropriate place to host this data; it's a version control system, not a database. Perhaps we could get Google to donate some S3 space to Open Austin? That would allow for posting dumps of the entire Postgres database, as well as incremental updates. It also depends on how complex the data structure is. If it's really just simple CSV then maybe a file-per-day in git isn't that bad, but it sounds like it's more sophisticated than that. As for fuzzing, I don't thing that's necessary. This is all public data anyway. |
APD has changed their site, so the scraper associated with my project has been broken for a while now. It could be pretty trivially fixed by someone who knows Python and has time to read up on the libraries I'm using, but between school and an impending internship out of state I don't really have the time right now. So if there's still real interest in doing something with this data I'm happy to help someone else get started with running this stack themselves, and I'm happy to take and review pull requests. Hosting shouldn't be a problem, this can run on a $5 digital ocean server if the memory inefficient Beautiful Soup code is replaced with Scrapy's xpath selectors. This really needs to be done anyway because the Beautiful Soup code is grafted on from an earlier project and hard to read compared to how XPath selectors would work. If minimal changes are made to fix without replacing Beautiful Soup 4 then it can run on a $10 digital ocean server. |
@decibel We can use S3 since Code for America pays for our AWS. But csv files on GitHub are easier for people to access and fork. |
But csv files on GitHub are easier to access.
I was under the impression that the data was more complex than a single
CSV, but if it really is just a daily/weekly CSV file then github is
fine for raw data. It would be nice to see what exactly was in the
database someone mentioned.
In terms of hosting, if there's an AWS account already setup then that
would be an easy way to host the data pulling script.
|
here is how the data is represented in the django server the scraper resides on The ORM only works with Postgres because of the usage of Postgres specific GIS features. The linked repo has the full implementation of rest api, interactives, etc. The data is fundamentally relational so any dump that doesn't preserve that will have issues. Which issues you can live with (and why you want the dump at all rather than using an API) will be determined by your use case. As I said, I do not have time to actively maintain this project, so it's up to you to write the code if you care to get the scraper running again and dump the data. I'll take any PR's that are reasonable to https://github.com/CuriousG102/interactives-backend/. If you replace the Beautiful Soup code with lxml or Scrapy's XPath code I'll even dockerize and run it on a $5 digital ocean server for you. |
Just to throw out another option... Would xml2dict be an option? I'm actually using that for a client in a plpythonu function that translates xml to a dict, which I then convert to json using json.dumps. (I ultimately split json objects and arrays into separate tables then create views that represent all the fields, but that's probably more effort than it's worth here.) BTW, it looks like the new APD site provides specific vehicle information where appropriate too. |
Looks like there's been lots of discussion on this ... I can see benefit in hosting this data in a backup place like github, or even data.world. Is anyone still interested in this project? |
I actually created a Ruby cli gem based on the 2016 data austin_crime. It's not much, and I'm refactoring, but I made it work. I'm interested and would need some help as I'm not that familiar with Python (currently working primarily in Ruby) and still learning things. |
Awesome! I'll at the Python label to this issue as well! |
I think the APD site and the state of the art in crime data mapping will have changed enough that it's time to close this. See also issue #139, scrAPD, which is similar except it has only traffic fatalities in scope. |
One sentence description:
Make the APD incidents report data accessible by hosting it on GitHub.
Link (more details/brain dump/alpha):
Miles Hutson (http://github.com/CuriousG102) has done some awesome work making a crime API from the APD incidents data. He created a scraper (https://github.com/CuriousG102/austin-crime-data/blob/master/download.py) that fetches the data from https://www.austintexas.gov/police/reports/search2.cfm.
To build on top of his work, I think we should extract the data from the APD incidents database and store it in a repo on GitHub (like https://github.com/scascketta/CapMetrics).
This makes analysis of the entire dataset easier. Instead of making many API calls to fetch incidents, you can just open a spreadsheet.
Project Needs (dev/design/resources):
Development help, policy help needed.
Status (in progress, pie-in-the-sky)
pie-in-the-sky
The text was updated successfully, but these errors were encountered: