This repo collects information by locale (county or town) from critical swing states for VoteByMail.io. Code for each state is under the state's name.
To get started, run the make create-install
command. There are other useful commands there.
The real work is done by PyInvoke, a simple task runner which was installed by the previous command.
Data is saved in the /public
folder of the public-data branch by state (e.g. florida.json
). Each file is a json array of all election-official contacts for locale. The format of the contacts depends on the state but supports (at a minimum) the following typescript
interface
interface Contact {
// each contact should have a locale and either a county or city
// it should also have either an email or fax and preferably the county official's name.
locale: string // locale name, unique within state
county?: string // county name
city?: string // city or township name
official?: string // name of election's official
emails?: string[] // array of emails
faxes?: string[] // list of fax numbers in E.164 format + ext
// optional fields
phones?: string[] // list of phone numbers in E.164 format + ext
url?: string // url for locale's election information
address?: string // mailing address data
physicalAddress?: string // physical address
party?: string // party affiliation of official
}
NB:, fields with a question mark (e.g. county?
) indicate that the value may possibly be empty, i.e. no such key exists. If no values are provided by the state, this is how it is indicated.
Phone and fax numbers should be in E.164 format, with optional extensions allowed. Since all numbers are US, they should match the regex '^\+1\d{10}(x\d+)?$'
. Collecting a state calls normalize_state
in src/common.py
, which will automatically reformat phone and fax numbers that are close to this format.
State data is no longer saved in the master
branch of this repo.
Data releases are tagged with the date of collection using the format data/yyyy-mm-dd
. Files that could not be collected or had no changes will be carried over from previous commits.
Each state's crawler is put under its own folder (e.g. states/new_york/main.py
), with potentially other files in the folder.
-
The goal is for each state's crawler to fetch and process all of its required inputs without human intervention, so that we can easily re-run scripts periodically to collect fresh data.
-
We use
cache_request
fromcommon.py
to request webpages so that the results are saved to a local cache for faster development. Thecommon
module also contains several other functions which may be useful, includingcache_selenium
andcache_webkit
. -
Each state's
main.py
should include a function namedfetch_data()
, which will be called using PyInvoke using (e.g.)inv collect new_york
-
The results are then sorted using
normalize_state
fromcommon.py
and saved in the above json format. -
Once you have the data, verify that it works by running tests:
make test
-
Also, rerun the Jupyter notebook
analysis/Analysis.ipynb
from scratch to update the analytics. You can see how many fields you were able to parse. To start the jupyter notebook, runmake jupyter
. Run the notebook. Make sure that you have all the values you need. Do not commit the notebook changes. Jsut throw them away. They just block rebase merging.
The public_data
GitHub Actions workflow will periodically run, collect fresh data, commit any updated .json files to the public-data
branch, and push a new data version tag in the format data/yyyy-mm-dd
.
To trigger this workflow between scheduled runs (i.e., after a new state is added to master
), push any commit to the trigger-public-data
branch. This will run the update workflow based on the latest code on master branch (admittedly, this is a bit of a hack to trigger a github action). If updated data is found, this will also push a new data version tag as part of the normal workflow.
For example,
git checkout trigger-public-data
touch 20200714_unscheduled_run.txt
git add 20200714_unscheduled_run.txt
git commit -m "20200714 unscheduled run"
git push origin trigger-public-data
This will automatically create a tag with the updated data and proper date.
Some state code (i.e. Nevada) does not seem to work on Github Actions. Instead, it needs to be run manually. To do this, you must run the following commands (changing values for the tag as necessary)
git checkout master
inv collect nevada
cp public/nevada.json /tmp/.
git checkout public-data
cp /tmp/nevada.json public/.
git commit -am 'Adding Nevada'
git tag data/2020-08-07
git push origin data/2020-08-07
Please submit code via pull requests, ideally from this repo if you have access or from your own fork if you do not.
- This repository has a continuous integration (CI) workflow to run pylint and tests on pull requests. The tests must pass for CI for code to be merged.
- We strive to only use rebase merges
- Please don't save changes to the Jupyter notebook
analysis/Analysis.ipynb
(it will break your rebase merge).
To update a version, tag the commit with a bumped semvar version and push the tag. Admittedly we are a little loose on the definition of a "minor" vs "patch" increment. For example, if the previous version was 1.4.0
and we chose to increment to 1.5.0
, we would deploy using:
git tag v1.5.0
git push origin v1.5.0
To see a list of all tags, run
git fetch
git tag --list
DO NOT DELETE TAGS ONCE THEY ARE PUBLISHED! Just increment the minor version and republish if you made a mistake. We rely on stable tags for production.
- To get started, look at the
Makefile
. You can install files, startup Jupyter, etc ... - To run tasks, we use PyInvoke. Look at
tasks.py
file
This repository is for VoteByMail.io.