-
Notifications
You must be signed in to change notification settings - Fork 10
YAMZ - a crowdsourced metadata dictionary
License
nassibnassar/yamz
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is the README for the YAMZ metadictionary and includes instructions for deploying on a local machine for testing and on Heroku (heroku.com) for a scalable production version. These assume a Ubuntu GNU/Linux environment, but should be easily adaptable to any system; YAMZ is written in Python and uses only cross-platform packages. This file is formatted by hand and does not contain markdown. Authored by Chris Patton. Last updated 5 March 2016 (jak). YAMZ is formerly known as SeaIce; for this reason, the database tables and API use names based on "SeaIce". Contents ======== 0. Prerequisites 0.1 Create the deploy_keys branch 1. Configuring a local instance 1.1 Postgres authentication 1.2 Create the database 1.3 Create a role for standard queries 1.4 OAuth credentials and app key 1.5 N2T persistent identifier resolver credentials 1.6 Test the instance 2. Deploying to Heroku 2.1 Heroku-Postgres 2.2 Mailgun 2.3 Heroku-Scheduler 2.4 Making changes 2.5 Exporting the dictionary 3. URL forwarding 4. Building the docs 0. Prerequisites ================ The contents of this directory are as follows: sea.py . . . . . . . . . . Console utility for scoring and classifying terms and other things. ice.py . . . . . . . . . . Web server front end. digest.py . . . . . . . . Console email notification utility. requirements.txt . . . . . Heroku package dependencies. Procfile . . . . . . . . . Heroku configuration. seaice/ . . . . . . . . . The SeaIce Python module. html/ . . . . . . . . . . HTML templates, static Javascript and CSS, including bootstrap.js. doc/ . . . . . . . . . . . API documentation and tools for building it. # .seaice/.seaic_auth . . . DB credentials, API keys, app key, etc. Note .seaice/.seaice_auth . . . DB credentials, API keys, app key, etc. Note that these files are just templates and don't contain actual keys. Before you get started, you need to set up a database and some software packages. On Mac OS X, this may suffice: $ pip install psycopg2 Flask configparser flask-login flask-OAuth \ python-dateutil On Ubuntu, grab the following: python-flask . . . . . . . Simple HTTP server. postgresql . . . . . . . . We're using PostgreSQL for database managment. python-psycopg2 . . . . . Python API for PostgreSQL. python-pip . . . . . . . . Package manager for additional Python programs. and then download a package from pip that handles configuration files nicely: $ sudo pip install \ configparser flask-login flask-OAuth python-dateutil urlparse 0.1 Create the deploy_keys branch ================================= The 'master' branch contains all the code to deploy locally or on heroku. To deploy to heroku, we first need to create a local branch (one time operation) called "deploy_keys" and then check it out, $ git branch deploy_keys # create branch $ git checkout deploy_keys # update working tree Then edit .seaice and .seaice_auth with actual API and app keys, and push to heroku with $ git push heroku deploy_keys:master See section 2 for more on heroku. IMPORTANT: NEVER PUSH THIS BRANCH TO GITHUB. 1. Configuring a local instance =============================== To start, we'll set up a database in postgres. First, we need to do some configuration. Postgres requires an administrative user called 'postgres'. It may be a good idea to create a SeaIce user (called "role" in postgres jargon) with read/write access granted on the tables. This work takes place in the top-level source directory. On Linux (assuming postgres was installed using sudo), set postgres' password: $ sudo -u postgres psql template1 template1=# alter user postgres with encrypted password 'PASS'; template1=# \q [quit] On a Mac, assuming homebrew installed the software in your home directory (ie, without sudo), you'll need to initialize postgres first, with something like these commands: $ initdb /usr/local/var/postgres $ cp /usr/local/Cellar/postgresql/9.6.3/homebrew.mxcl.postgresql.plist \ ~/Library/LaunchAgents/ $ launchctl load -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist That makes postgres run automatically after a reboot. Create user 'postgres' and set the password to 'PASS': $ createuser -d postgres $ createuser -d SeaIce $ psql -U postgres -c "alter user postgres with encrypted password 'PASS'" 1.1 Postgres authentication =========================== Now configure the authentication method for postgres and all other users connecting locally. In /etc/postgresql/9.1/main/pg_hba.conf (on our Mac, /usr/local/var/postgres/pg_hba.conf), change "peer" in the line that will become (xxx shouldn't the md5 show below?) local all postgres peer to "md5" for the administrative account and local unix domain socket connections. Next, we want to only be able to connect to the database from the local machine. In /etc/postgresql/9.1/main/postgresql.conf (on our Mac, /usr/local/var/postgres/postgresql.conf), uncomment the line listen_addresses = 'localhost' After you've done this, you need to restart the postgres server. On Linux, $ sudo service postgresql restart or on our homebrew-based Mac, $ launchctl stop ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist $ launchctl start ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist 1.2 Create the database ======================= Finally, log back in to postgres to create the database, $ sudo -u postgres psql postgres=# create database seaice with owner postgres; or on our Mac, $ psql -U postgres -c 'create database seaice with owner postgres' (Using unique, completely random passwords is a good idea here.) Next, create a configuration file for the database and user account you set up. Create a file called '.seaice' like: [default] dbname = seaice user = postgres password = PASS xxx do this in a separate "local_deploy" dir? xxx user = reader? xxx separate section for [contributor] ? eg xxx [contributor] xxx dbname = seaice xxx user = contributor xxx password = PASSWORD3 IMPORTANT NOTE: A template of this file is provided in the github repository. Your working version of this file should remain secret and must not be published. Set the correct file permissions with: $ chmod 600 .seaice This file is used by the SeaIce DB connector to grant access to the database. To initialize the DB schema and tables, type: # $ ./sea.py --init-db --config=.seaice $ ./sea.py --init-db --config=local_deploy/.seaice xxx move this section above preceding, since you may need those roles? 1.3 Create a role for standard queries ====================================== At this point, it's suggested that you set up user standard read/write permissions on the table (no DROP, CREATE, GRANT, etc.) for most of the database queries. Note that this isn't applicable in Heroku; the postgres interface there doesn't allow you to control user views. postgres=# create user contributor with encrypted password 'PASS'; postgres=# \c seaice; postgres=# grant usage on schema SI, SI_Notify to contributor; postgres=# grant select, insert, update, delete on all tables in schema SI, SI_Notify to contributor; On our Mac, that would be $ psql -c "create user contributor with encrypted password 'PASS'" template1 xxx shouldn't we do the next line too? grant usage on schema ... $ psql -c "grant usage on schema SI, SI_Notify to contributor" seaice $ psql -c "grant select, insert, update, delete on all tables in schema SI, SI_Notify to contributor" seaice Add the configuration to '.seaice': [contributor] dbname = seaice user = contributor password = PASS XXXX move this last few lines after [dev] and [heroku] are set up xxx and/or add --deploy=dev so that local instance can come up without having to configure heroku section The web user interface creates a database connection pool with the same role. You can specify this on the command line: # $ ./ice.py --role=contributor --config=.seaice $ ./ice.py --role=contributor --config=local_deploy/.seaice '--role' defaults to 'default'. 1.4 OAuth credentials and app key ================================= YAMZ uses Google for third party authentication (OAuth-2.0) management of logins. Visit https://console.developers.google.com to set this service up for your instance. Navigate to something like API Manager -> Credentials and select whatever lets you create a new OAauth client ID. For local configuration, supply these answers: Application type . . . . . . . . . . Web application Authorized javascript origins . . . http://localhost:5000 Authorized redirect URI . . . . . . http://localhost:5000/authorized Create another set of credentials for your heroku instance, say yamz-dev: Application type . . . . . . . . . . Web application Authorized javascript origins . . . http://yamz-dev.herokuapp.com Authorized redirect URI . . . . . . http://yamz-dev.herokuapp.com/authorized In each case, you should obtain a pair of values to put into another configuration file called '.seaice_auth'. Create or edit this file, replacing google_client_id with the returned 'Client ID' and replacing google_client_secret with the returned 'Client secret'. xxx Where does app_secret come in? does it come from the 'API key'? Manoj: app_secret only needed for heroku deployment (See section 2.) XXX Are the instructions there complete, eg, redirect URL?) XXX ?document this in a 0.x section, since it applies to local and heroku? Next, create a configuration file called '.seaice_auth' with the appropriate client IDs and secret keys. For instance, you may have credentials for 'http://localhost:5000', as well as a deployment on heroku: XXX the google_client_id identifies your client software/(app?) and is paired with the redirect URL, eg, one for 'http://localhost:5000' and another for http://yamz.net... xxx is this correct? each unique post-auth redirection target needs its own unique google_client_id XXX to do: allow local dev to take place offline, ie, without contact with google for Auth or with minters and binders and n2t xxx to do: let people create test terms that expire in two weeks [dev] google_client_id = 000-fella.apps.googleusercontent.com google_client_secret = SECRET1 app_secret = SECRET2 [heroku] google_client_id = 000-guy.apps.googleusercontent.com google_client_secret = SECRET3 app_secret = SECRET4 IMPORTANT NOTE: A template of this file is provided in the github repository. This file should remain secret and must not be published. We provide the template, since heroku requires a commited file. For convenience, this file will also keep the Flask app's secret key. For this key, enter a long, random string of characters. Finally, set the correct file permissions with: $ chmod 600 .seaice_auth 1.5 N2T persistent identifier resolver credentials ================================================== Whenever a new term is created, YAMZ uses an API to n2t.net (maintained by the California Digital Library) in order to generate ("mint") a persistent identifier. The main role of n2t.net is to be a resolver for the public- facing URLs that persistently identify YAMZ terms. It is necessary to provide a minter password for API access to this web service. To do so include a line in ".seaice_auth" for every view: minter_password = PASS A password found in the MINTER_PASSWORD environment variable, however, will be preferred over the file setting. This password is used again in the API call to store metadata in a YAMZ binder on n2t.net. The main bit of metadata stored is the redirection target URL that supports resolution of ARK identifiers for YAMZ terms. Because real identifiers are meant to be persistent, no local or test instance of YAMZ should ever set the boolean "prod_mode" parameter in ".seaice_auth". For such instances the generated and updated terms should just be for identifiers meant to be thrown away. Only on the real production instance of YAMZ, when you're done testing term creation and update, should it be set to "enable" (the default is don't enable). 1.6 Test the instance ===================== First, create the database schema: $ ./sea --config=.seaice --init-db Start the local server with: $ ./ice.py --config=.seaice --deploy=dev If all goes well, you should be able to navigate to your server by typing 'http://localhost:5000' in the address bar. To verify that you've set up Google OAuth-2.0 correctly, try logging in. This will create an account. Try adding a new term, modifying and deleting a term, and commenting on terms. To classify a term, do: $ ./sea.py --config=.seaice --classify-terms 2. Deploying to Heroku ====================== The YAMZ prototype is currently hosted at http://yamz.herokuapp.com. Heroku is a cloud-computing service which allows users to host web-based software projects. Heroku is scalable for a price; however, we can still achieve quite a bit without spending money. We have access to a small Postgres database, can schedule jobs, use a variety of packages (all we need are available), and deploy easily with Git. Some limitations of Heroku are that it is impossible to set up DB roles and any local files cannot be assumed to persist after a reboot. To begin, you need to setup an account with Heroku and download their software. (It's nothing major, just some tools for running commands, interacting with the database, etc.) Visit http://www.heroku.com. Heroku requires a couple additional configuration files and some small code changes. The additional files (already set up in the repo) are: Procfile . . . . . . . specifies the commands that start web server, as well as periodic jobs. requirements.txt . . . a list of packages required by our software that Heroku needs to make available. These are available via pip. I used the following tutorial: https://devcenter.heroku.com/articles/python to set these up. Once you've set up your heroku account, you're ready to deploy. The recommended best practice for managing your heroku instance is to set up a local branch called 'deploy_keys' based on 'master'. In this branch, edit .seaice and .seaice_auth to contain actual passwords and API and app keys. NOTE: IT IS CRITICAL THAT YOU DON'T PUSH THIS BRANCH TO GITHUB. Publishing these secrets compromises the security of the entire app. Login via the heroku website and create a new app. Let's say we've named it "fella". Navigate to the directory containing the cloned repository. Create and checkout the branch 'deploy_keys'. $ git checkout -b deploy_keys $ heroku login $ heroku git:remote -a fella This creates a "slug" containing our code and its dependencies. To get the web app running, we'll now need to set up a database and a couple of heroku backend services. 2.1 Heroku-Postgres =================== Heroku-Postgres is a scalable DB interface for heroku apps. (See the python section of devcenter.heroku.com/articles/heroku-postgresql .) To create a free addon, $ heroku addons:create heroku-postgresql:hobby-dev The 'master' branch is set up to use either a local postgres database server or Heroku-Postgres. The location of the DB in the "cloud" is specified when you create the heroku addon, and heroku automatically sets the instance's environment variable DATABASE_URL, which you can query with $ heroku config Using 'sea' or 'ice' with '--config=heroku' will force SeaIce to use the web address found in this variable to connect to the DB. (Note this is the default.) Heroku-Postgres doesn't allow you to create roles, so '--role' will be ignored and the default will be used. To create the database schema: $ heroku run python sea.py --init-db 2.2 Mailgun =========== YAMZ provides an email notification service for users who opt in. A utility called 'digest' collects for each user all notifications that haven't previously been emailed into a single digest. The code uses a heroku backend app called Mailgun for SMTP service. To set this up, simply type (you may be asked to verify your heroku account with a credit card, but note your card should not be charged for the most basic service level) $ heroku addons:create mailgun This sets a number of instance environment variables (see "heroku config"). Of them the code uses "MAILGUN_SMTP_LOGIN" and "MAILGUN_SMTP_PASSWORD" to connect to Mailgun. Normally that happens when notifications are harvested by the scheduler (below), but to send out notifications manually, type: $ heroku run python digest.py 2.3 Heroku-Scheduler ==================== There are two periodic jobs that need to be scheduled in YAMZ: the term classifier and the email digest. To set this up, do: $ heroku addons:create scheduler $ heroku addons:open scheduler The second command will take you to the web interface for the scheduler. Add the following two jobs: "python sea.py --classify-terms" . . . . . every 10 minutes "python digest.py" . . . . . . . . . . . . once per day 2.4 Starting the instance ========================= Now that your instance is all prepared, you can get it up and running with $ git push heroku deploy_keys:master This pushes the secret keys found in the local deploy_keys branch so that they update the remote master branch on heroku. (xxx see section 1.4 and ??? for setting the secrets) # xxx app_secret is the (api key?) password from netrc, or "heroku auth:token"? 2.5 Making changes ================== Deploying changes to heroku is made easy with Git. Suppose we have changes to 'master' that we want to push to heroku. $ git checkout deploy_keys $ git merge master # updates deploy_keys with latest master commits $ git push heroku deploy_keys:master The first command checks out the already created local 'deploy_keys' branch. The second command merges the latest commits from the master branch into it, and the final command updates the heroku master branch, which also restarts the instance. This keeps the secrets outside the master branch. When you next checkout the master branch, however, your keys and secrets in the .seaice* files will be overwritten, so you may want to save them in separate files that you can copy back in to the branch when you later deploy again; just make sure those separate files don't ever become part of any branch that will show up in the public github repo. 2.6 Exporting the dictionary ============================ The SeaIce API includes queries for importing and exporting database tables in JSON formatted objects. This could be used to backup the entire database. Note however that imports must be done in the proper order in order to satisfy foreign key constraints. To back up the dictionary, do: $ heroku config | grep DATABASE_URL DATABASE_URL: <whatever> $ export DATABASE_URL=<whatever> $ ./sea.py --config=heroku --export=Terms >terms.json 3. URL forwarding ================= The current stable implementation of YAMZ is redirected from http://yamz.net. Setting this up takes a bit of doing. The following instructions are synthsized from http://lifesforlearning.com/heroku-with-godaddy/ for redirecting a domain name managed by GoDaddy to a Heroku app. Launch the "Domains" app on GoDaddy. Under "Forward Domain" for the appropriate domain (let's call it "fella.org"), add the following settings: Forward to . . . . . . . . . . . . . . . . . . . . http://www.fella.org Redirect type . . . . . . . . . . . . . . . . . . 301 (Permanent) Forward settings . . . . . . . . . . . . . . . . . Forward only Update nameservers and DNS settings to support this change . . . . . . . . . yes Next, under "Manage DNS", remove all entries except for 'A (Host)' and 'NS (Nameserver)', and add the following under 'CName (Alias)': Record type . . . . . . . . . . . . . . . . . CNAME (Alias) Host . . . . . . . . . . . . . . . . . . . . . www Points to . . . . . . . . . . . . . . . . . . http://fella.herokuapp.com TTL . . . . . . . . . . . . . . . . . . . . . 1 Hour Next, change the IP address for entry '@' under 'A (Host)' to 50.63.202.31 (the current IP address of yamz.net). That's it for DNS configuration. The last thing we need to do is modify the redirect URLs in the Google OAuth API. Edit the authorized javascript origins and redirect URI by replacing "fella.herokuapp.com" with "fella.org" and save. It can take a couple hours to a day for your DNS settings to propogate. Once it's done, you can navigate to YAMZ by typing "fella.org" into your browser. Try logging in to verify that the OAuth settings are also correct. 4. Building the docs ==================== The seaice package (but not this README file) is autodoc'ed using python-sphinx. To install on Ubuntu: $ sudo apt-get install python-sphinx The directory doc/sphinx includes a Makefile for exporting the docs to various media. For example, make html make latex
About
YAMZ - a crowdsourced metadata dictionary
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published