A set of scripts to make it easier to set up census-postgres on an Amazon EC2 instance.
If you just want to use American Community Survey data on your own PostgreSQL machine, it's recommended to use the dumps that Census Reporter provides. Read more about those dumps in our Tumblr post.
Use the next section only if want to go through the process of rebuilding these data dumps from scratch.
These are the steps I follow when I want to start from scratch and load all ACS releases into the database.
-
Launch a
c1.xlarge
instance using the most recent Ubuntu 14.04 image, making sure to connect all four of the ephemeral storage to block devices during the setup walkthrough. If you have theaws
command line tool installed and configured, this command should do it:aws ec2 request-spot-instances --dry-run \ --spot-price 1.5 \ --instance-count 1 \ --launch-specification '{\ "InstanceType": "c1.xlarge",\ "ImageId": "ami-xxxxxxx",\ "BlockDeviceMappings": [\ {"VirtualName": "ephemeral0", "DeviceName": "/dev/sdb"},\ {"VirtualName": "ephemeral1", "DeviceName": "/dev/sdc"},\ {"VirtualName": "ephemeral2", "DeviceName": "/dev/sdd"},\ {"VirtualName": "ephemeral3", "DeviceName": "/dev/sde"}\ ]\ }'
-
Connect to it and immediately launch
screen
(From this link)
sudo su -
apt-get update
apt-get -y install mdadm xfsprogs
umount /mnt
yes | mdadm --create /dev/md0 --level=0 -c256 --raid-devices=4 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
echo 'DEVICE /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde' > /etc/mdadm/mdadm.conf
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
blockdev --setra 65536 /dev/md0
mkfs.xfs -f /dev/md0
mkdir -p /mnt && mount -t xfs -o noatime /dev/md0 /mnt
exit
cd /mnt
After this df -h
should indicate that you have 1.7TB of disk available at /dev/md0
.
Installs PostgreSQL 9.1 and puts the data directory on the /dev/md0
partition we just created above.
sudo su -
apt-get install -y postgresql-9.1 postgresql-9.1-postgis
/etc/init.d/postgresql stop
mkdir /mnt/postgresql
sed -i "s/data_directory = '\/var\/lib\/postgresql\/9.1\/main'/data_directory = '\/mnt\/postgresql\/9.1\/main'/" /etc/postgresql/9.1/main/postgresql.conf
mv /var/lib/postgresql/9.1 /mnt/postgresql/
chown -R postgres:postgres /mnt/postgresql
/etc/init.d/postgresql start
exit
Create a user and database for census data.
sudo -u postgres psql -c "CREATE ROLE census WITH NOSUPERUSER LOGIN UNENCRYPTED PASSWORD 'censuspassword';"
sudo -u postgres psql -c "CREATE DATABASE census WITH OWNER census;"
Make login passwordless.
echo "localhost:5432:census:census:censuspassword" > /home/ubuntu/.pgpass
chmod 0600 /home/ubuntu/.pgpass
After this you should be able to connect to your empty PostgreSQL database:
psql -h localhost -U census postgres
Downloads the raw data from the Census Bureau to prepare for insert into the database you just created. We'll use aria2 to help us download these files as quickly as possible.
sudo mkdir -p /mnt/tmp
sudo chown ubuntu /mnt/tmp
sudo apt-get install -y aria2 git
cd /home/ubuntu
git clone https://github.com/censusreporter/census-postgres-scripts.git
git clone https://github.com/censusreporter/census-postgres.git
cd census-postgres-scripts
With this stuff set up we can use the scripts I wrote to download the data from the Census Bureau in a relatively consistent manner. It probably makes sense to run at least some of these in parallel across several screen sessions. If a recent ACS release doesn't show up on this list, follow the new release checklist to create one for the new release before continuing.
./02_download_acs_2007_1yr.sh
./02_download_acs_2007_3yr.sh
./02_download_acs_2008_1yr.sh
./02_download_acs_2008_3yr.sh
./02_download_acs_2009_1yr.sh
./02_download_acs_2009_3yr.sh
./02_download_acs_2009_5yr.sh
./02_download_acs_2010_1yr.sh
./02_download_acs_2010_3yr.sh
./02_download_acs_2010_5yr.sh
./02_download_acs_2011_1yr.sh
./02_download_acs_2011_3yr.sh
./02_download_acs_2011_5yr.sh
./02_download_acs_2012_1yr.sh
./02_download_acs_2012_3yr.sh
./02_download_acs_2012_5yr.sh
An hour or two and 279GB later you should have a directory at /mnt/tmp
full of raw, expanded Census Bureau ACS data.
Once we have the ACS data downloaded it's time to actually load that data in to PostgreSQL. Again, since each release is slightly different there's a bunch of scripts I hand-crafted to do this import in a consistent way. If a recent ACS release doesn't show up on this list, follow the new release checklist to create one for the new release before continuing.
./03_import_acs_2007_1yr.sh
./03_import_acs_2007_3yr.sh
./03_import_acs_2008_1yr.sh
./03_import_acs_2008_3yr.sh
./03_import_acs_2009_1yr.sh
./03_import_acs_2009_3yr.sh
./03_import_acs_2009_5yr.sh
./03_import_acs_2010_1yr.sh
./03_import_acs_2010_3yr.sh
./03_import_acs_2010_5yr.sh
./03_import_acs_2011_1yr.sh
./03_import_acs_2011_3yr.sh
./03_import_acs_2011_5yr.sh
./03_import_acs_2012_1yr.sh
./03_import_acs_2012_3yr.sh
./03_import_acs_2012_5yr.sh
We just imported the estimate/error values for the ACS data. We also rely on tables that describe what the various Census tables and columns mean and how they relate to each other. We'll load that information here.
cd /home/ubuntu
git clone https://github.com/censusreporter/census-table-metadata.git
cd census-table-metadata
psql -U census -d census -h localhost -f census_metadata.sql
sudo -u postgres psql -d census -f census_metadata_load.sql
The geodata part of our APIs comes from the Census Bureau's TIGER 2012 dataset. Let's download and import it.
cd /home/ubuntu/census-postgres-scripts
./11_set_up_postgis.sh
./12_download_tiger_2012.sh
./13_import_tiger_2012.sh
psql -U census -d census -h localhost -f 13_index_tiger_2012.sql
TIGER data for American Indian geographies is a bit different from others. This script makes the necessary database adjustments so that Census Reporter can treat them equally
cd /home/ubuntu/census-postgres-scripts
psql -U census -d census -h localhost -f 14_aiannh_tables_2017.sql
Census Reporter uses GIS analysis when necessary to identify parent/child relationships between geographies. (see 13_import_tiger_) However, in some cases, other data is available to handle containment more accurately.
CBSAs are delineated explicitly as one or more counties. Because the location of the delineation files seems unreliable, we advise generating the SQL separately as needed, using geocontainment_scripts/cbsa_containment.py
and moving the created SQL script to the root of this repository. If the delineations haven't changed, then the script can just be copied and the schema year updated.
cd /home/ubuntu/census-postgres-scripts
psql -U census -d census -h localhost -f 15_cbsa_geocontainment_2018.sql
What to do when the Census Bureau releases a new set of ACS data.