This program orchestrates Docker containers together to run the steps to process the rental listings data.
Docker is used to setup each environment that is specific to each piece of the processor. We need Docker because of how different the set of requirements/dependencies are for each step (Mapping, Cleaning, Geolocating). This prevents us from going through the process of installing each individual environment onto your host machine and relying on the host to have the correct packages and versions of packages to successfully run each piece of the processor.
We need Git LFS to pull down the large volume/ reference files required for processing.
git clone https://github.com/MAPC/rental-listing-processor
There is one environment file that is loaded into the Mapper container which you must specify the year and quarter which are pulling listings data for and add database information to. The only container that does any communication with the database is the Mapper (I will get to what this container does below).
The environments for the other containers are written into the docker-compose.yml file. The reason we use an environment file for the Mapper is because it requires sensitive information that we don't want to mistakenly commit; the other containers don't require such sensitive information and therefore can have their environments committed.
cp .env.mapper.template .env.mapper
vim .env.mapper
Provide login/location information to the Mapper container by filling in the blank fields of .env.mapper.
./process.py
The Mapper container pulls the data from the database according to the months specified in
the map
program. We need the Mapper because the Cleaner container wants the input data in a
specific format/structure. The Mapper is responsible for shaping the timely data into a consumable
CSV file.
The Cleaner generates a few of the deliverables while also organizing input datasets for the Geolocator to process. After the cleaner runs, we have to manually move one of the files it creates so the Geolocator has access to it.
The final step is running the Geolocator. This takes the longest of all processing steps, usually 5-8 hours.
You should now see a .zip archive in the project root which contains the output of the process.