Overview:
-
Private configuration:
/etc/barberini-analytics/
secrets/
: Secret files.
-
Database:
/var/barberini-analytics/db-data/
applied_migrations.txt
: Local version registry of the migration system. See Migration system.server.{crt,key}
: Certificate files. E.g., these can be created using certbot, see configuration.pg-data/
: Postgres internals of the database container.
-
Logs:
/var/log/barberini-analytics/
Remark: All logs older than two weeks will be cleaned up automatically via
cron.sh
. No chance to take a longer vacation!
-
database.env
: Local credentials for the database. -
keys.env
: Contains access tokens for various public APIs. -
smtp.env
: Contains access parameters for mailer service. The mailer is used to send notification emails about failures in the mining pipeline or CI pipeline.If you need to fix the mailer, you should be able to log in on account.google.com using these credentials. Otherwise, just exchange the credentials to use another account.
-
secret_files/
: Various files that should be available in the luigi container. Used to store intended amounts of global state.Note that from within the luigi container (
make connect
), you can access these files in/app/secret_files/
!absa/
: Large external datasets used for the implementation of the bachelor thesis about ABSA.german_word_embeddings.model
(CACHE)spacy_models
(CACHE)SePL-german-v1.1.csv
(seeFetchSepl
, manually requested from http://www.opinion-mining.org)
google_gmb_*.json
: Required for the Google Maps task. See implementation.ig_session
: Required for instagram thumbnail fetching. SeeFetchIgPostThumbnails
.
CACHE files do not need to be provided after installation but will restored automatically.
Our recommended workflow consists of the following policies:
- a protected
master
branch - a new merge request for every change
- a merge policy that rejects every branch unless the CI has passed (ideally, use Pipelines for Merged Results). For more information about our CI pipeline, head here
-
.gitlab-ci.yml
: Configuration file for GitLab CI. See Continuous Integration. -
luigi.cfg
: Configuration file for luigi, a framework for pipeline orchestration which is used for our central mining pipeline.In particular, you can configure timeouts and notification emails on failures here. By the way: If desired, it is also possible to reduce the number of notification emails by bundling them.
-
Makefile
: Smorgasbord of every-day commands used during development. It is recommended to scan this file once in order to get a brief overview.To run any of these targets, for example, type
make connect
to open a connection to the luigi docker container. You can also use bash completion for showing all make targets. -
data/
: Configuration files and constants for several components. SeeCONFIGURATION.md
. -
docker/
: Stuff for container configuration. See Docker containers.docker-compose*.yml
: Configuration files for all docker containers.Dockerfile
: Scripts to be executed when building the luigi container.Dockerfile_gplay_api
: Scripts to be executed when building the gplay_api container.requirements.txt
: pip dependencies. See also: Update project dependencies.
-
(
output*
: Pipeline artifacts. Not actually part of this repository. Excluded via.gitignore
.) -
power_bi/
: Data analytics reports for Power BI.These reports are stored as template files to avoid storing secrets in the repository.
-
scripts/
: Smorgasbord of scripts for development, operations, and manual data manipulations.-
migrations/
: Migration scripts for separate DB schema changes. See Migration system. -
running/
: Scripts for automated pipeline operations. -
setup/
: Scripts for setting up the solution on another VM/workstation.No claim on completeness. See installation.
-
tests/
: Scripts used as part of CI tests.run_minimal_mining_pipeline.sh
: See the minimal mining description in CI stages.
-
update/
: Scripts for occasional use. See particular docmuentations.-
historic_data/
: Scripts to scrape all data from the gomus system again.Regularly, we only scrape data of the latest few weeks in the daily pipeline. This script can be used if older changes have to be respected (e. g. after the VM has been down for a while, or after any retroactive changes in the booking system have been made that go beyond simple cancellations).
-
-
-
src/
: Source code of data integration and analysis tasks.The rough structure follows different data sources (see Data sources) or analysis fields.
In particular, the following paths are of special relevance:_utils/
: Miscellaneous helper methods and classes used for database access and preprocessing.gomus/
: Component for the integration of the go~mus booking system. Museum data are accessed by using web scrapers, undocumented HTTP calls to download reports, and the public API._utils/
: Webscrapers.
-
tests/
: Unit tests and acceptance tests for the solution.The rough structure follows the
src/
tree.
In particular, the following paths are of special relevance:_utils/
: Classes for our domain-specific test framework.utils/
: Unit tests forsrc/_utils
.schema/
: Acceptance tests for the database schema.test_data/
: Contains sample files used as mocking inputs or expected outputs of units under test.
-
visualizations/sigma/
: Custom Power BI Visual for the sigma text graph. See documentation there.
Docker is a state-of-the-art technology used for containerizing package dependencies.
This project uses multiple dockers for several purposes.
They are all defined from within the docker/
folder.
At the moment, our solution includes three docker containers:
-
barberini_analytics_luigi
(aka "luigi container" or "the docker"): Primary docker container used for all execution of source code. Will be set up and destroyed automatically as part of every automated pipeline run (seecron.sh
).To manually start up the docker, use
make startup
. To open a bash shell on the docker, runmake connect
. To stop the docker again, usemake shutdown
. -
barberini_analytics_db
(aka "database container"): Postgres container intended to run permanently. If not running, it will be started automatically as part of every automated pipeline run. -
gplay_api
: Special docker container used to host the Google Play API scraper. Seedocker/Dockerfile_gplay_api
andsrc/gplay/gplay_reviews
for further information.
Our whole data-mining pipeline is build using the pipeline orchestration framework Luigi.
It can be configured via the luigi.cfg
file (see above).
Here is a short crash course about how it works:
- For each script, you can define a task by subclassing
luigi.Task
. - Every task can define three main points of information by overriding the corresponding methods:
-
an output target (
output()
): The name of the file the task will produce.A task is defined as
complete()
iff its output file exists. -
dependencies (
requires()
): A collection of tasks that need to be completed before the requiring task is allowed to run. -
run()
method: Contains the actual task logic. A task can also yield dynamic dependencies simply by implementing therun()
as a generator yielding other task instances.
-
To run our whole pipeline, we define some WrapperTask
s in the _fill_db
module.
See make luigi
and scripts/running/fill_db.sh
to trigger the whole pipeline to run.
To rerun a certain task in development, you need to remove its output file and trigger the pipeline again.
This project currently collects and integrates data from the following sources:
Data source | Relevant data | Access point | Authentication | Current status |
---|---|---|---|---|
Apple App Store | Ratings and comments about the app | RSS feed | None |
Fully operational (presumably)
|
|
Facebook Graph API | Facebook Access Token |
Partially operational
| |
Google Maps Reviews | Ratings and comments | Google My Business API | GMB API Key |
Fully operational (presumably)
|
Google Play Reviews | Ratings and comments about the app | Scraper | None | Fully operational |
Gomus | Exhibitions |
|
None | Fully operational |
Bookings |
|
Gomus session ID | ||
Customers |
|
|||
Daily entries |
|
|||
Orders |
|
|||
|
Facebook Graph API | Facebook Access Token | Fully operational | |
Twitter/X | All tweets related to the museum | Scraper | None |
Out of service
|
Our CI pipeline is designed to work for GitLab and it is controlled via the .gitlab-ci.yml
file.
These are our different CI stages (non-exhaustive list):
- build: Test the tech stack set up, especially the Dockerfiles and dependencies.
- unittest: Run all unit tests in our
test/
directory. See Repository overview. - minimal-mining-pipeline: Test setup and running of the entire pipeline in a minimal mode, providing an isolated context against production data.
See
scripts/test/run_minimal_mining_pipeline.sh
. - lint: Makes sure all Python code in the repository follows the PEP8 coding style guide.
During the development of this solution, a lot of database schema changes accrued. To manage the complexity of synchronizing such changes on every VM and every developer's workstation, and in order to ensure internal stability by testing these changes, we developed a schema migration system.
This is how it works: Every schema change is defined as a new migration version.
Pipelines will automatically scan for newly added migration versions and apply them automatically.
All applied migration versions are stored in /var/barberini-analytics/db-data/applied_migrations.txt
.
Usually you do not want to touch that file manually.
To add a new migration, check out the latest version name nnn
under scripts/migrations/
and create a new script file named migration_{nnn + 1}.xxx
.
The script file can have an arbitrary extension, but it must be either an .sql
transaction, or provide a valid shebang.
If you use a shebang, make sure to chmod +x
that file.
You can also run make migration
to create a new SQL migration script.
To apply all pending migrations, run make apply-pending-migrations
.
This is done automatically via cron.sh
.
To apply all migrations without respecting the applied-file, run scripts/migrations/migrate.sh
without any arguments.
Remark: A migration system is characterized by the immutability of all previously defined versions. This induces the policy that you never must change any existing migration script that could already have been applied elsewhere. To revert an existing migration script, create a new migration script in which you implement how to revert the changes of the migration to revert.