Project description: The problem my project is aimed at solving is combating the distributed, unorganized information of government officials’ financial decisions. my project is for anyone who is interested in government officials’ financial disclosures and including their stocks and options transactions records. my project is unique because it identifies the ways in which political activities correlate with government officials’ financial transactions records. This enables transparency, eliminates the potential for insider trading, and identifies potential conflicts of interest. It allows retail investors to monitor and subsequently follow their trades before restrictions are made. Applicaiton Stack Architecture:
Architecture Description
-
Data collectors use urllib to fetch two main public sites.
-
Data pre-processor converts downloaded csv data to structured raw data and insert/update corresponding tables in database.
-
Job scheduler has two jobs.
a. invoke periodical data collectors and preprocessor.
b. invoke batch data processor to transform structured raw data to application data structures and save update them to database.
-
Batch data processor does data transformation, data analysis and storing data for API server to use.
-
API server provides endpoints for application servera /dashboardb /search (by name/filing date/stock symbol)
-
Integration server: This component facilitates the integration and deployment of the application code from the source repository to the staging environment.
-
Frontend server uses template language to serve html pages from API server endpoints.
-
Web server reverse proxy for application server
-
Integration server does:
a. monitoring health of endpoints on api server, frontend, and web server.
b. continuous integration testing by code change and build
Initial architecture diagram(Week 1) is almost identical, only design changes are:
- Relational(Postgresql) database was chosen instead of a document database. Initially I thought application needs store raw pdf files per record, which is large amount of data for relational database to handle. But in my development stage I could parse trades text in pdf files, that significantly reduced data amount to data store. Also search ability is more robust when using a relational database built in sql query.
- Performance metrics services were added, using heroku managed ones.
Application integrates with GitHub to make it easy to deploy to my app stack running on Heroku. When GitHub integration is configured for my app, Heroku can automatically build and release (if the build is successful). Continuous Delivery is implemented by using Heroku pipeline, it runs function & unit tests automatically for every subsequent code push to the GitHub. Along with any merges to master from dev branch that is used as staging. Staging will be promoted to production servers after tests. A few illustrative tests were written using standard pytest library and running continously upon each code change in GitHub.
Staging environment: This is a pre-production environment where the application is deployed and tested before being released to the production environment.
Monitoring service monitors the system's performance, health, and potential issues, providing visibility and alerting mechanisms. Heroku provides server performance metrics and alert services, it includes monotroing applicaiton Response time, Memory, Throughput. Alert service will send notifications upon system events in production, such as unresponsive endpoints, resource exhaustions, throughput over certain threshold limit.
Code Structure:
Web Application (app.py/service.py)
- Standard Python Flask web app, that routes http requests and responds with templated data from database on web pages.
- Apscheduler BackgroundScheduler is started when app.py starts, it runs fetcher.py 'main' method periodcally. The timestamp of data collection is displayed on the bottom of the site page.
- service.py provides db records for the endpoint by retrieving them from database. It does certain data convertions on unstructured raw trades data.
-
fetcher.py uses Python requests lib to fetch two main public sites. respectively:
-
https://disclosures-clerk.house.gov/public_disc/financial-pdfs/{year}FD.zip all congress members' trades disclosed in the year.
-
https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2024/{docId} individual trade disclosure doc record.
-
attaches individual parsed trade doc to record
-
save records to postgresql database (hosted in Heroku)
-
has function to collect multi years records.
-
-
processor.py helps fetcher.py convert raw data to structured records, extracts trades from trade pdf doc per record.
-
db.py inserts/update/remove gov trades records to a postgresql database hosted in heroku.
-
DB connection parameters are in .env file. Sample db records screenshot here.
HTML page provides user to view/search/select goverment trading records. Application server is running a python flask stack. sort/search functionality uses sortable.js, bottom 'last run' displays data collection time.
Public url of my project: https://govtrade-a46bca12cc9b.herokuapp.com/
python3 -m venv venv
source ./venv/bin/activate
pip install -r requirements.txt
export FLASK_APP=src/app.py
flask run --port 1234 --debug `