Authors: Yacun Wang, Jianming Geng, Huy Trinh
June 16th, 2023
Improve profitability of an insurance company by understanding patterns and factors contributing to accidents in the city of San Diego.
Throughout the analysis we're using Python in Jupyter Notebooks to run data processing, insertion, and analysis. Specific Python packages listed in the notebooks could be downloaded via pip install <package>
.
We use two databases to store our data:
- PostgreSQL (Relational Database): Version 13+
- Neo4j (Graph Database):
- Base Kernel: Version 5.7.0
- Awesome Procedures on Cypher (APOC) Library: Version 5.7.0
- Graph Data Science (GDS) Library: Version 2.3.7
We are using San Diego Data Portal for our raw data, including information on the accidents, Get-It-Done Reports, and roads:
- Traffic Collision Records
- Vehicles and People Involved in Accidents
- San Diego Get-It-Done Requests
- Roads
- San Diego City Boundary
Running data preprocessing requires Python GIS packages and an ArcGIS Online Account with ESRI credits, so we don't recommend running the preprocessing as we have prepared processed data.
All processed data are created and stored under the Shared Google Drive. Download these data and store them in the data
directory, as well as the "neo4j-docker - GDS"/db/import
directory.
As suggested by the data sources and their relationships, all analysis questions are separated into 4 parts:
- Accident Information
- Accident Vehicle Information
- Accident Road Information
- Accident Information in Relation to Get-It-Done Reports
San-Diego-Accident-Analysis/
├── data/ <- all processed data files
│ ├── accidents.csv
│ ├── accidents_info.csv
│ ├── roads.csv
│ ├── reports.csv
│ ├── accidents_on_road.csv
│ └── reports_on_road.csv
├── src/ <- all code
│ ├── preprocessing.ipynb
│ ├── data-loading.ipynb
│ └── data-analysis.ipynb
└── README.md