Supporters

Investigating downloads of vulnerable Python packages from PyPI.

Supporters

Generated Resources

DBT Documentation on GitHub Pages.
Public Dataset on BigQuery US Location: pypi-vulns.published_us

Stability and Versioning

Please consider the BigQuery dataset structure and schema to be experimental and subject to unannounced breaking changes at this stage. If you wish for me to hold something stable, please raise an issue to let me know.
I do not version this dataset as I will not be maintaining previous versions.

Timeframe

I'm performing this initial analysis on package downloads performed on a specific date, 2023-11-05. There's a few reasons for that:

PyPI downloads is a big dataset - days in late 2023 are on the order of 250GB. At $5/TB scanned, that's about a dollar a day to scan one day of the full dataset.
The Safety public dataset is updated monthly, so I can use a the 2023-10-01 update to be sure that any vulnerabilities I'm considering have been in the public domain and accessible via tools for at least a month.

I can get an idea of what's going on and figure out how to solve the problems that need solving with a relatively small snapshot dataset, so I copy just the columns I need for one day with minimal processing to a new table and work from that.

Example Query

Top Ten Packages by Number of Vulnerable Downloads

Bills 277 MB

SELECT
  package,
  was_vulnerable,
  downloads,
  proportion_vulnerable_downloads
FROM `pypi-vulns.published_us.vulnerable_downloads_by_package`
WHERE was_vulnerable
  AND download_date = '2023-11-05'
ORDER BY downloads DESC
LIMIT 10

package	was_vulnerable	downloads	proportion_vulnerable_downloads
requests	true	2435390	0.2838941569600516
numpy	true	2124978	0.44167721853126357
urllib3	true	2052931	0.18725861594174412
flask	true	1878314	0.6503661420171899
awscli	true	1831423	0.45352032247727847
cryptography	true	1534228	0.34972655174363154
sqlalchemy	true	1401421	0.55823008581653166
scikit-learn	true	1058386	1.0
pydantic	true	1036847	0.37088439111792182
setuptools	true	934051	0.13253937971267768

Contributing

See CONTRIBUTORS.md for guidance.

Pre-Reqs

Python == 3.11 (see https://docs.getdbt.com/faqs/Core/install-python-compatibility)
[RECOMMENDED] VSCode to use built-in tasks
Access to GCP Project enabled for BigQuery
[RECOMMENDED] set environment variable PIP_REQUIRE_VIRTUALENV=true
- Prevents accidentally installing to your system Python installation (if you have permissions to do so)

Setup Local

Setting up the local software without any need for Data Warehouse credentials.

A VSCode task triggers a shell script .dev_scripts/init_and_update.sh which should take care of setting up a virtualenv if necessary, then installing/updating software and running a vulnerability scan.

Note - the vulnerability scan is performed using safety, which is not free for commercial use and has limitations on freshness and completeness of the vulnerability database.

That script describes the steps involved in a full setup if you are unable to run a bash script and need to translate to some other language.

Connect to Data Warehouse

Set up credentials and environment and test connectivity.

update .env with appropriate values
- note project ID not project name (manifests as 404 error)
- . .env to update values in use in terminal
get credentials
- if no valid credential, then error message says default credentials not found
- must be application default credential
- gcloud auth application-default login
dbt debug should now succeed and list settings/versions
- if dbt is not found, you may need to activate your venv at the terminal as described earlier

Obtaining Safety DB in BigQuery

I use the public database used by the [safety] package as a reference for which PyPI packages have known vulnerabilities.

Automation is provided to take care of making the current version of the public Safety DB available as a BigQuery table. The source data is stored as a large JSON array, so needs a bit of processing before it can be loaded into BigQuery.

etl/safety_db/load_missing_partitions.py takes care of loading any Safety DB commits that are not currently available in the data warehouse.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.dev_scripts		.dev_scripts
.devcontainer		.devcontainer
.envs		.envs
.github		.github
.vscode		.vscode
docs		docs
etl/safety_db		etl/safety_db
macros		macros
models		models
seeds		seeds
test_models/udf_tests		test_models/udf_tests
tests		tests
.env_template		.env_template
.gitignore		.gitignore
.safety-policy.yml		.safety-policy.yml
LICENCE		LICENCE
dbt_project.yml		dbt_project.yml
packages.yml		packages.yml
profiles.yml		profiles.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supporters

Generated Resources

Stability and Versioning

Timeframe

Example Query

Top Ten Packages by Number of Vulnerable Downloads

Contributing

Pre-Reqs

Setup Local

Connect to Data Warehouse

Obtaining Safety DB in BigQuery

About

Releases

Packages

Languages

License

brabster/pypi_vulnerabilities

Folders and files

Latest commit

History

Repository files navigation

Supporters

Generated Resources

Stability and Versioning

Timeframe

Example Query

Top Ten Packages by Number of Vulnerable Downloads

Contributing

Pre-Reqs

Setup Local

Connect to Data Warehouse

Obtaining Safety DB in BigQuery

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages