Skip to content

Commit

Permalink
Update README to point folks to Kaggle.
Browse files Browse the repository at this point in the history
  • Loading branch information
zaneselvans committed Dec 3, 2023
1 parent 385ea92 commit 29dc84b
Showing 1 changed file with 40 additions and 129 deletions.
169 changes: 40 additions & 129 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,146 +2,57 @@

This repository contains a collection of
[Jupyter notebooks](https://jupyter.org) with examples of how to use the data
and software distributed under [Catalyst Cooperative](https://catalyst.coop)'s
and software distributed by [Catalyst Cooperative](https://catalyst.coop)'s
[Public Utility Data Liberation (PUDL) project](https://github.com/catalyst-cooperative/pudl).

The example notebooks depend on having the processed PUDL data available, and
it's too large to commit to a GitHub repository. There are two main ways to
access it. You can either download it to your computer and run our Docker
container locally, or you can request an account on
[our JupyterHub](https://catalyst-cooperative.pilot.2i2c.cloud/) which is
hosted in collaboration with [2i2c.org](https://2i2c.org).
## Run PUDL Notebooks on Kaggle

## Option 1: Download preprocessed data and run Docker
The easiest way to get up and running with these examples and a fresh copy of all the
PUDL data is on [Kaggle](https://www.kaggle.com):

### Download and extract the archived data and Docker container
- [PUDL Data on Kaggle](https://www.kaggle.com/datasets/catalystcooperative/pudl-project/data)
- [01 PUDL Data Access](https://www.kaggle.com/code/catalystcooperative/01-pudl-data-access)
- [02 State Hourly Electricity Demand](https://www.kaggle.com/code/catalystcooperative/02-state-hourly-electricity-demand)

* Download and extract the most recent
[PUDL data release from Zenodo](https://doi.org/10.5281/zenodo.3653158)
into a local directory. On MacOS and Windows you should just be able to
double-click the archive file. On Linux (or MacOS) you may want to use the
command line:
Kaggle offers substantial free computing resources and convenient data storage, so you
can start playing with the PUDL data without needing to set up any software or download
any data.

```sh
tar -xzf filename.tgz
```
## Running Jupyter locally

It may take a couple of minutes to extract.
* Extracting the archive will create a directory containing the example Jupyter
Notebooks from this repository, and all the processed PUDL data as a combination of
[SQLite](https://www.sqlite.org) databases and
[Apache Parquet](https://parquet.apache.org/) files.
If you're already familiar with git, Python environments, filesystem paths, and running
upyter notebooks locally, you can also work with these notebooks and the PUDL data locally:

### Install and run Docker
- Create a Python environment that includes common data science packages. We like to use
the [mamba](https://github.com/mamba-org/mamba) package manager and the
[conda-forge](https://conda-forge.org/#about) channel.
- Clone this repository.
- [Download the PUDL dataset from Kaggle](https://www.kaggle.com/datasets/catalystcooperative/pudl-project/download) (it's ~8GB!) and unzip it somewhere conveniently accessible from the
notebooks in the cloned repo.
- Start your JupyterLab or Jupyter Notebook server and navigate to the notebooks in
the cloned repo.
- You'll need to adjust the file paths in the notebooks to point at the directory where
you put the PUDL data, and might need to adjust the packages installed in your Python
environment to work with the notebooks.

* [Download and install Docker](https://docs.docker.com/get-docker/). On MacOS
and Windows it'll be called "Docker Desktop". On Linux it's just "Docker."
* On Linux, you'll need to separately install a tool called
[docker compose](https://docs.docker.com/compose/cli-command/#install-on-linux)
(it comes bundled with Docker Desktop for MacOS/Windows).
* If you're on MacOS or Windows, open the settings in Docker Desktop and
increase the amount of memory that Docker is allowed to use to at least 8GB.
* Check to make sure that the Docker service is running in the background. On
MacOS it should show up in the menu bar. On Windows it should show up in the
system tray. On Linux, a daemon called `dockerd` should be running in the
background.
## Other Data Access Methods

### Load the archived Docker image
See [the PUDL documentation](https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html)
for other data access methods.

* At a command line, go into the directory which was created by extracting the
archive. It should contain a file named `pudl-jupyter.tar` -- this is
a Docker image which will run a Jupyter Notebook server for you locally, with
all of the PUDL software installed and ready to use. But first you need to
load the image into your local collection of docker images with this
command:
If you're familiar with cloud services, you can check out:

```sh
docker load -i pudl-jupyter.tar
```
- The [AWS Open Data Registry](https://registry.opendata.aws/catalyst-cooperative-pudl/):
s3://pudl.catalyst.coop (free access)
- Google Cloud Storage: gs://pudl.catalyst.coop (requester pays)

You should see some output at the command line as it loads the image.
## Stalk us on the Internet

### Start the Jupyter Notebook server using `docker compose`

* Once it's done loading, in that same directory (where you should also see a
file named `docker-compose.yml`), run the command:
```sh
docker compose up
```
* You should see some logging messages as the PUDL Docker image starts up and
runs the Jupyter Notebook server. Near the end of those logging message, you
should see several possible links to click or copy-and-paste.
Pick one that starts with `https://localhost:48512` or
`https://127.0.0.1:48512` and open it in a web browser. (Note: this is a local
web address for the Jupyter Notebook server running on your computer.)
* You should see JupyterLab launcher and notebook interface. In the file
browser in the left hand sidebar, you should see a `notebooks` directory with
several example notebooks in it, which (hopefully!) you will be able to run.
### Add your own data
* If you have additional data you want to work with in conjunction with the
PUDL data, you can put it in the `user_data` directory, and it will be
accessible to you from within the Docker container. You can also save
outputs to that directory inside the Docker container, and they will be
available in the `user_data` directory on your computer.
## Option 2: Request an account on our JupyterHub
We also have an experimental shared JupyterHub currently maintained in
collaboration with [2i2c.org](https://2i2c.org). Once you
have an account on our hub, you can
[work through the example notebooks there](https://bit.ly/pudl-examples-01)
without needing to download anything or install
anything. If you'd like to get an account
[submit this Google form](https://forms.gle/TN3GuE2e2mnWoFC4A) and we'll get
back to you soon!
## Contact Us
* Web: [Catalyst Cooperative](https://catalyst.coop)
* Email: [[email protected]](mailto:[email protected])
* Twitter: [@CatalystCoop](https://twitter.com/CatalystCoop)
---
## Addendum: Development-Oriented Usage
### Running the PUDL Jupyter Container with no data
If you just want the PUDL software environment without the processed data, for
development or other purposes, you can pull a Docker image from the
[catalystcoop/pudl-jupyter repository on DockerHub](https://hub.docker.com/r/catalystcoop/pudl-jupyter) directly:
```sh
docker pull catalystcoop/pudl-jupyter:latest
```
This image is built automatically using
[`repo2docker`](https://github.com/jupyterhub/repo2docker) whenever a commit
is made to the
[pudl-examples repository](https://github.com/catalyst-cooperative/pudl-examples)
### Environment Variables
The Docker container needs to be pointed at a couple of local directories to
work properly with PUDL. These paths are set using environment variables:
* `PUDL_DATA` is the path to the PUDL directory containing your PUDL
`data`, `sqlite` and `epacems` directories. It is treated as read-only, and by
default is set to `./pudl_data`
* `USER_DATA` is a local directory that you want to have access to
within the container. It can contain other data, or your own notebooks, etc. by
default it is set to `./user_data`
You can change these defaults by editing the `.env` file in the top directory of
this repository (or the archive you downloaded from Zenodo)
To be able to fill in data using the EIA API, you'll need to [obtain an API KEY
from EPA](https://www.eia.gov/opendata/register.php). If you set an environment
variable called `API_KEY_EIA` in the shell where you run the
`catalystcoop/pudl-jupyter` container using `docker compose` then the value of
that environment variable will be passed in to the container and available for
use automatically.
- [WWW](https://catalyst.coop)
- Email: [[email protected]](mailto:[email protected])
- Mastodon: [@CatalystCoop@mastodon.energy](https://mastodon.energy/@CatalystCoop)
- BlueSky: [@catalyst.coop](https://bsky.app/profile/catalyst.coop)
- [Kaggle](https://www.kaggle.com/catalystcooperative)
- [HuggingFace](https://huggingface.co/catalystcooperative)
- [GitHub](https://github.com/catalyst-cooperative)
- Twitter: [@CatalystCoop](https://twitter.com/CatalystCoop)

0 comments on commit 29dc84b

Please sign in to comment.