Skip to content

Commit

Permalink
add documentation to ddataflow :) (#25)
Browse files Browse the repository at this point in the history
Co-authored-by: theodore.meynard <[email protected]>
  • Loading branch information
theopinard and theodoremeynard authored Apr 18, 2024
1 parent 1fb61c5 commit a1b032a
Show file tree
Hide file tree
Showing 18 changed files with 113 additions and 88 deletions.
51 changes: 20 additions & 31 deletions .github/workflows/pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,41 +2,30 @@
name: Deploy static content to Pages

on:
# Runs on pushes targeting the default branch
push:
branches: ["main"]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
branches:
- master
- main
permissions:
contents: read
pages: write
id-token: write

# Allow one concurrent deployment
concurrency:
group: "pages"
cancel-in-progress: true

contents: write
jobs:
# Single deploy job since we're just deploying
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Pages
uses: actions/configure-pages@v2
- name: Upload artifact
uses: actions/upload-pages-artifact@v1
- uses: actions/checkout@v4
- name: Configure Git Credentials
run: |
git config user.name github-actions[bot]
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
- uses: actions/setup-python@v5
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v4
with:
# Upload entire repository
path: 'html'
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v1
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install mkdocs mkdocstrings[python] mkdocs-material
- run: mkdocs gh-deploy --force
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
.idea/*
*.swp
dist/
site/
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
DDataFlow is an end2end tests and local development solution for machine learning and data pipelines using pyspark.
Check out this blogpost if you want to [understand deeper its design motivation](https://www.getyourguide.careers/posts/ddataflow-a-tool-for-data-end-to-end-tests-for-machine-learning-pipelines).

![ddataflow overview](ddataflow.png)
![ddataflow overview](docs/ddataflow.png)

You can find our documentation in the [docs folder](https://github.com/getyourguide/DDataFlow/tree/main/docs). And see the complete code reference [here](https://code.getyourguide.com/DDataFlow/ddataflow/ddataflow.html).

Expand All @@ -15,7 +15,7 @@ You can find our documentation in the [docs folder](https://github.com/getyourgu

Enables to run on the pipelines in the CI

## 1. Install Ddataflow
## 1. Install DDataflow

```sh
pip install ddataflow
Expand Down Expand Up @@ -95,4 +95,15 @@ Check out our [FAQ in case of problems](https://github.com/getyourguide/DDataFlo

## Contributing

This project requires manual release at the moment. See the docs and request a pypi access if you want to contribute.
We welcome contributions to DDataFlow! If you would like to contribute, please follow these guidelines:

1. Fork the repository and create a new branch for your contribution.
2. Make your changes and ensure that the code passes all tests.
3. Submit a pull request with a clear description of your changes and the problem it solves.

Please note that all contributions are subject to review and approval by the project maintainers. We appreciate your help in making DDataFlow even better!

If you have any questions or need any help, please don't hesitate to reach out. We are here to assist you throughout the contribution process.

## License
DDataFlow is licensed under the [MIT License](https://github.com/getyourguide/DDataFlow/blob/main/LICENSE).
6 changes: 3 additions & 3 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# FAQ



## Im trying to download data but the system is complaining my databricks cli is not are not configure
## I am trying to download data but the system is complaining my databricks cli is not configure

After installing ddataflow run the configure producedure in your installed machine

```
```sh
databricks configure --token
```

Expand Down
8 changes: 0 additions & 8 deletions docs/_releasing.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/api_reference/DDataflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: ddataflow.ddataflow.DDataflow
1 change: 1 addition & 0 deletions docs/api_reference/DataSource.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: ddataflow.data_source
1 change: 1 addition & 0 deletions docs/api_reference/DataSourceDownloader.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: ddataflow.downloader
1 change: 1 addition & 0 deletions docs/api_reference/DataSources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: ddataflow.data_sources
Binary file added docs/ddataflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Home

DDataFlow is an end2end tests and local development solution for machine learning and data pipelines using pyspark.

## Features

- Read a subset of our data so to speed up the running of the pipelines during tests
- Write to a test location our artifacts so you don't pollute production
- Download data for enabling local machine development
3 changes: 1 addition & 2 deletions docs/integrator_manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,12 @@ pip install ddataflow
DDataflow is declarative and is completely configurable a single configuration in DDataflow startup. To create a configuration for you project simply run:

```shell

ddataflow setup_project
```

You can use this config also in in a notebook, or using databricks-connect or in the repository with db-rocket. Example config below:

```py
```python
#later save this script as ddataflow_config.py to follow our convention
from ddataflow import DDataflow
import pyspark.sql.functions as F
Expand Down
27 changes: 26 additions & 1 deletion docs/local_development.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Local development with DDataflow
# Local Development

DDataflow also enables one to develop with local data. We see this though as a more advanced use case, which might be
the first choice for everybody. First, make a copy of the files you need to download in dbfs.
Expand All @@ -22,3 +22,28 @@ python yourproject/train.py
```

The downloaded data sources will be stored at `$HOME/.ddataflow`.

## Local setup for spark

if you run spark locally you might need to tweak some parameters compared to your cluster. Below is a good example you can use.

```py
def configure_spark():

if ddataflow.is_local():
import pyspark

spark_conf = pyspark.SparkConf()
spark_conf.set("spark.sql.warehouse.dir", "/tmp")
spark_conf.set("spark.sql.catalogImplementation", "hive")
spark_conf.set("spark.driver.memory", "15g")
spark_conf.setMaster("local[*]")
sc = pyspark.SparkContext(conf=spark_conf)
session = pyspark.sql.SparkSession(sc)

return session

return SparkSession.builder.getOrCreate()
```

If you run into Snappy compression problem: Please reinstall pyspark!
30 changes: 0 additions & 30 deletions docs/running_spark_locally.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Add the following to your setup.py
],
```

## With Dbrocket
## With DBrocket

Cell 1

Expand Down
10 changes: 2 additions & 8 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,3 @@

One drawback of having ddataflow in the root folder is that it can conflict with other ddtaflow- installations.
Prefer installing ddataflow in submodules of your main project.

myproject/main_module/ddataflow_config.py

instead of globally like this:

myproject/ddataflow_config.py
One drawback of having ddataflow in the root folder is that it can conflict with other ddtaflow installations.
Prefer installing ddataflow in submodules of your main project (`myproject/main_module/ddataflow_config.py`) instead of globally (`myproject/ddataflow_config.py`).
2 changes: 1 addition & 1 deletion examples/ddataflow_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"demo_tours": {
"source": lambda spark: spark.table('demo_tours'),
"filter": lambda df: df.limit(500)
}
},
"demo_locations": {
"source": lambda spark: spark.table('demo_locations'),
"default_sampling": True,
Expand Down
31 changes: 31 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
site_name: DDataflow
site_url: https://example.com/

theme:
name: material

markdown_extensions:
- pymdownx.superfences

nav:
- 'index.md'
- 'integrator_manual.md'
- 'local_development.md'
- 'sampling.md'
- API Reference:
- 'api_reference/DDataflow.md'
- 'api_reference/DataSource.md'
- 'api_reference/DataSources.md'
- 'api_reference/DataSourceDownloader.md'
- 'troubleshooting.md'
- 'FAQ.md'

plugins:
- search
- mkdocstrings:
handlers:
# See: https://mkdocstrings.github.io/python/usage/
python:
options:
docstring_style: sphinx
allow_inspection: true

0 comments on commit a1b032a

Please sign in to comment.