add documentation to ddataflow :)

getyourguide · Apr 16, 2024 · d27a3bf · d27a3bf
1 parent 1ad5cdc
commit d27a3bf
Show file tree

Hide file tree

Showing 18 changed files with 113 additions and 88 deletions.
diff --git a/.github/workflows/pages.yml b/.github/workflows/pages.yml
@@ -2,41 +2,30 @@
 name: Deploy static content to Pages
 
 on:
-  # Runs on pushes targeting the default branch
   push:
-    branches: ["main"]
-
-  # Allows you to run this workflow manually from the Actions tab
-  workflow_dispatch:
-
-# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
+    branches:
+      - master 
+      - main
 permissions:
-  contents: read
-  pages: write
-  id-token: write
-
-# Allow one concurrent deployment
-concurrency:
-  group: "pages"
-  cancel-in-progress: true
-
+  contents: write
 jobs:
-  # Single deploy job since we're just deploying
   deploy:
-    environment:
-      name: github-pages
-      url: ${{ steps.deployment.outputs.page_url }}
     runs-on: ubuntu-latest
     steps:
-      - name: Checkout
-        uses: actions/checkout@v3
-      - name: Setup Pages
-        uses: actions/configure-pages@v2
-      - name: Upload artifact
-        uses: actions/upload-pages-artifact@v1
+      - uses: actions/checkout@v4
+      - name: Configure Git Credentials
+        run: |
+          git config user.name github-actions[bot]
+          git config user.email 41898282+github-actions[bot]@users.noreply.github.com
+      - uses: actions/setup-python@v5
+        with:
+          python-version: 3.x
+      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV 
+      - uses: actions/cache@v4
         with:
-          # Upload entire repository
-          path: 'html'
-      - name: Deploy to GitHub Pages
-        id: deployment
-        uses: actions/deploy-pages@v1
+          key: mkdocs-material-${{ env.cache_id }}
+          path: .cache
+          restore-keys: |
+            mkdocs-material-
+      - run: pip install mkdocs mkdocstrings[python] mkdocs-material
+      - run: mkdocs gh-deploy --force
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@
 .idea/*
 *.swp
 dist/
+site/
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 DDataFlow is an end2end tests and local development solution for machine learning and data pipelines using pyspark.
 Check out this blogpost if you want to [understand deeper its design motivation](https://www.getyourguide.careers/posts/ddataflow-a-tool-for-data-end-to-end-tests-for-machine-learning-pipelines).
 
-![ddataflow overview](ddataflow.png)
+![ddataflow overview](docs/ddataflow.png)
 
 You can find our documentation in the [docs folder](https://github.com/getyourguide/DDataFlow/tree/main/docs). And see the complete code reference [here](https://code.getyourguide.com/DDataFlow/ddataflow/ddataflow.html).
 
@@ -15,7 +15,7 @@ You can find our documentation in the [docs folder](https://github.com/getyourgu
 
 Enables to run on the pipelines in the CI
 
-## 1. Install Ddataflow
+## 1. Install DDataflow
 
 ```sh
 pip install ddataflow 
@@ -95,4 +95,15 @@ Check out our [FAQ in case of problems](https://github.com/getyourguide/DDataFlo
 
 ## Contributing
 
-This project requires manual release at the moment. See the docs and request a pypi access if you want to contribute.
+We welcome contributions to DDataFlow! If you would like to contribute, please follow these guidelines:
+
+1. Fork the repository and create a new branch for your contribution.
+2. Make your changes and ensure that the code passes all tests.
+3. Submit a pull request with a clear description of your changes and the problem it solves.
+
+Please note that all contributions are subject to review and approval by the project maintainers. We appreciate your help in making DDataFlow even better!
+
+If you have any questions or need any help, please don't hesitate to reach out. We are here to assist you throughout the contribution process.
+
+## License
+DDataFlow is licensed under the [MIT License](https://github.com/getyourguide/DDataFlow/blob/main/LICENSE).
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -1,11 +1,11 @@
+# FAQ
 
 
-
- ## Im trying to download data but the system is complaining my databricks cli is not are not configure
+## I am trying to download data but the system is complaining my databricks cli is not configure
 
  After installing ddataflow run the configure producedure in your installed machine
 
- ```
+ ```sh
  databricks configure --token
  ```
 

diff --git a/docs/_releasing.md b/docs/_releasing.md
diff --git a/docs/api_reference/DDataflow.md b/docs/api_reference/DDataflow.md
@@ -0,0 +1 @@
+::: ddataflow.ddataflow.DDataflow
diff --git a/docs/api_reference/DataSource.md b/docs/api_reference/DataSource.md
@@ -0,0 +1 @@
+::: ddataflow.data_source
diff --git a/docs/api_reference/DataSourceDownloader.md b/docs/api_reference/DataSourceDownloader.md
@@ -0,0 +1 @@
+::: ddataflow.downloader
diff --git a/docs/api_reference/DataSources.md b/docs/api_reference/DataSources.md
@@ -0,0 +1 @@
+::: ddataflow.data_sources
diff --git a/docs/ddataflow.png b/docs/ddataflow.png
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,9 @@
+# Home
+
+DDataFlow is an end2end tests and local development solution for machine learning and data pipelines using pyspark.
+
+## Features
+
+- Read a subset of our data so to speed up the running of the pipelines during tests
+- Write to a test location our artifacts so you don't pollute production
+- Download data for enabling local machine development
diff --git a/docs/integrator_manual.md b/docs/integrator_manual.md
@@ -14,13 +14,12 @@ pip install ddataflow
 DDataflow is declarative and is completely configurable a single configuration in DDataflow startup. To create a configuration for you project simply run:
 
 ```shell
-
 ddataflow setup_project
 ```
 
 You can use this config also in in a notebook, or using databricks-connect or in the repository with db-rocket. Example config below:
 
-```py
+```python
 #later save this script as ddataflow_config.py to follow our convention
 from ddataflow import DDataflow
 import pyspark.sql.functions as F

diff --git a/docs/local_development.md b/docs/local_development.md
@@ -1,4 +1,4 @@
-# Local development with DDataflow
+# Local Development
 
 DDataflow also enables one to develop with local data. We see this though as a more advanced use case, which might be
 the first choice for everybody. First, make a copy of the files you need to download in dbfs.
@@ -22,3 +22,28 @@ python yourproject/train.py
 ```
 
 The downloaded data sources will be stored at `$HOME/.ddataflow`.
+
+## Local setup for spark
+
+if you run spark locally you might need to tweak some parameters compared to your cluster. Below is a good example you can use.
+
+```py
+def configure_spark():
+
+    if ddataflow.is_local():
+        import pyspark
+
+        spark_conf = pyspark.SparkConf()
+        spark_conf.set("spark.sql.warehouse.dir", "/tmp")
+        spark_conf.set("spark.sql.catalogImplementation", "hive")
+        spark_conf.set("spark.driver.memory", "15g")
+        spark_conf.setMaster("local[*]")
+        sc = pyspark.SparkContext(conf=spark_conf)
+        session = pyspark.sql.SparkSession(sc)
+
+        return session
+
+    return SparkSession.builder.getOrCreate()
+```
+
+If you run into Snappy compression problem: Please reinstall pyspark! 
diff --git a/docs/running_spark_locally.md b/docs/running_spark_locally.md
diff --git a/docs/sampling.md b/docs/sampling.md
@@ -11,7 +11,7 @@ Add the following to your setup.py
     ],
 ```
 
-## With Dbrocket
+## With DBrocket
 
 Cell 1
 

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -1,9 +1,3 @@
 
-One drawback of having ddataflow in the root folder is that it can conflict with other ddtaflow- installations.
-Prefer installing ddataflow in submodules of your main project.
-
-myproject/main_module/ddataflow_config.py
-
-instead of globally like this:
-
-myproject/ddataflow_config.py
+One drawback of having ddataflow in the root folder is that it can conflict with other ddtaflow installations.
+Prefer installing ddataflow in submodules of your main project (`myproject/main_module/ddataflow_config.py`) instead of globally (`myproject/ddataflow_config.py`).
diff --git a/examples/ddataflow_config.py b/examples/ddataflow_config.py
@@ -6,7 +6,7 @@
         "demo_tours": {
             "source": lambda spark: spark.table('demo_tours'),
             "filter": lambda df: df.limit(500)
-        }
+        },
         "demo_locations": {
             "source": lambda spark: spark.table('demo_locations'),
             "default_sampling": True,

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -0,0 +1,31 @@
+site_name: DDataflow
+site_url: https://example.com/
+
+theme:
+  name: material
+
+markdown_extensions:
+  - pymdownx.superfences
+
+nav:
+  - 'index.md'
+  - 'integrator_manual.md'
+  - 'local_development.md'
+  - 'sampling.md'
+  - API Reference:
+    - 'api_reference/DDataflow.md'
+    - 'api_reference/DataSource.md'
+    - 'api_reference/DataSources.md'
+    - 'api_reference/DataSourceDownloader.md'
+  - 'troubleshooting.md'
+  - 'FAQ.md'
+
+plugins:
+  - search
+  - mkdocstrings:
+      handlers:
+        # See: https://mkdocstrings.github.io/python/usage/
+        python:
+          options:
+            docstring_style: sphinx
+            allow_inspection: true