Skip to content
This repository has been archived by the owner on May 31, 2023. It is now read-only.

Solgate as an operator #32

Open
tumido opened this issue Sep 17, 2020 · 0 comments
Open

Solgate as an operator #32

tumido opened this issue Sep 17, 2020 · 0 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tumido
Copy link
Member

tumido commented Sep 17, 2020

Overview

Is your feature request related to a problem? Please describe.
Solgate should be accompanied wit an operator capabilities - only a single CRD should be required to define a dataset and set up a new sync pipeline to maintain said dataset in sync.

Describe the solution you'd like
Once a CRD instance is created, the operator would facilitate the initial sync. It will keep the data set up to date. It may be requested to delete the dataset on CRD delete.

Describe alternatives you've considered
n/a

Additional context
Would streamline deployment of solgate. Should be easy to implement via https://github.com/nolar/kopf

Proposal

Define a CustomResourceDefinition that would hold information about the said dataset like name, origin (source), desired local on-cluster location, sync triggers etc.

It can look something like this:

apiVersion: solgate.io/v1alpha1
kind: DataSet
metadata:
  name: my-dataset
  annotations:  # Populated by the operator
    solgate.io/dataset-origin: kaggle
    solgate.io/dataset-name: Original name of the dataset
    solgate.io/dataset-description: Description pulled from origin if available
spec:
  initialSync: true  # Schedule an initial sync of the dataset to destinations
  cleanupOnDelete: false # Delete data from destinations on DataSet object delete

  triggers: # Forwarded to KNative or Argo Events
    - calendar:
        interval: x
        schedule: cron
    - webhook: "..."

  source: # Source with type like S3/Mailing list/Kaggle, etc.
    s3:
      endpoint:
      bucket:
      key:
      accessKeySecret:
        name:
        key:
      secretKeySecret:
        name:
        key:

  destinations: # S3 or PersistentVolume
    - name:
      s3:
        endpoint:
        bucket:
        key:
        accessKeySecret:
          name:
          key:
        secretKeySecret:
          name:
          key:
      ...
status: # Filled in by the operator
  triggers:
    - calendar:
      lastActivated: # timestamp
  destinations:
    - name:
      lastSync: # timestamp
  ...

The operator may (aka a far-fetched road map):

  • schedule the initial sync using minio client mirror job
  • schedule sync pipeline to keep the datasets up to date using different backends (Argo Workflow etc...)
  • trigger individual preprocess pipelines for the destinations via different backends (Argo Workflows, AirFlow...)
  • delete data on DataSet resource delete
  • fire events when dataset was updated
  • if deployed cluster-wide, monitor different DataSet instances in distinct namespaces and coordinate the sync jobs (so we're not pulling the same source for may data scientists at similar times, rather aggregate the destinations and sync them all together - possibility for optimizations like when syncing within the same S3 cluster can be done via object .copy method, which is much faster than copying byte per byte)

As a result, adding a new dataset to the cluster + keeping it up to date is just matter of deploying a single DataSet resource (compared to deploying many manifests for current solgate to spin up a new sync pipeline instance). It can also serve as a base for a "local" dataset catalogue (can be aggregated from annotations).

@tumido tumido self-assigned this Sep 17, 2020
@tumido tumido changed the title Provide an operator Solgate as an operator Sep 21, 2020
@sesheta sesheta added kind/feature Categorizes issue or PR as related to a new feature. and removed enhancement labels Feb 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants