Solgate as an operator #32

tumido · 2020-09-17T15:36:11Z

Overview

Is your feature request related to a problem? Please describe.
Solgate should be accompanied wit an operator capabilities - only a single CRD should be required to define a dataset and set up a new sync pipeline to maintain said dataset in sync.

Describe the solution you'd like
Once a CRD instance is created, the operator would facilitate the initial sync. It will keep the data set up to date. It may be requested to delete the dataset on CRD delete.

Describe alternatives you've considered
n/a

Additional context
Would streamline deployment of solgate. Should be easy to implement via https://github.com/nolar/kopf

Proposal

Define a CustomResourceDefinition that would hold information about the said dataset like name, origin (source), desired local on-cluster location, sync triggers etc.

It can look something like this:

apiVersion: solgate.io/v1alpha1
kind: DataSet
metadata:
  name: my-dataset
  annotations:  # Populated by the operator
    solgate.io/dataset-origin: kaggle
    solgate.io/dataset-name: Original name of the dataset
    solgate.io/dataset-description: Description pulled from origin if available
spec:
  initialSync: true  # Schedule an initial sync of the dataset to destinations
  cleanupOnDelete: false # Delete data from destinations on DataSet object delete

  triggers: # Forwarded to KNative or Argo Events
    - calendar:
        interval: x
        schedule: cron
    - webhook: "..."

  source: # Source with type like S3/Mailing list/Kaggle, etc.
    s3:
      endpoint:
      bucket:
      key:
      accessKeySecret:
        name:
        key:
      secretKeySecret:
        name:
        key:

  destinations: # S3 or PersistentVolume
    - name:
      s3:
        endpoint:
        bucket:
        key:
        accessKeySecret:
          name:
          key:
        secretKeySecret:
          name:
          key:
      ...
status: # Filled in by the operator
  triggers:
    - calendar:
      lastActivated: # timestamp
  destinations:
    - name:
      lastSync: # timestamp
  ...

The operator may (aka a far-fetched road map):

schedule the initial sync using minio client mirror job
schedule sync pipeline to keep the datasets up to date using different backends (Argo Workflow etc...)
trigger individual preprocess pipelines for the destinations via different backends (Argo Workflows, AirFlow...)
delete data on DataSet resource delete
fire events when dataset was updated
if deployed cluster-wide, monitor different DataSet instances in distinct namespaces and coordinate the sync jobs (so we're not pulling the same source for may data scientists at similar times, rather aggregate the destinations and sync them all together - possibility for optimizations like when syncing within the same S3 cluster can be done via object .copy method, which is much faster than copying byte per byte)

As a result, adding a new dataset to the cluster + keeping it up to date is just matter of deploying a single DataSet resource (compared to deploying many manifests for current solgate to spin up a new sync pipeline instance). It can also serve as a base for a "local" dataset catalogue (can be aggregated from annotations).

The text was updated successfully, but these errors were encountered:

tumido added the enhancement label Sep 17, 2020

tumido self-assigned this Sep 17, 2020

tumido changed the title ~~Provide an operator~~ Solgate as an operator Sep 21, 2020

sesheta added kind/feature Categorizes issue or PR as related to a new feature. and removed enhancement labels Feb 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solgate as an operator #32

Solgate as an operator #32

tumido commented Sep 17, 2020 •

edited

Loading

Solgate as an operator #32

Solgate as an operator #32

Comments

tumido commented Sep 17, 2020 • edited Loading

Overview

Proposal

tumido commented Sep 17, 2020 •

edited

Loading