You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 31, 2023. It is now read-only.
Is your feature request related to a problem? Please describe.
Solgate should be accompanied wit an operator capabilities - only a single CRD should be required to define a dataset and set up a new sync pipeline to maintain said dataset in sync.
Describe the solution you'd like
Once a CRD instance is created, the operator would facilitate the initial sync. It will keep the data set up to date. It may be requested to delete the dataset on CRD delete.
Describe alternatives you've considered
n/a
Additional context
Would streamline deployment of solgate. Should be easy to implement via https://github.com/nolar/kopf
Proposal
Define a CustomResourceDefinition that would hold information about the said dataset like name, origin (source), desired local on-cluster location, sync triggers etc.
It can look something like this:
apiVersion: solgate.io/v1alpha1kind: DataSetmetadata:
name: my-datasetannotations: # Populated by the operatorsolgate.io/dataset-origin: kagglesolgate.io/dataset-name: Original name of the datasetsolgate.io/dataset-description: Description pulled from origin if availablespec:
initialSync: true # Schedule an initial sync of the dataset to destinationscleanupOnDelete: false # Delete data from destinations on DataSet object deletetriggers: # Forwarded to KNative or Argo Events
- calendar:
interval: xschedule: cron
- webhook: "..."source: # Source with type like S3/Mailing list/Kaggle, etc.s3:
endpoint:
bucket:
key:
accessKeySecret:
name:
key:
secretKeySecret:
name:
key:
destinations: # S3 or PersistentVolume
- name:
s3:
endpoint:
bucket:
key:
accessKeySecret:
name:
key:
secretKeySecret:
name:
key:
...status: # Filled in by the operatortriggers:
- calendar:
lastActivated: # timestampdestinations:
- name:
lastSync: # timestamp...
The operator may (aka a far-fetched road map):
schedule the initial sync using minio client mirror job
schedule sync pipeline to keep the datasets up to date using different backends (Argo Workflow etc...)
trigger individual preprocess pipelines for the destinations via different backends (Argo Workflows, AirFlow...)
delete data on DataSet resource delete
fire events when dataset was updated
if deployed cluster-wide, monitor different DataSet instances in distinct namespaces and coordinate the sync jobs (so we're not pulling the same source for may data scientists at similar times, rather aggregate the destinations and sync them all together - possibility for optimizations like when syncing within the same S3 cluster can be done via object .copy method, which is much faster than copying byte per byte)
As a result, adding a new dataset to the cluster + keeping it up to date is just matter of deploying a single DataSet resource (compared to deploying many manifests for current solgate to spin up a new sync pipeline instance). It can also serve as a base for a "local" dataset catalogue (can be aggregated from annotations).
The text was updated successfully, but these errors were encountered:
Overview
Is your feature request related to a problem? Please describe.
Solgate should be accompanied wit an operator capabilities - only a single CRD should be required to define a dataset and set up a new sync pipeline to maintain said dataset in sync.
Describe the solution you'd like
Once a CRD instance is created, the operator would facilitate the initial sync. It will keep the data set up to date. It may be requested to delete the dataset on CRD delete.
Describe alternatives you've considered
n/a
Additional context
Would streamline deployment of solgate. Should be easy to implement via https://github.com/nolar/kopf
Proposal
Define a
CustomResourceDefinition
that would hold information about the said dataset like name, origin (source), desired local on-cluster location, sync triggers etc.It can look something like this:
The operator may (aka a far-fetched road map):
DataSet
resource deleteDataSet
instances in distinct namespaces and coordinate the sync jobs (so we're not pulling the same source for may data scientists at similar times, rather aggregate the destinations and sync them all together - possibility for optimizations like when syncing within the same S3 cluster can be done via object.copy
method, which is much faster than copying byte per byte)As a result, adding a new dataset to the cluster + keeping it up to date is just matter of deploying a single
DataSet
resource (compared to deploying many manifests for current solgate to spin up a new sync pipeline instance). It can also serve as a base for a "local" dataset catalogue (can be aggregated from annotations).The text was updated successfully, but these errors were encountered: