From d827bf02a4cf43bebf3f21263493b4536f48a073 Mon Sep 17 00:00:00 2001 From: "Jinjing.Zhou" Date: Fri, 22 Jul 2022 13:23:53 +0800 Subject: [PATCH 1/2] add rfc --- docs/proposals/data.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 docs/proposals/data.md diff --git a/docs/proposals/data.md b/docs/proposals/data.md new file mode 100644 index 000000000..e69de29bb From b0c46779f4acd03ab9f7db2634e13de00f191247 Mon Sep 17 00:00:00 2001 From: "Jinjing.Zhou" Date: Fri, 22 Jul 2022 13:33:43 +0800 Subject: [PATCH 2/2] add Signed-off-by: Jinjing.Zhou --- docs/proposals/data.md | 82 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) diff --git a/docs/proposals/data.md b/docs/proposals/data.md index e69de29bb..68852ebea 100644 --- a/docs/proposals/data.md +++ b/docs/proposals/data.md @@ -0,0 +1,82 @@ +# Data function support + +## Summary + +To provide mount data support for envd + + +## Goals + +Design a *unified, declarative* interface and underlying architecture to provide dataset in the development environment in a *scalable way* + + +Non-goals: +- Support Git-like version control for data + +## Common Scenarios + +### Possible sources +- local files +- Object storage (AWS S3) +- NFS-like system (AWS EFS, AWS FSx for OpenZFS) +- Block storage (Ceph) +- HDFS +- Lustre +- API endpoint (http path) +- SQL results +- Other distributed fs (alluxio, juicefs) +- Python SDK + +### Possible form +- Images +- Text +- Embedding binarys +- CSV + +### Access Pattern + +The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore + +### Possible versions/tags +- Version by number, V1, V2, V3 +- Version by scale, sample dataset vs full dataset +- Version by time, query range of user activity (7d, 30d) from feature store + +We can have a new standard on how to version the data like semver + +## Proposal + +Each version of dataset is immutable. By assuming the data is immutable, we can cache the data and make replication easily, to increase the read throughput in multiple ways. + + +### Usage + +User need to create the dataset beforehand. Than declare mounting in the build.envd file. + +``` +envd data add -f mnist.yaml +``` + +User can create multiple dataset with the same name, but need to be different versions + +mnist.yaml +```yaml= +ApiVersion: V1alpha +name: mnist +version: "0.0.1-sample" +sources: + - type: local # First source will be considered major source, others are the replication of this one + path: ~/.torch/mnist + - type: s3 + path: xxx +validation: + checksum: + - name: MD5 + value: xxxx +``` + +build.envd +``` +def data(): + return [d.mount("mnist", target="./data")] # User can specify mount multiple datasets +``` \ No newline at end of file