Skip to content

Commit

Permalink
Docs: How to Write a Data Pipeline? (#247)
Browse files Browse the repository at this point in the history
  • Loading branch information
iesahin authored Dec 4, 2023
1 parent 413758a commit e1eaade
Show file tree
Hide file tree
Showing 32 changed files with 1,309 additions and 153 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ jobs:
MINIO_ACCESS_KEY_ID: ${{ secrets.MINIO_ACCESS_KEY_ID }}
MINIO_SECRET_ACCESS_KEY: ${{ secrets.MINIO_SECRET_ACCESS_KEY }}
XVC_TEST_ONE_EMRESULT_COM_KEY: ${{ secrets.XVC_TEST_ONE_EMRESULT_COM_KEY }}
# We don't run xvc-storage tests here
XVC_TRYCMD_TESTS: core,file,pipeline,intro,howto,start
# We don't run xvc-storage and how-to tests here
XVC_TRYCMD_TESTS: core,file,pipeline,intro,start
steps:
- name: Checkout
uses: actions/checkout@v1
Expand Down
1 change: 1 addition & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [How to Compile Xvc](./how-to/compile.md)
- [Xvc with Git Branches](./how-to/git-branches.md)
- [Turn off Git Integration](./how-to/turn-off-git-automation.md)
- [Create a Data Pipeline](./how-to/create-a-data-pipeline.md)

- [Command Reference](./ref/xvc.md)
- [`xvc init`](./ref/xvc-init.md)
Expand Down
650 changes: 650 additions & 0 deletions book/src/how-to/create-a-data-pipeline.md

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions book/src/images/xvc-pipeline-dag-pipeline-1.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 6 additions & 3 deletions book/src/ref/xvc-file-copy.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ Options:
--no-recheck
Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace

--name-only
When copying multiple files, by default whole path is copied to the destination. This option sets the destination to be created with the file name only

-h, --help
Print help (see a summary with '-h')

Expand Down Expand Up @@ -117,7 +120,7 @@ Total #: 3 Workspace Size: 57 Cached Size: 19

```

If the targets you specify are changed, Xvc cancels the copy operation. Please either recheck old versions or carry in new versions.
If the source files you specify are changed, Xvc cancels the copy operation. Please either recheck old versions or carry in new versions.

```console
$ perl -i -pe 's/a/ee/g' data.txt
Expand Down Expand Up @@ -164,8 +167,8 @@ FH 19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data.txt
DX 160 [..] another-set
FX 141 [..] 3054b812 .xvcignore
FX 529 [..] [..] .gitignore
Total #: 11 Workspace Size: 1105 Cached Size: 19
FX [..] [..] [..] .gitignore
Total #: 11 Workspace Size: [..] Cached Size: 19


```
Expand Down
17 changes: 7 additions & 10 deletions book/src/ref/xvc-file-move.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,35 +3,32 @@
## Synopsis

```console
$ xvc file copy --help
Copy from source to another location in the workspace
$ xvc file move --help
Move files to another location in the workspace

Usage: xvc file copy [OPTIONS] <SOURCE> <DESTINATION>
Usage: xvc file move [OPTIONS] <SOURCE> <DESTINATION>

Arguments:
<SOURCE>
Source file, glob or directory within the workspace.
If the source ends with a slash, it's considered a directory and all files in that directory are copied.
If the number of source files is more than one, the destination must be a directory.
If there are multiple source files, the destination must be a directory.

<DESTINATION>
Location we copy file(s) to within the workspace.
Location we move file(s) to within the workspace.
If the target ends with a slash, it's considered a directory and created if it doesn't exist.
If this ends with a slash, it's considered a directory and created if it doesn't exist.
If the number of source files is more than one, the destination must be a directory.

Options:
--recheck-method <RECHECK_METHOD>
How the targets should be rechecked: One of copy, symlink, hardlink, reflink.
How the destination should be rechecked: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.

--force
Force even if target exists

--no-recheck
Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace

Expand Down
94 changes: 72 additions & 22 deletions book/src/ref/xvc-file-recheck.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,75 +39,123 @@ This command has an alias [`xvc file checkout`](/ref/xvc-file-checkout.md) if yo
Rechecking is analogous to [git checkout](https://git-scm.com/docs/git-checkout).
It copies or links a cached file to the workspace.

Start by tracking a file.
Let's create an example directory hierarchy as a showcase.

```console
$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 231123
$ tree
.
├── dir-0001
│   ├── file-0001.bin
│   ├── file-0002.bin
│   └── file-0003.bin
└── dir-0002
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

3 directories, 6 files

```

Start by tracking files.

```console
$ git init
...
$ xvc init

$ xvc file track data.txt

$ lsd -l
.rw-rw-rw- [..] data.txt
$ xvc file track dir-*

```

Once you added the file to the cache, you can delete the workspace copy.

```console
$ rm data.txt
$ ls -l
$ rm dir-0001/file-0001.bin
$ lsd -l dir-0001/file-*
total[..]
drwxr-xr-x [..] dir-0001
drwxr-xr-x [..] dir-0002

```

Then, recheck the file. By default, it makes a copy of the file.

```console
$ xvc file recheck data.txt
$ xvc file recheck dir-0001/file-0001.bin

$ lsd -l
.rw-rw-rw- [..] data.txt

```

Xvc only updates the recheck method if the file is not changed.
You can track and recheck complete directories

```console
$ xvc file recheck data.txt --as symlink
$ xvc file track dir-0002/
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/
$ ls -l dir-0002/
total 24
-rw-rw-rw- 1 [..] file-0001.bin
-rw-rw-rw- 1 [..] file-0002.bin
-rw-rw-rw- 1 [..] file-0003.bin

$ ls -l data.txt
l[..] data.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
```
You can use glob patterns to recheck files.
```console
```console
$ xvc file track 'dir-*'


You can update the recheck method of a file. Otherwise it will be kept as same before.

```console
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/ --as symlink
$ ls -l dir-0002/
total 0
lrwxr-xr-x 1 [..] file-0001.bin -> [CWD]/.xvc/b3/3c9/255/424e13d9c38a37c5ddd376e1070cdd5de66996fbc82194c462f653856d/0.bin
lrwxr-xr-x 1 [..] file-0002.bin -> [CWD]/.xvc/b3/6bc/65f/581e3a03edb127b63b71c5690be176e2fe265266f70abc65f72613f62e/0.bin
lrwxr-xr-x 1 [..] file-0003.bin -> [CWD]/.xvc/b3/804/fb8/edbb122e735facd7f943c1bbe754e939a968f385c12f56b10411a4a015/0.bin

$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/

$ ls -l dir-0002/
total 0
lrwxr-xr-x 1 [..] file-0001.bin -> [CWD]/.xvc/b3/3c9/255/424e13d9c38a37c5ddd376e1070cdd5de66996fbc82194c462f653856d/0.bin
lrwxr-xr-x 1 [..] file-0002.bin -> [CWD]/.xvc/b3/6bc/65f/581e3a03edb127b63b71c5690be176e2fe265266f70abc65f72613f62e/0.bin
lrwxr-xr-x 1 [..] file-0003.bin -> [CWD]/.xvc/b3/804/fb8/edbb122e735facd7f943c1bbe754e939a968f385c12f56b10411a4a015/0.bin

```

Symlink and hardlinks are read-only.
You can delete the symlink, and replace with an updated copy.
(As `perl -i` does below.)
You can recheck as copy to update.

```console
$ perl -i -pe 's/a/ee/g' data.txt
$ zsh -c 'echo "120912" >> dir-0002/file-0001.bin'
? 1
zsh:1: permission denied: dir-0002/file-0001.bin

$ xvc file recheck data.txt --as copy
[ERROR] data.txt has changed on disk. Either carry in, force, or delete the target to recheck.
$ xvc file recheck dir-0002/file-0001.bin --as copy

$ rm data.txt
$ zsh -c 'echo "120912" >> dir-0002/file-0001.bin'

```
Note that, as files in the cache are kept read-only, hardlinks and symlinks are also read only. Files rechecked as copy are made read-write explicitly.

```console
$ xvc -vv file recheck data.txt --as hardlink
[INFO] [HARDLINK] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt -> [CWD]/data.txt

$ ls -l
total[..]
-r--r--r--[..] data.txt
drwxr-xr-x [..] dir-0001
drwxr-xr-x [..] dir-0002

```

Note that, as files in the cache are kept read-only, hardlinks and symlinks are also read only. Files rechecked as copy are made read-write explicitly.

Reflinks are supported by Xvc, but the underlying file system should also support it.
Otherwise it uses `copy`.

Expand All @@ -118,3 +166,5 @@ $ xvc file recheck data.txt --as reflink
```

The above command will create a read only link in macOS APFS and a copy in ext4 or NTFS file systems.


4 changes: 2 additions & 2 deletions book/src/ref/xvc-file-remove.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,8 +282,8 @@ $ xvc file list
SS [..] [..] 4a2e9d7c data2.txt
FC 1024 [..] 4a2e9d7c 4a2e9d7c data.txt
FX 141 [..] 3054b812 .xvcignore
FX 274 [..] [..] .gitignore
Total #: 4 Workspace Size: 1621 Cached Size: 1024
FX [..] [..] [..] .gitignore
Total #: 4 Workspace Size: [..] Cached Size: 1024


$ xvc file remove --from-cache data.txt
Expand Down
15 changes: 9 additions & 6 deletions book/src/ref/xvc-pipeline-dag.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,23 +42,26 @@ $ xvc pipeline step new --step-name train --command "echo 'train'"
$ xvc pipeline step dependency --step-name train --step preprocess

```
It's not very readable but you can supply the result directly to dot and get a more useful output.

```console
$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=box;label="train";];n0[shape=box;label="preprocess";];n1->n0;}
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=box;label="train";];n0[shape=box;label="preprocess";];n0->n1;}

```

When you add a dependency between two steps, the graph shows it as a node.
The output after `dot -Tsvg` is:

```console
$ xvc pipeline step dependency --step-name preprocess --glob 'data/*'
![pipeline-1](/images/xvc-pipeline-dag-pipeline-1.svg)

$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=folder;label="data/*";];n0->n1;n2[shape=box;label="train";];n0[shape=box;label="preprocess";];n2->n0;}
When you add a dependency between two steps, the graph shows it as a node. For example,

```console
$ xvc pipeline step dependency --step-name preprocess --glob 'data/*'
```

![pipeline-2](/images/xvc-pipeline-dag-pipeline-2.svg)

You can use `--mermaid` option to get a [mermaid.js](https://mermaid.js.org) diagram.

```
Expand Down
23 changes: 23 additions & 0 deletions core/src/types/xvcpath.rs
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,29 @@ impl XvcPath {
pub fn join(&self, other: &XvcPath) -> Result<XvcPath> {
Ok(XvcPath(self.0.join(&other.0)))
}

/// Join only the file name portion of the other XvcPath
/// ```
/// use xvc_core::XvcPath;
/// use relative_path::RelativePathBuf;
///
/// let path = XvcPath::from(RelativePathBuf::from("a/b/c"));
/// let other = XvcPath::from(RelativePathBuf::from("d/e/f"));
/// let joined = path.join_file_name(&other).unwrap();
/// assert_eq!(joined, XvcPath::from(RelativePathBuf::from("a/b/c/f")));
/// ```
pub fn join_file_name(&self, other: &XvcPath) -> Result<XvcPath> {
let other_name = other
.file_name()
.ok_or_else(|| anyhow::anyhow!("other path doesn't have a file name"))?;

Ok(XvcPath(self.0.join(other_name)))
}

/// Returns the file name of the path
pub fn file_name(&self) -> Option<&str> {
self.0.file_name()
}
}

/// Represents whether a file is a text file or not
Expand Down
7 changes: 6 additions & 1 deletion file/src/bring/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ use xvc_logging::{debug, uwr, warn, watch, XvcOutputSender};

use xvc_storage::XvcStorageEvent;
use xvc_storage::{storage::get_storage_record, StorageIdentifier, XvcStorageOperations};
use xvc_walker::PathSync;

/// Bring (download, pull, fetch) files from storage.
///
Expand Down Expand Up @@ -132,11 +133,15 @@ pub fn fetch(output_snd: &XvcOutputSender, xvc_root: &XvcRoot, opts: &BringCLI)
watch!(temp_dir);
watch!(event);

let path_sync = PathSync::new();
// Move the files from temp dir to cache
for (_, cp) in cache_paths {
let cache_path = cp.to_absolute_path(xvc_root);
let temp_path = temp_dir.temp_cache_path(&cp)?;
uwr!(move_to_cache(&temp_path, &cache_path), output_snd);
uwr!(
move_to_cache(&temp_path, &cache_path, &path_sync),
output_snd
);
}

xvc_root.with_store_mut(|store: &mut XvcStore<XvcStorageEvent>| {
Expand Down
Loading

0 comments on commit e1eaade

Please sign in to comment.