Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc add --to-remote breaks tracked directories #10581

Open
HuhtaLauri opened this issue Oct 5, 2024 · 0 comments
Open

dvc add --to-remote breaks tracked directories #10581

HuhtaLauri opened this issue Oct 5, 2024 · 0 comments

Comments

@HuhtaLauri
Copy link

Bug Report

Issue name

dvc add --to-remote -r my-remote : breaks the tracked directories

Description

I have nightly runs running in CI (in my case github actions) where i collect new data and want to add them to a tracked directory. Running this in the CI-runner, it sets a limitation that we can't access the same cache we used while developing as it is .gitignored and I don't want to have any of the actual data in my git repository.

I managed to circumvent the cache hit by using the --to-remote flag but that causes some unexpected results in my tracked directory. The file is not added into the tracked directory and the tracked directory has been overwritten as a tracked "object"

Reproduce

  1. Create azure storage
  2. Track a directory
  3. add files to tracked directory using --to-remote flag

Set up an azure storage container and initialize environment

pip install dvc[azure]

dvc init
dvc config core.autostage true

dvc remote add my-remote azure://testcontainer/datadir
export AZURE_STORAGE_CONNECTION_STRING='my-connection-string'

Adding the initial data and tracking the directory

mkdir -p data/raw/testdata
printf "foo,bar\n1,2\n3,4" >> data/raw/testdata/file1.csv

dvc add data/raw/testdata
dvc push -r my-remote

The tracked directory should look like this

cat data/raw/testdata.dvc
# outs:
# - md5: 70abec330da9b503272a8d45546f2e28.dir
#   size: 15
#   nfiles: 1
#   hash: md5
#   path: testdata

Let's add more data and simulate the run in the CI

# To emulate the CI run, remove the cache from its original location
mv .dvc/cache /tmp/

printf "foo,bar\n1,2\n3,4" >> data/raw/testdata/file2.csv

# Then circumvent the cache and push directly to remote
dvc add data/raw/testdata/file2.csv --to-remote -r my-remote

After this the tracked directory seems to be messed up
The tracked object is not double in size anymore as would be expected and the tracked object is no longer a directory

cat data/raw/testdata.dvc
# outs:
# - md5: 7fa6745d19ebfd9864be1b9b543640c5
#   size: 15
#   hash: md5
#   path: testdata

Expected

I expect the tracked data size to be doubled and to be tracking 2 files in a tracked directory

Environment information

Azure Storage Account
Linux
dvc==3.55.2

Output of dvc doctor:

$ dvc doctor
DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.8.0-45-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.16.6
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.40.1
        scmrepo = 3.3.8
Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.18.0),
        http (aiohttp = 3.10.9, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.9, aiohttp-retry = 2.8.3)
Config:
        Global: /home/lauri/.config/dvc
        System: /etc/xdg/xdg-ubuntu/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/nvme0n1p5
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/72164a4bfb5dd93ccd0746df2a24c25b

Additional Information (if any):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant