Container cache download #3163

muffato · 2024-09-07T12:21:08Z

First #3162. The code has to deal with an out_path (always defined) and sometimes a cache_path too. The implementation was slightly convoluted as it was doing fresh downloads into the cache_path or out_path (first decision point) and then doing an extra copy if needed (second decision point). Because of that confusion, symlinks across container registries were not all created across both locations.
I propose to reverse the logic to make it more straightforward:

Always download into out_path and create its symlinks.
Then, optionally copy to cache_path (and create symlinks there).

Then, #3019: I propose to handle the singularity "library" directory this way:

Just like in Nextflow itself, it's considered a read-only location. It means that containers can only be copied from it, not to it, and that we shouldn't be even creating symlinks there.
There is no point in having a --container-library-utilisation parameter for the library because i) remote would be redundant with the --container-cache-utilisation's remote mode, ii) amend is not possible as per the read-only rule, so iii) copy is the only possible mode.
Therefore, the most natural place to use the library is as a source of containers, in parallel of https downloads and singularity pull. When NXF_SINGULARITY_LIBRARYDIR is set and the container exists in the library, it is copied to the target directories (out_path and possibly cache_path too)

PR checklist

This comment contains a description of changes (with reason)
CHANGELOG.md is updated
Unit-tests
Documentation in README.md

… and are then copied to the cache if needed This allows getting rid of the confusing `output_path` variable.

codecov · 2024-09-07T12:37:56Z

Codecov Report

Attention: Patch coverage is 0% with 16 lines in your changes missing coverage. Please review.

Project coverage is 75.52%. Comparing base (8e47a33) to head (78650af).

Files with missing lines	Patch %	Lines
nf_core/pipelines/download.py	0.00%	16 Missing ⚠️

Additional details and impacted files

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

muffato · 2024-09-10T21:02:27Z

The code is there, but I still need to add unit-tests and documentation

MatthiasZepper

Thanks a lot for your contribution, and sorry for the delayed review. I was on holiday. I think, your general approach is suitable, but there are unfortunately a few details that I am not yet happy with.

MatthiasZepper · 2024-09-20T11:50:21Z

nf_core/pipelines/download.py

@@ -1086,12 +1086,13 @@ def get_singularity_images(self, current_revision: str = "") -> None:

                # Organise containers based on what we need to do with them
                containers_exist: List[str] = []
-                containers_cache: List[Tuple[str, str, Optional[str]]] = []
+                containers_cache: List[Tuple[str, str, str]] = []


True, the cache is never optional, so the type List[Tuple[str, str, Optional[str]]] can be simplified to List[Tuple[str, str, str]] for containers_cache.

My rationale was that I wanted an identical type for containers_cache, containers_download and containers_pull to somewhat standardize the functions consuming them. But with the new containers_library variable, the heterogeneity is anyway given, so I have no real objections against.

Ultimately, this comment is just to explain why I initially had done it differently.

MatthiasZepper · 2024-09-20T11:55:06Z

nf_core/pipelines/download.py

-        return (out_path, cache_path)
+        library_path = None
+        if os.environ.get("NXF_SINGULARITY_LIBRARYDIR"):
+            library_path = os.path.join(os.environ["NXF_SINGULARITY_LIBRARYDIR"], out_name)


That will not work. The registries are stripped from the out_name and without symlinks, I doubt that there will be a single appropriately named container in the NXF_SINGULARITY_LIBRARYDIR.

You will have to use the name prior to trimming the registries.

MatthiasZepper · 2024-09-20T13:21:44Z

nf_core/pipelines/download.py

+        """Copy Singularity image from NXF_SINGULARITY_LIBRARYDIR to target folder, and possibly NXF_SINGULARITY_CACHEDIR."""
+        self.singularity_copy_image(container, library_path, out_path)
+        if cache_path:
+            self.singularity_copy_image(container, library_path, cache_path)


I have not used the $NXF_SINGULARITY_LIBRARYDIR so far, so I struggle to conceptualize corresponding setups. Intuitively, however, I question that copying the image to the cache is desirable? That will probably lead to a lot of data duplication?

Also mind that in the case of self.container_cache_utilisation == "amend", the cache_path is assigned to the out_path in singularity_image_filenames() function. So if you retain this copy step, you should at least not copy the same image twice to cache, but making the self.singularity_copy_image(container, library_path, cache_path) conditional depending on the chosen cache utilisation.

MatthiasZepper · 2024-09-20T14:05:57Z

nf_core/pipelines/download.py

+        """Copy Singularity image between folders. This function is used seamlessly
+        across the target directory, NXF_SINGULARITY_CACHEDIR, and NXF_SINGULARITY_LIBRARYDIR."""
+        log.debug(f"Copying {container} to cache: '{os.path.basename(from_path)}'")
+        shutil.copyfile(from_path, to_path)


Sorry, but this function misses important functionality. If you factor out the copying process, you need to consider that copies may be interrupted by exceptions or by the user using SIGINT (CTRL+C).

Previously, in case of incomplete/corrupted downloads, the local files were deleted by the except and finally branches within the singularity_download_image() function. (Lines 1356-1369 on dev). Additionally, it did not matter too much, since the copy happened from the cache to the output_path, typically a folder the user would delete in case of download failures.

But now that you changed the logic, it seems more likely to me that a user could amass corrupted images in the cache folder persistently. Therefore, it is important to ensure that the copy process removes partially copied files etc. in case of exceptions or SIGINT, e.g. with a cleanup_temp_files() function.

Something along this line:

import signal import sys # Example def cleanup_temp_files(): if os.path.exists('temp_file'): os.remove('temp_file') # Define a signal handler for SIGINT (CTRL+C) def abort_download(sig, frame): cleanup_temp_files() raise DownloadError("Aborting pipeline download due to user interruption.") signal.signal(signal.SIGINT, abort_download) # Example of using try-finally for cleanup in case of exceptions try: # File copy code here # ... except Exception as e: cleanup_temp_files() raise DownloadError(e) from e finally: cleanup_temp_files()

MatthiasZepper · 2024-09-20T14:06:49Z

nf_core/pipelines/download.py

+        log.debug(f"Copying {container} to cache: '{os.path.basename(from_path)}'")
+        shutil.copyfile(from_path, to_path)
+        # Create symlinks to ensure that the images are found even with different registries being used.
+        self.symlink_singularity_images(to_path)


Mind that the cleanup function should also cover the symlinks in the cache.

MatthiasZepper · 2024-09-20T14:08:29Z

nf_core/pipelines/download.py

@@ -1361,8 +1389,8 @@ def singularity_download_image(
            log.debug(f"Deleting incompleted singularity image download:\n'{output_path_tmp}'")
            if output_path_tmp and os.path.exists(output_path_tmp):


This is the cleanup function I mentioned. Now, it is not sufficient anymore, because the cache is not considered.

MatthiasZepper · 2024-09-20T14:18:15Z

nf_core/pipelines/download.py

-                progress.update(task, description="Copying from cache to target directory")
-                shutil.copyfile(cache_path, out_path)
+                progress.update(task, description="Copying from target directory to cache")
+                self.singularity_copy_image(container, out_path, cache_path)


You need to check now, if the cache_path actually exists on the file system:
os.path.exists(cache_path).

Previously, the get_singularity_images() function ensured, that an image was actually present in cache before adding it to the list of images that are already cached.

if cache_path and os.path.exists(cache_path): containers_cache.append((container, out_path, cache_path))

Hence, it was safe to copy it without further checks. Now, the cache_path is just created from the environment variable without further checks if the defined directory actually exists in the file system and is writable. Both needs to be done either here just for the cache path or preferably inside the generic copy function - better safe than sorry.

ewels · 2024-11-21T09:52:16Z

@muffato - do you think you'll have time to take a look at this PR in the near(ish) future? It'd be great to get it moving and merged if possible..

muffato · 2024-11-26T11:36:08Z

Hi @ewels . Unfortunately there's too much work left to do on this PR. I can't see myself getting to the bottom of it any time soon.

ewels · 2024-11-28T08:29:44Z

No problem - hopefully @mirpedrol can take it over then when she has time, if that's ok 🙏🏻

MatthiasZepper · 2024-11-28T11:36:37Z

No problem - hopefully @mirpedrol can take it over then when she has time, if that's ok 🙏🏻

Frankly, I consider this feature expendable. It is nice to save some download bandwidth and storage space, but the poor Seqera Containers support is haunting me much more (and also wasting significant storage).

Considering that Júlia has the most extensive knowledge of the nf-core tools codebase, I would rather suggest / entreat / beg her for working on tooling around the container.yml creation, linting and release logic. As soon as each new pipeline release ships with it, we can adopt it on the download side as well.

muffato added 2 commits September 7, 2024 13:10

Inverted the data movements: downloads happen in the output directory…

4f7a3c6

… and are then copied to the cache if needed This allows getting rid of the confusing `output_path` variable.

Create symlinks in the cache *and* output directory

78650af

muffato added the download nf-core download label Sep 7, 2024

muffato self-assigned this Sep 7, 2024

muffato added the WIP Work in progress label Sep 7, 2024

This was linked to issues Sep 7, 2024

Support NXF_SINGULARITY_LIBRARYDIR in nf-core download #3019

Open

Singularity container symlinks not consistently created #3162

Closed

muffato force-pushed the container_cache_download branch 4 times, most recently from 593ffef to 6293dd0 Compare September 8, 2024 22:38

muffato added 3 commits September 9, 2024 16:08

cache_path is actually always defined

113430d

Factored out a function to copy a container

ac771d8

Copy from the library directory if set

6293dd0

muffato requested a review from MatthiasZepper September 10, 2024 21:02

MatthiasZepper added this to the 3.0 milestone Sep 19, 2024

MatthiasZepper requested changes Sep 20, 2024

View reviewed changes

ewels modified the milestones: 3.0, 3.1 Sep 26, 2024

ewels modified the milestones: 3.1, 3.2 Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container cache download #3163

Container cache download #3163

muffato commented Sep 7, 2024 •

edited

Loading

codecov bot commented Sep 7, 2024

muffato commented Sep 10, 2024

MatthiasZepper left a comment

MatthiasZepper Sep 20, 2024

MatthiasZepper Sep 20, 2024

MatthiasZepper Sep 20, 2024

MatthiasZepper Sep 20, 2024

MatthiasZepper Sep 20, 2024

MatthiasZepper Sep 20, 2024

MatthiasZepper Sep 20, 2024

ewels commented Nov 21, 2024

muffato commented Nov 26, 2024

ewels commented Nov 28, 2024

MatthiasZepper commented Nov 28, 2024

		@@ -1361,8 +1389,8 @@ def singularity_download_image(
		log.debug(f"Deleting incompleted singularity image download:\n'{output_path_tmp}'")
		if output_path_tmp and os.path.exists(output_path_tmp):

Container cache download #3163

Are you sure you want to change the base?

Container cache download #3163

Conversation

muffato commented Sep 7, 2024 • edited Loading

PR checklist

codecov bot commented Sep 7, 2024

Codecov Report

muffato commented Sep 10, 2024

MatthiasZepper left a comment

Choose a reason for hiding this comment

MatthiasZepper Sep 20, 2024

Choose a reason for hiding this comment

MatthiasZepper Sep 20, 2024

Choose a reason for hiding this comment

MatthiasZepper Sep 20, 2024

Choose a reason for hiding this comment

MatthiasZepper Sep 20, 2024

Choose a reason for hiding this comment

MatthiasZepper Sep 20, 2024

Choose a reason for hiding this comment

MatthiasZepper Sep 20, 2024

Choose a reason for hiding this comment

MatthiasZepper Sep 20, 2024

Choose a reason for hiding this comment

ewels commented Nov 21, 2024

muffato commented Nov 26, 2024

ewels commented Nov 28, 2024

MatthiasZepper commented Nov 28, 2024

muffato commented Sep 7, 2024 •

edited

Loading