Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container cache download #3163

Draft
wants to merge 5 commits into
base: dev
Choose a base branch
from
Draft

Container cache download #3163

wants to merge 5 commits into from

Conversation

muffato
Copy link
Member

@muffato muffato commented Sep 7, 2024

Fixes #3019 and #3162.

First #3162. The code has to deal with an out_path (always defined) and sometimes a cache_path too. The implementation was slightly convoluted as it was doing fresh downloads into the cache_path or out_path (first decision point) and then doing an extra copy if needed (second decision point). Because of that confusion, symlinks across container registries were not all created across both locations.
I propose to reverse the logic to make it more straightforward:

  1. Always download into out_path and create its symlinks.
  2. Then, optionally copy to cache_path (and create symlinks there).

Then, #3019: I propose to handle the singularity "library" directory this way:

  1. Just like in Nextflow itself, it's considered a read-only location. It means that containers can only be copied from it, not to it, and that we shouldn't be even creating symlinks there.
  2. There is no point in having a --container-library-utilisation parameter for the library because i) remote would be redundant with the --container-cache-utilisation's remote mode, ii) amend is not possible as per the read-only rule, so iii) copy is the only possible mode.
  3. Therefore, the most natural place to use the library is as a source of containers, in parallel of https downloads and singularity pull. When NXF_SINGULARITY_LIBRARYDIR is set and the container exists in the library, it is copied to the target directories (out_path and possibly cache_path too)

PR checklist

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md is updated
  • Unit-tests
  • Documentation in README.md

… and are then copied to the cache if needed

This allows getting rid of the confusing `output_path` variable.
@muffato muffato added the download nf-core download label Sep 7, 2024
@muffato muffato self-assigned this Sep 7, 2024
@muffato muffato added the WIP Work in progress label Sep 7, 2024
Copy link

codecov bot commented Sep 7, 2024

Codecov Report

Attention: Patch coverage is 0% with 16 lines in your changes missing coverage. Please review.

Project coverage is 75.52%. Comparing base (8e47a33) to head (78650af).

Files with missing lines Patch % Lines
nf_core/pipelines/download.py 0.00% 16 Missing ⚠️
Additional details and impacted files

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@muffato
Copy link
Member Author

muffato commented Sep 10, 2024

The code is there, but I still need to add unit-tests and documentation

@MatthiasZepper MatthiasZepper added this to the 3.0 milestone Sep 19, 2024
Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution, and sorry for the delayed review. I was on holiday. I think, your general approach is suitable, but there are unfortunately a few details that I am not yet happy with.

@@ -1086,12 +1086,13 @@ def get_singularity_images(self, current_revision: str = "") -> None:

# Organise containers based on what we need to do with them
containers_exist: List[str] = []
containers_cache: List[Tuple[str, str, Optional[str]]] = []
containers_cache: List[Tuple[str, str, str]] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the cache is never optional, so the type List[Tuple[str, str, Optional[str]]] can be simplified to List[Tuple[str, str, str]] for containers_cache.

My rationale was that I wanted an identical type for containers_cache, containers_download and containers_pull to somewhat standardize the functions consuming them. But with the new containers_library variable, the heterogeneity is anyway given, so I have no real objections against.

Ultimately, this comment is just to explain why I initially had done it differently.

return (out_path, cache_path)
library_path = None
if os.environ.get("NXF_SINGULARITY_LIBRARYDIR"):
library_path = os.path.join(os.environ["NXF_SINGULARITY_LIBRARYDIR"], out_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will not work. The registries are stripped from the out_name and without symlinks, I doubt that there will be a single appropriately named container in the NXF_SINGULARITY_LIBRARYDIR.

You will have to use the name prior to trimming the registries.

"""Copy Singularity image from NXF_SINGULARITY_LIBRARYDIR to target folder, and possibly NXF_SINGULARITY_CACHEDIR."""
self.singularity_copy_image(container, library_path, out_path)
if cache_path:
self.singularity_copy_image(container, library_path, cache_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not used the $NXF_SINGULARITY_LIBRARYDIR so far, so I struggle to conceptualize corresponding setups. Intuitively, however, I question that copying the image to the cache is desirable? That will probably lead to a lot of data duplication?

Also mind that in the case of self.container_cache_utilisation == "amend", the cache_path is assigned to the out_path in singularity_image_filenames() function. So if you retain this copy step, you should at least not copy the same image twice to cache, but making the self.singularity_copy_image(container, library_path, cache_path) conditional depending on the chosen cache utilisation.

"""Copy Singularity image between folders. This function is used seamlessly
across the target directory, NXF_SINGULARITY_CACHEDIR, and NXF_SINGULARITY_LIBRARYDIR."""
log.debug(f"Copying {container} to cache: '{os.path.basename(from_path)}'")
shutil.copyfile(from_path, to_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but this function misses important functionality. If you factor out the copying process, you need to consider that copies may be interrupted by exceptions or by the user using SIGINT (CTRL+C).

Previously, in case of incomplete/corrupted downloads, the local files were deleted by the except and finally branches within the singularity_download_image() function. (Lines 1356-1369 on dev). Additionally, it did not matter too much, since the copy happened from the cache to the output_path, typically a folder the user would delete in case of download failures.

But now that you changed the logic, it seems more likely to me that a user could amass corrupted images in the cache folder persistently. Therefore, it is important to ensure that the copy process removes partially copied files etc. in case of exceptions or SIGINT, e.g. with a cleanup_temp_files() function.

Something along this line:

import signal
import sys

# Example
def cleanup_temp_files():
    if os.path.exists('temp_file'):
        os.remove('temp_file')

# Define a signal handler for SIGINT (CTRL+C)
def abort_download(sig, frame):
    cleanup_temp_files()
    raise DownloadError("Aborting pipeline download due to user interruption.")

signal.signal(signal.SIGINT, abort_download)

# Example of using try-finally for cleanup in case of exceptions
try:
    # File copy code here
    # ...
except Exception as e:
    cleanup_temp_files()
      raise DownloadError(e) from e
finally:
    cleanup_temp_files()

log.debug(f"Copying {container} to cache: '{os.path.basename(from_path)}'")
shutil.copyfile(from_path, to_path)
# Create symlinks to ensure that the images are found even with different registries being used.
self.symlink_singularity_images(to_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind that the cleanup function should also cover the symlinks in the cache.

@@ -1361,8 +1389,8 @@ def singularity_download_image(
log.debug(f"Deleting incompleted singularity image download:\n'{output_path_tmp}'")
if output_path_tmp and os.path.exists(output_path_tmp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cleanup function I mentioned. Now, it is not sufficient anymore, because the cache is not considered.

progress.update(task, description="Copying from cache to target directory")
shutil.copyfile(cache_path, out_path)
progress.update(task, description="Copying from target directory to cache")
self.singularity_copy_image(container, out_path, cache_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to check now, if the cache_path actually exists on the file system:
os.path.exists(cache_path).

Previously, the get_singularity_images() function ensured, that an image was actually present in cache before adding it to the list of images that are already cached.

  if cache_path and os.path.exists(cache_path):
      containers_cache.append((container, out_path, cache_path))

Hence, it was safe to copy it without further checks. Now, the cache_path is just created from the environment variable without further checks if the defined directory actually exists in the file system and is writable. Both needs to be done either here just for the cache path or preferably inside the generic copy function - better safe than sorry.

@ewels ewels modified the milestones: 3.0, 3.1 Sep 26, 2024
@ewels
Copy link
Member

ewels commented Nov 21, 2024

@muffato - do you think you'll have time to take a look at this PR in the near(ish) future? It'd be great to get it moving and merged if possible..

@ewels ewels modified the milestones: 3.1, 3.2 Nov 21, 2024
@muffato
Copy link
Member Author

muffato commented Nov 26, 2024

Hi @ewels . Unfortunately there's too much work left to do on this PR. I can't see myself getting to the bottom of it any time soon.

@ewels
Copy link
Member

ewels commented Nov 28, 2024

No problem - hopefully @mirpedrol can take it over then when she has time, if that's ok 🙏🏻

@MatthiasZepper
Copy link
Member

No problem - hopefully @mirpedrol can take it over then when she has time, if that's ok 🙏🏻

Frankly, I consider this feature expendable. It is nice to save some download bandwidth and storage space, but the poor Seqera Containers support is haunting me much more (and also wasting significant storage).

Considering that Júlia has the most extensive knowledge of the nf-core tools codebase, I would rather suggest / entreat / beg her for working on tooling around the container.yml creation, linting and release logic. As soon as each new pipeline release ships with it, we can adopt it on the download side as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
download nf-core download WIP Work in progress
Projects
None yet
3 participants