Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: get ensembl-reference wrapper to download more than one chromosome #3432

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 21 additions & 9 deletions bio/reference/ensembl-sequence/wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,33 +47,45 @@
if chromosome:
if not datatype == "dna":
raise ValueError(
"invalid datatype, to select a single chromosome the datatype must be dna"
"Invalid datatype. To select individual chromosomes, the datatype must be dna."
)

url = snakemake.params.get("url", "ftp://ftp.ensembl.org/pub")
spec = spec.format(build=build, release=release)
url_prefix = f"{url}/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}"

success = False
for suffix in suffixes:
success = False
Copy link
Collaborator

@fgvieira fgvieira Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be outside the loop to check if at least one suffix was successful? This way it will only check the last suffix, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is outside of the loop, and we are requesting multiple chromosomes, this will turn true on any working chromosome, and then stay that way. So we will not get any debugging output and error thrown, in case any of the chromosomes is not available. So for the chromosomes case, we should reset this for every suffix in suffixes. For the other case, checking whether "dna.primary_assembly.fa.gz" or "dna.toplevel.fa.gz" is available, it will break out of the suffix in suffixes loop right after setting success = True and will otherwise be left with success = False after the last suffix that runs into the except:.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this way it only checks if the last chromosome was available, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you are right. Very good catch. Let me think about what the best solution is...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe moving the error checking directly to the try/except?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Track download success for each chromosome separately.

The current implementation resets the success flag for each suffix, which could mask failures of individual chromosome downloads. Consider tracking success per chromosome:

-success = False
+successes = set()  # Track successful downloads

Then update the success tracking after download:

-success = True
+successes.add(suffix)  # Record successful download

And modify the final check:

-if not success:
+if not successes:

This change will help identify which specific chromosomes failed to download.

Committable suggestion skipped: line range outside the PR's diff.

url = f"{url_prefix}.{suffix}"

try:
shell("curl -sSf {url} > /dev/null 2> /dev/null")
except sp.CalledProcessError:
continue
if chromosome:
print(
f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.",
file=sys.stderr,
)
break
else:
Comment on lines +64 to +70
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical: Remove break statement and use proper logging.

The current implementation has two issues:

  1. The break statement prevents downloading multiple chromosomes
  2. Direct print statements should use Snakemake's logging mechanism

Apply these changes:

 if chromosome:
-    print(
-        f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.",
-        file=sys.stderr,
-    )
-    break
+    shell.logger.error(
+        f"Unable to download chromosome sequence from: {url_prefix}.{suffix}"
+    )
+    continue
 else:
     continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if chromosome:
print(
f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.",
file=sys.stderr,
)
break
else:
if chromosome:
shell.logger.error(
f"Unable to download chromosome sequence from: {url_prefix}.{suffix}"
)
continue
else:

continue

shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}")
success = True
break
if not chromosome:
break

if not success:
if len(suffixes) > 1:
url = f"{url_prefix}.[{'|'.join(suffixes)}]"
else:
url = f"{url_prefix}.{suffixes[0]}"
if not chromosome:
if len(suffixes) > 1:
url = f"{url_prefix}.[{'|'.join(suffixes)}]"
else:
url = f"{url_prefix}.{suffixes[0]}"
print(
f"Unable to download the requested reference sequence data from Ensembl at: {url}.",
file=sys.stderr,
)
Comment on lines +79 to +87
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error reporting with Snakemake logging.

The error reporting can be improved by:

  1. Using Snakemake's logging mechanism
  2. Providing more detailed feedback about failed chromosomes
-if not success:
+if not successes:
     if not chromosome:
         if len(suffixes) > 1:
             url = f"{url_prefix}.[{'|'.join(suffixes)}]"
         else:
             url = f"{url_prefix}.{suffixes[0]}"
-        print(
+        shell.logger.error(
             f"Unable to download the requested reference sequence data from Ensembl at: {url}.",
-            file=sys.stderr,
         )
+else:
+    failed = set(suffixes) - successes
+    if failed:
+        shell.logger.error(
+            f"Failed to download the following chromosomes: {', '.join(failed)}"
+        )

-print(
+shell.logger.error(
     "Please check whether above URL is currently available (might be a temporal server issue). "
     "Apart from that, did you check that this combination of species, build, and release is actually provided?",
-    file=sys.stderr,
)

Committable suggestion skipped: line range outside the PR's diff.

print(
f"Unable to download requested sequence data from Ensembl ({url}). "
"Please check whether above URL is currently available (might be a temporal server issue). "
"Apart from that, did you check that this combination of species, build, and release is actually provided?",
file=sys.stderr,
Expand Down
Loading