-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure we are handling CRAM files correctly #111
Comments
@reichan1998 As discussed in the last meeting, I have moved some discussion here on how we are handling CRAM files in the pipeline |
Thank you for moving this discussion over @tkchafin. The current approach seems solid, especially with the focus on avoiding embedded references and using local paths or cached data. I agree that ensuring Samtools doesn’t attempt to access reference data from external servers is critical for maintaining efficiency and reliability. Regarding the next steps, I’ll double-check that no operations are falling back to server lookups. If there’s a way to explicitly disable this, we should definitely implement it. I’ll also investigate exactly what steps Samtools takes in the case of |
cram_tests.tar.gz |
Thank you very much, Tyler! I will take a look at the script and run some tests on my end as well. I will get back to you with any findings or if I have any questions. |
Hi @reichan1998 just wanted to confirm, you determined that no changes are needed here right? |
Yes I believe our approach to handling CRAM files is fine now, as no operations are likely to fall back to server lookups. Since the operations we're performing likely do not benefit from using the cache, it might be better to pass the reference directly using the |
Excellent work, thanks for that (and for catching my mistake with mpileup!) |
You mentioned that you have tried turning off the remote lookup behaviour by spoofing REF_PATH? I wonder if we can set this somehow in |
It's not your mistake at all, as |
Description of feature
For any steps requiring a reference, we should be passing it or using REF_CACHE.
---- Some notes from JIRA ticket TOLIT-1916 on this topic ----
For using CRAM with Samtools there are a number of options:
embed_ref
) -- with also a few "sub-options" therein on how much information is embedded-T/--reference
)REF_CACHE
Following some conversation over in Jira (TOLIT-1916), we decided to forgo option 1 as it means storing redundant reference information across files (in part defeating the purpose of using CRAM over BAM), and that we also do not like option 4 because (1) it can take a very long time and (2) server reliability issues.
So, we did some benchmarking of simple commands (
sort
,view
, etc) and we found that for many operations, option 2 and 3 have nearly identical runtimes, with exception of some cases which presumably involve a LOT of reference lookups:"I have done some tests comparing several samtools operations under the cases of a pre-computed cached reference vs. supplying a reference at runtime. For most common/simple operations (e.g., sort, view), it didn't really seem to matter... However for mpileup (which presumably involved a lot of repeated lookups against the reference?), it can make a MASSIVE difference (>100X faster to use a cache). I'm not certain why, but it seems that for most of samtools use cases we can likely get away with just passing the reference and ignoring the cache.
—
For the initial questions:
"Do we really need to pre-compute the .dict file and pass it ? Why is samtools not doing it by itself ?" – No need to precompute
"Do we need to run seq_cache_populate.pl ?" – If we want to use the cached/pre-processed reference option then yes, we could wrap this script in a nextflow module (would be a good candidate to add to nf-core/samtools if not there already), or check a given path (passed in config or taken from $REF_PATH/$REF_CACHE) to see if this has already been done
"What REF_PATH do we need to set ? Note: in my experience, the automatic pulling from ENA is unreliable: either slow or down" – IMO we should mostly just ignore the remote download option and either use the REF_CACHE or pass the genome explicitly with -T/-reference. Samtools will also (I think) try to look up the genome if a local path is present in the header before falling back to a remote lookup, but we can't use Nextflow's file staging if we only pass the reference this way so it is better to pass it explicitly. For most of the tests I did, setting the local cache as REF_CACHE was equally efficient to passing the genome with -T/ -reference
—
In terms of "best practice" for using CRAM, I think the only reasonable options are to use the cache with REF_PATH/CACHE or pass the reference alongside each call on a CRAM file (for operations needing the reference). Whether or not these two options make a difference depends on the exact operation, which simple operations at least seeming not to be affected.
As for the alternatives, I think allowing samtools to check the server is a bad option, and we also don't want to use the embed_ref options because this results in (1) redundant disc usage across files and (2) doesn't have any useful support in ENA for upload purposes anyways.
So, most cases we probably can pass the reference and not bother with cache, but for instances where this is actually faster, we can build the cache in a module (if genome not already cached; otherwise just grab its path) and pass that (like is done [here|https://github.com/epi2me-labs/wf-basecalling/blob/master/main.nf]) "
--
SO, remaining things to do/ questions are:
REF_CACHE
vs. passing directly? It would be good to get a better idea of what steps exactly Samtools is doing when you pass a reference (with-T/--reference
) in the case ofsort
(same runtime asREF_CACHE
) vs.mpileup
(massively worse runtime).Given the operations done here likely have no advantage of implementing the cache, I suspect that we will be fine to pass the reference directly with
-T/--reference
, but it might be worth repeating the benchmarking tests on a larger genome (and also digging a bit deeper into what exactly Samtools is doing). If that ends up being the case, we already havech_fasta
so will just need to check we are passing this where neededThe text was updated successfully, but these errors were encountered: