-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better to have multiple smaller Docker images or one larger Docker image? #138
Comments
I don't think this is as much of a big deal unless the docker images become really large if you include all the tools. Even then, it shouldn't be a big deal. (Docker images are pulled on each EC2 instance that's brought up)
The more interesting point to consider is how you organize your If you have an exec which produces a large amount of output and if you have another exec which processes it, then it would be worthwhile to merge them into the same exec. Also, if there are multiple execs that are processing the same input, it might make sense to do that in one step. For example, lets say you are doing ML and generating a bunch of features for each input, and if you have to process all the input to generate each feature, then instead of having each feature be computed in its own
We could perhaps provide better guidance if you share a simplification of what your pipeline looks like. (eg, share a sample reflow program after stripping out potentially sensitive steps) |
This is super helpful; thanks!! This is a viral amplicon sequencing analysis pipeline, so the files are quite small per sample (we just have a ton of samples we want to process simultaneously). Here's an example of how the pipeline used to look for one sample:
This took ~8 minutes to run on 1, 10, and 100 simultaneous samples, but when I hit 1K samples, it got bogged down and took like 1.5 hours. I've now merged everything into a single exec and
I'm going to try running larger numbers of samples to see how things scale |
A random list of suggestions (note, uncompiled code, so you might have to fix syntax errors, if any):
That way, you avoid having to do
The above will produce a directory "output" with all the output files under each sample's base path. If you feel that you want to recompute everything anyway (say you made a mistake and want to regenerate for old samples as well), you can run reflow with
can be written as:
This avoids having to delete the intermediate "trimmed.bam" files. |
Also, a general comment. If you haven't already done so, you can enable caching (see under "Next, we'll set up a cache.") |
Thank you, this was incredibly helpful! I do indeed have caching enabled, but I think the primary issue with the workflows I was running that caused the slowdown with separate Regarding avoiding writing files, I am considering adding optional flags to avoid writing intermediate files in the future, but some users (including us) want to hold onto these intermediate files for debugging/investigative purposes. Thanks for the tip, though! Once we have things stabilized, I'll start looking for places where I can pipe to avoid writing to disk EDIT: Ah, so the deletion at the end is for the unsorted trimmed BAMs, but the issue is that the tool I'm using writes to disk and doesn't (seem to) have an option for outputting to stdout instead (to then pipe into samtools sort). I could get around this using named pipes, though, so perhaps I'll consider doing that instead in the future 😄 |
I am creating a workflow that has many steps that use different tools, and I'm creating minimal Docker images for each of the tools. I was wondering which of the following approaches would be better for Reflow's scalability/performance/etc.:
Any guidance would be greatly appreciated (and ideally any information as to why exactly one may be better than the other, so I can get a better understanding of how Reflow works)
EDIT: And note that, for a single sample, each individual step of the workflow is actually quite fast (e.g. minutes of runtime). The scalability issue we're facing is that we have thousands of samples being run in parallel (so we have a lot of small pipeline step executions)
The text was updated successfully, but these errors were encountered: