Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚧 WIP: Batch run #125

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

🚧 WIP: Batch run #125

wants to merge 3 commits into from

Conversation

itzamna314
Copy link
Collaborator

This is the beginnings of a container that can run the serratus pipeline end-to-end. I'm not sure what settings I need for bowtie2 though, so I haven't been able to get past those runs.

I'm also not sure if its appropriate to write to all 3 pipes, and then try to do both flavors of bowtie (paired and unpaired?), or if we need to figure out which scenario we're in and only run the one bowtie process.

I think the input to the container is good though, so hopefully we're on the right track

@itzamna314 itzamna314 requested review from ababaian and brietaylor May 24, 2020 23:09
@rcedgar
Copy link
Collaborator

rcedgar commented May 24, 2020

I believe we can and should simplify by always using unpaired mode of bowtie2 and not using --split-files option of fastq-dump. That way, the same command-line should work for any SRA dataset AFAIK. Artem can correct me if I'm wrong here. That way, we only need one pipe, no need for named pipes.

@rcedgar
Copy link
Collaborator

rcedgar commented May 24, 2020

I believe the only option we need for bowtie2 is --very-sensitive-local with /dev/stdin for unpaired fastq input.

@rcedgar
Copy link
Collaborator

rcedgar commented May 24, 2020

I would suggest the following simplification & optimization. Combine the bowtie2, prefetch, fastq-dump and samtools binary files, summarizer.py and the bowtie2 index files into one tarball on S3. When the container starts, install aws cli and python3 base only. Then copy the tarball and decompress it. At that point the container is ready to do

prefetch SRA12345

fastq-dump SRA12345 | bowtie2 | summarizer.py | samtools > output.bam # single pipe

aws s3 cp output.bam s3://serratus-public/out/...

@itzamna314
Copy link
Collaborator Author

I would suggest the following simplification & optimization. Combine the bowtie2, prefetch, fastq-dump and samtools binary files, summarizer.py and the bowtie2 index files into one tarball on S3. When the container starts, install aws cli and python3 base only. Then copy the tarball and decompress it. At that point the container is ready to do

That actually adds a lot of complexity. Using Docker, we simply build an image with all of those executables installed. Then when we create a container, they're ready to go instantly.

I'll see if I can guess the parameters right for bowtie2. I've never used it before though, I have no background in biology. I know how to get the executables where they need to be, but not so much what they do or how to run them.

@rcedgar
Copy link
Collaborator

rcedgar commented May 24, 2020

I have no background in Docker, so my bad on that -- I'm trying to learn but am struggling so far.

I think this command-line for bowtie2 should work with unpaired FASTQ from a pipe, sending SAM output to a pipe:

bowtie2 -x INDEXNAME --very-sensitive-local -U /dev/stdin

Contact me by email [email protected] or the serratus-bioinformatics slack channel if you need help with the informatics pipe.

@itzamna314
Copy link
Collaborator Author

All good, happy to help clear 🐳 stuff up 👍

Where can I find the value from INDEXNAME for this scenario? It comes from a JOB_JSON file in the full serratus pipeline.

I think that's the piece I'm missing to get this running. I'll ping you over on the serratus slack 👍. Thanks!

@ababaian
Copy link
Owner

You'll need a genome/sequence file and index of that genome to run bowtie. In essence it takes takes short little bits of DNA and tries to place them in a big piece of DNA. Kind of like a fuzzy regex.

Genome + Bowtie2 Index Files : aws s3 sync s3://serratus-public/seq/cov3a/ ./

As long as those files are in the same directory as bowtie2 you can run -x cov3a (or whatever the prefix to the .bt2 files is)

@itzamna314 itzamna314 force-pushed the run-batch branch 2 times, most recently from a448262 to 6bec9d0 Compare May 25, 2020 01:14
@itzamna314 itzamna314 marked this pull request as ready for review May 26, 2020 16:21
@brietaylor brietaylor removed their request for review August 31, 2022 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants