Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loosing header info from fastq files #40

Open
prmunn opened this issue Jul 26, 2024 · 5 comments
Open

Loosing header info from fastq files #40

prmunn opened this issue Jul 26, 2024 · 5 comments
Assignees

Comments

@prmunn
Copy link

prmunn commented Jul 26, 2024

I'm running Pheniqs with paired end fastq files as input, and creating bam files as output. The first field in the bam file retains the info from the fastq header row (the instrument type, run id, flowcell id, etc., but it removes the "member of a pair", "read filtred", "control bits on", and index sequence info.

e.g. if my fastq header contains "@LH00497:14:22KTM5LT3:8:1101:2036:1064 1:N:0:ACTTCTGC+CCGGGACT" then only the "LH00497:14:22KTM5LT3:8:1101:2036:1064" part is retained.

Is there a way to move the rest of the header (the "1:N:0:ACTTCTGC+CCGGGACT" part) to the bam file (say as a tag field perhaps).

@moonwatcher
Copy link
Contributor

Hi @prmunn

Thank you for using Pheniqs :)

So... the fastq format only really defines the read id as the part that comes after the @ sign on the first line of each 4 line record. anything after the space is considered a "comment" and can technically be anything. As long as the whole thing is less than 254 characters. I think, at least HTSLib assumes it to be. The ID itself (the part immediately after the @ and before any whitespace) is unique and also shared between segments in different fastq files so it's kind of the glue that sticks all the segments together. The comment, its more specific to a segment.

Ilumina has a specific syntax for the comment, which is not really part of the fastq format, but pheniqs is kind of still parsing it. you can see the specific code here:

pheniqs/fastq.h

Line 102 in d4bd514

case Platform::ILLUMINA: {

Some of this info does make it to the designated fields in a SAM file, like the Pass-QC field or the segment number. the rest is kind of very illumina specific and not really used downstream by anything I know of.

BUT if you really really really need it I can try and add a flag to move that to some field.

the trick is that pheniqs is in the business of manipulating the topology of a read. it may get a read with 4 segments and produce a read with 2 segments, or any rearrangement you can think of, really. its not always clear which metadata from which segment ends up on the output segment. defining this for the general case is much more involved then it sounds at first.

What did you have in mind? what exactly are you trying to achieve?

@moonwatcher moonwatcher self-assigned this Aug 7, 2024
@prmunn
Copy link
Author

prmunn commented Aug 7, 2024

Hi @moonwatcher

What I'm ultimately trying to achieve is to move the barcode segments (I have four segments defined in a "cellular" template) to the header of a fastq file, similar to the way the "sample" template works. I noticed that when I use a "sample" template the comment section of the header is also retained.

What I've been doing up until now is create a bam file as the output of Pheniqs, which gives me a CB tag that has the "cellular" barcode segments and then have an awk script convert this to a fastq file with the contents of the CB tag moved to the fastq header. I'm also able to use the sam flag to parse out the the appropriate bit for the read number (for paired end reads) and add that back into the header. This approach is working (minus the rest of the comment section), but it would be better (at least for me) if I could skip the conversion step and have Pheniqs create the fastq with the "cellular" segments moved to the header. Since the "sample" template already moves one segment to the header, maybe modify it to allow multiple templates for the "sample" section.

@moonwatcher
Copy link
Contributor

moonwatcher commented Aug 22, 2024

Sorry for the slow reply, personal issues and deadline at my actual job. I'll get around soon to try and add a flag for writing the header comment to an auxiliary field.

Like I said, because of the nature of what pheniqs is doing that might raise some questions about more complicated case or require a default behavior.

@moonwatcher
Copy link
Contributor

moonwatcher commented Aug 22, 2024

Sorry just read your second reply again.

So what you ultimately want is to write the cell barcode to the fastq comment, same as the sample barcode. Which is working for sample barcodes...

There might be a way to do this with a few pipes. Pheniqs can produce SAM (uncompressed, simple text bam) to stdout, then you can use sed to switch the tag from cellular to sample and pipe that back to another pheniqs and convert to fastq. Am I getting this right? That should be very fast.

Alternatively, if there was a flag to write cell barcodes to the fastq comment, the same way the sample barcode is written now, that would satisfy your needs?

@prmunn
Copy link
Author

prmunn commented Aug 28, 2024

Yes, if there was a flag to write cell barcodes to the fastq comment, the same way the sample barcode is written now, that would work.

That said, I have written an awk script that that converts the sam to a fastq with the barcodes in the comment section and this runs pretty quickly, so I can make do with things as they stand. However, other people might find such a flag useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants