-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
demultiplexing based on primer #31
Comments
Hi Fabian I am not sure I understand what you are trying to achieve. Are you trying to remove the primer from the biological read or classify by it? Maybe you can provide an example? |
classify by primer (removing would be the next step although it could be done in one). So after demultiplexing, my files look like this file_index1_R1: PRIMER_F_1_READ file_index1_R2: PRIMER_R_1_READ what I want is file_index1_PRIMER1_R1, file_index1_PRIMER1_R2 and file_index1_PRIMER2_R1 file_index1_PRIMER2_R2 Right now, we (and most people I assume), use cutadapt for this and it works well. However it doesn't take into account the quality scores but just allows for a certain number of miss-matches (10 % be default). I think that the PAMLD decoder could be a brilliant alternative. |
So, if I understand this correctly: you have two possible primers and 91 sample barcodes. The "libraries" are characterized by the dot product, so 2x91 libraries. Pheniqs only allows one sample barcode decoder (since sample barcodes represent read groups and those have their own rules and life cycle) BUT every decoder in pheniqs (sample, cellular or molecular) can have an unlimited number of segments. That means you can treat your sample barcode as a barcode with two segments and encode the full dot product (so you will declare 91x2 barcodes). for instance "primer_1_library_1": {
"barcode": [
"<your barcode sequence for library 1>",
"<your first primer sequence>"
]
},
"primer_2_library_1": {
"barcode": [
"<your barcode sequence for library 1>",
"<your second primer sequence>"
]
}, You will, of course, need to define tokens for locating the primer sequence. If this sounds confusing just upload a sample config file you made and an example read with the location of the primers and the two sequences and I will help you with the config but it sounds doable. P.S. Since primers are usually longer than barcodes, you can expect posterior decoding to have significantly less false negatives (loosing reads) without risking more false positives (misclassifying reads). From a probabilistic approach telling two long sequences apart with high confidence is possible even with a lot of mismatches. would be interesting though to see what the noise filter thinks about it, but I suspect that won't be an issue either. P.S.2 |
Fabian Could you please also post your two primers? do they have ambiguous nucleotides? the statistical core that computes the posterior will have to be extended to support those. |
in my case I use the following primer pairs: PRIMER 1 PRIMER 2 Note that the first forward primer does indeed contain ambiguous nucleotides (but in my case no Iosine - which could be teh case for other primers) |
So those primers show up on a fixed offset from some edge of the read segments? Is the only thing you are missing is support for ambiguity codes or do they also shift in position? Do they have indels? in which case this becomes more of an alignment issue... I think adding support for ambiguity codes in the barcode declaration is possible, the statistical model can definitely apply to it. Basically a W means that both A and T should be considered a positive match. Have you tried what I suggested? does it work for you other than the ambiguity code? (you can try and do a run with just an A or just a T to see that it otherwise works). What are you r thoughts? |
sorry I just saw you suggestion, I had only seen your question about the barcode sequence. I will try it out later this week and come back to you. I am not sure there can be indels in the primer binding site but primer slippage might lead to that (see this paper ). I'll get back to you! |
Hi,
I just discovered the tool and used it for de multiplexing of an Illumina MiSeq run, it seemed to work well! (only 6% of reads with unclear barcodes for 91 pooled samples).
On the same run however, I also pooled to differnt primer pairs. The Primer sequences are part of the biological read.
I was wondering if pheniqs could be used for that? I use cutadapt but the PHRED error model would be a nicer way than just allowing for a certain number of mismatches.
edit: for primer search, IUPAC wildcard characters would ideally need to be supported
The text was updated successfully, but these errors were encountered: