Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"0 in the intersection of the posfile and the reference haplotypes" #14

Open
karstyl opened this issue Mar 16, 2022 · 4 comments
Open

Comments

@karstyl
Copy link

karstyl commented Mar 16, 2022

Hello,

I am running quilt on a set of data where I have ~1300 F2's with shallow sequencing and high quality phased genotypes from the F0's.

After running most of my data it all looked good except one chromosome, and I realized that I had the position data from an old version of the genome, so I updated the portions of the variants in the impute.legend file.

When I tried to re-run the chromosome, I am getting this in the output:
[2022-03-16 10:06:12] In the region to be imputed plus buffer there are the following number of SNPs:
[2022-03-16 10:06:12] 505727 from the posfile
[2022-03-16 10:06:12] 505727 from the reference haplotypes
[2022-03-16 10:06:12] 0 in the intersection of the posfile and the reference haplotypes
Error in print_and_validate_reference_snp_stats(pos_snps, legend_snps, :
There are 0 SNPs that intersect between the posfile and the reference haplotypes. Please troubleshoot to see if there is an error, or re-run without reference haplotypes

This is the command I am running:
/home/jjohnso/project-aces/apps/QUILT/QUILT/QUILT.R
--outputdir=${dirname}
--chr=${chr}
--tempdir=temp.${dirname}
--output_filename=quilt.chr${chr}a.set${set}.vcf
--bamlist=shortbamlist${chr}.txt
--sampleNames_file=shortbamlist${chr}.names
--reference_haplotype_file=gpinfo/chr${chr}.impute.hap.gz
--reference_legend_file=gpinfo/chr${chr}.impute.legend.gz
--genetic_map_file=gpinfo/chr${chr}.geneticmap.txt
--nGen=${nGenval}
--nCores=11
--save_prepared_reference=TRUE

I am not providing a posfile, so I assume that it previously made one with the other positions that it has stored in some other location.

I have removed the old directories, changed the temp directory, changed the file names, and double checked my impute.legend files.

-Jen Jhohnson

@rwdavies
Copy link
Owner

Hi Jen,

That is very weird. Indeed there is no "posfile" here, it's just the way I'm re-using code from STITCH, both the "posfile" and the legend file are derived from the legend. Somewhat suspicious that there are exactly the same number of SNPs in both, but no overlap.

If it's not too much trouble, are you able to add some print statements to the code, re-install QUILT, then try re-running? For instance the problematic line is line 895 with "print_and_validate_reference_snp_stats", but if you put some print statements just above it starting from current line 893, like
print(head(pos_snps))
print(head(legend_snps))
maybe that will give some insight into why the code is now running into problems.
You can then re-install QUILT using something like

cd /home/jjohnso/project-aces/apps/QUILT/QUILT/
./scripts/build-and-install.R

Sorry I can't immediately see the problem,
Best wishes,
Robbie

@karstyl
Copy link
Author

karstyl commented Mar 18, 2022

Robbie,

Sure, I can do that.
Which code file should I edit?

-Jen

@rwdavies
Copy link
Owner

Right, sorry, it would be this file
https://github.com/rwdavies/STITCH/blob/master/STITCH/R/reference.R
so you'd actually need to download and compile STITCH if not already done
then re-build and install using

cd ~/proj/STITCH ## or similar
./scripts/build-and-install.R

@karstyl
Copy link
Author

karstyl commented Mar 21, 2022

I figured out the issue. When you mentioned that it was not the issue of stored data, I knew it had something to do with my files. I have run the rest of my chromosomes without this issue.
It turns out that the issue was the program that converted the positions added a carriage return instead of a newline. I thought this would have been fixed when I re-wrote the file using awk and printing it column by column, but the carriage return stayed in the file.

For others who find this same issue:
To find it"
cat -A filename
will show '^M$' instead of '$'
tr -d $'\r' < filename
will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants