Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish Version III of the SNP+TR reference haplotype panel #27

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

gymreklab
Copy link
Collaborator

@gymreklab gymreklab commented Sep 10, 2024

The Version III files are the same as VII, with the following updates to facilitate use in downstream imputation pipelines:

  1. Remove TRs for which the REF allele does not match the expected sequence based on CHR:POS
  2. For each TR, remove alleles with 0 count (with the exception of the REF allele which should always be present even if it has count 0)
  3. Remove TRs which have more than 100 alleles.
  4. Remove TRs which have less than 2 alleles.
  5. Remove the DS/GP fields which are large and not used by downstream steps.
  6. Add unique IDs for each TR of the format EnsTR:CHROM:POS. For TRs with the same CHR:POS, add the duplicate number of the TR following format: EnsTR:CHROM:POS:Duplicate_num.
  7. Add VT field, set to VT=TR for TRs and VT=OTHER for other variant types
  8. Add the .bref format files which have the same information as the VCFs but can improve Beagle imputation performance.

The script scripts/fix-ref/fix_ensembletr_snpstr_reference.py makes these changes.

VII files (and V1 genotype files) have been moved to archive_ensembletr_datasets.md so the main README doesn't get too cluttered.

Melissa Gymrek and others added 30 commits August 5, 2024 11:02
update the main README.md and make some changes on the fix_ensembletr…
fix file name, remove duplicate loci and update README
fix issue with REF count 0 in fix ref script
@gymreklab gymreklab marked this pull request as ready for review November 5, 2024 19:30
@gymreklab
Copy link
Collaborator Author

@heliziii you can review but let's hold off on merging until @yli091230 updates links to the new ref files

@yli091230
Copy link

@heliziii I just updated links and README file. It should be good to go.

@@ -125,76 +129,86 @@ Chromosome 21 [VCF file](https://ensemble-tr.s3.us-east-2.amazonaws.com/add-vntr

Chromosome 22 [VCF file](https://ensemble-tr.s3.us-east-2.amazonaws.com/add-vntrs/ensemble_chr22_filtered.vcf.gz) and [tbi file](https://ensemble-tr.s3.us-east-2.amazonaws.com/add-vntrs/ensemble_chr22_filtered.vcf.gz.tbi)

## Version II of reference SNP+TR haplotype panel for imputation of TR variants
## Version IV of reference SNP+TR haplotype panel for imputation of TR variants
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we publishing version 3 or 4?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I saw that we archived version 3, can you remind me what changed from version 3 to 4?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I not sure. The version 3 have some issues with missing reference alleles. @gymreklab , which version number should we use?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In version 3, some REF alleles are missing, due to no REF allele detected. This will cause error in the downstream analysis. To fix it, we always keep the REF alleles in version 4.

2. For each TR, remove alelles with 0 count.
* If reference allele have 0 count, keep the reference alleles.
3. Remove TRs which have more than 100 alleles.
4. Remove TRs which have less than 2 alleles.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean at least one alternative alleles?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants