Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap #1

Open
4 of 6 tasks
Hua-Zhou opened this issue Aug 29, 2017 · 6 comments
Open
4 of 6 tasks

Roadmap #1

Hua-Zhou opened this issue Aug 29, 2017 · 6 comments

Comments

@Hua-Zhou
Copy link
Member

Hua-Zhou commented Aug 29, 2017

This issue documents the implementation roadmap for iterative solvers.

  • Convert GT data to Plink format (SnpData)
  • Subset VCF records according chromosome and position range
  • Subset VCF records according to marker index or IDs
  • Subset VCF samples according to individual index or names
  • Convert methods for VCF.Reader, e.g., convert(Matrix{T}, reader::VCF.Reader) will convert all records from the current position of reader to a matrix of type Matrix{T}
  • Copy methods for VCF.Reader, e.g., copy(A::AbstractMatrix{T}, reader::VCF.Reader) will fill the columns of A the GT data from the current position of reader
@biona001
Copy link
Member

biona001 commented Nov 25, 2019

Below is a (tentative) list of features I need for MendelImpute:

  • filter function based on sample and record index
  • convert_ht function to import VCF files into a numeric matrix where columns are haplotypes
  • convert_ds function to read dosage into a numeric matrix
  • convert_vcf function to convert phased genotype matrix back to a VCF file. This is not implemented explicitly but is support with general write methods. See MendelImpute's impute.jl

I will try to implement them in the next 1~2 weeks.. Are there any caveats I need to be aware of? In particular, I'm not sure what is the best way to filter based on sample index due to the data structure of VCF.Record.

@biona001
Copy link
Member

biona001 commented May 20, 2020

Here are a few more desired routines typically needed for quality control:

  • splitting multi-allelic calls into different records
  • left-aligning indels
  • removing 'junk' variants that have a high statistical probability of being false-positives via the QUAL score, which is negative log 10 Phred-scaled quality.

They are mentioned here

@kose-y
Copy link
Member

kose-y commented Jul 30, 2020

@kose-y
Copy link
Member

kose-y commented Aug 2, 2020

VCF to PLINK is implemented in SnpArrays.jl.

@biona001
Copy link
Member

biona001 commented Sep 2, 2020

In the next few days, I will add:

  • Calculation of Hardy Wienburg equilibrium p-value using Fisher's exact test (reference)
  • GRM calculation, with emphasis on treating missing data (reference)

@Hua-Zhou
Copy link
Member Author

Hua-Zhou commented Sep 2, 2020

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants