Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Filtering options #27

Open
dridk opened this issue Feb 6, 2017 · 9 comments
Open

Add Filtering options #27

dridk opened this issue Feb 6, 2017 · 9 comments
Milestone

Comments

@dridk
Copy link
Member

dridk commented Feb 6, 2017

You should be able to filter based on column name.

@dridk dridk added this to the 0.3 milestone Feb 6, 2017
@dridk
Copy link
Member Author

dridk commented Feb 6, 2017

@Arkanosis How can I perform sqlite like filtering on C++ list ?

@Arkanosis
Copy link
Member

You mean like a SQL WHERE? You can't do that very efficiently (there's a reason why people use sqlite), but assuming you're working on small lists, std::copy_if() / std::remove_copy_if() and a custom predicate might do the trick (that's somewhat expensive, but compared to the cost of displaying the result in Qt, not that much).

@dridk
Copy link
Member Author

dridk commented Feb 6, 2017

Humm..
I mean simple filter like excel does.
I think I lost all the benefict of htslib by using sqlite . No ?

Humm..
Open a file, import the file into sqlite , make a query on region ...
What do you suggest ?

@dridk
Copy link
Member Author

dridk commented Feb 6, 2017

I think I like the idea of saving all variant as a sqlite file !

@Arkanosis
Copy link
Member

It all depends on how big you expect the VCF files to be. For small files, linear filtering is probably cheap enough, but on big ones, I'm afraid it's going to be noticeably slow. sqlite with proper indexes might scale much better but there's an overhead at startup.

I'd suggest linear filtering for typical excel-sized VCF and indexed filtering for anything larger than that (sqlite being the most convenient approach I can thing of).

Now, given it displays every single row of the VCF, I assume CuteVCF is more small-files oriented, isn't it?

@dridk
Copy link
Member Author

dridk commented Feb 6, 2017

CuteVCF should be able to manage big file . Qt Model system is really strong and can support huge amount of line. If I exceed my memory, I can use pagination.
So, I will probably make CuteVCF has a strong VCF viewer/filtering application which support different kind of annotation definition. I think this will be really usefull. Too many people use Excel for filtering.

By the way, @Arkanosis How many specification do you know for annotation ?
I only know snpEff wich put annotation in INFO fields as follow : ANN=A|324|234

@Arkanosis
Copy link
Member

In that case, you'll probably want to use some indexed backend like sqlite (which handles offsets and limits for pagination, btw).

As for annotation specs, I'm only aware of that of SnpEFF (ie. EFF=A|324|234 and ANN=A|324|234) and VEP (ie. CSQ=A|324|234). I've never heard of any other widely-used composite INFO field.

@dridk
Copy link
Member Author

dridk commented Feb 6, 2017

Ok, I have two option.

  • 1 save raw vcf line in sqlite : chrom, pos, ref , alt , info , sample and do the parsing from C++.
    Or
  • 2 parse raw vcf before insertion in sqlite table .

In all case, I think I can avoid table joining by saving all variant data in one table .

@dridk
Copy link
Member Author

dridk commented Feb 7, 2017

After some reflexion during my night, I propose the following idea :
For each .vcf.gz create the sqlite clone .vcf.db
Sqlite support query on different database, so I can easily imagine to intersect 2 database if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants