Skip to content

Towards an Internal Kmer Search Engine

Compare
Choose a tag to compare
@epruesse epruesse released this 07 Feb 02:59

Internal Kmer Search Update

With this release, the internal kmer search is nearing completion. The kmer-index is now persisted to disk, computed in parallel, and uses a presence/absence optimization to reduce its total size and search speed. It's many times faster than the original PT server based search. (You still need to use --num-pts though to make it use multiple threads). Tweaks to the way SINA interacts with ARB and caches sequences internally have reduced the memory usage of the kmer search indexing and use stages to allow working with the current SILVA Ref NR with on a 16GB machine.

Documentation Update

The documentation is now up to date with the current features. A man file is distributed with SINA and available via man sina from conda environments. Text-file versions are shipped in share/doc/sina, and a pretty html version rendered by sphinx is available at https://sina.readthedocs.io.

Evalutation Options Reinstated

The options --show-dist and --fs-msc-max have been re-instated to allow evaluating the accuracy of SINA. New unit tests are in place to verify that the accuracy doesn't accidentally drop. These will help making the switch to the internal kmer search without risking significant changes to the overall accuracy.

Changelog

  • update documentation (#20)
  • reinstate --show-dist
  • reinstate --fs-msc-max
  • add choice "exact" to --search-iupac
  • change default for --search-kmer-len to match --fs-kmer-len
  • parallelize launch of background PT servers
  • lower memory usage:
    • avoid redundant sequence caching by libARBDB
    • use compact aligned base (50% on internal sequence cache)
  • improve internal kmer search performace
    • add caching of kmer index on disk
    • parallelize kmer index construction
    • add presence/absence optimization
  • fix field align_ident_slv added for 100% matches even when
    not enabled
  • fix crash on overhang past alignment edge
  • fix libARBDB writing to stdout, clobbering sequence output
  • fix out-of-bounds access on iterator in NAST implementation
  • remove dependency on boost serialization library
  • build release binaries with GCC 7 and C++11 ABI
  • add integration tests watching for accuracy regressions (#25)

Full Changelog on ReadTheDocs