Skip to content

Latest commit

 

History

History
97 lines (97 loc) · 5.17 KB

TODO.org

File metadata and controls

97 lines (97 loc) · 5.17 KB

Issues & ideas

Done

  • [characters] Add consequently contains-XXX and contains-XXX-only
  • [transformers] Move split-tokens-with-space to parsers
  • [tagging] Add some kind of gen-dict-from-file. Check how does with-open works. Is it reasonable to take reader as a parameter?
  • [tagging] Add defaulting to merge-tokens-with-space in tags algebra.
  • [tagging] Add API DOCS to tagging module.
  • [tagging] Add function to convert fdict -> dict.
  • [tagging] Refactor tagging module so it works with langlab
  • [ngrams] Correct ngrams module so it uses partition
  • [stopwords] Verify Norwegian stopwords (no-sw) I had to do some uppercase conversions, so something might be wrong there. Checked by comparison of intersection with Lucene stopwords (70 out of 119 are common).
  • [stopwords] Refactor constants into functions
  • [stopwords] Add docs to stopwords functions
  • [stopwords] Convert en-drop-articles so it uses stopwords
  • [stopwords] Add basic unit tests to stopwords functions
  • [detectors] Create wrappers for language detection module in Apache Tika
  • [stopwords] Add constants containing articles
  • [stopwords] Create a module and add general stopwords filter
  • [general] Refactor langlab-base -> langlab
  • [readability] Correct test to the infix notation
  • [readability] Correct counting characters to bi version
  • [characters] Add functions detecting/removing non-MBP characters
  • [transformers] Add tests to transformers
  • [parsers] Add tests to sentence splitters
  • [parsers] Add simple tokenizer based on Analyzer from Lucene http://stackoverflow.com/questions/6334692/how-to-use-a-lucene-analyzer-to-tokenize-a-string See post by Ben McCann for Lucene 4.1
  • [characters] Make use of punctuation classes from Unicode [Pc] Punctuation, Connector [Pd] Punctuation, Dash [Pe] Punctuation, Close [Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage) [Pi] Punctuation, Initial quote (may behave like Ps or like Ps or Pe depending on usage) [Po] Punctuation, Other [Ps] Punctuation, Open see http://www.fileformat.info/info/unicode/category/index.htm
  • [characters] Add tokenizer based on module ICU4j and its break iterator It implements Unicode segmentation rules http://www.unicode.org/reports/tr29/ http://icu-project.org/apiref/icu4j/com/ibm/icu/text/BreakIterator.html More http://site.icu-project.org/
  • [characters] Add Java functions to characters module containsPunctuation(String s) containsPunctuationOnly(String s) containsWhitespace(String s) containsWhitespaceOnly(String s)