- [general] Correct apidocs, so they look OK with new codox markdown support
- [general] Add string similarity library based either on simmetrics or on second string http://web.archive.org/web/20081224065617/http://www.dcs.shef.ac.uk/~sam/simmetrics.html http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html http://sourceforge.net/projects/secondstring/ http://secondstring.sourceforge.net/
- [general] Consider using perforate lein plugin to add benchmarks. https://github.com/davidsantiago/perforate This is, however, for microbenchmarks.
- [tagging] Review tests for get-min-tokens, get-max-tokens tests.
- [tagging] Should get-min/max-tokens really return nil on empty dict?
- [detectors] Add tests for language detection.
- [detectors] Add tests for encoding.
- [detectors] Add option of seed initialization for Cybozu lang detector
- [detectors] Investigate the issue of ICU language detection problems Why it works only for texts with removed diacritics? Does manual removal affect the non-latin alphabet text detection?
- [general] Check for ICU automatic detection + decoding facilities This may solve the problems of different encodings in files.
- [characters] Add function to detect/normalize some UTF-8 characters
Examples
- “…” -> “…”
- “„” -> “"”
- “”” -> “"”
- “—” -> “-”
- hard spaces to spaces
- what more ?
- [parsers] There is a (closed source?) sentence parser library from Scott Piao http://text0.mib.man.ac.uk:8080/scottpiao/apidocs/api_sentpar_breaker.jsp https://sites.google.com/site/scottpiaosite/ http://www.research.lancs.ac.uk/portal/en/people/Scott-Piao/
- [stemmers] Should I rewrite stemmers in the spirit of create and apply? It should yield performance improvement but what about concurrency? Alternative is to keep all the stemmers in delays.
- [stemmers] Should I rename multi-stemmers to lemmatizers and move there morpha stemmer (lemmatizer)?
- [detectors] Look up new method from Cybozu for tweets http://shuyo.wordpress.com/2012/05/17/short-text-language-detection-with-infinity-gram/ Difficult because of the maximal substring implementation needed http://code.google.com/p/esaxx/
- [ngrams] Efficient computing of all n-grams using prefix arrays http://dl.acm.org/citation.cfm?id=972779 http://www.mibel.cs.tsukuba.ac.jp/~myama/tfdf/index.html http://cm.bell-labs.com/cm/cs/who/doug/ssort.c
- [characters] Add consequently contains-XXX and contains-XXX-only
- [transformers] Move split-tokens-with-space to parsers
- [tagging] Add some kind of gen-dict-from-file. Check how does with-open works. Is it reasonable to take reader as a parameter?
- [tagging] Add defaulting to merge-tokens-with-space in tags algebra.
- [tagging] Add API DOCS to tagging module.
- [tagging] Add function to convert fdict -> dict.
- [tagging] Refactor tagging module so it works with langlab
- [ngrams] Correct ngrams module so it uses partition
- [stopwords] Verify Norwegian stopwords (no-sw) I had to do some uppercase conversions, so something might be wrong there. Checked by comparison of intersection with Lucene stopwords (70 out of 119 are common).
- [stopwords] Refactor constants into functions
- [stopwords] Add docs to stopwords functions
- [stopwords] Convert en-drop-articles so it uses stopwords
- [stopwords] Add basic unit tests to stopwords functions
- [detectors] Create wrappers for language detection module in Apache Tika
- [stopwords] Add constants containing articles
- [stopwords] Create a module and add general stopwords filter
- [general] Refactor langlab-base -> langlab
- [readability] Correct test to the infix notation
- [readability] Correct counting characters to bi version
- [characters] Add functions detecting/removing non-MBP characters
- [transformers] Add tests to transformers
- [parsers] Add tests to sentence splitters
- [parsers] Add simple tokenizer based on Analyzer from Lucene http://stackoverflow.com/questions/6334692/how-to-use-a-lucene-analyzer-to-tokenize-a-string See post by Ben McCann for Lucene 4.1
- [characters] Make use of punctuation classes from Unicode [Pc] Punctuation, Connector [Pd] Punctuation, Dash [Pe] Punctuation, Close [Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage) [Pi] Punctuation, Initial quote (may behave like Ps or like Ps or Pe depending on usage) [Po] Punctuation, Other [Ps] Punctuation, Open see http://www.fileformat.info/info/unicode/category/index.htm
- [characters] Add tokenizer based on module ICU4j and its break iterator It implements Unicode segmentation rules http://www.unicode.org/reports/tr29/ http://icu-project.org/apiref/icu4j/com/ibm/icu/text/BreakIterator.html More http://site.icu-project.org/
- [characters] Add Java functions to characters module containsPunctuation(String s) containsPunctuationOnly(String s) containsWhitespace(String s) containsWhitespaceOnly(String s)