Issues & ideas

[general] Correct apidocs, so they look OK with new codox markdown support
[general] Add string similarity library based either on simmetrics or on second string http://web.archive.org/web/20081224065617/http://www.dcs.shef.ac.uk/~sam/simmetrics.html http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html http://sourceforge.net/projects/secondstring/ http://secondstring.sourceforge.net/
[general] Consider using perforate lein plugin to add benchmarks. https://github.com/davidsantiago/perforate This is, however, for microbenchmarks.
[tagging] Review tests for get-min-tokens, get-max-tokens tests.
[tagging] Should get-min/max-tokens really return nil on empty dict?
[detectors] Add tests for language detection.
[detectors] Add tests for encoding.
[detectors] Add option of seed initialization for Cybozu lang detector
[detectors] Investigate the issue of ICU language detection problems Why it works only for texts with removed diacritics? Does manual removal affect the non-latin alphabet text detection?
[general] Check for ICU automatic detection + decoding facilities This may solve the problems of different encodings in files.
[characters] Add function to detect/normalize some UTF-8 characters Examples
- “…” -> “…”
- “„” -> “"”
- “”” -> “"”
- “—” -> “-”
- hard spaces to spaces
- what more ?
[parsers] There is a (closed source?) sentence parser library from Scott Piao http://text0.mib.man.ac.uk:8080/scottpiao/apidocs/api_sentpar_breaker.jsp https://sites.google.com/site/scottpiaosite/ http://www.research.lancs.ac.uk/portal/en/people/Scott-Piao/
[stemmers] Should I rewrite stemmers in the spirit of create and apply? It should yield performance improvement but what about concurrency? Alternative is to keep all the stemmers in delays.
[stemmers] Should I rename multi-stemmers to lemmatizers and move there morpha stemmer (lemmatizer)?
[detectors] Look up new method from Cybozu for tweets http://shuyo.wordpress.com/2012/05/17/short-text-language-detection-with-infinity-gram/ Difficult because of the maximal substring implementation needed http://code.google.com/p/esaxx/
[ngrams] Efficient computing of all n-grams using prefix arrays http://dl.acm.org/citation.cfm?id=972779 http://www.mibel.cs.tsukuba.ac.jp/~myama/tfdf/index.html http://cm.bell-labs.com/cm/cs/who/doug/ssort.c

Done

[characters] Add consequently contains-XXX and contains-XXX-only
[transformers] Move split-tokens-with-space to parsers
[tagging] Add some kind of gen-dict-from-file. Check how does with-open works. Is it reasonable to take reader as a parameter?
[tagging] Add defaulting to merge-tokens-with-space in tags algebra.
[tagging] Add API DOCS to tagging module.
[tagging] Add function to convert fdict -> dict.
[tagging] Refactor tagging module so it works with langlab
[ngrams] Correct ngrams module so it uses partition
[stopwords] Verify Norwegian stopwords (no-sw) I had to do some uppercase conversions, so something might be wrong there. Checked by comparison of intersection with Lucene stopwords (70 out of 119 are common).
[stopwords] Refactor constants into functions
[stopwords] Add docs to stopwords functions
[stopwords] Convert en-drop-articles so it uses stopwords
[stopwords] Add basic unit tests to stopwords functions
[detectors] Create wrappers for language detection module in Apache Tika
[stopwords] Add constants containing articles
[stopwords] Create a module and add general stopwords filter
[general] Refactor langlab-base -> langlab
[readability] Correct test to the infix notation
[readability] Correct counting characters to bi version
[characters] Add functions detecting/removing non-MBP characters
[transformers] Add tests to transformers
[parsers] Add tests to sentence splitters
[parsers] Add simple tokenizer based on Analyzer from Lucene http://stackoverflow.com/questions/6334692/how-to-use-a-lucene-analyzer-to-tokenize-a-string See post by Ben McCann for Lucene 4.1
[characters] Make use of punctuation classes from Unicode [Pc] Punctuation, Connector [Pd] Punctuation, Dash [Pe] Punctuation, Close [Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage) [Pi] Punctuation, Initial quote (may behave like Ps or like Ps or Pe depending on usage) [Po] Punctuation, Other [Ps] Punctuation, Open see http://www.fileformat.info/info/unicode/category/index.htm
[characters] Add tokenizer based on module ICU4j and its break iterator It implements Unicode segmentation rules http://www.unicode.org/reports/tr29/ http://icu-project.org/apiref/icu4j/com/ibm/icu/text/BreakIterator.html More http://site.icu-project.org/
[characters] Add Java functions to characters module containsPunctuation(String s) containsPunctuationOnly(String s) containsWhitespace(String s) containsWhitespaceOnly(String s)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO.org

TODO.org

Issues & ideas

Done

Files

TODO.org

Latest commit

History

TODO.org

File metadata and controls

Issues & ideas

Done