-
Notifications
You must be signed in to change notification settings - Fork 3
/
NEWS
162 lines (133 loc) · 5 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
ticcltools 0.11 2023-10-28
[Ko van der Sloot]
* Now we use NFC endoded Unicode strings everywhere (need ticcutils >= 0.34)
* testrank script results were outdated since 0.10
* removed dependency on libtar
* added --follow option to TiCCL-indexer(NT)
* some code refactoring and cleanup
ticcltools 0.10 2023-02-28
[Ko van der Sloot]
* LDcalc:
- No longer filter out n-grams with common parts. Was too aggressive
- Removed some more outcommented old code
* chainclean: added a --caseless option. (Default is true)
* Removed Roaring versions of the code. Lacked maintenance for years.
* internally shifting towards UnicodeString in general
* a lot of C++ cleanup, with some refactoring, splitting up long blobs of code
ticcltools 0.9 2022-09-14
[Maarten van Gompel]
* Updated README
* added Docker stuff
ticcltools 0.8 2021-12-15
[Ko van der Sloot]
* using more recent functions from ticcutis
* use more code from ticcl_common
* attempt to solve https://github.com/LanguageMachines/ticcltools/issues/42
* some small code refactoring
ticcltools 0.7 2020-04-15
[Martin Reynaert]
* updated man pages
* updated README.md
[Ko van der Sloot]
Numerous bug fixes and additions. Added a .so for common functions
The bitType is changed to uint64_t (for the biggest int possible) which
triggered some code adaptations. (values < 0 are not possible)
* TICCL-unk:
- some changes in UNK detection
- added a --hemp option
- create a .fore.clean file when a background corpus is merged in
* TICCL-stats:
- added a -n option to use a newline as delimiter
* TICCL-indexer(NT):
- better and faster implementation
- added --confstats option
* TICCL-LDcalc:
- added a --follow option for debugging purposes
- fix for https://github.com/LanguageMachines/ticcltools/issues/30
- added --low and --high parameters
* TICCL-rank:
- added a --follow option for debugging purposes
- added --subtractartifrqfeature1 and --subtractartifrqfeature2 options
- replaced pairs_combined ranking by median ranking
- added an n-gram filter
* TICCL-chain:
- added --nounk option
- fix for https://github.com/LanguageMachines/ticcltools/issues/38
- fix for https://github.com/LanguageMachines/ticcltools/issues/37
- use the alphabet file too with --alph
* TICCL-chainclean: new module to clean chain ranked files
* TICCL-anahash:
- accept lexicons without frequencies too. (also simple word lists)
- added a -o option
ticcltools 0.6 2018-06-05
[Ko vander Sloot]
Intermediate release, with a lot of new code to handle N-grams
Also a lot of refactoring is done, for more clear and maintainable code.
This is work in progress still.
* TICCL-unk:
- more extensive acronym detection
- fixed artifreq problems in 'clean' punctuated words
- added filters for 'unwanted' characters
- added a ligature filter to convert evil ligatures
- normalize all hyphens to a 'normal' one (-)
- use a better definition of punctuation (unicode character class is not
good enough to decide)
* TICCL-lexstat:
- the 'separator' symbol should get freq=0, so it isn't counted
- the clip value is added to the output filename
* TICCL-indexer:
- indexer and indexerNT now produce the same output, using different
strategies when a --foci files is used.
* TICCL-LDcalc:
major overhaul for n-grams
- added a ngram point column to the output (so NOT backward compatible!)
- produce a '.short' list for short word corrections
- produce a '.ambi' file with a list of n-grams related to short words
- prune a lot of ngrams from the output
* TICCL-rank:
- output is sorted now
- honor the ngram-points from the new LDcalc. (so NOT backward compatible!)
* TICCL-chain: new module to chain ranked files
* TICCL-lexclean:
-added a -x option for 'inverse' alphabet
* TICCL-anahash:
- added a --list option to produce a list of words and anagram values
[Maarten van Gompel]
* added metadata file: codemeta.json
ticcltools 0.5 2018-02-19
[Ko van der Sloot]
* updated configuration. also for Mac OSX
* use of more ticcutils stuff: diacritics filter
* added a TICCL-mergelex program
* the OMP_THREAD_LIMIT environment variable was ignored sometimes
* TICCL-unk:
- fixed a problem in artifreq handling
- changed acronym detection (work in progress)
- added -o option
TICCL-lexstat:
- added TTR output
- added -o option
TICCL-indexer
- now also handles --foci file. with some speed-up
- added a -t option
TICCL-LDcalc:
- be less picky on a few wrong lines in the data
* added some tests
* when libroaring is installed we built roaring versions of some modules (experimental)
* updated man pages
ticcltools 0.4 2017-04-04
[Ko van der Sloot]
* first official release.
- added functions to test on Word2Vec datafiles
- refactoring and modernizing stuff all around
ticcltools 0.3 2016-01-21
* upload ready
ticcltools 0.2 2016-01-14
[Ko van der Sloot]
* repository moved to GitHub
* Travis support
* first 'distributable' version
* added TICCL-stats program
* added W2V-near and W2V-dist programs bases on new libticcl lib
[0.1] Ko vd Sloot
started autoconfiscating