-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathNEWS
305 lines (199 loc) · 11.1 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
NEWS
====
Versioning
----------
Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
* Breaking backward compatibility bumps the major (and resets the minor
and patch)
* New additions without breaking backward compatibility bumps the minor
(and resets the patch)
* Bug fixes and misc changes bumps the patch
termco 0.5.0 -
----------------------------------------------------------------
BUG FIXES
* `ngram_collocations` did not properly merge the **quanteda** outputs resulting
in the `length` column being replicated multiple times. Additionally, length
was integer whereas the other ngram measures are numeric resulting in a
**data.table** warning in `melt`. Both of these issues have been addressed.
* `colo` did not copy a single term to the clipboard with quotes. See
issue #50.
NEW FEATURES
* `plot_upset` added to enable exploration of overlapping instersections between
`term_count` categories: http://caleydo.org/tools/upset.
* `get_text` added to extract the original text associated with particular tags.
* `frequent_terms_co_occurrence` added to view the co-occurrence between frequent
terms. A combination of `frequent_terms` and `tag_co_occurrence`.
* `term_before`, `term_after`, & `term_first` added to get frequencies of terms
relative to other terms or specific locations.
* `token_count` added to count the occurrence of tokens within a vector of
strings. This function differs from`term_count` in that `term_count` is
regex based, allowing for fuzzy matching. This function only searches for
lower cased tokens (words, number sequences, or punctuation) providing a well
defined counting function that is faster than `term_count` but less flexible.
* `as_term_list` added. This is a convenience function to convert a vector of
terms or a quanteda `dictionary` into a named list.
* `combine_counts` added to enable combining `term_count` and `token_count`
objects.
* `match_word` added to match words to regular expressions. Roughly equivalent
**qdap**'s `term_match`.
* `read_term_list`/`write_term_list` added to aid in the reading in/writing out
and formatting of term list files.
* `classification_template` added to manually add a classification script
template. This template has a suggested **termco** based workflow that may be
useful for classification projects.
* `test_regex` added to test an atomic vector, list, or term list of regexes for
validity.
* `mutate_counts` added to apply a normalizing function to all the term columns
of a `term_count`/`token_count` object without stripping the attributes and
class.
* `drop_terms` added to allow the user to explore/iterate on a term list and
drop terms prior prior to \code{term_count} use without manually editing an
external term list file.
* `tidy_counts` added to converts a wide matrix of counts to tidy form (tags are
stretched long-wise with corresponding counts of tags).
* `set_meta_tags` added for setting the `metatags` attribute on a
`term_count/`token_count` object. This can also be controlled by separators
in the term/token list passed to `term_count/`token_count`.
* `select_counts` added for safely selecting `term_count/`token_count` object
columns without stripping attributes. Works like `?dplyr::select`.
MINOR FEATURES
* `important_terms` picks up a plot method corresponding to the `frequent_terms`
plot method.
* `term_count` checks for duplicate categories within tiers for hierarchical
term lists.
* `read_term_list` checks for valid regex.
IMPROVEMENTS
* `validate_model` now uses `classify` before validating to assign tags.
* `tag_co_occurrence` used a grid + base plotting approach that required
restarting the graphics device between plots. This dependency has been
replaces with a dependency on **ggraph** for plotting networks as grid
objects.
* `plot.validate_model` now shows tag counts in the sample to provide a relative
importance of the accuracy in making decisions.
* Open, unescaped or regexes [(i.e., `|)` unescaped pipe followed by a closing
group character] are now caught and warned for `read_term_list` and thus
`term_count`.
* `metatags` is an official attribute that can be used to group common tags
together. This is common in qualitative coding where one tags text and then
groups these subtags together into coherent metatags. This is used by
`tidy_counts` and can be used by other future features.
CHANGES
* The **stopwords** package replaces the **tm** package for providing default
stopword lists. The **stopwords** package is more comprehensive and lighter
weight. This changes allows the removal of the **tm** package as a dependency.
Suggested by Ken Benoit issue #69.
* `important_terms` now uses `quanteda::dfm_tfidf` rather than `tm::weightTfIdf`.
This means the tf-idf weighting is done is base 10 log rather than base 2 as
done with the **tm** package. Suggested by Ken Benoit issue #69.
* `as_dtm` & `as_tdm` moved to the **gofastr** package where they can be used by
other packages and their classed objects. **termco** re-exports the two
functions.
* `summary.validate_model` used to return `n` which was the number of tags from
the `termco` object. It now gives n.tags and n.classified to be more explicit
about counts of potential tags and tags actually assigned by `classify`.
* `colo` no longer uses non-standard evaluation; terms must be quoted.
* `ngram_collocations` has been renamed to `frequent_ngrams` for better clarity
in what the function does and as a counter part to `frequent_terms`.
* `update_names` renamed to `rename_tags` to be consistent with naming
conventions.
* `term_cols` renamed to `tag_cols` to be consistent with naming
conventions.
* `token_count` has no print method of it's own any more. The `print` method
for `term_count` was made more generic and works for both since `token_count`
inheerits from `term_count`. This is easier to maintain.
termco 0.4.0 - 0.4.3
----------------------------------------------------------------
NEW FEATURES
* `term_cols` & `group_cols` added to quickly grab just term or grouping
variable columns.
* `as_dtm` & `as_tdm` added to convert a `term_count` object into a
`tm::DocumentTermMatrix` or `tm::TermDocumentMatrix` object.
* `update_names` added to allow for safe renaming of a `term_count` object's
columns while also updating its attributes as well.
* `term_list_template` added for generating and writing term list templates.
IMPROVEMENTS
* `classify` picks up a new default `ties.method` type of `"probabilities"`.
This used the probability distribution from all tags assigned to randomly
break ties based on that distribution.
* `term_count` gets an auto-collapse feature for hierarchical `term.list`s with
duplicate names. A message is printed telling the user this is happening. To
get the hierarchical coverage use `attributes(x2)[['pre_collapse_coverage']]`.
* `accuracy` now uses standard model evaluation measures of macro/micro averaged
accuracy, precision, and recall as outlined by Dan Jurafsky & Chris Manning.
See https://www.youtube.com/watch?v=OwwdYHWRB5E&index=31&list=PL6397E4B26D00A269
for details on the methods.
CHANGES
* `plot.tag_co_occurrence` uses a bubble-dotplot for the right hand graph rather
than the older bar plot. This allows for tag size to be displayed in addition
to average number of other tags to determine if the tag co-occurrence is a
meaningful number of tags to give additional attention to. Use `tag = TRUE`
for the old behavior.
* `accuracy` was renamed to `evaluate` to be more informative as well as a verb.
termco 0.3.0 - 0.3.6
----------------------------------------------------------------
BUG FIXES
* `colo` returned list rather than string if a single term was passed. Spotted
by Steve Simpson. See issue #12.
* `term_count` did not handle hierarchical `term.list` correctly due to a
reordering done by **data.table** (when `group.vars` not `= TRUE`). This
has been corrected.
* Column ordering was not respected by `print.term_count`.
* `colo` did not copy to the clip board when `copy2clip` was `TRUE` and a single
expression was passed to `...`.
NEW FEATURES
* `important_terms` added to compliment `frequent_terms` allowing tf-idf
weighted terms to rise to the top.
* `collapse_tags` added to combine tags/columns from `term_count` object without
stripping the `term_count` class and attributes.
MINOR FEATURES
* `plot_counts` picks up a `drop` argument to enable terms not found (if `x` is
a `as_terms` object created from a `term_count` object) to be retained in the
bar plot. Suggested by Steve Simpson. See issue #18.
IMPROVEMENTS
* `colo` automatically adds a group parenthesis around `...` regexes to protect
the grouping explicitly. This is useful when a regex used or pipes (`|`).
This would create an unintended expression that was overly aggressive (see #20).
termco 0.2.0
----------------------------------------------------------------
NEW FEATURES
* `validate_model` and `assign_validation_task` added to allow for human
assessment of how accurate a model is functioning.
CHANGES
* `probe_colo_list`,`probe_colo_plot_list`, & `probe_colo_plot` all use
`search_term_collocations` under the hood rather than `search_term` + `
frequent_terms`.
termco 0.1.0
----------------------------------------------------------------
BUG FIXES
* `plot.term_count` did not properly handle weighting. This has been fixed and
allows for `"count"` as a choice.
* `search_term_which` (also `search_term`) did not treat te `and` argument
correctly. `and` was treated identical to the `not` argument.
NEW FEATURES
* `split_data` added for easy creation of training and testing data.
* `classification_project` added to make a classification modeling project
template.
* `plot_cum_percent` added for cumulative percent plot of frequent terms.
* `probe_` family of functions added to easily make lists of function calls for
exploration of the frequent terms in the context of the data. Functions include:
`probe_list`, `probe_colo_list`, `probe_colo_plot_list`, & `probe_colo_plot`.
* `hierarchical_coverage` added to allow exploration of the unique coverage of a
text vector by a term after partitioning out the elements matched by previous
terms.
* `tag_co_occurrence` added to explore tag co-occurrences.
* `search_term_collocations` added as a convenience wrapper for `search_term`
+ `frequent_terms`. (Thanks to Steve Simpson)
MINOR FEATURES
* `plot_freq` picks up a `size` argument.
IMPROVEMENTS
* `term_count` now can be used in a hierarchical fashion. A list of regexes can
be passed and counted and then a second (or more) pass can be taken wit a new
set of regexes on only those rows/text elements that were left untagged
(count `rowSums` is zero). This is accomplished by passing a `list` of
`list`s of regexes. Thanks to Steve Simpson for suggesting this feature.
termco 0.0.1
----------------------------------------------------------------
This package is a small suite of functions used to count terms and substrings
in strings.