reorder-columns
can work on grouped dataset now
- arrays of 2 element arrays behave as expected on dataset creation (#142)
Deps updated
Documentation changed to be generated by Clay instead of RMarkdown
- semi and anti joins fail on table containing missing values, multi columns and duplicated rows
Deps updated to fix j/left-join
issue.
- join columns should consider
nil
as missing value only, discussion :nil-missing?
in more places needed (group-by operations), discussion- changes to the
group-by
documentation PR115, thanks to Marshall - reflection warning for
Collections/shuffle
removed
- Extened documentation for
dataset
(copied from TMD), #112
rows
accepts:nil-missing?
(default: true) andcopying?
(default: false) options.
Deps updated
:hashing
is available for single column joins too
:hashing
option determines method of creating an index for multicolumn joins (washash
isidentity
)
- #108 - hashing replaced with packing data into the sequence
Deps updated
- dataset from singleton creation generated from wrong structure
map-rows
to map each row and produce new columnsrows
can return sequence of vectors (:as-vecs
)
- balanced k-fold partitioning as proposed in #92 by @behrica
Updated to TMD v7
Differences:
- the order of columns is persisted in more cases
- the order of groups in grouped dataset can be random
- doc strings for every funcitons, #87, #88
- aggregate-columns should default to all columns when called without a column selector #91
- create functions for packing / unpacking columns to arrays #82
- [breaking] when dataset file do not exists throw an exception #84, #85
Clojure upgraded to 1.11.1
separate-column
infers column names when function is used andtarget-columns
isnil
, #78
- [breaking][minor]
separate-column
repleces source column with target on every case
- replace
clojure.core/pmap
withdtype-next
version (related to #325)
get-entry
introduced
- [#77]
anti-join
andsemi-join
bugs when tables contain missing values
crosstab
- cross tabulationpivot->longer
:coerce-to-number
option added
- [breaking]
pivot->wider
no longer coerces column names to strings, it's up to user
- predicates should behave as in Clojure (discussion)
TMD version bump
[breaking]
replace-missing
up/down strategies clarified. :down
is replaced by :downup
and :up
is replaced by :updown
. :down
and :up
work only in one direction now.
techascent/tech.ml.dataset#305
- Wrong way of selecting columns for joins (shouldn't be a set). https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/complete.20ala.20R/near/286277344
data frame
term in the title of docs (discussion)- joins can accept different names for left/right datasets
cross-join
,expand
andcomplete
introduced
- removed setting
*warn-on-reflection*
- [breaking] creation of singleton dataset adds an error message as a column by default (discussion)
Version bump
- docstring for
unroll
andfold-by
by @holyjak (#60 and #61)
- [#58] - editor friendly api file
- #57 - InputStream should be dispatched first (the flow now: tries to create a dataset and it fails packs an objet as a singleton
select-rows
acceptsIFn
for row selection.- [breaking] #54, #56 -
pipeline
namespace is stripped, all functions are moved to metamorph library. This is temporary solution before removing this namespace completely. Pipelined versions of functions will be moved to metamorph as well later.
- [#49] added docstring to
add-column
- [#53] summary prefix ignored for aggregate (when fn[ds] is passed)
- Documented columns / rows functions PR52
- Reference to original to lifted functions metadata for pipelines PR51
- alias for api functions in reference (was:
api
, is:tc
)
replace-missing
on grouped dataset has swapped arguments
update-columns
on grouped dataset
- [#43] Align with TMD for dataset creation from a map of sequences.
- [breaking] creation from tensor is
:as-rows
now
- [#42] [breaking]
add-column
default strategy is:strict
now.
- [#41] dataset name not set on tensor path
TMD upgrade, no changes in TC
TMD upgrade
- [#36]
reorder-columns
on empty dataset returns nil
aggregate-columns
didn't keep column order (#35)
pipeline
functions havedoc
copied from original ones
split
can turn off shuffling now (:shuffle?
option)split :holdouts
- sequence of consecutive holdouts
tech.ml.dataset version bump, this introduces the change of the order of the groups after group-by
operation
split :holdout
supports any number of splits (minimum 2) [#28]split
supportssplit-names
to provide custom names for subdatasetsconcat
andconcat-copying
are working with grouped datasets
kfold
split failed on small number of rows (due topartition-all
behaviour
split->seq
to return train/test splits as a sequence or datasets or as map of sequences for grouped datasets
- [breaking]
tablecloth.pipeline
returns a map with dataset under:metamorph/data
key (see metamorph) - [breaking]
split
returns now a dataset or grouped dataset with two new columns indicating train/test and split id. Seesplit->seq
for previous behaviour.
without-grouping->
threading macro which allows operations on grouping dataset treated as a regular one.
group-by
accepts any java.util.Map for a collection of indexes (use LinkedHashMap to persist an order)- some
tablecloth.api.group-by
functions moved totablecloth.api.utils
, no changes to API
add-or-replace-column(s)
replaced byadd-column(s)
(add-or-replace-column(s)
is marked as deprecated) (#16)
mark-as-group
wasn't visible in API (#18)map-columns
didn't propagatenew-type
for grouped case (#20)- broken links (#14) in readme
let-dataset
- to simulatetibble
from R
- Adding a column to an empty dataset returned empty dataset
- re-implementation of numerical arrays path dataset creation
rows
andcolumns
new result::as-double-arrays
- convert rows to 2d double array- dataset can be created from numerical arrays discusson
- column from single value should create valid datatype (#10)
tablecloth.pipeline
for pipeline operations
concat-copying
exposed.split
function for splitting into train-test pairs with:kfold
,:bootstrap
,:loo
andholdout
strategies + stratified versionsreplace-missing
with new strategy:midpoint
- column names should keep order for provided names (#9)
t.m.d update
t.m.d update
- contribution guide in readme
t.m.d update
write-nippy!
andread-nippy
are deprecated, replaced bywrite!
anddataset
tech.ml.dataset
version 5.0-alpha*
map-columns
accepts optional target datatypeds/column->dataset
functionality introduced inseparate-column
- more datatypes included for conversion (
:text
among others)
write-csv!
replaced bywrite!
(write-csv!
is marked as deprecated)info
field:size
is replaced by:n-elems
- [breaking]
separate-column
3-arity version acceptsseparator
insteadtarget-columns
now
- do not skip 1-row DS when folding
- do not attempt to fold empty dataset
tech.ml.dataset
version 4.04
- tests: dataset
- version number to match t.m.dataset version
- documentation:
- gfm renderer for markdown
- code block language alignment fix in css
tech.ml.dataset
version 4.03
- some operations on grouped dataset can be parallel (
parallel?
option set totrue
). These are:aggregate
,unique-by
,order-by
,join-columns
,separate-columns
,ungroup
- #2 - docs typo
- #3 - recover datatypes after ungrouping
aggregation
uses now in-place ungrouping which is much faster
tech.ml.dataset
version 3.06
fill-range-replace
to inject data to make continuous seqence in columnwrite-nippy!
andread-nippy
tech.ml.dataset
version 2.13
replace-missing
new strategies::mid
and:lerp
, working also for dates.
- [breaking]
replace-missing
has different conctract and default strategy:mid
.value
argument is the last argument now. - [breaking]
replace-missing
:up
and:down
strategies, whenvalue
isnil
fills border missing values with nearest value.
tech.ml.dataset
version 2.06
asof-join
added
reshape
testspivot->wider
accepts:drop-missing?
option (default:true
)
pivot->wider
drops missing rows by defaultpivto->wider
order of concatenated column names is reversed (first: colnames, last: value), was opposite.pivot->longer
:splitter
accepts string used for splitting column name