-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What to port from StatsBase #87
Comments
Let me ask top-down, as I was not involved much in earlier discussions. Is the plan the following:
Also then the question is when we move things from StatsBase.jl to Statistics.jl should we allow (and thus discuss) API changes or we only port things? If we allow API changes then probably we need to keep in StatsBase.jl some deprecations for the old API. Finally (again - I do not want to mess things up too much), it would be great to start moving things to Statistics.jl with a clear statement how we want to handle:
in a consistent manner across all functions (it might have been discussed but I think it would be beneficial for many - for sure for me - to have it clearly stated, so then when I review the PR I can check every single function if it conforms to the assumed API). The point is that once we move things to Statistics.jl there will be no way to fix things before Julia 2.0 release. |
Yes. Though for 3 we will have to carefully assess how we can reexport functions from Statistics to avoid conflicts on recent Julia versions.
I'm afraid we'll have to apply some API changes (as noted above for some cases). So yeah adding deprecations to StatsBase would be nice for users, though not absolutely required in the short term as nobody will be forced to switch to Statistics (as long as we avoid/fix conflicts with StatsBase).
Good point. For weights, I think the majority want to use the Missing values are a more tricky issue. For unweighted single-argument functions, |
|
What would you suggest then? Have Applying a binning map is what histograms do, right? |
That reads right.
The actual work in histograms is to find the binning map, right? Otherwise it could really just be |
OK so basically we could also add a |
One more thing
I think the keyword is "average run length", which is important for statistics and the reason why |
I tend to think in general it would be better to move things from Statistics to StatsBase (or some other package). For posterity a copy of the discussion on Slack:
|
I think starting with the My only additional suggestion is to figure out a standard interface for multivariate stats (e.g. I would still love |
I generally agree with @devmotion's point on keeping this out of stdlib to make it easier to maintain it. One concrete point regarding mean_and_var, this is very commonly used for computational efficiency reasons in GPs, see for example https://github.com/JuliaGaussianProcesses/AbstractGPs.jl/blob/master/src/exact_gpr_posterior.jl, so would be good to keep the compound forms! |
@nalimilan - I have checked what would be a problem when merging StatsBase.jl into Statistics.jl because of external dependencies: DataAPI:
DataStructures:
LogExpFunctions:
Missings:
SortingAlgorithms:
StatsAPI: a massive dependency (I did not perform check - probably we would need to merge StatsAPI into StatsBase?) |
Thanks. DataAPI and StatsAPI are the most problematic, as even if Statistics is part of the stdlib, some packages may not want to depend on it as loading it may take some time (in particular if we move lots of StatsBase code there). So making these depend on Statistics could be a problem.
|
An idea what to do:
|
Yeah we would need to check the load time of Statistics and see whether it affects that of e.g. Distances or not. I'm not sure how a conditional dependency would work, given that packages that currently overload functions defined in DataAPI or StatsAPI do not necessarily load Statistics, so that would be breaking (and if we break we may as well require such packages to overload the method defined in Statistics). |
Why would that be breaking? We do not need to load |
I mean that if a package wants to overload e.g. Or do you have a trick in mind which would allow merging |
You are right. I thought that this is highly unlikely, but maybe you are right that we should also safeguard ourselves against this.
I do not think this is possible. But then the question is - maybe Statistics.jl could depend on StatsAPI.jl in the end? Do you think it would be a problem? |
I guess that would imply adding it to the stdlib. It wouldn't be add significant overhead but it would be the only stdlib which doesn't actually provide any features usable on their own. It would also make it harder to tag StatsAPI 2.0 if needed at some point. Hopefully we can simply make StatsAPI depend on Statistics. |
This issue is to discuss what functions should be ported from StatsBase to Statistics (#2). Some functions would better move to a separate package:
Most APIs have passed the test of time so they are probably good enough, but I find some of them are not completely satisfying:
sum
cannot be implemented via aweights
keyword arguments like other functions since the function lives in Base (RFC: Add weights argument to sum JuliaLang/julia#33310). We could either exportwsum
or keep it internal and do not support it for now.counts
sounds a bit too generic of a term for a function that only allows counting integer values.countmap
is more general and its name is explicit. That said,counts
could easily be extended to allow any type of levels -- its limitation is just that it returns a vector without names so the mapping to the levels has to be done by hand, which isn't user-friendly. APIs provided by FreqTables.jl are nicer to use, but they need NamedArrays.jl (or a similar package). Then there's the issue thatcountmap
uses radix sort for performance with some types, but this needs SortingAlgorithms.jl, which isn't a stdlib (yet?).counteq
andcountne
don't really sound like statistical functions and I'm not sure how commonly they are used.sqL2dist
,L2dist
,L1dist
,Linfdist
have an uppercase in their name; these and remaining functions are redundant with functions provided in Distances.jl. That only leavespsnr
.indexmap
is justindexin
so remove it.levelsmap
andindicatormat
sound a bit limited compared with what StatsModels provides.rle
andinverse_rle
are not really related to statistics.mean_and_var
andmean_and_std
have weird names so I'm not sure we should keep them or not.zscore
andzscore!
are convenient but redundant with (more general and more verbose) functions in transformations.jl.transform
andtransform!
are too generic names, I propose overloading LinearAlgebra'snormalize
andnormalize!
, since that name is actually the commonly used term for such transformations. I wonder whether we really needreconstruct
andreconstruct!
(which could be calledunnormalize
if we keep them). I'm also not sure what's the use of allowing a separatefit
operation before actually applying the transformation (I'd imagine one would always normalize the data immediately).moment
is redundant with specific functions so I'd drop it.trimvar(x)
could bevar(trim(x))
iftrim(x)
returned a special iterator type to dispatch onSee also my previous notes at JuliaLang/julia#27152 (comment).
The text was updated successfully, but these errors were encountered: