-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for padding with sentinel value for roll* #24
Comments
What would the sentinel value do, simply provide the value with which to pad? |
This is a new idea. The point is that e.g. |
I can see how |
This is how pandas does it in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html: In short:
|
Introducing |
I prefer keeping advances to that which can be stated simply, and having specific requests gel together. Is the discourse item replaced by this refinement, or is there a distinction I may have forgotten? |
So my original proposal was an essential. Currently user needs to do padding manually:
or
This is quite inconvenient, especially for new users of Julia. What I would think makes sense:
This is not a restatement of https://discourse.julialang.org/t/rollingfunctions-with-a-variable-window-width/87601. |
I don't know how to read "specify a different window for each observation." Windows span multiple observations by definition. Does this mean at each observation use however many earlier observations to determine the current function value and that however many may be unique to each observation [index]? If so, "no". Your request above is much more widely useful. |
Is there a reason to actually pad rather than know where to begin/end? The data sequence belongs to the client. I would be more comfortable providing a very easy, perhaps nearly invisible way to copy with padding inserted rather than grab the information and alter its e.g. type. A keyword or two would allow this. Another function that actually modifies the source data sequence could be exported for advanced users.
then currying or structifying (fn, window) is possible |
This is what OP on Discourse wanted. |
This is also OK, but would require an additional step later. The point is often users want to do something like:
and now, as you see, they get an error. They could do:
but often they want the first observation that does not have a full window size to be some sentinel (like |
ahh. ic |
@bkamins |
Thank you for this work. @JeffreySarnoff - what is the status of your efforts towards 1.0 release of RollingFunctions.jl? |
reasonably good -- external constraints put the prerelease at Saturday Jan 21 and the release (or release candidate) at Sunday Feb 5th. |
@JeffreySarnoff - I saw a mention in this issue, but now I do not see it, so I am not sure what it was exactly. However, let me re-state the original request. When using rolling functions to get the result of the same length as input vector padded with some sentinel ( |
Got that..
What is bugging me is that I cannot process your views without including
all of data frames. Would you consider a future release that makes separate a least necessary package `DataFramesCore.jl` that would do just enough so I could use it and define `expansive_view = Union{AbstractArray, AbstractDataFrame}`?
…On Wed, Jan 18, 2023, 6:31 AM Bogumił Kamiński ***@***.***> wrote:
@JeffreySarnoff <https://github.com/JeffreySarnoff> - I saw a mention in
this issue, but now I do not see it, so I am not sure what it was exactly.
However, let me re-state the original request. When using rolling functions
to get the result of the same length as input vector padded with some
sentinel (missing is a reasonable default).
—
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAM2VRXS7B4ODKSAQGI3HBLWS7H7TANCNFSM6AAAAAAQ5IWG7M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Can you please explain the problem? I would assume that RollingFunctions.jl does not need to be aware of DataFrames.jl at all. It takes a vector as an input and should produce vector in output. |
That is true. People have requested that it work with DataFrame column views, and just for general symmetry -- that we have two well developed sorts of |
Indeed you can get a view as an input to your rolling function. But it should not matter I thought. I assumed that I would have a signature that restricts input type to |
Yes, I use AbstractVector{T} and also AbstractMatrix{T} (for running some function over multiple columns at one go). |
And why this is not enough? |
these are vector examples (fyi)
|
it is -- really it all goes well. |
Thank you, https://github.com/JeffreySarnoff/RollingFunctions.jl/blob/v1/docs/src/intro/padding.md seems exactly what is needed (in other frameworks there is also an option to pad in the middle so that half of padding is in front and half of padding is at end). Conceptually this means that if we move the window is the observation to which we attach the result in:
Padding to the center can be added later if it is a problem (but it would be good to keep in mind that people sometimes want to say that Tuesday is a mean of Monday, Tuesday and Wednesday). Regarding the types challenge - I discussed the issue with Jeff Bezanson yesterday - and it is a tough thing. |
with padding in the middle and an odd number of paddings, which end gets more or is that another settable choice? |
I had considered letting people specify n1 pads at the start and n2 pads at the end -- that seemed an opportunity for people to get it wrong. |
😄. We seem to think alike. I was copying the functionality that people asked for, but when I did this I started thinking about exactly the same problem. Maybe a more general solution would be to add a numeric (now I see it is similar to what you said with |
At the next-to-last minute, I decided to keep compatibility with current code. |
Here be something (unregistered while I heal).
The entire edifice has been reworked. The current docs reflect the tests and other working details in need of tests. IncrementalAccumulators exists to be incorporated, and will replace the older implementations ( this would add substantive performance I have not been able to figure out how to get
fails with
|
Thank you for working on this. I have checked the code and have the following comments: Sentinel value in paddingYou use
structure and set The reason is that
In some cases
|
supplemental information attached (to be updated on occasion) RollingFunctions.jl v1notes for JeffreySarnoff, bkamins 2023-02-15Looking BacktaperingWhen I researched rolling functions over windowed data initially (2017), there were packages and products that provided padding with a given value. I saw nothing about tapering to provide values for incomplete window spans. To the best of my knowledge (without claiming anything more) RunningFunctions.jl introduced efficient, well-structured, and functionally broad tapering to many. Now, this capability appears in many packages and in multiple languages. genericityThe implementation followed a few stylized approaches, allowing reasonably concise and compact code to cover many sorts of windowed summarization. omissionsmost often noted
others of import
Looking Forwarddesign goalsapi consistancy
argument order
Many clients will use some preferred set of windowspans with some selected set of summarization functions, each specialized to the nature of the datasequence. This suggests that the datasequence appear after the windowspan and the summarizer args. These two orderings meet that constraint. Which is preferable in use with respect to currying and maybe some use of closures?
performance
Technical Notestaper
pad
pad then taper
consistant relationships
|
re: padding is supported only for rolling They all are of a consistent, flexible design and share a nice api (thank you). |
The new, more capable api is call signature compatible with release v0.7.x -- that is, call signatures used with high level exports of v0.7,x (rolling and running) yield very nearly the same results when used with the new api as it is specified at present. We could support greater generality with fewer lines of code by changing from distinguishing functions of arity1, 2, 3, and 4 by dispatch over 1, 2, 3, or 4 vectors to a Tuple form and capturing the number of args (as given by N, the number of vectors):
We would code unary window functions explicitly: (a) providing well-grounded reference implementations code (b) providing evaluative optimizations and applicative speed-ups.
And we could do the same with window functions of 2 args, as e.g. Gaining performance (2x-3x in throughput generally, 5x-10x and in reappearing client situations) is available (sometimes not straightforward, sometimes requires additional dev work with third party packages). Work applying e.g. LoopVectorization, SIMD, ..? ?.. to our task is necessary. |
@bkamins The previous note explains my remaining high-level API decision for v1. (+) I prefer to support rolling/running a multi-arity functions that way.
What do see? |
I am not fully clear what the benefit of Two considerations:
For example, looking at the signature of
also with this approach the pre-1.0 release signatures could be kept as deprecated |
Tuple wrapping the two data vectors provided for two-arg window_fns is less compelling than is tuple wrapping for 3..12 arg window_fns. Clients working with timestamped and time-aggregated (the reading at 5min intervals or the trimmed mean taken through 5min intervals) financial market data often obtain many field values for every stamped or marked time. These 6 data items provide market information about the current trading day as it concludes: PriorClose, Open, High, Low, Close, Volume. Estimators, Indicators, and Accuracy/Loss Measures are developed through their Arithmetic and Algebraic interconnections. The more compelling use provides clients with a semantically consistent, syntactically uniform, generalized, and robust mechanism for working with the whole of it through any parts partitioning deemed worthwhile. |
Can you please explain on an example what you mean exactly? If we discuss convenience then I believe that not requiring to wrap values in a tuple is more convenient. |
Why imo There may be more than one data vector and there may be one weighting for each data vector [multiple weight vecs is a feature that seems would simplify some windowed summarizations]. Only one sort can be a Vararg. The data vecs are likely to change more often than the weight vecs. However weight vecs are not always used, while data vecs are. Having the weight vec[s] available as a positional arg simplifies dispatch somewhat -- so perhaps rolling(window_fn<:Function, window_span::Int, data::AbstractVector...; |
OK - I see your point. Let me ask on Slack on #data what users would think. Thank you! |
Ah - and also we discussed that the |
Just to bring over what I said on Slack: I voted for positional arg originally, as I thought footguns of the type I then wondered whether there are any constraints about the arguments which could be used to error early and prevent users from silently getting wrong results, e.g. if we require My assumption is that the vast majority of use cases is still the simple |
Would the integration of the equivalent of R's > roll_sum(1:3, width = 2)
[1] NA 3 5 > roll_sum(1:3, width = 2, min_obs = 1)
[1] 1 3 5 > roll_sum(1:3, width = 4, min_obs = 2)
[1] NA 3 6 Note that in the last case, although it may seem wrong to specify a width longer than the actual data, such situation may legitimately happens in cases where such rolling function is passed to a grouped DataFrame whose length isn't guaranteed to be >= |
To roll a function F over a data vector of length N using a window_span of W where W > N is the same as applying F to the (entire) data vector. That is reasonable iff N is large enough so F(data) is numeric (e.g. N > k for some k that depends on the nature of F .. which of F([]), F([x]), F([x1,x2]) is legit). The responsibility would be on the caller. Note that there are analytic issues with doing this; consider situations where the result is compared or combined with other results that have been applied to data of length N >= W. This "feature" should not be used in such situations. I think it may be overkill to provide a keyword arg to permit this -- so it would be all on the user. For less experienced users, that is problematic. What do you think? |
Functions will ignore any extra spans (spans in a vector of window spans beyond where cumsum(spans) exceeds length(data). If the last useful span extends beyond length(data), either the prior span could be extended or the that last span could be shrunk or that last span could be trimmed. Preferences? |
I'm not strongly opiniated on this, just thought that |
There was no post giving an example for wanting to process distinct window_spans while traversing a vector (e.g. the first 12 obs, then the next 18 obs ..). My guess is that the request was to process a sequence of window_spans, one-at-a-time, without explicitly coding the loop. |
If it is not clear how |
One way to allow support for that: there would be a kwarg |
Otherwise, it becomes difficult to know the data vectors from the span vector (without always making the span a kwarg -- which we discussed above) |
This is the reason I preferred kwarg over positional arg as I believed it is more flexible for future development. |
I will allow an Int and allow NTuple{N,Int} for |
In the future, will the keyword of
But despite now needing to manually add padding values, your package has already given me a 20x+ speedup over Python's rolling.apply() solution, so thank you very much for your work! |
@BeitianMa yes and it is working in the dev v1 |
@bkamins I have settled on an implementation pattern
|
On development version 0.9.75 v1 adjacent Next is to comb the docs into relevance. meanwhile:
|
Early access to RollingFunctions.jl v1 is given as WindowedFunctions.jl.
This version is radically redesigned. The new approach offers the most requested capabilities, It would be a great help if you would take this prerelease for a walk. |
Example:
It would be nice to add a kwarg to
roll*
to allow such padding, e.g. by writingrollmean(1:5, 3; pad=missing)
.The text was updated successfully, but these errors were encountered: