Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add TimeSeriesScaleMeanMaxVariance() #333

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tonylee2016
Copy link

@tonylee2016 tonylee2016 commented Jan 7, 2021

When I was using the package for K-shape, one issue found is that the original TimeSeriesScaleMeanVariance may amplify small changes (noises) in the signal. Because that it normalizes each signal per its own variance.

Here I am adding a new scaling class that preserves the amplitude relation across signals, this scaling method may be useful for some applications.

TimeSeriesScaleMeanMaxVariance is similar to the existing TimeSeriesScaleMeanVariance with one exception, the max std of all signals is used for normalization. As a result, the amplitude relation across signals is preserved.

@tonylee2016 tonylee2016 changed the title add TimeSeriesScaleMeanMaxVariance() [WIP] add TimeSeriesScaleMeanMaxVariance() Jan 7, 2021
@tonylee2016
Copy link
Author

the method won't pass check_methods_subset_invariance, It is expected, I've excluded it from the test.

@tonylee2016 tonylee2016 changed the title [WIP] add TimeSeriesScaleMeanMaxVariance() [MRG] add TimeSeriesScaleMeanMaxVariance() Jan 8, 2021
# retain the scaling relation cross the signals,
# the max std_t is set to self.std
max_std = max(std_t)
if max_std ==0.: max_std = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split up over multiple lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add space to right and left of ==

X_ = check_dims(X_, X_fit_dims=self._X_fit_dims, extend=False)
mean_t = numpy.nanmean(X_, axis=1, keepdims=True)
std_t = numpy.nanstd(X_, axis=1, keepdims=True)
# retain the scaling relation cross the signals,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these comments still relevant? Does it need to be stateful? I.e. should there be a fit() and transform() where in the fit() the self.std is stored and used in the transform() method?

@GillesVandewiele
Copy link
Contributor

Hi,

Thank you for this PR! I think this could indeed be useful for some use cases, but I will let @rtavenar confirm this.

I left 2 very minor remarks. I checked out the check_methods_subset_invariance from sklearn and it indeed makes sense that the results change depending on the sub-batch you provide to the method (as it uses the max std of that batch to normalise). However, we must ensure that only the relevant tests are skipped, and that as many as possible tests are performed. Are we sure that adding _skip_test as a tag does not skip more tags than needed?

@GillesVandewiele
Copy link
Contributor

One other thing that we could consider is, instead of making this part of the main codebase, to create an example/tutorial "defining a custom preprocessor" and then using this specific preprocessing technique. I can imagine that there are more alternatives to normalise out there, and they only have minimal differences. Including all of those into tslearn could quickly result in a bloated code-base.

@rtavenar
Copy link
Member

rtavenar commented Jan 8, 2021

Hi @tonylee2016
I think this is closely related to #280 and #285

The scalers we have implemented so far only rely on per-time series information. And we seem to have more and more requests like this one.

I would be OK to have such scalers in tslearn but I strongly believe that we should put effort in a clear documentation of the differences between scalers.

Also, maybe it would be worth it to have names that explicitly state that these scalers rely on global information, or have a subpackage dedicated to "global" scalers.

I am open to suggestions on these points and would like to clarify that before considering the add of these preprocessors in tslearn.

@tonylee2016
Copy link
Author

Hi, @GillesVandewiele and @rtavenar

thanks for the comments and extra info. After reading #280 and #285, here are my thoughts on a more generic scaler framework

  1. move the std and mu calculation to fit(). As [WIP] Fit method at preprocessing file #280 #285 is already doing it.
  2. add an optional scaler Infos gathered from fit(): so that user know what data the scaler is based on
  3. better document and an example for TimeSeriesScaleMeanMaxVariance() to show the different effects

I can spend some time on 3 in this PR. But 2 and 1 may be handled by #285?

@GillesVandewiele _skip_test will skip all estimator checks. However, I did not find a way to only skip check_methods_subset_invariance.

Base automatically changed from master to main January 26, 2021 12:41
use inheritance ans skip check_methods_subset_invariance
@tonylee2016 tonylee2016 force-pushed the TimeSeriesScaleMeanMaxVariance branch from 28ea436 to e7eaea5 Compare March 16, 2021 22:42
@codecov-commenter
Copy link

Codecov Report

Merging #333 (28ea436) into main (87495c1) will decrease coverage by 0.16%.
The diff coverage is 28.57%.

❗ Current head 28ea436 differs from pull request most recent head 2819398. Consider uploading reports for the commit 2819398 to get more accurate results

@@            Coverage Diff             @@
##             main     #333      +/-   ##
==========================================
- Coverage   94.70%   94.54%   -0.17%     
==========================================
  Files          59       59              
  Lines        4511     4525      +14     
==========================================
+ Hits         4272     4278       +6     
- Misses        239      247       +8     
Impacted Files Coverage Δ
tslearn/preprocessing/__init__.py 100.00% <ø> (ø)
tslearn/preprocessing/preprocessing.py 89.10% <28.57%> (-9.75%) ⬇️
tslearn/tests/sklearn_patches.py 92.59% <0.00%> (+0.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87495c1...2819398. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants