Adds fit_params support for stacking classifiers #255

jrbourbeau · 2017-09-22T04:16:27Z

Description

This PR aims to add fit parameter support for StackingClassifier, StackingCVClassifier, and EnsembleVoteClassifier.

Related issues or pull requests

Fixes #177, fixes #178, fixes #179

Pull Request requirements

Added appropriate unit test functions in the ./mlxtend/*/tests directories
Ran nosetests ./mlxtend -sv and make sure that all unit tests pass
Checked the test coverage by running nosetests ./mlxtend --with-coverage
Checked for style issues by running flake8 ./mlxtend
Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file
Modify documentation in the appropriate location under mlxtend/docs/sources/ (@rasbt will take care of that)
Checked that the Travis-CI build passed at https://travis-ci.org/rasbt/mlxtend

pep8speaks · 2017-09-22T04:16:29Z

Hello @jrbourbeau! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 20, 2017 at 03:42 Hours UTC

jrbourbeau · 2017-09-22T04:21:12Z

@rasbt so far I've only added fit_param support for StackingClassifier. Any comments on the implementation? The following code snippet with cross_val_score should work now

import numpy as np
from mlxtend.classifier import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score

# Generate some data
X, y = make_blobs(random_state=2)

# Build a StackingClassifier
classifiers=[RandomForestClassifier(random_state=2),
             SGDClassifier(random_state=2)]
meta_classifier = RandomForestClassifier(random_state=2)
sclf = StackingClassifier(classifiers=classifiers, meta_classifier=meta_classifier)

# Define some fit_params
fit_params = {'randomforestclassifier__sample_weight': np.arange(X.shape[0]),
              'sgdclassifier__intercept_init': np.unique(y),
              'meta-randomforestclassifier__sample_weight': np.full(X.shape[0], 7)}

# Pass fit params to StackingClassifier and cross_val_score
sclf.fit(X, y, **fit_params)
print('predictions = {}'.format(sclf.predict(X)))
print('scores = {}'.format(cross_val_score(sclf, X, y, fit_params=fit_params)))

coveralls · 2017-09-22T04:28:48Z

Coverage decreased (-0.06%) to 89.597% when pulling 6fd998d on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

coveralls · 2017-09-22T05:05:45Z

Coverage increased (+0.05%) to 89.705% when pulling 27206b3 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

rasbt · 2017-09-23T05:51:38Z

Wow, this is awesome -- it even includes the support for both the level-2 classifiers as well as the meta-classifier. Again, thanks so much for the PR. I am happy to help with extending this to the other ensemble/stacking classifiers (and regressors) :).

jrbourbeau · 2017-09-25T23:04:34Z

Great! I can work on adding fit parameters to other estimators. Here are the ones I had in mind:

Any others you can think of?

rasbt · 2017-09-26T02:42:27Z

I think that includes all of the ensemble methods I could currently think of as well :). I was/am a bit busy due to paper deadline end of the week, but I am happy to take care of a few of them as well so that you don't have to work on it all alone -- and of course, there's really no hurry :)

rasbt · 2017-09-27T14:43:54Z

Just a very minor suggestion, could you change the docstring for the fit_params in fit() to the following:

    fit_params : dict of string -> object, optional
        Parameters to pass to the fit methods of the `classifiers` and
        `meta_classifier`.

(The string -> object is how scikit-learn lists it; it's maybe good to use the same convention for consistency)

jrbourbeau · 2017-09-27T17:53:06Z

For sure, I definitely think the fit_params docstring here should match the corresponding scikit-learn docstring. Where did you find the fit_params : dict of string -> object, optional docstring in scikit-learn? I was using fit_params : dict, optional from the cross_val_score documentation http://scikit-learn.org/dev/modules/generated/sklearn.model_selection.cross_val_score.html

rasbt · 2017-09-27T18:06:28Z

Yeah, I think that also "fit_params : dict of string -> object, optional" would only be minimally more helpful compared to "dict" (found it in the GridSearchCV docs; http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

jrbourbeau · 2017-09-27T18:40:13Z

Do you think adding something like Example 3 - Stacked Regression with sample_weight to the StackingRegressor user guide page would be more useful for users? (similar examples could be added for stacked classifiers as well). Note sure if that's too specific an example or not.

coveralls · 2017-09-27T18:54:52Z

Coverage increased (+0.09%) to 89.75% when pulling 05942e9 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

rasbt · 2017-09-27T19:29:08Z

Example 3 - Stacked Regression with sample_weight

Yeah, I think this would be very useful! Maybe, "Example 3 - Stacked Regression with sample_weight using fit_params"? The reason is that it may be a bit more clear how to refer to the meta vs first-level classifiers regarding the string names.

jrbourbeau · 2017-10-09T21:28:54Z

@rasbt sorry for the delay, I've been out of town at a conference.

I ran into an issue when trying to add fit_param support for StackingCVClassifier. It issue arises when each CV fold is being trained on.

mlxtend/mlxtend/classifier/stacking_cv_classification.py

Lines 168 to 175 in 5735e00

    
           for num, (train_index, test_index) in enumerate(skf): 
        
               if self.verbose > 0: 
        
                   print("Training and fitting fold %d of %d..." % 
        
                         ((num + 1), final_cv.get_n_splits())) 
        
               try: 
        
                   model.fit(X[train_index], y[train_index])

Some classifier fit parameters are arrays that have a value for each sample being trained on (e.g. the sample_weight parameter for RandomForestClassifier). This case is pretty straight-forward because the sample_weight array can be indexed using the same train_index array as the training data.

However, it's less clear how to deal with fit parameters that aren't of shape (n_samples,). For example, SGDClassifier has coef_init and intercept_init fit parameters that are of shape (n_classes, n_features) and (n_classes,), respectively. I guess one could check if a fit pararmeter is an array of shape (n_samples,), and if so index it with the train_index array. But this seems a little hack-ish to me 😕 I can already think of a couple of edge cases that would lead to problems.

Any suggestions on how to get around this issue?

rasbt · 2017-10-10T04:09:19Z

No need to apologize, hope you enjoyed the conference and the trip!

how to deal with fit parameters

Hm, that's a good question ...

n_classes: I guess we don't have to worry about n_classes as that shouldn't change.
n_features: We don't have to worry about that in the 1st level classifiers as the k-folds should have the same feature dimension as the training set. However, it's a bit tricky regarding the 2nd-level (meta) classifier. Maybe we should add a check in the fit method for that, for example,

if not self.use_features_in_secondary:
    n_features_in_2nd = len(self.classifiers)

else:
    n_features_in_2nd += X.shape[1]

if 'coef_init' in meta_fit_params and\
        meta_fit_params['coef_init'].shape[1] != n_features_in_2nd:
    raise AttributeError('Number of features in the `fit_params`'s `coef_init` array'
                         ' of the meta-classifier must be equal to'
                         ' the number of features this classifier expects based on the'
                         ' number of first-level classifiers and wether `use_features_in_secondary`
                         ' is set to `True` or `False`.'
                         ' Expected: %d'
                         ' Got: %d' % (n_features_in_2nd, meta_fit_params['coef_init'].shape[1]))

On the other hand, there are probably multiple parameters that use shapen_features somehow. And my guess is it would be quite tricky to maintain this.

n_samples: this is probably most tricky since we are modifying the training set in a sense, and like you said, we kind of need a way to determine when to use the subindices to pass the correct weights.

I agree with you that an automated checking based on the shape is quite unstable and might break in certain edge cases. Hm, I currently don't have a good idea for how to handle that.

For now, maybe we should just handle sample_weight explicitly via passing the respective the training set indices and be very clear about that in the fit docstring that and mention that any other parameter based on sample_weights shapes (and n_features for the meta-classifier) might result in unexpected behavior?

rasbt · 2017-10-12T18:08:47Z

Btw, I just saw that the VotingClassifier in scikit-learn (which we ported from mlxtend, aka the EnsembleVoteClassifier some time ago) currently also doesn't support sample weights:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/voting_classifier.py#L176

(Edit: Please ignore the passage above, I misread the code)

So, I was thinking that before we come up with some hacky work-arounds for those parameters that rely on n_samples or n_features, we could maybe take a similar approach and just allow support for fit params that are more general, e.g,. the ones mentioned in

#177

    'xgbclassifier__eval_metric': 'mlogloss',
    'xgbclassifier__eval_set': [(X_test, y_test)],
    'xgbclassifier__early_stopping_rounds': 100,
    'xgbclassifier__verbose': False}

So, I think the best way would be to add similar exceptions for

sample_weight, coef, coef_init

to get at least some working versions that generally support fit_params without some unexpected behavior in edge cases :).

jrbourbeau · 2017-10-13T21:51:24Z

That sounds like a good plan to me! So just to be clear, right now we'll only support fit_params that are the same for each training fold (e.g. like the params in #177)? So something like

for num, (train_index, test_index) in enumerate(skf): 
  
     if self.verbose > 0: 
         print("Training and fitting fold %d of %d..." % 
               ((num + 1), final_cv.get_n_splits())) 
  
     try: 
         model.fit(X[train_index], y[train_index], *clf_fit_params)

will work because we won't have to worry about having different slices of fit_params for different CV folds.

And obviously this will need to be clarified in the documentation 😃

rasbt · 2017-10-13T23:18:27Z

Oh yeah, that's probably the best way to handle this for now :)

coveralls · 2017-10-14T18:12:55Z

Coverage increased (+1.3%) to 90.918% when pulling 162c159 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

rasbt · 2017-10-14T18:29:00Z

Thanks for the PR! Hm, it's weird that the unit test on Windows for Py 2.7 fail. My first thought was that it could be due to banker's rounding in Python 3. I.e.,

Python 2.7
>>> round(2.5)
3.0

Python 3:
>>> round(2.5)
2

But then, it wouldn't explain why the Py27 pass in Travis. Any idea about what might be going on?

jrbourbeau · 2017-10-15T15:25:50Z

Yeah, that is really weird that the tests fails for 2.7 on AppVeyor, but not on Travis — not sure what's going on there. I have seen kind of random test failures happen before on some CI services. I'll try adding a small commit to see if re-running the tests fails again.

coveralls · 2017-10-15T15:39:53Z

Coverage increased (+1.3%) to 90.918% when pulling 7ac7887 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

rasbt · 2017-10-15T17:57:55Z

Sorry about the trouble with AppVeyor! I see sth like

 assert scores_mean == 0.94, scores_mean
AssertionError: 0.95

I am not sure why it suddenly occurs, it seems like some rounding error somewhere. I can't see anything in your new code additions (like integer division) that may cause this as those are "old" unit tests that fail. This is really weird!

rasbt · 2017-10-18T06:18:28Z

Arg, I just found an issue with Travis CI. I.e., the unit tests were not properly executed in python 2.7. It's fixed in the master branch now.

Maybe try to sync your master branch of your fork (e.g., like described here: http://rasbt.github.io/mlxtend/contributing/#syncing-an-existing-fork)

and then you could rebase your fork on top of the master branch. I think the following should do it:

git checkout your_branch
git rebase upstream/master

Alternatively, instead of rebasing, you could execute the following four commands that should also fix the problem:

git checkout origin/master ci/.travis_install.sh
git checkout origin/master ci/.travis_test.sh
git checkout origin/master .travis.yml
git checkout origin/master mlxtend/plotting/test_ecdf.py

(the last one was due to a bug in Python 2.7 as the travis py27 tests didn't run recently, and Appveyor does not run plotting functions).

This will probably not solve the unit test issue, but at least we would know if there's something odd about Python 2.7 on Windows or in general (once more, Python 2.7 turns out to be an annoyance, time to replace it completely by Python 3.6 :))

jrbourbeau · 2017-10-19T16:52:35Z

Awesome, good to know! I'll update and see if we at least get consistent test failures.

once more, Python 2.7 turns out to be an annoyance, time to replace it completely by Python 3.6 :)

👍

coveralls · 2017-10-19T17:23:51Z

Coverage increased (+0.1%) to 90.918% when pulling 2372300 on jrbourbeau:add_fit_params_for_cross_val_score into 922f44f on rasbt:master.

rasbt · 2017-10-19T19:12:12Z

Awesome, good to know! I'll update and see if we at least get consistent test failures.

Thanks! I am curious to see how to see how that will turn out

jrbourbeau · 2017-10-19T21:01:01Z

I believe the rebase worked fine, my commits for this PR are now on top of your commits related to Python 2.7 on Travis. But it looks like the tests are still passing on Travis, but not on AppVeyor 😕

rasbt · 2017-10-19T21:17:10Z

Oh hm, that's weird. It seems like the files are still the old ones and Travis is still not running those Py27 tests. Sorry about the inconvenience, but maybe this could be resolved via the following (after you synced your master branch with the upstream one)

git checkout origin/master ci/.travis_install.sh
git checkout origin/master ci/.travis_test.sh
git checkout origin/master .travis.yml
git checkout origin/master mlxtend/plotting/test_ecdf.py

jrbourbeau · 2017-10-20T02:24:55Z

Thanks for helping out with this!

Just to be clear, I've done

# Switch to master branch from feature branch
$ git checkout master
# Update my local master branch to be up-to-date with upstream
$ git fetch upstream
$ git merge upstream/master
# Update origin to also be synced up with upstream
$ git push origin master

which has synced up my master branch with upstream (your mlxtend repo). A quick inspection of git log on my local copy master branch matches the upstream log (https://github.com/rasbt/mlxtend/commits/master) and my origin log (https://github.com/jrbourbeau/mlxtend/commits/master) — so I think everything should be up-to-date.

I then switched to my feature branch and did the checkouts you've provided

# Switch back to feature branch
$ git checkout add_fit_params_for_cross_val_score
# Checkout files from my updated origin/master
$ git checkout origin/master ci/.travis_install.sh
$ git checkout origin/master ci/.travis_test.sh
$ git checkout origin/master .travis.yml
$ git checkout origin/master mlxtend/plotting/tests/test_ecdf.py

This didn't seem to update any of my local files (at least git status didn't seem to indicate any changes have been made). So, if I'm not mistaken, then everything (origin, my local master, upstream, and my local feature branch) should all be up to date with one another.

jrbourbeau · 2017-10-20T02:30:51Z

Actually, one thing I did notice related to the Travis tests (and my previous comment) is that .travis.yml

mlxtend/.travis.yml

Lines 11 to 15 in 922f44f

    
           install: 
        
               - sudo apt-get update 
        
               - source ci/.travis_install.sh 
        
           script: 
        
               - bash ci/.travis_test.sh

is pointing to scripts in the ci directory. But those scripts, mlxtend/ci/.travis_install.sh and mlxtend/ci/.travis_test.sh, haven't been updated in four months. I think they should be pointing to mlxtend/.travis_install.sh and mlxtend/.travis_test.sh, which were updated a couple of days ago, instead.

rasbt · 2017-10-20T02:54:34Z

Oh, there's definitely something funny going on. I think I screwed up some rebase after adjusting those travis files, which reverted them back to the 4 month old version. Let me address this ... thanks for mentioning it!

rasbt · 2017-10-20T03:16:56Z

Okay, like you said, the issue was that I left the wrong link after playing around with travis a few days ago. I moved them into the right place now and based on the logs, it should be all fine now :)

coveralls · 2017-10-20T03:54:06Z

Coverage increased (+0.1%) to 90.918% when pulling ab389b1 on jrbourbeau:add_fit_params_for_cross_val_score into 4388714 on rasbt:master.

kingychiu · 2018-05-17T18:04:10Z

Hi @rasbt, how is the progress of this enhancement? As a workaround, I implemented a class inheriting the StackingClassifier, but I'd like to see the official support for the fitting parameters :). Anything left I can help here?

jrbourbeau · 2018-05-18T03:39:33Z

Hi @kingychiu, thanks for commenting! I totally lost track of this PR.

I'm busy for the remainder of this week and into next week, but I'd like to finish up my work on this PR later next week.

qiagu · 2019-10-02T21:52:58Z

mlxtend/classifier/stacking_classification.py


        Returns
        -------
        self : object

        """
        self.clfs_ = [clone(clf) for clf in self.classifiers]
+        self.named_clfs_ = {key: value for key, value in
+                            _name_estimators(self.clfs_)}


dict(_name_estimators(self.clfs_)) might work, but no guarantee.

hm yeah, I think it should be equivalent indeed

qiagu · 2019-10-02T21:59:49Z

mlxtend/classifier/stacking_classification.py

+            clf_fit_params = {}
+            for key, value in six.iteritems(fit_params):
+                if name in key and 'meta-' not in key:
+                    clf_fit_params[key.replace(name+'__', '')] = value


for key in list(six.iterkeys(fit_params)): if name in key and 'meta-' not in key: clf_fit_params[key[len(name) + 2: ]] = fit_params.pop(key)

might be more efficient since the fit_params will be iterated again.

I suppose key.startswith(name) is True. Is that the case?

qiagu · 2019-10-02T22:10:27Z

mlxtend/classifier/stacking_cv_classification.py

+            clf_fit_params = {}
+            for key, value in six.iteritems(fit_params):
+                if name in key and 'meta-' not in key:
+                    clf_fit_params[key.replace(name+'__', '')] = value


pop can help here as well.

qiagu · 2019-10-02T22:19:46Z

Solid enhancement. I look forward to this PR being merged!
After issues being solved, of course.

rasbt · 2019-10-03T03:22:36Z

Thanks for the code review, @qiagu !

jrbourbeau force-pushed the add_fit_params_for_cross_val_score branch from 7ac7887 to 2372300 Compare October 19, 2017 17:10

jrbourbeau added 5 commits October 19, 2017 22:41

Adds fit_params support for StackingClassifier

61a027d

Adds fit_param test for StackingClassifier

24473ac

Adds fit_params to StackingRegressor fit method + adds test

2b3668b

Adds fit_params to StackingCVClassifier fit method

7a67ce5

Fixes typo in CONTRIBUTING.md

ab389b1

jrbourbeau force-pushed the add_fit_params_for_cross_val_score branch from 2372300 to ab389b1 Compare October 20, 2017 03:42

This was referenced Mar 19, 2018

Adds fit_params support for SequentialFeatureSelector #350

Merged

Adds fit_params support for ExhaustiveFeatureSelector #354

Merged

This was referenced Sep 23, 2018

Support for sample_weight in the fit method #438

Closed

Refactor stacking estimators with class inheritance #444

Open

qiagu reviewed Oct 2, 2019

View reviewed changes

Adds fit_params support for stacking classifiers #255

Are you sure you want to change the base?

Adds fit_params support for stacking classifiers #255

Conversation

jrbourbeau commented Sep 22, 2017 • edited by rasbt Loading

Description

Related issues or pull requests

Pull Request requirements

pep8speaks commented Sep 22, 2017 • edited Loading

Comment last updated on October 20, 2017 at 03:42 Hours UTC

jrbourbeau commented Sep 22, 2017 • edited Loading

coveralls commented Sep 22, 2017

coveralls commented Sep 22, 2017

rasbt commented Sep 23, 2017

jrbourbeau commented Sep 25, 2017 • edited Loading

rasbt commented Sep 26, 2017

rasbt commented Sep 27, 2017

jrbourbeau commented Sep 27, 2017

rasbt commented Sep 27, 2017

jrbourbeau commented Sep 27, 2017 • edited Loading

coveralls commented Sep 27, 2017

rasbt commented Sep 27, 2017

jrbourbeau commented Oct 9, 2017

rasbt commented Oct 10, 2017

rasbt commented Oct 12, 2017 • edited Loading

jrbourbeau commented Oct 13, 2017

rasbt commented Oct 13, 2017

coveralls commented Oct 14, 2017

rasbt commented Oct 14, 2017 • edited Loading

jrbourbeau commented Oct 15, 2017

coveralls commented Oct 15, 2017

rasbt commented Oct 15, 2017 • edited Loading

rasbt commented Oct 18, 2017

jrbourbeau commented Oct 19, 2017

coveralls commented Oct 19, 2017

rasbt commented Oct 19, 2017

jrbourbeau commented Oct 19, 2017

rasbt commented Oct 19, 2017

jrbourbeau commented Oct 20, 2017

jrbourbeau commented Oct 20, 2017 • edited Loading

rasbt commented Oct 20, 2017

rasbt commented Oct 20, 2017

coveralls commented Oct 20, 2017

kingychiu commented May 17, 2018

jrbourbeau commented May 18, 2018

qiagu Oct 2, 2019

Choose a reason for hiding this comment

rasbt Oct 2, 2019

Choose a reason for hiding this comment

qiagu Oct 2, 2019 • edited Loading

Choose a reason for hiding this comment

qiagu Oct 2, 2019

Choose a reason for hiding this comment

qiagu Oct 2, 2019

Choose a reason for hiding this comment

qiagu commented Oct 2, 2019 • edited Loading

rasbt commented Oct 3, 2019

jrbourbeau commented Sep 22, 2017 •

edited by rasbt

Loading

pep8speaks commented Sep 22, 2017 •

edited

Loading

jrbourbeau commented Sep 22, 2017 •

edited

Loading

jrbourbeau commented Sep 25, 2017 •

edited

Loading

jrbourbeau commented Sep 27, 2017 •

edited

Loading

rasbt commented Oct 12, 2017 •

edited

Loading

rasbt commented Oct 14, 2017 •

edited

Loading

rasbt commented Oct 15, 2017 •

edited

Loading

jrbourbeau commented Oct 20, 2017 •

edited

Loading

qiagu Oct 2, 2019 •

edited

Loading

qiagu commented Oct 2, 2019 •

edited

Loading