-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds fit_params support for stacking classifiers #255
base: master
Are you sure you want to change the base?
Adds fit_params support for stacking classifiers #255
Conversation
Hello @jrbourbeau! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on October 20, 2017 at 03:42 Hours UTC |
@rasbt so far I've only added fit_param support for import numpy as np
from mlxtend.classifier import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score
# Generate some data
X, y = make_blobs(random_state=2)
# Build a StackingClassifier
classifiers=[RandomForestClassifier(random_state=2),
SGDClassifier(random_state=2)]
meta_classifier = RandomForestClassifier(random_state=2)
sclf = StackingClassifier(classifiers=classifiers, meta_classifier=meta_classifier)
# Define some fit_params
fit_params = {'randomforestclassifier__sample_weight': np.arange(X.shape[0]),
'sgdclassifier__intercept_init': np.unique(y),
'meta-randomforestclassifier__sample_weight': np.full(X.shape[0], 7)}
# Pass fit params to StackingClassifier and cross_val_score
sclf.fit(X, y, **fit_params)
print('predictions = {}'.format(sclf.predict(X)))
print('scores = {}'.format(cross_val_score(sclf, X, y, fit_params=fit_params))) |
Wow, this is awesome -- it even includes the support for both the level-2 classifiers as well as the meta-classifier. Again, thanks so much for the PR. I am happy to help with extending this to the other ensemble/stacking classifiers (and regressors) :). |
Great! I can work on adding fit parameters to other estimators. Here are the ones I had in mind:
Any others you can think of? |
I think that includes all of the ensemble methods I could currently think of as well :). I was/am a bit busy due to paper deadline end of the week, but I am happy to take care of a few of them as well so that you don't have to work on it all alone -- and of course, there's really no hurry :) |
Just a very minor suggestion, could you change the docstring for the fit_params in
(The |
For sure, I definitely think the |
Yeah, I think that also "fit_params : dict of string -> object, optional" would only be minimally more helpful compared to "dict" (found it in the GridSearchCV docs; http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). |
Do you think adding something like |
Yeah, I think this would be very useful! Maybe, "Example 3 - Stacked Regression with sample_weight using |
@rasbt sorry for the delay, I've been out of town at a conference. I ran into an issue when trying to add mlxtend/mlxtend/classifier/stacking_cv_classification.py Lines 168 to 175 in 5735e00
Some classifier fit parameters are arrays that have a value for each sample being trained on (e.g. the However, it's less clear how to deal with fit parameters that aren't of shape Any suggestions on how to get around this issue? |
No need to apologize, hope you enjoyed the conference and the trip!
Hm, that's a good question ...
if not self.use_features_in_secondary:
n_features_in_2nd = len(self.classifiers)
else:
n_features_in_2nd += X.shape[1]
if 'coef_init' in meta_fit_params and\
meta_fit_params['coef_init'].shape[1] != n_features_in_2nd:
raise AttributeError('Number of features in the `fit_params`'s `coef_init` array'
' of the meta-classifier must be equal to'
' the number of features this classifier expects based on the'
' number of first-level classifiers and wether `use_features_in_secondary`
' is set to `True` or `False`.'
' Expected: %d'
' Got: %d' % (n_features_in_2nd, meta_fit_params['coef_init'].shape[1])) On the other hand, there are probably multiple parameters that use shape
I agree with you that an automated checking based on the shape is quite unstable and might break in certain edge cases. Hm, I currently don't have a good idea for how to handle that. For now, maybe we should just handle |
(Edit: Please ignore the passage above, I misread the code) So, I was thinking that before we come up with some hacky work-arounds for those parameters that rely on 'xgbclassifier__eval_metric': 'mlogloss',
'xgbclassifier__eval_set': [(X_test, y_test)],
'xgbclassifier__early_stopping_rounds': 100,
'xgbclassifier__verbose': False} So, I think the best way would be to add similar exceptions for
to get at least some working versions that generally support |
That sounds like a good plan to me! So just to be clear, right now we'll only support for num, (train_index, test_index) in enumerate(skf):
if self.verbose > 0:
print("Training and fitting fold %d of %d..." %
((num + 1), final_cv.get_n_splits()))
try:
model.fit(X[train_index], y[train_index], *clf_fit_params) will work because we won't have to worry about having different slices of And obviously this will need to be clarified in the documentation 😃 |
Oh yeah, that's probably the best way to handle this for now :) |
Thanks for the PR! Hm, it's weird that the unit test on Windows for Py 2.7 fail. My first thought was that it could be due to banker's rounding in Python 3. I.e., Python 2.7 Python 3: But then, it wouldn't explain why the Py27 pass in Travis. Any idea about what might be going on? |
Yeah, that is really weird that the tests fails for 2.7 on AppVeyor, but not on Travis — not sure what's going on there. I have seen kind of random test failures happen before on some CI services. I'll try adding a small commit to see if re-running the tests fails again. |
Sorry about the trouble with AppVeyor! I see sth like
I am not sure why it suddenly occurs, it seems like some rounding error somewhere. I can't see anything in your new code additions (like integer division) that may cause this as those are "old" unit tests that fail. This is really weird! |
Arg, I just found an issue with Travis CI. I.e., the unit tests were not properly executed in python 2.7. It's fixed in the master branch now. Maybe try to sync your master branch of your fork (e.g., like described here: http://rasbt.github.io/mlxtend/contributing/#syncing-an-existing-fork) and then you could rebase your fork on top of the master branch. I think the following should do it:
Alternatively, instead of rebasing, you could execute the following four commands that should also fix the problem:
(the last one was due to a bug in Python 2.7 as the travis py27 tests didn't run recently, and Appveyor does not run plotting functions). This will probably not solve the unit test issue, but at least we would know if there's something odd about Python 2.7 on Windows or in general (once more, Python 2.7 turns out to be an annoyance, time to replace it completely by Python 3.6 :)) |
Awesome, good to know! I'll update and see if we at least get consistent test failures.
👍 |
7ac7887
to
2372300
Compare
Thanks! I am curious to see how to see how that will turn out |
I believe the rebase worked fine, my commits for this PR are now on top of your commits related to Python 2.7 on Travis. But it looks like the tests are still passing on Travis, but not on AppVeyor 😕 |
Oh hm, that's weird. It seems like the files are still the old ones and Travis is still not running those Py27 tests. Sorry about the inconvenience, but maybe this could be resolved via the following (after you synced your master branch with the upstream one)
|
Thanks for helping out with this! Just to be clear, I've done # Switch to master branch from feature branch
$ git checkout master
# Update my local master branch to be up-to-date with upstream
$ git fetch upstream
$ git merge upstream/master
# Update origin to also be synced up with upstream
$ git push origin master which has synced up my master branch with upstream (your I then switched to my feature branch and did the checkouts you've provided # Switch back to feature branch
$ git checkout add_fit_params_for_cross_val_score
# Checkout files from my updated origin/master
$ git checkout origin/master ci/.travis_install.sh
$ git checkout origin/master ci/.travis_test.sh
$ git checkout origin/master .travis.yml
$ git checkout origin/master mlxtend/plotting/tests/test_ecdf.py This didn't seem to update any of my local files (at least |
Actually, one thing I did notice related to the Travis tests (and my previous comment) is that Lines 11 to 15 in 922f44f
is pointing to scripts in the |
Oh, there's definitely something funny going on. I think I screwed up some rebase after adjusting those travis files, which reverted them back to the 4 month old version. Let me address this ... thanks for mentioning it! |
Okay, like you said, the issue was that I left the wrong link after playing around with travis a few days ago. I moved them into the right place now and based on the logs, it should be all fine now :) |
2372300
to
ab389b1
Compare
Hi @rasbt, how is the progress of this enhancement? As a workaround, I implemented a class inheriting the StackingClassifier, but I'd like to see the official support for the fitting parameters :). Anything left I can help here? |
Hi @kingychiu, thanks for commenting! I totally lost track of this PR. I'm busy for the remainder of this week and into next week, but I'd like to finish up my work on this PR later next week. |
|
||
Returns | ||
------- | ||
self : object | ||
|
||
""" | ||
self.clfs_ = [clone(clf) for clf in self.classifiers] | ||
self.named_clfs_ = {key: value for key, value in | ||
_name_estimators(self.clfs_)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict(_name_estimators(self.clfs_))
might work, but no guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm yeah, I think it should be equivalent indeed
clf_fit_params = {} | ||
for key, value in six.iteritems(fit_params): | ||
if name in key and 'meta-' not in key: | ||
clf_fit_params[key.replace(name+'__', '')] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for key in list(six.iterkeys(fit_params)):
if name in key and 'meta-' not in key:
clf_fit_params[key[len(name) + 2: ]] = fit_params.pop(key)
might be more efficient since the fit_params
will be iterated again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose key.startswith(name)
is True
. Is that the case?
clf_fit_params = {} | ||
for key, value in six.iteritems(fit_params): | ||
if name in key and 'meta-' not in key: | ||
clf_fit_params[key.replace(name+'__', '')] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pop
can help here as well.
Solid enhancement. I look forward to this PR being merged! |
Thanks for the code review, @qiagu ! |
Description
This PR aims to add fit parameter support for
StackingClassifier
,StackingCVClassifier
, andEnsembleVoteClassifier
.Related issues or pull requests
Fixes #177, fixes #178, fixes #179
Pull Request requirements
./mlxtend/*/tests
directoriesnosetests ./mlxtend -sv
and make sure that all unit tests passnosetests ./mlxtend --with-coverage
flake8 ./mlxtend
./docs/sources/CHANGELOG.md
filemlxtend/docs/sources/
(@rasbt will take care of that)