added option to use sklearn's OneHotEncoder to handle unknown categories #174

Vaseekaran-V · 2024-09-03T15:25:50Z

This library is amazing and I noticed a small issue when using the Multiple Correspondence Analysis: since the function uses pd.get_dummies internally to one hot encode the data, I got an error as my testing set had unknown categories in certain categorical features compared to the train set.

Therefore, I have initialized a OneHotEncoder object from sklearn.preprocessing to process the data, if the user wants to opt out of using the get_dummies function.

These are the three attributes that I have specified:

get_dummies (if True, will use the original get_dummies method (default is set to False))
one_hot_encoder (the OneHotEncoder object)
is_one_hot_fitted: (boolean to check if the one_hot_encoder is fitted)

I have updated the _prepare function as well:

def _prepare(self, X):
        if self.one_hot:
            if self.get_dummies:
                X = pd.get_dummies(X, columns=X.columns)
                return X
            else:
                if self.is_one_hot_fitted == False:
                    X_enc = self.one_hot_encoder.fit_transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    self.is_one_hot_fitted = True
                    return X_enc
                else:
                    X_enc = self.one_hot_encoder.transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    return X_enc
        return X

Let me know if there is anything else I can do, or whether the workings are correct.

Thanks again for this great library <3

…et dummies fail)

…ethod | added description

…the one_hot_encoder is fitted)

MaxHalford · 2024-09-07T18:31:18Z

Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.

MaxHalford · 2024-09-07T18:31:27Z

And thanks for the appreciation :)

Vaseekaran-V · 2024-09-08T07:33:42Z

Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.

Hi, thank you. I didn't try the unit tests, and as you said, the unit tests are failing. Please let me know if there is anything that I can do, and also, may I know the reason for having supplementary columns?

…MCA analysis

Vaseekaran-V · 2024-09-08T07:37:58Z

I modified the mca file to handle unknown features, as the error in the unit test is the features that are seen in fit are not seen when transforming, so I modified the _prepare function in mca.py:

def _prepare(self, X):
        if self.one_hot:
            if self.get_dummies:
                X = pd.get_dummies(X, columns=X.columns)
                return X
            else:
                if self.is_one_hot_fitted == False:
                    #if the one_hot_encoder is not fitted, to fit and also set the is_one_hot_fitted variable to True
                    X_enc = self.one_hot_encoder.fit_transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    self.is_one_hot_fitted = True
                    return X_enc
                else:
                    #checking if the columns fed to the onehot encoder and the columns fitted to the onehot encoder are the same
                    oh_cols = set(self.one_hot_encoder.feature_names_in_.tolist())
                    X_cols = set(X.columns.tolist())
                    
                    if oh_cols == X_cols:
                        #if the fitted cols are the same as the inferencing columns, then can transform
                        X_enc = self.one_hot_encoder.transform(X)
                        X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                        return X_enc
                    else:
                        #if the fitted cols are different to the inferencing columns, then should fit the onehot encoder again, to handle unit tests
                        print(X_cols)
                        print(oh_cols)
                        X_enc = self.one_hot_encoder.fit_transform(X)
                        X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                        return X_enc
        return X

I checked with the unit tests and didn't have issues on my side. please let me know if this works.

MaxHalford · 2024-09-08T15:42:59Z

Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.

Vaseekaran-V · 2024-09-08T16:00:10Z

Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.

Sure, thank you. Saw the error clean code test, and made a change.

Vaseekaran-V · 2024-09-22T14:29:43Z

Hi @MaxHalford, is there any update to this?

Vaseekaran-V added 5 commits September 3, 2024 09:27

added onehot encoder to handle unknown categorical values (which pd g…

cc46520

…et dummies fail)

modified code to support one_hot attribute and original get_dummies m…

09d92d2

…ethod | added description

fixed issue to get column names after using OneHotEncoder

37e0f59

small issue in _prepare (didn't return the one-hot encoded values if …

77e0603

…the one_hot_encoder is fitted)

updated the mca notebook in docs/content

9166071

fixed an issue to handle unknown columns during one hot encoding for …

f87c843

…MCA analysis

Vaseekaran-V added 5 commits September 8, 2024 13:15

fixed merging conflicts

af04b67

Merge branch 'MaxHalford-master' | fixing issues during merge

1255661

fixing merge issue in mca notebook in docs

e63d74b

removed code lines kept for debugging

3bcff9c

2 errors caused by print code for logging

bfc9179

fixed a clean code issue

ef68f6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added option to use sklearn's OneHotEncoder to handle unknown categories #174

added option to use sklearn's OneHotEncoder to handle unknown categories #174

Vaseekaran-V commented Sep 3, 2024 •

edited

Loading

MaxHalford commented Sep 7, 2024

MaxHalford commented Sep 7, 2024

Vaseekaran-V commented Sep 8, 2024

Vaseekaran-V commented Sep 8, 2024

MaxHalford commented Sep 8, 2024

Vaseekaran-V commented Sep 8, 2024

Vaseekaran-V commented Sep 22, 2024

added option to use sklearn's OneHotEncoder to handle unknown categories #174

Are you sure you want to change the base?

added option to use sklearn's OneHotEncoder to handle unknown categories #174

Conversation

Vaseekaran-V commented Sep 3, 2024 • edited Loading

MaxHalford commented Sep 7, 2024

MaxHalford commented Sep 7, 2024

Vaseekaran-V commented Sep 8, 2024

Vaseekaran-V commented Sep 8, 2024

MaxHalford commented Sep 8, 2024

Vaseekaran-V commented Sep 8, 2024

Vaseekaran-V commented Sep 22, 2024

Vaseekaran-V commented Sep 3, 2024 •

edited

Loading