-
Notifications
You must be signed in to change notification settings - Fork 1
FeatureSet: Controlling Training Rows
The traditional 80%/20% split is great for quick model construction but it's nice to be able to have a bit more control when building your models and which rows exactly get used for training. Right now the SageWorks API supports three options:
- 80/20 Split: Default, just use the classes as they are.
- Train on 100%: Useful for deploying a model to production.
- Hold Out Set: Specify an explicit set of rows to NOT be used for training.
Just set the train_all_data
argument to True when calling transform
features_to_model = FeaturesToModel("abalone_feature_set", "abalone-regression",
model_type=ModelType.REGRESSOR)
features_to_model.set_output_tags(["abalone", "regression"])
features_to_model.transform(target_column="class_number_of_rings",
description="Abalone Regression Model", train_all_data=True)
This gives you fine-grained controls over exactly which rows are NOT used for training (all other rows will be trained on)
# Specify the column name and a list of items to 'Hold Out'
fs = FeatureSet("abalone_feature_set")
fs.create_training_view("id", hold_out_ids=[1, 5, 18, 26, ...])
# Another example
fs = FeatureSet("test_feature_set")
fs.create_training_view("name", hold_out_ids=["Bob", "Sue", "Tim", ...])
This will create a 'training_view' for the feature set, which will automatically be used when creating a model
features_to_model = FeaturesToModel("abalone_feature_set", "abalone-regression",
model_type=ModelType.REGRESSOR
This model ^ will be trained on the rows that are specified by the training view
If you want to inspect/use/view the training view, you can do that either through the Athena console or through the SageWorks API.
fs = FeatureSet("test_feature_set")
ds = fs.training_view
table = ds.table_name
df = ds.query(f"select * from {table}")
If you get an error like this, it means that your FeatureSet already has a training column, unfortunately since FeatureSets are 'append only' the only way to solve this is to delete your FeatureSet and recreate.
ERROR Failed to execute statement: line 1:1: Column name 'training' specified more than once