Skip to content

FeatureSet: Controlling Training Rows

Brian Wylie edited this page Nov 2, 2023 · 4 revisions

The traditional 80%/20% split is great for quick model construction but it's nice to be able to have a bit more control when building your models and which rows exactly get used for training. Right now the SageWorks API supports three options:

  • 80/20 Split: Default, just use the classes as they are.
  • Train on 100%: Useful for deploying a model to production.
  • Hold Out Set: Specify an explicit set of rows to NOT be used for training.

Train on 100%

Just set the train_all_data argument to True when calling transform

features_to_model = FeaturesToModel("abalone_feature_set", "abalone-regression",
                                     model_type=ModelType.REGRESSOR)
features_to_model.set_output_tags(["abalone", "regression"])
features_to_model.transform(target_column="class_number_of_rings", 
                            description="Abalone Regression Model", train_all_data=True)

Hold Out Set

This gives you fine-grained controls over exactly which rows are NOT used for training (all other rows will be trained on)

# Specify the column name and a list of items to 'Hold Out'
fs = FeatureSet("abalone_feature_set")
fs.create_training_view("id", hold_out_ids=[1, 5, 18, 26, ...])

# Another example
fs = FeatureSet("test_feature_set")
fs.create_training_view("name", hold_out_ids=["Bob", "Sue", "Tim", ...])

This will create a 'training_view' for the feature set, which will automatically be used when creating a model

features_to_model = FeaturesToModel("abalone_feature_set", "abalone-regression",
                                     model_type=ModelType.REGRESSOR

This model ^ will be trained on the rows that are specified by the training view

Inspecting the training view

If you want to inspect/use/view the training view, you can do that either through the Athena console or through the SageWorks API.

fs = FeatureSet("test_feature_set")
ds = fs.training_view
table = ds.table_name
df = ds.query(f"select * from {table}")

Issues

If you get an error like this, it means that your FeatureSet already has a training column, unfortunately since FeatureSets are 'append only' the only way to solve this is to delete your FeatureSet and recreate.

ERROR Failed to execute statement: line 1:1: Column name 'training' specified more than once