Skip to content

Commit

Permalink
Continuize: Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
janezd committed Dec 23, 2022
1 parent 63f4b63 commit c4b8438
Show file tree
Hide file tree
Showing 4 changed files with 29 additions and 15 deletions.
44 changes: 29 additions & 15 deletions doc/visual-programming/source/widgets/data/continuize.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,45 +11,59 @@ Turns discrete variables (attributes) into numeric ("continuous") dummy variable

- Data: transformed data set

The **Continuize** widget receives a data set in the input and outputs the same data set in which the discrete variables (including binary variables) are replaced with continuous ones.
The **Continuize** widget receives a data set in the input and outputs the same data set in which some or all categorical variables are replaced with continuous ones and numeric variables are scaled.

![](images/Continuize-stamped.png)

1. Define the treatment of non-binary categorical variables.
1. Select a categorical attribute to define its specific treatmen, or click the "Deafult" option above to set the default treatment for all categorical attributes without specific settings.

Examples in this section will assume that we have a discrete attribute status with the values low, middle and high, listed in that order. Options for their transformation are:
Multiple attributes can be chosen.

2. Define the treatment of categorical variables.

Examples in this section will assume that we have a categorical attribute *status* with values *low*, *middle* and *high*, listed in that order. Options for their transformation are:

- **Use default setting**: use the default treatment.

- **Leave categorical**: leave the attribute as it is.

- **First value as base**: a N-valued categorical variable will be transformed into N-1 numeric variables, each serving as an indicator for one of the original values except for the base value. The base value is the first value in the list. By default, the values are ordered alphabetically; their order can be changed in [Edit Domain](../data/editdomain).

In the above case, the three-valued variable *status* is transformed into two numeric variables, *status=middle* with values 0 or 1 indicating whether the original variable had value *middle* on a particular example, and similarly, *status=high*.

- **Most frequent value as base**: similar to the above, except that the most frequent value is used as a base. So, if the most frequent value in the above example is *middle*, then *middle* is considered as the base and the two newly constructed variables are *status=low* and *status=high*.

- **One attribute per value**: this option constructs one numeric variable per each value of the original variable. In the above case, we would get variables *status=low*, *status=middle* and *status=high*.
- **One-hot encoding**: this option constructs one numeric variable per each value of the original variable. In the above case, we would get variables *status=low*, *status=middle* and *status=high*.

- **Ignore multinomial attributes**: removes non-binary categorical variables from the data.
- **Remove if more than 3 values**: removes non-binary categorical variables from the data.

- **Treat as ordinal**: converts the variable into a single numeric variable enumerating the original values. In the above case, the new variable would have the value of 0 for *low*, 1 for *middle* and 2 for *high*. Again note that the order of values can be set in [Edit Domain](../data/editdomain).

- **Divide by number of values**: same as above, except that values are normalized into range 0-1. In our example, the values of the new variable would be 0, 0.5 and 1.
- **Remove**: removes the attribute.

2. Define the treatment of continuous attributes. Besised the option to *Leave them as they are*, we can *Normalize by span*, which will subtract the lowest value found in the data and divide by the span, so all values will fit into [0, 1]. Option *Normalize by standard deviation* subtracts the average and divides by the standard deviation.
- **Treat as ordinal**: converts the variable into a single numeric variable enumerating the original values. In the above case, the new variable would have the value of 0 for *low*, 1 for *middle* and 2 for *high*. Again note that the order of values can be set in [Edit Domain](../data/editdomain).

3. Define the treatment of class attributes (outcomes, targets). Besides leaving it as it is, the available options mirror those for multinomial attributes, except for those that would split the outcome into multiple outcome variables.
- **Treat as normalized ordinal**: same as above, except that values are normalized into range 0-1. In our example, the values of the new variable would be 0, 0.5 and 1.

4. This option defines the ranges of new variables. In the above text, we supposed the range *from 0 to 1*.
3. Select attributes to set individual treatments or click "Default" to set the default treatment for numeric attributes.

5. Produce a report.
4. Define the treatment of numeric attributes.

6. If *Apply automatically* is ticked, changes are committed automatically. Otherwise, you have to press *Apply* after each change.
- **Use default setting**: use the general default.
- **Leave as it is**: do not change anything.
- **Standardize**: subtract the mean and divide by the standard deviation (not available for sparse data).
- **Center**: subtract the mean (not available for sparse data).
- **Scale**: divide by standard deviation.
- **Normalize to interval [-1, 1]**: linearly scale the values into interval [-1, 1] (not available for sparse data)
- **Normalize to interval [0, 1]**: linearly scale the values into interval [0, 1] (not available for sparse data)

5. If checked, the class attribute is converted in the same fashion as categorical attributes that are treated as ordinal (see above).

Examples
--------

First, let's see what is the output of the **Continuize** widget. We feed the original data (the *Heart disease* data set) into the [Data Table](../data/datatable) and see how they look like. Then we continuize the discrete values and observe them in another [Data Table](../data/datatable).
First, let's see what is the output of the **Continuize** widget. We feed the original data (the *Heart disease* data set) into the [Data Table](../data/datatable) and see how they look like. Then we continuize the discrete values using various options and observe them in another [Data Table](../data/datatable).

![](images/Continuize-Example1.png)

In the second example, we show a typical use of this widget - in order to properly plot the linear projection of the data, discrete attributes need to be converted to continuous ones and that is why we put the data through the **Continuize** widget before drawing it. The attribute "*chest pain*" originally had four values and was transformed into three continuous attributes; similar happened to gender, which was transformed into a single attribute "*gender=female*".
In the second example, we show a typical use of this widget - in order to properly plot the linear projection of the data, discrete attributes need to be converted to continuous ones and that is why we put the data through the **Continuize** widget before drawing it. Gender, for instance, is transformed into two attributes "*gender=female*" and *gender=male*.

![](images/Continuize-Example2.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c4b8438

Please sign in to comment.