Skip to content

Latest commit

 

History

History
136 lines (100 loc) · 7.43 KB

ml.rst

File metadata and controls

136 lines (100 loc) · 7.43 KB

ml

Table of contents

The ml command is to train/predict/trainandpredict on any algorithm in the ml-commons plugin on the search result returned by a PPL command.

AD(RCF) KMEANS

ml action='train' algorithm='rcf' <number_of_trees> <shingle_size> <sample_size> <output_after> <time_decay> <anomaly_rate> <time_field> <date_format> <time_zone>

  • number_of_trees(integer): optional. Number of trees in the forest. The default value is 30.
  • shingle_size(integer): optional. A shingle is a consecutive sequence of the most recent records. The default value is 8.
  • sample_size(integer): optional. The sample size used by stream samplers in this forest. The default value is 256.
  • output_after(integer): optional. The number of points required by stream samplers before results are returned. The default value is 32.
  • time_decay(double): optional. The decay factor used by stream samplers in this forest. The default value is 0.0001.
  • anomaly_rate(double): optional. The anomaly rate. The default value is 0.005.
  • time_field(string): mandatory. It specifies the time field for RCF to use as time-series data.
  • date_format(string): optional. It's used for formatting time_field field. The default formatting is "yyyy-MM-dd HH:mm:ss".
  • time_zone(string): optional. It's used for setting time zone for time_field filed. The default time zone is UTC.
  • category_field(string): optional. It specifies the category field used to group inputs. Each category will be independently predicted.

ml action='train' algorithm='rcf' <number_of_trees> <sample_size> <output_after> <training_data_size> <anomaly_score_threshold>

  • number_of_trees(integer): optional. Number of trees in the forest. The default value is 30.
  • sample_size(integer): optional. Number of random samples given to each tree from the training data set. The default value is 256.
  • output_after(integer): optional. The number of points required by stream samplers before results are returned. The default value is 32.
  • training_data_size(integer): optional. The default value is the size of your training data set.
  • anomaly_score_threshold(double): optional. The threshold of anomaly score. The default value is 1.0.
  • category_field(string): optional. It specifies the category field used to group inputs. Each category will be independently predicted.

The example trains an RCF model and uses the model to detect anomalies in the time-series ridership data.

PPL query:

os> source=nyc_taxi | fields value, timestamp | ml action='train' algorithm='rcf' time_field='timestamp' | where value=10844.0
fetched rows / total rows = 1/1
+---------+---------------------+-------+---------------+
| value   | timestamp           | score | anomaly_grade |
|---------+---------------------+-------+---------------|
| 10844.0 | 2014-07-01 00:00:00 | 0.0   | 0.0           |
+---------+---------------------+-------+---------------+

The example trains an RCF model and uses the model to detect anomalies in the time-series ridership data with multiple category values.

PPL query:

os> source=nyc_taxi | fields category, value, timestamp | ml action='train' algorithm='rcf' time_field='timestamp' category_field='category' | where value=10844.0 or value=6526.0
fetched rows / total rows = 2/2
+----------+---------+---------------------+-------+---------------+
| category | value   | timestamp           | score | anomaly_grade |
|----------+---------+---------------------+-------+---------------|
| night    | 10844.0 | 2014-07-01 00:00:00 | 0.0   | 0.0           |
| day      | 6526.0  | 2014-07-01 06:00:00 | 0.0   | 0.0           |
+----------+---------+---------------------+-------+---------------+

The example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data.

PPL query:

os> source=nyc_taxi | fields value | ml action='train' algorithm='rcf' | where value=10844.0
fetched rows / total rows = 1/1
+---------+-------+-----------+
| value   | score | anomalous |
|---------+-------+-----------|
| 10844.0 | 0.0   | False     |
+---------+-------+-----------+

The example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data with multiple category values.

PPL query:

os> source=nyc_taxi | fields category, value | ml action='train' algorithm='rcf' category_field='category' | where value=10844.0 or value=6526.0
fetched rows / total rows = 2/2
+----------+---------+-------+-----------+
| category | value   | score | anomalous |
|----------+---------+-------+-----------|
| night    | 10844.0 | 0.0   | False     |
| day      | 6526.0  | 0.0   | False     |
+----------+---------+-------+-----------+

ml action='train' algorithm='kmeans' <centroids> <iterations> <distance_type>

  • centroids: optional. The number of clusters you want to group your data points into. The default value is 2.
  • iterations: optional. Number of iterations. The default value is 10.
  • distance_type: optional. The distance type can be COSINE, L1, or EUCLIDEAN, The default type is EUCLIDEAN.

The example shows how to classify three Iris species (Iris setosa, Iris virginica and Iris versicolor) based on the combination of four features measured from each sample: the length and the width of the sepals and petals.

PPL query:

os> source=iris_data | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | ml action='train' algorithm='kmeans' centroids=3
+--------------------+-------------------+--------------------+-------------------+-----------+
| sepal_length_in_cm | sepal_width_in_cm | petal_length_in_cm | petal_width_in_cm | ClusterID |
|--------------------+-------------------+--------------------+-------------------+-----------|
| 5.1                | 3.5               | 1.4                | 0.2               | 1         |
| 5.6                | 3.0               | 4.1                | 1.3               | 0         |
| 6.7                | 2.5               | 5.8                | 1.8               | 2         |
+--------------------+-------------------+--------------------+-------------------+-----------+