To setup the data generator on OpenShift simply use the s2i
by running:
oc new-app centos/python-36-centos7~https://github.com/ruivieira/timeseries-mock \
-e KAFKA_BROKERS=kafka:9092 \
-e KAFKA_TOPIC=example \
-e CONF=examples/mean_continuous.yml \
--name=emitter
This will deploy the data generator, which will emit data (modelled as defined by
the configuration file examples/mean_continuous.yml
) into the Kafka topic example
.
To configure a data stream, you must specify both the structure of the time-series
as well and the type of observation in a .yml
file.
The structure can be specified by specifiying several fundamental components, which are ultimately composed to create a single structure. Some core components are:
This specifies an underlying mean. Using a single "mean" component will result in a random-walk type time-series:
structure:
- type: mean
start: 0.0
noise: 1.5
All components need a start
and noise
value. The start specifies the general
probable area for the series start and noise
specifies how much the component
will vary over time.
This will represent a seasonal component:
structure:
- type: season
period: 200
start: 0.0
noise: 0.7
The period
represents how often will the season repeat. Note that this is relative
to your specified rate
.
That is, if your rate
is 0.1
(the generator will emit new data every 0.1
seconds),
then a period
of 20
means that the season will last rate * period = 2
seconds.
But if your rate
is 100
seconds, the season will repeat every 33.33
minutes.
The seasonal component Fourier representation consists of n
harmonics, which can be either specified
in the configuration as:
structure:
- type: season
# ...
harmonics: 6
# ...
or just default to n=3
harmonics if not specified.
Structures can be composed simply by listing them under structure
in the .yml
file.
For instance, composing the above mean and seasonality examples would simply be:
structure:
- type: mean # component 1
start: 0.0
noise: 1.5
- type: season # component 2
period: 200
start: 0.0
noise: 0.7
The observation type can be configured using the observations
key.
The main supported observation types are detailed below.
Continuous observations allow us to model any floating point type measure. Note that
this is not bound by upper or lower limits (range ]-inf, +inf[
[1]). If you use continuous
to simulate, say, temperature readings from a sensor, keep in mind that the simulated readings might drift
to very high or very low values depending on the structure. [2]
observations:
- type: continuous
noise: 1.5
Discrete observations allow us to model any integer measure in the range [0, +inf[.
The notes about drift in the continuous section also apply.
observations:
- type: discrete
Please note that (at the moment) the discrete case only allows the noise
to be specified
at the structure level, since the observations are based on a Poisson model.
Categorical observations allow to model any set of categories represented by an integer.
observations:
- type: categorical
categories: 16
The typical example would be setting categories
to 1
. This would simulate
a stream of "binary" values 0
and 1
. In the above example, setting categories
to 16
would output a stream taking any value from [0, 1, 2, ..., 16]
.
A variant of this generator consists in passing a list of values directly.
Let's assume we wanted to generate a stream of random DNA nucleotides, that is,
C,T,A,G
. This corresponds to four categories which we can specify in the values
field:
observations:
- type: categorical
values: C,T,A,G
Comma-separated values are taken as the categories, without needing to specify anything else.
The output is the a random element of values
at each timepoint, in this case the time-series
would be:
G -> T -> G -> T -> A -> A -> A -> C -> ...
A full configuration file would look something like this:
name: "status"
rate: 0.1
structure:
- type: mean
start: 0.0
noise: 0.5
- type: season
period: 600
start: 0.0
noise: 1.7
observations:
- type: categorical
values: pass,fail
This configuration generate a stream of values pass
or fail
with a rate one value every 0.1 seconds, with a random-walk-like mean and a cyclic pattern every minute.
The previous example was for a univariate observation, however in a real-world application it is very likely we might need to use multivariate data.
To compose a multivariate data model we simply use the model specification above and add as many models together as we want.
To define a multivariate model we declare the individual components inside a compose
clause. For instance, to declare a bivariate continous stream a minimal example would be:
name: "bivariate"
rate: 0.5
compose:
- structure: # component 1
- type: mean
start: 0.0
noise: 0.5
- observations:
- type: continuous
noise: 0.5
- structure: # component 2
- type: mean
start: 5.0
noise: 3.7
- observations:
- type: continuous
noise: 1.5
This would output a stream of bivariate observations such as
[-0.6159691811574524, 6.70524660538598]
[0.09028869591370958, 6.519194818247104]
[-0.1980867909796035, 6.503466768530726]
[0.0063771543008148135, 5.2229932206447405]
...
In the specific case where you wish to simulate a multivariate observation with components that follow the same structure, you can use the shorthand replicate
. The observation is then replicated n
times.
For example, to simulate bivariate samples where the components had the same underlying structure, we could write:
name: "bivariate"
rate: 0.5
compose:
- replicate: 2
structure:
- type: mean
start: 0.0
noise: 0.5
observations:
- type: continuous
noise: 0.5
In this example we will generate a fake HTTP log stream. The multivariate data will contain a request type (GET
, POST
or PUT
), a URL from a provided list and random IP address.
We want the URL to have seasonality, that is, users will tend more to a certain URL than others over time in a cyclic fashion.
We can define this model as:
name: "HTTP log"
period: 0.1
compose:
- structure:
- type: mean
start: 0.0
noise: 0.01
observations:
type: categorical
values: GET,POST,PUT
- structure:
- type: mean
start: 0.0
noise: 0.01
- type: season
start: 1.0
period: 15
noise: 0.2
observations:
type: categorical
values: /site/page.htm,/site/index.htm,/internal/example.htm
- replicate: 4
structure:
- type: mean
start: 0.0
noise: 2.1
observations:
type: categorical
categories: 255
An example output would be
["PUT", "/internal/example.htm", 171, 158, 59, 89]
["GET", "/internal/example.htm", 171, 253, 71, 146]
["PUT", "/internal/example.htm", 224, 252, 9, 156]
["POST", "/site/index.htm", 143, 253, 6, 126]
["POST", "/site/page.htm", 238, 254, 2, 48]
["GET", "/site/page.htm", 228, 252, 52, 126]
["POST", "/internal/example.htm", 229, 234, 103, 233]
["GET", "/internal/example.htm", 185, 221, 109, 195]
...
This project is based on elmiko's Kafka OpenShift Python Emitter, a part of the Bones Brigade set of OpenShift application skeletons.
[1] - whatever the minimum and maximum double/float values are, of course
[2] - In this case I suggest using some auto-regressive component in the structure.