Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example usage #15

Merged
merged 3 commits into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 102 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,14 @@
SigMahaKNN (`signature_mahalanobis_knn`) combines the variance norm (a
generalisation of the Mahalanobis distance) with path signatures for anomaly
detection for multivariate streams. The `signature_mahalanobis_knn` library is a
Python implementation of the SigMahaKNN method. The key contributions of this
library are:
Python implementation of the SigMahaKNN method described in
[_Dimensionless Anomaly Detection on Multivariate Streams with Variance Norm and Path Signature_](https://arxiv.org/abs/2006.03487).

To find the examples from the paper, please see the
[paper-examples](paper-examples) folder which includes notebooks for downloading
and running the experiments.

The key contributions of this library are:

- A simple and efficient implementation of the variance norm distance as
provided by the `signature_mahalanobis_knn.Mahalanobis` class. The class has
Expand All @@ -38,7 +44,7 @@ library are:
- The `distance` method to compute the distance between two `numpy` arrays
`x1` and `x2`
- A simple and efficient implementation of the SigMahaKNN method as provided by
the `signature_mahalanobis_knn.SigMahaKNN` class. The class has two main
the `signature_mahalanobis_knn.SignatureMahalanobisKNN` class. The class has two main
methods:
- The `fit` method to fit a model to a training dataset
- The `fit` method can take in a corpus of streams as its input (where we
Expand Down Expand Up @@ -68,21 +74,108 @@ pip install signature_mahalanobis_knn

As noted above, the `signature_mahalanobis_knn` library has two main classes:
`Mahalanobis`, a class for computing the variance norm distance, and
`SigMahaKNN`, a class for computing the conformance score for a set of new data
`SignatureMahalanobisKNN`, a class for computing the conformance score for a set of new data
points.

### Computing the variance norm distance

To compute the variance norm (a generalisation of the Mahalanobis distance) for a
pair of data points `x1` and `x2` given a corpus of training data `X` (a two-dimensional
`numpy` array), you can use the `Mahalanobis` class as follows:

```python
import numpy as np
from signature_mahalanobis_knn import Mahalanobis

# create a corpus of training data
X = np.random.rand(100, 10)

# initialise the Mahalanobis class
mahalanobis = Mahalanobis()
mahalanobis.fit(X)

# compute the variance norm distance between two data points
x1 = np.random.rand(10)
x2 = np.random.rand(10)
distance = mahalanobis.distance(x1, x2)
```

Here we provided an example with the default initialisation of the `Mahalanobis`
class. There are also a few parameters that can be set when initialising the class
(see details in [_Dimensionless Anomaly Detection on Multivariate Streams with Variance Norm and Path Signature_](https://arxiv.org/abs/2006.03487)):
- `subspace_thres`: (float) threshold for deciding whether or not a point is in the subspace, default is 1e-3
- `svd_thres`: (float) threshold for deciding the numerical rank of the data matrix, default is 1e-12
- `zero_thres`: (float) threshold for deciding whether the distance should be set to zero, default is 1e-12

### Using the SigMahaKNN method for anomaly detection

To use the SigMahaKNN method for anomaly detection of multivariate streams, you
can use the `SignatureMahalanobisKNN` class by first initialising the class and then using the
`fit` and `conformance` methods to fit a model to a training dataset of streams and compute
the conformance score for a set of new data streams, respectively:

```python
import numpy as np
from signature_mahalanobis_knn import SignatureMahalanobisKNN

# create a corpus of training data
# X is a three-dimensional numpy array with shape (n_samples, length, channels)
X = np.random.rand(100, 10, 3)

# initialise the SignatureMahalanobisKNN class
sig_maha_knn = SignatureMahalanobisKNN()
sig_maha_knn.fit(
knn_library="sklearn",
X_train=X,
signature_kwargs={"depth": 3},
)

# create a set of test data streams
Y = np.random.rand(10, 10, 3)

# compute the conformance score for the test data streams
conformance_scores = sig_maha_knn.conformance(X_test=Y, n_neighbors=5)
```

Note here, we have provided an example whereby you pass in a corpus of streams to
fit and compute the conformance scores. We use the `sktime` library to compute
path signatures of the streams.

However, if you already have computed signatures or you are using another feature representation method, you can pass in the corpus of
signatures to the `fit` and `conformance` methods instead of the streams. You do this by
passing in arguments `signatures_train` and `signatures_test` to the `fit` and `conformance`
methods, respectively.

```python
import numpy as np
from signature_mahalanobis_knn import SignatureMahalanobisKNN

# create a corpus of training data (signatures or other feature representations)
# X is a two-dimensional numpy array with shape (n_samples, n_features)
features = np.random.rand(100, 10)

# initialise the SignatureMahalanobisKNN class
sig_maha_knn = SignatureMahalanobisKNN()
sig_maha_knn.fit(
knn_library="sklearn",
signatures_train=features,
)

# create a set of test features
features_y = np.random.rand(10, 10)

# compute the conformance score for the test features
conformance_scores = sig_maha_knn.conformance(signatures_test=features_y, n_neighbors=5)
```

## Repo structure

The core implementation of the SigMahaKNN method is in the
`src/signature_mahalanobis_knn` folder:

- `mahal_distance.py` contains the implementation of the `Mahalanobis` class to
compute the variance norm distance
- `sig_maha_knn.py` contains the implementation of the `SigMahaKNN` class to
- `sig_maha_knn.py` contains the implementation of the `SignatureMahalanobisKNN` class to
compute the conformance scores for a set of new data points against a corpus
of training data
- `utils.py` contains some utility functions that are useful for the library
Expand All @@ -96,9 +189,10 @@ There are various examples in the `examples` and `paper-examples` folder:

- `examples` contains small examples using randomly generated data for
illustration purposes
- `paper-examples` contains the examples used in the paper (link available
soon!) where we compare the SigMahaKNN method to other baseline approaches
(e.g. Isolation Forest and Local Outlier Factor) on real-world datasets
- `paper-examples` contains the examples used our paper
[_Dimensionless Anomaly Detection on Multivariate Streams with Variance Norm and Path Signature_](https://arxiv.org/abs/2006.03487)
where we compare the SigMahaKNN method to other baseline approaches (e.g.
Isolation Forest and Local Outlier Factor) on real-world datasets
- There are notebooks for downloading and preprocessing the datasets for the
examples - see [paper-examples/README.md](paper-examples/README.md) for more
details
Expand Down
8 changes: 4 additions & 4 deletions paper-examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,16 @@ Prior to running this experiment notebook, you will need to run the
[ship_movement_anomalies_data.ipynb](ship_movement_anomalies_data.ipynb)
notebook to download and pre-process the data.

## [Univariate time series: ucr_dataset_comparison.ipynb](ucr_dataset_comparison.ipynb)
## [Univariate time series: ucr_anomalies.ipynb](ucr_anomalies.ipynb)

Prior to running this experiment notebook, you will need to run the
[ucr_dataset_comparison_data.ipynb](ucr_dataset_comparison_data.ipynb) notebook
[ucr_anomalies_data.ipynb](ucr_anomalies_data.ipynb) notebook
to download and pre-process the data.

## [Language dataset: langauge_dataset_anomalise.ipynb](langauge_dataset_anomalise.ipynb)
## [Language dataset: language_dataset_anomalies.ipynb](language_dataset_anomalies.ipynb)

Prior to running this experiment notebook, you will need to run the
[langauge_dataset_anomalise_data.ipynb](langauge_dataset_anomalise_data.ipynb)
[language_dataset_anomalies_data.ipynb](language_dataset_anomalies_data.ipynb)
notebook to download and pre-process the data.

There is some data provided for this in the `data` folder, but the notebook will
Expand Down
221 changes: 0 additions & 221 deletions paper-examples/example2.ipynb

This file was deleted.

Loading
Loading