Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baselines #14

Merged
merged 25 commits into from
Dec 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
e770707
first attempt at I.F. and L.O.F. baselines
rchan26 Nov 27, 2023
c035ea1
add helper function to compute sigs and doc strings for baselines
rchan26 Nov 27, 2023
66ba947
pen-digit with baselines and download notebook
rchan26 Nov 29, 2023
dd7566c
pen-digit and ship-movement results
rchan26 Nov 29, 2023
586077d
ucr results
rchan26 Nov 29, 2023
ba13f9b
add ucr data downloader notebook
rchan26 Dec 1, 2023
e96607e
add lang data results
rchan26 Dec 1, 2023
c25b05f
ship-movement data processing notebook
rchan26 Dec 1, 2023
d814f81
data for language dataset example
rchan26 Dec 1, 2023
af3a07a
update language data loading
rchan26 Dec 3, 2023
1b98658
have common moments computing function for baselines
rchan26 Dec 3, 2023
b9e3e8f
add line to save some indices to pkl in ship mov
rchan26 Dec 4, 2023
29d6fb5
remove the examples that need data
rchan26 Dec 4, 2023
f7603c4
add dependency on pandas
rchan26 Dec 4, 2023
11af5f3
add requirements
rchan26 Dec 4, 2023
db7f60d
first readme attempt
rchan26 Dec 4, 2023
ab11853
save in data dir
rchan26 Dec 4, 2023
fcea6d3
upload data necessary for ucr
rchan26 Dec 4, 2023
db98e64
fix donwload link
rchan26 Dec 4, 2023
3814f09
start the paper-examples readme
rchan26 Dec 4, 2023
7d6da86
remove the language pkl files
rchan26 Dec 4, 2023
8fa0644
not completed lang data
rchan26 Dec 4, 2023
94b3a2e
add n_jobs as optional argument in paper-methods
rchan26 Dec 5, 2023
30b98a2
final language data anomalies
rchan26 Dec 5, 2023
60dd6e9
bump version
rchan26 Dec 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -157,4 +157,6 @@ Thumbs.db
*~
*.swp

# miscellanous
.DS_Store
paper-examples/data
99 changes: 98 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# signature_mahalanobis_knn
# SigMahaKNN - Signature Mahalanobis KNN method

## Anamoly detection on multivariate streams with Variance Norm and Path Signature

[![Actions Status][actions-badge]][actions-link]
[![Documentation Status][rtd-badge]][rtd-link]
Expand All @@ -22,3 +24,98 @@
[rtd-link]: https://signature_mahalanobis_knn.readthedocs.io/en/latest/?badge=latest

<!-- prettier-ignore-end -->

SigMahaKNN (`signature_mahalanobis_knn`) combines the variance norm (a
generalisation of the Mahalanobis distance) with path signatures for anomaly
detection for multivariate streams. The `signature_mahalanobis_knn` library is a
Python implementation of the SigMahaKNN method. The key contributions of this
library are:

- A simple and efficient implementation of the variance norm distance as
provided by the `signature_mahalanobis_knn.Mahalanobis` class. The class has
two main methods:
- The `fit` method to fit the variance norm distance to a training datase
- The `distance` method to compute the distance between two `numpy` arrays
`x1` and `x2`
- A simple and efficient implementation of the SigMahaKNN method as provided by
the `signature_mahalanobis_knn.SigMahaKNN` class. The class has two main
methods:
- The `fit` method to fit a model to a training dataset
- The `fit` method can take in a corpus of streams as its input (where we
will compute path signatures of using the `sktime` library with `esig` or
`iisignature`) _or_ a corpus of path signatures as its input. This also
opens up the possibility of using other feature represenations and
applications of using the variance norm distance for anomaly detection
- Currently, the library uses either `sklearn`'s `NearestNeighbors` class or
`pynndescent`'s `NNDescent` class to efficiently compute the nearest
neighbour distances of a new data point to the corpus training data
- The `conformance` method to compute the conformance score for a set of new
data points
- Similarly to the `fit` method, the `conformance` method can take in a
corpus of streams as its input (where we will compute path signatures of
using the `sktime` library with `esig` or `iisignature`) _or_ a corpus of
path signatures as its input

## Installation

The SigMahaKNN library is available on PyPI and can be installed with `pip`:

```bash
pip install signature_mahalanobis_knn
```

## Usage

As noted above, the `signature_mahalanobis_knn` library has two main classes:
`Mahalanobis`, a class for computing the variance norm distance, and
`SigMahaKNN`, a class for computing the conformance score for a set of new data
points.

### Computing the variance norm distance

### Using the SigMahaKNN method for anomaly detection

## Repo structure

The core implementation of the SigMahaKNN method is in the
`src/signature_mahalanobis_knn` folder:

- `mahal_distance.py` contains the implementation of the `Mahalanobis` class to
compute the variance norm distance
- `sig_maha_knn.py` contains the implementation of the `SigMahaKNN` class to
compute the conformance scores for a set of new data points against a corpus
of training data
- `utils.py` contains some utility functions that are useful for the library
- `baselines/` is a folder containing some of the baseline methods we look at in
the paper - see [paper-examples/README.md](paper-examples/README.md) for more
details

## Examples

There are various examples in the `examples` and `paper-examples` folder:

- `examples` contains small examples using randomly generated data for
illustration purposes
- `paper-examples` contains the examples used in the paper (link available
soon!) where we compare the SigMahaKNN method to other baseline approaches
(e.g. Isolation Forest and Local Outlier Factor) on real-world datasets
- There are notebooks for downloading and preprocessing the datasets for the
examples - see [paper-examples/README.md](paper-examples/README.md) for more
details

## Contributing

To take advantage of `pre-commit`, which will automatically format your code and
run some basic checks before you commit:

```
pip install pre-commit # or brew install pre-commit on macOS
pre-commit install # will install a pre-commit hook into the git repo
```

After doing this, each time you commit, some linters will be applied to format
the codebase. You can also/alternatively run `pre-commit run --all-files` to run
the checks.

See [CONTRIBUTING.md](CONTRIBUTING.md) for more information on running the test
suite using `nox`.
Binary file removed examples/data/pen_digit_test.pkl
Binary file not shown.
Binary file removed examples/data/pen_digit_train.pkl
Binary file not shown.
Loading