-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OneClassClassifier model #3501
base: master
Are you sure you want to change the base?
Conversation
@jeffpicard thanks for the PR! @elenamer can you take a look? |
5a4204c
to
65459d3
Compare
Many thanks for the review! I've squashed in a commit with your requested changes (Implement mini_batch_size and verbose; Rename loss). @elenamer would you be willing to take another look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the OneClassClassifier
, looks good to me now!
65459d3
to
5ab2bfc
Compare
(CI was failing with errors that looked unrelated to this branch so I clicked the "rebase" button in the UI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
I tested with your script and also the following script to see if I can find some outliers in the TREC_6 dataset:
embeddings = TransformerDocumentEmbeddings(
model="distilbert-base-uncased",
)
# Train on TREC_6
imdb = TREC_6()
label_type = "question_class"
# Identify outliers of the class DESC
corpus = Corpus(
train=[x for x in imdb.train if x.get_label(label_type).value == "ENTY"],
test=[x for x in imdb.test if x.get_label(label_type).value == "ENTY"],
)
print(corpus)
label_dictionary = corpus.make_label_dictionary(label_type)
model = OneClassClassifier(embeddings, label_dictionary, label_type=label_type)
trainer = ModelTrainer(model, corpus)
trainer.fine_tune("resources/taggers/outlier", mini_batch_size=8)
threshold = model.calculate_threshold(corpus.dev, quantile=0.95)
model.threshold = threshold
result = model.evaluate(corpus.test, gold_label_type=label_type, out_path="predictions.txt")
print(json.dumps(result.classification_report, indent=2))
Some thoughts:
- Having to set a threshold afterwards breaks with the default way of training/testing models in Flair. The trainer by default produces a dev.tsv and a test.tsv and prints evaluation metrics during/after training. However, without the threshold set during training, these outputs make no sense, which might confuse users.
- Have you considered modeling this as a regression task instead of a classification task? Then users would not need to set a threshold. Also, this would allow users to do things like print out the 10 biggest outliers which could be more useful than experimenting with different thresholds.
- The name of the class (
OneClassClassifier
) currently does not explain what it does. How about something likeOutlierDetection
? - The model currently is limited to datasets consisting of a single class. This means users will always need to first create a dataset like in your example snippet. Is it possible to do outlier detection for multiple classes at once? Or is this problematic since each class would need a separate encoder/decoder network?
Hi @alanakbik. I'm extremely sorry I took so long to reply, and many thanks for your thoughts. To your points,
Two more points I'm wondering what you (or others) think about:
Altogether this could look like
I'm sorry this got so long! Focusing in on some yes/no questions that I think can be decided independently:
|
This PR adds
OneClassClassifier
to flair.models for #3496.The task, usage, and architecture are described in the class docstring.
The architecture is inspired by papers such as Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction. While this doesn't achieve state of the art, or implement improvements like adding noise, I thought I'd see if you're interested in it, as it's a new task formulation that works, and might be useful to others.
The interface requires users to set the threshold explicitly... not sure if there's a cleaner way to hook that in to happen automatically after training completes.
Here's a short script demonstrating its usage separating IMDB from STACKOVERFLOW:
prints
Thanks for any time you're willing to put into considering this :) !