-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CleanCoNLL object #3557
base: master
Are you sure you want to change the base?
Add CleanCoNLL object #3557
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this and implementing the patching in python! :)
I'm getting an error TypeError: CLEANCONLL.download_and_prepare_data() takes 1 positional argument but 2 were given
, and I saw that the same one is appearing when the tests are run.
Also, so that all checks pass, you need to run mypy, ruff and black for checks and formatting
flair/datasets/sequence_labeling.py
Outdated
changes = [] | ||
current_change = None | ||
|
||
with open(patch_file_path, 'r') as patch_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to also specify an encoding when reading/writing files, as different os have different default encodings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I fixed the deprecated argument and added the encoding everywhere.
Now the checks pass, after some formatting fixes. |
Hi @susannaruecker , many thanks for adding this! I have trained FLERT models on the CleanCoNLL dataset with very great results. I have one question about further experiments with that dataset: is it possible that the SpanTagger training also gets officially integrated into Flair - I opened this issue some time ago #3457 and it would be awesome to have support in Flair for this approach as well :) |
Here's the PR for adding a CleanCoNLL object. Simple usage:
When called for the first time, this will download the necessary files, so
It then applies the patch files to the original CoNLL-03 tokens (for our new line break etc.) and then merges those new tokens with our new annotations.
Note: As requested, I replaced all previous usage of subcalling bash scripts with pure python. Especially the patching process which until now was
is now done with own methods, but unfortunately rather lengthy now...
Please check if it works for you! 🙂 (If you already have the CleanCoNLL files in your
.flair/datasets
folder you should delete those before, otherwise those simply will be read and the new reconstruction is not tested)