diff --git a/site/GERMS-AT_AnnotationGuidelines_and_AnnotatorAgreement_English_version.pdf b/site/GERMS-AT_AnnotationGuidelines_and_AnnotatorAgreement_English_version.pdf new file mode 100644 index 0000000..2c9437d Binary files /dev/null and b/site/GERMS-AT_AnnotationGuidelines_and_AnnotatorAgreement_English_version.pdf differ diff --git a/site/GERMS-AT_Annotierrichtlinien.pdf b/site/GERMS-AT_Annotierrichtlinien.pdf new file mode 100644 index 0000000..2804ba3 Binary files /dev/null and b/site/GERMS-AT_Annotierrichtlinien.pdf differ diff --git a/site/closed-track.md b/site/closed-track.md index c58f456..2aec1fd 100644 --- a/site/closed-track.md +++ b/site/closed-track.md @@ -1,4 +1,4 @@ -# GermEval2024 GerMS - Closed Track Competitions +# Closed Track Competitions There is a _Closed Track_ competition for each of the two subtasks. Please note the following: diff --git a/site/download.md b/site/download.md index 06cfef8..e9c068c 100644 --- a/site/download.md +++ b/site/download.md @@ -1,13 +1,12 @@ -# GermEval2024 GerMS - Download +# Downloads On this page, the files for training and labeling can be downloaded for each of the phases of the GermEval2024 GerMS competition. +All files are made available under a [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). -## Trial Phase -* [Training Data](data/germeval-trial-train.jsonl) -* [Testing Data](data/germeval-trial-test.jsonl) +## Trial Phase ## Development Phase diff --git a/site/guidelines.md b/site/guidelines.md new file mode 100644 index 0000000..494b59c --- /dev/null +++ b/site/guidelines.md @@ -0,0 +1,4 @@ +# Annotation Guidelines + +* [German Language Original](GERMS-AT_Annotierrichtlinien.pdf) +* [English Translation](GERMS-AT_AnnotationGuidelines_and_AnnotatorAgreement_English_version.pdf) diff --git a/site/index.md b/site/index.md index 2d42fa6..edaec5d 100644 --- a/site/index.md +++ b/site/index.md @@ -20,8 +20,7 @@ are forum moderators. The main aim of annotating the presence and strength of sexism/misogyny in the corpus was to identify comments which make it less welcoming to women to participate in the conversation. -The full annotator guidelines with examples are available in an -[English translation of the German original](guidelines.pdf) +The full annotator guidelines with examples are [available](guidelines.html). Since the sexism/misogyny present in this corpus is often present in a subtle form that avoids outright offensiveness or curse words, @@ -66,8 +65,9 @@ become available. * **Competition phase**: June 7 - June 25, 2024 * During this phase, the training data will consist of the training data of the previous phase plus the labeled test data of the previous phase and a new unlabeled test set is made available. You can upload the labels for - the test set to the competition site and your most last submission will be the one that will be used for ranking. During that phase, the leaderboard is not shown. - The final leaderboard/ranking is shown after the end of the competition phase. + the test set to the competition site and your most last submission will be the one that will be used for ranking. During that phase, the leaderboard is not shown. Please note that only submissions which adhere to all terms and rules are considered for the final ranking. + * A preliminary leaderboard/ranking is shown after the end of the competition phase. + * The final leaderboard/ranking will be available once paper review and final submission is completed, on July 21th 2024. * **Paper submission due**: July 1, 2024 * **Camera ready due**: July 20, 2024 * **Shared Task @KONVENS**: 10 September, 2024 diff --git a/site/open-track.md b/site/open-track.md index 9833d33..11611e3 100644 --- a/site/open-track.md +++ b/site/open-track.md @@ -1,4 +1,4 @@ -# GermEval2024 GerMS - Open Track Competitions +# Open Track Competitions There is an _Open Track_ competition for each of the two subtasks. Please note the following: diff --git a/site/report.md b/site/report.md index 658cf05..7225b4b 100644 --- a/site/report.md +++ b/site/report.md @@ -1,3 +1,4 @@ -# GermEval2024 GerMS - Report +# Report -TBD +This page will contain the report about the shared task after completion +and publication. diff --git a/site/subtask1.md b/site/subtask1.md index a7c1d35..22282d4 100644 --- a/site/subtask1.md +++ b/site/subtask1.md @@ -1,4 +1,4 @@ -# GermEval2024 GerMS - Subtask 1 +# Subtask 1 IMPORTANT: please note that there is a [closed](closed-track.md) and an [open](open-track.md) track for this subtask! diff --git a/site/subtask2.md b/site/subtask2.md index cd7b85a..32facd1 100644 --- a/site/subtask2.md +++ b/site/subtask2.md @@ -1,107 +1,107 @@ -# GermEval2024 GerMS - Subtask 2 - -IMPORTANT: please note that there is a [closed](closed-track.md) and an [open](open-track.md) track for this subtask! - -**Only submissions to the closed track which follow the rules for the closed track qualify for a paper submission and only an accepted paper qualifies for the -inclusion of your results in the final competition ranking.** - -In subtask 2 the goal is to predict the distribution for each text in a dataset where the distribution is derived from the original distribution of labels assigned by several human annotators. - -The human annotators assigned (according to the [annotation guidelines](guidelines.md) ) -the strength of misogyny/sexism present in the given text via the following labels: - -* `0-Kein`: no sexism/misogyny present -* `1-Gering`: mild sexism/misogyny -* `2-Vorhanden`: sexism/misogyny present -* `3-Stark`: strong sexism/misogyny -* `4-Extrem`: extreme sexism/misogyny - -While the annotation guidelines define what kind of sexism/misogyny should get annotated, there has been made no attempt to give rules about how to decide on the strength. For this reason, if an annotator decided that sexism/misogyny is present in a text, the strength assigned is a matter of personal judgement. - -The distributions to predict in subtask 2 are -* the binary distribution ('dist_bin'): two values are predicted, which add up to 1. - * `dist_bin_0`: refers to the portion of annotators labeling the text as 'not-sexist' (`0-Kein`) - * `dist_bin_1`: refers to the portion of annotators labeling the text as 'sexist' (`1-Gering`, `2-Vorhanden`, `3-Stark`, or `4-Extrem`). -* the multi score distribution ('dist_multi'): five values are predicted, which add up to 1. - * `dist_multi_0`: predict the portion of annotators labeling the text as `0-Kein`. - * `dist_multi_1`: predict the portion of annotators labeling the text as `1-Gering`. - * `dist_multi_2`: predict the portion of annotators labeling the text as `2-Vorhanden`. - * `dist_multi_3`: predict the portion of annotators labeling the text as `3-Stark`. - * `dist_multi_4`: predict the portion of annotators labeling the text as `4-Extrem`. - -## Data - -For the *trial phase* of subtask 1, we provide a small dataset, containing -* a small labeled dataset containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them) -* a small unlabeled dataset containing 'id', 'text' and 'annotators' (annotator ids) - -For the *development phase* of subtask 1, we provide all participants with the following data: -* the labeled training set containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them) -* the unlabeled dev set containing 'id', 'text' and 'annotators' (annotator ids) - -For the *competition phase* of subtask 1, we provide -* the unlabeled test set containing 'id', 'text' and 'annotators' (annotator ids) - -All of the five files are in JSONL format (one JSON-serialized object per line) where each object is a dictionary with the following -fields: - -* `id`: a hash that identifies the example -* `text`: the text to classify. The text can contain arbitrary Unicode and new lines -* `annotations` (only in the labeled dataset): an array of dictionaries which contain the following key/value pairs: - * `user`: a string in the form "A003" which is an anonymized id for the annotator who assigned the label - * `label`: the label assigned by the annotator - * Note that the number of annotations and the specific annotators who assigned labels vary between examples -* `annotators` (only in the unlabeled dataset): an array of annotator ids who labeled the example - -You can [download](download.md) the data for each phase as soon as the corresponding phase starts. - -## Submission - -Your submission must be a file in TSV (tab separated values) format which contains the following columns in any order: - -* `id`: the id of the example in the unlabeled dataset for which the predictions are submitted -* `dist_bin_0`: prediction of one value between 0 and 1 (all `dist_bin` values need to add up to 1). -* `dist_bin_1`: prediction of one value between 0 and 1 (all `dist_bin` values need to add up to 1). -* `dist_multi_0`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). -* `dist_multi_1`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). -* `dist_multi_2`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). -* `dist_multi_3`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). -* `dist_multi_4`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). - -Note that the way how you derive those values is up to you (as long as the rules for the closed or open tracks are followed): - -* you can train several models or a single model to get the predicted distribution -* you can derive the mode-specific training set in any way from the labeled training data -* you can use the information of which annotator assigned the label or ignore that - -To submit your predictions to the competition: - -* the file MUST have the file name extension `.tsv` -* the TSV file must get compressed into a ZIP file with extension `.zip` -* the ZIP file should then get uploaded as a submission to the correct competition -* !! Please make sure you submit to the competition that corresponds to the correct subtask (1 or 2) and correct track (Open or Closed)! -* under "My Submissions" make sure to fill out the form and: - * enter the name of your team which has been registered for the competition - * give a name to your method - * confirm that you have checked that you are indeed submitting to the correct competition for the subtask and track desired - -**Submission errors and warnings** - -* Always make sure a phase is selected before trying to upload your submission. -* A submission is successful, if it has the submission status 'finished'. 'Failed' submissions can be investigated for error sources by clicking at '?' next to 'failed' and looking at LOGS > scoring logs > stderr. -* If you experience any issue such as a submission file stuck with a "scoring" status, please cancel the submission and try again. In case the problem persists you can contact us using the Forum. -* Following a successful submission, you need to refresh the submission page in order to see your score and your result on the leaderboard. - -## Phases - -* For the *trial phase*, multiple submissions are allowed for getting to know the problem and the subtask. -* For the *development phase*, multiple submissions are allowed and they serve the purpose of developing and improving the model(s). -* For the *competition phase*, participants may only submit a limited number of times. Please note that only the latest valid submission determines the final task ranking. - -## Evaluation - -System performance on subtask 2 is evaluated using the Jensen-Shannon distance for both (i) the prediction of the binary distribution, and (ii) the prediction of the multi score distribution. We chose the Jensen-Shannon distance as it is a standard method for measuring the similarity between two probability distributions and it is a proper -distance metric which is between 0 and 1. It is the square root of the Jensen-Shannon divergence, which is based on the Kullback-Leibler divergence. - -The overall score which is used for ranking the submissions is calculated as the unweighted average between the two JS-distances. - +# Subtask 2 + +IMPORTANT: please note that there is a [closed](closed-track.md) and an [open](open-track.md) track for this subtask! + +**Only submissions to the closed track which follow the rules for the closed track qualify for a paper submission and only an accepted paper qualifies for the +inclusion of your results in the final competition ranking.** + +In subtask 2 the goal is to predict the distribution for each text in a dataset where the distribution is derived from the original distribution of labels assigned by several human annotators. + +The human annotators assigned (according to the [annotation guidelines](guidelines.md) ) +the strength of misogyny/sexism present in the given text via the following labels: + +* `0-Kein`: no sexism/misogyny present +* `1-Gering`: mild sexism/misogyny +* `2-Vorhanden`: sexism/misogyny present +* `3-Stark`: strong sexism/misogyny +* `4-Extrem`: extreme sexism/misogyny + +While the annotation guidelines define what kind of sexism/misogyny should get annotated, there has been made no attempt to give rules about how to decide on the strength. For this reason, if an annotator decided that sexism/misogyny is present in a text, the strength assigned is a matter of personal judgement. + +The distributions to predict in subtask 2 are +* the binary distribution ('dist_bin'): two values are predicted, which add up to 1. + * `dist_bin_0`: refers to the portion of annotators labeling the text as 'not-sexist' (`0-Kein`) + * `dist_bin_1`: refers to the portion of annotators labeling the text as 'sexist' (`1-Gering`, `2-Vorhanden`, `3-Stark`, or `4-Extrem`). +* the multi score distribution ('dist_multi'): five values are predicted, which add up to 1. + * `dist_multi_0`: predict the portion of annotators labeling the text as `0-Kein`. + * `dist_multi_1`: predict the portion of annotators labeling the text as `1-Gering`. + * `dist_multi_2`: predict the portion of annotators labeling the text as `2-Vorhanden`. + * `dist_multi_3`: predict the portion of annotators labeling the text as `3-Stark`. + * `dist_multi_4`: predict the portion of annotators labeling the text as `4-Extrem`. + +## Data + +For the *trial phase* of subtask 1, we provide a small dataset, containing +* a small labeled dataset containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them) +* a small unlabeled dataset containing 'id', 'text' and 'annotators' (annotator ids) + +For the *development phase* of subtask 1, we provide all participants with the following data: +* the labeled training set containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them) +* the unlabeled dev set containing 'id', 'text' and 'annotators' (annotator ids) + +For the *competition phase* of subtask 1, we provide +* the unlabeled test set containing 'id', 'text' and 'annotators' (annotator ids) + +All of the five files are in JSONL format (one JSON-serialized object per line) where each object is a dictionary with the following +fields: + +* `id`: a hash that identifies the example +* `text`: the text to classify. The text can contain arbitrary Unicode and new lines +* `annotations` (only in the labeled dataset): an array of dictionaries which contain the following key/value pairs: + * `user`: a string in the form "A003" which is an anonymized id for the annotator who assigned the label + * `label`: the label assigned by the annotator + * Note that the number of annotations and the specific annotators who assigned labels vary between examples +* `annotators` (only in the unlabeled dataset): an array of annotator ids who labeled the example + +You can [download](download.md) the data for each phase as soon as the corresponding phase starts. + +## Submission + +Your submission must be a file in TSV (tab separated values) format which contains the following columns in any order: + +* `id`: the id of the example in the unlabeled dataset for which the predictions are submitted +* `dist_bin_0`: prediction of one value between 0 and 1 (all `dist_bin` values need to add up to 1). +* `dist_bin_1`: prediction of one value between 0 and 1 (all `dist_bin` values need to add up to 1). +* `dist_multi_0`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). +* `dist_multi_1`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). +* `dist_multi_2`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). +* `dist_multi_3`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). +* `dist_multi_4`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). + +Note that the way how you derive those values is up to you (as long as the rules for the closed or open tracks are followed): + +* you can train several models or a single model to get the predicted distribution +* you can derive the mode-specific training set in any way from the labeled training data +* you can use the information of which annotator assigned the label or ignore that + +To submit your predictions to the competition: + +* the file MUST have the file name extension `.tsv` +* the TSV file must get compressed into a ZIP file with extension `.zip` +* the ZIP file should then get uploaded as a submission to the correct competition +* !! Please make sure you submit to the competition that corresponds to the correct subtask (1 or 2) and correct track (Open or Closed)! +* under "My Submissions" make sure to fill out the form and: + * enter the name of your team which has been registered for the competition + * give a name to your method + * confirm that you have checked that you are indeed submitting to the correct competition for the subtask and track desired + +**Submission errors and warnings** + +* Always make sure a phase is selected before trying to upload your submission. +* A submission is successful, if it has the submission status 'finished'. 'Failed' submissions can be investigated for error sources by clicking at '?' next to 'failed' and looking at LOGS > scoring logs > stderr. +* If you experience any issue such as a submission file stuck with a "scoring" status, please cancel the submission and try again. In case the problem persists you can contact us using the Forum. +* Following a successful submission, you need to refresh the submission page in order to see your score and your result on the leaderboard. + +## Phases + +* For the *trial phase*, multiple submissions are allowed for getting to know the problem and the subtask. +* For the *development phase*, multiple submissions are allowed and they serve the purpose of developing and improving the model(s). +* For the *competition phase*, participants may only submit a limited number of times. Please note that only the latest valid submission determines the final task ranking. + +## Evaluation + +System performance on subtask 2 is evaluated using the Jensen-Shannon distance for both (i) the prediction of the binary distribution, and (ii) the prediction of the multi score distribution. We chose the Jensen-Shannon distance as it is a standard method for measuring the similarity between two probability distributions and it is a proper +distance metric which is between 0 and 1. It is the square root of the Jensen-Shannon divergence, which is based on the Kullback-Leibler divergence. + +The overall score which is used for ranking the submissions is calculated as the unweighted average between the two JS-distances. + diff --git a/site/terms.md b/site/terms.md index 37aeff4..37111a3 100644 --- a/site/terms.md +++ b/site/terms.md @@ -2,14 +2,20 @@ **Participation in the competition**: Any interested person may participate in the competition. By your participation, you agree to the terms and conditions in their entirety, without amendment or provision. By participating in the competition, you consent to the public release of your scores and submissions at the GermEval-2024 workshop and in the associated proceedings. Participation is understood as any direct or indirect contributions to this site or the shared task organizers, such as, but not limited to: results of automatic scoring programs; manual, qualitative and quantitative assessments of the data submitted; task and systems papers submitted. -**Individual and team participation**: Participants can participate as individuals or as part of one team. Teams and individual participants must create exactly one account to participate in the Codabench competition. Team composition may not be changed once the Test Phase starts. Your system is named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers. +**Individual and team participation**: Participants can participate as individuals or as part of one team. Teams and individual participants must create exactly one account to participate in the Codabench competition and both teams and individuals must register a team name. Team name and user name must be supplied Team composition may not be changed once the Test Phase starts. Your system is named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers. **Scoring of submissions**: Submissions are evaluated with automatic and manual quantitative judgements, qualitative judgements, and any other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers. Organizers are under no obligation to release scores. Official scores may be withheld if organizers judge the submission incomplete, erroneous, deceptive, or violating the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission. If multiple submission files are uploaded during the Test Phase, the last submission file per group will be understood as the team's or participant's definitive submission and ranked as such in the task description paper. +Important: submission will only be considered for the final ranking if the team name specified for the submission matches a team that has been registered for participation +and the codabench user who made the submission also matches the codabench user name that has been specified in the registration for participation. + **Data usage**: The provided data should be used responsibly and ethically. Do not attempt to misuse it in any way, including, but not limited to, reconstructing test sets, any non-scientific use of the data, or any other unconscionable usage of the data. You may not redistribute the task data except in the manner prescribed by its licence. **Specific conditions for closed and open tracks**: Participants agree to follow the specific conditions for [closed tracks](link-tbd) and [open tracks](link-tbd), which specify the type of data allowed for pretraining the model. **Submission of systems description papers**: Participants having made at least one submission for a closed track during the Test Phase will be invited to submit a paper describing their system. Participants having made only submissions for open tracks will not be invited to submit a paper describing their system (see the specific conditions for closed and open tracks). For both tracks, we strongly encourage participants to provide a link to the code of their system(s) to organizers or the public at large. We also encourage you to upload any systems and models to an open-source repository such as the HuggingFace Hub. +**Final Competition Ranking**: the final rankings for the closed tracks of subtask 1 and subtask 2 are determined after the paper review and camera ready submission have ended. +Only submissions for which a paper describing the sulution has been submitted and + **Acknowledgements**: This shared task was created by OFAI with funding from the FFG project EKIP.