This repository contains the code and resources for my master's thesis titled "The Impact of Tokenization on Gender Bias in NMT." The objective of this research is to analyze the influence of tokenization methods on gender bias in Neural Machine Translation (NMT). For the purposes of evaluation we also created a Catalan language version of the MuST-SHE corpus.
I have created a Catalan version of the MuST-SHE corpus, originally developed by Bentivogli et al. (2020), and made some modifications to the MuST-SHE evaluation script distributed with the original corpus. The updated corpus and evaluation script are included in this repository, allowing you to replicate and build upon the original research. For reference the original evaluation script is also included. Please make sure to reference their paper if you use these resources in your work.
- MuST-SHE corpus: Link to the original paper
This repository provides scripts for data preprocessing and training four different models, each utilizing a different tokenization method. It also contains the scripts used to generate BLEU scores for these models. The scripts are tailored to the High-Performance Computing (HPC) environment in which they were executed. Note that running these scripts in a different environment may require adjustments.
The results obtained from the experiments conducted in this research are also included in this repository. You can find detailed information about the performance of the trained models for each tokenization method.
If you have any questions, suggestions, or feedback regarding this research or the repository, please feel free to contact me at [email protected].