The Impact of Tokenization on Gender Bias in NMT

Introduction

This repository contains the code and resources for my master's thesis titled "The Impact of Tokenization on Gender Bias in NMT." The objective of this research is to analyze the influence of tokenization methods on gender bias in Neural Machine Translation (NMT). For the purposes of evaluation we also created a Catalan language version of the MuST-SHE corpus.

Catalan Version and Modified Evaluation Script

I have created a Catalan version of the MuST-SHE corpus, originally developed by Bentivogli et al. (2020), and made some modifications to the MuST-SHE evaluation script distributed with the original corpus. The updated corpus and evaluation script are included in this repository, allowing you to replicate and build upon the original research. For reference the original evaluation script is also included. Please make sure to reference their paper if you use these resources in your work.

MuST-SHE corpus: Link to the original paper

Data Preprocessing and Model Training

This repository provides scripts for data preprocessing and training four different models, each utilizing a different tokenization method. It also contains the scripts used to generate BLEU scores for these models. The scripts are tailored to the High-Performance Computing (HPC) environment in which they were executed. Note that running these scripts in a different environment may require adjustments.

Results

The results obtained from the experiments conducted in this research are also included in this repository. You can find detailed information about the performance of the trained models for each tokenization method.

Contact Information

If you have any questions, suggestions, or feedback regarding this research or the repository, please feel free to contact me at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
must_she_data		must_she_data
results		results
README.md		README.md
char_split.py		char_split.py
generate_all_tests.bpe.sh		generate_all_tests.bpe.sh
generate_all_tests.char.sh		generate_all_tests.char.sh
generate_all_tests.morf.sh		generate_all_tests.morf.sh
generate_all_tests.uni.sh		generate_all_tests.uni.sh
get_vocabulary.py		get_vocabulary.py
morfessor_lines.py		morfessor_lines.py
mustshe_acc_v1.1.py		mustshe_acc_v1.1.py
mustshe_acc_v1.2.py		mustshe_acc_v1.2.py
preprocess-tests_bpe32.sh		preprocess-tests_bpe32.sh
preprocess-tests_char.sh		preprocess-tests_char.sh
preprocess-tests_morf.sh		preprocess-tests_morf.sh
preprocess-tests_uni32.sh		preprocess-tests_uni32.sh
preprocess_bpe_32.sh		preprocess_bpe_32.sh
preprocess_char.sh		preprocess_char.sh
preprocess_morfessor_joint_sample.sh		preprocess_morfessor_joint_sample.sh
preprocess_uni_32.sh		preprocess_uni_32.sh
spm_encode.py		spm_encode.py
train_bpe_32.sh		train_bpe_32.sh
train_char.sh		train_char.sh
train_morf.sh		train_morf.sh
train_uni_32.sh		train_uni_32.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Impact of Tokenization on Gender Bias in NMT

Introduction

Catalan Version and Modified Evaluation Script

Data Preprocessing and Model Training

Results

Contact Information

About

Releases

Packages

Languages

audreyvm/tfm_gender_bias

Folders and files

Latest commit

History

Repository files navigation

The Impact of Tokenization on Gender Bias in NMT

Introduction

Catalan Version and Modified Evaluation Script

Data Preprocessing and Model Training

Results

Contact Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages