Skip to content

audreyvm/tfm_gender_bias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Impact of Tokenization on Gender Bias in NMT

Introduction

This repository contains the code and resources for my master's thesis titled "The Impact of Tokenization on Gender Bias in NMT." The objective of this research is to analyze the influence of tokenization methods on gender bias in Neural Machine Translation (NMT). For the purposes of evaluation we also created a Catalan language version of the MuST-SHE corpus.

Catalan Version and Modified Evaluation Script

I have created a Catalan version of the MuST-SHE corpus, originally developed by Bentivogli et al. (2020), and made some modifications to the MuST-SHE evaluation script distributed with the original corpus. The updated corpus and evaluation script are included in this repository, allowing you to replicate and build upon the original research. For reference the original evaluation script is also included. Please make sure to reference their paper if you use these resources in your work.

Data Preprocessing and Model Training

This repository provides scripts for data preprocessing and training four different models, each utilizing a different tokenization method. It also contains the scripts used to generate BLEU scores for these models. The scripts are tailored to the High-Performance Computing (HPC) environment in which they were executed. Note that running these scripts in a different environment may require adjustments.

Results

The results obtained from the experiments conducted in this research are also included in this repository. You can find detailed information about the performance of the trained models for each tokenization method.

Contact Information

If you have any questions, suggestions, or feedback regarding this research or the repository, please feel free to contact me at [email protected].

About

Scripts and Datasets for my TFM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published