📘 Introduction | 🛠️ Preparation | 🚀 Get on with it! | 💕 How can I help? | 🦒 Database Zoo | 📖 Citation | 📝 License
- Motivation. More than 7,000 known languages are spoken around the world. However, only a small fraction of them are currently covered by speech technologies. In addition, if that weren't enough, many of these languages are considered endangered and the lack of support from current technologies may contribute negatively to this situation. For this reason, to avoid any type of discrimination or exclusion, we encourage people to contribute to this research project whose purpose is to cover the greatest possible number of languages in the context of audio-visual speech technologies. 🌟 Wouldn't you like to make the table shown below larger? Find more information about in our 💕 How can I help? section.
-
The AnnoTheia Toolkit. We present AnnoTheia, a semi-automatic toolkit that detects when a person speaks on the scene and the corresponding transcription. One of the most notable aspects of the proposed toolkit is the flexibility to replace a module with another of our preference or, where appropriate, adapted to our language of interest. Therefore, to show the complete process of preparing AnnoTheia for a language of interest, we also describe in this tutorial 📜 the adaptation of a pre-trained TalkNet-ASD model to a specific language, using a database not initially conceived for this type of task.
-
The User Interface. Looking at the image shown above. A: Video display of the scene candidate to be a new sample of the future database. An overlying green bounding box highlights the speaker detected by the toolkit. B: Keyword legend to control the video display. C: Transcription automatically generated by the toolkit. It can be edited by the annotator. D: Buttons to allow the annotator to accept or discard the candidate scene sample. E: Navigation buttons through candidate scenes. It can be useful to correct possible annotation mistakes.
- Clone the repository:
git clone https://github.com/joactr/AnnoTheia.git
- Create and activate a new conda environment:
cd AnnoTheia
conda create -y -n annotheia python=3.10
conda activate annotheia
- Install all requirements to prepare the environment:
python ./prepare_environment.py
The AnnoTheia toolkit is divided into two stages:
- Detect Candidate Scenes to compile the new audio-visual database from long videos:
python main_scenes.py \
--video_dir ${PATH_TO_VIDEO_DIR} \
--config-file ${PATH_TO_CONFIG_FILE} \
--output-dir ${PATH_TO_OUTPUT_DIR}
- Supervise & Annotate the candidate scenes detected by the toolkit. Once the script above warns you that a video has been fully processed, you can run the following command:
python main_gui.py --scenes-info-path ${PATH_TO_SCENES_INFO_CSV}
🌟 We plan to unify both stages. Any comments or suggestions in this regard will be of great help!
Many of the world’s languages are in danger of disappearing. Our hope is to encourage the community to promote research in the field of audiovisual speech technologies for low-resource languages. It will be a long road, but here we explain different ways to help our project:
- Fine-Tuning TalkNet-ASD for another Language. Take a look at our tutorial 📜!
- Collecting New Audio-Visual Databases. Once it is compiled, it will be welcome to our 🦒 Database Zoo!
- Adding new alternative models to the pipeline's modules. We prepared another tutorial 📜 for you. And you will also discover that it might even be possible to add new modules, such as a body landmarker!
- Sharing the AnnoTheia's toolkit with your Colleagues. The more, the merrier 💫!
- Any other comment or suggestion? You are more than welcome to create an issue :)
✅ English 🇬🇧 ✅ Spanish 🇪🇸 ⬜ Czech 🇨🇿 ⬜ Kalanga 🇿🇼 ⬜ Polish 🇵🇱
⬜ Turkish 🇹🇷 ⬜ Japanase 🇯🇵 ⬜ Fijian 🇫🇯 ⬜ Malay 🇲🇾 ⬜ Somali 🇸🇴
⬜ Romanian 🇷🇴 ⬜ Vietnamese 🇻🇳 ⬜ Berber 🇲🇦 ⬜ Quechua 🇵🇪 ⬜ Māori 🇳🇿
⬜ Norwegian 🇳🇴 ⬜ Hindi 🇮🇳 ⬜ Swahili 🇹🇿 ⬜ Urdu 🇵🇰 ⬜ and so on ... 🏳️
🌟 Help us cover languages around the world 🗺️! It will be a great contribution to the research community to move towards a fairer development of speech technologies.
- 🇪🇸 LIP-RTVE: An Audio-Visual Database for Continuous Spanish in the Wild
- Place for your future database :)
If you found our work useful, please cite our paper:
AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech Technologies
@inproceedings{acosta24annotheia,
author={Acosta-Triana, José-M. and Gimeno-Gómez, David and Martínez-Hinarejos, Carlos-D},
title={{AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech Technologies}},
booktitle={{The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}},
pages={1260--1269},
year={2024},
}
-
License. This work is protected by Apache License 2.0.
⚠️ HOWEVER, please note that use of the toolkit will always be limited by the license associated with the modules used in it. -
Ethical Considerations. The toolkit we present can be used to collect and annotate audio-visual databases from social media where a person's face and voice are recorded. Both information cues could be considered as biometrics, raising privacy concerns regarding a person's identity. Therefore, any data collection must protect privacy rights, taking necessary measures and always asking all the people involved for permission beforehand.