Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Readme.md #1

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Pipeline to process YouTube auto-generated captions in multiple languages
For a given collection of auto-captions and json metadata, the pipeline produces a CWB-compatible corpus in CONLL format with tokenisation, POS tagging, lemmatisation and further token-level features as created by UDPipe.
For a given collection of auto-captions and json metadata, the pipeline produces a CWB-compatible corpus in CONLL format with tokenization, POS tagging, lemmatization and further token-level features as created by UDPipe.

Scripts are written in bash and Python 3.

Expand All @@ -10,7 +10,7 @@ You will need the auto-generated subtitles (.vtt files) along with accompanying
## Prerequisites ##
- You will need an installation of UDPipe 1, along with the relevant model for the language in question ([https://ufal.mff.cuni.cz/udpipe/1](https://ufal.mff.cuni.cz/udpipe/1); English model: [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/english-ewt-ud-2.5-191206.udpipe](https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/english-ewt-ud-2.5-191206.udpipe)).
- You will need an installation of our fork of Alam et al.2020's punctuation restoration tool ([https://github.com/RedHenLab/punctuation-restoration)](https://github.com/RedHenLab/punctuation-restoration)) and [our weights file](http://go.redhenlab.org/pgu/punctuation_restoration/) (1.4 GB)
- You will need an installation of SoMaJo for tokenisation (https://github.com/tsproisl/SoMaJo/tree/master/somajo)
- You will need an installation of SoMaJo for tokenization (https://github.com/tsproisl/SoMaJo/tree/master/somajo)

## Download ##
To avoid problems with strange characters in filenames, we recommend using the YouTube video ID as filename. The following command will download the auto-generated subtitles and the info json file, but will not download the video:
Expand Down Expand Up @@ -53,7 +53,7 @@ CORPUS_NAME specifies the CWB ID for your corpus
1. `convert_vtt_auto_to_conll-u.sh` Convert your .vtt files to CONLL
This script assumes the existence of a directory called `conll_input` and takes as input the .vtt file that you would like to convert to CONLL format.

It then calls `vtt_auto_to_conll-u.py` on the specified .vtt file and produces a corresponding `.conll_input`file, which consists of a tab-separated line number, the "token", several "empty" columns with underscores and the start and end time for each word. Tokenisation is done with the help of SoMaJo -- this also means that we do not retain the multi-word units in the vtt files, such as "a little". Instead in such cases, each individual token is set to the same start and end time.
It then calls `vtt_auto_to_conll-u.py` on the specified .vtt file and produces a corresponding `.conll_input`file, which consists of a tab-separated line number, the "token", several "empty" columns with underscores and the start and end time for each word. Tokenization is done with the help of SoMaJo -- this also means that we do not retain the multi-word units in the vtt files, such as "a little". Instead in such cases, each individual token is set to the same start and end time.

2. `extract_text_connl.py`
This script takes as input the path of the non-annotated ConLL-files from their directory. It writes the content of the "token" column to a raw-text file, which can then be processed by NLP tools.
Expand Down
108 changes: 108 additions & 0 deletions farsi/Farsi_readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@

This repository provides a comprehensive pipeline to create a Farsi-language corpus from YouTube subtitles. The process includes downloading subtitles, generating subtitles if needed, performing linguistic annotation, and formatting the corpus for CQPweb integration.

# Prerequisites

Ensure you have the following tools and libraries installed:

- [yt-dlp](https://github.com/yt-dlp/yt-dlp) for downloading videos and subtitles.
- [OpenAI Whisper](https://github.com/openai/whisper) for generating subtitles.
- [Python](https://www.python.org/) with the following libraries:
- `pandas`
- `nltk`
- `hazm`
- `argparse`
- `json`
- [UDPipe](https://ufal.mff.cuni.cz/udpipe) for linguistic annotation.

You also need the **Farsi Persian Seraji model** for UDPipe: `persian-seraji-ud-2.5-191206.udpipe`.

# Pipeline Overview

1. **Download Videos and Subtitles**: Use `yt-dlp` to download YouTube videos and subtitles.
2. **Handle Subtitles**:
- If Farsi subtitles are available, run punctuation restoration.
- If Farsi subtitles are not available, generate them using OpenAI Whisper.
3. **Process Subtitles**: Use `time_frame.py` to convert `.vtt` files to `.conllu` format.
4. **Annotate Text**: Use UDPipe to annotate the `.conllu` file and produce an annotated file.
5. **Post-process Annotation**: Convert the annotated file to `.fa.txt` using `final.py`.
6. **Convert to XML Format**: Use `convert_to_xml.py` to convert `.fa.txt` into `.vrt` format.
7. **Concatenate `.vrt` Files**: Merge multiple `.vrt` files if necessary.
8. **Upload to CQPweb**: Upload the final `.vrt` file to CQPweb.

# Detailed Steps

## Step 1: Download YouTube Videos and Subtitles

Use the following command to download videos and Farsi subtitles:

```
yt-dlp -i -o "%(id)s.%(ext)s" "URL_of_the_video" --write-info-json --write-auto-sub --sub-lang fa --verbose
```

This command downloads the video along with its subtitles in Farsi (.vtt format).4

## Step 2: Handle Subtitles

- If Farsi subtitles exist, run the punctuation restoration model.
```
python predict_punctuation.py input_file.vtt "models/xlm-roberta-large-fa-1-task2/final/" 2 output_file.vtt
```
- If Farsi subtitles do not exist, generate subtitles using OpenAI Whisper:

```
whisper "path_to_video_file" --model large --language Persian -f 'vtt'
```
## Step 3: Process Subtitles with time_frame.py

Run the time_frame.py script to process .vtt files and convert them to .conllu format:

```
python time_frame.py path_to_vtt_folder path_to_conllu_output_folder
```

## Step 4: Annotate Text with UDPipe

Use UDPipe for tokenization, POS tagging, and dependency parsing:

```
/path/to/udpipe --input=conllu --tag --parse /path/to/persian-seraji-ud-2.5-191206.udpipe/ --outfile /path/to/output/annotated.txt /path/to/conllu/file
```

## Step 6: Convert to XML with xml.py

Run xml.py to convert .fa.txt into XML format for CQPweb:

```
python convert_to_xml.py --json_file "info.json" --annotated "output.fa.txt"
```

## Step 7: Concatenate .vrt Files

If processing multiple videos, concatenate all .vrt files:

```
cat *.vrt > combined_corpus.vrt
```

## Step 8: Upload to CQPweb

Upload the concatenated .vrt file to CQPweb to query and analyze the corpus.


# Folder Structure
```
├── time_frame.py # Processes .vtt subtitles and generates .conllu files
├── final.py # Post-processes annotated text into .fa.txt format
├── convert_to_xml.py # Converts .fa.txt into .vrt format for CQPweb
├── README.md # This file
├── info.json # Metadata for the video from yt-dlp
├── annotated_pos_sent.txt # Annotated file from UDPipe
└── corpus.vrt # Final .vrt file for CQPweb

```

# Additional Notes

- Use correct paths for input/output files in each script.
- The pipeline handles cases with missing Farsi subtitles by generating them with Whisper.
64 changes: 64 additions & 0 deletions farsi/convert_to_xml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# -*- coding: utf-8 -*-
import argparse, re, sys
import json
from pathlib import Path

def xmlescape_mysql(x):
# Function to escape certain characters so that the output is a valid xml file AND DOES NOT CONTAIN ANY CHARACTERS BEYOND THE MBP, which is required by MySQL's utf8 collation (which is not utf8mb4) as of CQPweb 3.2.31
x = re.sub(r'&', '&', x)
x = re.sub(r'"', '"', x)
x = re.sub(r'\'', ''', x)
x = re.sub(r'>', '>', x)
x = re.sub(r'<', '&lt;', x)
# THIS ONE SHOULD TAKE CARE OF THE MYSQL PROBLEM (taken from here https://stackoverflow.com/questions/13729638/how-can-i-filter-emoji-characters-from-my-input-so-i-can-save-in-mysql-5-5/13752628#13752628):
try:
# UCS-4
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
# UCS-2
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
x = highpoints.sub(u'\u25FD', x)
return x

parser = argparse.ArgumentParser()
parser.add_argument("-j", "--json_file", type=str, help="Name of JSON file written by youtube-dl")
parser.add_argument("-a", "--annotated", type=str, default="output.vrt", help="Name of annotated file (default is output.vrt)")
args = parser.parse_args()

video_id = Path(args.annotated).stem
video_id = re.sub("(?<!^)\..*$", "", video_id)

# Fields from JSON we are interested in:
jsonfields = ["uploader", "channel_id", "full_title", "upload_date", "uploader_id", "title", "duration", "webpage_url"]
jsonfields_dict = {}

# Replace hyphens in the video ID to ensure uniqueness
video_id_for_cqpweb = re.sub("-", "___hyphen___", video_id)

with open("corpus.vrt", "w", encoding="utf-8") as outfile:
outfile.write(f'<text id="y__{video_id_for_cqpweb}" video_id="{video_id}"')

with open(args.json_file, encoding="utf-8") as infile:
x = json.load(infile)
for field in jsonfields:
if field in x:
if field == "upload_date":
outfile.write(f' upload_year="{xmlescape_mysql(str(x[field][:4]))}"')
outfile.write(f' {field}="{xmlescape_mysql(str(x[field]))}"')
outfile.write(">\n")

with open(args.annotated, encoding="utf-8") as parsedfile:
s_num = 1
for parsedline in parsedfile:
line = parsedline.strip()
if line.startswith("#"):
if line == "# sent_id 1":
outfile.write('<s id="1">\n')
else:
outfile.write(f'</s>\n<s id="{s_num}">\n')
s_num += 1
else:
outfile.write(xmlescape_mysql(line) + "\n")
outfile.write('</s>\n')
outfile.write('</text>\n')

46 changes: 46 additions & 0 deletions farsi/final.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Specify the file name
file_name = 'annotated_pos_sent.txt'

# Open the file and read its contents
with open(file_name, 'r', encoding='utf-8') as file:
file_contents = file.read()

# Print the contents to verify
#print(file_contents)
def convert_dependency_tree(vrt_content):
result = []
for line in vrt_content.strip().split('\n'):
parts = line.split()

if len(parts) < 8: # Skip lines that don't have enough parts
continue
token_index = int(parts[0])
parent_index = int(parts[6])
x = parent_index - token_index
y = 0
string=str((x,y))
parts.append(string)
result.append(parts)
return result

result = convert_dependency_tree(file_contents)
with open('output.vrt', 'w', encoding='utf-8') as f:
# print(result)
for item in result:
item.pop(0)
#print(f"{item[0]}\t{item[1]}\t{item[2]}\t{item[3]}\t{item[4]}\t{item[5]}\t{item[6]}\t{item[7]}\t{item[8]}\t{item[9]}\n")
#print(item[0],item[1],item[2],item[3])
#print(item)
# Join the elements with a space
if item[1]=="_":
item[1]=item[0]

start, end = item[7].split("__")
start_secs, start_centisecs = start.split(":")
end_secs, end_centisecs = end.split(":")
start_time, end_time = item[8].split("__")
f.write(f"{item[0]}\t{item[3]}\t{item[1]}\t{item[2]}\t{item[1]}_{item[2]}\t{item[0]}\t_\t_\t{item[4]}\t_\t_\t_\t{item[9]}\t{item[6]}\t{start_secs}\t{start_centisecs}\t{end_secs}\t{end_centisecs}\t{start_time}\t{end_time}\n")

# Print the result

#print(output_line[0])
Loading