RedHenLab · Taskmaster-1 · Jul 23, 2024 · Jul 23, 2024 · Jul 23, 2024 · Jul 23, 2024
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 Pipeline to process YouTube auto-generated captions in multiple languages
-For a given collection of auto-captions and json metadata, the pipeline produces a CWB-compatible corpus in CONLL format with tokenisation, POS tagging, lemmatisation and further token-level features as created by UDPipe.
+For a given collection of auto-captions and json metadata, the pipeline produces a CWB-compatible corpus in CONLL format with tokenization, POS tagging, lemmatization and further token-level features as created by UDPipe.
 
 Scripts are written in bash and Python 3.
 
@@ -10,7 +10,7 @@ You will need the auto-generated subtitles (.vtt files) along with accompanying
 ## Prerequisites ##
 - You will need an installation of UDPipe 1, along with the relevant model for the language in question ([https://ufal.mff.cuni.cz/udpipe/1](https://ufal.mff.cuni.cz/udpipe/1); English model: [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/english-ewt-ud-2.5-191206.udpipe](https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/english-ewt-ud-2.5-191206.udpipe)).
 - You will need an installation of our fork of Alam et al.2020's punctuation restoration tool ([https://github.com/RedHenLab/punctuation-restoration)](https://github.com/RedHenLab/punctuation-restoration)) and [our weights file](http://go.redhenlab.org/pgu/punctuation_restoration/) (1.4 GB)
-- You will need an installation of SoMaJo for tokenisation (https://github.com/tsproisl/SoMaJo/tree/master/somajo)
+- You will need an installation of SoMaJo for tokenization (https://github.com/tsproisl/SoMaJo/tree/master/somajo)
 
 ## Download ##
 To avoid problems with strange characters in filenames, we recommend using the YouTube video ID as filename. The following command will download the auto-generated subtitles and the info json file, but will not download the video:
@@ -53,7 +53,7 @@ CORPUS_NAME specifies the CWB ID for your corpus
 1. `convert_vtt_auto_to_conll-u.sh` Convert your .vtt files to CONLL
 This script assumes the existence of a directory called `conll_input` and takes as input the .vtt file that you would like to convert to CONLL format.
 
-It then calls `vtt_auto_to_conll-u.py` on the specified .vtt file and produces a corresponding `.conll_input`file, which consists of a tab-separated line number, the "token", several "empty" columns with underscores and the start and end time for each word. Tokenisation is done with the help of SoMaJo -- this also means that we do not retain the multi-word units in the vtt files, such as "a little". Instead in such cases, each individual token is set to the same start and end time.
+It then calls `vtt_auto_to_conll-u.py` on the specified .vtt file and produces a corresponding `.conll_input`file, which consists of a tab-separated line number, the "token", several "empty" columns with underscores and the start and end time for each word. Tokenization is done with the help of SoMaJo -- this also means that we do not retain the multi-word units in the vtt files, such as "a little". Instead in such cases, each individual token is set to the same start and end time.
 
 2. `extract_text_connl.py`
 This script takes as input the path of the non-annotated ConLL-files from their directory. It writes the content of the "token" column to a raw-text file, which can then be processed by NLP tools.

diff --git a/farsi/Farsi_readme.md b/farsi/Farsi_readme.md
@@ -0,0 +1,108 @@
+
+This repository provides a comprehensive pipeline to create a Farsi-language corpus from YouTube subtitles. The process includes downloading subtitles, generating subtitles if needed, performing linguistic annotation, and formatting the corpus for CQPweb integration.
+
+# Prerequisites
+
+Ensure you have the following tools and libraries installed:
+
+- [yt-dlp](https://github.com/yt-dlp/yt-dlp) for downloading videos and subtitles.
+- [OpenAI Whisper](https://github.com/openai/whisper) for generating subtitles.
+- [Python](https://www.python.org/) with the following libraries:
+  - `pandas`
+  - `nltk`
+  - `hazm`
+  - `argparse`
+  - `json`
+- [UDPipe](https://ufal.mff.cuni.cz/udpipe) for linguistic annotation.
+
+You also need the **Farsi Persian Seraji model** for UDPipe: `persian-seraji-ud-2.5-191206.udpipe`.
+
+# Pipeline Overview
+
+1. **Download Videos and Subtitles**: Use `yt-dlp` to download YouTube videos and subtitles.
+2. **Handle Subtitles**:
+   - If Farsi subtitles are available, run punctuation restoration.
+   - If Farsi subtitles are not available, generate them using OpenAI Whisper.
+3. **Process Subtitles**: Use `time_frame.py` to convert `.vtt` files to `.conllu` format.
+4. **Annotate Text**: Use UDPipe to annotate the `.conllu` file and produce an annotated file.
+5. **Post-process Annotation**: Convert the annotated file to `.fa.txt` using `final.py`.
+6. **Convert to XML Format**: Use `convert_to_xml.py` to convert `.fa.txt` into `.vrt` format.
+7. **Concatenate `.vrt` Files**: Merge multiple `.vrt` files if necessary.
+8. **Upload to CQPweb**: Upload the final `.vrt` file to CQPweb.
+
+# Detailed Steps
+
+## Step 1: Download YouTube Videos and Subtitles
+
+Use the following command to download videos and Farsi subtitles:
+
+```
+yt-dlp -i -o "%(id)s.%(ext)s" "URL_of_the_video" --write-info-json --write-auto-sub --sub-lang fa --verbose 
+```
+
+This command downloads the video along with its subtitles in Farsi (.vtt format).4
+
+## Step 2: Handle Subtitles
+
+- If Farsi subtitles exist, run the punctuation restoration model.
+```
+python predict_punctuation.py input_file.vtt "models/xlm-roberta-large-fa-1-task2/final/" 2 output_file.vtt
+```
+- If Farsi subtitles do not exist, generate subtitles using OpenAI Whisper:
+
+```
+whisper "path_to_video_file" --model large --language Persian -f 'vtt'
+```
+## Step 3: Process Subtitles with time_frame.py
+
+Run the time_frame.py script to process .vtt files and convert them to .conllu format:
+
+```
+python time_frame.py path_to_vtt_folder path_to_conllu_output_folder
+```
+
+## Step 4: Annotate Text with UDPipe
+
+Use UDPipe for tokenization, POS tagging, and dependency parsing:
+
+```
+/path/to/udpipe --input=conllu --tag --parse /path/to/persian-seraji-ud-2.5-191206.udpipe/ --outfile /path/to/output/annotated.txt /path/to/conllu/file
+```
+
+## Step 6: Convert to XML with xml.py
+
+Run xml.py to convert .fa.txt into XML format for CQPweb:
+
+```
+python convert_to_xml.py --json_file "info.json" --annotated "output.fa.txt"
+```
+
+## Step 7: Concatenate .vrt Files
+
+If processing multiple videos, concatenate all .vrt files:
+
+```
+cat *.vrt > combined_corpus.vrt
+```
+
+## Step 8: Upload to CQPweb
+
+Upload the concatenated .vrt file to CQPweb to query and analyze the corpus.
+
+
+# Folder Structure
+```
+├── time_frame.py        # Processes .vtt subtitles and generates .conllu files
+├── final.py             # Post-processes annotated text into .fa.txt format
+├── convert_to_xml.py               # Converts .fa.txt into .vrt format for CQPweb
+├── README.md            # This file
+├── info.json            # Metadata for the video from yt-dlp
+├── annotated_pos_sent.txt  # Annotated file from UDPipe
+└── corpus.vrt           # Final .vrt file for CQPweb
+
+```
+
+# Additional Notes
+
+- Use correct paths for input/output files in each script.
+- The pipeline handles cases with missing Farsi subtitles by generating them with Whisper.
diff --git a/farsi/convert_to_xml.py b/farsi/convert_to_xml.py
@@ -0,0 +1,64 @@
+# -*- coding: utf-8 -*-
+import argparse, re, sys
+import json
+from pathlib import Path
+
+def xmlescape_mysql(x):
+    # Function to escape certain characters so that the output is a valid xml file AND DOES NOT CONTAIN ANY CHARACTERS BEYOND THE MBP, which is required by MySQL's utf8 collation (which is not utf8mb4) as of CQPweb 3.2.31
+    x = re.sub(r'&', '&amp;', x)
+    x = re.sub(r'"', '&quot;', x)
+    x = re.sub(r'\'', '&apos;', x)
+    x = re.sub(r'>', '&gt;', x)
+    x = re.sub(r'<', '&lt;', x)
+    # THIS ONE SHOULD TAKE CARE OF THE MYSQL PROBLEM (taken from here https://stackoverflow.com/questions/13729638/how-can-i-filter-emoji-characters-from-my-input-so-i-can-save-in-mysql-5-5/13752628#13752628):
+    try:
+        # UCS-4
+        highpoints = re.compile(u'[\U00010000-\U0010ffff]')
+    except re.error:
+        # UCS-2
+        highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
+    x = highpoints.sub(u'\u25FD', x)
+    return x
+
+parser = argparse.ArgumentParser()
+parser.add_argument("-j", "--json_file", type=str, help="Name of JSON file written by youtube-dl")
+parser.add_argument("-a", "--annotated", type=str, default="output.vrt", help="Name of annotated file (default is output.vrt)")
+args = parser.parse_args()
+
+video_id = Path(args.annotated).stem
+video_id = re.sub("(?<!^)\..*$", "", video_id)
+
+# Fields from JSON we are interested in:
+jsonfields = ["uploader", "channel_id", "full_title", "upload_date", "uploader_id", "title", "duration", "webpage_url"]
+jsonfields_dict = {}
+
+# Replace hyphens in the video ID to ensure uniqueness
+video_id_for_cqpweb = re.sub("-", "___hyphen___", video_id)
+
+with open("corpus.vrt", "w", encoding="utf-8") as outfile:
+    outfile.write(f'<text id="y__{video_id_for_cqpweb}" video_id="{video_id}"')
+
+    with open(args.json_file, encoding="utf-8") as infile:
+        x = json.load(infile)
+        for field in jsonfields:
+            if field in x:
+                if field == "upload_date":
+                    outfile.write(f' upload_year="{xmlescape_mysql(str(x[field][:4]))}"')
+                outfile.write(f' {field}="{xmlescape_mysql(str(x[field]))}"')
+        outfile.write(">\n")
+
+    with open(args.annotated, encoding="utf-8") as parsedfile:
+        s_num = 1
+        for parsedline in parsedfile:
+            line = parsedline.strip()
+            if line.startswith("#"):
+                if line == "# sent_id 1":
+                    outfile.write('<s id="1">\n')
+                else:
+                    outfile.write(f'</s>\n<s id="{s_num}">\n')
+                s_num += 1
+            else:
+                outfile.write(xmlescape_mysql(line) + "\n")
+        outfile.write('</s>\n')
+    outfile.write('</text>\n')
+
diff --git a/farsi/final.py b/farsi/final.py
@@ -0,0 +1,46 @@
+# Specify the file name
+file_name = 'annotated_pos_sent.txt'
+
+# Open the file and read its contents
+with open(file_name, 'r', encoding='utf-8') as file:
+    file_contents = file.read()
+
+# Print the contents to verify
+#print(file_contents)
+def convert_dependency_tree(vrt_content):
+    result = []
+    for line in vrt_content.strip().split('\n'):
+        parts = line.split()
+
+        if len(parts) < 8:  # Skip lines that don't have enough parts
+            continue
+        token_index = int(parts[0])
+        parent_index = int(parts[6])
+        x = parent_index - token_index
+        y = 0
+        string=str((x,y))
+        parts.append(string)
+        result.append(parts)
+    return result
+
+result = convert_dependency_tree(file_contents)
+with open('output.vrt', 'w', encoding='utf-8') as f:
+# print(result)
+    for item in result:
+        item.pop(0)
+        #print(f"{item[0]}\t{item[1]}\t{item[2]}\t{item[3]}\t{item[4]}\t{item[5]}\t{item[6]}\t{item[7]}\t{item[8]}\t{item[9]}\n")
+        #print(item[0],item[1],item[2],item[3])
+        #print(item)
+        # Join the elements with a space
+        if item[1]=="_":
+            item[1]=item[0]
+
+        start, end = item[7].split("__")
+        start_secs, start_centisecs = start.split(":")
+        end_secs, end_centisecs = end.split(":")
+        start_time, end_time = item[8].split("__")
+        f.write(f"{item[0]}\t{item[3]}\t{item[1]}\t{item[2]}\t{item[1]}_{item[2]}\t{item[0]}\t_\t_\t{item[4]}\t_\t_\t_\t{item[9]}\t{item[6]}\t{start_secs}\t{start_centisecs}\t{end_secs}\t{end_centisecs}\t{start_time}\t{end_time}\n")
+
+# Print the result
+
+        #print(output_line[0])