- I uploaded 24 texts written by 8 students as my data. I chose four intermediate level students and four advanced level students. Since I was running out of time before the end of the semester, it turned out to be difficult to figure out how to anonymize my original files, that is, the planned 300 texts. For this final project, I created a dataframe with the 24 texts and proficiency levels.
- I calculated all three lexical complexity measures for 24 texts. I was able to create dataframes for two complexity measures. I am not sure how to convert answers for MLTD into a tibble with 24 slots.
- I joined my original dataframe with dataframes indicating lexical density and lexical sophistication. Once I figure out what to do with MLTD, my dataframe will be ready for the cluster analysis (see Draft_3_NEW_REPLACEMENT).
- I started reading how to perform the cluster analysis.
- The most important part of my progress was figuring out how to make sure that the gibberish output is not gibberish. Dan helped me with that - Sys.setlocale("LC_CTYPE", "Russian"). Overall, I always had to take additional steps to download models for the Russian language with many packages because everything is by default for English texts.
- I was making stupid mistakes a lot of the time. Most of my functions would refuse to work because I would leave show() in the end of my commands.
- I think I read and understand package descriptions better. I was able to figure out myself several times why one or another function is not working.
- I was able to calculate three lexical complexity measures out of four (see Draft_2_Soyan). In order to calculate one of the lexical complexity measures, I created five non-lexical words .txt files.
- I was able to calculate onle one syntactic complexity measure. I found that spacyr does not tag pos for Russian texts correctly. The udpipe package does a better job, but I still don't know how to move from there to calculating the number of clauses.
I cannot share the corpus since it does not belong to me. I have access to this corpus only as an RA. I can manipulate data from the corpus but I cannot make the texts publicly available.
I can make the codes publicly accessible but I cannot make the corpus accessible. I do not think that my codes as of now are useful. Still, I want them to be reusable by other people, but only for non-commercial purposes. Therefore I chose this Attribution-NonCommercial-ShareAlike version of licensing.
- I decided to work with one text at this moment. Once I figure out how to measure complexity measures in one text, I think I can venture to work with the whole corpus.
- I figured out how to load a text in Russian in r (using the readtext package) but there were problems with setting the directory. It would work one time and then when I go back the next day, it would stop working.
- I figured out how to remove unnecessary words from the text (online search).
- I figured out how to measure the number of sentences in r (using the quanteda package).
- I figured out how to measure the number of tokens in r (using the quanteda package).
- I measured my first complexity measure - mean sentence length - in one text.
- Now I am thinking how to measure T-units in a text.
- I couldn't knit the document. I get this message: Quitting from lines 17-18 (Draft_1_Soyan.Rmd) Error in setwd(dir) : cannot change working directory Calls: ... process_group.inline -> call_inline -> in_dir -> setwd Execution halted
I do not think I can share the texts in the corpus. The corpus does not belong to me. I am just an RA in this project and it would not be okay if I start sharing the corpus on my own.
So far I have my data ready. I need to start choosing complexity measures for extraction and figuring out the codes for them.