Rossina Soyan, Fall 2021, [email protected]
The goal of the project was to understand what lexical complexity measures correspond to intermediate and advanced proficiency levels in L2 Russian texts. To answer this question, I calculated three lexical complexity measures using a small sub-corpus of L2 Russian texts, performed a hierarchical cluster analysis and compared the results with the original proficiency levels. The goodness of clusters analysis showed that 5 out 8 students (62.5%) were categorized correctly. One reason for this subpar performance may be the size of the corpus which included only 8 students and 24 texts. Another and a more serious reason may be the lexical complexity measures themselves which were chosen based on studies with L2 English texts and which may not reflect proficiency levels in L2 Russian texts. This would be a fruitful area for further work.
The dataset for this project was randomly chosen from the Middlebury corpus of L2 Russian texts. The Middlebury corpus is not currently publicly accessible but I worked as an RA during the compilation of this corpus, and the PI Dr. Olesya Kisselev gave me the permission to explore the corpus in my statistics courses as part of my coursework. The Middlebury corpus consists of essays written by students as part of placement (pre-test) and final examination (post-test) in the summer of 2019. The original corpus includes 601 essays (103,150 words total) by 133 Russian L2 learners at different levels of proficiency. The sub-corpus for this project includes 24 essays (4,854 words total) by 8 L2 Russian learners, with 4 students rated as intermediate and 4 students rated as advanced.
-
final_report.md overall description of the project, theoretical contextualization, analysis and the story behind the final product
-
README.md you are here
-
LICENSE.md licensing terms
-
project_plan.md my original plan before I started coding in R
-
progress_report.md four progress reports throughout the semester
-
presentation.pdf this is the pdf of the final presentation I gave at the end of the semester
-
final_code.Rmd the loading of the sub-corpus, calculation of LC measures and cluster analysis
-
final_code.md the same output but in the HTML format
-
data_sample this is an example of a text in the sub-corpus
-
non_lexical_items these are .txt files that I created to be able to calculate lexical density of the corpus texts
-
images images of the cluster analysis
-
scratchpad these are all the drafts I went through on my way to the final project
Link to the guestbook