Skip to content

This is Rossina Soyan's project repo for Data Science (LING 2340). The goal of the project is to extract syntactic and lexical measures from a corpus of L2 Russian learner texts and to cluster features that correspond to specific proficiency levels.

License

Notifications You must be signed in to change notification settings

Data-Sci-2021/Complexity-measures-and-proficiency

Repository files navigation

Lexical-complexity-measures-and-proficiency

Rossina Soyan, Fall 2021, [email protected]

Description

The goal of the project was to understand what lexical complexity measures correspond to intermediate and advanced proficiency levels in L2 Russian texts. To answer this question, I calculated three lexical complexity measures using a small sub-corpus of L2 Russian texts, performed a hierarchical cluster analysis and compared the results with the original proficiency levels. The goodness of clusters analysis showed that 5 out 8 students (62.5%) were categorized correctly. One reason for this subpar performance may be the size of the corpus which included only 8 students and 24 texts. Another and a more serious reason may be the lexical complexity measures themselves which were chosen based on studies with L2 English texts and which may not reflect proficiency levels in L2 Russian texts. This would be a fruitful area for further work.

Dataset

The dataset for this project was randomly chosen from the Middlebury corpus of L2 Russian texts. The Middlebury corpus is not currently publicly accessible but I worked as an RA during the compilation of this corpus, and the PI Dr. Olesya Kisselev gave me the permission to explore the corpus in my statistics courses as part of my coursework. The Middlebury corpus consists of essays written by students as part of placement (pre-test) and final examination (post-test) in the summer of 2019. The original corpus includes 601 essays (103,150 words total) by 133 Russian L2 learners at different levels of proficiency. The sub-corpus for this project includes 24 essays (4,854 words total) by 8 L2 Russian learners, with 4 students rated as intermediate and 4 students rated as advanced.

Repo directory

  • final_report.md overall description of the project, theoretical contextualization, analysis and the story behind the final product

  • README.md you are here

  • LICENSE.md licensing terms

  • project_plan.md my original plan before I started coding in R

  • progress_report.md four progress reports throughout the semester

  • presentation.pdf this is the pdf of the final presentation I gave at the end of the semester

  • final_code.Rmd the loading of the sub-corpus, calculation of LC measures and cluster analysis

  • final_code.md the same output but in the HTML format

  • data_sample this is an example of a text in the sub-corpus

  • non_lexical_items these are .txt files that I created to be able to calculate lexical density of the corpus texts

  • images images of the cluster analysis

  • scratchpad these are all the drafts I went through on my way to the final project

Guestbook

Link to the guestbook

About

This is Rossina Soyan's project repo for Data Science (LING 2340). The goal of the project is to extract syntactic and lexical measures from a corpus of L2 Russian learner texts and to cluster features that correspond to specific proficiency levels.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages