-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
125 lines (85 loc) · 6.07 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# tidylex <img src="man/figures/tidylex-logo.png" align="right" />
[![Travis-CI Build Status](https://travis-ci.org/CoEDL/tidylex.svg?branch=master)](https://travis-ci.org/CoEDL/tidylex)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tidylex)](https://cran.r-project.org/package=tidylex)
[![Coverage Status](https://img.shields.io/codecov/c/github/CoEDL/tidylex/master.svg)](https://codecov.io/github/CoEDL/tidylex?branch=master)
## Overview
The purpose of tidylex is to provide a collaborative, open-source, cross-platform tool for tidying dictionary data stored as Toolbox-style backslash-coded data (a broad convention for serializing lexicographic data in a human-readable and -editable manner).
This format is commonly used in the description of under-documented languages, many of which are also highly endangered.
The example below shows a toy French-to-English dictionary with 3 entries *rouge*, *bonjour*, and *parler* with various lexicographic information about these 3 entries (`lx`: lexeme, `ps`: part of speech, `de`: definition, `xv`: example, vernacular in source language, `xe`: translation, English). Tidylex makes it easy to make assertions of how these entries *should* be structured, test whether or not they are well-structured ([examples](#examples) provided below), and, most importantly, communicate the results of these tests with relevant parties.
<pre style="background:#f1f1f1">
<span class="bscode">\lx</span> <span class="bsval">rouge</span>
<span class="bscode">\ps</span> <span class="bsval">adjective</span>
<span class="bscode">\de</span> <span class="bsval">red</span>
<span class="bscode">\xv</span> <span class="bsval">La chaise est rouge</span>
<span class="bscode">\xe</span> <span class="bsval">The chair is red</span>
<span class="bscode">\lx</span> <span class="bsval">bonjour</span>
<span class="bscode">\de</span> <span class="bsval">hello</span>
<span class="bscode">\ps</span> <span class="bsval">exclamation</span>
<span class="bscode">\lx</span> <span class="bsval">parler</span>
<span class="bscode">\ps</span> <span class="bsval">verb</span>
<span class="bscode">\de</span> <span class="bsval">speak</span>
<span class="bscode">\xv</span> <span class="bsval">Parlez-vous français?</span>
</pre>
## Why is tidylex needed?
Owing to the dictionary data having been hand-edited over many years, often by multiple contributors, there is often a lot of structural inconsistency in these plain-text files.
Given the structural variation, the knowledge about these languages are effectively 'locked up' in terms of machine-processability.
Tidylex provides a set of functions to iteratively work towards a well-structured, or 'tidy', lexicon, and maintain the tidiness of the lexicon when used within a Continuous Testing setting (e.g. with Travis CI, or GitLab pipelines).
## Installation
You can install tidylex from github with:
```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("CoEDL/tidylex")
```
## Examples
### Read a lexicon line-by-line into a data frame
As there is project-to-project variation in coding convention (for `\lx`, some others use `\me` for main entry, or '`.`' in place of backslashes, e.g. `.i bonjour`), the `read_lexicon` function provides a quick way to specify a regular expression to parse each line in the dictionary into its various components.
```{r read-lexicon, message=FALSE}
library(tidylex)
# The path to the 'rouge, bonjour, parler' dictionary shown in the example above
lexicon_file <- system.file("extdata", "error-french.txt", package = "tidylex")
lexicon_df <- read_lexicon(
file = lexicon_file,
regex = "\\\\?([a-z]*)\\s?(.*)", # Note two capture groups, in parentheses
into = c("code", "value") # Captured data placed, respectively, in 'code' and 'value' columns
)
lexicon_df
```
### Assign groups to rows based on properties of the rows
A common pre-processing routine that must be done is to group the lines into various subgroups, e.g. within some given entry, or within some given sense of the entry, etc. We can easily do this with the `add_group_col` function.
```{r verify-unique}
grouped_lxdf <-
lexicon_df %>%
add_group_col(
name = lx_group, # Name of the new grouping column
where = code == "lx", # When to fill with a value, i.e. when *not* to inherit value
value = paste0(line, ": ", value) # What the value should be when above condition is true
)
grouped_lxdf
```
### Formally define a well-formed entry and test entries against definition
Tidylex lets you define and use basic [Nearley](https://nearley.js.org/) grammars within R to test for well-formedness of sequence of backslash codes.
For such sequences above, we can define a context-free grammar (equivalent to phrase structure rules) within the Nearley notation below (`:?`, `:+` are quantifiers indicating, respectively, 'zero or one' and 'one or more' of the preceding entity).
We use the `compile_grammar` function to generate code that can be used to test whether a series of values (e.g. those within the the `code` column) conform to a sequence expected by some grammar.
```{r nearley-grammar}
entry_parser <- compile_grammar('
headword -> "lx" "ps" "de" examples:?
examples -> ("xv" "xe"):+
')
grouped_lxdf %>%
mutate(code_ok = entry_parser$parse_str(code, return_labels = TRUE))
```
We can see from the data frame above that the sequence of codes for entry group `1:rouge` (`lx ps de xv xe`) conforms to the grammar, while the group `7: bonjour` does not. We can see that there is a value `FALSE` for `code_ok` for the `de` line (line 8).
<hr>
<b>Footnotes</b>
1. <a name="nearley"></a> At the moment tidylex can't work with *all* Nearley grammars since the R V8 package which uses an older version of the V8 engine for cross-compatibility requirements. So, Nearley grammars that make use of ES6 features won't compile in V8 3.14.