-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathtlgu.1
272 lines (261 loc) · 11.9 KB
/
tlgu.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
.\" Copyright (C) 2004, 2005, 2011, 2013, 2020 Dimitri Marinakis (dm, ssa gr).
.\"
.\" This file is part of tlgu which is free software; you can redistribute it and/or modify
.\" it under the terms of the GNU General Public License (version 2)
.\" as published by the Free Software Foundation.
.\"
.\" tlgu is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public License
.\" along with GNU Emacs; see the file COPYING. If not, write to the
.\" Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
.\" Boston, MA 02110-1301 USA.
.\"
.TH tlgu 1 "27-July-2021" "Version 1.9" "TLG to Unicode Converter"
.SH NAME
.B tlgu
\- convert beta code TLG and PHI CD-ROM txt files to Unicode
.SH SYNOPSIS
.B tlgu
[
.I options
] [
.I input_file
] [
.I output_file
]
.SH DESCRIPTION
.B tlgu
will convert an \fIinput_file\fP from Thesaurus Linguae Graeca (TLG) and Packard Humanities Institute (PHI) representation
to a Unicode (UTF-8) \fIoutput_file\fP.
If \fIinput_file\fP is not specified, standard input will be read.
If \fIoutput_file\fP is not specified, the Unicode text is directed to standard output.
The TLG/PHI representation consists of \fBbeta-code\fP text and \fBcitation\fP information.
.SH OPTIONS
.TP
.B \-b
inserts a form feed and citation information (levels a, b, c, d) on every "book" citation
change. By default the program will output line feeds only (see also \fB\-p\fP).
.TP
.B \-p
observes paging instructions.
By default the program will output line feeds only.
.TP
.B \-r
primarily Roman text (PHI). Some TLG texts, notably doccan1.txt and doccan2.txt are mostly
roman texts lacking explicit language change codes. Setting this option will force
a change to roman text after each citation block is encountered.
.TP
.B \-v
highest-level reference citation is included before each text line (v-level)
.TP
.B \-w
reference citation is included before each text line (w-level)
.TP
.B \-x
reference citation is included before each text line (x-level)
.TP
.B \-y
reference citation is included before each text line (y-level)
.TP
.B \-z
lowest-level reference citation is included before each text line (z-level).
.TP
.B \-Z <custom_citation_format_string>
an arbitrary combination of citation information is included before each text line;
see also -e option e.g. "%A/%B/%x/%y/%z\\t" will output the contents of the
A, B \fBcitation description\fP levels, followed by x, y, z \fBcitation reference\fP levels,
followed by a TAB character.
.TP
.B \-e <custom_blank_citation_string>
if there is no citation information for a citation level defined with the -Z option above,
a single right-hand slash is substituted by default; you may define any string with this option
e.g. "-" or "[NONE]" are valid inputs
.sp 1
.TP
.B \-B
inserts blank space (a tab) before each and every line.
.TP
.B \-X
compact format; v, w, x citations are inserted as they change at the beginning of each section.
.TP
.B \-Y
compact format; w, x, y citations are inserted as they change at the beginning of each section.
.TP
.B \-N
no spaces; line ends and hyphens before an ID code are removed while hyphens and spaces before page
and column ends are (still) retained.
.sp 1
.TP
.B \-C
citation debug information is output.
.TP
.B \-S
special code debug information is output.
.TP
.B \-V
block processing information is output (verbose).
.TP
.B \-U
vowels with acute accent are output using the Unicode 0x0370 codes rather than the 0x1F00 ones for compatibility with most current (as of 2020) keyboard encoders.
.TP
.B \-W
each work (book) is output as a separate file in the form output_file-xxx.txt;
if an output file is not specified, this option has no effect.
.SH HISTORY AND INTENDED USE
The purpose of \fBtlgu\fP is to translate binary TLG/PHI-format files into readable and editable text.
It is based on an earlier program written in 80x86 assembly language (1996) outputting codes for
a home-made font which used the prevalent hellenic font encodings of that time complemented
by dead accent characters - not very attractive, but readable.
.sp 1
Then came Unicode and a plethora of accented character glyphs;
Polytonic fonts are already available (Cardo, Gentium, Athena, Athenian, Porson); new fonts
are being created and older fonts are being expanded as special-use code points are included
in the Unicode definition (musical symbols, other special symbols).
A notable effort since this note was originally drafted is that of the Greek Font Society,
now featuring a great, and expanding, selection of open polytonic fonts.
.sp 1
So, at this point in time, \fBtlgu\fP will crunch a file which has been formatted
according to the published TLG/PHI format and produce codes for most glyphs
generally available. No attempt has been made to introduce multi-character sequences
or formatting codes (font changes). If a code has not been defined, the program will output
the respective "code family" glyph. You may use the \fB\-S\fP option to check such codes
against the published beta code definition.
.sp 1
July 2005 - Troy A. Griffitts (scribe, crosswire org) contributed the arbitrary citation output code and added per-line processing of the input file.
.sp 1
April 2006 - Final sigma will now be output at end-of-line (!) from free-form input text (thank you Jan).
.sp 1
October 2011 - stdout is used if output_file is not specified.
.sp 1
November 2011 - citations (v, w, x) at the start of section changes (e-book option)
.sp 1
May 2012 - Nick White (nick white, durham ac uk) revised the input arguments to use tlgu as a filter; stdin is used if input_file is not specified
.sp 1
May 2020 - Alternate output codes for vowels with acute accent (-U option)
.sp 1
July 2021 - Corrections to citation code
.SH EXAMPLES
.B ./tlgu -r DOCCAN2.TXT doccanu.txt
Translate the TLG canon to a unicode text file. Note the use of the \fB-r\fP option (this file
expects Roman as the default font).
.TP
.B ./tlgu -x -y -z TLG1799.TXT tlg1799u.txt
Generate a continuous file with the texts of granpa Euclides. Available citations (-x -y -z)
are Book//demonstratio/line as shown in the respective "cit" field of doccan2.txt.
.TP
.B ./tlgu -b -B TLG1799.TXT tlg1799u.txt
Generate the same texts, this time with a page feed and book citation information on the first
page of each book and a tab before each line (use with OOo versions earlier than 1.1.4).
.TP
.B ./tlgu -C TLG1799.TXT tlg1799u.txt
See how the citation information changes within each TLG block.
.TP
.B ./tlgu -S TLG1799.TXT tlg1799u.txt | sort > symbols1799.txt
Check out the symbols used in a work. Book and x, y, z references are printed on a separate
line for each symbol. Sort / grep the output to locate specific symbols of interest; save in
a file for later use.
.TP
.B ./tlgu -W TLG0006.TXT tlg0006u
Will produce separate files for each work, named tlg006u-001.txt etc.
.TP
.B ./tlgu -Z \N'34'%A/%B/%D/%c/%d/%Z/%x/%y/%z\et\N'34' -e \N'34'-\N'34' chr0010.txt chr0010u.txt
Will generate a file with citation description (A, B, D, Z) and citation reference (c, d, x, y, z)
levels, separated by "/" followed by a TAB character and the respective text.
Blank citation elements will be filled with a single "-"
e.g. Asia/Smyrna/1222-1223 ac/IGChAs/Asia Min [Chr]/88/-/2A/7p1 [TAB] inscription text etc.
.TP
.B ./tlgu -r -N -X LAT0448.TXT LAT0448.xx.TXT
will produce a compact version of the Gaius Iulius Caesar texts with v and x citations printed
as they change; similarly,
.B ./tlgu -r -N -Y LAT2150.TXT LAT2150.yy.TXT
will produce a compact version of Zeno's texts.
.SH POST-PROCESSING EXAMPLES
I use the OpenOffice/LibreOffice suite for most of my work. This example shows one of many possible
ways of using the search and replace facility to create a readable version of the Suda lexicon.
.TP
.B ./tlgu -B TLG4085.TXT tlg4085u.txt
A Unicode file with the text is created
.TP
.B Open the generated file with Openoffice/LibreOffice:
File | Open | Filename: tlg4085u.txt,
File Type: Text Encoded \-\- Press Open
.sp 1
The ASCII Filter Options window appears. Select the Unicode (UTF-8) character set and
a proper Unicode font installed in your machine (e.g. Cardo). Press OK.
.TP
.B Replace angle brackets with expanded text
Lexicon terms are enclosed in <angle brackets>. The actual beta codes indicate the use of
expanded text for emphasis. Select Edit | Find & Replace. The \fBFind & Replace\fP window appears.
.sp 1
In the \fBSearch For\fP field, type the following expression: \fB<[^<>]*>\fP
This means "find any characters between angle brackets, not including angle brackets".
.sp 1
In the \fBReplace With\fP window insert a single ampersand: \fB&\fP
This means that we need to \fBadd\fP formatting information (this case) or additional text to
the text found. Press \fBMore Options\fP, \fBFormat...\fP and select the \fBPosition\fP tab; select Spacing
Expanded by 2.0 points. Press OK.
.sp 1
Check the \fBRegular Expressions\fP box and press \fBReplace All\fP.
.sp 1
You may now replace the angle brackets with nothings.
.sp 1
Repeat the above procedure for titles enclosed in {braces}. Write a macro...
.TP
.B Other useful information
If you are using your wordprocessor with a locale setting other than Hellenic (el_GR), the following
invocation with the desired character classification may prove useful for the occasional polytonic editing:
.br
.sp 1
\fBLC_CTYPE=el_GR.UTF-8 /usr/bin/soffice\fP (or \fB/opt/libreoffice3.4/program/soffice\fP ).
.br
.sp 1
I put my default locale and keyboard definitions in my \fB.bashrc\fP or \fB.profile\fP:
.br
.sp 1
.na
.B export LC_CTYPE=el_GR.UTF-8
.br
.na
.B setxkbmap us,el ,polytonic -option grp:ctrl_shift_toggle -option grp_led:scroll
.br
.sp 1
This way multi-lingual text can be entered; keyboard layout switching is done by pressing Ctrl/Shift;
alternate keyboard layout is indicated by the Scroll Lock light on the keyboard.
.SH FURTHER DEVELOPMENT
You may not like the character output for a specific code. Check out the \fBtlgcodes.h\fP file
containing the special symbol and punctuation codes and select one to suit you better. It will
probably be a while before the beta to Unicode correspondence settles down.
.sp 1
Drop me a line, if you need a new feature; let me know if you do find
an interesting applications that others can profit from.
.SH REFERENCES
There are several texts describing the internal representation of \fBPHI\fP and
\fBTLG\fP text, ID data, citation data and index files. The originator of this
format is the Packard Humanities Institute. The TLG is maintained by UCI \- see
\fBwww.tlg.uci.edu\fP \- where you may find the latest versions of the \fBTLG Beta Code Manual\fP and the
\fBTLG Beta Code Quick Reference Guide\fP.
.sp 1
Unicode consortium (\fBwww.unicode.org\fP) publications pertaining to the codification
of characters used in Hellenic literature, scientific and musical texts.
.sp 1
The OpenOffice/Libreoffice suite in its various editions
(\fBwww.openoffice.org\fP - apache.org, \fBwww.libreoffice.org\fP, \fBwww.neooffice.org\fP)
includes a word processor that you can use to load, process and create new polytonic texts.
.sp 1
Greek Font Society: \fBwww.greekfontsociety.gr\fP
.SH COPYRIGHT
Copyright (C) 2004, 2005, 2011, 2013, 2020, 2021 Dimitri Marinakis (dm, ssa gr).
This file is part of tlgu which is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License (version 2) as published by
the Free Software Foundation.
tlgu is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA