forked from kjd/lgr-specification
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdraft-ietf-lager-specification.xml
3235 lines (2936 loc) · 191 KB
/
draft-ietf-lager-specification.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2045 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2045.xml">
<!ENTITY rfc2119 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY rfc3339 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3339.xml">
<!ENTITY rfc3688 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3688.xml">
<!ENTITY rfc3743 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3743.xml">
<!ENTITY rfc5646 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5646.xml">
<!ENTITY rfc4290 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4290.xml">
<!ENTITY rfc5226 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5226.xml">
<!ENTITY rfc5564 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5564.xml">
<!ENTITY rfc5891 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5891.xml">
<!ENTITY rfc5892 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5892.xml">
<!ENTITY rfc6838 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6838.xml">
<!ENTITY rfc7303 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7303.xml">
]>
<rfc category="std" ipr="trust200902" docName="draft-ietf-lager-specification-13">
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes"?>
<?rfc iprnotified="no" ?>
<?rfc strict="no" ?>
<front>
<title abbrev="Label Generation Rulesets in XML">Representing Label Generation Rulesets
using XML</title>
<author initials="K" surname="Davies" fullname="Kim Davies">
<organization abbrev="ICANN">Internet Corporation for Assigned Names and
Numbers</organization>
<address>
<postal>
<street>12025 Waterfront Drive</street>
<city>Los Angeles</city>
<region>CA</region>
<code>90094</code>
<country>US</country>
</postal>
<phone>+1 310 301 5800</phone>
<email>[email protected]</email>
<uri>http://www.icann.org/</uri>
</address>
</author>
<author initials="A" surname="Freytag" fullname="Asmus Freytag">
<organization>ASMUS Inc.</organization>
<address>
<email>[email protected]</email>
</address>
</author>
<date/>
<area>Applications and Real-Time Area</area>
<workgroup>Label Generation Rules (lager)</workgroup>
<abstract>
<t>This document describes a method of representing rules for validating identifier
labels and alternate representations of those labels using Extensible Markup
Language (XML). These policies, known as "Label Generation Rulesets" (LGRs), are
used for the implementation of Internationalized Domain Names (IDNs), for example.
The rulesets are used to implement and share that aspect of policy defining which
labels and Unicode code points are permitted for registrations, which
alternative code points are considered variants, and what actions may be performed
on labels containing those variants.</t>
</abstract>
</front>
<middle>
<section title="Introduction">
<t>This document specifies a method of using Extensible Markup Language
(XML) to describe Label Generation Rulesets (LGRs). LGRs are
algorithms used to determine whether, and under what conditions, a
given identifier label is permitted, based on the code points it
contains and their context. These algorithms comprise a list of
permissible code points, variant code point mappings, and a set of
rules that act on the code points and mappings. LGRs form part of
an administrator's policies. In deploying internationalized domain
names (IDNs), they have also been known as IDN tables or variant
tables.</t>
<t>There are other kinds of policies relating to labels which are not normally covered by
Label Generation Rulesets and are therefore not necessarily representable by the XML
format described here. These include, but are not limited to policies around
trademarks, or prohibition of fraudulent or objectionable words.</t>
<t>Administrators of the zones for top-level domain registries have historically
published their LGRs using ASCII text or HTML. The formatting of these documents has
been loosely based on the format used for the Language Variant Table described
in <xref target="RFC3743"/>. <xref target="RFC4290"/> also provides a "model table
format" that describes a similar set of functionality. Common to these formats is
that the algorithms used to evaluate the data therein are implicit or specified
elsewhere.</t>
<t>Through the first decade of IDN deployment, experience has shown that LGRs derived
from these formats are difficult to consistently implement and compare due to their
differing formats. A universal format, such as one using a structured XML format,
will assist by improving machine-readability, consistency, reusability and
maintainability of LGRs.</t>
<t>When used to represent simple list of permitted code points, the format is quite
straightforward. At the cost of some complexity in the resulting file, it also
allows for an implementation of more sophisticated handling of conditional variants
that reflects the known requirements of current zone administrator policies.</t>
<t>Another feature of this format is that it allows many of the algorithms to be made
explicit and machine implementable. A remaining small set of implicit algorithms is
described in this document to allow commonality in implementation.</t>
<t>While the predominant usage of this specification is to represent IDN label policy,
the format is not limited to IDN usage and may also be used for describing ASCII domain
name label rulesets, or other types of identifier labels beyond those used for
domain names.</t>
</section>
<section title="Design Goals">
<t>The following goals informed the design of this format:</t>
<t>
<list style="symbols">
<t>The format needs to be implementable in a reasonably straightforward manner
in software.</t>
<t>The format should be able to be automatically checked for formatting errors,
so that common mistakes can be caught.</t>
<t>An LGR needs to be able to express the set of valid code points that are
allowed for registration under a specific administrator's policies.</t>
<t>An LGR needs to be able to express computed alternatives to a given identifier
based on mapping relationships between code points, whether one-to-one or
many-to-many. These computed alternatives are commonly known as "variants".</t>
<t>Variant code points should be able to be tagged with explicit dispositions or
categories that can be used to support registry policy (such as whether to
allocate the computed variant, or to merely block it from usage or
registration).</t>
<t>Variants and code points must be able to be stipulated based on contextual
information. For example, some variants may only be applicable when they
follow a certain code point, or when the code point is displayed in a
specific presentation form.</t>
<t>The data contained within an LGR must be able to be interpreted
unambiguously, so that independent implementations that utilize the
contents will arrive at the same results.</t>
<t>To the largest extent possible, policy rules should be able to be specified
in the XML format without relying on hidden, or built-in algorithms in
implementations.</t>
<t>LGRs should be suitable for comparison and re-use, such that one could easily
compare the contents of two or more to see the differences, to merge them,
and so on.</t>
<t>As many existing IDN tables as practicable should be able to be migrated to
the LGR format with all applicable interpretation logic retained.</t>
</list>
</t>
<t>These requirements are partly derived from reviewing the existing corpus of published
IDN tables, plus the requirements of ICANN's work to implement an LGR for the DNS
Root Zone <xref target="LGR-PROCEDURE"/>. In particular, Section B of that document
identifies five specific requirements for an LGR methodology.</t>
<t>The syntax and rules in <xref target="RFC5892"/> and <xref target="RFC3743"/> were
also reviewed.</t>
<t>It is explicitly not the goal of this format to stipulate what code points should be
listed in an LGR by a zone administrator. Which registration policies are used for a
particular zone is outside the scope of this memo.</t>
</section>
<section title="Normative Language">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref target="RFC2119"/>.
</t>
</section>
<section title="LGR Format">
<t>An LGR is expressed as a well-formed XML Document <xref target="XML"/>
that conforms to the schema defined in <xref target="schema"/>.</t>
<t>As XML is case-sensitive, an LGR must be authored with the correct
casing. For example, the XML element names MUST be in lower
case as described in this specification, and matching of attribute
values, is only performed in a case-sensitive manner.</t>
<t>A document that is not well-formed, non-conforming or violates other
constraints specified in this specification MUST be rejected.</t>
<section title="Namespace">
<t>The XML Namespace URI is "urn:ietf:params:xml:ns:lgr-1.0".</t>
<t>See <xref target="urn_reg"/> for more information.</t>
</section>
<section title="Basic Structure">
<t>The basic XML framework of the document is as follows:</t>
<figure>
<artwork><![CDATA[
<?xml version="1.0"?>
<lgr xmlns="urn:ietf:params:xml:ns:lgr-1.0">
...
</lgr>]]></artwork>
</figure>
<t>The "lgr" element contains up to three sub-elements. First is an optional "meta"
element that contains all meta-data associated with the LGR, such as its
authorship, what it is used for, implementation notes and references. This is
followed by a required "data" element that contains the substantive code point data.
Finally, an optional "rules" element contains information on contextual and
whole-label evaluation rules, if any, along with "action" elements
providing for the disposition of labels and computed variant labels.</t>
<figure>
<artwork><![CDATA[
<?xml version="1.0"?>
<lgr xmlns="urn:ietf:params:xml:ns:lgr-1.0">
<meta>
...
</meta>
<data>
...
</data>
<rules>
...
</rules>
</lgr>
]]></artwork>
</figure>
<t>A document MUST contain exactly one "lgr" element. Each "lgr" element MUST
contain zero or one "meta" element, exactly one "data" element, and
zero or one "rules" element; and these three elements MUST be in that order.</t>
<t>Some elements that are direct or nested child elements of the "rules" element
MUST be placed in a specific relative order to other elements for the LGR to be valid.
An LGR that violates these constraints MUST be rejected. In other cases, changing
the ordering would result in a valid, but different specification.</t>
<t>In the following descriptions, required, non-repeating elements or attributes are
generally not called out explicitly, in contrast to "OPTIONAL" ones,
or those that "MAY" be repeated. For attributes that take lists as values, the elements MUST be
space-separated.</t>
</section>
<section title="Metadata">
<t>The "meta" element expresses metadata associated with the LGR,
and the element SHOULD be included so that the associated
metadata are available as part of the LGR and cannot become disassociated.
The following subsections describe elements that may appear within the "meta" element.</t>
<t>The "meta" element can be used to identify the author or relevant contact person, explain
the intended usage of the LGR, and provide implementation notes as well as
references. Detailed metadata allow the LGR document to become self-documenting,
for example if rendered in a human readable format by an appropriate tool.</t>
<t>Providing metadata pertaining to the date and version of the
LGR is particularly encouraged to make it easier for interoperating
consumers to ensure that they are using the correct LGR.</t>
<t>With the exception of "unicode-version" element, the data contained
within is not required by software consuming the LGR in order to calculate valid
labels, or to calculate variants. If present, the "unicode-version" element MUST
be used by a consumer of the table to identify that it has the correct Unicode
property data to perform operations on the table. This ensures that possible
differences in code point properties between editions of the Unicode standard
do not impact the product of calculations utilizing an LGR.</t>
<section title="The version Element">
<t>The "version" element is OPTIONAL. It is used to uniquely identify each
version of the LGR. No specific format is required, but it is RECOMMENDED
that it be the decimal representation of a single positive
integer, which is incremented with each revision of the file.</t>
<t>An example of a typical first edition of a document:</t>
<t>
<figure>
<artwork><![CDATA[
<version>1</version>
]]></artwork>
</figure>
</t>
<t>The "version" element may have an OPTIONAL "comment" attribute.</t>
<t>
<figure>
<artwork><![CDATA[
<version comment="draft">1</version>
]]></artwork>
</figure>
</t>
</section>
<section title="The date Element">
<t>The OPTIONAL "date" element is used to identify the date the LGR was posted.
The contents of this element MUST be a valid ISO 8601 "full-date" string as
described in <xref target="RFC3339"/>.</t>
<figure>
<preamble>Example of a date:</preamble>
<artwork><![CDATA[
<date>2009-11-01</date>
]]></artwork>
</figure>
</section>
<section title="The language Element">
<t>Each OPTIONAL "language" element identifies a language or script for which the LGR
is intended. The value of the "language" element MUST be a
valid language tag as described in <xref target="RFC5646"/>. The tag may
refer to a script plus undefined language if the LGR is not intended for a
specific language.</t>
<t>Example of an LGR for the English language:</t>
<figure>
<artwork><![CDATA[
<language>en</language>
]]></artwork>
</figure>
<t>If the LGR applies to a script, rather than a specific language, the "und"
language tag SHOULD be used followed by the relevant <xref target="RFC5646"
/> script subtag. For example, for a Cyrillic script LGR:</t>
<figure>
<artwork><![CDATA[
<language>und-Cyrl</language>
]]></artwork>
</figure>
<t>If the LGR covers a set of multiple languages or scripts, the
"language" element MAY be repeated. However, for cases of a script-specific
LGR exhibiting insignificant admixture of code points from other scripts, it
is RECOMMENDED to use a single "language" element identifying the
predominant script. In the exceptional case of a multi-script LGR where no
script is predominant, use Zyyy (Common):</t>
<figure>
<artwork><![CDATA[
<language>und-Zyyy</language>
]]></artwork>
</figure>
</section>
<section title="The scope Element">
<t>This OPTIONAL element refers to a scope, such as a domain, to which this
policy is applied. The "type" attribute specifies the type of scope being
defined. A type of "domain" means that the scope is a domain that represents
the apex of the DNS zone to which the LGR is applied. For that type, the content
of the "scope" element MUST be a a domain name written relative to the root
zone, in presentation format with no trailing dot. However, in the unique case of
the DNS root zone, it is represented as ".".
</t>
<t>
<figure>
<artwork><![CDATA[ <scope type="domain">example.com</scope>]]></artwork>
</figure>
</t>
<t>There may be multiple "scope" tags used, for example to reflect a list of domains
to which the LGR is applied.</t>
<t>No other values of the "type" attribute are defined by
this specification, however this specification can be used for
applications other than domain names. Implementers of LGRs for applications other
than domain names SHOULD define the scope extension grammar in an IETF Specification,
or use XML Namespaces to distinguish their scoping mechanism distinctly from the
base LGR namespace. An explanation of any custom usage of the scope in the
"description" element is RECOMMENDED.</t>
<t><figure>
<artwork><![CDATA[ <scope xmlns="http://example.com/ns/scope/1.0">
... content per alternate namespace ...
</scope>]]></artwork></figure>
</t>
</section>
<section title="The description Element">
<t>The "description" element is an OPTIONAL, free-form element that contains any
additional relevant description that is useful for the user in its
interpretation. Typically, this field contains authorship information, as
well as additional context on how the LGR was formulated and how it applies,
such as citations and references that apply to the LGR as a whole.</t>
<t>This field should not be relied upon for providing instructions on how to
parse or utilize the data contained elsewhere in the
specification. Authors of tables should expect that
software applications that parse and use LGRs will not use the
description field to condition the application of the
LGR's data and rules.</t>
<t>The element has an OPTIONAL "type" attribute, which refers to the internet
media type <xref target="RFC2045"/> of the enclosed data. Typical types would be "text/plain" or
"text/html". The attribute SHOULD be a valid media type. If supplied, it will
be assumed that the contents are of that media type. If the description
lacks a type field, it will be assumed to be plain text ("text/plain").</t>
</section>
<section title="The validity-start and validity-end Elements">
<t>The "validity-start" and "validity-end" elements are OPTIONAL
elements that describe the time period from which the contents of the
LGR become valid (are used in registry policy), and time when the
contents of the LGR cease to be used, respectively.</t>
<t>The dates MUST confirm to the "full-date" format described in section 5.6 of
<xref target="RFC3339"/>.</t>
<t>
<figure>
<artwork><![CDATA[ <validity-start>2014-03-12</validity-start>]]></artwork>
</figure>
</t>
</section>
<section title="The unicode-version Element">
<t>Whenever an LGR depends on character properties from a given version of the
Unicode standard, the version number used in creating the LGR MUST be listed
in the form x.y.z, where x, y, and z are positive, decimal integers (see
<xref target="Unicode-Versions"/>). If any software processing the table
does not have access to character property data of the requisite version, it
MUST NOT perform any operations relating to whole-label evaluation relying
on Unicode character properties (<xref target="property"/>).</t>
<t> The value of a given Unicode character property may change
between versions of the Unicode Character Database <xref target="UAX44"/>,
unless such change has been explicitly disallowed in <xref
target="Unicode-Stability"/>. It is RECOMMENDED to only reference properties
defined as stable or immutable. As an alternative to referencing the property,
the information can be presented explicitly in the LGR.</t>
<t>
<figure>
<artwork><![CDATA[ <unicode-version>6.2.0</unicode-version>
]]></artwork>
</figure>
</t>
<t>It is not necessary to include a "unicode-version" element for LGRs that do
not make use of Unicode character properties, however, it is RECOMMENDED.</t>
</section>
<section title="The references Element" anchor="references">
<t>A Label Generation Ruleset may define a list of references which are used to
associate various individual elements in the LGR to one or more normative
references. A common use for references is to annotate that code points belong
to an externally defined collection or standard, or to give normative references
for rules.</t>
<t>References are specified in an OPTIONAL "references" element, containing
or more "reference" elements, each with a unique "id" attribute. It is
RECOMMENDED that the "id" attribute be a zero-based integer, however,
in addition to digits 0-9, it MAY contain uppercase letters A-Z, as well as
period, hyphen, colon or underscore. The value of
each "reference" element SHOULD be the citation of a standard, dictionary or
other specification in any suitable format. In addition to an "id"
attribute, a "reference" element MAY have a "comment" attribute for an
optional free-form annotation.</t>
<t>
<figure>
<artwork><![CDATA[ <references>
<reference id="0">The Unicode Consortium. The Unicode
Standard, Version 8.0.0, (Mountain View, CA: The Unicode
Consortium, 2015. ISBN 978-1-936213-10-8)
http://www.unicode.org/versions/Unicode8.0.0/</reference>
<reference id="1">Big-5: Computer Chinese Glyph and Character
Code Mapping Table, Technical Report C-26, 1984</reference>
<reference id="2" comment="synchronized with Unicode 6.1">
ISO/IEC
10646:2012 3rd edition</reference>
...
</references>
...
<data>
<char cp="0620" ref="0 2" />
...
</data>]]></artwork>
</figure>
</t>
<t> A reference is associated with an element by using its id as part of an optional "ref" attribute
(see <xref target="ref"/>). The "ref" attribute may be used with many
kinds of elements in the "data" or "rules" sections of the LGR, most notably
those defining code points, variants and rules. However, a "ref" attribute may not occur on certain
kinds of elements, including references to named character classes or rules.
See description of these elements below.</t>
</section>
</section>
</section>
<section title="Code Points and Variants">
<t>The bulk of a label generation ruleset is a description of which set of code points
are eligible for a given label. For rulesets that perform operations that result in
potential variants, the code point-level relationships between variants need to also
be described.</t>
<t>The code point data is collected within the "data" element. Within this element, a
series of "char" and "range" elements describe eligible code points, or ranges of
code points, respectively. Collectively, these are known as the repertoire.</t>
<t>Discrete permissible code points or code point sequences (see
<xref target="sequences" />) are declared with a "char"
element. Here is a minimal example declaration for a single code point,
with the code point value given in the "cp" attribute:</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="002D"/>]]></artwork>
</figure>
</t>
<t>As described below, a full declaration for a "char" element, whether or
not it is used for a single code point, or for a sequence (see <xref target="sequences" />),
may have optional child elements defining variants. Both the "char" and "range" elements can
take a number of optional attributes for conditional inclusion, commenting, cross referencing and
character tagging, as described below.</t>
<t>Ranges of permissible code points may be declared with a "range" element, as in this minimal example:</t>
<t>
<figure>
<artwork><![CDATA[ <range first-cp="0030" last-cp="0039"/>]]></artwork>
</figure>
</t>
<t>The range is inclusive of the first and last code points. Any additional attributes defined
for a "range" element act as if applied to each code point within. A "range" element
has no child elements.</t>
<t>It is always possible to substitute a list of individually specified code points for
a range element. The reverse is not necessarily the case. Whenever such a
substitution is possible, it makes no difference in processing the data. Tools
reading or writing the LGR format are free to aggregate
sequences of consecutive code points of the same properties into range elements.</t>
<t>Code points MUST be represented according to the standard
Unicode convention but without the prefix "U+": they are
expressed in uppercase hexadecimal, and are zero-padded
to a minimum of 4 digits. </t>
<t>The rationale for not allowing other encoding formats,
including native Unicode encoding in XML, is explored in <xref target="UAX42"/>. The
XML conventions used in this format, such as element and attribute names,
mirror this document where practical and reasonable to do so. It is RECOMMENDED to
list all "char" elements in ascending order of the "cp" attribute. Not doing so makes it
unnecessarily difficult for authors and reviewers to check for errors, such as duplications,
or to review and compare against listing of code points in other documents and specifications.</t>
<t>All "char" elements in the data section MUST have distinct "cp" attributes. The
"range" elements MUST NOT specify code point ranges that overlap either another range
or any single code point "char" elements. An LGR that defines the same code point more than
once by any combination of "char" or "range" elements MUST be rejected.</t>
<section title="Sequences" anchor="sequences">
<t>A sequence of two or more code points may be specified in an LGR, for example,
when defining the source for n:m variant mappings. Another use of sequences
would be in cases when the exact sequence of code points is required to occur in
order for the constituent elements to be eligible, such as when some code
point is only eligible when preceded or followed by a certain code point. The
following would define the eligibility of the MIDDLE DOT (U+00B7) only when both
preceded and followed by the LATIN SMALL LETTER L (U+006C):</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="006C 00B7 006C" comment="Catalan middle dot"/>]]></artwork>
</figure>
</t>
<t>All sequences defined this way must be distinct, but sub-sequences may be defined.
Thus, the sequence defined here may coexist with single code point definitions
such as:</t>
<t><figure>
<artwork><![CDATA[ <char cp="006C" />]]></artwork>
</figure></t>
<t>As an alternative to using sequences to define a required context, a "char" or
"range" element may specify conditional context using an optional "when"
attribute as described below in <xref target="contexts" />. Using a conditional context
is more flexible because a context is not limited to a specific sequence of code points.
In addition, using a context allows the choice of specifying either a prohibited or a required context.</t>
</section>
<section title="Conditional Contexts" anchor="contexts">
<t>A conditional context is specified by a rule that must be satisfied (or alternatively,
must not be satisfied) for a code point in a given label, often at a particular location in a label.</t>
<t>To specify a conditional context either a "when" or "not-when" attribute
may be used. The value of each "when" or "not-when" attributes is a whole label or
parameterized context rule as described below in <xref target="whole_label"/>.
The context condition is met when the rule specified in the "when"
attribute is matched or when the rule in the "not-when" attribute fails to match.
It is an error to reference a rule that is not actually defined in the "rules" element.</t>
<t>A parameterized context rule (see <xref target="parameterized_context_rule"/>)
defines the context immediately surrounding a given code point; unlike a sequence, the context
is not limited to a specific fixed code point, but for example may designate any member
of a certain character class or a code point that has a certain Unicode character property.</t>
<t>Given a suitable definition of a parameterized context rule named "follows-virama" this
example specifies that a ZERO-WIDTH JOINER (U+200D) is restricted
to immediately follow any of several code points classified as virama:</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="200D" when="follows-virama" />]]></artwork>
</figure>
</t>
<t>
For a complete example, see <xref target="example_tables" />.
</t>
<t>In contrast, a whole label rule (see <xref target="whole_label"/>) specifies a condition
to be met by the entire label, for example that it must contain at least one code point
from a given script anywhere in the label. In the following example, no digit from either
range may occur in a label that mixes digits from both ranges:</t>
<t>
<figure>
<artwork><![CDATA[ <data>
<range first-cp="0660" last-cp="0669" not-when="mixed-digits"
tag="arabic-indic-digits" />
<range first-cp="06F0" last-cp="06F9" not-when="mixed-digits"
tag="extended-arabic-indic-digits" />
</data>]]></artwork>
</figure></t>
<t>
(See <xref target="IDNA2008_example"/> for an example of the "mixed-digits" rule.)</t>
<t>The OPTIONAL "when" or "not-when" attributes are mutually exclusive. They MAY be
applied to both "char" and "range" elements in the "data" element, including "char" elements
defining sequences of code points, as well as to "var" elements (see <xref target="conditional_variants"/>).</t>
<t>If a label contains one or more code points that fail to satisfy a conditional context, the label is invalid, see <xref
target="implied_actions"/>. For variants, the conditional context restricts the definition of the variant
to the case where the condition is met. Outside the specified context, a variant is not defined.</t>
</section>
<section title="Variants" anchor="variants">
<t>Most LGRs typically only determine simple code point eligibility, and for them,
the elements described so far would be the only ones required for their "data"
section. Others additionally specify a mapping of code points to other code
points, known as "variants". What constitutes a variant code point is a matter
of policy, and varies for each implementation. The following examples are
intended to demonstrate the syntax; they are not necessarily typical.</t>
<section title="Basic Variants" anchor="basic_variants">
<t>Variant code points are specified using one of more "var" elements as
children of a "char" element. The target mapping is specified using the "cp"
attribute. Other, optional attributes for the "var" element are described
below.</t>
<t>For example, to map LATIN SMALL LETTER V (U+0076) as a variant of LATIN SMALL
LETTER U (U+0075):</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="0075">
<var cp="0076"/>
</char>]]></artwork>
</figure>
</t>
<t>A sequence of multiple code points can be specified as a variant of a single
code point. For example, the sequence of LATIN SMALL LETTER O (U+006F) then
LATIN SMALL LETTER E (U+0065) might hypothetically be specified as a variant
for an LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) as follows:</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="00F6">
<var cp="006F 0065"/>
</char>]]></artwork>
</figure>
</t>
<t>The source and target of a variant mapping may both be sequences, but not
ranges.</t>
<t>If the source of one mapping is a prefix sequence of the source for another,
both variant mappings will be considered at the same location in the input label
when generating permuted variant labels. If poorly designed, an LGR containing
such an instance of a prefix relation could generate multiple instances of the same
variant label for the same original label, but with potentially different dispositions.
Any duplicate variant labels encountered MUST be treated as an error (see
<xref target="duplicate_variants" />).</t>
<t>The "var" element specifies variant mappings in only one direction, even
though the variant relation is usually considered symmetric, that is, if A
is a variant of B then B should also be a variant of A. The format requires
that the inverse of the variant be given explicitly to fully specify
symmetric variant relations in the LGR. This has the beneficial side effect
of making the symmetry explicit:</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="006F 0065">
<var cp="00F6"/>
</char>]]></artwork>
</figure>
</t>
<t>Variant relations are normally not only symmetric, but also transitive.
If A is a variant of B and B is a variant of C, then A is also a variant of C.
As with symmetry, these transitive relations are only part of the LGR if
spelled out explicitly. Implementations that require an LGR to be symmetric
and transitive should verify this mechanically.</t>
<t>All variant mappings are unique. For a given "char" element all "var" elements
MUST have a unique combination of "cp", "when" and "not-when" attributes.
It is RECOMMENDED to list the "var" elements in ascending order of their
target code point sequence. (For "when" and "not-when" attributes, see
<xref target="conditional_variants" />).</t>
</section>
<section title="The type attribute" anchor="var_type">
<t>Variants may be tagged with an OPTIONAL "type" attribute. The value of the
"type" attribute may be any non-empty value not starting with an underscore
and not containing spaces. This value is used to resolve the disposition of
any variant labels created using a given variant. (See
<xref target="variants_actions" />.)</t>
<t>By default, the values of the "type" attribute directly describe the target
policy status (disposition) for a variant label that was generated using
a particular variant, with any variant label being assigned a disposition
corresponding to the most restrictive variant type. Several conventional
disposition values are predefined below in <xref target="actions"/>. Whenever
these values can represent the desired policy, they SHOULD be used.
</t>
<t><figure>
<artwork><![CDATA[ <char cp="767C">
<var cp="53D1" type="allocatable"/>
<var cp="5F42" type="blocked"/>
<var cp="9AEA" type="blocked"/>
<var cp="9AEE" type="blocked"/>
</char>]]></artwork>
</figure></t>
<t>By default, if a variant label contains any instance of one of the variants of
type "blocked" the label would be blocked, but if it contained only instances
of variants to be allocated it could be allocated. See the discussion about
implied actions in <xref target="default_actions"/>.
</t>
<t>The XML format for the LGR makes the relation between the values of the "type"
attribute on variants and the resulting disposition of variant labels fully
explicit. See the discussion in <xref target="variants_actions" />. Making
this relation explicit allows a generalization of the "type" attribute from
directly reflecting dispositions to a more differentiated intermediate value
that is then used in the resolution of label disposition. Instead of the default
action of applying the most restrictive disposition to the entire label, such
a generalized resolution can be used to achieve additional goals, such as
limiting the set of allocatable variant labels, or to implement other policies
found in existing LGRs (see for example <xref target="translate_rfc3743" />).
</t>
<t>Because variant mappings MUST be unique, it is not possible to define the same
variant for the same "char" element with different type attributes (see however
<xref target="conditional_variants" />).</t>
</section>
<section title="Null Variants">
<t>A null variant is a variant string that maps to no code point. This is used when
a particular code point sequence is considered discretionary in the context of
a whole label. To specify a null variant, use an empty cp attribute. For example,
to mark a string with a ZERO WIDTH NON-JOINER (U+200C) to the same string
without the ZERO WIDTH NON-JOINER:</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="200C">
<var cp=""/>
</char>]]></artwork>
</figure>
</t>
<t>This is useful in expressing the intent that some code points in a label are
to be mapped away when generating a canonical variant of the label. However,
in tables that are designed to have symmetric variant mappings, this could
lead to combinatorial explosion, if not handled carefully.</t>
<t>The symmetric form of a null variant is expressed as follows:</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="">
<var cp="200C" type="invalid" />
</char>]]></artwork>
</figure>
</t>
<t>A "char" element with an empty "cp" attribute MUST specify at least one variant
mapping. It is strongly RECOMMENDED to use a type of "invalid" or equivalent
when defining variant mappings from null sequences, so that variant mapping from
null sequences are removed in variant label generation (see <xref
target="var_type" />).</t>
</section>
<section title="Variants with Reflexive Mapping" anchor="reflexive_mapping">
<t>At first sight there seems to be no call for adding variant mappings for which
source and target code points are the same, that is for which the mapping is
reflexive, or, in other words, an identity mapping. Yet such reflexive mappings
occur frequently in LGRs that follow <xref target="RFC3743"/>.</t>
<t>Adding a "var" element allows both a type and a reference id to be specified for
it. While the reference id is not used in processing, the type of the variant
can be used to trigger actions. In permuting the label to generate all possible
variants, the type associated with a reflexive variant mapping is applied to any
of the permuted labels containing the original code point.</t>
<t>In the following example, let's assume the goal is to allocate only those labels
that contain a variant that is considered "preferred" in some way.
As defined in the example, the code point U+3473 exists both as a variant of U+3447
and as a variant of itself (reflexive mapping). Assuming an original label of
"U+3473 U+3447", the permuted variant "U+3473 U+3473" would consist of the reflexive
variant of U+3473 followed by a variant of U+3447. Given the variant mappings as
defined here, the types for both of the variant mappings used to generate that particular
permutation would have the value "preferred":</t>
<t><figure>
<artwork><![CDATA[ <char cp="3447" ref="0">
<var cp="3473" type="preferred" ref="1 3" />
</char>
<char cp="3473" ref="0">
<var cp="3447" type="blocked" ref="1 3" />
<var cp="3473" type="preferred" ref="0" />
</char>]]></artwork>
</figure></t>
<t>Having established the variant types in this way, a set of actions could be defined
that return a disposition of "allocatable" or "activated" for a label consisting
exclusively of variants with type "preferred" for example. (For details on how to
define actions based on variant types see <xref target="variant_triggers"/>.)</t>
<t>In general, using reflexive variant mappings in this manner makes it possible to
calculate disposition values using a uniform approach for all labels, whether they
consist of mapped variant code points, original code points, or a mixture of both.
In particular, the dispositions for two otherwise identical labels may differ based
on which variant mappings were executed in order to generate each of them. (For
details on how to generate variants and evaluate dispositions, see
<xref target="processing" />.)</t>
<t>Another useful convention that uses reflexive variants is described below in
<xref target="variant_triggers" />.</t>
</section>
<section title="Conditional Variants" anchor="conditional_variants">
<t>Fundamentally, variants are mappings between two sequences of code points.
However, in some instances for a variant relationship to exist, some context
external to the code point sequence must also be considered. For example, a
positional context may determine whether two code point sequences are
variants of each other.</t>
<t>An example of that are Arabic code points which can have different forms
based on position, with some code points sharing forms, thus making them
variants in the positions corresponding to those forms. Such positional
context cannot be solely derived from the code point by itself, as the code
point would be the same for the various forms.</t>
<t>As described in <xref target="contexts"/> an OPTIONAL "when" or "not-when" attribute
may be given for any "var" element to specify
required or prohibited contextual conditions under which the variant defined.</t>
<t>Assuming the "rules" element contains suitably defined rules for
"arabic-isolated" and "arabic-final", the following example shows how to
mark ARABIC LETTER ALEF WITH WAVY HAMZA BELOW (U+0673) as a variant of
ARABIC LETTER ALEF WITH HAMZA BELOW (U+0625), but only when it appears in
its isolated or final forms:</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="0625">
<var cp="0673" when="arabic-isolated"/>
<var cp="0673" when="arabic-final"/>
</char>
]]></artwork>
</figure>
</t>
<t> While a "var" element MUST NOT contain multiple conditions (it
is only allowed a single "when" or "not-when" attribute), multiple
"var" elements using the same mapping MAY be specified with
different "when" or "not-when" attributes. The combination of
mapping and conditional context defines a unique variant..</t>
<t>Care must be taken to ensure that
for each variant label at most one of the contextual conditions is met for variants
with the same mapping; otherwise duplicate variant labels would be created for the
same input label. Any such duplicate variant labels MUST be treated as an error, see
<xref target="duplicate_variants" />.
</t>
<t>Two contexts may be complementary, as in the following example, which shows
ARABIC LETTER TEH MARBUTA (U+0629) as a variant of ARABIC LETTER HEH
(U+0647), but with two different types.</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="0647" >
<var cp="0629" not-when="arabic-final" type="blocked" />
<var cp="0629" when="arabic-final" type="allocatable" />
</char>
]]></artwork>
</figure>
</t>
<t>The intent is that in final position a label that uses U+0629 instead of U+0647
should be considered essentially the same label and therefore allocatable to the
same entity, while the same substitution in non-final context leads to labels that
are different, but considered confusable so that either one, but not both should
be delegatable.</t>
<t>For symmetry, the reverse mappings must exist, and must agree in their
"when" or "not-when" attributes. However, symmetry does not apply to the other
attributes. For example, these are potential reverse mappings for the above:</t>
<t>
<figure>
<artwork><![CDATA[ <char cp="0629" >
<var cp="0647" not-when="arabic-final" type="allocatable" />
<var cp="0647" when="arabic-final" type="allocatable" />
</char>
]]></artwork>
</figure>
</t>
<t>Here, both variants have the same "type" attribute. While it is tempting to
recognize that in this instance the "when" and "not-when" attributes are complementary
and therefore between them cover every single possible context, it is strongly RECOMMENDED
to use the format shown in the example that makes the symmetry easily verifiable
by parsers and tools. (The same applies to entries created for transitivity.)</t>
<t>Arabic is an example of a script for which such conditional variants
have been implemented based on the joining contexts for Arabic code points.
The mechanism defined here supports other forms of conditional variants that may
required by other scripts.</t>
</section>
</section>
<section title="Annotations">
<t>Two attributes, the "ref" and "comment" attributes, can be used to annotate
individual elements in the LGR. They are ignored in machine-processing or
the LGR. The "ref" attribute is intended for formal annotations and the
"comment" attribute for free form annotations. The latter can be applied
more widely.</t>
<section title="The ref Attribute" anchor="ref">
<t>Reference information MAY optionally be specified by a "ref" attribute, consisting of a space
delimited sequence of reference identifiers (see <xref target="references" />).
</t>
<t><figure>
<artwork><![CDATA[ <char cp="5220" ref="0">
<var cp="5220" ref="5"/>
<var cp="522A" ref="2 3"/>
</char>]]></artwork>
</figure></t>
<t>This facility is typically used to give source information for code points
or variant relations. This information is ignored when machine-processing an
LGR. If applied to a range the "ref" attribute applies to every code point
in the range. All reference identifiers MUST be from the set
declared in the "references" element (see <xref target="references"/>). It
is an error to repeat a reference identifier in the same "ref" attribute.
It is RECOMMENDED that identifiers be listed in ascending order.</t>
<t>In addition to "char", "range" and "var" elements in the data section, a "ref"
attribute may be present for a number of elements types contained in the "rules"
element as described below: actions, literals ("char" inside a rule), as well as
for definitions of rules and classes, but not for references to named character
classes or rules using
the "by-ref" attribute defined below. (The use of the
"by-ref" and "ref" attributes is mutually exclusive.) None of the elements
in the metadata take a "ref" attribute; to provide additional information
use the "description" element instead.</t>
</section>
<section title="The comment Attribute" anchor="comment">
<t>Any "char", "range" or "variant" element in the data section may contain an
OPTIONAL "comment" attribute. The contents of a "comment" attribute are
free-form plain text. Comments are ignored in machine processing of the
table. Comment attributes MAY also be placed on all elements in the "rules"
section of the document, such as actions and match operators, such as
literals ("char"), as well as definitions of classes and rules, but not on
child elements of the "class" element. Finally, in the metadata, only the
"version" and "reference" elements MAY have "comment" attributes (to match the
syntax in <xref target="RFC3743"/>).</t>
</section>
</section>
<section title="Code Point Tagging" anchor="tagging">
<t>Typically, LGRs are used to explicitly designate allowable code points, where any
label that contains a code point not explicitly listed in the LGR is considered
an ineligible label according to the ruleset.
</t>
<t>For more complex registry rules, there may be a need to discern one or more
subsets of code points. This can be accomplished by applying an OPTIONAL "tag"
attribute to "char" or "range" elements that are child elements of the "data"
element. By collecting code points that share the same tag value, character classes
may be defined (see <xref target="tag_based_classes" />) which can then be used
in whole label evaluation rules (see <xref target="match_operators" />).
</t>
<t>Each "tag" attribute MAY contain multiple values separated by white space. A tag
value is an identifier, which may also include certain punctuation marks, such
as colon. Formally, it MUST correspond to the XML 1.0 Nmtoken (Name token)
production (see <xref target="XML"/> Section 2.3). It is an error to duplicate a
value within the same "tag" attribute. A "tag" attribute for a "range" element
applies to all code points in the range. Because code point sequences are not
proper members of a set of code points, a "tag" attribute MUST NOT be present
in a "char" element defining a code point sequence.
</t>
</section>
</section>
<section title="Whole Label and Context Evaluation">
<section title="Basic Concepts">
<t>The "rules" element contains the specification of both context-based and whole
Whole Label Evaluation (WLE) rules (<xref target="whole_label" />), the character
classes (<xref target="character_classes" />) that they depend on
and any actions (<xref target="actions"/>) that assign dispositions to labels
based on rules or variant mappings.</t>
<t>A Whole Label Evaluation rule (WLE) is applied to the whole label. It is used to
validate both original labels and any variant labels computed from them. </t>
<t>A conditional context rule does not necessarily
apply to the whole label, but may be specific to the context around a single code
point or code point sequence. Certain code points in a label sometimes need to
satisfy context-based rules, for example for the label to be considered valid, or
to satisfy the context for a variant mapping (see the description of the "when"
attribute in <xref target="parameterized_context_rule"/>). </t>