Allow default inline lexicon, --ssml reads entire input

rhasspy · Nov 10, 2021 · cd4f7e3 · cd4f7e3
1 parent c742711
commit cd4f7e3
Show file tree

Hide file tree

Showing 7 changed files with 273 additions and 55 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -7,6 +7,7 @@
 ### Changed
 
 - Moved English data files to separate Python package so core can be updated without large download
+- With --ssml, input from stdin is assumed to be one document instead of lines (override with --stdin-format lines)
 
 ### Fixed
 

diff --git a/README.md b/README.md
@@ -321,11 +321,105 @@ A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported:
 * `<phoneme ph="...">` - supply phonemes for inner text
     * `ph` - phonemes for each word of inner text, separated by whitespace
     * `alphabet` - if "ipa", phonemes are intelligently split ("aːˈb" -> "aː", "ˈb")
+* `<lexicon id="...">` - inline pronunciation lexicon
+    * `id` - unique id of lexicon (used in `<lookup ref="...">`)
+    * One or more `<lexeme>` child elements with:
+        * `<grapheme role="...">WORD</grapheme>` - word text (optional [role][#word-roles])
+        * `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)
+* `<lookup ref="...">` - use inline pronunciation lexicon for child elements
+    * `ref` - id from a `<lexicon id="...">`
 
 #### Word Roles
 
 During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`.
 
+For `en-us`, the following additional roles are available from the part-of-speech tagger:
+
+* `gruut:CD` - number
+* `gruut:DT` - determiner
+* `gruut:IN` - preposition or subordinating conjunction 
+* `gruut:JJ` - adjective
+* `gruut:NN` - noun
+* `gruut:PRP` - personal pronoun
+* `gruut:RB` - adverb
+* `gruut:VB` - verb
+* `gruut:VB` - verb (past tense)
+
+#### Inline Lexicons
+
+Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by only allowing lexicons to be defined within the SSML document itself. Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `<lookup>` tag.
+
+For example, the following document will yield three different pronunciations for the word "tomato":
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <lexicon xml:id="test" alphabet="ipa">
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Individual phonemes are separated by whitespace -->
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+    <lexeme>
+      <grapheme role="fake-role">
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Made up pronunciation for fake word role -->
+        t ə m ˈi t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+  <lookup ref="test">
+    <w>tomato</w>
+    <w role="fake-role">tomato</w>
+  </lookup>
+</speak>
+```
+
+The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `<lookup>` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached  (selecting a made up pronunciation in this case).
+
+Even further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document: 
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <!-- No id means change all words without a lookup -->
+  <lexicon>
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+</speak>
+```
+
+This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a `<lookup>`).
+
 ## Intended Audience
 
 gruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies).

diff --git a/gruut/__main__.py b/gruut/__main__.py
@@ -6,6 +6,7 @@
 import logging
 import os
 import sys
+from enum import Enum
 from pathlib import Path
 
 import jsonlines
@@ -21,6 +22,20 @@
 # Path to gruut base directory
 _DIR = Path(__file__).parent
 
+
+class StdinFormat(str, Enum):
+    """Format of standard input"""
+
+    AUTO = "auto"
+    """Choose based on SSML state"""
+
+    LINES = "lines"
+    """Each line is a separate sentence/document"""
+
+    DOCUMENT = "document"
+    """Entire input is one document"""
+
+
 # -----------------------------------------------------------------------------
 
 
@@ -62,7 +77,18 @@ def main():
         lines = args.text
     else:
         # Use stdin
-        lines = sys.stdin
+        stdin_format = StdinFormat.LINES
+
+        if (args.stdin_format == StdinFormat.AUTO) and args.ssml:
+            # Assume SSML input is entire document
+            stdin_format = StdinFormat.DOCUMENT
+
+        if stdin_format == StdinFormat.DOCUMENT:
+            # One big line
+            lines = [sys.stdin.read()]
+        else:
+            # Multiple lines
+            lines = sys.stdin
 
         if os.isatty(sys.stdin.fileno()):
             print("Reading input from stdin...", file=sys.stderr)
@@ -175,6 +201,12 @@ def get_args() -> argparse.Namespace:
     parser.add_argument(
         "--ssml", action="store_true", help="Input text is SSML",
     )
+    parser.add_argument(
+        "--stdin-format",
+        choices=[str(v.value) for v in StdinFormat],
+        default=StdinFormat.AUTO,
+        help="Format of stdin text (default: auto)",
+    )
 
     # Disable features
     parser.add_argument(

diff --git a/gruut/const.py b/gruut/const.py
@@ -152,6 +152,9 @@ class InterpretAs(str, Enum):
     TIME = "time"
     """Word should be interpreted as a time on the clock"""
 
+    WORD = "word"
+    """Interpret as regular word"""
+
 
 class InterpretAsFormat(str, Enum):
     """Supported options for format attribute of <say-as>"""