Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with wrong normalizations #124

Open
pfefferniels opened this issue Feb 26, 2022 · 4 comments
Open

Deal with wrong normalizations #124

pfefferniels opened this issue Feb 26, 2022 · 4 comments

Comments

@pfefferniels
Copy link
Owner

While generally text normalization using DTA's CAB service works fine, it introduces some mistakes, e.g. Schleiffer gets normalized to "Schluffer" and some more. Probably we should collect those and correct them after normalization again.

@pfefferniels
Copy link
Owner Author

pfefferniels commented Feb 27, 2022

further mistakes

  • PS 20: Palmulen => Palmölen (it's seems to be possible to restrict the rewrite model to the 18th century, which maybe makes this choice less likely?)
  • PS 20: Lambertischen => Lombertischen
  • PS 20: the quote by St. Lambert is correctly recognized as French, but still wrongly normalized (for the word "on" it results in <w id="w2a2" t="on" b="4849 2"><xlit isLatinExt="1" isLatin1="1" latin1Text="on"/><moot tag="FM.fr" word="ohne" lemma="ohne"/>on</w>)
  • PS 1: Zieffern => Ziefern
  • passim: somewhat problematic is the transcription of "Clavier" als "Klavier".
  • PS 24: Auszierungen => Aussierungen
  • PS 19: an Statt => an Stadt

@pfefferniels
Copy link
Owner Author

Not sure if relevant, but CAB seems to make use of an "exception lexicon"

@rettinghaus
Copy link
Contributor

Could we report those errors for improving the language model?

@pfefferniels
Copy link
Owner Author

Yes absolutely – I'd suggest to collect as many as possible and report them in one go. Not sure where exactly to report though … (Bryan Jurish perhaps?)

@pfefferniels pfefferniels changed the title Deal with wrong normilzations Deal with wrong normalizations Feb 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants