Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rasa X Decoding error with German umlauts #4151

Closed
kristiankolthoff opened this issue Aug 1, 2019 · 10 comments
Closed

Rasa X Decoding error with German umlauts #4151

kristiankolthoff opened this issue Aug 1, 2019 · 10 comments
Labels
type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@kristiankolthoff
Copy link

kristiankolthoff commented Aug 1, 2019

Rasa version: 1.1.7

Rasa X version (if used & relevant): 0.19.5

Python version: 3.7.0

Operating system (windows, osx, ...): windows

Issue:

I am getting a decoding error when I want to start rasa x with German umlauts in the domain.yml. If I remove the special characters, I can start rasa x without problems. Same issue has been reported already on a rasa-x-demo repository here : RasaHQ/rasa-x-demo#16

After testing, this error also occurs when running rasa train.

Error (including full traceback):

(base) C:\Users\Documents\workspace_python\FuBo\bot>rasa x
Starting Rasa X in local mode... 🚀
Traceback (most recent call last):
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasa\cli\x.py", line 322, in run_locally
    local.main(args, project_path, args.data, token=rasa_x_token)
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasax\community\local.py", line 190, in main
    project_path, data_path, session, args.port
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasax\community\local.py", line 139, in _initialize_with_local_data
    domain_path, domain_service, COMMUNITY_PROJECT_NAME, COMMUNITY_USERNAME
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasax\community\initialise.py", line 136, in inject_domain
    domain_yaml=read_file(domain_path),
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasa\utils\io.py", line 130, in read_file
    return f.read()
  File "c:\users\appdata\local\continuum\anaconda3\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 127: invalid start byte

Command or request that led to error:

rasa x

Content of configuration file (config.yml) (if relevant):

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: de
pipeline: pretrained_embeddings_spacy

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

Content of domain file (domain.yml) (if relevant):

intents:
- affirm
- deny
- goodbye
- greet
templates:
  utter_greet:
  - text: Hallo ich bin dein persönlicher Assistent. Wie kann ich Dir helfen?
  utter_did_that_help:
  - text: Konnte ich Dir damit weiterhelfen?
  utter_goodbye:
  - text: Ich wünsche Dir noch einen schönen Tag!
actions:
- utter_did_that_help
- utter_goodbye
- utter_greet
@kristiankolthoff kristiankolthoff added the type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. label Aug 1, 2019
@msamogh
Copy link
Contributor

msamogh commented Aug 1, 2019

Thanks for raising this issue, @gausie will get back to you about it soon.

@wochinge
Copy link
Contributor

@kristiankolthoff Which encoding is your file in? Can you please save it as utf-8?

@wochinge wochinge added the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Aug 15, 2019
@kristiankolthoff
Copy link
Author

I saved it as utf-8 explcitly and the error still remains.

@no-response no-response bot removed the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Aug 21, 2019
@wochinge
Copy link
Contributor

@erohmensing Can you please check whether you can reproduce that when I'm gone? Thanks!

@erohmensing
Copy link
Contributor

erohmensing commented Aug 23, 2019

Hm, I added ö to my domain.yml, nlu.md and stories.md and it loaded up with no problem. I can see the umlauts in all 3 of these places on the UI too. Of course I am also running the latest version, so you might want to try updating.

Can you run this in the console for me?

❯ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stdin.encoding
'UTF-8'

I think the user in the post you mentioned is probably right with Rasa-X opens the domain.yml file, modifies it, stores it with system default encoding (ISO 8859-2 for me on Windows)

@daniel-eder
Copy link

daniel-eder commented Sep 11, 2019

@erohmensing This error still reproduces for me on the latest version, using Windows in German locale. The stdin and stdout streams show UTF-8, but they are not the root cause here.

The underlying issue is that python by default writes to files with the system code page, unless an override is provided when opening the file, and rasa does not specificy UTF8. Additionally, when loading the domain.yml file rasa first reformats and saves it, before actually loading and parsing it, during the first step we lose the encoding, and when loading we are no longer in UTF8 causing the error.

Workaround: (Python 3.7+ only) set the environment variable PYTHONUTF8 to 1 before running rasa, this forces python to use utf8 as default encoding. On Windows: set PYTHONUTF8=1

Solution for rasa/rasa x: When saving the domain file (and other files as well .. ) specify utf8 as override. Python 3.7+ only: Enable utf8 mode in code.

@erohmensing
Copy link
Contributor

Thanks for the very descriptive into @taotsetung. I've tracked down the part where the domain gets written in Rasa X and you're right, the encoding isn't specified:

def dump_yaml_to_file(filename: Text, content: Any) -> Optional[str]:
    """Dump content to yaml."""
    with open(filename, "w") as f:
        f.write(dump_yaml(content))

I assume that with open(filename, "w", encoding="utf-8") as f:should do the job, but we'll check it out.

@JStumpp
Copy link

JStumpp commented Sep 12, 2019

@erohmensing yes, adding the encoding to the open call fixed the error. Rasa-X is not open source so we can't make a PR?

@erohmensing
Copy link
Contributor

Yes, unfortunately. But we actually already merged the PR to fix this issue :) will close it when the fix is released.

@tmbo
Copy link
Member

tmbo commented Sep 19, 2019

fix is part of Rasa X 0.20.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

7 participants