-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behaviour of command-line detokenizer #43
Comments
I can't get the command-line detokenizer to work properly. I have tried this:
and I get
What am I doing wrong? |
Sorry about it, I think it was cause by a mistake in a previous version which was patched in #36 Could you try the latest version It should work now:
[out]:
Seems like this https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L481 isn't used when iterating... I'll check that first thing tomorrow morning =) |
Hmmm, seems like the apostrophe for french isn't working as expected though: From original moses:
|
Wow, that was fast. Yes, apostrophes don't look good when detokenized (they are separated with spaces). |
I'll be grateful if you let me know of any progress. |
(1) Have you had a chance to solve the problem with spaces when detokenizing, @alvations ? (2) Also, apparently, there is a way to specify the language when creating the tokenizer.
it would be nice to document this in the README.md. By the way, when the language is not "en", "it" or "fr", but you specify it, the apostrophes are doubly escaped with backslashes for a reason that escapes me:
produces
where it would be more appropriate to have (as with lang="en")
By the way, I could consider offering my help with Catalan ("ca") tokenization. The current French and Italian model partly works, but Catalan has post-verbal pronouns such as
which should be tokenized as
But first I'd have to get better acquainted with your code. Cheers, |
@mlforcada Sorry for the delay! Now the latest version should have the french apostrophes patched. from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
mt = MosesTokenizer(lang='fr')
md = MosesDetokenizer(lang='fr')
md.detokenize(mt.tokenize("L'amitié nous a fait forts d'esprit")) == "L'amitié nous a fait forts d'esprit" I was catching the end of string symbol in the token after the apostrophes' clitics so that was wrong |
Regarding the Spanish escaping of the ampersand, I'm not able to reproduce it, shouldn't be a problem with version Which version of >>> import sacremoses
>>> sacremoses.__version__
0.0.19
>>> from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
>>> mt=MosesTokenizer(lang="es")
>>> print(mt.tokenize("Un texto con 'comillas' para probar"))
['Un', 'texto', 'con', ''', 'comillas', ''', 'para', 'probar'] Also having Catalan specific rules would be awesome! I've vested interest for Catalan text processing =) Do you have a list of rules and words that should prevent weird splitting for Catalan? |
Thanks a million, @alvations ! Catalan rules for apostrophes and hyphens with pronouns, articles and prepositions: Work as in French and italian:
Single pronoun after verb, apostrophe.
Single pronoun, after verb, with hyphen:
VERB-(me|te|se|lo|la|li|nos|us|vos|los|les)-(em|el|la|li|en|ens|us|els|les|hi|ho) →
VERB'(ns|ls)-(el|la|els|les|li|ho|hi|en) → VERB '(ns|ls) -(el|la|els|les|li|ho|hi|en)
VERB'(me|te|se|li|)-(m|t|s|l|ns|ls) → VERB '(me|te|se|li) -(m|t|s|l|ns|ls)
|
Aggh, the last one is wrong.
Sorry about that! |
Thanks @mlforcada! Let me see how I could convert the rules above =) |
No description provided.
The text was updated successfully, but these errors were encountered: