Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update japan_dict.txt #13142

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Update japan_dict.txt #13142

wants to merge 1 commit into from

Conversation

madmalkav
Copy link

Update japan_dict.txt to include missing jouyou kanji ( #12940 )

Update japan_dict.txt to include missing jouyou kanji
@CLAassistant
Copy link

CLAassistant commented Jun 20, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@GreatV
Copy link
Collaborator

GreatV commented Jun 20, 2024

Will this affect the previous jp model?

@madmalkav
Copy link
Author

It is my understanding this file is only used when training a new model, so I don't think it would have any effect until a new model is trained, but there is a chance I misunderstood everything wrong and it works in a different way 😅

Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they use the same character dictionary, using the previous model may result in inconsistent output, and I think it would be better to use a different filename (e.g., ja_ext_dict.txt).

@GreatV GreatV requested a review from jzhang533 June 20, 2024 14:02
@madmalkav
Copy link
Author

If this file can indeed affect the current model, I would propose to hold this PR until a new version of the japanese model is going to be trained. I don't like too much the idea of creating a new file and called it "extended" because this is not really extending the dictionary, is fixing a fault in it.

@GreatV
Copy link
Collaborator

GreatV commented Jun 20, 2024

@madmalkav, That makes sense.

@madmalkav
Copy link
Author

Can I ask why this similar PR for another language was already merged? What is the difference with my PR? I want to understand to see if I can do something else to move this forward.

@GreatV
Copy link
Collaborator

GreatV commented Aug 18, 2024

I don't think that PR being merged will work properly with the previous model.

@GreatV
Copy link
Collaborator

GreatV commented Aug 19, 2024

Maybe we could try to keep the version information of the character dictionary (a new column in the model list is used to show the dictionary used by the model), so that new character dictionaries can be merged in, and old models use old character dictionaries.

model code description character dictionary model size download Update Date
ch Chinese and English https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppocr/utils/ppocr_keys_v1.txt 3.71M inference model/
trained model
2020.9.22
ch Chinese and English https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.8/ppocr/utils/ppocr_keys_v1.txt 3M inference model/
trained model
2024.9.22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor language requests Multilingual language requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants