Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you train with those NOT-TEXT elements. #21

Open
linan142857 opened this issue Nov 9, 2020 · 4 comments
Open

How do you train with those NOT-TEXT elements. #21

linan142857 opened this issue Nov 9, 2020 · 4 comments

Comments

@linan142857
Copy link

Dear author,
For some documents that contain massive not-text elements, such as hundreds of thousands of "##LTLine##". How do you deal with them actually?
For example, you try to train&predict all those elements with text '##LTLine##'.

Thank you!
image

@liminghao1630
Copy link
Collaborator

Yes. We regard '##LTLine##' as a special token during train and predict.

@NandreyN
Copy link

Yes. We regard '##LTLine##' as a special token during train and predict.

Hi! Could you please tell integer identifiers of ##LTLine## and ##LTFigure## tokens within LayoutLM's vocabulary?

Thanks

@liminghao1630
Copy link
Collaborator

In fact, we did not add them to the vocabulary. They will also be tokenized into tokens and labeled in the way I mentioned at #25.

@NandreyN
Copy link

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants