Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

target vocab size #35

Open
mily33 opened this issue Apr 27, 2020 · 5 comments
Open

target vocab size #35

mily33 opened this issue Apr 27, 2020 · 5 comments

Comments

@mily33
Copy link

mily33 commented Apr 27, 2020

I found that the provided model has a vocabulary size 525, however, following the preprocessing, I got a vocabulary with size 496.

@da03
Copy link
Collaborator

da03 commented Apr 27, 2020

Hmm did you preprocess the full training set? Or the provided sample set?

@mily33
Copy link
Author

mily33 commented Apr 27, 2020

@da03 Emm, I preprocess the full training set.

@da03
Copy link
Collaborator

da03 commented Apr 27, 2020

That's weird, can you load the provided pretrained model, find the vocab (https://github.com/harvardnlp/im2markup/blob/master/src/model/model.lua#L64) and then compare it to your vocabulary to see where they differ?

@mily33
Copy link
Author

mily33 commented Apr 27, 2020

@da03 It's soooo weird. I download both the processed files and raw files from http://lstm.seas.harvard.edu/latex/data/. I got a vocabulary of size 496 from both the two files (using the "train_filter.lst"). When I use the “train.lst” file (without filter), I got the size 519, still unequal to yours. I also compare the vocabulary in the provided pretrained model with mine, and I can not find the exact reason for causing this.

@mily33
Copy link
Author

mily33 commented Apr 27, 2020

It seems it is because of the different sizes of the training set. I found that the vocab size in this respo is 499, and his training set is 76444. https://github.com/ritheshkumar95/im2latex-tensorflow/tree/master/im2markup
My training set is of size 75275 after filter, equal to your provided processed training set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants