found some data label unconsistence #23

Zhang-O · 2019-10-10T09:08:30Z

51238 1a00a76d4e basic in im2latex_train.lst
latexs around line 51238 in im2latex_formulas.lst are not the latex content in pic 1a00a76d4e.
1a00a76d4e should point to line 51729 in im2latex_formulas.lst.
I have found some of this case, but not sure how many.
I download data from https://zenodo.org/record/56198#.XZ7yK_n_yHt.
Is anything wrong?

Miffyli · 2019-10-10T09:09:42Z

Hey, did you open the files correctly? See this quote from the Zenodo webpage:

Newlines used in formulas_im2latex.lst are UNIX-style newlines (\n). Reading file with other type of newlines results to slightly wrong amount of lines (104563 instead of 103558), and thus breaks the structure used by this dataset. Python 3.x reads files using newlines of the running system by default, and to avoid this file must be opened with newlines="\n" (eg. open("formulas_im2latex.lst", newline="\n")).

Zhang-O · 2019-10-10T11:55:51Z

sorry to waste your time.I see the web again, and chect what you said.
I found formulas_im2latex.lst with lines of 104564. I open it using notepad++ with line ending \n.
what is wrong?
thanks very much.

Zhang-O · 2019-10-10T12:30:56Z

f = open("./im2latex_formulas.lst", encoding="ISO-8859-1",newline="\n")
len(f.readlines()) = 103359
when epen file with nptepad++ ,changing encoding will not change the lines of file.
almost an hour for me to check it out.
thanks again.

Miffyli · 2019-10-10T13:32:07Z

Hmm that is peculiar: I downloaded the im2latex_formulas.lst from zenodo and ran the following (Windows 10, Python 3.6):

f = open("./im2latex_formulas.lst", newline="\n")
len(f.readlines())
Out[11]: 103559

f = open("./im2latex_formulas.lst", encoding="ISO-8859-1",newline="\n")
len(f.readlines())
Out[13]: 103559

I do not think changing the encoding helps, it is the way newlines are handled differently in different OSes.

kim-yhow · 2019-11-22T13:35:51Z

51238 1a00a76d4e basic in im2latex_train.lst
latexs around line 51238 in im2latex_formulas.lst are not the latex content in pic 1a00a76d4e.
1a00a76d4e should point to line 51729 in im2latex_formulas.lst.
I have found some of this case, but not sure how many.
I download data from https://zenodo.org/record/56198#.XZ7yK_n_yHt.
Is anything wrong?

Excuse me, I am also interested in this project. and are you still doing formula recognition? Have you successfully reproduced the results of EM in the paper？

Zhang-O closed this as completed Oct 10, 2019

Zhang-O reopened this Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

found some data label unconsistence #23

found some data label unconsistence #23

Zhang-O commented Oct 10, 2019 •

edited

Loading

Miffyli commented Oct 10, 2019

Zhang-O commented Oct 10, 2019 •

edited

Loading

Zhang-O commented Oct 10, 2019

Miffyli commented Oct 10, 2019

kim-yhow commented Nov 22, 2019

found some data label unconsistence #23

found some data label unconsistence #23

Comments

Zhang-O commented Oct 10, 2019 • edited Loading

Miffyli commented Oct 10, 2019

Zhang-O commented Oct 10, 2019 • edited Loading

Zhang-O commented Oct 10, 2019

Miffyli commented Oct 10, 2019

kim-yhow commented Nov 22, 2019

Zhang-O commented Oct 10, 2019 •

edited

Loading

Zhang-O commented Oct 10, 2019 •

edited

Loading