Fix `parsing fail` error for Parse phase #22

rael346 · 2024-05-16T21:04:52Z

This pull request mainly fixes most of the common errors in the scraper's Parse phase (parsing fail, unexpected end of token, ambiguous grammar) by doing the following

Refactor the snapshot tests to output into individual files to be more readable (This is mainly to avoid large changes that affect multiple majors like the missing sections error near the end of the semester)
Tokenize Section Info only if it is after a header/subheader/subsubheader or it is the first element in the section. This is because the Section Info phrases ("Complete one of the following" for example) can also be tokenized as a XOM
Add a new Token type SUBSUBHEADER (basically a SUBHEADER but with indentation). This header sometimes doesn't have any meaning (see CS and English's 19th century section for an example), but other times it does. This is mainly here so that we have more options to parse later on (in this PR the Parse Phase just ignore the subsubheader token and the comment).
Tokenize Phase now aggregates adjacent course lists table. This is because some majors decided to separate two tables in a section (see this major for an example), which the Tokenize Phase would have understood as two different sections.
Rewrite the grammar for the parse phase to be more succinct and contains less recursion (since it is easier to debug and less chance of having ambiguous grammar error)
Refactor the Parse Phase code to write snapshot tests

With this PR, most of the errors in the Parse Phase are gone. If you run the scraper for 2023, there are only 13 Parse Phase errors, and all of them are specific to one major.

For future works

Address comments that are not tokenized as XOM #16 is the main focus now since a lot of majors has errors relating to this
Unknown catalog type - some majors just don't contain the word "major", "minor" or "concentration" in their name
Potentially removing the Section Info type - This type is basically just XOM but at the beginning of a section, so we can move the tokenize logic to the grammar instead

… in tokenize phase

…phase

… ambiguous grammar)

AlpacaFur

Looks awesome, thank you so much for working on this!

test/tokenize.test.ts

src/parse/parse.ts

src/runtime/index.ts

rael346 added 6 commits May 10, 2024 13:14

test: refactor tokenize snapshot test to individual file for readability

3cd686c

fix: enforce that section info should always be in a certain position…

e5468e9

… in tokenize phase

fix: add subsubheader token type

73c687d

fix: aggregate adjacent course lists in the same section in tokenize …

85336c3

…phase

fix: revamp grammar to use less recursion (mainly to reduce chance of…

8051c2c

… ambiguous grammar)

refactor: shorten token parsing for readability

6287b90

rael346 added the Scraper label May 16, 2024

rael346 requested a review from AlpacaFur May 16, 2024 21:04

rael346 self-assigned this May 16, 2024

AlpacaFur approved these changes Jun 4, 2024

View reviewed changes

test/tokenize.test.ts Show resolved Hide resolved

src/parse/parse.ts Show resolved Hide resolved

src/runtime/index.ts Show resolved Hide resolved

rael346 merged commit d61b9d0 into main Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `parsing fail` error for Parse phase #22

Fix `parsing fail` error for Parse phase #22

rael346 commented May 16, 2024

AlpacaFur left a comment

Fix parsing fail error for Parse phase #22

Fix parsing fail error for Parse phase #22

Conversation

rael346 commented May 16, 2024

AlpacaFur left a comment

Choose a reason for hiding this comment

Fix `parsing fail` error for Parse phase #22

Fix `parsing fail` error for Parse phase #22