Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing fail error for Parse phase #22

Merged
merged 6 commits into from
Jun 5, 2024
Merged

Fix parsing fail error for Parse phase #22

merged 6 commits into from
Jun 5, 2024

Conversation

rael346
Copy link
Contributor

@rael346 rael346 commented May 16, 2024

closes #14 #15

This pull request mainly fixes most of the common errors in the scraper's Parse phase (parsing fail, unexpected end of token, ambiguous grammar) by doing the following

  • Refactor the snapshot tests to output into individual files to be more readable (This is mainly to avoid large changes that affect multiple majors like the missing sections error near the end of the semester)
  • Tokenize Section Info only if it is after a header/subheader/subsubheader or it is the first element in the section. This is because the Section Info phrases ("Complete one of the following" for example) can also be tokenized as a XOM
  • Add a new Token type SUBSUBHEADER (basically a SUBHEADER but with indentation). This header sometimes doesn't have any meaning (see CS and English's 19th century section for an example), but other times it does. This is mainly here so that we have more options to parse later on (in this PR the Parse Phase just ignore the subsubheader token and the comment).
  • Tokenize Phase now aggregates adjacent course lists table. This is because some majors decided to separate two tables in a section (see this major for an example), which the Tokenize Phase would have understood as two different sections.
  • Rewrite the grammar for the parse phase to be more succinct and contains less recursion (since it is easier to debug and less chance of having ambiguous grammar error)
  • Refactor the Parse Phase code to write snapshot tests

With this PR, most of the errors in the Parse Phase are gone. If you run the scraper for 2023, there are only 13 Parse Phase errors, and all of them are specific to one major.

For future works

  • Address comments that are not tokenized as XOM #16 is the main focus now since a lot of majors has errors relating to this
  • Unknown catalog type - some majors just don't contain the word "major", "minor" or "concentration" in their name
  • Potentially removing the Section Info type - This type is basically just XOM but at the beginning of a section, so we can move the tokenize logic to the grammar instead

@rael346 rael346 requested a review from AlpacaFur May 16, 2024 21:04
@rael346 rael346 self-assigned this May 16, 2024
Copy link
Member

@AlpacaFur AlpacaFur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, thank you so much for working on this!

test/tokenize.test.ts Show resolved Hide resolved
src/parse/parse.ts Show resolved Hide resolved
src/runtime/index.ts Show resolved Hide resolved
@rael346 rael346 merged commit d61b9d0 into main Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix parsing fail for section without headers
2 participants