-
Notifications
You must be signed in to change notification settings - Fork 5
Tech: Scraping
Scraping is the core of SearchNEU, and of this API. "Scraping" refers to the process of fetching employee and course data, parsing it, and storing it.
flowchart TD
A[Run scrapers] --> B[Determine which term IDs we want to scrape]
B --> B2[Start scraping!]
B2 --> C["Scrape publically-accessible Northeastern employee data"]
B2 --> D[Scrape subjects, sections, and courses for the given terms]
C --> E("Parse scraped data")
D --> E
E -.-> cache["If running locally, the results will be cached to speed up development"]
E --> dump[Courses, sections, and employees are inserted into our database]
dump --> DONE
DONE["Inserts the courses into Elasticsearch, which allows for easy & efficient searching"]
In depth diagram
flowchart TD
A[yarn scrape] --> termIds
subgraph termIds
TERM_A[Determine which term IDs to scrape]
TERM_A -.->|OR| TERM_F[User inputs the specific terms via CLI]
TERM_F ---> TERM_DONE
TERM_A -.->|OR| TERM_B(Query term IDs from Banner)
TERM_B --> TERM_C("Filter the list of term IDs")
TERM_C -.->|OR| TERM_D[Default - use the first 12]
TERM_C -.->|OR| TERM_D2[User inputs the # to use via CLI]
TERM_DONE(Returns list of terms to scrape)
TERM_D --> TERM_DONE
TERM_D2 --> TERM_DONE
end
subgraph employees
EMP_F(Scrape HTML of college-specific websites)
EMP_F -->|COE, CCIS, CAMD, CSSH| EMP_F2[Parse HTML, extract employee data]
EMP_G("Query NEU Faculty directory API (REST)")
EMP_F2 --> EMP_H(Merges employee results based on name & email)
EMP_G -->|All NEU Employees| EMP_H
end
subgraph courses
I("Query subject abbreviations from Banner [eg. HIST -> History]")
I --> J(Query all sections from Banner)
J --> J2("Map the sections to a list of courses (ie. subject + code)")
J2 --> course
subgraph course[For each course in the list...]
direction TB
K[Query course information from Banner]
K --> L[Parse into a course object]
end
end
subgraph dumpProcessor
DMP_K[Courses, sections, and employees are inserted into our Prisma database]
end
termIds -->|Runs once, doesn't care about specific terms| employees
termIds -->|Runs for each term in the terms list| courses
employees --> dumpProcessor
courses --> dumpProcessor
dumpProcessor --> ES["Index data from Postgres into Elasticsearch, which allows for easy searching"]
ES --> DONE["When users search on the website, the searching is handled by Elasticsearch"]
Scraping course data for multiple terms can take quite a bit of time. Caching scrapes is fantastic for quickly initializing local databases, but for scraper-related work we might need to run real scrapes often.
In order to speed up scraper-related dev work we can specify custom scraping filters so that we only fetch data for a subset of the total courses for the given terms. Filters are specified in scrapers/filters.ts
in the following format:
const filters = {
campus: (campus) => true,
subject: (subject) => ["CS", "MATH"].includes(subject),
courseNumber: (courseNumber) => courseNumber >= 3000,
truncate: true,
};
const filters = {
campus: (campus: string) => boolean,
subject: (subject: string) => boolean,
courseNumber: (courseNumber: number) => boolean,
truncate: boolean,
};
The custom scrape will only scrape courses that fulfill all filters, so the above can be read as: "Scrape all courses from all campuses that have subject "CS" or "MATH" AND have a course number 3000 or higher. Clear out my local database before inserting the custom scrape data."
?> The custom scrape will not overwrite the cache, and therefore it will also never read from the cache.
-
truncate
- If
truncate
is set to true, then thecourses
andsections
tables in your local database will be cleared before they are re-populated with the scraped data. Theclasses
elasticsearch index will also be cleared before being re-populated with scraped data.
- If
There are a number of course-to-course relations that we store - coreqs of a course, prereqs of a course, courses that the given course is a prereq of, and courses that the given course is an optional prereq of. It's important to note how the custom scrape will behave in these cases, for example if Course A
is a prereq of Course B
, and the filters include Course B
but not Course A
.
Assuming the filters include B
but not A
:
- If
B
hasA
as aprereq
orcoreq
, then in your local databaseB
will know thatA
exists as aprereq
orcoreq
, butA
will not have been scraped soA
will be marked asmissing
in theprereq
orcoreq
field. - If
A
hasB
as aprereq
oroptionalPrereq
, thenB
will not know anything aboutA
, meaningB
'sprereqs_for
oropt_prereqs_for
fields will not includeA
.
In summary, if you're looking at any course-to-course info while using a custom scrape then pay extra attention to what exactly you scraped. When in doubt, do a full scrape.
The command to run the custom scrape is: yarn scrape:custom
A Sandbox Project