Another project in To-Do

Discontinuing work on this project. This project was started as an attempt to complete Taiyo's Assessment but unfortunately it is being discontinued due to several factors:

The assessment asked for using LLM to produce smart scraping (scrape the website without changing source code even if website's HTML changes or the website itself)
As of now, even after thorough research, there was no free LLM found that can pin point the selector for extraction from the website

Please note, this project may/may not be continued. Updates will be posted as they become available.

Some related articles/snippets/notebooks that I found useful:

Web-Scraper-with--LLM

[general description]

This project comprises of the following steps:

Step 1) Making a web scraper that supports dynamic websites and scheduling

[More details to be added]

Step 2) Making an extractor which takes in .html files or screenshots of website and parses into plain text

[More details to be added]

Step 3) Making an LLM which takes in plain text and constructs a summary, list of info required to be extracted

[More details to be added]

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Taiyo - Data Engineer Assignment Task 1 - Sheet1.csv		Taiyo - Data Engineer Assignment Task 1 - Sheet1.csv
Taiyo.AI - Data Engineering (Web Scraping)Trial Task (1).pdf		Taiyo.AI - Data Engineering (Web Scraping)Trial Task (1).pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Another project in To-Do

Most successful attempt at the time

Running streamlit on Collab + QA with websites

Langchain QA on data

Selenium on Collab

Web-Scraper-with--LLM

Step 1) Making a web scraper that supports dynamic websites and scheduling

Step 2) Making an extractor which takes in .html files or screenshots of website and parses into plain text

Step 3) Making an LLM which takes in plain text and constructs a summary, list of info required to be extracted

About

Releases

Packages

sid-am-ahd935/Web-Scraper-with--LLM

Folders and files

Latest commit

History

Repository files navigation

Another project in To-Do

Most successful attempt at the time

Running streamlit on Collab + QA with websites

Langchain QA on data

Selenium on Collab

Web-Scraper-with--LLM

Step 1) Making a web scraper that supports dynamic websites and scheduling

Step 2) Making an extractor which takes in .html files or screenshots of website and parses into plain text

Step 3) Making an LLM which takes in plain text and constructs a summary, list of info required to be extracted

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages