Skip to content

sid-am-ahd935/Web-Scraper-with--LLM

Repository files navigation

Another project in To-Do

Discontinuing work on this project. This project was started as an attempt to complete Taiyo's Assessment but unfortunately it is being discontinued due to several factors:

  • The assessment asked for using LLM to produce smart scraping (scrape the website without changing source code even if website's HTML changes or the website itself)
  • As of now, even after thorough research, there was no free LLM found that can pin point the selector for extraction from the website

Please note, this project may/may not be continued. Updates will be posted as they become available.

Some related articles/snippets/notebooks that I found useful:

Most successful attempt at the time

https://colab.research.google.com/drive/1Ctkuh0Aq6sgJTV2vWfZ_N06eO6dJAPKm?authuser=1

Running streamlit on Collab + QA with websites

https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Create_streamlit_app.ipynb#scrollTo=vWmc_s2ezvU0

Langchain QA on data

https://colab.research.google.com/drive/1NaEyuFWCkDtkufHIsWQhfgPLHybXeYA1?usp=sharing#scrollTo=SVOXzjaNsVb8 https://medium.com/@onkarmishra/using-langchain-for-question-answering-on-own-data-3af0a82789ed

Selenium on Collab

https://colab.research.google.com/drive/1MX3xY23Go1STe7LbDMvwf2KaqHpbrVhC?usp=sharing#scrollTo=xvawzPAvm5Nq

Web-Scraper-with--LLM

[general description]

This project comprises of the following steps:

Step 1) Making a web scraper that supports dynamic websites and scheduling

[More details to be added]

Step 2) Making an extractor which takes in .html files or screenshots of website and parses into plain text

[More details to be added]

Step 3) Making an LLM which takes in plain text and constructs a summary, list of info required to be extracted

[More details to be added]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published