Discontinuing work on this project. This project was started as an attempt to complete Taiyo's Assessment but unfortunately it is being discontinued due to several factors:
- The assessment asked for using LLM to produce smart scraping (scrape the website without changing source code even if website's HTML changes or the website itself)
- As of now, even after thorough research, there was no free LLM found that can pin point the selector for extraction from the website
Please note, this project may/may not be continued. Updates will be posted as they become available.
Some related articles/snippets/notebooks that I found useful:
- https://scrapingant.com/blog/web-scraping-without-getting-blocked
- https://scrapingant.com/blog/how-to-crawl-website-without-getting-blocked
- https://www.fahdmirza.com/2024/05/how-to-scrape-websites-for-free-with-ai.html?m=1
- https://eyurtsev.github.io/kor/tutorial.html
- https://github.com/sid-am-ahd935/entities-extraction-web-scraper/blob/main/ai_extractor.py
- https://colab.research.google.com/drive/1vzZAL2Zy6NS_0LzexhzsJQaxaexZWFjO?authuser=1 [from YT of same name]
- https://colab.research.google.com/drive/1oDOVdRXl4_DWH56VKwOKKCA5JxBAMYsL?authuser=1 [from a certain article]
- https://huggingface.co/LLukas22/gpt4all-lora-quantized-ggjt
- https://github.com/nomic-ai/gpt4all
https://colab.research.google.com/drive/1Ctkuh0Aq6sgJTV2vWfZ_N06eO6dJAPKm?authuser=1
https://colab.research.google.com/drive/1NaEyuFWCkDtkufHIsWQhfgPLHybXeYA1?usp=sharing#scrollTo=SVOXzjaNsVb8 https://medium.com/@onkarmishra/using-langchain-for-question-answering-on-own-data-3af0a82789ed
[general description]
This project comprises of the following steps:
[More details to be added]
Step 2) Making an extractor which takes in .html files or screenshots of website and parses into plain text
[More details to be added]
Step 3) Making an LLM which takes in plain text and constructs a summary, list of info required to be extracted
[More details to be added]