This module goal is to allow botpress to reply questions from textual documents in an unsupervised way.
At the end we should be able to give it a website Url, pdfs or differents documents and it will act like a custom search engine
It is written in python and will run in a docker which will then be integrated with a botpress module calling the api
- Precomputed datasets are in the dataset folder.
- The embedder folder is a wrapper for the deeplearning models (embedding and QA)
- Indexer folder is responsible for all the preprocessing
- Qa folder is responsible for all the retrieval / inference
config.py
stores all the useful global variables like model namesdatatypes.py
stores all datatypes used in the fonctions for type hintsutils.py
provides some global standalone functions like sanitazing/hashing text or math like a cosine similaritytest.py
is made for developpement only to be sure that the code runs without trying all the interactive streamlit things.
- Make sure you have python at least 3.8
- Optional but prefered : Make a virtual environement
- Install all the dependecies with
pip install -r requirements.txt
- Run the code with
streamlit run pipeline.py
. It will open a tab in your default browser - Make sure elasticsearch is launched, the python code will connect to it with defaults (localhost:9200)
- First time running, the database will be computed when you click on the
ask
button, it takes time (more than 5mn on cpu). Subsequent question will use the same database so it will be fast.
N.b : Eqch times when asking the first question (after database is created) all the models needs to be loaded so expect ~50s of overhead. Then subsequent questions are fast.
-
Preprocess
- Clean the html data (already parsed in a tree manner with children but maybe will do a parser with scrapy)
- Chunk the documents in pieces
- Compute useful metadatas
- Index this chunks with the metadatas in a database
-
Retrieving
- Query the database with infos from botpress (like topics) to retrieve the X more pertinents docs
- Among those docs elect the best sentence to answer the question
- We assume the query is a short question
- We assume only one language (french for now)
- We assume the query is in the scope of the documents (COVID-related)
- We assume there's always a relevant document for the query
- Full pipeline along with interfaces between components
- Retriever yields decent results when querying manually
- Build retriever dataset & measure retriever performances
- TBD