BirdiDQ is an intuitive and user-friendly data quality application that allows you to run data quality checks on top of python great expectation open source library using natural language queries. Type in your requests, and BirdiDQ will generate the appropriate GE method, run the quality control and return the results along with data docs you need. Demo Video
BirdiDQ is under development and is an open source project. Contributions are welcomed!
- 🔍 Data Exploration Made Easy: Quickly and interactively explore your data using a range of features like filters, comparisons, and more. Uncover hidden insights and make informed decisions with confidence.
- 🎯 Natural Language Processing:: Speak BirdiDQ's language! No technical expertise required. Simply type in your queries, and BirdiDQ intelligently converts them into powerful Great Expectations methods (using a fine-tuned Large Language Model), saving you time and effort..
- ⚡ Instant Results: Run comprehensive data quality checks on your selected data sources, and get instant feedback on data inconsistencies. BirdiDQ ensures that your data is reliable and trustworthy.
- 📧Automate Email Alert: Reach out to the Data Owner directly through the app, sending them an email with the detailed data quality report generated by Great Expectations.
- GEN AI models: Uses finetuned LLM on customed expectations data.
This app is an LLM-powered app built using:
- Streamlit
- Great Expectations
- Finetuned LLMs:
- Falcon-7B parameters causal decoder-only model: The model is finetuned on custom data with Qlora approach.
- OpenAI GPT-3: Also finetuned on the same data
To run BirdiDQ, you need to perform the following steps:
-
Clone the repository locally:
git clone https://github.com/BirdiD/BirdiDQ.git
-
(Recommended) Create a virtual environment and activate it:
python3 -m venv bir_env source bir_env/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the app:
streamlit run great_expectations/app.py
Note: BirdiDQ can use OpenAI's ChatGPT or Falcon LLM to convert the natural language descriptions to expectations. If you plan to use Falcon, consider using Pytorch with GPU support for better performance. To install Pytorch with CUDA support follow the instructions avaiable at for your Operating System at Pytorch.
Falcon 7b is an open source large language model (LLM) that can be used with BirdiDQ to convert natural language descriptions to Great Expectations expectations. To use the current fine-tuned Falcon 7b, you need to have a system with the following minimum requirements:
- If you don't have a GPU, you need at least 16GB of RAM to load the model into the memory. Inferencing will be really slow.
- You need a GPU with at least 16GB of VRAM to load the model into the memory. Inferencing will be faster.
Here are some example queries you can try with BirdiDQ:
- Ensure that at least 80% of the values in the country column are not null.
- Check that none of the values in the address column match the pattern for an address starting with a digit.
BirdiDQ integrates, connects, and works with a range of tools and services.
- Filesystem
- Support Local Filesystem with Pandas
- Support Local Filesystem with Spark
- Database
- Support PostgreSQL
- Support BigQuery
- Support Snowflake
- Support Amazon Athena
- Support AWS Redshift
- Cloud
- Connect to data on Amazon S3 using Pandas
- Connect to data on Azure Blob storage using Pandas
- Connect to data on GCS using Pandas