This app/package has been created by the Data Science Campus as a proof of concept to evaluate Large Language Models (LLM) potential to assist
with classification coding. It uses the LangChain
library to perform Retrieval Augmented Generation (RAG) based on the provided classification index. A special case of Standard Industrial Classification (SIC) coding has been used as the primary test case, see method explanation. An example deployment using Streamlit
allows for interactive exploration of the model's capabilities.
Examples of simplified SIC, Standard Occupational Classification (SOC) and Classification of Individual Consumption According to Purpose (COICOP) are included in the example_data
folder. These condensed indices are flattened subsets of more detailed indices officially published online, such as the UK SIC 2007, UK SOC 2020, and COICOP 2018 (pdf).
⚠️ Warning: The example data is provided for demonstration purposes only. No guarrantee is given for its accuracy or up to date status.
In this project, we focused on the SIC. A flexible representation of this hierarchical index (including metadata) has been implemented within the data_models
submodule, enabling enhanced context for RAG/LLM. This representation can be used independently for other SIC coding tasks or easily extended to accommodate different classification indices.
The SIC index hierarchy object is built using three data sources provided by ONS:
-
Published UK SIC summary of structure worksheet (xlsx) - location needs to be specified in config
-
UK SIC2007 indexes with addendum December 2022 (xlsx) - location needs to be specified in config
-
SIC resource file by ONSdigital/dp-classification-tools (js) - included inside the package
It is recommended that you install the project with its required dependencies in a virtual environment. When the virtual environment is activated, any subsequent Python commands will use the Python interpreter and libraries specific to that isolated environment. This ensures that the project uses the correct versions of the dependencies specified in its requirements.
Create and activate a new virtual environment on Linux/OS X:
python3.10 -m venv .venv
source .venv/bin/activate
Update pip and install requirements:
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
The -e flag installs the project in "editable" mode, which means that any changes made to the project code will be reflected immediately without the need to reinstall. The ".[dev]" part specifies that both the regular requirements and the development requirements should be installed.
The package provides code to use popular LLMs, access to the LLMs is a perquisite for use. Depending on your choice, keys/credentials may need to be added, for example:
- Include a personal OpenAI API in .env as
OPENAI_API_KEY="<your key>"
- Authenticate for Vertex AI:
gcloud config set project "<PROJECT_ID>"
gcloud auth application-default login
Examples of how to use the sic-soc-llm
package can be found in Tutorials and References.
The sic-soc-llm
package uses a configuration file in TOML format to specify the paths to the data files and the names of the models to use. An example configuration file is provided in sic_soc_llm_config.toml
and is read by the get_config
function. The following fields are required:
Field | Type | Default value |
---|---|---|
[lookups] | ||
sic_structure | str | "data/sic-index/publisheduksicsummaryofstructureworksheet.xlsx" |
sic_index | str | "data/sic-index/uksic2007indexeswithaddendumdecember2022.xlsx" |
sic_condensed | str | "sic_2d_condensed.txt" |
soc_condensed | str | "soc_4d_condensed.txt" |
coicop_condensed | str | "coicop_5d_condensed.txt" |
[llm] | ||
db_dir | str | "data/sic-index/db" |
embedding_model_name | str | "all-MiniLM-L6-v2" |
llm_model_name | str | "gemini-pro" |
Make sure to update the file paths and model names according to your specific setup. While the condensed indexes (.txt
) are included in the package, the .xlsx
files need to be downloaded from the ONS website (mentioned above) and placed in the specified locations.
To run the Streamlit app, use the following command:
streamlit run app/Welcome.py --server.port 8500
The app will be available at http://localhost:8500/
.
Example commands used to build and deploy the app as a GCP Cloud Run service are provided in cloud_deploy.sh
(which references Dockerfile
and app.yaml
). The Dockerfile
contains a set of instructions for building a Docker image. It specifies the base image to use, the files and directories to include, the dependencies and the commands to run. The app.yaml
file is used to specify the configuration of the Cloud Run service, including the container image to deploy, the service name, and the port to expose.
This repository contains a configuration of pre-commit hooks. If approaching this project as a developer, you are encouraged to install and enable pre-commits
by running the following in your shell:
pip install pre-commit
pre-commit install
To run the unit tests, use the following command:
python -m pytest
- Build (Quatro markdown)
reference
files from docstrings:
cd docs
python -m quartodoc build
- Render webpage from Quarto markdowns in
docs
dir (includingreference
files):
quarto render
The code, unless otherwise stated, is released under the MIT Licence. The documentation for this work is subject to © 2024 Crown Copyright (Office for National Statistics) and is available under the terms of the Open Government 3.0 licence.
At the Data Science Campus we apply data science, and build skills, for public good across the UK and internationally. Get in touch with the Campus at [email protected].