Skip to content

Latest commit

 

History

History
248 lines (195 loc) · 10.5 KB

README.md

File metadata and controls

248 lines (195 loc) · 10.5 KB











GitHub Actions Workflow Status GitHub top language PyPI - Version PyPI - Status Code style: black codecov

Sieve your data: zero-shot NLP made easy.

sieves enables zero- and few-shot NLP tasks with structured generation. With no training required, you can quickly prototype NLP tasks while ensuring reliable, unified output formats.

sieves allows running NLP tasks off the bat with zero- and few-shot models, leveraging structured generation methods.

  • No training needed
  • Consistent structured output
  • Immediate productivity using built-in tasks
  • Flexible customization for your own tasks

A simple example, using outlines for classification:

import outlines

from sieves import Pipeline, engines, tasks, Doc

# 1. Define documents by text or URI.
docs = [Doc(text="Special relativity applies to all physical phenomena in the absence of gravity.")]

# 2. Create engine responsible for generating structured output.
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
engine = engines.outlines_.Outlines(model=outlines.models.transformers(model_name))

# 3. Create pipeline with tasks.
pipe = Pipeline(
    [
        # 4. Run classification on provided document.
        tasks.predictive.Classification(labels=["science", "politics"], engine=engine),
    ]
)

# 5. Run pipe and output results.
docs = list(pipe(docs))
print(docs[0].results["Classification"])
A more involved example

Here we parse a PDF with docling, chunk it with chonkie, and classify it with gliclass:

import os
import pickle

import transformers
import gliclass
import chonkie
import tokenizers
import dspy

from sieves import Pipeline, engines, tasks, Doc

# 1. Define documents by text or URI.
docs = [Doc(uri="https://arxiv.org/pdf/2408.09869")]

# 2. Create engine responsible for generating structured output.
model_name = "knowledgator/gliclass-small-v1.0"
pipeline = gliclass.ZeroShotClassificationPipeline(
    gliclass.GLiClassModel.from_pretrained(model_name),
    transformers.AutoTokenizer.from_pretrained(model_name),
    classification_type="multi-label",
)
engine = engines.glix_.GliX(model=pipeline)
    
# 3. Create pipeline with tasks.
pipe = Pipeline(
    [
        # 4. Add document parsing task.
        tasks.parsers.Docling(),
        # 5. Add chunking task to ensure we don't exceed our model's context window.
        tasks.chunkers.Chonkie(chonkie.TokenChunker(tokenizers.Tokenizer.from_pretrained(model_name))),
        # 6. Run classification on provided document.
        tasks.predictive.Classification(task_id="classifier", labels=["science", "politics"], engine=engine),
    ]
)

# 7. Run pipe and output results.
docs = list(pipe(docs))
print(docs[0].results["classifier"])

# 8. Serialize pipeline and docs.
pipe.dump("pipeline.yml")
with open("docs.pkl", "wb") as f:
    pickle.dump(docs, f)

# 9. To load a pipeline and docs from disk:
loaded_pipe = Pipeline.load(
    "pipeline.yml",
    (
        {},
        {"chunker": chonkie.TokenChunker(tokenizers.Tokenizer.from_pretrained("gpt2"))},
        {"engine": {"model": dspy.LM("claude-3-haiku-20240307", api_key=os.environ["ANTHROPIC_API_KEY"])}},
    ),
)
with open("docs.pkl", "rb") as f:
    loaded_docs = pickle.load(f)

Features

  • 🎯 No training: All tasks run with structured zero-/few-shot models
  • 🤖 Integration and unified usage of multiple structured generation libraries:
  • ▶️ Pipeline-based system for easy observability and debugging
  • 🛠️ Integrated utilities convenient in an NLP pipeline
  • 🏷️ Prebuilt tasks ready to use out-of-the-box
    • Classification
    • Information extraction
    • TBD: NER, entity linking, summarization, translation, ...
  • 💾 Serializability (TBD)

Why sieves?

Even with generative AI, structured outputs and observability remain essential.

In many real-life use-cases, there is a need for prototyping solutions that work with very little data. Generative language models fill this gap nicely. Getting them to output clean, structured data can be tricky though. This is addressed by tooling for structured/steered language model output: outlines, dspy, ollama and others. Each of these libraries has its own patterns, pros and cons. sieves offers a unified interface for usage, in- and output.

Beyond that, we often encounter the same set of steps to get a prototype off the ground - such as parsing and chunking documents, exporting results in a dataset format that can be used to fine-tune models, and experimenting with different ways of prompting/encouraging the model to structure its output. All of these things are solved by existing libraries in the NLP ecosystem (think docling or Hugging Face's datasets.

What sieves aims to do is to bundle all of this into one library that enables you to quickly and painlessly build prototypes of NLP-heavy applications. Built on zero- and few-shot models and a number of pre-implemented tasks you can use right without having to build your own.

sieves has been partially inspired by spacy, especially spacy-llm.

Core Abstractions

We intend to keep the API is minimalist and modular as possible. When using sieves, you'll need to know about the following abstrations:

Pipeline

Responsible for coordinating the execution of tasks. Typical interactions with a Pipeline instance:

  • Instatiate pipeline
  • Add tasks to pipeline
  • Run pipeline
  • Serialize pipeline (config)

Doc

A Doc instance represents one document and contains information about its content, URI (if available) and potential meta-information. The sieves workflow pipeline revolves around Doc instances - information is propagated throughout the pipeline via Doc instances being passed from task to task.

Important

The length of your Docs might exceed your models context length. Make sure you run a chunking task in your pipeline to avoid running into problems with the context length.

Engine

An Engine wraps an existing library designed for structured generation with zero-/few-shot models. It coordinates setting up executing prompts and fetching output from the model and passes it back to be converted into a unified data format.

Warning

Engines might fail in producing structured output. Validation ensures you won't end up with improperly structured results, but you might end up with a pipeline failure. The risk for this correlates positively with the complexity of the expected response type and negatively with the capability of the used model.

Task

A Task implements a given NLP task (such as classification, NER, information extraction, ...) for at least one engine. In itself a Task implementation is usually very concise, as most of the logic will be implemented in Bridge classes, which implement task logic with respect to a certain engine. I.e.: one Task provides one or more Bridge classes, which implement a task for one Engine.

Tip

Some tasks might work better with one engine than the other. When implementing your own tasks, make sure to experiment to figure out which engine delivers the best performance.

Bridge

A Bridge class connects one Task with one Engine. E.g.: a classification task will require a different implementation when run against outlines then when run against dspy or ollama. This is the responsibility of a Bridge implementation. Bridge implements functionality such as specifying a prompt template or the prompt's expected output type.

Note that the Bridge abstraction is not relevant if you're just using sieves, as it's a part of a Tasks composition. It is however relevant when implementing your own task, as you will need at least one Bridge for any given Task.

Frequently Asked Questions

Show FAQs

What's the meaning behind the name?

Originally, sieves was intended for information extraction. The name comes from gold panning: run your raw data through a sieve to obtain structured, refined “gold.”

Why not just prompt an LLM directly?

You can - but sieves offers: - Structured data output. Zero-/few-shot LLMs can be finicky without guardrails or parsing. - A step-by-step pipeline, making it easier to debug and track each stage.

Why use sieves and not a structured generation library like outlines?

This is not either-or - sieves includes outlines (among others), plus: - A uniform input/output format - Prebuilt NLP tasks - Pipeline logic with chunking and chunk consolidation - Easy switching between structured generation engines - Tools for file parsing, chunking, exporting data for fine-tuning


Source for sieves icon: Sieve icons created by Freepik - Flaticon.