Releases: deepset-ai/haystack
v1.8.0
⭐ Highlights
This release comes with a bunch of new features, improvements and bug fixes. Let us know how you like it on our brand new Haystack Discord server! Here are the highlights of the release:
Pipeline Evaluation in Batch Mode #2942
The evaluation of pipelines often uses large datasets and with this new feature batches of queries can be processed at the same time on a GPU. Thereby, the time needed for an evaluation run is decreased and we are working on further speed improvements. To try it out, you only need to replace the call to pipeline.eval()
with pipeline.eval_batch()
when you evaluate your question answering pipeline:
...
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)
eval_result = pipeline.eval_batch(labels=eval_labels, params={"Retriever": {"top_k": 5}})
Early Stopping in Reader and Retriever Training #3071
When training a reader or retriever model, you need to specify the number of training epochs. If the model doesn't further improve after the first few epochs, the training usually still continues for the rest of the specified number of epochs. Early Stopping can now automatically monitor how much the model improves during training and stop the process when there is no significant improvement. Various metrics can be monitored, including loss
, EM
, f1
, and top_n_accuracy
for FARMReader
or loss
, acc
, f1
, and average_rank
for DensePassageRetriever
. For example, reader training can be stopped when loss
doesn't further decrease by at least 0.001 compared to the previous epoch:
from haystack.nodes import FARMReader
from haystack.utils.early_stopping import EarlyStopping
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-distilled")
reader.train(data_dir="data/squad20", train_filename="dev-v2.0.json", early_stopping=EarlyStopping(min_delta=0.001), use_gpu=True, n_epochs=8, save_dir="my_model")
PineconeDocumentStore Without SQL Database #2749
Thanks to @jamescalam the PineconeDocumentStore
does not depend on a local SQL database anymore. So when you initialize a PineconeDocumentStore
from now on, all you need to provide is a Pinecone API key:
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key="...")
docs = [Document(content="...")]
document_store.write_documents(docs)
FAISS in OpenSearchDocumentStore: #3101 #3029
OpenSearch supports different approximate k-NN libraries for indexing and search. In Haystack's OpenSearchDocumentStore
you can now set the knn_engine
parameter to choose between nmslib
and faiss
. When loading an existing index you can also specify a knn_engine
and Haystack checks if the same engine was used to create the index. If not, it falls back to slow exact vector calculation.
Highlighted Bug Fixes
A bug was fixed that prevented users from loading private models in some components because the authentication token wasn't passed on correctly. A second bug was fixed in the schema files affecting parameters that are of type Optional[List[]]
, in which case the validation failed if the parameter was explicitly set to None
.
- fix: Use use_auth_token in all cases when loading from the HF Hub by @sjrl in #3094
- bug: handle
Optional
params in schema validation by @anakin87 in #2980
Other Changes
DocumentStores
Documentation
- refactor: rename
master
intomain
in documentation and links by @ZanSara in #3063 - docs:fixed typo (or old documentation) in ipynb tutorial 3 by @DavidGerva in #3033
- docs: Add OpenAI Answer Generator API by @brandenchan in #3050
Crawler
- fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter by @danielbichuetti in #3043
- fix: Crawler quits ChromeDriver on destruction by @danielbichuetti in #3070
Other Changes
- fix(translator): write translated text to output documents, while keeping input untouched by @danielbichuetti in #3077
- test: Use
random_sample
instead ofndarray
for random array inOpenSearchDocumentStore
test by @bogdankostic in #3083 - feat: add progressbar to upload_files() for deepset Cloud client by @tholor in #3069
- refactor: update package metadata by @ofek in #3079
New Contributors
- @DavidGerva made their first contribution in #3033
- @ofek made their first contribution in #3079
❤️ Big thanks to all contributors and the whole community!
Full Changelog: v1.7.1...v1.8.0
v1.7.1
Patch Release
Main Changes
Other Changes
- fix: pin version of pyworld to
0.2.12
by @sjrl in #3047 - test: update filtering of Pinecone mock to imitate doc store by @jamescalam in #3020
Full Changelog: v1.7.0...v1.7.1
v1.7.0
⭐ Highlights
This time we have a couple of smaller yet important feature highlights: lots of them coming from you, our amazing community!
🥂 Alongside that, as we notice more frequent and great contributions from our community, we are also announcing our brand new Haystack Discord server to help us interact better with the people that make Haystack what it is! 🥳
Here's what you'll find in Haystack 1.7:
Support for OpenAI GPT-3
If you always wanted to know how OpenAI's famous GPT-3 model compares to other models, now your time has come. It's been fully integrated into Haystack, so you can use it as any other model. Just sign up to OpenAI, copy your API key from here and run the following code.To compare it to other models, check out our evaluation guide.
from haystack.nodes import OpenAIAnswerGenerator
from haystack import Document
reader = OpenAIAnswerGenerator(api_key="<your-api-token>", max_tokens=15, temperature=0.3)
docs = [Document(content="""The Big Bang Theory is an American sitcom.
The four main characters are all avid fans of nerd culture.
Among their shared interests are science fiction, fantasy, comic books and collecting memorabilia.
Star Trek in particular is frequently referenced""")]
res = reader.predict(query="Do the main characters of big bang theory like Star Trek?", documents=docs)
print(res)
Zero-Shot Query Classification
Till now, TransformersQueryClassifier
was very closely built around the excellent binary query-type classifier model of hahrukhx01. Although it was already possible to use other Transformer models, the choice was restricted to the models that output binary labels. One of our amazing community contributions now lifted this restriction.
But that's not all: @anakin87 added support for zero-shot classification models as well!
So now that you're completely free to choose the classification categories you want, you can let your creativity run wild. One thing you could do is customize the behavior of your pipeline based on the semantic category of the query, like this:
from haystack.nodes import TransformersQueryClassifier
# In zero-shot-classification, you are free to choose the labels
labels = ["music", "cinema", "food"]
query_classifier = TransformersQueryClassifier(
model_name_or_path="typeform/distilbert-base-uncased-mnli",
use_gpu=True,
task="zero-shot-classification",
labels=labels,
)
queries = [
"In which films does John Travolta appear?", # query about cinema
"What is the Rolling Stones first album?", # query about music
"Who was Sergio Leone?", # query about cinema
]
for query in queries:
result = query_classifier.run(query=query)
print(f'Query "{query}" was sent to {result[1]}')
Adding Page Numbers to Document Meta
Sometimes it's not enough to find the right answer or paragraph inside a document and just print it on the screen. Context matters and thus, for search applications, it's essential to send the user exactly to the place where the information came from. For huge documents, we're just halfway there if the user clicks a result and the document opens. To get to the right position, they still need to search the document using the document viewer. To make it easier, we added the parameter add_page_number
to ParsrConverter
, AzureConverter
and PreProcessor
. If you set it to True
, it adds a meta field "page"
to documents containing the page number of the text snippet or a table within the original file.
from haystack.nodes import PDFToTextConverter, PreProcessor
from haystack.document_stores import InMemoryDocumentStore
converter = PDFToTextConverter()
preprocessor = PreProcessor(add_page_number=True)
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_node(component=converter, name="Converter", inputs=["File"])
pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Converter"])
pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])
Gradient Accumulation for FARMReader
Training big Transformer models in low-resource environments is hard. Batch size plays a significant role when it comes to hyper-parameter tuning during the training process. The number of batches you can run on your machine is restricted by the amount of memory that fits into your GPUs. Gradient accumulation is a well-known technique to work around that restriction: adding up the gradients across iterations and running the backward pass only once after a certain number of iterations.
We tested it when we fine-tuned roberta-base on SQuAD, which led to nearly the same results as using a higher batch size. We also used it for training deepset/deberta-v3-large, which significantly outperformed its predecessors (see Question Answering on SQuAD).
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data/squad20"
reader.train(
data_dir=data_dir,
train_filename="dev-v2.0.json",
use_gpu=True, n_epochs=1,
save_dir="my_model",
grad_acc_steps=8
)
Extended Ray Support
Another great contribution from our community comes from @zoltan-fedor: it's now possible to run more complex pipelines with dual-retriever setup on Ray. Also, we now support ray serve deployment arguments in Pipeline YAMLs so that you can fully control your ray deployments.
pipelines:
- name: ray_query_pipeline
nodes:
- name: EmbeddingRetriever
replicas: 2
inputs: [ Query ]
serve_deployment_kwargs:
num_replicas: 2
version: Twenty
ray_actor_options:
num_gpus: 0.25
num_cpus: 0.5
max_concurrent_queries: 17
- name: Reader
inputs: [ EmbeddingRetriever ]
Support for Custom Sentence Tokenizers in Preprocessor
On some specific domains (for example, legal with lots of custom abbreviations), the default sentence tokenizer can be improved by some extra training on the domain data. To support a custom model for sentence splitting, @danielbichuetti added the tokenizer_model_folder
parameter to Preprocessor
.
from haystack.nodes import PreProcessor
preprocessor = PreProcessor(
split_length=10,
split_overlap=0,
split_by="sentence",
split_respect_sentence_boundary=False,
language="pt",
tokenizer_model_folder="/home/user/custom_tokenizer_models",
)
Making it Easier to Switch Document Stores
We had yet another amazing community contribution by @zoltan-fedor about the support for BM25 with the Weaviate document store.
Besides that we streamlined methods of BaseDocumentStore
and added update_document_meta()
to InMemoryDocumentStore
. These are all steps to make it easier for you to run the same pipeline with different document stores (for example, for quick prototyping, use in-memory, then head to something more production-ready).
#2860
#2689
Almost 2x Performance Gain for Electra Reader Models
We did a major refactoring of our language_modeling module resolving a bug that caused Electra models to execute the forward pass twice.
#2703.
⚠️ Breaking Changes
- Add
update_document_meta
toInMemoryDocumentStore
by @bogdankostic in #2689 - Add support for BM25 with the Weaviate document store by @zoltan-fedor in #2860
- Extending the Ray Serve integration to allow attributes for Serve deployments by @zoltan-fedor in #2918
- bug: make
MultiLabel
ids consistent across python interpreters by @camillepradel in #2998
⚠️ Breaking Changes for Contributors
Default Branch will be Renamed to main
on Tuesday, 16th of August
We will rename the default branch from master
to main
after this release. For a nice recap about good reasons for doing this, have a look at the Software Freedom Conservancy's blog.
Whether coming from this repository or from a fork, local clones of the Haystack repository will need to be updated by running the following commands:
git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a
Pre-Commit Hooks Instead of CI Jobs
To give you full control over your changes, we switched from CI jobs that automatically reformat files, generate schemas, and so on, to pre-commit hooks. To install them, run:
pre-commit install
For more information, check our contributing guidelines.
#2819
Other Changes
Pipelin...
v1.6.0
⭐ Highlights
Make Your QA Pipelines Talk with Audio Nodes! (#2584)
Indexing pipelines can use a new DocumentToSpeech
node, which generates an audio file for each indexed document and stores it alongside the text content in a SpeechDocument
. A GPU is recommended for this step to increase indexing speed. During querying, SpeechDocument
s allow accessing the stored audio version of the documents the answers are extracted from. There is also a new AnswerToSpeech
node that can be used in QA pipelines to generate the audio of an answer on the fly. See the new tutorial for a step by step guide on how to make your QA pipelines talk!
Save Models to Remote (#2618)
A new save_to_remote
method was introduced to the FARMReader
, so that you can easily upload a trained model to the Hugging Face Model Hub. More of this to come in the following releases!
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="roberta-base")
reader.train(data_dir="my_squad_data", train_filename="squad2.json", n_epochs=1, save_dir="my_model")
reader.save_to_remote(repo_id="your-user-name/roberta-base-squad2", private=True, commit_message="First version of my qa model trained with Haystack")
Note that you need to be logged in with transformers-cli login. Otherwise there will be an error message with instructions how to log in. Further, if you make your model private by setting private=True
, others won't be able to use it and you will need to pass an authentication token when you reload the model from the Model Hub, which is created also via transformers-cli login
.
new_reader = FARMReader(model_name_or_path="your-user-name/roberta-base-squad2", use_auth_token=True)
Multi-Hop Dense Retrieval (#2571)
There is a new MultihopEmbeddingRetriever
node that applies iterative retrieval steps and a shared encoder for the query and the documents. Used together with a reader node in a QA pipeline, it is suited for answering complex open-domain questions that require "hopping" multiple relevant documents. See the original paper by Xiong et al. for more details: "Answering complex open-domain questions with multi-hop dense retrieval".
from haystack.nodes import MultihopEmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
retriever = MultihopEmbeddingRetriever(
document_store=document_store,
embedding_model="deutschmann/mdr_roberta_q_encoder",
)
Big thanks to our community member @deutschmn for the PR!
InMemoryKnowledgeGraph (#2678)
Besides querying texts and tables, Haystack also allows querying knowledge graphs with the help of pre-trained models that translate text queries to graph queries. The latest Haystack release adds an InMemoryKnowledgeGraph allowing to store knowledge graphs without setting up complex graph databases. Try out the tutorial as a notebook on colab!
from pathlib import Path
from haystack.nodes import Text2SparqlRetriever
from haystack.document_stores import InMemoryKnowledgeGraph
from haystack.utils import fetch_archive_from_http
# Fetch data represented as triples of subject, predicate, and object statements
fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/triples_and_config.zip", output_dir="data/tutorial10")
# Fetch a pre-trained BART model that translates text queries to SPARQL queries
fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/saved_models/hp_v3.4.zip", output_dir="../saved_models/tutorial10/")
# Initialize knowledge graph and import triples from a ttl file
kg = InMemoryKnowledgeGraph(index="tutorial10")
kg.create_index()
kg.import_from_ttl_file(index="tutorial10", path=Path("data/tutorial10/triples.ttl"))
# Initialize retriever from pre-trained model
kgqa_retriever = Text2SparqlRetriever(knowledge_graph=kg, model_name_or_path=Path("../saved_models/tutorial10/hp_v3.4"))
# Translate a text query to a SPARQL query and execute it on the knowledge graph
print(kgqa_retriever.retrieve(query="In which house is Harry Potter?"))
Big thanks to our community member @anakin87 for the PR!
Torch 1.12 and Transformers 4.20.1 Support
Haystack is now compatible with last week's PyTorch v1.12 release so that you can take advantage of Apple silicon GPUs (Apple M1) for accelerated training and evaluation. PyTorch shared an impressive analysis of speedups over CPU-only here.
Haystack is also compatible with the latest Transformers v4.20.1 release and we will continuously ensure that you can benefit from the latest features in Haystack!
Other Changes
Pipeline
- Fix JoinAnswer/JoinNode by @MichelBartels in #2612
- Reduce logging messages and simplify logging by @julian-risch in #2682
- Correct docstring parameter name by @julian-risch in #2757
AnswerToSpeech
by @ZanSara in #2584- Fix params being changed during pipeline.eval() by @tstadel in #2638
- Make crawler extract also hidden text by @anakin87 in #2642
- Update document scores based on ranker node by @mathislucka in #2048
- Improved crawler support for dynamically loaded pages by @danielbichuetti in #2710
- Replace deprecated Selenium methods by @ZanSara in #2724
- Fix EvaluationSetCliet.get_labels() by @tstadel in #2690
- Show warning in reader.eval() about differences compared to pipeline.eval() by @tstadel in #2477
- Fix using id_hash_keys as pipeline params by @tstadel in #2717
- Fix loading of tokenizers in DPR by @bogdankostic in #2755
- Add support for Multi-Hop Dense Retrieval by @deutschmn in #2571
- Create target folder if not exists in EvalResult.save() by @tstadel in #2647
- Validate
max_seq_length
inSquadProcessor
by @francescocastelli in #2740
Models
- Use AutoTokenizer by default, to easily adapt to new models and token… by @apohllo in #1902
- first version of save_to_remote for HF from FarmReader by @TuanaCelik in #2618
DocumentStores
- Move Opensearch document store in its own module by @masci in #2603
- Extract common code for ES and OS into a base class by @masci in #2664
- Fix bugs in loading code from yaml by @masci in #2705
- fix error in log message by @anakin87 in #2719
- Pin es client to include bugfixes by @masci in #2735
- Make check of document & embedding count optional in FAISS and Pinecone by @julian-risch in #2677
- In memory knowledge graph by @anakin87 in #2678
- Pinecone unary queries upgrade by @jamescalam in #2657
- wait for postgres to be ready before data migrations by @masci in #2654
Documentation & Tutorials
- Update docstrings for GPL by @agnieszka-m in #2633
- Add GPL API docs, unit tests update by @vblagoje in #2634
- Add GPL adaptation tutorial by @vblagoje in #2632
- GPL tutorial - add GPU header and open in colab button by @vblagoje in #2736
- Add execute_eval_run example to Tutorial 5 by @tstadel in #2459
- Tutorial 14 edit by @robpasternak in #2663
Misc
- Replace question issue with link to discussions by @masci in #2697
- Upgrade transformers to 4.20.1 by @julian-risch in #2702
- Upgrade torch to 1.12 by @julian-risch in #2741
- Remove rapidfuzz version pin by @tstadel in #2730
New Contributors
- @ryanrussell made their first contribution in #2617
- @apohllo made their first contribution in #1902
- @robpasternak made their first contribution in #2663
- @danielbichuetti made their first contribution in #2710
- @francescocastelli made their first contribution in #2740
- @deutschmn made their first contribution in https://github.com/deepset-...
v1.5.0
⭐ Highlights
Generative Pseudo Labeling
Dense retrievers excel when finetuned on a labeled dataset of the target domain. However, such datasets rarely exist and are costly to create from scratch with human annotators. Generative Pseudo Labeling solves this dilemma by creating labels automatically for you, which makes it a super fast and low-cost alternative to manual annotation. Technically speaking, it is an unsupervised approach for domain adaptation of dense retrieval models. Given a corpus of unlabeled documents from that domain, it automatically generates queries on that corpus and then uses a cross-encoder model to create pseudo labels for these queries. The pseudo labels can be used to adapt retriever models that domain. Here is a code example that shows how to do that in Haystack:
from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes.question_generator.question_generator import QuestionGenerator
from haystack.nodes.label_generator.pseudo_label_generator import PseudoLabelGenerator
# Initialize any document store and fill it with documents from your domain - no labels needed.
document_store = InMemoryDocumentStore()
document_store.write_documents(...)
# Calculate and store a dense embedding for each document
retriever = EmbeddingRetriever(document_store=document_store,
embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b",
max_seq_len=200)
document_store.update_embeddings(retriever)
# Use the new PseudoLabelGenerator to automatically generate labels and train the retriever on them
qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1", max_length=64, split_length=200, batch_size=12)
psg = PseudoLabelGenerator(qg, retriever)
output, _ = psg.run(documents=document_store.get_all_documents())
retriever.train(output["gpl_labels"])
Batch Processing with Query Pipelines
Every query pipeline now has a run_batch()
method, which allows to pass multiple queries to the pipeline at once.
Together with a list of queries, you can either provide a single list of documents or a list of lists of documents. In the first case, answers are returned for each query-document pair. In the second case, each query is applied to its corresponding list of documents based on same index in the list. A third option is to have a list containing a single query, which is then applied to each list of documents separately.
Here is an example with a pipeline:
from haystack.pipelines import ExtractiveQAPipeline
...
pipe = ExtractiveQAPipeline(reader, retriever)
predictions = pipe.pipeline.run_batch(
queries=["Who is the father of Arya Stark?","Who is the mother of Arya Stark?"], params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
And here is an example with a single reader node:
from haystack.nodes import FARMReader
from haystack.schema import Document
FARMReader.predict_batch(
queries=["1st sample query", "2nd sample query"]
documents=[[Document(content="sample doc1"), Document(content="sample doc2")], [Document(content="sample doc3"), Document(content="sample doc4")]]
{"queries": ["1st sample query", "2nd sample query"], "answers": [[Answers from doc1 and doc2], [Answers from doc3 and doc4]], ...]}
Pipeline Evaluation with Advanced Label Scopes
Typically, a predicted answer is considered correct if it matches the gold answer in the set of evaluation labels. Similarly, a retrieved document is considered correct if its ID matches the gold document ID in the labels. Sometimes however, these simple definitions of "correctness" are not sufficient and you want to further specify the "scope" within which an answer or a document is considered correct.
For this reason, EvaluationResult.calculate_metrics()
accepts the parameters answer_scope
and document_scope
.
As an example, you might consider an answer to be correct only if it stems from a specific context of surrounding words. You can specify answer_scope="context"
in calculate_metrics()
in that case. See the updated docstrings with a description of the different label scopes or the updated tutorial on evaluation.
...
document_store.add_eval_data(
filename="data/tutorial5/nq_dev_subset_v2.json",
preprocessor=preprocessor,
)
...
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True)
eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})
metrics = eval_result.calculate_metrics(answer_scope="context")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
Support of DeBERTa Models
Haystack now supports DeBERTa models! These kind of models come with some smart architectural improvements over BERT and RoBERTa, such as encoding the relative and absolute position of a token in the input sequence. Only the following three lines are needed to train a DeBERTa reader model on the SQuAD 2.0 dataset. And compared to a RoBERTa model trained on that dataset, you can expect a boost in F1-score from ~84% to ~88% ("microsoft/deberta-v3-large" even gets you to an F1-score as high as ~92%).
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="microsoft/deberta-v3-base")
reader.train(data_dir="data/squad20", train_filename="train-v2.0.json", dev_filename="dev-v2.0.json", save_dir="my_model")
⚠️ Breaking Changes
- Validation for Ray pipelines by @ZanSara in #2545
- Add
run_batch
method to all nodes andPipeline
to allow batch querying by @bogdankostic in #2481 - Support context matching in
pipeline.eval()
by @tstadel in #2482
Other Changes
Pipeline
- Add sort arg to JoinAnswers by @brandenchan in #2436
- Update run() and run_batch() params descriptions in API by @agnieszka-m in #2542
- [CI refactoring] Avoid
ray==1.12.0
on Windows by @ZanSara in #2562 - Prevent losing names of utilized components when loaded from config by @tstadel in #2525
- Do not copy
_component_config
inget_components_definitions
by @ZanSara in #2574 - Add
run_batch
for standard pipelines by @bogdankostic in #2595 - Fix Pipeline.get_config() for forked pipelines by @tstadel in #2616
- Remove wrong retriever top_1 metrics from
print_eval_report
by @tstadel in #2510 - Handle transformers pipeline flattening lists of length 1 by @MichelBartels in #2531
- Fix
pipeline.eval
with context matching for Table-QA by @tstadel in #2597 - set top_k to 5 in SAS to be consistent by @ClaMnc in #2550
DocumentStores
- Make
DeepsetCloudDocumentStore
work with non-existing index by @bogdankostic in #2513 - [Weaviate] Exit the while loop when we query less documents than available by @masci in #2537
- Fix knn params for aws managed opensearch by @tstadel in #2581
- Fix number of returned values in
get_metadata_values_by_key
by @bogdankostic in #2614
Retriever
- Simplify loading of
EmbeddingRetriever
by @bogdankostic in #2619 - Add training checkpoint in retriever trainer by @dimitrisna in #2543
- Include meta data when computing embeddings in EmbeddingRetriever by @MichelBartels in #2559
Documentation
- fix small typo in Document doc string by @galtay in #2520
- rearrange contributing guidelines by @masci in #2515
- Documenting output score of JoinDocuments when using concatenation by @MichelBartels in #2561
- Minor lg updates to doc strings by @agnieszka-m in #2585
- Adjust pydoc markdown config so methods shown with classes by @brandenchan in #2511
- Update Ray pipeline docs with validation info by @agnieszka-m in #2590
Other Changes
- Upgrade transformers version to 4.18.0 by @bogdankostic in #2514
- Upgrade torch version to 1.11 by @bogdankostic in #2538
- Fix tutorials 4, 7 and 8 by @bogdankostic in #2526
- Tutorial1:
convert_files_to_dicts
-->convert_files_to_docs
by @ZanSara in #2546 - Fix docker image tag with semantic version for releases by @askainet in https://github.com/deepset-ai/haystack/pull/...
v1.4.0
⭐ Highlights
Logging Evaluation Results to MLflow
Logging and comparing the evaluation results of multiple different pipeline configurations is much easier now thanks to the newly implemented MLflowTrackingHead
. With our public MLflow instance you can log evaluation metrics and metadata about pipeline, evaluation set and corpus. Here is an example log file. If you have your own MLflow instance you can even store the pipeline YAML file and the evaluation set as artifacts. In Haystack, all you need is the execute_eval_run()
method:
eval_result = Pipeline.execute_eval_run(
index_pipeline=index_pipeline,
query_pipeline=query_pipeline,
evaluation_set_labels=labels,
corpus_file_paths=file_paths,
corpus_file_metas=file_metas,
experiment_tracking_tool="mlflow",
experiment_tracking_uri="http://localhost:5000",
experiment_name="my-query-pipeline-experiment",
experiment_run_name="run_1",
pipeline_meta={"name": "my-pipeline-1"},
evaluation_set_meta={"name": "my-evalset"},
corpus_meta={"name": "my-corpus"}.
add_isolated_node_eval=True,
reuse_index=False
)
Filtering Answers by Confidence in FARMReader
The FARMReader got a parameter confidence_threshold
to filter out predictions below this threshold.
The threshold is disabled by default but can be set between 0 and 1 when initializing the FARMReader:
from haystack.nodes import FARMReader
model = "deepset/roberta-base-squad2"
reader = FARMReader(model, confidence_threshold=0.5)
Deprecating Milvus1DocumentStore & Renaming ElasticsearchRetriever
The Milvus1DocumentStore is deprecated in favor of the newer Milvus2DocumentStore. Besides big architectural changes that impact performance and reliability Milvus version 2.0 supports the filtering by scalar data types.
For Haystack users this means you can now run a query using vector similarity and filter for some meta data at the same time! See the Milvus documentation for more details if you need to migrate from Milvus1DocumentStore to Milvus2DocumentStore. #2495
The ElasticsearchRetriever node does not only work with the ElasticsearchDocumentStore but also with the OpenSearchDocumentStore and so it is only logical to rename the ElasticsearchRetriever. Now it is called
BM25Retriever after the underlying BM25 ranking function. For the same reason, ElasticsearchFilterOnlyRetriever is now called FilterRetriever. The deprecated names and the new names are both working but we will drop support of the deprecated names in a future release. An overview of the different DocumentStores in Haystack can be found here. #2423 #2461
Fixing Evaluation Discrepancies
The evaluation of pipeline nodes with pipeline.eval(add_isolated_node_eval=True)
and alternatively with retriever.eval()
and reader.eval()
gave slightly different results due to a bug in handling no_answers
. This bug is fixed now and all different ways to run the evaluation give the same results. #2381
⚠️ Breaking Changes
- Change return types of indexing pipeline nodes by @bogdankostic in #2342
- Upgrade
weaviate-client
to3.3.3
and fixget_all_documents
by @ZanSara in #1895 - Align TransformersReader defaults with FARMReader by @julian-risch in #2490
- Change default encoding for
PDFToTextConverter
fromLatin 1
toUTF-8
by @ZanSara in #2420 - Validate YAML files without loading the nodes by @ZanSara in #2438
Other Changes
Pipeline
- Add tests for missing
__init__
andsuper().__init__()
in custom nodes by @ZanSara in #2350 - Forbid usage of
*args
and**kwargs
in any node's__init__
by @ZanSara in #2362 - Change YAML version exception into a warning by @ZanSara in #2385
- Make sure that
debug=True
andparams={'debug': True}
behaves the same way by @ZanSara in #2442 - Add support for positional args in pipeline.get_config() by @tstadel in #2478
- enforce same index values before and after saving/loading eval dataframes by @tstadel in #2398
DocumentStores
- Fix sparse retrieval with filters returns results without any text-match by @tstadel in #2359
- EvaluationSetClient for deepset cloud to fetch evaluation sets and la… by @FHardow in #2345
- Update launch script for Milvus from 1.x to 2.x by @ZanSara in #2378
- Use
ElasticsearchDocumentStore.get_all_documents
inElasticsearchFilterOnlyRetriever.retrieve
by @adri1wald in #2151 - Fix and use delete_index instead of delete_documents in tests by @tstadel in #2453
- Update docs of DeepsetCloudDocumentStore by @tholor in #2460
- Add support for aliases in elasticsearch document store by @ZeJ0hn in #2448
- fix dot_product metric by @jamescalam in #2494
- Deprecate
Milvus1DocumentStore
by @bogdankostic in #2495 - Fix
OpenSearchDocumentStore
's__init__
by @ZanSara in #2498
Retriever
- Rename dataset to evaluation_set when logging to mlflow by @tstadel in #2457
- Linearize tables in EmbeddingRetriever by @MichelBartels in #2462
- Print warning in
EmbeddingRetriever
if sentence-transformers model used with different model format by @mpangrazzi in #2377 - Add flag to disable scaling scores to probabilities by @tstadel in #2454
- changing the name of the retrievers from es_retriever to retriever by @TuanaCelik in #2487
- Replace dpr with embeddingretriever tut14 by @mkkuemmel in #2336
- Support conjunctive queries in sparse retrieval by @tstadel in #2361
- Fix: Auth token not passed for EmbeddingRetriever by @mathislucka in #2404
- Pass
use_auth_token
to sentence transformers EmbeddingRetriever by @MichelBartels in #2284
Reader
- Fix
TableReader
for tables without rows by @bogdankostic in #2369 - Match answer sorting in
QuestionAnsweringHead
withFARMReader
by @tstadel in #2414 - Fix reader.eval() and reader.eval_on_file() output by @tstadel in #2476
- Raise error if torch-scatter is not installed or wrong version is installed by @MichelBartels in #2486
Documentation
- Fix link to squad_to_dpr.py in DPR train tutorial by @raphaelmerx in #2334
- Add evaluation and document conversion to tutorial 15 by @MichelBartels in #2325
- Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 by @MichelBartels in #2479
- Fix RouteDocuments documentation by @MichelBartels in #2380
Other Changes
- extract extension based on file's content by @GiannisKitsos in #2330
- Reduce num REST API workers to accommodate smaller machines by @brandenchan in #2400
- Add
devices
alongsideuse_gpu
inFARMReader
by @ZanSara in #2294 - Delete files in docs/_src by @brandenchan in #2322
- Add
apt update
in Linux CI by @ZanSara in #2415 - Exclude
beir
from Windows install by @ZanSara in #2419 - Added macos version of xpdf in tutorial 8 by @seduerr91 in #2424
- Make
python-magic
fully optional by @ZanSara in #2412 - Upgrade xpdf to 4.0.4 by @tholor in #2443
- Update
xpdfreader
package installation by @AI-Ahmed in #2491
New Contributors
- @raphaelmerx made their first contribution in #2334
- @FHardow made their first contribution in #2345
- @GiannisKitsos made their first contribution in #2330
- @mpangrazzi made their first contribution in #2377
- @seduerr91 made their first contribution i...
v1.3.0
⭐ Highlights
Pipeline YAML Syntax Validation
The syntax of pipeline configurations as defined in YAML files can now be validated. If the validation fails, erroneous components/parameters are identified to make it simple to fix them. Here is a code snippet to manually validate a file:
from pathlib import Path
from haystack.pipelines.config import validate_yaml
validate_yaml(Path("rest_api/pipeline/pipelines.haystack-pipeline.yml"))
Your IDE can also take care of the validation when you edit a pipeline YAML file. The suffix *.haystack-pipeline.yml
tells your IDE that this YAML contains a Haystack pipeline configuration and enables some checks and autocompletion features if the IDE is configured that way (YAML plugin for VSCode, Configuration Guide for PyCharm). The schema used for validation can be found in SchemaStore pointing to the schema files for the different Haystack versions. Note that an update of the Haystack version might sometimes require to do small changes to the pipeline YAML files. You can set version: 'unstable'
in the pipeline YAML to circumvent the validation or set it to the latest Haystack version if the components and parameters that you use are compatible with the latest version. #2226
Pinecone DocumentStore
We added another DocumentStore to Haystack: PineconeDocumentStore! 🎉 Pinecone is a fully managed service for very large scale dense retrieval. To this end, embeddings and metadata are stored in a hosted Pinecone vector database while the document content is stored in a local SQL database. This separation simplifies infrastructure setup and maintenance. In order to use this new document store, all you need is an API key, which you can obtain by creating an account on the Pinecone website. #2254
import os
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key=os.environ["PINECONE_API_KEY"])
BEIR Integration
Fresh from the 🍻 cellar, Haystack now has an integration with our favorite BEnchmarking Information Retrieval tool BEIR. It contains preprocessed datasets for zero-shot evaluation of retrieval models in 17 different languages, which you can use to benchmark your pipelines. For example, a DocumentSearchPipeline can now be evaluated by calling Pipeline.eval_beir()
after having installed Haystack with the BEIR dependency via pip install farm-haystack[beir]
. Cheers! #2333
from haystack.pipelines import DocumentSearchPipeline, Pipeline
from haystack.nodes import TextConverter, ElasticsearchRetriever
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
text_converter = TextConverter()
document_store = ElasticsearchDocumentStore(search_fields=["content", "name"], index="scifact_beir")
retriever = ElasticsearchRetriever(document_store=document_store, top_k=1000)
index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])
query_pipeline = DocumentSearchPipeline(retriever=retriever)
ndcg, _map, recall, precision = Pipeline.eval_beir(
index_pipeline=index_pipeline, query_pipeline=query_pipeline, dataset="scifact"
)
Breaking Changes
- Make Milvus2DocumentStore compatible with pymilvus>=2.0.0 by @MichelBartels in #2126
- Set provider parameter when instantiating onnxruntime.InferenceSession and make
device
a torch.device in internal methods by @cjb06776 in #1976
Pipeline
- Generate
haystack-pipeline-1.2.0.schema.json
by @ZanSara in #2239 - Add
RouteDocuments
andJoinAnswers
nodes by @bogdankostic in #2256 - Refactor Pipeline peripherals by @tstadel in #2253
- Allow to deploy and undeploy Pipelines on Deepset Cloud by @tstadel in #2285
- Reintroduce
debug
as a valid global key for Pipeline'sparams
by @ZanSara in #2298 - Replace dpr with embeddingretriever tut11 by @mkkuemmel in #2287
- Package JSON schemas properly in Haystack by @ZanSara in #2316
- Fix dependency graph for indexing pipelines during codegen by @tstadel in #2311
- Fix YAML pipeline paths in
docker-compose.yml
by @ZanSara in #2335 - Improve error message for nodes failing validation by @ZanSara in #2313
- Fix
Pipeline.print_eval_report
by @tstadel in #2271 - save_to_deepset_cloud: automatically convert document stores by @tstadel in #2283
- Sas gpu additions by @thimo72 in #2308
Models
DocumentStores
- Bulk insert in sql document stores by @OmniscienceAcademy in #2264
- 'os' wrapper to function for brownfield support by @TuanaCelik in #2282
- Using default OpenSearch parameters by @TuanaCelik in #2327
- Fix docker launch scripts by @tstadel in #2341
- Fix
normalize_embedding
using numba by @tstadel in #2347
Documentation
- Update other.yml with new node names by @agnieszka-m in #2286
- Bring back init defs to api in v1.2 and latest by @brandenchan in #2296
- Remove unneeded files in docs directory by @brandenchan in #2237
- change old text to content argument for translator examples by @ju-gu in #2240
Tutorials
- Fix tutorial dataset paths by @julian-risch in #2340
- Polish Evaluation Tutorial by @brandenchan in #2212
- Comment out Milvus cell on Tutorial6 by @ZanSara in #2243
- Change document attribute from text to content by @julian-risch in #2352
- Replace dpr with embeddingretriever tut5 by @mkkuemmel in #2274
- ipynb: inserted links to graph images by @mkkuemmel in #2309
Other Changes
- Implement Context Matching by @tstadel in #2293
- Fix surrounding context extraction in
ParsrConverter
by @bogdankostic in #2162 - Fix table extraction in
ParsrConverter
by @bogdankostic in #2262 - Api pages by @brandenchan in #2248
- fix pip backtracking issue by @tstadel in #2281
- Update reader/base.py to fix UnboundLocalError in #2273 by @thimo72 in #2275
- Remove substrings basic implementation by @dmigo in #2152
- adding quotes for zsh shell issue by @TuanaCelik in #2289
- Prevent Preprocessor from changing existing documents by @tstadel in #2297
- Fix install because of missing jsonschema dependency by @tstadel in #2315
- Add basic telemetry features by @julian-risch in #2314
- Let SquadData support data from Annotation Tool by @brandenchan in #2329
New Contributors
- @thimo72 made their first contribution in #2275
- @agnieszka-m made their first contribution in #2286
- @TuanaCelik made their first contribution in #2289
- @OmniscienceAcademy made their first contribution in #2264
- @jamescalam made their first contribution in #2254
- @cjb06776 made their first contribution in #1976
❤️ Big thanks to all contributors and the whole community!
v1.2.0
⭐ Highlights
Brownfield Support of Existing Elasticsearch Indices
You have an existing Elasticsearch index from other projects and now want to try out Haystack? The newly added method es_index_to_document_store
provides brownfield support of existing Elasticsearch indices by converting each of the records in the provided index to Haystack Document
objects and writing them to the specified DocumentStore
.
document_store = es_index_to_document_store(
document_store=InMemoryDocumentStore(), #or any other Haystack DocumentStore
original_index_name="existing_index",
original_content_field="content",
original_name_field="name",
included_metadata_fields=["date_field"],
index="new_index",
)
It can even be used on a regular basis in order to add new records of the Elasticsearch index to the DocumentStore
! #2229
Tapas Reader With Scores
The new model class TapasForScoredQA
introduced in #1997 supports Tapas Reader models that return confidence scores. When you load a Tapas Reader model, Haystack automatically infers whether the model supports confidence scores and chooses the correct model class under the hood. The returned answers are sorted first by a general table score and then by answer span scores. To try it out, just use one of the new TableReader models:
reader = TableReader(model_name_or_path="deepset/tapas-large-nq-reader", max_seq_len=512) #or
reader = TableReader(model_name_or_path="deepset/tapas-large-nq-hn-reader", max_seq_len=512)
Extended Meta Data Filtering
We extended the filter capabilities of all(*) document stores to support more complex filter expressions than previously. Besides simple selections on multiple fields you can now use more complex comparison expressions and connect these using boolean operators. For people having used mongodb the new syntax should look familiar. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
, "$or"
, "$not"
), a comparison operator ("$eq"
, "$in"
, "$gt"
, "$gte"
, "$lt"
, "$lte"
) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in"
) a list of values as value.
If no logical operator is provided, "$and"
is used as default operation.
If no comparison operator is provided, "$eq"
(or "$in"
if the comparison value is a list) is used as default operation.
Therefore, we don't have any breaking changes and you can keep on using your existing filter expressions.
Example:
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
(*) FAISSDocumentStore and MilvusDocumentStore currently do not support filters during search.
Code Style and Linting
In addition to mypy we already had for static type checking, we now use pylint for linting and the Haystack code base does now comply with Black formatting standards. As a result, the code is formatted in a consistent way and easier to read. When you would like to contribute to Haystack you don't need to worry about that though - our CI will automatically format your code changes correctly. Our contributor guidelines give more details in case you would like to run the checks locally. #2115 #2130
Installation with fewer dependencies
Installing Haystack has become easier and faster thanks to optional dependencies. From now on, there is no need to install all dependencies if you don't need them. For example, pip3 install farm-haystack
will install the latest release together with only a small subset of packages required for basic Pipelines with an ElasticsearchDocumentStore. As another example, if you are experimenting with FAISSDocumentStore in a colab notebook, you can install Haystack from the master branch together with FAISS dependency by running: !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]
. The installation guide reflects these updates and the full list of subsets of dependencies can be found here. Keep in mind, though, that this system works best with pip versions above 22 #1994
⚠️ Known Issues
Installing haystack with all dependencies results in heavy pip backtracking that might never finish.
This is due to a dependency conflict that was introduced by a new release of one of our sub dependencies.
To circumvent this problem install haystack like this:
pip install farm-haystack[all] "azure-core<1.23"
This might also be needed for other non-default dependencies (e.g. farm-haystack[dev] "azure-core<1.23"
).
See #2280 for more information.
⚠️ Breaking Changes
- Improve dependency management by @ZanSara in #1994
- Make
ui
andrest
proper packages by @ZanSara in #2098 - Add aiorwlock to 'ray' extra & fix maximum version for some dependencies by @ZanSara in #2140
🤓 Detailed Changes
Pipeline
- Add
top_k_join
parameter toJoinDocuments.run
by @adri1wald in #2065 - ✨ Add JSON Schema autogeneration for Pipeline YAML files by @tiangolo in #2020
- Make FileTypeClassifier more flexible by @ZanSara in #2101
- Query response without answers by @ZanSara in #2161
- Generate JSON schema index for Schemastore by @ZanSara in #2225
- Fix Pipeline.components by @tstadel in #2215
- Join node should allow reciprocal rank fusion as additional merging method by @mathislucka in #2133
- Apply filter in eval only if no gold docs are given as input by @julian-risch in #2154
- pipeline.save_to_deepset_cloud() by @tstadel in #2145
- Fix typo in save_to_deepset_cloud() by @tstadel in #2189
- Generate code from pipeline (pipeline.to_code()) by @tstadel in #2214
- Allow different filters per query in pipeline evaluation by @julian-risch in #2068
- List all pipeline(_configs) on Deepset Cloud by @tstadel in #2102
- Evaluating a pipeline consisting only of a reader node by @julian-risch in #2132
- DC SDK - load pipeline from deepset cloud by @ArzelaAscoIi in #2013
- YAML versioning by @ZanSara in #2209
Models
- Add Tapas reader with scores by @bogdankostic in #1997
- Fix finetuning notebook augmentation by @MichelBartels in #2071
- Fix Seq2SeqGenerator return type by @tstadel in #2099
- Distribute intermediate layer distillation loss calculation over multiple GPUs by @MichelBartels in #2090
- Do not apply DataParallel twice by @MichelBartels in #2095
DocumentStores
- Pin Milvus to <2.0.0 by @ZanSara in #2063
- fix: get_documents_by_id should return docs for all passed ids by @mathislucka in #2064
- Supported Highlighting in Elasticsearch by @SjSnowball in #1930
- pass faiss batch_size to sqldocumentstore by @AhmedIdr in #2061
- Fixed the Search Field mapping in ElasticSearch DocumentStore by @SjSnowball in #2080
- Provide option to recreate es doc store on initialization by @mathislucka in #2084
- Fixed performance bug. Using a list where a set is needed. by @baregawi in #2125
- Extend metadata filtering support in
ElasticsearchDocumentStore
by @bogdankostic in #2108 - OpenSearchDocumentStore: Extend similarity support by @tstadel in #2070
- Speed up query_by_embedding in InMemoryDocumentStore. by @baregawi in https://github.com/deepset-ai/haystack/p...
v1.1.0
⭐ Highlights
Model Distillation for Reader Models
With the new model distillation features, you don't need to choose between accuracy and speed! Now you can compress a large reader model (teacher) into a smaller model (student) while retaining most of the teacher's performance. For example, deepset/tinybert-6l-768d-squad2 is twice as fast as bert-base with an F1 reduction of only 2%.
To distil your own model, just follow these steps:
- Call
python augment_squad.py --squad_path <your dataset> --output_path <output> --multiplication_factor 20
where augment_squad.py is our data augmentation script. - Run
student.distil_intermediate_layers_from(teacher, data_dir="dataset", train_filename="augmented_dataset.json")
wherestudent
is a small model and teacher is a highly accurate, larger reader model. - Run
student.distil_prediction_layer_from(teacher, data_dir="dataset", train_filename="dataset.json")
with the same teacher and student.
For more information on what kinds of students and teachers you can use and on model distillation in general, just take a look at this guide.
Integrated vs. Isolated Pipeline Evaluation Modes
When you evaluate a pipeline, you can now use two different evaluation modes and create an automatic report that shows the results of both. The integrated evaluation (default) shows what result quality users will experience when running the pipeline. The isolated evaluation mode additionally shows what the maximum result quality of a node could be if it received the perfect input from the preceeding node. Thereby, you can find out whether the retriever or the reader in an ExtractiveQAPipeline
is the bottleneck.
eval_result_with = pipeline.eval(labels=eval_labels, add_isolated_node_eval=True)
pipeline.print_eval_report(eval_result)
================== Evaluation Report ==================
=======================================================
Query
|
Retriever
|
| recall_single_hit: ...
|
Reader
|
| f1 upper bound: 0.78
| f1: 0.65
|
Output
As the gap between the upper bound F1-score of the reader differs a lot from its actual F1-score in this report, you would need to improve the predictions of the retriever node to achieve the full performance of this pipeline. Our updated evaluation tutorial lists all the steps to generate an evaluation report with all the metrics you need and their upper bounds of each individual node. The guide explains the two evaluation modes in detail.
Row-Column-Intersection Model for TableQA
Now you can use a Row-Column-Intersection model on your own tabular data. To try it out, just replace the declaration of your TableReader:
reader = RCIReader(row_model_name_or_path="michaelrglass/albert-base-rci-wikisql-row",
column_model_name_or_path="michaelrglass/albert-base-rci-wikisql-col")
The RCIReader requires two separate models: One for rows and one for columns. Working on each column and row separately allows it to be used on much larger tables. It is also able to return meaningful confidence scores unlike the TableReader
.
Please note, however, that it currently does not support aggregations over multiple cells and that it is a bit slower than other approaches.
Advanced File Converters
Given a file (PDF or DOCX), there are two file converters to extract text and tables in Haystack now:
The ParsrConverter
based on the open-source Parsr tool by axa-group introduced into Haystack in this release and the AzureConverter
, which we improved on. Both of them return a list of dictionaries containing one dictionary for each table detected in the file and one dictionary containing the text of the file. This format matches the document format and can be used right away for TableQA (see the guide).
converter = ParsrConverter()
docs = converter.convert(file_path="samples/pdf/sample_pdf_1.pdf")
⚠️ Breaking Changes
- Custom id hashing on documentstore level by @ArzelaAscoIi in #1910
- Implement proper FK in
MetaDocumentORM
andMetaLabelORM
to work on PostgreSQL by @ZanSara in #1990
🤓 Detailed Changes
Pipeline
- Extend TranslationWrapper to work with QA Generation by @julian-risch in #1905
- Add nDCG to
pipeline.eval()
's document metrics by @tstadel in #2008 - change column order for evaluatation dataframe by @ju-gu in #1957
- Add isolated node eval mode in pipeline eval by @julian-risch in #1962
- introduce node_input param by @tstadel in #1854
- Add ParsrConverter by @bogdankostic in #1931
- Add improvements to AzureConverter by @bogdankostic in #1896
Models
- Prevent wrapping DataParallel in second DataParallel by @bogdankostic in #1855
- Enable batch mode for SAS cross encoders by @tstadel in #1987
- Add RCIReader for TableQA by @bogdankostic in #1909
- distinguish intermediate layer & prediction layer distillation phases with different parameters by @MichelBartels in #2001
- Add TinyBERT data augmentation by @MichelBartels in #1923
- Adding distillation loss functions from TinyBERT by @MichelBartels in #1879
DocumentStores
- Raise exception if Elasticsearch search_fields have wrong datatype by @tstadel in #1913
- Support custom headers per request in pipeline by @tstadel in #1861
- Fix retrieving documents in
WeaviateDocumentStore
withcontent_type=None
by @bogdankostic in #1938 - Fix Numba TypingError in
normalize_embedding
for cosine similarity by @bogdankostic in #1933 - Fix loading a saved
FAISSDocumentStore
by @bogdankostic in #1937 - Propagate duplicate_documents to base class initialization by @yorickvanzweeden in #1936
- Fix vector_id collision in FAISS by @yorickvanzweeden in #1961
- Unify vector_dim and embedding_dim parameter in Document Store by @mathew55 in #1922
- Align similarity scores across document stores by @MichelBartels in #1967
- Bugfix - save_to_yaml for OpenSearchDocumentStore by @ArzelaAscoIi in #2017
- Fix elasticsearch scores if they are 0.0 by @tstadel in #1980
REST API
- Rely api healthcheck on status code rather than json decoding by @fabiolab in #1871
- Bump version in REST api by @tholor in #1875
UI / Demo
- Replace SessionState with Streamlit built-in by @yorickvanzweeden in #2006
- Fix demo deployment by @askainet in #1877
- Add models to demo docker image by @ZanSara in #1978
Documentation
- Update pydoc-markdown-file-classifier.yml by @brandenchan in #1856
- Create v1.0 docs by @brandenchan in #1862
- Fix typo by @amotl in #1869
- Correct bug with encoding when generating Markdown documentation issue #1880 by @albertovilla in #1881
- Minor typo by @javier in #1900
- Fixed the grammatical issue in optimization guides #1940 by @eldhoittangeorge in #1941
- update link to annotation tool docu by @julian-risch in #2005
- Extend Tutorial 5 with Upper Bound Reader Eval Metrics by @julian-risch in #1995
- Add distillation to finetuning tutorial by @MichelBartels in #2025
- Add ndcg and eval_mode to docs by @tstadel in #2038
- Remove hard-coded variables from the Tutorial 15 by @dmigo in #1984
Other Changes
- upgrade transformers to 4.13.0 by @julian-risch in #1659
- Fix typo in the Windows CI UI deps by @ZanSara in #1876
- Exchanged min...
1.0.0
🎁 Haystack 1.0
We worked hard to bring you an early Christmas present: 1.0 is out! In the last months, we re-designed many essential parts of Haystack, introduced new features, and simplified many user-facing methods. We believe Haystack is now much easier to use and a solid base for many exciting upcoming features that we plan. This release is a major milestone on our journey with you, the community, and we want to thank you again for all the great contributions, discussions, questions, and bug reports that helped us to build a better Haystack. This journey has just started 🚀
⭐ Highlights
Improved Evaluation of Pipelines
Evaluation helps you find out how well your system is doing on your data. This includes Pipeline level evaluation to ensure that the system's output is really what you're after, but also Node level evaluation so that you can figure out whether it's your Reader or Retriever that is holding back the performance.
In this release, evaluation is much simpler and cleaner to perform. All the functionality is now baked into the Pipeline
class and you can kick off the process by providing Label
or MultiLabel
objects to the Pipeline.eval()
method.
eval_result = pipeline.eval(
labels=labels,
params={"Retriever": {"top_k": 5}},
)
The output is an EvaluationResult
object which stores each Node's prediction for each sample in a Pandas DataFrame
- so you can easily inspect granular predictions and potential mistakes without re-running the whole thing. There is a EvaluationResult.calculate_metrics()
method which will return the relevant metrics for your evaluation and you can print a convenient summary report via the new .
metrics = eval_result.calculate_metrics()
pipeline.print_eval_report(eval_result)
If you'd like to start evaluating your own systems on your own data, check out our Evaluation Tutorial!
Table QA
A lot of valuable information is stored in tables - we've heard this again and again from the community. While they are an efficient structured data format, it hasn't been possible to search for table contents using traditional NLP techniques. But now, with the new TableTextRetriever
and TableReader
our users have all the tools they need to query for relevant tables and perform Question Answering.
The TableTextRetriever
is the result of our team's research into table retrieval methods which you can read about in this paper that was presented at EMNLP 2021. Behind the scenes, it uses three transformer-based encoders - one for text passages, one for tables, and one for the query. However, in Haystack, you can swap it in for any other dense retrieval model and start working with tables. The TableReader
is built upon the TAPAS model and when handed table containing Documents, it can return a single cell as an answer or perform an aggregation operation on a set of cells to form a final answer.
retriever = TableTextRetriever(
document_store=document_store,
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
embed_meta_fields=["title", "section_title"]
)
reader = TableReader(
model_name_or_path="google/tapas-base-finetuned-wtq",
max_seq_len=512
)
Have a look at the Table QA documentation if you'd like to learn more or dive into the Table QA tutorial to start unlocking the information in your table data.
Improved Debugging of Pipelines & Nodes
We've made debugging much simpler and also more informative! As long as your node receives a boolean debug
argument, it can propagate its input, output or even some custom information to the output of the pipeline. It is now a built-in feature of all existing nodes and can also easily be inherited by your custom nodes.
result = pipeline.run(
query="Who is the father of Arya Stark?",
params={
"debug": True
}
)
{'ESRetriever': {'input': {'debug': True,
'query': 'Who is the father of Arya Stark?',
'root_node': 'Query',
'top_k': 1},
'output': {'documents': [<Document: {'content': "\n===In the Riverlands===\nThe Stark army reaches the Twins, a bridge strong", ...}>]
...}
To find out more about this feature, check out debugging. To learn how to define custom debug information, have a look at custom debugging.
FARM Migration
Those of you following Haystack from its first days will know that Haystack first evolved out of the FARM framework. While FARM is designed to handle diverse NLP models and tasks, Haystack gives full end-to-end support to search and question answering use cases with a focus on coordinating all components that take a proof-of-concept into production.
Haystack has always relied on FARM for much lower-level processing and modeling. To reduce the implementation overhead and simplify debugging, we have migrated the relevant parts of FARM into the new haystack/modeling
package.
⚠️ Breaking Changes & Migration Guide
Migration to v1.0
With the release of v1.0, we decided to make some bold changes.
We believe this has brought a significant improvement in usability and makes the project more future-proof.
While this does come with a few breaking changes, and we do our best to guide you on how to go from v0.x to v1.0.
For more details see the Migration Guide and if you need more guidance, just reach out via Slack.
New Package Structure & Changed Imports
Due to the ever-increasing number of Nodes and Document Stores being integrated into Haystack,
we felt the need to implement a repository structure that makes it easier to navigate to what you're looking for. We've also shortened the length of the imports.
haystack.document_stores
- All Document Stores can now be directly accessed from here
- Note the pluralization of
document_store
todocument_stores
haystack.nodes
- This directory directly contains any class that can be used as a node
- This includes File Converters and PreProcessors
haystack.pipelines
- This contains all the base, custom and pre-made pipeline classes
- Note the pluralization of
pipeline
topipelines
haystack.utils
- Any utility functions
➡️ For the large majority of imports, the old style still works but this will be deprecated in future releases!
Primitive Objects
Instead of relying on dictionaries, Haystack now standardizes more of the inputs and outputs of Nodes using the following primitive classes:
With these, there is now support for data structures beyond text and the REST API schema is built around their structure.
Using these classes also allows for the autocompletion of fields in your IDE.
Tip: To see examples of these primitive classes being returned, have a look at Ready-Made Pipelines.
Many of the fields in these classes have also been renamed or removed.
You can see a more comprehensive list of them in this Github issue.
Below, we will go through a few cases that are likely to impact established workflows.
Input Document Format
This dictionary schema used to be the recommended way to prepare your data to be indexed.
Now we strongly recommend using our dedicated Document
class as a replacement.
The text
field has been renamed content
to accommodate for cases where it is used for another data format,
for example in Table QA.
Click here to see code example
v0.x:
doc = {
'text': 'DOCUMENT_TEXT_HERE',
'meta': {'name': DOCUMENT_NAME, ...}
}
v1.0:
doc = Document(
content='DOCUMENT_TEXT_HERE',
meta={'name': DOCUMENT_NAME, ...}
)
From here, you can take the same steps to write Documents into your Document Store.
document_store.write_documents([doc])
Response format of Reader
All Reader Nodes now return Answer objects instead of dictionaries.
Click here to see code example
v0.x:
[
{
'answer': 'Fang',
'score': 13.26807975769043,
'probability': 0.9657130837440491,
'context': """Криволапик (Kryvolapyk, kryvi lapy "crooked paws")
===Fang (Hagrid's dog)===
*Chinese (PRC): 牙牙 (ya2 ya) (from 牙 "tooth", 牙,"""
}
]
v1.0:
[
<Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9946763813495636, 'co...