Skip to content

v1.8.0

Compare
Choose a tag to compare
@julian-risch julian-risch released this 26 Aug 16:08
· 2314 commits to main since this release
4e518cd

⭐ Highlights

This release comes with a bunch of new features, improvements and bug fixes. Let us know how you like it on our brand new Haystack Discord server! Here are the highlights of the release:

Pipeline Evaluation in Batch Mode #2942

The evaluation of pipelines often uses large datasets and with this new feature batches of queries can be processed at the same time on a GPU. Thereby, the time needed for an evaluation run is decreased and we are working on further speed improvements. To try it out, you only need to replace the call to pipeline.eval() with pipeline.eval_batch() when you evaluate your question answering pipeline:

...
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)
eval_result = pipeline.eval_batch(labels=eval_labels, params={"Retriever": {"top_k": 5}})

Early Stopping in Reader and Retriever Training #3071

When training a reader or retriever model, you need to specify the number of training epochs. If the model doesn't further improve after the first few epochs, the training usually still continues for the rest of the specified number of epochs. Early Stopping can now automatically monitor how much the model improves during training and stop the process when there is no significant improvement. Various metrics can be monitored, including loss, EM, f1, and top_n_accuracy for FARMReader or loss, acc, f1, and average_rank for DensePassageRetriever. For example, reader training can be stopped when loss doesn't further decrease by at least 0.001 compared to the previous epoch:

from haystack.nodes import FARMReader
from haystack.utils.early_stopping import EarlyStopping
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-distilled")
reader.train(data_dir="data/squad20", train_filename="dev-v2.0.json", early_stopping=EarlyStopping(min_delta=0.001), use_gpu=True, n_epochs=8, save_dir="my_model")

PineconeDocumentStore Without SQL Database #2749

Thanks to @jamescalam the PineconeDocumentStore does not depend on a local SQL database anymore. So when you initialize a PineconeDocumentStore from now on, all you need to provide is a Pinecone API key:

from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key="...")
docs = [Document(content="...")]
document_store.write_documents(docs)

FAISS in OpenSearchDocumentStore: #3101 #3029

OpenSearch supports different approximate k-NN libraries for indexing and search. In Haystack's OpenSearchDocumentStore you can now set the knn_engine parameter to choose between nmslib and faiss. When loading an existing index you can also specify a knn_engine and Haystack checks if the same engine was used to create the index. If not, it falls back to slow exact vector calculation.

Highlighted Bug Fixes

A bug was fixed that prevented users from loading private models in some components because the authentication token wasn't passed on correctly. A second bug was fixed in the schema files affecting parameters that are of type Optional[List[]], in which case the validation failed if the parameter was explicitly set to None.

  • fix: Use use_auth_token in all cases when loading from the HF Hub by @sjrl in #3094
  • bug: handle Optional params in schema validation by @anakin87 in #2980

Other Changes

DocumentStores

  • feat: Allow exact list matching with field in Elasticsearch filtering by @masci in #2988

Documentation

Crawler

  • fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter by @danielbichuetti in #3043
  • fix: Crawler quits ChromeDriver on destruction by @danielbichuetti in #3070

Other Changes

  • fix(translator): write translated text to output documents, while keeping input untouched by @danielbichuetti in #3077
  • test: Use random_sample instead of ndarray for random array in OpenSearchDocumentStore test by @bogdankostic in #3083
  • feat: add progressbar to upload_files() for deepset Cloud client by @tholor in #3069
  • refactor: update package metadata by @ofek in #3079

New Contributors

❤️ Big thanks to all contributors and the whole community!

Full Changelog: v1.7.1...v1.8.0