v1.3.0
⭐ Highlights
Pipeline YAML Syntax Validation
The syntax of pipeline configurations as defined in YAML files can now be validated. If the validation fails, erroneous components/parameters are identified to make it simple to fix them. Here is a code snippet to manually validate a file:
from pathlib import Path
from haystack.pipelines.config import validate_yaml
validate_yaml(Path("rest_api/pipeline/pipelines.haystack-pipeline.yml"))
Your IDE can also take care of the validation when you edit a pipeline YAML file. The suffix *.haystack-pipeline.yml
tells your IDE that this YAML contains a Haystack pipeline configuration and enables some checks and autocompletion features if the IDE is configured that way (YAML plugin for VSCode, Configuration Guide for PyCharm). The schema used for validation can be found in SchemaStore pointing to the schema files for the different Haystack versions. Note that an update of the Haystack version might sometimes require to do small changes to the pipeline YAML files. You can set version: 'unstable'
in the pipeline YAML to circumvent the validation or set it to the latest Haystack version if the components and parameters that you use are compatible with the latest version. #2226
Pinecone DocumentStore
We added another DocumentStore to Haystack: PineconeDocumentStore! 🎉 Pinecone is a fully managed service for very large scale dense retrieval. To this end, embeddings and metadata are stored in a hosted Pinecone vector database while the document content is stored in a local SQL database. This separation simplifies infrastructure setup and maintenance. In order to use this new document store, all you need is an API key, which you can obtain by creating an account on the Pinecone website. #2254
import os
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key=os.environ["PINECONE_API_KEY"])
BEIR Integration
Fresh from the 🍻 cellar, Haystack now has an integration with our favorite BEnchmarking Information Retrieval tool BEIR. It contains preprocessed datasets for zero-shot evaluation of retrieval models in 17 different languages, which you can use to benchmark your pipelines. For example, a DocumentSearchPipeline can now be evaluated by calling Pipeline.eval_beir()
after having installed Haystack with the BEIR dependency via pip install farm-haystack[beir]
. Cheers! #2333
from haystack.pipelines import DocumentSearchPipeline, Pipeline
from haystack.nodes import TextConverter, ElasticsearchRetriever
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
text_converter = TextConverter()
document_store = ElasticsearchDocumentStore(search_fields=["content", "name"], index="scifact_beir")
retriever = ElasticsearchRetriever(document_store=document_store, top_k=1000)
index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])
query_pipeline = DocumentSearchPipeline(retriever=retriever)
ndcg, _map, recall, precision = Pipeline.eval_beir(
index_pipeline=index_pipeline, query_pipeline=query_pipeline, dataset="scifact"
)
Breaking Changes
- Make Milvus2DocumentStore compatible with pymilvus>=2.0.0 by @MichelBartels in #2126
- Set provider parameter when instantiating onnxruntime.InferenceSession and make
device
a torch.device in internal methods by @cjb06776 in #1976
Pipeline
- Generate
haystack-pipeline-1.2.0.schema.json
by @ZanSara in #2239 - Add
RouteDocuments
andJoinAnswers
nodes by @bogdankostic in #2256 - Refactor Pipeline peripherals by @tstadel in #2253
- Allow to deploy and undeploy Pipelines on Deepset Cloud by @tstadel in #2285
- Reintroduce
debug
as a valid global key for Pipeline'sparams
by @ZanSara in #2298 - Replace dpr with embeddingretriever tut11 by @mkkuemmel in #2287
- Package JSON schemas properly in Haystack by @ZanSara in #2316
- Fix dependency graph for indexing pipelines during codegen by @tstadel in #2311
- Fix YAML pipeline paths in
docker-compose.yml
by @ZanSara in #2335 - Improve error message for nodes failing validation by @ZanSara in #2313
- Fix
Pipeline.print_eval_report
by @tstadel in #2271 - save_to_deepset_cloud: automatically convert document stores by @tstadel in #2283
- Sas gpu additions by @thimo72 in #2308
Models
DocumentStores
- Bulk insert in sql document stores by @OmniscienceAcademy in #2264
- 'os' wrapper to function for brownfield support by @TuanaCelik in #2282
- Using default OpenSearch parameters by @TuanaCelik in #2327
- Fix docker launch scripts by @tstadel in #2341
- Fix
normalize_embedding
using numba by @tstadel in #2347
Documentation
- Update other.yml with new node names by @agnieszka-m in #2286
- Bring back init defs to api in v1.2 and latest by @brandenchan in #2296
- Remove unneeded files in docs directory by @brandenchan in #2237
- change old text to content argument for translator examples by @ju-gu in #2240
Tutorials
- Fix tutorial dataset paths by @julian-risch in #2340
- Polish Evaluation Tutorial by @brandenchan in #2212
- Comment out Milvus cell on Tutorial6 by @ZanSara in #2243
- Change document attribute from text to content by @julian-risch in #2352
- Replace dpr with embeddingretriever tut5 by @mkkuemmel in #2274
- ipynb: inserted links to graph images by @mkkuemmel in #2309
Other Changes
- Implement Context Matching by @tstadel in #2293
- Fix surrounding context extraction in
ParsrConverter
by @bogdankostic in #2162 - Fix table extraction in
ParsrConverter
by @bogdankostic in #2262 - Api pages by @brandenchan in #2248
- fix pip backtracking issue by @tstadel in #2281
- Update reader/base.py to fix UnboundLocalError in #2273 by @thimo72 in #2275
- Remove substrings basic implementation by @dmigo in #2152
- adding quotes for zsh shell issue by @TuanaCelik in #2289
- Prevent Preprocessor from changing existing documents by @tstadel in #2297
- Fix install because of missing jsonschema dependency by @tstadel in #2315
- Add basic telemetry features by @julian-risch in #2314
- Let SquadData support data from Annotation Tool by @brandenchan in #2329
New Contributors
- @thimo72 made their first contribution in #2275
- @agnieszka-m made their first contribution in #2286
- @TuanaCelik made their first contribution in #2289
- @OmniscienceAcademy made their first contribution in #2264
- @jamescalam made their first contribution in #2254
- @cjb06776 made their first contribution in #1976
❤️ Big thanks to all contributors and the whole community!