v1.4.0
⭐ Highlights
Logging Evaluation Results to MLflow
Logging and comparing the evaluation results of multiple different pipeline configurations is much easier now thanks to the newly implemented MLflowTrackingHead
. With our public MLflow instance you can log evaluation metrics and metadata about pipeline, evaluation set and corpus. Here is an example log file. If you have your own MLflow instance you can even store the pipeline YAML file and the evaluation set as artifacts. In Haystack, all you need is the execute_eval_run()
method:
eval_result = Pipeline.execute_eval_run(
index_pipeline=index_pipeline,
query_pipeline=query_pipeline,
evaluation_set_labels=labels,
corpus_file_paths=file_paths,
corpus_file_metas=file_metas,
experiment_tracking_tool="mlflow",
experiment_tracking_uri="http://localhost:5000",
experiment_name="my-query-pipeline-experiment",
experiment_run_name="run_1",
pipeline_meta={"name": "my-pipeline-1"},
evaluation_set_meta={"name": "my-evalset"},
corpus_meta={"name": "my-corpus"}.
add_isolated_node_eval=True,
reuse_index=False
)
Filtering Answers by Confidence in FARMReader
The FARMReader got a parameter confidence_threshold
to filter out predictions below this threshold.
The threshold is disabled by default but can be set between 0 and 1 when initializing the FARMReader:
from haystack.nodes import FARMReader
model = "deepset/roberta-base-squad2"
reader = FARMReader(model, confidence_threshold=0.5)
Deprecating Milvus1DocumentStore & Renaming ElasticsearchRetriever
The Milvus1DocumentStore is deprecated in favor of the newer Milvus2DocumentStore. Besides big architectural changes that impact performance and reliability Milvus version 2.0 supports the filtering by scalar data types.
For Haystack users this means you can now run a query using vector similarity and filter for some meta data at the same time! See the Milvus documentation for more details if you need to migrate from Milvus1DocumentStore to Milvus2DocumentStore. #2495
The ElasticsearchRetriever node does not only work with the ElasticsearchDocumentStore but also with the OpenSearchDocumentStore and so it is only logical to rename the ElasticsearchRetriever. Now it is called
BM25Retriever after the underlying BM25 ranking function. For the same reason, ElasticsearchFilterOnlyRetriever is now called FilterRetriever. The deprecated names and the new names are both working but we will drop support of the deprecated names in a future release. An overview of the different DocumentStores in Haystack can be found here. #2423 #2461
Fixing Evaluation Discrepancies
The evaluation of pipeline nodes with pipeline.eval(add_isolated_node_eval=True)
and alternatively with retriever.eval()
and reader.eval()
gave slightly different results due to a bug in handling no_answers
. This bug is fixed now and all different ways to run the evaluation give the same results. #2381
⚠️ Breaking Changes
- Change return types of indexing pipeline nodes by @bogdankostic in #2342
- Upgrade
weaviate-client
to3.3.3
and fixget_all_documents
by @ZanSara in #1895 - Align TransformersReader defaults with FARMReader by @julian-risch in #2490
- Change default encoding for
PDFToTextConverter
fromLatin 1
toUTF-8
by @ZanSara in #2420 - Validate YAML files without loading the nodes by @ZanSara in #2438
Other Changes
Pipeline
- Add tests for missing
__init__
andsuper().__init__()
in custom nodes by @ZanSara in #2350 - Forbid usage of
*args
and**kwargs
in any node's__init__
by @ZanSara in #2362 - Change YAML version exception into a warning by @ZanSara in #2385
- Make sure that
debug=True
andparams={'debug': True}
behaves the same way by @ZanSara in #2442 - Add support for positional args in pipeline.get_config() by @tstadel in #2478
- enforce same index values before and after saving/loading eval dataframes by @tstadel in #2398
DocumentStores
- Fix sparse retrieval with filters returns results without any text-match by @tstadel in #2359
- EvaluationSetClient for deepset cloud to fetch evaluation sets and la… by @FHardow in #2345
- Update launch script for Milvus from 1.x to 2.x by @ZanSara in #2378
- Use
ElasticsearchDocumentStore.get_all_documents
inElasticsearchFilterOnlyRetriever.retrieve
by @adri1wald in #2151 - Fix and use delete_index instead of delete_documents in tests by @tstadel in #2453
- Update docs of DeepsetCloudDocumentStore by @tholor in #2460
- Add support for aliases in elasticsearch document store by @ZeJ0hn in #2448
- fix dot_product metric by @jamescalam in #2494
- Deprecate
Milvus1DocumentStore
by @bogdankostic in #2495 - Fix
OpenSearchDocumentStore
's__init__
by @ZanSara in #2498
Retriever
- Rename dataset to evaluation_set when logging to mlflow by @tstadel in #2457
- Linearize tables in EmbeddingRetriever by @MichelBartels in #2462
- Print warning in
EmbeddingRetriever
if sentence-transformers model used with different model format by @mpangrazzi in #2377 - Add flag to disable scaling scores to probabilities by @tstadel in #2454
- changing the name of the retrievers from es_retriever to retriever by @TuanaCelik in #2487
- Replace dpr with embeddingretriever tut14 by @mkkuemmel in #2336
- Support conjunctive queries in sparse retrieval by @tstadel in #2361
- Fix: Auth token not passed for EmbeddingRetriever by @mathislucka in #2404
- Pass
use_auth_token
to sentence transformers EmbeddingRetriever by @MichelBartels in #2284
Reader
- Fix
TableReader
for tables without rows by @bogdankostic in #2369 - Match answer sorting in
QuestionAnsweringHead
withFARMReader
by @tstadel in #2414 - Fix reader.eval() and reader.eval_on_file() output by @tstadel in #2476
- Raise error if torch-scatter is not installed or wrong version is installed by @MichelBartels in #2486
Documentation
- Fix link to squad_to_dpr.py in DPR train tutorial by @raphaelmerx in #2334
- Add evaluation and document conversion to tutorial 15 by @MichelBartels in #2325
- Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 by @MichelBartels in #2479
- Fix RouteDocuments documentation by @MichelBartels in #2380
Other Changes
- extract extension based on file's content by @GiannisKitsos in #2330
- Reduce num REST API workers to accommodate smaller machines by @brandenchan in #2400
- Add
devices
alongsideuse_gpu
inFARMReader
by @ZanSara in #2294 - Delete files in docs/_src by @brandenchan in #2322
- Add
apt update
in Linux CI by @ZanSara in #2415 - Exclude
beir
from Windows install by @ZanSara in #2419 - Added macos version of xpdf in tutorial 8 by @seduerr91 in #2424
- Make
python-magic
fully optional by @ZanSara in #2412 - Upgrade xpdf to 4.0.4 by @tholor in #2443
- Update
xpdfreader
package installation by @AI-Ahmed in #2491
New Contributors
- @raphaelmerx made their first contribution in #2334
- @FHardow made their first contribution in #2345
- @GiannisKitsos made their first contribution in #2330
- @mpangrazzi made their first contribution in #2377
- @seduerr91 made their first contribution in #2424
- @ZeJ0hn made their first contribution in #2448
- @AI-Ahmed made their first contribution in #2491
❤️ Big thanks to all contributors and the whole community!