v1.5.0
⭐ Highlights
Generative Pseudo Labeling
Dense retrievers excel when finetuned on a labeled dataset of the target domain. However, such datasets rarely exist and are costly to create from scratch with human annotators. Generative Pseudo Labeling solves this dilemma by creating labels automatically for you, which makes it a super fast and low-cost alternative to manual annotation. Technically speaking, it is an unsupervised approach for domain adaptation of dense retrieval models. Given a corpus of unlabeled documents from that domain, it automatically generates queries on that corpus and then uses a cross-encoder model to create pseudo labels for these queries. The pseudo labels can be used to adapt retriever models that domain. Here is a code example that shows how to do that in Haystack:
from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes.question_generator.question_generator import QuestionGenerator
from haystack.nodes.label_generator.pseudo_label_generator import PseudoLabelGenerator
# Initialize any document store and fill it with documents from your domain - no labels needed.
document_store = InMemoryDocumentStore()
document_store.write_documents(...)
# Calculate and store a dense embedding for each document
retriever = EmbeddingRetriever(document_store=document_store,
embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b",
max_seq_len=200)
document_store.update_embeddings(retriever)
# Use the new PseudoLabelGenerator to automatically generate labels and train the retriever on them
qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1", max_length=64, split_length=200, batch_size=12)
psg = PseudoLabelGenerator(qg, retriever)
output, _ = psg.run(documents=document_store.get_all_documents())
retriever.train(output["gpl_labels"])
Batch Processing with Query Pipelines
Every query pipeline now has a run_batch()
method, which allows to pass multiple queries to the pipeline at once.
Together with a list of queries, you can either provide a single list of documents or a list of lists of documents. In the first case, answers are returned for each query-document pair. In the second case, each query is applied to its corresponding list of documents based on same index in the list. A third option is to have a list containing a single query, which is then applied to each list of documents separately.
Here is an example with a pipeline:
from haystack.pipelines import ExtractiveQAPipeline
...
pipe = ExtractiveQAPipeline(reader, retriever)
predictions = pipe.pipeline.run_batch(
queries=["Who is the father of Arya Stark?","Who is the mother of Arya Stark?"], params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
And here is an example with a single reader node:
from haystack.nodes import FARMReader
from haystack.schema import Document
FARMReader.predict_batch(
queries=["1st sample query", "2nd sample query"]
documents=[[Document(content="sample doc1"), Document(content="sample doc2")], [Document(content="sample doc3"), Document(content="sample doc4")]]
{"queries": ["1st sample query", "2nd sample query"], "answers": [[Answers from doc1 and doc2], [Answers from doc3 and doc4]], ...]}
Pipeline Evaluation with Advanced Label Scopes
Typically, a predicted answer is considered correct if it matches the gold answer in the set of evaluation labels. Similarly, a retrieved document is considered correct if its ID matches the gold document ID in the labels. Sometimes however, these simple definitions of "correctness" are not sufficient and you want to further specify the "scope" within which an answer or a document is considered correct.
For this reason, EvaluationResult.calculate_metrics()
accepts the parameters answer_scope
and document_scope
.
As an example, you might consider an answer to be correct only if it stems from a specific context of surrounding words. You can specify answer_scope="context"
in calculate_metrics()
in that case. See the updated docstrings with a description of the different label scopes or the updated tutorial on evaluation.
...
document_store.add_eval_data(
filename="data/tutorial5/nq_dev_subset_v2.json",
preprocessor=preprocessor,
)
...
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True)
eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})
metrics = eval_result.calculate_metrics(answer_scope="context")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
Support of DeBERTa Models
Haystack now supports DeBERTa models! These kind of models come with some smart architectural improvements over BERT and RoBERTa, such as encoding the relative and absolute position of a token in the input sequence. Only the following three lines are needed to train a DeBERTa reader model on the SQuAD 2.0 dataset. And compared to a RoBERTa model trained on that dataset, you can expect a boost in F1-score from ~84% to ~88% ("microsoft/deberta-v3-large" even gets you to an F1-score as high as ~92%).
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="microsoft/deberta-v3-base")
reader.train(data_dir="data/squad20", train_filename="train-v2.0.json", dev_filename="dev-v2.0.json", save_dir="my_model")
⚠️ Breaking Changes
- Validation for Ray pipelines by @ZanSara in #2545
- Add
run_batch
method to all nodes andPipeline
to allow batch querying by @bogdankostic in #2481 - Support context matching in
pipeline.eval()
by @tstadel in #2482
Other Changes
Pipeline
- Add sort arg to JoinAnswers by @brandenchan in #2436
- Update run() and run_batch() params descriptions in API by @agnieszka-m in #2542
- [CI refactoring] Avoid
ray==1.12.0
on Windows by @ZanSara in #2562 - Prevent losing names of utilized components when loaded from config by @tstadel in #2525
- Do not copy
_component_config
inget_components_definitions
by @ZanSara in #2574 - Add
run_batch
for standard pipelines by @bogdankostic in #2595 - Fix Pipeline.get_config() for forked pipelines by @tstadel in #2616
- Remove wrong retriever top_1 metrics from
print_eval_report
by @tstadel in #2510 - Handle transformers pipeline flattening lists of length 1 by @MichelBartels in #2531
- Fix
pipeline.eval
with context matching for Table-QA by @tstadel in #2597 - set top_k to 5 in SAS to be consistent by @ClaMnc in #2550
DocumentStores
- Make
DeepsetCloudDocumentStore
work with non-existing index by @bogdankostic in #2513 - [Weaviate] Exit the while loop when we query less documents than available by @masci in #2537
- Fix knn params for aws managed opensearch by @tstadel in #2581
- Fix number of returned values in
get_metadata_values_by_key
by @bogdankostic in #2614
Retriever
- Simplify loading of
EmbeddingRetriever
by @bogdankostic in #2619 - Add training checkpoint in retriever trainer by @dimitrisna in #2543
- Include meta data when computing embeddings in EmbeddingRetriever by @MichelBartels in #2559
Documentation
- fix small typo in Document doc string by @galtay in #2520
- rearrange contributing guidelines by @masci in #2515
- Documenting output score of JoinDocuments when using concatenation by @MichelBartels in #2561
- Minor lg updates to doc strings by @agnieszka-m in #2585
- Adjust pydoc markdown config so methods shown with classes by @brandenchan in #2511
- Update Ray pipeline docs with validation info by @agnieszka-m in #2590
Other Changes
- Upgrade transformers version to 4.18.0 by @bogdankostic in #2514
- Upgrade torch version to 1.11 by @bogdankostic in #2538
- Fix tutorials 4, 7 and 8 by @bogdankostic in #2526
- Tutorial1:
convert_files_to_dicts
-->convert_files_to_docs
by @ZanSara in #2546 - Fix docker image tag with semantic version for releases by @askainet in #2548
- added launch_tika method by @anakin87 in #2567
- Remove encoding option from PDFToTextOCRConverter by @julian-risch in #2553
- Fix
StaleElementReferenceException
in Crawler by @bogdankostic in #2591
New Contributors
- @galtay made their first contribution in #2520
- @masci made their first contribution in #2515
- @ClaMnc made their first contribution in #2550
- @anakin87 made their first contribution in #2567
- @dimitrisna made their first contribution in #2543
❤️ Big thanks to all contributors and the whole community!