v1.7.0
⭐ Highlights
This time we have a couple of smaller yet important feature highlights: lots of them coming from you, our amazing community!
🥂 Alongside that, as we notice more frequent and great contributions from our community, we are also announcing our brand new Haystack Discord server to help us interact better with the people that make Haystack what it is! 🥳
Here's what you'll find in Haystack 1.7:
Support for OpenAI GPT-3
If you always wanted to know how OpenAI's famous GPT-3 model compares to other models, now your time has come. It's been fully integrated into Haystack, so you can use it as any other model. Just sign up to OpenAI, copy your API key from here and run the following code.To compare it to other models, check out our evaluation guide.
from haystack.nodes import OpenAIAnswerGenerator
from haystack import Document
reader = OpenAIAnswerGenerator(api_key="<your-api-token>", max_tokens=15, temperature=0.3)
docs = [Document(content="""The Big Bang Theory is an American sitcom.
The four main characters are all avid fans of nerd culture.
Among their shared interests are science fiction, fantasy, comic books and collecting memorabilia.
Star Trek in particular is frequently referenced""")]
res = reader.predict(query="Do the main characters of big bang theory like Star Trek?", documents=docs)
print(res)
Zero-Shot Query Classification
Till now, TransformersQueryClassifier
was very closely built around the excellent binary query-type classifier model of hahrukhx01. Although it was already possible to use other Transformer models, the choice was restricted to the models that output binary labels. One of our amazing community contributions now lifted this restriction.
But that's not all: @anakin87 added support for zero-shot classification models as well!
So now that you're completely free to choose the classification categories you want, you can let your creativity run wild. One thing you could do is customize the behavior of your pipeline based on the semantic category of the query, like this:
from haystack.nodes import TransformersQueryClassifier
# In zero-shot-classification, you are free to choose the labels
labels = ["music", "cinema", "food"]
query_classifier = TransformersQueryClassifier(
model_name_or_path="typeform/distilbert-base-uncased-mnli",
use_gpu=True,
task="zero-shot-classification",
labels=labels,
)
queries = [
"In which films does John Travolta appear?", # query about cinema
"What is the Rolling Stones first album?", # query about music
"Who was Sergio Leone?", # query about cinema
]
for query in queries:
result = query_classifier.run(query=query)
print(f'Query "{query}" was sent to {result[1]}')
Adding Page Numbers to Document Meta
Sometimes it's not enough to find the right answer or paragraph inside a document and just print it on the screen. Context matters and thus, for search applications, it's essential to send the user exactly to the place where the information came from. For huge documents, we're just halfway there if the user clicks a result and the document opens. To get to the right position, they still need to search the document using the document viewer. To make it easier, we added the parameter add_page_number
to ParsrConverter
, AzureConverter
and PreProcessor
. If you set it to True
, it adds a meta field "page"
to documents containing the page number of the text snippet or a table within the original file.
from haystack.nodes import PDFToTextConverter, PreProcessor
from haystack.document_stores import InMemoryDocumentStore
converter = PDFToTextConverter()
preprocessor = PreProcessor(add_page_number=True)
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_node(component=converter, name="Converter", inputs=["File"])
pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Converter"])
pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])
Gradient Accumulation for FARMReader
Training big Transformer models in low-resource environments is hard. Batch size plays a significant role when it comes to hyper-parameter tuning during the training process. The number of batches you can run on your machine is restricted by the amount of memory that fits into your GPUs. Gradient accumulation is a well-known technique to work around that restriction: adding up the gradients across iterations and running the backward pass only once after a certain number of iterations.
We tested it when we fine-tuned roberta-base on SQuAD, which led to nearly the same results as using a higher batch size. We also used it for training deepset/deberta-v3-large, which significantly outperformed its predecessors (see Question Answering on SQuAD).
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data/squad20"
reader.train(
data_dir=data_dir,
train_filename="dev-v2.0.json",
use_gpu=True, n_epochs=1,
save_dir="my_model",
grad_acc_steps=8
)
Extended Ray Support
Another great contribution from our community comes from @zoltan-fedor: it's now possible to run more complex pipelines with dual-retriever setup on Ray. Also, we now support ray serve deployment arguments in Pipeline YAMLs so that you can fully control your ray deployments.
pipelines:
- name: ray_query_pipeline
nodes:
- name: EmbeddingRetriever
replicas: 2
inputs: [ Query ]
serve_deployment_kwargs:
num_replicas: 2
version: Twenty
ray_actor_options:
num_gpus: 0.25
num_cpus: 0.5
max_concurrent_queries: 17
- name: Reader
inputs: [ EmbeddingRetriever ]
Support for Custom Sentence Tokenizers in Preprocessor
On some specific domains (for example, legal with lots of custom abbreviations), the default sentence tokenizer can be improved by some extra training on the domain data. To support a custom model for sentence splitting, @danielbichuetti added the tokenizer_model_folder
parameter to Preprocessor
.
from haystack.nodes import PreProcessor
preprocessor = PreProcessor(
split_length=10,
split_overlap=0,
split_by="sentence",
split_respect_sentence_boundary=False,
language="pt",
tokenizer_model_folder="/home/user/custom_tokenizer_models",
)
Making it Easier to Switch Document Stores
We had yet another amazing community contribution by @zoltan-fedor about the support for BM25 with the Weaviate document store.
Besides that we streamlined methods of BaseDocumentStore
and added update_document_meta()
to InMemoryDocumentStore
. These are all steps to make it easier for you to run the same pipeline with different document stores (for example, for quick prototyping, use in-memory, then head to something more production-ready).
#2860
#2689
Almost 2x Performance Gain for Electra Reader Models
We did a major refactoring of our language_modeling module resolving a bug that caused Electra models to execute the forward pass twice.
#2703.
⚠️ Breaking Changes
- Add
update_document_meta
toInMemoryDocumentStore
by @bogdankostic in #2689 - Add support for BM25 with the Weaviate document store by @zoltan-fedor in #2860
- Extending the Ray Serve integration to allow attributes for Serve deployments by @zoltan-fedor in #2918
- bug: make
MultiLabel
ids consistent across python interpreters by @camillepradel in #2998
⚠️ Breaking Changes for Contributors
Default Branch will be Renamed to main
on Tuesday, 16th of August
We will rename the default branch from master
to main
after this release. For a nice recap about good reasons for doing this, have a look at the Software Freedom Conservancy's blog.
Whether coming from this repository or from a fork, local clones of the Haystack repository will need to be updated by running the following commands:
git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a
Pre-Commit Hooks Instead of CI Jobs
To give you full control over your changes, we switched from CI jobs that automatically reformat files, generate schemas, and so on, to pre-commit hooks. To install them, run:
pre-commit install
For more information, check our contributing guidelines.
#2819
Other Changes
Pipeline
- Fix _debug info getting lost for previous nodes when using join nodes by @tstadel in #2776
- fix pipeline run loop on joined pipelines whithout debug flag by @tstadel in #2777
- Fix crawler long file names by @danielbichuetti in #2723
- Prevent
PDFToTextConverter
from failing on PDFs with spaces in their names by @danielbichuetti in #2786 - Passing the meta-data in the summarizer response by @SjSnowball in #2179
- Fix YAML validation for
ElasticsearchDocumentStore.custom_query
by @ZanSara in #2789 - Fix gold_contexts_similarity for table retrieval evaluation by @tstadel in #2815
- Fix validation for dynamic outgoing edges by @tstadel in #2850
- Print eval reports improvements by @vblagoje in #2941
- Add progress bar to batch run component ops by @vblagoje in #2864
- feat: warn users if they're calling
get_all_labels
on a document index and vice-versa (Elasticsearch & Opensearch only) by @ZanSara in #2990 - Make
MultiLabel
preserve order by @anakin87 in #2956 - bug: fix
UnboundLocalError
inPipeline.run_batch()
by @anakin87 in #3016 - feat: enable the
JoinDocuments
node to work with documents withscore=None
by @zoltan-fedor in #2984 - Resolving issue 2853: no answer logic in FARMReader by @sjrl in #2856
- bug: Make
TranslationWrapperPipeline
work withQuestionAnswerGenerationPipeline
by @bogdankostic in #3034
Models
- Simplify
language_modeling.py
andtokenization.py
by @ZanSara in #2703 - Validate OpenAI response by @anakin87 in #2844
- remove unnecessary if else block #2835 by @kekayan in #2842
- Explicitly specify all parameters to forward call by @vblagoje in #2886
- Use
batch_size
inQuestionGenerator
by @GianiStatie in #2870 - Generalize , and tokens of QuestionGenerator node by @francescocastelli in #2769
- Component batch_size should be defined rather than Optional by @vblagoje in #2958
- Better check for "DebertaV2" architecture in Trainer.train by @sjrl in #2966
DocumentStores
- Fix confusing elasticsearch exception by @tstadel in #2763
- added mock pinecone client by @jamescalam in #2770
- changed mock pinecone to use dict rather than list index by @jamescalam in #2845
- Handle invalid metadata for
SQLDocumentStore
by @anakin87 in #2868 - Use opensearch-py in OpenSearchDocumentStore by @masci in #2691
- Wrap opensearch imports into
safe_import
by @ZanSara in #2907 - Bug fix Weaviate document deletion by @stevenhaley in #2899
- switch label variables in test_labels by @jamescalam in #3011
- Adding support for additional distance/similarity metrics for Weaviate by @zoltan-fedor in #3001
- test: add meta fields for meta_config to be used during testing by @jamescalam in #3021
- Fix
embeddings_field_supports_similarity
ofOpenSearchDocumentStore
when creating index by @tstadel in #3030 - Forbid the key
id
fromDocument
s to be written inWeaviateDocumentStore
by @thenewera-ru in #2846
Documentation
- Trying out some smaller images for docs by @TuanaCelik in #2772
- Clean OpenAIAnswerGenerator docstrings by @brandenchan in #2797
- Add a custom pydoc renderer for Readme.io by @masci in #2825
- Typo README.md by @danielfleischer in #2895
- Fix typos in Contributing.md by @stevenhaley in #2897
- Fix docs code format for sentence transformers by @bilgeyucel in #2957
- Update Seq2SeqGenerator API documentation by @vblagoje in #2970
- Add API page for util functions by @brandenchan in #2863
- docs: update File Classifier Docstring by @brandenchan in #3018
Tutorials
- Fix load_from_yaml example in the Pipelines tutorial by @agnieszka-m in #2774
- Tutorial 12: add introduction by @vblagoje in #2798
- Exclude docker from Tutorial 15 by @anakin87 in #2861
- Remove logging config from Haystack by @julian-risch in #2848
- docs: extend tutorial14 about query classification by @anakin87 in #3013
- Tutorial 06: Replace DPR with EmbeddingRetriever by @bglearning in #2910
Other Changes
- API key check in
OpenAIAnswerGenerator
by @ZanSara in #2791 - API tests by @masci in #2738
- Allow values that are not dictionaries in the request params in the
/search
endpoint by @masci in #2720 - fix healtcheck cmds for annotation tool postgres by @tstadel in #2840
- Remove deprecated method prepare_seq2seq_batch by @anakin87 in #2852
- Fix corrupted csv from
EvaluationResult.save()
by @tstadel in #2854 - Fix audio dependency chain issue on Python 3.10 by @danielbichuetti in #2900
- Add switch for
BiAdaptive
andTriAdaptiveModel
inEvaluator
by @ZanSara in #2908 - Fix serialization of numpy arrays and pandas dataframes in REST API by @tstadel in #2838
- Update minimum selenium version supported for crawler by @sjrl in #2921
- Enable Opensearch unit tests in Windows CI by @masci in #2936
- Remove unused variable by @sjrl in #2974
- Bump streamlit version to latest by @masci in #3002
- Testing order in
test_multilabel
by @jamescalam in #3015 - fix: move
azure-core
pin into the dev dependency list by @ZanSara in #3022 - Fix broken
MultiLabel
serialization by @tstadel in #3037
New Contributors
- @kekayan made their first contribution in #2842
- @sjrl made their first contribution in #2884
- @zoltan-fedor made their first contribution in #2860
- @danielfleischer made their first contribution in #2895
- @stevenhaley made their first contribution in #2897
- @GianiStatie made their first contribution in #2870
- @bglearning made their first contribution in #2910
- @bilgeyucel made their first contribution in #2957
- @wochinge made their first contribution in #2883
- @camillepradel made their first contribution in #2998
- @thenewera-ru made their first contribution in #2846
❤️ Big thanks to all contributors and the whole community!
Full Changelog: v1.6.0...v1.7.0