v1.2.0
⭐ Highlights
Brownfield Support of Existing Elasticsearch Indices
You have an existing Elasticsearch index from other projects and now want to try out Haystack? The newly added method es_index_to_document_store
provides brownfield support of existing Elasticsearch indices by converting each of the records in the provided index to Haystack Document
objects and writing them to the specified DocumentStore
.
document_store = es_index_to_document_store(
document_store=InMemoryDocumentStore(), #or any other Haystack DocumentStore
original_index_name="existing_index",
original_content_field="content",
original_name_field="name",
included_metadata_fields=["date_field"],
index="new_index",
)
It can even be used on a regular basis in order to add new records of the Elasticsearch index to the DocumentStore
! #2229
Tapas Reader With Scores
The new model class TapasForScoredQA
introduced in #1997 supports Tapas Reader models that return confidence scores. When you load a Tapas Reader model, Haystack automatically infers whether the model supports confidence scores and chooses the correct model class under the hood. The returned answers are sorted first by a general table score and then by answer span scores. To try it out, just use one of the new TableReader models:
reader = TableReader(model_name_or_path="deepset/tapas-large-nq-reader", max_seq_len=512) #or
reader = TableReader(model_name_or_path="deepset/tapas-large-nq-hn-reader", max_seq_len=512)
Extended Meta Data Filtering
We extended the filter capabilities of all(*) document stores to support more complex filter expressions than previously. Besides simple selections on multiple fields you can now use more complex comparison expressions and connect these using boolean operators. For people having used mongodb the new syntax should look familiar. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
, "$or"
, "$not"
), a comparison operator ("$eq"
, "$in"
, "$gt"
, "$gte"
, "$lt"
, "$lte"
) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in"
) a list of values as value.
If no logical operator is provided, "$and"
is used as default operation.
If no comparison operator is provided, "$eq"
(or "$in"
if the comparison value is a list) is used as default operation.
Therefore, we don't have any breaking changes and you can keep on using your existing filter expressions.
Example:
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
(*) FAISSDocumentStore and MilvusDocumentStore currently do not support filters during search.
Code Style and Linting
In addition to mypy we already had for static type checking, we now use pylint for linting and the Haystack code base does now comply with Black formatting standards. As a result, the code is formatted in a consistent way and easier to read. When you would like to contribute to Haystack you don't need to worry about that though - our CI will automatically format your code changes correctly. Our contributor guidelines give more details in case you would like to run the checks locally. #2115 #2130
Installation with fewer dependencies
Installing Haystack has become easier and faster thanks to optional dependencies. From now on, there is no need to install all dependencies if you don't need them. For example, pip3 install farm-haystack
will install the latest release together with only a small subset of packages required for basic Pipelines with an ElasticsearchDocumentStore. As another example, if you are experimenting with FAISSDocumentStore in a colab notebook, you can install Haystack from the master branch together with FAISS dependency by running: !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]
. The installation guide reflects these updates and the full list of subsets of dependencies can be found here. Keep in mind, though, that this system works best with pip versions above 22 #1994
⚠️ Known Issues
Installing haystack with all dependencies results in heavy pip backtracking that might never finish.
This is due to a dependency conflict that was introduced by a new release of one of our sub dependencies.
To circumvent this problem install haystack like this:
pip install farm-haystack[all] "azure-core<1.23"
This might also be needed for other non-default dependencies (e.g. farm-haystack[dev] "azure-core<1.23"
).
See #2280 for more information.
⚠️ Breaking Changes
- Improve dependency management by @ZanSara in #1994
- Make
ui
andrest
proper packages by @ZanSara in #2098 - Add aiorwlock to 'ray' extra & fix maximum version for some dependencies by @ZanSara in #2140
🤓 Detailed Changes
Pipeline
- Add
top_k_join
parameter toJoinDocuments.run
by @adri1wald in #2065 - ✨ Add JSON Schema autogeneration for Pipeline YAML files by @tiangolo in #2020
- Make FileTypeClassifier more flexible by @ZanSara in #2101
- Query response without answers by @ZanSara in #2161
- Generate JSON schema index for Schemastore by @ZanSara in #2225
- Fix Pipeline.components by @tstadel in #2215
- Join node should allow reciprocal rank fusion as additional merging method by @mathislucka in #2133
- Apply filter in eval only if no gold docs are given as input by @julian-risch in #2154
- pipeline.save_to_deepset_cloud() by @tstadel in #2145
- Fix typo in save_to_deepset_cloud() by @tstadel in #2189
- Generate code from pipeline (pipeline.to_code()) by @tstadel in #2214
- Allow different filters per query in pipeline evaluation by @julian-risch in #2068
- List all pipeline(_configs) on Deepset Cloud by @tstadel in #2102
- Evaluating a pipeline consisting only of a reader node by @julian-risch in #2132
- DC SDK - load pipeline from deepset cloud by @ArzelaAscoIi in #2013
- YAML versioning by @ZanSara in #2209
Models
- Add Tapas reader with scores by @bogdankostic in #1997
- Fix finetuning notebook augmentation by @MichelBartels in #2071
- Fix Seq2SeqGenerator return type by @tstadel in #2099
- Distribute intermediate layer distillation loss calculation over multiple GPUs by @MichelBartels in #2090
- Do not apply DataParallel twice by @MichelBartels in #2095
DocumentStores
- Pin Milvus to <2.0.0 by @ZanSara in #2063
- fix: get_documents_by_id should return docs for all passed ids by @mathislucka in #2064
- Supported Highlighting in Elasticsearch by @SjSnowball in #1930
- pass faiss batch_size to sqldocumentstore by @AhmedIdr in #2061
- Fixed the Search Field mapping in ElasticSearch DocumentStore by @SjSnowball in #2080
- Provide option to recreate es doc store on initialization by @mathislucka in #2084
- Fixed performance bug. Using a list where a set is needed. by @baregawi in #2125
- Extend metadata filtering support in
ElasticsearchDocumentStore
by @bogdankostic in #2108 - OpenSearchDocumentStore: Extend similarity support by @tstadel in #2070
- Speed up query_by_embedding in InMemoryDocumentStore. by @baregawi in #2091
- Fix dependency management in Tutorial 6 by @ZanSara in #2148
- Enable use of dot_product OpenSearch Script Scoring by @tstadel in #2168
- Changed document_store to ElasticsearchDocumentStore by @mkkuemmel in #2192
- Support more data types and extended filters in WeaviateDocStore by @bogdankostic in #2143
- Adding extended meta data filtering support for InMemoryDocumenStore by @MichelBartels in #2120
- Fix ef_search param for hnsw in OpenSearchDocumentStore by @tstadel in #2227
- Add Brownfield Support of existing Elasticsearch indices by @bogdankostic in #2229
- Introduce readonly DCDocumentStore (without labels support) by @tstadel in #1991
- Extend meta data support for SQLDocumentStore by @MichelBartels in #2199
- Fix missing embeddings not skipped if filters are used by @MichelBartels in #2230
REST API
- Convert doc embedding from ndarray to list of float for REST API by @julian-risch in #1901
- Autogenerate OpenAPI specs file by @ZanSara in #2047
- Make
openapi.json
multiline so the diff is parsable by @ZanSara in #2163 - Align REST API and Haystack versions by @ZanSara in #2164
- Add
DELETE /feedback
for testing and make the label's id generate server-side by @ZanSara in #2159 - Add type check for meta & add tests by @ZanSara in #2184
- Update url in POST /file-upload by @ZanSara in #2193
- Versioning
openapi.json
by @ZanSara in #2228
Docker
- Change
docstores_gpu
intodocstores-gpu
inDockerfile-GPU
by @ZanSara in #2129 - Remove run_docker_gpu.sh by @ZanSara in #2003
- Remove
rest
extra from Dockerfile-GPU by @ZanSara in #2122 - Fix dependency related build issues in
Dockerfile
s by @ZanSara in #2135 - Add docker-compose override file for Traffic Monitoring by @tstadel in #2224
- Adding a minimal haystack gpu build by @ArzelaAscoIi in #2185
Documentation
- Remove stray requirements.txt files and update README.md by @ZanSara in #2075
- Make the docstring bot work only on master by @ZanSara in #2078
- Add who uses Haystack section by @dmigo in #1975
- Rename image to fix link in
CONTRIBUTING.md
by @ZanSara in #2211 - Add ADR template for transparent architecture decisions by @tholor in #2072
- Update Readme to reflect changes to installation procedure by @brandenchan in #2157
- Add REST API and UI installation info to readme by @brandenchan in #2160
- Upgrade
pydoc-markdown
by @ZanSara in #2117
CI
- Introduce pylint & other improvements on the CI by @ZanSara in #2130
- Apply black formatting by @ZanSara in #2115
- Pylint: solve or silence locally rare warnings by @ZanSara in #2170
- Revert "Make the docstring bot work only on master" by @ZanSara in #2114
- Fix CI build-cache issue causing code changes to take no effect by @tstadel in #2082
- Disable cache on the CI by @ZanSara in #2083
- Reintroduce push on master trigger for Linux CI by @ZanSara in #2127
- Allow Linux CI to push changes to forks by @ZanSara in #2182
- Fix windows ci tests by @tstadel in #2144
- Disable
autoformat.yml
on master by @ZanSara in #2198 - Testing actions (@ZanSara) by @hegyibalint in #2200
Other Changes
- Add UnlabeledTextProcessor by @MichelBartels in #2054
- fix answer is not subscriptable error by @julian-risch in #2069
- Add faiss dependency to tutorial 12 by @julian-risch in #2109
- Simplify SQuAD data to df conversion by @mathislucka in #2124
- Remove requirements for json schema by @ZanSara in #2128
- Move pytest configuration into
pyproject.toml
by @ZanSara in #2141 - Fix MultiLabel creation with aggregate_by_meta by @tstadel in #2165
- Add tests on MultiLabel's meta and filter aggregation by @tstadel in #2169
- Improve
Label
andMultiLabel
__str__
and__repr__
by @ZanSara in #2202
New Contributors
- @adri1wald made their first contribution in #2065
- @tiangolo made their first contribution in #2020
- @baregawi made their first contribution in #2125
- @mkkuemmel made their first contribution in #2192
- @hegyibalint made their first contribution in #2200
❤️ Big thanks to all contributors and the whole community!