v2.8.0-rc3
Pre-releaseRelease Notes
⬆️ Upgrade Notes
- Remove
is_greedy
deprecated argument from@component
decorator. Change theVariadic
input of your Component toGreedyVariadic
instead.
🚀 New Features
- We've added a new
DALLEImageGenerator
component, bringing image generation with OpenAI's DALL-E to the Haystack- Easy to Use: Just a few lines of code to get started:
`python from haystack.components.generators import DALLEImageGenerator image_generator = DALLEImageGenerator() response = image_generator.run("Show me a picture of a black cat.") print(response)
`
- Easy to Use: Just a few lines of code to get started:
- Add warning logs to the
PDFMinerToDocument
andPyPDFToDocument
to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to theDocumentSplitter
that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening. - We have added a new
MetaFieldGroupingRanker
component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM. - Added a new
store_full_path
parameter to the__init__
methods of the following converters:
JSONConverter
,CSVToDocument
,DOCXToDocument
,HTMLToDocument
MarkdownToDocument
,PDFMinerToDocument
,PPTXToDocument
,TikaDocumentConverter
,PyPDFToDocument
,AzureOCRDocumentConverter
andTextFileToDocument
. The default value isTrue
, which stores full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. - When making function calls via
OpenAPI
, allow both switching SSL verification off and specifying a certificate authority to use for it. - Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
- Added a new option to the required_variables parameter to the
PromptBuilder
andChatPromptBuilder
. By passingrequired_variables="*"
you can automatically set all variables in the prompt to be required.
⚡️ Enhancement Notes
- Across Haystack codebase, we have replaced the use of
ChatMessage
data class constructor with specific class methods (ChatMessage.from_user
,ChatMessage.from_assistant
, etc.). - Added the Maximum Margin Relevance (MMR) strategy to the
SentenceTransformersDiversityRanker
. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents. - Introduces optional parameters in the
ConditionalRouter
component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters. - Added split by line to
DocumentSplitter
, which will split the document at n. - Change
OpenAIDocumentEmbedder
to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches. - Added new initialization parameters to the
PyPDFToDocument
component to customize the text extraction process from PDF files. - Replace usage of
ChatMessage.content
withChatMessage.text
across the codebase. This is done in preparation for the removal ofcontent
in Haystack 2.9.0.
⚠️ Deprecation Notes
- The default value of the
store_full_path
parameter in converters will change toFalse
in Haysatck 2.9.0 to enhance privacy. - In Haystack 2.9.0, the
ChatMessage
data class will be refactored to make it more flexible and future-proof. As part of this change, the content attribute will be removed. A newtext
property has been introduced to provide access to the textual value of theChatMessage
. To ensure a smooth transition, start using thetext
property now in place ofcontent
. - The
converter
parameter in thePyPDFToDocument
component is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component. Additional high-level customization options will be added in the future. - The output of
context_documents
will change in the next release. Instead of a List[List[Document]], the output will be a List[Document], where the documents are ordered bysplit_idx_start
.
🐛 Bug Fixes
-
Fix
DocumentCleaner
not preserving allDocument
fields when run -
Fix
DocumentJoiner
failing when ran with an empty list of Documents -
For the
NLTKDocumentSplitter
we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases ofDoc1 = [s1, s2], Doc2 = [s1, s2, s3]
. -
Finished adding function support for this component by updating the
_split_into_units
function and added thesplitting_function
init
parameter. -
Add specific
to_dict
method to overwrite the underlying one fromDocumentSplitter
. This is needed to properly save the settings of the component to yaml. -
Fix
OpenAIChatGenerator
andOpenAIGenerator
crashing when using a streaming_callback andgeneration_kwargs
contain{"stream_options": {"include_usage": True}}
. -
Fix tracing
Pipeline
with cycles to correctly track components execution -
When meta is passed into
AnswerBuilder.run()
, it is now merged intoGeneratedAnswer
meta -
Fix
DocumentSplitter
to handle customsplitting_function
without requiringsplit_length.
Previously thesplitting_function
provided would not override other settings.