Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add multi vector support #11

Merged
merged 5 commits into from
Oct 9, 2024

Conversation

omriel1
Copy link
Contributor

@omriel1 omriel1 commented Oct 2, 2024

Description

This PR introduces multi-vector support!

@efriis @zc277584121 I've moved and modified the PR previously submitted to the main langhcain repo: langchain-ai/langchain#26500

Milvus 2.4 introduced the option for multi-vector support, which is becoming increasingly popular, especially for use cases like hybrid search (dense + sparse embeddings).

Lately, @ohadeytan introduced the option to use sparse embeddings in this PR.

Additionally, @zc277584121 already introduced the MilvusCollectionHybridSearchRetriever, which enables hybrid search against pre-defined collections directly via pymilvus.

However, this method doesn't take full advantage of the many useful features offered by langchain_milvus when building a collection: automatic schema creation, indexing and search parameter creation etc.

This PR intend to make developers life easier, by allowing them to use single-vector or multi-vector with a single langchain interface, that create, connect, and search Milvus. For example, at IBM and IBM Research this feature is requested by many of our developers and researchers and will be very useful for us.

Changes

This PR addresses the limitations described above by introducing the following changes:

  1. Allows passing multiple embedding functions with optional matching indexing parameters, search parameters, and vector field names.
  2. Dynamically creates the collection using these functions, similar to how it's done for a single embedding function.
  3. Adds multiple tests to validate this new feature.

We are eager to have this merged into langchain-milvus as we utilize many of langchain features, particularly langchain-milvus. We want to continue benefiting from the many valuable features this package provides. We'll make any required changes and will be glad to get any guidance to make it happen!

Twitter handle: @EliyahuOmri, @ohadeytan

@codingjaguar
Copy link
Collaborator

Hi @omriel1 thanks for the PR! FYI, the reviewer @zc277584121 is out for a week. Is this PR urgent? Would prefer to wait till the reviewer is back, or we could ask @efriis to take a look first.

@omriel1
Copy link
Contributor Author

omriel1 commented Oct 2, 2024

Hi! Thanks for the reply. This is important for us (it was delayed due to the repo separation), but I completely understand and agree that we should wait for @zc277584121. But in the meantime, if you or @efriis have any comments, I'd be happy to start addressing them!

Thanks!

dense_embeddings_func_2,
],
texts=fake_texts,
connection_args={"uri": "./milvus_demo.db"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this uri be passed in the same way as ut above using temp_milvus_db? Just for the code consistency

embedding=[FakeEmbeddings(), FakeEmbeddings()],
index_params=[index_param_1, index_param_2],
vector_field=["vec_field_1", "vec_field_2"],
connection_args={"uri": "./milvus_demo.db"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

embedding=[embedding_1, embedding_2],
texts=fake_texts,
index_params=[index_param_1, index_param_2],
connection_args={"uri": "./milvus_demo.db"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

for i, embeddings_func in enumerate(embeddings_functions):
if not self._get_index(vector_fields[i]):
try:
# If no index params, use a default HNSW based one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should keep the comment If no index params, use a default *AutoIndex* based one unchanged

using=self.alias,
)

# If default did not work, most likely on Zilliz Cloud
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


# If default did not work, most likely on Zilliz Cloud
except MilvusException:
# Use AUTOINDEX based index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@zc277584121
Copy link
Collaborator

@omriel1 sorry for delay. thanks for your great contributing. and i have left some small comments here. all others LGTM

@omriel1 omriel1 requested a review from zc277584121 October 8, 2024 11:24
@zc277584121 zc277584121 merged commit 02eb7be into langchain-ai:main Oct 9, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants