Why nearest k = 5 sometimes only return 2 results? #3306

hongbo-miao · 2024-12-27T09:14:02Z

hongbo-miao
Dec 27, 2024

I am trying to understand how IVF_PQ works based on https://lancedb.github.io/lancedb/concepts/index_ivfpq/

I created this script:

My uv pyproject.toml

[project]
name = "hm-lancedb"
version = "1.0.0"
requires-python = "~=3.12.0"
dependencies = [
  "pandas==2.2.3",
  "polars==1.18.0",
  "pylance==0.21.0",
  "tqdm==4.67.1",
]

import logging

import lance
import numpy as np
import pandas as pd
from lance.vector import vec_to_table


def main() -> None:
    # Create sample vectors (minimum 5000 recommended for meaningful indexing)
    num_vectors = 5000  # Increased from 1000 to meet minimum recommendation
    vector_dim = 128  # Dimension of each vector (common for embeddings)
    vectors = np.random.randn(num_vectors, vector_dim)

    # Create some distinct vectors at the beginning for demonstration
    # Make the first vector have a clear pattern
    vectors[0] = np.array(
        [1.0] * 32 + [2.0] * 32 + [3.0] * 32 + [4.0] * 32
    )
    # Make the second vector similar to the first but with some variation
    vectors[1] = vectors[0] + np.random.randn(vector_dim) * 0.1

    # Convert to Lance table
    vector_table = vec_to_table(vectors)

    # Save to Lance dataset
    uri = "/tmp/lancedb/vectors.lance"
    dataset = lance.write_dataset(vector_table, uri, mode="overwrite")
    logging.info(
        "Dataset saved to %s with %d vectors of dimension %d",
        uri,
        num_vectors,
        vector_dim,
    )

    # https://lancedb.github.io/lancedb/concepts/index_ivfpq/
    # Create an index for vector similarity search
    # IVF-PQ is a composite index that combines inverted file index (IVF) and product quantization (PQ)
    # - IVF divides the vector space into Voronoi cells using K-means clustering
    # - PQ reduces dimensionality by dividing vectors into sub-vectors and quantizing them
    dataset.create_index(
        "vector",
        index_type="IVF_PQ",
        # num_partitions: The number of partitions (Voronoi cells) in the IVF portion
        # - Controls how the vector space is divided
        # - Higher values increase query throughput but may reduce recall
        # - Should be chosen to target a particular number of vectors per partition
        # - For 5000 vectors, we use 64 partitions (~78 vectors per partition)
        num_partitions=64,
        # num_sub_vectors: The number of sub-vectors created during Product Quantization (PQ)
        # - Controls the compression level and search accuracy
        # - Chosen based on desired recall and vector dimensionality
        # - Trade-off: more sub-vectors = better compression but potentially lower accuracy
        num_sub_vectors=16,
    )
    logging.info("Created vector similarity index")

    # Read back the dataset
    dataset = lance.dataset(uri)

    # Perform vector similarity search using the second vector as query
    query_vector = vectors[1]

    # Find 5 nearest neighbors
    # Note: For better accuracy, you can use nprobes (5-10% of dataset) and refine_factor
    k = 5  # I want exactly 5 results
    results = dataset.to_table(
        nearest={
            "column": "vector",
            "k": k,
            "q": query_vector,
        }
    ).to_pandas()

    logging.info(
        "\nNearest neighbors (distances show similarity, lower = more similar):"
    )
    if len(results) != k:
        logging.warning(f"Expected {k} results but got {len(results)} results!")
    
    for idx, row in results.iterrows():
        vector_preview = np.array(row["vector"])
        logging.info(
            f"Result {idx + 1}/{k}: Distance: {row['_distance']:.4f}, Vector preview: {vector_preview[:8]}..."
        )


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    main()

However, I found sometimes it prints top 5 results which is good. Roughly 80% of time.

INFO:root:Dataset saved to /tmp/lancedb/vectors.lance with 5000 vectors of dimension 128
INFO:pylance:Final create_index rust time: 0.2940809726715088s
INFO:root:Created vector similarity index
INFO:root:
Performing similarity search for vector with pattern [1.0]*32 + [2.0]*32 + [3.0]*32 + [4.0]*32
INFO:root:
Nearest neighbors (distances show similarity, lower = more similar):
INFO:root:Result 1/5: Distance: 23.9152, Vector preview: [0.807779   1.028215   0.9340216  1.1076214  1.0949317  1.0049326
 0.90235597 1.0152307 ]...
INFO:root:Result 2/5: Distance: 23.9152, Vector preview: [1. 1. 1. 1. 1. 1. 1. 1.]...
INFO:root:Result 3/5: Distance: 868.6546, Vector preview: [-0.6312011   0.43899867 -0.28580767 -0.66803545  1.5501056  -0.42998713
 -0.07740911  0.82827485]...
INFO:root:Result 4/5: Distance: 890.7475, Vector preview: [ 1.0464841  -0.9530895  -0.99942106  0.17338635 -0.32746494 -0.67426693
  0.19943988 -0.699121  ]...
INFO:root:Result 5/5: Distance: 899.7136, Vector preview: [-0.79945636 -0.7129127   0.27931052  1.6528097   1.0595443   0.13149944
 -1.9304385   0.8099807 ]...

Sometimes it only prints 2 results. Roughly 20% of time.
Based on my understanding, it will always prints top 5. Could someone help explain why? Thanks!

INFO:root:Dataset saved to /tmp/lancedb/vectors.lance with 5000 vectors of dimension 128
INFO:pylance:Final create_index rust time: 0.25128984451293945s
INFO:root:Created vector similarity index
INFO:root:
Performing similarity search for vector with pattern [1.0]*32 + [2.0]*32 + [3.0]*32 + [4.0]*32
INFO:root:
Nearest neighbors (distances show similarity, lower = more similar):
WARNING:root:Expected 5 results but got 2 results!
INFO:root:Result 1/5: Distance: 10.4091, Vector preview: [0.9139447  1.0198995  0.9886301  0.8659454  1.1905029  0.94092387
 0.89365345 1.0127491 ]...
INFO:root:Result 2/5: Distance: 10.5039, Vector preview: [1. 1. 1. 1. 1. 1. 1. 1.]...

Answered by westonpace

Dec 27, 2024

In lance the default is "postfiltering". This is different from lancedb where the default is "prefiltering". Both are capable of returning fewer results than asked for but it is more common with postfiltering.

With post-filtering we first perform the vector search to calculate k * refine_factor results. We then filter these and rank the remaining results. Since you are not setting refine_factor it is defaulting to None which means you will get fewer than k results if the filter eliminates any of the top k.

To get prefiltering add prefilter=True to your to_table call.

With prefiltering we first calculate which row ids match the filter. We then perform a vector search and filter the results…

View full answer

westonpace · 2024-12-27T20:30:13Z

westonpace
Dec 27, 2024
Maintainer

In lance the default is "postfiltering". This is different from lancedb where the default is "prefiltering". Both are capable of returning fewer results than asked for but it is more common with postfiltering.

With post-filtering we first perform the vector search to calculate k * refine_factor results. We then filter these and rank the remaining results. Since you are not setting refine_factor it is defaulting to None which means you will get fewer than k results if the filter eliminates any of the top k.

To get prefiltering add prefilter=True to your to_table call.

With prefiltering we first calculate which row ids match the filter. We then perform a vector search and filter the results during the search. It is possible (but less likely) to also get fewer results than desired. This happens with highly selective filters (e.g. filters that only match a few rows). What happens is the first part of the vector search picks the closest nprobes partitions. If there are not k results in those partitions then you will get fewer than k results. For example, a filter might only match 10 rows. Then you can search with k=10 and nprobes=3. If the closets 3 partitions only contain 7 of the 10 possible rows then you will only get 7 results.

If you know your filter only matches a few rows then the best thing to do is skip the vector index entirely and just do a flat search of the remaining rows. You can add use_index=False to your nearest dict to accomplish this.

1 reply

hongbo-miao Dec 28, 2024
Author

Thank you @westonpace for the explanation!
After I change to

    # Find 5 nearest neighbors
    k = 5

    # https://lancedb.github.io/lancedb/concepts/index_ivfpq/#query-the-index
    # nprobes:
    #   The number of probes determines the distribution of vector space.
    #   While a higher number enhances search accuracy, it also results in slower performance.
    #   Typically, setting nprobes to cover 5–10% of the dataset proves effective in achieving high recall with minimal latency.
    #
    # refine_factor:
    #   Refine the results by reading extra elements and re-ranking them in memory.
    #   A higher number makes the search more accurate but also slower.
    results = dataset.to_table(
        prefilter=True,
        nearest={
            "column": "vector",
            "k": k,
            "q": query_vector,
            "nprobes": 500,
            "refine_factor": 10,
        },
    ).to_pandas()

I got quite reliable result now 😃

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why nearest k = 5 sometimes only return 2 results? #3306

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Why nearest k = 5 sometimes only return 2 results? #3306

hongbo-miao Dec 27, 2024

Replies: 1 comment · 1 reply

westonpace Dec 27, 2024 Maintainer

hongbo-miao Dec 28, 2024 Author

hongbo-miao
Dec 27, 2024

Replies: 1 comment 1 reply

westonpace
Dec 27, 2024
Maintainer

hongbo-miao Dec 28, 2024
Author