Why nearest k = 5 sometimes only return 2 results? #3306
-
I am trying to understand how I created this script: My uv pyproject.toml [project]
name = "hm-lancedb"
version = "1.0.0"
requires-python = "~=3.12.0"
dependencies = [
"pandas==2.2.3",
"polars==1.18.0",
"pylance==0.21.0",
"tqdm==4.67.1",
] import logging
import lance
import numpy as np
import pandas as pd
from lance.vector import vec_to_table
def main() -> None:
# Create sample vectors (minimum 5000 recommended for meaningful indexing)
num_vectors = 5000 # Increased from 1000 to meet minimum recommendation
vector_dim = 128 # Dimension of each vector (common for embeddings)
vectors = np.random.randn(num_vectors, vector_dim)
# Create some distinct vectors at the beginning for demonstration
# Make the first vector have a clear pattern
vectors[0] = np.array(
[1.0] * 32 + [2.0] * 32 + [3.0] * 32 + [4.0] * 32
)
# Make the second vector similar to the first but with some variation
vectors[1] = vectors[0] + np.random.randn(vector_dim) * 0.1
# Convert to Lance table
vector_table = vec_to_table(vectors)
# Save to Lance dataset
uri = "/tmp/lancedb/vectors.lance"
dataset = lance.write_dataset(vector_table, uri, mode="overwrite")
logging.info(
"Dataset saved to %s with %d vectors of dimension %d",
uri,
num_vectors,
vector_dim,
)
# https://lancedb.github.io/lancedb/concepts/index_ivfpq/
# Create an index for vector similarity search
# IVF-PQ is a composite index that combines inverted file index (IVF) and product quantization (PQ)
# - IVF divides the vector space into Voronoi cells using K-means clustering
# - PQ reduces dimensionality by dividing vectors into sub-vectors and quantizing them
dataset.create_index(
"vector",
index_type="IVF_PQ",
# num_partitions: The number of partitions (Voronoi cells) in the IVF portion
# - Controls how the vector space is divided
# - Higher values increase query throughput but may reduce recall
# - Should be chosen to target a particular number of vectors per partition
# - For 5000 vectors, we use 64 partitions (~78 vectors per partition)
num_partitions=64,
# num_sub_vectors: The number of sub-vectors created during Product Quantization (PQ)
# - Controls the compression level and search accuracy
# - Chosen based on desired recall and vector dimensionality
# - Trade-off: more sub-vectors = better compression but potentially lower accuracy
num_sub_vectors=16,
)
logging.info("Created vector similarity index")
# Read back the dataset
dataset = lance.dataset(uri)
# Perform vector similarity search using the second vector as query
query_vector = vectors[1]
# Find 5 nearest neighbors
# Note: For better accuracy, you can use nprobes (5-10% of dataset) and refine_factor
k = 5 # I want exactly 5 results
results = dataset.to_table(
nearest={
"column": "vector",
"k": k,
"q": query_vector,
}
).to_pandas()
logging.info(
"\nNearest neighbors (distances show similarity, lower = more similar):"
)
if len(results) != k:
logging.warning(f"Expected {k} results but got {len(results)} results!")
for idx, row in results.iterrows():
vector_preview = np.array(row["vector"])
logging.info(
f"Result {idx + 1}/{k}: Distance: {row['_distance']:.4f}, Vector preview: {vector_preview[:8]}..."
)
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
main() However, I found sometimes it prints top 5 results which is good. Roughly 80% of time. INFO:root:Dataset saved to /tmp/lancedb/vectors.lance with 5000 vectors of dimension 128
INFO:pylance:Final create_index rust time: 0.2940809726715088s
INFO:root:Created vector similarity index
INFO:root:
Performing similarity search for vector with pattern [1.0]*32 + [2.0]*32 + [3.0]*32 + [4.0]*32
INFO:root:
Nearest neighbors (distances show similarity, lower = more similar):
INFO:root:Result 1/5: Distance: 23.9152, Vector preview: [0.807779 1.028215 0.9340216 1.1076214 1.0949317 1.0049326
0.90235597 1.0152307 ]...
INFO:root:Result 2/5: Distance: 23.9152, Vector preview: [1. 1. 1. 1. 1. 1. 1. 1.]...
INFO:root:Result 3/5: Distance: 868.6546, Vector preview: [-0.6312011 0.43899867 -0.28580767 -0.66803545 1.5501056 -0.42998713
-0.07740911 0.82827485]...
INFO:root:Result 4/5: Distance: 890.7475, Vector preview: [ 1.0464841 -0.9530895 -0.99942106 0.17338635 -0.32746494 -0.67426693
0.19943988 -0.699121 ]...
INFO:root:Result 5/5: Distance: 899.7136, Vector preview: [-0.79945636 -0.7129127 0.27931052 1.6528097 1.0595443 0.13149944
-1.9304385 0.8099807 ]... Sometimes it only prints 2 results. Roughly 20% of time. INFO:root:Dataset saved to /tmp/lancedb/vectors.lance with 5000 vectors of dimension 128
INFO:pylance:Final create_index rust time: 0.25128984451293945s
INFO:root:Created vector similarity index
INFO:root:
Performing similarity search for vector with pattern [1.0]*32 + [2.0]*32 + [3.0]*32 + [4.0]*32
INFO:root:
Nearest neighbors (distances show similarity, lower = more similar):
WARNING:root:Expected 5 results but got 2 results!
INFO:root:Result 1/5: Distance: 10.4091, Vector preview: [0.9139447 1.0198995 0.9886301 0.8659454 1.1905029 0.94092387
0.89365345 1.0127491 ]...
INFO:root:Result 2/5: Distance: 10.5039, Vector preview: [1. 1. 1. 1. 1. 1. 1. 1.]... |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
In With post-filtering we first perform the vector search to calculate To get prefiltering add With prefiltering we first calculate which row ids match the filter. We then perform a vector search and filter the results during the search. It is possible (but less likely) to also get fewer results than desired. This happens with highly selective filters (e.g. filters that only match a few rows). What happens is the first part of the vector search picks the closest If you know your filter only matches a few rows then the best thing to do is skip the vector index entirely and just do a flat search of the remaining rows. You can add |
Beta Was this translation helpful? Give feedback.
In
lance
the default is "postfiltering". This is different fromlancedb
where the default is "prefiltering". Both are capable of returning fewer results than asked for but it is more common with postfiltering.With post-filtering we first perform the vector search to calculate
k * refine_factor
results. We then filter these and rank the remaining results. Since you are not settingrefine_factor
it is defaulting toNone
which means you will get fewer thank
results if the filter eliminates any of the topk
.To get prefiltering add
prefilter=True
to yourto_table
call.With prefiltering we first calculate which row ids match the filter. We then perform a vector search and filter the results…