-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding semantic search workload that includes vector and bm25 search #342
base: main
Are you sure you want to change the base?
Adding semantic search workload that includes vector and bm25 search #342
Conversation
875dd1d
to
ef84aa8
Compare
ef84aa8
to
43bc7ff
Compare
@VijayanB can I get initial feedback for this PR while we're waiting for feedback on opensearch-project/opensearch-benchmark#591 from other folks? |
6cb13b7
to
42ddc44
Compare
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
42ddc44
to
79bc1f3
Compare
I'm going to restore works for this PR and I want to summarize the feedback I got last time:
I put together main points from this list in a private branch: https://github.com/martin-gaievski/opensearch-benchmark-workloads/tree/adding_vectordataset_for_semantic_search/treccovid_semantic_search @VijayanB please let me know if I've missed any important point from your comments |
@@ -0,0 +1,46 @@ | |||
{ | |||
"settings": { | |||
"index.number_of_shards": {{number_of_shards | default(1)}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we can leave defaults to cluster. in other words, lets set this value if number_of_shards are provided in params
"default_pipeline": "nlp-ingest-pipeline" | ||
}, | ||
"mappings": { | ||
"dynamic": "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to allow dynamic? why not strict?
"dimension": 768, | ||
"method": { | ||
"name": "hnsw", | ||
"space_type": "innerproduct", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be outside method.
class UpdateConcurrentSegmentSearchSettings(Runner): | ||
|
||
RUNNER_NAME = "update-concurrent-segment-search-settings" | ||
|
||
async def __call__(self, opensearch, params): | ||
enable_setting = params.get("enable", "false") | ||
max_slice_count = params.get("max_slice_count", None) | ||
body = { | ||
"persistent": { | ||
"search.concurrent_segment_search.enabled": enable_setting | ||
} | ||
} | ||
if max_slice_count is not None: | ||
body["persistent"]["search.concurrent.max_slice_count"] = max_slice_count | ||
request_context_holder.on_client_request_start() | ||
await opensearch.cluster.put_settings(body=body) | ||
request_context_holder.on_client_request_end() | ||
|
||
def __repr__(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be moved to opensearch-benchmarks?
"corpora": [ | ||
{ | ||
"name": "trec-covid", | ||
"base-url": "https://github.com/martin-gaievski/neural-search/releases/download/trec_covid_dataset_1M_v1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this is temporary. You must be using generic repository to fetch corpus
], | ||
"corpora": [ | ||
{ | ||
"name": "trec-covid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add smaller data set as well? something like 1k documents?
{ | ||
"base-url": "https://github.com/martin-gaievski/neural-search/releases/download/trec_covid_queries_knn", | ||
"source-file": "queries.json.zip", | ||
"compressed-bytes" : 193612, | ||
"uncompressed-bytes": 519763 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure whether we discussed this or not, can this be converted to corpus?
if self._params['variable-queries'] > 0: | ||
with open(script_dir + os.sep + 'workload_queries.json', 'r') as f: | ||
d = json.loads(f.read()) | ||
source_file = d['source-file'] | ||
base_url = d['base-url'] | ||
compressed_bytes = d['compressed-bytes'] | ||
uncompressed_bytes = d['uncompressed-bytes'] | ||
compressed_path = script_dir + os.sep + source_file | ||
uncompressed_path = script_dir + os.sep + Path(source_file).stem | ||
if not os.path.exists(compressed_path): | ||
downloader = Downloader(False, False) | ||
downloader.download(base_url, None, compressed_path, compressed_bytes) | ||
if not os.path.exists(uncompressed_path): | ||
decompressor = Decompressor() | ||
decompressor.decompress(compressed_path, uncompressed_path, uncompressed_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this if it is corpus
|
||
self._params = params | ||
self._params['index'] = index | ||
self._params['type'] = type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need type?
Description
Adding workload for semantic search that is based on customized trec-covid dataset and includes vector, text and integer fields. That allows to run queries like neural search and hybrid query where neural is one of sub queries.
Modified version of trec-covid dataset has ~1M documents, 6 copies of each document from original dataset.
Issues Resolved
#341
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.