[RFC] Implement register custom sparse tokenizer from local files #3170

zhichao-aws · 2024-10-25T03:31:25Z

Background

Neural Sparse is a semantic search method which is built on native Lucene inverted index. The documents and queries are encoded into sparse vectors, where the entry represents the token and their corresponding semantic weight.

For the neural sparse doc-only mode, we need a sparse encoding neural model for ingestion, and a tokenizer for query. We use consistent tokenizer construction with Hugging Face, i.e. the tokenizer is determined by tokenizer.json (example) config file. For the model pre-trained by opensearch, we also need a idf.json file for token weight (example).

In OpenSearch ml-commons, the tokenizer is wrapped as a SPARSE_TOKENIZE model. This provides a consistent API user experience of doc-only mode and bi-encoder mode. To register a sparse tokenizer, users can choose the pre-trained tokenizer provided by OpenSearch, or build a zip file containing tokenizer.json and idf.json and then register from URL.

What are we going to do?

Implement customized sparse tokenizer by reading config file from local file system. This makes the tokenizer registry more flexible. Some service providers have more strict restriction on customized torchscript file, while the sparse tokenizer don't need to interact with torchscript during run-time. We need a registry option that explicity exclude torchscript resources, to achieve a more fine-grained security control for different model types. Besides, we also don't need to upload a zipped file since these config file are much smaller than model weights.

User Experience

Here are different ways to implement the API. The tricky part is, currently this feature only works for tokenizer registry. The new fields are not general for other model types.

Option 1: register API + put new fields at model config (preferred)

POST /_plugins/_ml/models/_register
{
    "name": "my custom tokenizer",
    "function_name": "SPARSE_TOKENIZE",
    "model_group_id": "Z1eQf4oB5Vm0Tdw8EIP2",
    "model_config": {
        "tokenizer_config_file": "/path/to/tokenizer.json",
        "idf_file": "/path/to/idf.json"
    }
}

For this option we implement a new class named SparseTokenizerModelConfig and put these fields in the body of model_config. In this way we don't need to alter the register logic of other model types.

Option 2: register API + put new fields at top-level request body

POST /_plugins/_ml/models/_register
{
    "name": "my custom tokenizer",
    "function_name": "SPARSE_TOKENIZE",
    "model_group_id": "Z1eQf4oB5Vm0Tdw8EIP2",
    "tokenizer_config_file": "/path/to/tokenizer.json",
    "idf_file": "/path/to/idf.json"
}

For this option we put new fields in the top-level request body of register model API. The cons is we'll have redundant fields for other model register request object.

Option 3: register the tokenizer from train model API

POST /_plugins/_ml/_train/sparse_tokenize
{
    "parameters": {
        "tokenizer_config_file": "/path/to/tokenizer.json",
        "idf_file": "/path/to/idf.json"
    }
}

For this option we can init the tokenizer object using training API. There will be more code changes and special logics for this option.

The text was updated successfully, but these errors were encountered:

xinyual · 2024-10-25T04:53:44Z

For option1 and 2, only function_name is "SPARSE_TOKENIZE" will work, right?

zhichao-aws · 2024-10-25T05:12:00Z

For option1 and 2, only function_name is "SPARSE_TOKENIZE" will work, right?

Yes, we should also include the function name field. Edited.

ylwu-amzn · 2024-10-25T16:11:32Z

This will be challenging for security. Suggest consult with security experts first.

zane-neo · 2024-10-28T08:46:53Z

Option1 seems better, also a question: do we need to support network file system registration? Placing files on production machines could be a maintenance overhead.

zhichao-aws · 2024-10-28T08:55:50Z

Option1 seems better, also a question: do we need to support network file system registration? Placing files on production machines could be a maintenance overhead.

I think support network file system will bring more security challenge. I know some service providers implement machanisms to upload files to clusters. E.g. custom packages on AOS. It is a common use case for customizing analyzers from synonyms files

yuye-aws · 2024-10-28T12:01:13Z

Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models.

zhichao-aws · 2024-10-28T13:06:43Z

Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models.

Sorry I don't get your point. I think currently our pre-trained tokenizer can be used just like other DL models. What else do we need to do?

yuye-aws · 2024-10-29T01:51:54Z

Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models.

Sorry I don't get your point. I think currently our pre-trained tokenizer can be used just like other DL models. What else do we need to do?

Wait a bit. Do you mean we can currently use pre-trained tokenizer without specifying tokenizer_config_file and idf_file?

zane-neo · 2024-10-29T02:13:52Z

Option1 seems better, also a question: do we need to support network file system registration? Placing files on production machines could be a maintenance overhead.

I think support network file system will bring more security challenge. I know some service providers implement machanisms to upload files to clusters. E.g. custom packages on AOS. It is a common use case for customizing analyzers from synonyms files

Agree this will bring more security challenges, but from open source perspective, this might be a valid use case, a configuration can be introduced to control the behavior in open source and AOS separately. This is not high priority, it's fine to implement this in a future release.

zhichao-aws · 2024-10-29T02:32:21Z

Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models.

Sorry I don't get your point. I think currently our pre-trained tokenizer can be used just like other DL models. What else do we need to do?

Wait a bit. Do you mean we can currently use pre-trained tokenizer without specifying tokenizer_config_file and idf_file?

Yes, we can use model API to register & deploy the pretrained tokenizer

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

tokenizer_config_file and idf_file approach is the new manner we want to introduce

yuye-aws · 2024-10-30T00:13:32Z

Yes, we can use model API to register & deploy the pretrained tokenizer
POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}
tokenizer_config_file and idf_file approach is the new manner we want to introduce

From the document, this API can register model. I think the customer also needs a light-weight method to register pre-trained tokenizer, where only need a tokenizer.

To register a sparse tokenizer, users can choose the pre-trained tokenizer provided by OpenSearch, or build a zip file containing tokenizer.json and idf.json and then register from URL.

I am assuming that this RFC and all these three options is about tokenizer, not about all the model.

zhichao-aws · 2024-10-30T01:00:34Z

Yes, we can use model API to register & deploy the pretrained tokenizer
POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}
tokenizer_config_file and idf_file approach is the new manner we want to introduce
From the document, this API can register model. I think the customer also needs a light-weight method to register pre-trained tokenizer, where only need a tokenizer.

To register a sparse tokenizer, users can choose the pre-trained tokenizer provided by OpenSearch, or build a zip file containing tokenizer.json and idf.json and then register from URL.

I am assuming that this RFC and all these three options is about tokenizer, not about all the model.

Do you mean the inner huggingface tokenizer instead of SPARSE_TOKENIZE model? Currently users doesn't interact with it directly in neural sparse search, it's out of scope for this RFC. Please feel free to create a dedicated RFC to support it if needed.

yuye-aws · 2024-10-30T01:10:03Z

Do you mean the inner huggingface tokenizer instead of SPARSE_TOKENIZE model? Currently users doesn't interact with it directly in neural sparse search, it's out of scope for this RFC. Please feel free to create a dedicated RFC to support it if needed.

Thanks for clarification between the hugging face tokenizer and sparse_tokenize model. As for the 3rd option, train API does not make sense to me. Creating a tokenizer does not follow the current documentation of train API https://opensearch.org/docs/latest/ml-commons-plugin/api/train-predict/train/.

github-actions bot added the untriaged label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Implement register custom sparse tokenizer from local files #3170

[RFC] Implement register custom sparse tokenizer from local files #3170

zhichao-aws commented Oct 25, 2024 •

edited

Loading

xinyual commented Oct 25, 2024

zhichao-aws commented Oct 25, 2024

ylwu-amzn commented Oct 25, 2024

zane-neo commented Oct 28, 2024

zhichao-aws commented Oct 28, 2024

yuye-aws commented Oct 28, 2024

zhichao-aws commented Oct 28, 2024

yuye-aws commented Oct 29, 2024

zane-neo commented Oct 29, 2024

zhichao-aws commented Oct 29, 2024

yuye-aws commented Oct 30, 2024

zhichao-aws commented Oct 30, 2024

yuye-aws commented Oct 30, 2024

[RFC] Implement register custom sparse tokenizer from local files #3170

[RFC] Implement register custom sparse tokenizer from local files #3170

Comments

zhichao-aws commented Oct 25, 2024 • edited Loading

Background

What are we going to do?

User Experience

Option 1: register API + put new fields at model config (preferred)

Option 2: register API + put new fields at top-level request body

Option 3: register the tokenizer from train model API

xinyual commented Oct 25, 2024

zhichao-aws commented Oct 25, 2024

ylwu-amzn commented Oct 25, 2024

zane-neo commented Oct 28, 2024

zhichao-aws commented Oct 28, 2024

yuye-aws commented Oct 28, 2024

zhichao-aws commented Oct 28, 2024

yuye-aws commented Oct 29, 2024

zane-neo commented Oct 29, 2024

zhichao-aws commented Oct 29, 2024

yuye-aws commented Oct 30, 2024

zhichao-aws commented Oct 30, 2024

yuye-aws commented Oct 30, 2024

zhichao-aws commented Oct 25, 2024 •

edited

Loading