Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Microservice Embedding Endpoints and embedding model setup #4558

Open
albertisfu opened this issue Oct 11, 2024 · 15 comments
Open

Add Microservice Embedding Endpoints and embedding model setup #4558

albertisfu opened this issue Oct 11, 2024 · 15 comments

Comments

@albertisfu
Copy link
Contributor

albertisfu commented Oct 11, 2024

After #4557 is done. We should continue on this one.

1.2 Add Microservice Embedding Endpoints and embedding model setup

  • Implement API endpoint to:

    a. Split texts into chunks

    b. Request embeddings for the chunks

  • Consider:

    • Add and set up the library and model for embedding processing
    • Ensure model can run on either GPU or CPU based on available resources in the POD
    • Use a single worker to process HTTP requests
    • Validate request body and handle according to request type, I'd suggest:

    Query embeddings request body:

 {
     "type": "single_text",
     "content": "This is a query" 
   }

Opinion lists request:

 {
     "type": "opinions_list",
     "content":[
       {
         "opinion_id": 1,
         "text": "Lorem"
       },
       ...
     ]
   }
  • For the response I'd suggest:
    Query embeddings response:
  {
     "type": "single_text",
     "embedding": [12123, 23232, 43545]
   }

Opinion lists response:

 {
     "type": "opinions_list",
     "embeddings": [
       {
         "opinion_id": 1,
         "chunks": [
           {
             "chunk": "Lorem",
             "embedding": [12445, ...]
           }
         ]
       },
       ...
     ]
   }
  • Error Handling

    Review and handle error types the embedding endpoint can return so we can differentiate between transient errors (e.g., ConnectionError) and bad requests with appropriate HTTP status codes, so we can decide on the client side whether to retry the request or not. For example:

    • Invalid request body format: 400 Bad Request
    • Unable to process embedding due to content issues: 422 Unprocessable Content
    • Set up reasonable timeout for embedding processing.
  • Authentication
    Determine if the microservice requires authentication. If for internal use only, authentication can be omitted (similar to Doctor)

The output for this issue would be:

  • Create a PR to the microservice repository that includes:
    • The endpoint to receive embedding requests as described above (for both single text queries and multiple texts for opinions) and perform the following tasks:
      • Split the text or texts into chunks.
      • Pass the chunks to the model to get their embeddings.
      • Once all texts in the request are processed, build a response as described in the specs above and send it to the client.
    • Add endpoint tests that cover the scenarios described in previous points.

A question here @mlissner to determine the type of authentication to use:
Will this microservice be offered as a service to other customers, requiring it to be exposed outside our internal network?

Image

@mlissner
Copy link
Member

For now, no auth is needed.

The other tweak I'd make is to design this with a bit more abstraction. So instead of opinion_id, just do id, and instead of single_text and opinion_list do simple and objects. I also think you don't need the type in the response (right?) so I removed that.

I think that gets you to:

 {
     "type": "simple
     "text": "This is a query" 
 }

Which returns:

 {
     "embedding": [12123, 23232, 43545]
 }

And for objects:

{
 "type": "objects",
 "content":[
   {
     "id": 1,
     "text": "Lorem"
   },
   ...
 ]
}

Which responds:

{
 "embeddings": [
   {
     "id": 1,
     "chunks": [
       {
         "chunk": "Lorem",
         "embedding": [12445, ...]
       }
     ]
   },
   ...
 ]
}

The nice thing about this is if we want to embed something else later, we can, without having to tweak the API.

Otherwise looks good to me.

@legaltextai
Copy link
Contributor

Since opinions will need to be split, we will need to find a way to link them to the parent opinion. I suggest doing so by adding a chunk number to opinion_id. E.g. we split opinion_id 123 into three chunks, and each chunk will have opinion_id respectively 123_1, 123_2, 123_3.

If you agree with this approach, do we still want to have embedding api to be responsible for splitting opinions? If yes, we 'll need to be sending opinion_id too.

My suggestion would be to have fastapi endpoint doing only the embedding work, with the logic if 'query' -> CPU, if 'opinion' -> GPU. The splitting and sending the embeddings + chunks into s3, will be handled by a separate instance / script.

@albertisfu
Copy link
Contributor Author

Since opinions will need to be split, we will need to find a way to link them to the parent opinion. I suggest doing so by adding a chunk number to opinion_id. E.g. we split opinion_id 123 into three chunks, and each chunk will have opinion_id respectively 123_1, 123_2, 123_3.

@legaltextai I was imagining that we could request all chunk embeddings for a single opinion at a time, so there’s no need to identify which opinion they belong to.

I was reviewing Sentence Transformer encode_multi_process method which states:

Encodes a list of sentences using multiple processes and GPUs via SentenceTransformer.encode. The sentences are chunked into smaller packages and sent to individual processes, which encode them on different GPUs or CPUs. This method is only suitable for encoding large sets of sentences.

So the idea for this endpoint is that it can receive a simple query

 {
     "type": "simple",
     "text": "This is a query" 
 }

Or multiple opinion texts:

{
 "type": "objects",
 "content":[
   {
     "id": 1,
     "text": "Lorem"
   },
   ...
 ]
}

For simple queries, it would be as easy as just requesting one embedding for it and returning it.

For multiple opinion texts, I was thinking that for each opinion text in the request, we could do something like:

  • Split the opinion text into chunks (sentences).
  • Send the list of chunks to encode_multi_process().
  • Get the chunk embeddings and append them to the opinion response.
{
 "embeddings": [
   {
     "id": 1,
     "chunks": [
       {
         "chunk": "Lorem",
         "embedding": [12445, ...]
       }
     ]
   },
   ...
 ]
}

This will only be possible if the order of the embeddings from the model response is the same as the list of sentences in the input, so we can map the chunks to their embeddings.

[
"sentence 1",
"sentence 2",
"sentence 3"
]

==

[
["12445,..."],
["12446,..."],
["12447,..."],
]

However, if the order of embeddings is not guaranteed, we won’t be able to map the chunks and their embeddings correctly, and we would need to look for a different strategy. Do you know how that works? Is the order of the sentence input the same as the embeddings?

My suggestion would be to have fastapi endpoint doing only the embedding work, with the logic if 'query' -> CPU, if 'opinion' -> GPU. The splitting and sending the embeddings + chunks into s3, will be handled by a separate instance / script.

We could split opinion texts before sending them to the embedding API, Yes. However, the request body will be a bit more complex. I was thinking of abstracting that process and handling it within the embedding API as a preliminary step before requesting embeddings from the model. This way, if we need to adjust chunk sizes or anything else, we can tweak it within the microservice, and we won’t have to worry about it in the Django command that requests embeddings.

@legaltextai
Copy link
Contributor

I know I am the one to blame for dragging this issue, my apology, I just want to make sure we don't overload a simple embedding API endpoint with tasks that can be handled easier outside of it.

I think we are on the same page re the importance of splitting and chunking the opinions in the right sequential order: 123_1, 123_2, etc. Just to remind us, we will be splitting opinion texts into chunks with several sentences per chunk, with approx 350 words per chunk. So, each chunk will contain ~12-15 sentences, plus some overlap.

I did not use encode_multi_process() in the past, I don't know how it preserves the correct order of chunks sent for embedding.

Here are some relevant parts from my code I used to split and send each chunk for embedding in batches (I would work with opinions table as df):


def split_text_into_chunks(text, max_word_count=350, overlap=20):
    words = text.split()
    current_chunk = []
    current_word_count = 0
    for word in words:
        current_chunk.append(word)
        current_word_count += 1

        if current_word_count >= max_word_count:
            yield ' '.join(current_chunk)
            current_chunk = current_chunk[-overlap:]  
            current_word_count = len(current_chunk)

    if current_chunk:
        yield ' '.join(current_chunk)


def process_df(df):
    new_rows = []
    for index, row in df.iterrows():
        chunks = list(split_text_into_chunks(row['opinion_text']))
        for i, chunk in enumerate(chunks):
            new_row = row.copy()
            new_row['opinion_text'] = chunk
            new_row['opinions_id'] = f"{row['opinions_id']}_{i + 1}"
            new_rows.append(new_row)
    return pd.DataFrame(new_rows)

new_df = process_df(filtered_df)

new_df.reset_index(drop=True, inplace=True)

You then get the opinions split into smth like:
Image

We then send the chunks for embedding to fastapi embedding endpoint in batches, get the result back, and save into s3 with {opinion_id, embedding, chunk text} into files with naming like 123_1_model_name, 123_2_model_name, etc, where 123 is the parent opinion_id; _1, _2 are chunks in the correct order; model_name is the embedding model we used. We then send those those files into Elastic to the correct parent opinion_id.

Bottom line, the only thing that fastapi endpoint does is embedding whatever is sent its way.

@albertisfu
Copy link
Contributor Author

@legaltextai some questions about your approach, please.

We then send the chunks for embedding to fastapi embedding endpoint in batches

Do you mean that within a single embedding request, you send multiple chunks in the body, and the embedding API returns a response with the embeddings for all the chunks?

Or do you send multiple API requests, each with a single chunk?

If it's the former, in your approach, how would the API handle embedding multiple chunks? Would it send one chunk at a time to the model, or send all the chunks at once?

You shared some code that includes:

embedding = model.encode(text).tolist()

If this only supports one chunk at a time, I think we should explore encode_multi_process to see if we can improve performance by embedding multiple chunks simultaneously, while ensuring that the returned embeddings preserve the order of the chunks in the input.

@mlissner Regarding the proposed architecture, the embedding endpoint will also split opinion texts into chunks, but it's possible that it can only perform embeddings as @legaltextai described. However, we may need to delegate chunk splitting to two different processes in CL: one for the initial batch process and another for real-time indexing via the ES signal processor, which is also a valid approach. My initial thought was that text splitting could be tied to the embedding endpoint settings, like chunk length. However, I'm open to your thoughts so we can decide whether we want to handle text splitting outside the embedding endpoint.

@legaltextai
Copy link
Contributor

legaltextai commented Oct 16, 2024

This is for a different embedding model (UAE) but this is how I used to embed those chunks in batches

from angle_emb import AnglE, Prompts
from typing import List, Dict, Tuple
from flupy import flu
import numpy as np
from tqdm import tqdm

angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

angle.set_prompt(prompt=None)

batch_size = 50


data = list(new_df[['id', 'content', 'metadata']].itertuples(index=False, name=None))

start_index = 0

data = data[start_index:]

uploaded_rows = 0

for chunk_ix, chunk in enumerate(tqdm(flu(data).chunk(batch_size))):


    records: List[Tuple[str, np.ndarray, Dict]] = []

    
    ids, contents, metadatas = zip(*chunk)

    
    embedding_chunk = angle.encode(contents, to_numpy=True)

    
    for row_ix, (id, content, embedding, metadata) in enumerate(zip(ids, contents, embedding_chunk, metadatas)):

        
        records.append((id, embedding, metadata)) 

... then I would upsert the records to my postgres database. We can replace metadata with text chunk and save into s3 instead.

I am ok with any solution you decide to implement. I just thought having a single endpoint for embeddings only would be an easier approach.

@mlissner
Copy link
Member

mlissner commented Oct 17, 2024

Sorry to take a minute to catch up on this conversation.

I'm with Alberto that we should at least check out encode_multi_process. I think it will give us better performance.

I also think it's better to do text splitting in the microservice for a few reasons:

  1. It should scale better. If we do chunking in the microservice, our batch script will only need to send opinion text to the microservice, making it do less work. The microservice can then do the work of chunking and embedding. A nice thing about microservices is that they scale horizontally, so if the microservice is a bottleneck, we can easily scale it. If chunking is a bottleneck in our batch script, we'll have to think about ways to make it multi-process, use threads, etc.

  2. From an API perspective, the one I sketched above is simpler and more general than one that takes and returns chunks.

  3. The other nice thing about microservices is encapsulation of logic and the creation of clean boundaries. The job of this microservice is to make good embeddings. It should encapsulate things like chunk size rather than splitting the embedding logic across the microservice (doing the vectorization) and the batch script (doing the chunking). From a caller's perspective, it's better to just send a blob of text and get back the useful thing, without having to even know what a chunk is (more or less).

I think the opposing view is simplicity in the microservice, but we can't get rid of the complexity of chunking — we can only decide where the complexity lives. If the microservice doesn't do chunking, then we do it in our batch script, which makes it part of CL itself. I'd much rather that complexity live outside of CL. The simpler we can make the monster that is CourtListener, the better.

Thank you both!

@legaltextai
Copy link
Contributor

Thank you, Mike. I will test encode_multi_process to see if it can organize the chunks in the correct order. Another angle to consider - Is it only us who will be using this API endpoint? No external access?

@mlissner
Copy link
Member

No external access?

At least not for now. Others can download the microservice and run it in their infra, but we won't make ours accessible.

@legaltextai
Copy link
Contributor

legaltextai commented Oct 22, 2024

Here is my first take on the fastapi for queries, and opinion batches. You may ignore that 'text' endpoint for now. The 'batch' endpoint breaks down opinions by sentences , not exceeding in total 350 words (approx 512 tokens, the context limit for this embedding model). Pls let me know what you think.

Text Pre-Processing: as json is sensitive to text formatting, we need to decide whether to handle pre-processing within the microservice or in the client-side script. I've been testing with text from the CL website. We may need to add additional preprocessing steps.

@mlissner
Copy link
Member

Hm, that link doesn't work, but where's the code for the microservice?

@legaltextai
Copy link
Contributor

can you pls try again. as the model is loaded into my GPU, can you pls let me know when you are done testing.
the code is here

@mlissner
Copy link
Member

I took a quick look. I don't know FastAPI, but the Swagger interface looked about right and the code seemed fine at a glance. I think you need to get that into a pull request format so that we can merge it into our own repo.

To that end, I created freelawproject/inception. Some things to make it good:

  • A few words in the readme
  • It should use poetry or uv for package management (not pip, sorry)
  • FastAPI is fine, but I don't know if what you have is good, so we can wait on Alberto for that.

Can you work on that? When that's done, what I think we can do with Alberto out is create a PR for the many pieces of this puzzle and when he gets back he can review them all.

Sound good?

@legaltextai
Copy link
Contributor

sounds good. is Doctor still a good template to follow?

@mlissner
Copy link
Member

Yeah, it should be great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants