Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating garbage output #521

Open
2 of 4 tasks
shreyansh26 opened this issue Jun 19, 2024 · 2 comments
Open
2 of 4 tasks

Generating garbage output #521

shreyansh26 opened this issue Jun 19, 2024 · 2 comments

Comments

@shreyansh26
Copy link

shreyansh26 commented Jun 19, 2024

System Info

Using Docker server

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus '"device=3"' --shm-size 1g -p 8080:80 -v $volume:/data \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    ghcr.io/predibase/lorax:main --model-id $model

Running on a node with 8xH100 80GB GPUs. Here device 3 is completely empty and has no process running.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Launch Lorax server

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus '"device=3"' --shm-size 1g -p 8080:80 -v $volume:/data \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    ghcr.io/predibase/lorax:main --model-id $model

Use lorax-client with Python to query the server.

from lorax import Client
import random
import time
import sys

client = Client("http://127.0.0.1:8080")

# Prompt the base LLM
prompt = "[INST] What is the capital of Portugal? [/INST]"
print(client.generate(prompt, max_new_tokens=64).generated_text)

This generates garbage output

Theaot of the9
 the00-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

And on the server side -

2024-06-19T15:19:59.434864Z  INFO HTTP request{otel.name=POST / http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/ http.scheme=HTTP http.target=/ http.user_agent=python-requests/2.31.0 otel.kind=server trace_id=a884cc60137b84e19b63b832ca233d42}:compat_generate{default_return_full_text=Extension(false) info=Extension(Info { model_id: "mistralai/Mistral-7B-Instruct-v0.1", model_sha: Some("73068f3702d050a2fd5aa2ca1e612e5036429398"), model_dtype: "torch.float16", model_device_type: "cuda", model_pipeline_tag: Some("text-generation"), max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_total_tokens: 453184, max_waiting_tokens: 20, validation_workers: 2, version: "0.1.0", sha: None, docker_label: None, request_logger_url: None, embedding_model: false }) request_logger_sender=Extension(Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x560439960630, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 32 }, bound: 32 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } }) req_headers={"host": "127.0.0.1:8080", "user-agent": "python-requests/2.31.0", "accept-encoding": "gzip, deflate", "accept": "*/*", "connection": "keep-alive", "content-length": "562", "content-type": "application/json"}}:generate{parameters=GenerateParameters { adapter_id: None, adapter_source: None, adapter_parameters: None, api_token: None, best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(64), ignore_eos_token: false, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, return_k_alternatives: None, apply_chat_template: false, seed: None, response_format: None } total_time="706.83088ms" validation_time="302.036µs" queue_time="48.49µs" inference_time="706.480604ms" time_per_token="11.038759ms" seed="None"}: lorax_router::server: router/src/server.rs:590: Success

And this is not input related. Garbage values are generated with pretty much every prompt I tried.

Expected behavior

Using a simple HF inference script gives the expected output.

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

prompt = "[INST] What is the capital of Portugal? [/INST]"

encodeds = tokenizer.encode(prompt, return_tensors="pt")
model_inputs = encodeds.to(device)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Output

<s> [INST] What is the capital of Portugal? [/INST] The capital city of Portugal is Lisbon.</s>
@shreyansh26
Copy link
Author

Okay so it looks like using ghcr.io/predibase/lorax:latest fixes it. Probably an issue with the current latest main branch.

@GirinMan
Copy link
Contributor

I'm facing a similar problem.
When using image ghcr.io/predibase/lorax:main, I see garbage outputs.
Older version of lorax or hf tgi is not creating such issue.

curl -X 'POST' \
  'http://127.0.0.1:50710/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }],
    "model": "",
    "seed": 42,
    "max_tokens": 256, "temperature": 0.1
  }'
{
  "id":"null",
  "object":"text_completion",
  "created":0,
  "model":"null",
  "choices":[
    {"index":0,"message": 
      {
        "role":"assistant",
        "content":"I \n ____ \n\n코_김 \n코 a a \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1 and the and the and the and the a and the the a a and the a a a a a a a a a a a a a a a a a a a a the the a and 2 and the a a a the the the a the the the the the a a 2 and I a\n\n\n1\n\n\n\n\n\n\n\n\n\n\n\n\n\ns and /******/ and the and the and the and the and the the the the the the the the a and /******/ and the the the the the a and /******/ and /******/ 1 and the, a a a a and the a.\n\n\n /******/ and /******/ and /******/ and /******/ and /******/ and /******/ and the a a.jpg the the the a the the and a and /******/ and the a a the the the a a the a a a a a to the a a a a. /******/ a"
      },
      "finish_reason":"length"
    }
  ],
  "usage":{
    "prompt_tokens":25,
    "total_tokens":281,
    "completion_tokens":256
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants