load_in_4bit does not work #1263

PatchouliPatch · 2024-07-01T01:45:58Z

PatchouliPatch
Jul 1, 2024

the following is a modified version of the code from the Phi3 huggingface page.
For some reason, load_in_4bit does not work but load_in_8bit does.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
import time

torch.random.manual_seed(0)

nf4_config = BitsAndBytesConfig(
   #load_in_8bit=True,
   load_in_4bit=True,
   #bnb_4bit_quant_type="nf4",
   #bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True,
    quantization_config=nf4_config
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

messages = [
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}
t = time.time()
output = pipe(messages, **generation_args)
print('===========================================')
print(f"inference took {time.time() - t} seconds")
print('===========================================')
print(output[0]['generated_text'])

the inference does not seem to finish.
at full precision the output is:

===========================================
inference took 3.9708006381988525 seconds
===========================================
 To solve the equation 2x + 3 = 7, follow these steps:

1. Subtract 3 from both sides of the equation: 2x + 3 - 3 = 7 - 3
2. Simplify: 2x = 4
3. Divide both sides by 2: 2x/2 = 4/2
4. Simplify: x = 2

and at 8-bit quant:

===========================================
inference took 18.259739637374878 seconds
===========================================
 To solve the equation 2x + 3 = 7, follow these steps:

1. Subtract 3 from both sides of the equation: 2x + 3 - 3 = 7 - 3
2. Simplify the equation: 2x = 4
3. Divide both sides of the equation by 2: 2x / 2 = 4 / 2
4. Simplify the equation: x = 2

So, the solution to the equation 2x + 3 = 7 is x = 2.

im using ROCm 6.0.3 and the stable branch of torch after finding out that the latest nightly builds of 6.1.x are having problems with quantization

any possible debugging steps?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_in_4bit does not work #1263

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

load_in_4bit does not work #1263

PatchouliPatch Jul 1, 2024

Replies: 0 comments

PatchouliPatch
Jul 1, 2024