Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use with llama.cpp #8

Closed
scalar27 opened this issue Sep 17, 2024 · 8 comments
Closed

use with llama.cpp #8

scalar27 opened this issue Sep 17, 2024 · 8 comments
Labels
documentation Improvements or additions to documentation

Comments

@scalar27
Copy link

I'm trying to understand if this could be used with a local llm via llama.cpp in interactive mode. Is this possible? Would very much like to try this out.

@codelion
Copy link
Owner

You can use it with local LLMs by just setting the base_url. E.g. I use it with ollama by using

python optillm.py --base_url http://localhost:11434/v1

or with llama_cpp.server (which starts the OpenAI API chat compatible server at port 8080)

python optillm.py --base_url http://localhost:8080/v1

To interact with the resulting proxy you need to use something that allows you to chat with an OpenAI API compatible endpoint. E.g. it would work with https://github.com/oobabooga/text-generation-webui , someone at reddit was able to set it up with oobabooga easily (see here ).

I checked llama.cpp's -i interactive model but that only works for models that are loaded in llama.cpp for inference. So, I am afraid that may not work.

It should not be hard to actually create a GUI to enable comparing different approaches. I have added it as an item #9 .

@scalar27
Copy link
Author

What if I run llama-server instead, at port 8080? When I start that and then run optillm, I get an API_KEY error: openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

It's probably obvious, but what do I do to fix that? thanks.

@codelion
Copy link
Owner

When using a local model just put any value as the key. So something like export OPENAI_API_KEY=no_key should do it. It is because the OpenAI client expects it to be set.

@scalar27
Copy link
Author

Thanks. It runs now but doesn't do anything as you mentioned above since llama-server ignores it. Looking forward to further development.

@codelion
Copy link
Owner

Thanks for trying it out.

Once we set up the llama.cpp server we still need to pass the slug prepended to the model name to make use of the proxy.

We can use the model_alias for that.

python -m llama_cpp.server --hf_model_repo_id bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF --model 'Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf' --chat_format chatml --model_alias 'my-local-model'

Above, we are creating an OpenAI compatible chat completions endpoint for using the model bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF from HF locally but we call it my-local-model. Then we start optillm we set the base_url to http://localhost:8080/v1 to get it to use this endpoint. Optillm will run on http://localhost:8000/v1 by default.

Now, to uset optillm, you need to call it from your code as shown in readme. Use the name slug-my-local-model. Where slug is the technique you want to use.

import os
from openai import OpenAI

OPENAI_KEY = os.environ.get("OPENAI_API_KEY")
OPENAI_BASE_URL = "http://localhost:8000/v1"
client = OpenAI(api_key=OPENAI_KEY, base_url=OPENAI_BASE_URL)

response = client.chat.completions.create(
  model="moa-my-local-model",
  messages=[
    {
      "role": "user",
      "content": "Write a Python program to build an RL model to recite text from any position that the user provides, using only numpy."
    }
  ],
  temperature=0.2
)

print(response)

This request will first hit the optillm proxy at http://localhost:8000 where we will parse the moa-my-local-model name from the request to detect moa as the technique and my-local-model as the base model (you can see how it is done in the code here).

Then we will apply the moa and send the calls to the base model at the endpoint http://localhost:8080 which is what we gave when starting optillm. This way the calls to the base model will go to llama_cpp.server.

Does that help? If you are still unable to get it working, I can do a detailed guide with screenshots or a video. Let me know.

@codelion codelion added the documentation Improvements or additions to documentation label Sep 17, 2024
@0xcoolio
Copy link

I was able to get optillm working with llama.cpp server and sillytavern on my m1 macbook pro. Not sure if I did it correctly though as the results were... meh. But here's what I did:

  1. Start llama-server normally, i.e. ./llama-server -m models/Meta-Llama-3.1-70B-Instruct-Q5_K_M-00001-of-00002.gguf
  2. Start optillm OPENAI_API_KEY=no_key python optillm.py --base_url http://localhost:8080/v1 --port 8001 --best_of_n 1. The --best_of_n 1 is the only way I could prevent llama.cpp server from throwing an error "Only one completion choice is allowed". I'm sure this is also the reason that the results of, say, the mcts approach were pretty bad (subjectively). And also the reason I'm really posting this, to see if there's some way to get around it! I set optillm to port 8001 because SillyTavern will run on port 8000.
  3. In SillyTavern, select the 'chat completion' API with Chat Completion Source = Custom (OpenAI compatible) and point it to optillm i.e. http://127.0.0.1:8001, or I may have needed to do http://127.0.0.1:8001/v1 can't really remember. It connects, you can do a test message, etc.

By that point it all works, but again for my tests the results were worse than just using the model directly.

@codelion
Copy link
Owner

@0xcoolio Glad that you were able to get it running. Regarding the error Only one completion choice is allowed it looks like llama-server doesn't support sampling n responses from the model - abetlen/llama-cpp-python#1130 This is required for some of the techniques as a first step (e.g. in bon, moa, mcts etc.). So, without that it will not work.

The best_of_n parameter is only used by the bon approach. When set to 1 it won't do anything as there will be only 1 completion to choose from.

You can try some of the other approaches and they should work e.g. cot_reflection, leap, plansearch, rstar, rto, self_consistency and z3 as they do not require sampling multiple responses from the model. You should also set n_ctx=4096 at least as most of the approaches have max_tokens set to 4096 and in llama-server the context length is 2048 by default.

Or, you could try using ollama to run the model locally, their OpenAI API compatible local endpoint does allow sampling multiple responses using the n parameter.

@LuMarans30
Copy link
Contributor

I've summarized this discussion into a small section in the README (#27) so that other people can try out this project easily. I hope it's all correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants