Alpine LLaMA is an ultra-compact Docker image (less than 10 MB), providing a LLaMA.cpp HTTP server for language model inference.
This Docker image is particularly suited for:
- Environments with limited disk space or low bandwidth.
- Servers that cannot do GPU-accelerated inference, e.g. a CPU-only VPS or a Raspberry Pi.
You can start a local standalone HTTP inference server who listens at the port 50000 and leverages the Qwen2.5-Coder 1.5B quantized model available on Hugging Face (HF) with:
docker run --name alpine-llama --publish 50000:8080 \
--env LLAMA_ARG_HF_REPO=bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF \
--env LLAMA_ARG_HF_FILE=Qwen2.5-Coder-1.5B-Instruct-Q4_K_M.gguf \
--env LLAMA_API_KEY=sk-xxxx \
--env LLAMA_ARG_ALIAS=qwen2.5-coder-1.5b \
samueltallet/alpine-llama-cpp-server
Once the GGUF model file is downloaded from HF (and cached in the Docker container filesystem), you can query your local endpoint using the official OpenAI TS & JS API library.
To check this model's structured output capabilities, execute the following Node.js script who extracts metadata from a product description according to a predefined JSON schema:
import OpenAI from "openai";
const inferenceClient = new OpenAI({
// In a real project, you should use environment variables
// instead of these hardcoded values:
apiKey: "sk-xxxx",
baseURL: "http://127.0.0.1:50000/v1",
});
const productDescription = `UrbanShoes 3.0: These brown and green shoes,
suitable for casual wear, are made of apple leather and recycled rubber.
They are priced at only €654.90.`; // 👈😄 This is not a typo.
const productSchema = {
properties: {
name: { type: "string" },
materials: { type: "array", items: { type: "string" } },
colors: { type: "array", items: { type: "string" } },
price: { type: "number" },
currency: { type: "string", enum: ["USD", "EUR", "GBP"] },
},
required: ["name", "materials", "colors", "price", "currency"],
};
async function extractProductMeta() {
const response = await inferenceClient.chat.completions.create({
messages: [
{ role: "user", content: productDescription }
],
model: "qwen2.5-coder-1.5b",
temperature: 0.2,
response_format: {
type: "json_schema",
json_schema: {
strict: true,
schema: productSchema,
},
},
});
console.log(response.choices[0].message.content);
}
extractProductMeta();
// > { "name": "UrbanShoes 3.0", "materials": ["apple leather", "recycled rubber"], "colors": ["brown", "green"], "price": 654.90, "currency": "EUR" }
If you want a fully-featured AI chat GUI, you can use this docker-compose.yml file who combines the Alpine LLaMA server with the LobeChat interface:
services:
alpine-llama:
image: samueltallet/alpine-llama-cpp-server
container_name: alpine-llama
volumes:
- ./models/HuggingFaceTB/smollm2-1.7b-instruct-q4_k_m.gguf:/opt/smollm2-1.7b.gguf:ro
environment:
- LLAMA_ARG_MODEL=/opt/smollm2-1.7b.gguf
- LLAMA_ARG_ALIAS=smollm2-1.7b
- LLAMA_API_KEY=sk-xxxx # In production, be sure to use your own strong secret key.
lobe-chat:
image: lobehub/lobe-chat
container_name: lobe-chat
depends_on:
- alpine-llama
environment:
- OPENAI_PROXY_URL=http://alpine-llama:8080/v1
- OPENAI_API_KEY=sk-xxxx
- OPENAI_MODEL_LIST=smollm2-1.7b
ports:
- "3210:3210"
Prior to run docker compose up
, you will need to:
Download the smollm2-1.7b-instruct-q4_k_m.gguf model file then put it in your models/HuggingFaceTB directory (next to the docker-compose.yml file).
Once the two services are started, you can optionally configure an AI assistant at http://localhost:3210 and begin to chat with the SmolLM2-1.7B model:
You can pass environment variables to the Docker container to configure the Alpine LLaMA server:
Environment Variable | Description | Example Value |
---|---|---|
LLAMA_ARG_HF_REPO |
Hugging Face (HF) repository of a model | bartowski/Llama-3.2-1B-Instruct-GGUF |
LLAMA_ARG_HF_FILE |
and model file to use in this HF repository | Llama-3.2-1B-Instruct-Q4_K_M.gguf |
LLAMA_ARG_MODEL |
or path to a model file in your hard disk | /home/you/LLMs/Llama-3.2-1B.gguf |
LLAMA_ARG_MODEL_URL |
or URL to download the model file from. | https://your.host/Llama-3.2-1B.gguf |
LLAMA_API_KEY |
Key for authenticating HTTP API requests. | sk-n5V9UAJt6wRFfZQ4eDYk37uGzbKXdpNj |
LLAMA_ARG_ALIAS |
Alias of the model in HTTP API requests. | Llama-3.2-1B |
An exhaustive list of these variables can be found in the official LLaMA.cpp server documentation.
Project licensed under MIT. See the LICENSE file for details.
© 2024 Samuel Tallet