Skip to content

A lightweight LLaMA.cpp HTTP server Docker image based on Alpine Linux.

License

Notifications You must be signed in to change notification settings

SamuelTallet/alpine-llama-cpp-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alpine LLaMA is an ultra-compact Docker image (less than 10 MB), providing a LLaMA.cpp HTTP server for language model inference.

Available on Docker Hub Docker Image Size

Use cases

This Docker image is particularly suited for:

  • Environments with limited disk space or low bandwidth.
  • Servers that cannot do GPU-accelerated inference, e.g. a CPU-only VPS or a Raspberry Pi.

Examples

Standalone

You can start a local standalone HTTP inference server who listens at the port 50000 and leverages the Qwen2.5-Coder 1.5B quantized model available on Hugging Face (HF) with:

docker run --name alpine-llama --publish 50000:8080 \
  --env LLAMA_ARG_HF_REPO=bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF \
  --env LLAMA_ARG_HF_FILE=Qwen2.5-Coder-1.5B-Instruct-Q4_K_M.gguf \
  --env LLAMA_API_KEY=sk-xxxx \
  --env LLAMA_ARG_ALIAS=qwen2.5-coder-1.5b \
  samueltallet/alpine-llama-cpp-server

Once the GGUF model file is downloaded from HF (and cached in the Docker container filesystem), you can query your local endpoint using the official OpenAI TS & JS API library.

To check this model's structured output capabilities, execute the following Node.js script who extracts metadata from a product description according to a predefined JSON schema:

import OpenAI from "openai";

const inferenceClient = new OpenAI({
  // In a real project, you should use environment variables
  // instead of these hardcoded values:
  apiKey: "sk-xxxx",
  baseURL: "http://127.0.0.1:50000/v1",
});

const productDescription = `UrbanShoes 3.0: These brown and green shoes,
suitable for casual wear, are made of apple leather and recycled rubber.
They are priced at only €654.90.`; // 👈😄 This is not a typo.

const productSchema = {
  properties: {
    name: { type: "string" },
    materials: { type: "array", items: { type: "string" } },
    colors: { type: "array", items: { type: "string" } },
    price: { type: "number" },
    currency: { type: "string", enum: ["USD", "EUR", "GBP"] },
  },
  required: ["name", "materials", "colors", "price", "currency"],
};

async function extractProductMeta() {
  const response = await inferenceClient.chat.completions.create({
    messages: [
      { role: "user", content: productDescription }
    ],
    model: "qwen2.5-coder-1.5b",
    temperature: 0.2,
    response_format: {
      type: "json_schema",
      json_schema: {
        strict: true,
        schema: productSchema,
      },
    },
  });

  console.log(response.choices[0].message.content);
}

extractProductMeta();
// > { "name": "UrbanShoes 3.0", "materials": ["apple leather", "recycled rubber"], "colors": ["brown", "green"], "price": 654.90, "currency": "EUR" }

With a GUI

If you want a fully-featured AI chat GUI, you can use this docker-compose.yml file who combines the Alpine LLaMA server with the LobeChat interface:

services:
  alpine-llama:
    image: samueltallet/alpine-llama-cpp-server
    container_name: alpine-llama
    volumes:
      - ./models/HuggingFaceTB/smollm2-1.7b-instruct-q4_k_m.gguf:/opt/smollm2-1.7b.gguf:ro
    environment:
      - LLAMA_ARG_MODEL=/opt/smollm2-1.7b.gguf
      - LLAMA_ARG_ALIAS=smollm2-1.7b
      - LLAMA_API_KEY=sk-xxxx # In production, be sure to use your own strong secret key.

  lobe-chat:
    image: lobehub/lobe-chat
    container_name: lobe-chat
    depends_on:
      - alpine-llama
    environment:
      - OPENAI_PROXY_URL=http://alpine-llama:8080/v1
      - OPENAI_API_KEY=sk-xxxx
      - OPENAI_MODEL_LIST=smollm2-1.7b
    ports:
      - "3210:3210"

Prior to run docker compose up, you will need to:

Download the smollm2-1.7b-instruct-q4_k_m.gguf model file then put it in your models/HuggingFaceTB directory (next to the docker-compose.yml file).

Once the two services are started, you can optionally configure an AI assistant at http://localhost:3210 and begin to chat with the SmolLM2-1.7B model:

Configuration

You can pass environment variables to the Docker container to configure the Alpine LLaMA server:

Environment Variable Description Example Value
LLAMA_ARG_HF_REPO Hugging Face (HF) repository of a model bartowski/Llama-3.2-1B-Instruct-GGUF
LLAMA_ARG_HF_FILE and model file to use in this HF repository Llama-3.2-1B-Instruct-Q4_K_M.gguf
LLAMA_ARG_MODEL or path to a model file in your hard disk /home/you/LLMs/Llama-3.2-1B.gguf
LLAMA_ARG_MODEL_URL or URL to download the model file from. https://your.host/Llama-3.2-1B.gguf
LLAMA_API_KEY Key for authenticating HTTP API requests. sk-n5V9UAJt6wRFfZQ4eDYk37uGzbKXdpNj
LLAMA_ARG_ALIAS Alias of the model in HTTP API requests. Llama-3.2-1B

An exhaustive list of these variables can be found in the official LLaMA.cpp server documentation.

License

Project licensed under MIT. See the LICENSE file for details.

Copyright

© 2024 Samuel Tallet

About

A lightweight LLaMA.cpp HTTP server Docker image based on Alpine Linux.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published