Skip to content

Latest commit

 

History

History
50 lines (36 loc) · 2.94 KB

File metadata and controls

50 lines (36 loc) · 2.94 KB

Warning

Starting with version 0.1.79, the model format has changed from GGML to GGUF. Existing GGML models can be converted using the convert-llama-ggmlv3-to-gguf.py script in llama.cpp (or you can often find the GGUF conversions on HuggingFace Hub)

There are two branches of this container for backwards compatability:

  • llama_cpp:gguf (the default, which tracks upstream master)
  • llama_cpp:ggml (which still supports GGML model format)

There are a couple patches applied to the legacy GGML fork:

  • fixed __fp16 typedef in llama.h on ARM64 (use half with NVCC)
  • parsing of BOS/EOS tokens (see ggerganov/llama.cpp#1931)

Inference Benchmark

You can use llama.cpp's built-in main tool to run GGUF models (from HuggingFace Hub or elsewhere)

./run.sh --workdir=/opt/llama.cpp/bin $(./autotag llama_cpp) /bin/bash -c \
 './main --model $(huggingface-downloader TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_S.gguf) \
         --prompt "Once upon a time," \
         --n-predict 128 --ctx-size 192 --batch-size 192 \
         --n-gpu-layers 999 --threads $(nproc)'

> the --model argument expects a .gguf filename (typically the Q4_K_S quantization is used)
> if you're trying to load Llama-2-70B, add the --gqa 8 flag

To use the Python API and benchmark.py instead:

./run.sh --workdir=/opt/llama.cpp/bin $(./autotag llama_cpp) /bin/bash -c \
 'python3 benchmark.py --model $(huggingface-downloader TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_S.gguf) \
            --prompt "Once upon a time," \
            --n-predict 128 --ctx-size 192 --batch-size 192 \
            --n-gpu-layers 999 --threads $(nproc)'

Memory Usage

Model Quantization Memory (MB)
TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_S.gguf 5,268
TheBloke/Llama-2-13B-GGUF llama-2-13b.Q4_K_S.gguf 8,609
TheBloke/LLaMA-30b-GGUF llama-30b.Q4_K_S.gguf 19,045
TheBloke/Llama-2-70B-GGUF llama-2-70b.Q4_K_S.gguf 37,655