🤗 Models on Hugging Face | Blog | Website | Get Started
We are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly.
This release includes model weights and starting code for pre-trained and instruction tuned Llama 3 language models — including sizes of 8B to 70B parameters.
This repository is intended as a minimal example to load Llama 3 models and run inference. For more detailed examples, see llama-recipes.
In order to download the model weights and tokenizer, please visit the Meta Llama website and accept our License.
Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.
Pre-requisites: Make sure you have wget
and md5sum
installed. Then run the script: ./download.sh
.
Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden
, you can always re-request a link.
We are also providing downloads on Hugging Face, in both transformers and native llama3
formats. To download the weights from Hugging Face, please follow these steps:
- Visit one of the repos, for example meta-llama/Meta-Llama-3-8B-Instruct.
- Read and accept the license. Once your request is approved, you'll be granted access to all the Llama 3 models. Note that requests use to take up to one hour to get processed.
- To download the original native weights to use with this repo, click on the "Files and versions" tab and download the contents of the
original
folder. You can also download them from the command line if youpip install huggingface-hub
:
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir meta-llama/Meta-Llama-3-8B-Instruct
-
To use with transformers, the following pipeline snippet will download and cache the weights:
import transformers import torch model_id = "meta-llama/Meta-Llama-3-8B-Instruct" pipeline = transformers.pipeline( "text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", model_kwargs={"torch_dtype": torch.bfloat16}, device="cuda", )
You can follow the steps below to quickly get up and running with Llama 3 models. These steps will let you run quick inference locally. For more examples, see the Llama recipes repository.
-
In a conda env with PyTorch / CUDA available clone and download this repository.
-
In the top-level directory run:
pip install -e .
-
Visit the Meta Llama website and register to download the model/s.
-
Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.
-
Once you get the email, navigate to your downloaded llama repository and run the download.sh script.
- Make sure to grant execution permissions to the download.sh script
- During this process, you will be prompted to enter the URL from the email.
- Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.
-
Once the model/s you want have been downloaded, you can run the model locally using the command below:
torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir Meta-Llama-3-8B-Instruct/ \
--tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
--max_seq_len 512 --max_batch_size 6
Note
- Replace
Meta-Llama-3-8B-Instruct/
with the path to your checkpoint directory andMeta-Llama-3-8B-Instruct/tokenizer.model
with the path to your tokenizer model. - The
–nproc_per_node
should be set to the MP value for the model you are using. - Adjust the
max_seq_len
andmax_batch_size
parameters as needed. - This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.
Different models require different model-parallel (MP) values:
Model | MP |
---|---|
8B | 1 |
70B | 8 |
All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to max_seq_len
and max_batch_size
values. So set those according to your hardware.
Llama 2 fork for running inference on Mac M1/M2 (MPS) devices This is a fork of https://github.com/facebookresearch/llama that runs on Apple M2 (MPS - Metal Performance Shaders).
Note: user needs to set PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to run this code. This is a workaround for unsupported 'aten:polar.out' operator.
So the example_text_completion.py
will look like this
PYTORCH_ENABLE_MPS_FALLBACK=1 torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir Meta-Llama-3-8B/ \
--tokenizer_path Meta-Llama-3-8B/tokenizer.model \
--max_seq_len 128 --max_batch_size 4
This will now use gloo backend instead of nccl in the case that you don't have nccl backend.
You can also check this
if torch.backends.mps.is_available():
device = torch.device('mps')
elif torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.
See example_text_completion.py
for some examples. To illustrate, see the command below to run it with the llama-3-8b model (nproc_per_node
needs to be set to the MP
value):
torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir Meta-Llama-3-8B/ \
--tokenizer_path Meta-Llama-3-8B/tokenizer.model \
--max_seq_len 128 --max_batch_size 4
The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat
needs to be followed: The prompt begins with a <|begin_of_text|>
special token, after which one or more messages follow. Each message starts with the <|start_header_id|>
tag, the role system
, user
or assistant
, and the <|end_header_id|>
tag. After a double newline \n\n
the contents of the message follow. The end of each message is marked by the <|eot_id|>
token.
You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.
Examples using llama-3-8b-chat:
torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir Meta-Llama-3-8B-Instruct/ \
--tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
--max_seq_len 512 --max_batch_size 6
Llama 3 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the Responsible Use Guide.
Please report any software “bug”, or other problems with the models through one of the following means:
- Reporting issues with the model: https://github.com/meta-llama/llama3/issues
- Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
- Reporting bugs and security concerns: facebook.com/whitehat/info
See MODEL_CARD.md.
Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements.
See the LICENSE file, as well as our accompanying Acceptable Use Policy
For common questions, the FAQ can be found here which will be kept up to date over time as new questions arise.