Skip to content

Commit

Permalink
Fix package naming (#861)
Browse files Browse the repository at this point in the history
  • Loading branch information
init27 authored Jan 22, 2025
2 parents aea548c + 8e92023 commit 69b9ca1
Show file tree
Hide file tree
Showing 72 changed files with 229 additions and 232 deletions.
2 changes: 1 addition & 1 deletion getting-started/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ If you are new to developing with Meta Llama models, this is where you should st
* The [Prompt_Engineering_with_Llama](./Prompt_Engineering_with_Llama.ipynb) notebook showcases the various ways to elicit appropriate outputs from Llama. Take this notebook for a spin to get a feel for how Llama responds to different inputs and generation parameters.
* The [inference](./inference/) folder contains scripts to deploy Llama for inference on server and mobile. See also [3p_integrations/vllm](../3p-integrations/vllm/) and [3p_integrations/tgi](../3p-integrations/tgi/) for hosting Llama on open-source model servers.
* The [RAG](./RAG/) folder contains a simple Retrieval-Augmented Generation application using Llama.
* The [finetuning](./finetuning/) folder contains resources to help you finetune Llama on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-recipes finetuning code found in [finetuning.py](../src/llama_recipes/finetuning.py) which supports these features:
* The [finetuning](./finetuning/) folder contains resources to help you finetune Llama on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-cookbook finetuning code found in [finetuning.py](../src/llama_cookbook/finetuning.py) which supports these features:
14 changes: 7 additions & 7 deletions getting-started/finetuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This folder contains instructions to fine-tune Meta Llama 3 on a
* [single-GPU setup](./singlegpu_finetuning.md)
* [multi-GPU setup](./multigpu_finetuning.md)

using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
using the canonical [finetuning script](../../src/llama_cookbook/finetuning.py) in the llama-cookbook package.

If you are new to fine-tuning techniques, check out [an overview](./LLM_finetuning_overview.md).

Expand All @@ -17,10 +17,10 @@ If you are new to fine-tuning techniques, check out [an overview](./LLM_finetuni
## How to configure finetuning settings?

> [!TIP]
> All the setting defined in [config files](../../src/llama_recipes/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
> All the setting defined in [config files](../../src/llama_cookbook/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.

* [Training config file](../../src/llama_recipes/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../../src/llama_recipes/configs/)
* [Training config file](../../src/llama_cookbook/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../../src/llama_cookbook/configs/)

It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:

Expand Down Expand Up @@ -71,11 +71,11 @@ It lets us specify the training settings for everything from `model_name` to `da

```

* [Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
* [Datasets config file](../../src/llama_cookbook/configs/datasets.py) provides the available options for datasets.

* [peft config file](../../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified. We currently support LoRA and Llama-Adapter. Please note that LoRA is the only technique which is supported in combination with FSDP.
* [peft config file](../../src/llama_cookbook/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified. We currently support LoRA and Llama-Adapter. Please note that LoRA is the only technique which is supported in combination with FSDP.

* [FSDP config file](../../src/llama_recipes/configs/fsdp.py) provides FSDP settings such as:
* [FSDP config file](../../src/llama_cookbook/configs/fsdp.py) provides FSDP settings such as:

* `mixed_precision` boolean flag to specify using mixed precision, defatults to true.

Expand All @@ -102,7 +102,7 @@ It lets us specify the training settings for everything from `model_name` to `da
You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.

```bash
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization 8bit --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
python -m llama_cookbook.finetuning --use_peft --peft_method lora --quantization 8bit --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
```
You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below.
<div style="display: flex;">
Expand Down
28 changes: 14 additions & 14 deletions getting-started/finetuning/datasets/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Datasets and Evaluation Metrics

The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or [`recipes/quickstart/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_cookbook.finetuning` module or [`recipes/quickstart/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)

* [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
* [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
* [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.

## Batching Strategies
Llama-recipes support two strategies to batch requests together.
Llama-cookbook support two strategies to batch requests together.
The default setting is `packing` which concatenates the tokenized samples into long sequences filling up the context length of the model.
This is the most compute efficient variant as it avoids any padding and all sequences have the same length.
Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence.
Expand All @@ -21,45 +21,45 @@ The batching strategy can be selected though the command line parameter `--batch

## Using custom datasets

The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
The list of available datasets in llama-cookbook is supposed to give users a quick start on training their Llama model.
To use a custom dataset there are two possible ways.
The first provides a function returning the dataset in a .py file which can be given to the command line tool.
This does not involve changing the source code of llama-recipes.
The second way is targeting contributions which extend llama-recipes as it involves changing the source code.
This does not involve changing the source code of llama-cookbook.
The second way is targeting contributions which extend llama-cookbook as it involves changing the source code.

### Training on custom data
To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
```@python
def get_custom_dataset(dataset_config, tokenizer, split: str):
```
For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [custom_dataset.py](./custom_dataset.py).
The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
For an example `get_custom_dataset` you can look at the provided datasets in llama_cookbook.datasets or [custom_dataset.py](./custom_dataset.py).
The `dataset_config` in the above signature will be an instance of llama_cookbook.configs.dataset.custom_dataset with the modifications made through the command line.
The split signals wether to return the training or validation dataset.
The default function name is `get_custom_dataset` but this can be changed as described below.

In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.
```
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "custom_dataset.py" [TRAINING PARAMETERS]
python -m llama_cookbook.finetuning --dataset "custom_dataset" --custom_dataset.file "custom_dataset.py" [TRAINING PARAMETERS]
```
To change the function name that is used in the .py you can append the name following a `:` like this:
```
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "custom_dataset.py:get_foo" [TRAINING PARAMETERS]
python -m llama_cookbook.finetuning --dataset "custom_dataset" --custom_dataset.file "custom_dataset.py:get_foo" [TRAINING PARAMETERS]
```
This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.

### Adding new dataset
Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../../../src/llama_cookbook/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.

Additionally, there is a preprocessing function for each dataset in the [datasets](../../../src/llama_recipes/datasets) folder.
Additionally, there is a preprocessing function for each dataset in the [datasets](../../../src/llama_cookbook/datasets) folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.

To add a custom dataset the following steps need to be performed.

1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py).
1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../../../src/llama_cookbook/configs/datasets.py).
2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [datasets/__init__.py](../../../src/llama_recipes/datasets/__init__.py)
4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or examples/finetuning.py training script.
3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [datasets/__init__.py](../../../src/llama_cookbook/datasets/__init__.py)
4. Set dataset field in training config to dataset name or use --dataset option of the `llama_cookbook.finetuning` module or examples/finetuning.py training script.

## Application
Below we list other datasets and their main use cases that can be used for fine tuning.
Expand Down
12 changes: 6 additions & 6 deletions getting-started/finetuning/multigpu_finetuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@ This recipe steps you through how to finetune a Meta Llama 3 model on the text s


## Requirements
Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
Ensure that you have installed the llama-cookbook package ([details](../../README.md#installing)).

We will also need 2 packages:
1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).

> [!NOTE]
> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
> The llama-cookbook package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
>
> INT8 quantization is not currently supported in FSDP
Expand Down Expand Up @@ -96,14 +96,14 @@ srun torchrun --nproc_per_node 8 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_e
Do not forget to adjust the number of nodes, ntasks and gpus-per-task in the top.

## Running with different datasets
Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_cookbook/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).

* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
* `grammar_dataset` : use this [notebook](../../src/llama_cookbook/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.

* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.

```bash
wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
wget -P ../../src/llama_cookbook/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
```

* `samsum_dataset`
Expand Down Expand Up @@ -132,7 +132,7 @@ In case you are dealing with slower interconnect network between nodes, to reduc

HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.

This will require to set the Sharding strategy in [fsdp config](../../src/llama_recipes/configs/fsdp.py) to `ShardingStrategy.HYBRID_SHARD` and specify two additional settings, `sharding_group_size` and `replica_group_size` where former specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model and latter specifies the replica group size, which is world_size/sharding_group_size.
This will require to set the Sharding strategy in [fsdp config](../../src/llama_cookbook/configs/fsdp.py) to `ShardingStrategy.HYBRID_SHARD` and specify two additional settings, `sharding_group_size` and `replica_group_size` where former specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model and latter specifies the replica group size, which is world_size/sharding_group_size.

```bash

Expand Down
10 changes: 5 additions & 5 deletions getting-started/finetuning/singlegpu_finetuning.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Fine-tuning with Single GPU
This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on a single GPU.

These are the instructions for using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
These are the instructions for using the canonical [finetuning script](../../src/llama_cookbook/finetuning.py) in the llama-cookbook package.


## Requirements

Ensure that you have installed the llama-recipes package.
Ensure that you have installed the llama-cookbook package.

To run fine-tuning on a single GPU, we will make use of two packages:
1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
Expand All @@ -33,15 +33,15 @@ The args used in the command above are:

### How to run with different datasets?

Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_cookbook/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).

* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
* `grammar_dataset` : use this [notebook](../../src/llama_cookbook/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.

* `alpaca_dataset` : to get this open source data please download the `alpaca.json` to `dataset` folder.


```bash
wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
wget -P ../../src/llama_cookbook/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
```

* `samsum_dataset`
Expand Down
Loading

0 comments on commit 69b9ca1

Please sign in to comment.