Update axolotl image and other dependencies (#28)

* Remove environment key from CI yaml * Update base image spec to axolotl 0.4.0 * Update deepspeed config location * Remove redundant configuration flags from merge cmdline * Disable debug mode in codellama config * Try re-enabling mistral flash attention * Revert some of the CI training overrides * Don't truncate data * Try a config without sample packing * Don't pad to sequence length * Reinstate CI data truncation * Set base GPU config to use A100-40GB * Remove sample packing and standardize batch / LR params for all models * Standardize sequence_len for mistral * Use consistent fractional val_set_size * Disable quantization in llama config * Fix CI val_set_size * Try simple torch optimizer * Try reverting deepspeed workaround * Fix type annotation * Add a step to assert that the evaluation loss is reasonable * Fix run name * Improve results table extraction * Fix direction of loss assertion * Don't call the remote data my_data * Remove huggingface secret (it's not needed for thse models) * Bump huggingface util pins * Update README
modal-labs · Feb 9, 2024 · 62cfb65 · 62cfb65
1 parent 3442b1f
commit 62cfb65
Show file tree

Hide file tree

Showing 10 changed files with 116 additions and 89 deletions.
diff --git a/.github/workflows/ci-cd.yml b/.github/workflows/ci-cd.yml
@@ -4,7 +4,6 @@ on: pull_request
 
 jobs:
   test:
-    environment: CI
     name: Test
     runs-on: ubuntu-latest
     strategy:
@@ -28,7 +27,7 @@ jobs:
       - name: Install Modal
         run: |
           python -m pip install --upgrade pip
-          pip install modal pyyaml
+          pip install modal pyyaml pandas
 
       - name: Prep config and data for CI
         run: |
@@ -39,4 +38,8 @@ jobs:
 
       - name: Run training job on Modal
         run: |
-          GPU_MEM=40 modal run src.train --config=config/${{ matrix.config }}.yml --data=data/sqlqa.jsonl
+          modal run src.train --config=config/${{ matrix.config }}.yml --data=data/sqlqa.jsonl
+
+      - name: Check training results
+        run: |
+          python ci/check_loss.py
diff --git a/.gitignore b/.gitignore
@@ -158,3 +158,7 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+
+
+# Local file written by the training script
+.last_run_name
diff --git a/README.md b/README.md
@@ -39,7 +39,7 @@ cd llm-finetuning
 ```
 3. Launch a training job:
 ```bash
-modal run --detach src.train --config=config/codellama.yml --data=data/sqlqa.jsonl
+modal run --detach src.train --config=config/mistral.yml --data=data/sqlqa.jsonl
 ```
 
 4. Try the model from a completed training run. You can select a folder via `modal volume ls example-runs-vol`, and then specify the training folder with the `--run-folder` flag (something like `/runs/axo-2023-11-24-17-26-66e8`) for inference:
@@ -48,7 +48,7 @@ modal run --detach src.train --config=config/codellama.yml --data=data/sqlqa.jso
 modal run -q src.inference --run-folder /runs/<run_tag>
 ```
 
-The default configuration fine-tunes CodeLlama Instruct 7B on a text-to-SQL dataset for five epochs (takes a few minutes) as a proof of concept. It uses DeepSpeed ZeRO-3 to shard the model state across 2 A100s. Inference on the fine-tuned model displays conformity to the output structure (`[SQL] ... [/SQL]`). To achieve better results, you would need to use more data! Refer to the full development section below.
+Our quickstart example trains a 7B model on a text-to-SQL dataset as a proof of concept (it takes just a few minutes). It uses DeepSpeed ZeRO-3 to shard the model state across 2 A100s. Inference on the fine-tuned model displays conformity to the output structure (`[SQL] ... [/SQL]`). To achieve better results, you would need to use more data! Refer to the full development section below.
 
 5. (Optional) Launch the GUI for easy observability of training status.
 
@@ -76,18 +76,18 @@ The rest of the code are helpers for _calling_ these three functions. There are
 
 ### Config
 
-You can `example_configs` for quick start with different models. We recommend duplicating one to `src/config.yml` and modifying as you need. See an overview of Axolotl's config options [here](https://github.com/OpenAccess-AI-Collective/axolotl#config). The most important options to consider are:
+You can view some example configurations in `config` for a quick start with different models. See an overview of Axolotl's config options [here](https://github.com/OpenAccess-AI-Collective/axolotl#config). The most important options to consider are:
 
 **Model**
 ```yaml
-base_model: codellama/CodeLlama-7b-Instruct-hf
+base_model: mistralai/Mistral-7B-v0.1
 ```
 
-**Dataset** (by default we upload a local .jsonl file from the `src` folder, but you can see all dataset options [here](https://github.com/OpenAccess-AI-Collective/axolotl#dataset))
+**Dataset** (You can see all dataset options [here](https://github.com/OpenAccess-AI-Collective/axolotl#dataset))
 ```yaml
 datasets:
   # This will be the path used for the data when it is saved to the Volume in the cloud.
-  - path: my_data.jsonl
+  - path: data.jsonl
     ds_type: json
     type:
       # JSONL file contains question, context, answer fields per line.
@@ -104,31 +104,31 @@ datasets:
 
 **LoRA**
 ```yaml
-adapter: lora # for qlora, or leave blank for full finetune
+adapter: lora  # for qlora, or leave blank for full finetune (requires much more GPU memory!)
 lora_r: 16
-lora_alpha: 32 # alpha = 2 x rank is a good rule of thumb.
+lora_alpha: 32  # alpha = 2 x rank is a good rule of thumb.
 lora_dropout: 0.05
-lora_target_linear: true # target all linear layers
+lora_target_linear: true  # target all linear layers
 ```
 
 ### Custom Dataset
 
-Axolotl supports many dataset formats ([see more](https://github.com/OpenAccess-AI-Collective/axolotl#dataset)). We recommend adding your custom dataset as a .jsonl file in the `src` folder and making the appropriate modifications to your config.
+Axolotl supports many dataset formats ([see more](https://github.com/OpenAccess-AI-Collective/axolotl#dataset)). We recommend adding your custom dataset as a .jsonl file in the `data` folder and making the appropriate modifications to your config.
 
 **Multi-GPU training**
 
 We recommend [DeepSpeed](https://github.com/microsoft/DeepSpeed) for multi-GPU training, which is easy to set up. Axolotl provides several default deepspeed JSON [configurations](https://github.com/OpenAccess-AI-Collective/axolotl/tree/main/deepspeed) and Modal makes it easy to [attach multiple GPUs](https://modal.com/docs/guide/gpu#gpu-acceleration) of any type in code, so all you need to do is specify which of these configs you'd like to use.
 
 In your `config.yml`:
 ```yaml
-deepspeed: /root/axolotl/deepspeed/zero3.json
+deepspeed: /root/axolotl/deepspeed_configs/zero3_bf16.json
 ```
 
 In `train.py`:
 ```python
 N_GPUS = 2
-GPU_MEM = 80
-GPU_CONFIG = modal.gpu.A100(count=N_GPUS, memory=GPU_MEM) # you can also change this to use A10Gs or T4s
+GPU_MEM = 40
+GPU_CONFIG = modal.gpu.A100(count=N_GPUS, memory=GPU_MEM)  # you can also change this to use A10Gs or T4s
 ```
 
 **Logging with Weights and Biases**
@@ -161,16 +161,15 @@ The script reads two local files containing the config information and the datas
 
 When you make local changes to either your config or data, they will be used for your next training run.
 
-The default configuration fine-tunes CodeLlama Instruct 7B to understand Modal documentation for five epochs as a proof of concept. It uses DeepSpeed ZeRO-3 to shard the model state across 2 A100s. To achieve better results, you would need to use more data and train for more epochs.
-
 **Inference**
 
 To try a model from a completed run, you can select a folder via `modal volume ls examples-runs-vol`, and then specify the training folder for inference:
 
 ```bash
-modal run -q src.inference::inference_main --run-folder /runs/axo-2023-11-24-17-26-66e8
+modal run -q src.inference::inference_main --run-folder=...
 ```
 
+The training script writes the most recent run name to a local file, `.last_run_name`, for easy access.
 
 ## Using the GUI
 

diff --git a/ci/check_loss.py b/ci/check_loss.py
@@ -0,0 +1,31 @@
+from io import StringIO
+import re
+import sys
+
+import pandas as pd
+
+from modal import Volume
+
+
+if __name__ == "__main__":
+
+    with open(".last_run_name", "r") as f:
+        run_name = f.read().strip()
+
+    vol = Volume.lookup("example-runs-vol")
+    contents = b""
+    for chunk in vol.read_file(f"{run_name}/lora-out/README.md"):
+        contents += chunk
+
+    m = re.search(r"### Training results\n\n(.+?)#", contents.decode(), flags=re.DOTALL)
+    if m is None:
+        sys.exit("Could not parse training results from model card")
+    else:
+        results_text = m.group(1).strip().replace(" ", "")
+
+    results = pd.read_table(StringIO(results_text), sep="|")
+    train_loss = float(results["TrainingLoss"].iloc[-1])
+    val_loss = float(results["ValidationLoss"].iloc[-1])
+
+    print(f"Loss: {train_loss:.2f} (training), {val_loss:.2f} (validation)")
+    sys.exit(val_loss > 0.25)  # Arbitrary threshold
diff --git a/ci/prep_for_ci.py b/ci/prep_for_ci.py
@@ -6,20 +6,19 @@
 @click.option("--config")
 @click.option("--data")
 def main(config: str, data: str):
-    """Set the config for lighter-weight training and truncate the dataset."""
+    """Set the config to train for only one epoch and truncate the dataset."""
+    train_set_size = 1000
+    val_set_size = 64
     with open(config) as f:
         cfg = yaml.safe_load(f.read())
-    cfg["sequence_len"] = 1024
-    cfg["val_set_size"] = 100
-    cfg["eval_batch_size"] = 2
-    cfg["micro_batch_size"] = 2
-    cfg["num_epochs"] = 2
-    cfg.pop("eval_steps", None)
+    cfg["val_set_size"] = val_set_size
+    cfg["num_epochs"] = 1
+    cfg.pop("eval_steps", None)  # Evaluate once at the end of the epoch
     with open(config, "w") as f:
         yaml.dump(cfg, f)
 
     with open(data) as f:
-        data_truncated = f.readlines()[:1000]
+        data_truncated = f.readlines()[: train_set_size + val_set_size]
     with open(data, "w") as f:
         f.writelines(data_truncated)
 

diff --git a/config/codellama.yml b/config/codellama.yml
@@ -9,7 +9,7 @@ strict: false
 
 datasets:
   # This will be the path used for the data when it is saved to the Volume in the cloud.
-  - path: my_data.jsonl
+  - path: data.jsonl
     ds_type: json
     type:
       # JSONL file contains question, context, answer fields per line.
@@ -28,9 +28,9 @@ val_set_size: 0.05
 output_dir: ./lora-out
 
 sequence_len: 4096
-sample_packing: true
+sample_packing: false
 eval_sample_packing: false
-pad_to_sequence_len: true
+pad_to_sequence_len: false
 
 adapter: lora
 lora_model_dir:
@@ -46,15 +46,15 @@ wandb_watch:
 wandb_run_id:
 
 gradient_accumulation_steps: 1
-micro_batch_size: 16
-num_epochs: 5
-optimizer: adamw_bnb_8bit
+micro_batch_size: 32
+num_epochs: 4
+optimizer: adamw_torch
 lr_scheduler: cosine
-learning_rate: 0.0002
+learning_rate: 0.0001
 
 train_on_inputs: false
 group_by_length: false
-bf16: true
+bf16: auto
 fp16: false
 tf32: false
 
@@ -70,8 +70,8 @@ flash_attention: true
 warmup_steps: 10
 eval_steps: 0.05
 save_steps:
-debug: True
-deepspeed: /root/axolotl/deepspeed/zero3.json
+debug:
+deepspeed: /root/axolotl/deepspeed_configs/zero3_bf16.json
 weight_decay: 0.0
 fsdp:
 fsdp_config:

diff --git a/config/llama-2.yml b/config/llama-2.yml
@@ -3,13 +3,13 @@ model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
 is_llama_derived_model: true
 
-load_in_8bit: true
+load_in_8bit: false
 load_in_4bit: false
 strict: false
 
 datasets:
   # This will be the path used for the data when it is saved to the Volume in the cloud.
-  - path: my_data.jsonl
+  - path: data.jsonl
     ds_type: json
     type:
       # JSONL file contains question, context, answer fields per line.
@@ -28,14 +28,14 @@ val_set_size: 0.05
 output_dir: ./lora-out
 
 sequence_len: 4096
-sample_packing: true
+sample_packing: false
 eval_sample_packing: false
-pad_to_sequence_len: true
+pad_to_sequence_len: false
 
 adapter: lora
 lora_model_dir:
-lora_r: 32
-lora_alpha: 16
+lora_r: 16
+lora_alpha: 32
 lora_dropout: 0.05
 lora_target_linear: true
 lora_fan_in_fan_out:
@@ -44,14 +44,13 @@ wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_run_id:
-wandb_log_model:
 
-gradient_accumulation_steps: 4
-micro_batch_size: 2
+gradient_accumulation_steps: 1
+micro_batch_size: 32
 num_epochs: 4
-optimizer: adamw_bnb_8bit
+optimizer: adamw_torch
 lr_scheduler: cosine
-learning_rate: 0.0002
+learning_rate: 0.0001
 
 train_on_inputs: false
 group_by_length: false

diff --git a/config/mistral.yml b/config/mistral.yml
@@ -9,7 +9,7 @@ strict: false
 
 datasets:
   # This will be the path used for the data when it is saved to the Volume in the cloud.
-  - path: my_data.jsonl
+  - path: data.jsonl
     ds_type: json
     type:
       # JSONL file contains question, context, answer fields per line.
@@ -24,13 +24,13 @@ datasets:
         {instruction} [/INST] 
 
 dataset_prepared_path:
-val_set_size: 32
+val_set_size: 0.05
 output_dir: ./lora-out
 
-sequence_len: 2048
-sample_packing: true
+sequence_len: 4096
+sample_packing: false
 eval_sample_packing: false
-pad_to_sequence_len: true
+pad_to_sequence_len: false
 
 adapter: lora
 lora_model_dir:
@@ -46,13 +46,13 @@ wandb_watch:
 wandb_run_id:
 
 gradient_accumulation_steps: 1
-micro_batch_size: 16
-num_epochs: 1
-optimizer: adamw_bnb_8bit
+micro_batch_size: 32
+num_epochs: 4
+optimizer: adamw_torch
 lr_scheduler: cosine
-learning_rate: 0.0002
+learning_rate: 0.0001
 
-bf16: true
+bf16: auto
 fp16: false
 tf32: false
 train_on_inputs: false
@@ -64,12 +64,12 @@ resume_from_checkpoint:
 local_rank:
 logging_steps: 1
 xformers_attention:
-flash_attention: false
+flash_attention: true
 
 warmup_steps: 10
 save_steps:
 debug:
-deepspeed: /root/axolotl/deepspeed/zero3.json
+deepspeed: /root/axolotl/deepspeed_configs/zero3_bf16.json
 weight_decay: 0.0
 fsdp:
 fsdp_config: