train rework, introduce --backend and --dtype flags #1157

cdoern · 2024-05-13T17:43:51Z

Changes

added --backend flag for ilab train allowing users to switch their training backend between pytorch and mlx.

changed what was linux_train.py to work for both macos and linux depending on some new flags like fp16, bfp16, etc.

Which issue is resolved by this Pull Request:

resolves #1108

Description of your changes:

ilab train on macos and linux are really different just for MLX support.

It turns out pytorch supports an mps device enabling hardware accleration on macos. This allows us to use 99% of the codepath for linux train with less storage and memory used. the mlx method requires adapters files, and a whole -fused dir that almost doubles the storage used in this whole process

This backend is also more maintainable as mlx is written in a way that lead us to need some heavy infrastructure to run it.

In testing, the pytorch train code takes roughly the same amount of time on comparable hardware and yields a better result with a model that appears to learn the new skill/knowledge in a better manner with less hallucinations.

I will note, 18gb of ram is the minimum to run this backend. When you have 32/38 train takes less than a quarter of the amount of time that 18gb does, and produces better results.

ilab train now defaults to --backend=pytorch but --backend=mlx is still available

added a few more enhancements:

the --backend flag lets users choose pytorch or mlx
I got rid of the multiple flags referencing "model" and just made a --model-repo flag since train should always pull the full safetensors from HF
I got rid of a few bad flags people should probably never use like gguf-model-path gguf models for training yield horrible results most of the time
--dtype takes fp16, bf16, ftp32, and auto. This sets the dtype to user in pytorch. bf16 and fp16 were hardcoded in the past.

mergify · 2024-05-13T17:44:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tiran

Nice work, Charlie! I have left a few inline comments.

The current state of the PR breaks Intel Gaudi support. It may also cause problems for AMD ROCm. We don't have CI runners for these accelerators, so you probably haven't noticed it.

I also suggest to rename the module like the --backend argument. from instructlab.train.pytorch import pytorch_train is more descriptive than linux_macos_train.

tiran · 2024-05-14T03:42:09Z

requirements.txt

-torch>=2.2.0a0,<3.0.0 ; python_version == '3.10'
-torch>=2.2.1,<3.0.0 ; python_version != '3.10'
+torch>=2.3.0,<3.0.0
 tqdm>=4.66.2,<5.0.0
-transformers>=4.30.0,<=4.38.2
+transformers==4.38.1
 trl>=0.7.11,<0.8.0
 wandb>=0.16.4,<0.17.0
 langchain-text-splitters
 # the below library should NOT be imported into any python files
 # it is for CLI usage ONLY
 yamllint>=1.35.1,<1.36.0
+bitsandbytes
+accelerate


The changes are breaking Intel Gaudi support and may cause issues for AMD ROCm support.

Intel Gaudi stack has PyTorch 2.2.0a0 on Python 3.10. The special case is required.

Intel Gaudi uses optimum-habana 1.10.3 from a fork, which requires transformers<4.38.0,>=4.37.0

bitesandbytes upstream only supports Nvidia CUDA and CPU. There is an work-in-progress fork for AMD ROCm. Intel Gaudi does not use it.

I guess we have to introduce more optional dependencies: instructlab[cuda], instructlab[rocm], instructlab[cpu], and instructlab[mps].

I see there is a requirements-hpu.txt I don't think we should have niche stack's in our main requirements.txt.

Would having that other requirements.txt contain all gaudi reqs be ok?

Yes, there is a separate requirements file that is used to generate dependencies for instructlab[hpu]. AFAIK optional dependencies cannot override base requirements. If instructlab base requires torch=>2.3.0, then instructlab[hpu] cannot change it to 2.2.0a.

@tiran ok, here are the requirements we have overlapping:

Gaudi: Python 3.10, torch 2.2.0a
MPS and all other Devices: 3.10+, torch 2.3.0+

Can you please advise the best way to merge these two requirements?

I was thinking pip install -r requirements-hpu.txt -e . would allow us to override the torch version in requirements.txt.

One thing I was thinking of would be to have make hpu or make mps or make cuda targets that also build ilab to be able to handle these backends?

I am open to whatever way is best! Though, I don't think having this stuff in the base requirements file makes sense since conflicts like this are bound to happen a lot.

Optional dependencies cannot introduce conflicting dependencies. They are limited to additional dependencies or more restrictive dependencies. Pip flat-out refuses to resolve conflicts:

ERROR: Cannot install instructlab[hpu]==0.14.1.dev183+gdeee169.d20240514 because these package versions have conflicting dependencies. The conflict is caused by: instructlab[hpu] 0.14.1.dev183+gdeee169.d20240514 depends on torch<3.0.0 and >=2.3.0 instructlab[hpu] 0.14.1.dev183+gdeee169.d20240514 depends on torch==2.2.0a0; extra == "hpu"

Why do we require PyTorch 2.3.0 anyway? Is there a specific feature that is required?

mps backend is broken below 2.3. I can test it out to give you the specific reason. However, I don't think it can compile with MPS below 2.3

src/instructlab/lab.py

tiran · 2024-05-14T04:10:38Z

src/instructlab/train/linux_macos_train.py

@@ -25,6 +25,8 @@
 # Local
 from ..chat.chat import CONTEXTS

+torch.set_autocast_enabled(False)


Isn't autocast disabled by default?

apparently not, everything was broken until I added this :)

That's really strange. Some 3rd party extension must mess with the settings. Torch 2.3.0 returns False for is_autocast_enabled.

src/instructlab/train/linux_macos_train.py

src/instructlab/lab.py

cdoern · 2024-05-14T13:30:54Z

Nice work, Charlie! I have left a few inline comments.

The current state of the PR breaks Intel Gaudi support. It may also cause problems for AMD ROCm. We don't have CI runners for these accelerators, so you probably haven't noticed it.

I also suggest to rename the module like the --backend argument. from instructlab.train.pytorch import pytorch_train is more descriptive than linux_macos_train.

thanks for the review @tiran I had a feeling gaudi support might be impacted by this. I will try to clean everything up.

src/instructlab/train/linux_macos_train.py

RobotSail

LGTM Witnessed this PR work end-to-end.

tiran · 2024-05-14T15:43:18Z

@RobotSail Could you please undo your approval? mergify is going to merge the PR automatically but there are still issues with the PR.

pyproject.toml

src/instructlab/train/pytorch_train.py

russellb

There is a design doc aiming to gather consensus on how to move training forward to support multiple modes of operation with varying levels of cost, complexity, and quality of output. This PR does some of the same. I would prefer we reach agreement on where we are headed before considering merging this.

instructlab/dev-docs#52

cdoern · 2024-05-29T15:00:33Z

There is a design doc aiming to gather consensus on how to move training forward to support multiple modes of operation with varying levels of cost, complexity, and quality of output. This PR does some of the same. I would prefer we reach agreement on where we are headed before considering merging this.

instructlab/dev-docs#52

This PR is unrelated to the dev doc, MPS is a functional change in that the backend is needed for MacOS training support. MLX has been proven to have degraded results when compared to MLX, the team has been using this code to train models on Macs.

I want to make sure we do not conflate the two designs here

russellb · 2024-05-29T15:05:16Z

tests/test_lab_train.py

+            # TODO: Why would we call convert without a model? this seems broken
+            # convert_llama_to_gguf_mock.assert_called_once()
+            # assert (
+            #     convert_llama_to_gguf_mock.call_args[1]["model"]
+            #     == "./training_results/final"
+            # )
+            # assert convert_llama_to_gguf_mock.call_args[1]["pad_vocab"] is True
+            # assert len(convert_llama_to_gguf_mock.call_args[1]) == 2


I'd like to see all the commented-out code removed throughout

... or at least the newly introduced commented out code, in some places I see it's moving some liens that were already commented out

My previous request for changes was about the UX related changes and that the alignment with the proposed training commands design was not clear. The new options have been hidden now, making it easier to continue to change them to get to the proposed design once it's agreed on. I'm not blocking on that point anymore.

russellb · 2024-05-31T00:33:11Z

src/instructlab/train/pytorch_train.py

@@ -50,6 +52,7 @@
    htcore = None
    hpu = None
    hpu_backends = None
+# 'fork' incompatible with some hardware accelerator


what is this referring to?

apologies, will remove this

src/instructlab/train/pytorch_train.py

russellb · 2024-05-31T00:38:48Z

src/instructlab/lab.py

-@click.option(
-    "--input-dir",
-    type=click.Path(),
-    show_default=True,  # TODO: set to None and change help message
-    help="Path to generated files to use as input.",
-)


just as some general feedback, including the various refactoring of options in the same commit makes this harder to review because refactoring is mixed up with the core changes you were making. When doing refactoring in a change like this, I would suggest at least keeping refactoring isolated to its own commit (and probably multiple commits that explain logical stages of refactoring). A structure like that goes a LONG way in helping people understand your changes as a logical progression without different things mixed up in the same view.

russellb · 2024-05-31T00:40:57Z

src/instructlab/lab.py

+            print(
+                f"Did not get best checkpoint, choosing most recent which is {best_checkpoint}"
+            )


other logging at this level is done using click.secho(), though this seems more like a debug message than something actionable to a user.

russellb · 2024-05-31T01:36:29Z

src/instructlab/train/pytorch_train.py

+    # torch compile only for MPS
+    if device.type == "mps":
+        torch_compile = True
+        torch_backend = "aot_eager"


and why is this different for mps?

added another comment but the backend needs to be rebuilt in order to get mps compiled I believe. i can retry it without this to get the exact error if you want!

src/instructlab/train/pytorch_train.py

russellb · 2024-05-31T01:37:15Z

src/instructlab/train/pytorch_train.py

    max_seq_length = 300

+    # TODO: This to @cdoern smells like it needs to be a different function, file, etc. Too different from the rest of train. Or at least, add flags for a bunch of this.


probably doesn't need to be in the code ...

why? imo leaving a TODO is better practice than forgetting about this entirely.

russellb · 2024-05-31T01:38:24Z

tests/test_lab_train.py

+            # TODO: Why would we call convert without a model? this seems broken
+            # convert_llama_to_gguf_mock.assert_called_once()
+            # assert (
+            #     convert_llama_to_gguf_mock.call_args[1]["model"]
+            #     == "./training_results/final"
+            # )
+            # assert convert_llama_to_gguf_mock.call_args[1]["pad_vocab"] is True
+            # assert len(convert_llama_to_gguf_mock.call_args[1]) == 2


... or at least the newly introduced commented out code, in some places I see it's moving some liens that were already commented out

russellb · 2024-05-31T01:39:41Z

tests/test_lab_train.py

+            # TODO: why did this ever work??? linux train will create the gguf file unless we exit 0 earlier
+            # assert not os.path.isfile(LINUX_GGUF_FILE)


would prefer this gets resolved -- did this start failing with your PR and that's why you commented it out?

This and the above one, I think, were tests that were written with the wrong functionality in mind. They passed by luck I think. This one seems to be saying " assert the gguf final file is not created on a successful train " which makes no sense?

I could totally be misinterpreting here. But it is possible the testing suite set this up differently and I somehow broke it?

ilab train on macos and linux are really different just for MLX support. It turns out pytorch supports an mps device enabling hardware accleration on macos. This allows us to use 99% of the codepath for linux train with less storage and memory used. the mlx method requires adapters files, and a whole -fused dir that almost doubles the storage used in this whole process This backend is also more maintainable as mlx is written in a way that lead us to need some heavy infrastructure to run it. ilab train now defaults to --backend=pytorch but --backend=mlx is still available added a few more enhancements: - the --backend flag lets users choose pytorch or mlx - I got rid of the multiple flags referencing "model" and just made a --model-repo flag since train should always pull the full safetensors from HF - I got rid of a few bad flags people should probably never use like gguf-model-path gguf models for training yield horrible results most of the time - --dtype takes fp16, bf16, ftp32, and auto. This sets the dtype to user in pytorch. bf16 and fp16 were hardcoded in the past. Signed-off-by: Charlie Doern <[email protected]>

mergify · 2024-06-04T00:58:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

leseb · 2024-06-19T10:04:40Z

@cdoern Is this work tracked in a design document? What's the status? Thanks!

JamesKunstle · 2024-08-05T17:56:04Z

@cdoern do you still want to keep this open?

mergify bot added the testing Relates to testing label May 13, 2024

mergify bot added the needs-rebase This Pull Request needs to be rebased label May 13, 2024

cdoern force-pushed the train branch 2 times, most recently from cc9c6de to 86678e6 Compare May 13, 2024 18:30

mergify bot removed the needs-rebase This Pull Request needs to be rebased label May 13, 2024

cdoern force-pushed the train branch 4 times, most recently from 6b10f71 to c8dc6eb Compare May 13, 2024 21:22

tiran requested changes May 14, 2024

View reviewed changes

alimaredia reviewed May 14, 2024

View reviewed changes

src/instructlab/train/linux_macos_train.py Outdated Show resolved Hide resolved

RobotSail previously approved these changes May 14, 2024

View reviewed changes

cdoern force-pushed the train branch from c8dc6eb to 6d4929e Compare May 14, 2024 15:41

cdoern force-pushed the train branch 6 times, most recently from 9e6bbbd to aa0e813 Compare May 14, 2024 18:25

tiran reviewed May 14, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

cdoern force-pushed the train branch from aa0e813 to 192cdf0 Compare May 14, 2024 18:31

tiran reviewed May 14, 2024

View reviewed changes

src/instructlab/train/pytorch_train.py Outdated Show resolved Hide resolved

cdoern force-pushed the train branch 5 times, most recently from 201ae25 to b906411 Compare May 15, 2024 02:25

mergify bot added the ci-failure PR has at least one CI failure label May 29, 2024

cdoern force-pushed the train branch from 48344ae to af0d918 Compare May 29, 2024 14:11

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels May 29, 2024

russellb self-requested a review May 29, 2024 14:48

russellb previously requested changes May 29, 2024

View reviewed changes

russellb reviewed May 29, 2024

View reviewed changes

cdoern force-pushed the train branch from af0d918 to 022a822 Compare May 29, 2024 15:15

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels May 29, 2024

cdoern force-pushed the train branch from 022a822 to 0b2734f Compare May 29, 2024 18:01

cdoern changed the title ~~train rework, introduce --backend, --train-style, --dtype flags~~ train rework, introduce --backend and --dtype flags May 29, 2024

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels May 29, 2024

This comment was marked as duplicate.

Sign in to view

russellb mentioned this pull request May 31, 2024

Auto-detect bf16 support for CUDA #993

Draft

russellb self-requested a review May 31, 2024 00:35

russellb requested changes May 31, 2024

View reviewed changes

cdoern force-pushed the train branch from 0b2734f to 8087098 Compare June 3, 2024 17:42

mergify bot added the ci-failure PR has at least one CI failure label Jun 3, 2024

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jun 4, 2024

cdoern dismissed RobotSail’s stale review via 8087098 June 4, 2024 14:41

mergify bot removed the one-approval PR has one approval from a maintainer label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train rework, introduce --backend and --dtype flags #1157

train rework, introduce --backend and --dtype flags #1157

cdoern commented May 13, 2024 •

edited

Loading

mergify bot commented May 13, 2024

tiran left a comment

tiran May 14, 2024

cdoern May 14, 2024

tiran May 14, 2024

cdoern May 14, 2024

tiran May 14, 2024

cdoern May 15, 2024

tiran May 14, 2024

cdoern May 14, 2024

tiran May 14, 2024

cdoern commented May 14, 2024

RobotSail left a comment

tiran commented May 14, 2024

russellb left a comment

cdoern commented May 29, 2024

russellb May 29, 2024

russellb May 31, 2024

This comment was marked as duplicate.

russellb May 31, 2024

cdoern Jun 3, 2024

russellb May 31, 2024

russellb May 31, 2024

russellb May 31, 2024

cdoern Jun 3, 2024

russellb May 31, 2024

cdoern Jun 3, 2024

russellb May 31, 2024

russellb May 31, 2024

cdoern Jun 3, 2024

mergify bot commented Jun 4, 2024

leseb commented Jun 19, 2024

JamesKunstle commented Aug 5, 2024

		max_seq_length = 300

		# TODO: This to @cdoern smells like it needs to be a different function, file, etc. Too different from the rest of train. Or at least, add flags for a bunch of this.

		# TODO: why did this ever work??? linux train will create the gguf file unless we exit 0 earlier
		# assert not os.path.isfile(LINUX_GGUF_FILE)

train rework, introduce --backend and --dtype flags #1157

Are you sure you want to change the base?

train rework, introduce --backend and --dtype flags #1157

Conversation

cdoern commented May 13, 2024 • edited Loading

Changes

mergify bot commented May 13, 2024

tiran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdoern commented May 14, 2024

RobotSail left a comment

Choose a reason for hiding this comment

tiran commented May 14, 2024

russellb left a comment

Choose a reason for hiding this comment

cdoern commented May 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as duplicate.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jun 4, 2024

leseb commented Jun 19, 2024

JamesKunstle commented Aug 5, 2024

cdoern commented May 13, 2024 •

edited

Loading