Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train rework, introduce --backend and --dtype flags #1157

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented May 13, 2024

Changes

added --backend flag for ilab train allowing users to switch their training backend between pytorch and mlx.

changed what was linux_train.py to work for both macos and linux depending on some new flags like fp16, bfp16, etc.

Which issue is resolved by this Pull Request:

resolves #1108

Description of your changes:

ilab train on macos and linux are really different just for MLX support.

It turns out pytorch supports an mps device enabling hardware accleration on macos. This allows us to use 99% of the codepath for linux train with less storage and memory used. the mlx method requires adapters files, and a whole -fused dir that almost doubles the storage used in this whole process

This backend is also more maintainable as mlx is written in a way that lead us to need some heavy infrastructure to run it.

In testing, the pytorch train code takes roughly the same amount of time on comparable hardware and yields a better result with a model that appears to learn the new skill/knowledge in a better manner with less hallucinations.

I will note, 18gb of ram is the minimum to run this backend. When you have 32/38 train takes less than a quarter of the amount of time that 18gb does, and produces better results.

ilab train now defaults to --backend=pytorch but --backend=mlx is still available

added a few more enhancements:

  • the --backend flag lets users choose pytorch or mlx
  • I got rid of the multiple flags referencing "model" and just made a --model-repo flag since train should always pull the full safetensors from HF
  • I got rid of a few bad flags people should probably never use like gguf-model-path gguf models for training yield horrible results most of the time
  • --dtype takes fp16, bf16, ftp32, and auto. This sets the dtype to user in pytorch. bf16 and fp16 were hardcoded in the past.

@mergify mergify bot added the testing Relates to testing label May 13, 2024
Copy link
Contributor

mergify bot commented May 13, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label May 13, 2024
@cdoern cdoern force-pushed the train branch 2 times, most recently from cc9c6de to 86678e6 Compare May 13, 2024 18:30
@mergify mergify bot removed the needs-rebase This Pull Request needs to be rebased label May 13, 2024
@cdoern cdoern force-pushed the train branch 4 times, most recently from 6b10f71 to c8dc6eb Compare May 13, 2024 21:22
Copy link
Contributor

@tiran tiran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, Charlie! I have left a few inline comments.

The current state of the PR breaks Intel Gaudi support. It may also cause problems for AMD ROCm. We don't have CI runners for these accelerators, so you probably haven't noticed it.

I also suggest to rename the module like the --backend argument. from instructlab.train.pytorch import pytorch_train is more descriptive than linux_macos_train.

requirements.txt Outdated
Comment on lines 34 to 44
torch>=2.2.0a0,<3.0.0 ; python_version == '3.10'
torch>=2.2.1,<3.0.0 ; python_version != '3.10'
torch>=2.3.0,<3.0.0
tqdm>=4.66.2,<5.0.0
transformers>=4.30.0,<=4.38.2
transformers==4.38.1
trl>=0.7.11,<0.8.0
wandb>=0.16.4,<0.17.0
langchain-text-splitters
# the below library should NOT be imported into any python files
# it is for CLI usage ONLY
yamllint>=1.35.1,<1.36.0
bitsandbytes
accelerate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are breaking Intel Gaudi support and may cause issues for AMD ROCm support.

  • Intel Gaudi stack has PyTorch 2.2.0a0 on Python 3.10. The special case is required.
  • Intel Gaudi uses optimum-habana 1.10.3 from a fork, which requires transformers<4.38.0,>=4.37.0
  • bitesandbytes upstream only supports Nvidia CUDA and CPU. There is an work-in-progress fork for AMD ROCm. Intel Gaudi does not use it.

I guess we have to introduce more optional dependencies: instructlab[cuda], instructlab[rocm], instructlab[cpu], and instructlab[mps].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is a requirements-hpu.txt I don't think we should have niche stack's in our main requirements.txt.

Would having that other requirements.txt contain all gaudi reqs be ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a separate requirements file that is used to generate dependencies for instructlab[hpu]. AFAIK optional dependencies cannot override base requirements. If instructlab base requires torch=>2.3.0, then instructlab[hpu] cannot change it to 2.2.0a.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiran ok, here are the requirements we have overlapping:

Gaudi: Python 3.10, torch 2.2.0a
MPS and all other Devices: 3.10+, torch 2.3.0+

Can you please advise the best way to merge these two requirements?

I was thinking pip install -r requirements-hpu.txt -e . would allow us to override the torch version in requirements.txt.

One thing I was thinking of would be to have make hpu or make mps or make cuda targets that also build ilab to be able to handle these backends?

I am open to whatever way is best! Though, I don't think having this stuff in the base requirements file makes sense since conflicts like this are bound to happen a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional dependencies cannot introduce conflicting dependencies. They are limited to additional dependencies or more restrictive dependencies. Pip flat-out refuses to resolve conflicts:

ERROR: Cannot install instructlab[hpu]==0.14.1.dev183+gdeee169.d20240514 because these package versions have conflicting dependencies.

The conflict is caused by:
    instructlab[hpu] 0.14.1.dev183+gdeee169.d20240514 depends on torch<3.0.0 and >=2.3.0
    instructlab[hpu] 0.14.1.dev183+gdeee169.d20240514 depends on torch==2.2.0a0; extra == "hpu"

Why do we require PyTorch 2.3.0 anyway? Is there a specific feature that is required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mps backend is broken below 2.3. I can test it out to give you the specific reason. However, I don't think it can compile with MPS below 2.3

src/instructlab/lab.py Outdated Show resolved Hide resolved
src/instructlab/lab.py Outdated Show resolved Hide resolved
src/instructlab/lab.py Outdated Show resolved Hide resolved
src/instructlab/lab.py Outdated Show resolved Hide resolved
@@ -25,6 +25,8 @@
# Local
from ..chat.chat import CONTEXTS

torch.set_autocast_enabled(False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't autocast disabled by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently not, everything was broken until I added this :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really strange. Some 3rd party extension must mess with the settings. Torch 2.3.0 returns False for is_autocast_enabled.

src/instructlab/train/linux_macos_train.py Outdated Show resolved Hide resolved
src/instructlab/lab.py Outdated Show resolved Hide resolved
@cdoern
Copy link
Contributor Author

cdoern commented May 14, 2024

Nice work, Charlie! I have left a few inline comments.

The current state of the PR breaks Intel Gaudi support. It may also cause problems for AMD ROCm. We don't have CI runners for these accelerators, so you probably haven't noticed it.

I also suggest to rename the module like the --backend argument. from instructlab.train.pytorch import pytorch_train is more descriptive than linux_macos_train.

thanks for the review @tiran I had a feeling gaudi support might be impacted by this. I will try to clean everything up.

RobotSail
RobotSail previously approved these changes May 14, 2024
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Witnessed this PR work end-to-end.

@tiran
Copy link
Contributor

tiran commented May 14, 2024

@RobotSail Could you please undo your approval? mergify is going to merge the PR automatically but there are still issues with the PR.

@cdoern cdoern force-pushed the train branch 6 times, most recently from 9e6bbbd to aa0e813 Compare May 14, 2024 18:25
pyproject.toml Outdated Show resolved Hide resolved
@cdoern cdoern force-pushed the train branch 5 times, most recently from 201ae25 to b906411 Compare May 15, 2024 02:25
@mergify mergify bot added the ci-failure PR has at least one CI failure label May 29, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels May 29, 2024
@russellb russellb self-requested a review May 29, 2024 14:48
Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a design doc aiming to gather consensus on how to move training forward to support multiple modes of operation with varying levels of cost, complexity, and quality of output. This PR does some of the same. I would prefer we reach agreement on where we are headed before considering merging this.

instructlab/dev-docs#52

@cdoern
Copy link
Contributor Author

cdoern commented May 29, 2024

There is a design doc aiming to gather consensus on how to move training forward to support multiple modes of operation with varying levels of cost, complexity, and quality of output. This PR does some of the same. I would prefer we reach agreement on where we are headed before considering merging this.

instructlab/dev-docs#52

This PR is unrelated to the dev doc, MPS is a functional change in that the backend is needed for MacOS training support. MLX has been proven to have degraded results when compared to MLX, the team has been using this code to train models on Macs.

I want to make sure we do not conflate the two designs here

Comment on lines +305 to +312
# TODO: Why would we call convert without a model? this seems broken
# convert_llama_to_gguf_mock.assert_called_once()
# assert (
# convert_llama_to_gguf_mock.call_args[1]["model"]
# == "./training_results/final"
# )
# assert convert_llama_to_gguf_mock.call_args[1]["pad_vocab"] is True
# assert len(convert_llama_to_gguf_mock.call_args[1]) == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see all the commented-out code removed throughout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or at least the newly introduced commented out code, in some places I see it's moving some liens that were already commented out

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels May 29, 2024
@cdoern cdoern changed the title train rework, introduce --backend, --train-style, --dtype flags train rework, introduce --backend and --dtype flags May 29, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels May 29, 2024
russellb

This comment was marked as duplicate.

@russellb russellb dismissed their stale review May 31, 2024 00:25

My previous request for changes was about the UX related changes and that the alignment with the proposed training commands design was not clear. The new options have been hidden now, making it easier to continue to change them to get to the proposed design once it's agreed on. I'm not blocking on that point anymore.

@russellb russellb self-requested a review May 31, 2024 00:35
@@ -50,6 +52,7 @@
htcore = None
hpu = None
hpu_backends = None
# 'fork' incompatible with some hardware accelerator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this referring to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies, will remove this

src/instructlab/train/pytorch_train.py Outdated Show resolved Hide resolved
src/instructlab/train/pytorch_train.py Outdated Show resolved Hide resolved
Comment on lines -903 to -908
@click.option(
"--input-dir",
type=click.Path(),
show_default=True, # TODO: set to None and change help message
help="Path to generated files to use as input.",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as some general feedback, including the various refactoring of options in the same commit makes this harder to review because refactoring is mixed up with the core changes you were making. When doing refactoring in a change like this, I would suggest at least keeping refactoring isolated to its own commit (and probably multiple commits that explain logical stages of refactoring). A structure like that goes a LONG way in helping people understand your changes as a logical progression without different things mixed up in the same view.

Comment on lines +1084 to +1086
print(
f"Did not get best checkpoint, choosing most recent which is {best_checkpoint}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other logging at this level is done using click.secho(), though this seems more like a debug message than something actionable to a user.

# torch compile only for MPS
if device.type == "mps":
torch_compile = True
torch_backend = "aot_eager"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why is this different for mps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added another comment but the backend needs to be rebuilt in order to get mps compiled I believe. i can retry it without this to get the exact error if you want!

src/instructlab/train/pytorch_train.py Outdated Show resolved Hide resolved
max_seq_length = 300

# TODO: This to @cdoern smells like it needs to be a different function, file, etc. Too different from the rest of train. Or at least, add flags for a bunch of this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably doesn't need to be in the code ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? imo leaving a TODO is better practice than forgetting about this entirely.

Comment on lines +305 to +312
# TODO: Why would we call convert without a model? this seems broken
# convert_llama_to_gguf_mock.assert_called_once()
# assert (
# convert_llama_to_gguf_mock.call_args[1]["model"]
# == "./training_results/final"
# )
# assert convert_llama_to_gguf_mock.call_args[1]["pad_vocab"] is True
# assert len(convert_llama_to_gguf_mock.call_args[1]) == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or at least the newly introduced commented out code, in some places I see it's moving some liens that were already commented out

Comment on lines +323 to +324
# TODO: why did this ever work??? linux train will create the gguf file unless we exit 0 earlier
# assert not os.path.isfile(LINUX_GGUF_FILE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer this gets resolved -- did this start failing with your PR and that's why you commented it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the above one, I think, were tests that were written with the wrong functionality in mind. They passed by luck I think. This one seems to be saying " assert the gguf final file is not created on a successful train " which makes no sense?

I could totally be misinterpreting here. But it is possible the testing suite set this up differently and I somehow broke it?

ilab train on macos and linux are really different just for MLX support.

It turns out pytorch supports an mps device enabling hardware accleration on macos. This allows us to use 99% of the codepath for linux train
with less storage and memory used. the mlx method requires adapters files, and a whole -fused dir that almost doubles the storage used in this whole process

This backend is also more maintainable as mlx is written in a way that lead us to need some heavy infrastructure to run it.

ilab train now defaults to --backend=pytorch but --backend=mlx is still available

added a few more enhancements:
- the --backend flag lets users choose pytorch or mlx
- I got rid of the multiple flags referencing "model" and just made a --model-repo flag since train should always pull the full safetensors from HF
- I got rid of a few bad flags people should probably never use like gguf-model-path gguf models for training yield horrible results most of the time
- --dtype takes fp16, bf16, ftp32, and auto. This sets the dtype to user in pytorch. bf16 and fp16 were hardcoded in the past.

Signed-off-by: Charlie Doern <[email protected]>
@mergify mergify bot added the ci-failure PR has at least one CI failure label Jun 3, 2024
Copy link
Contributor

mergify bot commented Jun 4, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Jun 4, 2024
@mergify mergify bot removed the one-approval PR has one approval from a maintainer label Jun 4, 2024
@leseb
Copy link
Contributor

leseb commented Jun 19, 2024

@cdoern Is this work tracked in a design document? What's the status? Thanks!

@JamesKunstle
Copy link
Contributor

@cdoern do you still want to keep this open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure PR has at least one CI failure hold In-progress PR. Tag should be removed before merge. needs-rebase This Pull Request needs to be rebased testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ilab train pytorch support
10 participants