Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary Vast AI support #4365

Open
wants to merge 93 commits into
base: master
Choose a base branch
from

Conversation

kristopolous
Copy link

@kristopolous kristopolous commented Nov 15, 2024

This is preliminary support for Vast. It currently works on an unreleased version of the SDK which we will soon get up to PyPy

The document https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?pli=1&tab=t.0 was followed and all the testing passed

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

I'm pretty sure there will need to be edits, I'm fine with that. This is attempt 1. The outstanding work:

We need to

  • tidy up our dockerhub and will get a better image to launch.
  • release the updates to the sdk and come up with a pip name for it.
  • get our catalog to update in the git hook flow as described (my goal is every 6 hours)

@Michaelvll Michaelvll requested a review from cblmemo November 16, 2024 02:46
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)

sky/adaptors/vast.py Show resolved Hide resolved
sky/clouds/vast.py Show resolved Hide resolved
sky/clouds/vast.py Outdated Show resolved Hide resolved
sky/clouds/vast.py Show resolved Hide resolved
sky/clouds/vast.py Outdated Show resolved Hide resolved
sky/provision/vast/utils.py Show resolved Hide resolved
sky/provision/vast/utils.py Show resolved Hide resolved
sky/provision/vast/instance.py Outdated Show resolved Hide resolved
sky/provision/vast/instance.py Show resolved Hide resolved
sky/provision/vast/instance.py Show resolved Hide resolved
@kristopolous
Copy link
Author

Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)

historically, runpod was a clone of vast. We currently offer docker-style containers and will be providing vms soonish (probably before end of year)

@kristopolous kristopolous force-pushed the vast.ai-support branch 3 times, most recently from e9e922a to 4c9aff9 Compare November 21, 2024 22:28
@kristopolous
Copy link
Author

these test passing is blocked by https://github.com/skypilot-org/skypilot-catalog/pull/100/commits

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 14, 2025

Also, feel free to separate the open ports in another PR if you find this PR's scope is too big :)

Yes that would be nice.

Many of these tests appear to be extremely broken.

Some, such as the example_app, multiple_accelerators_unordered and say sky_bench, require aws and fail to observe the "no_vast" decorator. Others, such as the ones that use nginx, want to run a number of commands that are not included in the docker image and then fail when it's unable to run the tools that are not there.

Also, even though I had a --vast option on the run, a number of machines on different providers got started up and I had to go through and manually shut them down. I had to manually disable all the other clouds as not to incur large expenses and unstopped instances hanging around.

Also We have to enforce an -n 1 on our tests otherwise you'll quickly get 429s from the api endpoints.

I tried running the tests just on both aws and gcp, but they extensively failed there as well.

Unless I modify the tests extensively, there's really no way for these to pass.

cc @Michaelvll

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 15, 2025

Is there an update on the failing CI tests? I find it still failing.

Also, if passing smoke test is too hard, how about listing some basic test result in the PR description? include but not limited to:

  • Launch CPU only instance
  • Launch GPU instance
  • Stop & Re-launch, check if the disk is persistent (write some content before stop, and cat them after re-launch)
  • Autostop & Autodown
  • Launch on existing cluster
  • SSH to the cluster
  • Failover: make sure it can failover from lambda to other clouds and the exceptions are printed correctly
  • launch on other clouds without vast dependencies installed (make sure it does not introduce unnecessary dependencies when vast is not enabled)

@kristopolous
Copy link
Author

kristopolous commented Jan 15, 2025

I can do this sure.
I'll find appropriate tests and run them one-off, here's some comments:

Is there an update on the failing CI tests? I find it still failing.

Also, if passing smoke test is too hard, how about listing some basic test result in the PR description? include but not limited to:

  • Launch CPU only instance

We only offer GPUs instances ... you are free to use the CPU if you'd like but we're a GPU shop

  • Launch GPU instance
  • Stop & Re-launch, check if the disk is persistent (write some content before stop, and cat them after re-launch)
  • Autostop & Autodown
  • Launch on existing cluster
  • SSH to the cluster
  • Failover: make sure it can failover from lambda to other clouds and the exceptions are printed correctly
  • launch on other clouds without vast dependencies installed (make sure it does not introduce unnecessary dependencies when vast is not enabled)

Sure. all these are fine. I'll find tests that can do this.

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 15, 2025

We only offer GPUs instances ... you are free to use the CPU if you'd like but we're a GPU shop

Got it. Feel free to ignore this one.

@kristopolous
Copy link
Author

kristopolous commented Jan 15, 2025

So how is this "Autodowning" supposed to work? It used to work and now it puts a running machine into an "INIT" state which sounds completely wrong. It also doesn't attempt to stop or terminate any instance.

You can do

sky autostop cluster

and it succeeds, every time

you can do

sky autodown cluster

and it succeeds, every time

you do

sky launch -i 1 ....
and it fails, every time.
sky status --refresh is of no use.

You can do autostop it then succeeds.

You can do

sky launch -i 1 --down ...
and it fails, every time
sky status --refresh is of no use.

You can do autostop it then succeeds.

In december, all of these things worked. I've spent the past 5 or so hours trying to work through this code. Is there something special about the base image? Are you doing some action at a distance?

@kristopolous
Copy link
Author

So how is this "Autodowning" supposed to work? It used to work and now it puts a running machine into an "INIT" state which sounds completely wrong. It also doesn't attempt to stop or terminate any instance.

You can do

sky autostop cluster

and it succeeds, every time

you can do

sky autodown cluster

and it succeeds, every time

you do

sky launch -i 1 .... and it fails, every time. sky status --refresh is of no use.

You can do autostop it then succeeds.

You can do

sky launch -i 1 --down ... and it fails, every time sky status --refresh is of no use.

You can do autostop it then succeeds.

In december, all of these things worked. I've spent the past 5 or so hours trying to work through this code. Is there something special about the base image? Are you doing some action at a distance?

I may have finally found the issue, let me look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants