Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tests] Fix SkyServe Smoke Test #4566

Merged
merged 9 commits into from
Jan 17, 2025
Merged

[Tests] Fix SkyServe Smoke Test #4566

merged 9 commits into from
Jan 17, 2025

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Jan 15, 2025

This PR fixes several broken SkyServe smoke tests:

  1. LLM tests failed cuz this breaking change in transformers: Breaking change in v4.48.0 and Python 3.9 huggingface/transformers#35639. Fix transformer version before this change.
  2. test_skyserve_fast_update's port has been accidentally changed by [DigitalOcean] droplet integration #3832. Revert it back.
  3. Replacing several workdir with wget to avoid storage creation.
  4. Fixes [Tests][Serve] Smoke test on SkyServe has a flacky output check #4565 by waiting all provisioning resources has a vCPU=2 representation.
  5. When the controller is put in INIT status by other services, the sky serve status --endpoint is likely to return a - due to the following check. This PR do retry on getting a - when fetching the endpoint.

try:
endpoint = backend_utils.get_endpoints(handle.cluster_name,
load_balancer_port).get(
load_balancer_port, None)
except exceptions.ClusterNotUpError:
return '-'

An example:

+ export ORIGIN_SKYPILOT_DEBUG=$SKYPILOT_DEBUG; export SKYPILOT_DEBUG=0; endpoint=$(sky serve status --endpoint t-ss-new-autosca-6c-402b-64-rolling); until ! echo "$endpoint" | grep "Controller is initializing"; do echo "Waiting for serve endpoint to be ready..."; sleep 5; endpoint=$(sky serve status --endpoint t-ss-new-autosca-6c-402b-64-rolling); done; export SKYPILOT_DEBUG=$ORIGIN_SKYPILOT_DEBUG; echo "$endpoint"; s=$(curl http://$endpoint); echo "$s"; echo "$s" | grep "Hi, SkyPilot here"
-
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: -

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests:
    • pytest tests/test_smoke.py::test_skyserve_fast_update
    • pytest tests/test_smoke.py::test_skyserve_base_ondemand_fallback
    • pytest tests/test_smoke.py::test_skyserve_llm
    • pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll
Copy link
Collaborator

cc'ing @zpoint

@Michaelvll
Copy link
Collaborator

Michaelvll commented Jan 15, 2025

Note that there might be some fix attempts in #4548. Just want to make sure that we are not having conflicted fixes.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 15, 2025

Note that there might be some fix attempts in #4548. Just want to make sure that we are not having conflicted fixes.

I took a quick look and it seems like this PR is just wrapping the tests with some wrappers? Should be good IIUC

@cblmemo cblmemo marked this pull request as ready for review January 15, 2025 19:40
@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 15, 2025

/smoke-test serve

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 15, 2025

Local smoke test on relevant one passed. Running on buildkite now

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 16, 2025

@Michaelvll Seems like the smoke test is not passed on the buildkite due to this issue:

#3951 (comment)

I manually run the relevant smoke test and they passed. Could you help review this?

@zpoint
Copy link
Collaborator

zpoint commented Jan 16, 2025

Should be good to go, no conflicts right now.

Buildkite is experiencing some weird failures due to not separating different run environments and not using a clean environment to run jobs. I am working on fixing this.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cblmemo ! Left a comment below. mostly looks good to me

tests/smoke_tests/test_sky_serve.py Outdated Show resolved Hide resolved
tests/smoke_tests/test_sky_serve.py Outdated Show resolved Hide resolved
' sleep 5; '
' s=$(sky serve status {name}); '
'done; '
# 2. Once controller is ready, check provisioning vs. vCPU=2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why we wait for the vCPU count int the comment? IIRC, if the replica is doing failover, the vCPU will show and disappear alternatively?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good catch! The replica is unlikely to do failover for vCPU=2 resources. But i just realized that this only works for cpu replicas while the _SERVE_STATUS_WAIT is used for LLM replica as well (the llm test). Refactored to only include this in _check_replica_in_status. Re-running all smoke tests now.

@zpoint
Copy link
Collaborator

zpoint commented Jan 16, 2025

/smoke-test serve

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 17, 2025

Fixing another flaky bug:

When the controller is put in INIT status by other services, the sky serve status --endpoint is likely to return a - due to the following check. This PR do retry on getting a - when fetching the endpoint.

try:
endpoint = backend_utils.get_endpoints(handle.cluster_name,
load_balancer_port).get(
load_balancer_port, None)
except exceptions.ClusterNotUpError:
return '-'

An example:

+ export ORIGIN_SKYPILOT_DEBUG=$SKYPILOT_DEBUG; export SKYPILOT_DEBUG=0; endpoint=$(sky serve status --endpoint t-ss-new-autosca-6c-402b-64-rolling); until ! echo "$endpoint" | grep "Controller is initializing"; do echo "Waiting for serve endpoint to be ready..."; sleep 5; endpoint=$(sky serve status --endpoint t-ss-new-autosca-6c-402b-64-rolling); done; export SKYPILOT_DEBUG=$ORIGIN_SKYPILOT_DEBUG; echo "$endpoint"; s=$(curl http://$endpoint); echo "$s"; echo "$s" | grep "Hi, SkyPilot here"
-
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: -

@zpoint
Copy link
Collaborator

zpoint commented Jan 17, 2025

The buildkite is working now.

Failure on this build should reflect the real failure. Could u help confirm? @cblmemo

Feel free to comment /smoke-test serve trigger furthur build.

@zpoint
Copy link
Collaborator

zpoint commented Jan 17, 2025

/smoke-test serve

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 17, 2025

The buildkite is working now.

Failure on this build should reflect the real failure. Could u help confirm? @cblmemo

Feel free to comment /smoke-test serve trigger furthur build.

I found that a parameterized test function (e.g. test_skyserve_new_autoscaler_update) will run all of its tests at once, which will probably report partially correct information. Do you think we should separate them?

image

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 17, 2025

Also, seems like the buildkite is running on Azure which requires a little bit more initial delay. Just added.

@zpoint
Copy link
Collaborator

zpoint commented Jan 17, 2025

I will separate all parameter tests in #4548. That can also help separate test_skyserve_new_autoscaler_update to reduce flakiness.

However, currently, for test_skyserve_new_autoscaler_update to pass, both rolling and blue_green must pass. Separating them increases the chance of passing during a rerun, but it doesn’t help with the flakiness. It’s better if we find the root cause and fix it.😄

@zpoint
Copy link
Collaborator

zpoint commented Jan 17, 2025

/smoke-test serve

@zpoint zpoint mentioned this pull request Jan 17, 2025
10 tasks
Copy link
Collaborator

@zpoint zpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cblmemo This helps a lot in making buildkite 100% reliable !!!

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 17, 2025

/smoke-test serve

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 17, 2025

It seems like basically all test passed, except for:

cc @Michaelvll for a look - i think it is ready to be merged ;)

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cblmemo! LGTM.

@cblmemo cblmemo merged commit 6b23582 into master Jan 17, 2025
18 of 19 checks passed
@cblmemo cblmemo deleted the fix-serve-smoke branch January 17, 2025 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Tests][Serve] Smoke test on SkyServe has a flacky output check
3 participants