-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tests] Fix SkyServe Smoke Test #4566
Conversation
cc'ing @zpoint |
Note that there might be some fix attempts in #4548. Just want to make sure that we are not having conflicted fixes. |
I took a quick look and it seems like this PR is just wrapping the tests with some wrappers? Should be good IIUC |
/smoke-test serve |
Local smoke test on relevant one passed. Running on buildkite now |
@Michaelvll Seems like the smoke test is not passed on the buildkite due to this issue: I manually run the relevant smoke test and they passed. Could you help review this? |
Should be good to go, no conflicts right now. Buildkite is experiencing some weird failures due to not separating different run environments and not using a clean environment to run jobs. I am working on fixing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cblmemo ! Left a comment below. mostly looks good to me
tests/smoke_tests/test_sky_serve.py
Outdated
' sleep 5; ' | ||
' s=$(sky serve status {name}); ' | ||
'done; ' | ||
# 2. Once controller is ready, check provisioning vs. vCPU=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain why we wait for the vCPU count int the comment? IIRC, if the replica is doing failover, the vCPU will show and disappear alternatively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good catch! The replica is unlikely to do failover for vCPU=2 resources. But i just realized that this only works for cpu replicas while the _SERVE_STATUS_WAIT
is used for LLM replica as well (the llm test). Refactored to only include this in _check_replica_in_status
. Re-running all smoke tests now.
/smoke-test serve |
Fixing another flaky bug: When the controller is put in skypilot/sky/serve/serve_utils.py Lines 793 to 798 in 0a810ee
An example: + export ORIGIN_SKYPILOT_DEBUG=$SKYPILOT_DEBUG; export SKYPILOT_DEBUG=0; endpoint=$(sky serve status --endpoint t-ss-new-autosca-6c-402b-64-rolling); until ! echo "$endpoint" | grep "Controller is initializing"; do echo "Waiting for serve endpoint to be ready..."; sleep 5; endpoint=$(sky serve status --endpoint t-ss-new-autosca-6c-402b-64-rolling); done; export SKYPILOT_DEBUG=$ORIGIN_SKYPILOT_DEBUG; echo "$endpoint"; s=$(curl http://$endpoint); echo "$s"; echo "$s" | grep "Hi, SkyPilot here"
-
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: - |
/smoke-test serve |
I found that a parameterized test function (e.g. |
Also, seems like the buildkite is running on Azure which requires a little bit more initial delay. Just added. |
I will separate all parameter tests in #4548. That can also help separate However, currently, for |
/smoke-test serve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cblmemo This helps a lot in making buildkite 100% reliable !!!
/smoke-test serve |
It seems like basically all test passed, except for:
cc @Michaelvll for a look - i think it is ready to be merged ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cblmemo! LGTM.
This PR fixes several broken SkyServe smoke tests:
test_skyserve_fast_update
's port has been accidentally changed by [DigitalOcean] droplet integration #3832. Revert it back.workdir
withwget
to avoid storage creation.vCPU=2
representation.INIT
status by other services, thesky serve status --endpoint
is likely to return a-
due to the following check. This PR do retry on getting a-
when fetching the endpoint.skypilot/sky/serve/serve_utils.py
Lines 793 to 798 in 0a810ee
An example:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_skyserve_fast_update
pytest tests/test_smoke.py::test_skyserve_base_ondemand_fallback
pytest tests/test_smoke.py::test_skyserve_llm
pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update
conda deactivate; bash -i tests/backward_compatibility_tests.sh