Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tests] Fix SkyServe Smoke Test #4566

Merged
merged 9 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion tests/skyserve/llm/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,10 @@ setup: |
fi

# Install dependencies
pip install "fschat[model_worker,webui]==0.2.24"
# TODO(tian): transformers<4.48.0 is a temporary solution for breaking
# change in transformers 4.48.0. Update to latest version when the issue
# is fixed. Ref: https://github.com/huggingface/transformers/issues/35639
pip install "fschat[model_worker,webui]==0.2.24" "transformers<4.48.0"
pip install sentencepiece protobuf

run: |
Expand Down
3 changes: 2 additions & 1 deletion tests/skyserve/spot/base_ondemand_fallback.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ resources:
cpus: 2+
use_spot: true

workdir: examples/serve/http_server
setup: |
wget https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/examples/serve/http_server/server.py

# Use 8080 to test jupyter service is terminated
run: python3 server.py --port 8080
2 changes: 1 addition & 1 deletion tests/skyserve/update/bump_version_after.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ service:
replicas: 3

resources:
ports: 8080
ports: 8081
cpus: 2+

setup: |
Expand Down
2 changes: 1 addition & 1 deletion tests/skyserve/update/bump_version_before.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ service:
replicas: 2

resources:
ports: 8080
ports: 8081
cpus: 2+

setup: |
Expand Down
3 changes: 2 additions & 1 deletion tests/skyserve/update/new_autoscaler_after.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ resources:
use_spot: true
cpus: 2+

workdir: examples/serve/http_server
setup: |
wget https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/examples/serve/http_server/server.py

run: |
if [ $SKYPILOT_SERVE_REPLICA_ID -eq 7 ]; then
Expand Down
3 changes: 2 additions & 1 deletion tests/skyserve/update/new_autoscaler_before.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ resources:
ports: 8081
cpus: 2+

workdir: examples/serve/http_server
setup: |
wget https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/examples/serve/http_server/server.py

run: python3 server.py --port 8081
28 changes: 24 additions & 4 deletions tests/smoke_tests/test_sky_serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,10 +89,30 @@ def _get_service_name() -> str:
'sleep 5; endpoint=$(sky serve status --endpoint {name}); done; '
'export SKYPILOT_DEBUG=$ORIGIN_SKYPILOT_DEBUG; echo "$endpoint"')

_SERVE_STATUS_WAIT = ('s=$(sky serve status {name}); '
'until ! echo "$s" | grep "Controller is initializing."; '
'do echo "Waiting for serve status to be ready..."; '
'sleep 5; s=$(sky serve status {name}); done; echo "$s"')
_SERVE_STATUS_WAIT = (
's=$(sky serve status {name}); '
# 1. Wait for "Controller is initializing." to disappear
'until ! echo "$s" | grep "Controller is initializing."; '
'do '
' echo "Waiting for serve status to be ready..."; '
' sleep 5; '
' s=$(sky serve status {name}); '
'done; '
# 2. Once controller is ready, check provisioning vs. vCPU=2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why we wait for the vCPU count int the comment? IIRC, if the replica is doing failover, the vCPU will show and disappear alternatively?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good catch! The replica is unlikely to do failover for vCPU=2 resources. But i just realized that this only works for cpu replicas while the _SERVE_STATUS_WAIT is used for LLM replica as well (the llm test). Refactored to only include this in _check_replica_in_status. Re-running all smoke tests now.

'provisioning_count=$(echo "$s" | grep "PROVISIONING" | wc -l); '
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
'vcpu_in_provision=$(echo "$s" | grep "PROVISIONING" | grep "vCPU=2" | wc -l); '
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
'until [ "$provisioning_count" -eq "$vcpu_in_provision" ]; '
'do '
' echo "Waiting for provisioning resource repr ready..."; '
' echo "PROVISIONING: $provisioning_count, vCPU: $vcpu_in_provision"; '
' sleep 5; '
' s=$(sky serve status {name}); '
' provisioning_count=$(echo "$s" | grep "PROVISIONING" | wc -l); '
' vcpu_in_provision=$(echo "$s" | grep "PROVISIONING" | grep "vCPU=2" | wc -l); '
'done; '
# 3. Provisioning is complete
'echo "Provisioning complete. PROVISIONING: $provisioning_count, vCPU=2: $vcpu_in_provision"; '
'echo "$s"')


def _get_replica_ip(name: str, replica_id: int) -> str:
Expand Down
Loading