-
Notifications
You must be signed in to change notification settings - Fork 10
Conversation
Co-authored-by: Alexey Kondratiev <[email protected]>
Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by: Mor Zusman <[email protected]>
Signed-off-by: kerthcet <[email protected]>
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
…ct#4893) The 2nd PR for vllm-project#4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
Signed-off-by: Muralidhar Andoorveedu <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…project#4985) Co-authored-by: Elisei Smirnov <[email protected]>
@@ -147,7 +147,7 @@ def __init__( | |||
self, | |||
model_name: str, | |||
dtype: str = "half", | |||
access_token: Optional[str] = None, | |||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- access token not needed (we set
HF_TOKEN
in automation) - kwargs enables us to pass whatever we want for hf runner
@@ -472,21 +472,21 @@ def _decode_token_by_position_index( | |||
|
|||
def generate_greedy_logprobs_nm_use_tokens( | |||
self, | |||
prompts: List[str], | |||
input_ids_lst: List[torch.Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- previously, we were passing a prompt, formatted with a chat template (which appends bos token)
- then, we tokenize here which appends another bos token
This change prevents that double permanently, by forcing the test to tokenize everything fully
…/nm-vllm into remote-push-refactor
@@ -37,7 +37,7 @@ jobs: | |||
test_label_solo: gcp-k8s-l4-solo | |||
test_label_multi: ignore | |||
test_timeout: 480 | |||
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt | |||
test_skip_env_vars: neuralmagic/tests/test_skip_env_vars/smoke.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For remote-push, maybe we don't need to run all 4 python versions, and only 38/311 is good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's get to that after this is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer not having a file as input. instead, it'd be more flexible to just have a string listing the ENV with setting. something like,
TEST_ACCURACY=DISABLE,TEST_CORE=ENABLE
@@ -37,7 +37,7 @@ jobs: | |||
test_label_solo: gcp-k8s-l4-solo | |||
test_label_multi: ignore | |||
test_timeout: 480 | |||
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt | |||
test_skip_env_vars: neuralmagic/tests/test_skip_env_vars/smoke.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's get to that after this is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! we can adjust calling parameters and style later.
SUMMARY: * updated model test structure to focus on core models * refactored tests to use environment variables (currently at "test group" level - so each folder has an env variable). All tests are off by default and they are explicitly enabled * refactored workflows build-test workflow to use a list of env variables rather than skip test list WHY: * this enables us to be more sane about what is and is not on - as opposed to a long list of files * this enables us to actually track what is run and what is not run (via testmo, which tracks skipped tests) * this enables us to have more fine-grained control over what is run vs not run (we can add more env vars at the sub-group level to turn off more tests) --------- Signed-off-by: kerthcet <[email protected]> Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: pandyamarut <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Wenwei Zhang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Alexey Kondratiev <[email protected]> Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: HUANG Fei <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Kante Yin <[email protected]> Co-authored-by: sasha0552 <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: raywanb <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Letian Li <[email protected]> Co-authored-by: Murali Andoorveedu <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Elisei Smirnov <[email protected]> Co-authored-by: Elisei Smirnov <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: leiwen83 <[email protected]> Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Eric Xihui Lin <[email protected]> Co-authored-by: beagleski <[email protected]> Co-authored-by: bapatra <[email protected]> Co-authored-by: Barun Patra <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Michał Moskal <[email protected]> Co-authored-by: Ruth Evans <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Junichi Sato <[email protected]> Co-authored-by: Marut Pandya <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Ronen Schaffer <[email protected]> Co-authored-by: Itay Etelis <[email protected]> Co-authored-by: omkar kakarparthi <[email protected]> Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Breno Faria <[email protected]> Co-authored-by: Breno Faria <[email protected]> Co-authored-by: Hyunsung Lee <[email protected]> Co-authored-by: Chansung Park <[email protected]> Co-authored-by: SnowDist <[email protected]> Co-authored-by: functionxu123 <[email protected]> Co-authored-by: xuhao <[email protected]> Co-authored-by: Domenic Barbuzzi <[email protected]>
SUMMARY:
WHY: