Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results for the same model on different hardware / compute device #260

Open
vade opened this issue Nov 12, 2024 · 2 comments
Open

Comments

@vade
Copy link

vade commented Nov 12, 2024

hi there

First, thank you for WhisperKit, it's awesome and nice to work with. Ya'll have done a fantastic job.

Question: Should I expect to see different transcription results for WhisperKit depending on the Compute hardware used? I'm aware that ANE is 16 bits only, and that GPU can run 32 bit, and that may result in different output predictions for the same input in theory. In testing on your end, should that be expected?

Secondly, across different hardware revisions, M1 vs M2, for example, should one expect numerical stability assuming the same compute hardware chosen? Ie M1 GPU vs M2 GPU produces the same result? Same for ANE on M1 vs M2?

I ask because im seeing some confusing results of Transcription while trying to chose a model / device config for a shipping app.

If the above is expected, can you kindly lmk what sort of variance is to be expected?

@atiorh
Copy link
Contributor

atiorh commented Nov 12, 2024

@vade Great question!

One major reason for variance in transcription results is the original Whisper decoding algorithm when temperature fallbacks are engaged (Described in Section 4.5). If an audio input triggers the temperature fallback conditions, all correct Whisper implementations that follow the OpenAI reference algorithm are expected to return non-deterministic results for those inputs.

@atiorh
Copy link
Contributor

atiorh commented Nov 12, 2024

Notes on hardware variance: (that I don't think is a major factor in this case)

  • In an internal version of WhisperKit Benchmarks, we test and monitor cross-hardware variance of WER for the same model and WhisperKit config and cap the disparity at 20% relative WER, e.g. <0.12 is a pass for a reference WER of 0.1.

  • Most implementations of parallel computation with floating point precision are not fully deterministic unless specifically set to a deterministic state (to avoid non-deterministic certain kernels). There is no Apple hardware-specific reason WhisperKit would have additional non-determinism that I am aware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants