You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thank you for WhisperKit, it's awesome and nice to work with. Ya'll have done a fantastic job.
Question: Should I expect to see different transcription results for WhisperKit depending on the Compute hardware used? I'm aware that ANE is 16 bits only, and that GPU can run 32 bit, and that may result in different output predictions for the same input in theory. In testing on your end, should that be expected?
Secondly, across different hardware revisions, M1 vs M2, for example, should one expect numerical stability assuming the same compute hardware chosen? Ie M1 GPU vs M2 GPU produces the same result? Same for ANE on M1 vs M2?
I ask because im seeing some confusing results of Transcription while trying to chose a model / device config for a shipping app.
If the above is expected, can you kindly lmk what sort of variance is to be expected?
The text was updated successfully, but these errors were encountered:
One major reason for variance in transcription results is the original Whisper decoding algorithm when temperature fallbacks are engaged (Described in Section 4.5). If an audio input triggers the temperature fallback conditions, all correct Whisper implementations that follow the OpenAI reference algorithm are expected to return non-deterministic results for those inputs.
Notes on hardware variance: (that I don't think is a major factor in this case)
In an internal version of WhisperKit Benchmarks, we test and monitor cross-hardware variance of WER for the same model and WhisperKit config and cap the disparity at 20% relative WER, e.g. <0.12 is a pass for a reference WER of 0.1.
Most implementations of parallel computation with floating point precision are not fully deterministic unless specifically set to a deterministic state (to avoid non-deterministic certain kernels). There is no Apple hardware-specific reason WhisperKit would have additional non-determinism that I am aware of.
hi there
First, thank you for WhisperKit, it's awesome and nice to work with. Ya'll have done a fantastic job.
Question: Should I expect to see different transcription results for WhisperKit depending on the Compute hardware used? I'm aware that ANE is 16 bits only, and that GPU can run 32 bit, and that may result in different output predictions for the same input in theory. In testing on your end, should that be expected?
Secondly, across different hardware revisions, M1 vs M2, for example, should one expect numerical stability assuming the same compute hardware chosen? Ie M1 GPU vs M2 GPU produces the same result? Same for ANE on M1 vs M2?
I ask because im seeing some confusing results of Transcription while trying to chose a model / device config for a shipping app.
If the above is expected, can you kindly lmk what sort of variance is to be expected?
The text was updated successfully, but these errors were encountered: