Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PR: Allow users to add their own customised model without editing existing faster-whisper code #1054

Closed
wants to merge 1 commit into from

Conversation

blackpolarz
Copy link

Added a user-friendly function to allow users to add their own Hugging Face ct2 models. This allows others to directly use any updated faster-whisper models without having to wait for others to update existing faster-whisper.

… existing faster-whisper code

Added a function to allow users to add their own Hugging Face ct2 models.
Provides a user-friendly way to test other models.
@jordimas
Copy link
Contributor

jordimas commented Oct 11, 2024

Hello.

You can do this very easily already with the current stable version. For example:

WhisperModel("deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

@blackpolarz
Copy link
Author

Hi jordimas,

I do understand that it can be done easily with the current stable version but I am looking in the perspective of someone creating projects based on faster-whisper.
Assuming there are multiple new models, in a GUI setting, the programmer might want to have something like custom_model1, custom_model2 and more. Then they would have to pair each custom_model with either an if-else statement, match case statement to convert each custom name to their respective url(?). By adding a pre-specified dictionary, it eliminates the need for secondary programmers to code additional components, by providing additional flexibility within faster-whisper code without additional code.
I do agree that this is not useful if there are only one additional model. Feel free to give me further comments and I would be more than happy to hear them.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Oct 11, 2024

I don't understand what this pull request does. Are you referring to adding a list of custom links to ct2 models above and beyondt systran ones on huggingface essentiallky?

@blackpolarz
Copy link
Author

Yes. Basically, this PR allows users to directly interfere with the _MODELS variable within the utils.py.
For example, the user has a dictionary of other faster-whisper models.

new_models = {"custom_model_1": "abc/custom-faster-whisper-1", 
              "custom_model_2": "abc/custom-faster-whisper-2",
              "custom_model_3": "abc/custom-faster-whisper-3"}

Users can directly add to the existing list of models by simply using the function and use the custom model.

utils.add_model(new_models)
WhisperModel(model_size_or_path = "custom_model_1",device="cuda", compute_type="float16")

This essentially allows users to use any custom model with any custom name so long as it is within hugging face.

It also allows users to overwrite any existing Systran models if necessary.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Oct 11, 2024

I like the general proposition. I think it's pure laziness for Systran to only provide select model precisions such as float16...yet ctranslate2 supports float32, bfloat16, int8, int8_float16, and so on...

To obviate this laziness, I had to create and upload all my own precisions and/or quantizations here:

https://huggingface.co/ctranslate2-4you

The way I handle it in my program is to simply use a dictionary. For example, if an option withing a GUI pulldown menu is Distil Whisper large-v3 - float32 this is what the relevant dictionary entry would look like:

    'Distil Whisper large-v3 - float32': {
        'name': 'Distil Whisper large-v3',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/distil-whisper-large-v3-ct2-float32',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },

In other words...1) user selects an option from pulldown menu...2) user clicks load button or whatever it's called...3) when the button is clicked, the script takes whatever item is selected in the pulldown menu and gets the relevant dictionary entry...4) once it gets the relevant dictionary entry, it specifically returns the "repo_id" child key, which is the huggingface repo id...5_ it this huggingface repo ID is what's actually used in your logic that runs the whisper model.

Then you can simply add new items to the dictionary!

HOWEVER, you'll also need to dynamically change the compute_type parameter in the logic that runs the model - e.g. float16, int8, and so on.

Again...in my dictionary you see a "precision" child key...simply return this child key value (just like you returned the one for "repo_id") and use it for the compute_type parameter's value!

Here's my entire dictionary, for example:

WHISPER_MODELS = {
    # LARGE-V3
    'Distil Whisper large-v3 - float32': {
        'name': 'Distil Whisper large-v3',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/distil-whisper-large-v3-ct2-float32',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Distil Whisper large-v3 - bfloat16': {
        'name': 'Distil Whisper large-v3',
        'precision': 'bfloat16',
        'repo_id': 'ctranslate2-4you/distil-whisper-large-v3-ct2-bfloat16',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Distil Whisper large-v3 - float16': {
        'name': 'Distil Whisper large-v3',
        'precision': 'float16',
        'repo_id': 'ctranslate2-4you/distil-whisper-large-v3-ct2-float16',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Whisper large-v3 - float32': {
        'name': 'Whisper large-v3',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/whisper-large-v3-ct2-float32',
        'tokens_per_second': 85,
        'optimal_batch_size': 2,
        'avg_vram_usage': '5.5 GB'
    },
    'Whisper large-v3 - bfloat16': {
        'name': 'Whisper large-v3',
        'precision': 'bfloat16',
        'repo_id': 'ctranslate2-4you/whisper-large-v3-ct2-bfloat16',
        'tokens_per_second': 95,
        'optimal_batch_size': 3,
        'avg_vram_usage': '3.8 GB'
    },
    'Whisper large-v3 - float16': {
        'name': 'Whisper large-v3',
        'precision': 'float16',
        'repo_id': 'ctranslate2-4you/whisper-large-v3-ct2-float16',
        'tokens_per_second': 100,
        'optimal_batch_size': 3,
        'avg_vram_usage': '3.3 GB'
    },
    # MEDIUM.EN
    'Distil Whisper medium.en - float32': {
        'name': 'Distil Whisper large-v3',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/distil-whisper-medium.en-ct2-float32',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Distil Whisper medium.en - bfloat16': {
        'name': 'Distil Whisper medium.en',
        'precision': 'bfloat16',
        'repo_id': 'ctranslate2-4you/distil-whisper-medium.en-ct2-bfloat16',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Distil Whisper medium.en - float16': {
        'name': 'Distil Whisper medium.en',
        'precision': 'float16',
        'repo_id': 'ctranslate2-4you/distil-whisper-medium.en-ct2-float16',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Whisper medium.en - float32': {
        'name': 'Whisper medium.en',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/whisper-medium.en-ct2-float32',
        'tokens_per_second': 130,
        'optimal_batch_size': 6,
        'avg_vram_usage': '2.5 GB'
    },
    'Whisper medium.en - bfloat16': {
        'name': 'Whisper medium.en',
        'precision': 'bfloat16',
        'repo_id': 'ctranslate2-4you/whisper-medium.en-ct2-bfloat16',
        'tokens_per_second': 140,
        'optimal_batch_size': 7,
        'avg_vram_usage': '2.0 GB'
    },
    'Whisper medium.en - float16': {
        'name': 'Whisper medium.en',
        'precision': 'float16',
        'repo_id': 'ctranslate2-4you/whisper-medium.en-ct2-float16',
        'tokens_per_second': 145,
        'optimal_batch_size': 7,
        'avg_vram_usage': '1.8 GB'
    },
    # SMALL.EN
    'Distil Whisper small.en - float32': {
        'name': 'Distil Whisper small.en',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/distil-whisper-small.en-ct2-float32',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Distil Whisper small.en - bfloat16': {
        'name': 'Distil Whisper small.en',
        'precision': 'bfloat16',
        'repo_id': 'ctranslate2-4you/distil-whisper-small.en-ct2-bfloat16',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Distil Whisper small.en - float16': {
        'name': 'Distil Whisper small.en',
        'precision': 'float16',
        'repo_id': 'ctranslate2-4you/distil-whisper-small.en-ct2-float16',
        'tokens_per_second': 160,
        'optimal_batch_size': 4,
        'avg_vram_usage': '3.0 GB'
    },
    'Whisper small.en - float32': {
        'name': 'Whisper small.en',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/whisper-small.en-ct2-float32',
        'tokens_per_second': 180,
        'optimal_batch_size': 14,
        'avg_vram_usage': '1.5 GB'
    },
    'Whisper small.en - bfloat16': {
        'name': 'Whisper small.en',
        'precision': 'bfloat16',
        'repo_id': 'ctranslate2-4you/whisper-small.en-ct2-bfloat16',
        'tokens_per_second': 190,
        'optimal_batch_size': 15,
        'avg_vram_usage': '1.2 GB'
    },
    'Whisper small.en - float16': {
        'name': 'Whisper small.en',
        'precision': 'float16',
        'repo_id': 'ctranslate2-4you/whisper-small.en-ct2-float16',
        'tokens_per_second': 195,
        'optimal_batch_size': 15,
        'avg_vram_usage': '1.1 GB'
    },
    # BASE.EN
    'Whisper base.en - float32': {
        'name': 'Whisper base.en',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/whisper-base.en-ct2-float32',
        'tokens_per_second': 230,
        'optimal_batch_size': 22,
        'avg_vram_usage': '1.0 GB'
    },
    'Whisper base.en - bfloat16': {
        'name': 'Whisper base.en',
        'precision': 'bfloat16',
        'repo_id': 'ctranslate2-4you/whisper-base.en-ct2-bfloat16',
        'tokens_per_second': 240,
        'optimal_batch_size': 23,
        'avg_vram_usage': '0.85 GB'
    },
    'Whisper base.en - float16': {
        'name': 'Whisper base.en',
        'precision': 'float16',
        'repo_id': 'ctranslate2-4you/whisper-base.en-ct2-float16',
        'tokens_per_second': 245,
        'optimal_batch_size': 23,
        'avg_vram_usage': '0.8 GB'
    },
    # TINY.EN
    'Whisper tiny.en - float32': {
        'name': 'Whisper tiny.en',
        'precision': 'float32',
        'repo_id': 'ctranslate2-4you/whisper-tiny.en-ct2-float32',
        'tokens_per_second': 280,
        'optimal_batch_size': 30,
        'avg_vram_usage': '0.7 GB'
    },
    'Whisper tiny.en - bfloat16': {
        'name': 'Whisper tiny.en',
        'precision': 'bfloat16',
        'repo_id': 'ctranslate2-4you/whisper-tiny.en-ct2-bfloat16',
        'tokens_per_second': 290,
        'optimal_batch_size': 31,
        'avg_vram_usage': '0.6 GB'
    },
    'Whisper tiny.en - float16': {
        'name': 'Whisper tiny.en',
        'precision': 'float16',
        'repo_id': 'ctranslate2-4you/whisper-tiny.en-ct2-float16',
        'tokens_per_second': 295,
        'optimal_batch_size': 31,
        'avg_vram_usage': '0.55 GB'
    },
}

Hope it helps!

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Oct 11, 2024

@blackpolarz So technically @jordimas is correct, but you need to instead implement a solution somewhat like mine if you want to allow a user to dynamically select the size or precision of whisper model to use...you don't want a crap ton of if, else, elif stuff, one for each permutation of model size and precision...that's ridiculous, and I used to do that. You can also use string manipulation or mappings...but I like the dictionary approach because:

It allows me to put the dictionary into my constants.py script and simply import it.

Reduces the code in the actual script that performs the transcription.

Is damn reliable and you don't have to scrounge for errors in string manipulation.

You can easily add/remove models...comment out portions of the dictionary, etc.

@blackpolarz
Copy link
Author

@BBC-Esq Thanks for recommendation and I do agree that your method does work.
However, if I am not wrong, we can simply specify the compute_type within the WhisperModel and ctranslate2 will do the quantization upon loading the model (https://opennmt.net/CTranslate2/quantization.html). As such, there is no need to create numerous versions of same model with different compute type.
When comparing various compute types, what I would do is to create a simple loop and rely on ctranslate2's built-in quantization to do its work.
For GUI, my approach would be to have a separate dropdown/pulldown or combobox to select the compute_type.

Either way, this is beyond the PR which only aims to build on the existing faster-whisper structure, providing a simple method for other custom models.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Oct 11, 2024

@blackpolarz You're technically correct. It's basically a tradeoff and I had a brief conversation with the @guillaumekln dude awhile ago.

  1. When you quantize from float16 to int8, for example, it's slightly less quality than quantizing from float32 to int8. Likewise, if you quantize from float32 to float16...and then to int8...you cannot regain some of the precision that was lost because of the two quants. In other words, quantizing from float32 to float16, you lose some information...and then when you quantize from float16 to int8, you lose additional information...The total information you lose is slightly MORE than if you had originally quantized from float32 to int8.

  2. Conversation takes time, albeit it's not horrendous.

Thus, it's a tradeoff...Yes CT2 can convert at runtime...heck, you can even quant from int8 to float16 (for a significant loss of quality)...So 1) to save compute time; and 2) to improve accuracy a little bit...I've chosen to upload the various permutations of quantizations that ct2 supports.

Personal choice totally, but good to make an informed decision after you fully understand what I've described is all I'm saying.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Oct 11, 2024

To give another example...if you take the systran float16 model and want to use bfloat16, it will not be as accurate as if you take the float32 version and do a single conversion to bfloat16. I prefer the quality and to know the quantizations.

I'm surprised that they only distribute float16 versions. The last two generations of Nvidia cards support bfloat16 so why not? Also, if they're going to put defaults into their code, why not include all quantizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants