Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model loading inference API #141

Open
clmnt opened this issue Nov 16, 2022 · 2 comments
Open

model loading inference API #141

clmnt opened this issue Nov 16, 2022 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@clmnt
Copy link
Member

clmnt commented Nov 16, 2022

Describe the bug

it gets stuck at model loading

Reproduction

go to https://huggingface.co/nitrosocke/classic-anim-diffusion and prompt for the first time

https://www.loom.com/share/10fdb5920e0248cc8162e145f8957d77

Logs

No response

System info

chrome
@clmnt clmnt added the bug Something isn't working label Nov 16, 2022
@osanseviero
Copy link
Contributor

The first time it says that the model is loading. When you do the refresh it turned out the model was now loaded, so the inference was fast this time. Moving to community repo

@osanseviero osanseviero transferred this issue from huggingface/huggingface_hub Nov 16, 2022
@Narsil
Copy link
Contributor

Narsil commented Nov 17, 2022

Multiple issues things at play that are currently known about:

  • Model loading is not really using correct information. api-inference doesn't know how to "guess" the model size properly, so the loading bar is not accurate. It's never acurate, but the simple rule of thumb would still mean the loading bar would be bigger and more representative.
  • First loads are always much longer due to downloading the weights
  • Sometimes, depending on cluster conditions creating the docker is slower than usual (depends how many GPUs are used, how many nodes are available etc.. creating a new node on demand is much slower than just launching the pod)
  • Inference still takes 5-6s which feels very "slow" to us humans. Using xformers and fast attention should help a bit (expected to go down to 3s).

Here I'm thinking 1/ and 4/ are the most effective things we can do something about.
We're also working on adding tracing to the cluster so we have a better picture of 2 and 3.

@NouamaneTazi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants