Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not an issue - richer datasets #6

Open
johndpope opened this issue Jul 25, 2021 · 7 comments
Open

Not an issue - richer datasets #6

johndpope opened this issue Jul 25, 2021 · 7 comments

Comments

@johndpope
Copy link

are you familiar with this https://twitter.com/e08477/status/1418440857578098691?s=21 ?

I want to do cityscape shots. Are you familiar with any relevant datasets?
Can this repo help output higher quality images? Or does it help with the prompting?

@mehdidc
Copy link
Owner

mehdidc commented Jul 26, 2021

Hi, I was not aware of these, these are very beautiful!
the repo is not meant to output higher quality images (quality should be the same as VQGAN-CLIP examples) or help with prompting, it is meant to do the same thing without needing an optimization loop for each prompt, and can also generalize to new unseen prompts in the training set. All you need is to collect/build a dataset of prompts and train the model with it, once it is done you can generate images with new prompts in a single step (so no optimization loop). I will shortly also upload pre-trained model(s) based on conceptual captions 12m prompts (https://github.com/google-research-datasets/conceptual-12m), if you would like to give it a try without re-training from scratch. Also, since you obtain a model at the end, additionally you can also interpolate between the generated images of different prompts. I hope the goal of the repo is clearer.

@johndpope
Copy link
Author

"so no optimization loop" -
does that mean there's no 500x iterations to get a good looking image?

fyi - @nerdyrodent

@mehdidc
Copy link
Owner

mehdidc commented Jul 26, 2021

" does that mean there's no 500x iterations to get a good looking image?" Yes

@mehdidc
Copy link
Owner

mehdidc commented Jul 26, 2021

Following the tweet you mentioned above, here is an example with "deviantart, volcano": https://imgur.com/a/cYMsNo5 with a model currently being trained on conceptual captions 12m.

@mehdidc
Copy link
Owner

mehdidc commented Jul 27, 2021

@johndpope I added a bunch of pre-trained models if you want to give it a try

@johndpope
Copy link
Author

I had a play with the 1.7gb cc12m_32x1024 - I couldn't get my high quality that I was getting on VQGAN-CLIP - will keep trying - bumping the dimensions. Maybe docs could use some pointers - 256 x256 / 512x512 etc
One thing is clear - this can perform very quickly - perhaps efforts to have this provide a hot serving whereby you could give it a new prompt / running a service / almost in realtime without turning off the engine so to speak. We talk about FPS - frames per second - could we see a VQPS ???

Here's some images I turned out over the weekend -
nerdyrodent/VQGAN-CLIP#13

Observerations
When I threw in a parameter - it was clearly identifable.
Los Angeles | 35mm
Eg. https://twitter.com/johndpope/status/1419352229031518209/photo/1

Los Angeles Album Cover
https://twitter.com/johndpope/status/1419354082192412679/photo/1

This didn't quite cut it.
python -u main.py test pretrained_models/cc12m_32x1024/model.th "los angeles album cover"

Other improvements for newbies - you could consider integrating these downloads into readme
https://github.com/nerdyrodent/VQGAN-CLIP/blob/5edb6a133944ee735025b8a92f6432d6c5fbf5eb/download_models.sh

@afiaka87
Copy link
Contributor

afiaka87 commented Jul 29, 2021

@johndpope have you considered re-embedding the outputs from the trained vitgan as clip image-embeds; and then using those as prompts to a "normal" VQGAN-CLIP optimization with a much higher learning rate than usual and fewer steps? That will allow you to use non-square dimensions.

Also - one of the other primary benefits of this approach is that if you'd like to finetune from one of the checkpoints or even train your own from scratch - this can be relatively simple as all you need are some captions which can be generated/typed out. You'll want to cover a large-ish corpus but using something like the provided MIT states captions as a base should be a good start.

Thanks for the extra info. I'm a little busy today but I think the README might need one or two more things and possibly a colab notebook specific to training (if we don't have that already) that would make it easy to customize MIT states.

edit: realtime updates to your captions/display of rate of generations etc. may be outside of the scope of the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants