Skip to content
View LayoutLLM-T2I's full-sized avatar

Block or report LayoutLLM-T2I

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
LayoutLLM-T2I/README.md

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

PyTorch code of the ACM MM'23 paper: LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation.

Introduction

In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation understanding and numeration failure) in complex natural scenes, which impedes the high-faithfulness text-to-image generation. Although recent efforts have been made to improve controllability by giving fine-grained guidance (e.g., sketch and scribbles), this issue has not been fundamentally tackled since users have to provide such guidance information manually. In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. Toward this end, we propose a coarse-to-fine paradigm to achieve layout planning and image generation. Concretely, we first generate the coarse-grained layout conditioned on a given textual prompt via in-context learning based on Large Language Models. Afterward, we propose a fine-grained object-interaction diffusion method to synthesize high-faithfulness images conditioned on the prompt and the automatically generated layout. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art models in terms of layout and image generation.

model

Dataset: COCO-NSS1K

By filtering, scrutinizing, and sampling from captions of COCO 2014, we built a new benchmark called COCO-NSS1K to evaluate the Numerical reasoning, Spatial and Semantic Relation understanding of text-to-image generative models.

#Num #Avg.bbox #Avg.Cap.Len Caption Examples
Numerical 155 6.23 9.55 • two old cell phones and a wooden table.
• two plates some food and a fork knife and spoon.
Spatial 200 5.35 10.25 • a large clock tower next to a small white church.
• a bowl with some noodles inside of it.
Semantic 200 7.10 10.62 • a train on a track traveling through a countryside.
• a living room filled with couches, chairs, TV, and windows.
Mixed 188 6.94 10.76 • one motorcycle rider riding going up the mountain, two going down.
• a group of three bathtubs sitting next to each other.
Null 200 6.17 9.62 • a kitchen scene complete with a dishwasher, sink, and an oven.
• a person with a hat and some ski poles.
Total 943 6.35 10.18 -

Getting Started

Installation

1. Download repo and create environment

https://github.com/LayoutLLM-T2I/LayoutLLM-T2I.git
conda create -n layoutllm_t2i python=3.8
conda activate layoutllm_t2i
pip install -r requirements.txt

2. Download and prepare the pretrained weights

This model includes a policy model and a GLIGEN-based relation-aware diffusion model. The policy weights can be downloaded hereand saved in POLICY_CKPT. The diffusion model weights are downloaded here (Baidu) or here (Huggingface) and saved in DIFFUSION_CKPT.

Text-to-Image Generation

Download the candidate example file in which each instance was randomly sampled from COCO2014, and save it in CANDIDATE_PATH. Obtain OPENAI_API_KEY from the openai platform.

Run the generation code:

export OPENAI_API_KEY=OPENAI_API_KEY

python txt2img.py --folder generation_samples
    --prompt PROMPT
    --policy_ckpt_path POLICY_CKPT
    --diff_ckpt_path DIFFUSION_CKPT
    --cand_path CANDIDATE_PATH
    --num_per_prompt 1

PROMPT denotes your prompt.

Training

To train the the policy network, we first download images from COCO2014 and the sampled training examples. Alternatively, one may sample some examples by himself. Besides, download the weights of the aesthetic predictor pre-trained on the LAION dataset.

Run the training code:

export OPENAI_API_KEY=OPENAI_API_KEY

python -u train_rl.py
    --gpu GPU_ID
    --exp EXPERIMENT_NAME
    --img_dir IMAGE_DIR
    --sampled_data_dir DATA_PATH
    --diff_ckpt_path DIFFUSION_CKPT 
    --aesthetic_ckpt AESTHETIC_CKPT

Reference

@inproceedings{qu2023layoutllm,
  title={LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation},
  author={Qu, Leigang and Wu, Shengqiong and Fei, Hao and Nie, Liqiang and Chua, Tat-Seng},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={643--654},
  year={2023}
}

Popular repositories Loading

  1. LayoutLLM-T2I LayoutLLM-T2I Public

    Code for ACM MM'23 paper: LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

    Python 42

  2. LayoutLLM-T2I.github.io LayoutLLM-T2I.github.io Public

    Project page for ACM MM 2023 paper: LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

    JavaScript 3