ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

[Project Website 🎯] [Paper 📃] [Code ]

This repository contrains code for the paper ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions.

Run the model on your images and prompts

Environment setup
- Use provided Dockerfile to build the environment or install the packages manually.
```
docker build -t showhowto .
docker run -it --rm -v $(pwd):$(pwd) -w $(pwd) --gpus=1 showhowto:latest bash
```
- The code, as written, requires a GPU.
Download ShowHowTo model weights
- Use download_weights.sh script or download the ShowHowTo weights manually.

Get predictions

Run the following command to get example predictions.

python predict.py --ckpt_path ./weights/showhowto_2to8steps.pt 
                  --prompt_file ./test_data/prompt_file.txt
                  --unconditional_guidance_scale 7.5

To run the model on your images and prompts, replace ./test_data/prompt_file.txt with your prompt file.

Training

Environment setup
- Use the same environment as for the prediction (see above).
Download DynamiCrafter model weights
- Use download_weights.sh script or download the DynamiCrafter weights manually.
Get the dataset
- To replicate our experiments on the ShowHowTo dataset, see below, or use your own dataset.
- The dataset must have the following directory structure.
```
dataset_root
├── prompts.json
└── imgseqs
    ├── <sequenceid>.jpg
    │   ...
    └── ...
```
  There can be multiple directories with names starting with imgseqs.
- The promts.json file must have the following structure.
```
{
  "<sequenceid>": ["prompt for the 1st frame", "prompt for the 2nd frame", ...],
  ...
}
```
- The sequence image <sequenceid>.jpg must be of width N*W (W is width of each image in the sequence) and arbitrary height H. The number of images in the sequence N must match the length of the prompt list in the prompts.json file.
Train
- Run the training code.
```
python train.py --local_batch_size 2
                --dataset_root /path/to/ShowHowToTrain
                --ckpt_path weights/dynamicrafter_256_v1.ckpt
```
- We trained on a single node with 8 GPUs with the batch size of 2 videos per GPU. Be advised, that more than 40 GB of VRAM per GPU may be required to train with batch size larger than 1.

Dataset

You can download the ShowHowTo dataset using the download_dataset.sh script. To also download the image sequences from our servers, you need username and password. You can obtain it by sending an email to tomas.soucek at cvut dot cz specifying your name and affiliation. Please use your institutional email (i.e., not gmail, etc.).

You can also extract the dataset from the raw original videos with the following steps.

Download the HowTo100M videos and the ShowHowTo prompts
- The list of all video ids for both the train set and test set can be found here.
- For each video, the keyframes.json file contains information on which video frames are part of the dataset.
- You can find there also the prompts for each video in prompts.json file.

Extract the video frames of the ShowHowTo dataset

To extract the frames from the videos, we used ffmpeg v7.0.1 with the following function.

def extract_frame(video, start_sec, frame_idx, width, height):
    ffmpeg_args = ['ffmpeg', '-i', video, '-f', 'rawvideo', '-pix_fmt', 'rgb24',
                   '-vf', f'fps=5,select=gte(t\\,{start_sec}),select=eq(n\\,{frame_idx})',
                   '-s', f'{width}x{height}', '-vframes', '1', 'pipe:']
    video_stream = subprocess.Popen(ffmpeg_args, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
    
    in_bytes = video_stream.stdout.read(width * height * 3)
    return np.frombuffer(in_bytes, np.uint8).reshape([height, width, 3])

The function arguments are: video is the path to the video, start_sec and frame_idx are the values from the keyframes.json and width and height specify the output image size (we used the native video resolution here).

Prepare the image sequences
- Concatenate all frames from a video in the horizontal dimension and place the resulting concatenated image into dataset_root/imgseqs/<sequenceid>.jpg. The <sequenceid> is the YouTube video id.

Citation

@article{soucek2024showhowto,
    title={ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions},
    author={Sou\v{c}ek, Tom\'{a}\v{s} and Gatti, Prajwal and Wray, Michael and Laptev, Ivan and Damen, Dima and Sivic, Josef},
    month = {December},
    year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
dataset_prep		dataset_prep
lvdm		lvdm
test_data		test_data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENCE		LICENCE
README.md		README.md
download_dataset.sh		download_dataset.sh
download_weights.sh		download_weights.sh
predict.py		predict.py
train.py		train.py
video_dataset.py		video_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

[Project Website 🎯] [Paper 📃] [Code ]

Run the model on your images and prompts

Training

Dataset

Citation

About

Languages

License

soCzech/ShowHowTo

Folders and files

Latest commit

History

Repository files navigation

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

[Project Website 🎯] [Paper 📃] [Code ]

Run the model on your images and prompts

Training

Dataset

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages