We provide benchmark results of spatiotemporal prediction learning (STL) methods on various video prediction datasets. More STL methods will be supported in the future. Issues and PRs are welcome! Currently, we only provide benchmark results, trained models and logs will be released soon (you can contact us if you require these files). You can download model files from Baidu Cloud (tgr6).
- Video Prediction Benchmarks
Currently supported spatiotemporal prediction methods
Currently supported MetaFormer models for SimVP
- ViT (ICLR'2021)
- Swin-Transformer (ICCV'2021)
- MLP-Mixer (NeurIPS'2021)
- ConvMixer (Openreview'2021)
- UniFormer (ICLR'2022)
- PoolFormer (CVPR'2022)
- ConvNeXt (CVPR'2022)
- VAN (ArXiv'2022)
- IncepU (SimVP.V1) (CVPR'2022)
- gSTA (SimVP.V2) (ArXiv'2022)
- HorNet (NeurIPS'2022)
- MogaNet (ArXiv'2022)
We provide benchmark results on the popular Moving MNIST dataset using
For a fair comparison of different methods, we report the best results when models are trained to converge. We provide config files in configs/mmnist.
Method | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
ConvLSTM-S | 200 epoch | 15.0M | 56.8G | 113 | 29.80 | 90.64 | 0.9288 | 22.10 | model | log |
ConvLSTM-L | 200 epoch | 33.8M | 127.0G | 50 | 27.78 | 86.14 | 0.9343 | 22.44 | model | log |
PredNet | 200 epoch | 12.5M | 8.6G | 659 | 161.38 | 201.16 | 0.7783 | 14.33 | model | log |
PhyDNet | 200 epoch | 3.1M | 15.3G | 182 | 28.19 | 78.64 | 0.9374 | 22.62 | model | log |
PredRNN | 200 epoch | 23.8M | 116.0G | 54 | 23.97 | 72.82 | 0.9462 | 23.28 | model | log |
PredRNN++ | 200 epoch | 38.6M | 171.7G | 38 | 22.06 | 69.58 | 0.9509 | 23.65 | model | log |
MIM | 200 epoch | 38.0M | 179.2G | 37 | 22.55 | 69.97 | 0.9498 | 23.56 | model | log |
MAU | 200 epoch | 4.5M | 17.8G | 201 | 26.86 | 78.22 | 0.9398 | 22.76 | model | log |
E3D-LSTM | 200 epoch | 51.0M | 298.9G | 18 | 35.97 | 78.28 | 0.9320 | 21.11 | model | log |
CrevNet | 200 epoch | 5.0M | 270.7G | 10 | 30.15 | 86.28 | 0.9350 | model | log | |
PredRNN.V2 | 200 epoch | 23.9M | 116.6G | 52 | 24.13 | 73.73 | 0.9453 | 23.21 | model | log |
DMVFN | 200 epoch | 3.5M | 0.2G | 1145 | 123.67 | 179.96 | 0.8140 | 16.15 | model | log |
SimVP+IncepU | 200 epoch | 58.0M | 19.4G | 209 | 32.15 | 89.05 | 0.9268 | 37.97 | model | log |
SimVP+gSTA-S | 200 epoch | 46.8M | 16.5G | 282 | 26.69 | 77.19 | 0.9402 | 38.35 | model | log |
TAU | 200 epoch | 44.7M | 16.0G | 283 | 24.60 | 71.93 | 0.9454 | 23.19 | model | log |
ConvLSTM-S | 2000 epoch | 15.0M | 56.8G | 113 | 22.41 | 73.07 | 0.9480 | 23.54 | model | log |
PredNet | 2000 epoch | 12.5M | 8.6G | 659 | 31.85 | 90.01 | 0.9273 | 21.85 | model | log |
PhyDNet | 2000 epoch | 3.1M | 15.3G | 182 | 20.35 | 61.47 | 0.9559 | 24.21 | model | log |
PredRNN | 2000 epoch | 23.8M | 116.0G | 54 | 26.43 | 77.52 | 0.9411 | 22.90 | model | log |
PredRNN++ | 2000 epoch | 38.6M | 171.7G | 38 | 14.07 | 48.91 | 0.9698 | 26.37 | model | log |
MIM | 2000 epoch | 38.0M | 179.2G | 37 | 14.73 | 52.31 | 0.9678 | 25.99 | model | log |
MAU | 2000 epoch | 4.5M | 17.8G | 201 | 22.25 | 67.96 | 0.9511 | 23.68 | model | log |
E3D-LSTM | 2000 epoch | 51.0M | 298.9G | 18 | 24.07 | 77.49 | 0.9436 | 23.19 | model | log |
PredRNN.V2 | 2000 epoch | 23.9M | 116.6G | 52 | 17.26 | 57.22 | 0.9624 | 25.01 | model | log |
SimVP+IncepU | 2000 epoch | 58.0M | 19.4G | 209 | 21.15 | 64.15 | 0.9536 | 23.99 | model | log |
SimVP+gSTA-S | 2000 epoch | 46.8M | 16.5G | 282 | 15.05 | 49.80 | 0.9675 | 25.97 | model | log |
TAU | 2000 epoch | 44.7M | 16.0G | 283 | 15.69 | 51.46 | 0.9661 | 25.71 | model | log |
Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing
and channel mixing
, we benchmark popular Metaformer architectures on SimVP with training times of 200-epoch and 2000-epoch. We provide config files in configs/mmnist/simvp.
MetaVP | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
IncepU (SimVPv1) | 200 epoch | 58.0M | 19.4G | 209 | 32.15 | 89.05 | 0.9268 | 21.84 | model | log |
gSTA (SimVPv2) | 200 epoch | 46.8M | 16.5G | 282 | 26.69 | 77.19 | 0.9402 | 22.78 | model | log |
ViT | 200 epoch | 46.1M | 16.9G | 290 | 35.15 | 95.87 | 0.9139 | 21.67 | model | log |
Swin Transformer | 200 epoch | 46.1M | 16.4G | 294 | 29.70 | 84.05 | 0.9331 | 22.22 | model | log |
Uniformer | 200 epoch | 44.8M | 16.5G | 296 | 30.38 | 85.87 | 0.9308 | 22.13 | model | log |
MLP-Mixer | 200 epoch | 38.2M | 14.7G | 334 | 29.52 | 83.36 | 0.9338 | 22.22 | model | log |
ConvMixer | 200 epoch | 3.9M | 5.5G | 658 | 32.09 | 88.93 | 0.9259 | 21.93 | model | log |
Poolformer | 200 epoch | 37.1M | 14.1G | 341 | 31.79 | 88.48 | 0.9271 | 22.03 | model | log |
ConvNeXt | 200 epoch | 37.3M | 14.1G | 344 | 26.94 | 77.23 | 0.9397 | 22.74 | model | log |
VAN | 200 epoch | 44.5M | 16.0G | 288 | 26.10 | 76.11 | 0.9417 | 22.89 | model | log |
HorNet | 200 epoch | 45.7M | 16.3G | 287 | 29.64 | 83.26 | 0.9331 | 22.26 | model | log |
MogaNet | 200 epoch | 46.8M | 16.5G | 255 | 25.57 | 75.19 | 0.9429 | 22.99 | model | log |
TAU | 200 epoch | 44.7M | 16.0G | 283 | 24.60 | 71.93 | 0.9454 | 23.19 | model | log |
IncepU (SimVPv1) | 2000 epoch | 58.0M | 19.4G | 209 | 21.15 | 64.15 | 0.9536 | 23.99 | model | log |
gSTA (SimVPv2) | 2000 epoch | 46.8M | 16.5G | 282 | 15.05 | 49.80 | 0.9675 | 25.97 | model | log |
ViT | 2000 epoch | 46.1M | 16.9.G | 290 | 19.74 | 61.65 | 0.9539 | 24.59 | model | log |
Swin Transformer | 2000 epoch | 46.1M | 16.4G | 294 | 19.11 | 59.84 | 0.9584 | 24.53 | model | log |
Uniformer | 2000 epoch | 44.8M | 16.5G | 296 | 18.01 | 57.52 | 0.9609 | 24.92 | model | log |
MLP-Mixer | 2000 epoch | 38.2M | 14.7G | 334 | 18.85 | 59.86 | 0.9589 | 24.58 | model | log |
ConvMixer | 2000 epoch | 3.9M | 5.5G | 658 | 22.30 | 67.37 | 0.9507 | 23.73 | model | log |
Poolformer | 2000 epoch | 37.1M | 14.1G | 341 | 20.96 | 64.31 | 0.9539 | 24.15 | model | log |
ConvNeXt | 2000 epoch | 37.3M | 14.1G | 344 | 17.58 | 55.76 | 0.9617 | 25.06 | model | log |
VAN | 2000 epoch | 44.5M | 16.0G | 288 | 16.21 | 53.57 | 0.9646 | 25.49 | model | log |
HorNet | 2000 epoch | 45.7M | 16.3G | 287 | 17.40 | 55.70 | 0.9624 | 25.14 | model | log |
MogaNet | 2000 epoch | 46.8M | 16.5G | 255 | 15.67 | 51.84 | 0.9661 | 25.70 | model | log |
TAU | 2000 epoch | 44.7M | 16.0G | 283 | 15.69 | 51.46 | 0.9661 | 25.71 | model | log |
Similar to Moving MNIST, we also provide the advanced version of MNIST, i.e., MFMNIST benchmark results, using
For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/mfmnist.
Method | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
ConvLSTM-S | 200 epoch | 15.0M | 56.8G | 113 | 28.87 | 113.20 | 0.8793 | 22.07 | model | log |
ConvLSTM-L | 200 epoch | 33.8M | 127.0G | 50 | 25.51 | 104.85 | 0.8928 | 22.67 | model | log |
PredNet | 200 epoch | 12.5M | 8.6G | 659 | 185.94 | 318.30 | 0.6713 | 14.83 | model | log |
PhyDNet | 200 epoch | 3.1M | 15.3G | 182 | 34.75 | 125.66 | 0.8567 | 22.03 | model | log |
PredRNN | 200 epoch | 23.8M | 116.0G | 54 | 22.01 | 91.74 | 0.9091 | 23.42 | model | log |
PredRNN++ | 200 epoch | 38.6M | 171.7G | 38 | 21.71 | 91.97 | 0.9097 | 23.45 | model | log |
MIM | 200 epoch | 38.0M | 179.2G | 37 | 23.09 | 96.37 | 0.9043 | 23.13 | model | log |
MAU | 200 epoch | 4.5M | 17.8G | 201 | 26.56 | 104.39 | 0.8916 | 22.51 | model | log |
E3D-LSTM | 200 epoch | 51.0M | 298.9G | 18 | 35.35 | 110.09 | 0.8722 | 21.27 | model | log |
PredRNN.V2 | 200 epoch | 23.9M | 116.6G | 52 | 24.13 | 97.46 | 0.9004 | 22.96 | model | log |
DMVFN | 200 epoch | 3.5M | 0.2G | 1145 | 118.32 | 220.02 | 0.7572 | 16.76 | model | log |
SimVP+IncepU | 200 epoch | 58.0M | 19.4G | 209 | 30.77 | 113.94 | 0.8740 | 21.81 | model | log |
SimVP+gSTA-S | 200 epoch | 46.8M | 16.5G | 282 | 25.86 | 101.22 | 0.8933 | 22.61 | model | log |
TAU | 200 epoch | 44.7M | 16.0G | 283 | 24.24 | 96.72 | 0.8995 | 22.87 | model | log |
Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing
and channel mixing
, we benchmark popular Metaformer architectures on SimVP with training times of 200 epochs. We provide config files in configs/mfmnist/simvp.
MetaFormer | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
IncepU (SimVPv1) | 200 epoch | 58.0M | 19.4G | 209 | 30.77 | 113.94 | 0.8740 | 21.81 | model | log |
gSTA (SimVPv2) | 200 epoch | 46.8M | 16.5G | 282 | 25.86 | 101.22 | 0.8933 | 22.61 | model | log |
ViT | 200 epoch | 46.1M | 16.9.G | 290 | 31.05 | 115.59 | 0.8712 | 21.83 | model | log |
Swin Transformer | 200 epoch | 46.1M | 16.4G | 294 | 28.66 | 108.93 | 0.8815 | 22.08 | model | log |
Uniformer | 200 epoch | 44.8M | 16.5G | 296 | 29.56 | 111.72 | 0.8779 | 21.97 | model | log |
MLP-Mixer | 200 epoch | 38.2M | 14.7G | 334 | 28.83 | 109.51 | 0.8803 | 22.01 | model | log |
ConvMixer | 200 epoch | 3.9M | 5.5G | 658 | 31.21 | 115.74 | 0.8709 | 21.71 | model | log |
Poolformer | 200 epoch | 37.1M | 14.1G | 341 | 30.02 | 113.07 | 0.8750 | 21.95 | model | log |
ConvNeXt | 200 epoch | 37.3M | 14.1G | 344 | 26.41 | 102.56 | 0.8908 | 22.49 | model | log |
VAN | 200 epoch | 44.5M | 16.0G | 288 | 31.39 | 116.28 | 0.8703 | 22.82 | model | log |
HorNet | 200 epoch | 45.7M | 16.3G | 287 | 29.19 | 110.17 | 0.8796 | 22.03 | model | log |
MogaNet | 200 epoch | 46.8M | 16.5G | 255 | 25.14 | 99.69 | 0.8960 | 22.73 | model | log |
TAU | 200 epoch | 44.7M | 16.0G | 283 | 24.24 | 96.72 | 0.8995 | 22.87 | model | log |
Similar to Moving MNIST, we further design the advanced version of MNIST with complex backgrounds from CIFAR-10, i.e., MMNIST-CIFAR benchmark, using
For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/mmnist_cifar.
Method | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
ConvLSTM-S | 200 epoch | 15.5M | 58.8G | 113 | 73.31 | 338.56 | 0.9204 | 23.09 | model | log |
ConvLSTM-L | 200 epoch | 34.4M | 130.0G | 50 | 62.86 | 291.05 | 0.9337 | 23.83 | model | log |
PredNet | 200 epoch | 12.5M | 8.6G | 945 | 286.70 | 514.14 | 0.8139 | 17.49 | model | log |
PhyDNet | 200 epoch | 3.1M | 15.3G | 182 | 142.54 | 700.37 | 0.8276 | 19.92 | model | log |
PredRNN | 200 epoch | 23.8M | 116.0G | 54 | 50.09 | 225.04 | 0.9499 | 24.90 | model | log |
PredRNN++ | 200 epoch | 38.6M | 171.7G | 38 | 44.19 | 198.27 | 0.9567 | 25.60 | model | log |
MIM | 200 epoch | 38.8M | 183.0G | 37 | 48.63 | 213.44 | 0.9521 | 25.08 | model | log |
MAU | 200 epoch | 4.5M | 17.8G | 201 | 58.84 | 255.76 | 0.9408 | 24.19 | model | log |
E3D-LSTM | 200 epoch | 52.8M | 306.0G | 18 | 80.79 | 214.86 | 0.9314 | 22.89 | model | log |
PredRNN.V2 | 200 epoch | 23.9M | 116.6G | 52 | 57.27 | 252.29 | 0.9419 | 24.24 | model | log |
DMVFN | 200 epoch | 3.6M | 0.2G | 960 | 298.73 | 606.92 | 0.7765 | 17.07 | model | log |
SimVP+IncepU | 200 epoch | 58.0M | 19.4G | 209 | 59.83 | 214.54 | 0.9414 | 24.15 | model | log |
SimVP+gSTA-S | 200 epoch | 46.8M | 16.5G | 282 | 51.13 | 185.13 | 0.9512 | 24.93 | model | log |
TAU | 200 epoch | 44.7M | 16.0G | 275 | 48.17 | 177.35 | 0.9539 | 25.21 | model | log |
Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing
and channel mixing
, we benchmark popular Metaformer architectures on SimVP with training times of 200 epochs. We provide config files in configs/mmnist_cifar/simvp.
MetaFormer | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
IncepU (SimVPv1) | 200 epoch | 58.0M | 19.4G | 209 | 59.83 | 214.54 | 0.9414 | 24.15 | model | log |
gSTA (SimVPv2) | 200 epoch | 46.8M | 16.5G | 282 | 51.13 | 185.13 | 0.9512 | 24.93 | model | log |
ViT | 200 epoch | 46.1M | 16.9G | 290 | 64.94 | 234.01 | 0.9354 | 23.90 | model | log |
Swin Transformer | 200 epoch | 46.1M | 16.4G | 294 | 57.11 | 207.45 | 0.9443 | 24.34 | model | log |
Uniformer | 200 epoch | 44.8M | 16.5G | 296 | 56.96 | 207.51 | 0.9442 | 24.38 | model | log |
MLP-Mixer | 200 epoch | 38.2M | 14.7G | 334 | 57.03 | 206.46 | 0.9446 | 24.34 | model | log |
ConvMixer | 200 epoch | 3.9M | 5.5G | 658 | 59.29 | 219.76 | 0.9403 | 24.17 | model | log |
Poolformer | 200 epoch | 37.1M | 14.1G | 341 | 60.98 | 219.50 | 0.9399 | 24.16 | model | log |
ConvNeXt | 200 epoch | 37.3M | 14.1G | 344 | 51.39 | 187.17 | 0.9503 | 24.89 | model | log |
VAN | 200 epoch | 44.5M | 16.0G | 288 | 59.59 | 221.32 | 0.9398 | 25.20 | model | log |
HorNet | 200 epoch | 45.7M | 16.3G | 287 | 55.79 | 202.73 | 0.9456 | 24.49 | model | log |
MogaNet | 200 epoch | 46.8M | 16.5G | 255 | 49.48 | 184.11 | 0.9521 | 25.07 | model | log |
TAU | 200 epoch | 44.7M | 16.0G | 275 | 48.17 | 177.35 | 0.9539 | 25.21 | model | log |
We provide benchmark results on KittiCaltech Pedestrian dataset using
For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/kitticaltech.
Method | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | LPIPS | Download |
---|---|---|---|---|---|---|---|---|---|---|
ConvLSTM-S | 100 epoch | 15.0M | 595.0G | 33 | 139.6 | 1583.3 | 0.9345 | 27.46 | 0.08575 | model | log |
E3D-LSTM* | 100 epoch | 54.9M | 1004G | 10 | 200.6 | 1946.2 | 0.9047 | 25.45 | 0.12602 | model | log |
PredNet | 100 epoch | 12.5M | 42.8G | 94 | 159.8 | 1568.9 | 0.9286 | 27.21 | 0.11289 | model | log |
PhyDNet | 100 epoch | 3.1M | 40.4G | 117 | 312.2 | 2754.8 | 0.8615 | 23.26 | 0.32194 | model | log |
MAU | 100 epoch | 24.3M | 172.0G | 16 | 177.8 | 1800.4 | 0.9176 | 26.14 | 0.09673 | model | log |
MIM | 100 epoch | 49.2M | 1858G | 39 | 125.1 | 1464.0 | 0.9409 | 28.10 | 0.06353 | model | log |
PredRNN | 100 epoch | 23.7M | 1216G | 17 | 130.4 | 1525.5 | 0.9374 | 27.81 | 0.07395 | model | log |
PredRNN++ | 100 epoch | 38.5M | 1803G | 12 | 125.5 | 1453.2 | 0.9433 | 28.02 | 0.13210 | model | log |
PredRNN.V2 | 100 epoch | 23.8M | 1223G | 52 | 147.8 | 1610.5 | 0.9330 | 27.12 | 0.08920 | model | log |
DMVFN | 100 epoch | 3.6M | 1.2G | 557 | 183.9 | 1531.1 | 0.9314 | 26.95 | 0.04942 | model | log |
SimVP+IncepU | 100 epoch | 8.6M | 60.6G | 57 | 160.2 | 1690.8 | 0.9338 | 26.81 | 0.06755 | model | log |
SimVP+gSTA-S | 100 epoch | 15.6M | 96.3G | 40 | 129.7 | 1507.7 | 0.9454 | 27.89 | 0.05736 | model | log |
TAU | 100 epoch | 44.7M | 80.0G | 55 | 131.1 | 1507.8 | 0.9456 | 27.83 | 0.05494 | model | log |
Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing
and channel mixing
, we benchmark popular Metaformer architectures on SimVP with 100-epoch training. We provide config files in configs/kitticaltech/simvp.
MetaFormer | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | LPIPS | Download |
---|---|---|---|---|---|---|---|---|---|---|
IncepU (SimVPv1) | 100 epoch | 8.6M | 60.6G | 57 | 160.2 | 1690.8 | 0.9338 | 26.81 | 0.06755 | model | log |
gSTA (SimVPv2) | 100 epoch | 15.6M | 96.3G | 40 | 129.7 | 1507.7 | 0.9454 | 27.89 | 0.05736 | model | log |
ViT* | 100 epoch | 12.7M | 155.0G | 25 | 146.4 | 1615.8 | 0.9379 | 27.43 | 0.06659 | model | log |
Swin Transformer | 100 epoch | 15.3M | 95.2G | 49 | 155.2 | 1588.9 | 0.9299 | 27.25 | 0.08113 | model | log |
Uniformer* | 100 epoch | 11.8M | 104.0G | 28 | 135.9 | 1534.2 | 0.9393 | 27.66 | 0.06867 | model | log |
MLP-Mixer | 100 epoch | 22.2M | 83.5G | 60 | 207.9 | 1835.9 | 0.9133 | 26.29 | 0.07750 | model | log |
ConvMixer | 100 epoch | 1.5M | 23.1G | 129 | 174.7 | 1854.3 | 0.9232 | 26.23 | 0.07758 | model | log |
Poolformer | 100 epoch | 12.4M | 79.8G | 51 | 153.4 | 1613.5 | 0.9334 | 27.38 | 0.07000 | model | log |
ConvNeXt | 100 epoch | 12.5M | 80.2G | 54 | 146.8 | 1630.0 | 0.9336 | 27.19 | 0.06987 | model | log |
VAN | 100 epoch | 14.9M | 92.5G | 41 | 127.5 | 1476.5 | 0.9462 | 27.98 | 0.05500 | model | log |
HorNet | 100 epoch | 15.3M | 94.4G | 43 | 152.8 | 1637.9 | 0.9365 | 27.09 | 0.06004 | model | log |
MogaNet | 100 epoch | 15.6M | 96.2G | 36 | 131.4 | 1512.1 | 0.9442 | 27.79 | 0.05394 | model | log |
TAU | 100 epoch | 44.7M | 80.0G | 55 | 131.1 | 1507.8 | 0.9456 | 27.83 | 0.05494 | model | log |
We provide long-term prediction benchmark results on KTH Action dataset using
For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/kth. Notice that 4xbs4
indicates 4GPUs DDP training with the batch size of 4 on each GPU.
Method | Setting | GPUs | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | LPIPS | Download |
---|---|---|---|---|---|---|---|---|---|---|---|
ConvLSTM-S | 100 epoch | 1xbs16 | 14.9M | 1368.0G | 16 | 47.65 | 445.5 | 0.8977 | 26.99 | 0.26686 | model | log |
E3D-LSTM | 100 epoch | 2xbs8 | 53.5M | 217.0G | 17 | 136.40 | 892.7 | 0.8153 | 21.78 | 0.48358 | model | log |
PredNet | 100 epoch | 1xbs16 | 12.5M | 3.4G | 399 | 152.11 | 783.1 | 0.8094 | 22.45 | 0.32159 | model | log |
PhyDNet | 100 epoch | 1xbs16 | 3.1M | 93.6G | 58 | 91.12 | 765.6 | 0.8322 | 23.41 | 0.50155 | model | log |
MAU | 100 epoch | 1xbs16 | 20.1M | 399.0G | 8 | 51.02 | 471.2 | 0.8945 | 26.73 | 0.25442 | model | log |
MIM | 100 epoch | 1xbs16 | 39.8M | 1099.0G | 17 | 40.73 | 380.8 | 0.9025 | 27.78 | 0.18808 | model | log |
PredRNN | 100 epoch | 1xbs16 | 23.6M | 2800.0G | 7 | 41.07 | 380.6 | 0.9097 | 27.95 | 0.21892 | model | log |
PredRNN++ | 100 epoch | 1xbs16 | 38.3M | 4162.0G | 5 | 39.84 | 370.4 | 0.9124 | 28.13 | 0.19871 | model | log |
PredRNN.V2 | 100 epoch | 1xbs16 | 23.6M | 2815.0G | 7 | 39.57 | 368.8 | 0.9099 | 28.01 | 0.21478 | model | log |
DMVFN | 100 epoch | 1xbs16 | 3.5M | 0.88G | 727 | 59.61 | 413.2 | 0.8976 | 26.65 | 0.12842 | model | log |
SimVP+IncepU | 100 epoch | 2xbs8 | 12.2M | 62.8G | 77 | 41.11 | 397.1 | 0.9065 | 27.46 | 0.26496 | model | log |
SimVP+gSTA-S | 100 epoch | 4xbs4 | 15.6M | 76.8G | 53 | 45.02 | 417.8 | 0.9049 | 27.04 | 0.25240 | model | log |
TAU | 100 epoch | 4xbs4 | 15.0M | 73.8G | 55 | 45.32 | 421.7 | 0.9086 | 27.10 | 0.22856 | model | log |
Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing
and channel mixing
, we benchmark popular Metaformer architectures on SimVP with 100-epoch training. We provide config files in configs/kth/simvp.
MetaFormer | Setting | GPUs | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | LPIPS | Download |
---|---|---|---|---|---|---|---|---|---|---|---|
IncepU (SimVPv1) | 100 epoch | 2xbs8 | 12.2M | 62.8G | 77 | 41.11 | 397.1 | 0.9065 | 27.46 | 0.26496 | model | log |
gSTA (SimVPv2) | 100 epoch | 2xbs8 | 15.6M | 76.8G | 53 | 45.02 | 417.8 | 0.9049 | 27.04 | 0.25240 | model | log |
ViT | 100 epoch | 2xbs8 | 12.7M | 112.0G | 28 | 56.57 | 459.3 | 0.8947 | 26.19 | 0.27494 | model | log |
Swin Transformer | 100 epoch | 2xbs8 | 15.3M | 75.9G | 65 | 45.72 | 405.7 | 0.9039 | 27.01 | 0.25178 | model | log |
Uniformer | 100 epoch | 2xbs8 | 11.8M | 78.3G | 43 | 44.71 | 404.6 | 0.9058 | 27.16 | 0.24174 | model | log |
MLP-Mixer | 100 epoch | 2xbs8 | 20.3M | 66.6G | 34 | 57.74 | 517.4 | 0.8886 | 25.72 | 0.28799 | model | log |
ConvMixer | 100 epoch | 2xbs8 | 1.5M | 18.3G | 175 | 47.31 | 446.1 | 0.8993 | 26.66 | 0.28149 | model | log |
Poolformer | 100 epoch | 2xbs8 | 12.4M | 63.6G | 67 | 45.44 | 400.9 | 0.9065 | 27.22 | 0.24763 | model | log |
ConvNeXt | 100 epoch | 2xbs8 | 12.5M | 63.9G | 72 | 45.48 | 428.3 | 0.9037 | 26.96 | 0.26253 | model | log |
VAN | 100 epoch | 2xbs8 | 14.9M | 73.8G | 55 | 45.05 | 409.1 | 0.9074 | 27.07 | 0.23116 | model | log |
HorNet | 100 epoch | 2xbs8 | 15.3M | 75.3G | 58 | 46.84 | 421.2 | 0.9005 | 26.80 | 0.26921 | model | log |
MogaNet | 100 epoch | 2xbs8 | 15.6M | 76.7G | 48 | 42.98 | 418.7 | 0.9065 | 27.16 | 0.25146 | model | log |
TAU | 100 epoch | 2xbs8 | 15.0M | 73.8G | 55 | 45.32 | 421.7 | 0.9086 | 27.10 | 0.22856 | model | log |
We further provide high-resolution benchmark results on Human3.6M dataset using
For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/human.
Method | Setting | GPUs | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | LPIPS | Download |
---|---|---|---|---|---|---|---|---|---|---|---|
ConvLSTM-S | 50 epoch | 1xbs16 | 15.5M | 347.0 | 52 | 125.5 | 1566.7 | 0.9813 | 33.40 | 0.03557 | model | log |
E3D-LSTM | 50 epoch | 4xbs4 | 60.9M | 542.0 | 7 | 143.3 | 1442.5 | 0.9803 | 32.52 | 0.04133 | model | log |
PredNet | 50 epoch | 1xbs16 | 12.5M | 13.7 | 176 | 261.9 | 1625.3 | 0.9786 | 31.76 | 0.03264 | model | log |
PhyDNet | 50 epoch | 1xbs16 | 4.2M | 19.1 | 57 | 125.7 | 1614.7 | 0.9804 | 39.84 | 0.03709 | model | log |
MAU | 50 epoch | 1xbs16 | 20.2M | 105.0 | 6 | 127.3 | 1577.0 | 0.9812 | 33.33 | 0.03561 | model | log |
MIM | 50 epoch | 4xbs4 | 47.6M | 1051.0 | 17 | 112.1 | 1467.1 | 0.9829 | 33.97 | 0.03338 | model | log |
PredRNN | 50 epoch | 1xbs16 | 24.6M | 704.0 | 25 | 113.2 | 1458.3 | 0.9831 | 33.94 | 0.03245 | model | log |
PredRNN++ | 50 epoch | 1xbs16 | 39.3M | 1033.0 | 18 | 110.0 | 1452.2 | 0.9832 | 34.02 | 0.03196 | model | log |
PredRNN.V2 | 50 epoch | 1xbs16 | 24.6M | 708.0 | 24 | 114.9 | 1484.7 | 0.9827 | 33.84 | 0.03334 | model | log |
SimVP+IncepU | 50 epoch | 1xbs16 | 41.2M | 197.0 | 26 | 115.8 | 1511.5 | 0.9822 | 33.73 | 0.03467 | model | log |
SimVP+gSTA-S | 50 epoch | 1xbs16 | 11.3M | 74.6 | 52 | 108.4 | 1441.0 | 0.9834 | 34.08 | 0.03224 | model | log |
TAU | 50 epoch | 1xbs16 | 37.6M | 182.0 | 26 | 113.3 | 1390.7 | 0.9839 | 34.03 | 0.02783 | model | log |
Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing
and channel mixing
, we benchmark popular Metaformer architectures on SimVP with 100-epoch training. We provide config files in configs/kth/human.
MetaFormer | Setting | GPUs | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | LPIPS | Download |
---|---|---|---|---|---|---|---|---|---|---|---|
IncepU (SimVPv1) | 50 epoch | 1xbs16 | 41.2M | 197.0 | 26 | 115.8 | 1511.5 | 0.9822 | 33.73 | 0.03467 | model | log |
gSTA (SimVPv2) | 50 epoch | 1xbs16 | 11.3M | 74.6 | 52 | 108.4 | 1441.0 | 0.9834 | 34.08 | 0.03224 | model | log |
ViT | 50 epoch | 4xbs4 | 28.3M | 239.0 | 17 | 136.3 | 1603.5 | 0.9796 | 33.10 | 0.03729 | model | log |
Swin Transformer | 50 epoch | 1xbs16 | 38.8M | 188.0 | 28 | 133.2 | 1599.7 | 0.9799 | 33.16 | 0.03766 | model | log |
Uniformer | 50 epoch | 4xbs4 | 27.7M | 211.0 | 14 | 116.3 | 1497.7 | 0.9824 | 33.76 | 0.03385 | model | log |
MLP-Mixer | 50 epoch | 1xbs16 | 47.0M | 164.0 | 34 | 125.7 | 1511.9 | 0.9819 | 33.49 | 0.03417 | model | log |
ConvMixer | 50 epoch | 1xbs16 | 3.1M | 39.4 | 84 | 115.8 | 1527.4 | 0.9822 | 33.67 | 0.03436 | model | log |
Poolformer | 50 epoch | 1xbs16 | 31.2M | 156.0 | 30 | 118.4 | 1484.1 | 0.9827 | 33.78 | 0.03313 | model | log |
ConvNeXt | 50 epoch | 1xbs16 | 31.4M | 157.0 | 33 | 113.4 | 1469.7 | 0.9828 | 33.86 | 0.03305 | model | log |
VAN | 50 epoch | 1xbs16 | 37.5M | 182.0 | 24 | 111.4 | 1454.5 | 0.9831 | 33.93 | 0.03335 | model | log |
HorNet | 50 epoch | 1xbs16 | 28.1M | 143.0 | 33 | 118.1 | 1481.1 | 0.9824 | 33.73 | 0.03333 | model | log |
MogaNet | 50 epoch | 1xbs16 | 8.6M | 63.6 | 56 | 109.1 | 1446.4 | 0.9834 | 34.05 | 0.03163 | model | log |
TAU | 50 epoch | 1xbs16 | 37.6M | 182.0 | 26 | 113.3 | 1390.7 | 0.9839 | 34.03 | 0.02783 | model | log |