Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recorded FPS vs actual FPS #56

Closed
sambo55 opened this issue Jun 23, 2020 · 15 comments
Closed

Recorded FPS vs actual FPS #56

sambo55 opened this issue Jun 23, 2020 · 15 comments

Comments

@sambo55
Copy link

sambo55 commented Jun 23, 2020

I'm struggling to achieve the FPS reported in the command line.

For example when I run inference on a 10 min 30fps video the reported inference fps is 300+.

I would expect that the time taken to run inference on the entire video would be 30x60x10 = 18000 / 300 fps = 60secs = 1min

Yet the code takes at least 3 mins to run. Is there something wrong with my calculation? Why would the reported fps not be the actual?

@ceccocats
Copy link
Owner

ceccocats commented Jun 23, 2020

Hi sambo,
"Inference time" is only inference time, it doesn't count preprocessing, postprocessing and visualisation.
Still the demo is only an example on how to use tkdnn is not the most optimized solution in term of preprocessing, postprocessing and visualisation.

@sambo55
Copy link
Author

sambo55 commented Jun 24, 2020

Thanks. Any pointers on how to optimise those aspects?

@ceccocats
Copy link
Owner

Opencv Is comfortable but slow expecially for visualization. Find an opengl viewer that fit your needs.
For preprocessing and postprocessing ensure that your are compiling opencv with CUDA and cudacodec

@mive93
Copy link
Collaborator

mive93 commented Jun 29, 2020

Hi @sambo55,

If you have opencv (4.x) with contrib compiled for CUDA, then you can uncomment line 17 here. The preprocessing will be optimized on GPU.
Another thing you could do is decouple inference and visualization, using different threads.

@rod-hendricks
Copy link

I tried running Batchsize=1 vs 4 and I notice that the inference speed is not as fast as I thought it would be on a RTX2070. For size=1 I get ~6.6ms inf time while when i do size=4 I get ~17.7ms (2.6+ slower). Are these numbers correct? I am only using ~1.6Gb of GPU mem and about ~40% processing power even when running size 4.

Would you guys know of a way to optimize this by utilizing more of the GPU power?

On a side note, my pre and post processing times are awful when I do size=4 which are running a total of ~9ms. I will try to check opencv again and ensure compiled on CUDA to see if it significantly improves.

@rod-hendricks
Copy link

I built opencv-4.2.0 with cuda and cudnn enabled and found that there was no substantial improvement gained on the pre and post processing portion of Yolo3Detection.cpp. In fact, it was slower in my end if i enable (uncomment

// #define OPENCV_CUDACONTRIB //if OPENCV has been compiled with CUDA and contrib.
) OPENCV_CUDACONTRIB as compared to running it disabled. Am I doing this right?

Also I guess the results I have on inference speed cannot be optimized anymore on my hardware?

@mive93
Copy link
Collaborator

mive93 commented Jul 15, 2020

Hi @rod-hendricks
I checked the problem both on the Xavier and the RTX2080Ti and it is actually true, that enabling that define for Yolo detectors is worse (while is better for others like Mobilenet and Centernet models). The problem with Yolo is that there are too many useless passages between host and device.

If you really want to improve pre and post processing, you could try to implement CUDA kernels for those phases, and having everything on the device. So you could copy the frame at the beginning and copy back the bounding boxes at the end. We have not tried this solution yet, but I have it's on my list to improve those parts.

@rod-hendricks
Copy link

Thanks for the response and advice @mive93 ! I am not sure yet as to the effort vs gain on doing this so I'll see if I can consider working on this later.

I would like to ask further though about what you meant by Yolo having too many useless passages between host and device. Do you mean that within the yolo inference task, there is still the data passages going on between host and device before the network output is produced?

@mive93
Copy link
Collaborator

mive93 commented Jul 16, 2020

@rod-hendricks if I have updates on any improvements, I will let you know.

No, I meant only on our preprocessing. The code should be cleaned up and fix to remove a useless passage between host and device. When I'll have time I'll fix that :)

@rod-hendricks
Copy link

Thanks @mive93 ! Much appreciated. Good work on this repo. Its amazing!

@mive93 mive93 closed this as completed Sep 11, 2020
@m-kzein
Copy link

m-kzein commented Aug 13, 2021

Hello @mive93 , any updates on removing the useless passage between host and device?
Thanks.

@mive93
Copy link
Collaborator

mive93 commented Aug 19, 2021

HI @MohammadKassemZein,
no updates yet, I'm sorry.

@mkzein
Copy link

mkzein commented Nov 9, 2021

@mive93 Sorry to bother :) any updates on this?

@mive93 mive93 pinned this issue Jan 19, 2022
@mive93
Copy link
Collaborator

mive93 commented Jan 19, 2022

Not yet, but maybe soon.
We have already the code in an internal project, we just need to merge it here.

@mkzein
Copy link

mkzein commented Jan 20, 2022

Sounds great!
Thank you @mive93

mive93 added a commit that referenced this issue Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants