accelerating OF execution #101

jpiesing · 2024-12-11T11:40:52Z

The time taken to run the OF seems to be a growing barrier.

Are there any opportunities to accelerate it?

Is the QR code detection running on the main CPU? Does it exploit a GPU at all?
If it's running on the main CPU, is it limited to one core or does it use more than one?

In 10 minutes with Google, I found this reference to a (partially) GPU accelerated version of OpenCV - https://opencv.org/platforms/cuda/. Have you looked at this previously?

yanj-github · 2024-12-11T12:07:18Z

Hi @jpiesing, what is the maximum acceptable time required to execute the OF? I normally run OF on native windows and run all 6 berlin test overnight at the same time which normally finishes within couple of hours and next morning they all completed. Is this considered to be too long?
At the moment, the quick way to improve the OF execution is to run it outside of docker.

One OF process is single threaded. So it only uses one CPU.
Multiple OF process can be invoked if required.
OF not use GPU at the moment. It could be a solution to speed it up. We have not looked yet to this but I will check if this is something that it can be done.

jpiesing · 2024-12-11T12:09:30Z

@louaybassbouss Did I hear you mention 40 hours yesterday? What was that for?

yanj-github · 2024-12-11T12:12:22Z

40 hours is really really long to me. @louaybassbouss you can run all 6 process same time, did you run one after the other or all at the same time?

louaybassbouss · 2024-12-11T12:15:32Z

@jpiesing OF time is on our side within docker around 13x times the recording time. For berlin plugfest recordings is around 4-5 hours. But we ran a full set of tests in our lab. Recording time is around 3h, OF time is around 40h. The factor of OF time vs Recording time better metric we can use to measure the performance.

yanj-github · 2024-12-11T12:24:43Z

Thanks @louaybassbouss I do think 40+ hour for 3 hour recording is too long. To help me to understand the issue correctly can you kindly let me have your PC spec number of CPU and MEM please? Also which OS you running this on and either it is docker within virtual box or native Linux system? And any other process / docker is running at the same time please?

jpiesing · 2024-12-11T13:10:56Z

Hi @jpiesing, what is the maximum acceptable time required to execute the OF? I normally run OF on native windows and run all 6 berlin test overnight at the same time which normally finishes within couple of hours and next morning they all completed. Is this considered to be too long? At the moment, the quick way to improve the OF execution is to run it outside of docker.

IMHO it's important to continue being able to run all validated tests overnight.

We currently have 26 validated tests which are a mixture of 30s and 1 minute. Suppose 20 are 30s and 6 are 1 minute (I've not checked, that's just a guess). That would be 16 minutes plus the gaps between tests in the recording so if that gap is 30s (say) then that's a further 13 minutes for a total of 29 minutes so let's work with a round 30 minutes.

Fraunhofer are quoting 13 times the recording time so 13 times 30 minutes fits in overnight. Good.

How would the 30 minutes increase when;

We have validated the beta tests? If all 7 are 30s running time & we're allowing 30s between tests then that's another 7 minutes in the recording so another 90 minutes on the OF run-time and 6.5 hours goes up to 8 hours.
What happens when we have validated audio tests? What would the 30 minutes increase to?
What happens when we have validated HEVC video tests?

Thirteen times the recording doesn't scale.

jpiesing · 2024-12-11T13:27:07Z

Independent of GPU acceleration, I found the following two threads seaching for python, opentv, multiple cores.
https://stackoverflow.com/questions/32775555/how-to-use-python-and-opencv-with-multiprocessing
https://www.reddit.com/r/opencv/comments/40be62/if_i_want_opencv_to_use_all_the_cores_in_my/

Just a thought, would it be reasonably practical to modify the OF to divide a recording into N pieces, process each one separately and then join up the results? Would that be easier / safer / less work than a GPU?

louaybassbouss · 2024-12-12T15:25:13Z

Thanks @louaybassbouss I do think 40+ hour for 3 hour recording is too long. To help me to understand the issue correctly can you kindly let me have your PC spec number of CPU and MEM please? Also which OS you running this on and either it is docker within virtual box or native Linux system? And any other process / docker is running at the same time please?

We use a powerfull server with following spec:

Dell PowerEdge R7615
Bare metal Server running Ubuntu 24.04.1 LTS Server without virtualisation (native linux)
- OF runs within Docker (we prefer to run the OF in docker even running without docker may bring some performance improvement)
RAM: 512 GB RDIMM 4800MT/s
CPU: AMD EPYC 9124
- CPU Cores: 16
- Threads: 32
GPUs: 3x NVIDIA L40 (these are really very powerful GPUs, if GPU acceleration is supported in the future, we can make use of these GPUs).

If we run multiple OF sessions, we still get same factor 13x (we tested with 5 OF sessions in parallel).

jpiesing · 2024-12-13T09:05:39Z

My ancient desktop PC running overnight took the following to process the Philips TV recording from the Berlin plugfest - as measured using the 'time' feature built into the shell.

real 496m26,285s
user 650m40,723s
sys 33m48,755s

mediainfo reports that recording as 33 min, 37s = 2017 seconds.
650m40.723s = 39040.723s

39040/2017=19.3

This was running the OF outside docker.

jpiesing · 2024-12-13T10:03:14Z

Here's the results from running the Philips TV recording from the Berlin plugfest on my work laptop. Also as measured using the 'time' feature built into the shell.

real 161m37.586s
user 186m32.423s
sys 9m25.564s

It's the same file as the desktop PC in the previous comment, 33min, 37s = 2017 seconds.

186m32.423s = 11192.423s.

11192/2017 = 5.55.

The processor is an Intel Core i7-1365U - https://www.techpowerup.com/cpu-specs/core-i7-1365u.c3070

The results of 'lscpu' are as follows.

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i7-1365U
CPU family: 6
Model: 186
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 3
CPU max MHz: 5200.0000
CPU min MHz: 400.0000
BogoMIPS: 5376.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopolog
y nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdran
d lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_
pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme
rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualisation features:
Virtualisation: VT-x
Caches (sum of all):
L1d: 352 KiB (10 instances)
L1i: 576 KiB (10 instances)
L2: 6.5 MiB (4 instances)
L3: 12 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Mitigation; Clear Register File
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Srbds: Not affected
Tsx async abort: Not affected

jpiesing · 2024-12-13T18:48:19Z

@nicholas-fr reports running the OF on the Philips TV recording from the Berlin plugfest on his work laptop takes : 3h36m18.479s = 12978s
12978/2017 = 6.43.
Not as good as the 5.5 of my laptop but a lot better than 11-12-13.

jpiesing assigned yanj-github and rcottingham Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerating OF execution #101

accelerating OF execution #101

jpiesing commented Dec 11, 2024

yanj-github commented Dec 11, 2024

jpiesing commented Dec 11, 2024

yanj-github commented Dec 11, 2024

louaybassbouss commented Dec 11, 2024

yanj-github commented Dec 11, 2024

jpiesing commented Dec 11, 2024 •

edited

Loading

jpiesing commented Dec 11, 2024

louaybassbouss commented Dec 12, 2024 •

edited

Loading

jpiesing commented Dec 13, 2024

jpiesing commented Dec 13, 2024

jpiesing commented Dec 13, 2024

accelerating OF execution #101

accelerating OF execution #101

Comments

jpiesing commented Dec 11, 2024

yanj-github commented Dec 11, 2024

jpiesing commented Dec 11, 2024

yanj-github commented Dec 11, 2024

louaybassbouss commented Dec 11, 2024

yanj-github commented Dec 11, 2024

jpiesing commented Dec 11, 2024 • edited Loading

jpiesing commented Dec 11, 2024

louaybassbouss commented Dec 12, 2024 • edited Loading

jpiesing commented Dec 13, 2024

jpiesing commented Dec 13, 2024

jpiesing commented Dec 13, 2024

jpiesing commented Dec 11, 2024 •

edited

Loading

louaybassbouss commented Dec 12, 2024 •

edited

Loading