Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine Feasibility of Koala-36M method for detector #441

Open
Breakthrough opened this issue Oct 17, 2024 · 4 comments · May be fixed by #459
Open

Determine Feasibility of Koala-36M method for detector #441

Breakthrough opened this issue Oct 17, 2024 · 4 comments · May be fixed by #459

Comments

@Breakthrough
Copy link
Owner

Koala-36M proposes a significantly improved model for scene transition detection (paper: HTML or PDF)

See section 4.1 which uses an SVM classifier. The performance degredation is just over 2x slowdown, however the accuracy, precision, and recall show marked improvements across the board that likely warrant this change for the majority of users.

@wjs018
Copy link
Collaborator

wjs018 commented Oct 17, 2024

The table of results looks promising and the approach is interesting. They haven't published their code yet (coming soon according to their homepage). Ideally, they could also release a pre-trained SVM model that could simply be imported and used.

Detection Algorithm

The bit about this that I don't understand is how they are using temporal information. Looking through this, here is my understanding of how this algorithm works:

  • There are 2 metrics that are calculated for each frame:
    • The d_color metric is simply the correlation between sequential frames' rgb histograms. This is the exact same metric used by HistogramDetector in PySceneDetect.
    • The d_struct is calculated by combining the maximum value of each pixel for the current frame of either the grayscale image or the image after a Canny edge filter (equation 3 in section 4.1). Then, the result of this is compared to the previous frame's equation 3 output by way of SSIM (I think skimage has an implementation of this).
  • A trained SVM classifier looks at these two parameters for the frame and determines if it should be classified as a scene transition

What I don't understand is the temporal information. They write:

Regarding temporal information, we hypothesize that video changes are relatively stable over time. By estimating a Gaussian distribution of changes from past frames, if the current frame’s change exceeds the 3⁢σ confidence interval, we consider it a significant transition.

So, I get that they are looking backwards X number of frames, finding the standard deviation, and calculating the current frame's Z-score. However, are they calculating the Z-score for both of the above metrics? An average of the two? Some other metric? It isn't clear. Additionally, it doesn't seem like this temporal information is used by the SVM since they explicitly say that the SVM only takes the two parameters calculated above. So, is something marked as a scene if the SVM classifies it as such or the frame's Z-score exceeds 3? I am not sure how to incorporate the temporal information here.

SVM Training

This is an area that would be quite onerous to do on our own, so if they do release a model, it would be a tremendous help. When describing how their SVM is trained, they write (from 4.1):

We treat image pairs from the same video source as negative examples and pairs from different video sources as positive examples.

I am assuming here that they are using their giant dataset for this. If I am trying to extract their data generation method from their very brief description, then it would be something like this:

  • To get training data showing a scene change:
    • Pick two different videos from your dataset
    • Pick a random frame from each
    • Calculate both features for each frame (d_color and d_struct)
    • Use these as training data showing a scene change
  • To get training data showing no scene change:
    • Pick a single, random video from your dataset
    • Get two frames from the video
    • Calculate both features for each frame (d_color and d_struct)
    • Use these as training data showing no scene change

The same method could be used to generate your test data as well. This would only be possible with a curated dataset like theirs that consist of single-scene videos.

Their dataset seems to be a giant list of youtube videos with timestamps denoting the start and stop points of each of those youtube videos, segmenting what part of the video is included in the dataset.

If we wanted to train our own SVM based on this dataset, it would be a huge task to reconstruct their dataset based off of the youtube urls and timestamps. Additionally, they don't really give any insight on the SVM parameters. I am far from an ML expert, so having some additional information on any of the options used for the SVM would be helpful for replication.

@Breakthrough
Copy link
Owner Author

The table of results looks promising and the approach is interesting. They haven't published their code yet (coming soon according to their homepage). Ideally, they could also release a pre-trained SVM model that could simply be imported and used.

I'm curious what it would look like if we plotted the values for d_color and d_struct on a few videos to see if any obvious patterns emerge. If the data can be fitted by a typical kernel function we may just need to get the coefficients correct. I also found across the additive chi squared kernel which seems like a feature map which can be trained in linear time, but does require training. The scikit SSIM function is skimage.metrics.structural_similarity.

I am not sure how to incorporate the temporal information here.

This section is also very unclear to me as well, and you raise some good questions. Hopefully they will publish some more information soon.

@wjs018
Copy link
Collaborator

wjs018 commented Oct 18, 2024

I'm curious what it would look like if we plotted the values for d_color and d_struct on a few videos to see if any obvious patterns emerge.

I am pretty sure that d_color is literally the same metric that HistogramDetector already uses. Calculating d_struct doesn't look too bad, but would probably need to add an skimage dependency.

In doing some searching around about this, I also ran across CLIP. This is a pre-trained deep-learning transformer that can measure similarity between two images. I found a Stack Overflow answer that has a great explanation with some examples. This might be an alternative similarity metric to SSIM. It would also require new dependencies though, and I have no idea what the computational efficiency would be.

Breakthrough added a commit that referenced this issue Nov 19, 2024
@Breakthrough Breakthrough linked a pull request Nov 19, 2024 that will close this issue
Breakthrough added a commit that referenced this issue Nov 19, 2024
Breakthrough added a commit that referenced this issue Nov 19, 2024
Add `KoalaDetector` and `detect-koala` command. #441
Breakthrough added a commit that referenced this issue Nov 19, 2024
Add `KoalaDetector` and `detect-koala` command. #441
Breakthrough added a commit that referenced this issue Nov 21, 2024
Add `KoalaDetector` and `detect-koala` command. #441
@Breakthrough Breakthrough linked a pull request Nov 21, 2024 that will close this issue
Breakthrough added a commit that referenced this issue Nov 23, 2024
Add `KoalaDetector` and `detect-koala` command. #441
@Breakthrough
Copy link
Owner Author

Breakthrough commented Nov 24, 2024

I have something up and running in #459 if folks are interested in testing it out. You can find Python packages and Windows builds in that PR. The new detector is called detect-koala. Note that it will say there are 0 scenes detected until the very end, but it should still provide results.

Anyone willing to test this out is encouraged to do so and provide feedback here. You can use the new detector the same as any other one, e.g. scenedetect -i video.mp4 detect-koala

On most of our test videos it provides identical results to the other fast-cut detectors:

[PySceneDetect] PySceneDetect 0.6.5-dev1
[PySceneDetect] Downscale factor set to 5, effective resolution: 256 x 108
[PySceneDetect] Detecting scenes...
  Detected: 0 | Progress: 100%|████████████████████████████████████████████████████████████████| 1980/1980 [00:03<00:00, 630.55frames/s] 
[PySceneDetect] Processed 1980 frames in 3.1 seconds (average 629.75 FPS).
[PySceneDetect] Detected 22 scenes, average shot length 3.8 seconds.
[PySceneDetect] Scene List:
-----------------------------------------------------------------------
 | Scene # | Start Frame |  Start Time  |  End Frame  |   End Time   |
-----------------------------------------------------------------------
 |      1  |           1 | 00:00:00.000 |          90 | 00:00:03.754 |
 |      2  |          91 | 00:00:03.754 |         210 | 00:00:08.759 |
 |      3  |         211 | 00:00:08.759 |         259 | 00:00:10.802 |
 |      4  |         260 | 00:00:10.802 |         374 | 00:00:15.599 |
 |      5  |         375 | 00:00:15.599 |         650 | 00:00:27.110 |
 |      6  |         651 | 00:00:27.110 |         818 | 00:00:34.117 |
 |      7  |         819 | 00:00:34.117 |         876 | 00:00:36.536 |
 |      8  |         877 | 00:00:36.536 |        1019 | 00:00:42.501 |
 |      9  |        1020 | 00:00:42.501 |        1055 | 00:00:44.002 |
 |     10  |        1056 | 00:00:44.002 |        1099 | 00:00:45.837 |
 |     11  |        1100 | 00:00:45.837 |        1176 | 00:00:49.049 |
 |     12  |        1177 | 00:00:49.049 |        1226 | 00:00:51.134 |
 |     13  |        1227 | 00:00:51.134 |        1260 | 00:00:52.552 |
 |     14  |        1261 | 00:00:52.552 |        1281 | 00:00:53.428 |
 |     15  |        1282 | 00:00:53.428 |        1334 | 00:00:55.639 |
 |     16  |        1335 | 00:00:55.639 |        1365 | 00:00:56.932 |
 |     17  |        1366 | 00:00:56.932 |        1590 | 00:01:06.316 |
 |     18  |        1591 | 00:01:06.316 |        1697 | 00:01:10.779 |
 |     19  |        1698 | 00:01:10.779 |        1871 | 00:01:18.036 |
 |     20  |        1872 | 00:01:18.036 |        1916 | 00:01:19.913 |
 |     21  |        1917 | 00:01:19.913 |        1966 | 00:01:21.999 |
 |     22  |        1967 | 00:01:21.999 |        1980 | 00:01:22.582 |
-----------------------------------------------------------------------
[PySceneDetect] Comma-separated timecode list:
  00:00:03.754,00:00:08.759,00:00:10.802,00:00:15.599,00:00:27.110,00:00:34.117,00:00:36.536,00:00:42.501,00:00:44.002,00:00:45.837,00:00:49.049,00:00:51.134,00:00:52.552,00:00:53.428,00:00:55.639,00:00:56.932,00:01:06.316,00:01:10.779,00:01:18.036,00:01:19.913,00:01:21.999

Note that via the Python API the detector is called KoalaDetector.

Breakthrough added a commit that referenced this issue Nov 24, 2024
Implement algorithm similar to that described in Koala-36M.

Add `KoalaDetector` and `detect-koala` command. #441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants