Replace HoG with other features #64

angela0130 · 2014-07-10T07:52:50Z

Hi,
I was trying to replace HoG features with dense SIFT features.
However, I found esvm_detect.m will change feature dimension especially for HoG.
Simply replacing model.x with SIFT feature will not work.
Is there any suggestion helping me develop the codes (replace HoG with other features)?
Thanks.

quantombone · 2014-07-11T20:54:20Z

I wrote the esvm_features function so that given a [MxNx3] image, it returns the HOG feature vector of size [M/sbin,N/sbin,F].

features = esvm_features(I,sbin)

To encode the dimensionality of the features, esvm_features will return F by default. For the features included in the code, F is equal to 31.

dim = esvm_features()

If you are working with a dense SIFT descriptor, think about what the output dimensionality will be in your case. If you associate a 128-D SIFT vector with every pixel in the image, then you are working with sbin=1.

Here is some pseudocode of what you need to do to integrate SIFT into this framework:

I = randn(M,N,3);
sbin = 1;
%make sure function returns a scalar (the cell dimension) with no args
assert(my_dense_sift() == 128)
features = my_dense_sift(I,sbin);
%then you will have
size(features) == [M N 128];

angela0130 · 2014-07-13T11:49:07Z

Hi, Tomasz,
Thanks for your reply.
It works after I change the sbin parameter!

One more question about "M".
I only want to do Platt's methods to re-scale output scores, so I let M.C = [].
The output scores should in the range between 0 and 1.
However, there are scores beyond the range.
Any clue why the scores are out of the range?
Thanks.

quantombone · 2014-07-13T18:55:20Z

You'll have to look closely at the code: by default I was employing a method which is more powerful than Platt's calibration method. This method involves a "boosting" matrix which couples the activations of the different exemplars. It is a very simple equation, but I pulled it out of thin air. I never got learning to improve the result over my own simple heuristic.

To learn more, I refer you to my doctoral dissertation, Section 3.3 Exemplar Co-occurrence Matrices.

I'm not sure simply setting M.C=[] is enough, there is a mechanism for disabling the matrix part, but it will involve you poking around the code.

angela0130 · 2014-07-16T03:50:58Z

Thanks! I will read you doctoral dissertation.

Another problem about replacing the features.
In esvm_features.m, I tried to replace HoG with HMP features. For each input images I, HMP will return [1, 112000] feature matrix.

function x = esvm_features(I, sbin)
if nargin ==0
  x = 1120000;
  return
end
x = HMP_feature(I);

Since HMP features only return 2 dimensional matrix, different from blocked based sift and HoG, I found errors during esvm_update_svm.m
Error using fconvblas (Invalid input: A)
I am not sure about the input format of fconvblas, and make sbin 1 cannot solve the problems. Any suggestion to modify the codes?
Thanks again!

quantombone · 2014-07-16T04:25:21Z

HOG is not typically used for image classification tasks, but instead object detection tasks.

When applying dense SIFT or GIST (or engineering your own feature) for image classification, you have to only think about the function x=f(I) which will associate a real-valued feature vector for an input image.

In object detection, you want to use something like pyramid of cell-based features. HOG is the prime example of such a cell-based feature. Let a cell be of size sbin x sbin where usually sbin=8.

If I is an 80x80x3 image, HOG(I) will return a 10x10xF feature tensor. As I mentioned before, the HOG code in this repo will give you F=31, but you can try the newer F=32 features, or get a vision hacker to design a better feature for you (typically a very expensive effort).

You can then slide templates of arbitrary sizes such as 4x4xF, 2x6xF, etc using fconv

HOG computation is perform at the image pyramid level, while older vision systems would typically enumerate regions first, then compute f(subI) for each sub image subI. This is quite slow in practice, and while you might think HOG is a rather trivial descriptor, the ability to perform the slides across a multiscale representation, is one of the key reasons why HOG gained such popularity.

It's not easy to just throw any feature you want at the problem. There is a quite a bit of engineering in any reasonably sophisticated vision system. I would suggest just talking to a few vision PhD engineers, they should be able to help you!

Good luck!

angela0130 · 2014-07-22T09:44:55Z

Hi, we came up with an idea to use HMP features with ESVM for object classification.

We don't need sliding window detection in object recognition.
So we change:

esvm_pyramid.m:
- make MAXLEVELS = 1; // so that no pyramid
esvm_features.m:
- since HMP features return 1x112000 vector for each images, we make it 3 dimensions

function x = esvm_features(I, sbin) 
      if nargin == 0 
         x = 112000;
         return;
      end

      HMP = HMPextractor(I);
      x(1, :, :) = HMP;

One question about params = esvm_get_default_params,
I am not sure params.init_params.goal_ncells, and params.init_params.MAXDIM.
Can you briefly explain these parameters?
Also, if params.train_skip_mining = 1, is it correct there will be no hard negative mining, and update svm?

Thanks a lot!

quantombone · 2014-07-22T14:12:27Z

Hi Angela,
That's cool that you figured out how to change the esvm code to work for your object recognition scenario. Even though sliding window is often not needed for classification/recognition tasks, you should be aware that deep learning does something akin to sliding windows detection. These networks have convolutional and pooling layers, so they are doing something much more complex than ExemplarSVMs, but these guys are getting really good object recognition performance.

The params.init_params struct has some guidelines for choosing templates sizes during initialization.

In the exemplarSVM case I tried making templates which have roughly 100 bins (goal_ncells), so templates like 8x11, 10x10, 12x7, and not templates like 30x40, 1x4. I simply wanted to control the dimensionality of the underlying problem. A template of size MxN has a HOG feature representation of size MxNx31, so ~100 cells translates to ~3000 dimensions for the feature vectors.

MAXDIM controls the maximum size of a dimension. When it is set to a reasonable value like 12, it prevents from really skinny objects (i.e., pencils, fingers, wires) from getting templates of size 1x80, and instead they will become 1x12. These really skinny objects will simply have fewer dimensions than rectangular objects.

In your single-feature-vector-per-image scenario, you should consider if your images all have the same size or not. The notion of initializing a template is a bit different than in the object detection case (where objects can have dramatically different aspect ratios), so you'll have to think a bit about your problem formulation and modify the code accordingly.

It seems like train_skip_mining should do what you mention. Even though I wouldn't have called the variable that if it did something completely different, you should toggle this parameter and trace/profile/debug the code to verify that it does what you want.

Cheers!
@quantombone

angela0130 · 2014-07-23T11:02:27Z

Hi Tomasz,

Since HMP feature extractor is time consuming, I am curious about how many times esvm_features will be called in hard negative mining.

I trace the codes, and what I know is that:
Given 1 positive and Nneg negative examples, ESVM will initialize a model with positive exemplar's features. Then, use my_randomize to shuffle the negative examples for hard negative mining. ESVM will apply current model to negative samples (NLevel with esvm_pyramid), and if wrong bounding boxes are detected, there are collected to update the model.

Consequently, the number of calling esvm_features = Nneg_NLevel. The number can be decreased by making MAXLEVEL = 1 in esvm_pyramid. I noticed that default_params.train_max_mine_iterations = 100; and we only have 125 total negative images. Does that mean esvm_features will be called Nneg_100 = 12500 times?
Because the training time is much longer than I expected, I was wondering the details of hard negative mining. Correct me if there is any misunderstanding of hard negative mining process. Thanks!!

quantombone · 2014-07-24T19:57:11Z

Hi Angela,

esvm_features is called very very often! I had made the design decision to not pre-compute features and instead make billions of esvm_features function calls. When you are dealing with millions of images, millions of precomputed pyramidal features can be an issue. This only makes sense when using something like HOG. I ran some experiments where computing HOG features on the CPU was less expensive than reading pre-computed features from a networked file system such as NFS.

However, most people like to use heavy-weight computer vision features which might take 1-10 seconds (or 1-10 minutes) per image. In that case, it makes sense to store them on disk. I think you are venturing into territory which is outside the domain of object detection, and you might need to perform some serious surgery on the ExemplarSVM codebase to get the effect you want.

Here are some details on hard-negative mining:

The idea is to maintain in memory the examples which incur a non-zero loss for the SVM objective function. This are positive examples which score below +1 and negative examples which score above -1.

In the ExemplarSVM case, there is only one positive example which is fixed after initialization and the negative examples are automatically mined. Let's say we start with 1 positive and 0 negatives, but we have a negative cache which can hold MAX_NEG=10,000 negatives. We will keep applying the detector on a seemingly infinite list of negative images and perform the following steps:

%some pseudocode for hard negative mining
esvm = init_esvm

for image in images

  %compute detections in this image
  dets = esvm_detect(esvm, image)

  % add negatives to the cache
  esvm = add_negatives(esvm, dets)

  % update the SVM using liblinear, libsvm, or your very own SVM library
  esvm = update_svm(esvm)

  %Keep at most `MAX_NEG` negatives.
  esvm = prune_negatives(MAX_NEG)
end

angela0130 · 2014-08-13T13:38:35Z

Hi Tomaz,

I read the codes, and have a few questions:

esvm_update_svm.m
I found newx is always equal to x, and model.mask are all ones. Removal of A and mu will help reduce memory usage (not using matlab function eye). I am not sure whether it's right to remove the feature transformation and backward projection.
libsvmtrain
Since there is only one positive image during training, is there any weighting adjustment in objective function of SVM? I try to implement a simplified version of esvm, and found it happens that svm classify positive feature as negative with high accuracy. I found you wrote to make sure positive example is always classified correctly, and there is always negative support vectors.

if length(svs) == 0
  fprintf(1,' ERROR: number of negative support vectors is 0!\');
  error('Something went wrong');
end

I incorporated HMP features (1x1x112000) in esvm, and removed pyramid by making MAXLEVEL equal to 1 (so that no sliding window). However, the accuracy is lower than expectation, and other features (HoG & SIFT). I want to figure out whether I misuse codes or just HMP featuares are not appropriate for this algorithm.

quantombone · 2014-08-13T19:14:14Z

Feel free to remove whatever variables/steps you don't like if you want to reduce memory usage. It makes sense that model.mask is all ones in your case.
I don't believe there is any weight adjustment in the SVM. When using SVM for detection, it is important that the correct objects score higher than the incorrect objects. In my experience, the scores for the good detection will be negative. In my opinion, you should never threshold the SVM detection score and think of it as a binary classifier -- you'll probably always get class "-1".

I don't know what HMP is, nor what it captures. But please take a detailed look at your code and make sure the image being fed into the features is the full-sized image and not a down-sampled version. In other words, please make sure that whatever you did to disable the sliding window operation, you are keeping the highest resolution image and not the lowest resolution image.

In my experience, the performance of a recognition/detection algorithm depends on 2 things: the interplay of features / learning algorithm, and how many years you spend working on the problem. There's something really nice about HOG+LinearSVM, probably because Navneet Dalal optimized HOG for use in this scenario. Don't be surprised that changing one of these components drastically reduced performance. If you really feel you need to use HMP, consider designing your own learning algorithm.

Cheers and good luck!

angela0130 · 2014-08-25T13:34:40Z

Hi, Tomasz,

There are two details I would like to know more:

What is the physical meaning of initialization of exemplar-SVMs?
I checked the codes in esvm_initialize_goalsize_exemplar.m, the initialized w and b are

model.w = curfeats - mean(curfeats(:));
model.b = 0;

Since each iteration of hard negative mining keeps support vectors whose scores are greater than -1, will the initialized w and b be influential to choose support vectors?

How co-occurrence matrix boosts the scores?
The co-occurrence matrix simply counts how often two different exemplars co-fire on the training set and they are both correct. So, the raw SVM scores of each model use Platt calibration, and then co-occurrence matrix. In esvm_apply_M.m

for i = 1:size(boxes,1)
  r(i) = (M.w{exids(i)}'*x(:,i) + sum(x(:,i)))-M.b{exids(i)};
end

My understanding is that:
M.w{i} represents the number of occurrence between other models and ith model, and normalized (by divided by total number of occurrence), and x(:,i) stands for raw scores from SVM. Then, why do we need to add sum(x(:.i))?

Your explanation above helps me understanding this paper a lot!
Thanks!

quantombone · 2014-08-29T03:29:46Z

Hi Angela,

1.) The initialization is just a simple mechanism for creating a mean-zero vector which can be used to detect the exemplar in its originating image. I made this initialization after I observed that all learned Exemplar-SVMs resembled the raw HOG features of the positive, but the learned hyperplane was mean-zero. It's just one of those tricks (among others) which I must have pulled out of thin air after endless nights of hacking at CMU.

Yes, the initialization does affect the support vectors. In theory, if you make multiple passes over the data, the initialization will not matter. If you remember from ML101, a linear SVM is a convex problem, and if you remember from Felszenwalb et al's DPM work, under a reasonable mining strategy, negative hard mining will give you the same answer as loading all data into memory.

2.) The boosting matrix does give a reasonable boost over raw ExemplarSVMs, even over per-exemplar calibrated ones. I do have to admit that I spent months trying to "learn" this boosting matrix, but I could never get the overfitting under control. This was yet another heuristic I pulled out of thin air (and after ~30 nights of trying it the ML way).

This heuristic really does have a simple form, and if I remember correctly, I added sum(x(:,i)) to boost the scores of exemplar which are firing a lot. If you remove non-maximum-suppression, you'll notice that each ExemplarSVM has a reasonably large activation zone. In other words, if an ExemplarSVM is correctly detecting a new object, then not only will it have a large "max score" but there will be lots of high-scoring detections nearby.

In the case of false-positives, there are scenarios where a single detection window will have a large score because some image gradients accidentally lined up to look like the object of interest. In these scenarios, there is a single large "max score" but nearby windows score below -1.

This intuition is not handled by non-maximum suppression, at least not in the Felzenszwalb et al, non-max suppression. That is why I favored ExemplarSVMs which have many high scoring windows. You'll have to look at the code in detail to see exactly what I did.

I hope this helps. Good luck! --Tomasz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace HoG with other features #64

Replace HoG with other features #64

angela0130 commented Jul 10, 2014

quantombone commented Jul 11, 2014

angela0130 commented Jul 13, 2014

quantombone commented Jul 13, 2014

angela0130 commented Jul 16, 2014

quantombone commented Jul 16, 2014

angela0130 commented Jul 22, 2014

quantombone commented Jul 22, 2014

angela0130 commented Jul 23, 2014

quantombone commented Jul 24, 2014

angela0130 commented Aug 13, 2014

quantombone commented Aug 13, 2014

angela0130 commented Aug 25, 2014

quantombone commented Aug 29, 2014

Replace HoG with other features #64

Replace HoG with other features #64

Comments

angela0130 commented Jul 10, 2014

quantombone commented Jul 11, 2014

angela0130 commented Jul 13, 2014

quantombone commented Jul 13, 2014

angela0130 commented Jul 16, 2014

quantombone commented Jul 16, 2014

angela0130 commented Jul 22, 2014

quantombone commented Jul 22, 2014

angela0130 commented Jul 23, 2014

quantombone commented Jul 24, 2014

angela0130 commented Aug 13, 2014

quantombone commented Aug 13, 2014

angela0130 commented Aug 25, 2014

quantombone commented Aug 29, 2014