Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential issue in the evaluation code #48

Open
Uio96 opened this issue Jul 26, 2021 · 2 comments
Open

Potential issue in the evaluation code #48

Uio96 opened this issue Jul 26, 2021 · 2 comments
Assignees

Comments

@Uio96
Copy link

Uio96 commented Jul 26, 2021

Thanks a lot for this great dataset.

My colleague @swtyree and I have taken a close look at your evaluation code and found some potential issues in it.

  1. The first issue is about the visibility label.

Although the dataset provided it, you did not use the obtained index to extract the instance for it. As a result, the lengths may not match because label[VISIBILITY] includes the entries for objects that are below the visibility threshold while label[LABEL_INSTANCE] does not.

Before:

label[VISIBILITY] = visibilities
index = visibilities > self._vis_thresh

I think it should be:

index = visibilities > self._vis_thresh
label[VISIBILITY] = visibilities[index]

Here is the corresponding part from the evaluation code:
https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/parser.py#L50-L53

  1. The second issue is with the calculation of average precision.

I found that the testing order of the images would affect the final result.

The original process in the classification/segmentation works has an important step which sorts the results by the predicted confidence. See https://github.com/ShawnNew/Detectron2-CenterNet/blob/master/detectron2/evaluation/pascal_voc_evaluation.py#L243-L246. However, I did not find it in your code. I am not sure if you assumed the tested methods would do that somewhere else, or you just fixed the order of testing images.

Here is the corresponding part from the evaluation code: https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/metrics.py#L86-L98

It is similar to the process used in the pascal_voc_evaluation:https://github.com/ShawnNew/Detectron2-CenterNet/blob/master/detectron2/evaluation/pascal_voc_evaluation.py#L290-L299

I am looking forward to your reply. Thank you so much.

@ahmadyan
Copy link
Collaborator

Hi Yunzhi,
Thanks for the detailed feedback. I appreciate it.

  1. Regarding visibility, we set the visibility mostly to 1.0 for all instances, so you can pretty much ignore it (there are few instances of 0.0 where the object is outside the frame, If you want to see how it is actually calculated see Visibility calculation #37 ). So threshold is not used in 3D object detection.

Furthermore, it doesn't make sense to me for the label['VISIBILITY'] length to match with other labels (such as label[LABEL_INSTANCE_3D], etc). It is the list of all visibible labels, so we don't want to drop those labels that are set to 0., Later when we actually check for visibility here we are careful enough to skip those invisible object instances. Let me know what you think and what would be the effects of applying your change to the output.

  1. As you linked it, the eval code is based on the reference MATLAB code of the pascal voc. However, our model is a two stage network. First we detect the object's crop (based on confidence using MobileNet or other detectors) and then we pass it to the second network to estimate the pose (objectron models). In the second stage, we assume the object exists in the crop with probability 1.0 and the network estimates the keypoints. Here we do not predict any confidence (network assumes object is within the bounding box). Thus we do not need to sort the predictions.

Now, if above eval doesn't work for you or you model predicts confidence and you need that to be included in the evaluation let me know and we can accomodate it, or you can create a pull request.

@guthasaibharathchandra
Copy link

guthasaibharathchandra commented Feb 3, 2023

Hi, I have recently noticed that in the evaluation code, for calculating recall, you directly divide the true positives (over all predictions) over the total instances in all images. This is fine, but when you are computing true positives, say if my model predicts multiple bounding boxes that match with same ground truth instance, then all the predictions are considered true positives according to the code while actually the number of true positives should be 1 (other predictions matching with same ground truth instance should be considered false positives right?, because otherwise my model could predict literally 100 bboxes that match with one gt instance and if number of gt instances are smaller then the recall value is > 1 which make no sense). I hope my question was clear. I'm looking forward to your answer! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants