TensorFlow Object Detection API Weird Behavior
How many images are there in the dataset? The more training data that you have the better the API performs. I tried training it on about 20 images per class, the accuracy was pretty bad. I pretty much faced all the problems that you have mentioned above. When I generated more data, the accuracy improved considerably.
PS: Sorry I couldn't comment since I don't have enough reputation
So I think I figured out what is going on. I did some analysis on the dataset and found out that it is skewed towards objects of category 1.
This is the frequency distribution of the each category from 1 to 11 (in 0 based indexing)
0 10440
1 304
2 998
3 67
4 412
5 114
6 190
7 311
8 195
9 78
10 75
I guess the model is hitting a local minima where just labelling everything as category 1 is good enough.
About the problem of not detecting some boxes : I tried training again, but this time I didn't differentiate between brands. Instead, I tried to teach the model what a cigarette box is. It still wasn't detecting all the boxes.
Then I decided to crop the input image and provide that as an input. Just to see if the results improve and it did!
It turns out that the dimensions of the input image were much larger than the 600 x 1024 that is accepted by the model. So, it was scaling down these images to 600 x 1024 which meant that the cigarette boxes were losing their details :)
So, I decided to test the original model which was trained on all classes on cropped images and it works like a charm :)
This was the output of the model on the original image
This is the output of the model when I crop out the top left quarter and provide it as input.
Thanks everyone who helped! And congrats to the TensorFlow team for an amazing job for the API :) Now everybody can train object detection models!
Maybe it is too late now, but I wanted to post the comments if anyone struggles with this in the future:
Unfortunately, TF documentation is not the best and I struggled with this a lot before finding the reason. The way the model is constructed is that it allows for a MAXIMUM of x amount of predictions per single image. In your case I think it is 20. You can easily test my hypothesis by editing the original photo like this:
Obviously before the boxes are actually drawn and you should see some better results.
Pretty nasty limitation.