Counting Vehicles - Model Improvements - Part 5
Published on 01/25/2019
10 min read
We will look into some techniques we can use to improve our model, both in terms of precision and recall and the actual inference performance. Also, we will explore some oddities of the model.
One of the downfalls of our original model is the inference time. Due to the size of the model it only averaged about 1-2 frames
per second, so if a vehicle was driving by fast we would only get one or two times to classify the vehicle. This was also an issue
because the sweet spot for the model was right in front of the camera and with a low FPS being able to capture the vehicle in that
sweet spot had a low probability. If you look at the TensorFlow Object Detection API model zoo
you can see that the base model that we chose
ssd_mobilenet_v1_fpn_coco had a speed of 56 ms, with a powerful Nvidia GeForce Titan GPU.
Since the DeepLens has a much less capable GPU, the Intel Gen9, you can hypothesize that it was 10x slower. Seeing as how we got 1-2 FPS,
which would equate to a roughly 500 ms inference time as a very rough measure.
With that in mind, in order to improve the performance of the model we could have opted for smaller input images. We chose images that were 429x240 in RGB, we could have used smaller images, so for example 356x200 in grayscale. This would have reduced the number of computations the GPU would have to do. Take note, that this may have reduced the model's accuracy and recall.
Another option is to just use a different model pipeline with a different base model. This is one option I chose to implement,
I opted to use the
ssdlite_mobilenet_v2_coco base model, which had a speed of 27 ms, so we could expect it to be roughly twice as fast
speed wise. When I implemented it I was able to get 4-6 FPS, while still slower than the input rate of the camera, 10 FPS, it allowed
for capturing more instances of a vehicle driving by.
Another area I wanted to improve upon was the actual performance of the model, in terms of its accuracy and more importantly the recall.
As always, collecting and labeling more training data would have likely done this for us, although I opted to not perform this time consuming step.
Instead I took the lazy route and explored more data augmentation options. One of the areas that I noticed lower performance was
when the sun was actually shining into the window. When I first collected training data in September the sun was at a higher angle in the sky
so it didn't really shine into the window, and most of my training data was centered around mid-day. Seeing as now it is November the
sun is at a lower angle, causing it to shine into the window and cause different ligtning issues in general. One way I opted to
get around this was to introduce some data augmentation. The ones
that I chose to specifically implement were
ssd_random_crop. The brightness and contrast adjustments were to take care of the different lighting that may take
place within the image. If you look below, these are two captures that are about 40 days apart and at the same time of day.
I opted to also use the random jitter boxes to account for any inaccuracies with labeling the bounding boxes. The random crop was chosen in the event I changed the location of the camera. I'm not sure how much, if any, these data augmentation options actually made the model better, for anything you want to improve you really need to measure and evaluate to determine its effectiveness, while keeping everything else the same.
Here are the final results of each model, trained and evaluated on the training and evaluation data, to 50k steps. As you can see the MobileNet V2 does receive lower average precision and lower average recall, this is expected as the model is smaller, but remember MobileNet V2 is also much faster when running on the DeepLens. One thing to also keep in mind is that the MobileNetV2 model also has the additional data augmentation options, I am unsure if this actually helped or hurt the model, to evaluate I would have to train a third model without the data augmentation options to compare to.
MobileNet V1 SSD final results 50000 steps
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.823 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.813 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.863 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.834 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.834 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.834 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.826 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.863
MobileNet V2 SSDLite final results 50000 steps
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.658 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.818 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.802 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.623 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.819 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.678 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.678 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.678 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.643 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.825
Here are some same classification examples of non training and non evaluation data to compare and contrast the models on a handful of examples. As you can see the MobileNet V2 SSDLite does have some classification issues, it has a hard time judging between SUVs and cars, but gets the direction right every time. There is one example that the MobileNet V2 fails to classify, image 6, I imagine this is due to the brightness of the image and scenary behind the car causing issues. Overall, both models don't do too bad, they can handle the season changing from summer, to fall, to winter showing how much the model really focuses on the road. This is also evident in image 10 where my neighbor's SUV is parked in their driveway but fails to get detected despite appearing in the same orientation as cars on the road.
One of the downfalls of machine learning in general is that it is only as good as the data you train the model with. Also, with some ingenuity it is possible to construct data to fool the model.
In this case the probability is only 84% so we could have easily thrown this away by increasing our detection threshold to something like 90%. With increasing the detection threshold though we would potentially miss more vehicles, but have better accuracy on the vehicles that we did detect. There is always a trade off with these detection thresholds that will be project specific. In this case the model likely mispredicts this example because the SUV does sort of have the shape of a van, I think in this case the windows in the back seat give it away as an SUV. Another reason for the misprediction would be the color of the SUV, the majority of the vans in the training data are white, while for SUVs the color white is probably evenly distributed with the other colors. Therefore, in this case the model is actually relying on the color of the vehicle to aid in classification. One way to avoid this would be to convert the images to grayscale.
This example was hand constructed, I had a pink toy car that I placed in front of the camera to see if I could get it to detect anything. In this case it labeled the toy car as car left with a probability of rougly 76%. While the toy car is actually a car, it isn't actually going left, it is going right. The key to fooling the model in this case was getting something that looked like the training data, the toy car, and also making it appear the same size in the scene as the training data. If you look above at the misclassified van left, you can see the general area where vehicles would drive and their scale. I tried to replicate this with the toy car, placing it near the center of the scene with roughly the same size. In the model's defense, it did only assign a probability of 76%, so it didn't get fooled that bad. In addition, since the training data never had toy cars within the image the model didn't really know any better that it isn't a real car, further emphasing the importance of training data that accurately depicts what you want to detect.
Thank you for reading this series of blog posts on counting vehicles. Feel free to add comments below or email me if you have questions or comments. I plan on adding more content in the coming months so be sure to subscibe to the RSS feed.