You only look once (YOLO), or do we need to? Again, the new version of the YOLO Object Detection Algorithm takes the Computer Vision and Automation industry by the storm. This time, however, it's not coming from Joseph Redmon but Meituan, a Chinese shopping platform!
YOLOv6 is a new and improved version of the years-old YOLO architecture with more effective and efficient training processes, hardware-friendly design for the backbone and neck, and better performance on mobile devices.
The field of computer vision is in full bloom due to the data and resource revolution over the past 7-8 years. Ever since the introduction of the YOLOv3 model by Joseph Redmon, the possibility of Real-Time object detection has reached new heights.
Contrary to the prior detection systems, which were predominantly repurposing older classification and segmentation methods, YOLO introduced the use of a distinct neural network for the entire image. This network divides the image into regions and predicts bounding boxes and probabilities for each area. Finally, the predicted probabilities weigh these bounding boxes.
Even the parent network for YOLOv6 achieved results 1000x times faster than the R-CNN object detection method, but where does this current network stand, or more importantly, why?
Meituan was not only able to deliver a faster and more flexible network, but it also comes with three different variations with their parameters and sizes more minuscule by the version.
YOLOv6 introduces two significant upgrades to the initial YOLOv5 model inspired by; Ultralytics' YOLOv5. The architecture in question has been the standing champion in Object Detection with its army of variations since 2020.
Let us look at the two upgrades YOLOv6 brings to the table to outperform its predecessor.
The hardware-aware neural network architecture inspired the neck and backbone of YOLOv6. However, for compelling inference, it is necessary to consider hardware characteristics like processing power, memory bandwidth, etc. Therefore, YOLOv6's neck and backbone have been modified using Rep-Pan and EfficientRep structures.
The neck's direct input into a decoupled head means tasks like regression, objectness, and classification; this has helped YOLOv6 to increase its speed and detection accuracy over its predecessors.
YOLOv6 has been tested against the benchmarks the predecessors have taken to represent the abilities accurately.
The key observations from the accuracy are:
YOLOv6-nano achieved 35% AP accuracy on COCO Val and 1242FPS performance on T4 using TRT FP16 batchsize=32. Compared to YOLOv5-nano, this indicates a gain of percent AP in accuracy and an increase of 85 percent in speed.
Using TRT FP16 batchsize=32 for inference on T4, YOLOv6-tiny obtained 41.3% AP accuracy on COCO Val with a performance of 602FPS. Compared to YOLOv5-s, the accuracy has risen by 3.9% AP, while the speed has increased by 29.4%.
COCO: Common Objects in Context is the single most large-scale object detection, segmentation, and captioning dataset. With over 330,000 labeled images, the dataset serves as the benchmark for all such architectures.
YOLOv6 begs the question, what’s next in computer vision and object detection. With the original author of the network, Redmon, quitting Computer Vision with his growing concerns about what his network was able to achieve, should corporations be investigating advances in this?
I stopped doing CV research because I saw the impact my work was having. I loved the work, but the military applications and privacy concerns eventually became impossible to ignore.https://t.co/DMa6evaQZr— Joseph Redmon (@pjreddie) February 20, 2020
However, the fantastic support the network is getting from the Open Source Developer Communities shows excellent potential when deployed on mobile devices. So let us keenly observe as Computer Vision reaches the highest potential.