Start from Yolo V1 until the latest version of YOLO V3

"Introduction: Nowadays, target detection based on deep learning has gradually become the core technology in the fields of automatic driving, video surveillance, machining, intelligent robot and so on. However, most of the existing high-precision target detection algorithms are slow and can not meet the real-time needs of the industry for target detection. At this time, Yolo algorithm was born, It has won unanimous praise with almost extreme speed and excellent accuracy. Based on this, we choose Yolo algorithm to realize target detection. At present, the Yolo algorithm has gone through three versions of iteration, and has greatly improved in speed and accuracy. We will start with Yolo V1 until the latest version of Yolo v3. 1. The pioneering work of Yolo V1 one-step detection Compared with the traditional classification problem, target detection is obviously more in line with the actual needs, because it is often impossible to have only one object in a scene in reality, so the needs of target detection become more complex, which requires not only the algorithm to detect what object, but also to determine where the object is in the picture. In this process, target detection has experienced a highly intuitive process. It is necessary to identify the location of the target, divide the picture into small pictures and throw them into the algorithm. When the algorithm thinks that an object is on this small area, the detection is completed. Then we think the object is on this small picture. This idea is the earlier idea of target detection, such as r-cnn. Although the later fast r-cnn and fast r-cnn [16] have been improved, for example, instead of sending pictures into CNN one by one to extract features, they are put into CNN to extract the feature map as a whole and then processed further, the overall process is still divided into two stages: region extraction and target classification. One feature of this is that although the accuracy is ensured, the speed is very slow, Therefore, this one-stage, end-to-end target detection algorithm, mainly represented by Yolo (you only look once), came into being. 1.1 Yolo V1 basic idea The core idea of Yolo V1 is to solve the target detection as a regression problem. Yolo V1 will first shrink the original image to 448 × The size of 448 is reduced to this size for the convenience of later division. Then divide the picture into SXS regions. Note that the concept of this region is different from that of dividing the picture into n regions and throwing it into the algorithm mentioned above. The area mentioned above is to clip the picture, or input a local pixel of the picture into the algorithm, and the division area here is only logical division. If the center of an object falls on a cell, the cell is responsible for predicting the object. Each cell needs to predict B boundary box (bbox) values (bbox values include coordinates, width and height), and predict a confidence score for each bbox value. After that, the prediction analysis is carried out in units of each cell. This confidence is not just the probability that the bounding box is the target to be detected, but the product of the probability of the target to be detected multiplied by the IOU of the bounding box and the real position (the intersection between the boxes divided by the Union). By multiplying the intersection and union ratio, the accuracy of the predicted position of the bounding box is reflected. As follows: Each bounding box corresponds to five outputs, x, y, W, h and confidence. Where x and Y represent the offset of the center of the bounding box from the boundary of its grid cell. w. H represents the ratio of the true width and height of the bounding box to the whole image. x. The parameters y, W and H have been limited to the interval [0,1]. In addition, each cell produces C conditional probabilities,. Note that regardless of the size of B, each cell produces only one set of such probabilities. Figure 1: Yolo forecast diagram In the non maximum suppression phase of test, for each bounding box, measure whether the box should be retained according to the following formula. This is the class specific confidence scores of each cell, which includes both the predicted category information and the accuracy of the bbox value. We can set a threshold, filter out the low class specific confidence scores, and leave the rest for non maximum suppression to obtain the final calibration box. When Pascal VOC is tested, s = 7 and B = 2 are used. Since there are 20 categories in total, C = 20. Therefore, the network output size is 7 × seven × 30。 1.2 network model structure Figure 2: network framework The network structure includes 24 convolution layers, and finally two full connection layers. Draknet [13] network uses the idea of googlenet for reference. After each 1x1 convolution layer, a 3x3 convolution layer structure is connected to replace the concept structure of googlenet. The paper also mentioned the faster version of Yolo, which has only 9 convolution layers, while others remain the same. 1.3 loss function Yolo V1 all uses mean squared error as the loss function. It consists of three parts: coordinate error, IOU error and classification error. Considering the contribution rate of each loss, Yolo V1 sets a weight for the coordinate error (coorderr) λ coord=5。 In the calculation of IOU error, the contribution of IOU error to network loss is different between the grid with object and the grid without object. If the same weight is used, the confidence value of the lattice without objects is approximately 0, which amplifies the influence of the confidence error of the lattice with objects on the calculation of the network parameter gradient. To solve this problem, Yolo uses λ Noobj = 0.5 correction (confidence error) iouerr（ The 'inclusion' here refers to the existence of an object whose central coordinates fall into the grid). For equal error values, the influence of large object error on detection should be less than that of small object error. This is because the proportion of the same position deviation to large objects is much smaller than that of the same deviation to small objects. Yolo improves this problem by finding the square root of the information items (W and H) of the object size, but it can not completely solve this problem. In summary, loss calculation of Yolo V1 during training is as follows: On the activation function: The standard linear activation function is used in the last layer, and leaky rectified linear activation function is used in other layers. 1.4 summary Yolo V1, as a pioneering work of one-step detection, is characterized by high speed. It takes object detection as a regression problem to solve, and uses a single network to complete the whole detection method, which greatly improves the speed of similar target detection algorithms, and realizes the advantage of low recall rate and low background false detection rate. Yolo V1 can obtain the overall information of the image and has a broader "field of vision" than region proposal and other methods. For its kinds of objects, the recognition effect after training is also very excellent, and has strong generalization ability. However, the accuracy and recall rate of Yolo V1 are relatively poor compared with fast RCNN. Its misjudgment rate of background is much lower than that of fast RCNN. This shows that the idea of turning object detection into regression problem in Yolo V1 has a good accuracy, but the positioning of bounding box is not very good. 2. Yolo V2 / yolo9000 is more accurate, faster and stronger Yolo V1's positioning of the bounding box is not very good, and there is still a certain gap in accuracy compared with similar networks. Therefore, Yolo V2 has greatly optimized the speed and accuracy, absorbed the advantages of similar networks, and made attempts step by step. Yolo V2 is put forward after improvement based on v1. Inspired by the fast RCNN method, anchor is introduced. At the same time, the k-means method is used to discuss the number of anchors and make a compromise between accuracy and speed. The network structure is modified, the full connection layer is removed, and the full convolution structure is changed. In the training, the wordtree structure is introduced to make the detection and classification problems into a unified framework, and a hierarchical joint training method is proposed to train the model with Imagenet classification data set and coco detection data set at the same time. 2.1 more accurate Yolo V2 does a normalization preprocessing for each batch of data. By adding batch normalization after each convolution layer, the convergence speed is greatly improved, and the dependence on other regularization methods is reduced (there is still no over fitting after dropping out optimization), so that the map is improved by 2%（ Map: mean average precision Yolo V1 has a resolution of 224 × Pre training is performed on 224 pictures, and the resolution is increased to 448 during formal training × 448, which requires the model to adapt to the new resolution. However, Yolo V2 uses 448 directly × 448 input, with the increase of input resolution, the model improves the map by 4%. In terms of the number of prediction boxes, Yolo V2 adjusts the input resolution of the network to 416 × 416. After multiple convolution, the lower sampling rate is 32, and 13 is obtained × 13 feature map. Using 9 anchor boxes [7] above, 13 were obtained × thirteen × 9 = 1521, which is much more than Yolo v1. Yolo V1 uses the data of the full connection layer to complete the prediction of the frame, which will lead to the loss of more spatial information and inaccurate positioning. In Yolo V2, the author uses the anchor idea in faster r-cnn for reference to improve the impact of the full connection layer. Anchor is a key step of RPN (region proposal network) network in fast r-cnn. It is a sliding window operation on the convolution feature map. Each center can predict 9 candidate frames of different sizes. In order to introduce anchor boxes to predict candidate boxes, the author removes the full connection layer from the network. The last pool layer is removed to ensure that the output convolution feature map has higher resolution. Then, by reducing the network, the image input resolution is 416 * 416, so that the width and height of the convolution feature image generated later are odd, so that a center cell can be generated. The author observed that large objects usually occupy the middle position of the image. You can only use a box in the center to predict the position of these objects, otherwise you need to use the middle four grids to predict. This technique can slightly improve the efficiency. Finally, Yolo V2 uses convolution layer downsampling (the sampling factor is 32), so that the 416 * 416 pictures input into the convolution network finally get a convolution characteristic map of 13 * 13 (416 / 32 = 13). Without anchor boxes, the recall rate of the model was 81% and the map was 69.5%; With anchor boxes, the recall rate of the model was 88% and the map was 69.2%. In this way, the accuracy rate decreased only slightly, while the recall rate increased by 7%. When using anchor, the author encountered two problems. The first is that the width and height dimensions of anchor boxes are often hand picked priors, that is, manually selected priors. Although in the training process, the network will also learn to adjust the width and height dimensions of the box, and finally get accurate bounding boxes. However, if a better and more representative a priori box dimension is selected at the beginning, it is easier for the network to learn the accurate prediction location. In order to make the network easier to learn the accurate prediction position, the author uses the K-means clustering method to train the bounding boxes, which can automatically find a better dimension of frame width and height. The traditional K-means clustering method uses Euclidean distance function, which means that the larger frame will produce more errors than the smaller frame, and the clustering results may deviate. Therefore, the author uses the IOU score as the evaluation standard. In this way, the error is independent of the scale of the frame, and the final distance function is: The clustering results of the data set are as follows: Figure 3: relationship between cluster number and AVG IOU (using voc2007 and coco data sets) It can be seen that k = 5 takes a compromise between model complexity and recall. When using anchor, the second problem is that the model added to anchor box is unstable. The author believes that the reason for the instability of the model comes from the prediction of bbox (x, y). As follows: In the prediction of fast r-cnn, the offset factor is unlimited, so the convergence will be slow. Therefore, we want each model to predict a part near the target. The paper uses the same method as Yolo V1 to directly predict the center point, and uses the sigmoid function to limit the offset between 0 and 1 (the scale here is for the grid box). The calculation formula is as follows: BX, by, BW and BH are the center point coordinates and width height of the predicted bbox, and the scale of the center point coordinates is relative to the grid. As shown in Figure 4: Figure 4: position diagram of each parameter Through the operation of dimension clustering and direct location prediction, the map is increased by 5% on the original anchor boxes version. Yolo V1 has a good effect on large target detection,