Decryption: "There is no hybrid" computer vision

Computer Vision (CV) is a research on how to make the machine "will see". Larry Roberts from Mit published in 1963 published the first doctoral thesis "Machine Perception of Three-Dimensional Solids" in this field, marking CV as the beginning of a new artificial intelligence. Today, after more than 50 years, let's talk about a few interesting attempts to make computer vision with "there is no hyped" ability: Super resolution reconstruction; Image coloring; Look at the picture; Portrait restoration; The image is automatically generated. It can be seen that these five attempted layers have gradually improved, and the degree of difficulty and fun are gradually improved. Due to the limited space, this article only talks about visual issues, not mentioning too much specific technical details, if you are interested in some part, you will write a discussion separately. Image super-resolution Last summer, an island country called "Waifu 2X" was used in animation and computer graphics. WAIFU 2X can increase the resolution of the image by 2 times by means of a depth "Convolutional Neural Network, CNN) technology, and the image can be noisened. Simply, let the computer "no middle" fills the pixels in some original pictures, so that the comics look clearer. Let's take a look at Figure 1, Figure 2, I really want to see the Dragon Ball of such HD! Figure 1 "Dragon Ball" super resolution reconstruction effect. On the right, the original painting, the left side is WAIFU 2X to the same frame animation super resolution reconstruction results Figure 2 WAIFU 2X super resolution reconstruction comparisons, above low resolution and noise animation images, the lower left is the result of directly amplifying, the lower right is WAIFU 2X denoising and super-resolution results However, it should be pointed out that the research of image super resolution begins with 2009, but only the development of "deep learning", WAIFU 2X can make better results. At the specific training CNN, the input image is the original resolution, and the corresponding super-resolution image is used as a target, and the "image pair" (image pair) of the training can be obtained by the model training, and the model training can be obtained. Some resolution reconstruction model. WAIFU 2x's depth network prototype is based on the results of Professor Tang Xiao Ou, Hong Kong University (shown in Figure 3). Interestingly, this study pointed out that the depth model can be given to a definitive explanation. In FIG. 3, the low resolution image can obtain an abstract feature (Feature Map) by the convolution of CNN and the pooling operation. Based on low resolution characteristics, non-linear mapping from low resolution to high resolution characteristics can also be implemented using convolution and poolization. The final step is to reconstruct high-resolution images using high resolution feature. In fact, these three steps are consistent with the three processes of the traditional super-resolution reconstruction method. Figure 3 Super resolution reconstruction algorithm process. From left to right, the low resolution image (input), a low resolution feature map obtained after several convolution and piochemical operation, the low resolution characteristic, high resolution characteristics, high resolution obtained by nonlinear mapping Rate reconstruction image (output) Image Colorization As the name suggests, the image coloring is a color filled black and white image originally "no" color. The image coloring also uses the convolutional neural network, input to "image pair" of black and white and corresponding color images, but only by comparing black and white pixels and RGB pixels to determine the color of the filled, the effect is not good. Because the results of color filling must meet our cognitive habits, for example, putting a "Wang Xingren" is bright and green, it will feel very weird. As a result, early RHICO University issued a work on the 2016 computer graphics international top meeting Siggraph, on the basis of the original depth model, joined the "Classification Network" to pre-determine the category of objects in the image, so as "basis" "Do this again with color fill. Figure 4 shows the model structure diagram and color recovery example, and its recovery effect is still quite realistic. In addition, such work can also be used for color recovery of black and white movies, just simply collect the video in the video framework by frame. Figure 4 Related to learning network structure and effects in image coloring. After entering the black and white image, it is divided into two, the upper side is used for image coloring, and the lower side is used for image classification. In the red portion of the figure, the two depth feature information is fused. Since the classified network features are included, it can play an effect of "using the classification result as in accordance with the auxiliary image coloring". Look at the picture (Image Caption) People often say "Picture and Mao", the text is another way of describing the world other than the image. Recently, a study called "Image Caption" gradually warmed, the main purpose is to automatically generate a description of human natural language to an image through computer vision and machine learning, that is, "look at the picture". In general, in Image Caption, the CNN is used to acquire the characteristics of the image, and then the image feature is used as the input of the language model LSTM (RNN), and the entire training is performed as an end-to-end structure. The language description of the image (shown in Figure 5). Figure 5 Image Caption Network Structure. The image is input as an input, first passes the multi-label classification network to obtain a predicted category tag, and the depth features of the image are used as the input of the following language model LSTM, and finally perform joint training. The following figure can complete the image capensium task, left 2 is a single word image quiz taste, right 1 is the sentence level of the sentence level Portrait Restoration (Sketch Inversion) In early June, Dutch scientists released their latest research results on Arxiv - "recovery" through the depth network. As shown in FIG. 6, in the model training stage, first, the true face image uses the traditional line marginalization method to obtain the outline map of the face, and "image pair" as a "image pair" consisting of the original map and the contour Input, model training similar to super resolution reconstruction. In the prediction stage, input is a face profile (left two sketch), and the layer abstraction and subsequent "restore" operation passing through the convolved neural network can gradually recover the photo-like face image (right), with the leftmost Face real image comparison, enough to fake. In the model flowchart, some portrait of some portrait of the resurgence, the left side is a real portrait, and the middle is listed as the painter manually depicted face profile map, and uses this as a network input to perform portrait recovery, and finally get one column Rehabilitation results - After visualization, the criminal investigation should never have to practice art. Figure 6 people like recovery algorithm process and effect Image automatic generation Review the four jobs just now, in fact, their common needs still need to rely on some "material" to "no middle", such as "portrait recovery" or need a contour to recover portraits. The next step can be made to generate an image that is approximated the real scene by any random vector. "No Supervised Learning" is a holy cup in a computer visual field. A creative work in this direction is "Generative Adversarial Nets, GaN) proposed by Ian Goodfellow and Yoshua Bengio et al. This work is inspired by the zero and games in game theory. In the binary and game, the interests of the two players are the sum of zero or a constant, that is, the party has income, and the other will lose. The two players in Ga are acting as a "discriminant network" and a "generated network" respectively, as shown in Figure 7. Figure 7 Generating network and discriminant network Among them, the input of "discriminating network" is an image. Its function is to determine whether an image is true or by a computer generated image; the input of "generated network" is a random vector, which can be "generated" through the network Synthetic image. This synthetic image can also be used as an input to "discriminating network", just at this time, it should be determined by the computer if it is ideally. Next, the zero and games in Ga have occurred on the "discriminant network" and "generated network": "Generating Network" wants to make the image generated by the image to approach the real image, so it can "deceive" "judgment" Network "; and" discriminating network "also always improves vigilance, preventing" generated network "from mixing off ... You come to me, so iterate, it is quite a bit" left and right fights ". The ultimate goal of the GAN's entire process is to learn a "generated network" that can approximate the real data distribution, which has mastered the distribution of the overall real data, so the name "Generate Anti-Network". It is important to emphasize that Gan is no longer like traditional supervisory depth learning, which requires a category mark, which does not require any image tags, which is deep learning under the condition of no monitoring. At the beginning of 2016, on the basis of GaN, the Indico Research and Facebook AI laboratory implements the Gan in deep convolutional neural network (referred to as DCGAN, Deep Convolutional Gan), and the work was published in international conference ICLR 2016, and The best results were achieved in the undisperse depth learning model. Figure 8 shows some bedroom images generated by DCGAN. Figure 8 DCGAN generated bedroom image More interesting is that DCGAN can also support the addition and subtraction of the image "Semantic" level like Word2Vec (shown in Figure 9). Figure 9 DCGAN "Semantic Dim Sum" In addition, the "Generated Computer Vision" research field Big Bull UCLA Song-chun ZHU Professor Song-chun ZHU issued the latest work based on generated convolution network: It not only automatically synthesize dynamic texture, but also synthesize sounds It can be said that there will be no oversight computer vision and advance a big step forward. Conclude Now, through "deep learning" Dongfeng, the performance performance of most tasks in computer vision is "brush" new high, even "portrait recovery", "image generation" similar to "there is no born", can be higher Quality is realized, it is really excited. However, in fact, the so-called subversive AI "odd" is quite distant, and it is foreseeable that computer vision or artificial intelligence can not be true in real sense, at this stage or even a long period of time. There is no born "- ie" self-awareness ". However, I am also very glad that we can witness and experience this computer vision and even the entire artificial intelligence of revolutionary wave. I believe there will be many "no hyped up" in the future. Standing in the trend, I am excited, sleep all night. Suo "Love Bo.com" to pay attention, daily update development board, intelligent hardware, open source hardware, activity and other information can make you master. Recommended attention! [WeChat scanning picture can be directly paid] Related Reading: DIY: 3D print shell made for GIZMO2 AMD64 development board