A neuron RESNET is a universal function approximator

MIT CSAIL researchers found that only one neuron of the hidden layer is a universal function approximator, and the constant mapping does strengthen the expression of the depth network. Researchers said that this discovery also filled the theoretical gap in the power of all-connected network expression capabilities. The deep neural network is the key to the success of many machine learning applications, while a major trend in deep learning is that the neural network is getting deeper: using computer vision applications as an example, from the first AlexNet, to the later VGG-NET, then To the nearest RESNET, the performance of the network is indeed raised as the number of layers. An intuitive feeling of researchers is that with the increase of network depth, the capacity of the network also becomes high, and it is easier to approach a function. Therefore, from the theory, there are more and more people to start care, is it possible to apply a sufficiently large neural network to approach? In a previous Upload ARXIV paper, the two researchers of Mit Csail start from the RESNET structure and argued that this issue. They found that only a neuron of the RESNET in each hidden layer is a general approximation function, and no matter how much the entire network is, even if it tends to be infinite, this is established. A neuron is enough, isn't this exciting? Understand universal approximation theorem from depth Regarding the expression of the neural network (Repesentational Power) has previously discussed. Some studies in the 1980s have found that there is a sufficiently hidden layer neuron, and the neural network with a single hidden layer can approximate any continuous function in any precision. This is also known as a universal approximation theorem. However, this is from the perspective of "width" rather than "depth" - continuously increases hidden layer neurons, increasing the width of the network - and actual experience tells us that the deep network is best for learning Solve the function of the real world problem. Therefore, this naturally leads out a problem: If the number of neurons per layer is fixed, is the general approximation theorem to establish when the network depth increases to infinity? Peking University Zhou Lu et al. Published in NIPS 2017 article "The Expressive Power of Neural Networks: a view from the width" found that for all connected neural networks using RELU as the activation function, when each hidden layer has at least D + 4 When a neuron (D represents the input space), the general approximation theorem is established, but when at most D neurons, it is not established. So, change this condition, will this condition also set up? What is the ability to express the depth network? The two researchers of Mit Csail thought about RESNet. Since Haven Ming and others, RESNETs are even considered to be the best network structure. The success of the RESNET benefits from the shortcut connection, as well as the constant mapping on this basis, so that the data stream can flow across. The original problem conversion to make the residual function (f (x) = H (x) -x) approximately 0 values without directly fitting a constant function h '(x). Due to the constant mapping, the width of the RESNET is equal to the input space. Therefore, the author constructs such a structure and constantly narrowing the hidden layer, see where the limit is: As a result, as mentioned above, at least one neuron is enough. The authors said that this is theoretically indicated that RESNET's constant mapping does enhance the expression of deep network. Illustration: Difference between the fully connected network and the resnet The author gives a Toy Example: We first pass a simple example, explore the difference between a fully connected network and the RESNET, in which each hidden layer of the fully connected network has D neurons. The example is: Classify unit balls in the plane. The training set consists of randomly generated samples, including We have created a boundary between the positive sample and negative samples to make the classified task easier. We use logical losses as a loss, which is the output of the network in the i-th sample. After training, we depict various depth network learning decision borders. Ideally, we hope that the decision boundaries of the model are close to the true distribution. Figure 2: In the unit sphere classification problem, training each hidden layer (above a line) width D = 2 full-connection network and each hidden layer only has a decision boundary obtained by a neuron of RESNET (below). The full connection network cannot capture the true function, which is consistent with the theory that the width D is too narrow for general approximation. Instead, the RESNET approaches the function and supports our theoretical results. Figure 2 shows the result. For a fully connected network (top row), the boundary decision learning have substantially the same shape on different depths: Approximation quality does not seem to increase with increasing depth. Although people may tend to think that this is caused by local optimality, our results are consistent with the results in [19]: Proposition 2.1. The order is a function defined by a fully connected network n with RELU activation. Use the positive level of the representation. If each hidden layer of N has D neurons, then Where λ represents LeBesgue Measure In other words, the level set of "Narrow" is boundless or zero measure. Therefore, even when the depth tends to be infinite, the full connection network of "Narrow" cannot approach the boundary area. Here we only show the case of D = 2, because it can be easily seen; the same observations can be seen in higher dimensions. The RESNET's decision border looks significantly different: although the width is narrower, the RESNET represents an index of a boundary area. As the depth increases, the decision border seems to tend to tend to tend to make propositions 2.1 cannot be applied to RESNet. These observations have excited the general approximation theorem. discuss In this article, we show that each hidden layer has a general approximation theorem for the RESNET structure of a neuron. This result is compared with the results recently connected to the full-connected network. For these full connection networks, the general approximation will fail when the width is D or more. RESNET VS full connection network: Although we use only one hidden neuron in each basic residual block to achieve universal approximation, some people may say that the structure of the ResNet still passes the Identity to the next layer. This Identity Map can be counted as D hidden units, resulting in a total of D + 1 hidden units per residual block, and makes the network as a complete connection network with a width (D + 1). However, even from this perspective, the RESNET is also equivalent to a compression or spindest version of a fully connected network. In particular, the full connection network of the width is (D + 1) has a connection, and only one connection is connected in the RESNET, which is due to Identity Map. This "excessive parameterization" that is fully connected network may explain why Dropout is useful for such networks. The same reason, our results indicate that the full connection network of the width (D + 1) is a general approximator, which is new discovery. The structure in [19] requires D + 4 units per layer, and there is a gap between the upper and lower boundaries. Therefore, our results shrink the gap: a full connection network of the width (D + 1) is a universal approximator, and the full connection network of the width D is not. Why is common approximation? As we mentioned in Section 2 of the paper, the full connection network of the width is never approaching a compact decision boundary, even if we allow an unlimited depth. However, in high dimensional space, it is difficult to visualize and inspect the obtained decision boundary. The general approximation theorem provides an integrity check and ensures that we can capture any desired decision boundaries. Training efficiency: The generic approximation theorem only guarantees the possibility of approximation of any expectation, but it does not guarantee that we can actually find it by running SGD or any other optimization algorithm. Understanding training efficiency may need to better understand optimization scenarios, which is a topic that is recently received. Here, we tried to propose a slightly different angle. According to our theory, the RESNET with a single neuron hidden layer (One-neuron hidden dayers) is already a universal approximator. In other words, the RESNET with multiple units in each layer is excessive parameterization of the model, while excessive parameterization has been observed to optimize. This is probably why training a very deep RESNET is one of the reasons why training a fully connected network "easier". Future work can be more rigorous to analyze this. Wide-in: Since a universal approximator can fit any functions, people may think it is easy to interact. However, it is often observed that the generalization effect of the depth network is excellent in the test set. The interpretation of this phenomenon is irrelevant to our paper, but understanding general approximation capabilities is an important part of this theory. In addition, our results suggest that the foregoing "excessive parameters" may also work. Summarize: In summary, we give a general approximation theorem for the RESNET with a single neuron hidden layer. This is theoretically distinguished from the RESNET and the fully connected network, and our results fill the blanks of understanding the representation of the fully connected network. To a certain extent, our results are theoretically incentive to make a more in-depth practice against the RESNet architecture. Read the full article, original title: [a neuronal rule all] ResNet powerful theoretical certificate Article Source: [Micro Signal: AI_ERA, WeChat public number: Xin Zhiyuan] Welcome to add attention! Please indicate the source of the article.