Volume 2022, Issue 1 1713912
Research Article
Open Access

The Construction and Approximation of ReLU Neural Network Operators

Hengjie Chen

Hengjie Chen

Department of Mathematical Sciences, School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China zstu.edu.cn

Search for more papers by this author
Dansheng Yu

Dansheng Yu

School of Mathematics, Hangzhou Normal University, Hangzhou 310036, China hznu.edu.cn

Search for more papers by this author
Zhong Li

Corresponding Author

Zhong Li

Department of Mathematical Sciences, School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China zstu.edu.cn

School of Information Engineering, Huzhou University, Huzhou 313000, China zjhu.edu.cn

Search for more papers by this author
First published: 28 September 2022
Citations: 2
Academic Editor: Yoshihiro Sawano

Abstract

In the present paper, we construct a new type of two-hidden-layer feedforward neural network operators with ReLU activation function. We estimate the rate of approximation by the new operators by using the modulus of continuity of the target function. Furthermore, we analyze features such as parameter sharing and local connectivity in this kind of network structure.

1. Introduction

Artificial neural network is a fundamental method in machine learning, which has been applied in many fields such as pattern recognition, automatic control, signal processing, auxiliary decision-making, and artificial intelligence. In particular, the successful applications of deep (multihidden layer) neural networks in image recognition, natural language processing, computer vision, etc. developed in recent years have made neural networks attract great attention. In fact, ever the function of XOR gate was implemented by adding one layer from the simplest perceptron, which leaded to the single-hidden-layer feedforward neural network.

A single-hidden-layer feedforward neural network has the expression form
(1)
where ci, θi(i = 1, 2, ⋯, n) are called as output weights and thresholds, the dimension of input weights ωi(i = 1, 2, ⋯, n) corresponds to that of the input x, ϕ, is called the activation function of this network, and n1 is the number of neurons in the hidden layer. If (T denotes the transpose) is the input weight matrix of size and are vectors of thresholds and output weights, respectively, then (1) can be written as
(2)
where ϕ(A1x + Θ1) means that ϕ acts on each component of A1x + Θ1. Now, the architecture of the neural network with two hidden layers is really not difficult to understand. If the second hidden layer contains n2 neurons, the input weight matrix A2 is the size of n2 × n1, the vector of thresholds is Θ2, and the output weight vector is O; then, the two-hidden-layer feedforward neural network can be mathematically expressed as
(3)

We call w = max{n1, n2} as the width of the network , and its depth is naturally 2.

The theoretical research and applications of the single-hidden-layer neural network model had been greatly developed in the 80’s and 90’s of last century; particularly, there were also some research results on the neural networks with multihidden layers at that time. So, in [1], Pinkus pointed out that “Nonetheless there seems to be reasonable to conjecture that the two-hidden-layer model may be significantly more promising than the single layer model, at least from a purely approximation-theoretical point of view. This problem certainly warrants further study.” However, whether it is a single-hidden-layer or multihidden-layer neural network, three fundamental issues are always involved: density, complexity, and algorithms.

The so-called density or universal approximation of a neural network structure means that for any given error accuracy and the target function in a function space with some metrics, there is a specific neural network model (except for the input x, other parameters are determined) such that the error between the output and target is less than the preaccuracy. In the 1980s and 1990s, the research on the density of feedforward neural network has achieved many satisfactory results [29]. Since the single-hidden-layer neural network is an extreme case of the multilayer neural networks, the current focus of neural network research is still on complexity and algorithms. So-called the complexity of a neural network means that to guarantee a prescribed degree of approximation, a neural network model requires the numbers of structural parameters, including the number of layers (or depth), the number of neuron in each layer (sometimes use width), and the number of link weights and the number of thresholds. In particular, it is of interest to have more equal weights and thresholds, which is called as the parameter sharing, as this reduces computational complexity. The representation ability that has attracted much attention in deep neural networks is actually the study of complexity problem, which needs to be investigated extensively and urgently.

The constructive method is an important approach to the study of complexity, which is applicable to single- and multiple-hidden-layer neural network. In fact, there are two cases here: one is that the depth, width, and approximation degree are given, while the weights and thresholds are uncertain; the other is that all these are given; that is, the neural network model is completely determined. In order to determine the weights and thresholds in the first kind of neural network, we simply use samples to learn or train. Theoretically, the second kind of neural network can be applied directly, but the parameters are often fine-tuned with a small number of samples before use. There have been many results about the constructions of network operators [1026]. It can be seen that these research results have an important guiding role in the construction and design of neural networks. Therefore, the purpose of this paper is to construct a kind of two-hidden-layer feedforward neural network operators with the ReLU activation function, and the upper bound estimate of approximation (or regression) ability of this neural network for two variable continuous function defined on [−1, 1]2 is given.

The rest of the paper is organized as follows: in Section 2, we introduce new two-hidden-layer neural network operators with ReLU activation function and give the approximation rate of approximation by the new operators. In Section 3, we give the proof of the main result. Finally, in Section 4, we give some numerical experiments and discussions.

2. Construction of ReLU Neural Network Operators and Its Approximation Properties

Let r : denote the rectified linear unit (ReLU), i.e., r(x) = max{0, x}. For any (x1, x2) ∈ × 2, we define
(4)
Obviously, σ is a continuous function of two variables supported on [−1, 1]2. By using the fact that |x| = r(x) + r(−x), σ can be rewritten as follows:
(5)

From the above representation, we see that σ(x1, x2) can be explained as the output of a two-hidden-layer feedforward neural network. It is obvious that σ possesses the following some important properties:

(A1) σ(−x1, x2) = σ(x1, x2), σ(x1, −x2) = σ(x1, x2)

(A2) For any given x1, σ(x1, x2), it is nondecreasing for x2 ≤ 0 and nonincreasing for x2 ≥ 0. Simultaneously, for any given x2, σ(x1, x2),it is nondecreasing for x1 ≤ 0 and nonincreasing for x1 ≥ 0

(A3) 0 ≤ σ(x1, x2) ≤ 3/4

(A4)

For any continuous function f(x1, x2) on [−1, 1]2, we define the following neural network operator:
(6)
where ⌊x⌋ is the largest integer not greater than x, and ⌈x⌉ denotes the smallest integer not less than x.

We prove that the rate of approximation by can be estimated by using the modulus of smoothness of the target function. In fact, we have

Theorem 1. Let f(x1, x2) is a continuous function defined on [−1, 1]2. Then,

(7)

where and are the modulus of continuity of f defined by

(8)

Remark 2. For 0 < α < 1, we define the following neural network operators:

(9)

Using a similar process of the proof in Theorem 1, we can get

(10)

Remark 3. Let β (0 < β ≤ 1) be a fixed number, if there is a constant L > 0 such that

(11)
for any (x1, x2), (x1′, x2′) ∈ [−1, 1]2, we say that f is a Lipschitz function of order β and write . Obviously, when , we have . Consequently, it follows from (7) that
(12)

Remark 4. Now, we describe the structure of by using the form (3).

The input matrix of the first hidden layer is

(13)

and its size is . The bias vector of the first hidden layer is

(14)

and the dimension is . The input matrix of the second hidden layer is

(15)

and its size is . Θ2 is a constant 1 vector with dimension . The output weight vector is

(16)
Its general term and dimension are and , respectively.

We can see that there are two different numbers in weight matrices A1 and A2, respectively. That is, neural network operators have a strong weight sharing feature. There are some results about the constructions of this kind of neural networks [14, 2729]. Moreover, A2 shows that this neural network is locally connected. Finally, the simplicity of bias vector Θ2 also greatly reduces the complexity of the neural network.

3. Proof of the Main Result

To prove Theorem 1, we need the following auxiliary lemma.

Lemma 5. For function σ(x1, x2), we have

(17)

Proof. We only prove (1), and (2), (3), and (4) can be proved similarly.

  • (1)

    When ki − 1 < ki < ki + 1 ≤ [nxi] − 1(i = 1, 2), we have

(18)

Considering the monotonicity of σ(x1, x2), we have

(19)
(20)

Combining (19) and (20) leads to

(21)

Similarly, we have

(22)

By (21), (22), and summation from to [nxi] − 1(i = 1, 2), we obtain (1) of Lemma 5.

  • (2)

    When k2 + 1 > k2 > k2 − 1 ≥ ⌈nx2⌉ + 1, we have

(23)

From (18), (23), and in a similar way to the proof in proving (1), we get

(24)

By summation for and , we obtain (2) of Lemma 5.

Proof of Theorem 6. Let

(25)

Then,

(26)

We further estimate by estimating I1 and I2, respectively.

Set

(27)

Since in ∑2 + ∑3 + ∑4, at least one of inequalities and holds; so, either

(28)
or
(29)
is valid. Therefore,
(30)
which implies that
(31)

For ∑1, by the facts that , for (x1, x2) ∈ [−1, 1]2, we obtain that

(32)

Hence,

(33)

where we have used the inequality 0 ≤ σ ≤ 3/4, and the fact that the number of the terms in ∑1 is no more than . From (27)-(33), it follows that

(34)

Set

(35)

Then,

(36)

Firstly, we have

(37)

Noting that , we get Δ1 = 0 by the similar arguments for estimating ∑2 + ∑3 + ∑4 in (27). Therefore,

(38)

Consequently,

(39)

Similarly, we have

(40)

Thus, we already have

(41)

Now, let us estimate I21. By

(42)
we deduce that
(43)
where we used the fact that the support of σ(t1, t2) is [−1, 1]2.

Similarly, by

(44)
we have
(45)

By (1) of Lemma 5, (43), and (45), we have

(46)

By (2)-(4) of Lemma 5, and the arguments similar to (43) and (45), we obtain that

(47)
(48)
(49)

By (46)-(49) and the identity , we have

(50)

It follows from (26), (34)-(41), and (50) that

(51)

which completes the proof of Theorem 6.

4. Numerical Experiments and Some Discussions

In this section, we give some numerical experiments to illustrate the theoretical results. We take as the target function.

Set
(52)

Figures 13 show the results of e100(x1, x2), e1000(x1, x2) and e10000(x1, x2), respectively. When n equals to 106, the amount of calculation of is large. Therefore, we choose 6 specific points and the corresponding values of en(x1, x2), which are shown in Table 1.

Details are in the caption following the image
Errors of approximation of network operators (6) with n = 100.
Details are in the caption following the image
Errors of approximation of network operators (6) with n = 1000.
Details are in the caption following the image
Errors of approximation of network operators (6) with n = 10000.
Table 1. The error values of en(x1, x2) for 6 specific points with n = 1000000.
(x1, x2) (0, −1) (−1, 1) (0.5,0.5) (0, 0) (0.5, −0.6) (0.25,0.8)
en(x1, x2) 0.0020 0.0040 -9.982e-04 3.992e-07 0.0012 -0.0014

From the results of experiments we see that as the parameter n of neural network operators increases, the approximation effect increases; we only need to notice , and after the simple calculation, we can demonstrate the validity of the obtained result.

If we investigate network operators (6) carefully, we cannot help but ask why we use instead of f((k1/n), (k2/n)) in (6). Because ((k1/n), (k2/n)) are the conventional grid points on [−1, 1]2, this will reduce the amount of calculation. Now, we might as well introduce the following network operators:
(53)
Then, from the proof of Theorem 6, we have
(54)
It is not difficult to obtain the same estimate of I as I1, but it is not inconvenient to estimate . In fact, if we set
(55)

Figure 4 shows the with n = 1000. We can see that next to the border of [−1, 1]2 the effect of approximating f is not satisfactory. Particularly, we can see this phenomenon from Table 2 below. So, we modified f((k1/n), (k2/n)) to construct operators (6). Then, we obtain the error estimation of approximation of operators (6) and give the numerical experiments.

Details are in the caption following the image
Errors of approximation of network operators (53) with n = 1000.
Table 2. The error values of and en(x1, x2) for 6 specific points with n = 10000.
(x1, x2) (0, −1) (−1, 1) (0.5,0.5) (0, 0) (0.5, −0.6) (0.25,0.8)
-0.5000 -1.4862 2.749e-05 3.999e-05 2.47e-05 2.24e-05
en(x1, x2) -0.0197 -0.0394 -0.0098 3.921e-05 -0.0120 -0.0138

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the National Natural Science Foundation of China under Grant No. 12171434 and Zhejiang Provincial Natural Science Foundation of China under Grant No. LZ19A010002.

    Data Availability

    Data are available on request from the authors.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.