Sequential Fitting: A Different Perspective on the Spectral Bias of Neural Networks

Contents

by Conor Rowan and Finn Murphy-Blanchard Introduction One-dimensional regression Sequential fitting Boundary effects Basis function perspective Two-dimensional regression Conclusion

by Conor Rowan and Finn Murphy-Blanchard

Introduction

with complex tasks such as image classification [1], autonomy [2], and language modeling [3], neural networks are spectacularly good at fitting high-dimensional, nonlinear functions from data. In fact, neural networks have such robust representational capabilities that they can achieve zero training error on images with randomized class labels, meaning there is no structure in the training data which the network can exploit [4]. Despite this flexibility, the neural network model class appears to provide useful inductive bias for many real-world tasks, as neural networks often generalize to unseen test data better than other model types [5]. Yet, regression with neural networks suffers a serious drawback, which has become known as the “spectral bias” in the literature.

Popularized in 2019, the spectral bias states that neural networks fit regression targets from low to high frequencies [6]. As shown in Figure 1, the neural network first learns the low-frequency content of the function, before refining the fit to capture the higher frequencies. As is standard in this literature, we understand the “frequency content” of the regression target to be provided by its Fourier transform.

Figure 1: Rahaman et al. showed empirically that a neural network (green) fits its regression target (blue) in order of increasing frequency. In practice, this means that neural networks are slow to fit high-frequency functions. Image adapted from [6].

Because networks fit the target function in order of increasing frequency, learning high-frequency functions is often quite slow, requiring a large number of training epochs. Subsequent works have corroborated the difficulties networks face in fitting high-frequency functions, and have offered explanations for this intriguing phenomenon. Some authors have explained the spectral bias by studying the Fourier spectrum of popular activation functions (e.g., ReLU, hyperbolic tangent, sigmoid, etc.), noting that their spectra decay rapidly at high frequencies, and thus the network is inherently biased toward learning low frequencies [7,8].

An influential approach called the Neural Tangent Kernel (NTK) offers an elegant explanation of the spectral bias by showing that, in the limit of an infinite-width network, the network output evolves according to a linear dynamical system. Using the theory of linear dynamical systems to decompose the network output into orthogonal modes, the authors in [9] show that the convergence rate is inversely proportional to the frequency content of the mode. This work offered a compelling theoretical explanation for the spectral bias of neural networks.

A number of other works have explored the spectral bias across different network architectures and optimization algorithms. For example, one work showed that for wide two-layer networks with ReLU activation, the training process can be interpreted as a constrained optimization problem in which high-frequency components of the solution are more heavily penalized [10]. In [11], noting that the original NTK analysis assumes training is carried out using gradient descent, the authors clarify that the spectral bias is observed with other optimizers as well.

More recently, it was shown from both an empirical and theoretical perspective that second-order quasi-Newton optimization strategies—meaning strategies which rely on an approximation of the Hessian of the loss—can mitigate the spectral bias for neural networks used in scientific machine learning applications [12]. Here, building on the NTK analysis, it is shown that pre-conditioning with the Hessian matrix helps equalize the convergence rate of modes of different frequencies, thus expediting the training process.

While much attention has been paid to understanding the origins of the spectral bias, a number of researchers have proposed strategies to remedy it. While using second-order optimization is one such strategy, others involve modifications to the architecture of the network. Replacing standard activations with periodic functions like sinusoids is one architectural modification known as a SIREN network [13]. Another popular architecture is the Fourier feature network, which, instead of modifying the activation functions, lifts the input to a higher-dimensional space with periodic embeddings at random frequencies [14,15]. In the context of scientific machine learning, Fourier features have been shown to improve performance for multi-scale partial differential equations [16].

The success of standard neural network architectures (multi-layer perceptrons, convolutional networks, etc.) in mainstream machine learning suggests that fitting high frequencies is not a bottleneck for many application areas. However, an inability to robustly or efficiently fit high-frequency functions can be a problem in scientific applications, where multi-scale and wave propagation problems rely heavily on oscillatory solution fields. While second-order optimization, SIREN networks, and Fourier features all represent successful remedies to the spectral bias, we believe the spectral bias to be an interesting problem in its own right.

Though the Fourier spectrum of the activation function offers some insight into the origin of the spectral bias for general neural network training problems, and the NTK provides an explanation in the case of infinite-width networks, we believe a more intuitive understanding of the spectral bias is possible. In this article, we argue that, in many cases, the spectral bias of multilayer perceptron (MLP) networks with hyperbolic tangent activations can be understood from the perspective of what we call “sequential fitting.” We define sequential fitting to mean that neural networks fit their target function beginning from the boundary and then progressing into the domain, building one oscillation of the target function at a time. We show that this behavior holds on a number of example problems in one and two spatial dimensions, and also find evidence for a “boundary effect,” whereby the training process is not only influenced by the frequency content of the target function, but also its behavior near boundaries.

Finally, we interpret these results using the “basis” learned by the neural network, namely, the set of functions defined by the last layer of the network. We show that in fitting high-frequency functions, these networks iteratively build a step function-type basis, which we believe offers additional insight into the spectral bias.

One-dimensional regression

Sequential fitting

In the following examples, we work with two hidden-layer MLP neural networks with hyperbolic tangent activation functions. We can write the network explicitly as

\[ u(\mathbf x; \boldsymbol \theta ) = \mathbf w^3 \cdot \tanh( \mathbf w^2 ( \tanh(\mathbf w^1 \mathbf x + \mathbf b^1)) + \mathbf b^2 ), \quad \boldsymbol \theta=[ \mathbf w^3 , \mathbf w^2 , \mathbf b^2 , \mathbf w^1 , \mathbf b^1] ,\]

where \(\boldsymbol \theta\) is the collection of all the trainable parameters (weights and biases) of the network, \(\mathbf x \in \Omega\) is the spatial coordinate (either one- or two-dimensional), and \(\Omega\) is the computational domain. The widths of the two hidden-layers are taken to be equivalent, and we denote this width as \( H \). We call the target function \(v(\mathbf x)\), and define the training objective as

\[ \underset{\boldsymbol \theta}{\text{argmin }} \frac{1}{2} \int \Big( u(\mathbf x ; \boldsymbol \theta) – v(\mathbf x) \Big)^2 d\Omega. \]

To demonstrate the phenomenon that we call sequential fitting, we begin with a one-dimensional regression problem on the unit domain, e.g., \( \Omega =[0,1] \), with a target function given by \( v(x) = \sin(26 \pi x) \). The width of the network is \( H=100 \), and the regression problem is solved with ADAM optimization using a learning rate of \( 5 \times 10^{-3}\). The integral in the objective is approximated using mid-point quadrature on a uniform grid with \(500\) points. Unless otherwise specified, all subsequent one-dimensional examples will be solved with this network architecture and integration rule, and these optimization settings. The number of training epochs will be displayed on plots showing the progress of the fit, and thus specified on a case-by-case basis. See Figure 2 for the results of this first example problem. The network initializes the fitting process near the boundaries, and then iteratively works its way toward the center of the domain, fitting one oscillation of the high-frequency target at a time. This is the phenomenon we call sequential fitting. We note that this Figure, and all subsequent Figures, were created by the authors.

Figure 2: Illustration of sequential fitting for a sinusoidal target function. The target is shown in orange, and the regression fit at the specified epoch is shown in blue. The network begins fitting from the boundaries of the domain, and sequentially works its way toward the center.

A second example shows how the envelope of the oscillatory function influences the training process. If sequential fitting begins at the boundaries, we hypothesize that the behavior of the function near the boundaries of the domain may have an effect on training. In particular, we test the case where an envelope function drives the amplitude of the oscillations to zero on one end of the domain. Our target function is \(v(x)=\sqrt{x} \sin(26 \pi x)\), where the \(\sqrt{x}\) envelope suppresses oscillations at the left end of the domain. See Figure 3 for the results. The sequential fitting process begins on the right side of the domain, where the oscillations have larger amplitude. As before, the network fits one oscillation at a time, except that the process is now one-sided, as a result of suppressed oscillations at the left boundary. This example motivates further investigation into the influence of the target function’s behavior near the boundary, which is the focus of the following section.

Figure 3: When an envelope function introduces asymmetry to the amplitude of oscillations near the boundary, the sequential fitting process also becomes asymmetric. In this case, fitting begins at the right boundary where the amplitude is larger, and works from right to left until the training objective is approximately zero.

Boundary effects

The previous example illustrated that not only the frequency content, but also the behavior of the target function near the boundary, can influence the fitting process. A striking demonstration of the influence of boundary behavior is seen when fitting the target function \(v(x)=4x(1-x)\sin(26 \pi x)\). Here, the parabolic envelope function causes the oscillations to decay to zero amplitude on both ends of the domain. Figure 4 shows the neural network fit over \(7500\) training epochs. Surprisingly, the network makes no progress toward representing the target, evidently due to the small amplitude oscillations at both boundaries. Compare this with the fitting of \(v(x)=4(x-1/2)^2 \sin(26 \pi x)\), a very similar target function, but one which suppresses oscillations in the center of the domain rather than at the ends. Figure 5 shows that the sequential fitting now behaves as expected: the network begins at the two ends of the domain, and then works symmetrically inward, building one oscillation at a time until the training objective is approximately zero.

Figure 4: When the amplitude of oscillations near the boundary are small, the network fails to initiate the fitting process in the allotted \(7500\) epochs. Here, an envelope function is used to suppress oscillations near the boundary.

Figure 5: When the envelope function is changed to suppress oscillations in the center of the domain, the sequential fitting process behaves as expected. The network works from the boundary into the domain, fitting one oscillation at a time.

Standard accounts of the spectral bias in the literature treat the difficulty of the regression problem to be primarily dependent on the target’s frequency content, and not on other features such as its behavior near the boundary of the domain. The two examples given above suggest that the target’s boundary behavior does meaningfully influence the difficulty of the regression problem, as measured by the number of epochs to obtain small training error. However, the reader may object to this claim, arguing that the two envelope functions meaningfully change the frequency content of the target, though both multiply the same oscillatory function (\(\sin(26 \pi x)\)). To show that this is not the case, we use the Discrete Fourier transform (DFT) to compute the Fourier spectra of the two target functions. The DFT of the target function is

\[ F[m] = \sum_{j=0}^{499} v(x_j) \exp( -i 2\pi m j/500), \]

where \(F[m]\) are the complex Fourier coefficients and the \(x_j\) are integration points. Note that, because the target function is real, the Fourier coefficients have Hermitian symmetry, meaning that \(F[m] = \overline{F[500-m]}\). Recall that the magnitude of the Fourier coefficient gives the contribution of a phase-shifted sinusoid of frequency \(2\pi m\) to the Fourier decomposition of the target function. Interesting ourselves in this quantity as a measure of the frequency content of the target, the Hermitian symmetry shows that only half the frequency spectrum is independent, as the conjugate of a complex number has the same magnitude.

Thus, Figure 6 compares the magnitude of the first half of the DFT spectrum for the two target functions. Their spectra differ only in the magnitude of the Fourier coefficient at \(m=13\), which reflects the fact that the two functions differ only by a factor of \(\sin( 26 \pi x)\), which can be seen by expanding the square in the second of the two envelope functions. This example shows that the frequency content of the regression target may not be the only factor in determining the difficulty of the fitting process. We call this phenomenon the “boundary effect,” indicating that two functions with similar Fourier spectra can behave differently as regression targets due to their behavior near domain boundaries.

Figure 6: Two target functions with nearly identical Fourier spectra can behave very differently as regression objectives. In this case, one target function is large near the boundaries and the other small. The boundary effect suggests that information of this sort—information beyond the Fourier spectra, that is—also has bearing on the success of training.

Basis function perspective

Another perspective on the spectral bias which, to the best of our knowledge, has not been explored in the literature, involves the basis functions built by the network. Referring back to the two hidden-layer network, we take the basis \(\mathbf h(\mathbf x) = \{ h_i(\mathbf x) \}_{i=1}^{H}\) to be the functions defined by the final hidden layer of the network:

\[ \mathbf h(\mathbf x ) = \tanh( \mathbf w^2 ( \tanh(\mathbf w^1 \mathbf x + \mathbf b^1)) + \mathbf b^2 ). \]

We are interested in how these functions evolve over the course of training. Returning to the example shown in Figure 2 (target function of \(v(x) = \sin(26 \pi x)\)), we plot the set of basis functions at discrete training epochs, with the opacity proportional to the coefficient corresponding to each basis function. In other words, the network representation is given by \(u(x) = \sum_{i=1}^{H} w^3_i h_i(x)\), so when plotting \(h_i(x)\), we set its opacity proportional to \(|w^3_i|\). Our goal with these plots is to visualize the basis functions that contribute most to the network output, and we refer to these as the relevant basis functions. The evolution of the relevant set of basis functions is shown in Figure 7.

In this plot, we see that each basis function is a smoothed step function, representing one oscillation in the target function. In fact, this provides insight into the sequential fitting process: the network builds step-like basis functions starting first from the boundaries, and then shifting and steepening further basis functions as required to represent oscillations inside the domain. It is interesting that the basis functions do not themselves have any oscillatory behavior, despite the fact that a two hidden-layer network is capable of representing this.

To appreciate this point, we remark that the basis functions of a two hidden-layer network are defined by a one hidden-layer network, which is known to be a universal approximator. We believe the basis function perspective offers the following insight into the spectral bias: if the basis learned by the network comprises smoothed step functions, then every oscillation in the target needs to be represented as a combination of two basis functions. And, as the sequential fitting phenomenon suggests, if this happens iteratively, meaning the network can only work on one oscillation at a time, it is not surprising that standard MLP networks are extremely slow to fit high-frequency functions.

Figure 7: The relevant basis functions are determined by the magnitudes of the coefficients scaling them. The two hidden-layer network builds a step-like basis to represent the oscillatory target function. Per the sequential fitting process, the relevant basis functions are constructed first at the boundaries, then move into the domain as training progresses.

As a final remark on the one-dimensional regression example, we note that architectural modifications to the network eliminate the sequential fitting process. For example, if we switch to a SIREN network, replacing the \(\tanh(\cdot)\) activation with \(\sin(2(\cdot))\), the basis functions themselves have oscillatory behavior, and there is no sequential fitting. See Figure 8 for the basis functions obtained from a SIREN network with the same regression target.

Figure 8: Switching to a periodic activation function introduces high-frequency behavior into the basis, which eliminates the sequential fitting phenomenon. We propose this is because it is no longer necessary to learn oscillations in the target function one at a time.

Two-dimensional regression

We now extend our study of the spectral bias to two-dimensional regression problems. In particular, we investigate whether the sequential fitting phenomenon also holds in two spatial dimensions. To do this, we set our computational domain as the unit square \(\Omega=[0,1]^2\), and perform midpoint integration with \(2500\) evenly spaced integration points. The network architecture is equivalent, with hyperbolic tangent activation functions and two hidden layers of width \(H=100\), except that the input layer is modified to accept an input \(\mathbf x \in \mathbb R^2\). We choose our regression target as \(v(x_1,x_2) = \sin( 10 \pi x_1 ) \sin(10 \pi x_2)\). Again using ADAM optimization with a learning rate of \(5 \times 10^{-3}\), we train the network by minimizing the squared error with the target. See Figure 9 to visualize the target function and the evolution of the network output over the course of training.

First, we note that the network spends more than \(3000\) epochs representing only the mean of the target, which is zero in this case. Once fitting begins, we notice similar sequential fitting behavior to the one-dimensional examples, where the network first becomes non-zero near a boundary, and then iteratively works inward. Interestingly, the sequential fitting process moves diagonally across the domain, and first represents one-dimensional oscillations in this diagonal coordinate direction. At around epoch \(5000\), and beginning at the boundary, the network starts sweeping the domain in a perpendicular diagonal direction, correcting the one-dimensional oscillations to be two-dimensional, as the training objective suggests. This example shows that sequential fitting can occur in higher spatial dimensions as well.

Figure 9: After spending more than \(3000\) epochs representing only the mean of the target function, the network begins sweeping across the domain in a diagonal direction, representing one-dimensional oscillations first. Then, a similar process initiates in a perpendicular direction, refining the one-dimensional oscillations in agreement with the target function. We take this to be a higher-dimensional version of the sequential fitting process observed previously.

As before, we investigate the behavior of the set of basis functions built by the network. In this case, we cannot overlay multiple basis functions on a single plot, so we choose to plot only the converged basis functions. In particular, we find the \(25\) largest entries from the coefficient vector \(|\mathbf w^3|\) and plot the corresponding basis functions. Figure 10 shows that, like the one-dimensional case, the basis functions are smoothed step functions.

Figure 10: The top \(25\) basis functions used in representing the high-frequency two-dimensional target function. As before, the basis formed by the two hidden-layer network with hyperbolic tangent activations comprises step-like functions.

Conclusion

After reviewing some of the standard perspectives on the spectral bias from the literature, we have offered an alternative understanding of this phenomenon. Our argument is that MLP neural networks fit high-frequency functions from the boundaries in, learning to represent one oscillation at a time. We showed that the behavior of the target function near the boundary can have a significant effect on the training process, independent of the frequency content of the target, which we believe to be a novel insight. Furthermore, we showed that the MLP networks we studied iteratively built step-like basis functions, which contrasted dramatically with the oscillatory behavior of the basis built by a SIREN network. The basis function perspective is interesting, as it shows that even the training of relatively wide networks (\(H=100\)) is in the “feature learning” regime. In other words, these step-like basis functions are not present at the initialization of the network—network parameters need to be tuned to carefully position and steepen the step-like basis functions.

Finally, we showed that sequential fitting behavior is also observed for two-dimensional regression problems, where the network now sweeps the domain in two perpendicular directions to obtain the fit. Our visualization of the converged basis functions suggested that the sequential behavior was again the consequence of iteratively fitting step-like bases.

Preliminary work suggests that deeper networks may build oscillatory basis functions, even with hyperbolic tangent activation. Future studies might investigate the effect of network depth and the activation function on sequential fitting, and on the robustness of the observed boundary effect. Our example of the boundary effect here shows that two functions with nearly identical Fourier spectra behave very differently as regression targets, evidently as a result of their behavior near domain boundaries. We believe that a worthwhile future contribution to the scientific machine learning literature would be to demonstrate more conclusively that the frequency content of a target function is not the only determinant of successful regression with neural networks.