## Preface

Recording the derivation of BackPropagation & Levenberg–Marquardt & Bayesian Regularization Algorithm, to have a deeper understanding of the Neural Networks.

## BackPropagation

### Differentiation in Computational Graph

In fact, there are two types of differential methods in nerual networks computing: forward-mode differentiation and backward-mode differentiation(Automatic Differentiation).
Consider the computayional graph below, we need to compute the gradient of the output with respect to the input, namely $\frac{\partial L}{\partial x}$, where $x, y, z$ can be multi-dimensional vectors.

• For forward-mode differentiation:
Each step of the iteration starts from the input cacluate upward, for example we consider the step of $z$, thus we need to calculate derivative by chain rule:
• For backward-mode differentiation:
Each step of the iteration starts from the output calcuate downward, for example, we consider the step of $y$, thus we need to calculate derivative by chain rule:

Why we choose backpropagation? Let’s compare two methods:

• Space Complexity:
In the forward-mode, we do NOT need to record the values of previous layers, and only need to consider the current layer value and current layer derivative value, thus SC(Forward)=$O(1)$
On the other hand, backward-mode need to record the values of all layers, thus SC(Backward)=$O(n)$

• Time Complexity:
Generally, each layers (such as $x,y,z$ here ) are multi-dimensional vectors in practical nerual networks, which means $\frac{\partial L}{\partial z'}$ should be $1\times m$ vector, and $\frac{\partial z}{\partial y'}$ & $\frac{\partial y}{\partial x'}$ should be $m\times m$ matrix.
Therefore, TC(Forward)=$O(m^3)$, but TC(Backward)=$O(m^2)$ since backward-differentiation method store the values of each step.

It’s obvious that we should choose Automatic Differentiation to accelerate neural networks.

### BackPropagation Process

BackPropagation algorithm can be divided into 4 steps:

1. Forward-propagate Input Singal
To illustrate the backpropagation process, we use the three layers neural network with two inputs and one output as below shows, the difference btween the output signal and the target is called error signal $\delta$ of output layer neuron.
2. Back-propagate Error Singal
Propagate error signal $\delta$ (computed in single teaching step) back to all neurons, which output signals were input for discussed neuron. (Omit some computation process here)

4. Update Parameters
When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below $\frac{d f(e)}{d e}$ represents derivative of neuron activation function (which weights are modified). Coefficient $\eta$ affects network teaching speed. (Omit some computation process here)

## Levenberg–Marquardt

The Levenberg–Marquardt algorithm, which was independently developed by Kenneth Levenberg and Donald Marquardt, provides a numerical solution to the problem of minimizing a nonlinear function. It is fast and has stable convergence. In the artificial neural-networks field, this algorithm is suitable for training small- and medium-sized problems.

### Jacobian matrix & Hessian matrix

Jacobian matrix
Suppose $F : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ is a function which has form below, w.r.t $\mathbf{x}=\left(x_{1}, \cdots, x_{n}\right)^{T}$

Then the Jacobian matrix $\mathbf{J}$ of $\mathbf{f}$ is an $m×n$matrix, usually defined and arranged as follows:

where, component-wise

Jacobian determinant
Assume $x=X(u, v), y=Y(u, v)$, we have

Hessian matrix
Suppose $f : \mathbb{R}^{n} \rightarrow \mathbb{R}$ is quadratic differentiable function， $\mathbf{x}=\left(x_{1}, \cdots, x_{n}\right)^{T} \in \mathbb{R}^{n}$,
define $\quad n \times n$ real symmetric matrix $H(\mathbf{x})=\left[h_{i j}(\mathbf{x})\right]$ is Hessian matrix as below shows:

The Jacobian of the $\nabla f$ is the Hessian.

### Steepest Descent Algorithm

Sum square error (SSE) is defined to evaluate the training process. For all training patterns and network outputs, it is calculated by

where $\boldsymbol{x}$ is the input vector, $\boldsymbol{w}$ is the weight vector, $\boldsymbol{e}$ is the training error.
Normally, gradient $\boldsymbol{g}$ is defined as the first-order derivative of total error function:

Then the update rule of the steepest descent algorithm could be written as

### Newton’s Method

Newton’s method assumes that all the gradient components $g_{1}, g_{2}, \dots, g_{N}$ are functions of weights and all weights are linearly independent:

Unfold each $g_{i}$ by Taylor series and take the first-order approximation:

where $\frac{\partial g_{i}}{\partial w_{j}}=\frac{\partial\left(\frac{\partial E}{\partial w_{i}}\right)}{\partial w_{j}}=\frac{\partial^{2} E}{\partial w_{i} \partial w_{j}}$, thus we have

so, $\Delta \boldsymbol{w}=-\boldsymbol{H}^{-1} \boldsymbol{g}$
Therefore, the update rule for Newton’s method is

### Gauss–Newton Algorithm

Since the second-order derivatives of total error function have to be calculated and it could be very complicated, in order to simplify the calculating process, Jacobian matrix $\mathbf{J}$ is introduced as

The elements of gradient vector can be calculated as

Thus, the relationship between Jacobian matrix $\mathbf{J}$ and gradient vector $\mathbf{g}$ would be

And the element at ith row and jth column of Hessian matrix can be calculated as

where $S_{i, j}$ is equal to

As the basic assumption of Newton’s method is that $S_{i, j}$ is closed to zero, thus

and the update rule become

### Levenberg–Marquardt Algorithm

In order to make sure that the approximated Hessian matrix $\boldsymbol{J}^{T} \boldsymbol{J}$ is invertible, Levenberg–Marquardt algorithm introduces another approximation to Hessian matrix:

where $μ$ is always positive, called combination coefficient, $\boldsymbol{I}$ is the identity matrix.
Therefore, the update rule of Levenberg–Marquardt algorithm can be presented as

When the combination coefficient $μ$ is very small (nearly zero), L-M is approaching to Gauss–Newton algorithm.
When the combination coefficient $μ$ is very big, it can be interpreted as the learning coefficient in the steepest descent method: $\alpha=\frac{1}{\mu}$ .

## Bayesian Regularization

Bayesian regularized artificial neural networks (BRANNs) are more robust than standard back-propagation nets and can reduce or eliminate the need for lengthy cross-validation. They are difficult to overtrain, since evidence procedures provide an objective Bayesian criterion for stopping training. They are also difficult to overfit, because the BRANN calculates and trains on a number of effective network parameters or weights, effectively turning off those that are not relevant.

### Inference

Typically, training aims to reduce the sum of squared errors $F=E_{D}$. However, regularization adds an additional term, thus the objective function becomes

where the simplest choice of regularizer is the weight decay regularizer $E_{W}(\mathbf{w})=\frac{1}{2} \sum_{i} w_{i}^{2}$
In the Bayesian framework the weights of the network are considered random variables. After the data is taken, the density function for the weights can be updated according to Bayes’ rule:

where $D$ represents the data set, $M$ is the particular neural network model used.
Assume that the noise in the training set data is Gaussian and that the prior distribution for the weights is Gaussian, the probability densities can be written as:

where $Z_{D}(\beta)=\left(\frac{\pi}{\beta}\right)^{\frac{n}{2}}$

Then the posterior probability of the parameters $\mathbf{w}$ is:

### Optimizing $\alpha \text { and } \beta$

Applying Bayesian rule to optimize the objective function parameters $\alpha \text { and } \beta$ , we have:

Assume $P(\alpha, \beta | M)$ is flat prior (which is uninformative to select $\alpha \text { and } \beta$), then maximizing the posterior is achieved by maximizing the likelihood function, which can obtain from previous baysian formular:

We can exactly evaluate $Z_{F}(\alpha, \beta)$ by Taylor series expansion around the minimum point of the posterior density $\mathbf{w}_{\mathrm{MP}}$, since $\ E_{D}\ and \ E_{W}$ are quadratic functions and let $\mathbf{H}=\beta \nabla^{2} E_{D}+\alpha \nabla^{2} E_{W}$ is the Hessian matrix of the objective function. Then we have:

Since $\mathbf{H}$ is semi-symmetry, it can be written as $\mathbf{H}=\left(\mathbf{H}^{\frac{1}{2}}\right)^{T}\mathbf{H}^{\frac{1}{2}}$, we introduce a variable $\mathbf{u}=\mathbf{H}^{\frac{1}{2}}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MP}}\right)$ and substitute into $\frac{1}{Z_{F}(\alpha, \beta)} \exp (-F(\mathbf{w}))$ and made both sides integral, we have:

Therefore, we have:

Then we substitute it into $\frac{Z_{F}(\alpha, \beta)}{Z_{D}(\beta) Z_{W}(\alpha)}$ and take the log of the evidence:

We can solve the optimal values of $\alpha \text { and } \beta$ at the minimum point by taking the derivative with respect to $\alpha \text { and } \beta$ and set them equal to zero.
First, differentiating with respect to $\alpha$, we need to evaluate $\frac{1}{2} \log \operatorname{det} \mathbf{H}$:

Then we obtain the following condition for the most probable value of $\alpha$:

The quantity on the right of equation above is called the number of good parameter measurements $\gamma$.
Similarlly, we differentiate the log evidence with respect to $\beta$ and obtain:

Thus, we solve the optimal values of $\alpha \text { and } \beta$ :

where $\gamma=N-2 \alpha^{\mathrm{MP}} \operatorname{Trace}\left(\mathbf{H}\right)^{-1}$