## Preface

Recording the derivation of BackPropagation & Levenberg–Marquardt & Bayesian Regularization Algorithm, to have a deeper understanding of the Neural Networks.

## BackPropagation

### Differentiation in Computational Graph

In fact, there are two types of differential methods in nerual networks computing: **forward-mode differentiation** and **backward-mode differentiation**(Automatic Differentiation).

Consider the computayional graph below, we need to compute the gradient of the output with respect to the input, namely , where can be multi-dimensional vectors.

- For
**forward-mode differentiation**:

Each step of the iteration starts from the input cacluate upward, for example we consider the step of , thus we need to calculate derivative by chain rule: - For
**backward-mode differentiation**:

Each step of the iteration starts from the output calcuate downward, for example, we consider the step of , thus we need to calculate derivative by chain rule:

Why we choose backpropagation? Let’s compare two methods:

**Space Complexity**:

In the**forward-mode**, we do NOT need to record the values of previous layers, and only need to consider the current layer value and current layer derivative value, thus**SC(Forward)=**

On the other hand,**backward-mode**need to record the values of all layers, thus**SC(Backward)=****Time Complexity**:

Generally, each layers (such as here ) are multi-dimensional vectors in practical nerual networks, which means should be vector, and & should be matrix.

Therefore,**TC(Forward)=**, but**TC(Backward)=**since backward-differentiation method store the values of each step.

It’s obvious that we should choose Automatic Differentiation to accelerate neural networks.

### BackPropagation Process

BackPropagation algorithm can be divided into **4 steps**:

**Forward-propagate Input Singal**

To illustrate the backpropagation process, we use the three layers neural network with two inputs and one output as below shows, the difference btween the output signal and the target is called error signal of output layer neuron.**Back-propagate Error Singal**

Propagate error signal (computed in single teaching step) back to**all neurons**, which output signals were input for discussed neuron. (Omit some computation process here)**Calculate Parameters Gradients****Update Parameters**

When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below represents derivative of neuron activation function (which weights are modified). Coefficient affects network teaching speed. (Omit some computation process here)

## Levenberg–Marquardt

The Levenberg–Marquardt algorithm, which was independently developed by Kenneth Levenberg and Donald Marquardt, provides a numerical solution to the problem of minimizing a nonlinear function. It is fast and has stable convergence. In the artificial neural-networks field, this algorithm is suitable for training small- and medium-sized problems.

### Jacobian matrix & Hessian matrix

**Jacobian matrix**

Suppose is a function which has form below, w.r.t

Then the Jacobian matrix of is an matrix, usually defined and arranged as follows:

where, component-wise

**Jacobian determinant**

Assume , we have

**Hessian matrix**

Suppose is quadratic differentiable function， ,

define real symmetric matrix is Hessian matrix as below shows:

The Jacobian of the is the Hessian.

### Steepest Descent Algorithm

Sum square error (SSE) is defined to evaluate the training process. For all training patterns and network outputs, it is calculated by

where is the input vector, is the weight vector, is the training error.

Normally, gradient is defined as the first-order derivative of total error function:

Then the update rule of the steepest descent algorithm could be written as

### Newton’s Method

Newton’s method assumes that all the gradient components are functions of weights and all weights are linearly independent:

Unfold each by Taylor series and take the first-order approximation:

where , thus we have

so,

Therefore, the update rule for Newton’s method is

### Gauss–Newton Algorithm

Since the second-order derivatives of total error function have to be calculated and it could be very complicated, in order to simplify the calculating process, Jacobian matrix is introduced as

The elements of gradient vector can be calculated as

Thus, the relationship between Jacobian matrix and gradient vector would be

And the element at ith row and jth column of Hessian matrix can be calculated as

where is equal to

As the basic assumption of Newton’s method is that ** is closed to zero**, thus

and the update rule become

### Levenberg–Marquardt Algorithm

In order to make sure that the approximated Hessian matrix is invertible, Levenberg–Marquardt algorithm introduces another approximation to Hessian matrix:

where is always positive, called combination coefficient, is the identity matrix.

Therefore, the update rule of Levenberg–Marquardt algorithm can be presented as

When the combination coefficient is **very small (nearly zero)**, L-M is approaching to Gauss–Newton algorithm.

When the combination coefficient is **very big**, it can be interpreted as the learning coefficient in the steepest descent method: .

## Bayesian Regularization

Bayesian regularized artificial neural networks (BRANNs) are more robust than standard back-propagation nets and can reduce or eliminate the need for lengthy cross-validation. They are difficult to overtrain, since evidence procedures provide an objective Bayesian criterion for stopping training. They are also difficult to overfit, because the BRANN calculates and trains on a number of effective network parameters or weights, effectively turning off those that are not relevant.

### Inference

Typically, training aims to reduce the sum of squared errors . However, regularization adds an additional term, thus the objective function becomes

where the simplest choice of regularizer is the weight decay regularizer

In the Bayesian framework the weights of the network are considered random variables. After the data is taken, the density function for the weights can be updated according to Bayes’ rule:

where represents the data set, is the particular neural network model used.

Assume that the noise in the training set data is Gaussian and that the prior distribution for the weights is Gaussian, the probability densities can be written as:

where

Then the posterior probability of the parameters is:

### Optimizing

Applying Bayesian rule to optimize the objective function parameters , we have:

Assume is flat prior (which is uninformative to select ), then maximizing the posterior is achieved by maximizing the likelihood function, which can obtain from previous baysian formular:

We can exactly evaluate by Taylor series expansion around the minimum point of the posterior density , since are quadratic functions and let is the Hessian matrix of the objective function. Then we have:

Since is semi-symmetry, it can be written as , we introduce a variable and substitute into and made both sides integral, we have:

Therefore, we have:

Then we substitute it into and take the log of the evidence:

We can solve the optimal values of at the minimum point by taking the derivative with respect to and set them equal to zero.

First, differentiating with respect to , we need to evaluate :

Then we obtain the following condition for the most probable value of :

The quantity on the right of equation above is called the number of good parameter measurements .

Similarlly, we differentiate the log evidence with respect to and obtain:

Thus, we solve the optimal values of :

where

## Reference

- Principles of training multi-layer neural network using backpropagation;
- Automatic Differentiation;
- Levenberg–Marquardt Training;
- Artificial Neural Networks Methods and Applications
- Bayesian interpolation
- Information theory, inference and learning algorithms
- Gauss-Newton approximation to Bayesian learning