Notes of BackPropagation & Levenberg–Marquardt & Bayesian Regularization Algorithm

Preface

Recording the derivation of BackPropagation & Levenberg–Marquardt & Bayesian Regularization Algorithm, to have a deeper understanding of the Neural Networks.

BackPropagation

Differentiation in Computational Graph

In fact, there are two types of differential methods in nerual networks computing: forward-mode differentiation and backward-mode differentiation(Automatic Differentiation).
Consider the computayional graph below, we need to compute the gradient of the output with respect to the input, namely , where can be multi-dimensional vectors.
backpropagation0

  • For forward-mode differentiation:
    Each step of the iteration starts from the input cacluate upward, for example we consider the step of , thus we need to calculate derivative by chain rule:
  • For backward-mode differentiation:
    Each step of the iteration starts from the output calcuate downward, for example, we consider the step of , thus we need to calculate derivative by chain rule:

Why we choose backpropagation? Let’s compare two methods:

  • Space Complexity:
    In the forward-mode, we do NOT need to record the values of previous layers, and only need to consider the current layer value and current layer derivative value, thus SC(Forward)=
    On the other hand, backward-mode need to record the values of all layers, thus SC(Backward)=

  • Time Complexity:
    Generally, each layers (such as here ) are multi-dimensional vectors in practical nerual networks, which means should be vector, and & should be matrix.
    Therefore, TC(Forward)=, but TC(Backward)= since backward-differentiation method store the values of each step.

It’s obvious that we should choose Automatic Differentiation to accelerate neural networks.

BackPropagation Process

BackPropagation algorithm can be divided into 4 steps:

  1. Forward-propagate Input Singal
    To illustrate the backpropagation process, we use the three layers neural network with two inputs and one output as below shows, the difference btween the output signal and the target is called error signal of output layer neuron.
    backpropagation1
  2. Back-propagate Error Singal
    Propagate error signal (computed in single teaching step) back to all neurons, which output signals were input for discussed neuron. (Omit some computation process here)
    backpropagation2
    backpropagation3
  3. Calculate Parameters Gradients
  4. Update Parameters
    When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below represents derivative of neuron activation function (which weights are modified). Coefficient affects network teaching speed. (Omit some computation process here)
    backpropagation4
    backpropagation5
    backpropagation6

Levenberg–Marquardt

The Levenberg–Marquardt algorithm, which was independently developed by Kenneth Levenberg and Donald Marquardt, provides a numerical solution to the problem of minimizing a nonlinear function. It is fast and has stable convergence. In the artificial neural-networks field, this algorithm is suitable for training small- and medium-sized problems.

Jacobian matrix & Hessian matrix

Jacobian matrix
Suppose is a function which has form below, w.r.t

Then the Jacobian matrix of is an matrix, usually defined and arranged as follows:

where, component-wise

Jacobian determinant
Assume , we have

Hessian matrix
Suppose is quadratic differentiable function, ,
define real symmetric matrix is Hessian matrix as below shows:

The Jacobian of the is the Hessian.

Steepest Descent Algorithm

Sum square error (SSE) is defined to evaluate the training process. For all training patterns and network outputs, it is calculated by

where is the input vector, is the weight vector, is the training error.
Normally, gradient is defined as the first-order derivative of total error function:

Then the update rule of the steepest descent algorithm could be written as

Newton’s Method

Newton’s method assumes that all the gradient components are functions of weights and all weights are linearly independent:

Unfold each by Taylor series and take the first-order approximation:

where , thus we have

so,
Therefore, the update rule for Newton’s method is

Gauss–Newton Algorithm

Since the second-order derivatives of total error function have to be calculated and it could be very complicated, in order to simplify the calculating process, Jacobian matrix is introduced as

The elements of gradient vector can be calculated as

Thus, the relationship between Jacobian matrix and gradient vector would be

And the element at ith row and jth column of Hessian matrix can be calculated as

where is equal to

As the basic assumption of Newton’s method is that is closed to zero, thus

and the update rule become

Levenberg–Marquardt Algorithm

In order to make sure that the approximated Hessian matrix is invertible, Levenberg–Marquardt algorithm introduces another approximation to Hessian matrix:

where is always positive, called combination coefficient, is the identity matrix.
Therefore, the update rule of Levenberg–Marquardt algorithm can be presented as

When the combination coefficient is very small (nearly zero), L-M is approaching to Gauss–Newton algorithm.
When the combination coefficient is very big, it can be interpreted as the learning coefficient in the steepest descent method: .

Bayesian Regularization

Bayesian regularized artificial neural networks (BRANNs) are more robust than standard back-propagation nets and can reduce or eliminate the need for lengthy cross-validation. They are difficult to overtrain, since evidence procedures provide an objective Bayesian criterion for stopping training. They are also difficult to overfit, because the BRANN calculates and trains on a number of effective network parameters or weights, effectively turning off those that are not relevant.

Inference

Typically, training aims to reduce the sum of squared errors . However, regularization adds an additional term, thus the objective function becomes

where the simplest choice of regularizer is the weight decay regularizer
In the Bayesian framework the weights of the network are considered random variables. After the data is taken, the density function for the weights can be updated according to Bayes’ rule:

where represents the data set, is the particular neural network model used.
Assume that the noise in the training set data is Gaussian and that the prior distribution for the weights is Gaussian, the probability densities can be written as:

where

Then the posterior probability of the parameters is:

Optimizing

Applying Bayesian rule to optimize the objective function parameters , we have:

Assume is flat prior (which is uninformative to select ), then maximizing the posterior is achieved by maximizing the likelihood function, which can obtain from previous baysian formular:

We can exactly evaluate by Taylor series expansion around the minimum point of the posterior density , since are quadratic functions and let is the Hessian matrix of the objective function. Then we have:

Since is semi-symmetry, it can be written as , we introduce a variable and substitute into and made both sides integral, we have:

Therefore, we have:

Then we substitute it into and take the log of the evidence:

We can solve the optimal values of at the minimum point by taking the derivative with respect to and set them equal to zero.
First, differentiating with respect to , we need to evaluate :

Then we obtain the following condition for the most probable value of :

The quantity on the right of equation above is called the number of good parameter measurements .
Similarlly, we differentiate the log evidence with respect to and obtain:

Thus, we solve the optimal values of :

where , which is called the effective number of parameters, and is the total number of parameters in the network.

Training Steps

  1. Initialize and weights. We choose to set and use the Nguyen- Widrow method of initializing the weighs. After the first training step, the objective function parameters will recover from the initial setting.
  2. Take one step of the Levenberg-Marquardt algorithm to minimize the objective function
  3. Compute the effective number of parameters , making use of the Gauss-Newton approximation to the Hessian available in the Levenberg-Marquardt training algorithm:
  4. Compute new estimates for the objective function parameters
  5. Now iterate steps 2 through 4 until convergence.

Reference

  1. Principles of training multi-layer neural network using backpropagation;
  2. Automatic Differentiation;
  3. Levenberg–Marquardt Training;
  4. Artificial Neural Networks Methods and Applications
  5. Bayesian interpolation
  6. Information theory, inference and learning algorithms
  7. Gauss-Newton approximation to Bayesian learning