Recording the derivation of BackPropagation & Levenberg–Marquardt & Bayesian Regularization Algorithm, to have a deeper understanding of the Neural Networks.
In fact, there are two types of differential methods in nerual networks computing: forward-mode differentiation and backward-mode differentiation(Automatic Differentiation).
Consider the computayional graph below, we need to compute the gradient of the output with respect to the input, namely , where can be multi-dimensional vectors.
- For forward-mode differentiation:
Each step of the iteration starts from the input cacluate upward, for example we consider the step of , thus we need to calculate derivative by chain rule:
- For backward-mode differentiation:
Each step of the iteration starts from the output calcuate downward, for example, we consider the step of , thus we need to calculate derivative by chain rule:
Why we choose backpropagation? Let’s compare two methods:
In the forward-mode, we do NOT need to record the values of previous layers, and only need to consider the current layer value and current layer derivative value, thus SC(Forward)=
On the other hand, backward-mode need to record the values of all layers, thus SC(Backward)=
Generally, each layers (such as here ) are multi-dimensional vectors in practical nerual networks, which means should be vector, and & should be matrix.
Therefore, TC(Forward)=, but TC(Backward)= since backward-differentiation method store the values of each step.
It’s obvious that we should choose Automatic Differentiation to accelerate neural networks.
BackPropagation algorithm can be divided into 4 steps:
- Forward-propagate Input Singal
To illustrate the backpropagation process, we use the three layers neural network with two inputs and one output as below shows, the difference btween the output signal and the target is called error signal of output layer neuron.
- Back-propagate Error Singal
Propagate error signal (computed in single teaching step) back to all neurons, which output signals were input for discussed neuron. (Omit some computation process here)
- Calculate Parameters Gradients
- Update Parameters
When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below represents derivative of neuron activation function (which weights are modified). Coefficient affects network teaching speed. (Omit some computation process here)
The Levenberg–Marquardt algorithm, which was independently developed by Kenneth Levenberg and Donald Marquardt, provides a numerical solution to the problem of minimizing a nonlinear function. It is fast and has stable convergence. In the artificial neural-networks field, this algorithm is suitable for training small- and medium-sized problems.
Suppose is a function which has form below, w.r.t
Then the Jacobian matrix of is an matrix, usually defined and arranged as follows:
Assume , we have
Suppose is quadratic differentiable function， ,
define real symmetric matrix is Hessian matrix as below shows:
The Jacobian of the is the Hessian.
Sum square error (SSE) is defined to evaluate the training process. For all training patterns and network outputs, it is calculated by
where is the input vector, is the weight vector, is the training error.
Normally, gradient is defined as the first-order derivative of total error function:
Then the update rule of the steepest descent algorithm could be written as
Newton’s method assumes that all the gradient components are functions of weights and all weights are linearly independent:
Unfold each by Taylor series and take the first-order approximation:
where , thus we have
Therefore, the update rule for Newton’s method is
Since the second-order derivatives of total error function have to be calculated and it could be very complicated, in order to simplify the calculating process, Jacobian matrix is introduced as
The elements of gradient vector can be calculated as
Thus, the relationship between Jacobian matrix and gradient vector would be
And the element at ith row and jth column of Hessian matrix can be calculated as
where is equal to
As the basic assumption of Newton’s method is that is closed to zero, thus
and the update rule become
In order to make sure that the approximated Hessian matrix is invertible, Levenberg–Marquardt algorithm introduces another approximation to Hessian matrix:
where is always positive, called combination coefficient, is the identity matrix.
Therefore, the update rule of Levenberg–Marquardt algorithm can be presented as
When the combination coefficient is very small (nearly zero), L-M is approaching to Gauss–Newton algorithm.
When the combination coefficient is very big, it can be interpreted as the learning coefficient in the steepest descent method: .
Bayesian regularized artificial neural networks (BRANNs) are more robust than standard back-propagation nets and can reduce or eliminate the need for lengthy cross-validation. They are difficult to overtrain, since evidence procedures provide an objective Bayesian criterion for stopping training. They are also difficult to overfit, because the BRANN calculates and trains on a number of effective network parameters or weights, effectively turning off those that are not relevant.
Typically, training aims to reduce the sum of squared errors . However, regularization adds an additional term, thus the objective function becomes
where the simplest choice of regularizer is the weight decay regularizer
In the Bayesian framework the weights of the network are considered random variables. After the data is taken, the density function for the weights can be updated according to Bayes’ rule:
where represents the data set, is the particular neural network model used.
Assume that the noise in the training set data is Gaussian and that the prior distribution for the weights is Gaussian, the probability densities can be written as:
Then the posterior probability of the parameters is:
Applying Bayesian rule to optimize the objective function parameters , we have:
Assume is flat prior (which is uninformative to select ), then maximizing the posterior is achieved by maximizing the likelihood function, which can obtain from previous baysian formular:
We can exactly evaluate by Taylor series expansion around the minimum point of the posterior density , since are quadratic functions and let is the Hessian matrix of the objective function. Then we have:
Since is semi-symmetry, it can be written as , we introduce a variable and substitute into and made both sides integral, we have:
Therefore, we have:
Then we substitute it into and take the log of the evidence:
We can solve the optimal values of at the minimum point by taking the derivative with respect to and set them equal to zero.
First, differentiating with respect to , we need to evaluate :
Then we obtain the following condition for the most probable value of :
The quantity on the right of equation above is called the number of good parameter measurements .
Similarlly, we differentiate the log evidence with respect to and obtain:
Thus, we solve the optimal values of :
- Principles of training multi-layer neural network using backpropagation;
- Automatic Differentiation;
- Levenberg–Marquardt Training;
- Artificial Neural Networks Methods and Applications
- Bayesian interpolation
- Information theory, inference and learning algorithms
- Gauss-Newton approximation to Bayesian learning