Mahatma Gandhi said: “Live as if you were to die tomorrow. Learn as if you were to live forever.” Indeed, it is human learning which is the basis for deep learning,^{2} which can be defined as a type of machine learning. It is used to solve complex problems, such as recognizing objects and understanding speech by developing algorithms that solve problems through examples. A key difficulty in these problems is feature extraction, or knowing what component of the problem to focus on. The basic unit of this technique is the neuron, and the model is a neural network. There is a set of inputs, each with a specific weight, for each neuron. There are various types of neurons, which can be used based on the function used. However, the sigmoidal neuron has been found to be useful in learning algorithms. It uses a logistic function of weighted sums of the inputs which outputs values between 0 and 1. For very negative weighted sums, the output is near 0 and for very large, positive values, the output is near 1 (Figure 1).

Many neurons can be connected into a neural network where the output of a neuron is the input into another neuron. Typically, neurons are organized in layers where there is often no connection between neurons in the same layer. The bottom layer of neurons receives the inputs and the top layer of neurons produces the outputs. The layers in between are called the hidden layers and is the location where most of the problem solving occurs. When the flow of information is in one direction from bottom to top layer, these are called feed-forward neural networks. If there is additional movement of information from top to bottom layers, then these are called recursive neural networks. The number of connections between neurons in a layer and an adjoining layer are important in the creation of a suitable model.

Too many connections results in “overfitting.” This is a type of modeling error which occurs when a model is overly complex such that it fits a particular data set too closely, thereby reducing its predictive power on different data sets. This is due to either random noise or other non-accounted for errors in the model. In order to obtain answers from the neural network, obtaining values for the weights is required. This is done through a large number of training samples with different weights in order to minimize the error. Although many neurons are sigmoidal, we can make an approximation using the error equation shown in Figure 2 for linear neurons and generalizing it to sigmoidal neurons.

By moving perpendicular to contour lines in a path of steepest descent down the surface, we can arrive at the point of minimum error. This training strategy is called gradient descent. An additional consideration is the distance moved at each step. If we are far, we want to move more. If we are close, we want to move less. This steepness of descent is a good indicator of this distance. Steepness is multiplied by e, the learning rate, a constant to obtain the delta rule for training (Figure 3).

This rule states that we should change the weights at each training iteration by the given amount. For nonlinear neurons, the gradient descent is modified to a stochastic gradient descent. This uses noise and randomization to locate the global error minimum. This rule is modified for the sigmoidal neuron with an extra term to account for the logistic component by using the chain rule for derivatives. Through dynamic programming, we can calculate how fast the error changes as we make changes in a hidden layer and then calculate error derivatives for adjacent layers. Then, we can calculate how the error changes regarding the weights. Finally, to calculate the backpropagation algorithm, we can sum up the partial derivatives in the training examples (Figure 4).

In order to create better and more accurate models, the addition of more training examples and judiciously limiting the number of neurons is required. As a word of caution, it should be stated that a recent paper^{1} has posited that although combinations of the hidden layers in the final layer are indistinguishable from the units themselves, but that specific perturbations in a non-random fashion can result in misclassification.

There is software to do deep learning such as cuda-convnet, a fast C++ implementation of convolutional neural networks, Matlab/Octave toolbox for deep learning, and pylearn2 page on GitHub.

Based on multi-layer feed-forward artificial neural networks, deep learning shows potential for solving machine learning problems and in artificial intelligence.

**References**

- Intriguing properties of neural networks by Christian Szegedy et al of Google, Inc. http://cs.nyu.edu/~zaremba/docs/understanding.pdf
- Deep Learning in a Nutshell – what it is, how it works, why care? by Nikhil Buduma http://www.kdnuggets.com/2015/01/deep-learning-explanation-what-how-why.html

*Mark Anawis is a Principal Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at editor@ScientificComputing.com.*