Monday, January 5, 2009

neuralnetwork

Artificial Neural NetworksIn this note we provide an overview of the key concepts that have led tothe emergence of Artificial Neural Networks as a major paradigm for DataMining applications. Neural nets have gone through two major developmentperiods -the early 60’s and the mid 80’s. They were a key development inthe field of machine learning. Artificial Neural Networks were inspired bybiological findings relating to the behavior of the brain as a network of unitscalled neurons. The human brain is estimated to have around 10 billionneurons each connected on average to 10,000 other neurons. Each neuronreceives signals through synapses that control the effects of the signal onthe neuron. These synaptic connections are believed to play a key role inthe behavior of the brain. The fundamental building block in an ArtificialNeural Network is the mathematical model of a neuron as shown in Figure1. The three basic components of the (artificial) neuron are:1. The synapses or connecting links that provide weights, wj, to theinput values, xjfor j = 1,...m;2. An adder that sums the weighted input values to compute theinput to the activation function v = w0+m∑j=1wjxj,where w0is called thebias (not to be confused with statistical bias in prediction or estimation) isa numerical value associated with the neuron. It is convenient to think ofthe bias as the weight for an input x0whose value is always equal to one,so that v =m∑j=0wjxj;3. An activation function g (also called a squashing function) thatmaps v to g(v) the output value of the neuron. This function is a monotonefunction.2
--------------------------------------------------------------------------------
Page 3
Figure 1While there are numerous different (artificial) neural network architec-tures that have been studied by researchers, the most successful applica-tions in data mining of neural networks have been multilayer feedforwardnetworks. These are networks in which there is an input layer consistingof nodes that simply accept the input values and successive layers of nodesthat are neurons as depicted in Figure 1. The outputs of neurons in a layerare inputs to neurons in the next layer. The last layer is called the outputlayer. Layers between the input and output layers are known as hiddenlayers. Figure 2 is a diagram for this architecture.Figure 2In a supervised setting where a neural net is used to predict a numericalquantity there is one neuron in the output layer and its output is the predic-tion. When the network is used for classification, the output layer typicallyhas as many nodes as the number of classes and the output layer node with3
--------------------------------------------------------------------------------
Page 4
the largest output value gives the network’s estimate of the class for a giveninput. In the special case of two classes it is common to have just one nodein the output layer, the classification between the two classes being madeby applying a cut-off to the output value at the node.1.1 Single layer networksLet us begin by examining neural networks with just one layer of neurons(output layer only, no hidden layers). The simplest network consists of justone neuron with the function g chosen to be the identity function, g(v) = vfor all v. In this case notice that the output of the network ism∑j=0wjxj, alinear function of the input vector x with components xj. If we are modelingthe dependent variable y using multiple linear regression, we can interpretthe neural network as a structure that predicts a valuêy for a given inputvector x with the weights being the coefficients. If we choose these weightsto minimize the mean square error using observations in a training set, theseweights would simply be the least squares estimates of the coefficients. Theweights in neural nets are also often designed to minimize mean square errorin a training data set. There is, however, a different orientation in the caseof neural nets: the weights are ”learned”. The network is presented withcases from the training data one at a time and the weights are revised aftereach case in an attempt to minimize the mean square error. This process ofincremental adjustment of weights is based on the error made on trainingcases and is known as ’training’ the neural net. The almost universally useddynamic updating algorithm for the neural net version of linear regression isknown as the Widrow-Hoff rule or the least-mean-square (LMS) algorithm.It is simply stated. Let x(i) denote the input vector x for the ithcase used totrain the network, and the weights before this case is presented to the net bythe vector w(i). The updating rule is w(i+1) = w(i)+η(y(i)−̂y(i))x(i) withw(0) = 0. It can be shown that if the network is trained in this manner byrepeatedly presenting test data observations one-at-a-time then for suitablysmall (absolute) values of η the network will learn (converge to) the optimalvalues of w. Note that the training data may have to be presented severaltimes for w(i) to be close to the optimal w. The advantage of dynamicupdating is that the network tracks moderate time trends in the underlyinglinear model quite effectively.If we consider using the single layer neural net for classification into cclasses, we would use c nodes in the output layer. If we think of classical4
--------------------------------------------------------------------------------
Page 5
discriminant analysis in neural network terms, the coefficients in Fisher’sclassification functions give us weights for the network that are optimalif the input vectors come from Multivariate Normal distributions with acommon covariance matrix.For classification into two classes, the linear optimization approach thatwe examined in class, can be viewed as choosing optimal weights in a singlelayer neural network using the appropriate objective function.Maximum likelihood coefficients for logistic regression can also be con-sidered as weights in a neural network to minimize a function of the residualscalled the deviance. In this case the logistic function g(v) =(ev1+ev)is theactivation function for the output node.1.2 Multilayer Neural networksMultilayer neural networks are undoubtedly the most popular networks usedin applications. While it is possible to consider many activation functions, inpractice it has been found that the logistic (also called the sigmoid) functiong(v) =(ev1+ev)as the activation function (or minor variants such as thetanh function) works best. In fact the revival of interest in neural nets wassparked by successes in training neural networks using this function in placeof the historically (biologically inspired) step function (the ”perceptron”}.Notice that using a linear function does not achieve anything in multilayernetworks that is beyond what can be done with single layer networks withlinear activation functions. The practical value of the logistic function arisesfrom the fact that it is almost linear in the range where g is between 0.1 and0.9 but has a squashing effect on very small or very large values of v.In theory it is sufficient to consider networks with two layers of neurons–one hidden and one output layer–and this is certainly the case for mostapplications. There are, however, a number of situations where three andsometimes four and five layers have been more effective. For prediction theoutput node is often given a linear activation function to provide forecaststhat are not limited to the zero to one range. An alternative is to scale theoutput to the linear part (0.1 to 0.9) of the logistic function.Unfortunately there is no clear theory to guide us on choosing the numberof nodes in each hidden layer or indeed the number of layers. The commonpractice is to use trial and error, although there are schemes for combining5
--------------------------------------------------------------------------------
Page 6
optimization methods such as genetic algorithms with network training forthese parameters.Since trial and error is a necessary part of neural net applications it isimportant to have an understanding of the standard method used to traina multilayered network: backpropagation. It is no exaggeration to say thatthe speed of the backprop algorithm made neural nets a practical tool inthe manner that the simplex method made linear optimization a practicaltool. The revival of strong interest in neural nets in the mid 80s was in largemeasure due to the efficiency of the backprop algorithm.1.3 Example1: Fisher’s Iris dataLet us look at the Iris data that Fisher analyzed using Discriminant Analysis.Recall that the data consisted of four measurements on three types of irisflowers. There are 50 observations for each class of iris. A part of the datais reproduced below.6
--------------------------------------------------------------------------------
Page 7
OBS# SPECIESCLASSCODE SEPLEN SEPW PETLEN PETW1 Iris-setosa15.13.51.40.22 Iris-setosa14.931.40.23 Iris-setosa14.73.21.30.24 Iris-setosa14.63.11.50.25 Iris-setosa153.61.40.26Iris-setosa15.43.91.70.47 Iris-setosa14.63.41.40.38 Iris-setosa153.41.50.29 Iris-setosa14.42.91.40.210 Iris-setosa14.93.11.50.1... ..................51 Iris-versicolor273.24.71.452 Iris-versicolor26.43.24.51.553 Iris-versicolor26.93.14.91.554 Iris-versicolor25.52.341.355 Iris-versicolor26.52.84.61.556Iris-versicolor25.72.84.51.357 Iris-versicolor26.33.34.71.658 Iris-versicolor24.92.43.3159 Iris-versicolor26.62.94.61.360 Iris-versicolor25.22.73.91.4... ..................101 Iris-virginica36.33.362.5102 Iris-virginica35.82.75.11.9103 Iris-virginica37.135.92.1104 Iris-virginica36.32.95.61.8105 Iris-virginica36.535.82.2106Iris-virginica37.636.62.1107 Iris-virginica34.92.54.51.7108 Iris-virginica37.32.96.31.8109 Iris-virginica36.72.55.81.8110 Iris-virginica37.23.66.12.5If we use a neural net architecture for this classification problem we willneed 4 nodes (not counting the bias node) one for each of the 4 independentvariables in the input layer and 3 neurons (one for each class) in the outputlayer. Let us select one hidden layer with 25 neurons. Notice that there willbe a total of 25 connections from each node in the input layer to nodes inthe hidden layer. This makes a total of 4 x 25 = 100 connections between7
--------------------------------------------------------------------------------
Page 8
the input layer and the hidden layer. In addition there will be a total of 3connections from each node in the hidden layer to nodes in the output layer.This makes a total of 25 x 3 = 75 connections between the hidden layer andthe output layer. Using the standard logistic activation functions, the net-work was trained with a run consisting of 60,000 iterations. Each iterationconsists of presentation to the input layer of the independent variables in acase, followed by successive computations of the outputs of the neurons ofthe hidden layer and the output layer using the appropriate weights. Theoutput values of neurons in the output layer are used to compute the er-ror. This error is used to adjust the weights of all the connections in thenetwork using the backward propagation (“backprop”) to complete the it-eration. Since the training data has 150 cases, each case was presented tothe network 400 times. Another way of stating this is to say the networkwas trained for 400 epochs where an epoch consists of one sweep throughthe entire training data. The results for the last epoch of training the neuralnet on this data are shown below:Iris Output 1Classification Confusion MatrixDesired Computed ClassClass123Total15050249150314950Total505050150Error ReportClassPatterns # Errors % Errors StdDev15000.00( 0.00)25012.00( 1.98)35012.00( 1.98)Overall15021.3( 0.92)The classification error of 1.3% is better than the error using discriminantanalysis which was 2% (See lecture note on Discriminant Analysis). Noticethat had we stopped after only one pass of the data (150 iterations) the8
--------------------------------------------------------------------------------
Page 9
error is much worse (75%) as shown below:Iris Output 2Classification Confusion MatrixDesired Computed ClassClass123 Total110721921316203125421Total35 131260The classification error rate of 1.3% was obtained by careful choice of

neuralnetwork

Artificial Neural NetworksIn this note we provide an overview of the key concepts that have led tothe emergence of Artificial Neural Networks as a major paradigm for DataMining applications. Neural nets have gone through two major developmentperiods -the early 60’s and the mid 80’s. They were a key development inthe field of machine learning. Artificial Neural Networks were inspired bybiological findings relating to the behavior of the brain as a network of unitscalled neurons. The human brain is estimated to have around 10 billionneurons each connected on average to 10,000 other neurons. Each neuronreceives signals through synapses that control the effects of the signal onthe neuron. These synaptic connections are believed to play a key role inthe behavior of the brain. The fundamental building block in an ArtificialNeural Network is the mathematical model of a neuron as shown in Figure1. The three basic components of the (artificial) neuron are:1. The synapses or connecting links that provide weights, wj, to theinput values, xjfor j = 1,...m;2. An adder that sums the weighted input values to compute theinput to the activation function v = w0+m∑j=1wjxj,where w0is called thebias (not to be confused with statistical bias in prediction or estimation) isa numerical value associated with the neuron. It is convenient to think ofthe bias as the weight for an input x0whose value is always equal to one,so that v =m∑j=0wjxj;3. An activation function g (also called a squashing function) thatmaps v to g(v) the output value of the neuron. This function is a monotonefunction.2
--------------------------------------------------------------------------------
Page 3
Figure 1While there are numerous different (artificial) neural network architec-tures that have been studied by researchers, the most successful applica-tions in data mining of neural networks have been multilayer feedforwardnetworks. These are networks in which there is an input layer consistingof nodes that simply accept the input values and successive layers of nodesthat are neurons as depicted in Figure 1. The outputs of neurons in a layerare inputs to neurons in the next layer. The last layer is called the outputlayer. Layers between the input and output layers are known as hiddenlayers. Figure 2 is a diagram for this architecture.Figure 2In a supervised setting where a neural net is used to predict a numericalquantity there is one neuron in the output layer and its output is the predic-tion. When the network is used for classification, the output layer typicallyhas as many nodes as the number of classes and the output layer node with3
--------------------------------------------------------------------------------
Page 4
the largest output value gives the network’s estimate of the class for a giveninput. In the special case of two classes it is common to have just one nodein the output layer, the classification between the two classes being madeby applying a cut-off to the output value at the node.1.1 Single layer networksLet us begin by examining neural networks with just one layer of neurons(output layer only, no hidden layers). The simplest network consists of justone neuron with the function g chosen to be the identity function, g(v) = vfor all v. In this case notice that the output of the network ism∑j=0wjxj, alinear function of the input vector x with components xj. If we are modelingthe dependent variable y using multiple linear regression, we can interpretthe neural network as a structure that predicts a valuêy for a given inputvector x with the weights being the coefficients. If we choose these weightsto minimize the mean square error using observations in a training set, theseweights would simply be the least squares estimates of the coefficients. Theweights in neural nets are also often designed to minimize mean square errorin a training data set. There is, however, a different orientation in the caseof neural nets: the weights are ”learned”. The network is presented withcases from the training data one at a time and the weights are revised aftereach case in an attempt to minimize the mean square error. This process ofincremental adjustment of weights is based on the error made on trainingcases and is known as ’training’ the neural net. The almost universally useddynamic updating algorithm for the neural net version of linear regression isknown as the Widrow-Hoff rule or the least-mean-square (LMS) algorithm.It is simply stated. Let x(i) denote the input vector x for the ithcase used totrain the network, and the weights before this case is presented to the net bythe vector w(i). The updating rule is w(i+1) = w(i)+η(y(i)−̂y(i))x(i) withw(0) = 0. It can be shown that if the network is trained in this manner byrepeatedly presenting test data observations one-at-a-time then for suitablysmall (absolute) values of η the network will learn (converge to) the optimalvalues of w. Note that the training data may have to be presented severaltimes for w(i) to be close to the optimal w. The advantage of dynamicupdating is that the network tracks moderate time trends in the underlyinglinear model quite effectively.If we consider using the single layer neural net for classification into cclasses, we would use c nodes in the output layer. If we think of classical4
--------------------------------------------------------------------------------
Page 5
discriminant analysis in neural network terms, the coefficients in Fisher’sclassification functions give us weights for the network that are optimalif the input vectors come from Multivariate Normal distributions with acommon covariance matrix.For classification into two classes, the linear optimization approach thatwe examined in class, can be viewed as choosing optimal weights in a singlelayer neural network using the appropriate objective function.Maximum likelihood coefficients for logistic regression can also be con-sidered as weights in a neural network to minimize a function of the residualscalled the deviance. In this case the logistic function g(v) =(ev1+ev)is theactivation function for the output node.1.2 Multilayer Neural networksMultilayer neural networks are undoubtedly the most popular networks usedin applications. While it is possible to consider many activation functions, inpractice it has been found that the logistic (also called the sigmoid) functiong(v) =(ev1+ev)as the activation function (or minor variants such as thetanh function) works best. In fact the revival of interest in neural nets wassparked by successes in training neural networks using this function in placeof the historically (biologically inspired) step function (the ”perceptron”}.Notice that using a linear function does not achieve anything in multilayernetworks that is beyond what can be done with single layer networks withlinear activation functions. The practical value of the logistic function arisesfrom the fact that it is almost linear in the range where g is between 0.1 and0.9 but has a squashing effect on very small or very large values of v.In theory it is sufficient to consider networks with two layers of neurons–one hidden and one output layer–and this is certainly the case for mostapplications. There are, however, a number of situations where three andsometimes four and five layers have been more effective. For prediction theoutput node is often given a linear activation function to provide forecaststhat are not limited to the zero to one range. An alternative is to scale theoutput to the linear part (0.1 to 0.9) of the logistic function.Unfortunately there is no clear theory to guide us on choosing the numberof nodes in each hidden layer or indeed the number of layers. The commonpractice is to use trial and error, although there are schemes for combining5
--------------------------------------------------------------------------------
Page 6
optimization methods such as genetic algorithms with network training forthese parameters.Since trial and error is a necessary part of neural net applications it isimportant to have an understanding of the standard method used to traina multilayered network: backpropagation. It is no exaggeration to say thatthe speed of the backprop algorithm made neural nets a practical tool inthe manner that the simplex method made linear optimization a practicaltool. The revival of strong interest in neural nets in the mid 80s was in largemeasure due to the efficiency of the backprop algorithm.1.3 Example1: Fisher’s Iris dataLet us look at the Iris data that Fisher analyzed using Discriminant Analysis.Recall that the data consisted of four measurements on three types of irisflowers. There are 50 observations for each class of iris. A part of the datais reproduced below.6
--------------------------------------------------------------------------------
Page 7
OBS# SPECIESCLASSCODE SEPLEN SEPW PETLEN PETW1 Iris-setosa15.13.51.40.22 Iris-setosa14.931.40.23 Iris-setosa14.73.21.30.24 Iris-setosa14.63.11.50.25 Iris-setosa153.61.40.26Iris-setosa15.43.91.70.47 Iris-setosa14.63.41.40.38 Iris-setosa153.41.50.29 Iris-setosa14.42.91.40.210 Iris-setosa14.93.11.50.1... ..................51 Iris-versicolor273.24.71.452 Iris-versicolor26.43.24.51.553 Iris-versicolor26.93.14.91.554 Iris-versicolor25.52.341.355 Iris-versicolor26.52.84.61.556Iris-versicolor25.72.84.51.357 Iris-versicolor26.33.34.71.658 Iris-versicolor24.92.43.3159 Iris-versicolor26.62.94.61.360 Iris-versicolor25.22.73.91.4... ..................101 Iris-virginica36.33.362.5102 Iris-virginica35.82.75.11.9103 Iris-virginica37.135.92.1104 Iris-virginica36.32.95.61.8105 Iris-virginica36.535.82.2106Iris-virginica37.636.62.1107 Iris-virginica34.92.54.51.7108 Iris-virginica37.32.96.31.8109 Iris-virginica36.72.55.81.8110 Iris-virginica37.23.66.12.5If we use a neural net architecture for this classification problem we willneed 4 nodes (not counting the bias node) one for each of the 4 independentvariables in the input layer and 3 neurons (one for each class) in the outputlayer. Let us select one hidden layer with 25 neurons. Notice that there willbe a total of 25 connections from each node in the input layer to nodes inthe hidden layer. This makes a total of 4 x 25 = 100 connections between7
--------------------------------------------------------------------------------
Page 8
the input layer and the hidden layer. In addition there will be a total of 3connections from each node in the hidden layer to nodes in the output layer.This makes a total of 25 x 3 = 75 connections between the hidden layer andthe output layer. Using the standard logistic activation functions, the net-work was trained with a run consisting of 60,000 iterations. Each iterationconsists of presentation to the input layer of the independent variables in acase, followed by successive computations of the outputs of the neurons ofthe hidden layer and the output layer using the appropriate weights. Theoutput values of neurons in the output layer are used to compute the er-ror. This error is used to adjust the weights of all the connections in thenetwork using the backward propagation (“backprop”) to complete the it-eration. Since the training data has 150 cases, each case was presented tothe network 400 times. Another way of stating this is to say the networkwas trained for 400 epochs where an epoch consists of one sweep throughthe entire training data. The results for the last epoch of training the neuralnet on this data are shown below:Iris Output 1Classification Confusion MatrixDesired Computed ClassClass123Total15050249150314950Total505050150Error ReportClassPatterns # Errors % Errors StdDev15000.00( 0.00)25012.00( 1.98)35012.00( 1.98)Overall15021.3( 0.92)The classification error of 1.3% is better than the error using discriminantanalysis which was 2% (See lecture note on Discriminant Analysis). Noticethat had we stopped after only one pass of the data (150 iterations) the8
--------------------------------------------------------------------------------
Page 9
error is much worse (75%) as shown below:Iris Output 2Classification Confusion MatrixDesired Computed ClassClass123 Total110721921316203125421Total35 131260The classification error rate of 1.3% was obtained by careful choice of