This paper describes and evaluates several global optimization issues of Artificial Neural Networks (ANN) and their applications. In this paper, the authors examine the properties of the feed-forward neural networks and the process of determining the appropriate network inputs and architecture, and built up a short-term gas load forecast system - the Tell Future system. This system performs very well for short-term gas load forecasting, which is built based on various Back- Propagation (BP) algorithms. The standard Back-Propagation (BP) algorithm for training feed-forward neural networks have proven robust even for difficult problems. In order to forecast the future load from the trained networks, the history loads, temperature, wind velocity, and calendar information should be used in addition to the predicted future temperature and wind velocity. Compared to other regression methods, the neural networks allow more flexible relationships between temperature, wind, calendar information and load pattern. Feed-forward neural networks can be used in many kinds of forecasting in different industrial areas. Similar models can be built to make electric load forecasting, daily water consumption forecasting, stock and markets forecasting, traffic flow and product sales forecasting.
Neural networks, more accurately called Artificial Neural Networks (ANN), are computational models that consist of a number of simple processing units that communicate by sending signals to each other over a large number of weighted connections. They were originally developed from the inspiration of human brains. In human brains, a biological neuron collects signal from the other neurons through a host of fine structures called dendrites. The neuron sends out spikes of electrical activity through a long, thin strand known as an axon, which splits into thousands of branches. At the end of each branch, a structure called a synapse, converts the activity from the axon into electrical effects that inhibit or excite activity in the connected neurons. When a neuron receives excitatory input that is sufficiently large compared with its inhibitory input, it sends a spike of electrical activity down its axon. Learning occurs by changing the effectiveness of the synapses, so that the influence of one neuron on other changes.
Like human brains, neural networks also consist of processing units (artificial neurons) and connections (weights) between them. The processing units, transport incoming information on their outgoing connections to other units. The "electrical" information is simulated with specific values stored in those weights that make these networks have the capacity to learn, memorize, and create relationships amongst the data.
A very important feature of these networks is their adaptive nature, where "learning by example" replaces "programming" in solving problems. This feature renders these computational models ver y appealing in application domains, where one has little or incomplete understanding of the problems to be solved, but where training data are available.
There are many different types of neural networks, and they are being used in many fields. And new uses for neural networks are devised daily by researchers. Some of the most traditional applications include [1], [2], [17].
Neural networks, sometimes referred to as connectionist models, or parallel-distributed models that have several distinguishing features [3], [18] :
A processing unit (Figure 1), also called a neuron or node, performs a relatively simple job; it receives input from the neighbors or external sources and uses them to compute an output signal that is propagated to other units.
Within the neural systems,there are three types of units
Each unit j can have one or more inputs x1 , x2 , x3 , … xn , but only one output zj . An input to a unit is either the data outside the network, or the output of another unit, or its own output.
Each non-input unit in a neural network combines values that are fed into it via synaptic connections from other units, producing a single value called net input. The function that combines the values is called the combination function, which is defined by a certain propagation rule. In most neural networks, the authors assume that, each unit provides an additive contribution to the input of the unit with which it is connected. The total input to unit j is simply the weighted sum of the separate outputs from the connected units plus a threshold or bias term θ1:
The contribution for positive wji is considered as an excitation and an inhibition for negative wji . The units with the above propagation rules are termed as sigma units.
In some cases ,more complex rules for combining inputs are used. One of the propagation rule known as sigma-pi has the following format [3]:
Lots of combination functions usually use a "bias" or "threshold" term in computing the net input to the unit. For a linear output unit, a bias term is equivalent to an intercept in a regression model. It is needed in much the same way as the constant polynomial '1' is required for approximation by polynomials.
Most units in neural network transform their net inputs by using a scalar-to-scalar function called an activation function, yielding a value called the unit's activation. Except possibly for output units, the activation value is fed to one or more other units. Activation functions with a bounded range are often called squashing functions. Some of the most commonly used activation functions are [4], [18]:
It is obvious that, the input units use the identity function. Sometimes a constant is multiplied by the net input to form a linear function.
Also known as threshold function or Heaviside function. The output of this function is limited to one of the two values:
This kind of function is often used in single layer networks.
This function is especially advantageous for the use in neural networks trained by Back-Propagation; because it is easy to differentiate, and thus can dramatically reduce the computation burden for training. It applies to applications whose desired output values are between 0 and 1.
This function has similar properties with the sigmoid function. It works well for applications that yield output values in the range of [-1,1].
Activation functions for the hidden units are needed to introduce non-linearity into the networks. The reason is that, a composition of linear functions is again a linear function. However, it is the non-linearity (i.e., the capability to represent nonlinear functions) that makes multi-layer networks so powerful. Almost any nonlinear function does the job, although for Back-Propagation learning, it must be differentiable and it helps if the function is bounded. The sigmoid functions are the most common choices [5] .
For the output units, activation functions should be chosen to be suited to the distribution of the target values. The authors have already seen that, for binary [0,1] outputs, the sigmoid function is an excellent choice. For continuous-valued targets with a bounded range, the sigmoid functions are again useful, provided that, either the outputs or the targets to be scaled to the range of the output activation function. But if the target values have no known bounded range, it is better to use an unbounded activation function, most often the identity function (which amounts to no activation function). If the target values are positive, but have no known upper bound, an exponential output activation function can be used [5].
The number of layers, the number of units per layer, and the interconnection patterns between layers defines the topology of a network. They are generally divided into two categories based on the pattern of connections:
1. Feed-forward Networks: The data flow from input units to output units is strictly feed-forward. The data processing can extend over multiple layers of units, but no feedback connections are present. That is, connections extending from the outputs of units to inputs of units in the same layer or previous layers are not permitted. Feed-forward networks are the main focus of this thesis.
2. Recurrent Networks: Itcontains feedback connections. Contrary to feed-forward networks, the dynamical properties of the network are important. In some cases, the activation values of the units undergo a relaxation process such that, the network will evolve to a stable state in which the activation does not change further. In other applications, where the dynamical behavior constitutes the output of the network, the changes of the activation values of the output units are significant (Figure 6).
The functionality of a neural network is determined by the combination of the topology (number of layers, number of units per layer, and the interconnection pattern between the layers) and the weight of the connections within the network. The topology is usually held fixed, and a certain training algorithm determines the weight. The process of adjusting the weights to make the network learn the relationship between the inputs and targets is called learning, or training. Many learning algorithms have been invented to help find an optimum set of weights that result in the solution of the problems. They can roughly be divided into two main groups:
1. Supervised Learning: The network is trained by providing it with inputs and desired outputs (target values). These input-output pairs are provided by an external teacher, or by the system containing the network. The difference between the real outputs and the desired outputs is used by the algorithm to adapt the weights in the network (Figure 7). It is often posed as a function approximation problem - given training data consisting of pairs of input patterns x, and corresponding target t, the goal is to find a function f(x) that matches the desired response for each training input.
2. Unsupervised Learning: With unsupervised learning, there is no feedback from the environment to indicate if the outputs of the network are correct. The network must discover features, regulations, correlations, or categories in the input data automatically. In fact, for most varieties of unsupervised learning, the targets are the same as inputs. In other words, unsupervised learning usually performs the same task as an auto-associative network, compressing the information from the inputs.
To train a network and measure how well it performs, an objective function (or cost function) must be defined to provide an unambiguous numerical rating of system performance. Selection of an objective function is very important, because the function represents the design goals and decides what training algorithm can be taken. To develop an objective function that measures exactly what we want is not an easy task. A few basic functions are very commonly used. One of them is the sum of squares error function,
where, p indexes the patterns in the training set, i indexes the output nodes, and tpi and ypi are the target and the actual network output for the ith output unit on the pth pattern respectively. In real world applications, it may be necessary to complicate the function with additional terms to control the complexity of the model.
A layered feed-forward network consists of a certain number of layers, and each layer contains a certain number of units. There is an input layer, an output layer, and one or more hidden layers between the input and the output layer. Each unit receives its inputs directly from the previous layer (except for input units) and sends its output directly to units in the next layer (except for output units). Unlike the Recurrent network, which contains feedback information, there are no connections from any of the units of the input of the previous layers nor to other units in the same layer, nor to the units more than one layer ahead. Every unit acts only as an input to the immediate next layer. Obviously, this class of networks is easier to analyze theoretically than other general topologies because their outputs can be represented by explicit functions of the inputs and the weights.
An example of a layered network with one hidden layer is shown in Figure 8. In this network there are l inputs, m hidden units, and n output units. The output of the jth hidden unit is obtained by first, forming a weighted linear combination of the l input values, then adding a bias,
where w(1)ji is the weight from input i to hidden unit j in the first layer and w(1)ji is the bias for hidden unit j. If the bias term are considered as being weights from an extra input x0 = 1, (8) can be rewritten in the form of,
The activation of hidden unit j then can be obtained by transforming the linear sum using an activation function g(x):
The outputs of the network can be obtained by transforming the activation of the hidden units using a second layer of processing units. For each output unit k, first we get the linear combination of the output of the hidden units are obtained as,
Again, the bias is observed and the above equation is rewritten into,
Then applying the activation function g2(x) to equations (12) we can get the kth output
Combining equations (9), (10), (12) and (13) the complete representation of network is represented as,
The network shown in Figure 8 is a network with one hidden layer. It can extended to have two or more hidden layers easily as long as the above transformation is carried out further.
One thing to be noted is that, the input units are very special units. They are hypothetical units that produce outputs equal to their supposed inputs. These input units do no processing.
Back-Propagation is the most commonly used method for training multi- layer feed-forward networks. It can be applied to any feed-forward network with differentiable activation functions. This technique was popularized by Rumelhart, Hinton and Williams [6].
For most networks, the learning process is based on a suitable error function, which is then minimized with respect to the weights and bias. If a network has differential activation functions, then the activations of the output units become differentiable functions of input variables, the weights and bias. If the authors also define a differentiable error function of the network outputs such as the sum-of-square error function, then the error function itself is a differentiable function of the weights. Therefore, the derivative of the error with respect to weights can be evaluated, and these derivatives can then be used to find the weights that minimize the error function, by either using the popular gradient descent or other optimization methods. The algorithm for evaluating the derivative of the error function is known as back-propagation, because it propagates the errors backward through the network.
The authors consider a general feed-forward network with arbitrary differentiable non-linear activation functions and a differential error function.
We know that, each unit j is obtained by first forming a weighted sum of its inputs of the form,
where zi is the activation of a unit, or input. The authors then i apply the activation function,
Note that one or more of the variables zi in equation (15) j could be an input, in which case, it will be denoted by xi . Similarly, the unit j in equation (16) could be an output unit, which we will denote by yk .
The error function will be written as a sum, the overall patterns in the training set of an error defined for each pattern separately is,
where, p indexes the patterns, Y is the vector of outputs, and W is the vector of all weights. Ep can be expressed as a differentiable function of the output variable yk.
The goal is to find a way to evaluate the derivatives of the error functions E with respect to the weights and bias. Using equation (17), these derivatives are expressed as sums over the training set patterns of the derivatives for each pattern separately. Now, one pattern at a time is considered.
For each pattern, with all the inputs, the activations of all hidden and output units in the network is obtained by successive application of equations (15) and (16). This process is called forward propagation or forward pass. Once the activations of all the outputs, together with the target values, are available, the full expression of the error function Ep is achieved.
Now, consider the evaluation of the derivative of Ep with respect to some weight wji . Applying the chain rule can get the partial derivatives,
where,
From equation (18), it is easy to see that, the derivative can be obtained by multiplying the value of δ for the unit at the output end of the weight by the value of z for the unit at the input end. Thus, the task becomes to find the δj for each hidden and output unit in the network.
For the output unit, δk is very straightforward,
For a hidden unit,δk is obtained indirectly. Hidden units can influence the error only through their effects on the unit k to which they send output connections. So,
The first factor is just the δk of unit k. So,
For the second factor, if unit j connects directly to unit k then , ∂ak/∂aj=g’(aj)Wkj otherwise it is zero. So the following Back-propagation formula is given,
which means that, the values of δ for a particular hidden unit can be obtained by propagating the δ's backwards from units later in the network, as shown in Figure 9.
Recursively applying the equation gets the δ's for all of the hidden units in a feed-forward network, no matter how many layers it has.
Once, the derivatives of the error function with respect to weights are obtained, it can be used to update the weights so as to decrease the error. There are many varieties of gradient-based optimization algorithms based on these derivatives. One of the simplest of such algorithms is called a gradient descent or steepest descent. With this algorithm, the weights are updated in the direction in which the error E decreases most rapidly, i.e., along the negative gradient. The weight updating process begins with an initial guess for weights (which might be chosen randomly), and then generates a sequence of weights using the following formula,
where, η is a small positive number called the learning rate, which is step size to be taken for the next step.
Gradient descent tells only the direction we will move to, but the step size or learning rate needs to be decided as well. Too low a learning rate makes the network learn very slowly, while too high a learning rate will lead to oscillation. One way to avoid oscillation for large η is to make the weight change dependent on the past weight change by adding a momentum term,
That is, the weight change is a combination of a step down the negative gradient, plus a fraction α of the previous weight change, where, 0 ≤ α < 1 and typically 0 ≤ α < 0.9 [6].
The role of the learning rate and the momentum term are shown in Figure 10 [3]. When no momentum term is used, it typically takes a long time before the minimum is reached with a low learning rate (a), whereas for large learning rates the minimum may be never reached because of oscillation (b). When adding a momentum term, the minimum will be reached faster ( c).
There are two basic weight-update variations: batch learning and incremental learning. With batch learning, the weights are updated over all the training data. It repeats the following loop: a) Process all the training data; b) Update the weights. Each such loop through the training set is called an epoch. While for incremental learning, the weights are updated for each sample separately. It repeats the following loop: a) Process one sample from the training data; b) Update the weights.
Table 1 provides the input data and Table 2 refers to the data processing table. Given the above input data, the model can be set up to reflect up to the following six effects:
The Dataset factory class also acts as a preliminary data filter to eliminate any outliners or bad data that were present. All inputs to the model linearly scale between 0 and 1, using the minimum and maximum values corresponding to the input vector.
Implementation of this project is a front end Java and Back end Oracle server and the authors used Swings to design the project.
The most difficult part of building a good model is to choose and collect the training and testing input data. A number of research papers [7] [8] [9] [10] [11] [12] [13] show that the following factors influence the demand of the load:
This includes temperature, wind velocity, cloud cover, dew point, rainfall, and snowfall. It has been widely observed that, in most cases, there is a strong correlation between weather (especially temperature and wind velocity) and load demand. In most situations, as temperature goes down, demand for gas goes up and vice-versa. However, this relation is highly non-linear. Other weather effects influence the load to a lesser extent.
This includes an hour-of-day, day-of-week, and month-ofyear, weekend and holiday effects. Most gas load patterns show a very consistent dependence on the calendar. For example, assuming all other factors remaining constant, the demand for energy at 1:00 AM when most people are sleeping is expected to be different from that at 6:00 AM when most people are getting up. Similar observations exist for the day-of-week. Though, it cannot be generalized, the middle days of the week (Tuesdays, Wednesdays, Thursdays and Fridays) behave differently from the remaining days. The monthof- year captures the seasonal effect. Holidays are again special days; they tend to produce behavior that is more like a weekend day.
This includes market gas price, the price differential between gas and oil, and the price differential between the competitors' price. In many situations, the effect of economic factors on gas demand is non-trivial. A direct influence of economic factors on gas demand can be observed in some instances, such as when the customer has storage fields for injection or withdrawal. If the market gas price is low, even if the temperature is high, there can be a high demand for gas if the customer is injecting gas into the storage field. Similarly, even if the temperature is low, if the gas price is high, customers may use the gas in the storage instead of buying new gas from pipeline companies, thus decrease the demand for load. The price differential between gas and oil plays an important role in the demand, when the customer is a dual fuel use power plant. Here, depending on the price differential between gas and oil, the customer can increase or reduce gas consumption.
The above effects are those, that can be quantified and hence are possible candidates to be used as inputs for neural net training and forecasting. There are other factors, such as contractual obligations that definitely influence gas demand, and these are too difficult to quantify and are therefore impossible to include as influencing variables. In addition, there are a number of other factors such as maintenance or accidents on competitors' lines, that influence the demand of gas load, that at best can only be explained qualitatively.
The input data used in this model came from one of the author's clients, a pipeline company. They stored all of their historical data in different tables in an Oracle database – weather information in one table, and hourly load history in another table. The authors retrieved these data and stored them as a text.
The network consists of one input layer, one output layer and one hidden layer. Obviously, there is only one output unit – the load. The number of input units is also fixed, depending on how many factors are included in the model, and how the factors are encoded. The number of hidden units are need to be decided by training with some test sets. Figure 11 is the architecture of the load forecast model including all of the six effects that is mentioned before.
The network requires enough hidden units to learn the general features of the relationship. With too many hidden units, it will cause over fitting while too few will lead to under fitting. The goal is to use as few units in the hidden layer as possible, while still retaining the network's ability to learn the relationships among the data. As mentioned earlier, including more than a single middle layer does not significantly improve the accuracy of the predictions.
The activation functions of the hidden units are sigmoid functions, while the output activation function can be either a sigmoid function or a linear function, which can be selected by the users.
The network is trained using the back-propagation algorithm. The standard sum-of-squares error function is used.
Here is the Java code for the error function, which is one of the methods within the Neural Network class:
public double errorFunction (double[] x, double[] y) {
double sum = 0.0;
for (int i=0; i < x.length; i++)sum+="(x[i]-y[i])*(x[i]-y[i]);
return 0.5 * sum;
}
As mentioned above, the activation function for the hidden units is the sigmoid function:
This function has a very useful feature – its derivative can be expressed in the following form:
The above two equations can be easily coded:
public double sigmoid(double x) {
if (x > 50.) return 1.0;
if (x < -50.) return 0.0;
return 1.0 /(1.0+Math.exp(-x));
}
public double sigmoidDerivative(double x) {
return x*(1.0-x);
}
The first step for the back-propagation is forward propagation
void feedForward() {
//For hidden units
for (int i = 0; i < numberofhiddenunit; i++)
double sum = 0.0;
for (int j=0; j < numberofinputunit+1;j++
if (j==numberOfInputUnit)
sum += weightLayer1[j][i]; // Include the Bias term
else sum += weightLayer1[j][i]*inputs [j];
}
hiddens[i] = sigmoid(sum);
}
//For output units
for (int i = 0; I < number of Output Unit; I++) {
double sum =0.0;
for (int j=0; j < number of Hidden Unit; j++) {
sum += weightLayer2[j][i]*hiddens[j] ;
}
outputs[i] = sigmoid(sum);
}
}
The second step is error Back-propagation. Using the expression derived from equations (20) and (23), the following results are obtained. For the output units, the δ's are given by,
while, for units in the hidden layer, the δ's are found using,
Derivatives with respect to the first layer and second layer weights are then given by,
Gradient descent algorithm is used with momentum equation (25) to update the weights:
void back propagation (double rate, double alpha) {
double[] delta1 = new double [number of HiddenUnit];
double[] delta2 = new double [numberofOutputUnit];
//Delta for second layer
for (int j=0; j < numberofoutputunit;j++)
delta2[j] = targets[j] - outputs [j];
}
//Delta for first layer
for (int j=0; j < numberofhiddenunit;;j++)
double sum = 0.0;
for (int k=0; k < numberofoutputunit;k++)
double term = delta2[k] * weightLayer2[j][k];
if (outputActivationType==1) term
*=sigmoidDerivative(outputs [k]);
sum += term;
}
delta1[j] = sum;
}
//Update the second layer weights
for (int i=0; i < numberofhiddenunit;i++)
for (int j=0; j < numberofoutputunit;j++)
double delta = delta2[j]*hiddens [i];
if (outputActivationType==1) delta *=sigmoidDerivative(outputs[j]);
double weightChange = rate * delta +alpha*momentum2[i][j];
weightLayer2[i][j] += weightChange;
momentum2[i][j] = weightChange;
}
}
//Update the first layer weights
for (int i=0; i < numberofinputunit+1;i++){
for (int j=0; j < numberofhiddenunit;j++){
if (i!=numberOfInputUnit && inputs[i]==0) {
momentum1[i][j] = 0.;
}
else {
double delta = delta1[j] *sigmoidDerivative(hiddens[j]);
if (i!=numberOfInputUnit) delta *= inputs[i];
double weightChange = rate * delta +alpha*momentum1[i][j];
weightLayer1[i][j] += weightChange;
momentum1[i][j] = weightChange;
}
}
}
}
Batch learning method was adopted to train the networks.
The Split-sample (or hold-out validation) method [5] is used to estimate generalization error. With this method, part of the data are reserved as a test set that will not be used in the training. After training, run the network on the test set. The error on the test set provides us an unbiased estimate of the generalization error, with which, the authors can decide whether the model is sufficiently general.
The Tell Future load forecast system has several useful features:
Figure 12 describes the Home Page for the Load forecaster system.
Figure 13 describes the effects setup with Temperature, Wind Velocity, Hour-of-day, Weekday, weekend, and Month of year. Here all the effects are selected.
Figure 14 describes the Network setup, where the authors provide the input as Hidden Units, Learning rate, Alpha, Epochs and then they select linear and click ok.
Figure 15 describes about the Training result on a graphic display for the given input. Here the Tabular View button is clicked.
Figure 16 displays the Training results in a tabular form for the given input.
Figure 17 displays the Load forecast graphic for the given input by the user. Here the Tabular View button is clicked in order to display the results in the form of table.
Figure 18 displays the graph for Average squared error vs. hidden units.
Table 3 shows the epochs that the training processes taken to meet the error tolerants (average square error is 0.0005) or reach the epoch limit (9999) with different values of learning rate and momentum, where each pair had 10 tests. It is easy to see that too large and too small learning rates converge slowly, while high momentum helps small learning rate to converge faster. The best learning rate and momentum term are 0.8 and 0.1 respectively for this model. There are no big differences between using a sigmoid activation function and a linear activation function for the output unit.
Neural networks can learn to approximate any function and behave like associative memories by using just an example data that is representative of the desired task. They are model free estimates and are capable of solving complex problems based on the presentation of a large number of training data. This gives them a key advantage over traditional approaches to function the estimation such as the statistical methods. Neural networks estimate a function without a mathematical description of how the outputs functionally depend on the inputs - they represent a good approach that is potentially robust and fault tolerant.
In this paper, the authors have examined the properties of the feed-forward neural networks and the process of determining the appropriate network inputs and architecture, and built up a short-term gas load forecast system - the Tell Future system. This system performs very well for short-term gas load forecasting. The forecast accuracy has been in excess of 90%.
In order to forecast the future load from the trained networks, the authors need to use the history loads, temperature, wind velocity, and calendar information in addition to the predicted future temperature and wind velocity. Compared to other regression methods, the neural networks allow more flexible relationships between temperature, wind, calendar information and load pattern. It has also been shown by other researchers that multi-layer feed-forward neural network performs best for short-term load forecasting [7], [14].
The authors have utilized only temperature, wind and calendar information, since they are the only information available to us. Use of additional variables such as cloud coverage and economic information should yield better results [7]. Since, the neural networks simply interpolate between the training data, it will give high errors with the test data that is not close enough to any training data.
Feed-forward neural networks can be used in many kinds of forecasting in different industrial areas. Similar models can be built to make electric load forecasting, daily water consumption forecasting, stock and markets forecasting, traffic flow and product sales forecasting [15], [16] as long as correct relationships between the inputs and the outputs can be captured and put in to the models. But there is no universal network paradigm suitable for all kinds of forecasting problems. For each problem, a detailed analysis of the domain data and the acquisition of prior knowledge are necessary to find a suitable model. The introduction of prior knowledge in input selection, input encoding, or architecture determination is often very useful, especially when the available domain data are limited.
The standard Back-propagation algorithm for training feed-forward neural networks have proven robust even for difficult problems. However, its high performance results are attained at the expense of a long training time to adjust the network parameters, which can be discouraging in many real-world applications. Even on relatively simple problems, it often requires a lengthy training process in which, the complete set of training examples is processed hundreds or thousands of time. Thus, some accelerating techniques or advanced training algorithms can be applied to improve the performance of the networks.