Prediction, forecasting and neural networks

Introduction

For a customer case we developed a fully connected RNN model with 4 layers using the pyrenn package that performed much better than SARIMA and a bit better than the lightGBM gradient booster. Unfortunately, pyrenn is GPL'd, and we need an implementation with a less restrictive license.

Usng RNN for modeling time series data is supported by a wealth of papers dealing with time series data regression, forecasting and classification.

Basically an RNN layer constitutes of a set of cells where the hidden state of a cell is fed to subsequent cells along with the input data. Each cell basically implements the following two formulas

    # compute next activation state using the formula given above
    a = activation_function(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)
    # compute output of the current cell using the formula given above
    yt_pred = g(np.dot(Wya, a) + by)

with the three weight matrices W_aa, W_ax and W_ya, the bias vectors b_a and b_y, a_prev the hidden state from the previous cell, x_t the actual input value, a the current cell's hidden state and final yt_pred the RNN cell's output. As far as I tell from the keras github the Keras SimpleRNN implementation simplifies the last line to yt_pred = a.

pyrenn by default employs real time recurrent learning (RTRL): all training data is passed to the recurrent neural net in a single pass, then the best combination of weight matrices and biases vectors is computed using optimization against the L2-norm of the difference between actual and network output. See here for some of the math. Optimization in pyrenn is based on second level methods related to the classical Newton method, the default method is Levenberg-Marquardt that turned out to be quite efficient with fast training times. Obviously this works for small networks only.

In short for a least square problem like

with f_k depending on the weights and bias of the network.

With J_k denoting the Jacobian of the f_k Levenberg-Marquardt searches in the direction given by the solution p of

lambda_k is an adjustable parameter for the step rate (like ADAM), I the identity diagonal matrix. The Fletcher approach substitutes I with

For a deeper dive into that topic consult "The Levenberg-Marquardt algorithm for nonlinear least squares curve-fitting problems". This article also deals with numeric implementation of this optimization method.

The well-known open source deep learning frameworks do not support RTRL out of the box^*, but truncated back propagation through time (TBPTT) should do as well. Actually you train to predict future values from snippets with a given length determined by the models batch shape - in the bike sharing example it's (None, 10, 4). So the model only includes the last 10 values to predict and ignores the rest, hence back propagation through time is truncated. See d2l.ai on RNN and BPTT and also the blog entry RNN behind the scene for more on that topic.

Hands-On

Since second-level optimization worked well in our customer case I looked around for examples on top of one of the deep learning frameworks and found Fabio's example code with a permissive license.

Code like the following example is definitively easier to read and understand than having to compute Jacobians layer by layer with hard-coded activation functions like this np.dot(np.dot(S[q,u,l],LW[l,m,0]),np.diag(1-(np.tanh(n[q,m]))**2))

        jacobians = tape.jacobian(
            residuals,
            self.model.trainable_variables)

See Fabio's repository for more background and an example notebook where he shows that this second level optimizer beats first level Adam on curve fitting a damped sine curve.

So I wanted to apply this approach to a typical prediction case. I came across Haydar Özler and Tankut Tekeli's example that deals with a well known prediction case often used as DS exercise.

Their notebook has been made available on github in repository bike sharing prediction with RNN. I departed from that because they did all the heavy lifting of exploring and preparing the data, from cutting out an outlier over scaling to one-hot encoding of day-of-week etc.

Second level optimization tends to save training time, this is quite apparent in Fabio's damped sine curve example. However, his second example is a tie between Adam and Levenberg-Marquardt and it appears to be the same for the BikeSharing example. Here second order training is quite faster at the expense of accuracy.

See here for the slightly modified bike sharing demand prediction example.

Footnotes

it can be done, though.

See Approximating Real-Time Recurrent Learning with Random Kronecker Factors with code in this github repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prediction, forecasting and neural networks

Introduction

Hands-On

Footnotes

References

What do we want to achieve

Hands-On

Pulsar Machine Learning

Kafka Machine Learning

Machine Learning on Spark

Clone this wiki locally