Skip to content

Prediction, forecasting and neural networks

Markus Müller edited this page Oct 27, 2021 · 7 revisions

Introduction

For a customer case we developed a fully connected RNN model with 4 layers using the pyrenn package that performed much better than SARIMA and a bit better than the lightGBM gradient booster. Unfortunately, pyrenn is GPL'd, and we need an implementation with a less restrictive license.

Usng RNN for modeling time series data is supported by a wealth of papers dealing with time series data regression, forecasting and classification.

Basically an RNN layer constitutes of a set of cells where the hidden state of a cell is fed to subsequent cells along with the input data. Each cell basically implements the following two formulas

    # compute next activation state using the formula given above
    a = activation_function(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)
    # compute output of the current cell using the formula given above
    yt_pred = g(np.dot(Wya, a) + by)

with the three weight matrices Waa, Wax and Wya, the bias vectors ba and by, a_prev the hidden state from the previous cell, xt the actual input value, a the current cell's hidden state and final yt_pred the RNN cell's output. As far as I tell from the keras github the Keras SimpleRNN implementation simplifies the last line to yt_pred = a.

pyrenn by default employs real time recurrent learning (RTRL): all training data is passed to the recurrent neural net in a single pass, then the best combination of weight matrices and biases vectors is computed using optimization against the L2-norm of the difference between actual and network output. See here for some of the math. Optimization in pyrenn is based on second level methods related to the classical Newton method, the default method is Levenberg-Marquardt that turned out to be quite efficient with fast training times. Obviously this works for small networks only.

In short for a least square problem like

with fk depending on the weights and bias of the network.

With Jk denoting the Jacobian of the fk Levenberg-Marquardt searches in the direction given by the solution p of

lambdak is an adjustable parameter for the step rate (like ADAM), I the identity diagonal matrix. The Fletcher approach substitutes I with


For a deeper dive into that topic consult "The Levenberg-Marquardt algorithm for nonlinear least squares curve-fitting problems". This article also deals with numeric implementation of this optimization method.

The well-known open source deep learning frameworks do not support RTRL out of the box*, but truncated back propagation through time (TBPTT) should do as well. Actually you train to predict future values from snippets with a given length determined by the models batch shape - in the bike sharing example it's (None, 10, 4). So the model only includes the last 10 values to predict and ignores the rest, hence back propagation through time is truncated. See d2l.ai on RNN and BPTT and also the blog entry RNN behind the scene for more on that topic.

Hands-On

Since second-level optimization worked well in our customer case I looked around for examples on top of one of the deep learning frameworks and found Fabio's example code with a permissive license.

Code like the following example is definitively easier to read and understand than having to compute Jacobians layer by layer with hard-coded activation functions like this np.dot(np.dot(S[q,u,l],LW[l,m,0]),np.diag(1-(np.tanh(n[q,m]))**2))

        jacobians = tape.jacobian(
            residuals,
            self.model.trainable_variables)

See Fabio's repository for more background and an example notebook where he shows that this second level optimizer beats first level Adam on curve fitting a damped sine curve.

So I wanted to apply this approach to a typical prediction case. I came across Haydar Özler and Tankut Tekeli's example that deals with a well known prediction case often used as DS exercise.

Their notebook has been made available on github in repository bike sharing prediction with RNN. I departed from that because they did all the heavy lifting of exploring and preparing the data, from cutting out an outlier over scaling to one-hot encoding of day-of-week etc.

Second level optimization tends to save training time, this is quite apparent in Fabio's damped sine curve example. However, his second example is a tie between Adam and Levenberg-Marquardt and it appears to be the same for the BikeSharing example. Here second order training is quite faster at the expense of accuracy.

See here for the slightly modified bike sharing demand prediction example.

Footnotes

  • it can be done, though.

See Approximating Real-Time Recurrent Learning with Random Kronecker Factors with code in this github repo.

References