from typing import (Tuple, List, Callable)
+from functools import partial
+
+import matplotlib
+import matplotlib.pyplot as plt
+import numpy as np
+from scipy import sparse
+
+import cvxpy as cp
+import torch
+import torch.nn.functional as torch_functional
+from cvxpylayers.torch import CvxpyLayer
+
+torch.set_default_dtype(torch.float64)
+
Learning a robust Kalman smoother for vehicle tracking¶
We will try to recover the state history (i.e., location and velocity trajectories) of a moving vehicle from noisy sensor data. To do this, we'll model the vehicle state as a discrete-time linear dynamical system and use standard and robust Kalman smoothers to estimate the state history of this system from the noisy sensor measurements. To construct our smoothers, we'll depart from typical Kalman methodologies in favor of an optimization-based approach. Considering these optimization problems as mutable objects, we then take a machine learning approach to evaluating and tuning the smoothers.
+Background¶
The primary aim of this notebook is demonstrating how to use cvxpylayers
to auto-tune parameters in a Kalman smoother. That said, it also contains a broad introduction to Kalman smoothing and a somewhat opinionated tutorial on mathematical optimization based Kalman smoothing. If you're a reader who is familiar with Kalman smoothers and/or just want to focus on the auto-tuning problem, that material starts in the Background/Learning section. The entire Implementation section is then focused on using cvxpylayers
to auto-tune vehicle tracking Kalman smoothers.
Motivating work¶
We also want to emphasize that this notebook is rather unoriginal. It is an extension of the robust Kalman filtering for vehicle tracking example found on the CVXPY website, and it is a watered-down, slight variation of the work in [BB20] and in [BV18 sec17.2]. All three of these sources are highly recommended for anyone wanting to learn more about Kalman smoothers and the Kalman smoother auto-tuning problem.
+System Model¶
We consider a linear, dynamical system (LDS) governed by
+$$ +\begin{equation} +x_{t+1} = Ax_t + Bw_t, \quad t =1, \ldots, T-1, +\tag{1} +\end{equation} +$$
+and output or sensor measurements
+$$ +\begin{align} +y_t = Cx_t + v_t, \quad t=1, \ldots, T. +\tag{2} +\end{align} +$$
+Here $x_t \in \mathbf{R}^{n}$ is the state, $w_t \in \mathbf{R}^{m}$ is an input to the dynamical system (e.g., a drive force on the vehicle), $y_t \in \mathbf{R}^{p}$ is the output or sensor measurement, and $v_t \in \mathbf{R}^{p}$ is the sensor noise, at time $t$. The matrix $A \in \mathbf{R}^{n \times n}$ is the state dynamics matrix, $B ∈ \mathbf{R}^{n \times m}$ is the input matrix, and $C \in \mathbf{R}^{p \times n}$ is the observation (or output) matrix.
+In state estimation, the goal is to guess the state sequence, $x_1, \ldots, x_T$, using the observations $y_1, \ldots, y_T$. In vehicle tracking, $x_t$ could be a four-vector (i.e., $x_t \in \mathbf{R}^{4}$), where $\left(x_t \right)_{1:2}$ is the position of the vehicle in two dimenions, $\left(x_t \right)_{3:4}$ is the vehicle velocity, and $y_t$ could be a five-vector containing measurements of latitude, longitude, heading, speed, and altitude of an iPhone mounted on the vehicle (this is an example in [BB20]).
+Traditionally, the matrices $A$, $B$, and $C$ are assumed fixed, while statistical assumptions are placed on $w_1, \ldots , w_{T-1}$ and $v_1, \ldots, v_{T}$. Importantly, whether the lack of knowledge is encoded statistically or not, we do not exactly know these noises. If $w_1, \ldots, w_{T-1}$ and $v_1, \ldots, v_T$ were known to us, we would be solving for $x_t$, not estimating $x_t$. (Sometimes, when arriving at an estimation, $x_t$ is replaced with $\hat{x}_t$ to clearly mark this distinction.)
+Smoothing¶
Overview¶
Smoothing is an offline (or batch) approach to approximately reconstructing the state sequence, i.e., the measurements $y_1, \ldots, y_T$ are collected, and then smoothing is performed to estimate $x_1, \ldots, x_T$. There are many approaches to constructing smoothers. The most widely used approach is an extension of Kalman filtering. (See [Mur23 sec8.2] for an overview of the classical Kalman filtering and smoothing methodologies based on the original worked proposed in [Kal60]. But briefly, while smoothing is an offline algorithm used to reconstruct a state trajectory, filtering is an online algorithm used to estimate the current state of the LDS as new measurements arrive. With that being said, the first step in the classical approach to constructing a smoother is to construct a filter.)
+In this notebook, however, we consider constructing smoothers by (explicitly) formulating optimization problems. Therefore, we perform the act of smoothing, obtaining $\hat{x}_1, \ldots, \hat{x}_T$, by solving a smoother's optimization problem. A major advantage to this mathematical optimization approach is the ability to formulate smoothers that handle missing measurements. If a smoother can handle missing measurements, we can take a machine learning approach to evaluating it. That is, to assess how well a smoother generalizes to unseen data, we can purposefully withhold measurements when creating the smoother and then compare the smoother's predicted outputs with this unseen (test) data. Additionally, because smoothers are characterized (or flavored) by parameters (hyper-parameters, in machine learning dialect), comparing the prediction error between smoothers is one way of determining whether one smoother is "better" than another.
+There are also a couple, less technically deep, advantages to this mathematical optimization approach to smoothing. Firstly, smoothers formulated as optimization problems are incredibly easy to interpret. In fact, the idea of a smoother "object" only arises in an optimization context. In typical literature, Kalman filtering (recalling that smoothing is filtering with some additional adjustments) is just a clever method for computing estimates of the LDS state and state uncertainty recursively. Consequently, understanding what a Kalman filter does in a traditional sense requires understanding some ideas from probability and statistics which are not usually seen in first courses on the subjects. In comparison, even if we were to incorporate statistical assumptions into our optimization models, such as those considered in [BB20], a smoother packaged as an optimization problem can just be "read off" as a state estimator.
+It also is far easier to extend and solve smoothers in a (convex) optimization approach than in a statistical approach. For instance, in the sequel, we'll construct a robust smoother by swapping quadratic penalty terms in the non-robust smoother's optimization problem's objective function with Huber penalty terms. This replacement is obvious to anyone familiar with convex optimization (particularly the ideas in [BV04 ch6]), and because the resulting problem is convex, it is readily solved. However, making such a swap in the traditional statisical approach would be less trivial.
+Using an out-of-the-box Kalman filter or smoother is certainly not difficult. Because the update equations which define a filter/smoother are already derived, Kalman filtering/smoothing in practice is simply an exercise in sparse linear algebra. However, because the traditional filter is constructed as a mean squared error minimizer (or through maximum likelihood statistics), adjusting the filter to be more robust would require re-constructing the filter to be the minimizer of the expectation of some other error function. Considering the effort required to derive the typical Kalman filter, this approach is not too appealing. This work would only be further complicated if the typical Gaussian assumptions were replaced with fatter-tailed distributions (an idea proposed in [BB20]). However, if we were to add statistical assumptions to our approach, such a distribution swap would be trivial so long as it maintained problem convexity. We could even add probabilistic restrictions or desired attributes to the filter/smoother via problem constraints.
+All of this is to say that an optimization approach to smoothing is declarative. Put glibly, you hand some observed measurements to a smoother problem and say (through a .solve()
call) "find a state sequence that is consistent with these measurements and some a priori knowledge of the system dynamics." The solver will then return to you (assuming feasibility) the best possible state sequence matching the smoother's requirements, and it will do so reliably and efficiently. (The idea of optimization being a declarative manner of solving problems is explored here in the context of control.)
Before detailing our optimization approach, it is worth noting that the typical Kalman filter and smoother are still terrific approaches to state estimation. The traditional approach to Kalman filtering actually has a major advantage over this optimization approach: once a state estimate is computed, all observations can be thrown away. Therefore, it's less memory intensive than an optimization approach, and thus more desirable in an online setting (one might argue that an optimization approach is also too computationally expensive in an online setting, but this is a non-issue. See [01 p4]).
+Approach¶
We'll consider the main problem formulation ideas here. The conclusion of these ideas will be a basic smoother problem, but the optimization problems used to construct the smoothers that we'll consider in the vehicle tracking example will be stated later in this notebook.
+Looking again at (1) and (2), the components of our optimization problem arise quite naturally. Because we are trying to determine the state sequence $x_1, \ldots, x_T$, these are optimization variables; they need to be decided upon. However, these variables are subject to the requirement that they satisfy the dynamic system model (1), so
+$$ +x_{t+1} = Ax_t + Bw_t, \quad t =1, \ldots, T-1, +$$
+must be constraints in the formulation.
+A fundamental assumption in Kalman filtering and smoothing is that the input and output noises are not too large. Or, to rely on (basic) statistics just briefly, that the means of both the input noise and measurement noise are known. This assumption can be understood intuitively thinking through what a Kalman smoother (or filter) is doing.
+We are assuming that we have two sources of (informally) information about the state in some system of interest. The first is the mathematical model, (1), which dictates how the state should evolve in time. The second is sensor measurements of the state as it does evolve in time. However, both sources include some form of randomness. The purpose of the Kalman smoother (or filter) is to merge these two somewhat reliable state estimators into a far more reliable state estimator. Furthermore, if $w_t$ is not small (or rather, if $w_t$ less the mean input noise is not small), then the first source of information becomes unreliable. Likewise, if the sensor noise is not "small" (there's no structure to its randomness), then there's no reason to believe that the measurements $y_t$ provide any useful information about the state.
+This fundamental assumption leads us to a bi-criterion optimization problem where we wish to minimize both the sensor noise (or equivalently, the size of the discrepancy between sensor measurements and predicted outputs) and the input noise. Mathematically, a good (and pervasive across Kalman smoothing/filtering) characterization of the total sensor noise is the sum of squares of the norms of the measurement residuals,
+$$ +J_{\text{meas}} = \left\lVert v_1 \right\rVert_{2}^2 + \cdots + \left\lVert v_T \right\rVert_{2}^2 = \left\lVert Cx_1 - y_1 \right\rVert_{2}^2 + \cdots + \left\lVert Cx_T - y_T \right\rVert_{2}^2. +$$
+Likewise, we'll take the total process noise to be the sum of squares of the norms of the process noise,
+$$ +J_{\text{proc}} = \left\lVert w_1 \right\rVert_{2}^2 + \cdots + \left\lVert w_T \right\rVert_{2}^2. +$$
+Combining this bi-criterion objective with the aforementioned constraints leads us to the simple smoother formulation
+$$ +\begin{equation} +\begin{array}{lll} +\text{minimize} \; & \tau J_{\text{proc}} + J_{\text{meas}} & \\ +\text{subject to} & x_{t+1} = Ax_t + Bw_t, \; & t=1, \ldots, T-1, +\end{array} +\tag{3} +\end{equation} +$$
+where $x_t \in \mathbf{R}^{n}$ and $w_t \in \mathbf{R}^{m}, \, t=1, \ldots, T,$ are optimization variables and $\tau > 0$ is a tuning parameter in the problem which (roughly) determines whether we trust the measurements more, or the dynamics more. (Throughout this notebook, we'll refer to parameters in a smoother's defining optimization problem as the "smoother's parameters." We'll also refer to a smoother as it's defining optimization problem, e.g., "consider the smoother (3).")
+Clearly, the choice of $\tau$ can drastically affect the estimated state that the smoother produces. The focus of this notebook is demonstrating how to use cvxpylayers
to auto-tune smoother parameters to construct "good" smoothers.
Missing Measurements¶
As the LDS evolves, all updates we have about the system come from the sequence $y_1, \ldots, y_T$. However, it may be the case that the full output sequence is not available to us. Or put another way, the output sequence could have missing measurements (e.g., over a $T=10$ horizon where $y_t \in \mathbf{R}^{2}$, perhaps at the second time step we are missing measurement one, and then at the ninth time step we are missing measurement two). To model this, we modify the output equation (2) so that $y_t \in (\mathbf{R} ∪ \left\{ ? \right\})^p$, where ? denotes a missing value. So (2) becomes
+$$ +(y_t)_i = (Cx_t + v_t)_i, \quad (t, i) \in \mathcal{K}, +$$
+where $\mathcal{K} \subseteq \left\{ 1, \ldots, T \right\} \times \left\{ 1, \ldots, p \right\}$ is the set of (scalar) outputs that are available. For $(t, i) \not \in \mathcal{K}$, we take $(y_t)_i = \,?$. We refer to the real entries of $y_t$ as known measurements and the entries of $y_t$ that have the value $?$ as missing measurements.
+Importantly, for the remainder of this notebook, we assume that the full output sequence, $y_1, \ldots, y_T$, is collected. That is, $\mathcal{K} = \left\{ 1, \ldots, T \right\} \times \left\{ 1, \ldots, p \right\}$. Nonetheless, we'll need the machinery presented here when we separate the measurements into training and testing datasets (or cross-validation folds), masking measurements $\mathcal{M} \subset \mathcal{K}$ and constructing smoothers with $\mathcal{K} \setminus \mathcal{M}$.
+The smoothing problems¶
Initial formulations¶
Consider again the first proposed smoother, (3). Making the typical statistical assumptions about the noises $w_1, \ldots, w_T$ and $v_1, \ldots, v_T$ and considering the problem as $T → ∞$, it turns out that (3) characterizes the Kalman filter. Or rather, the optimal solution to (3) is the recursive, closed-form update equations that appear in Kalman filtering. This observation is useful because it gives insight into a potential failure of this smoother. Namely, it performs well when $w_t$ and $v_t$ are Gaussian, but large outliers in the measurements will significantly influence the smoother's quadratic objective, thus degrading the accuracy of the smoother's state recovery.
+To improve estimation in the presence of outliers, we propose a robust Kalman smoother, which is formulated by simply replacing
+$$ +J_{\text{meas}} = \left\lVert Cx_1 - y_1 \right\rVert_{2} + \cdots + \left\lVert Cx_T - y_T \right\rVert_{2}^2 +$$
+in (3) with
+$$ +J_{\text{meas}} = \phi_{\rho}\left(Cx_1 - y_1 \right) + \cdots + \phi_{\rho}\left(Cx_T - y_T \right), +$$
+where
+$ϕ_{ρ}: \mathbf{R}^{n} \to \mathbf{R}$, defined as
+$$ +\phi_{\rho}(u) = \begin{cases} +\left\lVert u \right\rVert_{2}^2 & \left\lVert u \right\rVert_{2} \le \rho \\ +\rho(2 \left\lVert u \right\rVert_{2} - \rho) & \left\lVert u \right\rVert_{2} > ρ, +\end{cases} +$$
+is the Huber penalty function. This function is "robust" as it penalizes estimation errors linearly outside of a ball of radius $ρ$. Consequently, large measurement outliers will only (roughly) influence the objective function linearly. (See [BV04 p298-300] for intuition-building visualizations of the Huber function and the induced residual behavior.)
+Making this measurement error term replacement (and dropping the $J\text{s}$ for clarity), the robust smoother problem is
+$$ +\begin{equation} +\begin{array}{lll} +\text{minimize} \; & \sum_{t=1}^{T} \left(\tau \left\lVert w_t \right\rVert_{2}^2 + \phi_{\rho}(Cx_t - y_t) \right) & \\ +\text{subject to} & x_{t+1} = Ax_t + Bw_t, \; & t = 1, \ldots, T-1, +\end{array} +\tag{4} +\end{equation} +$$
+where again $x_t$ and $w_t, \, t=1, \ldots, T$, are optimization variables, but now both $\tau >0$ and $\rho>0$ are tuning parameters. Specifically, $\tau$ still is a measure of whether we trust the dynamics or the observations more, while $\rho$ (roughly) determines the size of the ball we believe reasonable observations should lie in.
+Handling missing measurements¶
Consider masking some fraction (e.g., 20%) of the measurements, denoted by the set $\mathcal{M} \subset \mathcal{K}$, resulting in a masked trajectory $\tilde{y}_1, \ldots, \tilde{y}_T$. That is, we let $(\tilde{y}_t)_i = \, ?$ for $(t, i) \in \mathcal{M}$ and $(\tilde{y}_t)_i = y_i$ for $(t, i) \in \mathcal{K} \setminus \mathcal{M}$.
+To construct smoothers with this masked trajectory, i.e., (in ML dialect) smoothers trained on the non-masked measurements, (3) and (4) must be reformulated. To do this, we introduce $T$ output optimization variables, $\hat{y}_1, \ldots \hat{y}_T \in \mathbf{R}^{p}$, which take the place of $y_t$ in (3) and (4)'s objective functions, subject to the constraint that the outputs match the non-masked observations, i.e.,
+$$\left(\hat{y}_t \right)_i = \left(y_t \right)_i, \quad (t, i) \in \mathcal{K} \setminus \mathcal{M}.$$
+Let $N=Tp$ and define the vector $z \in \mathbf{R}^{N}$ as $z = (\hat{y}_1, \ldots, \hat{y}_T)$. Using $z$, (3) and (4) become
+$$ +\begin{equation} +\begin{array}{lll} +\text{minimize} \; & \sum_{t=1}^{T} \left(\tau \left\lVert w_t \right\rVert_{2}^2 + \left\lVert Cx_t - \hat{y}_t \right\rVert_{2}^{2} \right) & \\ +\text{subject to} & x_{t+1} = Ax_t + Bw_t, \; & t = 1, \ldots, T-1 \\ +& S z = c & +\end{array} +\tag{5} +\end{equation} +$$
+and
+$$ +\begin{equation} +\begin{array}{lll} +\text{minimize} \; & \sum_{t=1}^{T} \left(\tau \left\lVert w_t \right\rVert_{2}^2 + \phi_{\rho}(Cx_t - \hat{y}_t) \right) & \\ +\text{subject to} & x_{t+1} = Ax_t + Bw_t, \; & t = 1, \ldots, T-1 \\ +& S z = c, & +\end{array} +\tag{6} +\end{equation} +$$
+where $S \in \mathbf{R}^{ \left| \mathcal{K} ∖ \mathcal{M} \right| \times N}$ is a selector matrix and $c ∈ \mathbf{R}^{\left| \mathcal{K} \setminus \mathcal{M} \right|}$ contains the corresponding entries of $y_t$. Concretely, if we denote the elements in $\mathcal{K} ∖ \mathcal{M}$ as $\left(\mathcal{K} ∖ \mathcal{M} \right)_1, \ldots, \left(\mathcal{K} ∖ \mathcal{M} \right)_{\left| \mathcal{K} ∖ \mathcal{M} \right|}$, then if $\left(\mathcal{K} ∖ \mathcal{M} \right)_j = (t, i)$, the $j\text{th}$ row of $S$ is $e_{tp + i}$ and the $j\text{th}$ entry of $c$ is $(y_t)_i$. Evidently, $S$ is very sparse.
+Solving the Kalman smoothing problems¶
Since we wish to use cvxpylayers
to learn smoother parameters, our smoothers must be DPP-compliant. (It is worth noting that more efficient, non-cvxpylayer
solutions can be developed by exploiting a particular smoother's problem structure. Again, refer to [BB20] for an example of this.)
In the above formulations, (5) can immediately be used to construct a CvxpyLayer
object. However, (6) is not DPP-compliant. Furthermore, to construct robust smoothers, we reformulate (6) as
$$ +\begin{equation} +\begin{array}{lll} +\text{minimize} \; & h & \\ +\text{subject to} & \tau\left( \sum_{t=1}^{T} \left\lVert w_t \right\rVert_{2}^2 \right) + +\left( \sum_{t=1}^{T} u_t^2 + 2\rho q_t \right) \le h & \\ +& \left\lVert Cx_t - \hat{y}_t \right\rVert_{2} \le u_t + q_t & t=1, \ldots, T \\ +& 0 \preceq u \preceq \rho \mathbf{1} & \\ +& q \succeq 0 & \\ +& x_{t+1} = Ax_t + Bw_t, & t = 1, \ldots, T-1 \\ +& S z = c, +\end{array} +\tag{7} +\end{equation} +$$
+where $u, q \in \mathbf{R}^{T}$ and $h \in \mathbf{R}$ are auxilary optimization variables. The equivalence of (6) and (7) is based on a result shown in the appendix (bottom of the notebook). To see the reformulation (roughly, and implicitly using (6) & (7)'s, not (A2)'s symbols), simply separate the summand in the objective function of (6), replace the summation of Huber penalties with the objective function in (A2), and then append (A2)'s constraints to this new problem. Because (A2) is shown equivalent to (A1) by fixing the $x_t$ variables, the constraints in (6) and total process noise term in (6)'s objective function can be ignored when reformulating the Huber measurement penalties since fixing $x_t$ decouples $\hat{y}_t$ and $w_t$.
+Finally, (7) is obtained by placing this transformed problem into epigraph form. This last step is not required for DPP-compliance, but if it is not done, diffcp
is unable to differentiate through the problem.
Evaluating a Kalman smoother¶
The first step in judging a Kalman smoother is to mask some fraction (e.g., 10% or 20%) of the observations, denoted by the set $\mathcal{M} ⊂ \left\{1, \ldots, T \right\} \times \left\{ 1, \ldots, p \right\}$. We then use the known set $\mathcal{K} ∖ \mathcal{M}$ to construct the selector matrix $S$ and the non-masked observation vector $c$.
+We then solve that smoother problem with $S$ and $c$, resulting in a predicted output trajectory $\hat{y}_1, \ldots \hat{y}_T$.
+To judge the Kalman smoother, we calculate the prediction error, which we'll take to be the mean squared difference between the predicted output trajectory and the actual trajectory in the entries that we masked,
+$$ +\begin{equation} +L(\theta) = \frac{1}{\left| \mathcal{M} \right|} ∑_{(t, i) \in \mathcal{M}} \left( \left( \hat{y}_t \right)_i - \left( y_t \right)_i \right)^2. +\tag{8} +\end{equation} +$$
+We'll refer to this prediction error as the mean squared error (MSE) of this smoother (i.e. the smoother trained on $\mathcal{K} ∖ \mathcal{M}$ parameterized by $\theta$). Importantly, because a smoother's predicted output trajectory is a function of the smoother's parameters, the prediction error $L$ is also a function of the parameters.
+The goal in the sequel will be to adjust the parameters to minimize this error.
+Learning¶
In this section we consider the Kalman smoother automatic parameter-tuning problem. We then present how to use cvxpylayers
to solve this auto-tuning problem for (5) and (7), as well as the use of cross-validation to assess the generalizability of the tuned smoothers.
Before moving forward however, a disclaimer. This is an explanatory notebook. Consequently, barely any effort was spent considering how to solve the parameter-tuning problem (or equivalently, the tune_kalman_smoother
function). This is for a reader to consider in the context of their own smoother problem.
Auto-tuning (learning) problem¶
To tune a Kalman smoother parameterized by $\theta$, we wish to minimize the prediction error, $L(\theta)$, on some held out measurements, $\mathcal{M}$. That is, we wish to solve the problem
+$$ +\text{minimize} \quad L(\theta), +$$
+with the variable $\theta$. This is actually a difficult problem since $L$ is not necessarily convex (and almost certainly isn't) and while the problem appears unconstrained, there are implicit problem constraints.
+Specifically, define $\Theta$ to be the set of parameters that the smoother optimization problem is defined for (e.g., (7) is parameterized by $\theta = (\tau, \rho) ∈ \mathbf{R} × \mathbf{R}$ but only defined for $\Theta = \mathbf{R}_{++} \times \mathbf{R}_{++}$). Since evaluating the prediction loss at some point $\theta$ implicitly requires solving a smoother problem parameterized by $\theta$, the prediction error function is only defined for $\theta \in \Theta$. Concretely, $\textbf{dom} \, L = \Theta$. Furthermore, $\theta \in \Theta$ is an implicit constraint in the above optimization problem (equivalently, the problem's domain is $\Theta$).
+The arising complication is that $\textbf{dom} \, L = \Theta$ does not constrain $\theta \in \Theta$. Because $L$ does not have a natural extended-value extension (convexity would be a sufficient condition for defining such an extension), $\theta \to \textbf{bd}\,\Theta$ does not imply that $L \to \infty$, so iterative optimization algorithms can converge to $\theta \not \in \Theta$. This is problematic as evaluating $L$ with an invalid parameter will result in a .solve()
call to an optimization problem which has "been told" will only need to solve problems for $\theta \in \Theta$. The iterative method finding a locally optimal $\theta$ will then terminate prematurely with a nasty error message from the smoother problem.
To address this issue, we propose
+$$ +\begin{equation} +\text{minimize} \quad F(\theta) = L(\theta) + r(\theta) +\tag{9} +\end{equation} +$$
+with variable $\theta$ as the Kalman smoother auto-tuning problem, where $r$ is a regularization function such that $\textbf{dom} \, r = \Theta$ and $r(\theta) = +\infty$ for $\theta \not \in \Theta$.
+Note that $\Theta$ can also include constraints on what parameters are allowed to change or constraints on what parameters are desirable. Making these restrictions adds more nuance to the above solver-error-throw argument, but the formulation of (9) does not change.
+Solving the auto-tuning problem for non-robust and robust smoothing¶
The process of using cvxpylayers
to construct Kalman smoothers and auto-tune their parameters simply requires:
-
+
- Formulating the smoother problems with
cvxpy
.
+ - Using these smoother objects to construct
CvxpyLayer
objects.
+ - Defining a regularization function for each smoother to concretely define $F(\theta)$. +
- Constructing a
tune_kalman_smoother
function to solve (9) using whatever underlying framework was chosen when the layer was constructed (i.e., PyTorch, TensorFlow, or JAX).
+
Most of this notebook was concerned with (1) and (2), and (4) (as mentioned at the beginning of this Background/Learning section) is beyond the scope of this notebook (or rather, best considered for specific, non-toy smoothers). Furthermore, we just consider (3).
+The non-robust Kalman smoother, (5), has one parameter, which we denote by $\theta = \tau \in \mathbf{R}$. The robust smoother, (7), has two parameters, which we denote by $\theta = \left(\tau, \rho \right) \in \mathbf{R} \times \mathbf{R}$. Both smoothers have the allowable set
+$$ +\Theta = \left\{ \theta \in \mathbf{R}^{k} \mid \theta \succ 0 \right\}, +$$
+where $k=1$ and $k=2$, respectively. (Note that for our parameter/allowable parameter notation we're adopting the function notation style found in [BV04], where a function may be declared over some space, but only defined for a subset of that space.) While there are many possible regularization functions we could choose to define $F(\theta)$, we'll auto-tune our smoothers using the logarithmic barrier function. That is, we'll define
+$$ +r(\theta) = ∑_{i=1}^{k} - \log θ_i. +$$
+Judging tuned smoothers¶
Update to notation for out-of-sample validation¶
Once a $\theta$ is obtained via auto-tuning (i.e., once a smoother is tuned), the prediction loss (8) is no longer a valid measurement of that smoother's performance. To evaluate this tuned smoother, we need to test it on another (unseen) output sequence. To reserve a final output prediction error test dataset, we still mask some fraction of the measurements (now more likely 20% than 10%) and then split that mask (rougly) in half to obtain $\mathcal{M} \subset \mathcal{K}$ and $\mathcal{T} ⊂ \mathcal{K}$ such that $\mathcal{M} ∩ \mathcal{T} = ∅$. The Kalman smoother is then formulated (or "trained") on $\mathcal{K} ∖ \left( \mathcal{M} ∪ \mathcal{T} \right)$, tuned on $\mathcal{M}$, and tested on $\mathcal{T}$. (The selector matrix, measurement vector, etc. are all updated accordingly.)
+Cross-validation¶
We perform $\mathcal{N}\text{-fold}$ cross-validation by dividing $\mathcal{K}$ into $\mathcal{N}$ sets (folds), and then constructing $\mathcal{N}$ smoothers each with $\mathcal{N}-1$ folds, using half of the measurements of the withheld fold to tune the smoother's parameters and the other half to evaluate the performance of the smoother. Concretely, we generate
+$$ +\mathcal{M}_n ⊂ \mathcal{K} \quad \text{and} \quad \mathcal{T}_n ⊂ \mathcal{K}, \quad \text{such that} \quad \mathcal{M}_n ∩ \mathcal{T}_n = ∅, \quad n =1, \ldots \mathcal{N}, +$$
+where for each $n$
+$$ +\left| \mathcal{M}_n \right| + \left| \mathcal{T}_n \right| ≈ \frac{1}{\mathcal{N}} \left| \mathcal{K} \right|. +$$
+Without loss of generality, the first of the $\mathcal{N}$ smoothers is constructed using the observations in $\mathcal{K} ∖ \left( \mathcal{M}_{1} ∪ \mathcal{T}_{1} \right)$, tuned by solving (9) where $\mathcal{M}_{1}$ are the observed outputs in $L$, and evaluated with (8) where $\mathcal{T}_{1}$ are the observed outputs in $L$.
+We'll define the $\mathcal{N}\text{-fold}$ cross-validation error as
+$$ +\text{CV}_{\mathcal{N}} = \sum_{n=1}^{\mathcal{N}} \frac{\left| \mathcal{T}_n \right|}{ \left| \mathcal{T} \right| } L_n(\hat{\theta}_n) +$$
+where $L_n(\hat{\theta}_n)$ is the prediction error between the output observations in $\mathcal{T}_n$ and the output predictions of the smoother constructed with the observations $\mathcal{K} ∖ \left(\mathcal{M}_n ∪ \mathcal{T}_n \right)$ and defined with the parameters $\hat{\theta}_n$, which were found by tuning the smoother with observations $\mathcal{M}_n$. Finally, $\mathcal{T} =⋃_{n=1}^{\mathcal{N}}\mathcal{T}_n$.
+Implementation (vehicle tracking example)¶
We'll apply standard and robust Kalman smoothing to recover the state history, $x_1, \ldots, x_T$, of a vehicle. The state, $x_t \in \mathbf{R}^{4}$, is as described in Background/System Model: the first two components of the state vector is the position of the vehicle in two dimensions, and the second two components is the vehicle velocity. The vehicle has unknown drive force $w_t$, and we observe noisy measurements of the vehicle position, $y_t \in \mathbf{R}^{2}$.
+The matrices for the system dynamics (and output) are +$$ +A = \begin{bmatrix} +1 & 0 & (1- \frac{\gamma}{2}\Delta t)\Delta t & 0 \\ +0 & 1 & 0 & (1- \frac{\gamma}{2}\Delta t)\Delta t \\ +0 & 0 & 1 - \gamma \Delta t & 0 \\ +0 & 0 & 0 & 1 - \gamma \Delta t +\end{bmatrix}, +$$
+$$ +B = \begin{bmatrix} +\frac{1}{2}\Delta t^2 & 0 \\ +0 & \frac{1}{2}\Delta t^2 \\ +\Delta t & 0 \\ +0 & \Delta t, +\end{bmatrix} +$$ +and +$$ +C = \begin{bmatrix} +1 & 0 & 0 & 0 \\ +0 & 1 & 0 & 0 +\end{bmatrix}, +$$ +where $\gamma$ is a velocity damping parameter.
+Plotting helper functions¶
+def plot_state(t, actual, estimated=None):
+ '''
+ plot position, speed, and input in the x and y coordinates for
+ the actual data, and optionally for the estimated data
+ '''
+ trajectories = [actual]
+ labels = ['Actual']
+ colors = ['#1f77b4']
+
+ if estimated is not None:
+ trajectories.append(estimated)
+ labels.append('Estimated')
+ colors.append('red')
+
+ fig, ax = plt.subplots(3, 2, sharex='col', sharey='row', figsize=(12,8))
+
+ for idx, (data, label) in enumerate(zip(trajectories, labels)):
+ x, w = data
+ ax[0,0].plot(t, x[0,:-1], color=colors[idx], label=label)
+ ax[0,1].plot(t, x[1,:-1], color=colors[idx])
+ ax[1,0].plot(t, x[2,:-1], color=colors[idx])
+ ax[1,1].plot(t, x[3,:-1], color=colors[idx])
+ ax[2,0].plot(t, w[0,:], color=colors[idx])
+ ax[2,1].plot(t, w[1,:], color=colors[idx])
+
+ ax[0,0].set_ylabel('$x$ position')
+ ax[1,0].set_ylabel('$x$ velocity')
+ ax[2,0].set_ylabel('$x$ input')
+
+ ax[0,1].set_ylabel('$y$ position')
+ ax[1,1].set_ylabel('$y$ velocity')
+ ax[2,1].set_ylabel('$y$ input')
+
+ ax[0,1].yaxis.tick_right()
+ ax[1,1].yaxis.tick_right()
+ ax[2,1].yaxis.tick_right()
+
+ ax[0,1].yaxis.set_label_position("right")
+ ax[1,1].yaxis.set_label_position("right")
+ ax[2,1].yaxis.set_label_position("right")
+
+ ax[2,0].set_xlabel('$t$')
+ ax[2,1].set_xlabel('$t$')
+
+ # Add a single legend
+ fig.legend(labels, loc='upper right', ncol=2)
+
+
+def plot_positions(traj: List[List[np.ndarray]],
+ titles: List[str],
+ legends: List[List[str]],
+ plot_args: List[List[str]],
+ axis=None,
+ filename=None):
+ matplotlib.rcParams.update({'font.size': 14})
+ n = len(traj)
+
+ assert len(traj) == len(titles) == len(legends) == len(plot_args), "Input lists must have the same length."
+
+ rows = int(np.ceil(n / 2))
+ cols = 2 if n > 1 else 1
+
+ fig, ax = plt.subplots(rows, cols, sharex=True, sharey=True, figsize=(12, 5 * rows))
+
+ # Ensure ax is a 2D array
+ if rows == 1:
+ ax = np.array([ax])
+ if cols == 1:
+ ax = ax.reshape(-1, 1)
+
+ ax = ax.flatten()
+
+ for i, xs in enumerate(traj):
+ for j, x in enumerate(xs):
+ typ = plot_args[i][j]
+ alpha = 0.1 if 'o' in typ else 1
+ ax[i].plot(x[0, :], x[1, :], typ, alpha=alpha)
+ ax[i].set_title(titles[i])
+ ax[i].legend(legends[i], loc='upper right')
+ if axis:
+ ax[i].axis(axis)
+
+ # Hide unused subplots if n is odd
+ for j in range(n, rows * cols):
+ fig.delaxes(ax[j])
+
+ plt.tight_layout()
+
+ if filename:
+ fig.savefig(filename, bbox_inches='tight')
+
Problem data¶
We generate the data for the vehicle tracking problem. We'll have $T=1000$, $w_t$ a standard Gaussian, and $v_t$ a standard Gaussian, except $20\%$ of the points will be outliers with $σ=20$.
+Below, we set the problem parameters and define the matrices $A$, $B$, and $C$.
+n = 4 # number of entries in the state vector
+m = 2 # number of entries in the input vector
+p = 2 # number of entries in the measurement vector
+gamma = 0.05 # velocity damping; 0 is no damping
+
+num_steps = 1000
+T = num_steps
+end_time = 50 # time will move from 0 to end_time with step delt
+times, delt = np.linspace(0, end_time, num_steps, endpoint=True, retstep=True)
+N = T*p # number of total possible measurements
+
+A = np.zeros((n, n))
+B = np.zeros((n, m))
+C = np.zeros((p, n))
+
+# Dynamics for this particular problem
+A[0,0] = 1
+A[1,1] = 1
+A[0,2] = (1-gamma*delt/2)*delt
+A[1,3] = (1-gamma*delt/2)*delt
+A[2,2] = 1 - gamma*delt
+A[3,3] = 1 - gamma*delt
+
+B[0,0] = delt**2/2
+B[1,1] = delt**2/2
+B[2,0] = delt
+B[3,1] = delt
+
+C[0,0] = 1
+C[1,1] = 1
+
+C_tch = torch.tensor(C) # used in loss functions
+
Simulation¶
We seed $x_0 = 0$ (starting at the origin with zero velocity) and simulate the system forward in time. The results are the true vehicle positions x_true
(which we'll use to judge our recovery) and the observed positions y
. (Note that $C$ only "observes" the first two components of a state vector.)
We plot the position, velocity, and system input $w$ in both dimensions as a function of time. We also plot the sets of true and observed vehicle positions.
+sigma = 20
+num_outliers = .20
+
+x = np.zeros((n,T+1))
+x[:,0] = [0,0,0,0]
+y = np.zeros((p,T))
+
+# generate random input and noise vectors
+np.random.seed(6)
+w = np.random.randn(m,T)
+v = np.random.randn(p,T)
+
+# add outliers to v
+np.random.seed(0)
+inds = np.random.rand(T) <= num_outliers
+v[:,inds] = sigma*np.random.randn(p, T)[:,inds]
+
+# simulate the system forward in time
+for t in range(T):
+ x[:, t+1] = A@x[:, t] + B@w[:, t]
+ y[:, t] = C@x[:, t] + v[:, t]
+
+x_true = x.copy()
+w_true = w.copy()
+
+plot_state(times,(x_true,w_true))
+plot_positions(traj=[[y, x_true]],
+ titles=['True vs. Observations'],
+ legends=[['Measurements', 'True']],
+ plot_args=[['ro', 'k']],
+ axis=[-4,14,-5,20])
+
Smoother Instantiation¶
Before moving into the main functionality of this vehicle path reconstruction example, we define some additional notation and comment on how this notation and the notation from the Background section is adapted to the code. (Note that exposition on all the code is not given here. What is not explained here can be extrapolated from the following main ideas.)
+Firstly,
+$$ +\mathcal{F} = \mathcal{K} ∖ \left( \mathcal{M} ∪ \mathcal{T} \right) +$$
+are the observations that a smoother is trained on in an out-of-sample validation setting. Similarily, when using $\mathcal{N}\text{-fold}$ cross-validation to evaluate smoothers,
+$$ +\mathcal{F}_n = \mathcal{K} ∖ \left( \mathcal{M}_n ∪ \mathcal{T}_n \right) +$$
+are the observations that the $n\text{th}$ smoother is trained on. Except in the cross-validation related functionality, the non-subscripted notation will be used.
+The code is very consistent with this set notation. The set of all observation, $\mathcal{K}$, is contained in y
, a $p \times T$ np.ndarray
whose $(i, t)\text{th}$ entry is the $i\text{th}$ component of the observation at time $\text{t}$. F
and M
will (almost always) be used to represent $\mathcal{F}$ and $\mathcal{M}$. However, these will be $T \times p$ mask matrices whose $(t, i)\text{th}$ entry is a boolean representing whether the $(i, t)\text{th}$ entry in y
is contained in that respective set. (The transposition is required for ease of indexing. See the Appendix for more on this.) Because $T$ is used pervasively throughout the Kalman smoothing mathematics, the set $\mathcal{T}$ will be represented in the code more bluntly by the $T \times p$ mask matrix test_mask
.
Keeping with this set notation, in the code's docstrings and comments, the concatenated symbol of absolute value bars around a mask matrix refers to the cardinality of the set that matrix represents (e.g., |F|
should be understood as $\left| \mathcal{F} \right|$).
As mentioned above, F
and M
will not always represent $\mathcal{F}$ and $\mathcal{M}$. Specifically, because the problem (9) is concerned with minimizing $F(\theta)$ with $\theta$, whenever the set $\mathcal{F}$ needs to exist in the same scope as this tuning function, $\mathcal{F}$ and $\mathcal{M}$ will be represented by the mask matrices train_mask
and tune_mask
, respectively.
Finally, these prediction error and tuning functions (i.e., $L$ and $F$) will be defined in the code as L
and F
. However, like in the Background/Cross-validation section, when in the cross-validation setting and referring to a specific smoother, $L_n$ and $F_n$ will be used mathematically with the corresponding variable names Ln
and Fn
.
def create_kalman_smoother(S: sparse.csc_matrix,
+ c: np.ndarray) -> CvxpyLayer:
+ """
+ Construct a standard Kalman smoother with
+ measurements F.
+
+ Arguments
+ ---------
+ S : sparse.csc_matrix
+ |F| x N selection matrix.
+ c : np.ndarray
+ |F|-vector of measurements.
+
+ Returns
+ --------
+ CvxpyLayer
+ Solves a (or a batch of) standard Kalman smoothing
+ problem(s) upon receiving a (or a batch of) tuning
+ parameter(s), tau.
+ """
+ x = cp.Variable(shape=(n, T+1))
+ w = cp.Variable(shape=(m, T))
+ # Missing data constraints
+ y_hat = cp.Variable(shape=(p, T))
+ z = cp.vec(y_hat, order='F')
+
+ tau = cp.Parameter(pos=True)
+
+ f0 = tau*cp.sum_squares(w) + cp.sum_squares((C@x)[:, :T] - y_hat)
+ obj = cp.Minimize(f0)
+
+ constr = [ x[:, t+1] == A@x[:, t] + B@w[:, t] for t in range(T) ]
+ constr += [ S@z == c ] # missing data constraint
+
+ smoother_problem = cp.Problem(obj, constr)
+
+ layer = CvxpyLayer(
+ smoother_problem,
+ parameters=[tau],
+ variables=[x, y_hat, w]
+ )
+ return layer
+
+
+def create_robust_kalman_smoother(S: sparse.csc_matrix,
+ c: np.ndarray) -> CvxpyLayer:
+ """
+ Construct a robust Kalman smoother with
+ measurements F.
+
+ Arguments
+ ---------
+ S : sparse.csc_matrix
+ |F| x N selection matrix.
+ c : np.ndarray
+ |F|-vector of measurements.
+
+ Returns
+ --------
+ CvxpyLayer
+ Solves a (or a batch of) robust Kalman smoothing
+ problem(s) upon receiving a (or a batch of) tuning
+ parameter(s), tau.
+ """
+ x = cp.Variable(shape=(n, T+1))
+ w = cp.Variable(shape=(m, T))
+ h = cp.Variable()
+ # Missing data constraints
+ y_hat = cp.Variable(shape=(p, T))
+ z = cp.vec(y_hat, order='F')
+ # Huber auxilary variables
+ u = cp.Variable(T)
+ q = cp.Variable(T)
+
+ tau = cp.Parameter(pos=True)
+ rho = cp.Parameter(pos=True)
+
+ f0 = tau*cp.sum_squares(w)
+ f0 += cp.sum([ cp.square(u[t]) + 2*rho*q[t] for t in range(T)])
+ constr = [f0 <= h]
+ obj = cp.Minimize(h) # epigraph form
+
+ # Huber constraints
+ constr += [cp.norm(C@x[:, t] - y_hat[:, t], 2) <= u[t] + q[t] for t in range(T)]
+ constr += [u[t] <= rho for t in range(T) ]
+ constr += [0 <= u, 0 <= q]
+
+ # smoother constraints
+ constr += [ x[:, t+1] == A@x[:, t] + B@w[:, t] for t in range(T) ]
+ constr += [ S@z == c ] # missing data constraint
+
+ smoother_problem = cp.Problem(obj, constr)
+
+ layer = CvxpyLayer(
+ smoother_problem,
+ parameters=[tau, rho],
+ variables=[x, y_hat, w]
+ )
+ return layer
+
Learning (Auto-Tuning) Functionality¶
Before implementing the tune_kalman_smoother
and cross_validate
functions, we implement the following boilerplate code. That is, the functionality required to split measurements into training, tuning, and validation sets (or into training, tuning, and validation cross-validation folds), and the penalty functions $L$ and $F$.
Data separation functions¶
+def split_data(fraction: float=0.8) -> Tuple[int,
+ np.ndarray,
+ np.ndarray,
+ np.ndarray]:
+ """
+ Split data into a training set,
+ tuning set, and test set.
+
+ Arguments
+ ---------
+ fraction : float, optional
+ Fraction of the measurement dataset included
+ in the training dataset.
+ Default fraction is 0.8.
+
+ Returns
+ -------
+ |F| : int
+ Number of measurements in the training dataset
+ F : np.ndarray
+ T x p mask matrix whose True entries are the
+ indices of the measurements used for model
+ training.
+ M : np.ndarray
+ T x p mask matrix whose True entries are the
+ indices of the measurements used for model
+ tuning.
+ test_mask : np.ndarray
+ T x p mask matrix whose True entries are the
+ indices of the measurements used for model
+ validation.
+ """
+ num_train_pts = int( N*fraction )
+ num_tune_pts = int( (N - num_train_pts)/2 )
+
+ # generate same out-of-sample validation split every time
+ np.random.seed(1)
+
+ indices = np.arange(N)
+ np.random.shuffle(indices)
+
+ train_indices = indices[:num_train_pts]
+ tune_indices = indices[num_train_pts:(num_train_pts+num_tune_pts)]
+ test_indices = indices[(num_train_pts+num_tune_pts):]
+
+ F = np.zeros((T, p), dtype=bool)
+ M = np.zeros((T, p), dtype=bool)
+ test_mask = np.zeros((T, p), dtype=bool)
+
+ F.flat[train_indices] = True
+ M.flat[tune_indices] = True
+ test_mask.flat[test_indices] = True
+
+ return num_train_pts, F, M, test_mask
+
def cv_split(num_folds: int=5) -> List[Tuple
+ [int,
+ np.ndarray,
+ np.ndarray,
+ np.ndarray]
+ ]:
+ """
+ Split data into folds for cross-validation.
+
+ Arguments
+ ---------
+ num_folds : int, optional
+ The number of folds desired for
+ cross-validation.
+ Default number of folds is 5.
+
+ Returns
+ -------
+ List
+ The nth entry of the list is the four-
+ tuple `(|Fn|, Fn, Mn, test_mask_n)`.
+ `Fn`, `Mn`, and `test_mask_n` are T x p mask matrices,
+ whose true entries are the indices of the
+ measurements used for model training, tuning, and
+ validation for the nth fold, respectively.
+ The list has length equal to `num_folds`.
+ """
+ np.random.seed(1) # generate same CV split every time
+
+ indices = np.arange(N)
+ np.random.shuffle(indices)
+
+ # List of Fn^c, where Fn^c does contain indices (it is not mask matrix).
+ # Specifically, Fn^c contains measurement indices in M U test_mask.
+ Fn_c_s = np.array_split(indices, num_folds)
+
+ to_return = []
+
+ for n in range(num_folds):
+ Fn_c_indices = Fn_c_s[n]
+
+ num_tune_pts_n = int( len(Fn_c_indices)/2 )
+ tune_indices_n = Fn_c_indices[:num_tune_pts_n]
+ test_indices_n = Fn_c_indices[num_tune_pts_n:]
+
+ Mn = np.zeros((T, p), dtype=bool)
+ test_mask_n = np.zeros((T, p), dtype=bool)
+
+ Mn.flat[tune_indices_n] = True
+ test_mask_n.flat[test_indices_n] = True
+
+ Fn = ~(Mn + test_mask_n)
+ num_train_pts_n = np.sum(Fn)
+
+ to_return.append((num_train_pts_n, Fn, Mn, test_mask_n))
+
+ return to_return
+
def create_problem_vectorized_data(F: np.ndarray,
+ M: np.ndarray,
+ test_mask: np.ndarray,
+ num_measurements: int=None) -> Tuple[sparse.csc_matrix,
+ np.ndarray,
+ np.ndarray,
+ np.ndarray]:
+ """
+ Turns a mask matrix of known measurement indices
+ into the sparse selection matrix needed in the
+ standard and robust Kalman smoothing problems.
+ Also produces the corresponding measurement
+ vector, and the tuning and testing measurement
+ vectors.
+
+ Arguments
+ ---------
+ F : np.ndarray
+ T x p mask matrix of measurement indices.
+ M : np.ndarray
+ T x p mask matrix of tuning indices.
+ test_mask : np.ndarray
+ T x p mask matrix of validation indices.
+ num_measurements : int, optional
+ The number of known measurement indices in
+ the selected numpy array. Mathematically,
+ |F|.
+
+ Returns
+ -------
+ S : sparse.csc_matrix
+ |F| x N sparse selection matrix.
+ c : np.ndarray
+ |F|-vector containing the output
+ measurements in the training dataset
+ c_tune : np.ndarray
+ |M|-vector containing the output
+ measurements in the tuning dataset.
+ c_test : np.ndarray
+ |test_mask|-vector containing the output
+ measurements in the validation dataset.
+ """
+ num_measurements = np.sum(F) if num_measurements == None else num_measurements
+
+ time_indices, measurement_indices = F.nonzero()
+ S = sparse.csc_matrix(
+ (
+ np.ones(num_measurements), # data filled into the sparse array
+ (
+ np.arange(num_measurements),
+ time_indices*p + measurement_indices
+ )
+ ),
+ shape=(num_measurements, N)
+ )
+ c = y.T[F]
+ c_tune = y.T[M]
+ c_test = y.T[test_mask]
+ return (S, c, c_tune, c_test)
+
Error Functions¶
The following are the implementations of $L$ and $F$, respectively.
+def L(y: torch.Tensor,
+ measurement_mask: np.ndarray,
+ y_hat: torch.Tensor) -> torch.Tensor:
+ """
+ Prediction error used to evaluate a
+ Kalman smoother.
+
+ Specifically, this function computes (8),
+ i.e., L(theta),
+ for a provided y, y_hat, and measurement
+ selection set.
+
+ Recall that y_hat implicitly depends on theta.
+
+ Arguments
+ ---------
+ y : torch.Tensor
+ An observed output subset to judge smoother
+ output against. Should satisfy
+ y.shape == (|measurement_mask|,).
+ measurement_mask : np.ndarray
+ T x p mask matrix used to select the subet
+ of y_hat to be judged against y.
+ y_hat : np.ndarray
+ p x T output matrix "predicted" by a smoother.
+
+ Returns
+ -------
+ torch.Tensor
+ containing L(theta)
+ """
+
+ return torch_functional.mse_loss(y_hat.t()[measurement_mask], y)
+
+def F(y: torch.Tensor,
+ measurement_mask: np.ndarray,
+ y_hat: torch.Tensor,
+ params: List[torch.Tensor]) -> torch.Tensor:
+ """
+ The learning function used to auto-tune
+ a Kalman smoother.
+
+ Specifically, this function computes F(theta)
+ found in (9) by adding
+
+ r(theta) = sum_i -log(theta_i)
+
+ to the prediction error computed by `L`.
+
+ Arguments
+ ---------
+ y : torch.Tensor
+ An observed output subset to judge smoother output
+ against. Should satisfy
+ y.shape == (|measurement_mask|,)
+ measurement_mask : np.ndarray
+ T x p mask matrix used to select subet of y_hat
+ to be judged against y.
+ y_hat : torch.Tensor
+ p x T output matrix "predicted" by a smoother.
+ params : List[torch.Tensor]
+ A list of the Tensor hyper-parameters needed to
+ solve the smoother defining the layer object.
+
+ Returns
+ -------
+ torch.Tensor
+ containing F(theta)
+ """
+
+ mse_loss = L(y, measurement_mask, y_hat)
+ r_theta = sum(-torch.log(param) for param in params)
+ return mse_loss + r_theta
+
Parameter finding functions¶
+def parameter_sweep(layer: CvxpyLayer,
+ num_params: int) -> Tuple[List[torch.Tensor],
+ torch.Tensor]:
+ """
+ Solve a batch of smoother problems with
+ log-spaced (or Cartesian product of log-spaced)
+ parameter values.
+
+ Used in the `cross_validate` function to find
+ initial theta values to use in `tune_kalman_smoother`.
+
+ Arguments
+ ---------
+ layer : CvxpyLayer
+ The Kalman smoother to perform the batch
+ solving with.
+ num_params : int
+ Number of parameters needed as arguments
+ for the provided layer.
+
+ Returns
+ -------
+ theta_init : List[torch.Tensor]
+ Contains a tensor for each parameter
+ needed for the smoother
+ (i.e., `len(theta_init) == num_params`).
+ Taking the jth entry of each tensor yields
+ the jth theta the smoother was solved with.
+ ys : torch.Tensor
+ A 3-D tensor containing the predicted
+ outputs that correspond to solving
+ a smoother with each combination of
+ log-spaced parameters in a batched fashion.
+ More specifically, the tensor has shape
+ len(theta_init[0].shape[0]) x p x T
+ """
+ steps = 10 // num_params
+
+ logspace_values = [torch.logspace(start=-4, end=7, steps=steps, base=2)
+ for _ in range(num_params)]
+ meshgrids = torch.meshgrid(*logspace_values)
+ theta_product = torch.stack(meshgrids,
+ dim=-1).reshape(-1,len(logspace_values))
+ theta_init = [theta_product[:, i] for i in range(num_params)]
+
+ # batch compute solutions
+ ys = layer(*theta_init)[1]
+
+ return (theta_init, ys)
+
def tune_kalman_smoother(smoother_layer: CvxpyLayer,
+ Fn: Callable[[torch.Tensor,
+ List[torch.Tensor]],
+ torch.Tensor],
+ Ln: Callable[[torch.Tensor],
+ torch.Tensor],
+ params: List[torch.Tensor],
+ num_iters: int=25,
+ lr: float=2e-2,
+ verbose: bool=False,
+ print_every: int=5) -> Tuple[int,
+ float,
+ List[List[float]],
+ List[float],
+ List[float]]:
+ """
+ Auto-tunes a Kalman smoother.
+
+ Specifically, attempts to tune the provided Kalman
+ smoother by solving (9) using PyTorch's
+ implementation of Adam.
+
+ Along with recording the minimizing sequence for
+ solving
+
+ minimize F(theta)
+
+ with theta, this function records the prediction
+ error for each parameter element in the sequence.
+
+ (Note that "prediction error" is the value of Ln(theta)
+ over the validation dataset.)
+
+ Auto-tuning will terminate early if an invalid theta
+ is produced as a minimizing iterate. See the above
+ background section for information on what makes a theta
+ invalid. Also note that this can occurr despite the use
+ of the regularization function, r(theta), since
+ no care has been taken in choosing a solution method.
+
+ Arguments
+ ---------
+ smoother_layer : CvxpyLayer
+ The Kalman smoother to tune.
+ (Recall that the smoother opt. problem
+ is embedded in a CvxpyLayer so that auto-
+ tuning can be performed.)
+ Fn : Callable[[torch.Tensor, List[torch.Tensor]], torch.Tensor]
+ The learning function F in (9) defined over some
+ tuning dataset. This Fn is the `F` above
+ with a `y` and `measurement_mask` already provided,
+ so the only arguments it needs to be provided during
+ runtime are the p x T y_hat output matrix returned
+ from passing the provided layer a parameter iterate,
+ and that parameter iterate.
+ Ln : Callable[[torch.Tensor], torch.Tensor]
+ The prediction error L from (8) defined over
+ some testing dataset. This Ln is the `L`
+ above with a `y` and `measurement_mask`
+ already provided, so the only argument it needs
+ to be provided during runtime is the p x T
+ y_hat output matrix returned from passing
+ the provided layer a parameter iterate.
+ params : List[torch.Tensor]
+ The scalar parameter(s) needed to define
+ a standard or robust Kalman smoother.
+ Used as the starting point for
+ solving min. F(theta).
+ num_iters : int, optional
+ Number of iterations to run Adam on
+ min. F(theta).
+ Default is 25 iterations.
+ lr : float, optional
+ Learning rate to instatiate Adam with.
+ Default is 2e-2.
+ verbose : bool, optional
+ Default is False.
+ print_every : int, optional
+ Default is every 5 iterations.
+
+ Returns
+ -------
+ best_param_index : int
+ The index in `param_history` corresponding
+ to the value of theta that produced the
+ smallest (test) prediction error.
+ best_loss : float
+ The smallest prediction error seen during
+ the attempt to minimize F(theta).
+ param_history : List[List[float]]
+ The minimzing sequence of F(theta).
+ train_losses : List[float]
+ F(theta_i) for theta_i in param_history.
+ test_losses : List[float]
+ L(theta_i) for theta_i in param_history.
+ """
+ opt = torch.optim.Adam(params, lr=lr)
+
+ best_param_index, best_loss = None, torch.inf
+ param_history: List[List[float]] = []
+ train_losses, test_losses = [], []
+
+ for k in range(num_iters):
+
+ param_history.append([p.item() for p in params])
+
+ y_hat_k = smoother_layer(*params)[1]
+
+ with torch.no_grad():
+ curr_test_loss = Ln(y_hat_k)
+ if curr_test_loss < best_loss:
+ best_loss = curr_test_loss
+ best_param_index = k
+ test_losses.append(curr_test_loss)
+ if verbose and k % print_every == 0:
+ print("test loss %03d | %3.5f" % (k + 1, test_losses[-1]))
+
+ opt.zero_grad()
+ curr_train_loss = Fn(y_hat=y_hat_k, params=params)
+ curr_train_loss.backward()
+ opt.step()
+
+ neg_theta = any([p.item() < 0 for p in params])
+ if neg_theta:
+ # despite r(theta), out-of-box opt. algo.
+ # could result in theta_i < 0
+ break
+
+ train_losses.append(curr_train_loss.item())
+ if verbose and k % print_every == 0:
+ print("train loss %03d | %3.5f" % (k+1, train_losses[-1]))
+
+ return (best_param_index, best_loss,
+ param_history, train_losses, test_losses)
+
Helper data objects¶
+class AutoTune_DataObject:
+
+ def __init__(self, num_epochs, best_theta_index, best_loss,
+ theta_history, train_losses, test_losses):
+ self.tune_succeeded: bool = True if best_theta_index > 0 else False
+ self.quit_early: bool = False if len(theta_history) == num_epochs else True
+ self.best_theta_index: int = best_theta_index
+ self.best_loss: float = best_loss
+ self.theta_history: List[List[float]] = theta_history
+ self.train_losses: List[float] = train_losses
+ self.test_losses: List[float] = test_losses
+
+class Fold_DataObject:
+ """
+
+ Contains
+ - losses and param values from sweep
+ - theta corresponding to lowest test loss
+
+ Can determine if autotune failed by observing size of parameter history list
+
+ Keep track of which tune object contained the best param?
+
+ """
+ def __init__(self, num_train_pts, layer):
+ self.num_train_pts: int = num_train_pts
+ self.layer: CvxpyLayer = layer
+
+ self.param_sweep_result = None
+
+ self.auto_tunes = []
+ self.theta: List[float] = None
+ self.auto_tune_succeeded = False
+ self.train_mse = torch.inf
+ self.test_mse = torch.inf
+
+ def add_tune_result(self, to_add: AutoTune_DataObject,
+ train_loss: Callable[[List[torch.Tensor]], torch.Tensor]):
+ self.auto_tunes.append(to_add)
+
+ best_theta_index = to_add.best_theta_index
+
+ if to_add.tune_succeeded:
+ self.auto_tune_succeeded = True
+
+ if to_add.best_loss < self.test_mse:
+ self.test_mse = to_add.best_loss
+ # if the first gradient step fails, then we don't have the training loss, so need to compute.
+ self.train_mse = to_add.train_history[best_theta_index] if best_theta_index != 0\
+ else train_loss(to_add.theta_history[0])
+ self.theta = to_add.theta_history[best_theta_index]
+
+
+
+class CV_DataObject:
+ """
+ Contains
+ - number of folds
+ - CV-MSE error
+ - each fold
+
+ TODOS:
+ - make sure to serialize/save after running cross_validate
+
+ if size of fold list == num_folds, automatically compute final statistics.
+ """
+ def __init__(self, num_params, num_folds):
+ self.num_params = num_params
+ self.num_folds = num_folds
+ self.folds: List[Fold_DataObject] = []
+
+ self.mse_cv_error = torch.inf
+
+ def add_fold(self, to_add: Fold_DataObject):
+ self.folds.append(to_add)
+
+ if len(self.folds) == self.num_folds:
+ self.compute_cv_errors()
+
+ def compute_cv_error(self):
+ sum_square = 0
+ for fold in self.folds:
+ sum_square += fold.test_mse**2
+ return sum_square / len(self.folds)
+
Cross-validation instantiation¶
+def cross_validate(create_smoother: Callable[[sparse.csc_matrix, np.ndarray],
+ CvxpyLayer],
+ num_params: int,
+ num_folds: int=5,
+ num_st_pts: int=3,
+ num_iters: int=25,
+ verbose: bool=False,
+ double_verbose: bool=False) -> CV_DataObject:
+ """
+ Creates a Kalman smoother and auto-tunes it ov
+
+ NEED TO MENTION TUNING
+ Arguments
+ ---------
+ create_smoother : Callable[[sparse.csc_matrix, np.ndarray], CvxpyLayer]
+ Pass in `create_kalman_smoother` to perform cross-validation
+ on a standard Kalman smoother or `create_robust_kalman_smoother`
+ to perform cross-validation on a robust Kalman smoother.
+ num_params : int
+ Number of parameters needed as arguments in the smoother layer
+ returned by `created_smoother`.
+ num_folds : int, optional
+ The
+ Default is 5 folds.
+ num_st_pts : int, optional
+ Default is three starting points.
+ num_iters: int, optional
+ Number of iterations to run Adam
+ on min. F(theta).
+ Default is 25 iterations.
+ verbose : bool, optional
+ Default is False.
+ double_verbose : bool, optional
+ Default is False.
+
+ Returns
+ -------
+ CV_DataObject
+ The aggregate and granular results of running this function.
+ See the docstrings of the above helper data functions for
+ more information on what results are captured.
+ """
+
+ cv_result: CV_DataObject = CV_DataObject(num_params, num_folds)
+
+ data_folds = cv_split(num_folds)
+
+ for n, fold in enumerate(data_folds):
+ if verbose:
+ print("=== Starting fold %02d | %02d ===" % (n, num_folds))
+
+ # === initialize fold's layer & data ===
+ num_train_pts, train_mask, tune_mask, test_mask = fold
+
+ S, c, c_tune, c_test = create_problem_vectorized_data(train_mask,
+ tune_mask,
+ test_mask,
+ num_train_pts)
+
+ c_tune_tch, c_test_tch = torch.tensor(c_tune), torch.tensor(c_test)
+
+ smoother_layer = create_smoother(S, c)
+
+ fold_to_add: Fold_DataObject = Fold_DataObject(num_train_pts,
+ smoother_layer)
+
+ Fn = partial(F, y=c_tune_tch,
+ measurement_mask=tune_mask)
+ Ln = lambda y_hat : L(y=c_test_tch,
+ measurement_mask=test_mask,
+ y_hat=y_hat)
+
+ # === parameter sweep ===
+ if verbose:
+ print("starting parameter sweep")
+
+ theta_inits, ys = parameter_sweep(smoother_layer,
+ num_params)
+
+ losses: List[Tuple[
+ List[torch.Tensor],
+ float
+ ]] = [
+ (
+ [theta_inits[j][i] for j in range(num_params)],
+ Ln(ys[i]).item()
+ )
+ for i in range(ys.shape[0])
+ ]
+ losses.sort(key=lambda x: x[1])
+
+ fold_to_add.param_sweep_result = losses
+
+ # === auto-tuning ===
+ if verbose:
+ print("starting auto_tuning")
+
+ starting_thetas = [theta for theta, _ in losses[0:num_st_pts]]
+
+ for j, theta_0 in enumerate(starting_thetas):
+ # make sure thetas are torch tensors
+ # well, param sweeep uses tensors, but don't declare grad
+ # also now I recombine
+ # perhaps use numpy and then only create torch tensors here when need to.
+ if verbose:
+ print("auto-tuning starting point %02d | %02d" % (j, num_st_pts))
+
+ auto_tune_result = tune_kalman_smoother(smoother_layer,
+ Fn, Ln,
+ params=theta_0,
+ num_iters=num_iters,
+ verbose=double_verbose)
+
+ fold_to_add.add_tune_result(AutoTune_DataObject(num_iters,
+ *auto_tune_result),
+ Fn)
+ cv_result.add_fold(fold_to_add)
+
+ return cv_result
+
Cross-validation helper functions¶
+# print results
+
Least Squares Smoothing¶
+Test basic functionality¶
+num_train_pts, train_mask, tune_mask, test_mask = split_data() # will still use this after CV
+S, c, c_tune, c_test = create_problem_vectorized_data(train_mask, tune_mask, test_mask, num_train_pts)
+c_test_tch = torch.tensor(c_test)
+c_tune_tch = torch.tensor(c_tune)
+
layer = create_kalman_smoother(S, c)
+tau_tch = torch.tensor(.08, requires_grad=True)
+x_hat_orig, y_hat_orig, w_hat_orig = layer(tau_tch)
+
Fn = partial(F, y=c_tune_tch, measurement_mask=tune_mask)
+Ln = lambda y_hat : L(y=c_test_tch, measurement_mask=test_mask, y_hat=y_hat)
+
Ln(y_hat_orig)
+
tensor(54.9676, grad_fn=<MseLossBackward0>)+
Fn(y_hat=y_hat_orig, params=[tau_tch])
+
tensor(97.2082, grad_fn=<AddBackward0>)+
result = tune_kalman_smoother(layer, Fn, Ln, [tau_tch], num_iters=15, verbose=True, print_every=1)
+
test loss 001 | 54.96757 +train loss 001 | 97.20817 +test loss 002 | 54.94265 +train loss 002 | 96.94404 +test loss 003 | 55.01819 +train loss 003 | 96.85116 +test loss 004 | 54.98556 +train loss 004 | 96.65761 +test loss 005 | 54.86169 +train loss 005 | 96.31496 +test loss 006 | 54.89240 +train loss 006 | 96.26523 +test loss 007 | 54.89424 +train loss 007 | 96.15142 +test loss 008 | 54.87469 +train loss 008 | 96.03847 +test loss 009 | 54.86414 +train loss 009 | 95.93140 +test loss 010 | 54.87093 +train loss 010 | 95.86843 +test loss 011 | 54.85492 +train loss 011 | 95.77033 +test loss 012 | 54.82499 +train loss 012 | 95.64594 +test loss 013 | 54.81648 +train loss 013 | 95.56851 +test loss 014 | 54.80896 +train loss 014 | 95.49646 +test loss 015 | 54.79921 +train loss 015 | 95.42338 ++
result[0]
+
14+
theta_star = result[2][14]
+
theta_star
+
[0.328879177216338]+
theta_star_tch = torch.tensor(theta_star[0])
+
x_hat_star, y_hat_star, v_hat_star = layer(theta_star_tch)
+
plot_positions(traj=[[y, x_true, x_hat_orig.detach().numpy(), x_hat_star.numpy()]],
+ titles=['Observations vs. True vs. Recovery vs. Tuned Recovery'],
+ legends=[['Measurements', 'True', 'Recovery', 'Tuned Recovery']],
+ plot_args=[['ro', 'g', 'k', 'b']],
+ axis=[-4,14,-5,20])
+
Robust Smoothing¶
+num_train_pts, train_mask, tune_mask, test_mask = split_data() # will still use this after CV
+S, c, c_tune, c_test = create_problem_vectorized_data(train_mask, tune_mask, test_mask, num_train_pts)
+c_test_tch = torch.tensor(c_test)
+c_tune_tch = torch.tensor(c_tune)
+
robust_layer = create_robust_kalman_smoother(S, c)
+tau_tch = torch.tensor(2.0, requires_grad=True)
+rho_tch = torch.tensor(2.0, requires_grad=True)
+robust_params = [tau_tch, rho_tch]
+roubst_x_hat_orig, roubst_y_hat_orig, roubst_w_hat_orig = robust_layer(*robust_params)
+
Fn = partial(F, y=c_tune_tch, measurement_mask=tune_mask)
+Ln = lambda y_hat : L(y=c_test_tch, measurement_mask=test_mask, y_hat=y_hat)
+
# Ln(roubst_y_hat_orig) # DEBUG
+Ln(roubst_y_hat_orig)
+
tensor(53.8178, grad_fn=<MseLossBackward0>)+
# Fn(y_hat=roubst_y_hat_orig, params=robust_params) # DEBUG
+Fn(y_hat=roubst_y_hat_orig, params=robust_params)
+
tensor(90.4367, grad_fn=<AddBackward0>)+
robust_result = tune_kalman_smoother(robust_layer, Fn, Ln, robust_params, num_iters=15, verbose=True, print_every=1)
+
test loss 001 | 53.81776 +train loss 001 | 90.43666 +test loss 002 | 53.81879 +train loss 002 | 90.41793 +test loss 003 | 53.81986 +train loss 003 | 90.40053 +test loss 004 | 53.82071 +train loss 004 | 90.38434 +test loss 005 | 53.82154 +train loss 005 | 90.36844 +test loss 006 | 53.82253 +train loss 006 | 90.35218 +test loss 007 | 53.82326 +train loss 007 | 90.33639 +test loss 008 | 53.82419 +train loss 008 | 90.32039 +test loss 009 | 53.82505 +train loss 009 | 90.30495 +test loss 010 | 53.82577 +train loss 010 | 90.28987 +test loss 011 | 53.82650 +train loss 011 | 90.27464 +test loss 012 | 53.82714 +train loss 012 | 90.25910 +test loss 013 | 53.82776 +train loss 013 | 90.24343 +test loss 014 | 53.82843 +train loss 014 | 90.22831 +test loss 015 | 53.82891 +train loss 015 | 90.21360 ++
last_theta = robust_result[2][14]
+last_theta_tch = [torch.tensor(p) for p in last_theta]
+
related_x = robust_layer(*last_theta_tch)[0]
+
plot_positions(traj=[[y, x_true, roubst_x_hat_orig.detach().numpy(), related_x.numpy()]],
+ titles=['Observations vs. True vs. Recovery vs. Tuned Recovery'],
+ legends=[['Measurements', 'True', 'Recovery', 'Tuned Recovery']],
+ plot_args=[['ro', 'g', 'k', 'b']],
+ axis=[-4,14,-5,20])
+
References¶
-
+
- [BB20] S. Barratt and S. Boyd. "Fitting a kalman smoother to data." 2019. +
- [BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004 +
- [BV18] S. Boyd and L. Vandenberghe. Introduction to applied linear algebra: vectors, matrices, and least squares. Cambridge University Press, 2018. +
- [Kal60] R. Kalman. A new approach to linear filtering and prediction problems. Journal Basic Engineering, 82(1):35-45, 1960. +
- [Mur23] K. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023. +
- [01] A. Agrawal, S. Barratt, S. Boyd, and B. Stellato, "Learning convex optimization control policies," in Proc. 2nd Annu. Conf. Learning for Dynamics and Control, 2020 +
Appendix¶
Huber reformulation¶
(The following is a variation of [BV04 ex4.8].)
+Consider the unconstrained optimization problem
+$$ +\begin{equation} +\text{minimize} \quad \sum_{t = 1}^{T} \phi_M(Ax_t - b_t), +\tag{A1} +\end{equation} +$$
+where $x_1, \ldots, x_T \in \mathbf{R}^{n}$ are the optimization variables, and +$\phi_M : \mathbf{R}^{n} \to \mathbf{R}$ given by
+$$ +\phi_M(u) = \begin{cases} +\left\lVert u \right\rVert_{2}^2 & \left\lVert u \right\rVert_{2} \le M \\ +M(2\left\lVert u \right\rVert_{2} - M) & \left\lVert u \right\rVert_{2} > M. +\end{cases} +$$ +is the Huber penalty function.
+The reformulation of (6) to (7) requires that we show equivalence between (A1) and
+$$ +\begin{equation} +\begin{array}{lll} +\text{minimize} \; & \sum_{t = 1}^{T} \left(u_t^2 + 2Mv_t \right) & \\ +\text{subject to} & \left\lVert Ax_t - b_t \right\rVert_{2} \le u_t + v_t, \; & t = 1, \ldots, T \\ +& 0 \preceq u \preceq M \mathbf{1} \; & \\ +& v \succeq 0, +\end{array} +\tag{A2} +\end{equation} +$$ +where $x_1, \ldots, x_T \in \mathbf{R}^{n}$, $u \in \mathbf{R}^{T}$, +and $v \in \mathbf{R}^{T}$ are the optimization variables.
+We make the (reasonable) assumption that there exists a $t \in \left\{ 1, \ldots, T \right\}$ such that $Ax_t \not = b_t$. +Now, consider (A2) and fix $x_t$. At the optimum, it is necessary that $u_t + v_t = \left\lVert Ax_t - b_t \right\rVert_{2}$, otherwise, +since $u_t$ and $v_t$ are not simultaneously zero, we could decrease one (or both) and see a reduction in the objective function.
+Therefore, we can eliminate $v_t$ from the original problem, (A2), using the substitution
+$$ +v_t = \left\lVert Ax_t - b_t \right\rVert_{2} - u_t, +$$
+with the additional constraints
+$$ +u_t \le \left\lVert Ax_t - b_t \right\rVert_{2}, +$$ +which ensure the substition does not violate $v \succeq 0$, to form the equivalent problem
+$$ +\begin{array}{lll} +\text{minimize} \; & f_0(u) = \sum_{t=1}^{T} \left(u_t^2 - 2Mu_t + 2M\left\lVert Ax_t - b_t \right\rVert_{2} \right) & \\ +\text{subject to} & 0 \le u_t \le \min \left\{ M, \left\lVert Ax_t - b_t \right\rVert_{2}\right\}, \quad t=1, \ldots, T. & \\ +\end{array} +$$
+Considering solely the objective function, a sum of (convex) quadratic functions, calculus tells us that $u_t = M$ minimizes each summand of $f_0$. However, if $\left\lVert Ax_t - b_t \right\rVert_{2} < M$ for some $t \in \left\{1, \ldots, T \right\}$, then $u_t = \left\lVert Ax_t - b_t \right\rVert_{2}$ becomes the minimizer of that particular summand.
+Therefore, when $\left\lVert Ax_t - b_t \right\rVert_{2} \le M$, we choose $u_t = \left\lVert Ax_t - b_t \right\rVert_{2}$ and the $t\text{th}$ summand simplifies to +$$ +\left\lVert Ax_t - b_t \right\rVert_{2}^{2} - 2M\left\lVert Ax_t - b_t \right\rVert_{2}^{2} + 2M \left\lVert Ax_t - b_t \right\rVert_{2}^{2} = \left\lVert Ax_t - b_t \right\rVert_{2}^{2}. +$$
+Otherwise, if $\left\lVert Ax_t - b_t \right\rVert_{2} > M$, we choose $u_t = M$ and the $t\text{th}$ summand becomes
+$$ +M^2 - 2M^2 + 2M \left\lVert Ax_t - b_t \right\rVert_{2} = M(2\left\lVert Ax_t - b_t \right\rVert_{2} - M). +$$
+We conclude that for a fixed $x$, the optimal value of (A2) is given by
+$$ +∑_{t=1}^T ϕ_M(Ax_t - b_t). +$$
+Indexing and vectorization¶
Consider the matrix containing observations, $y \in \mathbf{R}^{p \times T}$, found in the smoother implementation section where $p = 2$ and $T = 1000$, and expressed as
+$$ +y = \begin{bmatrix} +y_{11} & y_{12} & \cdots y_{1,1000} \\ +y_{21} & \hat{y}_{22} & \cdots y_{2, 1000} \\ +\end{bmatrix}. +$$
+More intuitively, the columns of $y$ are the measurement vectors $y_t$ over the $T$ time periods.
+Let $N = Tp$ ($= 2000$, in this instance). Flattening $y$ in row-major order yields the vector $\textbf{vec}_{\text{R}} \, y \in \mathbf{R}^{N}$, defined as
+$$ +\textbf{vec}_{\text{R}} \, y = (y_{11}, y_{12}, \ldots, y_{1,1000}, y_{21}, y_{22}, \ldots, y_{2, 1000}). +$$
+In words, $\textbf{vec}_{\text{R}} \, y$ is obtained by taking the first row of $y$ (every first element of a measurement) and concatenating to that first row the second row of $y$.
+Flattening $y$ in a column-major order yields the vector $\textbf{vec}_{\text{F}} \, y \in \mathbf{R}^{N}$, defined as
+$$ +\textbf{vec}_{\text{F}} \, y = (y_{11}, y_{21}, y_{12}, y_{22}, \ldots, y_{1, 1000}, y_{2, 1000}). +$$
+In words, $\textbf{vec}_{\text{F}} \, y$ is obtained by concatenating all measurement vectors, starting with $y_1$ and ending with $y_T$.
+Related NumPy (and PyTorch, loosely)¶
Let y
be a np.ndarray
with shape (p, T)
representing $y$.
-
+
y.flat
produces a 1-D iterator over the array. The iterator moves throughy
in a $\textbf{vec}_{\text{R}} \, y$ fashion.
+- Let
F
be anp.ndarray
with shape(T, p)
anddtype==bool
(a $T \times p$ mask matrix) such that onlyF[0, 0]
,F[0, 1]
,F[999, 0]
, andF[999, 1]
areTrue
. Takingc = y.T[M]
,c
is the implementation version of +$$ +c = (y_{11}, y_{21}, y_{1, 1000}, y_{2, 1000}). +$$
+
Resources
+ +Related CVXPY¶
Let y_hat
be a cp.Variable
representing $\hat{y}$. Taking z = cp.vec(y_hat, order='F')
, z
is an Expression
object that can be understood (and used) as $\textbf{vec}_{\text{F}} \, \hat{y}$.
Resources
+ +Creating the selector matrix¶
Consider the two smoother formulations (and related code). When c
is passed to either smoother layer creation function, it is some subset of the output measurements flattened in column-major order (e.g, the c
created above). The vector of output optimization variables, z
, is likewise flattened in column-major order. Furthermore, to implement the equality constraint $Sz = c$, we just need to implement the (sparse) selector matrix $S$ exactly as it exists mathematically. To do this, consider the following code chunk:
time_indices, measurement_indices = F.nonzero()
+S = sparse.csc_matrix(
+ (
+ np.ones(num_measurements), # data filled into the sparse array
+ (
+ np.arange(num_measurements),
+ time_indices*p + measurement_indices
+ )
+ ),
+ shape=(num_measurements, N)
+)
+
The variables time_indices
and measurement_indices
are 1-D arrays containing the row and column indices of F
that are nonzero (or rather, that are True
). The indices are obtained by iterating over F
in a row-major fashion. (Note that for our vehicle tracking example, time_indices
will contain values ranging from $0$ to $999$ and measurement indices
can only contain the values $0$ or $1$.) Furthermore, the $j\text{th}$ entry of time_indices
and measurement_indices
are the
yields the $(t, i)$ of the $j\text{th}$ entry in the vector c
.
However, because $\mathcal{F}$ is implemented as the mask matrix F
, constructing S
requires different manipulation than turning $\mathcal{F}$ into $S$.
Specifically, consider the following code chunk taken from create_problem_vectorized_data
:
To see $S$, think about the picture of matrix-vector multiplication.
+$S$ is then constructed by placing a $1$ in every $(i, j)$ in $(1, + ),$
+Resources
+ +(informal) Testing¶
+indices = np.arange(12)
+np.random.shuffle(indices)
+splits = np.array_split(indices, 3)
+print(splits[1])
+
+mask_matrix = np.zeros((4, 3))
+mask_matrix.flat[splits[1]] = True
+mask_matrix = mask_matrix.astype(bool)
+
+y_random = np.random.randn(4, 3)
+print(y_random[mask_matrix])
+print(y_random.flatten())
+
[8 0 1 2] +[-0.60331112 -0.74586404 0.61854093 -0.74426384] +[-0.60331112 -0.74586404 0.61854093 1.18430381 1.96604923 0.98671515 + 0.28121689 -0.95360892 -0.74426384 -1.18353915 -0.94735021 -0.22503192] ++
test_mask = np.zeros(N, dtype=bool)
+test_mask = test_mask.reshape((T, p))
+
test_mask
+
array([[False, False], + [False, False], + [False, False], + ..., + [False, False], + [False, False], + [False, False]])+
test_mask.flat
+
<numpy.flatiter at 0x7fb497ed5200>+
arr1 = np.array([1, 2, 3])
+
arr1.ndim
+
1+
+