采取Informer模型进行ETTh数据集中指标的预测,并且在Informer模型的基础上加入了滑动时间窗口改进。Informer模型的实质是注意力机制+Transformer模型,Informer模型的核心思想是将输入序列进行自注意力机制的处理,以捕捉序列中的长期依赖关系,并利用Transformer的编码器-解码器结构进行预测。
Informer是一种用于长序列时间序列预测的Transformer模型,但是它与传统的Transformer模型又有些不同点,与传统的Transformer模型相比,Informer具有以下几个独特的特点:
- ProbSparse自注意力机制:Informer引入了ProbSparse自注意力机制,该机制在时间复杂度和内存使用方面达到了O(Llog L)的水平,能够有效地捕捉序列之间的长期依赖关系。
- 自注意力蒸馏:通过减少级联层的输入,自注意力蒸馏技术可以有效处理极长的输入序列,提高了模型处理长序列的能力。
- 生成式解码器:Informer采用生成式解码器,可以一次性预测整个长时间序列,而不是逐步进行预测。这种方式大大提高了长序列预测的推理速度。
Informer模型概述见下图,左侧:编码器接收大规模的长序列输入(绿色序列),使用提出的ProbSparse自注意力替代传统的自注意力。蓝色梯形表示自注意力蒸馏操作,用于提取主导的注意力,大幅减小网络大小。层堆叠的副本增加了模型的稳健性。右侧:解码器接收长序列输入,将目标元素填充为零,测量特征图的加权注意力组合,并以生成式风格即时预测输出元素(橙色序列)。
Informer模型编码器结构的视觉表示如下图,以下是对其内容的解释:
-
编码器堆栈:图像中的水平堆栈代表Informer编码器结构中的一个编码器副本。每个堆栈都是一个独立单元,处理部分或全部输入序列。
-
主堆栈:图中显示的主堆栈处理整个输入序列。主堆栈之后,第二个堆栈处理输入序列的一半,以此类推,每个后续的堆栈都处理上一个堆栈输入的一半。
-
点积矩阵:堆栈内的红色层是点积矩阵,它们是自注意力机制的一部分。通过在每一层应用自注意力蒸馏,这些矩阵的大小逐层递减,可能降低了计算复杂度,并集中于序列中最相关的信息。
-
输出的拼接:通过自注意力机制处理后,所有堆栈的特征图被拼接起来,形成编码器的最终输出。然后,模型的后续部分(如解码器)通常使用这个输出,基于输入序列中学习到的特征和关系生成预测。
总体来说,Informer模型,成功提高了在LSTF问题中的预测能力,验证了类似Transformer的模型在捕捉长序列时间序列输出和输入之间的个体长期依赖关系方面的潜在价值。
提出了ProbSparse自注意力机制,以高效地替代传统的自注意力机制, 提出了自注意力蒸馏操作,可优化J个堆叠层中主导的注意力得分,并将总空间复杂度大幅降低。 提出了生成式风格的解码器,只需要一步前向传播即可获得长序列输出,同时避免在推理阶段累积误差的传播。
在给出的数据集中一共有8列,第一列日期(date)会处理成4维向量(freq='h'的情况下)作为时间戳输入到Encoder和Decoder中进行embedding。如果进行多变量预测任务,则预测为后7列变量的值,如果进行的是单变量预测任务,则预测最后一列变量的值。Encoder和Decoder输入的embedding包括三个部分:数据的embedding(下图中Scalar部分)、位置编码(下图中Local Time Stamp部分)以及时间戳编码(下图中Global Time Stamp部分)。
对输入的原始数据进行一个1维卷积得到,将输入数据从$C_{in}(=7)$维映射为$d_{model}(=512)$维。
class TokenEmbedding(nn.Module):
def __init__(self, c_in, d_model):
super(TokenEmbedding, self).__init__()
padding = 1 if torch.__version__>='1.5.0' else 2
self.tokenConv = nn.Conv1d(in_channels=c_in, out_channels=d_model,
kernel_size=3, padding=padding, padding_mode='circular')
for m in self.modules():
if isinstance(m, nn.Conv1d):
nn.init.kaiming_normal_(m.weight,mode='fan_in',nonlinearity='leaky_relu')
def forward(self, x):
x = self.tokenConv(x.permute(0, 2, 1)).transpose(1,2)
return x
输入序列中的元素是一起处理的,而不是像RNN一样是一个一个处理的,这样做加快了速度但是忽略了序列中元素的先后关系,这个时候就需要位置编码。
位置编码的公式为: $$ PE_{(pos,2i)}=\sin(pos/10000^{2i/d_{model}})\ PE_{(pos,2i+1)}=\cos(pos/10000^{2i/d_{model}}) $$
class PositionalEmbedding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEmbedding, self).__init__()
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model).float()
pe.require_grad = False
position = torch.arange(0, max_len).float().unsqueeze(1)
div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
return self.pe[:, :x.size(1)]
对时间戳的编码主要分为TemporalEmbedding和TimeFeatureEmbedding这两种方式,前者使用month_embed、day_embed、weekday_embed、hour_embed和minute_embed(可选)多个embedding层处理输入的时间戳,将结果相加;后者直接使用一个全连接层将输入的时间戳映射到512维的embedding。
TemporalEmbedding中的embedding层可以使用Pytorch自带的embedding层,再训练参数,也可以使用定义的FixedEmbedding,它使用位置编码作为embedding的参数,不需要训练参数。
class FixedEmbedding(nn.Module):
def __init__(self, c_in, d_model):
super(FixedEmbedding, self).__init__()
w = torch.zeros(c_in, d_model).float()
w.require_grad = False
position = torch.arange(0, c_in).float().unsqueeze(1)
div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()
w[:, 0::2] = torch.sin(position * div_term)
w[:, 1::2] = torch.cos(position * div_term)
self.emb = nn.Embedding(c_in, d_model)
self.emb.weight = nn.Parameter(w, requires_grad=False)
def forward(self, x):
return self.emb(x).detach()
class TemporalEmbedding(nn.Module):
def __init__(self, d_model, embed_type='fixed', freq='h'):
super(TemporalEmbedding, self).__init__()
minute_size = 4; hour_size = 24
weekday_size = 7; day_size = 32; month_size = 13
Embed = FixedEmbedding if embed_type=='fixed' else nn.Embedding
if freq=='t':
self.minute_embed = Embed(minute_size, d_model)
self.hour_embed = Embed(hour_size, d_model)
self.weekday_embed = Embed(weekday_size, d_model)
self.day_embed = Embed(day_size, d_model)
self.month_embed = Embed(month_size, d_model)
def forward(self, x):
x = x.long()
minute_x = self.minute_embed(x[:,:,4]) if hasattr(self, 'minute_embed') else 0.
hour_x = self.hour_embed(x[:,:,3])
weekday_x = self.weekday_embed(x[:,:,2])
day_x = self.day_embed(x[:,:,1])
month_x = self.month_embed(x[:,:,0])
return hour_x + weekday_x + day_x + month_x + minute_x
class TimeFeatureEmbedding(nn.Module):
def __init__(self, d_model, embed_type='timeF', freq='h'):
super(TimeFeatureEmbedding, self).__init__()
freq_map = {'h':4, 't':5, 's':6, 'm':1, 'a':1, 'w':2, 'd':3, 'b':3}
d_inp = freq_map[freq]
self.embed = nn.Linear(d_inp, d_model)
def forward(self, x):
return self.embed(x)
最后将这三部分的embedding加起来,就得到了最终的embedding。
class DataEmbedding(nn.Module):
def __init__(self, c_in, d_model, embed_type='fixed', freq='h', dropout=0.1):
super(DataEmbedding, self).__init__()
self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model)
self.position_embedding = PositionalEmbedding(d_model=d_model)
self.temporal_embedding = TemporalEmbedding(d_model=d_model, embed_type=embed_type, freq=freq) if embed_type!='timeF' else TimeFeatureEmbedding(d_model=d_model, embed_type=embed_type, freq=freq)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x, x_mark):
x = self.value_embedding(x) + self.position_embedding(x) + self.temporal_embedding(x_mark)
return self.dropout(x)
使用了两种attention,一种是普通的多头自注意力层(FullAttention),一种是Informer新提出来的ProbSparse self-attention层(ProbAttention)。
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from math import sqrt
from utils.masking import TriangularCausalMask, ProbMask
class FullAttention(nn.Module):
def __init__(self, mask_flag=True, factor=5, scale=None, attention_dropout=0.1, output_attention=False):
super(FullAttention, self).__init__()
self.scale = scale
self.mask_flag = mask_flag
self.output_attention = output_attention
self.dropout = nn.Dropout(attention_dropout)
def forward(self, queries, keys, values, attn_mask):
B, L, H, E = queries.shape
_, S, _, D = values.shape
scale = self.scale or 1./sqrt(E)
scores = torch.einsum("blhe,bshe->bhls", queries, keys)
if self.mask_flag:
if attn_mask is None:
attn_mask = TriangularCausalMask(B, L, device=queries.device)
scores.masked_fill_(attn_mask.mask, -np.inf)
A = self.dropout(torch.softmax(scale * scores, dim=-1))
V = torch.einsum("bhls,bshd->blhd", A, values)
if self.output_attention:
return (V.contiguous(), A)
else:
return (V.contiguous(), None)
Informer模型中提出了一种新的注意力层——ProbSparse Self-Attention,Q,K,V为输入的embedding分别乘上一个权重矩阵得到的query、key、value。ProbSparse Self-Attention首先对K进行采样,得到K_sample,对每个$q_i\in Q$关于K_sample求M值:
$$
M(q_i,K)=\max_j{\frac{q_ik_k^\top}{\sqrt{d}}}-\frac{1}{L_K}\sum_{j=1}^{L_K}\frac{q_ik_k^\top}{\sqrt{d}}
$$
找到M值最大的u个
class ProbAttention(nn.Module):
def __init__(self, mask_flag=True, factor=5, scale=None, attention_dropout=0.1, output_attention=False):
super(ProbAttention, self).__init__()
self.factor = factor
self.scale = scale
self.mask_flag = mask_flag
self.output_attention = output_attention
self.dropout = nn.Dropout(attention_dropout)
def _prob_QK(self, Q, K, sample_k, n_top): # n_top: c*ln(L_q)
# Q [B, H, L, D]
B, H, L_K, E = K.shape
_, _, L_Q, _ = Q.shape
# calculate the sampled Q_K
K_expand = K.unsqueeze(-3).expand(B, H, L_Q, L_K, E)
index_sample = torch.randint(L_K, (L_Q, sample_k)) # real U = U_part(factor*ln(L_k))*L_q
K_sample = K_expand[:, :, torch.arange(L_Q).unsqueeze(1), index_sample, :]
Q_K_sample = torch.matmul(Q.unsqueeze(-2), K_sample.transpose(-2, -1)).squeeze()
# find the Top_k query with sparisty measurement
M = Q_K_sample.max(-1)[0] - torch.div(Q_K_sample.sum(-1), L_K)
M_top = M.topk(n_top, sorted=False)[1]
# use the reduced Q to calculate Q_K
Q_reduce = Q[torch.arange(B)[:, None, None],
torch.arange(H)[None, :, None],
M_top, :] # factor*ln
Q_K = torch.matmul(Q_reduce, K.transpose(-2, -1)) # factor*ln(L_q)*L_k
return Q_K, M_top
def _get_initial_context(self, V, L_Q):
B, H, L_V, D = V.shape
if not self.mask_flag:
# V_sum = V.sum(dim=-2)
V_sum = V.mean(dim=-2)
contex = V_sum.unsqueeze(-2).expand(B, H, L_Q, V_sum.shape[-1]).clone()
else: # use mask
assert(L_Q == L_V) # requires that L_Q == L_V, i.e. for self-attention only
contex = V.cumsum(dim=-2)
return contex
def _update_context(self, context_in, V, scores, index, L_Q, attn_mask):
B, H, L_V, D = V.shape
if self.mask_flag:
attn_mask = ProbMask(B, H, L_Q, index, scores, device=V.device)
scores.masked_fill_(attn_mask.mask, -np.inf)
attn = torch.softmax(scores, dim=-1) # nn.Softmax(dim=-1)(scores)
context_in[torch.arange(B)[:, None, None],
torch.arange(H)[None, :, None],
index, :] = torch.matmul(attn, V).type_as(context_in)
if self.output_attention:
attns = (torch.ones([B, H, L_V, L_V])/L_V).type_as(attn).to(attn.device)
attns[torch.arange(B)[:, None, None], torch.arange(H)[None, :, None], index, :] = attn
return (context_in, attns)
else:
return (context_in, None)
def forward(self, queries, keys, values, attn_mask):
B, L_Q, H, D = queries.shape
_, L_K, _, _ = keys.shape
queries = queries.transpose(2,1)
keys = keys.transpose(2,1)
values = values.transpose(2,1)
U_part = self.factor * np.ceil(np.log(L_K)).astype('int').item() # c*ln(L_k)
u = self.factor * np.ceil(np.log(L_Q)).astype('int').item() # c*ln(L_q)
U_part = U_part if U_part<L_K else L_K
u = u if u<L_Q else L_Q
scores_top, index = self._prob_QK(queries, keys, sample_k=U_part, n_top=u)
# add scale factor
scale = self.scale or 1./sqrt(D)
if scale is not None:
scores_top = scores_top * scale
# get the context
context = self._get_initial_context(values, L_Q)
# update the context with selected top_k queries
context, attn = self._update_context(context, values, scores_top, index, L_Q, attn_mask)
return context.contiguous(), attn
AttentionLayer是定义的attention层,会先将输入的embedding分别通过线性映射得到query、key、value。还将输入维度$d_{model}$划分为多头,接着就执行前面定义的attention操作,最后经过一个线性映射得到输出。
class AttentionLayer(nn.Module):
def __init__(self, attention, d_model, n_heads, d_keys=None,
d_values=None):
super(AttentionLayer, self).__init__()
d_keys = d_keys or (d_model//n_heads)
d_values = d_values or (d_model//n_heads)
self.inner_attention = attention
self.query_projection = nn.Linear(d_model, d_keys * n_heads)
self.key_projection = nn.Linear(d_model, d_keys * n_heads)
self.value_projection = nn.Linear(d_model, d_values * n_heads)
self.out_projection = nn.Linear(d_values * n_heads, d_model)
self.n_heads = n_heads
def forward(self, queries, keys, values, attn_mask):
B, L, _ = queries.shape
_, S, _ = keys.shape
H = self.n_heads
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)
out, attn = self.inner_attention(
queries,
keys,
values,
attn_mask
)
out = out.view(B, L, -1)
return self.out_projection(out), attn
ConvLayer类实现的是Informer中的Distilling操作,本质上就是一个1维卷积+ELU激活函数+最大池化。公式如下: $$ X_{j+1}^t=MaxPool(ELU(Conv1d[X_j^t]_{AB})) $$
import torch
import torch.nn as nn
import torch.nn.functional as F
class ConvLayer(nn.Module):
def __init__(self, c_in):
super(ConvLayer, self).__init__()
self.downConv = nn.Conv1d(in_channels=c_in,
out_channels=c_in,
kernel_size=3,
padding=2,
padding_mode='circular')
self.norm = nn.BatchNorm1d(c_in)
self.activation = nn.ELU()
self.maxPool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)
def forward(self, x):
x = self.downConv(x.permute(0, 2, 1))
x = self.norm(x)
x = self.activation(x)
x = self.maxPool(x)
x = x.transpose(1,2)
return x
EncoderLayer类实现的是一个Encoder层,整体架构和Transformer是大致相同的,主要包含两个子层:多头注意力层(Informer中改为提出的ProbSparse Self-Attention层)和两个线性映射组成的前馈层(Feed Forward),两个子层后都带有一个批量归一化层,子层之间有跳跃连接。
class EncoderLayer(nn.Module):
def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
super(EncoderLayer, self).__init__()
d_ff = d_ff or 4*d_model
self.attention = attention
self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.activation = F.relu if activation == "relu" else F.gelu
def forward(self, x, attn_mask=None):
# x [B, L, D]
# x = x + self.dropout(self.attention(
# x, x, x,
# attn_mask = attn_mask
# ))
new_x, attn = self.attention(
x, x, x,
attn_mask = attn_mask
)
x = x + self.dropout(new_x)
y = x = self.norm1(x)
y = self.dropout(self.activation(self.conv1(y.transpose(-1,1))))
y = self.dropout(self.conv2(y).transpose(-1,1))
return self.norm2(x+y), attn
Encoder类是将前面定义的Encoder层和Distilling操作组织起来,形成一个Encoder模块。其中distilling层总比EncoderLayer少一层,即最后一层EncoderLayer后不再做distilling操作。
class Encoder(nn.Module):
def __init__(self, attn_layers, conv_layers=None, norm_layer=None):
super(Encoder, self).__init__()
self.attn_layers = nn.ModuleList(attn_layers)
self.conv_layers = nn.ModuleList(conv_layers) if conv_layers is not None else None
self.norm = norm_layer
def forward(self, x, attn_mask=None):
# x [B, L, D]
attns = []
if self.conv_layers is not None:
for attn_layer, conv_layer in zip(self.attn_layers, self.conv_layers):
x, attn = attn_layer(x, attn_mask=attn_mask)
x = conv_layer(x)
attns.append(attn)
x, attn = self.attn_layers[-1](x)
attns.append(attn)
else:
for attn_layer in self.attn_layers:
x, attn = attn_layer(x, attn_mask=attn_mask)
attns.append(attn)
if self.norm is not None:
x = self.norm(x)
return x, attns
为了增强distilling操作的鲁棒性,文章中提到可以采用多个replicas并行执行,不同replicas采用不同长度的embedding(L、L/2、L/4、...),embedding长度减半对应的attention层也减少一层,distilling层也会随之减少一层,最终得到的结果拼接起来作为输出。
class EncoderStack(nn.Module):
def __init__(self, encoders, inp_lens):
super(EncoderStack, self).__init__()
self.encoders = nn.ModuleList(encoders)
self.inp_lens = inp_lens
def forward(self, x, attn_mask=None):
# x [B, L, D]
x_stack = []; attns = []
for i_len, encoder in zip(self.inp_lens, self.encoders):
inp_len = x.shape[1]//(2**i_len)
x_s, attn = encoder(x[:, -inp_len:, :])
x_stack.append(x_s); attns.append(attn)
x_stack = torch.cat(x_stack, -2)
return x_stack, attns
Decoder部分结构可以参考Transformer中的Decoder结构,包括两层attention层和一个两层线性映射的Feed Forward部分。
需要注意的是,第一个attention层中的query、key、value都是根据Decoder输入的embedding乘上权重矩阵得到的,而第二个attention层中的query是根据前面attention层的输出乘上权重矩阵得到的,key和value是根据Encoder的输出乘上权重矩阵得到的。
import torch
import torch.nn as nn
import torch.nn.functional as F
class DecoderLayer(nn.Module):
def __init__(self, self_attention, cross_attention, d_model, d_ff=None,
dropout=0.1, activation="relu"):
super(DecoderLayer, self).__init__()
d_ff = d_ff or 4*d_model
self.self_attention = self_attention
self.cross_attention = cross_attention
self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.activation = F.relu if activation == "relu" else F.gelu
def forward(self, x, cross, x_mask=None, cross_mask=None):
x = x + self.dropout(self.self_attention(
x, x, x,
attn_mask=x_mask
)[0])
x = self.norm1(x)
x = x + self.dropout(self.cross_attention(
x, cross, cross,
attn_mask=cross_mask
)[0])
y = x = self.norm2(x)
y = self.dropout(self.activation(self.conv1(y.transpose(-1,1))))
y = self.dropout(self.conv2(y).transpose(-1,1))
return self.norm3(x+y)
class Decoder(nn.Module):
def __init__(self, layers, norm_layer=None):
super(Decoder, self).__init__()
self.layers = nn.ModuleList(layers)
self.norm = norm_layer
def forward(self, x, cross, x_mask=None, cross_mask=None):
for layer in self.layers:
x = layer(x, cross, x_mask=x_mask, cross_mask=cross_mask)
if self.norm is not None:
x = self.norm(x)
return x
mask机制用在Decoder的第一个attention层中,目的是为了保证t时刻解码的输出只依赖于t时刻之前的输出。
生成的mask矩阵右上角部分为1(不包括对角线),将mask矩阵作用到score矩阵上会使得mask矩阵中为1的位置在score矩阵中为 −∞ ,这样softmax后就为0。TriangularCausalMask是用在Fullattention层上的,ProbMask是用在ProbSparseAttention层上的。
import torch
class TriangularCausalMask():
def __init__(self, B, L, device="cpu"):
mask_shape = [B, 1, L, L]
with torch.no_grad():
self._mask = torch.triu(torch.ones(mask_shape, dtype=torch.bool), diagonal=1).to(device)
@property
def mask(self):
return self._mask
class ProbMask():
def __init__(self, B, H, L, index, scores, device="cpu"):
_mask = torch.ones(L, scores.shape[-1], dtype=torch.bool).to(device).triu(1)
_mask_ex = _mask[None, None, :].expand(B, H, L, scores.shape[-1])
indicator = _mask_ex[torch.arange(B)[:, None, None],
torch.arange(H)[None, :, None],
index, :].to(device)
self._mask = indicator.view(scores.shape).to(device)
@property
def mask(self):
return self._mask
更详细的结果放在了压缩包中。总体预测结果如下所示,其中每一列都代表了一个独立的指标。
各个指标的预测结果如下所示,其中对于每一个预测结果,我们以2000作为一个数据的批次,将17500个数据点分为了9张图展示。
Mean Squared Error (MSE) 和 Mean Absolute Error (MAE) 是用来评估估计器、预测或模型质量的统计度量。MSE 是错误的平方的平均值,即预估值和实际值之间平方差的平均值。而 MAE 则是预测值和实际观察值之间绝对差异的平均值,其计算公式如下: $$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}i)^2,\ MAE = \frac{1}{n}\sum{i=1}^{n}|y_i - \hat{y}_i|. $$ 下图展示了部分滚动时间内的MAE和MSE数值: