-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathevt-mcts.tex
246 lines (191 loc) · 11.1 KB
/
evt-mcts.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
%
% This is the LaTeX source for the instructions to authors using
% the LaTeX document class 'llncs.cls' for contributions to
% the Lecture Notes in Computer Sciences series.
% http://www.springer.com/lncs Springer Heidelberg 2006/05/04
%
% It may be used as a template for your own input - copy it
% to a new file with a new name and use it as the basis
% for your article.
%
% NB: the document class 'llncs' has its own and detailed documentation, see
% ftp://ftp.springer.de/data/pubftp/pub/tex/latex/llncs/latex2e/llncsdoc.pdf
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[runningheads,a4paper]{llncs}
\usepackage{graphicx}
\usepackage{amssymb}
\setcounter{tocdepth}{3}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{color}
\usepackage{tikz}
\usepackage{pgfplots}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{comment}
\usepackage{booktabs}
\usepackage{nopageno}
\usepackage[algo2e, noend, noline, linesnumbered]{algorithm2e}
%\DontPrintSemicolon
\makeatletter
\newcommand{\pushline}{\Indp}% Indent
\newcommand{\popline}{\Indm}
\makeatother
\DeclareMathOperator{\pess}{pess}
\DeclareMathOperator{\opti}{opti}
\newcommand{\argmax}{\operatornamewithlimits{argmax}}
\captionsetup{compatibility=false}
%\pgfplotsset{compat=newest}
\usetikzlibrary{arrows,shapes,petri}
\newcommand{\bE}{\mathbb{E}}
\newcommand{\bI}{\mathbb{I}}
\newcommand{\cA}{\mathcal{A}}
\newcommand{\cC}{\mathcal{C}}
\newcommand{\cD}{\mathcal{D}}
\newcommand{\cI}{\mathcal{I}}
\newcommand{\cN}{\mathcal{N}}
\newcommand{\cO}{\mathcal{O}}
\newcommand{\cS}{\mathcal{S}}
\newcommand{\cT}{\mathcal{T}}
\newcommand{\cZ}{\mathcal{Z}}
%\newcommand{\Pr}{\mathbf{Pr}}
\newcommand{\eg}{{\it e.g.,}~}
\newcommand{\ie}{{\it i.e.,}~}
\newcommand{\redbold}[1]{\textbf{\color{red}#1}}
\newcommand{\markw}[1]{\textbf{\color{red} /* #1 (markw) */}}
\newcommand{\marcl}[1]{\textbf{\color{red} /* #1 (marcl) */}}
\newcommand{\toexpand}[1]{{\it $\ll$ #1 ... $\gg$ }}
\usepackage{url}
\urldef{\emails}\path|{marc.lanctot,m.winands}@maastrichtuniversity.nl,[email protected]|
%\urldef{\mailsb}\path|anna.kramer, leonie.kunz, christine.reiss, nicole.sator,|
%\urldef{\mailsc}\path|erika.siebert-cole, peter.strasser, lncs}@springer.com|
\newcommand{\keywords}[1]{\par\addvspace\baselineskip
\noindent\keywordname\enspace\ignorespaces#1}
\begin{document}
\mainmatter % start of an individual contribution
% first the title is needed
\title{Modeling Extreme Values in Monte Carlo Tree Search}
% a short form should be given in case it is too long for the running head
%\titlerunning{Monte Carlo Tree Search in Simultaneous Move Games with Applications to Goofspiel}
\titlerunning{ }
% the name(s) of the author(s) follow(s) next
%
% NB: Chinese authors should write their first names(s) in front of
% their surnames. This ensures that the names appear correctly in
% the running heads and the author index.
%
\author{ }
%
\authorrunning{ }
% (feature abused for this document to repeat the title also on left hand pages)
% the affiliations are given next; don't give your e-mail address
% unless you accept that it will be published
% Agent Technology Center, Dept. of Computer Science, FEE, Czech Technical University in Prague
%
\institute{ }
%\institute{$^1$Department of Knowledge Engineering, \hspace{1cm} $^2$Department of Computer Science\\
%\hspace{0.5cm}Maastricht University, Netherlands \hspace{1.1cm} Czech Technical University in Prague \\
%\emails\\
%}
%Viliam Lis{\' y}
%Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University in Prague
%
% NB: a more complex sample for affiliations and the mapping to the
% corresponding authors can be found in the file "llncs.dem"
% (search for the string "\mainmatter" where a contribution starts).
% "llncs.dem" accompanies the document class "llncs.cls".
%
%\toctitle{}
%\tocauthor{Authors' Instructions}
\maketitle
\begin{abstract}
Abstract
\end{abstract}
\section{Introduction}
Monte Carlo Tree Search (MCTS)~\cite{Coulom06,Kocsis06Bandit} is a simulation-based search technique that has received much attention in the search community due to its success in two-player turn-taking games, such as Go~\cite{Gelly12Grand}.
\toexpand{more about mcts}
During a simulation, actions are selected using Upper Confidence Bound for Trees (UCT)~\cite{Kocsis06Bandit}. This selection policy is based on a regret minimization method for bandit problems, Upper Confidence Bound (UCB)~\cite{Auer02Finite}, which provides a good balance between exploration and exploitation.
Using the UCB selection policy, the strategy computed by MCTS provably converges to the optimal strategy as the number of simulations approaches
infinity in two-player zero-sum games.
In this setting, every state has a unique optimal value and the optimal strategy is one that chooses the action leading to a new state with the highest such value.
The proof uses Hoeffding's inequality to show that the difference between the empirical mean of a set of sampled payoffs and the true mean of the underlying distribution is bounded and approaches zero as the number samples grow.
Therefore, if the empirical mean of the samples can be shown to converge to the optimal value, the optimal strategy simply chooses the action with the highest value.
%This way of decision-making is a good fit when the goal of the algorithm is to simply maximize expected utility.
Suppose now that we have a sequential problem where the goal is not simply to maximize expected utility but to find (or avoid)
strategies that admit extreme values.
For example, there my be hard constraints such as avoiding any strategy that could leading to the death of a patient.
As another example, the underlying problem may lack inherent stochasticity, which is only introduced by the sampling
algorithm itself. In these situations, selection policies that model the statistical behavior of extreme values
may be preferred.
In this work, we develop an MCTS-style simulation-based algorithm for sequential decision-making problems based on extreme value
theory (EVT)~\cite{Coles01Introduction}. We show that, in certain situations, using extreme value theory to guide simulations
is preferred to using bandit theory. We prove the consistency of MCTS in single-player games using this selection mechanism.
Finally, we thoroughly compare the EVT-based MCTS versus bandit-based on several complex optimization problems as well a spectrum
of randomly-generated sequential problems.
\section{Background}
\toexpand{formalize basics of sequential decision-making problems}
\subsection{Bandit-Based Monte Carlo Tree Search}
\label{sec:mcts}
A stochastic bandit problem is one where an player is faced with $K$ actions. When an action is taken, the player received a random
payoff $X_i$ which is generated from some fixed distribution.
The player can repeat this process $n$ times, receiving payoffs $X_{i,t}$ for choosing action $i$ at time $t$. The goal
is maximize the expected payoff $\bE[\sum_{t=1}^n X_{I_t,t}]$ where $I_t \in \{ 1, \ldots, K \}$ is the (generally randomized) choice
made by the player at time $t$.
The standard way to express the quality of a bandit algorithm is to measure its expected regret
\[
R_n = \max_{1 \le i \le K} \left( \bE \left[ \sum_{t=1}^n X_{i,t} \right] - \bE \left[ \sum_{t=1}^n X_{I_t,t} \right] \right),
\]
that is to quantify how much the player would have preferred to choose the single best action in hindsight.
Let $T_i(n) = \sum_{t = 1}^n \bI(I_t = i)$ be the number of times action $i$ was chosen up to time $n$, and
$\bar{X}_{i,n} = \sum_{t = 1}^n \bI(I_t = i) X_{i,t} / T_i(n)$ is the empirical mean of the value of action $i$
after $n$ plays, where the indicator function $\bI(\mbox{cond}) = 1$ if cond is true, or 0 otherwise.
Algorithm UCB1 chooses
\[
I_t = \argmax_{i \in \{1, \ldots, K\}} \{\bar{X}_{i,n} + B_i(t)\},
\]
where $B_i(t)$ is an exploration bias parameter that is generally a function of the observed sequence up to time $t$.
Hoeffding's inequality, as adapted from \cite[Equations 3-4]{Kocsis06Bandit}, states that
\[
\Pr(|\bar{X}_{i,n} - \mu_i| \ge B_i(t)) \le n^{-4},
\]
and if $\lim_{t \rightarrow \infty}B_i(t) = 0$ then eventually the empirical mean will match true mean $\mu_i$.
In Monte Carlo Tree Search, UCB is applied recursively, as if treating the decision-making process at
every state as its own bandit problem.
The true value of action $i$ is then the minimax value of the child state and the empirical mean $\bar{X}_{i,n}$
converges to this value in two-player zero-sum games with perfect information.
The proof is non-trivial because in this setting, the payoffs are not simply random variables generated from some
distribution but rather the reward received from an arbitrarily complex stochastic (non-stationary) process which
is itself the another ``recursive sub-bandit problem''. Nonetheless, under certain conditions that the payoffs
and process are well-behaved, convergence can still be guaranteed.
\subsection{Extreme Value Theory}
\label{sec:evt}
In subsection~\ref{sec:mcts} we outlined how to compute the value of an action in UCT.
One can relabel the time steps $1 \le s \le T_i(n)$ and consider, for some action $i$, the sequence of payoffs $\mathbf{X}_i = (X_1, X_2, \ldots, X_{T_i(n)})$,
guiding the simulations by analyzing their empirical means $f(\mathbf{X}_i) = \bar{X}_{i,n} = \sum_{s = 1}^{T_i(n)} X_s / T_i$.
Extreme value theory concerns the statistical behavior of a different quantity
\[ f(\mathbf{X}_i) = M_{i,n} = \max\{ X_1, X_2, \ldots, X_{T_i(n)} \}. \]
In contrast to the bandit setting, $M_{i,n}$ is not necessarily an estimator of the mean of the
underlying payoff distribution, but rather models extreme values in of the payoff sequences.
Define $M^*_{i,n} = (M_{i,n} - b_n)/a_n$ for some constants $a_n$ and $b_n$.
By \cite[Theorem 3.1.1]{Coles01Introduction}, the cumulative distribution function
\[
\Pr(M^*_{i,n} \le z) \rightarrow G(z), \mbox{ where }
G(z) = \exp \left\{ - \left[ 1 + \xi \left(\frac{z - \mu_i}{\sigma}\right) \right]^{-1/\xi} \right\}
\]
is known as the {\bf generalized extreme value} (GEV) family of distributions. Therefore, the foundation of extreme value theory
is an analog of the central limit theorem for $M_{i,n}$.
Furthermore the GEV distribution $G(z)$ can be inferred from observed data using standard techniques from
statistical machine learning such as maximum likelihood estimation.
\section{MCTS Based on Extreme Value Theory}
In this section, we develop our MCTS algorithm based on extreme value theory. In essence, at each state we replace the bandit-inspired regret
minimizers (UCB) with GEV prediction machines. Each one observes payoff data from each of its actions and build its own GEV distributions
from observed payoff data. In turn, this informs the algorithm how to select actions on successive simulations.
As in standard MCTS, the problem is defined recursively, so that the payoffs generated are indeed
obtained as function of many decision-making steps rather than a single step.
\section{Experiments}
\section{Conclusion}
\bibliographystyle{splncs03}
\bibliography{evt-mcts}
\end{document}