We develop a fast numerical algorithm for large scale zero-sum stochastic games with perfect information, which combines policy iteration and algebraic multigrid methods.
Consider a game on a finite state space with discounted infinite horizon payoff. Each pair of strategies of the two players determines a Markov chain on . The value of the game satisfies the following dynamic programing equation:
Equation () may also be obtained after a suitable discretization of Hamilton-Jacobi-Bellman or Isaacs partial differential equations :
One can solve classically () by applying the fixed point method which is known as the value iteration algorithm. The iterations are cheap but their convergence slows considerably as approaches one, which holds when the discretization step for () is small, since then . Another approach consists in the following algorithm called policy iteration, initially introduced by Howard (60) for one player games.
A policy for the first player maps any to an action . Given an initial policy , the policy iteration applies successively the two following steps:
The first step is performed itself using the policy iteration algorithm for a one-player game. The sequence of the external loop (resp. the sequence of values of the internal loop) is non decreasing (resp. non increasing) and stops after a finite time when the sets of actions are finite. Under regularity assumptions, the policy iteration algorithm for a one player game with infinite action spaces is equivalent to Newton's method, thus can have a super-linear convergence in the neighborhood of the solution.
In all cases, this method converges faster than the value iterations and in practice it ends in few steps (see for instance large scale random examples for deterministic games in Dhingra, Gaubert, 2006). In each internal iteration of the policy iterations, one needs to solve a linear system of equations, the dimension of which is equal to the cardinality of the state space . When () is coming from the discretization of the Isaacs partial differential equation (), these linear systems correspond to discretizations of linear elliptic equations, hence may be solved in the best case in a time in the order of , by using multigrid methods. In general, using the nice monotonicity properties of these linear systems, one may expect the same complexity when solving them by an algebraic multigrid method.
We have implemented (in C) the policy iteration algorithm in which linear systems are solved using a fixed or adapted number of iterations of the algebraic multigrid method of Ruge and Stüben (86). This algorithm can be applied either to a true finite state space zero-sum two player game or to the discretization of an Isaacs equation. Such an association of multigrid methods with policy iteration has already been used and studied in the case of one player, that is in the case of discounted stochastic control problems (see the ancient works of Hoppe (86,87) and Akian (88, 90) on Hamilton-Jacobi-Bellman equations, and the recent work of Ziv and Shimkin (05) on algebraic multigrid methods associated to learning methods). However, the association with the policy iteration for games is new. We shall present numerical tests on discretizations of Isaacs or Hamilton-Jacobi-Bellman equations or variational inequalities.
The complexity of policy iteration algorithms is still unsettled. Recall that the number of iterations is bounded by the number of possible strategies, which is exponential in . Moreover, in some reachability (or pursuit-evasion) games, the number of iterations is typically of the order of the diameter of the domain. As for Newton's algorithm, convergence can be improved by starting the policy iteration with a good initial guess, close to the solution. In this way, we developed a full multi-level scheme, similar to FMG. It consists in solving the problem at each grid level by performing policy iterations (combined with algebraic multigrid method) until a convergence criterion is verified, then to interpolate the strategies and value to the next level, in order to initialize the policy iterations of the next level, until the finest level is attained. Numerical examples on variational inequalities show that the execution time can be much improved using this full multi-level scheme.