We develop a fast numerical algorithm for large scale zero-sum two player stochastic games with perfect information and mean reward, which combines policy iteration and algebraic multigrid methods.
Consider a finite state space
. The stochastic game is played in
stages as follows. The initial state
is given and known by the two
players. The player who plays first, says MAX, chooses an action in a
set of possible actions. Then the second player, called MIN, chooses an
action in another set of possible actions. The actions of both players
and the current state determine a payment made by MIN to MAX at stage
0 and the probability of the new state
. Then the game continue in
the same way with state
and so on. We call a strategy or policy for
a player, a rule which tells him the action to choose in any situation. A
Markovian strategy depends only, possibly randomly, on the current state
and not on the past history or stage. Each pair of Markovian stationary
strategies of the two players determines a Markov chain on
. We are
studying the value of the game with mean reward which is defined as the
mean expected payment per stage, made by MIN to MAX, when each player
chooses a strategy maximizing his reward.
The value of the game is solution of a dynamic programing equation. This
nonlinear equation can be solved by the policy iteration algorithm for
zero sum stochastic games of Hoffman and Karp (66) when the Markov
transition matrices of the game are all irreducible. The principle of
this algorithm consists in applying successively the two following steps:
first compute the value of the game with fixed strategy for the first
player and then improve this strategy. The first step is solved applying
the policy iteration for one player games, i.e. stochastic control
problems. Cochet-Terrasson and Gaubert (06) proposed a version of policy
iteration for two player games in the general multichain case, which is
based on the algorithm for multichain Markov decision processes of
Denardo and Fox (68). Each iteration of Dernado and Fox algorithm
requires the computation of stationary probabilities of irreducible
Markov chains and also the solution of linear systems of the type
where
is a sub-markovian matrix.
We propose an algorithm based on Cochet-Terrasson and Gaubert policy iteration algorithm where we use multigrid methods for Markov chains of Horton (94) and De Sterck and all (08) to find the stationary probabilities and algebraic multigrid algorithm of Ruge and Stüben (86) for the above linear systems. We present numerical results of this algorithm (implemented in C) for large scale zero-sum two player games.