main.tex

\documentclass{article}

% Packages
\usepackage[margin=1.5cm, includefoot, footskip=30pt]{geometry}
\usepackage{hyperref}
\usepackage{multicol}
\usepackage{booktabs}
\usepackage{xstring}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{standalone}
\usepackage{fancyvrb}
\usepackage{authblk}

\usepackage{pdflscape}

\usepackage{tikz}
\usetikzlibrary{calc, shapes, patterns}

% Title
\title{Reinforcement Learning Produces Dominant Strategies for the
Iterated Prisoner's Dilemma}
\author[1]{Marc Harper}
\author[2]{Vincent Knight}
\author[3]{Martin Jones}
\author[4]{Georgios Koutsovoulos}
\author[2]{Nikoleta E. Glynatsi}
\author[3]{Owen Campbell}


\affil[1]{Google Inc., Mountain View, CA, USA}
\affil[2]{Cardiff University, School of Mathematics, UK}
\affil[3]{Not affiliated}
\affil[4]{INRA, Université Côte d‘Azur, CNRS, ISA, France}
\date{}


\begin{document}

\maketitle

\begin{abstract}
    We present tournament results and several powerful strategies for the Iterated
    Prisoner's Dilemma created using reinforcement learning techniques
    (evolutionary and particle swarm algorithms). These strategies are
    trained to perform well against a corpus of over 170 distinct
    opponents, including many well-known and classic strategies. All
    the trained strategies win standard tournaments against the total collection
    of other opponents. The trained strategies and one particular human made
    designed strategy are the top performers in noisy tournaments also.
\end{abstract}

\section{Introduction}\label{sec:introduction}

The Iterated Prisoner's Dilemma (IPD) is a common model in game theory,
frequently used  to understand the evolution of cooperative behaviour from complex
dynamics \cite{Axelrod1984}.

This manuscript uses the Axelrod library \cite{knight2016open, axelrodproject},
open source software for conducting IPD research with
reproducibility as a principal goal. Written in the Python programming language,
to date the library contains source code contributed by over 50 individuals
from a variety of geographic locations and technical backgrounds. The library
is supported by a comprehensive test suite that covers all the intended
behaviors of all of the strategies in the library, as well as the features that
conduct matches, tournaments, and population dynamics.

The library is continuously developed and as of version 3.0.0, the library
contains over 200 strategies, many from the scientific literature, including
classic strategies like Win Stay Lose Shift \cite{nowak1993strategy} and
previous tournament winners such as OmegaTFT \cite{slany2007some}, Adaptive
Pavlov \cite{li2007design}, and ZDGTFT2 \cite{Stewart2012}.

Since Robert Axelrod's seminal tournament \cite{Axelrod1980a}, a number of
IPD\@ tournaments have been undertaken and are summarised in
Table~\ref{tbl:tournaments}. Further to the work described in
\cite{knight2016open} a regular set of standard, noisy~\cite{Bendor1991} and
probabilistic ending~\cite{Axelrod1980b} tournaments are carried out as more
strategies are added to the Axelrod library.
Details and results are available here:
\url{http://axelrod-tournament.readthedocs.io}. This work presents a detailed
analysis of a tournament with 176 strategies (details given
in Section~\ref{sec:results}).

\begin{table}[!hbtp]
    \begin{center}
        \begin{tabular}{ccccc}
            \toprule
            Year     & Reference                  & Number of Strategies & Type     & Source Code\\
            \midrule
            1979     & \cite{Axelrod1980a}        & 13                   & Standard & Not immediately available\\
            1979     & \cite{Axelrod1980b}        & 64                   & Standard & Available in FORTRAN\\
            1991     & \cite{Bendor1991}          & 13                   & Noisy    & Not immediately available\\
            2002     & \cite{Stephens2002}        & 16                   & Wildlife & Not applicable\\
            2005     & \cite{kendall2007iterated} & 223                  & Varied   & Not available \\
            2012     & \cite{Stewart2012}         & 13                   & Standard & Not fully available \\
            2016     & \cite{knight2016open}       & 129                  & Standard & Fully available \\
            \bottomrule
        \end{tabular}
    \end{center}
    \caption{An overview of a selection of published tournaments. Not all
             tournaments were `standard' round robins; for more details
             see the indicated references.}\label{tbl:tournaments}
\end{table}

In this work we describe how collections of strategies in the Axelrod library
have been used to train new strategies specifically to win IPD tournaments.
These strategies are trained using generic strategy archetypes based on e.g.\
finite state
machines, arriving at particularly effective parameter choices through
evolutionary or particle swarm algorithms. There are several
previous publications that use evolutionary algorithms to
evolve IPD strategies in various circumstances
\cite{ashlock2006training, Ashlock2015, Ashlock2006,
      ashlock2014shaped, Ashlock2014, barlow2015varying,
      fogel1993evolving, marks1989niche, sudo2015effects,
      vassiliades2010multiagent}. See also \cite{Gaudesi2016} for a
strategy trained to win against a collection of well-known IPD opponents and see
\cite{franken2005particle} for a prior use of particle swarm algorithms. Our
results are unique in that we are able to train against a large and diverse
collection of strategies available from the scientific literature.
Crucially, the
software used in this work is openly available and can be used to train strategies
in the future in a reliable manner, with confidence that the opponent strategies
are correctly implemented, tested and documented.
Moreover, as of the time of writing, we claim that this work contains the best
performing strategies for the Iterated Prisoner's Dilemma.

\section{The Strategy Archetypes}

The Axelrod library now contains many parametrised strategies trained using
machine learning
methods. Most are deterministic, use many rounds of memory, and perform
extremely well in tournaments as will be discussed in Section~\ref{sec:results}.
Training of these strategies will be discussed in Section~\ref{sec:methods}.
These strategies can encode a variety
of other strategies, including classic strategies like Tit For Tat
\cite{Axelrod1980},
handshake strategies, and grudging strategies, that always defect after
an opponent defection.

\subsection{LookerUp}\label{sec:lookerup}

The LookerUp strategy is based on a lookup table and encodes a set of
deterministic responses based on the opponent's first $n_1$ moves, the
opponent's last $m_1$ moves, and the players last $m_2$ moves. If $n_1 > 0$ then
the player has infinite memory depth, otherwise it has depth $\max(m_1, m_2)$.
This is illustrated diagrammatically in Figure~\ref{fig:lookerup}.

\begin{figure}[!hbtp]
    \centering
    \includestandalone[height=.3\textheight]{./assets/lookerup}
    \caption{Diagrammatic representation of the Looker up Archetype.}
    \label{fig:lookerup}
\end{figure}


Training of this strategy corresponds to finding maps from partial histories to
actions, either a cooperation or a defection. Although various
combinations of $n_1, m_1,$ and $m_2$ have been tried, the best performance at
the time of
training was obtained for $n_1 = m_1 = m_2 = 2$ and generally for $n_1 > 0$.
A strategy
called EvolvedLookerUp2\_2\_2 is among the top strategies in the library.

This archetype can be used to train deterministic memory-$n$ strategies with the
parameters $n_1=0$ and $m_1=m_2=n$. For $n=1$, the resulting strategy cooperates
if the last round was mutual cooperation and defects otherwise, known as Grim or
Grudger.

Two strategies in the library, Winner12 and Winner21, from \cite{Mathieu2015},
are based on lookup tables for $n_1 = 0$, $m_1 = 1$, and $m_2=2$. The strategy
Winner12 emerged in less than 10 generations of training in our framework using
a score maximizing objective. Strategies nearly identical to Winner21 arise
from training with a Moran process objective.

\subsection{Gambler}\label{sec:gambler}

Gambler is a stochastic variant of LookerUp. Instead of deterministically
encoded moves the lookup table emits probabilities which are
used to choose cooperation or defection.
This is illustrated diagrammatically in Figure~\ref{fig:gambler}.

\begin{figure}[!hbtp]
    \centering
    \includestandalone[height=.3\textheight]{./assets/gambler}
    \caption{Diagrammatic representation of the Gambler Archetype.}
    \label{fig:gambler}
\end{figure}

Training of this strategy corresponds to finding maps from histories to
a probability of cooperation. The library includes a strategy trained
with $n_1 = m_1 = m_2 = 2$ that is \emph{mostly deterministic}, with 52 of the 64
probabilities being 0 or 1. At one time this strategy outperformed
EvolvedLookerUp2\_2\_2.

This strategy type can be used to train arbitrary memory-$n$ strategies. A
memory one strategy called PSOGamblerMem1 was trained, with
probabilities $(\text{Pr}(\text{C}\;|\;\text{CC}),
                \text{Pr}(\text{C}\;|\;\text{CD}),
                \text{Pr}(\text{C}\;|\;\text{DC}),
                \text{Pr}(\text{C}\;|\;\text{DD})) = (1, 0.5217, 0, 0.121)$.
Though it performs well in standard tournaments (see
Table~\ref{tbl:standard_score})
it does not outperform the longer memory strategies, and is bested by a similar
strategy that also uses the first round of play: PSOGambler\_1\_1\_1.

These strategies are trained with a particle swarm algorithm rather than an
evolutionary algorithm (though the former would suffice). Particle swarm
algorithms have been used to trained IPD strategies previously
\cite{franken2005particle}.

\subsection{ANN: Single Hidden Layer Artificial Neural Network}\label{sec:ann}

Strategies based on artificial neural networks use a variety of features
computed from the history of play:

\begin{multicols}{2}
    \begin{itemize}
        \item Opponent's first move is C
        \item Opponent's first move is D
        \item Opponent's second move is C
        \item Opponent's second move is D
        \item Player's previous move is C
        \item Player's previous move is D
        \item Player's second previous move is C
        \item Player's second previous move is D
        \item Opponent's previous move is C
        \item Opponent's previous move is D
        \item Opponent's second previous move is C
        \item Opponent's second previous move is D
        \item Total opponent cooperations
        \item Total opponent defections
        \item Total player cooperations
        \item Total player defections
        \item Round number
    \end{itemize}
\end{multicols}

These are then input into a feed forward neural network with one layer and
user-supplied width.  This is illustrated diagrammatically in
Figure~\ref{fig:ann}.

\begin{figure}[!hbtp]
    \centering
    \includestandalone[height=.3\textheight]{./assets/ann}
    \caption{Diagrammatic representation of the ANN Archetype.}
    \label{fig:ann}
\end{figure}

Training of this strategy corresponds to finding parameters of the neural
network. An inner layer with just five nodes performs quite well in both deterministic and
noisy tournaments. The output of the ANN used in this work is deterministic;
a stochastic variant that outputs probabilities rather than exact moves could
be easily created.

\subsection{Finite State Machines}\label{sec:fsm}

Strategies based on finite state machines are deterministic and computationally efficient.
In each round of play the strategy selects an action based on the current state
and the opponent's last action, transitioning to a new state for the next round.
This is illustrated diagrammatically in Figure~\ref{fig:fsm}.

\begin{figure}[!hbtp]
    \centering
    \includestandalone[height=.3\textheight]{./assets/fsm}
    \caption{Diagrammatic representation of the Finite State Machine Archetype.}
    \label{fig:fsm}
\end{figure}

Training this strategy corresponds to finding mappings of states and histories
to an action and a state. Figure~\ref{fig:fsm_images} shows two of the trained
finite state machines. The layout of state nodes is kept the same between
Figure~\ref{fig:fsm16} and~\ref{fig:fsm16noise} to highlight the effect of
different training environments. Note also that two of the 16 states are not
used, this is also an outcome of the training process.


\begin{figure}[!hbtp]
    \centering
    \begin{subfigure}[t]{.5\textwidth}
        \includegraphics[height=.3\textheight]{./assets/FSM16.pdf}
        \caption{Evolved\_FSM\_16: trained to maximise score in a standard
        tournament}
        \label{fig:fsm16}
    \end{subfigure}%
    ~
    \begin{subfigure}[t]{.5\textwidth}
        \centering
        \includegraphics[height=.35\textheight]{./assets/FSM16noise.pdf}
        \caption{Evolved\_FSM\_16\_Noise\_05: trained to maximise score in a
        noisy tournament}
        \label{fig:fsm16noise}
    \end{subfigure}%
    \caption{Trained sixteen state Finite State Machine players.}
    \label{fig:fsm_images}
\end{figure}


\subsection{Hidden Markov Models}\label{sec:hmm}

A variant of finite state machine strategies are called hidden Markov models
(HMMs). Like the strategies based on finite state machines, these strategies
also encode an internal state. However, they use probabilistic transitions based on the
prior round of play to other states and cooperate or defect with various
probabilities at each state. This is
shown diagrammatically in Figure~\ref{fig:hmm}. Training this strategy
corresponds to finding mappings of states and histories to probabilities of
cooperating as well as probabilities of the next internal state.

\begin{figure}[!hbtp]
    \centering
    \includestandalone[height=.3\textheight]{./assets/hmm}
    \caption{Diagrammatic representation of the Hidden Markov Model Archetype.}
    \label{fig:hmm}
\end{figure}


\subsection{Meta Strategies}

There are several strategies based on ensemble methods that
are common in machine learning called Meta strategies. These strategies are
composed of a team of other strategies. In each round, each member of the team
is polled for its desired next
move. The ensemble then selects the next move based on a rule, such as the
consensus vote in the case of MetaMajority or the best individual performance
in the case of MetaWinner. These strategies were among the best in the library
before the inclusion of those trained by reinforcement learning. The library
contains strategies containing teams of all the deterministic players, all the
memory-one players, and some others.

Because these strategies inherit many of the properties of the strategies
on which they are based, including using knowledge of the match length to defect
on the last round(s) of play, not all of these
strategies were included in results of this
paper. These strategies do not typically outperform the trained strategies
described above.

\section{Results}\label{sec:results}

This section presents the results of a large IPD tournament with
strategies from the Axelrod library, including some additional parametrized
strategies (e.g.\ various parameter choices for Generous Tit For Tat
\cite{Gaudesi2016}). These are
listed in Appendix~\ref{app:list_of_players}.

All strategies in the tournament follow a simple set of
rules in accordance with earlier tournaments:

\begin{itemize}
  \item Players are unaware of the number of turns in a match.
  \item Players carry no acquired state between matches.
  \item Players cannot observe the outcome of other matches.
  \item Players cannot identify their opponent by any label or identifier.
  \item Players cannot manipulate or inspect their opponents in any way.
\end{itemize}

Any strategy that does not follow these rules, such as a strategy that defects
on the last round of play, was omitted from the tournament presented here (but
not necessarily from the training pool).

A total of \input{./assets/number_of_players.tex}are included, of which
\input{./assets/number_of_stochastic_players.tex}are stochastic.
Section~\ref{sec:standard} is concerned with the standard tournament with 200
turns whereas in Section~\ref{sec:noise} a tournament with 5\% noise is
discussed. Due to the inherent stochasticity of these IPD tournaments, these
tournaments were repeated \input{./assets/standard_number_of_repetitions.tex}
times. This allows for a detailed and confident analysis of the performance of
strategies. To illustrate the results considered,
Figure~\ref{fig:tit_for_tat_scores} shows the distribution of the mean score per
turn of Tit For Tat over all the repetitions. Similarly,
Figure~\ref{fig:tit_for_tat_ranks} shows the ranks of of Tit For Tat for each
repetition. (We note that it never wins a tournament). Finally
Figure~\ref{fig:tit_for_tat_wins} shows the number of opponents beaten in any given
tournament: Tit For Tat does not win any match. (This is due to the fact that it
will either draw with mutual cooperation or defect second).

\begin{figure}[!hbtp]
    \centering
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/standard_tft_scores.pdf}
        \caption{Scores}
        \label{fig:tit_for_tat_scores}
    \end{subfigure}%
    ~
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/standard_tft_ranks.pdf}
        \caption{Ranks}
        \label{fig:tit_for_tat_ranks}
    \end{subfigure}%
    ~
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/standard_tft_wins.pdf}
        \caption{Wins}
        \label{fig:tit_for_tat_wins}
    \end{subfigure}%

    \caption{Results for Tit For Tat over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments.}
\end{figure}

The utilities used are \((R, P, T, S)=(3, 1, 5, 0)\) thus the specific
Prisoner's Dilemma being played is:

\begin{equation}\label{equ:pd}
    \begin{pmatrix}
        (3, 3) & (0, 5)\\
        (5, 0) & (1, 1)
    \end{pmatrix}
\end{equation}

All data generated for this work is archived and available at~\cite{data}.

\subsection{Standard Tournament}\label{sec:standard}

The top 11 performing strategies by median payoff are all strategies trained to maximize
total payoff against a subset of the strategies (Table~\ref{tbl:standard_score}).
The next strategy is Desired Belief Strategy (DBS) \cite{Au2006},
which actively analyzes the opponent and responds
accordingly. The next two strategies are Winner12, based on a lookup table,
Fool Me Once \cite{axelrodproject}, a grudging strategy that defects indefinitely on
the second defection, and Omega Tit For Tat \cite{kendall2007iterated}.

\begin{table}[!hbtp]
        \centering
        \input{./assets/standard_top_15_scores_summary.tex}
        \caption{Standard Tournament: Mean score per turn of top 15 strategies
            (ranked by median over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments).
        The leaderboard is dominated by the trained strategies (indicated by a
        $^{*}$).}
        \label{tbl:standard_score}
\end{table}

For completeness, violin plots showing the distribution of the scores of each
strategy (again ranked by median score) are shown in
Figure~\ref{fig:standard_boxplot}.

\begin{landscape}
    \begin{figure}[!hbtp]
        \centering
        \includegraphics[width=\paperwidth]{./assets/standard_scores_boxplots.pdf}
        \caption{Standard Tournament: Mean score per turn (ranked by median
        over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
        \label{fig:standard_boxplot}
    \end{figure}
\end{landscape}

Pairwise payoff results are given as a heatmap (Figure~\ref{fig:standard_heatmap})
which shows that many strategies achieve mutual cooperation (obtaining a score
of 3). The top performing
strategies never defect first yet are able to exploit weaker strategies that
attempt to defect.

\begin{figure}[!hbtp]
    \centering
    \includegraphics[width=\textwidth]{./assets/standard_scores_heatmap.pdf}
    \caption{Standard Tournament: Mean score per turn of row players against
    column players (ranked by median over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
    \label{fig:standard_heatmap}
\end{figure}

The strategies that win the most matches
(Table~\ref{tbl:standard_wins_top_winners}) are Defector~\cite{Axelrod1984} and Aggravater~\cite{axelrodproject}, followed
by handshaking and zero determinant strategies~\cite{Press2012}.
This includes two handshaking
strategies that were the result of training to maximize Moran process fixation
(TF1 and TF2). No strategies were trained specifically to win matches. None of
the top scoring strategies appear in the top 15 list of strategies ranked by
match wins. This can be seen in Figure~\ref{fig:standard_winplot} where the
distribution of the number of wins of each strategy is shown.

\begin{table}[!hbtp]
    \centering
        \input{./assets/standard_top_15_winners_wins_summary.tex}
        \caption{Standard Tournament: Number of wins per tournament
        of top 15 strategies (ranked by median wins over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
        \label{tbl:standard_wins_top_winners}
\end{table}

\begin{landscape}
    \begin{figure}[!hbtp]
        \centering
        \includegraphics[width=\paperwidth]{./assets/standard_wins_boxplots.pdf}
        \caption{Standard Tournament: number of wins per tournament (ranked by
        median over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
        \label{fig:standard_winplot}
    \end{figure}
\end{landscape}

The number of wins of the top strategies of Table~\ref{tbl:standard_score} are
shown in Table~\ref{tbl:standard_wins}. It is evident that although these
strategies score highly they do not win many matches: the strategy with the most
number of wins is the Evolved FSM 16 strategy that at most won 60
(\(60/175\approx34\%\)) matches in a given tournament.

\begin{table}[!hbtp]
    \centering
        \input{./assets/standard_top_15_wins_summary.tex}
        \caption{Standard Tournament: Number of wins per tournament
        of top 15 strategies (ranked by median score over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
        \label{tbl:standard_wins}
\end{table}


Finally, Table~\ref{tbl:standard_ranks} and
Figure~\ref{fig:standard_ranks_boxplot} show the ranks (based on median score)
of each strategy over the repeated tournaments. Whilst there is some
stochasticity, the top three strategies almost always rank in the top three. For
example, the worst that the Evolved Lookerup 2 2 2 ranks in any tournament
is 8th.


\begin{table}[!hbtp]
    \centering
        \input{./assets/standard_top_15_ranks_summary.tex}
        \caption{Standard Tournament: Rank in each tournament
        of top 15 strategies (ranked by median over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
        \label{tbl:standard_ranks}
\end{table}

\begin{landscape}
    \begin{figure}[!hbtp]
        \centering
        \includegraphics[width=\paperwidth]{./assets/standard_ranks_boxplots.pdf}
        \caption{Standard Tournament: rank in each tournament (ranked by
        median over
        \protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
        \label{fig:standard_ranks_boxplot}
    \end{figure}
\end{landscape}

Figure~\ref{fig:comparison_cooperation_heatmaps_standard} shows the rate of
cooperation in each round for the top three strategies. The opponents in these
figures are ordered according to performance by median score. It is evident that
the high performing strategies share a common thread against the top strategies:
they do not defect first and achieve mutual cooperation. Against the lower
strategies they also do not defect first (a mean cooperation rate of 1 in the
first round) but do learn to quickly retaliate.

\begin{figure}[!hbtp]
    \centering
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_EvolvedLookerUp2_2_2_array.pdf}
        \caption{EvolvedLookerUp\_2\_2\_2}
    \end{subfigure}%
    ~
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_Evolved_HMM_5_array.pdf}
        \caption{Evolved\_HMM\_5}
    \end{subfigure}%
    ~
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_Evolved_FSM_16_array.pdf}
        \caption{Evolved\_FSM\_16}
    \end{subfigure}%

    \caption{Comparison of cooperation rates for Standard Tournament Top 3
    (over 10000 repetitions).}
    \label{fig:comparison_cooperation_heatmaps_standard}
\end{figure}

\subsection{Noisy Tournament}\label{sec:noise}

Noisy tournaments in which there is a 5\% chance that an action is
flipped are now described. As shown in Table~\ref{tbl:noisy_score} and
Figure~\ref{fig:noisy_score}, the best performing strategies in median payoff
are DBS, designed to account for noise, followed by two strategies trained in
the presence of noise and three trained strategies trained without noise. One of
the strategies trained with noise (PSO Gambler) actually performs less well than
some of the other high ranking strategies including
Spiteful TFT (TFT but defects indefinitely if the opponent defects twice
consecutively) and OmegaTFT (also designed to handle noise). While DBS is the clear
winner, it comes at a 6x increased run time over Evolved FSM 16 Noise 05.

% Table of best strategies
\begin{table}[!hbtp]
    \centering
        \input{./assets/noisy_top_15_scores_summary.tex}
        \caption{Noisy (5\%) Tournament: Mean score per turn of top 15 strategies
        (ranked by median over
        \protect\input{./assets/noisy_number_of_repetitions.tex}tournaments)
        ~$^{*}$ indicates that the strategy was trained.}
        \label{tbl:noisy_score}
\end{table}


\begin{landscape}
    \begin{figure}[!hbtp]
        \centering
        \includegraphics[width=\paperwidth]{./assets/noisy_scores_boxplots.pdf}
        \caption{Noisy (5\%) Tournament: Mean score per turn (ranked by median
        over
        \protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
        \label{fig:noisy_score}
    \end{figure}
\end{landscape}


Recalling Table~\ref{tbl:standard_score}, the strategies trained in the presence
of noise are also among the best performers in the absence of noise. As shown in
Figure~\ref{fig:noisy_heatmap} the cluster of mutually cooperative strategies is
broken by the noise at 5\%. A similar collection of players excels at winning
matches but again they have a poor total payoff.

\begin{figure}[!hbtp]
    \centering
    \includegraphics[width=\textwidth]{./assets/noisy_scores_heatmap.pdf}
    \caption{Noisy (5\%) Tournament: Mean score per turn of row players against
    column players (ranked by median over
        \protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
    \label{fig:noisy_heatmap}
\end{figure}

As shown in Table~\ref{tbl:noisy_wins_top_winners} and
Figure~\ref{fig:noisy_winplot} the strategies tallying the most wins are
somewhat similar to the standard tournaments, with Defector, the handshaking
CollectiveStrategy~\cite{Li2009}, and Aggravater appearing as the top three again.

\begin{table}[!hbtp]
    \centering
        \input{./assets/noisy_top_15_winners_wins_summary.tex}
        \caption{Noisy (5\%) Tournament: Number of wins per tournament
        of top 15 strategies (ranked by median wins over
        \protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
        \label{tbl:noisy_wins_top_winners}
\end{table}


\begin{landscape}
    \begin{figure}[!hbtp]
        \centering
        \includegraphics[width=\paperwidth]{./assets/noisy_wins_boxplots.pdf}
        \caption{Noisy (5\%) Tournament: number of wins per tournament (ranked by
        median over
        \protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
        \label{fig:noisy_winplot}
    \end{figure}
\end{landscape}

As shown in Table~\ref{tbl:noisy_wins}, the top ranking strategies win a larger
number of matches in the presence of noise. For example Spiteful Tit For Tat~\cite{Prison1998}
in one tournament won almost all its matches (167).

\begin{table}[!hbtp]
    \centering
        \input{./assets/noisy_top_15_wins_summary.tex}
        \caption{Noisy (5\%) Tournament: Number of wins per tournament
        of top 15 strategies (ranked by median score over
        \protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
        \label{tbl:noisy_wins}
\end{table}


Finally, Table~\ref{tbl:noisy_ranks} and
Figure~\ref{fig:noisy_ranks_boxplot} show the ranks (based on median score)
of each strategy over the repeated tournaments. We see that the stochasticity
of the ranks understandably increases relative to the standard tournament. An
exception is the top three strategies, for example, the DBS strategy never ranks
lower than
second and wins 75\% of the time. The two strategies trained for noisy
tournaments rank in the top three 95\% of the time.

\begin{table}[!hbtp]
    \centering
        \input{./assets/noisy_top_15_ranks_summary.tex}
        \caption{Noisy (5\%) Tournament: Rank in each tournament
        of top 15 strategies (ranked by median over
        \protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
        \label{tbl:noisy_ranks}
\end{table}

\begin{landscape}
    \begin{figure}[!hbtp]
        \centering
        \includegraphics[width=\paperwidth]{./assets/noisy_ranks_boxplots.pdf}
        \caption{Noisy (5\%) Tournament: rank in each tournament (ranked by
        median over
        \protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
        \label{fig:noisy_ranks_boxplot}
    \end{figure}
\end{landscape}

Figure~\ref{fig:comparison_cooperation_heatmaps_noisy} shows the rate of
cooperation in each round for the top three strategies (in the absense of noise)
and just as for the top performing strategies in the standard tournament
(Figure~\ref{fig:comparison_cooperation_heatmaps_standard}) it is evident that
the strategies never defect first and learn to quickly punish poorer strategies.

\begin{figure}[!hbtp]
    \centering
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_DBS_0-75_3_4_3_5_array.pdf}
        \caption{DBS}
    \end{subfigure}%
    ~
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_Evolved_ANN_5_Noise_05_array.pdf}
        \caption{Evolved\_ANN\_5\_Noise\_05}
    \end{subfigure}%
    ~
    \begin{subfigure}[t]{.3\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_Evolved_FSM_16_Noise_05_array.pdf}
        \caption{Evolved\_FSM\_16\_Noise\_05}
    \end{subfigure}%

    \caption{Comparison of cooperation rates for the Noisy (5\%) Tournament Top 3
    (over 10000 standard tournaments).}
    \label{fig:comparison_cooperation_heatmaps_noisy}
\end{figure}

\section{Methods}\label{sec:methods}

The trained strategies (denoted by a \(^{*}\) in
Appendix~\ref{app:list_of_players}) were trained using reinforcement
learning algorithms. The ideas of reinforcement learning can be attributed to
the original work of \cite{turing1950computing} in which the notion that
computers would learn by taking random actions but according to a distribution
that picked actions with high rewards more often. The two particular algorithms
used here:

\begin{itemize}
    \item Particle Swarm Algorithm: \cite{imran2013overview}.
    \item Evolutionary algorithm: \cite{moriarty1999evolutionary}.
\end{itemize}

The Particle Swarm Algorithm is implemented using the pyswarm library:
\url{https://pypi.python.org/pypi/pyswarm}. This algorithm was used only to
train the Gambler archetype.

All other strategies were trained using evolutionary algorithms. The
evolutionary algorithms used standard techniques, varying strategies by
mutation and crossover, and evaluating the performance against each opponent
for many repetitions. The best performing strategies in each generation are
persisted, variants created, and objective functions computed again.

The default parameters for this procedure:

\begin{itemize}
	\item A population size of 40 individuals (kept constant across the
        generations);
	\item A mutation rate of 10\%;
	\item 10 individuals kept from one generation to the next;
    \item A total of 500 generations.
\end{itemize}

All implementations of these algorithms are archived at
\cite{dojo}. This software is (similarly to the Axelrod
library) available on github
\url{https://github.com/Axelrod-Python/axelrod-dojo}. There are objective
functions for:

\begin{itemize}
 \item total or mean payoff,
 \item total or mean payoff difference (unused in this work),
 \item total Moran process wins (fixation probability). This lead to the
     strategies named TF1, TF2, TF3 listed in
     Appendix~\ref{app:list_of_players}.
\end{itemize}

These can be used in noisy or standard environments (as evidenced by
Sections~\ref{sec:standard} and~\ref{sec:noise}). These objectives can be
further modified to suit other purposes. New strategies could be trained with
variations including spatial structure and probabilistically ending matches.

\section{Discussion}

The tournament results indicate that pre-trained strategies are generally better
than human designed strategies at maximizing payoff against a diverse set of
opponents. An evolutionary algorithm produces strategies based on multiple
generic archetypes that are able to achieve a higher average
score than any other known opponent in a standard tournament. Most of the trained
strategies use multiple rounds of the history of play (some using all of it) and
outperform memory-one strategies from the literature. Interestingly, a trained
memory one strategy produced by a particle swarm algorithm performs well, better
than human designed strategies such as Win Stay Lose Shift and zero determinant
strategies (which enforce a payoff difference rather than maximize total payoff).
The generic structure of the trained strategies did not appear to be
critical for the standard tournament -- strategies based on lookup tables,
finite state machines, neural networks, and stochastic variants all performed well.
Single layer neural networks (Section~\ref{sec:ann}) performed well in both
noisy and standard tournaments
though these had some aspect of human involvement in the selection of features.
This is in line with the other strategies also where some human decisions are made
regarding the structure. For the LookerUp and Gambler archetypes
(Sections~\ref{sec:lookerup} and~\ref{sec:gambler})
a decision has to be made
regarding the number of rounds of history and initial play that are to be used.
In contrast, the finite state machines and hidden Markov models
(Sections\ref{sec:fsm} and~\ref{sec:hmm}) required only a choice of the number
of states, and the training algorithm can eliminate unneeded states in the case
of finite state machines (evidenced by the unconnected nodes in the diagrams
for the included representations).

Many strategies can be represented by multiple archetypes, however some
archetypes will be more efficient in encoding the patterns present in the data.
The fact that the Lookerup strategy does the best for the standard
tournament indicates that it represents an efficient reduction of
dimension which in turn makes its training more efficient. In particular the
first rounds of play were valuable bits of information. For the noisy
tournament however the dimension reduction represented by some archetypes
indicates that some features of the data are not captured by the lookup
tables while they are by the neural networks and the finite state machines,
allowing the latter to adapt better to the noisy environment. Intuitively, a noisy
environment can significantly affect a lookup table based on the last two rounds
of play since these action pairs compete with probing defections, apologies, and
retaliations. Accordingly, it is not surprising that additional parameter space
is needed to adapt to a noisy environment.

In opposition to historical tournament results and community folklore,
our results show that complex strategies can be very effective for the
IPD\@. Designing complex strategies for the prisoner's dilemma appears to be
difficult for humans. Of all the human-designed
strategies in the library, only DBS consistently performs well, and it is
substantially more complex than traditional tournament winners like TFT, OmegaTFT,
and zero determinant strategies. Furthermore, dealing with noise is difficult
for most strategies. Two strategies designed specifically to account for noise,
DBS and OmegaTFT, perform well and only DBS performs better than the trained
strategies and \textbf{only} in some noisy contexts. Empirically we find that
DBS (with its default parameters) does not win tournaments at 1\% noise.
However DBS has a parameter that accounts for the expected amount of noise and a
followup study with various noise levels could make a more complete study of
the performance of DBS and strategies trained at various noise levels.

The strategies trained to maximize their average score are generally
cooperative and do not defect first. Maximizing for individual
performance across a collection of opponents leads to mutual cooperation despite
the fact that mutual cooperation is an unstable evolutionary equilibrium for the prisoner's
dilemma. Specifically it is noted that the reinforcement learning process for maximizing
payoff does not lead to exploitative zero determinant strategies, which may also
be a result of the collection of training strategies, of which several retaliate
harshly. Training with the objective of maximizing payoff difference may
produce strategies more like zero determinant strategies.

For the trained
strategies utilizing look up tables we generally found those that incorporate
one or more of the initial rounds of play outperformed those that did not. The
strategies based on neural networks and finite state machines also are able to
condition throughout a match on the first rounds of play. Accordingly, we conclude
that first impressions matter in the IPD\@. The best strategies are nice (never
defecting first) and the impact of the first rounds of play could be further
investigated with the Axelrod library
in future work by e.g.\ forcing all strategies to defect on the first round.

Finally, we note that as the library grows, the top performing strategies
sometimes shuffle, and are not retrained automatically. Most of the strategies were
trained on an earlier version of the library (v2.2.0: \cite{axelrodproject2.2})
that did not include DBS and several other opponents. The precise parameters
that are optimal will depend on the pool of opponents. Moreover we have not
extensively trained strategies to determine the minimum parameter spaces that are
sufficient -- neural networks with fewer nodes and features and finite state
machines with fewer states may suffice. See \cite{ashlock2013impact} for
discussion of resource availability for IPD strategies.

% TODO: author contributions

\section*{Acknowledgements}

This work was performed using the computational facilities of the Advanced
Research Computing @ Cardiff (ARCCA) Division, Cardiff University.

A variety of software libraries have been used in this work:

\begin{itemize}
    \item The Axelrod library (IPD strategies and Tournaments)
        \cite{axelrodproject}.
    \item The matplotlib library (visualisation) \cite{hunter2007matplotlib}.
    \item The pandas and numpy libraries (data manipulation)
        \cite{mckinney2010data, walt2011numpy}.
\end{itemize}

\bibliographystyle{plain}
\bibliography{bibliography.bib}

\appendix

\section{List of players}\label{app:list_of_players}

The players used for this study are from Axelrod version 2.13.0
\cite{axelrodproject}.

\begin{multicols}{2}
	\begin{enumerate}
		\input{./assets/list_of_players.tex}
	\end{enumerate}
\end{multicols}

\end{document}