-
Notifications
You must be signed in to change notification settings - Fork 5
/
main.tex
904 lines (772 loc) · 39.7 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
\documentclass{article}
% Packages
\usepackage[margin=1.5cm, includefoot, footskip=30pt]{geometry}
\usepackage{hyperref}
\usepackage{multicol}
\usepackage{booktabs}
\usepackage{xstring}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{standalone}
\usepackage{fancyvrb}
\usepackage{authblk}
\usepackage{pdflscape}
\usepackage{tikz}
\usetikzlibrary{calc, shapes, patterns}
% Title
\title{Reinforcement Learning Produces Dominant Strategies for the
Iterated Prisoner's Dilemma}
\author[1]{Marc Harper}
\author[2]{Vincent Knight}
\author[3]{Martin Jones}
\author[4]{Georgios Koutsovoulos}
\author[2]{Nikoleta E. Glynatsi}
\author[3]{Owen Campbell}
\affil[1]{Google Inc., Mountain View, CA, USA}
\affil[2]{Cardiff University, School of Mathematics, UK}
\affil[3]{Not affiliated}
\affil[4]{INRA, Université Côte d‘Azur, CNRS, ISA, France}
\date{}
\begin{document}
\maketitle
\begin{abstract}
We present tournament results and several powerful strategies for the Iterated
Prisoner's Dilemma created using reinforcement learning techniques
(evolutionary and particle swarm algorithms). These strategies are
trained to perform well against a corpus of over 170 distinct
opponents, including many well-known and classic strategies. All
the trained strategies win standard tournaments against the total collection
of other opponents. The trained strategies and one particular human made
designed strategy are the top performers in noisy tournaments also.
\end{abstract}
\section{Introduction}\label{sec:introduction}
The Iterated Prisoner's Dilemma (IPD) is a common model in game theory,
frequently used to understand the evolution of cooperative behaviour from complex
dynamics \cite{Axelrod1984}.
This manuscript uses the Axelrod library \cite{knight2016open, axelrodproject},
open source software for conducting IPD research with
reproducibility as a principal goal. Written in the Python programming language,
to date the library contains source code contributed by over 50 individuals
from a variety of geographic locations and technical backgrounds. The library
is supported by a comprehensive test suite that covers all the intended
behaviors of all of the strategies in the library, as well as the features that
conduct matches, tournaments, and population dynamics.
The library is continuously developed and as of version 3.0.0, the library
contains over 200 strategies, many from the scientific literature, including
classic strategies like Win Stay Lose Shift \cite{nowak1993strategy} and
previous tournament winners such as OmegaTFT \cite{slany2007some}, Adaptive
Pavlov \cite{li2007design}, and ZDGTFT2 \cite{Stewart2012}.
Since Robert Axelrod's seminal tournament \cite{Axelrod1980a}, a number of
IPD\@ tournaments have been undertaken and are summarised in
Table~\ref{tbl:tournaments}. Further to the work described in
\cite{knight2016open} a regular set of standard, noisy~\cite{Bendor1991} and
probabilistic ending~\cite{Axelrod1980b} tournaments are carried out as more
strategies are added to the Axelrod library.
Details and results are available here:
\url{http://axelrod-tournament.readthedocs.io}. This work presents a detailed
analysis of a tournament with 176 strategies (details given
in Section~\ref{sec:results}).
\begin{table}[!hbtp]
\begin{center}
\begin{tabular}{ccccc}
\toprule
Year & Reference & Number of Strategies & Type & Source Code\\
\midrule
1979 & \cite{Axelrod1980a} & 13 & Standard & Not immediately available\\
1979 & \cite{Axelrod1980b} & 64 & Standard & Available in FORTRAN\\
1991 & \cite{Bendor1991} & 13 & Noisy & Not immediately available\\
2002 & \cite{Stephens2002} & 16 & Wildlife & Not applicable\\
2005 & \cite{kendall2007iterated} & 223 & Varied & Not available \\
2012 & \cite{Stewart2012} & 13 & Standard & Not fully available \\
2016 & \cite{knight2016open} & 129 & Standard & Fully available \\
\bottomrule
\end{tabular}
\end{center}
\caption{An overview of a selection of published tournaments. Not all
tournaments were `standard' round robins; for more details
see the indicated references.}\label{tbl:tournaments}
\end{table}
In this work we describe how collections of strategies in the Axelrod library
have been used to train new strategies specifically to win IPD tournaments.
These strategies are trained using generic strategy archetypes based on e.g.\
finite state
machines, arriving at particularly effective parameter choices through
evolutionary or particle swarm algorithms. There are several
previous publications that use evolutionary algorithms to
evolve IPD strategies in various circumstances
\cite{ashlock2006training, Ashlock2015, Ashlock2006,
ashlock2014shaped, Ashlock2014, barlow2015varying,
fogel1993evolving, marks1989niche, sudo2015effects,
vassiliades2010multiagent}. See also \cite{Gaudesi2016} for a
strategy trained to win against a collection of well-known IPD opponents and see
\cite{franken2005particle} for a prior use of particle swarm algorithms. Our
results are unique in that we are able to train against a large and diverse
collection of strategies available from the scientific literature.
Crucially, the
software used in this work is openly available and can be used to train strategies
in the future in a reliable manner, with confidence that the opponent strategies
are correctly implemented, tested and documented.
Moreover, as of the time of writing, we claim that this work contains the best
performing strategies for the Iterated Prisoner's Dilemma.
\section{The Strategy Archetypes}
The Axelrod library now contains many parametrised strategies trained using
machine learning
methods. Most are deterministic, use many rounds of memory, and perform
extremely well in tournaments as will be discussed in Section~\ref{sec:results}.
Training of these strategies will be discussed in Section~\ref{sec:methods}.
These strategies can encode a variety
of other strategies, including classic strategies like Tit For Tat
\cite{Axelrod1980},
handshake strategies, and grudging strategies, that always defect after
an opponent defection.
\subsection{LookerUp}\label{sec:lookerup}
The LookerUp strategy is based on a lookup table and encodes a set of
deterministic responses based on the opponent's first $n_1$ moves, the
opponent's last $m_1$ moves, and the players last $m_2$ moves. If $n_1 > 0$ then
the player has infinite memory depth, otherwise it has depth $\max(m_1, m_2)$.
This is illustrated diagrammatically in Figure~\ref{fig:lookerup}.
\begin{figure}[!hbtp]
\centering
\includestandalone[height=.3\textheight]{./assets/lookerup}
\caption{Diagrammatic representation of the Looker up Archetype.}
\label{fig:lookerup}
\end{figure}
Training of this strategy corresponds to finding maps from partial histories to
actions, either a cooperation or a defection. Although various
combinations of $n_1, m_1,$ and $m_2$ have been tried, the best performance at
the time of
training was obtained for $n_1 = m_1 = m_2 = 2$ and generally for $n_1 > 0$.
A strategy
called EvolvedLookerUp2\_2\_2 is among the top strategies in the library.
This archetype can be used to train deterministic memory-$n$ strategies with the
parameters $n_1=0$ and $m_1=m_2=n$. For $n=1$, the resulting strategy cooperates
if the last round was mutual cooperation and defects otherwise, known as Grim or
Grudger.
Two strategies in the library, Winner12 and Winner21, from \cite{Mathieu2015},
are based on lookup tables for $n_1 = 0$, $m_1 = 1$, and $m_2=2$. The strategy
Winner12 emerged in less than 10 generations of training in our framework using
a score maximizing objective. Strategies nearly identical to Winner21 arise
from training with a Moran process objective.
\subsection{Gambler}\label{sec:gambler}
Gambler is a stochastic variant of LookerUp. Instead of deterministically
encoded moves the lookup table emits probabilities which are
used to choose cooperation or defection.
This is illustrated diagrammatically in Figure~\ref{fig:gambler}.
\begin{figure}[!hbtp]
\centering
\includestandalone[height=.3\textheight]{./assets/gambler}
\caption{Diagrammatic representation of the Gambler Archetype.}
\label{fig:gambler}
\end{figure}
Training of this strategy corresponds to finding maps from histories to
a probability of cooperation. The library includes a strategy trained
with $n_1 = m_1 = m_2 = 2$ that is \emph{mostly deterministic}, with 52 of the 64
probabilities being 0 or 1. At one time this strategy outperformed
EvolvedLookerUp2\_2\_2.
This strategy type can be used to train arbitrary memory-$n$ strategies. A
memory one strategy called PSOGamblerMem1 was trained, with
probabilities $(\text{Pr}(\text{C}\;|\;\text{CC}),
\text{Pr}(\text{C}\;|\;\text{CD}),
\text{Pr}(\text{C}\;|\;\text{DC}),
\text{Pr}(\text{C}\;|\;\text{DD})) = (1, 0.5217, 0, 0.121)$.
Though it performs well in standard tournaments (see
Table~\ref{tbl:standard_score})
it does not outperform the longer memory strategies, and is bested by a similar
strategy that also uses the first round of play: PSOGambler\_1\_1\_1.
These strategies are trained with a particle swarm algorithm rather than an
evolutionary algorithm (though the former would suffice). Particle swarm
algorithms have been used to trained IPD strategies previously
\cite{franken2005particle}.
\subsection{ANN: Single Hidden Layer Artificial Neural Network}\label{sec:ann}
Strategies based on artificial neural networks use a variety of features
computed from the history of play:
\begin{multicols}{2}
\begin{itemize}
\item Opponent's first move is C
\item Opponent's first move is D
\item Opponent's second move is C
\item Opponent's second move is D
\item Player's previous move is C
\item Player's previous move is D
\item Player's second previous move is C
\item Player's second previous move is D
\item Opponent's previous move is C
\item Opponent's previous move is D
\item Opponent's second previous move is C
\item Opponent's second previous move is D
\item Total opponent cooperations
\item Total opponent defections
\item Total player cooperations
\item Total player defections
\item Round number
\end{itemize}
\end{multicols}
These are then input into a feed forward neural network with one layer and
user-supplied width. This is illustrated diagrammatically in
Figure~\ref{fig:ann}.
\begin{figure}[!hbtp]
\centering
\includestandalone[height=.3\textheight]{./assets/ann}
\caption{Diagrammatic representation of the ANN Archetype.}
\label{fig:ann}
\end{figure}
Training of this strategy corresponds to finding parameters of the neural
network. An inner layer with just five nodes performs quite well in both deterministic and
noisy tournaments. The output of the ANN used in this work is deterministic;
a stochastic variant that outputs probabilities rather than exact moves could
be easily created.
\subsection{Finite State Machines}\label{sec:fsm}
Strategies based on finite state machines are deterministic and computationally efficient.
In each round of play the strategy selects an action based on the current state
and the opponent's last action, transitioning to a new state for the next round.
This is illustrated diagrammatically in Figure~\ref{fig:fsm}.
\begin{figure}[!hbtp]
\centering
\includestandalone[height=.3\textheight]{./assets/fsm}
\caption{Diagrammatic representation of the Finite State Machine Archetype.}
\label{fig:fsm}
\end{figure}
Training this strategy corresponds to finding mappings of states and histories
to an action and a state. Figure~\ref{fig:fsm_images} shows two of the trained
finite state machines. The layout of state nodes is kept the same between
Figure~\ref{fig:fsm16} and~\ref{fig:fsm16noise} to highlight the effect of
different training environments. Note also that two of the 16 states are not
used, this is also an outcome of the training process.
\begin{figure}[!hbtp]
\centering
\begin{subfigure}[t]{.5\textwidth}
\includegraphics[height=.3\textheight]{./assets/FSM16.pdf}
\caption{Evolved\_FSM\_16: trained to maximise score in a standard
tournament}
\label{fig:fsm16}
\end{subfigure}%
~
\begin{subfigure}[t]{.5\textwidth}
\centering
\includegraphics[height=.35\textheight]{./assets/FSM16noise.pdf}
\caption{Evolved\_FSM\_16\_Noise\_05: trained to maximise score in a
noisy tournament}
\label{fig:fsm16noise}
\end{subfigure}%
\caption{Trained sixteen state Finite State Machine players.}
\label{fig:fsm_images}
\end{figure}
\subsection{Hidden Markov Models}\label{sec:hmm}
A variant of finite state machine strategies are called hidden Markov models
(HMMs). Like the strategies based on finite state machines, these strategies
also encode an internal state. However, they use probabilistic transitions based on the
prior round of play to other states and cooperate or defect with various
probabilities at each state. This is
shown diagrammatically in Figure~\ref{fig:hmm}. Training this strategy
corresponds to finding mappings of states and histories to probabilities of
cooperating as well as probabilities of the next internal state.
\begin{figure}[!hbtp]
\centering
\includestandalone[height=.3\textheight]{./assets/hmm}
\caption{Diagrammatic representation of the Hidden Markov Model Archetype.}
\label{fig:hmm}
\end{figure}
\subsection{Meta Strategies}
There are several strategies based on ensemble methods that
are common in machine learning called Meta strategies. These strategies are
composed of a team of other strategies. In each round, each member of the team
is polled for its desired next
move. The ensemble then selects the next move based on a rule, such as the
consensus vote in the case of MetaMajority or the best individual performance
in the case of MetaWinner. These strategies were among the best in the library
before the inclusion of those trained by reinforcement learning. The library
contains strategies containing teams of all the deterministic players, all the
memory-one players, and some others.
Because these strategies inherit many of the properties of the strategies
on which they are based, including using knowledge of the match length to defect
on the last round(s) of play, not all of these
strategies were included in results of this
paper. These strategies do not typically outperform the trained strategies
described above.
\section{Results}\label{sec:results}
This section presents the results of a large IPD tournament with
strategies from the Axelrod library, including some additional parametrized
strategies (e.g.\ various parameter choices for Generous Tit For Tat
\cite{Gaudesi2016}). These are
listed in Appendix~\ref{app:list_of_players}.
All strategies in the tournament follow a simple set of
rules in accordance with earlier tournaments:
\begin{itemize}
\item Players are unaware of the number of turns in a match.
\item Players carry no acquired state between matches.
\item Players cannot observe the outcome of other matches.
\item Players cannot identify their opponent by any label or identifier.
\item Players cannot manipulate or inspect their opponents in any way.
\end{itemize}
Any strategy that does not follow these rules, such as a strategy that defects
on the last round of play, was omitted from the tournament presented here (but
not necessarily from the training pool).
A total of \input{./assets/number_of_players.tex}are included, of which
\input{./assets/number_of_stochastic_players.tex}are stochastic.
Section~\ref{sec:standard} is concerned with the standard tournament with 200
turns whereas in Section~\ref{sec:noise} a tournament with 5\% noise is
discussed. Due to the inherent stochasticity of these IPD tournaments, these
tournaments were repeated \input{./assets/standard_number_of_repetitions.tex}
times. This allows for a detailed and confident analysis of the performance of
strategies. To illustrate the results considered,
Figure~\ref{fig:tit_for_tat_scores} shows the distribution of the mean score per
turn of Tit For Tat over all the repetitions. Similarly,
Figure~\ref{fig:tit_for_tat_ranks} shows the ranks of of Tit For Tat for each
repetition. (We note that it never wins a tournament). Finally
Figure~\ref{fig:tit_for_tat_wins} shows the number of opponents beaten in any given
tournament: Tit For Tat does not win any match. (This is due to the fact that it
will either draw with mutual cooperation or defect second).
\begin{figure}[!hbtp]
\centering
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/standard_tft_scores.pdf}
\caption{Scores}
\label{fig:tit_for_tat_scores}
\end{subfigure}%
~
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/standard_tft_ranks.pdf}
\caption{Ranks}
\label{fig:tit_for_tat_ranks}
\end{subfigure}%
~
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/standard_tft_wins.pdf}
\caption{Wins}
\label{fig:tit_for_tat_wins}
\end{subfigure}%
\caption{Results for Tit For Tat over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments.}
\end{figure}
The utilities used are \((R, P, T, S)=(3, 1, 5, 0)\) thus the specific
Prisoner's Dilemma being played is:
\begin{equation}\label{equ:pd}
\begin{pmatrix}
(3, 3) & (0, 5)\\
(5, 0) & (1, 1)
\end{pmatrix}
\end{equation}
All data generated for this work is archived and available at~\cite{data}.
\subsection{Standard Tournament}\label{sec:standard}
The top 11 performing strategies by median payoff are all strategies trained to maximize
total payoff against a subset of the strategies (Table~\ref{tbl:standard_score}).
The next strategy is Desired Belief Strategy (DBS) \cite{Au2006},
which actively analyzes the opponent and responds
accordingly. The next two strategies are Winner12, based on a lookup table,
Fool Me Once \cite{axelrodproject}, a grudging strategy that defects indefinitely on
the second defection, and Omega Tit For Tat \cite{kendall2007iterated}.
\begin{table}[!hbtp]
\centering
\input{./assets/standard_top_15_scores_summary.tex}
\caption{Standard Tournament: Mean score per turn of top 15 strategies
(ranked by median over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments).
The leaderboard is dominated by the trained strategies (indicated by a
$^{*}$).}
\label{tbl:standard_score}
\end{table}
For completeness, violin plots showing the distribution of the scores of each
strategy (again ranked by median score) are shown in
Figure~\ref{fig:standard_boxplot}.
\begin{landscape}
\begin{figure}[!hbtp]
\centering
\includegraphics[width=\paperwidth]{./assets/standard_scores_boxplots.pdf}
\caption{Standard Tournament: Mean score per turn (ranked by median
over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
\label{fig:standard_boxplot}
\end{figure}
\end{landscape}
Pairwise payoff results are given as a heatmap (Figure~\ref{fig:standard_heatmap})
which shows that many strategies achieve mutual cooperation (obtaining a score
of 3). The top performing
strategies never defect first yet are able to exploit weaker strategies that
attempt to defect.
\begin{figure}[!hbtp]
\centering
\includegraphics[width=\textwidth]{./assets/standard_scores_heatmap.pdf}
\caption{Standard Tournament: Mean score per turn of row players against
column players (ranked by median over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
\label{fig:standard_heatmap}
\end{figure}
The strategies that win the most matches
(Table~\ref{tbl:standard_wins_top_winners}) are Defector~\cite{Axelrod1984} and Aggravater~\cite{axelrodproject}, followed
by handshaking and zero determinant strategies~\cite{Press2012}.
This includes two handshaking
strategies that were the result of training to maximize Moran process fixation
(TF1 and TF2). No strategies were trained specifically to win matches. None of
the top scoring strategies appear in the top 15 list of strategies ranked by
match wins. This can be seen in Figure~\ref{fig:standard_winplot} where the
distribution of the number of wins of each strategy is shown.
\begin{table}[!hbtp]
\centering
\input{./assets/standard_top_15_winners_wins_summary.tex}
\caption{Standard Tournament: Number of wins per tournament
of top 15 strategies (ranked by median wins over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
\label{tbl:standard_wins_top_winners}
\end{table}
\begin{landscape}
\begin{figure}[!hbtp]
\centering
\includegraphics[width=\paperwidth]{./assets/standard_wins_boxplots.pdf}
\caption{Standard Tournament: number of wins per tournament (ranked by
median over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
\label{fig:standard_winplot}
\end{figure}
\end{landscape}
The number of wins of the top strategies of Table~\ref{tbl:standard_score} are
shown in Table~\ref{tbl:standard_wins}. It is evident that although these
strategies score highly they do not win many matches: the strategy with the most
number of wins is the Evolved FSM 16 strategy that at most won 60
(\(60/175\approx34\%\)) matches in a given tournament.
\begin{table}[!hbtp]
\centering
\input{./assets/standard_top_15_wins_summary.tex}
\caption{Standard Tournament: Number of wins per tournament
of top 15 strategies (ranked by median score over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
\label{tbl:standard_wins}
\end{table}
Finally, Table~\ref{tbl:standard_ranks} and
Figure~\ref{fig:standard_ranks_boxplot} show the ranks (based on median score)
of each strategy over the repeated tournaments. Whilst there is some
stochasticity, the top three strategies almost always rank in the top three. For
example, the worst that the Evolved Lookerup 2 2 2 ranks in any tournament
is 8th.
\begin{table}[!hbtp]
\centering
\input{./assets/standard_top_15_ranks_summary.tex}
\caption{Standard Tournament: Rank in each tournament
of top 15 strategies (ranked by median over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
\label{tbl:standard_ranks}
\end{table}
\begin{landscape}
\begin{figure}[!hbtp]
\centering
\includegraphics[width=\paperwidth]{./assets/standard_ranks_boxplots.pdf}
\caption{Standard Tournament: rank in each tournament (ranked by
median over
\protect\input{./assets/standard_number_of_repetitions.tex}tournaments).}
\label{fig:standard_ranks_boxplot}
\end{figure}
\end{landscape}
Figure~\ref{fig:comparison_cooperation_heatmaps_standard} shows the rate of
cooperation in each round for the top three strategies. The opponents in these
figures are ordered according to performance by median score. It is evident that
the high performing strategies share a common thread against the top strategies:
they do not defect first and achieve mutual cooperation. Against the lower
strategies they also do not defect first (a mean cooperation rate of 1 in the
first round) but do learn to quickly retaliate.
\begin{figure}[!hbtp]
\centering
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_EvolvedLookerUp2_2_2_array.pdf}
\caption{EvolvedLookerUp\_2\_2\_2}
\end{subfigure}%
~
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_Evolved_HMM_5_array.pdf}
\caption{Evolved\_HMM\_5}
\end{subfigure}%
~
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_Evolved_FSM_16_array.pdf}
\caption{Evolved\_FSM\_16}
\end{subfigure}%
\caption{Comparison of cooperation rates for Standard Tournament Top 3
(over 10000 repetitions).}
\label{fig:comparison_cooperation_heatmaps_standard}
\end{figure}
\subsection{Noisy Tournament}\label{sec:noise}
Noisy tournaments in which there is a 5\% chance that an action is
flipped are now described. As shown in Table~\ref{tbl:noisy_score} and
Figure~\ref{fig:noisy_score}, the best performing strategies in median payoff
are DBS, designed to account for noise, followed by two strategies trained in
the presence of noise and three trained strategies trained without noise. One of
the strategies trained with noise (PSO Gambler) actually performs less well than
some of the other high ranking strategies including
Spiteful TFT (TFT but defects indefinitely if the opponent defects twice
consecutively) and OmegaTFT (also designed to handle noise). While DBS is the clear
winner, it comes at a 6x increased run time over Evolved FSM 16 Noise 05.
% Table of best strategies
\begin{table}[!hbtp]
\centering
\input{./assets/noisy_top_15_scores_summary.tex}
\caption{Noisy (5\%) Tournament: Mean score per turn of top 15 strategies
(ranked by median over
\protect\input{./assets/noisy_number_of_repetitions.tex}tournaments)
~$^{*}$ indicates that the strategy was trained.}
\label{tbl:noisy_score}
\end{table}
\begin{landscape}
\begin{figure}[!hbtp]
\centering
\includegraphics[width=\paperwidth]{./assets/noisy_scores_boxplots.pdf}
\caption{Noisy (5\%) Tournament: Mean score per turn (ranked by median
over
\protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
\label{fig:noisy_score}
\end{figure}
\end{landscape}
Recalling Table~\ref{tbl:standard_score}, the strategies trained in the presence
of noise are also among the best performers in the absence of noise. As shown in
Figure~\ref{fig:noisy_heatmap} the cluster of mutually cooperative strategies is
broken by the noise at 5\%. A similar collection of players excels at winning
matches but again they have a poor total payoff.
\begin{figure}[!hbtp]
\centering
\includegraphics[width=\textwidth]{./assets/noisy_scores_heatmap.pdf}
\caption{Noisy (5\%) Tournament: Mean score per turn of row players against
column players (ranked by median over
\protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
\label{fig:noisy_heatmap}
\end{figure}
As shown in Table~\ref{tbl:noisy_wins_top_winners} and
Figure~\ref{fig:noisy_winplot} the strategies tallying the most wins are
somewhat similar to the standard tournaments, with Defector, the handshaking
CollectiveStrategy~\cite{Li2009}, and Aggravater appearing as the top three again.
\begin{table}[!hbtp]
\centering
\input{./assets/noisy_top_15_winners_wins_summary.tex}
\caption{Noisy (5\%) Tournament: Number of wins per tournament
of top 15 strategies (ranked by median wins over
\protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
\label{tbl:noisy_wins_top_winners}
\end{table}
\begin{landscape}
\begin{figure}[!hbtp]
\centering
\includegraphics[width=\paperwidth]{./assets/noisy_wins_boxplots.pdf}
\caption{Noisy (5\%) Tournament: number of wins per tournament (ranked by
median over
\protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
\label{fig:noisy_winplot}
\end{figure}
\end{landscape}
As shown in Table~\ref{tbl:noisy_wins}, the top ranking strategies win a larger
number of matches in the presence of noise. For example Spiteful Tit For Tat~\cite{Prison1998}
in one tournament won almost all its matches (167).
\begin{table}[!hbtp]
\centering
\input{./assets/noisy_top_15_wins_summary.tex}
\caption{Noisy (5\%) Tournament: Number of wins per tournament
of top 15 strategies (ranked by median score over
\protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
\label{tbl:noisy_wins}
\end{table}
Finally, Table~\ref{tbl:noisy_ranks} and
Figure~\ref{fig:noisy_ranks_boxplot} show the ranks (based on median score)
of each strategy over the repeated tournaments. We see that the stochasticity
of the ranks understandably increases relative to the standard tournament. An
exception is the top three strategies, for example, the DBS strategy never ranks
lower than
second and wins 75\% of the time. The two strategies trained for noisy
tournaments rank in the top three 95\% of the time.
\begin{table}[!hbtp]
\centering
\input{./assets/noisy_top_15_ranks_summary.tex}
\caption{Noisy (5\%) Tournament: Rank in each tournament
of top 15 strategies (ranked by median over
\protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
\label{tbl:noisy_ranks}
\end{table}
\begin{landscape}
\begin{figure}[!hbtp]
\centering
\includegraphics[width=\paperwidth]{./assets/noisy_ranks_boxplots.pdf}
\caption{Noisy (5\%) Tournament: rank in each tournament (ranked by
median over
\protect\input{./assets/noisy_number_of_repetitions.tex}tournaments).}
\label{fig:noisy_ranks_boxplot}
\end{figure}
\end{landscape}
Figure~\ref{fig:comparison_cooperation_heatmaps_noisy} shows the rate of
cooperation in each round for the top three strategies (in the absense of noise)
and just as for the top performing strategies in the standard tournament
(Figure~\ref{fig:comparison_cooperation_heatmaps_standard}) it is evident that
the strategies never defect first and learn to quickly punish poorer strategies.
\begin{figure}[!hbtp]
\centering
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_DBS_0-75_3_4_3_5_array.pdf}
\caption{DBS}
\end{subfigure}%
~
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_Evolved_ANN_5_Noise_05_array.pdf}
\caption{Evolved\_ANN\_5\_Noise\_05}
\end{subfigure}%
~
\begin{subfigure}[t]{.3\textwidth}
\centering
\includegraphics[width=\textwidth]{./assets/cooperation_0_0_10000_Evolved_FSM_16_Noise_05_array.pdf}
\caption{Evolved\_FSM\_16\_Noise\_05}
\end{subfigure}%
\caption{Comparison of cooperation rates for the Noisy (5\%) Tournament Top 3
(over 10000 standard tournaments).}
\label{fig:comparison_cooperation_heatmaps_noisy}
\end{figure}
\section{Methods}\label{sec:methods}
The trained strategies (denoted by a \(^{*}\) in
Appendix~\ref{app:list_of_players}) were trained using reinforcement
learning algorithms. The ideas of reinforcement learning can be attributed to
the original work of \cite{turing1950computing} in which the notion that
computers would learn by taking random actions but according to a distribution
that picked actions with high rewards more often. The two particular algorithms
used here:
\begin{itemize}
\item Particle Swarm Algorithm: \cite{imran2013overview}.
\item Evolutionary algorithm: \cite{moriarty1999evolutionary}.
\end{itemize}
The Particle Swarm Algorithm is implemented using the pyswarm library:
\url{https://pypi.python.org/pypi/pyswarm}. This algorithm was used only to
train the Gambler archetype.
All other strategies were trained using evolutionary algorithms. The
evolutionary algorithms used standard techniques, varying strategies by
mutation and crossover, and evaluating the performance against each opponent
for many repetitions. The best performing strategies in each generation are
persisted, variants created, and objective functions computed again.
The default parameters for this procedure:
\begin{itemize}
\item A population size of 40 individuals (kept constant across the
generations);
\item A mutation rate of 10\%;
\item 10 individuals kept from one generation to the next;
\item A total of 500 generations.
\end{itemize}
All implementations of these algorithms are archived at
\cite{dojo}. This software is (similarly to the Axelrod
library) available on github
\url{https://github.com/Axelrod-Python/axelrod-dojo}. There are objective
functions for:
\begin{itemize}
\item total or mean payoff,
\item total or mean payoff difference (unused in this work),
\item total Moran process wins (fixation probability). This lead to the
strategies named TF1, TF2, TF3 listed in
Appendix~\ref{app:list_of_players}.
\end{itemize}
These can be used in noisy or standard environments (as evidenced by
Sections~\ref{sec:standard} and~\ref{sec:noise}). These objectives can be
further modified to suit other purposes. New strategies could be trained with
variations including spatial structure and probabilistically ending matches.
\section{Discussion}
The tournament results indicate that pre-trained strategies are generally better
than human designed strategies at maximizing payoff against a diverse set of
opponents. An evolutionary algorithm produces strategies based on multiple
generic archetypes that are able to achieve a higher average
score than any other known opponent in a standard tournament. Most of the trained
strategies use multiple rounds of the history of play (some using all of it) and
outperform memory-one strategies from the literature. Interestingly, a trained
memory one strategy produced by a particle swarm algorithm performs well, better
than human designed strategies such as Win Stay Lose Shift and zero determinant
strategies (which enforce a payoff difference rather than maximize total payoff).
The generic structure of the trained strategies did not appear to be
critical for the standard tournament -- strategies based on lookup tables,
finite state machines, neural networks, and stochastic variants all performed well.
Single layer neural networks (Section~\ref{sec:ann}) performed well in both
noisy and standard tournaments
though these had some aspect of human involvement in the selection of features.
This is in line with the other strategies also where some human decisions are made
regarding the structure. For the LookerUp and Gambler archetypes
(Sections~\ref{sec:lookerup} and~\ref{sec:gambler})
a decision has to be made
regarding the number of rounds of history and initial play that are to be used.
In contrast, the finite state machines and hidden Markov models
(Sections\ref{sec:fsm} and~\ref{sec:hmm}) required only a choice of the number
of states, and the training algorithm can eliminate unneeded states in the case
of finite state machines (evidenced by the unconnected nodes in the diagrams
for the included representations).
Many strategies can be represented by multiple archetypes, however some
archetypes will be more efficient in encoding the patterns present in the data.
The fact that the Lookerup strategy does the best for the standard
tournament indicates that it represents an efficient reduction of
dimension which in turn makes its training more efficient. In particular the
first rounds of play were valuable bits of information. For the noisy
tournament however the dimension reduction represented by some archetypes
indicates that some features of the data are not captured by the lookup
tables while they are by the neural networks and the finite state machines,
allowing the latter to adapt better to the noisy environment. Intuitively, a noisy
environment can significantly affect a lookup table based on the last two rounds
of play since these action pairs compete with probing defections, apologies, and
retaliations. Accordingly, it is not surprising that additional parameter space
is needed to adapt to a noisy environment.
In opposition to historical tournament results and community folklore,
our results show that complex strategies can be very effective for the
IPD\@. Designing complex strategies for the prisoner's dilemma appears to be
difficult for humans. Of all the human-designed
strategies in the library, only DBS consistently performs well, and it is
substantially more complex than traditional tournament winners like TFT, OmegaTFT,
and zero determinant strategies. Furthermore, dealing with noise is difficult
for most strategies. Two strategies designed specifically to account for noise,
DBS and OmegaTFT, perform well and only DBS performs better than the trained
strategies and \textbf{only} in some noisy contexts. Empirically we find that
DBS (with its default parameters) does not win tournaments at 1\% noise.
However DBS has a parameter that accounts for the expected amount of noise and a
followup study with various noise levels could make a more complete study of
the performance of DBS and strategies trained at various noise levels.
The strategies trained to maximize their average score are generally
cooperative and do not defect first. Maximizing for individual
performance across a collection of opponents leads to mutual cooperation despite
the fact that mutual cooperation is an unstable evolutionary equilibrium for the prisoner's
dilemma. Specifically it is noted that the reinforcement learning process for maximizing
payoff does not lead to exploitative zero determinant strategies, which may also
be a result of the collection of training strategies, of which several retaliate
harshly. Training with the objective of maximizing payoff difference may
produce strategies more like zero determinant strategies.
For the trained
strategies utilizing look up tables we generally found those that incorporate
one or more of the initial rounds of play outperformed those that did not. The
strategies based on neural networks and finite state machines also are able to
condition throughout a match on the first rounds of play. Accordingly, we conclude
that first impressions matter in the IPD\@. The best strategies are nice (never
defecting first) and the impact of the first rounds of play could be further
investigated with the Axelrod library
in future work by e.g.\ forcing all strategies to defect on the first round.
Finally, we note that as the library grows, the top performing strategies
sometimes shuffle, and are not retrained automatically. Most of the strategies were
trained on an earlier version of the library (v2.2.0: \cite{axelrodproject2.2})
that did not include DBS and several other opponents. The precise parameters
that are optimal will depend on the pool of opponents. Moreover we have not
extensively trained strategies to determine the minimum parameter spaces that are
sufficient -- neural networks with fewer nodes and features and finite state
machines with fewer states may suffice. See \cite{ashlock2013impact} for
discussion of resource availability for IPD strategies.
% TODO: author contributions
\section*{Acknowledgements}
This work was performed using the computational facilities of the Advanced
Research Computing @ Cardiff (ARCCA) Division, Cardiff University.
A variety of software libraries have been used in this work:
\begin{itemize}
\item The Axelrod library (IPD strategies and Tournaments)
\cite{axelrodproject}.
\item The matplotlib library (visualisation) \cite{hunter2007matplotlib}.
\item The pandas and numpy libraries (data manipulation)
\cite{mckinney2010data, walt2011numpy}.
\end{itemize}
\bibliographystyle{plain}
\bibliography{bibliography.bib}
\appendix
\section{List of players}\label{app:list_of_players}
The players used for this study are from Axelrod version 2.13.0
\cite{axelrodproject}.
\begin{multicols}{2}
\begin{enumerate}
\input{./assets/list_of_players.tex}
\end{enumerate}
\end{multicols}
\end{document}