-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathhigh-availability.sgml
2127 lines (1894 loc) · 86.9 KB
/
high-availability.sgml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!-- doc/src/sgml/high-availability.sgml -->
<chapter id="high-availability">
<title>High Availability, Load Balancing, and Replication</title>
<indexterm><primary>high availability</></>
<indexterm><primary>failover</></>
<indexterm><primary>replication</></>
<indexterm><primary>load balancing</></>
<indexterm><primary>clustering</></>
<indexterm><primary>data partitioning</></>
<para>
Database servers can work together to allow a second server to
take over quickly if the primary server fails (high
availability), or to allow several computers to serve the same
data (load balancing). Ideally, database servers could work
together seamlessly. Web servers serving static web pages can
be combined quite easily by merely load-balancing web requests
to multiple machines. In fact, read-only database servers can
be combined relatively easily too. Unfortunately, most database
servers have a read/write mix of requests, and read/write servers
are much harder to combine. This is because though read-only
data needs to be placed on each server only once, a write to any
server has to be propagated to all servers so that future read
requests to those servers return consistent results.
</para>
<para>
This synchronization problem is the fundamental difficulty for
servers working together. Because there is no single solution
that eliminates the impact of the sync problem for all use cases,
there are multiple solutions. Each solution addresses this
problem in a different way, and minimizes its impact for a specific
workload.
</para>
<para>
Some solutions deal with synchronization by allowing only one
server to modify the data. Servers that can modify data are
called read/write, <firstterm>master</> or <firstterm>primary</> servers.
Servers that track changes in the master are called <firstterm>standby</>
or <firstterm>slave</> servers. A standby server that cannot be connected
to until it is promoted to a master server is called a <firstterm>warm
standby</> server, and one that can accept connections and serves read-only
queries is called a <firstterm>hot standby</> server.
</para>
<para>
Some solutions are synchronous,
meaning that a data-modifying transaction is not considered
committed until all servers have committed the transaction. This
guarantees that a failover will not lose any data and that all
load-balanced servers will return consistent results no matter
which server is queried. In contrast, asynchronous solutions allow some
delay between the time of a commit and its propagation to the other servers,
opening the possibility that some transactions might be lost in
the switch to a backup server, and that load balanced servers
might return slightly stale results. Asynchronous communication
is used when synchronous would be too slow.
</para>
<para>
Solutions can also be categorized by their granularity. Some solutions
can deal only with an entire database server, while others allow control
at the per-table or per-database level.
</para>
<para>
Performance must be considered in any choice. There is usually a
trade-off between functionality and
performance. For example, a fully synchronous solution over a slow
network might cut performance by more than half, while an asynchronous
one might have a minimal performance impact.
</para>
<para>
The remainder of this section outlines various failover, replication,
and load balancing solutions. A <ulink
url="http://www.postgres-r.org/documentation/terms">glossary</ulink> is
also available.
</para>
<sect1 id="different-replication-solutions">
<title>Comparison of Different Solutions</title>
<variablelist>
<varlistentry>
<term>Shared Disk Failover</term>
<listitem>
<para>
Shared disk failover avoids synchronization overhead by having only one
copy of the database. It uses a single disk array that is shared by
multiple servers. If the main database server fails, the standby server
is able to mount and start the database as though it were recovering from
a database crash. This allows rapid failover with no data loss.
</para>
<para>
Shared hardware functionality is common in network storage devices.
Using a network file system is also possible, though care must be
taken that the file system has full <acronym>POSIX</> behavior (see <xref
linkend="creating-cluster-nfs">). One significant limitation of this
method is that if the shared disk array fails or becomes corrupt, the
primary and standby servers are both nonfunctional. Another issue is
that the standby server should never access the shared storage while
the primary server is running.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>File System (Block-Device) Replication</term>
<listitem>
<para>
A modified version of shared hardware functionality is file system
replication, where all changes to a file system are mirrored to a file
system residing on another computer. The only restriction is that
the mirroring must be done in a way that ensures the standby server
has a consistent copy of the file system — specifically, writes
to the standby must be done in the same order as those on the master.
<productname>DRBD</> is a popular file system replication solution
for Linux.
</para>
<!--
https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
Oracle RAC is a shared disk approach and just send cache invalidations
to other nodes but not actual data. As the disk is shared, data is
only committed once to disk and there is a distributed locking
protocol to make nodes agree on a serializable transactional order.
-->
</listitem>
</varlistentry>
<varlistentry>
<term>Warm and Hot Standby Using Point-In-Time Recovery (<acronym>PITR</>)</term>
<listitem>
<para>
Warm and hot standby servers can be kept current by reading a
stream of write-ahead log (<acronym>WAL</>)
records. If the main server fails, the standby contains
almost all of the data of the main server, and can be quickly
made the new master database server. This is asynchronous and
can only be done for the entire database server.
</para>
<para>
A PITR standby server can be implemented using file-based log shipping
(<xref linkend="warm-standby">) or streaming replication (see
<xref linkend="streaming-replication">), or a combination of both. For
information on hot standby, see <xref linkend="hot-standby">.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Trigger-Based Master-Standby Replication</term>
<listitem>
<para>
A master-standby replication setup sends all data modification
queries to the master server. The master server asynchronously
sends data changes to the standby server. The standby can answer
read-only queries while the master server is running. The
standby server is ideal for data warehouse queries.
</para>
<para>
<productname>Slony-I</> is an example of this type of replication, with per-table
granularity, and support for multiple standby servers. Because it
updates the standby server asynchronously (in batches), there is
possible data loss during fail over.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Statement-Based Replication Middleware</term>
<listitem>
<para>
With statement-based replication middleware, a program intercepts
every SQL query and sends it to one or all servers. Each server
operates independently. Read-write queries must be sent to all servers,
so that every server receives any changes. But read-only queries can be
sent to just one server, allowing the read workload to be distributed
among them.
</para>
<para>
If queries are simply broadcast unmodified, functions like
<function>random()</>, <function>CURRENT_TIMESTAMP</>, and
sequences can have different values on different servers.
This is because each server operates independently, and because
SQL queries are broadcast (and not actual modified rows). If
this is unacceptable, either the middleware or the application
must query such values from a single server and then use those
values in write queries. Another option is to use this replication
option with a traditional master-standby setup, i.e. data modification
queries are sent only to the master and are propagated to the
standby servers via master-standby replication, not by the replication
middleware. Care must also be taken that all
transactions either commit or abort on all servers, perhaps
using two-phase commit (<xref linkend="sql-prepare-transaction">
and <xref linkend="sql-commit-prepared">.
<productname>Pgpool-II</> and <productname>Continuent Tungsten</>
are examples of this type of replication.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Asynchronous Multimaster Replication</term>
<listitem>
<para>
For servers that are not regularly connected, like laptops or
remote servers, keeping data consistent among servers is a
challenge. Using asynchronous multimaster replication, each
server works independently, and periodically communicates with
the other servers to identify conflicting transactions. The
conflicts can be resolved by users or conflict resolution rules.
Bucardo is an example of this type of replication.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Synchronous Multimaster Replication</term>
<listitem>
<para>
In synchronous multimaster replication, each server can accept
write requests, and modified data is transmitted from the
original server to every other server before each transaction
commits. Heavy write activity can cause excessive locking,
leading to poor performance. In fact, write performance is
often worse than that of a single server. Read requests can
be sent to any server. Some implementations use shared disk
to reduce the communication overhead. Synchronous multimaster
replication is best for mostly read workloads, though its big
advantage is that any server can accept write requests —
there is no need to partition workloads between master and
standby servers, and because the data changes are sent from one
server to another, there is no problem with non-deterministic
functions like <function>random()</>.
</para>
<para>
<productname>PostgreSQL</> does not offer this type of replication,
though <productname>PostgreSQL</> two-phase commit (<xref
linkend="sql-prepare-transaction"> and <xref
linkend="sql-commit-prepared">)
can be used to implement this in application code or middleware.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Commercial Solutions</term>
<listitem>
<para>
Because <productname>PostgreSQL</> is open source and easily
extended, a number of companies have taken <productname>PostgreSQL</>
and created commercial closed-source solutions with unique
failover, replication, and load balancing capabilities.
</para>
</listitem>
</varlistentry>
</variablelist>
<para>
<xref linkend="high-availability-matrix"> summarizes
the capabilities of the various solutions listed above.
</para>
<table id="high-availability-matrix">
<title>High Availability, Load Balancing, and Replication Feature Matrix</title>
<tgroup cols="8">
<thead>
<row>
<entry>Feature</entry>
<entry>Shared Disk Failover</entry>
<entry>File System Replication</entry>
<entry>Hot/Warm Standby Using PITR</entry>
<entry>Trigger-Based Master-Standby Replication</entry>
<entry>Statement-Based Replication Middleware</entry>
<entry>Asynchronous Multimaster Replication</entry>
<entry>Synchronous Multimaster Replication</entry>
</row>
</thead>
<tbody>
<row>
<entry>Most Common Implementation</entry>
<entry align="center">NAS</entry>
<entry align="center">DRBD</entry>
<entry align="center">PITR</entry>
<entry align="center">Slony</entry>
<entry align="center">pgpool-II</entry>
<entry align="center">Bucardo</entry>
<entry align="center"></entry>
</row>
<row>
<entry>Communication Method</entry>
<entry align="center">shared disk</entry>
<entry align="center">disk blocks</entry>
<entry align="center">WAL</entry>
<entry align="center">table rows</entry>
<entry align="center">SQL</entry>
<entry align="center">table rows</entry>
<entry align="center">table rows and row locks</entry>
</row>
<row>
<entry>No special hardware required</entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
</row>
<row>
<entry>Allows multiple master servers</entry>
<entry align="center"></entry>
<entry align="center"></entry>
<entry align="center"></entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
</row>
<row>
<entry>No master server overhead</entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center"></entry>
</row>
<row>
<entry>No waiting for multiple servers</entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center"></entry>
</row>
<row>
<entry>Master failure will never lose data</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center">•</entry>
</row>
<row>
<entry>Standby accept read-only queries</entry>
<entry align="center"></entry>
<entry align="center"></entry>
<entry align="center">Hot only</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
</row>
<row>
<entry>Per-table granularity</entry>
<entry align="center"></entry>
<entry align="center"></entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
</row>
<row>
<entry>No conflict resolution necessary</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center">•</entry>
<entry align="center"></entry>
<entry align="center"></entry>
<entry align="center">•</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
There are a few solutions that do not fit into the above categories:
</para>
<variablelist>
<varlistentry>
<term>Data Partitioning</term>
<listitem>
<para>
Data partitioning splits tables into data sets. Each set can
be modified by only one server. For example, data can be
partitioned by offices, e.g., London and Paris, with a server
in each office. If queries combining London and Paris data
are necessary, an application can query both servers, or
master/standby replication can be used to keep a read-only copy
of the other office's data on each server.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Multiple-Server Parallel Query Execution</term>
<listitem>
<para>
Many of the above solutions allow multiple servers to handle multiple
queries, but none allow a single query to use multiple servers to
complete faster. This solution allows multiple servers to work
concurrently on a single query. It is usually accomplished by
splitting the data among servers and having each server execute its
part of the query and return results to a central server where they
are combined and returned to the user. <productname>Pgpool-II</>
has this capability. Also, this can be implemented using the
<productname>PL/Proxy</> tool set.
</para>
</listitem>
</varlistentry>
</variablelist>
</sect1>
<sect1 id="warm-standby">
<title>Log-Shipping Standby Servers</title>
<para>
Continuous archiving can be used to create a <firstterm>high
availability</> (HA) cluster configuration with one or more
<firstterm>standby servers</> ready to take over operations if the
primary server fails. This capability is widely referred to as
<firstterm>warm standby</> or <firstterm>log shipping</>.
</para>
<para>
The primary and standby server work together to provide this capability,
though the servers are only loosely coupled. The primary server operates
in continuous archiving mode, while each standby server operates in
continuous recovery mode, reading the WAL files from the primary. No
changes to the database tables are required to enable this capability,
so it offers low administration overhead compared to some other
replication solutions. This configuration also has relatively low
performance impact on the primary server.
</para>
<para>
Directly moving WAL records from one database server to another
is typically described as log shipping. <productname>PostgreSQL</>
implements file-based log shipping by transferring WAL records
one file (WAL segment) at a time. WAL files (16MB) can be
shipped easily and cheaply over any distance, whether it be to an
adjacent system, another system at the same site, or another system on
the far side of the globe. The bandwidth required for this technique
varies according to the transaction rate of the primary server.
Record-based log shipping is more granular and streams WAL changes
incrementally over a network connection (see <xref
linkend="streaming-replication">).
</para>
<para>
It should be noted that log shipping is asynchronous, i.e., the WAL
records are shipped after transaction commit. As a result, there is a
window for data loss should the primary server suffer a catastrophic
failure; transactions not yet shipped will be lost. The size of the
data loss window in file-based log shipping can be limited by use of the
<varname>archive_timeout</varname> parameter, which can be set as low
as a few seconds. However such a low setting will
substantially increase the bandwidth required for file shipping.
Streaming replication (see <xref linkend="streaming-replication">)
allows a much smaller window of data loss.
</para>
<para>
Recovery performance is sufficiently good that the standby will
typically be only moments away from full
availability once it has been activated. As a result, this is called
a warm standby configuration which offers high
availability. Restoring a server from an archived base backup and
rollforward will take considerably longer, so that technique only
offers a solution for disaster recovery, not high availability.
A standby server can also be used for read-only queries, in which case
it is called a Hot Standby server. See <xref linkend="hot-standby"> for
more information.
</para>
<indexterm zone="high-availability">
<primary>warm standby</primary>
</indexterm>
<indexterm zone="high-availability">
<primary>PITR standby</primary>
</indexterm>
<indexterm zone="high-availability">
<primary>standby server</primary>
</indexterm>
<indexterm zone="high-availability">
<primary>log shipping</primary>
</indexterm>
<indexterm zone="high-availability">
<primary>witness server</primary>
</indexterm>
<indexterm zone="high-availability">
<primary>STONITH</primary>
</indexterm>
<sect2 id="standby-planning">
<title>Planning</title>
<para>
It is usually wise to create the primary and standby servers
so that they are as similar as possible, at least from the
perspective of the database server. In particular, the path names
associated with tablespaces will be passed across unmodified, so both
primary and standby servers must have the same mount paths for
tablespaces if that feature is used. Keep in mind that if
<xref linkend="sql-createtablespace">
is executed on the primary, any new mount point needed for it must
be created on the primary and all standby servers before the command
is executed. Hardware need not be exactly the same, but experience shows
that maintaining two identical systems is easier than maintaining two
dissimilar ones over the lifetime of the application and system.
In any case the hardware architecture must be the same — shipping
from, say, a 32-bit to a 64-bit system will not work.
</para>
<para>
In general, log shipping between servers running different major
<productname>PostgreSQL</> release
levels is not possible. It is the policy of the PostgreSQL Global
Development Group not to make changes to disk formats during minor release
upgrades, so it is likely that running different minor release levels
on primary and standby servers will work successfully. However, no
formal support for that is offered and you are advised to keep primary
and standby servers at the same release level as much as possible.
When updating to a new minor release, the safest policy is to update
the standby servers first — a new minor release is more likely
to be able to read WAL files from a previous minor release than vice
versa.
</para>
</sect2>
<sect2 id="standby-server-operation">
<title>Standby Server Operation</title>
<para>
In standby mode, the server continuously applies WAL received from the
master server. The standby server can read WAL from a WAL archive
(see <xref linkend="restore-command">) or directly from the master
over a TCP connection (streaming replication). The standby server will
also attempt to restore any WAL found in the standby cluster's
<filename>pg_xlog</> directory. That typically happens after a server
restart, when the standby replays again WAL that was streamed from the
master before the restart, but you can also manually copy files to
<filename>pg_xlog</> at any time to have them replayed.
</para>
<para>
At startup, the standby begins by restoring all WAL available in the
archive location, calling <varname>restore_command</>. Once it
reaches the end of WAL available there and <varname>restore_command</>
fails, it tries to restore any WAL available in the <filename>pg_xlog</> directory.
If that fails, and streaming replication has been configured, the
standby tries to connect to the primary server and start streaming WAL
from the last valid record found in archive or <filename>pg_xlog</>. If that fails
or streaming replication is not configured, or if the connection is
later disconnected, the standby goes back to step 1 and tries to
restore the file from the archive again. This loop of retries from the
archive, <filename>pg_xlog</>, and via streaming replication goes on until the server
is stopped or failover is triggered by a trigger file.
</para>
<para>
Standby mode is exited and the server switches to normal operation
when <command>pg_ctl promote</> is run or a trigger file is found
(<varname>trigger_file</>). Before failover,
any WAL immediately available in the archive or in <filename>pg_xlog</> will be
restored, but no attempt is made to connect to the master.
</para>
</sect2>
<sect2 id="preparing-master-for-standby">
<title>Preparing the Master for Standby Servers</title>
<para>
Set up continuous archiving on the primary to an archive directory
accessible from the standby, as described
in <xref linkend="continuous-archiving">. The archive location should be
accessible from the standby even when the master is down, i.e. it should
reside on the standby server itself or another trusted server, not on
the master server.
</para>
<para>
If you want to use streaming replication, set up authentication on the
primary server to allow replication connections from the standby
server(s); that is, create a role and provide a suitable entry or
entries in <filename>pg_hba.conf</> with the database field set to
<literal>replication</>. Also ensure <varname>max_wal_senders</> is set
to a sufficiently large value in the configuration file of the primary
server.
</para>
<para>
Take a base backup as described in <xref linkend="backup-base-backup">
to bootstrap the standby server.
</para>
</sect2>
<sect2 id="standby-server-setup">
<title>Setting Up a Standby Server</title>
<para>
To set up the standby server, restore the base backup taken from primary
server (see <xref linkend="backup-pitr-recovery">). Create a recovery
command file <filename>recovery.conf</> in the standby's cluster data
directory, and turn on <varname>standby_mode</>. Set
<varname>restore_command</> to a simple command to copy files from
the WAL archive. If you plan to have multiple standby servers for high
availability purposes, set <varname>recovery_target_timeline</> to
<literal>latest</>, to make the standby server follow the timeline change
that occurs at failover to another standby.
</para>
<note>
<para>
Do not use pg_standby or similar tools with the built-in standby mode
described here. <varname>restore_command</> should return immediately
if the file does not exist; the server will retry the command again if
necessary. See <xref linkend="log-shipping-alternative">
for using tools like pg_standby.
</para>
</note>
<para>
If you want to use streaming replication, fill in
<varname>primary_conninfo</> with a libpq connection string, including
the host name (or IP address) and any additional details needed to
connect to the primary server. If the primary needs a password for
authentication, the password needs to be specified in
<varname>primary_conninfo</> as well.
</para>
<para>
If you're setting up the standby server for high availability purposes,
set up WAL archiving, connections and authentication like the primary
server, because the standby server will work as a primary server after
failover.
</para>
<para>
If you're using a WAL archive, its size can be minimized using the <xref
linkend="archive-cleanup-command"> parameter to remove files that are no
longer required by the standby server.
The <application>pg_archivecleanup</> utility is designed specifically to
be used with <varname>archive_cleanup_command</> in typical single-standby
configurations, see <xref linkend="pgarchivecleanup">.
Note however, that if you're using the archive for backup purposes, you
need to retain files needed to recover from at least the latest base
backup, even if they're no longer needed by the standby.
</para>
<para>
A simple example of a <filename>recovery.conf</> is:
<programlisting>
standby_mode = 'on'
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
restore_command = 'cp /path/to/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
</programlisting>
</para>
<para>
You can have any number of standby servers, but if you use streaming
replication, make sure you set <varname>max_wal_senders</> high enough in
the primary to allow them to be connected simultaneously.
</para>
</sect2>
<sect2 id="streaming-replication">
<title>Streaming Replication</title>
<indexterm zone="high-availability">
<primary>Streaming Replication</primary>
</indexterm>
<para>
Streaming replication allows a standby server to stay more up-to-date
than is possible with file-based log shipping. The standby connects
to the primary, which streams WAL records to the standby as they're
generated, without waiting for the WAL file to be filled.
</para>
<para>
Streaming replication is asynchronous, so there is still a small delay
between committing a transaction in the primary and for the changes to
become visible in the standby. The delay is however much smaller than with
file-based log shipping, typically under one second assuming the standby
is powerful enough to keep up with the load. With streaming replication,
<varname>archive_timeout</> is not required to reduce the data loss
window.
</para>
<para>
If you use streaming replication without file-based continuous
archiving, you have to set <varname>wal_keep_segments</> in the master
to a value high enough to ensure that old WAL segments are not recycled
too early, while the standby might still need them to catch up. If the
standby falls behind too much, it needs to be reinitialized from a new
base backup. If you set up a WAL archive that's accessible from the
standby, <varname>wal_keep_segments</> is not required as the standby can always
use the archive to catch up.
</para>
<para>
To use streaming replication, set up a file-based log-shipping standby
server as described in <xref linkend="warm-standby">. The step that
turns a file-based log-shipping standby into streaming replication
standby is setting <varname>primary_conninfo</> setting in the
<filename>recovery.conf</> file to point to the primary server. Set
<xref linkend="guc-listen-addresses"> and authentication options
(see <filename>pg_hba.conf</>) on the primary so that the standby server
can connect to the <literal>replication</> pseudo-database on the primary
server (see <xref linkend="streaming-replication-authentication">).
</para>
<para>
On systems that support the keepalive socket option, setting
<xref linkend="guc-tcp-keepalives-idle">,
<xref linkend="guc-tcp-keepalives-interval"> and
<xref linkend="guc-tcp-keepalives-count"> helps the primary promptly
notice a broken connection.
</para>
<para>
Set the maximum number of concurrent connections from the standby servers
(see <xref linkend="guc-max-wal-senders"> for details).
</para>
<para>
When the standby is started and <varname>primary_conninfo</> is set
correctly, the standby will connect to the primary after replaying all
WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver process in the standby, and
a corresponding walsender process in the primary.
</para>
<sect3 id="streaming-replication-authentication">
<title>Authentication</title>
<para>
It is very important that the access privileges for replication be set up
so that only trusted users can read the WAL stream, because it is
easy to extract privileged information from it. Standby servers must
authenticate to the primary as an account that has the
<literal>REPLICATION</> privilege. So a role with the
<literal>REPLICATION</> and <literal>LOGIN</> privileges needs to be
created on the primary.
</para>
<note>
<para>
It is recommended that a dedicated user account is used for replication.
While the <literal>REPLICATION</> privilege is granted to superuser
accounts by default, it is not recommended to use superuser accounts
for replication. While <literal>REPLICATION</> privilege gives very high
permissions, it does not allow the user to modify any data on the
primary system, which the <literal>SUPERUSER</> privilege does.
</para>
</note>
<para>
Client authentication for replication is controlled by a
<filename>pg_hba.conf</> record specifying <literal>replication</> in the
<replaceable>database</> field. For example, if the standby is running on
host IP <literal>192.168.1.100</> and the account name for replication
is <literal>foo</>, the administrator can add the following line to the
<filename>pg_hba.conf</> file on the primary:
<programlisting>
# Allow the user "foo" from host 192.168.1.100 to connect to the primary
# as a replication standby if the user's password is correctly supplied.
#
# TYPE DATABASE USER ADDRESS METHOD
host replication foo 192.168.1.100/32 md5
</programlisting>
</para>
<para>
The host name and port number of the primary, connection user name,
and password are specified in the <filename>recovery.conf</> file.
The password can also be set in the <filename>~/.pgpass</> file on the
standby (specify <literal>replication</> in the <replaceable>database</>
field).
For example, if the primary is running on host IP <literal>192.168.1.50</>,
port <literal>5432</literal>, the account name for replication is
<literal>foo</>, and the password is <literal>foopass</>, the administrator
can add the following line to the <filename>recovery.conf</> file on the
standby:
<programlisting>
# The standby connects to the primary that is running on host 192.168.1.50
# and port 5432 as the user "foo" whose password is "foopass".
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</programlisting>
</para>
</sect3>
<sect3 id="streaming-replication-monitoring">
<title>Monitoring</title>
<para>
An important health indicator of streaming replication is the amount
of WAL records generated in the primary, but not yet applied in the
standby. You can calculate this lag by comparing the current WAL write
location on the primary with the last WAL location received by the
standby. They can be retrieved using
<function>pg_current_xlog_location</> on the primary and the
<function>pg_last_xlog_receive_location</> on the standby,
respectively (see <xref linkend="functions-admin-backup-table"> and
<xref linkend="functions-recovery-info-table"> for details).
The last WAL receive location in the standby is also displayed in the
process status of the WAL receiver process, displayed using the
<command>ps</> command (see <xref linkend="monitoring-ps"> for details).
</para>
<para>
You can retrieve a list of WAL sender processes via the
<link linkend="monitoring-stats-views-table">
<literal>pg_stat_replication</></link> view. Large differences between
<function>pg_current_xlog_location</> and <literal>sent_location</> field
might indicate that the master server is under heavy load, while
differences between <literal>sent_location</> and
<function>pg_last_xlog_receive_location</> on the standby might indicate
network delay, or that the standby is under heavy load.
</para>
</sect3>
</sect2>
<sect2 id="synchronous-replication">
<title>Synchronous Replication</title>
<indexterm zone="high-availability">
<primary>Synchronous Replication</primary>
</indexterm>
<para>
<productname>PostgreSQL</> streaming replication is asynchronous by
default. If the primary server
crashes then some transactions that were committed may not have been
replicated to the standby server, causing data loss. The amount
of data loss is proportional to the replication delay at the time of
failover.
</para>
<para>
Synchronous replication offers the ability to confirm that all changes
made by a transaction have been transferred to one synchronous standby
server. This extends the standard level of durability
offered by a transaction commit. This level of protection is referred
to as 2-safe replication in computer science theory.
</para>
<para>
When requesting synchronous replication, each commit of a
write transaction will wait until confirmation is
received that the commit has been written to the transaction log on disk
of both the primary and standby server. The only possibility that data
can be lost is if both the primary and the standby suffer crashes at the
same time. This can provide a much higher level of durability, though only
if the sysadmin is cautious about the placement and management of the two
servers. Waiting for confirmation increases the user's confidence that the
changes will not be lost in the event of server crashes but it also
necessarily increases the response time for the requesting transaction.
The minimum wait time is the roundtrip time between primary to standby.
</para>
<para>
Read only transactions and transaction rollbacks need not wait for
replies from standby servers. Subtransaction commits do not wait for
responses from standby servers, only top-level commits. Long
running actions such as data loading or index building do not wait
until the very final commit message. All two-phase commit actions
require commit waits, including both prepare and commit.
</para>
<sect3 id="synchronous-replication-config">
<title>Basic Configuration</title>
<para>
Once streaming replication has been configured, configuring synchronous
replication requires only one additional configuration step:
<xref linkend="guc-synchronous-standby-names"> must be set to
a non-empty value. <varname>synchronous_commit</> must also be set to
<literal>on</>, but since this is the default value, typically no change is
required. This configuration will cause each commit to wait for
confirmation that the standby has written the commit record to durable
storage, even if that takes a very long time.
<varname>synchronous_commit</> can be set by individual
users, so can be configured in the configuration file, for particular
users or databases, or dynamically by applications, in order to control
the durability guarantee on a per-transaction basis.
</para>
<para>
After a commit record has been written to disk on the primary, the
WAL record is then sent to the standby. The standby sends reply
messages each time a new batch of WAL data is written to disk, unless
<varname>wal_receiver_status_interval</> is set to zero on the standby.
If the standby is the first matching standby, as specified in
<varname>synchronous_standby_names</> on the primary, the reply
messages from that standby will be used to wake users waiting for
confirmation that the commit record has been received. These parameters
allow the administrator to specify which standby servers should be
synchronous standbys. Note that the configuration of synchronous
replication is mainly on the master.
</para>
<para>
Users will stop waiting if a fast shutdown is requested. However, as
when using asynchronous replication, the server will does not fully
shutdown until all outstanding WAL records are transferred to the currently
connected standby servers.
</para>
</sect3>
<sect3 id="synchronous-replication-performance">
<title>Planning for Performance</title>
<para>
Synchronous replication usually requires carefully planned and placed
standby servers to ensure applications perform acceptably. Waiting
doesn't utilise system resources, but transaction locks continue to be
held until the transfer is confirmed. As a result, incautious use of
synchronous replication will reduce performance for database
applications because of increased response times and higher contention.
</para>
<para>
<productname>PostgreSQL</> allows the application developer
to specify the durability level required via replication. This can be
specified for the system overall, though it can also be specified for
specific users or connections, or even individual transactions.
</para>
<para>
For example, an application workload might consist of:
10% of changes are important customer details, while
90% of changes are less important data that the business can more
easily survive if it is lost, such as chat messages between users.
</para>
<para>
With synchronous replication options specified at the application level
(on the primary) we can offer synchronous replication for the most
important changes, without slowing down the bulk of the total workload.
Application level options are an important and practical tool for allowing