-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathbootstrapping.html.md.erb
340 lines (252 loc) · 19.3 KB
/
bootstrapping.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: Bootstrapping a Galera Cluster
owner: MySQL
---
<strong><%= modified_date %></strong>
This topic describes the procedure for recovering a terminated Elastic Runtime cluster using a process known as bootstrapping.
## <a id='when'></a>When to Bootstrap
You must bootstrap a cluster that loses quorum. A cluster loses quorum when less than half of the nodes can communicate with each other for longer than the configured grace period. If a cluster does not lose quorum, individual unhealthy nodes automatically rejoin the cluster after resolving the error, restarting the node, or restoring connectivity.
You can detect lost quorum through the following symptoms:
* All nodes appear "Unhealthy" on the proxy dashboard, viewable at `proxy-BOSH-JOB-INDEX.p-mysql.YOUR-SYSTEM-DOMAIN`:
<img src="https://docs.pivotal.io/edge/p-mysql/quorum-lost.png">
* All responsive nodes report the value of `wsrep_cluster_status` as `non-Primary`:
<pre class="terminal">
mysql> SHOW STATUS LIKE 'wsrep_cluster_status';
+----------------------+-------------+
| Variable_name | Value |
+----------------------+-------------+
| wsrep_cluster_status | non-Primary |
+----------------------+-------------+
</pre>
* All responsive nodes respond with `ERROR 1047` when queried with most statement types:
<pre class="terminal">
mysql> select * from mysql.user;
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
</pre>
See the [Cluster Scaling, Node Failure, and Quorum](../../edge/p-mysql/cluster-behavior.html) topic for more details about determining cluster state.
Follow the steps below to recover a cluster that has lost quorum.
## <a id='manifest'></a>Step 1: Choose the Correct Manifest
<p class='note'><strong>Note</strong>: This topic requires you to run commands from the <a href="http://docs.pivotal.io/pivotalcf/customizing/index.html">Ops Manager Director</a> using the BOSH CLI. Refer to the <a href="http://docs.pivotal.io/pivotalcf/customizing/trouble-advanced.html#prepare">Advanced Troubleshooting with the BOSH CLI</a> topic for more information.</p>
1. Log into the BOSH director by running `bosh target DIRECTOR-URL` followed by `bosh login USERNAME PASSWORD`.
1. Run `bosh deployments`.
<pre class="terminal">
$ bosh deployments
Acting as user 'director' on 'p-bosh-30c19bdd43c55c627d70'
+-------------------------+-------------------------------+----------------------------------------------+--------------+
| Name | Release(s) | Stemcell(s) | Cloud Config |
+-------------------------+-------------------------------+----------------------------------------------+--------------+
| cf-e82cbf44613594d8a155 | cf-autoscaling/28 | bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3140 | none |
| | cf-mysql/23 | | |
| | cf/225 | | |
| | diego/0.1441.0 | | |
| | etcd/18 | | |
| | garden-linux/0.327.0 | | |
| | notifications-ui/10 | | |
| | notifications/19 | | |
| | push-apps-manager-release/397 | | |
+-------------------------+-------------------------------+----------------------------------------------+--------------+
| p-mysql | p-mysql | | |
+-----------------------------------------------------------------------------------------------------------------------+
</pre>
1. Download the manifest.
<pre class="terminal">
$ bosh download manifest p-mysql /tmp/p-mysql.yml
Acting as user 'director' on deployment [...]
Deployment manifest saved to `/tmp/p-mysql.yml'
</pre>
1. Set BOSH to use the deployment manifest you downloaded.
<pre class="terminal">
$ bosh deployment /tmp/p-mysql.yml
</pre>
## <a id="assisted-bootstrap"></a>Step 2: Run the Bootstrap Errand ##
Elastic Runtime versions 1.7.0 and later include a [BOSH errand](http://bosh.io/docs/jobs.html#jobs-vs-errands) to automate the process of bootstrapping. The bootstrap errand automates the steps described in the [Manual Bootstrapping](#manual-bootstrap) section below. It finds the node with the highest transaction sequence number and asks it to start up by itself in bootstrap mode. Finally, it asks the remaining nodes to join the cluster.
In most cases, running the errand will recover your cluster. However, certain scenarios require additional steps. To determine which set of instructions to follow, you must determine the state of your Virtual Machines (VMs).
1. Run `bosh instances` and examine the output.
* If the output of `bosh instances` shows the state of the jobs as `failing`, proceed to Scenario 1.
<pre class="terminal">
$ bosh instances
[...]
+--------------------------------------------------+---------+------------------------------------------------+------------+
| Instance | State | Resource Pool | IPs |
+--------------------------------------------------+---------+------------------------------------------------+------------+
| mysql-partition-a813339fde9330e9b905/0 | failing | mysql-partition-a813339fde9330e9b905 | 192.0.2.55 |
| mysql-partition-a813339fde9330e9b905/1 | failing | mysql-partition-a813339fde9330e9b905 | 192.0.2.56 |
| mysql-partition-a813339fde9330e9b905/2 | failing | mysql-partition-a813339fde9330e9b905 | 192.0.2.57 |
+--------------------------------------------------+---------+------------------------------------------------+------------+
</pre>
* If the output of `bosh instances` shows the state of jobs as `unknown/unknown`, proceed to Scenario 2.
<pre class="terminal">
$ bosh instances
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| Instance | State | Resource Pool | IPs |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| unknown/unknown | unresponsive agent | | |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| unknown/unknown | unresponsive agent | | |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| unknown/unknown | unresponsive agent | | |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
</pre>
### <a id="cluster-disrupted"></a>Scenario 1: Virtual Machines Running, Cluster Disrupted ###
In this scenario, nodes are up and running, but the cluster has been disrupted. You can run the bootstrap errand without recreating the VMs.
1. Run `bosh run errand bootstrap`. The errand command prints the following message when finished running:
<pre class="terminal">
Bootstrap errand completed
[stderr]
\+ echo 'Started bootstrap errand ...'
\+ JOB\_DIR=/var/vcap/jobs/bootstrap
\+ CONFIG\_PATH=/var/vcap/jobs/bootstrap/config/config.yml
\+ /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml
\+ echo 'Bootstrap errand completed'
\+ exit 0
Errand `bootstrap' completed successfully (exit code 0)
</pre>
<p class="note"><strong>Note</strong>: Sometimes the bootstrap errand fails on the first try. If this happens, run the command again in a few minutes.</p>
1. If the errand fails, try performing the steps automated by the errand manually by following the [Manual Bootstrapping](#manual-bootstrap) procedure.
### <a id="vms-terminated"></a>Scenario 2: Virtual Machines Terminated or Lost ###
In this scenario, severe circumstances such as power failure have terminated all of your VMs. You need to recreate the VMs before you can recover the cluster.
1. If you enabled the [VM Resurrector](http://docs.pivotal.io/pivotalcf/customizing/resurrector.html#enabling) in Ops Manager, the system detects the terminated VMs and automatically attempts to recreate them. Run `bosh tasks recent --no-filter` to see the `scan and fix` job run by the VM Resurrector.
<pre class="terminal">
$ bosh tasks recent --no-filter
+-----+------------+-------------------------+----------+--------------------------------------------+---------------------------------------------------+
| # | State | Timestamp | User | Description | Result |
+-----+------------+-------------------------+----------+--------------------------------------------+---------------------------------------------------+
| 123 | queued | 2016-01-08 00:18:07 UTC | director | scan and fix | |
</pre>
If you have not enabled the VM Resurrector, run the BOSH cloud check command `bosh cck` to delete any placeholder VMs. When prompted, choose `Delete VM reference` by entering `3`.
<pre class="terminal">
$ bosh cck
Acting as user 'director' on deployment 'cf-e82cbf44613594d8a155' on 'p-bosh-30c19bdd43c55c627d70'
Performing cloud check...
Director task 34
Started scanning 22 vms
Started scanning 22 vms > Checking VM states. Done (00:00:10)
Started scanning 22 vms > 19 OK, 0 unresponsive, 3 missing, 0 unbound, 0 out of sync. Done (00:00:00)
Done scanning 22 vms (00:00:10)
Started scanning 10 persistent disks
Started scanning 10 persistent disks > Looking for inactive disks. Done (00:00:02)
Started scanning 10 persistent disks > 10 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
Done scanning 10 persistent disks (00:00:02)
Task 34 done
Started 2015-11-26 01:42:42 UTC
Finished 2015-11-26 01:42:54 UTC
Duration 00:00:12
Scan is complete, checking if any problems found.
Found 3 problems
Problem 1 of 3: VM with cloud ID `i-afe2801f' missing.
1. Skip for now
2. Recreate VM
3. Delete VM reference
Please choose a resolution [1 - 3]: 3
Problem 2 of 3: VM with cloud ID `i-36741a86' missing.
1. Skip for now
2. Recreate VM
3. Delete VM reference
Please choose a resolution [1 - 3]: 3
Problem 3 of 3: VM with cloud ID `i-ce751b7e' missing.
1. Skip for now
2. Recreate VM
3. Delete VM reference
Please choose a resolution [1 - 3]: 3
Below is the list of resolutions you've provided
Please make sure everything is fine and confirm your changes
1. VM with cloud ID `i-afe2801' missing.
Delete VM reference
2. VM with cloud ID `i-36741a86' missing.
Delete VM reference
3. VM with cloud ID `i-ce751b7e' missing.
Delete VM reference
Apply resolutions? (type 'yes' to continue): yes
Applying resolutions...
Director task 35
Started applying problem resolutions
Started applying problem resolutions > missing_vm 11: Delete VM reference. Done (00:00:00)
Started applying problem resolutions > missing_vm 27: Delete VM reference. Done (00:00:00)
Started applying problem resolutions > missing_vm 26: Delete VM reference. Done (00:00:00)
Done applying problem resolutions (00:00:00)
Task 35 done
Started 2015-11-26 01:47:08 UTC
Finished 2015-11-26 01:47:08 UTC
Duration 00:00:00
Cloudcheck is finished
</pre>
1. Run `bosh instances` and examine the output. The VMs transition from `unresponsive agent` to `starting`. Ultimately, two appear as `failing`. Do not proceed to the next step until all three VMs are in the `starting` or `failing` state.
<pre class="terminal">
$ bosh instances
[...]
+--------------------------------------------------+----------+------------------------------------------------+------------+
| mysql-partition-e97dae91e44681e0b543/0 | starting | mysql-partition-e97dae91e44681e0b543 | 192.0.2.60 |
| mysql-partition-e97dae91e44681e0b543/1 | failing | mysql-partition-e97dae91e44681e0b543 | 192.0.2.61 |
| mysql-partition-e97dae91e44681e0b543/2 | failing | mysql-partition-e97dae91e44681e0b543 | 192.0.2.62 |
+--------------------------------------------------+----------+------------------------------------------------+------------+
</pre>
1. Complete the following steps to prepare your deployment for the bootstrap errand:
1. Run `bosh edit deployment` to launch a `vi` editor and modify the deployment.
1. Search for the jobs section: `jobs`.
1. Search for the mysql-partition: `mysql-partition`.
1. Search for the update section: `update`.
1. Change `max_in_flight` to `3`.
1. Below the `max_in_flight` line, add a new line: `canaries: 0`.
1. Set `update.serial` to `false`.
1. Run `bosh deploy`.
1. Run `bosh run errand bootstrap`.
1. Run `bosh instances` and examine the output to confirm that the errand completes successfully. Some instances may still appear as `failing`.
1. Complete the following steps to restore the BOSH configuration:
1. Run `bosh edit deployment`.
1. Re-set `canaries` to 1, `max_in_flight` to 1, and `serial` to true in the same manner as above.
1. Run `bosh deploy`.
1. Validate that all mysql instances are in `running` state.
<p class="note"><strong>Note:</strong> You must reset the values in the BOSH manifest to ensure successful future deployments and accurate reporting of the status of your jobs.</p>
1. If this procedure fails, try performing the steps automated by the errand manually by following the [Manual Bootstrapping](#manual-bootstrap) procedure.
## <a id="manual-bootstrap"></a>Manual Bootstrapping ##
<p class="note"><strong>Note</strong>: The following steps are prone to user error and can result in lost data if followed incorrectly. Please follow the <a href="#assisted-bootstrap">Run the Bootstrap Errand</a> instructions above first, and only resort to the manual process if the errand fails to repair the cluster.</p>
If the bootstrap errand cannot recover the cluster, you need to perform the steps automated by the errand manually.
- If the output of `bosh instances` shows the state of the jobs as `failing` ([Scenario 1](#cluster-disrupted)), proceed directly to the manual steps below.
- If the output of `bosh instances` shows the state of the jobs as `unknown/unknown`, perform Steps 1-3 of [Scenario 2](#vms-terminated), substitute the manual steps below for Step 4, and then perform Steps 5-6 of [Scenario 2](#vms-terminated).
1. SSH to each node in the cluster and, as root, shut down the `mariadb` process.
<pre class="terminal">
$ monit stop mariadb_ctrl
</pre>
Re-bootstrapping the cluster will not be successful unless all other nodes have been shut down.
1. Choose a node to bootstrap by locating the node with the highest transaction sequence number (`seqno`). You can obtain the `seqno` of a stopped node in one of two ways:
- If a node shut down gracefully, the `seqno` is in the Galera state file of the node.
<pre class="terminal">
$ cat /var/vcap/store/mysql/grastate.dat | grep 'seqno:'
</pre>
- If the node crashed or was killed, the `seqno` in the Galera state file of the node is `-1`. In this case, the `seqno` may be recoverable from the database.
1. Run the following command to start up the database, log the recovered sequence number, and exit.
<pre class="terminal">
$ /var/vcap/packages/mariadb/bin/mysqld --wsrep-recover
</pre>
1. Scan the error log for the recovered sequence number. The last number after the group id (`uuid`) is the recovered `seqno`:
<pre class="terminal">
$ grep "Recovered position" /var/vcap/sys/log/mysql/mysql.err.log | tail -1
150225 18:09:42 mysqld_safe WSREP: Recovered position e93955c7-b797-11e4-9faa-9a6f0b73eb46:15
</pre>
If the node never connected to the cluster before crashing, it may not have a group id (`uuid` in `grastate.dat`). In this case, you cannot recover the `seqno`. Unless all nodes crashed this way, do not choose this node for bootstrapping.
1. Choose the node with the highest `seqno` value as the bootstrap node. If all nodes have the same `seqno`, you can choose any node as the bootstrap node.
<p class="note"><strong>Note:</strong> Only perform these bootstrap commands on the node with the highest <code>seqno</code>. Otherwise, the node with the highest <code>seqno</code> will be unable to join the new cluster unless its data is abandoned. Its <code>mariadb</code> process will exit with an error. See the <a href="../../edge/p-mysql/cluster-behavior.html">Cluster Scaling, Node Failure, and Quorum</a> topic for more details on intentionally abandoning data.</p>
1. On the bootstrap node, update the state file and restart the `mariadb` process.
<pre class="terminal">
$ echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/mysql/state.txt
$ monit start mariadb_ctrl
</pre>
1. Check that the `mariadb` process has started successfully.
<pre class="terminal">
$ watch monit summary
</pre>
It can take up to ten minutes for `monit` to start the `mariadb` process.
1. Once the bootstrapped node is running, start the `mariadb` process on the remaining nodes using `monit`.
<pre class="terminal">
$ monit start mariadb_ctrl
</pre>
1. Verify that the new nodes have successfully joined the cluster. The following command displays the total number of nodes in the cluster:
<pre class="terminal">
mysql> SHOW STATUS LIKE 'wsrep_cluster_size';
</pre>
1. Complete the following steps to restore the BOSH configuration:
1. Run `bosh edit deployment`.
1. Re-set `canaries` to 1, `max_in_flight` to 1, and `serial` to true in the same manner as above.
1. Run `bosh deploy`.
1. Validate that all mysql instances are in `running` state.
<p class="note"><strong>Note:</strong> You must reset the values in the BOSH manifest to ensure successful future deployments and accurate reporting of the status of your jobs.</p>