-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLecture 14 Optimistic Concurrency Control.srt
5159 lines (4295 loc) · 146 KB
/
Lecture 14 Optimistic Concurrency Control.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:00,060 --> 00:00:06,150
我想谈谈农场日和乐观的并发控制,
I'd like to like to talk about farms day
and optimistic concurrency control which
2
00:00:06,150 --> 00:00:10,860
是主要有趣的技术,它利用了我们谈论农场的原因
is the main interesting technique that
uses the reason we're talking about farm
3
00:00:10,860 --> 00:00:16,350
这是该系列中有关事务和复制的最后一篇文章,
it's this the last paper in the series
about transactions and replication and
4
00:00:16,350 --> 00:00:21,720
分片,这仍然是一个开放的研究领域,人们完全可以
sharding and this is still an open
research area where people are totally
5
00:00:21,720 --> 00:00:30,359
对绩效或绩效与一致性不满意
not satisfied with performance or in the
kind of performance versus consistency
6
00:00:30,359 --> 00:00:32,668
可以进行权衡,他们仍在尝试做得更好
trade-offs that are available and
they're still trying to do better
7
00:00:32,668 --> 00:00:36,780
特别是这份特殊的论文是由出色的表现所激发的
and in particular this particular paper
is motivated by the huge performance
8
00:00:36,780 --> 00:00:43,800
这些新的RDMA NIC的潜力,所以您可能想知道,因为我们只是
potential of these new RDMA NICs
so you may be wondering since we just
9
00:00:43,800 --> 00:00:48,000
阅读有关扳手的信息,农场与其他扳手有何不同
read about spanner
how farm differs some spanner both of
10
00:00:48,000 --> 00:00:51,660
他们毕竟复制,他们使用两阶段提交的交易
them after all replicate and they use
two-phase commit for transactions of
11
00:00:51,660 --> 00:00:56,850
在那个级别上,它们看起来就像是已部署系统中的扳手
that level they seem pretty similar
spanner as a is a deployed systems been
12
00:00:56,850 --> 00:01:02,640
很长时间以来,它的主要重点是地理复制
used a lot for a long time its main
focus is on Geographic replication that
13
00:01:02,640 --> 00:01:06,930
能够在东海岸和西海岸以及其他地方复制
is to be able to have copies on there
like east and west coasts and different
14
00:01:06,930 --> 00:01:11,189
数据中心并能够进行合理有效的交易
data centers and be able to have
reasonably efficient transactions that
15
00:01:11,189 --> 00:01:16,590
涉及许多不同地方的数据片段,最具创新性
involve pieces of data in lots of
different places and the most innovative
16
00:01:16,590 --> 00:01:20,640
关于它的事情,因为为了尝试解决它多长时间的问题
thing about it because in order to try
to solve the problem of how long it
17
00:01:20,640 --> 00:01:26,119
长距离进行两阶段提交是因为它有一个特殊的
takes to do two-phase commit over long
distances is that it has a special
18
00:01:26,119 --> 00:01:32,189
使用同步时间和时间的只读事务的优化路径
optimization path for read-only
transactions using synchronized time and
19
00:01:32,189 --> 00:01:36,680
如果记得,您会从扳手获得的性能是读/写
the performance you get out of spanner
if you remember is that a read/write
20
00:01:36,680 --> 00:01:42,509
交易需要10到100毫秒,具体取决于
transaction takes 10 to 100 milliseconds
depending on how close together the
21
00:01:42,509 --> 00:01:49,079
农场中不同的数据中心做出了非常不同的设计决策
different data centers are farm makes a
very different set of design decisions
22
00:01:49,079 --> 00:01:52,710
并针对不同类型的工作负载首先是研究原型
and targets a different kind of workload
first of all it's a research prototype
23
00:01:52,710 --> 00:01:57,390
因此它绝不是成品,而目标是探索
so it's not by any means a finished
product and the goal is to explore the
24
00:01:57,390 --> 00:02:04,259
这些新的RDMA高速网络硬件的潜力,因此它实际上仍然
potential of these new RDMA high speed
networking hardware so it's really still
25
00:02:04,259 --> 00:02:09,619
探索性系统,它假定所有副本都在同一个数据中心中
an exploratory system it assumes that
all replicas are in the same data center
26
00:02:09,619 --> 00:02:11,880
绝对没有任何意义
absolutely it doesn't wouldn't make
sense
27
00:02:11,880 --> 00:02:15,600
这些副本甚至位于不同的数据中心,更不用说在东海岸了
the replicas were in even in different
data centers let alone on East Coast
28
00:02:15,600 --> 00:02:20,640
与西海岸之战,因此它并不是要解决扳手问题
versus West Coast so it's not trying to
solve a problem that spanner is about
29
00:02:20,640 --> 00:02:23,790
如果整个数据中心宕机了怎么办,我能取出数据吗?
what happens if an entire data center
goes down can I so get out my data
30
00:02:23,790 --> 00:02:27,540
确实,它具有容错能力的程度是针对个人的
really that's does the extent that it
has fault tolerance is for individual
31
00:02:27,540 --> 00:02:33,720
崩溃,或者在整个数据中心断电并尝试恢复后尝试恢复
crashes or maybe try to recover after a
whole data center loses power and gets
32
00:02:33,720 --> 00:02:39,210
再次恢复它使用这种RDMA技术,我将谈论但
restored again it uses this RDMA
technique which I'll talk about but
33
00:02:39,210 --> 00:02:43,650
可能已经严重限制了设计选项,并且
already may turns out to seriously
restrict the design options and because
34
00:02:43,650 --> 00:02:49,980
另一方面,该农场中的一部分被迫使用乐观并发控制
of this farm is forced to use optimistic
concurrency control on the other hand
35
00:02:49,980 --> 00:02:56,300
他们获得的性能远远高于扳手农场所能做到的
the performance they get is far far
higher than spanner farm can do a
36
00:02:56,300 --> 00:03:00,900
在58微秒内传输一个简单的事务,这来自图7
transit a simple transaction in 58
microseconds and this is from figure 7
37
00:03:00,900 --> 00:03:06,780
和第6.3节,所以这是58微秒,而到10毫秒
and section 6.3 so this is 58
microseconds versus to 10 milliseconds
38
00:03:06,780 --> 00:03:12,960
扳手所需的时间比扳手快一百倍,
that the spanner takes is that's about a
hundred times faster than spanner so
39
00:03:12,960 --> 00:03:17,730
那也许是我们最大的巨大差异
that's maybe the main huge differences
that farm us how much higher performance
40
00:03:17,730 --> 00:03:26,640
但不是针对地理复制的,所以您知道这个农场
but is not aimed at Geographic
replication so this you know farms
41
00:03:26,640 --> 00:03:31,790
性能令人印象深刻,比其他任何东西都快
performance is extremely impressive like
how much faster than anything else
42
00:03:31,790 --> 00:03:35,370
另一种看待方式是扳手和农场目标不同
another way to look at it is that
spanner and farm target different
43
00:03:35,370 --> 00:03:39,150
瓶颈和跨度是人们担心的主要瓶颈
bottlenecks and span are the main
bottleneck the people worried about is
44
00:03:39,150 --> 00:03:42,900
光速和网络光速之间的延迟和网络离开
the speed of light and network speed of
light delays and network leaves between
45
00:03:42,900 --> 00:03:50,130
数据中心,而在农场中,设计担心的主要瓶颈
data centers whereas in farm the main
bottlenecks that the design is worried
46
00:03:50,130 --> 00:03:54,209
大约是服务器上的CPU时间,因为他们希望
about is is CPU time on the server's
because they kind of wished away the
47
00:03:54,209 --> 00:03:57,180
通过将所有副本置于同一位置来加快光速和网络延迟
speed of light and network delays by
putting all the replicas in the same
48
00:03:57,180 --> 00:04:01,070
数据中心没事
data center all right
49
00:04:01,220 --> 00:04:08,459
这样的背景如何使其适合684序列
so sort of the background of how this
fits into the 684 sequence the setup and
50
00:04:08,459 --> 00:04:15,860
服务器场是您将它们全部运行在一个数据中心中,
farm is that you have it's all running
in one datacenter there's a sort of
51
00:04:16,040 --> 00:04:21,120
配置管理器这是我们之前所见的配置
configuration manager this which we've
seen before and the configuration
52
00:04:21,120 --> 00:04:25,110
经理决定哪个代表
managers in charge of deciding which rep
which
53
00:04:25,110 --> 00:04:30,479
在每个数据分片之前,服务器应是备份中的主要服务器,如果
servers should be the primary in the
backup before each shard of data and if
54
00:04:30,479 --> 00:04:37,110
您仔细阅读后会看到他们使用Zookeeper来帮助他们
you read carefully you'll see that they
use zookeeper in order to help them
55
00:04:37,110 --> 00:04:40,139
实现此配置管理器,但这不是本文的重点
implement this configuration manager but
it's not not the focus of the paper at
56
00:04:40,139 --> 00:04:42,539
所有相反,有趣的是,
all
instead the interesting thing is that
57
00:04:42,539 --> 00:04:47,490
数据通过一堆主要备用付款人的密钥分片
the data is sharded split up by key
across a bunch of primary backup payers
58
00:04:47,490 --> 00:04:52,919
所以我的意思是说一个分片继续存在,您知道主一台服务器主一台备份
so I mean one shard goes on you know
primary one server primary one backup
59
00:04:52,919 --> 00:04:59,879
一个又一个短的主数据库来备份两个,依此类推,这意味着
one another short one primary to backup
two and so forth and that means that
60
00:04:59,879 --> 00:05:03,870
每当您更新数据时,都需要在主数据库和
anytime you update data you need to
update it both on the primary and on the
61
00:05:03,870 --> 00:05:08,250
备份,这些不是这些主副本,这些副本不由维护
backup and these are not these primaries
these replicas are not maintained by
62
00:05:08,250 --> 00:05:14,969
PAC或类似的东西,而是更新所有数据副本
PACs or anything like it instead all the
replicas of the data are updated
63
00:05:14,969 --> 00:05:18,180
每当有变化时,如果您阅读,则必须始终阅读
whenever there's a change and if you
read you always have to read from the
64
00:05:18,180 --> 00:05:24,210
当然,这种复制的主要原因是容错和
primary the reason for this replication
of course is fault tolerance and the
65
00:05:24,210 --> 00:05:29,039
他们得到的一种容错能力是,只要给定碎片的一个副本
kind of fault tolerance they get is that
as long as one replicas of a given shard
66
00:05:29,039 --> 00:05:33,409
可用,那么该分片将可用,因此它们只需要一个
is available then that shard will be
available so they only require one
67
00:05:33,409 --> 00:05:39,900
如果有数据的话,活的副本不是多数,也不是整个系统
living replica not a majority and the
system as a whole if there's say a data
68
00:05:39,900 --> 00:05:43,889
中心白色电源故障,只要至少有一个,它就可以恢复
center white power failure it can
recover as long as there's at least one
69
00:05:43,889 --> 00:05:49,560
系统中每个分片的副本的另一种放置方式是
replicas of every shard in the system
another way of putting that is if you
70
00:05:49,560 --> 00:05:54,330
他们有F加一个副本,那么他们可以忍受F次失败
they have F plus one replicas then they
can tolerate up to F failures for that
71
00:05:54,330 --> 00:06:00,860
碎片,除了每种数据的主要备份副本外,
shard in addition to the primary backup
copies of each sort of data there's
72
00:06:00,860 --> 00:06:06,479
运行它的事务代码可能是最方便的
transaction code that runs it's maybe
most convenient to think of the
73
00:06:06,479 --> 00:06:11,789
交易代码作为独立的客户端运行,实际上它们在运行交易
transaction code is running as separate
clients in fact they run the transaction
74
00:06:11,789 --> 00:06:16,560
在与实际服务器场存储相同的机器上进行实验的代码
code in their experiments on the same
machines as the actual farm storage
75
00:06:16,560 --> 00:06:24,270
服务器,但我通常会认为它们是一组单独的客户端,
servers but I'll mostly think of them as
as being a separate set of clients and
76
00:06:24,270 --> 00:06:29,960
客户端正在运行事务,并且该事务需要读写
the clients are running transactions and
the transactions need to read and write
77
00:06:29,960 --> 00:06:38,020
除这些外,还存储在分片服务器中的数据对象
data objects that are stored in the
in the sharded servers in addition these
78
00:06:38,020 --> 00:06:42,849
交易这些客户每个客户不仅运行交易,而且还运行
transaction these clients each client
not only runs the transactions but also
79
00:06:42,849 --> 00:06:48,330
充当两阶段提交的事务协调器
acts as that transaction coordinator for
two-phase commit
80
00:06:48,330 --> 00:06:53,589
好吧,这是他们获得性能的基本方法,因为这真的
okay so it's the basic set up the way
they get performance because this really
81
00:06:53,589 --> 00:06:57,669
这是一篇有关如何获得高性能并仍然拥有
this is a paper all about how you can
get high performance and still have
82
00:06:57,669 --> 00:07:04,150
通过分片获得高性能的一种方式是
transactions one way they get high
performances with sharding these are the
83
00:07:04,150 --> 00:07:12,219
从某种意义上讲,主要成分是通过在实验中分片
ingredients in a sense the main way is
through sharding in experiments they
84
00:07:12,219 --> 00:07:17,050
通过90种方式将其数据分摊到90台服务器上,也许是45种方式,而不是
shard their data over 90 ways for 90
servers or maybe it's 45 ways and not
85
00:07:17,050 --> 00:07:21,279
只要操作和不同的碎片差不多
just if as long as the operations and
different shards are more or less
86
00:07:21,279 --> 00:07:25,899
彼此独立,可自动为您加速90倍
independent of each other that just gets
you an automatic 90 times speed up
87
00:07:25,899 --> 00:07:30,699
因为您可以运行90糖浆中的任何糖浆
because you can run whatever it is
you're running in parallel on 90 syrups
88
00:07:30,699 --> 00:07:36,520
这个巨大的钱来自较短的分片,还有他们为了获得的另一个技巧
this huge went from shorter sharding um
another trick they play in order to get
89
00:07:36,520 --> 00:07:41,919
良好的性能,因为所有数据都必须容纳在它们不需要的服务器的RAM中
good performance as the data all has to
fit in the RAM of the servers they don't
90
00:07:41,919 --> 00:07:46,089
真正将数据存储在磁盘上,所有这些都必须放入RAM中,这意味着
really store the data on disk
it all has to fit in RAM and that means
91
00:07:46,089 --> 00:07:50,620
当然,您可以很快摆脱另一种方式,即它们变得很高
of course you can get out of pretty
quickly another way that they get high
92
00:07:50,620 --> 00:07:56,199
性能是他们需要容忍掉电,这意味着他们
performance is they need to tolerate
power failures which means that they
93
00:07:56,199 --> 00:07:59,499
不能只是使用RAM,因为它们需要在上电后恢复数据
can't just be using RAM because they
need to recover the data after a power
94
00:07:59,499 --> 00:08:04,589
发生故障,RAM在断电时会丢失内容,因此它们有一个聪明的选择
failure and RAM loses contents on a
power failure so they have a clever
95
00:08:04,589 --> 00:08:11,020
非易失性Ram方案,用于让RAM的内容在电源中存活
non-volatile Ram scheme for having the
contents of RAM the data survived power
96
00:08:11,020 --> 00:08:16,360
失败,这与将数据持久存储在磁盘上相反
failures this is in contrast to storing
the data persistently on disk i'm is
97
00:08:16,360 --> 00:08:22,180
比磁盘快得多他们使用的另一个技巧是使用此RDMA
much faster than disk um another trick
they play is they use this RDMA
98
00:08:22,180 --> 00:08:31,289
实质上是聪明的网络接口卡技术
technique which essentially clever
network interface cards that allow that
99
00:08:31,289 --> 00:08:35,919
接受指示我们直接进入接口卡的数据包
accept packets that instruct that then
that we're interface card to directly
100
00:08:35,919 --> 00:08:42,698
读写服务器的内存,而不会中断服务器
read and write the memory of the server
without interrupting the server I know
101
00:08:42,698 --> 00:08:48,569
他们玩的技巧就是您通常所说的内核旁路
that trick they play is what you often
call kernel bypass
102
00:08:48,680 --> 00:08:58,790
这意味着应用程序级代码可以直接访问网络
which means that the application level
code can directly access the network
103
00:08:58,790 --> 00:09:02,540
接口卡,而无需涉及内核,所以这些都是
interface card without getting the
kernel involved okay so these are all
104
00:09:02,540 --> 00:09:07,520
我们正在寻找的那种巧妙的技巧倒掉了
the sort of clever tricks we're looking
at out pour it that they used to get
105
00:09:07,520 --> 00:09:11,270
高性能,我将谈论我们已经讨论过分片
high performance and I'll talk about
we've already talked about sharding a
106
00:09:11,270 --> 00:09:15,430
很多,但我将在本讲座中讨论其余内容
lot but I'll talk about the rest in this
lecture
107
00:09:15,430 --> 00:09:21,560
好吧,首先我要谈谈非易失性Ram,这是真的
okay so first I'll talk about
non-volatile Ram I mean this is really a
108
00:09:21,560 --> 00:09:31,310
不会真正影响设计其余部分的话题
topic that doesn't doesn't really affect
the rest of the design directly as I
109
00:09:31,310 --> 00:09:37,190
当客户端更新时,所有数据和用于场的数据都存储在RAM中
said all the data and for farm is stored
in RAM when you update it when a client
110
00:09:37,190 --> 00:09:39,800
交易更新了一条数据,这实际上意味着它可以到达
transaction updates a piece of data what
that really means is it reaches out to
111
00:09:39,800 --> 00:09:43,760
存储数据并导致这些服务器修改
the relevant servers that store the data
and causes those servers to modify the
112
00:09:43,760 --> 00:09:49,370
事务正在修改的任何对象都可以在对象中对其进行修改
whatever object is the transaction is
modifying to object modify it right in
113
00:09:49,370 --> 00:09:53,540
RAM,就写入而言,它们不会进入磁盘,这就是您
RAM and that's as far as the writes get
they don't go to disk and this is you
114
00:09:53,540 --> 00:09:56,959
知道与您的筏实施的对比,例如花了多少钱
know contrast to your raft
implementations for example which spent
115
00:09:56,959 --> 00:10:04,310
很多时间将数据持久保存到磁盘上没有持久性,在服务器场中
a lot of time persisting data to disk
there's no persisting and in farm this
116
00:10:04,310 --> 00:10:07,880
在RAM中写东西是一件大事,写ram大约需要200
is a big wind writing stuff in RAM write
a write to ram takes about 200
117
00:10:07,880 --> 00:10:11,930
十亿分之一秒,而对固态硬盘的袭击甚至是非常快的
nanoseconds whereas a raid even to a
solid state drive which is pretty fast a
118
00:10:11,930 --> 00:10:17,330
失速寻道驱动器的权利大约需要一百微秒,然后写入
right to a stall seek drive takes about
a hundred microseconds and a write to
119
00:10:17,330 --> 00:10:21,080
我们的硬盘驱动器大约需要十毫秒,因此能够写入
our hard drive takes about ten
milliseconds so being able to write to
120
00:10:21,080 --> 00:10:26,270
ram值得许多数量级和速度的交易
ram is worth many many orders of
magnitude and speed for transactions
121
00:10:26,270 --> 00:10:30,770
会改变事物,但伊朗当然会丢失其内容和电源故障,因此
that modify things but of course iran
loses its content and a power failure so
122
00:10:30,770 --> 00:10:37,990
它本身并不是持久的,您可能会认为写作
it's not persistent by itself as a side
you might think that writing
123
00:10:37,990 --> 00:10:43,610
如果您有副本服务器,则对多个服务器的RAM进行修改
modifications to the RAM of multiple
servers that if you have replica servers
124
00:10:43,610 --> 00:10:48,080
然后您更新所有可能具有足够持久性的副本,因此
and you update all the replicas that
that might be persistent enough and so
125
00:10:48,080 --> 00:10:53,150
毕竟,如果您有F 1 F +1个副本,则最多可以容忍F个故障,并且
after all if you have F 1 F +1 replicas
you can tolerate up to F failures and
126
00:10:53,150 --> 00:10:57,350
仅在多个服务器上写入Ram的原因不好
the reason why just simply writing to
Ram on multiple servers is not good
127
00:10:57,350 --> 00:11:00,870
一个站点范围内的电源故障将足以破坏
enough is that a site-wide power failure
will destroy
128
00:11:00,870 --> 00:11:09,029
您所有的服务器,从而违反了故障发生的假设
all of your servers and thus violating
the assumption that the failures are in
129
00:11:09,029 --> 00:11:12,810
不同的服务器是独立的,所以我们需要一个即使它也可以工作的方案
different servers are independent so we
need a scheme that it's gonna work even
130
00:11:12,810 --> 00:11:24,210
如果整个数据中心的电源故障,那么论坛会做什么呢?
if power fails to the entire data center
so what what forum does is it it puts a
131
00:11:24,210 --> 00:11:28,230
为每个机架中的一块大电池供电,并通过
battery a big battery in every rack and
runs the power supply system through the
132
00:11:28,230 --> 00:11:32,970
电池,因此如果发生电源故障,电池会自动接管电池,
batteries so the batteries automatically
take over if there's a power failure and
133
00:11:32,970 --> 00:11:37,830
保持所有机器运行至少到电池出现故障为止,但是当然
keep all their machines running at least
until the battery fails but of course
134
00:11:37,830 --> 00:11:41,490
你知道电池不是很大,可能只能用自己的电池
you know the battery is not very big it
may only be able to run their their
135
00:11:41,490 --> 00:11:45,900
机器说了10分钟之类的时间,因此电池本身不足
machines for say 10 minutes or something
so the battery by itself is not enough
136
00:11:45,900 --> 00:11:50,760
为了使系统能够承受较长的电源故障,因此
to make this the system be able to
withstand a lengthy power failure so
137
00:11:50,760 --> 00:11:56,460
而是当电池系统看到主电源出现故障时,
instead the battery system when it sees
that the main power is failed the
138
00:11:56,460 --> 00:12:00,330
电池系统在保持服务器的Marling的同时也会提醒
battery system while it keeps the
server's Marling also alerts the
139
00:12:00,330 --> 00:12:04,500
服务器的所有服务器,并带有某种中断或消息告诉
server's all the servers and with some
kind of interrupt or message telling
140
00:12:04,500 --> 00:12:09,020
他们看起来力量刚刚失败,您知道您只剩10分钟了
them look the powers just failed you
know you only got 10 minutes left before
141
00:12:09,020 --> 00:12:16,170
电池也会发生故障,因此此时场服务器上的软件会复制所有
the batteries fail also so at that point
the software on farms servers copies all
142
00:12:16,170 --> 00:12:21,630
的雨水活动会先停止对农场的所有处理,然后再复制每个
of rain active stops all processing it
for farm first and then copies each
143
00:12:21,630 --> 00:12:25,650
服务器将其所有RAM复制到与之相连的固态驱动器
server copies all of its RAM to a
solid-state drive attached to that
144
00:12:25,650 --> 00:12:30,089
服务器是我所希望的,可能需要几分钟,而一旦所有的RAM被
server I'm what wished could take a
couple minutes and once all the RAM is
145
00:12:30,089 --> 00:12:33,600
复制到固态驱动器,然后机器自行关闭并转动
copied to the solid-state drive then the
machine shuts itself down and turns
146
00:12:33,600 --> 00:12:39,870
本身关闭,所以如果一切顺利,所有机器都将发生站点范围的电源故障
itself off so if all goes well there's a
site-wide power failure all the machines
147
00:12:39,870 --> 00:12:45,180
当数据中心恢复供电时,将其RAM保存到磁盘
save their RAM to disk when the power
comes back up in the datacenter all the
148
00:12:45,180 --> 00:12:51,540
机器重启时将读取保存在磁盘上的内存映像
machines will when they reboot will read
the memory image that was saved on disk
149
00:12:51,540 --> 00:12:57,420
恢复到RAM中,但必须进行一些恢复,但基本上
restored into RAM and but there's some
recovery that has to go on but basically
150
00:12:57,420 --> 00:13:00,570
他们不会因为力量而失去任何持久状态
they won't have lost any of their
persistent state due to the power
151
00:13:00,570 --> 00:13:07,310
失败,这实际上意味着该农场正在使用常规Ram
failure and so what that really means is
that the farm is using conventional Ram
152
00:13:07,310 --> 00:13:13,200
但本质上使RAM非易失性能够承受电源
but it's essentially made the RAM
non-volatile being able to survive power
153
00:13:13,200 --> 00:13:17,339
使用这种电池的技巧而导致故障
failures with the
this trick of using a battery having a
154
00:13:17,339 --> 00:13:21,660
电池警报服务器使服务器固态存储RAM内容
battery alert the server having the
server store the RAM content solid-state
155
00:13:21,660 --> 00:13:33,509
驱动有关nvram方案的任何问题,这是一个有用的
drives any questions about the nvram
scheme alright this is a is a useful
156
00:13:33,509 --> 00:13:40,559
技巧,但值得记住的是,它只有在有
trick but it is worthwhile keeping mind
that it really only helps if there's
157
00:13:40,559 --> 00:13:46,499
停电就是如果您只知道整个事件序列,
power failures that is if the you know
the whole sequence of events only it
158
00:13:46,499 --> 00:13:50,639
如果电池发现主电源出现故障,则将其设置为火车
gets set in train when the battery
notices that the main power is failed if
159
00:13:50,639 --> 00:13:53,749
还有其他导致服务器故障的原因,例如
there's some other reason
causing the server to fail like
160
00:13:53,749 --> 00:13:57,120
硬件出现问题或软件中存在错误
something goes wrong with the hardware
or there's a bug in the software that
161
00:13:57,120 --> 00:14:02,670
导致崩溃那些崩溃非易失性Ram系统只是一无所有
causes a crash those crashes the
non-volatile Ram system is just nothing
162
00:14:02,670 --> 00:14:06,809
与这些崩溃有关的那些崩溃将导致计算机重新启动并
to do with those crashes those crashes
will cause the machine to reboot and
163
00:14:06,809 --> 00:14:10,470
丢失其RAM中的内容,它将无法恢复它们,因此
lose the contents of its RAM and it
won't be able to recover them so this
164
00:14:10,470 --> 00:14:15,660
NVRAM方案适用于电源故障,但不适用于其他崩溃,所以这就是为什么
NVRAM scheme is good for power failures
but not other crashes and so that's why
165
00:14:15,660 --> 00:14:21,720
除了NVRAM场外,还具有多个副本的多个副本
in addition to the NVRAM farm also has
multiple copies multiple replicas of
166
00:14:21,720 --> 00:14:28,730
每个分片都可以,因此该NVRAM方案从根本上消除了
each shard all right so this NVRAM
scheme essentially eliminates
167
00:14:28,730 --> 00:14:35,160
持久性成为系统性能的瓶颈
persistence rates as a bottleneck in the
performance of the system leaving only