-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLecture 03 GFS.srt
5227 lines (4351 loc) · 151 KB
/
Lecture 03 GFS.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:00,600 --> 00:00:09,389
我想今天开始,我们要谈谈Google文件的GFS
I'd like to get started today we're
gonna talk about GFS the Google file
2
00:00:09,389 --> 00:00:12,660
我们今天阅读的系统论文,这将是第一篇
system paper we read for today
and this will be the first of a number
3
00:00:12,660 --> 00:00:17,160
在本课程中,我们将讨论各种案例研究
of different sort of case studies we'll
talk about in this course about how to
4
00:00:17,160 --> 00:00:29,310
建立大型存储系统,所以更大的话题是大型存储的原因
be build big storage systems so the
larger topic is big storage the reason
5
00:00:29,310 --> 00:00:34,260
是事实证明存储是关键的抽象,如果您知道
is the storage is turned out to be a key
abstraction you might you know if you
6
00:00:34,260 --> 00:00:37,230
不知道你可能会想像可能有各种各样的
didn't know already you might imagine
that there could be all kinds of
7
00:00:37,230 --> 00:00:42,030
不同,您知道可能要用于的重要抽象
different you know important
abstractions you might want to use for
8
00:00:42,030 --> 00:00:47,730
分布式系统,但事实证明,简单的存储接口只是
distributed systems but it's turned out
that a simple storage interface is just
9
00:00:47,730 --> 00:00:51,480
非常有用且极为笼统,因此许多想法已经消失
incredibly useful and extremely general
and so a lot of the thought that's gone
10
00:00:51,480 --> 00:00:55,170
建立分布式系统已进入设计存储
into building distributed systems has
either gone into designing storage
11
00:00:55,170 --> 00:01:00,180
系统或设计在其下假设某种其他功能的其他系统
systems or designing other systems that
assume underneath them some sort of
12
00:01:00,180 --> 00:01:05,519
表现良好的大型分布式存储系统,所以我们
reasonably well behaved big just
distributed storage system so we're
13
00:01:05,519 --> 00:01:09,360
会非常在意您如何设计一个好的界面
going to care a lot about how the you
know how to design a good interface to a
14
00:01:09,360 --> 00:01:14,159
大型存储系统以及如何设计存储系统的内部结构
big storage system and how to design the
innards of the storage system so it has
15
00:01:14,159 --> 00:01:19,229
您当然知道行为良好,这就是为什么我们阅读本文只是为了获得
good behavior you know of course that's
why we're reading this paper just to get
16
00:01:19,229 --> 00:01:22,530
本文的起点还涉及很多主题,这些主题将
a start on that the this paper also
touches on a lot of themes that will
17
00:01:22,530 --> 00:01:27,060
大量介绍并行性能容错复制
come up a lot in a tube for parallel
performance fault tolerance replication
18
00:01:27,060 --> 00:01:34,140
和一致性,本文就是这样合理
and consistency and this paper is as
such things go reasonably
19
00:01:34,140 --> 00:01:38,670
直观易懂,这也是一本很好的系统论文
straightforward and easy to understand
it's also a good systems paper it sort
20
00:01:38,670 --> 00:01:43,229
讨论了从硬件到软件的所有问题
of talks about issues all the way from
the hardware to the software that
21
00:01:43,229 --> 00:01:49,320
最终使用该系统,这是一个成功的现实世界设计,因此它说
ultimately uses the system and it's a
successful real world design so it says
22
00:01:49,320 --> 00:01:53,189
你知道学术会议上发表的学术论文,但它描述了
you know academic paper published in an
academic conference but it describes
23
00:01:53,189 --> 00:01:57,030
真正成功的东西,在现实世界中使用了很长时间
something that really was successful and
used for a long time in the real world
24
00:01:57,030 --> 00:02:02,340
所以我们知道我们在谈论的是一个很好的
so we sort of know that we're talking
about something that is it's a good a
25
00:02:02,340 --> 00:02:09,149
好有用的设计好吧,所以在我谈论GFS之前,我想
good useful design okay so before I'm
gonna talk about GFS I want to sort of
26
00:02:09,149 --> 00:02:13,030
谈论分布式存储系统的空间
talk about the space of distributed
storage systems a little bit
27
00:02:13,030 --> 00:02:18,810
首先设置场景,为什么很难
set the scene so first why is it hard
28
00:02:19,920 --> 00:02:25,900
实际上很多事情是正确的,但是对于2/4,有一种特殊的
it's actually a lot to get right but for
a 2/4 there's a particular sort of
29
00:02:25,900 --> 00:02:32,140
对于许多系统而言,这种叙事往往会很多
narrative that's gonna come up quite a
lot for many systems often the starting
30
00:02:32,140 --> 00:02:35,890
人们设计这类大型分布式系统或大型存储的关键
point for people designing these sort of
big distributed systems or big storage
31
00:02:35,890 --> 00:02:39,340
系统是他们希望获得巨大的综合性能并能够驾驭
systems is they want to get huge
aggregate performance be able to harness
32
00:02:39,340 --> 00:02:44,620
数百台机器的资源,以便完成大量工作
the resources of hundreds of machines in
order to get a huge amount of work done
33
00:02:44,620 --> 00:02:54,430
因此,起点通常是性能,并且您知道是否开始
so the sort of starting point is often
performance and you know if you start
34
00:02:54,430 --> 00:02:59,019
有一个自然的下一个想法是,我们要把我们的数据分成大量
there a natural next thought is well
we're gonna split our data over a huge
35
00:02:59,019 --> 00:03:04,420
数量的服务器,以便能够并行读取许多服务器,因此我们
number of servers in order to be able to
read many servers in parallel so we're
36
00:03:04,420 --> 00:03:11,160
会得到,如果您在许多服务器上分片,通常称为分片
gonna get and that's often called
sharding if you shard over many servers
37
00:03:11,160 --> 00:03:15,970
数百或数千台服务器,如果
hundreds or thousands of servers you're
just gonna see constant faults right if
38
00:03:15,970 --> 00:03:20,680
您有成千上万的服务器,总会出现故障,因此我们
you have thousands of servers there's
just always gonna be one down so we
39
00:03:20,680 --> 00:03:27,250
默认值是每天每小时发生一次,我们需要自动
defaults are just every day every hour
occurrences and we need automatic
40
00:03:27,250 --> 00:03:31,890
涉及人类的周末并修复此故障,我们需要自动
weekend of humans involved and fixing
this fault we need automatic
41
00:03:31,890 --> 00:03:43,090
容错系统,因此导致容错能力是最强的
fault-tolerant systems so that leads to
fault tolerance the among the most
42
00:03:43,090 --> 00:03:46,630
获得容错的有效方法是复制,只需保留两个或三个
powerful ways to get fault tolerance is
with replication just keep two or three
43
00:03:46,630 --> 00:03:52,390
或其中任何一个数据副本失败,您可以使用另一个副本,因此我们希望
or whatever copies of data one of them
fails you can use another one so we want
44
00:03:52,390 --> 00:04:03,100
如果您有两个复制,则具有导致复制的容忍度
to have tolerance that leads to
replication if you have replication two
45
00:04:03,100 --> 00:04:07,329
复制数据,然后确定是否不小心会丢失数据
copies the data then you know for sure
if you're not careful they're gonna get
46
00:04:07,329 --> 00:04:10,750
不同步,所以您认为是数据的两个副本
out of sync and so what you thought was
two replicas of the data where you could
47
00:04:10,750 --> 00:04:14,170
如果您不小心,可以互换使用其中一种来容忍错误
use either one interchangeably to
tolerate faults if you're not careful
48
00:04:14,170 --> 00:04:18,640
您最终得到的是两个几乎相同的数据副本
what you end up with is two almost
identical replicas of the data that's
49
00:04:18,640 --> 00:04:22,180
就像根本不完全是复制品一样,您获得的回报取决于哪一个
like not exactly replicas at all and
what you get back depends on which one
50
00:04:22,180 --> 00:04:25,240
你说话,所以开始看起来可能有点
you talk to so that's starting to maybe
look a little bit
51
00:04:25,240 --> 00:04:34,330
应用程序使用起来很棘手,所以如果我们有复制操作,我们可能会感到奇怪
tricky for applications to use so if we
have replication we risk weird
52
00:04:34,330 --> 00:04:45,400
当然,巧妙的设计可以消除不一致和
inconsistencies of course clever design
you can get rid of inconsistency and
53
00:04:45,400 --> 00:04:49,450
使数据看起来非常正常,但如果这样做,几乎总是需要
make the data look very well-behaved but
if you do that it almost always requires
54
00:04:49,450 --> 00:04:53,140
所有不同服务器之间的额外工作和额外的选择
extra work and extra sort of chitchat
between all the different servers and
55
00:04:53,140 --> 00:04:58,470
网络中的客户端会降低性能,因此如果需要一致性
clients in the network that reduces
performance so if you want consistency
56
00:04:59,550 --> 00:05:11,740
您为性能低下付出的代价我当然不是我们最初的目标
you pay for with low performance I which
is of course not what we originally
57
00:05:11,740 --> 00:05:14,650
当然希望这是绝对的,您可以构建非常高的性能
hoping for of course this is an absolute
you can build very high performance
58
00:05:14,650 --> 00:05:19,480
系统,但是尽管如此,设计还是不可避免的
systems but nevertheless there's this
sort of inevitable way that the design
59
00:05:19,480 --> 00:05:24,670
这些系统发挥作用,并导致最初目标之间的紧张关系
of these systems play out and it results
in a tension between the original goals
60
00:05:24,670 --> 00:05:29,020
表现和那种认识,如果你想要好的
of performance and the sort of
realization that if you want good
61
00:05:29,020 --> 00:05:33,730
一致性,您将为此付出代价,如果您不想为此付出代价,那么您
consistency you're gonna pay for it and
if you don't want to pay for it then you
62
00:05:33,730 --> 00:05:37,930
不得不遭受某种异常行为的困扰有时我会提出这个建议
have to suffer with sort of anomalous
behavior sometimes I'm putting this up
63
00:05:37,930 --> 00:05:42,310
因为我们将在许多系统中多次看到此循环
because we're gonna see this this loop
many times for many of the systems we
64
00:05:42,310 --> 00:05:48,070
看我们看的人是我们很少愿意或不愿意支付
look we look at people are we're rarely
willing to or happy about paying the
65
00:05:48,070 --> 00:05:57,520
很好的一致性的全部成本,好的,所以您知道带来一致性后,我会
full cost of very good consistency ok so
you know with brought a consistency I'll
66
00:05:57,520 --> 00:06:04,000
在本课程的后面再讨论更多关于我所说的良好一致性的确切含义
talk more later in the course about more
exactly what I mean by good consistency
67
00:06:04,000 --> 00:06:09,280
但您可以将强一致性或良好一致性视为我们想要的
but you can think of strong consistency
or good consistency as being we want to
68
00:06:09,280 --> 00:06:13,930
构建一个系统,其对应用程序或客户端的行为类似于
build a system whose behavior to
applications or clients looks just like
69
00:06:13,930 --> 00:06:18,760
您会期望与单个服务器进行对话,好了,我们将为您打造
you'd expect from talking to a single
server all right we're gonna build you
70
00:06:18,760 --> 00:06:23,170
知道数百台机器中的系统,但具有理想的强一致性
know systems out of hundreds of machines
but a kind of ideal strong consistency
71
00:06:23,170 --> 00:06:26,560
如果只有一台服务器带有一个副本,那么您将获得的模型
model would be what you'd get if there
was just one server with one copy of the
72
00:06:26,560 --> 00:06:34,349
数据一次只能做一件事,所以这很强大
data doing one thing at a time so this
is kind of a strong
73
00:06:34,349 --> 00:06:42,789
一致性是一种考虑强一致性的直观方法,因此您
consistency kind of intuitive way to
think about strong consistency so you
74
00:06:42,789 --> 00:06:47,020
可能认为您有一台服务器,我们假设这是一台单线程服务器
might think you have one server we'll
assume that's a single-threaded server
75
00:06:47,020 --> 00:06:50,919
并且它一次处理一个来自客户端的请求,那就是
and that it processes requests from
clients one at a time and that's
76
00:06:50,919 --> 00:06:55,509
重要,因为可能有很多客户端同时发送请求
important because there may be lots of
clients sending concurrently requests
77
00:06:55,509 --> 00:06:59,020
进入服务器并查看一些当前请求,它选择一个或另一个去
into the server and see some current
requests it picks one or the other to go
78
00:06:59,020 --> 00:07:04,090
首先,请原谅请求完成,然后再原谅网
first and excuse that request to
completion then excuse the nets so for
79
00:07:04,090 --> 00:07:07,629
存储服务器,或者您知道服务器上有磁盘,这意味着
storage servers or you know the server's
got a disk on it and what it means to
80
00:07:07,629 --> 00:07:12,610
处理一个请求,这是一个您可能知道正在写的写请求
process a request is it's a write
request you know which might be writing
81
00:07:12,610 --> 00:07:17,979
一个项目,或者可能是增加的,我的意思是,如果它是一个变异,则增加一个项目
an item or may be increment and I mean
incrementing an item if it's a mutation
82
00:07:17,979 --> 00:07:23,680
那么我们要走了,我们有一些数据表,您也许知道索引
then we're gonna go and we have some
table of data and you know maybe index
83
00:07:23,680 --> 00:07:27,039
通过键和值,我们将更新此表,如果请求
by keys and values and we're gonna
update this table and if the request
84
00:07:27,039 --> 00:07:30,099
进来阅读,我们只是要知道将写入数据拉出
comes in and to read we're just gonna
you know pull the write data out of the
85
00:07:30,099 --> 00:07:39,580
列出这里的规则之一,使其表现良好,就是每个
table one of the rules here that sort of
makes this well-behaved is that each is
86
00:07:39,580 --> 00:07:44,710
服务器确实确实在我们简化的模型借口中执行请求
that the server really does execute in
our simplified model excuse to request
87
00:07:44,710 --> 00:07:49,990
一次一个,并且该请求看到的数据反映了所有以前的数据
one at a time and that requests see data
that reflects all the previous
88
00:07:49,990 --> 00:07:53,560
按顺序进行操作,以便按顺序执行写入操作以及服务器进程
operations in order so if a sequence of
writes come in and the server process
89
00:07:53,560 --> 00:07:58,060
他们以某种顺序排列,然后当您阅读时,您会看到一种您知道自己有价值的东西
them in some order then when you read
you see the sort of you know value you
90
00:07:58,060 --> 00:08:05,169
会期望如果一次发生的写操作的行为是
would expect if those writes that
occurred one at a time the behavior this
91
00:08:05,169 --> 00:08:09,659
仍然不是很简单,有一些你知道有一些
is still not completely straightforward
there's some you know there's some
92
00:08:09,659 --> 00:08:13,629
您必须花费至少一秒钟思考的事情,例如,如果
things that you have to spend at least a
second thinking about so for example if
93
00:08:13,629 --> 00:08:25,180
我们有一堆客户,一个客户发出价值X的写信,并希望
we have a bunch of clients and client
one issues a write of value X and wants
94
00:08:25,180 --> 00:08:30,460
将其设置为一个,同时客户端两个发出相同的权利
it to set it to one and at the same time
client two issues the right of the same
95
00:08:30,460 --> 00:08:34,360
值,但想要将其设置为其他相同的键,但希望将其设置为
value but wants to set it to a different
the same key but wants to set it to a
96
00:08:34,360 --> 00:08:38,409
发生一些不同的价值吧,假设客户三
different value right
something happens let's say client three
97
00:08:38,409 --> 00:08:44,020
在这些写入完成读取后读取并获得一些结果或客户端三
reads and get some result or client
three after these writes complete reads
98
00:08:44,020 --> 00:08:50,290
得到一些结果客户机四读X并且得到一些也得到结果
get some result client four
reads X and get some also gets a result
99
00:08:50,290 --> 00:09:00,959
那两个客户应该看到什么结果
so what results should the two clients
see yeah
100
00:09:04,700 --> 00:09:09,060
好,这是一个很好的问题,所以我在这里假设的是那个客户
well that's a good question so these
what I'm assuming here is that client
101
00:09:09,060 --> 00:09:12,720
一个倾向于同时启动这些请求,因此如果我们正在监视
one inclined to launch these requests at
the same time so if we were monitoring
102
00:09:12,720 --> 00:09:16,500
网络,我们会看到两个请求同时发送到服务器
the network we'd see two requests
heading to the server at the same time
103
00:09:16,500 --> 00:09:20,520
然后一段时间后,服务器会响应它们
and then sometime later the server would
respond to them
104
00:09:20,520 --> 00:09:26,070
所以这里实际上没有足够的空间来说明客户是否愿意
so there's actually not enough here to
be able to say whether the client would
105
00:09:26,070 --> 00:09:30,780
收据将首先处理第一个请求,该请求的订单不足
receipt would process the first request
first which order there's not enough
106
00:09:30,780 --> 00:09:35,460
这里告诉服务器处理的顺序,当然还有
here to tell which order the server
processes them in and of course if it
107
00:09:35,460 --> 00:09:41,760
首先处理此请求,然后处理或
processes this request first then that
means or it processes the right with
108
00:09:41,760 --> 00:09:46,350
值到秒,这意味着后续读取必须查看在哪里
value to second and that means that
subsequent reads have to see to where is
109
00:09:46,350 --> 00:09:50,250
服务器碰巧首先处理了这个请求,而第二个则是
it the server happened to process this
request first and this one's second that
110
00:09:50,250 --> 00:09:53,760
意味着结果值最好是1,而这两个请求是
means the resulting value better be one
and these these two requests and see
111
00:09:53,760 --> 00:09:58,950
所以,我只是为了说明这一点,即使是简单的
what so I'm just putting this up to sort
of illustrate that even in a simple
112
00:09:58,950 --> 00:10:04,020
系统存在不确定性,您不一定可以从跟踪结果中看出来
system there's ambiguity you can't
necessarily tell from trace of what went
113
00:10:04,020 --> 00:10:08,820
进入服务器或应该显示出来的全部信息是,
into the server or what should come out
all of you can tell is that some set of
114
00:10:08,820 --> 00:10:13,470
结果与可能的执行结果一致或不一致,因此可以肯定
results is consistent or not consistent
with a possible execution so certainly
115
00:10:13,470 --> 00:10:21,060
有一些完全错误的结果,我们可以看到,如果客户端3
there's some completely wrong results we
can see go by it you know if client 3
116
00:10:21,060 --> 00:10:27,210
看到2,然后客户4,我敢打赌最好也看到它,因为我们的模型是
sees a 2 then client 4 I bet had better
see it too also because our model is
117
00:10:27,210 --> 00:10:30,750
在第二个权利之后,你知道爬树,这是两个
well after the second right you know
climb trees these are two that means
118
00:10:30,750 --> 00:10:35,700
这项权利一定是第二位的,它最好还是仍然必须拥有
this right must have been second and it
still had better be it still has to have
119
00:10:35,700 --> 00:10:41,220
是第一个客户4的第二个权利,所以希望这一切都是
been the second right one client 4 goes
to the date so hopefully all this is
120
00:10:41,220 --> 00:10:47,790
完全简单明了,正如预期的那样,因为
just completely straightforward and just
as expected because it's it's supposed
121
00:10:47,790 --> 00:10:53,190
成为强一致性的直观模型还可以,所以
to be the intuitive model of strong
consistency ok and so the problem with
122
00:10:53,190 --> 00:10:56,370
这当然是单个服务器的容错能力差,如果它
this of course is that a single server
has poor fault tolerance right if it
123
00:10:56,370 --> 00:11:00,870
崩溃或磁盘死机或我们一无所有,因此在
crashes or it's disk dies or something
we're left with nothing and so in the
124
00:11:00,870 --> 00:11:05,430
在分布式系统的真实世界中,我们实际上构建了复制系统,因此
real world of distributed systems we
actually build replicated systems so and
125
00:11:05,430 --> 00:11:08,220
那是所有问题开始泄漏的地方,当我们有第二个
that's where all the problems start
leaking in is when we have a second
126
00:11:08,220 --> 00:11:16,180
复制数据,因此这里必须接近最差的复制设计
copying data so here is what must be
close to the worst replication design
127
00:11:16,180 --> 00:11:20,810
我这样做是为了警告您我们将要寻找的问题
and I'm doing this to warn you of the
problems that we will then be looking
128
00:11:20,810 --> 00:11:30,380
在GFS中可以正常使用,所以这是一个糟糕的复制设计,我们将有两个
for in GFS all right so here's a bad
replication design we're gonna have two
129
00:11:30,380 --> 00:11:38,510
现在,每台服务器都具有数据的完整副本,因此磁盘都是
servers now each with a complete copy of
the data and so on disks that are both
130
00:11:38,510 --> 00:11:44,810
将拥有此键表并重视其直觉,当然是
gonna have this this table of keys and
values the intuition of course is that
131
00:11:44,810 --> 00:11:49,880
我们希望保留这些表,我们希望保持这些表相同,以便
we want to keep these tables we hope to
keep these tables identical so that if
132
00:11:49,880 --> 00:11:53,720
一台服务器发生故障,我们可以从另一台服务器读取或写入数据,因此这意味着
one server fails we can read or write
from the other server and so that means
133
00:11:53,720 --> 00:11:59,210
以某种方式每次写入都必须由服务器和读取双方处理
that somehow every write must be
processed by both servers and reads have
134
00:11:59,210 --> 00:12:02,570
能够由单个服务器处理,否则它不是容错的
to be able to be processed by a single
server otherwise it's not fault tolerant
135
00:12:02,570 --> 00:12:07,940
好的,如果读取必须同时查阅两者,那么我们就无法在失去其中之一的情况下生存
all right if reads have to consult both
and we can't survive the loss of one of
136
00:12:07,940 --> 00:12:17,030
服务器没问题,所以问题会解决的很好,我想我们有客户端1
the servers okay so the problem is gonna
come up well I suppose we have client 1
137
00:12:17,030 --> 00:12:20,570
和客户2,他们两个都想做正确的事,说其中一个要写
and client 2 and they both want to do
these right say one of them gonna write
138
00:12:20,570 --> 00:12:25,790
一个要写两个,所以客户端1要启动它是正确的
one and the other is going to write two
so client 1 is gonna launch it's right
139
00:12:25,790 --> 00:12:32,600
x1 2都是因为我们想同时更新它们和攀登2将要启动
x1 2 both because we want to update both
of them and climb 2 is gonna launch it's
140
00:12:32,600 --> 00:12:46,280
写X所以这里出什么问题了是的我们在这里没有做任何事情
write X so what's gonna go wrong here
yeah yeah we haven't done anything here
141
00:12:46,280 --> 00:12:51,590
确保两个服务器以相同的顺序处理两个请求
to ensure that the two servers process
the two requests in the same order right
142
00:12:51,590 --> 00:12:57,800
这是一个糟糕的设计,因此如果服务器1处理客户端
that's a bad design
so if server 1 processes client ones
143
00:12:57,800 --> 00:13:02,600
首先请求它将以1开头,然后将看到
request first it'll end up it'll start
with a value of 1 and then it'll see
144
00:13:02,600 --> 00:13:07,610
如果服务器2刚好碰到客户端二进制请求并用2覆盖
client twos request and overwrite that
with 2 if server 2 just happens to
145
00:13:07,610 --> 00:13:11,020
通过网络以不同的顺序接收数据包
receive the packets over the network in
a different order it's going to execute
146
00:13:11,020 --> 00:13:15,350
客户2的请求并将其值设置为2,然后它将看到客户的
client 2's requests and set the value to
2 and then then it will see client ones
147
00:13:15,350 --> 00:13:20,450
请求将值设置为1,然后是以后阅读的客户端看到您的客户端
request set the value to 1 and now what
a client a later reading client sees you
148
00:13:20,450 --> 00:13:25,520
知道客户端3是否碰巧从此服务器到达,并且客户端发生
know if client 3 happens to reach from
this server and client for happens to
149
00:13:25,520 --> 00:13:28,610
从另一台服务器到达,然后我们陷入这种可怕的境地
reach from the other server then we get
into this terrible situation where
150
00:13:28,610 --> 00:13:33,410
即使我们采用正确的直观模型,他们也会读取不同的值
they're gonna read different values even
though our intuitive model of a correct
151
00:13:33,410 --> 00:13:39,589
服务人员说,它们随后的读取值都很高,您具有相同的值,这可以
service says they both subsequent reads
hefty you're the same value and this can
152
00:13:39,589 --> 00:13:43,579
以您知道的其他方式出现,假设我们试图通过使客户解决此问题
arise in other ways you know suppose we
try to fix this by making the clients
153
00:13:43,579 --> 00:13:48,829
总是从服务器一读取(如果启动),否则从服务器二读取
always read from server one if it's up
and otherwise server two if we do that
154
00:13:48,829 --> 00:13:53,089
那么如果这种情况发生了,那么为什么四个人都读呢?
then if this situation happened and four
why oh yeah both everybody reads might
155
00:13:53,089 --> 00:13:57,649
看到客户端也可能看到价值,但是服务器突然突然失败,甚至
see client might see value too but a
server one suddenly fails then even
156
00:13:57,649 --> 00:14:02,050
尽管突然没有正确的X值,我们将从2切换为1
though there was no right suddenly the
value for X we'll switch from 2 to 1
157
00:14:02,050 --> 00:14:07,130
因为如果服务器1死了,那就是所有客户助理服务器2否,而仅仅是
because if server 1 died it's all the
clients assistant server 2 no but just
158
00:14:07,130 --> 00:14:11,570
数据中这种神秘的变化与任何权利都不对应
this mysterious change in the data that
doesn't correspond to any right which is
159
00:14:11,570 --> 00:14:15,680
也完全不是这种服务中可能发生的事情
also totally not something that could
have happened in this service simple
160
00:14:15,680 --> 00:14:25,940
服务器模型还可以,所以当然可以修复,修复需要更多
server model all right so of course this
can be fixed the fix requires more
161
00:14:25,940 --> 00:14:33,529
通常是服务器之间或更复杂的地方之间的通信
communication usually between the
servers or somewhere more complexity and
162
00:14:33,529 --> 00:14:37,820
因为不可避免的成本使成本变得越来越复杂
because of the cost of inevitable cost
to the complexity to get strong
163
00:14:37,820 --> 00:14:43,610
一致性,有各种各样的解决方案可以使您变得更好
consistency there's a whole range of
different solutions to get better
164
00:14:43,610 --> 00:14:48,350
一致性和人们认为的整个范围是可接受的水平
consistency and a whole range of what
people feel is an acceptable level of
165
00:14:48,350 --> 00:14:54,890
可接受的一组异常行为中的一致性,这可能是
consistency in an acceptable sort of a
set of anomalous behaviors that might be
166
00:14:54,890 --> 00:15:03,910
在这里透露了有关此灾难性模型的所有问题
revealed all right any questions about
this disastrous model here
167
00:15:04,649 --> 00:15:13,209
好的,那就是您在谈论GFS的原因,关于做GFS的很多想法是
okay that's what you're talking about
GFS a lot of thought about doing GFS was