-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathatom.xml
747 lines (411 loc) · 740 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>腾讯云容器团队</title>
<link href="/atom.xml" rel="self"/>
<link href="https://TencentCloudContainerTeam.github.io/"/>
<updated>2020-06-16T01:53:49.339Z</updated>
<id>https://TencentCloudContainerTeam.github.io/</id>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>Kubernetes 服务部署最佳实践(一)</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/06/16/kubernetes-app-deployment-best-practice-1/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/06/16/kubernetes-app-deployment-best-practice-1/</id>
<published>2020-06-16T01:50:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<h2 id="作者-陈鹏"><a href="#作者-陈鹏" class="headerlink" title="作者: 陈鹏"></a>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></h2><h2 id="引言"><a href="#引言" class="headerlink" title="引言"></a>引言</h2><p>业务容器化后,如何将其部署在 K8S 上?如果仅仅是将它跑起来,很简单,但如果是上生产,我们有许多地方是需要结合业务场景和部署环境进行方案选型和配置调优的。比如,如何设置容器的 Request 与 Limit、如何让部署的服务做到高可用、如何配置健康检查、如何进行弹性伸缩、如何更好的进行资源调度、如何选择持久化存储、如何对外暴露服务等。</p><p>对于这一系列高频问题,这里将会出一个 Kubernetes 服务部署最佳实践的系列的文章来为大家一一作答,本文将先围绕如何合理利用资源的主题来进行探讨。</p><h2 id="Request-与-Limit-怎么设置才好"><a href="#Request-与-Limit-怎么设置才好" class="headerlink" title="Request 与 Limit 怎么设置才好"></a>Request 与 Limit 怎么设置才好</h2><p>如何为容器配置 Request 与 Limit? 这是一个即常见又棘手的问题,这个根据服务类型,需求与场景的不同而不同,没有固定的答案,这里结合生产经验总结了一些最佳实践,可以作为参考。</p><h3 id="所有容器都应该设置-request"><a href="#所有容器都应该设置-request" class="headerlink" title="所有容器都应该设置 request"></a>所有容器都应该设置 request</h3><p>request 的值并不是指给容器实际分配的资源大小,它仅仅是给调度器看的,调度器会 “观察” 每个节点可以用于分配的资源有多少,也知道每个节点已经被分配了多少资源。被分配资源的大小就是节点上所有 Pod 中定义的容器 request 之和,它可以计算出节点剩余多少资源可以被分配(可分配资源减去已分配的 request 之和)。如果发现节点剩余可分配资源大小比当前要被调度的 Pod 的 reuqest 还小,那么就不会考虑调度到这个节点,反之,才可能调度。所以,如果不配置 request,那么调度器就不能知道节点大概被分配了多少资源出去,调度器得不到准确信息,也就无法做出合理的调度决策,很容易造成调度不合理,有些节点可能很闲,而有些节点可能很忙,甚至 NotReady。</p><p>所以,建议是给所有容器都设置 request,让调度器感知节点有多少资源被分配了,以便做出合理的调度决策,让集群节点的资源能够被合理的分配使用,避免陷入资源分配不均导致一些意外发生。</p><h3 id="老是忘记设置怎么办"><a href="#老是忘记设置怎么办" class="headerlink" title="老是忘记设置怎么办"></a>老是忘记设置怎么办</h3><p>有时候我们会忘记给部分容器设置 request 与 limit,其实我们可以使用 LimitRange 来设置 namespace 的默认 request 与 limit 值,同时它也可以用来限制最小和最大的 request 与 limit。<br>示例:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">LimitRange</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">mem-limit-range</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">test</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> limits:</span></span><br><span class="line"><span class="attr"> - default:</span></span><br><span class="line"><span class="attr"> memory:</span> <span class="number">512</span><span class="string">Mi</span></span><br><span class="line"> <span class="attr">cpu:</span> <span class="number">500</span><span class="string">m</span></span><br><span class="line"><span class="attr"> defaultRequest:</span></span><br><span class="line"><span class="attr"> memory:</span> <span class="number">256</span><span class="string">Mi</span></span><br><span class="line"> <span class="attr">cpu:</span> <span class="number">100</span><span class="string">m</span></span><br><span class="line"><span class="attr"> type:</span> <span class="string">Container</span></span><br></pre></td></tr></table></figure><h3 id="重要的线上应用改如何设置"><a href="#重要的线上应用改如何设置" class="headerlink" title="重要的线上应用改如何设置"></a>重要的线上应用改如何设置</h3><p>节点资源不足时,会触发自动驱逐,将一些低优先级的 Pod 删除掉以释放资源让节点自愈。没有设置 request,limit 的 Pod 优先级最低,容易被驱逐;request 不等于 limit 的其次; request 等于 limit 的 Pod 优先级较高,不容易被驱逐。所以如果是重要的线上应用,不希望在节点故障时被驱逐导致线上业务受影响,就建议将 request 和 limit 设成一致。</p><h3 id="怎样设置才能提高资源利用率"><a href="#怎样设置才能提高资源利用率" class="headerlink" title="怎样设置才能提高资源利用率"></a>怎样设置才能提高资源利用率</h3><p>如果给给你的应用设置较高的 request 值,而实际占用资源长期远小于它的 request 值,导致节点整体的资源利用率较低。当然这对时延非常敏感的业务除外,因为敏感的业务本身不期望节点利用率过高,影响网络包收发速度。所以对一些非核心,并且资源不长期占用的应用,可以适当减少 request 以提高资源利用率。</p><p>如果你的服务支持水平扩容,单副本的 request 值一般可以设置到不大于 1 核,CPU 密集型应用除外。比如 coredns,设置到 0.1 核就可以,即 100m。</p><h3 id="尽量避免使用过大的-request-与-limit"><a href="#尽量避免使用过大的-request-与-limit" class="headerlink" title="尽量避免使用过大的 request 与 limit"></a>尽量避免使用过大的 request 与 limit</h3><p>如果你的服务使用单副本或者少量副本,给很大的 request 与 limit,让它分配到足够多的资源来支撑业务,那么某个副本故障对业务带来的影响可能就比较大,并且由于 request 较大,当集群内资源分配比较碎片化,如果这个 Pod 所在节点挂了,其它节点又没有一个有足够的剩余可分配资源能够满足这个 Pod 的 request 时,这个 Pod 就无法实现漂移,也就不能自愈,加重对业务的影响。</p><p>相反,建议尽量减小 request 与 limit,通过增加副本的方式来对你的服务支撑能力进行水平扩容,让你的系统更加灵活可靠。</p><h3 id="避免测试-namespace-消耗过多资源影响生产业务"><a href="#避免测试-namespace-消耗过多资源影响生产业务" class="headerlink" title="避免测试 namespace 消耗过多资源影响生产业务"></a>避免测试 namespace 消耗过多资源影响生产业务</h3><p>若生产集群有用于测试的 namespace,如果不加以限制,可能导致集群负载过高,从而影响生产业务。可以使用 ResourceQuota 来限制测试 namespace 的 request 与 limit 的总大小。<br>示例:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ResourceQuota</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">quota-test</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">test</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> hard:</span></span><br><span class="line"> <span class="string">requests.cpu:</span> <span class="string">"1"</span></span><br><span class="line"> <span class="string">requests.memory:</span> <span class="number">1</span><span class="string">Gi</span></span><br><span class="line"> <span class="string">limits.cpu:</span> <span class="string">"2"</span></span><br><span class="line"> <span class="string">limits.memory:</span> <span class="number">2</span><span class="string">Gi</span></span><br></pre></td></tr></table></figure><h2 id="如何让资源得到更合理的分配"><a href="#如何让资源得到更合理的分配" class="headerlink" title="如何让资源得到更合理的分配"></a>如何让资源得到更合理的分配</h2><p>设置 Request 能够解决让 Pod 调度到有足够资源的节点上,但无法做到更细致的控制。如何进一步让资源得到合理的使用?我们可以结合亲和性、污点与容忍等高级调度技巧,让 Pod 能够被合理调度到合适的节点上,让资源得到充分的利用。</p><h3 id="使用亲和性"><a href="#使用亲和性" class="headerlink" title="使用亲和性"></a>使用亲和性</h3><ul><li>对节点有特殊要求的服务可以用节点亲和性 (Node Affinity) 部署,以便调度到符合要求的节点,比如让 MySQL 调度到高 IO 的机型以提升数据读写效率。</li><li>可以将需要离得比较近的有关联的服务用 Pod 亲和性 (Pod Affinity) 部署,比如让 Web 服务跟它的 Redis 缓存服务都部署在同一可用区,实现低延时。</li><li>也可使用 Pod 反亲和 (Pod AntiAffinity) 将 Pod 进行打散调度,避免单点故障或者流量过于集中导致的一些问题。</li></ul><h3 id="使用污点与容忍"><a href="#使用污点与容忍" class="headerlink" title="使用污点与容忍"></a>使用污点与容忍</h3><p>使用污点 (Taint) 与容忍 (Toleration) 可优化集群资源调度:</p><ul><li>通过给节点打污点来给某些应用预留资源,避免其它 Pod 调度上来。</li><li>需要使用这些资源的 Pod 加上容忍,结合节点亲和性让它调度到预留节点,即可使用预留的资源。</li></ul><h2 id="弹性伸缩"><a href="#弹性伸缩" class="headerlink" title="弹性伸缩"></a>弹性伸缩</h2><h3 id="如何支持流量突发型业务"><a href="#如何支持流量突发型业务" class="headerlink" title="如何支持流量突发型业务"></a>如何支持流量突发型业务</h3><p>通常业务都会有高峰和低谷,为了更合理的利用资源,我们为服务定义 HPA,实现根据 Pod 的资源实际使用情况来对服务进行自动扩缩容,在业务高峰时自动扩容 Pod 数量来支撑服务,在业务低谷时,自动缩容 Pod 释放资源,以供其它服务使用(比如在夜间,线上业务低峰,自动缩容释放资源以供大数据之类的离线任务运行) 。</p><p>使用 HPA 前提是让 K8S 得知道你服务的实际资源占用情况(指标数据),需要安装 resource metrics (metrics.k8s.io) 或 custom metrics (custom.metrics.k8s.io) 的实现,好让 hpa controller 查询这些 API 来获取到服务的资源占用情况。早期 HPA 用 resource metrics 获取指标数据,后来推出 custom metrics,可以实现更灵活的指标来控制扩缩容。官方有个叫 <a href="https://github.com/kubernetes-sigs/metrics-server" target="_blank" rel="noopener">metrics-server</a> 的实现,通常社区使用的更多的是基于 prometheus 的 实现 <a href="https://github.com/DirectXMan12/k8s-prometheus-adapter" target="_blank" rel="noopener">prometheus-adapter</a>,而云厂商托管的 K8S 集群通常集成了自己的实现,比如 TKE,实现了 CPU、内存、硬盘、网络等维度的指标,可以在网页控制台可视化创建 HPA,但最终都会转成 K8S 的 yaml,示例:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">autoscaling/v2beta2</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">HorizontalPodAutoscaler</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">nginx</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> scaleTargetRef:</span></span><br><span class="line"><span class="attr"> apiVersion:</span> <span class="string">apps/v1beta2</span></span><br><span class="line"><span class="attr"> kind:</span> <span class="string">Deployment</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">nginx</span></span><br><span class="line"><span class="attr"> minReplicas:</span> <span class="number">1</span></span><br><span class="line"><span class="attr"> maxReplicas:</span> <span class="number">10</span></span><br><span class="line"><span class="attr"> metrics:</span></span><br><span class="line"><span class="attr"> - type:</span> <span class="string">Pods</span></span><br><span class="line"><span class="attr"> pods:</span></span><br><span class="line"><span class="attr"> metric:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">k8s_pod_rate_cpu_core_used_request</span></span><br><span class="line"><span class="attr"> target:</span></span><br><span class="line"><span class="attr"> averageValue:</span> <span class="string">"100"</span></span><br><span class="line"><span class="attr"> type:</span> <span class="string">AverageValue</span></span><br></pre></td></tr></table></figure><h3 id="如何节约成本"><a href="#如何节约成本" class="headerlink" title="如何节约成本"></a>如何节约成本</h3><p>HPA 能实现 Pod 水平扩缩容,但如果节点资源不够用了,Pod 扩容出来还是会 Pending。如果我们提前准备好大量节点,做好资源冗余,提前准备好大量节点,通常不会有 Pod Pending 的问题,但也意味着需要付出更高的成本。通常云厂商托管的 K8S 集群都会实现 <a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler" target="_blank" rel="noopener">cluster-autoscaler</a>,即根据资源使用情况,动态增删节点,让计算资源能够被最大化的弹性使用,按量付费,以节约成本。在 TKE 上的实现叫做伸缩组,以及一个包含伸缩功能组但更高级的特性:节点池(正在灰度)</p><h3 id="无法水平扩容的服务怎么办"><a href="#无法水平扩容的服务怎么办" class="headerlink" title="无法水平扩容的服务怎么办"></a>无法水平扩容的服务怎么办</h3><p>对于无法适配水平伸缩的单体应用,或者不确定最佳 request 与 limit 超卖比的应用,可以尝用 <a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler" target="_blank" rel="noopener">VPA</a> 来进行垂直伸缩,即自动更新 request 与 limit,然后重启 pod。不过这个特性容易导致你的服务出现短暂的不可用,不建议在生产环境中大规模使用。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li>Understanding Kubernetes limits and requests by example: <a href="https://sysdig.com/blog/kubernetes-limits-requests/" target="_blank" rel="noopener">https://sysdig.com/blog/kubernetes-limits-requests/</a></li><li>Understanding resource limits in kubernetes: cpu time: <a href="https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b" target="_blank" rel="noopener">https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b</a></li><li>Understanding resource limits in kubernetes: memory: <a href="https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-memory-6b41e9a955f9" target="_blank" rel="noopener">https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-memory-6b41e9a955f9</a></li><li>Kubernetes best practices: Resource requests and limits: <a href="https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-resource-requests-and-limits" target="_blank" rel="noopener">https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-resource-requests-and-limits</a></li><li>Kubernetes 资源分配之 Request 和 Limit 解析: <a href="https://cloud.tencent.com/developer/article/1004976" target="_blank" rel="noopener">https://cloud.tencent.com/developer/article/1004976</a></li><li>Assign Pods to Nodes using Node Affinity: <a href="https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/" target="_blank" rel="noopener">https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/</a></li><li>Taints and Tolerations: <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/" target="_blank" rel="noopener">https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/</a></li><li>metrics-server: <a href="https://github.com/kubernetes-sigs/metrics-server" target="_blank" rel="noopener">https://github.com/kubernetes-sigs/metrics-server</a></li><li>prometheus-adapter: <a href="https://github.com/DirectXMan12/k8s-prometheus-adapter" target="_blank" rel="noopener">https://github.com/DirectXMan12/k8s-prometheus-adapter</a></li><li>cluster-autoscaler: <a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler" target="_blank" rel="noopener">https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler</a></li><li>VPA: <a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler" target="_blank" rel="noopener">https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler</a></li></ul>]]></content>
<summary type="html">
<h2 id="作者-陈鹏"><a href="#作者-陈鹏" class="headerlink" title="作者: 陈鹏"></a>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a><
</summary>
</entry>
<entry>
<title>TKE 集群组建最佳实践</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/06/16/tke-cluster-setup-best-practice/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/06/16/tke-cluster-setup-best-practice/</id>
<published>2020-06-16T01:00:00.000Z</published>
<updated>2020-06-16T01:53:49.351Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><h2 id="Kubernetes-版本"><a href="#Kubernetes-版本" class="headerlink" title="Kubernetes 版本"></a>Kubernetes 版本</h2><p>K8S 版本迭代比较快,新版本通常包含许多 bug 修复和新功能,旧版本逐渐淘汰,建议创建集群时选择当前 TKE 支持的最新版本,后续出新版本后也是可以支持 Master 和 节点的版本升级的。</p><h2 id="网络模式-GlobalRouter-vs-VPC-CNI"><a href="#网络模式-GlobalRouter-vs-VPC-CNI" class="headerlink" title="网络模式: GlobalRouter vs VPC-CNI"></a>网络模式: GlobalRouter vs VPC-CNI</h2><p><strong>GlobalRouter 模式架构:</strong></p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/global-router.jpg" alt=""></p><ul><li>基于 CNI 和 网桥实现的容器网络能力,容器路由直接通过 VPC 底层实现</li><li>容器与节点在同一网络平面,但网段不与 VPC 网段重叠,容器网段地址充裕 </li></ul><p><strong>VPC-CNI 模式架构:</strong></p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/vpc-cni.jpg" alt=""></p><ul><li>基于 CNI 和 VPC 弹性网卡实现的容器网络能力,容器路由通过弹性网卡,性能相比 Global Router 约提高 10%</li><li>容器与节点在同一网络平面,网段在 VPC 网段内</li><li>支持 Pod 固定 IP</li></ul><p><strong>对比:</strong></p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/global-router-vs-vpc-cni.jpg" alt=""></p><p><strong>支持三种使用方式:</strong></p><ul><li>创建集群时指定 GlobalRouter 模式</li><li>创建集群时指定 VPC-CNI 模式,后续所有 Pod 都必须使用 VPC-CNI 模式创建</li><li>创建集群时指定 GlobalRouter 模式,在需要使用 VPC-CNI 模式时为集群启用 VPC-CNI 的支持,即两种模式混用</li></ul><p><strong>选型建议:</strong></p><ul><li>绝大多数情况下应该选择 GlobalRouter,容器网段地址充裕,扩展性强,能适应规模较大的业务</li><li>如果后期部分业务需要用到 VPC-CNI 模式,可以在 GlobalRouter 集群再开启 VPC-CNI 支持,也就是 GlobalRouter 与 VPC-CNI 混用,仅对部分业务使用 VPC-CNI 模式</li><li>如果完全了解并接受 VPC-CNI 的各种限制,并且需要集群内所有 Pod 都用 VPC-CNI 模式,可以创建集群时选择 VPC-CNI 网络插件</li></ul><blockquote><p>参考官方文档 《如何选择容器服务网络模式》: <a href="https://cloud.tencent.com/document/product/457/41636" target="_blank" rel="noopener">https://cloud.tencent.com/document/product/457/41636</a></p></blockquote><h2 id="运行时-Docker-vs-Containerd"><a href="#运行时-Docker-vs-Containerd" class="headerlink" title="运行时: Docker vs Containerd"></a>运行时: Docker vs Containerd</h2><p><strong>Docker 作为运行时的架构:</strong></p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/docker-as-runtime.jpg" alt=""></p><ul><li>kubelet 内置的 dockershim 模块帮傲娇的 docker 适配了 CRI 接口,然后 kubelet 自己调自己的 dockershim (通过 socket 文件),然后 dockershim 再调 dockerd 接口 (Docker HTTP API),接着 dockerd 还要再调 docker-containerd (gRPC) 来实现容器的创建与销毁等。</li><li>为什么调用链这么长? K8S 一开始支持的就只是 Docker,后来引入了 CRI,将运行时抽象以支持多种运行时,而 Docker 跟 K8S 在一些方面有一定的竞争,不甘做小弟,也就没在 dockerd 层面实现 CRI 接口,所以 kubelet 为了让 dockerd 支持 CRI,就自己为 dockerd 实现了 CRI。docker 本身内部组件也模块化了,再加上一层 CRI 适配,调用链肯定就长了。</li></ul><p><strong>Containerd 作为运行时的架构:</strong></p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/containerd-as-runtime.jpg" alt=""></p><ul><li>containerd 1.1 之后,支持 CRI Plugin,即 containerd 自身这里就可以适配 CRI 接口。</li><li>相比 Docker 方案,调用链少了 dockershim 和 dockerd。</li></ul><p><strong>对比:</strong></p><ul><li>containerd 方案由于绕过了 dockerd,调用链更短,组件更少,占用节点资源更少,绕过了 dockerd 本身的一些 bug,但 containerd 自身也还存在一些 bug (已修复一些,灰度中)。</li><li>docker 方案历史比较悠久,相对更成熟,支持 docker api,功能丰富,符合大多数人的使用习惯。</li></ul><p><strong>选型建议:</strong></p><ul><li>Docker 方案 相比 containerd 更成熟,如果对稳定性要求很高,建议 docker 方案</li><li>以下场景只能使用 docker: <ul><li>Docker in docker (通常在 CI 场景)</li><li>节点上使用 docker 命令</li><li>调用 docker API</li></ul></li><li>没有以上场景建议使用 containerd</li></ul><blockquote><p>参考官方文档 《如何选择 Containerd 和 Docker》: <a href="https://cloud.tencent.com/document/product/457/35747" target="_blank" rel="noopener">https://cloud.tencent.com/document/product/457/35747</a></p></blockquote><h2 id="Service-转发模式-iptables-vs-ipvs"><a href="#Service-转发模式-iptables-vs-ipvs" class="headerlink" title="Service 转发模式: iptables vs ipvs"></a>Service 转发模式: iptables vs ipvs</h2><p>先看看 Service 的转发原理:</p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/service.jpg" alt=""></p><ul><li>节点上的 kube-proxy 组件 watch apiserver,获取 Service 与 Endpoint,转化成 iptables 或 ipvs 规则并写到节点上</li><li>集群内的 client 去访问 Service (Cluster IP),会被 iptable/ipvs 规则负载均衡到 Service 对应的后端 pod</li></ul><p><strong>对比:</strong></p><ul><li>ipvs 模式性能更高,但也存在一些已知未解决的 bug</li><li>iptables 模式更成熟稳定</li></ul><p><strong>选型建议:</strong></p><ul><li>对稳定性要求极高且 service 数量小于 2000,选 iptables</li><li>其余场景首选 ipvs</li></ul><h2 id="集群类型-托管集群-vs-独立集群"><a href="#集群类型-托管集群-vs-独立集群" class="headerlink" title="集群类型: 托管集群 vs 独立集群"></a>集群类型: 托管集群 vs 独立集群</h2><p><strong>托管集群:</strong></p><ul><li>Master 组件用户不可见,由腾讯云托管</li><li>很多新功能也是会率先支持托管的集群</li><li>Master 的计算资源会根据集群规模自动扩容</li><li>用户不需要为 Master 付费</li></ul><p><strong>独立集群:</strong></p><ul><li>Master 组件用户可以完全掌控</li><li>用户需要为 Master 付费购买机器</li></ul><p><strong>选型建议:</strong></p><ul><li>一般推荐托管集群</li><li>如果希望能能够对 Master 完全掌控,可以使用独立集群 (比如对 Master 进行个性化定制实现高级功能)</li></ul><h2 id="节点操作系统"><a href="#节点操作系统" class="headerlink" title="节点操作系统"></a>节点操作系统</h2><p>TKE 主要支持 Ubuntu 和 CentOS 两类发行版,带 “TKE-Optimized” 后缀用的是 TKE 定制优化版的内核,其它的是 linux 社区官方开源内核:</p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/ubuntu.jpg" alt=""></p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/centos.jpg" alt=""></p><p><strong>TKE-Optimized 的优势:</strong></p><ul><li>基于内核社区长期支持的 4.14.105 版本定制</li><li>针对容器和云场景进行优化</li><li>计算、存储和网络子系统均经过性能优化</li><li>对内核缺陷修复支持较好</li><li>完全开源: <a href="https://github.com/Tencent/TencentOS-kernel" target="_blank" rel="noopener">https://github.com/Tencent/TencentOS-kernel</a></li></ul><p><strong>选型建议:</strong></p><ul><li>推荐 “TKE-Optimized”,稳定性和技术支持都比较好</li><li>如果需要更高版本内核,选非 “TKE-Optimized” 版本的操作系统</li></ul><h2 id="节点池"><a href="#节点池" class="headerlink" title="节点池"></a>节点池</h2><p>此特性当前正在灰度中,可申请开白名单使用。主要可用于批量管理节点:</p><ul><li>节点 Label 与 Taint</li><li>节点组件启动参数</li><li>节点自定义启动脚本</li><li>操作系统与运行时 (暂未支持)</li></ul><blockquote><p>产品文档:<a href="https://cloud.tencent.com/document/product/457/43719" target="_blank" rel="noopener">https://cloud.tencent.com/document/product/457/43719</a></p></blockquote><p><strong>适用场景:</strong></p><ul><li>异构节点分组管理,减少管理成本</li><li>让集群更好支持复杂的调度规则 (Label, Taint)</li><li>频繁扩缩容节点,减少操作成本</li><li>节点日常维护(版本升级)</li></ul><p><strong>用法举例:</strong></p><p>部分IO密集型业务需要高IO机型,为其创建一个节点池,配置机型并统一设置节点 Label 与 Taint,然后将 IO 密集型业务配置亲和性,选中 Label,使其调度到高 IO 机型的节点 (Taint 可以避免其它业务 Pod 调度上来)。</p><p>随着时间的推移,业务量快速上升,该 IO 密集型业务也需要更多的计算资源,在业务高峰时段,HPA 功能自动为该业务扩容了 Pod,而节点计算资源不够用,这时节点池的自动伸缩功能自动扩容了节点,扛住了流量高峰。</p><h2 id="启动脚本"><a href="#启动脚本" class="headerlink" title="启动脚本"></a>启动脚本</h2><p>添加节点时通过自定义数据配置节点启动脚本 (可用于修改组件启动参数、内核参数等):</p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/custom-script.jpg" alt=""></p><h2 id="组件自定义参数"><a href="#组件自定义参数" class="headerlink" title="组件自定义参数"></a>组件自定义参数</h2><p>此特性当前也正在灰度中,可申请开白名单使用。</p><p>创建集群时可自定义 Master 组件部分启动参数:</p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/custom-master-parameter.jpg" alt=""></p><p>添加节点时可自定义 kubelet 部分启动参数:</p><p><img src="https://imroc.io/assets/blog/tke-best-practices-and-troubleshooting/custom-kubelet-parameter.jpg" alt=""></p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<h2 id="Kubernetes-版本"><a href="#Kubernetes-版本" class="headerli
</summary>
</entry>
<entry>
<title>揭秘 Kubernetes attach/detach controller 逻辑漏洞致使 pod 启动失败</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/05/13/K8s-ad-controller-bug/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/05/13/K8s-ad-controller-bug/</id>
<published>2020-05-13T13:00:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://github.com/ivan-cai" target="_blank" rel="noopener">蔡靖</a></p><h3 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h3><p>本文主要通过深入学习k8s attach/detach controller源码,了解现网案例发现的attach/detach controller bug发生的原委,并给出解决方案。</p><p>看完本文你也将学习到:</p><ul><li>attach/detach controller的主要数据结构有哪些,保存什么数据,数据从哪来,到哪去等等;</li><li>k8s attach/detach volume的详细流程,如何判断volume是否需要attach/detach,attach/detach controller和kubelet(volume manager)如何协同工作等等。</li></ul><h3 id="现网案例现象"><a href="#现网案例现象" class="headerlink" title="现网案例现象"></a>现网案例现象</h3><p>我们首先了解下现网案例的问题和现象;然后去深入理解ad controller维护的数据结构;之后根据数据结构与ad controller的代码逻辑,再来详细分析现网案例出现的原因和解决方案。从而深入理解整个ad controller。</p><h4 id="问题描述"><a href="#问题描述" class="headerlink" title="问题描述"></a>问题描述</h4><ul><li>一个statefulsets(sts)引用了多个pvc cbs,我们更新sts时,删除旧pod,创建新pod,此时如果删除旧pod时cbs detach失败,且创建的新pod调度到和旧pod相同的节点,就可能会让这些pod一直处于<code>ContainerCreating</code> 。</li></ul><h4 id="现象"><a href="#现象" class="headerlink" title="现象"></a>现象</h4><ul><li><code>kubectl describe pod</code></li></ul><p><img src="https://main.qcloudimg.com/raw/e502a73a01436daa323e6747ae151b67.png" alt="enter image description here"></p><ul><li>kubelet log</li></ul><p><img src="https://main.qcloudimg.com/raw/376e655665a8738810a78370f9bd3bee.png" alt="enter image description here"></p><ul><li><code>kubectl get node xxx -oyaml</code> 的<code>volumesAttached</code>和<code>volumesInUse</code></li></ul><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">volumesAttached:</span><br><span class="line"> - devicePath: /dev/disk/by-id/virtio-disk-6w87j3wv</span><br><span class="line"> name: kubernetes.io/qcloud-cbs/disk-6w87j3wv</span><br><span class="line">volumesInUse:</span><br><span class="line"> - kubernetes.io/qcloud-cbs/disk-6w87j3wv</span><br><span class="line"> - kubernetes.io/qcloud-cbs/disk-7bfqsft5</span><br></pre></td></tr></table></figure><h3 id="k8s存储简述"><a href="#k8s存储简述" class="headerlink" title="k8s存储简述"></a>k8s存储简述</h3><p>k8s中attach/detach controller负责存储插件的attach/detach。本文结合现网出现的一个案例来分析ad controller的源码逻辑,该案例是因k8s的ad controller bug导致的pod创建失败。</p><p>k8s中涉及存储的组件主要有:attach/detach controller、pv controller、volume manager、volume plugins、scheduler。每个组件分工明确:</p><ul><li><strong>attach/detach controller</strong>:负责对volume进行attach/detach</li><li><strong>pv controller</strong>:负责处理pv/pvc对象,包括pv的provision/delete(cbs intree的provisioner设计成了external provisioner,独立的cbs-provisioner来负责cbs pv的provision/delete)</li><li><strong>volume manager</strong>:主要负责对volume进行mount/unmount</li><li><strong>volume plugins</strong>:包含k8s原生的和各厂商的的存储插件<ul><li>原生的包括:emptydir、hostpath、flexvolume、csi等</li><li>各厂商的包括:aws-ebs、azure、我们的cbs等</li></ul></li><li><strong>scheduler</strong>:涉及到volume的调度。比如对ebs、csi等的单node最大可attach磁盘数量的predicate策略</li></ul><p><img src="https://main.qcloudimg.com/raw/9acf4b3722150ca3c315e19cd0b17132.png" alt="enter image description here"></p><p>控制器模式是k8s非常重要的概念,一般一个controller会去管理一个或多个API对象,以让对象从实际状态/当前状态趋近于期望状态。</p><p>所以attach/detach controller的作用其实就是去attach期望被attach的volume,detach期望被detach的volume。</p><p>后续attach/detach controller简称ad controller。</p><h3 id="ad-controller数据结构"><a href="#ad-controller数据结构" class="headerlink" title="ad controller数据结构"></a>ad controller数据结构</h3><p>对于ad controller来说,理解了其内部的数据结构,再去理解逻辑就事半功倍。ad controller在内存中维护2个数据结构:</p><ol><li><code>actualStateOfWorld</code> —— 表征实际状态(后面简称asw)</li><li><code>desiredStateOfWorld</code> —— 表征期望状态(后面简称dsw)</li></ol><p>很明显,对于声明式API来说,是需要随时比对实际状态和期望状态的,所以ad controller中就用了2个数据结构来分别表征实际状态和期望状态。</p><h4 id="actualStateOfWorld"><a href="#actualStateOfWorld" class="headerlink" title="actualStateOfWorld"></a>actualStateOfWorld</h4><p><code>actualStateOfWorld</code> 包含2个map:</p><ul><li><code>attachedVolumes</code>: 包含了那些ad controller认为被成功attach到nodes上的volumes</li><li><code>nodesToUpdateStatusFor</code>: 包含要更新<code>node.Status.VolumesAttached</code> 的nodes</li></ul><h5 id="attachedVolumes"><a href="#attachedVolumes" class="headerlink" title="attachedVolumes"></a>attachedVolumes</h5><h6 id="如何填充数据?"><a href="#如何填充数据?" class="headerlink" title="如何填充数据?"></a>如何填充数据?</h6><p>1、在启动ad controller时,会populate asw,此时会list集群内所有node对象,然后用这些node对象的<code>node.Status.VolumesAttached</code> 去填充<code>attachedVolumes</code>。</p><p>2、之后只要有需要attach的volume被成功attach了,就会调用<code>MarkVolumeAsAttached</code>(<code>GenerateAttachVolumeFunc</code> 中)来填充到<code>attachedVolumes中</code>。</p><h6 id="如何删除数据?"><a href="#如何删除数据?" class="headerlink" title="如何删除数据?"></a>如何删除数据?</h6><p>1、只有在volume被detach成功后,才会把相关的volume从<code>attachedVolumes</code>中删掉。(<code>GenerateDetachVolumeFunc</code> 中调用<code>MarkVolumeDetached</code>)</p><h5 id="nodesToUpdateStatusFor"><a href="#nodesToUpdateStatusFor" class="headerlink" title="nodesToUpdateStatusFor"></a>nodesToUpdateStatusFor</h5><h6 id="如何填充数据?-1"><a href="#如何填充数据?-1" class="headerlink" title="如何填充数据?"></a>如何填充数据?</h6><p>1、detach volume失败后,将volume add back到<code>nodesToUpdateStatusFor</code></p><p> - <code>GenerateDetachVolumeFunc</code> 中调用<code>AddVolumeToReportAsAttached</code></p><h6 id="如何删除数据?-1"><a href="#如何删除数据?-1" class="headerlink" title="如何删除数据?"></a>如何删除数据?</h6><p>1、在detach volume之前会先调用<code>RemoveVolumeFromReportAsAttached</code> 从<code>nodesToUpdateStatusFor</code>中先删除该volume相关信息</p><h4 id="desiredStateOfWorld"><a href="#desiredStateOfWorld" class="headerlink" title="desiredStateOfWorld"></a>desiredStateOfWorld</h4><p><code>desiredStateOfWorld</code> 中维护了一个map:</p><p><code>nodesManaged</code>:包含被ad controller管理的nodes,以及期望attach到这些node上的volumes。</p><h5 id="nodesManaged"><a href="#nodesManaged" class="headerlink" title="nodesManaged"></a>nodesManaged</h5><h6 id="如何填充数据?-2"><a href="#如何填充数据?-2" class="headerlink" title="如何填充数据?"></a>如何填充数据?</h6><p>1、在启动ad controller时,会populate asw,list集群内所有node对象,然后把由ad controller管理的node填充到<code>nodesManaged</code></p><p>2、ad controller的<code>nodeInformer</code> watch到node有更新也会把node填充到<code>nodesManaged</code></p><p>3、另外在populate dsw和<code>podInformer</code> watch到pod有变化(add, update)时,往<code>nodesManaged</code> 中填充volume和pod的信息</p><p>4、<code>desiredStateOfWorldPopulator</code> 中也会周期性地去找出需要被add的pod,此时也会把相应的volume和pod填充到<code>nodesManaged</code> </p><h6 id="如何删除数据?-2"><a href="#如何删除数据?-2" class="headerlink" title="如何删除数据?"></a>如何删除数据?</h6><p>1、当删除node时,ad controller中的<code>nodeInformer</code> watch到变化会从dsw的<code>nodesManaged</code> 中删除相应的node</p><p>2、当ad controller中的<code>podInformer</code> watch到pod的删除时,会从<code>nodesManaged</code> 中删除相应的volume和pod</p><p>3、<code>desiredStateOfWorldPopulator</code> 中也会周期性地去找出需要被删除的pod,此时也会从<code>nodesManaged</code> 中删除相应的volume和pod</p><h3 id="ad-controller流程简述"><a href="#ad-controller流程简述" class="headerlink" title="ad controller流程简述"></a>ad controller流程简述</h3><p>ad controller的逻辑比较简单:</p><p>1、首先,list集群内所有的node和pod,来populate <code>actualStateOfWorld</code> (<code>attachedVolumes</code> )和<code>desiredStateOfWorld</code> (<code>nodesManaged</code>)</p><p>2、然后,单独开个goroutine运行<code>reconciler</code>,通过触发attach, detach操作周期性地去reconcile asw(实际状态)和dws(期望状态)</p><ul><li>触发attach,detach操作也就是,detach该被detach的volume,attach该被attach的volume</li></ul><p>3、之后,又单独开个goroutine运行<code>DesiredStateOfWorldPopulator</code> ,定期去验证dsw中的pods是否依然存在,如果不存在就从dsw中删除</p><h3 id="现网案例"><a href="#现网案例" class="headerlink" title="现网案例"></a>现网案例</h3><p>接下来结合上面所说的现网案例,来详细看看<code>reconciler</code>的逻辑。</p><h4 id="案例初步分析"><a href="#案例初步分析" class="headerlink" title="案例初步分析"></a>案例初步分析</h4><ul><li>从pod的事件可以看出来:ad controller认为cbs attach成功了,然后kubelet没有mount成功。</li><li><strong><em>但是</em></strong>从kubelet日志却发现<code>Volume not attached according to node status</code> ,也就是说kubelet认为cbs没有按照node的状态去挂载。这个从node info也可以得到证实:<code>volumesAttached</code> 中的确没有这个cbs盘(disk-7bfqsft5)。</li><li>node info中还有个现象:<code>volumesInUse</code> 中还有这个cbs。说明没有unmount成功</li></ul><p>很明显,cbs要能被pod成功使用,需要ad controller和volume manager的协同工作。所以这个问题的定位首先要明确:</p><ol><li>volume manager为什么认为volume没有按照node状态挂载,ad controller却认为volume attch成功了?</li><li><code>volumesAttached</code>和<code>volumesInUse</code> 在ad controller和kubelet之间充当什么角色?</li></ol><p>这里只对分析volume manager做简要分析。</p><ul><li>根据<code>Volume not attached according to node status</code> 在代码中找到对应的位置,发现在<code>GenerateVerifyControllerAttachedVolumeFunc</code> 中。仔细看代码逻辑,会发现<ul><li>volume manager的reconciler会先确认该被unmount的volume被unmount掉</li><li>然后确认该被mount的volume被mount<ul><li>此时会先从volume manager的dsw缓存中获取要被mount的volumes(<code>volumesToMount</code>的<code>podsToMount</code> )</li><li>然后遍历,验证每个<code>volumeToMount</code>是否已经attach了<ul><li><code>这个volumeToMount</code>是由<code>podManager</code>中的<code>podInformer</code>加入到相应内存中,然后<code>desiredStateOfWorldPopulator</code>周期性同步到dsw中的</li></ul></li><li>验证逻辑中,在<code>GenerateVerifyControllerAttachedVolumeFunc</code>中会去遍历本节点的<code>node.Status.VolumesAttached</code>,如果没有找到就报错(<code>Volume not attached according to node status</code>)</li></ul></li></ul></li><li>所以可以看出来,<strong><em>volume manager就是根据volume是否存在于<code>node.Status.VolumesAttached</code> 中来判断volume有无被attach成功</em></strong>。</li><li>那谁去填充<code>node.Status.VolumesAttached</code> ?ad controller的数据结构<code>nodesToUpdateStatusFor</code> 就是用来存储要更新到<code>node.Status.VolumesAttached</code> 上的数据的。</li><li>所以,<strong><em>如果ad controller那边没有更新<code>node.Status.VolumesAttached</code>,而又新建了pod,<code>desiredStateOfWorldPopulator</code> 从podManager中的内存把新建pod引用的volume同步到了<code>volumesToMount</code>中,在验证volume是否attach时,就会报错(Volume not attached according to node status)</em></strong><ul><li>当然,之后由于kublet的syncLoop里面会调用<code>WaitForAttachAndMount</code> 去等待volumeattach和mount成功,由于前面一直无法成功,等待超时,才会有会面<code>timeout expired</code> 的报错</li></ul></li></ul><p>所以接下来主要需要看为什么ad controller那边没有更新<code>node.Status.VolumesAttached</code>。</p><h4 id="ad-controller的reconciler详解"><a href="#ad-controller的reconciler详解" class="headerlink" title="ad controller的reconciler详解"></a>ad controller的<code>reconciler</code>详解</h4><p>接下来详细分析下ad controller的逻辑,看看为什么会没有更新<code>node.Status.VolumesAttached</code>,但从事件看ad controller却又认为volume已经挂载成功。</p><p>从<a href="#流程简述">流程简述</a>中表述可见,ad controller主要逻辑是在<code>reconciler</code>中。</p><ul><li><p><code>reconciler</code>定时去运行<code>reconciliationLoopFunc</code>,周期为100ms。</p></li><li><p><code>reconciliationLoopFunc</code>的主要逻辑在<code>reconcile()</code>中:</p><ol><li><p>首先,确保该被detach的volume被detach掉</p><ul><li>遍历asw中的<code>attachedVolumes</code>,对于每个volume,判断其是否存在于dsw中<ul><li>根据nodeName去dsw.nodesManaged中判断node是否存在</li><li>存在的话,再根据volumeName判断volume是否存在</li></ul></li><li>如果volume存在于asw,且不存在于dsw,则意味着需要进行detach</li><li>之后,根据<code>node.Status.VolumesInUse</code>来判断volume是否已经unmount完成,unmount完成或者等待6min timeout时间到后,会继续detach逻辑</li><li>在执行detach volume之前,会先调用<code>RemoveVolumeFromReportAsAttached</code>从asw的<code>nodesToUpdateStatusFor</code>中去删除要detach的volume</li><li>然后patch node,也就等于从<code>node.status.VolumesAttached</code>删除这个volume</li><li>之后进行detach,detach失败主要分2种<ul><li>如果真正执行了<code>volumePlugin</code>的具体实现<code>DetachVolume</code>失败,会把volume add back到<code>nodesToUpdateStatusFor</code>(之后在attach逻辑结束后,会再次patch node)</li><li>如果是operator_excutor判断还没到backoff周期,就会返回<code>backoffError</code>,直接跳过<code>DetachVolume</code></li></ul></li><li>backoff周期起始为500ms,之后指数递增至2min2s。已经detach失败了的volume,在每个周期期间进入detach逻辑都会直接返回<code>backoffError</code></li></ul></li><li><p>之后,确保该被attach的volume被attach成功</p><ul><li><p>遍历dsw的<code>nodesManaged</code>,判断volume是否已经被attach到该node,如果已经被attach到该node,则跳过attach操作</p></li><li><p>去asw.attachedVolumes中判断是否存在,若不存在就认为没有attach到node</p><ul><li>若存在,再判断node,node也匹配就返回<code>attachedConfirmed</code></li></ul></li></ul></li></ol></li></ul><pre><code> - 而`attachedConfirmed`是由asw中`AddVolumeNode`去设置的,`MarkVolumeAsAttached`设置为true。(true即代表该volume已经被attach到该node了) - 之后判断是否禁止多挂载,再由operator_excutor去执行attach3. 最后,`UpdateNodeStatuses`去更新node status</code></pre><h4 id="案例详细分析"><a href="#案例详细分析" class="headerlink" title="案例详细分析"></a>案例详细分析</h4><ul><li>前提<ul><li>volume detach失败</li><li>sts+cbs(pvc),pod recreate前后调度到相同的node</li></ul></li><li>涉及k8s组件<ul><li>ad controller</li><li>kubelet(volume namager)</li></ul></li><li>ad controller和kubelet(volume namager)通过字段<code>node.status.VolumesAttached</code>交互。<ul><li>ad controller为<code>node.status.VolumesAttached</code>新增或删除volume,新增表明已挂载,删除表明已删除</li><li>kubelet(volume manager)需要验证新建pod中的(pvc的)volume是否挂载成功,存在于<code>node.status.VolumesAttached</code>中,则表明验证volume已挂载成功;不存在,则表明还未挂载成功。</li></ul></li><li>以下是整个过程:</li></ul><ol><li>首先,删除pod时,由于某种原因cbs detach失败,失败后就会backoff重试。<ol><li>由于detach失败,该volume也不会从asw的<code>attachedVolumes</code>中删除</li></ol></li><li>由于detach时,<ol><li>先从<code>node.status.VolumesAttached</code>中删除volume,之后才去执行detach</li><li>detach时返回<code>backoffError</code>不会把该volumeadd back <code>node.status.VolumesAttached</code></li></ol></li><li>之后,我们在backoff周期中(假如就为第一个周期的500ms中间)再次创建sts,pod被调度到之前的node</li><li>而pod一旦被创建,就会被添加到dsw的<code>nodesManaged</code>(nodeName和volumeName都没变)</li><li>reconcile()中的第2步,会去判断volume是否被attach,此时发现该volume同时存在于asw和dws中,并且由于detach失败,也会在检测时发现还是attach,从而设置<code>attachedConfirmed</code>为true</li><li>ad controller就认为该volume被attach成功了</li><li>reconcile()中第1步的detach逻辑进行判断时,发现要detach的volume已经存在于<code>dsw.nodesManaged</code>了(由于nodeName和volumeName都没变),这样volume同时存在于asw和dsw中了,实际状态和期望状态一致,被认为就不需要进行detach了。</li><li>这样,该volume之后就再也不会被add back到<code>node.status.VolumesAttached</code>。所以就出现了现象中的node info中没有该volume,而ad controller又认为该volume被attach成功了</li><li>由于kubelet(volume manager)与controller manager是异步的,而它们之间交互是依据<code>node.status.VolumesAttached</code> ,所以volume manager在验证volume是否attach成功,发现<code>node.status.VolumesAttached</code>中没有这个voume,也就认为没有attach成功,所以就有了现象中的报错<code>Volume not attached according to node status</code></li><li>之后kubelet的<code>syncPod</code>在等待pod所有的volume attach和mount成功时,就超时了(现象中的另一个报错<code>timeout expired wating...</code>)。</li><li>所以pod一直处于<code>ContainerCreating</code></li></ol><h4 id="小结"><a href="#小结" class="headerlink" title="小结"></a>小结</h4><ul><li>所以,该案例出现的原因是:<ul><li>sts+cbs,pod recreate时间被调度到相同的node</li><li>由于detach失败,backoff期间创建sts/pod,致使ad controller中的dsw和asw数据一致(此时该volume由于没有被detach成功而确实处于attach状态),从而导致ad controller认为不再需要去detach该volume。</li><li>又由于detach时,是先从<code>node.status.VolumesAttached</code>中删除该volume,再去执行真正的<code>DetachVolume</code>。backoff期间直接返回<code>backoffError</code>,跳过<code>DetachVolume</code>,不会add back</li><li>之后,ad controller因volume已经处于attach状态,认为不再需要被attach,就不会再向<code>node.status.VolumesAttached</code>中添加该volume</li><li>最后,kubelet与ad controller交互就通过<code>node.status.VolumesAttached</code>,所以kubelet认为没有attach成功,新创建的pod就一直处于<code>ContianerCreating</code>了</li></ul></li><li>据此,我们可以发现关键点在于<code>node.status.VolumesAttached</code>和以下两个逻辑:<ol><li>detach时backoffError,不会add back</li><li>detach是先删除,失败再add back</li></ol></li><li>所以只要想办法能在任何情况下add back就不会有问题了。根据以上两个逻辑就对应有以下2种解决方案,<strong>推荐使用方案2</strong>:<ol><li>backoffError时,也add back<ul><li><a href="https://github.com/kubernetes/kubernetes/pull/72914" target="_blank" rel="noopener">pr #72914</a><ul><li>但这种方式有个缺点:patch node的请求数增加了10+次/(s * volume)</li></ul></li></ul></li><li>一进入detach逻辑就判断是否backoffError(处于backoff周期中),是就跳过之后所有detach逻辑,不删除就不需要add back了。<ul><li><a href="https://github.com/kubernetes/kubernetes/pull/88572" target="_blank" rel="noopener">pr #88572</a><ul><li>这个方案能避免方案1的问题,且会进一步减少请求apiserver的次数,且改动也不多</li></ul></li></ul></li></ol></li></ul><h3 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h3><ul><li>AD Controller负责存储的Attach、Detach。通过比较asw和dsw来判断是否需要attach/detach。最终attach和detach结果会体现在<code>node.status.VolumesAttached</code>。</li><li>以上现网案例出现的现象,是k8s ad controller的bug导致,目前社区并未修复。<ul><li>现象出现的原因主要是:<ul><li>先删除旧pod过程中detach失败,而在detach失败的backoff周期中创建新pod,此时由于ad controller逻辑bug,导致volume被从<code>node.status.VolumesAttached</code>中删除,从而导致创建新pod时,kubelet检查时认为该volume没有attach成功,致使pod就一直处于<code>ContianerCreating</code>。</li></ul></li><li>而现象的解决方案,推荐使用<a href="https://github.com/kubernetes/kubernetes/pull/88572" target="_blank" rel="noopener">pr #88572</a>。目前TKE已经有该方案的稳定运行版本,在灰度中。</li></ul></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://github.com/ivan-cai" target="_blank" rel="noopener">蔡靖</a></p>
<h3 id="前言"><a href="#前言" class="headerlink" title="前
</summary>
</entry>
<entry>
<title>揭秘!containerd 镜像文件丢失问题,竟是镜像生成惹得祸</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/05/08/containerd-image-file-loss/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/05/08/containerd-image-file-loss/</id>
<published>2020-05-08T10:00:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://github.com/payall4u" target="_blank" rel="noopener">李志宇</a></p><h4 id="containerd-镜像丢失文件问题说明"><a href="#containerd-镜像丢失文件问题说明" class="headerlink" title="containerd 镜像丢失文件问题说明"></a>containerd 镜像丢失文件问题说明</h4><p>近期有客户反映某些容器镜像出现了文件丢失的奇怪现象,经过模拟复现汇总出丢失情况如下:</p><p>某些特定的镜像会稳定丢失文件;</p><p>“丢失”在某些发行版稳定复现,但在 ubuntu 上不会出现;</p><p>v1.2 版本的 containerd 会文件丢失,而 v1.3 不会。</p><p>通过阅读源码和文档,最终解决了这个 containerd 镜像丢失问题,并写下了这篇文章,希望和大家分享下解决问题的经历和镜像生成的原理。为了方便某些心急的同学,本文接下来将首先揭晓该问题的答案~</p><h4 id="根因和解决方案"><a href="#根因和解决方案" class="headerlink" title="根因和解决方案"></a>根因和解决方案</h4><p>由于内核 overlay 模块 Bug,当 containerd 从镜像仓库下载镜像的“压缩包”生成镜像的“层”时,overlay 错误地把trusted.overlay.opaque=y这个 xattrs 从下层传递到了上层。如果某个目录设置了这个属性,overlay 则会认为这个目录是不透明的,以至于在进行联合挂载时该目录将会把下面的目录覆盖掉,进而导致镜像文件丢失的问题。</p><p>这个问题的解决方案可以有两种,一种简单粗暴,直接升级内核中 overlay 模块即可。</p><p>另外一种可以考虑把 containerd 从 v1.2 版本升级到 v1.3,原因在于 containerd v1.3 中会主动设置上述 opaque 属性,该版本 containerd 不会触发 overlayfs 的 bug。当然,这种方式是规避而非彻底解决 Bug。</p><h4 id="snapshotter-生成镜像原理分析"><a href="#snapshotter-生成镜像原理分析" class="headerlink" title="snapshotter 生成镜像原理分析"></a>snapshotter 生成镜像原理分析</h4><p>虽然根本原因看起来比较简单,但分析的过程还是比较曲折的。在分享下这个问题的排查过程和收获之前,为了方便大家理解,本小节将集中讲解问题排查过程涉及到的 containerd 和 overlayfs 的知识,比较了解或者不感兴趣的同学可以直接跳过。</p><p>与 docker daemon 一开始的设计不同,为了减少耦合性,containerd 通过插件的方式由多个模块组成。结合下图可以看出,其中与镜像相关的模块包含以下几种:</p><p><img src="https://main.qcloudimg.com/raw/da8285980cbdafa55dfdc5719e920e96.png" alt="enter image description here"></p><ul><li>metadata 是 containerd 通过 bbolt 实现的 kv 存储模块,用来保存镜像、容器或者层等元信息。比如命令行 ctr 列出所有 snapshot 或 kubelet 获取所有 pod 都是通过 metadata 模块查询的数据。</li><li><p>content 是负责保存 blob 的模块,其保存的关于镜像的内容一般分为三种:</p><ol><li>镜像的 manifest(一个普通的 json,其中指定了镜像的 config 和镜像的 layers 数组)</li><li>镜像的 config(同样是个 json,其中指定镜像的元信息,比如启动命令、环境变量等)</li><li>镜像的 layer(tar 包,解压、处理后会生成镜像的层)</li></ol></li><li><p>snapshots 是快照模块总称,可以设置使用不同的快照模块,常见的模块有 overlayfs、aufs 或 native。在 unpack 时 snapshots 会把生成镜像层并保存到文件系统;当运行容器时,可以调用 snapshots 模块给容器提供 rootfs 。</p></li></ul><p>容器镜像规范主要有 docker 和 oci v1、v2 三种,考虑到这三种规范在原理上大同小异,可以参考以下示例,将 manifest 当作是每个镜像只有一份的元信息,用于指向镜像的 config 和每层 layer。其中,config 即为镜像配置,把镜像作为容器运行时需要;layer 即为镜像的每一层。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> manifest <span class="keyword">struct</span> {</span><br><span class="line"> c config</span><br><span class="line"> layers []layer</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>镜像下载流程与图 1 中数字标注出来的顺序一致,每个步骤作用总结如下:</p><p>首先在 metadata 模块中添加一个 image,这样我们在执行 list image 时可看到这个 image。</p><p>其次是需要下载镜像,因为镜像是有 manifest、config、layers 等多个部分组成,所以先下载镜像的 manifest 并保存到 content 模块,再解析 manifest 获取 config 的地址和 layers 的地址。接下来分别把 config 和每个 layer 下载并保存到 content 模块,这里需要强调镜像的 layer 本来应该是目录,当创建容器时联合挂载到 root 下,但是为了方便网络传输和存储,这里会用 tar + 压缩的方式保存。这里保存到 content 也是不解压的。</p><p>③、④、⑤的作用关联性比较强,此处放在一起解释。snapshot 模块去 content 模块读取 manifest,找到镜像的所有层,再去 content 模块把这些层自“下”而“上”读取出来,逐一解压并加工,最后放到 snapshot 模块的目录下,像图 1 中的 1001/fs、1002/fs 这些都是镜像的层。(当创建容器时,需要把这些层联合挂载生成容器的 rootfs,可以理解成1001/fs + 1002/fs + … => 1008/work)。</p><p>整个流程的函数调用关系如下图 2,喜欢阅读源码的同学可以照着这个去看下。<br><img src="https://main.qcloudimg.com/raw/7c7768df98a9f0e5a56646cb70b0b9ec.png" alt="enter image description here"></p><p>为了方便理解,接下来用 layer 表示 snapshot 中的层,把刚下载未经过加工的“层”称之为镜像层的 tar 包或者是 tar 包。</p><p>下载镜像保存入 content 的流程比较简单,直接跳过就好。而通过镜像的 tar 包生成 snapshot 中的 layer 这个过程比较巧妙,甚至 bug 也是出现在这里,接下来进行重点描述。</p><p>首先通过 content 拿到了镜像的 manifest,这样我们得知镜像是有哪些层组成的。最下面一层镜像比较简单,直接解压到 snapshot 提供的目录就可以了,比如 10/fs。假设接下来要在 11/fs 生成第二层(此时 11/fs 还是空的),snapshot 会使用mount -t overlay overlay -o lowerdir=10/fs,upperdir=11/fs,workdir=11/work tmp把已经生成好的 layer 10 和还未生成的 layer 11 挂载到一个 tmp 目录上,其中写入层是 11/fs 也就是我们想要生成的 layer。去 content 中拿到 layer 11 对应的 tar 包,遍历这个 tar 包,根据 tar 包中不同的文件对挂载点 tmp 进行写入或者删除文件的操作(因为是联合挂载,所以对于挂载点的操作都会变成对写入层的操作)。把 tar 包转化成 layer 的具体逻辑和下面经过简化的源码一致,可以看到如果 tar 包中存在 whiteout 文件或者当前的层比如 11/fs 和之前的层有冲突比如 10/fs,会把底层目录删掉。在把 tar 包的文件写入到目录后,会根据 tar 包中记录的 PAXRecords 给文件添加 xattr,PAXRecords 可以看做是 tar 中每个文件都带有的 kv 数组,可以用来映射文件系统中文件属性。<br><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 这里的tmp就是overlay的挂载点</span></span><br><span class="line">applyNaive(tar, tmp) {</span><br><span class="line"> <span class="keyword">for</span> tar.hashNext() {</span><br><span class="line"> tar_file := tar.Next()<span class="comment">// tar包中的文件</span></span><br><span class="line"> real_file := path.Join(root, file.base)<span class="comment">// 现实世界的文件</span></span><br><span class="line"> <span class="comment">// 按照规则删除文件</span></span><br><span class="line"> <span class="keyword">if</span> isWhiteout(info) {</span><br><span class="line"> whiteRM(real_file)</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">if</span> !(file.IsDir() && IsDir(real_file)) {</span><br><span class="line"> rm(real_file)</span><br><span class="line"> } </span><br><span class="line"> <span class="comment">// 把tar包的文件写入到layer中</span></span><br><span class="line"> createFileOrDir(tar_file, real_file)</span><br><span class="line"> <span class="keyword">for</span> k, v := <span class="keyword">range</span> tar_file.PAXRecords {</span><br><span class="line"> setxattr(real_file, k, v)</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure></p><p>需要删除的这些情况总结如下:</p><p>如果存在同名目录,两者进行 merge</p><p>如果存在同名但不都是目录,需要删除掉下层目录(上文件下目录、上目录下文件、上文件下文件)</p><p>如果存在 .wh. 文件,需要移除底层应该被覆盖掉的目录,比如目录下存在 .wh..wh.opaque 文件,就需要删除 lowerdir 中的对应目录。</p><p> <img src="https://main.qcloudimg.com/raw/f6757b2c3f36744914ca0fe6040441e6.png" alt="enter image description here"></p><p>当然这里的删除也没那么简单,还记得当前的操作都是通过挂载点来删除底层的文件么?在 overlay 中,如果通过挂载点删除 lower 层的内容,不会把文件真的从 lower 的文件目录中干掉,而是会在 upper 层中添加 whiteout,添加 whiteout 的其中一种方式就是设置上层目录的 xattr trusted.overlay.opaque=y。</p><p>当 tar 包遍历结束以后,对 tmp 做个 umount,得到的 11/fs 就是我们想要的 layer,当我们想要生成 12/fs 这个 layer 时,只需要把 10/fs,11/fs 作为 lowerdir,把 12/fs 作为 upperdir 联合挂载就可以。也就是说,之后镜像的每一个 layer 生成都是需要把之前的 layer 挂载,下面图说明了整个流程。</p><p><img src="https://main.qcloudimg.com/raw/b6894ff43fda3eb5e1ff7ed6121e893c.png" alt="enter image description here"></p><p>可以考虑下为什么要这么大费周章?关键有两点。</p><p>一是镜像中的删除下层文件是要遵循 image-spec 中对于 whiteout 文件的定义(<a href="https://github.com/opencontainers/image-spec/blob/9f4348abedbe4415e6db1f08689fa7588045d982/layer.md" target="_blank" rel="noopener">image-spec</a>),这个文件只会在 tar 包中作为标识,并不会产生真正的影响。而起到真正作用的是在 applyNaive 碰到了 whiteout 文件,会调用联合文件系统对底层目录进行删除,当然这个删除对于 overlay 就是标记 opaque。</p><p>二是因为存在文件和目录相互覆盖的现象,每一个 tar 包中的文件都需要和之前所有 tar包 中的内容进行比对,如果不借用联合文件系统的“超能力”,我们就只能拿着 tar 中的每一个文件对之前的层遍历。</p><h4 id="问题排查过程"><a href="#问题排查过程" class="headerlink" title="问题排查过程"></a>问题排查过程</h4><p>了解了镜像相关的知识,我们来看看这个问题的排查过程。首先我们观察用户的容器,经过简化和打码目录结构如下,其中目录 modules 就是事故多发地。</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">/data</span><br><span class="line">└── prom</span><br><span class="line"> ├── bin</span><br><span class="line"> └── modules</span><br><span class="line"> ├── file</span><br><span class="line"> └── lib/</span><br></pre></td></tr></table></figure><p>再观察下用户的镜像的各个层。我们把镜像的层按照从下往上用递增的 ID 来标注,对这个目录有修改的有 5099、5101、5102、5103、5104 这几层。把容器运行起来后,看到的 modules 目录和 <strong>5104</strong> 提供的一样。并没有把 5103 等“下面”的镜像合并起来,相当于 <strong>5104</strong> 把下面的目录都覆盖掉了(当然,<strong>5104</strong> 和 <strong>5103</strong> 文件是有区别的)。</p><h5 id="5104-下层目录为何被覆盖?"><a href="#5104-下层目录为何被覆盖?" class="headerlink" title="5104 下层目录为何被覆盖?"></a>5104 下层目录为何被覆盖?</h5><p>看到这里,首先想到是不是创建容器的 rootfs 时参数出现了问题,导致少 mount 了一些层?于是模拟手动挂载mount -t overlay overlay -o lowerdir=5104:5103 point把最上两层挂载,结果 <strong>5104</strong> 依然把 <strong>5103</strong> 覆盖了。这里推断可能是存在 overlay 的 .wh. 文件,于是尝试在这两层中搜 .wh. 文件,无果。于是去查 overlayfs 的文档:</p><blockquote><p>A directory is made opaque by setting the xattr “trusted.overlay.opaque”<br>to “y”. Where the upper filesystem contains an opaque directory, any<br>directory in the lower filesystem with the same name is ignored.</p></blockquote><p>设置了属性 trusted.overlay.opaque=y 的目录会变成“不透明”的,当上层文件系统被设置为“不透明”时,下层中同名的目录会被忽略。overlay 如果想要在上层把下层覆盖掉,就需要设置这个属性。</p><p>通过命令getfattr -n “trusted.overlay.opaque” dir查看发现,<strong>5104</strong> 下面的 /data/asr_offline/modules 果然带有这个属性,这一现象也进而导致了下层目录被“覆盖”。</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">[root@]$ getfattr -n <span class="string">"trusted.overlay.opaque"</span> 5104/fs/data/asr_offline/modules</span><br><span class="line"><span class="comment"># file: 5102/fs/data/asr_offline/modules</span></span><br><span class="line">trusted.overlay.opaque=<span class="string">"y"</span></span><br></pre></td></tr></table></figure><p>一波多折,层层追究<br>那么问题来了,为什么只有特定的发行版会出现这个现象?我们尝试在 ubuntu 拉下镜像,发现“同源”目录居然没有设置 opaque!由于镜像的层通过把源文件解压和解包生成的,我们决定在确保不同操作系统中的“镜像源文件”的 md5 相同之后,在各个操作系统上把镜像源文件通过tar -zxf进行解包并重新手动挂载,发现 <strong>5104</strong> 均不会把 <strong>5103</strong> 覆盖。</p><p>根据以上现象推断,可能是某些发行版下的 containerd 从 content 读取 tar 包并解压制作 snapshot 的 layer 时出现问题,错误地把 snapshot 的目录设置上了这个属性。</p><p>为验证该推断,决定进行源代码梳理,由此发现了其中的疑点(相关代码如下)——生成 layers 时遍历 tar 包会读取每个文件的 PAXRecords 并且把这个设置在文件的 xattr 上( tar 包给每个文件都准备了 PAXRecords,和 Pod 的 labels 等价)。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">applyNaive</span><span class="params">()</span></span> {</span><br><span class="line"> <span class="comment">// ...</span></span><br><span class="line"> <span class="keyword">for</span> k, v := <span class="keyword">range</span> tar_file.PAXRecords {</span><br><span class="line">setxattr(real_file, k, v)</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">setxattr</span><span class="params">(path, key, value <span class="keyword">string</span>)</span> <span class="title">error</span></span> {</span><br><span class="line"><span class="keyword">return</span> unix.Lsetxattr(path, key, []<span class="keyword">byte</span>(value), <span class="number">0</span>)</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>因为之前实验过 v1.3 的 containerd 不会出现这个问题,所以对照了下两者的代码,发现两者从 tar 包中抽取 PAXRecords 设置 xattr 的逻辑两者是不一样的。v1.3 的代码如下:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">setxattr</span><span class="params">(path, key, value <span class="keyword">string</span>)</span> <span class="title">error</span></span> {</span><br><span class="line"><span class="comment">// Do not set trusted attributes</span></span><br><span class="line"><span class="keyword">if</span> strings.HasPrefix(key, <span class="string">"trusted."</span>) {</span><br><span class="line"><span class="keyword">return</span> errors.Wrap(unix.ENOTSUP, <span class="string">"admin attributes from archive not supported"</span>)</span><br><span class="line">}</span><br><span class="line"><span class="keyword">return</span> unix.Lsetxattr(path, key, []<span class="keyword">byte</span>(value), <span class="number">0</span>)</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>也就是说 v1.3.0 中不会设置以trusted.开头的 xattr!如果 tar 包中某目录带有trusted.overlay.opaque=y这个 PAX,低版本的 containerd 可能就会把这些属性设置到 snapshot 的目录上,而高版本的却不会。那么,当用户在打包时,如果把 opaque 也打到 tar 包中,解压得到的 layer 对应目录也就会带有这个属性。<strong>5104</strong> 这个目录可能就是这个原因才变成 opaque 的。</p><p>为了验证这个观点,我写了一段简单的程序来扫描与 layer 对应的 content 来寻找这个属性,结果发现 <strong>5102</strong>、<strong>5103</strong>、<strong>5104</strong> 几个层都没有这个属性。这时我也开始怀疑这个观点了,毕竟如果只是 tar 包中有特别的标识,应该不会在不同的操作系统表现不同。</p><p>抱着最后一丝希望扫描了 <strong>5099</strong> 和 <strong>5101</strong>,果然也并没有这个属性。但在扫描的过程中,注意到 <strong>5101</strong> 的 tar 包里存在 /data/asr_offline/modules/.wh..wh.opq 这个文件。记得当时看代码 applyNaive 时如果遇到了 .wh..wh.opq 对应的操作应该是在挂载点删除 /data/asr_offline/modules,而在 overlay 中删除 lower 目录会给 upper 同名目录加上trusted.overlay.opaque=y。也就是说,在生成 layer <strong>5101</strong> 时(需要提前挂载好 <strong>5100</strong> 和 <strong>5099</strong>),遍历 tar 包遇到了这个 wh 文件,应该先在挂载点删除 modules,也就是会在 <strong>5101</strong> 对应目录加上 opaque=y。</p><p>再次以验证源代码成果的心态,去 snapshot 的 5101/fs 下查看目录 modules 的 opaque,果然和想象的一样。这些文件应该都是在 lower层,所以对应的 overlayfs 的操作应该是在 upper 也就是 <strong>5101</strong> 层的 /data/asr_offline/modules 目录设置trusted.overlay.opaque=y。去查看 <strong>5101</strong> 的这个目录,果然带有这个属性,好奇心驱使着我继续查看了 <strong>5102</strong>、<strong>5103</strong>、<strong>5104</strong> 这几层的目录,发现居然都有这个属性。</p><p>也就是这些 layer 每个都会把下面的覆盖掉?这好像不符合常理。于是,去表现正常的 ubuntu 中查看,发现只有 <strong>5101</strong> 有这个属性。经过反复确认 <strong>5102</strong>、<strong>5103</strong>、<strong>5104</strong> 的 tar 包中的确没有目录 modules 的 whiteout 文件,也就是说镜像原本的意图就是让 <strong>5101</strong> 把下面的层覆盖掉,再把 <strong>5101</strong>、<strong>5102</strong>、<strong>5103</strong>、<strong>5104</strong> 这几层的 modules 目录 merge 起来。整个生成镜像的流程里,只有“借用”overlay 生成 snapshot 的 layer 会涉及到操作系统。</p><h5 id="云开雾散,大胆猜探"><a href="#云开雾散,大胆猜探" class="headerlink" title="云开雾散,大胆猜探"></a>云开雾散,大胆猜探</h5><p>我们不妨大胆猜测一下,会不会像下图这样,在生成 layer <strong>5102</strong> 时,因为内核或 overlay 的 bug 把 modules 也添加了不透明的属性?</p><p><img src="https://main.qcloudimg.com/raw/98362e7c7e20199f0a15ed99c0b86a1b.png" alt="enter image description here"></p><p>为了对这个特性做单独的测试,写了个简单的脚本。运行脚本之后,果然发现在这个发行版中,如果 overlay 的低层目录有这个属性并且在 upper 层中创建了同样的目录,会把这个 opaque“传播”到 upper 层的目录中。如果像 containerd 那样递推生成镜像,肯定从有 whiteout 层开始上面的每一层都会具有这个属性,也就导致了最终容器在某些特定的目录只能看到最上面一层。</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">`<span class="comment">#!/bin/bash</span></span><br><span class="line"></span><br><span class="line">mkdir 1 2 work p</span><br><span class="line">mkdir 1/func</span><br><span class="line">touch 1/func/min</span><br><span class="line"></span><br><span class="line">mount -t overlay overlay p -o lowerdir=1,upperdir=2,workdir=work</span><br><span class="line">rm -rf p/func</span><br><span class="line">mkdir -p p/func</span><br><span class="line">touch p/func/max</span><br><span class="line">umount p</span><br><span class="line">getfattr -n <span class="string">"trusted.overlay.opaque"</span> 2/func</span><br><span class="line"></span><br><span class="line">mkdir 3</span><br><span class="line">mount -t overlay overlay p -o lowerdir=2:1,upperdir=3,workdir=work</span><br><span class="line">touch p/func/sqrt</span><br><span class="line">umount p</span><br><span class="line">getfattr -n <span class="string">"trusted.overlay.opaque"</span> 3/func`</span><br></pre></td></tr></table></figure><h3 id="最终总结"><a href="#最终总结" class="headerlink" title="最终总结"></a>最终总结</h3><p>在几个内核大佬的帮助下,确认了是内核 overlayfs 模块的 bug。在 lower 层调用 copy_up 时并没有检测 xattr,从而导致 opaque 这个 xattr 传播到了 upper 层。做联合挂载时,如果上层的文件得到了这个属性,自然会把下层文件覆盖掉,也就出现了镜像中丢失文件的现象。反思整个排查过程,其实很难在一开始就把问题定位到内核的某个模块上,好在可以另辟蹊径通过测试和阅读源码逐步逼近“真相”,成功寻得解决方案。</p>]]></content>
<summary type="html">
<p>作者: <a href="https://github.com/payall4u" target="_blank" rel="noopener">李志宇</a></p>
<h4 id="containerd-镜像丢失文件问题说明"><a href="#containerd-
</summary>
</entry>
<entry>
<title>大规模使用ConfigMap卷的负载分析及缓解方案</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/04/24/k8s-configmap-volume/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/04/24/k8s-configmap-volume/</id>
<published>2020-04-24T07:00:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://github.com/borgerli" target="_blank" rel="noopener">李波</a></p><h2 id="简介"><a href="#简介" class="headerlink" title="简介"></a>简介</h2><p>有客户反馈在大集群(几千节点)中大量使用ConfigMap卷时,会给集群带来很大负载和压力,这里我们分析下原因以及缓解方案。</p><h2 id="Kubelet如何管理ConfigMap"><a href="#Kubelet如何管理ConfigMap" class="headerlink" title="Kubelet如何管理ConfigMap"></a>Kubelet如何管理ConfigMap</h2><p>我们先来看下Kubelet是如何管理ConfigMap的。</p><p>Kubelet在启动的时候,会创建ConfigMapManager(以及SecretManager),用来管理本机运行的Pod用到的ConfigMap(及Secret,下面只讨论ConfigMap)对象,功能包括获取及更新这些对象的内容,以及为其他组件比如VolumeManager提供获取这些对象内容的服务。</p><p>那Kubelet是如何获取和更新ConfigMap呢? k8s提供了三种检测资源更新的策略(<code>ResourceChangeDetectionStrategy</code>)</p><h3 id="WatchChangeDetectionStrategy-Watch"><a href="#WatchChangeDetectionStrategy-Watch" class="headerlink" title="WatchChangeDetectionStrategy(Watch)"></a>WatchChangeDetectionStrategy(Watch)</h3><p>这是<code>1.12+</code>的默认策略。</p><p>看名字,这个策略使用K8s经典的ListWatch模式。在Pod创建时,对每个引用到的ConfigMap,都会先从ApiServer缓存(指定ResourceVersion=”0”)获取,然后对后续变化进行Watch。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// pkg/kubelet/util/manager/watch_based_manager.go</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *objectCache)</span> <span class="title">newReflector</span><span class="params">(namespace, name <span class="keyword">string</span>)</span> *<span class="title">objectCacheItem</span></span> {</span><br><span class="line">fieldSelector := fields.Set{<span class="string">"metadata.name"</span>: name}.AsSelector().String()</span><br><span class="line">listFunc := <span class="function"><span class="keyword">func</span><span class="params">(options metav1.ListOptions)</span> <span class="params">(runtime.Object, error)</span></span> {</span><br><span class="line">options.FieldSelector = fieldSelector</span><br><span class="line"><span class="keyword">return</span> c.listObject(namespace, options)</span><br><span class="line">}</span><br><span class="line">watchFunc := <span class="function"><span class="keyword">func</span><span class="params">(options metav1.ListOptions)</span> <span class="params">(watch.Interface, error)</span></span> {</span><br><span class="line">options.FieldSelector = fieldSelector</span><br><span class="line"><span class="keyword">return</span> c.watchObject(namespace, options)</span><br><span class="line">}</span><br><span class="line">store := c.newStore()</span><br><span class="line">reflector := cache.NewNamedReflector(</span><br><span class="line">fmt.Sprintf(<span class="string">"object-%q/%q"</span>, namespace, name),</span><br><span class="line">&cache.ListWatch{ListFunc: listFunc, WatchFunc: watchFunc},</span><br><span class="line">c.newObject(),</span><br><span class="line">store,</span><br><span class="line"><span class="number">0</span>,</span><br><span class="line">)</span><br><span class="line">...</span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>重点强调下,是<strong>对每一个ConfigMap都会创建一个Watch</strong>。如果大量使用CongiMap,并且集群规模很大,假设平均每个节点有100个ConfigMap,集群有2000个节点,就会创建20w个watch。经过测试(测试结果如下图,20w个watch),单纯大量的watch会对ApiServer造成一定的内存压力,对Etcd则基本没有压力。</p><h4 id="ListWatch压力测试"><a href="#ListWatch压力测试" class="headerlink" title="ListWatch压力测试"></a>ListWatch压力测试</h4><p><img src="https://github.com/TencentCloudContainerTeam/TencentCloudContainerTeam.github.io/raw/develop/source/_posts/res/images/configmap-5node-20w-watch.png" alt="'ListWatch压力测试结果'"></p><p>测试采用单节点的ApiServer(16核32G)和单节点的Etcd,并停止所有(共5个)节点kubelet服务以及删除所有非kube-system的负载,并把5个节点作为客户端,每个有间隔的发起4w个ListWatch。<br>从上图的测试结果,可以看到在20w ListWatch创建期间,ApiServer的内存增长到20G左右,CPU使用率在25%左右(创建完成后,使用率降回原来水平),连接数增持长并稳定到965个左右,而Etcd的内存,CPU核连接数无明显变化。粗略计算,每个watch占用<strong>100KB</strong>左右的内存。</p><h3 id="TTLCacheChangeDetectionStrategy-Cache"><a href="#TTLCacheChangeDetectionStrategy-Cache" class="headerlink" title="TTLCacheChangeDetectionStrategy(Cache)"></a>TTLCacheChangeDetectionStrategy(Cache)</h3><p>这是<code>1.10</code>及<code>1.11</code>版本的默认策略,且不可通过参数或者配置文件修改。</p><p>看名字,这是带TTL的缓存方式。第一次获取时,从ApiServer获取最新内容,超过TTL后,如果读取ConfigMap,会从ApiServer缓存获取(Get请求指定ResouceVersion=0)进行刷新,以减小对ApiServer和Etcd的压力。<br><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// pkg/kubelet/util/manager/cache_based_manager.go</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(s *objectStore)</span> <span class="title">Get</span><span class="params">(namespace, name <span class="keyword">string</span>)</span> <span class="params">(runtime.Object, error)</span></span> {</span><br><span class="line">...</span><br><span class="line"><span class="keyword">if</span> data.err != <span class="literal">nil</span> || !fresh {</span><br><span class="line">klog.V(<span class="number">1</span>).Infof(<span class="string">"data is null or object is not fresh: err=%v, fresh=%v"</span>, fresh)</span><br><span class="line">opts := metav1.GetOptions{}</span><br><span class="line"><span class="keyword">if</span> data.object != <span class="literal">nil</span> && data.err == <span class="literal">nil</span> {</span><br><span class="line">util.FromApiserverCache(&opts) <span class="comment">//opts.ResourceVersion = "0"</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">object, err := s.getObject(namespace, name, opts)</span><br><span class="line">...</span><br></pre></td></tr></table></figure></p><p>TTL时间首先会从节点的<code>Annotation["node.alpha.kubernetes.io/ttl"]</code>获取,如果节点没有设置,那么会使用默认值1分钟。</p><p><code>node.alpha.kubernetes.io/ttl</code>由kube-controller-manager中的TTLController根据集群节点数自动设置,具体规则如下(例如100个节点及以下规模的集群,ttl是0s;随着集群规模变大,节点数大于100小于500时,节点ttl变为15s;当集群规模超过100又减小,少于90个节点时,节点的ttl又变回0s):</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// pkg/controller/ttl/ttl_controller.go</span></span><br><span class="line">ttlBoundaries = []ttlBoundary{</span><br><span class="line">{sizeMin: <span class="number">0</span>, sizeMax: <span class="number">100</span>, ttlSeconds: <span class="number">0</span>},</span><br><span class="line">{sizeMin: <span class="number">90</span>, sizeMax: <span class="number">500</span>, ttlSeconds: <span class="number">15</span>},</span><br><span class="line">{sizeMin: <span class="number">450</span>, sizeMax: <span class="number">1000</span>, ttlSeconds: <span class="number">30</span>},</span><br><span class="line">{sizeMin: <span class="number">900</span>, sizeMax: <span class="number">2000</span>, ttlSeconds: <span class="number">60</span>},</span><br><span class="line">{sizeMin: <span class="number">1800</span>, sizeMax: <span class="number">10000</span>, ttlSeconds: <span class="number">300</span>},</span><br><span class="line">{sizeMin: <span class="number">9000</span>, sizeMax: math.MaxInt32, ttlSeconds: <span class="number">600</span>},</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="GetChangeDetectionStrategy-Get"><a href="#GetChangeDetectionStrategy-Get" class="headerlink" title="GetChangeDetectionStrategy(Get)"></a>GetChangeDetectionStrategy(Get)</h3><p>这是最简单直接粗暴的方式,每次获取ConfigMap时,都访问ApiServer从Etcd读取最新版本。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// pkg/kubelet/configmap/configmap_manager.go</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(s *simpleConfigMapManager)</span> <span class="title">GetConfigMap</span><span class="params">(namespace, name <span class="keyword">string</span>)</span> <span class="params">(*v1.ConfigMap, error)</span></span> {</span><br><span class="line"><span class="keyword">return</span> s.kubeClient.CoreV1().ConfigMaps(namespace).Get(name, metav1.GetOptions{})</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h2 id="ConfigMap卷的自动更新机制"><a href="#ConfigMap卷的自动更新机制" class="headerlink" title="ConfigMap卷的自动更新机制"></a>ConfigMap卷的自动更新机制</h2><p>Kubelet在Pod创建成功后,会把Pod放到podWorker的工作队列,并指定延迟1分钟(<code>--sync-frequency</code>,默认1m)才能出队列被获取。<br>Kubelet在sync逻辑中,会在延迟过后取到Pod进行同步,包括同步Volume状态。VolumeManager在同步Volume时会看它的类型是否需要重新挂载(<code>RequiresRemount() bool</code>),<code>ConfigMap</code>、<code>Secret</code>、<code>downwardAPI</code>及<code>Projected</code>四种VolumePlugin,这个方法都返回<code>true</code>,需要重新挂载。</p><p>因此每隔1分钟多,Kubelet都会访问ConfigMapManager,去获取本机Pod使用的ConfigMap的最新内容。这个操作对于Watch类型的策略,没有影响,不会对ApiServer及Etcd带来额外的压力;对于ttl很小的Cache及Get类型的策略,会给ApiServer及Etcd带来压力。</p><h2 id="大集群方案"><a href="#大集群方案" class="headerlink" title="大集群方案"></a>大集群方案</h2><p>从上面的分析看,一般小规模的集群或者ConfigMap(及Secret)用量不大的集群,可以使用默认的Watch策略。如果集群规模比较大,并且大量使用ConfigMap,默认的Watch策略会对ApiServer带来内存压力。在实际生产集群,ApiServer除了处理这些watch,还会执行很多其他任务,相互之间共享抢占系统资源,会加重和放大对ApiServer的负载,影响服务。</p><p>同时,实际上我们很多应用并不需要通过修改ConfigMap动态更新配置的功能,一方面在大集群时会带来不必要的压力,另一方面,如1.18的这个<a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20191117-immutable-secrets-configmaps.md" target="_blank" rel="noopener">KEP</a>所考虑的,实时更新ConfigMap或者Secret,如果内容出现错误,会导致应用异常,在配置发生变化时,更推荐采用滚动更新的方式来更新应用。</p><p>在大集群时,我们可以怎么使用和管理ConfigMap,来减轻对集群的负载压力呢?</p><h3 id="1-18版本"><a href="#1-18版本" class="headerlink" title="1.18版本"></a>1.18版本</h3><p>社区也注意到了这个问题(刚才提到的<a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20191117-immutable-secrets-configmaps.md" target="_blank" rel="noopener">KEP</a>),增加了一个新的特性<code>ImmutableEphemeralVolumes</code>,允许用户设置ConfigMap(及Secrets)为不可变(<code>immutable: true</code>),这样Kubelet就不会去Watch这些ConfigMap的变化了。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">...</span><br><span class="line"><span class="keyword">if</span> utilfeature.DefaultFeatureGate.Enabled(features.ImmutableEphemeralVolumes) && c.isImmutable(object) {</span><br><span class="line"><span class="keyword">if</span> item.stop() {</span><br><span class="line">klog.V(<span class="number">4</span>).Infof(<span class="string">"Stopped watching for changes of %q/%q - object is immutable"</span>, namespace, name)</span><br><span class="line">}</span><br><span class="line">}</span><br><span class="line">...</span><br></pre></td></tr></table></figure><h4 id="开启ImmutableEphemeralVolumes"><a href="#开启ImmutableEphemeralVolumes" class="headerlink" title="开启ImmutableEphemeralVolumes"></a>开启ImmutableEphemeralVolumes</h4><p>ImmutableEphemeralVolumes是alpha特性,需要设置kubelet参数开启它:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">--feature-gates=ImmutableEphemeralVolumes=<span class="literal">true</span></span><br></pre></td></tr></table></figure><h4 id="ConfigMap设置为不可变"><a href="#ConfigMap设置为不可变" class="headerlink" title="ConfigMap设置为不可变"></a>ConfigMap设置为不可变</h4><p>ConfigMap设置<code>immutable</code>为true</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ConfigMap</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">immutable-cm</span></span><br><span class="line"><span class="attr">data:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">tencent</span></span><br><span class="line"><span class="attr">immutable:</span> <span class="literal">true</span></span><br></pre></td></tr></table></figure><h3 id="之前版本"><a href="#之前版本" class="headerlink" title="之前版本"></a>之前版本</h3><p>在1.18之前的版本,我们可以使用<code>Cache</code>策略来代替<code>Watch</code>。</p><ol><li><p>关闭<code>TTLController</code>: kube-controller-manager启动参数增加 <code>--controllers=-ttl,*</code>,重启。</p></li><li><p>配置所有节点Kubelet使用<code>Cache</code>策略: </p><ul><li>创建<code>/etc/kubernetes/kubelet.conf</code>,内容如下:</li></ul></li></ol><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">kubelet.config.k8s.io/v1beta1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">KubeletConfiguration</span></span><br><span class="line"><span class="attr">configMapAndSecretChangeDetectionStrategy:</span> <span class="string">Cache</span></span><br></pre></td></tr></table></figure><ul><li>kubelet增加参数: <code>--config=/etc/kubernetes/kubelet.conf</code>,重启<ol start="3"><li>设置所有节点的ttl为期望值,比如1000天: <code>kubectl annotate node <node> node.alpha.kubernetes.io/ttl=86400000 --overwrite</code><br>。设置1000天并不是1000天内真的不更新。在Kubelet新建Pod时,它所引用的ConfigMap的cache都会被重置和更新。</li></ol></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://github.com/borgerli" target="_blank" rel="noopener">李波</a></p>
<h2 id="简介"><a href="#简介" class="headerlink" title="简
</summary>
</entry>
<entry>
<title>打造云原生大型分布式监控系统(三): Thanos 部署与实践</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/04/20/build-cloud-native-large-scale-distributed-monitoring-system-3/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/04/20/build-cloud-native-large-scale-distributed-monitoring-system-3/</id>
<published>2020-04-20T05:00:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><h2 id="视频"><a href="#视频" class="headerlink" title="视频"></a>视频</h2><p>附上本系列完整视频</p><ul><li>打造云原生大型分布式监控系统(一): 大规模场景下 Prometheus 的优化手段 <a href="https://www.bilibili.com/video/BV17C4y1x7HE" target="_blank" rel="noopener">https://www.bilibili.com/video/BV17C4y1x7HE</a></li><li>打造云原生大型分布式监控系统(二): Thanos 架构详解 <a href="https://www.bilibili.com/video/BV1Vk4y1R7S9" target="_blank" rel="noopener">https://www.bilibili.com/video/BV1Vk4y1R7S9</a></li><li>打造云原生大型分布式监控系统(三): Thanos 部署与实践 <a href="https://www.bilibili.com/video/BV16g4y187HD" target="_blank" rel="noopener">https://www.bilibili.com/video/BV16g4y187HD</a></li></ul><h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h2><p>上一篇 <a href="https://tencentcloudcontainerteam.github.io/2020/04/06/build-cloud-native-large-scale-distributed-monitoring-system-2/">Thanos 架构详解</a> 我们深入理解了 thanos 的架构设计与实现原理,现在我们来聊聊实战,分享一下如何部署和使用 Thanos。</p><h2 id="部署方式"><a href="#部署方式" class="headerlink" title="部署方式"></a>部署方式</h2><p>本文聚焦 Thanos 的云原生部署方式,充分利用 Kubernetes 的资源调度与动态扩容能力。从官方 <a href="https://thanos.io/getting-started.md/#community-thanos-kubernetes-applications" target="_blank" rel="noopener">这里</a> 可以看到,当前 thanos 在 Kubernetes 上部署有以下三种:</p><ul><li><a href="https://github.com/coreos/prometheus-operator" target="_blank" rel="noopener">prometheus-operator</a>: 集群中安装了 prometheus-operator 后,就可以通过创建 CRD 对象来部署 Thanos 了。</li><li><a href="https://hub.helm.sh/charts?q=thanos" target="_blank" rel="noopener">社区贡献的一些 helm charts</a>: 很多个版本,目标都是能够使用 helm 来一键部署 thanos。</li><li><a href="https://github.com/thanos-io/kube-thanos" target="_blank" rel="noopener">kube-thanos</a>: Thanos 官方的开源项目,包含部署 thanos 到 kubernetes 的 jsonnet 模板与 yaml 示例。</li></ul><p>本文将使用基于 kube-thanos 提供的 yaml 示例 (<code>examples/all/manifests</code>) 来部署,原因是 prometheus-operator 与社区的 helm chart 方式部署多了一层封装,屏蔽了许多细节,并且它们的实现都还不太成熟;直接使用 kubernetes 的 yaml 资源文件部署更直观,也更容易做自定义,而且我相信使用 thanos 的用户通常都是高玩了,也有必要对 thanos 理解透彻,日后才好根据实际场景做架构和配置的调整,直接使用 yaml 部署能够让我们看清细节。</p><h2 id="方案选型"><a href="#方案选型" class="headerlink" title="方案选型"></a>方案选型</h2><h3 id="Sidecar-or-Receiver"><a href="#Sidecar-or-Receiver" class="headerlink" title="Sidecar or Receiver"></a>Sidecar or Receiver</h3><p>看了上一篇文章的同学应该知道,目前官方的架构图用的 Sidecar 方案,Receiver 是一个暂时还没有完全发布的组件。通常来说,Sidecar 方案相对成熟一些,最新的数据存储和计算 (比如聚合函数) 比较 “分布式”,更加高效也更容易扩展。</p><p><img src="https://imroc.io/assets/blog/thanos-sidecar.png" alt=""></p><p>Receiver 方案是让 Prometheus 通过 remote wirte API 将数据 push 到 Receiver 集中存储 (同样会清理过期数据):</p><p><img src="https://imroc.io/assets/blog/thanos-receiver-without-objectstore.png" alt=""></p><p>那么该选哪种方案呢?我的建议是:</p><ol><li>如果你的 Query 跟 Sidecar 离的比较远,比如 Sidecar 分布在多个数据中心,Query 向所有 Sidecar 查数据,速度会很慢,这种情况可以考虑用 Receiver,将数据集中吐到 Receiver,然后 Receiver 与 Query 部署在一起,Query 直接向 Receiver 查最新数据,提升查询性能。</li><li>如果你的使用场景只允许 Prometheus 将数据 push 到远程,可以考虑使用 Receiver。比如 IoT 设备没有持久化存储,只能将数据 push 到远程。</li></ol><p>此外的场景应该都尽量使用 Sidecar 方案。</p><h3 id="评估是否需要-Ruler"><a href="#评估是否需要-Ruler" class="headerlink" title="评估是否需要 Ruler"></a>评估是否需要 Ruler</h3><p>Ruler 是一个可选组件,原则上推荐尽量使用 Prometheus 自带的 rule 功能 (生成新指标+告警),这个功能需要一些 Prometheus 最新数据,直接使用 Prometheus 本机 rule 功能和数据,性能开销相比 Thanos Ruler 这种分布式方案小得多,并且几乎不会出错,Thanos Ruler 由于是分布式,所以更容易出错一些。</p><p>如果某些有关联的数据分散在多个不同 Prometheus 上,比如对某个大规模服务采集做了分片,每个 Prometheus 仅采集一部分 endpoint 的数据,对于 <code>record</code> 类型的 rule (生成的新指标),还是可以使用 Prometheus 自带的 rule 功能,在查询时再聚合一下就可以(如果可以接受的话);对于 <code>alert</code> 类型的 rule,就需要用 Thanos Ruler 来做了,因为有关联的数据分散在多个 Prometheus 上,用单机数据去做 alert 计算是不准确的,就可能会造成误告警或不告警。</p><h3 id="评估是否需要-Store-Gateway-与-Compact"><a href="#评估是否需要-Store-Gateway-与-Compact" class="headerlink" title="评估是否需要 Store Gateway 与 Compact"></a>评估是否需要 Store Gateway 与 Compact</h3><p>Store 也是一个可选组件,也是 Thanos 的一大亮点的关键:数据长期保存。</p><p>评估是否需要 Store 组件实际就是评估一下自己是否有数据长期存储的需求,比如查看一两个月前的监控数据。如果有,那么 Thanos 可以将数据上传到对象存储保存。Thanos 支持以下对象存储: </p><ul><li>Google Cloud Storage</li><li>AWS/S3</li><li>Azure Storage Account</li><li>OpenStack Swift</li><li>Tencent COS</li><li>AliYun OSS</li></ul><p>在国内,最方便还是使用腾讯云 COS 或者阿里云 OSS 这样的公有云对象存储服务。如果你的服务没有跑在公有云上,也可以通过跟云服务厂商拉专线的方式来走内网使用对象存储,这样速度通常也是可以满足需求的;如果实在用不了公有云的对象存储服务,也可以自己安装 <a href="https://github.com/minio/minio" target="_blank" rel="noopener">minio</a> 来搭建兼容 AWS 的 S3 对象存储服务。</p><p>搞定了对象存储,还需要给 Thanos 多个组件配置对象存储相关的信息,以便能够上传与读取监控数据。除 Query 以外的所有 Thanos 组件 (Sidecar、Receiver、Ruler、Store Gateway、Compact) 都需要配置对象存储信息,使用 <code>--objstore.config</code> 直接配置内容或 <code>--objstore.config-file</code> 引用对象存储配置文件,不同对象存储配置方式不一样,参考官方文档: <a href="https://thanos.io/storage.md" target="_blank" rel="noopener">https://thanos.io/storage.md</a></p><p>通常使用了对象存储来长期保存数据不止要安装 Store Gateway,还需要安装 Compact 来对对象存储里的数据进行压缩与降采样,这样可以提升查询大时间范围监控数据的性能。注意:Compact 并不会减少对象存储的使用空间,而是会增加,增加更长采样间隔的监控数据,这样当查询大时间范围的数据时,就自动拉取更长时间间隔采样的数据以减少查询数据的总量,从而加快查询速度 (大时间范围的数据不需要那么精细),当放大查看时 (选择其中一小段时间),又自动选择拉取更短采样间隔的数据,从而也能显示出小时间范围的监控细节。</p><h2 id="部署实践"><a href="#部署实践" class="headerlink" title="部署实践"></a>部署实践</h2><p>这里以 Thanos 最新版本为例,选择 Sidecar 方案,介绍各个组件的 k8s yaml 定义方式并解释一些重要细节 (根据自身需求,参考上一节的方案选型,自行评估需要安装哪些组件)。</p><h3 id="准备对象存储配置"><a href="#准备对象存储配置" class="headerlink" title="准备对象存储配置"></a>准备对象存储配置</h3><p>如果我们要使用对象存储来长期保存数据,那么就要准备下对象存储的配置信息 (<code>thanos-objectstorage-secret.yaml</code>),比如使用腾讯云 COS 来存储:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Secret</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">type:</span> <span class="string">Opaque</span></span><br><span class="line"><span class="attr">stringData:</span></span><br><span class="line"> <span class="string">objectstorage.yaml:</span> <span class="string">|</span></span><br><span class="line"><span class="string"></span><span class="attr"> type:</span> <span class="string">COS</span></span><br><span class="line"><span class="attr"> config:</span></span><br><span class="line"><span class="attr"> bucket:</span> <span class="string">"thanos"</span></span><br><span class="line"><span class="attr"> region:</span> <span class="string">"ap-singapore"</span></span><br><span class="line"><span class="attr"> app_id:</span> <span class="string">"12*******5"</span></span><br><span class="line"><span class="attr"> secret_key:</span> <span class="string">"tsY***************************Edm"</span></span><br><span class="line"><span class="attr"> secret_id:</span> <span class="string">"AKI******************************gEY"</span></span><br></pre></td></tr></table></figure><p>或者使用阿里云 OSS 存储:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Secret</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">type:</span> <span class="string">Opaque</span></span><br><span class="line"><span class="attr">stringData:</span></span><br><span class="line"> <span class="string">objectstorage.yaml:</span> <span class="string">|</span></span><br><span class="line"><span class="string"></span><span class="attr"> type:</span> <span class="string">ALIYUNOSS</span></span><br><span class="line"><span class="attr"> config:</span></span><br><span class="line"><span class="attr"> endpoint:</span> <span class="string">"oss-cn-hangzhou-internal.aliyuncs.com"</span></span><br><span class="line"><span class="attr"> bucket:</span> <span class="string">"thanos"</span></span><br><span class="line"><span class="attr"> access_key_id:</span> <span class="string">"LTA******************KBu"</span></span><br><span class="line"><span class="attr"> access_key_secret:</span> <span class="string">"oki************************2HQ"</span></span><br></pre></td></tr></table></figure><blockquote><p>注: 对敏感信息打码了</p></blockquote><h3 id="给-Prometheus-加上-Sidecar"><a href="#给-Prometheus-加上-Sidecar" class="headerlink" title="给 Prometheus 加上 Sidecar"></a>给 Prometheus 加上 Sidecar</h3><p>如果选用 Sidecar 方案,就需要给 Prometheus 加上 Thanos Sidecar,准备 <code>prometheus.yaml</code>:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br><span class="line">188</span><br><span class="line">189</span><br><span class="line">190</span><br><span class="line">191</span><br><span class="line">192</span><br><span class="line">193</span><br><span class="line">194</span><br><span class="line">195</span><br><span class="line">196</span><br><span class="line">197</span><br><span class="line">198</span><br><span class="line">199</span><br><span class="line">200</span><br><span class="line">201</span><br><span class="line">202</span><br><span class="line">203</span><br><span class="line">204</span><br><span class="line">205</span><br><span class="line">206</span><br><span class="line">207</span><br><span class="line">208</span><br><span class="line">209</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-headless</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> type:</span> <span class="string">ClusterIP</span></span><br><span class="line"><span class="attr"> clusterIP:</span> <span class="string">None</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">web</span></span><br><span class="line"><span class="attr"> protocol:</span> <span class="string">TCP</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">9090</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="string">web</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="string">grpc</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ServiceAccount</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">rbac.authorization.k8s.io/v1beta1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ClusterRole</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">rules:</span></span><br><span class="line"><span class="attr">- apiGroups:</span> <span class="string">[""]</span></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">nodes</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">nodes/proxy</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">nodes/metrics</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">services</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">endpoints</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">pods</span></span><br><span class="line"><span class="attr"> verbs:</span> <span class="string">["get",</span> <span class="string">"list"</span><span class="string">,</span> <span class="string">"watch"</span><span class="string">]</span></span><br><span class="line"><span class="attr">- apiGroups:</span> <span class="string">[""]</span></span><br><span class="line"><span class="attr"> resources:</span> <span class="string">["configmaps"]</span></span><br><span class="line"><span class="attr"> verbs:</span> <span class="string">["get"]</span></span><br><span class="line"><span class="attr">- nonResourceURLs:</span> <span class="string">["/metrics"]</span></span><br><span class="line"><span class="attr"> verbs:</span> <span class="string">["get"]</span></span><br><span class="line"></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">rbac.authorization.k8s.io/v1beta1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ClusterRoleBinding</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr">subjects:</span></span><br><span class="line"><span class="attr"> - kind:</span> <span class="string">ServiceAccount</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">roleRef:</span></span><br><span class="line"><span class="attr"> kind:</span> <span class="string">ClusterRole</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> apiGroup:</span> <span class="string">rbac.authorization.k8s.io</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">StatefulSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> serviceName:</span> <span class="string">prometheus-headless</span></span><br><span class="line"><span class="attr"> podManagementPolicy:</span> <span class="string">Parallel</span></span><br><span class="line"><span class="attr"> replicas:</span> <span class="number">2</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"><span class="attr"> matchLabels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> template:</span></span><br><span class="line"><span class="attr"> metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> serviceAccountName:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> securityContext:</span></span><br><span class="line"><span class="attr"> fsGroup:</span> <span class="number">2000</span></span><br><span class="line"><span class="attr"> runAsNonRoot:</span> <span class="literal">true</span></span><br><span class="line"><span class="attr"> runAsUser:</span> <span class="number">1000</span></span><br><span class="line"><span class="attr"> affinity:</span></span><br><span class="line"><span class="attr"> podAntiAffinity:</span></span><br><span class="line"><span class="attr"> requiredDuringSchedulingIgnoredDuringExecution:</span></span><br><span class="line"><span class="attr"> - labelSelector:</span></span><br><span class="line"><span class="attr"> matchExpressions:</span></span><br><span class="line"><span class="attr"> - key:</span> <span class="string">app.kubernetes.io/name</span></span><br><span class="line"><span class="attr"> operator:</span> <span class="string">In</span></span><br><span class="line"><span class="attr"> values:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> topologyKey:</span> <span class="string">kubernetes.io/hostname</span></span><br><span class="line"><span class="attr"> containers:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">quay.io/prometheus/prometheus:v2.15.2</span></span><br><span class="line"><span class="attr"> args:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--config.file=/etc/prometheus/config_out/prometheus.yaml</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--storage.tsdb.path=/prometheus</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--storage.tsdb.retention.time=10d</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--web.route-prefix=/</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--web.enable-lifecycle</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--storage.tsdb.no-lockfile</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--storage.tsdb.min-block-duration=2h</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--storage.tsdb.max-block-duration=2h</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--log.level=debug</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">9090</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">web</span></span><br><span class="line"><span class="attr"> protocol:</span> <span class="string">TCP</span></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">6</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/healthy</span></span><br><span class="line"><span class="attr"> port:</span> <span class="string">web</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">5</span></span><br><span class="line"><span class="attr"> successThreshold:</span> <span class="number">1</span></span><br><span class="line"><span class="attr"> timeoutSeconds:</span> <span class="number">3</span></span><br><span class="line"><span class="attr"> readinessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">120</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/ready</span></span><br><span class="line"><span class="attr"> port:</span> <span class="string">web</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">5</span></span><br><span class="line"><span class="attr"> successThreshold:</span> <span class="number">1</span></span><br><span class="line"><span class="attr"> timeoutSeconds:</span> <span class="number">3</span></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/etc/prometheus/config_out</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-config-out</span></span><br><span class="line"><span class="attr"> readOnly:</span> <span class="literal">true</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/prometheus</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-storage</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/etc/prometheus/rules</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-rules</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">quay.io/thanos/thanos:v0.11.0</span></span><br><span class="line"><span class="attr"> args:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">sidecar</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--log.level=debug</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--tsdb.path=/prometheus</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--prometheus.url=http://127.0.0.1:9090</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--objstore.config-file=/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--reloader.config-file=/etc/prometheus/config/prometheus.yaml.tmpl</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--reloader.config-envsubst-file=/etc/prometheus/config_out/prometheus.yaml</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--reloader.rule-dir=/etc/prometheus/rules/</span></span><br><span class="line"><span class="attr"> env:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">POD_NAME</span></span><br><span class="line"><span class="attr"> valueFrom:</span></span><br><span class="line"><span class="attr"> fieldRef:</span></span><br><span class="line"><span class="attr"> fieldPath:</span> <span class="string">metadata.name</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">http-sidecar</span></span><br><span class="line"><span class="attr"> containerPort:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> containerPort:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/healthy</span></span><br><span class="line"><span class="attr"> readinessProbe:</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/ready</span></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">prometheus-config-tmpl</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/etc/prometheus/config</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">prometheus-config-out</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/etc/prometheus/config_out</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">prometheus-rules</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/etc/prometheus/rules</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">prometheus-storage</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/prometheus</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> subPath:</span> <span class="string">objectstorage.yaml</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="attr"> volumes:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">prometheus-config-tmpl</span></span><br><span class="line"><span class="attr"> configMap:</span></span><br><span class="line"><span class="attr"> defaultMode:</span> <span class="number">420</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-config-tmpl</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">prometheus-config-out</span></span><br><span class="line"><span class="attr"> emptyDir:</span> <span class="string">{}</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">prometheus-rules</span></span><br><span class="line"><span class="attr"> configMap:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-rules</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> secret:</span></span><br><span class="line"><span class="attr"> secretName:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> volumeClaimTemplates:</span></span><br><span class="line"><span class="attr"> - metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-storage</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">prometheus</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> accessModes:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">ReadWriteOnce</span></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="attr"> requests:</span></span><br><span class="line"><span class="attr"> storage:</span> <span class="number">200</span><span class="string">Gi</span></span><br><span class="line"><span class="attr"> volumeMode:</span> <span class="string">Filesystem</span></span><br></pre></td></tr></table></figure><ul><li>Prometheus 使用 StatefulSet 方式部署,挂载数据盘以便存储最新监控数据。</li><li>由于 Prometheus 副本之间没有启动顺序的依赖,所以 podManagementPolicy 指定为 Parallel,加快启动速度。</li><li>为 Prometheus 绑定足够的 RBAC 权限,以便后续配置使用 k8s 的服务发现 (<code>kubernetes_sd_configs</code>) 时能够正常工作。</li><li>为 Prometheus 创建 headless 类型 service,为后续 Thanos Query 通过 DNS SRV 记录来动态发现 Sidecar 的 gRPC 端点做准备 (使用 headless service 才能让 DNS SRV 正确返回所有端点)。</li><li>使用两个 Prometheus 副本,用于实现高可用。</li><li>使用硬反亲和,避免 Prometheus 部署在同一节点,既可以分散压力也可以避免单点故障。</li><li>Prometheus 使用 <code>--storage.tsdb.retention.time</code> 指定数据保留时长,默认15天,可以根据数据增长速度和数据盘大小做适当调整(数据增长取决于采集的指标和目标端点的数量和采集频率)。</li><li>Sidecar 使用 <code>--objstore.config-file</code> 引用我们刚刚创建并挂载的对象存储配置文件,用于上传数据到对象存储。</li><li>通常会给 Prometheus 附带一个 quay.io/coreos/prometheus-config-reloader 来监听配置变更并动态加载,但 thanos sidecar 也为我们提供了这个功能,所以可以直接用 thanos sidecar 来实现此功能,也支持配置文件根据模板动态生成:<code>--reloader.config-file</code> 指定 Prometheus 配置文件模板,<code>--reloader.config-envsubst-file</code> 指定生成配置文件的存放路径,假设是 <code>/etc/prometheus/config_out/prometheus.yaml</code> ,那么 <code>/etc/prometheus/config_out</code> 这个路径使用 emptyDir 让 Prometheus 与 Sidecar 实现配置文件共享挂载,Prometheus 再通过 <code>--config.file</code> 指定生成出来的配置文件,当配置有更新时,挂载的配置文件也会同步更新,Sidecar 也会通知 Prometheus 重新加载配置。另外,Sidecar 与 Prometheus 也挂载同一份 rules 配置文件,配置更新后 Sidecar 仅通知 Prometheus 加载配置,不支持模板,因为 rules 配置不需要模板来动态生成。</li></ul><p>然后再给 Prometheus 准备配置 (<code>prometheus-config.yaml</code>):</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ConfigMap</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-config-tmpl</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">data:</span></span><br><span class="line"> <span class="string">prometheus.yaml.tmpl:</span> <span class="string">|-</span></span><br><span class="line"><span class="attr"> global:</span></span><br><span class="line"><span class="attr"> scrape_interval:</span> <span class="number">5</span><span class="string">s</span></span><br><span class="line"><span class="attr"> evaluation_interval:</span> <span class="number">5</span><span class="string">s</span></span><br><span class="line"><span class="attr"> external_labels:</span></span><br><span class="line"><span class="attr"> cluster:</span> <span class="string">prometheus-ha</span></span><br><span class="line"><span class="attr"> prometheus_replica:</span> <span class="string">$(POD_NAME)</span></span><br><span class="line"><span class="attr"> rule_files:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">/etc/prometheus/rules/*rules.yaml</span></span><br><span class="line"><span class="attr"> scrape_configs:</span></span><br><span class="line"><span class="attr"> - job_name:</span> <span class="string">cadvisor</span></span><br><span class="line"><span class="attr"> metrics_path:</span> <span class="string">/metrics/cadvisor</span></span><br><span class="line"><span class="attr"> scrape_interval:</span> <span class="number">10</span><span class="string">s</span></span><br><span class="line"><span class="attr"> scrape_timeout:</span> <span class="number">10</span><span class="string">s</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">https</span></span><br><span class="line"><span class="attr"> tls_config:</span></span><br><span class="line"><span class="attr"> insecure_skip_verify:</span> <span class="literal">true</span></span><br><span class="line"><span class="attr"> bearer_token_file:</span> <span class="string">/var/run/secrets/kubernetes.io/serviceaccount/token</span></span><br><span class="line"><span class="attr"> kubernetes_sd_configs:</span></span><br><span class="line"><span class="attr"> - role:</span> <span class="string">node</span></span><br><span class="line"><span class="attr"> relabel_configs:</span></span><br><span class="line"><span class="attr"> - action:</span> <span class="string">labelmap</span></span><br><span class="line"><span class="attr"> regex:</span> <span class="string">__meta_kubernetes_node_label_(.+)</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ConfigMap</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-rules</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">prometheus-rules</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">data:</span></span><br><span class="line"> <span class="string">alert-rules.yaml:</span> <span class="string">|-</span></span><br><span class="line"><span class="attr"> groups:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">k8s.rules</span></span><br><span class="line"><span class="attr"> rules:</span></span><br><span class="line"><span class="attr"> - expr:</span> <span class="string">|</span></span><br><span class="line"><span class="string"> sum(rate(container_cpu_usage_seconds_total{job="cadvisor", image!="", container!=""}[5m])) by (namespace)</span></span><br><span class="line"><span class="string"></span><span class="attr"> record:</span> <span class="attr">namespace:container_cpu_usage_seconds_total:sum_rate</span></span><br><span class="line"><span class="attr"> - expr:</span> <span class="string">|</span></span><br><span class="line"><span class="string"> sum(container_memory_usage_bytes{job="cadvisor", image!="", container!=""}) by (namespace)</span></span><br><span class="line"><span class="string"></span><span class="attr"> record:</span> <span class="attr">namespace:container_memory_usage_bytes:sum</span></span><br><span class="line"><span class="attr"> - expr:</span> <span class="string">|</span></span><br><span class="line"><span class="string"> sum by (namespace, pod, container) (</span></span><br><span class="line"><span class="string"> rate(container_cpu_usage_seconds_total{job="cadvisor", image!="", container!=""}[5m])</span></span><br><span class="line"><span class="string"> )</span></span><br><span class="line"><span class="string"></span><span class="attr"> record:</span> <span class="attr">namespace_pod_container:container_cpu_usage_seconds_total:sum_rate</span></span><br></pre></td></tr></table></figure><ul><li>本文重点不在 prometheus 的配置文件,所以这里仅以采集 kubelet 所暴露的 cadvisor 容器指标的简单配置为例。</li><li>Prometheus 实例采集的所有指标数据里都会额外加上 <code>external_labels</code> 里指定的 label,通常用 <code>cluster</code> 区分当前 Prometheus 所在集群的名称,我们再加了个 <code>prometheus_replica</code>,用于区分相同 Prometheus 副本(这些副本所采集的数据除了 <code>prometheus_replica</code> 的值不一样,其它几乎一致,这个值会被 Thanos Sidecar 替换成 Pod 副本的名称,用于 Thanos 实现 Prometheus 高可用)</li></ul><h3 id="安装-Query"><a href="#安装-Query" class="headerlink" title="安装 Query"></a>安装 Query</h3><p>准备 <code>thanos-query.yaml</code>:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">9090</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Deployment</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> replicas:</span> <span class="number">3</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"><span class="attr"> matchLabels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr"> template:</span></span><br><span class="line"><span class="attr"> metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> affinity:</span></span><br><span class="line"><span class="attr"> podAntiAffinity:</span></span><br><span class="line"><span class="attr"> preferredDuringSchedulingIgnoredDuringExecution:</span></span><br><span class="line"><span class="attr"> - podAffinityTerm:</span></span><br><span class="line"><span class="attr"> labelSelector:</span></span><br><span class="line"><span class="attr"> matchExpressions:</span></span><br><span class="line"><span class="attr"> - key:</span> <span class="string">app.kubernetes.io/name</span></span><br><span class="line"><span class="attr"> operator:</span> <span class="string">In</span></span><br><span class="line"><span class="attr"> values:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr"> topologyKey:</span> <span class="string">kubernetes.io/hostname</span></span><br><span class="line"><span class="attr"> weight:</span> <span class="number">100</span></span><br><span class="line"><span class="attr"> containers:</span></span><br><span class="line"><span class="attr"> - args:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">query</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--log.level=debug</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--query.auto-downsampling</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--grpc-address=0.0.0.0:10901</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--http-address=0.0.0.0:9090</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--query.partial-response</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--query.replica-label=prometheus_replica</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--query.replica-label=rule_replica</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--store=dnssrv+_grpc._tcp.prometheus-headless.thanos.svc.cluster.local</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--store=dnssrv+_grpc._tcp.thanos-rule.thanos.svc.cluster.local</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">thanosio/thanos:v0.11.0</span></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">4</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/healthy</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">9090</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">30</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-query</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">9090</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> readinessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">20</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/ready</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">9090</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">5</span></span><br><span class="line"><span class="attr"> terminationMessagePolicy:</span> <span class="string">FallbackToLogsOnError</span></span><br><span class="line"><span class="attr"> terminationGracePeriodSeconds:</span> <span class="number">120</span></span><br></pre></td></tr></table></figure><ul><li>因为 Query 是无状态的,使用 Deployment 部署,也不需要 headless service,直接创建普通的 service。</li><li>使用软反亲和,尽量不让 Query 调度到同一节点。</li><li>部署多个副本,实现 Query 的高可用。</li><li><code>--query.partial-response</code> 启用 <a href="https://thanos.io/components/query.md/#partial-response" target="_blank" rel="noopener">Partial Response</a>,这样可以在部分后端 Store API 返回错误或超时的情况下也能看到正确的监控数据(如果后端 Store API 做了高可用,挂掉一个副本,Query 访问挂掉的副本超时,但由于还有没挂掉的副本,还是能正确返回结果;如果挂掉的某个后端本身就不存在我们需要的数据,挂掉也不影响结果的正确性;总之如果各个组件都做了高可用,想获得错误的结果都难,所以我们有信心启用 Partial Response 这个功能)。</li><li><code>--query.auto-downsampling</code> 查询时自动降采样,提升查询效率。</li><li><code>--query.replica-label</code> 指定我们刚刚给 Prometheus 配置的 <code>prometheus_replica</code> 这个 external label,Query 向 Sidecar 拉取 Prometheus 数据时会识别这个 label 并自动去重,这样即使挂掉一个副本,只要至少有一个副本正常也不会影响查询结果,也就是可以实现 Prometheus 的高可用。同理,再指定一个 <code>rule_replica</code> 用于给 Ruler 做高可用。</li><li><code>--store</code> 指定实现了 Store API 的地址(Sidecar, Ruler, Store Gateway, Receiver),通常不建议写静态地址,而是使用服务发现机制自动发现 Store API 地址,如果是部署在同一个集群,可以用 DNS SRV 记录来做服务发现,比如 <code>dnssrv+_grpc._tcp.prometheus-headless.thanos.svc.cluster.local</code>,也就是我们刚刚为包含 Sidecar 的 Prometheus 创建的 headless service (使用 headless service 才能正确实现服务发现),并且指定了名为 grpc 的 tcp 端口,同理,其它组件也可以按照这样加到 <code>--store</code> 参数里;如果是其它有些组件部署在集群外,无法通过集群 dns 解析 DNS SRV 记录,可以使用配置文件来做服务发现,也就是指定 <code>--store.sd-files</code> 参数,将其它 Store API 地址写在配置文件里 (挂载 ConfigMap),需要增加地址时直接更新 ConfigMap (不需要重启 Query)。</li></ul><h3 id="安装-Store-Gateway"><a href="#安装-Store-Gateway" class="headerlink" title="安装 Store Gateway"></a>安装 Store Gateway</h3><p>准备 <code>thanos-store.yaml</code>:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> clusterIP:</span> <span class="string">None</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">StatefulSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> replicas:</span> <span class="number">2</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"><span class="attr"> matchLabels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr"> serviceName:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr"> podManagementPolicy:</span> <span class="string">Parallel</span></span><br><span class="line"><span class="attr"> template:</span></span><br><span class="line"><span class="attr"> metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> containers:</span></span><br><span class="line"><span class="attr"> - args:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">store</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--log.level=debug</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--data-dir=/var/thanos/store</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--grpc-address=0.0.0.0:10901</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--http-address=0.0.0.0:10902</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--objstore.config-file=/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--experimental.enable-index-header</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">thanosio/thanos:v0.11.0</span></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">8</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/healthy</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">30</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> readinessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">20</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/ready</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">5</span></span><br><span class="line"><span class="attr"> terminationMessagePolicy:</span> <span class="string">FallbackToLogsOnError</span></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/var/thanos/store</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">data</span></span><br><span class="line"><span class="attr"> readOnly:</span> <span class="literal">false</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> subPath:</span> <span class="string">objectstorage.yaml</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="attr"> terminationGracePeriodSeconds:</span> <span class="number">120</span></span><br><span class="line"><span class="attr"> volumes:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> secret:</span></span><br><span class="line"><span class="attr"> secretName:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> volumeClaimTemplates:</span></span><br><span class="line"><span class="attr"> - metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-store</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">data</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> accessModes:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">ReadWriteOnce</span></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="attr"> requests:</span></span><br><span class="line"><span class="attr"> storage:</span> <span class="number">10</span><span class="string">Gi</span></span><br></pre></td></tr></table></figure><ul><li>Store Gateway 实际也可以做到一定程度的无状态,它会需要一点磁盘空间来对对象存储做索引以加速查询,但数据不那么重要,是可以删除的,删除后会自动去拉对象存储查数据重新建立索引。这里我们避免每次重启都重新建立索引,所以用 StatefulSet 部署 Store Gateway,挂载一块小容量的磁盘(索引占用不到多大空间)。</li><li>同样创建 headless service,用于 Query 对 Store Gateway 进行服务发现。</li><li>部署两个副本,实现 Store Gateway 的高可用。</li><li>Store Gateway 也需要对象存储的配置,用于读取对象存储的数据,所以要挂载对象存储的配置文件。</li></ul><h3 id="安装-Ruler"><a href="#安装-Ruler" class="headerlink" title="安装 Ruler"></a>安装 Ruler</h3><p>准备 Ruler 部署配置 <code>thanos-ruler.yaml</code>:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> clusterIP:</span> <span class="string">None</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">StatefulSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> replicas:</span> <span class="number">2</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"><span class="attr"> matchLabels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> serviceName:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> podManagementPolicy:</span> <span class="string">Parallel</span></span><br><span class="line"><span class="attr"> template:</span></span><br><span class="line"><span class="attr"> metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> containers:</span></span><br><span class="line"><span class="attr"> - args:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">rule</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--grpc-address=0.0.0.0:10901</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--http-address=0.0.0.0:10902</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--rule-file=/etc/thanos/rules/*rules.yaml</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--objstore.config-file=/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--data-dir=/var/thanos/rule</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--label=rule_replica="$(NAME)"</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--alert.label-drop="rule_replica"</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--query=dnssrv+_http._tcp.thanos-query.thanos.svc.cluster.local</span></span><br><span class="line"><span class="attr"> env:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">NAME</span></span><br><span class="line"><span class="attr"> valueFrom:</span></span><br><span class="line"><span class="attr"> fieldRef:</span></span><br><span class="line"><span class="attr"> fieldPath:</span> <span class="string">metadata.name</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">thanosio/thanos:v0.11.0</span></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">24</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/healthy</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">5</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> readinessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">18</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/ready</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> initialDelaySeconds:</span> <span class="number">10</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">5</span></span><br><span class="line"><span class="attr"> terminationMessagePolicy:</span> <span class="string">FallbackToLogsOnError</span></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/var/thanos/rule</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">data</span></span><br><span class="line"><span class="attr"> readOnly:</span> <span class="literal">false</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> subPath:</span> <span class="string">objectstorage.yaml</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-rules</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/etc/thanos/rules</span></span><br><span class="line"><span class="attr"> volumes:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> secret:</span></span><br><span class="line"><span class="attr"> secretName:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-rules</span></span><br><span class="line"><span class="attr"> configMap:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-rules</span></span><br><span class="line"><span class="attr"> volumeClaimTemplates:</span></span><br><span class="line"><span class="attr"> - metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-rule</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">data</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> accessModes:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">ReadWriteOnce</span></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="attr"> requests:</span></span><br><span class="line"><span class="attr"> storage:</span> <span class="number">100</span><span class="string">Gi</span></span><br></pre></td></tr></table></figure><ul><li>Ruler 是有状态服务,使用 Statefulset 部署,挂载磁盘以便存储根据 rule 配置计算出的新数据。</li><li>同样创建 headless service,用于 Query 对 Ruler 进行服务发现。</li><li>部署两个副本,且使用 <code>--label=rule_replica=</code> 给所有数据添加 <code>rule_replica</code> 的 label (与 Query 配置的 <code>replica_label</code> 相呼应),用于实现 Ruler 高可用。同时指定 <code>--alert.label-drop</code> 为 <code>rule_replica</code>,在触发告警发送通知给 AlertManager 时,去掉这个 label,以便让 AlertManager 自动去重 (避免重复告警)。</li><li>使用 <code>--query</code> 指定 Query 地址,这里还是用 DNS SRV 来做服务发现,但效果跟配 <code>dns+thanos-query.thanos.svc.cluster.local:9090</code> 是一样的,最终都是通过 Query 的 ClusterIP (VIP) 访问,因为它是无状态的,可以直接由 K8S 来给我们做负载均衡。</li><li>Ruler 也需要对象存储的配置,用于上传计算出的数据到对象存储,所以要挂载对象存储的配置文件。</li><li><code>--rule-file</code> 指定挂载的 rule 配置,Ruler 根据配置来生成数据和触发告警。</li></ul><p>再准备 Ruler 配置文件 <code>thanos-ruler-config.yaml</code>:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ConfigMap</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-rules</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-rules</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">data:</span></span><br><span class="line"> <span class="string">record.rules.yaml:</span> <span class="string">|-</span></span><br><span class="line"><span class="attr"> groups:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">k8s.rules</span></span><br><span class="line"><span class="attr"> rules:</span></span><br><span class="line"><span class="attr"> - expr:</span> <span class="string">|</span></span><br><span class="line"><span class="string"> sum(rate(container_cpu_usage_seconds_total{job="cadvisor", image!="", container!=""}[5m])) by (namespace)</span></span><br><span class="line"><span class="string"></span><span class="attr"> record:</span> <span class="attr">namespace:container_cpu_usage_seconds_total:sum_rate</span></span><br><span class="line"><span class="attr"> - expr:</span> <span class="string">|</span></span><br><span class="line"><span class="string"> sum(container_memory_usage_bytes{job="cadvisor", image!="", container!=""}) by (namespace)</span></span><br><span class="line"><span class="string"></span><span class="attr"> record:</span> <span class="attr">namespace:container_memory_usage_bytes:sum</span></span><br><span class="line"><span class="attr"> - expr:</span> <span class="string">|</span></span><br><span class="line"><span class="string"> sum by (namespace, pod, container) (</span></span><br><span class="line"><span class="string"> rate(container_cpu_usage_seconds_total{job="cadvisor", image!="", container!=""}[5m])</span></span><br><span class="line"><span class="string"> )</span></span><br><span class="line"><span class="string"></span><span class="attr"> record:</span> <span class="attr">namespace_pod_container:container_cpu_usage_seconds_total:sum_rate</span></span><br></pre></td></tr></table></figure><ul><li>配置内容仅为示例,根据自身情况来配置,格式基本兼容 Prometheus 的 rule 配置格式,参考: <a href="https://thanos.io/components/rule.md/#configuring-rules" target="_blank" rel="noopener">https://thanos.io/components/rule.md/#configuring-rules</a></li></ul><h3 id="安装-Compact"><a href="#安装-Compact" class="headerlink" title="安装 Compact"></a>安装 Compact</h3><p>准备 Compact 部署配置 <code>thanos-compact.yaml</code>:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">StatefulSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> replicas:</span> <span class="number">1</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"><span class="attr"> matchLabels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> serviceName:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> template:</span></span><br><span class="line"><span class="attr"> metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> containers:</span></span><br><span class="line"><span class="attr"> - args:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">compact</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--wait</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--objstore.config-file=/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--data-dir=/var/thanos/compact</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--debug.accept-malformed-index</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--log.level=debug</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--retention.resolution-raw=90d</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--retention.resolution-5m=180d</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--retention.resolution-1h=360d</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">thanosio/thanos:v0.11.0</span></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">4</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/healthy</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">30</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> readinessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">20</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/ready</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">5</span></span><br><span class="line"><span class="attr"> terminationMessagePolicy:</span> <span class="string">FallbackToLogsOnError</span></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/var/thanos/compact</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">data</span></span><br><span class="line"><span class="attr"> readOnly:</span> <span class="literal">false</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> subPath:</span> <span class="string">objectstorage.yaml</span></span><br><span class="line"><span class="attr"> mountPath:</span> <span class="string">/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="attr"> terminationGracePeriodSeconds:</span> <span class="number">120</span></span><br><span class="line"><span class="attr"> volumes:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> secret:</span></span><br><span class="line"><span class="attr"> secretName:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> volumeClaimTemplates:</span></span><br><span class="line"><span class="attr"> - metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-compact</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">data</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> accessModes:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">ReadWriteOnce</span></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="attr"> requests:</span></span><br><span class="line"><span class="attr"> storage:</span> <span class="number">100</span><span class="string">Gi</span></span><br></pre></td></tr></table></figure><ul><li>Compact 只能部署单个副本,因为如果多个副本都去对对象存储的数据做压缩和降采样的话,会造成冲突。</li><li>使用 StatefulSet 部署,方便自动创建和挂载磁盘。磁盘用于存放临时数据,因为 Compact 需要一些磁盘空间来存放数据处理过程中产生的中间数据。</li><li><code>--wait</code> 让 Compact 一直运行,轮询新数据来做压缩和降采样。</li><li>Compact 也需要对象存储的配置,用于读取对象存储数据以及上传压缩和降采样后的数据到对象存储。</li><li>创建一个普通 service,主要用于被 Prometheus 使用 kubernetes 的 endpoints 服务发现来采集指标(其它组件的 service 也一样有这个用途)。</li><li><code>--retention.resolution-raw</code> 指定原始数据存放时长,<code>--retention.resolution-5m</code> 指定降采样到数据点 5 分钟间隔的数据存放时长,<code>--retention.resolution-1h</code> 指定降采样到数据点 1 小时间隔的数据存放时长,它们的数据精细程度递减,占用的存储空间也是递减,通常建议它们的存放时间递增配置 (一般只有比较新的数据才会放大看,久远的数据通常只会使用大时间范围查询来看个大致,所以建议将精细程度低的数据存放更长时间)</li></ul><h3 id="安装-Receiver"><a href="#安装-Receiver" class="headerlink" title="安装 Receiver"></a>安装 Receiver</h3><p>该组件处于试验阶段,慎用。准备 Receiver 部署配置 <code>thanos-receiver.yaml</code>:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ConfigMap</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-receive-hashrings</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">data:</span></span><br><span class="line"> <span class="string">thanos-receive-hashrings.json:</span> <span class="string">|</span></span><br><span class="line"><span class="string"> [</span></span><br><span class="line"><span class="string"> {</span></span><br><span class="line"><span class="string"> "hashring": "soft-tenants",</span></span><br><span class="line"><span class="string"> "endpoints":</span></span><br><span class="line"><span class="string"> [</span></span><br><span class="line"><span class="string"> "thanos-receive-0.thanos-receive.kube-system.svc.cluster.local:10901",</span></span><br><span class="line"><span class="string"> "thanos-receive-1.thanos-receive.kube-system.svc.cluster.local:10901",</span></span><br><span class="line"><span class="string"> "thanos-receive-2.thanos-receive.kube-system.svc.cluster.local:10901"</span></span><br><span class="line"><span class="string"> ]</span></span><br><span class="line"><span class="string"> }</span></span><br><span class="line"><span class="string"> ]</span></span><br><span class="line"><span class="string">---</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string"></span><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">kubernetes.io/name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> protocol:</span> <span class="string">TCP</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">remote-write</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">19291</span></span><br><span class="line"><span class="attr"> protocol:</span> <span class="string">TCP</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="number">19291</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> protocol:</span> <span class="string">TCP</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"> <span class="string">kubernetes.io/name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> clusterIP:</span> <span class="string">None</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">StatefulSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">kubernetes.io/name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">thanos</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> replicas:</span> <span class="number">3</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"><span class="attr"> matchLabels:</span></span><br><span class="line"> <span class="string">kubernetes.io/name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> serviceName:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> template:</span></span><br><span class="line"><span class="attr"> metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">kubernetes.io/name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> containers:</span></span><br><span class="line"><span class="attr"> - args:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">receive</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--grpc-address=0.0.0.0:10901</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--http-address=0.0.0.0:10902</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--remote-write.address=0.0.0.0:19291</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--objstore.config-file=/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--tsdb.path=/var/thanos/receive</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--tsdb.retention=12h</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--label=receive_replica="$(NAME)"</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--label=receive="true"</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">--receive.local-endpoint=$(NAME).thanos-receive.thanos.svc.cluster.local:10901</span></span><br><span class="line"><span class="attr"> env:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">NAME</span></span><br><span class="line"><span class="attr"> valueFrom:</span></span><br><span class="line"><span class="attr"> fieldRef:</span></span><br><span class="line"><span class="attr"> fieldPath:</span> <span class="string">metadata.name</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">thanosio/thanos:v0.11.0</span></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> failureThreshold:</span> <span class="number">4</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/healthy</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">30</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">10901</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">grpc</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="number">19291</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">remote-write</span></span><br><span class="line"><span class="attr"> readinessProbe:</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/-/ready</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">10902</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">HTTP</span></span><br><span class="line"><span class="attr"> initialDelaySeconds:</span> <span class="number">10</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">30</span></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="attr"> limits:</span></span><br><span class="line"><span class="attr"> cpu:</span> <span class="string">"4"</span></span><br><span class="line"><span class="attr"> memory:</span> <span class="number">8</span><span class="string">Gi</span></span><br><span class="line"><span class="attr"> requests:</span></span><br><span class="line"><span class="attr"> cpu:</span> <span class="string">"2"</span></span><br><span class="line"><span class="attr"> memory:</span> <span class="number">4</span><span class="string">Gi</span></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/var/thanos/receive</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">data</span></span><br><span class="line"><span class="attr"> readOnly:</span> <span class="literal">false</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/etc/thanos/thanos-receive-hashrings.json</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-receive-hashrings</span></span><br><span class="line"><span class="attr"> subPath:</span> <span class="string">thanos-receive-hashrings.json</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/etc/thanos/objectstorage.yaml</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> subPath:</span> <span class="string">objectstorage.yaml</span></span><br><span class="line"><span class="attr"> terminationGracePeriodSeconds:</span> <span class="number">120</span></span><br><span class="line"><span class="attr"> volumes:</span></span><br><span class="line"><span class="attr"> - configMap:</span></span><br><span class="line"><span class="attr"> defaultMode:</span> <span class="number">420</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-receive-hashrings</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">thanos-receive-hashrings</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> secret:</span></span><br><span class="line"><span class="attr"> secretName:</span> <span class="string">thanos-objectstorage</span></span><br><span class="line"><span class="attr"> volumeClaimTemplates:</span></span><br><span class="line"><span class="attr"> - metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"> <span class="string">app.kubernetes.io/name:</span> <span class="string">thanos-receive</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">data</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> accessModes:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">ReadWriteOnce</span></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="attr"> requests:</span></span><br><span class="line"><span class="attr"> storage:</span> <span class="number">200</span><span class="string">Gi</span></span><br></pre></td></tr></table></figure><ul><li>部署 3 个副本, 配置 hashring, <code>--label=receive_replica</code> 为数据添加 <code>receive_replica</code> 这个 label (Query 的 <code>--query.replica-label</code> 也要加上这个) 来实现 Receiver 的高可用。</li><li>Query 要指定 Receiver 后端地址: <code>--store=dnssrv+_grpc._tcp.thanos-receive.thanos.svc.cluster.local</code></li><li>request, limit 根据自身规模情况自行做适当调整。</li><li><code>--tsdb.retention</code> 根据自身需求调整最新数据的保留时间。</li><li>如果改命名空间,记得把 Receiver 的 <code>--receive.local-endpoint</code> 参数也改下,不然会疯狂报错直至 OOMKilled。</li></ul><p>因为使用了 Receiver 来统一接收 Prometheus 的数据,所以 Prometheus 也不需要 Sidecar 了,但需要给 Prometheus 配置文件里加下 <code>remote_write</code>,让 Prometheus 将数据 push 给 Receiver:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">remote_write:</span></span><br><span class="line"><span class="attr">- url:</span> <span class="attr">http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive</span></span><br></pre></td></tr></table></figure><h3 id="指定-Query-为数据源"><a href="#指定-Query-为数据源" class="headerlink" title="指定 Query 为数据源"></a>指定 Query 为数据源</h3><p>查询监控数据时需要指定 Prometheus 数据源地址,由于我们使用了 Thanos 来做分布式,而 Thanos 关键查询入口就是 Query,所以我们需要将数据源地址指定为 Query 的地址,假如使用 Grafana 查询,进入 <code>Configuration</code>-<code>Data Sources</code>-<code>Add data source</code>,选择 Prometheus,指定 thanos query 的地址: <code>http://thanos-query.thanos.svc.cluster.local:9090</code></p><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h2><p>本文教了大家如何选型 Thanos 部署方案并详细讲解了各个组件的安装方法,如果仔细阅读完本系列文章,我相信你已经有能力搭建并运维一套大型监控系统了。</p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<h2 id="视频"><a href="#视频" class="headerlink" title="视频"></a>视频<
</summary>
</entry>
<entry>
<title>打造云原生大型分布式监控系统(二): Thanos 架构详解</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/04/06/build-cloud-native-large-scale-distributed-monitoring-system-2/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/04/06/build-cloud-native-large-scale-distributed-monitoring-system-2/</id>
<published>2020-04-06T05:00:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h2><p>之前在 <a href="https://tencentcloudcontainerteam.github.io/2020/03/27/build-cloud-native-large-scale-distributed-monitoring-system-1/">大规模场景下 Prometheus 的优化手段</a> 中,我们想尽 “千方百计” 才好不容易把 Prometheus 优化到适配大规模场景,部署和后期维护麻烦且复杂不说,还有很多不完美的地方,并且还无法满足一些更高级的诉求,比如查看时间久远的监控数据,对于一些时间久远不常用的 “冷数据”,最理想的方式就是存到廉价的对象存储中,等需要查询的时候能够自动加载出来。</p><p>Thanos (没错,就是灭霸) 可以帮我们简化分布式 Prometheus 的部署与管理,并提供了一些的高级特性:<strong>全局视图</strong>,<strong>长期存储</strong>,<strong>高可用</strong>。下面我们来详细讲解一下。</p><h2 id="Thanos-架构"><a href="#Thanos-架构" class="headerlink" title="Thanos 架构"></a>Thanos 架构</h2><p>这是官方给出的架构图:</p><p><img src="https://imroc.io/assets/blog/thanos-arch.jpg" alt=""></p><p>这张图中包含了 Thanos 的几个核心组件,但并不包括所有组件,为了便于理解,我们先不细讲,简单介绍下图中这几个组件的作用:</p><ul><li>Thanos Query: 实现了 Prometheus API,将来自下游组件提供的数据进行聚合最终返回给查询数据的 client (如 grafana),类似数据库中间件。</li><li>Thanos Sidecar: 连接 Prometheus,将其数据提供给 Thanos Query 查询,并且/或者将其上传到对象存储,以供长期存储。</li><li>Thanos Store Gateway: 将对象存储的数据暴露给 Thanos Query 去查询。</li><li>Thanos Ruler: 对监控数据进行评估和告警,还可以计算出新的监控数据,将这些新数据提供给 Thanos Query 查询并且/或者上传到对象存储,以供长期存储。</li><li>Thanos Compact: 将对象存储中的数据进行压缩和降低采样率,加速大时间区间监控数据查询的速度。</li></ul><h2 id="架构设计剖析"><a href="#架构设计剖析" class="headerlink" title="架构设计剖析"></a>架构设计剖析</h2><p>如何理解 Thanos 的架构设计的?我们可以自己先 YY 一下,要是自己来设计一个分布式 Prometheus 管理应用,会怎么做?</p><h3 id="Query-与-Sidecar"><a href="#Query-与-Sidecar" class="headerlink" title="Query 与 Sidecar"></a>Query 与 Sidecar</h3><p>首先,监控数据的查询肯定不能直接查 Prometheus 了,因为会存在许多个 Prometheus 实例,每个 Prometheus 实例只能感知它自己所采集的数据。我们可以比较容易联想到数据库中间件,每个数据库都只存了一部分数据,中间件能感知到所有数据库,数据查询都经过数据库中间件来查,这个中间件收到查询请求再去查下游各个数据库中的数据,最后将这些数据聚合汇总返回给查询的客户端,这样就实现了将分布式存储的数据集中查询。</p><p>实际上,Thanos 也是使用了类似的设计思想,Thanos Query 就是这个 “中间件” 的关键入口。它实现了 Prometheus 的 HTTP API,能够 “看懂” PromQL。这样,查询 Prometheus 监控数据的 client 就不直接查询 Prometheus 本身了,而是去查询 Thanos Query,Thanos Query 再去下游多个存储了数据的地方查数据,最后将这些数据聚合去重后返回给 client,也就实现了分布式 Prometheus 的数据查询。</p><p>那么 Thanos Query 又如何去查下游分散的数据呢?Thanos 为此抽象了一套叫 Store API 的内部 gRPC 接口,其它一些组件通过这个接口来暴露数据给 Thanos Query,它自身也就可以做到完全无状态部署,实现高可用与动态扩展。</p><p><img src="https://imroc.io/assets/blog/thanos-querier.svg" alt=""></p><p>这些分散的数据可能来自哪些地方呢?首先,Prometheus 会将采集的数据存到本机磁盘上,如果我们直接用这些分散在各个磁盘上的数据,可以给每个 Prometheus 附带部署一个 Sidecar,这个 Sidecar 实现 Thanos Store API,当 Thanos Query 对其发起查询时,Sidecar 就读取跟它绑定部署的 Prometheus 实例上的监控数据返回给 Thanos Query。</p><p><img src="https://imroc.io/assets/blog/thanos-sidecar.png" alt=""></p><p>由于 Thanos Query 可以对数据进行聚合与去重,所以可以很轻松实现高可用:相同的 Prometheus 部署多个副本(都附带 Sidecar),然后 Thanos Query 去所有 Sidecar 查数据,即便有一个 Prometheus 实例挂掉过一段时间,数据聚合与去重后仍然能得到完整数据。</p><p>这种高可用做法还弥补了我们上篇文章中用负载均衡去实现 Prometheus 高可用方法的缺陷:如果其中一个 Prometheus 实例挂了一段时间然后又恢复了,它的数据就不完整,当负载均衡转发到它上面去查数据时,返回的结果就可能会有部分缺失。</p><p>不过因为磁盘空间有限,所以 Prometheus 存储监控数据的能力也是有限的,通常会给 Prometheus 设置一个数据过期时间 (默认15天) 或者最大数据量大小,不断清理旧数据以保证磁盘不被撑爆。因此,我们无法看到时间比较久远的监控数据,有时候这也给我们的问题排查和数据统计造成一些困难。</p><p>对于需要长期存储的数据,并且使用频率不那么高,最理想的方式是存进对象存储,各大云厂商都有对象存储服务,特点是不限制容量,价格非常便宜。</p><p>Thanos 有几个组件都支持将数据上传到各种对象存储以供长期保存 (Prometheus TSDB 数据格式),比如我们刚刚说的 Sidecar:</p><p><img src="https://imroc.io/assets/blog/thanos-sidecar-with-objectstore.png" alt=""></p><h3 id="Store-Gateway"><a href="#Store-Gateway" class="headerlink" title="Store Gateway"></a>Store Gateway</h3><p>那么这些被上传到了对象存储里的监控数据该如何查询呢?理论上 Thanos Query 也可以直接去对象存储查,但会让 Thanos Query 的逻辑变的很重。我们刚才也看到了,Thanos 抽象出了 Store API,只要实现了该接口的组件都可以作为 Thanos Query 查询的数据源,Thanos Store Gateway 这个组件也实现了 Store API,向 Thanos Query 暴露对象存储的数据。Thanos Store Gateway 内部还做了一些加速数据获取的优化逻辑,一是缓存了 TSDB 索引,二是优化了对象存储的请求 (用尽可能少的请求量拿到所有需要的数据)。</p><p><img src="https://imroc.io/assets/blog/thanos-store-gateway.png" alt=""></p><p>这样就实现了监控数据的长期储存,由于对象存储容量无限,所以理论上我们可以存任意时长的数据,监控历史数据也就变得可追溯查询,便于问题排查与统计分析。</p><h3 id="Ruler"><a href="#Ruler" class="headerlink" title="Ruler"></a>Ruler</h3><p>有一个问题,Prometheus 不仅仅只支持将采集的数据进行存储和查询的功能,还可以配置一些 rules:</p><ol><li>根据配置不断计算出新指标数据并存储,后续查询时直接使用计算好的新指标,这样可以减轻查询时的计算压力,加快查询速度。</li><li>不断计算和评估是否达到告警阀值,当达到阀值时就通知 AlertManager 来触发告警。</li></ol><p>由于我们将 Prometheus 进行分布式部署,每个 Prometheus 实例本地并没有完整数据,有些有关联的数据可能存在多个 Prometheus 实例中,单机 Prometheus 看不到数据的全局视图,这种情况我们就不能依赖 Prometheus 来做这些工作,Thanos Ruler 应运而生,它通过查询 Thanos Query 获取全局数据,然后根据 rules 配置计算新指标并存储,同时也通过 Store API 将数据暴露给 Thanos Query,同样还可以将数据上传到对象存储以供长期保存 (这里上传到对象存储中的数据一样也是通过 Thanos Store Gateway 暴露给 Thanos Query)。</p><p><img src="https://imroc.io/assets/blog/thanos-ruler.png" alt=""></p><p>看起来 Thanos Query 跟 Thanos Ruler 之间会相互查询,不过这个不冲突,Thanos Ruler 为 Thanos Query 提供计算出的新指标数据,而 Thanos Query 为 Thanos Ruler 提供计算新指标所需要的全局原始指标数据。</p><p>至此,Thanos 的核心能力基本实现了,完全兼容 Prometheus 的情况下提供数据查询的全局视图,高可用以及数据的长期保存。</p><p>看下还可以怎么进一步做下优化呢?</p><h3 id="Compact"><a href="#Compact" class="headerlink" title="Compact"></a>Compact</h3><p>由于我们有数据长期存储的能力,也就可以实现查询较大时间范围的监控数据,当时间范围很大时,查询的数据量也会很大,这会导致查询速度非常慢。通常在查看较大时间范围的监控数据时,我们并不需要那么详细的数据,只需要看到大致就行。Thanos Compact 这个组件应运而生,它读取对象存储的数据,对其进行压缩以及降采样再上传到对象存储,这样在查询大时间范围数据时就可以只读取压缩和降采样后的数据,极大地减少了查询的数据量,从而加速查询。</p><p><img src="https://imroc.io/assets/blog/thanos-compact.png" alt=""></p><h3 id="再看架构图"><a href="#再看架构图" class="headerlink" title="再看架构图"></a>再看架构图</h3><p>上面我们剖析了官方架构图中各个组件的设计,现在再来回味一下这张图:</p><p><img src="https://imroc.io/assets/blog/thanos-arch.jpg" alt=""></p><p>理解是否更加深刻了?</p><p>另外还有 Thanos Bucket 和 Thanos Checker 两个辅助性的工具组件没画出来,它们不是核心组件,这里也就不再赘述。</p><h2 id="Sidecar-模式与-Receiver-模式"><a href="#Sidecar-模式与-Receiver-模式" class="headerlink" title="Sidecar 模式与 Receiver 模式"></a>Sidecar 模式与 Receiver 模式</h2><p>前面我们理解了官方的架构图,但其中还缺失一个核心组件 Thanos Receiver,因为它是一个还未完全发布的组件。这是它的设计文档: <a href="https://thanos.io/proposals/201812_thanos-remote-receive.md/" target="_blank" rel="noopener">https://thanos.io/proposals/201812_thanos-remote-receive.md/</a></p><p>这个组件可以完全消除 Sidecar,所以 Thanos 实际有两种架构图,只是因为没有完全发布,官方的架构图只给的 Sidecar 模式。</p><p>Receiver 是做什么的呢?为什么需要 Receiver?它跟 Sidecar 有什么区别?</p><p>它们都可以将数据上传到对象存储以供长期保存,区别在于最新数据的存储。</p><p>由于数据上传不可能实时,Sidecar 模式将最新的监控数据存到 Prometheus 本机,Query 通过调所有 Sidecar 的 Store API 来获取最新数据,这就成一个问题:如果 Sidecar 数量非常多或者 Sidecar 跟 Query 离的比较远,每次查询 Query 都调所有 Sidecar 会消耗很多资源,并且速度很慢,而我们查看监控大多数情况都是看的最新数据。</p><p>为了解决这个问题,Thanos Receiver 组件被提出,它适配了 Prometheus 的 remote write API,也就是所有 Prometheus 实例可以实时将数据 push 到 Thanos Receiver,最新数据也得以集中起来,然后 Thanos Query 也不用去所有 Sidecar 查最新数据了,直接查 Thanos Receiver 即可。另外,Thanos Receiver 也将数据上传到对象存储以供长期保存,当然,对象存储中的数据同样由 Thanos Store Gateway 暴露给 Thanos Query。</p><p><img src="https://imroc.io/assets/blog/thanos-receiver.png" alt=""></p><p>有同学可能会问:如果规模很大,Receiver 压力会不会很大,成为性能瓶颈?当然设计这个组件时肯定会考虑这个问题,Receiver 实现了一致性哈希,支持集群部署,所以即使规模很大也不会成为性能瓶颈。</p><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h2><p>本文详细讲解了 Thanos 的架构设计,各个组件的作用以及为什么要这么设计。如果仔细看完,我相信你已经 get 到了 Thanos 的精髓,不过我们还没开始讲如何部署与实践,实际上在腾讯云容器服务的多个产品的内部监控已经在使用 Thanos 了,比如 <a href="https://cloud.tencent.com/product/tke" target="_blank" rel="noopener">TKE</a> (公有云 k8s)、<a href="https://github.com/tkestack/tke" target="_blank" rel="noopener">TKEStack</a> (私有云 k8s)、<a href="https://console.cloud.tencent.com/tke2/ecluster" target="_blank" rel="noopener">EKS</a> (Serverless k8s)。 下一篇我们将介绍 Thanos 的部署与最佳实践,敬请期待。</p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述<
</summary>
</entry>
<entry>
<title>打造云原生大型分布式监控系统(一): 大规模场景下 Prometheus 的优化手段</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/03/27/build-cloud-native-large-scale-distributed-monitoring-system-1/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/03/27/build-cloud-native-large-scale-distributed-monitoring-system-1/</id>
<published>2020-03-27T04:00:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h2><p>Prometheus 几乎已成为监控领域的事实标准,它自带高效的时序数据库存储,可以让单台 Prometheus 能够高效的处理大量的数据,还有友好并且强大的 PromQL 语法,可以用来灵活的查询各种监控数据以及配置告警规则,同时它的 pull 模型指标采集方式被广泛采纳,非常多的应用都实现了 Prometheus 的 metrics 接口以暴露自身各项数据指标让 Prometheus 去采集,很多没有适配的应用也会有第三方 exporter 帮它去适配 Prometheus,所以监控系统我们通常首选用 Prometheus,本系列文章也将基于 Prometheus 来打造云原生环境下的大型分布式监控系统。</p><h2 id="大规模场景下-Prometheus-的痛点"><a href="#大规模场景下-Prometheus-的痛点" class="headerlink" title="大规模场景下 Prometheus 的痛点"></a>大规模场景下 Prometheus 的痛点</h2><p>Prometheus 本身只支持单机部署,没有自带支持集群部署,也就不支持高可用以及水平扩容,在大规模场景下,最让人关心的问题是它的存储空间也受限于单机磁盘容量,磁盘容量决定了单个 Prometheus 所能存储的数据量,数据量大小又取决于被采集服务的指标数量、服务数量、采集速率以及数据过期时间。在数据量大的情况下,我们可能就需要做很多取舍,比如丢弃不重要的指标、降低采集速率、设置较短的数据过期时间(默认只保留15天的数据,看不到比较久远的监控数据)。</p><p>这些痛点实际也是可以通过一些优化手段来改善的,下面我们来细讲一下。</p><h2 id="从服务维度拆分-Prometheus"><a href="#从服务维度拆分-Prometheus" class="headerlink" title="从服务维度拆分 Prometheus"></a>从服务维度拆分 Prometheus</h2><p>Prometheus 主张根据功能或服务维度进行拆分,即如果要采集的服务比较多,一个 Prometheus 实例就配置成仅采集和存储某一个或某一部分服务的指标,这样根据要采集的服务将 Prometheus 拆分成多个实例分别去采集,也能一定程度上达到水平扩容的目的。</p><p><img src="https://imroc.io/assets/blog/prometheus-divide.png" alt=""></p><p>通常这样的扩容方式已经能满足大部分场景的需求了,毕竟单机 Prometheus 就能采集和处理很多数据了,很少有 Prometheus 撑不住单个服务的场景。不过在超大规模集群下,有些单个服务的体量也很大,就需要进一步拆分了,我们下面来继续讲下如何再拆分。</p><h2 id="对超大规模的服务做分片"><a href="#对超大规模的服务做分片" class="headerlink" title="对超大规模的服务做分片"></a>对超大规模的服务做分片</h2><p>想象一下,如果集群节点数量达到上千甚至几千的规模,对于一些节点级服务暴露的指标,比如 kubelet 内置的 cadvisor 暴露的容器相关的指标,又或者部署的 DeamonSet <code>node-exporter</code> 暴露的节点相关的指标,在集群规模大的情况下,它们这种单个服务背后的指标数据体量就非常大;还有一些用户量超大的业务,单个服务的 pod 副本数就可能过千,这种服务背后的指标数据也非常大,当然这是最罕见的场景,对于绝大多数的人来说这种场景都只敢 YY 一下,实际很少有单个服务就达到这么大规模的业务。</p><p>针对上面这些大规模场景,一个 Prometheus 实例可能连这单个服务的采集任务都扛不住。Prometheus 需要向这个服务所有后端实例发请求采集数据,由于后端实例数量规模太大,采集并发量就会很高,一方面对节点的带宽、CPU、磁盘 IO 都有一定的压力,另一方面 Prometheus 使用的磁盘空间有限,采集的数据量过大很容易就将磁盘塞满了,通常要做一些取舍才能将数据量控制在一定范围,但这种取舍也会降低数据完整和精确程度,不推荐这样做。</p><p>那么如何优化呢?我们可以给这种大规模类型的服务做一下分片(Sharding),将其拆分成多个 group,让一个 Prometheus 实例仅采集这个服务背后的某一个 group 的数据,这样就可以将这个大体量服务的监控数据拆分到多个 Prometheus 实例上。</p><p><img src="https://imroc.io/assets/blog/prometheus-sharding.png" alt=""></p><p>如何将一个服务拆成多个 group 呢?下面介绍两种方案,以对 kubelet cadvisor 数据做分片为例。</p><p>第一,我们可以不用 Kubernetes 的服务发现,自行实现一下 sharding 算法,比如针对节点级的服务,可以将某个节点 shard 到某个 group 里,然后再将其注册到 Prometheus 所支持的服务发现注册中心,推荐 consul,最后在 Prometheus 配置文件加上 <a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config" target="_blank" rel="noopener">consul_sd_config</a> 的配置,指定每个 Prometheus 实例要采集的 group。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">- job_name:</span> <span class="string">'cadvisor-1'</span></span><br><span class="line"><span class="attr"> consul_sd_configs:</span></span><br><span class="line"><span class="attr"> - server:</span> <span class="number">10.0</span><span class="number">.0</span><span class="number">.3</span><span class="string">:8500</span></span><br><span class="line"><span class="attr"> services:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">cadvisor-1</span> <span class="comment"># This is the 2nd slave</span></span><br></pre></td></tr></table></figure><p>在未来,你甚至可以直接利用 Kubernetes 的 <a href="https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/" target="_blank" rel="noopener">EndpointSlice</a> 特性来做服务发现和分片处理,在超大规模服务场景下就可以不需要其它的服务发现和分片机制。不过暂时此特性还不够成熟,没有默认启用,不推荐用(当前 Kubernentes 最新版本为 1.18)。</p><p>第二,用 Kubernetes 的 node 服务发现,再利用 Prometheus relabel 配置的 hashmod 来对 node 做分片,每个 Prometheus 实例仅抓其中一个分片中的数据:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">- job_name:</span> <span class="string">'cadvisor-1'</span></span><br><span class="line"><span class="attr"> metrics_path:</span> <span class="string">/metrics/cadvisor</span></span><br><span class="line"><span class="attr"> scheme:</span> <span class="string">https</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 请求 kubelet metrics 接口也需要认证和授权,通常会用 webhook 方式让 apiserver 代理进行 RBAC 校验,所以还是用 ServiceAccount 的 token</span></span><br><span class="line"><span class="attr"> bearer_token_file:</span> <span class="string">/var/run/secrets/kubernetes.io/serviceaccount/token</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> kubernetes_sd_configs:</span></span><br><span class="line"><span class="attr"> - role:</span> <span class="string">node</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 通常不校验 kubelet 的 server 证书,避免报 x509: certificate signed by unknown authority</span></span><br><span class="line"><span class="attr"> tls_config:</span></span><br><span class="line"><span class="attr"> insecure_skip_verify:</span> <span class="literal">true</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> relabel_configs:</span></span><br><span class="line"><span class="attr"> - source_labels:</span> <span class="string">[__address__]</span></span><br><span class="line"><span class="attr"> modulus:</span> <span class="number">4</span> <span class="comment"># 将节点分片成 4 个 group</span></span><br><span class="line"><span class="attr"> target_label:</span> <span class="string">__tmp_hash</span></span><br><span class="line"><span class="attr"> action:</span> <span class="string">hashmod</span></span><br><span class="line"><span class="attr"> - source_labels:</span> <span class="string">[__tmp_hash]</span></span><br><span class="line"><span class="attr"> regex:</span> <span class="string">^1$</span> <span class="comment"># 只抓第 2 个 group 中节点的数据(序号 0 为第 1 个 group)</span></span><br><span class="line"><span class="attr"> action:</span> <span class="string">keep</span></span><br></pre></td></tr></table></figure><h2 id="拆分引入的新问题"><a href="#拆分引入的新问题" class="headerlink" title="拆分引入的新问题"></a>拆分引入的新问题</h2><p>前面我们通过不通层面对 Prometheus 进行了拆分部署,一方面使得 Prometheus 能够实现水平扩容,另一方面也加剧了监控数据落盘的分散程度,使用 Grafana 查询监控数据时我们也需要添加许多数据源,而且不同数据源之间的数据还不能聚合查询,监控页面也看不到全局的视图,造成查询混乱的局面。</p><p><img src="https://imroc.io/assets/blog/prometheus-chaos.png" alt=""></p><p>要解决这个问题,我们可以从下面的两方面入手,任选其中一种方案。</p><h2 id="集中数据存储"><a href="#集中数据存储" class="headerlink" title="集中数据存储"></a>集中数据存储</h2><p>我们可以让 Prometheus 不负责存储,仅采集数据并通过 remote write 方式写入远程存储的 adapter,远程存储使用 OpenTSDB 或 InfluxDB 这些支持集群部署的时序数据库,Prometheus 配置:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">remote_write:</span></span><br><span class="line"><span class="attr">- url:</span> <span class="attr">http://10.0.0.2:8888/write</span></span><br></pre></td></tr></table></figure><p>然后 Grafana 添加我们使用的时序数据库作为数据源来查询监控数据来展示,架构图:</p><p><img src="https://imroc.io/assets/blog/prometheus-remotewirte.png" alt=""></p><p>这种方式相当于更换了存储引擎,由其它支持存储水平扩容的时序数据库来存储庞大的数据量,这样我们就可以将数据集中到一起。OpenTSDB 支持 HBase, BigTable 作为存储后端,InfluxDB 企业版支持集群部署和水平扩容(开源版不支持)。不过这样的话,我们就无法使用友好且强大的 PromQL 来查询监控数据了,必须使用我们存储数据的时序数据库所支持的语法来查询。</p><h2 id="Prometheus-联邦"><a href="#Prometheus-联邦" class="headerlink" title="Prometheus 联邦"></a>Prometheus 联邦</h2><p>除了上面更换存储引擎的方式,还可以将 Prometheus 进行联邦部署。</p><p><img src="https://imroc.io/assets/blog/prometheus-federation.png" alt=""></p><p>简单来说,就是将多个 Prometheus 实例采集的数据再用另一个 Prometheus 采集汇总到一起,这样也意味着需要消耗更多的资源。通常我们只把需要聚合的数据或者需要在一个地方展示的数据用这种方式采集汇总到一起,比如 Kubernetes 节点数过多,cadvisor 的数据分散在多个 Prometheus 实例上,我们就可以用这种方式将 cadvisor 暴露的容器指标汇总起来,以便于在一个地方就能查询到集群中任意一个容器的监控数据或者某个服务背后所有容器的监控数据的聚合汇总以及配置告警;又或者多个服务有关联,比如通常应用只暴露了它应用相关的指标,但它的资源使用情况(比如 cpu 和 内存) 由 cadvisor 来感知和暴露,这两部分指标由不同的 Prometheus 实例所采集,这时我们也可以用这种方式将数据汇总,在一个地方展示和配置告警。</p><p>更多说明和配置示例请参考官方文档: <a href="https://prometheus.io/docs/prometheus/latest/federation/" target="_blank" rel="noopener">https://prometheus.io/docs/prometheus/latest/federation/</a></p><h2 id="Prometheus-高可用"><a href="#Prometheus-高可用" class="headerlink" title="Prometheus 高可用"></a>Prometheus 高可用</h2><p>虽然上面我们通过一些列操作将 Prometheus 进行了分布式改造,但并没有解决 Prometheus 本身的高可用问题,即如果其中一个实例挂了,数据的查询和完整性都将受到影响。</p><p>我们可以将所有 Prometheus 实例都使用两个相同副本,分别挂载数据盘,它们都采集相同的服务,所以它们的数据是一致的,查询它们之中任意一个都可以,所以可以在它们前面再挂一层负载均衡,所有查询都经过这个负载均衡分流到其中一台 Prometheus,如果其中一台挂掉就从负载列表里踢掉不再转发。</p><p>这里的负载均衡可以根据实际环境选择合适的方案,可以用 Nginx 或 HAProxy,在 Kubernetes 环境,通常使用 Kubernentes 的 Service,由 kube-proxy 生成的 iptables/ipvs 规则转发,如果使用 Istio,还可以用 VirtualService,由 envoy sidecar 去转发。</p><p><img src="https://imroc.io/assets/blog/prometheus-ha.png" alt=""></p><p>这样就实现了 Prometheus 的高可用,简单起见,上面的图仅展示单个 Prometheus 的高可用,当你可以将其拓展,代入应用到上面其它的优化手段中,实现整体的高可用。</p><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h2><p>通过本文一系列对 Prometheus 的优化手段,我们在一定程度上解决了单机 Prometheus 在大规模场景下的痛点,但操作和运维复杂度比较高,并且不能够很好的支持数据的长期存储(long term storage)。对于一些时间比较久远的监控数据,我们通常查看的频率很低,但也希望能够低成本的保留足够长的时间,数据如果全部落盘到磁盘成本是很高的,并且容量有限,即便利用水平扩容可以增加存储容量,但同时也增大了资源成本,不可能无限扩容,所以需要设置一个数据过期策略,也就会丢失时间比较久远的监控数据。</p><p>对于这种不常用的冷数据,最理想的方式就是存到廉价的对象存储中,等需要查询的时候能够自动加载出来。Thanos 可以帮我们解决这些问题,它完全兼容 Prometheus API,提供统一查询聚合分布式部署的 Prometheus 数据的能力,同时也支持数据长期存储到各种对象存储(无限存储能力)以及降低采样率来加速大时间范围的数据查询。</p><p>下一篇我们将会介绍 Thanos 的架构详解,敬请期待。</p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述<
</summary>
</entry>
<entry>
<title>Kubernetes 疑难杂症排查分享:神秘的溢出与丢包</title>
<link href="https://TencentCloudContainerTeam.github.io/2020/01/13/kubernetes-overflow-and-drop/"/>
<id>https://TencentCloudContainerTeam.github.io/2020/01/13/kubernetes-overflow-and-drop/</id>
<published>2020-01-13T06:40:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><blockquote><p>上一篇 <a href="https://tencentcloudcontainerteam.github.io/2019/12/15/no-route-to-host/">Kubernetes 疑难杂症排查分享: 诡异的 No route to host</a> 不小心又爆火,这次继续带来干货,看之前请提前泡好茶,避免口干。</p></blockquote><h2 id="问题描述"><a href="#问题描述" class="headerlink" title="问题描述"></a>问题描述</h2><p>有用户反馈大量图片加载不出来。</p><p>图片下载走的 k8s ingress,这个 ingress 路径对应后端 service 是一个代理静态图片文件的 nginx deployment,这个 deployment 只有一个副本,静态文件存储在 nfs 上,nginx 通过挂载 nfs 来读取静态文件来提供图片下载服务,所以调用链是:client –> k8s ingress –> nginx –> nfs。</p><h2 id="猜测"><a href="#猜测" class="headerlink" title="猜测"></a>猜测</h2><p>猜测: ingress 图片下载路径对应的后端服务出问题了。</p><p>验证:在 k8s 集群直接 curl nginx 的 pod ip,发现不通,果然是后端服务的问题!</p><h2 id="抓包"><a href="#抓包" class="headerlink" title="抓包"></a>抓包</h2><p>继续抓包测试观察,登上 nginx pod 所在节点,进入容器的 netns 中:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 拿到 pod 中 nginx 的容器 id</span></span><br><span class="line">$ kubectl describe pod tcpbench-6484d4b457-847gl | grep -A10 <span class="string">"^Containers:"</span> | grep -Eo <span class="string">'docker://.*$'</span> | head -n 1 | sed <span class="string">'s/docker:\/\/\(.*\)$/\1/'</span></span><br><span class="line">49b4135534dae77ce5151c6c7db4d528f05b69b0c6f8b9dd037ec4e7043c113e</span><br><span class="line"></span><br><span class="line"><span class="comment"># 通过容器 id 拿到 nginx 进程 pid</span></span><br><span class="line">$ docker inspect -f {{.State.Pid}} 49b4135534dae77ce5151c6c7db4d528f05b69b0c6f8b9dd037ec4e7043c113e</span><br><span class="line">3985</span><br><span class="line"></span><br><span class="line"><span class="comment"># 进入 nginx 进程所在的 netns</span></span><br><span class="line">$ nsenter -n -t 3985</span><br><span class="line"></span><br><span class="line"><span class="comment"># 查看容器 netns 中的网卡信息,确认下</span></span><br><span class="line">$ ip a</span><br><span class="line">1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000</span><br><span class="line"> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00</span><br><span class="line"> inet 127.0.0.1/8 scope host lo</span><br><span class="line"> valid_lft forever preferred_lft forever</span><br><span class="line">3: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default</span><br><span class="line"> link/ether 56:04:c7:28:b0:3c brd ff:ff:ff:ff:ff:ff link-netnsid 0</span><br><span class="line"> inet 172.26.0.8/26 scope global eth0</span><br><span class="line"> valid_lft forever preferred_lft forever</span><br></pre></td></tr></table></figure><p>使用 tcpdump 指定端口 24568 抓容器 netns 中 eth0 网卡的包:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">tcpdump -i eth0 -nnnn -ttt port 24568</span><br></pre></td></tr></table></figure><p>在其它节点准备使用 nc 指定源端口为 24568 向容器发包:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">nc -u 24568 172.16.1.21 80</span><br></pre></td></tr></table></figure><p>观察抓包结果:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">00:00:00.000000 IP 10.0.0.3.24568 > 172.16.1.21.80: Flags [S], seq 416500297, win 29200, options [mss 1424,sackOK,TS val 3000206334 ecr 0,nop,wscale 9], length 0</span><br><span class="line">00:00:01.032218 IP 10.0.0.3.24568 > 172.16.1.21.80: Flags [S], seq 416500297, win 29200, options [mss 1424,sackOK,TS val 3000207366 ecr 0,nop,wscale 9], length 0</span><br><span class="line">00:00:02.011962 IP 10.0.0.3.24568 > 172.16.1.21.80: Flags [S], seq 416500297, win 29200, options [mss 1424,sackOK,TS val 3000209378 ecr 0,nop,wscale 9], length 0</span><br><span class="line">00:00:04.127943 IP 10.0.0.3.24568 > 172.16.1.21.80: Flags [S], seq 416500297, win 29200, options [mss 1424,sackOK,TS val 3000213506 ecr 0,nop,wscale 9], length 0</span><br><span class="line">00:00:08.192056 IP 10.0.0.3.24568 > 172.16.1.21.80: Flags [S], seq 416500297, win 29200, options [mss 1424,sackOK,TS val 3000221698 ecr 0,nop,wscale 9], length 0</span><br><span class="line">00:00:16.127983 IP 10.0.0.3.24568 > 172.16.1.21.80: Flags [S], seq 416500297, win 29200, options [mss 1424,sackOK,TS val 3000237826 ecr 0,nop,wscale 9], length 0</span><br><span class="line">00:00:33.791988 IP 10.0.0.3.24568 > 172.16.1.21.80: Flags [S], seq 416500297, win 29200, options [mss 1424,sackOK,TS val 3000271618 ecr 0,nop,wscale 9], length 0</span><br></pre></td></tr></table></figure><p>SYN 包到容器内网卡了,但容器没回 ACK,像是报文到达容器内的网卡后就被丢了。看样子跟防火墙应该也没什么关系,也检查了容器 netns 内的 iptables 规则,是空的,没问题。</p><p>排除是 iptables 规则问题,在容器 netns 中使用 <code>netstat -s</code> 检查下是否有丢包统计:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ netstat -s | grep -E <span class="string">'overflow|drop'</span></span><br><span class="line"> 12178939 <span class="built_in">times</span> the listen queue of a socket overflowed</span><br><span class="line"> 12247395 SYNs to LISTEN sockets dropped</span><br></pre></td></tr></table></figure><p>果然有丢包,为了理解这里的丢包统计,我深入研究了一下,下面插播一些相关知识。</p><h2 id="syn-queue-与-accept-queue"><a href="#syn-queue-与-accept-queue" class="headerlink" title="syn queue 与 accept queue"></a>syn queue 与 accept queue</h2><p>Linux 进程监听端口时,内核会给它对应的 socket 分配两个队列:</p><ul><li>syn queue: 半连接队列。server 收到 SYN 后,连接会先进入 <code>SYN_RCVD</code> 状态,并放入 syn queue,此队列的包对应还没有完全建立好的连接(TCP 三次握手还没完成)。</li><li>accept queue: 全连接队列。当 TCP 三次握手完成之后,连接会进入 <code>ESTABELISHED</code> 状态并从 syn queue 移到 accept queue,等待被进程调用 <code>accept()</code> 系统调用 “拿走”。</li></ul><blockquote><p>注意:这两个队列的连接都还没有真正被应用层接收到,当进程调用 <code>accept()</code> 后,连接才会被应用层处理,具体到我们这个问题的场景就是 nginx 处理 HTTP 请求。</p></blockquote><p>为了更好理解,可以看下这张 TCP 连接建立过程的示意图:</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/backlog.png" alt=""></p><h2 id="listen-与-accept"><a href="#listen-与-accept" class="headerlink" title="listen 与 accept"></a>listen 与 accept</h2><p>不管使用什么语言和框架,在写 server 端应用时,它们的底层在监听端口时最终都会调用 <code>listen()</code> 系统调用,处理新请求时都会先调用 <code>accept()</code> 系统调用来获取新的连接,然后再处理请求,只是有各自不同的封装而已,以 go 语言为例:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 调用 listen 监听端口</span></span><br><span class="line">l, err := net.Listen(<span class="string">"tcp"</span>, <span class="string">":80"</span>)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"><span class="built_in">panic</span>(err)</span><br><span class="line">}</span><br><span class="line"><span class="keyword">for</span> {</span><br><span class="line"><span class="comment">// 不断调用 accept 获取新连接,如果 accept queue 为空就一直阻塞</span></span><br><span class="line">conn, err := l.Accept()</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line">log.Println(<span class="string">"accept error:"</span>, err)</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line"> }</span><br><span class="line"><span class="comment">// 每来一个新连接意味着一个新请求,启动协程处理请求</span></span><br><span class="line"><span class="keyword">go</span> handle(conn)</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h2 id="Linux-的-backlog"><a href="#Linux-的-backlog" class="headerlink" title="Linux 的 backlog"></a>Linux 的 backlog</h2><p>内核既然给监听端口的 socket 分配了 syn queue 与 accept queue 两个队列,那它们有大小限制吗?可以无限往里面塞数据吗?当然不行! 资源是有限的,尤其是在内核态,所以需要限制一下这两个队列的大小。那么它们的大小是如何确定的呢?我们先来看下 listen 这个系统调用:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">int listen(int sockfd, int backlog)</span><br></pre></td></tr></table></figure><p>可以看到,能够传入一个整数类型的 <code>backlog</code> 参数,我们再通过 <code>man listen</code> 看下解释:</p><p><code>The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog. When syncookies are enabled there is no logical maximum length and this setting is ignored. See tcp(7) for more information.</code></p><p><code>If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128. In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.</code></p><p>继续深挖了一下源码,结合这里的解释提炼一下:</p><ul><li>listen 的 backlog 参数同时指定了 socket 的 syn queue 与 accept queue 大小。</li><li><p>accept queue 最大不能超过 <code>net.core.somaxconn</code> 的值,即: </p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">max accept queue size = min(backlog, net.core.somaxconn)</span><br></pre></td></tr></table></figure></li><li><p>如果启用了 syncookies (net.ipv4.tcp_syncookies=1),当 syn queue 满了,server 还是可以继续接收 <code>SYN</code> 包并回复 <code>SYN+ACK</code> 给 client,只是不会存入 syn queue 了。因为会利用一套巧妙的 syncookies 算法机制生成隐藏信息写入响应的 <code>SYN+ACK</code> 包中,等 client 回 <code>ACK</code> 时,server 再利用 syncookies 算法校验报文,校验通过后三次握手就顺利完成了。所以如果启用了 syncookies,syn queue 的逻辑大小是没有限制的,</p></li><li>syncookies 通常都是启用了的,所以一般不用担心 syn queue 满了导致丢包。syncookies 是为了防止 SYN Flood 攻击 (一种常见的 DDoS 方式),攻击原理就是 client 不断发 SYN 包但不回最后的 ACK,填满 server 的 syn queue 从而无法建立新连接,导致 server 拒绝服务。</li><li>如果 syncookies 没有启用,syn queue 的大小就有限制,除了跟 accept queue 一样受 <code>net.core.somaxconn</code> 大小限制之外,还会受到 <code>net.ipv4.tcp_max_syn_backlog</code> 的限制,即:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">max syn queue size = min(backlog, net.core.somaxconn, net.ipv4.tcp_max_syn_backlog)</span><br></pre></td></tr></table></figure></li></ul><p>4.3 及其之前版本的内核,syn queue 的大小计算方式跟现在新版内核这里还不一样,详细请参考 commit <a href="https://github.com/torvalds/linux/commit/ef547f2ac16bd9d77a780a0e7c70857e69e8f23f#diff-56ecfd3cd70d57cde321f395f0d8d743L43" target="_blank" rel="noopener">ef547f2ac16b</a></p><h2 id="队列溢出"><a href="#队列溢出" class="headerlink" title="队列溢出"></a>队列溢出</h2><p>毫无疑问,在队列大小有限制的情况下,如果队列满了,再有新连接过来肯定就有问题。</p><p>翻下 linux 源码,看下处理 SYN 包的部分,在 <code>net/ipv4/tcp_input.c</code> 的 <code>tcp_conn_request</code> 函数:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> ((net->ipv4.sysctl_tcp_syncookies == <span class="number">2</span> ||</span><br><span class="line"> inet_csk_reqsk_queue_is_full(sk)) && !isn) {</span><br><span class="line">want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);</span><br><span class="line"><span class="keyword">if</span> (!want_cookie)</span><br><span class="line"><span class="keyword">goto</span> drop;</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (sk_acceptq_is_full(sk)) {</span><br><span class="line">NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);</span><br><span class="line"><span class="keyword">goto</span> drop;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><code>goto drop</code> 最终会走到 <code>tcp_listendrop</code> 函数,实际上就是将 <code>ListenDrops</code> 计数器 +1:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">inline</span> <span class="keyword">void</span> <span class="title">tcp_listendrop</span><span class="params">(<span class="keyword">const</span> struct sock *sk)</span></span></span><br><span class="line"><span class="function"></span>{</span><br><span class="line">atomic_inc(&((struct sock *)sk)->sk_drops);</span><br><span class="line">__NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>大致可以看出来,对于 SYN 包:</p><ul><li>如果 syn queue 满了并且没有开启 syncookies 就丢包,并将 <code>ListenDrops</code> 计数器 +1。</li><li>如果 accept queue 满了也会丢包,并将 <code>ListenOverflows</code> 和 <code>ListenDrops</code> 计数器 +1。</li></ul><p>而我们前面排查问题通过 <code>netstat -s</code> 看到的丢包统计,其实就是对应的 <code>ListenOverflows</code> 和 <code>ListenDrops</code> 这两个计数器。</p><p>除了用 <code>netstat -s</code>,还可以使用 <code>nstat -az</code> 直接看系统内各个计数器的值:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ nstat -az | grep -E <span class="string">'TcpExtListenOverflows|TcpExtListenDrops'</span></span><br><span class="line">TcpExtListenOverflows 12178939 0.0</span><br><span class="line">TcpExtListenDrops 12247395 0.0</span><br></pre></td></tr></table></figure><p>另外,对于低版本内核,当 accept queue 满了,并不会完全丢弃 SYN 包,而是对 SYN 限速。把内核源码切到 3.10 版本,看 <code>net/ipv4/tcp_ipv4.c</code> 中 <code>tcp_v4_conn_request</code> 函数:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* Accept backlog is full. If we have already queued enough</span></span><br><span class="line"><span class="comment"> * of warm entries in syn queue, drop request. It is better than</span></span><br><span class="line"><span class="comment"> * clogging syn queue with openreqs with exponentially increasing</span></span><br><span class="line"><span class="comment"> * timeout.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > <span class="number">1</span>) {</span><br><span class="line"> NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);</span><br><span class="line"> <span class="keyword">goto</span> drop;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>其中 <code>inet_csk_reqsk_queue_young(sk) > 1</code> 的条件实际就是用于限速,仿佛在对 client 说: 哥们,你慢点!我的 accept queue 都满了,即便咱们握手成功,连接也可能放不进去呀。</p><h2 id="回到问题上来"><a href="#回到问题上来" class="headerlink" title="回到问题上来"></a>回到问题上来</h2><p>总结之前观察到两个现象:</p><ul><li>容器内抓包发现收到 client 的 SYN,但 nginx 没回包。</li><li>通过 <code>netstat -s</code> 发现有溢出和丢包的统计 (<code>ListenOverflows</code> 与 <code>ListenDrops</code>)。</li></ul><p>根据之前的分析,我们可以推测是 syn queue 或 accept queue 满了。</p><p>先检查下 syncookies 配置:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ cat /proc/sys/net/ipv4/tcp_syncookies</span><br><span class="line">1</span><br></pre></td></tr></table></figure><p>确认启用了 <code>syncookies</code>,所以 syn queue 大小没有限制,不会因为 syn queue 满而丢包,并且即便没开启 <code>syncookies</code>,syn queue 有大小限制,队列满了也不会使 <code>ListenOverflows</code> 计数器 +1。</p><p>从计数器结果来看,<code>ListenOverflows</code> 和 <code>ListenDrops</code> 的值差别不大,所以推测很有可能是 accept queue 满了,因为当 accept queue 满了会丢 SYN 包,并且同时将 <code>ListenOverflows</code> 与 <code>ListenDrops</code> 计数器分别 +1。</p><p>如何验证 accept queue 满了呢?可以在容器的 netns 中执行 <code>ss -lnt</code> 看下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ ss -lnt</span><br><span class="line">State Recv-Q Send-Q Local Address:Port Peer Address:Port</span><br><span class="line">LISTEN 129 128 *:80 *:*</span><br></pre></td></tr></table></figure><p>通过这条命令我们可以看到当前 netns 中监听 tcp 80 端口的 socket,<code>Send-Q</code> 为 128,<code>Recv-Q</code> 为 129。</p><p>什么意思呢?通过调研得知:</p><ul><li>对于 <code>LISTEN</code> 状态,<code>Send-Q</code> 表示 accept queue 的最大限制大小,<code>Recv-Q</code> 表示其实际大小。</li><li>对于 <code>ESTABELISHED</code> 状态,<code>Send-Q</code> 和 <code>Recv-Q</code> 分别表示发送和接收数据包的 buffer。</li></ul><p>所以,看这里输出结果可以得知 accept queue 满了,当 <code>Recv-Q</code> 的值比 <code>Send-Q</code> 大 1 时表明 accept queue 溢出了,如果再收到 SYN 包就会丢弃掉。</p><p>导致 accept queue 满的原因一般都是因为进程调用 <code>accept()</code> 太慢了,导致大量连接不能被及时 “拿走”。</p><p>那么什么情况下进程调用 <code>accept()</code> 会很慢呢?猜测可能是进程连接负载高,处理不过来。</p><p>而负载高不仅可能是 CPU 繁忙导致,还可能是 IO 慢导致,当文件 IO 慢时就会有很多 IO WAIT,在 IO WAIT 时虽然 CPU 不怎么干活,但也会占据 CPU 时间片,影响 CPU 干其它活。</p><p>最终进一步定位发现是 nginx pod 挂载的 nfs 服务对应的 nfs server 负载较高,导致 IO 延时较大,从而使 nginx 调用 <code>accept()</code> 变慢,accept queue 溢出,使得大量代理静态图片文件的请求被丢弃,也就导致很多图片加载不出来。</p><p>虽然根因不是 k8s 导致的问题,但也从中挖出一些在高并发场景下值得优化的点,请继续往下看。</p><h2 id="somaxconn-的默认值很小"><a href="#somaxconn-的默认值很小" class="headerlink" title="somaxconn 的默认值很小"></a>somaxconn 的默认值很小</h2><p>我们再看下之前 <code>ss -lnt</code> 的输出:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ ss -lnt</span><br><span class="line">State Recv-Q Send-Q Local Address:Port Peer Address:Port</span><br><span class="line">LISTEN 129 128 *:80 *:*</span><br></pre></td></tr></table></figure><p>仔细一看,<code>Send-Q</code> 表示 accept queue 最大的大小,才 128 ?也太小了吧!</p><p>根据前面的介绍我们知道,accept queue 的最大大小会受 <code>net.core.somaxconn</code> 内核参数的限制,我们看下 pod 所在节点上这个内核参数的大小:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ cat /proc/sys/net/core/somaxconn</span><br><span class="line">32768</span><br></pre></td></tr></table></figure><p>是 32768,挺大的,为什么这里 accept queue 最大大小就只有 128 了呢?</p><p><code>net.core.somaxconn</code> 这个内核参数是 namespace 隔离了的,我们在容器 netns 中再确认了下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ cat /proc/sys/net/core/somaxconn</span><br><span class="line">128</span><br></pre></td></tr></table></figure><p>为什么只有 128?看下 stackoverflow <a href="https://stackoverflow.com/questions/26177059/refresh-net-core-somaxcomm-or-any-sysctl-property-for-docker-containers/26197875#26197875" target="_blank" rel="noopener">这里</a> 的讨论: </p><p><code>The "net/core" subsys is registered per network namespace. And the initial value for somaxconn is set to 128.</code></p><p>原来新建的 netns 中 somaxconn 默认就为 128,在 <code>include/linux/socket.h</code> 中可以看到这个常量的定义:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* Maximum queue length specifiable by listen. */</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> SOMAXCONN128</span></span><br></pre></td></tr></table></figure><p>很多人在使用 k8s 时都没太在意这个参数,为什么大家平常在较高并发下也没发现有问题呢?</p><p>因为通常进程 <code>accept()</code> 都是很快的,所以一般 accept queue 基本都没什么积压的数据,也就不会溢出导致丢包了。</p><p>对于并发量很高的应用,还是建议将 somaxconn 调高。虽然可以进入容器 netns 后使用 <code>sysctl -w net.core.somaxconn=1024</code> 或 <code>echo 1024 > /proc/sys/net/core/somaxconn</code> 临时调整,但调整的意义不大,因为容器内的进程一般在启动的时候才会调用 <code>listen()</code>,然后 accept queue 的大小就被决定了,并且不再改变。</p><p>下面介绍几种调整方式:</p><h3 id="方式一-使用-k8s-sysctls-特性直接给-pod-指定内核参数"><a href="#方式一-使用-k8s-sysctls-特性直接给-pod-指定内核参数" class="headerlink" title="方式一: 使用 k8s sysctls 特性直接给 pod 指定内核参数"></a>方式一: 使用 k8s sysctls 特性直接给 pod 指定内核参数</h3><p>示例 yaml:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Pod</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">sysctl-example</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> securityContext:</span></span><br><span class="line"><span class="attr"> sysctls:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">net.core.somaxconn</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"8096"</span></span><br></pre></td></tr></table></figure><p>有些参数是 <code>unsafe</code> 类型的,不同环境不一样,我的环境里是可以直接设置 pod 的 <code>net.core.somaxconn</code> 这个 sysctl 的。如果你的环境不行,请参考官方文档 <a href="https://kubernetes-io-vnext-staging.netlify.com/docs/tasks/administer-cluster/sysctl-cluster/#enabling-unsafe-sysctls" target="_blank" rel="noopener">Using sysctls in a Kubernetes Cluster</a> 启用 <code>unsafe</code> 类型的 sysctl。</p><blockquote><p>注:此特性在 k8s v1.12 beta,默认开启。</p></blockquote><h3 id="方式二-使用-initContainers-设置内核参数"><a href="#方式二-使用-initContainers-设置内核参数" class="headerlink" title="方式二: 使用 initContainers 设置内核参数"></a>方式二: 使用 initContainers 设置内核参数</h3><p>示例 yaml:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Pod</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">sysctl-example-init</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> initContainers:</span></span><br><span class="line"><span class="attr"> - image:</span> <span class="string">busybox</span></span><br><span class="line"><span class="attr"> command:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">sh</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">-c</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">echo</span> <span class="number">1024</span> <span class="string">> /proc/sys/net/core/somaxconn</span></span><br><span class="line"><span class="string"></span><span class="attr"> imagePullPolicy:</span> <span class="string">Always</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">setsysctl</span></span><br><span class="line"><span class="attr"> securityContext:</span></span><br><span class="line"><span class="attr"> privileged:</span> <span class="literal">true</span></span><br><span class="line"><span class="attr"> Containers:</span></span><br><span class="line"> <span class="string">...</span></span><br></pre></td></tr></table></figure><blockquote><p>注: init container 需要 privileged 权限。</p></blockquote><h3 id="方式三-安装-tuning-CNI-插件统一设置-sysctl"><a href="#方式三-安装-tuning-CNI-插件统一设置-sysctl" class="headerlink" title="方式三: 安装 tuning CNI 插件统一设置 sysctl"></a>方式三: 安装 tuning CNI 插件统一设置 sysctl</h3><p>tuning plugin 地址: <a href="https://github.com/containernetworking/plugins/tree/master/plugins/meta/tuning" target="_blank" rel="noopener">https://github.com/containernetworking/plugins/tree/master/plugins/meta/tuning</a></p><p>CNI 配置示例:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">{</span><br><span class="line"> <span class="string">"name"</span>: <span class="string">"mytuning"</span>,</span><br><span class="line"> <span class="string">"type"</span>: <span class="string">"tuning"</span>,</span><br><span class="line"> <span class="string">"sysctl"</span>: {</span><br><span class="line"> <span class="string">"net.core.somaxconn"</span>: <span class="string">"1024"</span></span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h2 id="nginx-的-backlog"><a href="#nginx-的-backlog" class="headerlink" title="nginx 的 backlog"></a>nginx 的 backlog</h2><p>我们使用方式一尝试给 nginx pod 的 somaxconn 调高到 8096 后观察:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ ss -lnt</span><br><span class="line">State Recv-Q Send-Q Local Address:Port Peer Address:Port</span><br><span class="line">LISTEN 512 511 *:80 *:*</span><br></pre></td></tr></table></figure><p>WTF? 还是溢出了,而且调高了 somaxconn 之后虽然 accept queue 的最大大小 (<code>Send-Q</code>) 变大了,但跟 8096 还差很远呀!</p><p>在经过一番研究,发现 nginx 在 <code>listen()</code> 时并没有读取 somaxconn 作为 backlog 默认值传入,它有自己的默认值,也支持在配置里改。通过 <a href="http://nginx.org/en/docs/http/ngx_http_core_module.html" target="_blank" rel="noopener">ngx_http_core_module</a> 的官方文档我们可以看到它在 linux 下的默认值就是 511:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">backlog=number</span><br><span class="line"> sets the backlog parameter in the listen() call that limits the maximum length for the queue of pending connections. By default, backlog is set to -1 on FreeBSD, DragonFly BSD, and macOS, and to 511 on other platforms.</span><br></pre></td></tr></table></figure><p>配置示例:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">listen 80 default backlog=1024;</span><br></pre></td></tr></table></figure><p>所以,在容器中使用 nginx 来支撑高并发的业务时,记得要同时调整下 <code>net.core.somaxconn</code> 内核参数和 <code>nginx.conf</code> 中的 backlog 配置。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li>Using sysctls in a Kubernetes Cluster: <a href="https://kubernetes-io-vnext-staging.netlify.com/docs/tasks/administer-cluster/sysctl-cluster/" target="_blank" rel="noopener">https://kubernetes-io-vnext-staging.netlify.com/docs/tasks/administer-cluster/sysctl-cluster/</a></li><li>SYN packet handling in the wild: <a href="https://blog.cloudflare.com/syn-packet-handling-in-the-wild/" target="_blank" rel="noopener">https://blog.cloudflare.com/syn-packet-handling-in-the-wild/</a></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<blockquote>
<p>上一篇 <a href="https://tencentcloudcontainerteam.
</summary>
</entry>
<entry>
<title>Kubernetes 疑难杂症排查分享: 诡异的 No route to host</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/12/15/no-route-to-host/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/12/15/no-route-to-host/</id>
<published>2019-12-15T04:03:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><p>之前发过一篇干货满满的爆火文章 <a href="https://tencentcloudcontainerteam.github.io/2019/08/12/troubleshooting-with-kubernetes-network/">Kubernetes 网络疑难杂症排查分享</a>,包含多个疑难杂症的排查案例分享,信息量巨大。这次我又带来了续集,只讲一个案例,但信息量也不小,Are you ready ?</p><h2 id="问题反馈"><a href="#问题反馈" class="headerlink" title="问题反馈"></a>问题反馈</h2><p>有用户反馈 Deployment 滚动更新的时候,业务日志偶尔会报 “No route to host” 的错误。</p><h2 id="分析"><a href="#分析" class="headerlink" title="分析"></a>分析</h2><p>之前没遇到滚动更新会报 “No route to host” 的问题,我们先看下滚动更新导致连接异常有哪些常见的报错:</p><ul><li><code>Connection reset by peer</code>: 连接被重置。通常是连接建立过,但 server 端发现 client 发的包不对劲就返回 RST,应用层就报错连接被重置。比如在 server 滚动更新过程中,client 给 server 发的请求还没完全结束,或者本身是一个类似 grpc 的多路复用长连接,当 server 对应的旧 Pod 删除(没有做优雅结束,停止时没有关闭连接),新 Pod 很快创建启动并且刚好有跟之前旧 Pod 一样的 IP,这时 kube-proxy 也没感知到这个 IP 其实已经被删除然后又被重建了,针对这个 IP 的规则就不会更新,旧的连接依然发往这个 IP,但旧 Pod 已经不在了,后面继续发包时依然转发给这个 Pod IP,最终会被转发到这个有相同 IP 的新 Pod 上,而新 Pod 收到此包时检查报文发现不对劲,就返回 RST 给 client 告知将连接重置。针对这种情况,建议应用自身处理好优雅结束:Pod 进入 Terminating 状态后会发送 <code>SIGTERM</code> 信号给业务进程,业务进程的代码需处理这个信号,在进程退出前关闭所有连接。</li><li><p><code>Connection refused</code>: 连接被拒绝。通常是连接还没建立,client 正在发 SYN 包请求建立连接,但到了 server 之后发现端口没监听,内核就返回 RST 包,然后应用层就报错连接被拒绝。比如在 server 滚动更新过程中,旧的 Pod 中的进程很快就停止了(网卡还未完全销毁),但 client 所在节点的 iptables/ipvs 规则还没更新,包就可能会被转发到了这个停止的 Pod (由于 k8s 的 controller 模式,从 Pod 删除到 service 的 endpoint 更新,再到 kube-proxy watch 到更新并更新 节点上的 iptables/ipvs 规则,这个过程是异步的,中间存在一点时间差,所以有可能存在 Pod 中的进程已经没有监听,但 iptables/ipvs 规则还没更新的情况)。针对这种情况,建议给容器加一个 preStop,在真正销毁 Pod 之前等待一段时间,留时间给 kube-proxy 更新转发规则,更新完之后就不会再有新连接往这个旧 Pod 转发了,preStop 示例:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">lifecycle:</span></span><br><span class="line"><span class="attr"> preStop:</span></span><br><span class="line"><span class="attr"> exec:</span></span><br><span class="line"><span class="attr"> command:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">/bin/bash</span></span><br><span class="line"><span class="bullet"> -</span> <span class="bullet">-c</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">sleep</span> <span class="number">30</span></span><br></pre></td></tr></table></figure><p>另外,还可能是新的 Pod 启动比较慢,虽然状态已经 Ready,但实际上可能端口还没监听,新的请求被转发到这个还没完全启动的 Pod 就会报错连接被拒绝。针对这种情况,建议给容器加就绪检查 (readinessProbe),让容器真正启动完之后才将其状态置为 Ready,然后 kube-proxy 才会更新转发规则,这样就能保证新的请求只被转发到完全启动的 Pod,readinessProbe 示例:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">readinessProbe:</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/healthz</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">80</span></span><br><span class="line"><span class="attr"> httpHeaders:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">X-Custom-Header</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">Awesome</span></span><br><span class="line"><span class="attr"> initialDelaySeconds:</span> <span class="number">15</span></span><br><span class="line"><span class="attr"> timeoutSeconds:</span> <span class="number">1</span></span><br></pre></td></tr></table></figure></li><li><p><code>Connection timed out</code>: 连接超时。通常是连接还没建立,client 发 SYN 请求建立连接一直等到超时时间都没有收到 ACK,然后就报错连接超时。这个可能场景跟前面 <code>Connection refused</code> 可能的场景类似,不同点在于端口有监听,但进程无法正常响应了: 转发规则还没更新,旧 Pod 的进程正在停止过程中,虽然端口有监听,但已经不响应了;或者转发规则更新了,新 Pod 端口也监听了,但还没有真正就绪,还没有能力处理新请求。针对这些情况的建议跟前面一样:加 preStop 和 readinessProbe。</p></li></ul><p>下面我们来继续分析下滚动更新时发生 <code>No route to host</code> 的可能情况。</p><p>这个报错很明显,IP 无法路由,通常是将报文发到了一个已经彻底销毁的 Pod (网卡已经不在)。不可能发到一个网卡还没创建好的 Pod,因为即便不加存活检查,也是要等到 Pod 网络初始化完后才可能 Ready,然后 kube-proxy 才会更新转发规则。</p><p>什么情况下会转发到一个已经彻底销毁的 Pod? 借鉴前面几种滚动更新的报错分析,我们推测应该是 Pod 很快销毁了但转发规则还没更新,从而新的请求被转发了这个已经销毁的 Pod,最终报文到达这个 Pod 所在 PodCIDR 的 Node 上时,Node 发现本机已经没有这个 IP 的容器,然后 Node 就返回 ICMP 包告知 client 这个 IP 不可达,client 收到 ICMP 后,应用层就会报错 “No route to host”。</p><p>所以根据我们的分析,关键点在于 Pod 销毁太快,转发规则还没来得及更新,导致后来的请求被转发到已销毁的 Pod。针对这种情况,我们可以给容器加一个 preStop,留时间给 kube-proxy 更新转发规则来解决,参考 《Kubernetes实践指南》中的部分章节: <a href="https://k8s.imroc.io/best-practice/high-availability-deployment-of-applications#smooth-update-using-prestophook-and-readinessprobe" target="_blank" rel="noopener">https://k8s.imroc.io/best-practice/high-availability-deployment-of-applications#smooth-update-using-prestophook-and-readinessprobe</a></p><h2 id="问题没有解决"><a href="#问题没有解决" class="headerlink" title="问题没有解决"></a>问题没有解决</h2><p>我们自己没有复现用户的 “No route to host” 的问题,可能是复现条件比较苛刻,最后将我们上面理论上的分析结论作为解决方案给到了用户。</p><p>但用户尝试加了 preStop 之后,问题依然存在,服务滚动更新时偶尔还是会出现 “No route to host”。</p><h2 id="深入分析"><a href="#深入分析" class="headerlink" title="深入分析"></a>深入分析</h2><p>为了弄清楚根本原因,我们请求用户协助搭建了一个可以复现问题的测试环境,最终这个问题在测试环境中可以稳定复现。</p><p>仔细观察,实际是部署两个服务:ServiceA 和 ServiceB。使用 ab 压测工具去压测 ServiceA (短连接),然后 ServiceA 会通过 RPC 调用 ServiceB (短连接),滚动更新的是 ServiceB,报错发生在 ServiceA 调用 ServiceB 这条链路。</p><p>在 ServiceB 滚动更新期间,新的 Pod Ready 了之后会被添加到 IPVS 规则的 RS 列表,但旧的 Pod 不会立即被踢掉,而是将新的 Pod 权重置为1,旧的置为 0,通过在 client 所在节点查看 IPVS 规则可以看出来:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">root@VM-0-3-ubuntu:~# ipvsadm -ln -t 172.16.255.241:80</span><br><span class="line">Prot LocalAddress:Port Scheduler Flags</span><br><span class="line"> -> RemoteAddress:Port Forward Weight ActiveConn InActConn</span><br><span class="line">TCP 172.16.255.241:80 rr</span><br><span class="line"> -> 172.16.8.106:80 Masq 0 5 14048</span><br><span class="line"> -> 172.16.8.107:80 Masq 1 2 243</span><br></pre></td></tr></table></figure><p>为什么不立即踢掉旧的 Pod 呢?因为要支持优雅结束,让存量的连接处理完,等存量连接全部结束了再踢掉它(ActiveConn+InactiveConn=0),这个逻辑可以通过这里的代码确认:<a href="https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/proxy/ipvs/graceful_termination.go#L170" target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/proxy/ipvs/graceful_termination.go#L170</a></p><p>然后再通过 <code>ipvsadm -lnc | grep 172.16.8.106</code> 发现旧 Pod 上的连接大多是 <code>TIME_WAIT</code> 状态,这个也容易理解:因为 ServiceA 作为 client 发起短连接请求调用 ServiceB,调用完成就会关闭连接,TCP 三次挥手后进入 <code>TIME_WAIT</code> 状态,等待 2*MSL (2 分钟) 的时长再清理连接。</p><p>经过上面的分析,看起来都是符合预期的,那为什么还会出现 “No route to host” 呢?难道权重被置为 0 之后还有新连接往这个旧 Pod 转发?我们来抓包看下:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">root@VM-0-3-ubuntu:~# tcpdump -i eth0 host 172.16.8.106 -n -tttt</span><br><span class="line">tcpdump: verbose output suppressed, use -v or -vv for full protocol decode</span><br><span class="line">listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes</span><br><span class="line">2019-12-13 11:49:47.319093 IP 10.0.0.3.36708 > 172.16.8.106.80: Flags [S], seq 3988339656, win 29200, options [mss 1460,sackOK,TS val 3751111666 ecr 0,nop,wscale 9], length 0</span><br><span class="line">2019-12-13 11:49:47.319133 IP 10.0.0.3.36706 > 172.16.8.106.80: Flags [S], seq 109196945, win 29200, options [mss 1460,sackOK,TS val 3751111666 ecr 0,nop,wscale 9], length 0</span><br><span class="line">2019-12-13 11:49:47.319144 IP 10.0.0.3.36704 > 172.16.8.106.80: Flags [S], seq 1838682063, win 29200, options [mss 1460,sackOK,TS val 3751111666 ecr 0,nop,wscale 9], length 0</span><br><span class="line">2019-12-13 11:49:47.319153 IP 10.0.0.3.36702 > 172.16.8.106.80: Flags [S], seq 1591982963, win 29200, options [mss 1460,sackOK,TS val 3751111666 ecr 0,nop,wscale 9], length 0</span><br></pre></td></tr></table></figure><p>果然是!即使权重为 0,仍然会尝试发 SYN 包跟这个旧 Pod 建立连接,但永远无法收到 ACK,因为旧 Pod 已经销毁了。为什么会这样呢?难道是 IPVS 内核模块的调度算法有问题?尝试去看了下 linux 内核源码,并没有发现哪个调度策略的实现函数会将新连接调度到权重为 0 的 rs 上。</p><p>这就奇怪了,可能不是调度算法的问题?继续尝试看更多的代码,主要是 <code>net/netfilter/ipvs/ip_vs_core.c</code> 中的 <code>ip_vs_in</code> 函数,也就是 IPVS 模块处理报文的主要入口,发现它会先在本地连接转发表看这个包是否已经有对应的连接了(匹配五元组),如果有就说明它不是新连接也就不会调度,直接发给这个连接对应的之前已经调度过的 rs (也不会判断权重);如果没匹配到说明这个包是新的连接,就会走到调度这里 (rr, wrr 等调度策略),这个逻辑看起来也没问题。</p><p>那为什么会转发到权重为 0 的 rs ?难道是匹配连接这里出问题了?新的连接匹配到了旧的连接?我开始做实验验证这个猜想,修改一下这里的逻辑:检查匹配到的连接对应的 rs 如果权重为 0,则重新调度。然后重新编译和加载 IPVS 内核模块,再重新压测一下,发现问题解决了!没有报 “No route to host” 了。</p><p>虽然通过改内核源码解决了,但我知道这不是一个好的解决方案,它会导致 IPVS 不支持连接的优雅结束,因为不再转发包给权重为 0 的 rs,存量的连接就会立即中断。</p><p>继续陷入深思……</p><p>这个实验只是证明了猜想:新连接匹配到了旧连接。那为什么会这样呢?难道新连接报文的五元组跟旧连接的相同了?</p><p>经过一番思考,发现这个是有可能的。因为 ServiceA 作为 client 请求 ServiceB,不同请求的源 IP 始终是相同的,关键点在于源端口是否可能相同。由于 ServiceA 向 ServiceB 发起大量短连接,ServiceA 所在节点就会有大量 <code>TIME_WAIT</code> 状态的连接,需要等 2 分钟 (2*MSL) 才会清理,而由于连接量太大,每次发起的连接都会占用一个源端口,当源端口不够用了,就会重用 <code>TIME_WAIT</code> 状态连接的源端口,这个时候当报文进入 IPVS 模块,检测到它的五元组跟本地连接转发表中的某个连接一致(<code>TIME_WAIT</code> 状态),就以为它是一个存量连接,然后直接将报文转发给这个连接之前对应的 rs 上,然而这个 rs 对应的 Pod 早已销毁,所以抓包看到的现象是将 SYN 发给了旧 Pod,并且无法收到 ACK,伴随着返回 ICMP 告知这个 IP 不可达,也被应用解释为 “No route to host”。</p><p>后来无意间又发现一个还在 open 状态的 issue,虽然还没提到 “No route to host” 关键字,但讨论的跟我们这个其实是同一个问题。我也参与了讨论,有兴趣的同学可以看下:<a href="https://github.com/kubernetes/kubernetes/issues/81775" target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/81775</a></p><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h2><p>这个问题通常发生的场景就是类似于我们测试环境这种:ServiceA 对外提供服务,当外部发起请求,ServiceA 会通过 rpc 或 http 调用 ServiceB,如果外部请求量变大,ServiceA 调用 ServiceB 的量也会跟着变大,大到一定程度,ServiceA 所在节点源端口不够用,复用 <code>TIME_WAIT</code> 状态连接的源端口,导致五元组跟 IPVS 里连接转发表中的 <code>TIME_WAIT</code> 连接相同,IPVS 就认为这是一个存量连接的报文,就不判断权重直接转发给之前的 rs,导致转发到已销毁的 Pod,从而发生 “No route to host”。</p><p>如何规避?集群规模小可以使用 iptables 模式,如果需要使用 ipvs 模式,可以增加 ServiceA 的副本,并且配置反亲和性 (podAntiAffinity),让 ServiceA 的 Pod 部署到不同节点,分摊流量,避免流量集中到某一个节点,导致调用 ServiceB 时源端口复用。</p><p>如何彻底解决?暂时还没有一个完美的方案。</p><p>Issue 85517 讨论让 kube-proxy 支持自定义配置几种连接状态的超时时间,但这对 <code>TIME_WAIT</code> 状态无效。</p><p>Issue 81308 讨论 IVPS 的优雅结束是否不考虑不活跃的连接 (包括 <code>TIME_WAIT</code> 状态的连接),也就是只考虑活跃连接,当活跃连接数为 0 之后立即踢掉 rs。这个确实可以更快的踢掉 rs,但无法让优雅结束做到那么优雅了,并且有人测试了,即便是不考虑不活跃连接,当请求量很大,还是不能很快踢掉 rs,因为源端口复用还是会导致不断有新的连接占用旧的连接,在较新的内核版本,<code>SYN_RECV</code> 状态也被视为活跃连接,所以活跃连接数还是不会很快降到 0。</p><p>这个问题的终极解决方案该走向何方,我们拭目以待,感兴趣的同学可以持续关注 issue 81775 并参与讨论。想学习更多 K8S 知识,可以关注本人的开源书《Kubernetes实践指南》: <a href="https://k8s.imroc.io" target="_blank" rel="noopener">https://k8s.imroc.io</a></p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<p>之前发过一篇干货满满的爆火文章 <a href="https://tencentcloudcontainerteam.g
</summary>
</entry>
<entry>
<title>k8s v1.17 新特性预告: 拓扑感知服务路由</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/11/26/service-topology/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/11/26/service-topology/</id>
<published>2019-11-26T08:18:00.000Z</published>
<updated>2020-06-16T01:53:49.351Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><p>大家好,我是 roc,来自腾讯云容器服务(TKE)团队,今天给大家介绍下我参与开发的一个 k8s v1.17 新特性: 拓扑感知服务路由。</p><h2 id="名词解释"><a href="#名词解释" class="headerlink" title="名词解释"></a>名词解释</h2><ul><li>拓扑域: 表示在集群中的某一类 “地方”,比如某节点、某机架、某可用区或某地域等,这些都可以作为某种拓扑域。</li><li>endpoint: k8s 某个服务的某个 ip+port,通常是 pod 的 ip+port。</li><li>service: k8s 的 service 资源(服务),关联一组 endpoint ,访问 service 会被转发到关联的某个 endpoint 上。</li></ul><h2 id="背景"><a href="#背景" class="headerlink" title="背景"></a>背景</h2><p>拓扑感知服务路由,此特性最初由杜军大佬提出并设计。为什么要设计此特性呢?想象一下,k8s 集群节点分布在不同的地方,service 对应的 endpoints 分布在不同节点,传统转发策略会对所有 endpoint 做负载均衡,通常会等概率转发,当访问 service 时,流量就可能被分散打到这些不同的地方。虽然 service 转发做了负载均衡,但如果 endpoint 距离比较远,流量转发过去网络时延就相对比较高,会影响网络性能,在某些情况下甚至还可能会付出额外的流量费用。要是如能实现 service 就近转发 endpoint,是不是就可以实现降低网络时延,提升网络性能了呢?是的!这也正是该特性所提出的目的和意义。</p><h2 id="k8s-亲和性"><a href="#k8s-亲和性" class="headerlink" title="k8s 亲和性"></a>k8s 亲和性</h2><p>service 的就近转发实际就是一种网络的亲和性,倾向于转发到离自己比较近的 endpoint。在此特性之前,已经在调度和存储方面有一些亲和性的设计与实现:</p><ul><li>节点亲和性 (Node Affinity): 让 Pod 被调度到符合一些期望条件的 Node 上,比如限制调度到某一可用区,或者要求节点支持 GPU,这算是调度亲和,调度结果取决于节点属性。</li><li>Pod 亲和性与反亲和性 (Pod Affinity/AntiAffinity): 让一组 Pod 调度到同一拓扑域的节点上,或者打散到不同拓扑域的节点, 这也算是调度亲和,调度结果取决于其它 Pod。</li><li>数据卷拓扑感知调度 (Volume Topology-aware Scheduling): 让 Pod 只被调度到符合其绑定的存储所在拓扑域的节点上,这算是调度与存储的亲和,调度结果取决于存储的拓扑域。</li><li>本地数据卷 (Local Persistent Volume): 让 Pod 使用本地数据卷,比如高性能 SSD,在某些需要高 IOPS 低时延的场景很有用,它还会保证 Pod 始终被调度到同一节点,数据就不会不丢失,这也算是调度与存储的亲和,调度结果取决于存储所在节点。</li><li>数据卷拓扑感知动态创建 (Topology-Aware Volume Dynamic Provisioning): 先调度 Pod,再根据 Pod 所在节点的拓扑域来创建存储,这算是存储与调度的亲和,存储的创建取决于调度的结果。</li></ul><p>而 k8s 目前在网络方面还没有亲和性能力,拓扑感知服务路由这个新特性恰好可以补齐这个的空缺,此特性使得 service 可以实现就近转发而不是所有 endpoint 等概率转发。</p><h2 id="如何实现"><a href="#如何实现" class="headerlink" title="如何实现"></a>如何实现</h2><p>我们知道,service 转发主要是 node 上的 kube-proxy 进程通过 watch apiserver 获取 service 对应的 endpoint,再写入 iptables 或 ipvs 规则来实现的; 对于 headless service,主要是通过 kube-dns 或 coredns 动态解析到不同 endpoint ip 来实现的。实现 service 就近转发的关键点就在于如何将流量转发到跟当前节点在同一拓扑域的 endpoint 上,也就是会进行一次 endpoint 筛选,选出一部分符合当前节点拓扑域的 endpoint 进行转发。</p><p>那么如何判断 endpoint 跟当前节点是否在同一拓扑域里呢?只要能获取到 endpoint 的拓扑信息,用它跟当前节点拓扑对比下就可以知道了。那又如何获取 endpoint 的拓扑信息呢?答案是通过 endpoint 所在节点的 label,我们可以使用 node label 来描述拓扑域。</p><p>通常在节点初始化的时候,controller-manager 就会为节点打上许多 label,比如 <code>kubernetes.io/hostname</code> 表示节点的 hostname 来区分节点;另外,在云厂商提供的 k8s 服务,或者使用 cloud-controller-manager 的自建集群,通常还会给节点打上 <code>failure-domain.beta.kubernetes.io/zone</code> 和 <code>failure-domain.beta.kubernetes.io/region</code> 以区分节点所在可用区和所在地域,但自 v1.17 开始将会改名成 <code>topology.kubernetes.io/zone</code> 和 <code>topology.kubernetes.io/region</code>,参见 <a href="https://github.com/kubernetes/kubernetes/pull/81431" target="_blank" rel="noopener">PR #81431</a>。</p><p>如何根据 endpoint 查到它所在节点的这些 label 呢?答案是通过 <code>Endpoint Slice</code>,该特性在 v1.16 发布了 alpha,在 v1.17 将会进入 beta,它相当于 Endpoint API 增强版,通过将 endpoint 做数据分片来解决大规模 endpoint 的性能问题,并且可以携带更多的信息,包括 endpoint 所在节点的拓扑信息,拓扑感知服务路由特性会通过 <code>Endpoint Slice</code> 获取这些拓扑信息实现 endpoint 筛选 (过滤出在同一拓扑域的 endpoint),然后再转换为 iptables 或 ipvs 规则写入节点以实现拓扑感知的路由转发。</p><p>细心的你可能已经发现,之前每个节点上转发 service 的 iptables/ipvs 规则基本是一样的,但启用了拓扑感知服务路由特性之后,每个节点上的转发规则就可能不一样了,因为不同节点的拓扑信息不一样,导致过滤出的 endpoint 就不一样,也正是因为这样,service 转发变得不再等概率,灵活的就近转发才得以实现。</p><p>当前还不支持 headless service 的拓扑路由,计划在 beta 阶段支持。由于 headless service 不是通过 kube-proxy 生成转发规则,而是通过 dns 动态解析实现的,所以需要改 kube-dns/coredns 来支持这个特性。</p><h2 id="前提条件"><a href="#前提条件" class="headerlink" title="前提条件"></a>前提条件</h2><p>启用当前 alpha 实现的拓扑感知服务路由特性需要满足以下前提条件:</p><ul><li>集群版本在 v1.17 及其以上。</li><li>Kube-proxy 以 iptables 或 IPVS 模式运行 (alpha 阶段暂时只实现了这两种模式)。</li><li>启用了 <a href="https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/" target="_blank" rel="noopener">Endpoint Slices</a> (此特性虽然在 v1.17 进入 beta,但没有默认开启)。</li></ul><h2 id="如何启用此特性"><a href="#如何启用此特性" class="headerlink" title="如何启用此特性"></a>如何启用此特性</h2><p>给所有 k8s 组件打开 <code>ServiceTopology</code> 和 <code>EndpointSlice</code> 这两个 feature:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">--feature-gates="ServiceTopology=true,EndpointSlice=true"</span><br></pre></td></tr></table></figure><h2 id="如何使用"><a href="#如何使用" class="headerlink" title="如何使用"></a>如何使用</h2><p>在 Service spec 里加上 <code>topologyKeys</code> 字段,表示该 Service 优先顺序选用的拓扑域列表,对应节点标签的 key;当访问此 Service 时,会找是否有 endpoint 有对应 topology key 的拓扑信息并且 value 跟当前节点也一样,如果是,那就选定此 topology key 作为当前转发的拓扑域,并且筛选出其余所有在这个拓扑域的 endpoint 来进行转发;如果没有找到任何 endpoint 在当前 topology key 对应拓扑域,就会尝试第二个 topology key,依此类推;如果遍历完所有 topology key 也没有匹配到 endpoint 就会拒绝转发,就像此 service 没有后端 endpoint 一样。</p><p>有一个特殊的 topology key “<code>*</code>“,它可以匹配所有 endpoint,如果 <code>topologyKeys</code> 包含了 <code>*</code>,它必须在列表末尾,通常是在没有匹配到合适的拓扑域来实现就近转发时,就打消就近转发的念头,可以转发到任意 endpoint 上。</p><p>当前 topology key 支持以下可能的值(未来会增加更多):</p><ul><li><code>kubernetes.io/hostname</code>: 节点的 hostname,通常将它放列表中第一个,表示如果本机有 endpoint 就直接转发到本机的 endpoint。</li><li><code>topology.kubernetes.io/zone</code>: 节点所在的可用区,通常将它放在 <code>kubernetes.io/hostname</code> 后面,表示如果本机没有对应 endpoint,就转发到当前可用区其它节点上的 endpoint(部分云厂商跨可用区通信会收取额外的流量费用)。</li><li><code>topology.kubernetes.io/region</code>: 表示节点所在的地域,表示转发到当前地域的 endpoint,这个用的应该会比较少,因为通常集群所有节点都只会在同一个地域,如果节点跨地域了,节点之间通信延时将会很高。</li><li><code>*</code>: 忽略拓扑域,匹配所有 endpoint,相当于一个保底策略,避免丢包,只能放在列表末尾。</li></ul><p>除此之外,还有以下约束:</p><ul><li><code>topologyKeys</code> 与 <code>externalTrafficPolicy=Local</code> 不兼容,是互斥的,如果 <code>externalTrafficPolicy</code> 为 <code>Local</code>,就不能定义 <code>topologyKeys</code>,反之亦然。</li><li>topology key 必须是合法的 label 格式,并且最多定义 16 个 key。</li></ul><p>这里给出一个简单的 Service 示例:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">nginx</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> type:</span> <span class="string">ClusterIP</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">80</span></span><br><span class="line"><span class="attr"> protocol:</span> <span class="string">TCP</span></span><br><span class="line"><span class="attr"> targetPort:</span> <span class="number">80</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"><span class="attr"> app:</span> <span class="string">nginx</span></span><br><span class="line"><span class="attr"> topologyKeys:</span> <span class="string">["kubernetes.io/hostname",</span> <span class="string">"topology.kubernetes.io/zone"</span><span class="string">,</span> <span class="string">"*"</span><span class="string">]</span></span><br></pre></td></tr></table></figure><p>解释: 当访问 nginx 服务时,首先看本机是否有这个服务的 endpoint,如果有就直接本机路由过去;如果没有,就看是否有 endpoint 位于当前节点所在可用区,如果有,就转发过去,如果还是没有,就转发给任意 endpoint。</p><p><img src="https://imroc.io/assets/blog/service-topology.png" alt=""></p><p>上图就是其中一次转发的例子:Pod 访问 nginx 这个 service 时,发现本机没有 endpoint,就找当前可用区的,找到了就转发过去,也就不会考虑转发给另一可用区的 endpoint。</p><h2 id="背后小故事"><a href="#背后小故事" class="headerlink" title="背后小故事"></a>背后小故事</h2><p>此特性的 KEP Proposal 最终被认可(合并)时的设计与当前最终的代码实现已经有一些差别,实现方案历经一变再变,但同时也推动了其它特性的发展,我来讲下这其中的故事。</p><p>一开始设计是在 alpha 时,让 kube-proxy 直接暴力 watch node,每个节点都有一份全局的 node 的缓存,通过 endpoint 的 <code>nodeName</code> 字段找到对应的 node 缓存,再查 node 包含的 label 就可以知道该 endpoint 的拓扑域了,但在集群节点数量多的情况下,kube-proxy 将会消耗大量资源,不过优点是实现上很简单,可以作为 alpha 阶段的实现,beta 时再从 watch node 切换到 watch 一个新设计的 PodLocator API,作为拓扑信息存储的中介,避免 watch 庞大的 node。</p><p>实际上一开始我也是按照 watch node 的方式,花了九牛二虎之力终于实现了这个特性,后来 v1.15 时 k8s 又支持了 metadata-only watch,参见 <a href="https://github.com/kubernetes/kubernetes/pull/71548" target="_blank" rel="noopener">PR 71548</a>,利用此特性可以仅仅 watch node 的 metadata,而不用 watch 整个 node,可以极大减小传输和缓存的数据量,然后我就将实现切成了 watch node metadata; 即便如此,metadata 还是会更新比较频繁,主要是 <code>resourceVersion</code> 会经常变 (kubelet 经常上报 node 状态),所以虽然 watch node metadata 比 watch node 要好,但也还是可能会造成大量不必要的网络流量,但作为 alpha 实现是可以接受的。</p><p>可惜在 v1.16 code freeze 之前没能将此特性合进去,只因有一点小细节还没讨论清楚。 实际在实现 watch node 方案期间,Endpoint Slice 特性就提出来了,在这个特性讨论的阶段,我们就想到了可以利用它来携带拓扑信息,以便让拓扑感知服务路由这个特性后续可以直接利用 Endpoint Slice 来获取拓扑信息,也就可以替代之前设计的 PodLocator API,但由于它还处于很早期阶段,并且代码还未合并进去,所以 alpha 阶段先不考虑 watch Endpint Slice。后来,Endpoint Slice 特性在 v1.16 发布了 alpha。</p><p>由于 v1.16 没能将拓扑感知服务路由特性合进去,在 v1.17 周期开始后,有更多时间来讨论小细节,并且 Endpoint Slice 代码已经合并,我就干脆直接又将实现从 watch node metadata 切成了 watch Endpint Slice,在 alpha 阶段就做了打算在 beta 阶段做的事情,终于,此特性实现代码最终合进了主干。</p><h2 id="结尾"><a href="#结尾" class="headerlink" title="结尾"></a>结尾</h2><p>拓扑感知服务路由可以实现 service 就近转发,减少网络延时,进一步提升 k8s 的网络性能,此特性将于 k8s v1.17 发布 alpha,时间是 12 月上旬,让我们一起期待吧!k8s 网络是块难啃的硬骨头,感兴趣的同学可以看下杜军的新书 <a href="https://item.jd.com/12724298.html" target="_blank" rel="noopener">《Kubernetes 网络权威指南》</a>,整理巩固一下 k8s 的网络知识。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li>KEP: EndpintSlice - <a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20190603-EndpointSlice-API.md" target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20190603-EndpointSlice-API.md</a></li><li>Proposal: Volume Topology-aware Scheduling - <a href="https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md" target="_blank" rel="noopener">https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md</a></li><li>PR: Service Topology implementation for Kubernetes - <a href="https://github.com/kubernetes/kubernetes/pull/72046" target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/72046</a></li><li>Proposal: Inter-pod topological affinity and anti-affinity - <a href="https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/podaffinity.md" target="_blank" rel="noopener">https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/podaffinity.md</a></li><li>Topology-Aware Volume Provisioning in Kubernetes - <a href="https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/" target="_blank" rel="noopener">https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/</a></li><li>Kubernetes 1.14: Local Persistent Volumes GA - <a href="https://kubernetes.io/blog/2019/04/04/kubernetes-1.14-local-persistent-volumes-ga/" target="_blank" rel="noopener">https://kubernetes.io/blog/2019/04/04/kubernetes-1.14-local-persistent-volumes-ga/</a></li><li>KubeCon 演讲: 面向 k8s 的拓扑感知服务路由即将推出! - <a href="https://v.qq.com/x/page/t0893nn9zqa.html" target="_blank" rel="noopener">https://v.qq.com/x/page/t0893nn9zqa.html</a></li><li>拓扑感知服务路由官方文档(等v1.17发布后才能看到) - <a href="https://kubernetes.io/docs/concepts/services-networking/service-topology/" target="_blank" rel="noopener">https://kubernetes.io/docs/concepts/services-networking/service-topology/</a></li><li>KEP: Topology-aware service routing - <a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20181024-service-topology.md" target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20181024-service-topology.md</a> (此文档后续会更新,因为实现跟设计已经不一样了)</li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<p>大家好,我是 roc,来自腾讯云容器服务(TKE)团队,今天给大家介绍下我参与开发的一个 k8s v1.17 新特性:
</summary>
</entry>
<entry>
<title>Kubernetes 网络疑难杂症排查分享</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/08/12/troubleshooting-with-kubernetes-network/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/08/12/troubleshooting-with-kubernetes-network/</id>
<published>2019-08-12T09:03:00.000Z</published>
<updated>2020-06-16T01:53:49.351Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><p>大家好,我是 roc,来自腾讯云容器服务(TKE)团队,经常帮助用户解决各种 K8S 的疑难杂症,积累了比较丰富的经验,本文分享几个比较复杂的网络方面的问题排查和解决思路,深入分析并展开相关知识,信息量巨大,相关经验不足的同学可能需要细细品味才能消化,我建议收藏本文反复研读,当完全看懂后我相信你的功底会更加扎实,解决问题的能力会大大提升。</p><blockquote><p>本文发现的问题是在使用 TKE 时遇到的,不同厂商的网络环境可能不一样,文中会对不同的问题的网络环境进行说明</p></blockquote><p><img src="https://imroc.io/assets/meme/dengguangshi.png" alt=""></p><h2 id="跨-VPC-访问-NodePort-经常超时"><a href="#跨-VPC-访问-NodePort-经常超时" class="headerlink" title="跨 VPC 访问 NodePort 经常超时"></a>跨 VPC 访问 NodePort 经常超时</h2><p>现象: 从 VPC a 访问 VPC b 的 TKE 集群的某个节点的 NodePort,有时候正常,有时候会卡住直到超时。</p><p>原因怎么查?</p><p><img src="https://imroc.io/assets/meme/emoji_analysis.png" alt=""></p><p>当然是先抓包看看啦,抓 server 端 NodePort 的包,发现异常时 server 能收到 SYN,但没响应 ACK:</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/no_ack.png" alt=""></p><p>反复执行 <code>netstat -s | grep LISTEN</code> 发现 SYN 被丢弃数量不断增加:</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/drop_syn.png" alt=""></p><p>分析:</p><ul><li>两个VPC之间使用对等连接打通的,CVM 之间通信应该就跟在一个内网一样可以互通。</li><li>为什么同一 VPC 下访问没问题,跨 VPC 有问题? 两者访问的区别是什么?</li></ul><p><img src="https://imroc.io/assets/meme/man_need_think.png" alt=""></p><p>再仔细看下 client 所在环境,发现 client 是 VPC a 的 TKE 集群节点,捋一下:</p><ul><li>client 在 VPC a 的 TKE 集群的节点</li><li>server 在 VPC b 的 TKE 集群的节点</li></ul><p>因为 TKE 集群中有个叫 <code>ip-masq-agent</code> 的 daemonset,它会给 node 写 iptables 规则,默认 SNAT 目的 IP 是 VPC 之外的报文,所以 client 访问 server 会做 SNAT,也就是这里跨 VPC 相比同 VPC 访问 NodePort 多了一次 SNAT,如果是因为多了一次 SNAT 导致的这个问题,直觉告诉我这个应该跟内核参数有关,因为是 server 收到包没回包,所以应该是 server 所在 node 的内核参数问题,对比这个 node 和 普通 TKE node 的默认内核参数,发现这个 node <code>net.ipv4.tcp_tw_recycle = 1</code>,这个参数默认是关闭的,跟用户沟通后发现这个内核参数确实在做压测的时候调整过。</p><p><img src="https://imroc.io/assets/meme/chijing2.png" alt=""></p><p>解释一下,TCP 主动关闭连接的一方在发送最后一个 ACK 会进入 <code>TIME_AWAIT</code> 状态,再等待 2 个 MSL 时间后才会关闭(因为如果 server 没收到 client 第四次挥手确认报文,server 会重发第三次挥手 FIN 报文,所以 client 需要停留 2 MSL的时长来处理可能会重复收到的报文段;同时等待 2 MSL 也可以让由于网络不通畅产生的滞留报文失效,避免新建立的连接收到之前旧连接的报文),了解更详细的过程请参考 TCP 四次挥手。</p><p>参数 <code>tcp_tw_recycle</code> 用于快速回收 <code>TIME_AWAIT</code> 连接,通常在增加连接并发能力的场景会开启,比如发起大量短连接,快速回收可避免 <code>tw_buckets</code> 资源耗尽导致无法建立新连接 (<code>time wait bucket table overflow</code>)</p><p>查得 <code>tcp_tw_recycle</code> 有个坑,在 RFC1323 有段描述:</p><p><code>An additional mechanism could be added to the TCP, a per-host cache of the last timestamp received from any connection. This value could then be used in the PAWS mechanism to reject old duplicate segments from earlier incarnations of the connection, if the timestamp clock can be guaranteed to have ticked at least once since the old connection was open. This would require that the TIME-WAIT delay plus the RTT together must be at least one tick of the sender’s timestamp clock. Such an extension is not part of the proposal of this RFC.</code></p><p>大概意思是说 TCP 有一种行为,可以缓存每个连接最新的时间戳,后续请求中如果时间戳小于缓存的时间戳,即视为无效,相应的数据包会被丢弃。 </p><p>Linux 是否启用这种行为取决于 <code>tcp_timestamps</code> 和 <code>tcp_tw_recycle</code>,因为 <code>tcp_timestamps</code> 缺省开启,所以当 <code>tcp_tw_recycle</code> 被开启后,实际上这种行为就被激活了,当客户端或服务端以 <code>NAT</code> 方式构建的时候就可能出现问题。</p><p>当多个客户端通过 NAT 方式联网并与服务端交互时,服务端看到的是同一个 IP,也就是说对服务端而言这些客户端实际上等同于一个,可惜由于这些客户端的时间戳可能存在差异,于是乎从服务端的视角看,便可能出现时间戳错乱的现象,进而直接导致时间戳小的数据包被丢弃。如果发生了此类问题,具体的表现通常是是客户端明明发送的 SYN,但服务端就是不响应 ACK。</p><p>回到我们的问题上,client 所在节点上可能也会有其它 pod 访问到 server 所在节点,而它们都被 SNAT 成了 client 所在节点的 NODE IP,但时间戳存在差异,server 就会看到时间戳错乱,因为开启了 <code>tcp_tw_recycle</code> 和 <code>tcp_timestamps</code> 激活了上述行为,就丢掉了比缓存时间戳小的报文,导致部分 SYN 被丢弃,这也解释了为什么之前我们抓包发现异常时 server 收到了 SYN,但没有响应 ACK,进而说明为什么 client 的请求部分会卡住直到超时。</p><p>由于 <code>tcp_tw_recycle</code> 坑太多,在内核 4.12 之后已移除: <a href="https://github.com/torvalds/linux/commit/4396e46187ca5070219b81773c4e65088dac50cc" target="_blank" rel="noopener">remove tcp_tw_recycle</a></p><p><img src="https://imroc.io/assets/meme/laugh1.png" alt=""></p><h2 id="LB-压测-CPS-低"><a href="#LB-压测-CPS-低" class="headerlink" title="LB 压测 CPS 低"></a>LB 压测 CPS 低</h2><p>现象: LoadBalancer 类型的 Service,直接压测 NodePort CPS 比较高,但如果压测 LB CPS 就很低。</p><p>环境说明: 用户使用的黑石TKE,不是公有云TKE,黑石的机器是物理机,LB的实现也跟公有云不一样,但 LoadBalancer 类型的 Service 的实现同样也是 LB 绑定各节点的 NodePort,报文发到 LB 后转到节点的 NodePort, 然后再路由到对应 pod,而测试在公有云 TKE 环境下没有这个问题。</p><p>client 抓包: 大量SYN重传。</p><p>server 抓包: 抓 NodePort 的包,发现当 client SYN 重传时 server 能收到 SYN 包但没有响应。</p><p><img src="https://imroc.io/assets/meme/emoji_analysis.png" alt=""></p><p>又是 SYN 收到但没响应,难道又是开启 <code>tcp_tw_recycle</code> 导致的?检查节点的内核参数发现并没有开启,除了这个原因,还会有什么情况能导致被丢弃?</p><p><code>conntrack -S</code> 看到 <code>insert_failed</code> 数量在不断增加,也就是 conntrack 在插入很多新连接的时候失败了,为什么会插入失败?什么情况下会插入失败?</p><p><img src="https://imroc.io/assets/meme/analysis_forever.png" alt=""></p><p>挖内核源码: netfilter conntrack 模块为每个连接创建 conntrack 表项时,表项的创建和最终插入之间还有一段逻辑,没有加锁,是一种乐观锁的过程。conntrack 表项并发刚创建时五元组不冲突的话可以创建成功,但中间经过 NAT 转换之后五元组就可能变成相同,第一个可以插入成功,后面的就会插入失败,因为已经有相同的表项存在。比如一个 SYN 已经做了 NAT 但是还没到最终插入的时候,另一个 SYN 也在做 NAT,因为之前那个 SYN 还没插入,这个 SYN 做 NAT 的时候就认为这个五元组没有被占用,那么它 NAT 之后的五元组就可能跟那个还没插入的包相同。</p><p>在我们这个问题里实际就是 netfilter 做 SNAT 时源端口选举冲突了,黑石 LB 会做 SNAT,SNAT 时使用了 16 个不同 IP 做源,但是短时间内源 Port 却是集中一致的,并发两个 SYN a 和SYN b,被 LB SNAT 后源 IP 不同但源 Port 很可能相同,这里就假设两个报文被 LB SNAT 之后它们源 IP 不同源 Port 相同,报文同时到了节点的 NodePort 会再次做 SNAT 再转发到对应的 Pod,当报文到了 NodePort 时,这时它们五元组不冲突,netfilter 为它们分别创建了 conntrack 表项,SYN a 被节点 SNAT 时默认行为是 从 port_range 范围的当前源 Port 作为起始位置开始循环遍历,选举出没有被占用的作为源 Port,因为这两个 SYN 源 Port 相同,所以它们源 Port 选举的起始位置相同,当 SYN a 选出源 Port 但还没将 conntrack 表项插入时,netfilter 认为这个 Port 没被占用就很可能给 SYN b 也选了相同的源 Port,这时他们五元组就相同了,当 SYN a 的 conntrack 表项插入后再插入 SYN b 的 conntrack 表项时,发现已经有相同的记录就将 SYN b 的 conntrack 表项丢弃了。</p><p>解决方法探索: 不使用源端口选举,在 iptables 的 MASQUERADE 规则如果加 <code>--random-fully</code> 这个 flag 可以让端口选举完全随机,基本上能避免绝大多数的冲突,但也无法完全杜绝。最终决定开发 LB 直接绑 Pod IP,不基于 NodePort,从而避免 netfilter 的 SNAT 源端口冲突问题。</p><p><img src="https://imroc.io/assets/meme/emoji_jizhi.png" alt=""></p><h2 id="DNS-解析偶尔-5S-延时"><a href="#DNS-解析偶尔-5S-延时" class="headerlink" title="DNS 解析偶尔 5S 延时"></a>DNS 解析偶尔 5S 延时</h2><p>网上一搜,是已知问题,仔细分析,实际跟之前黑石 TKE 压测 LB CPS 低的根因是同一个,都是因为 netfilter conntrack 模块的设计问题,只不过之前发生在 SNAT,这个发生在 DNAT,这里用我的语言来总结下原因:</p><p>DNS client (glibc 或 musl libc) 会并发请求 A 和 AAAA 记录,跟 DNS Server 通信自然会先 connect (建立fd),后面请求报文使用这个 fd 来发送,由于 UDP 是无状态协议, connect 时并不会创建 conntrack 表项, 而并发请求的 A 和 AAAA 记录默认使用同一个 fd 发包,这时它们源 Port 相同,当并发发包时,两个包都还没有被插入 conntrack 表项,所以 netfilter 会为它们分别创建 conntrack 表项,而集群内请求 kube-dns 或 coredns 都是访问的CLUSTER-IP,报文最终会被 DNAT 成一个 endpoint 的 POD IP,当两个包被 DNAT 成同一个 IP,最终它们的五元组就相同了,在最终插入的时候后面那个包就会被丢掉,如果 dns 的 pod 副本只有一个实例的情况就很容易发生,现象就是 dns 请求超时,client 默认策略是等待 5s 自动重试,如果重试成功,我们看到的现象就是 dns 请求有 5s 的延时。</p><p>参考 weave works 工程师总结的文章: <a href="https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts" target="_blank" rel="noopener">Racy conntrack and DNS lookup timeouts</a></p><p>解决方案一: 使用 TCP 发送 DNS 请求</p><p>如果使用 TCP 发 DNS 请求,connect 时就会插入 conntrack 表项,而并发的 A 和 AAAA 请求使用同一个 fd,所以只会有一次 connect,也就只会尝试创建一个 conntrack 表项,也就避免插入时冲突。</p><p><code>resolv.conf</code> 可以加 <code>options use-vc</code> 强制 glibc 使用 TCP 协议发送 DNS query。下面是这个 <code>man resolv.conf</code>中关于这个选项的说明:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">use-vc (since glibc 2.14)</span><br><span class="line"> Sets RES_USEVC <span class="keyword">in</span> _res.options. This option forces the</span><br><span class="line"> use of TCP <span class="keyword">for</span> DNS resolutions.</span><br></pre></td></tr></table></figure><p>解决方案二: 避免相同五元组 DNS 请求的并发</p><p><code>resolv.conf</code> 还有另外两个相关的参数:</p><ul><li>single-request-reopen (since glibc 2.9): A 和 AAAA 请求使用不同的 socket 来发送,这样它们的源 Port 就不同,五元组也就不同,避免了使用同一个 conntrack 表项。</li><li>single-request (since glibc 2.10): A 和 AAAA 请求改成串行,没有并发,从而也避免了冲突。</li></ul><p><code>man resolv.conf</code> 中解释如下:<br><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">single-request-reopen (since glibc 2.9)</span><br><span class="line"> Sets RES_SNGLKUPREOP <span class="keyword">in</span> _res.options. The resolver</span><br><span class="line"> uses the same socket <span class="keyword">for</span> the A and AAAA requests. Some</span><br><span class="line"> hardware mistakenly sends back only one reply. When</span><br><span class="line"> that happens the client system will sit and <span class="built_in">wait</span> <span class="keyword">for</span></span><br><span class="line"> the second reply. Turning this option on changes this</span><br><span class="line"> behavior so that <span class="keyword">if</span> two requests from the same port are</span><br><span class="line"> not handled correctly it will close the socket and open</span><br><span class="line"> a new one before sending the second request.</span><br><span class="line"></span><br><span class="line">single-request (since glibc 2.10)</span><br><span class="line"> Sets RES_SNGLKUP <span class="keyword">in</span> _res.options. By default, glibc</span><br><span class="line"> performs IPv4 and IPv6 lookups <span class="keyword">in</span> parallel since</span><br><span class="line"> version 2.9. Some appliance DNS servers cannot handle</span><br><span class="line"> these queries properly and make the requests time out.</span><br><span class="line"> This option disables the behavior and makes glibc</span><br><span class="line"> perform the IPv6 and IPv4 requests sequentially (at the</span><br><span class="line"> cost of some slowdown of the resolving process).</span><br></pre></td></tr></table></figure></p><p>要给容器的 <code>resolv.conf</code> 加上 options 参数,最方便的是直接在 Pod Spec 里面的 dnsConfig 加 (k8s v1.9 及以上才支持)</p><p><img src="https://imroc.io/assets/meme/jugelizi.jpeg" alt=""></p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> dnsConfig:</span></span><br><span class="line"><span class="attr"> options:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">single-request-reopen</span></span><br></pre></td></tr></table></figure><p>加 options 还有其它一些方法:</p><ul><li>在容器的 <code>ENTRYPOINT</code> 或者 <code>CMD</code> 脚本中,执行 <code>/bin/echo 'options single-request-reopen' >> /etc/resolv.conf</code></li><li><p>在 postStart hook 里加:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">lifecycle:</span><br><span class="line"> postStart:</span><br><span class="line"> <span class="built_in">exec</span>:</span><br><span class="line"> <span class="built_in">command</span>:</span><br><span class="line"> - /bin/sh</span><br><span class="line"> - -c </span><br><span class="line"> - <span class="string">"/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"</span></span><br></pre></td></tr></table></figure></li><li><p>使用 <a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook-beta-in-1-9" target="_blank" rel="noopener">MutatingAdmissionWebhook</a>,这是 1.9 引入的 Controller,用于对一个指定的资源的操作之前,对这个资源进行变更。 istio 的自动 sidecar 注入就是用这个功能来实现的,我们也可以通过 <code>MutatingAdmissionWebhook</code> 来自动给所有 Pod 注入 <code>resolv.conf</code> 文件,不过需要一定的开发量。</p></li></ul><p>解决方案三: 使用本地 DNS 缓存</p><p>仔细观察可以看到前面两种方案是 glibc 支持的,而基于 alpine 的镜像底层库是 musl libc 不是 glibc,所以即使加了这些 options 也没用,这种情况可以考虑使用本地 DNS 缓存来解决,容器的 DNS 请求都发往本地的 DNS 缓存服务(dnsmasq, nscd等),不需要走 DNAT,也不会发生 conntrack 冲突。另外还有个好处,就是避免 DNS 服务成为性能瓶颈。</p><p>使用本地DNS缓存有两种方式:</p><ul><li>每个容器自带一个 DNS 缓存服务</li><li>每个节点运行一个 DNS 缓存服务,所有容器都把本节点的 DNS 缓存作为自己的 nameserver</li></ul><p>从资源效率的角度来考虑的话,推荐后一种方式。</p><h2 id="Pod-访问另一个集群的-apiserver-有延时"><a href="#Pod-访问另一个集群的-apiserver-有延时" class="headerlink" title="Pod 访问另一个集群的 apiserver 有延时"></a>Pod 访问另一个集群的 apiserver 有延时</h2><p>现象:集群 a 的 Pod 内通过 kubectl 访问集群 b 的内网地址,偶尔出现延时的情况,但直接在宿主机上用同样的方法却没有这个问题。</p><p>提炼环境和现象精髓:</p><ol><li>在 pod 内将另一个集群 apiserver 的 ip 写到了 hosts,因为 TKE apiserver 开启内网集群外内网访问创建的内网 LB 暂时没有支持自动绑内网 DNS 域名解析,所以集群外的内网访问 apiserver 需要加 hosts</li><li>pod 内执行 kubectl 访问另一个集群偶尔延迟 5s,有时甚至10s</li></ol><p>观察到 5s 延时,感觉跟之前 conntrack 的丢包导致 dns 解析 5s 延时有关,但是加了 hosts 呀,怎么还去解析域名?</p><p><img src="https://imroc.io/assets/meme/emoji_analysis.png" alt=""></p><p>进入 pod netns 抓包: 执行 kubectl 时确实有 dns 解析,并且发生延时的时候 dns 请求没有响应然后做了重试。</p><p>看起来延时应该就是之前已知 conntrack 丢包导致 dns 5s 超时重试导致的。但是为什么会去解析域名? 明明配了 hosts 啊,正常情况应该是优先查找 hosts,没找到才去请求 dns 呀,有什么配置可以控制查找顺序?</p><p>搜了一下发现: <code>/etc/nsswitch.conf</code> 可以控制,但看有问题的 pod 里没有这个文件。然后观察到有问题的 pod 用的 alpine 镜像,试试其它镜像后发现只有基于 alpine 的镜像才会有这个问题。</p><p>再一搜发现: musl libc 并不会使用 <code>/etc/nsswitch.conf</code> ,也就是说 alpine 镜像并没有实现用这个文件控制域名查找优先顺序,瞥了一眼 musl libc 的 <code>gethostbyname</code> 和 <code>getaddrinfo</code> 的实现,看起来也没有读这个文件来控制查找顺序,写死了先查 hosts,没找到再查 dns。</p><p>这么说,那还是该先查 hosts 再查 dns 呀,为什么这里抓包看到是先查的 dns? (如果是先查 hosts 就能命中查询,不会再发起dns请求)</p><p>访问 apiserver 的 client 是 kubectl,用 go 写的,会不会是 go 程序解析域名时压根没调底层 c 库的 <code>gethostbyname</code> 或 <code>getaddrinfo</code>?</p><p><img src="https://imroc.io/assets/meme/physical_analysis.png" alt=""></p><p>搜一下发现果然是这样: go runtime 用 go 实现了 glibc 的 <code>getaddrinfo</code> 的行为来解析域名,减少了 c 库调用 (应该是考虑到减少 cgo 调用带来的的性能损耗)</p><p>issue: <a href="https://github.com/golang/go/issues/18518" target="_blank" rel="noopener">net: replicate DNS resolution behaviour of getaddrinfo(glibc) in the go dns resolver</a></p><p>翻源码验证下:</p><p>Unix 系的 OS 下,除了 openbsd, go runtime 会读取 <code>/etc/nsswitch.conf</code> (<code>net/conf.go</code>):</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/nsswitch.png" alt=""></p><p><code>hostLookupOrder</code> 函数决定域名解析顺序的策略,Linux 下,如果没有 <code>nsswitch.conf</code> 文件就 dns 比 hosts 文件优先 (<code>net/conf.go</code>):</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/hostLookupOrder.png" alt=""></p><p>可以看到 <code>hostLookupDNSFiles</code> 的意思是 dns first (<code>net/dnsclient_unix.go</code>):</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/hostLookupDNSFiles.png" alt=""></p><p>所以虽然 alpine 用的 musl libc 不是 glibc,但 go 程序解析域名还是一样走的 glibc 的逻辑,而 alpine 没有 <code>/etc/nsswitch.conf</code> 文件,也就解释了为什么 kubectl 访问 apiserver 先做 dns 解析,没解析到再查的 hosts,导致每次访问都去请求 dns,恰好又碰到 conntrack 那个丢包问题导致 dns 5s 延时,在用户这里表现就是 pod 内用 kubectl 访问 apiserver 偶尔出现 5s 延时,有时出现 10s 是因为重试的那次 dns 请求刚好也遇到 conntrack 丢包导致延时又叠加了 5s 。</p><p><img src="https://imroc.io/assets/meme/emoji_jizhi.png" alt=""></p><p>解决方案:</p><ol><li>换基础镜像,不用 alpine</li><li>挂载 <code>nsswitch.conf</code> 文件 (可以用 hostPath)</li></ol><h2 id="DNS-解析异常"><a href="#DNS-解析异常" class="headerlink" title="DNS 解析异常"></a>DNS 解析异常</h2><p>现象: 有个用户反馈域名解析有时有问题,看报错是解析超时。</p><p><img src="https://imroc.io/assets/meme/emoji_analysis.png" alt=""></p><p>第一反应当然是看 coredns 的 log:<br><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">[ERROR] 2 loginspub.xxxxmobile-inc.net. </span><br><span class="line">A: unreachable backend: <span class="built_in">read</span> udp 172.16.0.230:43742->10.225.30.181:53: i/o timeout</span><br></pre></td></tr></table></figure></p><p>这是上游 DNS 解析异常了,因为解析外部域名 coredns 默认会请求上游 DNS 来查询,这里的上游 DNS 默认是 coredns pod 所在宿主机的 <code>resolv.conf</code> 里面的 nameserver (coredns pod 的 dnsPolicy 为 “Default”,也就是会将宿主机里的 <code>resolv.conf</code> 里的 nameserver 加到容器里的 <code>resolv.conf</code>, coredns 默认配置 <code>proxy . /etc/resolv.conf</code>, 意思是非 service 域名会使用 coredns 容器中 <code>resolv.conf</code> 文件里的 nameserver 来解析)</p><p>确认了下,超时的上游 DNS 10.225.30.181 并不是期望的 nameserver,VPC 默认 DNS 应该是 180 开头的。看了 coredns 所在节点的 <code>resolv.conf</code>,发现确实多出了这个非期望的 nameserver,跟用户确认了下,这个 DNS 不是用户自己加上去的,添加节点时这个 nameserver 本身就在 <code>resolv.conf</code> 中。</p><p>根据内部同学反馈, 10.225.30.181 是广州一台年久失修将被撤裁的 DNS,物理网络,没有 VIP,撤掉就没有了,所以如果 coredns 用到了这台 DNS 解析时就可能 timeout。后面我们自己测试,某些 VPC 的集群确实会有这个 nameserver,奇了怪了,哪里冒出来的?</p><p><img src="https://imroc.io/assets/meme/cooldown_analysis.png" alt=""></p><p>又试了下直接创建 CVM,不加进 TKE 节点发现没有这个 nameserver,只要一加进 TKE 节点就有了 !!!</p><p>看起来是 TKE 的问题,将 CVM 添加到 TKE 集群会自动重装系统,初始化并加进集群成为 K8S 的 node,确认了初始化过程并不会写 <code>resolv.conf</code>,会不会是 TKE 的 OS 镜像问题?尝试搜一下除了 <code>/etc/resolv.conf</code> 之外哪里还有这个 nameserver 的 IP,最后发现 <code>/etc/resolvconf/resolv.conf.d/base</code> 这里面有。</p><p>看下 <code>/etc/resolvconf/resolv.conf.d/base</code> 的作用:Ubuntu 的 <code>/etc/resolv.conf</code> 是动态生成的,每次重启都会将 <code>/etc/resolvconf/resolv.conf.d/base</code> 里面的内容加到 <code>/etc/resolv.conf</code> 里。</p><p>经确认: 这个文件确实是 TKE 的 Ubuntu OS 镜像里自带的,可能发布 OS 镜像时不小心加进去的。</p><p>那为什么有些 VPC 的集群的节点 <code>/etc/resolv.conf</code> 里面没那个 IP 呢?它们的 OS 镜像里也都有那个文件那个 IP 呀。</p><p>请教其它部门同学发现:</p><ul><li>非 dhcp 子机,cvm 的 cloud-init 会覆盖 <code>/etc/resolv.conf</code> 来设置 dns</li><li>dhcp 子机,cloud-init 不会设置,而是通过 dhcp 动态下发</li><li>2018 年 4 月 之后创建的 VPC 就都是 dhcp 类型了的,比较新的 VPC 都是 dhcp 类型的</li></ul><p>真相大白:<code>/etc/resolv.conf</code> 一开始内容都包含 <code>/etc/resolvconf/resolv.conf.d/base</code> 的内容,也就是都有那个不期望的 nameserver,但老的 VPC 由于不是 dhcp 类型,所以 cloud-init 会覆盖 <code>/etc/resolv.conf</code>,抹掉了不被期望的 nameserver,而新创建的 VPC 都是 dhcp 类型,cloud-init 不会覆盖 <code>/etc/resolv.conf</code>,导致不被期望的 nameserver 残留在了 <code>/etc/resolv.conf</code>,而 coredns pod 的 dnsPolicy 为 “Default”,也就是会将宿主机的 <code>/etc/resolv.conf</code> 中的 nameserver 加到容器里,coredns 解析集群外的域名默认使用这些 nameserver 来解析,当用到那个将被撤裁的 nameserver 就可能 timeout。</p><p><img src="https://imroc.io/assets/meme/emoji_jizhi.png" alt=""></p><p>临时解决: 删掉 <code>/etc/resolvconf/resolv.conf.d/base</code> 重启</p><p>长期解决: 我们重新制作 TKE Ubuntu OS 镜像然后发布更新</p><p>这下应该没问题了吧,But, 用户反馈还是会偶尔解析有问题,但现象不一样了,这次并不是 dns timeout。</p><p><img src="https://imroc.io/assets/meme/chijing1.png" alt=""></p><p>用脚本跑测试仔细分析现象:</p><ul><li>请求 <code>loginspub.xxxxmobile-inc.net</code> 时,偶尔提示域名无法解析</li><li>请求 <code>accounts.google.com</code> 时,偶尔提示连接失败</li></ul><p>进入 dns 解析偶尔异常的容器的 netns 抓包:</p><ul><li>dns 请求会并发请求 A 和 AAAA 记录</li><li>测试脚本发请求打印序号,抓包然后 wireshark 分析对比异常时请求序号偏移量,找到异常时的 dns 请求报文,发现异常时 A 和 AAAA 记录的请求 id 冲突,并且 AAAA 响应先返回</li></ul><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/dns-id-conflict.png" alt=""></p><p>正常情况下id不会冲突,这里冲突了也就能解释这个 dns 解析异常的现象了:</p><ul><li><code>loginspub.xxxxmobile-inc.net</code> 没有 AAAA (ipv6) 记录,它的响应先返回告知 client 不存在此记录,由于请求 id 跟 A 记录请求冲突,后面 A 记录响应返回了 client 发现 id 重复就忽略了,然后认为这个域名无法解析</li><li><code>accounts.google.com</code> 有 AAAA 记录,响应先返回了,client 就拿这个记录去尝试请求,但当前容器环境不支持 ipv6,所以会连接失败</li></ul><p>那为什么 dns 请求 id 会冲突?</p><p><img src="https://imroc.io/assets/meme/chengsi.png" alt=""></p><p>继续观察发现: 其它节点上的 pod 不会复现这个问题,有问题这个节点上也不是所有 pod 都有这个问题,只有基于 alpine 镜像的容器才有这个问题,在此节点新起一个测试的 <code>alpine:latest</code> 的容器也一样有这个问题。</p><p>为什么 alpine 镜像的容器在这个节点上有问题在其它节点上没问题? 为什么其他镜像的容器都没问题?它们跟 alpine 的区别是什么?</p><p>发现一点区别: alpine 使用的底层 c 库是 musl libc,其它镜像基本都是 glibc</p><p>翻 musl libc 源码, 构造 dns 请求时,请求 id 的生成没加锁,而且跟当前时间戳有关:</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/musl-libc-make-dns-query.png" alt=""></p><p>看注释,作者应该认为这样id基本不会冲突,事实证明,绝大多数情况确实不会冲突,我在网上搜了很久没有搜到任何关于 musl libc 的 dns 请求 id 冲突的情况。这个看起来取决于硬件,可能在某种类型硬件的机器上运行,短时间内生成的 id 就可能冲突。我尝试跟用户在相同地域的集群,添加相同配置相同机型的节点,也复现了这个问题,但后来删除再添加时又不能复现了,看起来后面新建的 cvm 又跑在了另一种硬件的母机上了。</p><p>OK,能解释通了,再底层的细节就不清楚了,我们来看下解决方案:</p><ul><li>换基础镜像 (不用alpine)</li><li>完全静态编译业务程序(不依赖底层c库),比如go语言程序编译时可以关闭 cgo (CGO_ENABLED=0),并告诉链接器要静态链接 (<code>go build</code> 后面加 <code>-ldflags '-d'</code>),但这需要语言和编译工具支持才可以</li></ul><p>最终建议用户基础镜像换成另一个比较小的镜像: <code>debian:stretch-slim</code>。</p><p>问题解决,但用户后面觉得 <code>debian:stretch-slim</code> 做出来的镜像太大了,有 6MB 多,而之前基于 alpine 做出来只有 1MB 多,最后使用了一个非官方的修改过 musl libc 的 alpine 镜像作为基础镜像,里面禁止了 AAAA 请求从而避免这个问题。</p><h2 id="Pod-偶尔存活检查失败"><a href="#Pod-偶尔存活检查失败" class="headerlink" title="Pod 偶尔存活检查失败"></a>Pod 偶尔存活检查失败</h2><p>现象: Pod 偶尔会存活检查失败,导致 Pod 重启,业务偶尔连接异常。</p><p>之前从未遇到这种情况,在自己测试环境尝试复现也没有成功,只有在用户这个环境才可以复现。这个用户环境流量较大,感觉跟连接数或并发量有关。</p><p>用户反馈说在友商的环境里没这个问题。</p><p><img src="https://imroc.io/assets/meme/emoji_analysis.png" alt=""></p><p>对比友商的内核参数发现有些区别,尝试将节点内核参数改成跟友商的一样,发现问题没有复现了。</p><p>再对比分析下内核参数差异,最后发现是 backlog 太小导致的,节点的 <code>net.ipv4.tcp_max_syn_backlog</code> 默认是 1024,如果短时间内并发新建 TCP 连接太多,SYN 队列就可能溢出,导致部分新连接无法建立。</p><p>解释一下:</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/backlog.png" alt=""></p><p>TCP 连接建立会经过三次握手,server 收到 SYN 后会将连接加入 SYN 队列,当收到最后一个 ACK 后连接建立,这时会将连接从 SYN 队列中移动到 ACCEPT 队列。在 SYN 队列中的连接都是没有建立完全的连接,处于半连接状态。如果 SYN 队列比较小,而短时间内并发新建的连接比较多,同时处于半连接状态的连接就多,SYN 队列就可能溢出,<code>tcp_max_syn_backlog</code> 可以控制 SYN 队列大小,用户节点的 backlog 大小默认是 1024,改成 8096 后就可以解决问题。</p><h2 id="访问-externalTrafficPolicy-为-Local-的-Service-对应-LB-有时超时"><a href="#访问-externalTrafficPolicy-为-Local-的-Service-对应-LB-有时超时" class="headerlink" title="访问 externalTrafficPolicy 为 Local 的 Service 对应 LB 有时超时"></a>访问 externalTrafficPolicy 为 Local 的 Service 对应 LB 有时超时</h2><p>现象:用户在 TKE 创建了公网 LoadBalancer 类型的 Service,externalTrafficPolicy 设为了 Local,访问这个 Service 对应的公网 LB 有时会超时。</p><p>externalTrafficPolicy 为 Local 的 Service 用于在四层获取客户端真实源 IP,官方参考文档:<a href="https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-type-loadbalancer" target="_blank" rel="noopener">Source IP for Services with Type=LoadBalancer</a></p><p>TKE 的 LoadBalancer 类型 Service 实现是使用 CLB 绑定所有节点对应 Service 的 NodePort,CLB 不做 SNAT,报文转发到 NodePort 时源 IP 还是真实的客户端 IP,如果 NodePort 对应 Service 的 externalTrafficPolicy 不是 Local 的就会做 SNAT,到 pod 时就看不到客户端真实源 IP 了,但如果是 Local 的话就不做 SNAT,如果本机 node 有这个 Service 的 endpoint 就转到对应 pod,如果没有就直接丢掉,因为如果转到其它 node 上的 pod 就必须要做 SNAT,不然无法回包,而 SNAT 之后就无法获取真实源 IP 了。</p><p>LB 会对绑定节点的 NodePort 做健康检查探测,检查 LB 的健康检查状态: 发现这个 NodePort 的所有节点都不健康 !!!</p><p><img src="https://imroc.io/assets/meme/chijing1.png" alt=""></p><p>那么问题来了:</p><ol><li>为什么会全不健康,这个 Service 有对应的 pod 实例,有些节点上是有 endpoint 的,为什么它们也不健康?</li><li>LB 健康检查全不健康,但是为什么有时还是可以访问后端服务?</li></ol><p><img src="https://imroc.io/assets/meme/smoke_cooldown.png" alt=""></p><p>跟 LB 的同学确认: 如果后端 rs 全不健康会激活 LB 的全死全活逻辑,也就是所有后端 rs 都可以转发。</p><p>那么有 endpoint 的 node 也是不健康这个怎么解释?</p><p>在有 endpoint 的 node 上抓 NodePort 的包: 发现很多来自 LB 的 SYN,但是没有响应 ACK。</p><p>看起来报文在哪被丢了,继续抓下 cbr0 看下: 发现没有来自 LB 的包,说明报文在 cbr0 之前被丢了。</p><p>再观察用户集群环境信息:</p><ol><li>k8s 版本1.12</li><li>启用了 ipvs</li><li>只有 local 的 service 才有异常</li></ol><p>尝试新建一个 1.12 启用 ipvs 和一个没启用 ipvs 的测试集群。也都创建 Local 的 LoadBalancer Service,发现启用 ipvs 的测试集群复现了那个问题,没启用 ipvs 的集群没这个问题。</p><p>再尝试创建 1.10 的集群,也启用 ipvs,发现没这个问题。</p><p>看起来跟集群版本和是否启用 ipvs 有关。</p><p>1.12 对比 1.10 启用 ipvs 的集群: 1.12 的会将 LB 的 <code>EXTERNAL-IP</code> 绑到 <code>kube-ipvs0</code> 上,而 1.10 的不会:<br><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ ip a show kube-ipvs0 | grep -A2 170.106.134.124</span><br><span class="line"> inet 170.106.134.124/32 brd 170.106.134.124 scope global kube-ipvs0</span><br><span class="line"> valid_lft forever preferred_lft forever</span><br></pre></td></tr></table></figure></p><ul><li>170.106.134.124 是 LB 的公网 IP</li><li>1.12 启用 ipvs 的集群将 LB 的公网 IP 绑到了 <code>kube-ipvs0</code> 网卡上</li></ul><p><code>kube-ipvs0</code> 是一个 dummy interface,实际不会接收报文,可以看到它的网卡状态是 DOWN,主要用于绑 ipvs 规则的 VIP,因为 ipvs 主要工作在 netfilter 的 INPUT 链,报文通过 PREROUTING 链之后需要决定下一步该进入 INPUT 还是 FORWARD 链,如果是本机 IP 就会进入 INPUT,如果不是就会进入 FORWARD 转发到其它机器。所以 k8s 利用 <code>kube-ipvs0</code> 这个网卡将 service 相关的 VIP 绑在上面以便让报文进入 INPUT 进而被 ipvs 转发。</p><p>当 IP 被绑到 <code>kube-ipvs0</code> 上,内核会自动将上面的 IP 写入 local 路由:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ ip route show table <span class="built_in">local</span> | grep 170.106.134.124</span><br><span class="line"><span class="built_in">local</span> 170.106.134.124 dev kube-ipvs0 proto kernel scope host src 170.106.134.124</span><br></pre></td></tr></table></figure><p>内核认为在 local 路由里的 IP 是本机 IP,而 linux 默认有个行为: 忽略任何来自非回环网卡并且源 IP 是本机 IP 的报文。而 LB 的探测报文源 IP 就是 LB IP,也就是 Service 的 <code>EXTERNAL-IP</code> 猜想就是因为这个 IP 被绑到 <code>kube-ipvs0</code>,自动加进 local 路由导致内核直接忽略了 LB 的探测报文。</p><p>带着猜想做实现, 试一下将 LB IP 从 local 路由中删除:<br><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ip route del table <span class="built_in">local</span> <span class="built_in">local</span> 170.106.134.124 dev kube-ipvs0 proto kernel scope host src 170.106.134.124</span><br></pre></td></tr></table></figure></p><p>发现这个 node 的在 LB 的健康检查的状态变成健康了! 看来就是因为这个 LB IP 被绑到 <code>kube-ipvs0</code> 导致内核忽略了来自 LB 的探测报文,然后 LB 收不到回包认为不健康。</p><p>那为什么其它厂商没反馈这个问题?应该是 LB 的实现问题,腾讯云的公网 CLB 的健康探测报文源 IP 就是 LB 的公网 IP,而大多数厂商的 LB 探测报文源 IP 是保留 IP 并非 LB 自身的 VIP。</p><p>如何解决呢? 发现一个内核参数: <a href="https://github.com/torvalds/linux/commit/8153a10c08f1312af563bb92532002e46d3f504a" target="_blank" rel="noopener">accept_local</a> 可以让 linux 接收源 IP 是本机 IP 的报文。</p><p>试了开启这个参数,确实在 cbr0 收到来自 LB 的探测报文了,说明报文能被 pod 收到,但抓 eth0 还是没有给 LB 回包。</p><p><img src="https://imroc.io/assets/meme/physical_analysis.png" alt=""></p><p>为什么没有回包? 分析下五元组,要给 LB 回包,那么 <code>目的IP:目的Port</code> 必须是探测报文的 <code>源IP:源Port</code>,所以目的 IP 就是 LB IP,由于容器不在主 netns,发包经过 veth pair 到 cbr0 之后需要再经过 netfilter 处理,报文进入 PREROUTING 链然后发现目的 IP 是本机 IP,进入 INPUT 链,所以报文就出不去了。再分析下进入 INPUT 后会怎样,因为目的 Port 跟 LB 探测报文源 Port 相同,是一个随机端口,不在 Service 的端口列表,所以没有对应的 IPVS 规则,IPVS 也就不会转发它,而 <code>kube-ipvs0</code> 上虽然绑了这个 IP,但它是一个 dummy interface,不会收包,所以报文最后又被忽略了。</p><p>再看看为什么 1.12 启用 ipvs 会绑 <code>EXTERNAL-IP</code> 到 <code>kube-ipvs0</code>,翻翻 k8s 的 kube-proxy 支持 ipvs 的 <a href="https://github.com/kubernetes/enhancements/blob/baca87088480254b26d0fdeb26303d7c51a20fbd/keps/sig-network/0011-ipvs-proxier.md#support-loadbalancer-service" target="_blank" rel="noopener">proposal</a>,发现有个地方说法有点漏洞:</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/ipvs-proposal.png" alt=""></p><p>LB 类型 Service 的 status 里有 ingress IP,实际就是 <code>kubectl get service</code> 看到的 <code>EXTERNAL-IP</code>,这里说不会绑定这个 IP 到 kube-ipvs0,但后面又说会给它创建 ipvs 规则,既然没有绑到 <code>kube-ipvs0</code>,那么这个 IP 的报文根本不会进入 INPUT 被 ipvs 模块转发,创建的 ipvs 规则也是没用的。</p><p>后来找到作者私聊,思考了下,发现设计上确实有这个问题。</p><p>看了下 1.10 确实也是这么实现的,但是为什么 1.12 又绑了这个 IP 呢? 调研后发现是因为 <a href="https://github.com/kubernetes/kubernetes/issues/59976" target="_blank" rel="noopener">#59976</a> 这个 issue 发现一个问题,后来引入 <a href="https://github.com/kubernetes/kubernetes/pull/63066" target="_blank" rel="noopener">#63066</a> 这个 PR 修复的,而这个 PR 的行为就是让 LB IP 绑到 <code>kube-ipvs0</code>,这个提交影响 1.11 及其之后的版本。</p><p><a href="https://github.com/kubernetes/kubernetes/issues/59976" target="_blank" rel="noopener">#59976</a> 的问题是因为没绑 LB IP到 <code>kube-ipvs0</code> 上,在自建集群使用 <code>MetalLB</code> 来实现 LoadBalancer 类型的 Service,而有些网络环境下,pod 是无法直接访问 LB 的,导致 pod 访问 LB IP 时访问不了,而如果将 LB IP 绑到 <code>kube-ipvs0</code> 上就可以通过 ipvs 转发到 LB 类型 Service 对应的 pod 去, 而不需要真正经过 LB,所以引入了 <a href="https://github.com/kubernetes/kubernetes/pull/63066" target="_blank" rel="noopener">#63066</a> 这个PR。</p><p>临时方案: 将 <a href="https://github.com/kubernetes/kubernetes/pull/63066" target="_blank" rel="noopener">#63066</a> 这个 PR 的更改回滚下,重新编译 kube-proxy,提供升级脚本升级存量 kube-proxy。</p><p>如果是让 LB 健康检查探测支持用保留 IP 而不是自身的公网 IP ,也是可以解决,但需要跨团队合作,而且如果多个厂商都遇到这个问题,每家都需要为解决这个问题而做开发调整,代价较高,所以长期方案需要跟社区沟通一起推进,所以我提了 issue,将问题描述的很清楚: <a href="https://github.com/kubernetes/kubernetes/issues/79783" target="_blank" rel="noopener">#79783</a></p><p>小思考: 为什么 CLB 可以不做 SNAT ? 回包目的 IP 就是真实客户端 IP,但客户端是直接跟 LB IP 建立的连接,如果回包不经过 LB 是不可能发送成功的呀。</p><p>是因为 CLB 的实现是在母机上通过隧道跟 CVM 互联的,多了一层封装,回包始终会经过 LB。</p><p>就是因为 CLB 不做 SNAT,正常来自客户端的报文是可以发送到 nodeport,但健康检查探测报文由于源 IP 是 LB IP 被绑到 <code>kube-ipvs0</code> 导致被忽略,也就解释了为什么健康检查失败,但通过LB能访问后端服务,只是有时会超时。那么如果要做 SNAT 的 LB 岂不是更糟糕,所有报文都变成 LB IP,所有报文都会被忽略?</p><p>我提的 issue 有回复指出,AWS 的 LB 会做 SNAT,但它们不将 LB 的 IP 写到 Service 的 Status 里,只写了 hostname,所以也不会绑 LB IP 到 <code>kube-ipvs0</code>:</p><p><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/aws-lb-snat.png" alt=""></p><p>但是只写 hostname 也得 LB 支持自动绑域名解析,并且个人觉得只写 hostname 很别扭,通过 <code>kubectl get svc</code> 或者其它 k8s 管理系统无法直接获取 LB IP,这不是一个好的解决方法。</p><p>我提了 <a href="https://github.com/kubernetes/kubernetes/pull/79976" target="_blank" rel="noopener">#79976</a> 这个 PR 可以解决问题: 给 kube-proxy 加 <code>--exclude-external-ip</code> 这个 flag 控制是否为 LB IP<br>创建 ipvs 规则和绑定 <code>kube-ipvs0</code>。</p><p>但有人担心增加 kube-proxy flag 会增加 kube-proxy 的调试复杂度,看能否在 iptables 层面解决:<br><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/solve-in-iptables.png" alt=""></p><p>仔细一想,确实可行,打算有空实现下,重新提个 PR:<br><img src="https://imroc.io/assets/blog/troubleshooting-k8s-network/solve-in-prerouting.png" alt=""></p><h2 id="结语"><a href="#结语" class="headerlink" title="结语"></a>结语</h2><p>至此,我们一起完成了一段奇妙的问题排查之旅,信息量很大并且比较复杂,有些没看懂很正常,但我希望你可以收藏起来反复阅读,一起在技术的道路上打怪升级。</p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<p>大家好,我是 roc,来自腾讯云容器服务(TKE)团队,经常帮助用户解决各种 K8S 的疑难杂症,积累了比较丰富的经验,
</summary>
</entry>
<entry>
<title>Kubernetes 问题排查:Pod 状态一直 Terminating</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/06/20/pod-terminating-forever/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/06/20/pod-terminating-forever/</id>
<published>2019-06-20T12:35:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><p>查看 Pod 事件:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl describe pod/apigateway-6dc48bf8b6-clcwk -n cn-staging</span><br></pre></td></tr></table></figure><h3 id="Need-to-kill-Pod"><a href="#Need-to-kill-Pod" class="headerlink" title="Need to kill Pod"></a>Need to kill Pod</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">Normal Killing 39s (x735 over 15h) kubelet, 10.179.80.31 Killing container with id docker://apigateway:Need to <span class="built_in">kill</span> Pod</span><br></pre></td></tr></table></figure><p>可能是磁盘满了,无法创建和删除 pod</p><p>处理建议是参考Kubernetes 最佳实践:<a href="https://tencentcloudcontainerteam.github.io/2019/06/08/kubernetes-best-practice-handle-disk-full/">处理容器数据磁盘被写满</a></p><h3 id="DeadlineExceeded"><a href="#DeadlineExceeded" class="headerlink" title="DeadlineExceeded"></a>DeadlineExceeded</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">Warning FailedSync 3m (x408 over 1h) kubelet, 10.179.80.31 error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded</span><br></pre></td></tr></table></figure><p>怀疑是17版本dockerd的BUG。可通过 <code>kubectl -n cn-staging delete pod apigateway-6dc48bf8b6-clcwk --force --grace-period=0</code> 强制删除pod,但 <code>docker ps</code> 仍看得到这个容器</p><p>处置建议:</p><ul><li>升级到docker 18. 该版本使用了新的 containerd,针对很多bug进行了修复。</li><li>如果出现terminating状态的话,可以提供让容器专家进行排查,不建议直接强行删除,会可能导致一些业务上问题。</li></ul><h3 id="存在-Finalizers"><a href="#存在-Finalizers" class="headerlink" title="存在 Finalizers"></a>存在 Finalizers</h3><p>k8s 资源的 metadata 里如果存在 <code>finalizers</code>,那么该资源一般是由某程序创建的,并且在其创建的资源的 metadata 里的 <code>finalizers</code> 加了一个它的标识,这意味着这个资源被删除时需要由创建资源的程序来做删除前的清理,清理完了它需要将标识从该资源的 <code>finalizers</code> 中移除,然后才会最终彻底删除资源。比如 Rancher 创建的一些资源就会写入 <code>finalizers</code> 标识。</p><p>处理建议:<code>kubectl edit</code> 手动编辑资源定义,删掉 <code>finalizers</code>,这时再看下资源,就会发现已经删掉了</p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<p>查看 Pod 事件:</p>
<figure class="highlight bash"><table><tr><td
</summary>
</entry>
<entry>
<title>Kubernetes 踩坑分享:开启tcp_tw_recycle内核参数在NAT环境会丢包</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/06/09/lost-packets-once-enable-tcp-tw-recycle/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/06/09/lost-packets-once-enable-tcp-tw-recycle/</id>
<published>2019-06-09T14:00:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><h2 id="原因"><a href="#原因" class="headerlink" title="原因"></a>原因</h2><p>tcp_tw_recycle 参数,它用来快速回收 TIME_WAIT 连接,不过如果在 NAT 环境下会引发问题。 RFC1323 中有如下一段描述:</p><p><code>An additional mechanism could be added to the TCP, a per-host cache of the last timestamp received from any connection. This value could then be used in the PAWS mechanism to reject old duplicate segments from earlier incarnations of the connection, if the timestamp clock can be guaranteed to have ticked at least once since the old connection was open. This would require that the TIME-WAIT delay plus the RTT together must be at least one tick of the sender’s timestamp clock. Such an extension is not part of the proposal of this RFC.</code></p><ul><li><p>大概意思是说TCP有一种行为,可以缓存每个连接最新的时间戳,后续请求中如果时间戳小于缓存的时间戳,即视为无效,相应的数据包会被丢弃。</p></li><li><p>Linux是否启用这种行为取决于tcp_timestamps和tcp_tw_recycle,因为tcp_timestamps缺省就是开启的,所以当tcp_tw_recycle被开启后,实际上这种行为就被激活了,当客户端或服务端以NAT方式构建的时候就可能出现问题,下面以客户端NAT为例来说明:</p></li><li><p>当多个客户端通过NAT方式联网并与服务端交互时,服务端看到的是同一个IP,也就是说对服务端而言这些客户端实际上等同于一个,可惜由于这些客户端的时间戳可能存在差异,于是乎从服务端的视角看,便可能出现时间戳错乱的现象,进而直接导致时间戳小的数据包被丢弃。如果发生了此类问题,具体的表现通常是是客户端明明发送的SYN,但服务端就是不响应ACK。</p></li><li><p>在4.12之后的内核已移除tcp_tw_recycle内核参数: <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4396e46187ca5070219b81773c4e65088dac50cc" target="_blank" rel="noopener">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4396e46187ca5070219b81773c4e65088dac50cc</a> <a href="https://github.com/torvalds/linux/commit/4396e46187ca5070219b81773c4e65088dac50cc" target="_blank" rel="noopener">https://github.com/torvalds/linux/commit/4396e46187ca5070219b81773c4e65088dac50cc</a></p></li></ul><h2 id="TKE中-使用-NAT-的场景"><a href="#TKE中-使用-NAT-的场景" class="headerlink" title="TKE中 使用 NAT 的场景"></a>TKE中 使用 NAT 的场景</h2><ul><li>跨 VPC 访问(通过对等连接、云联网、专线等方式打通),会做 SNAT</li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<h2 id="原因"><a href="#原因" class="headerlink" title="原因"></a>原因<
</summary>
</entry>
<entry>
<title>kubernetes 最佳实践:处理容器数据磁盘被写满</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/06/08/kubernetes-best-practice-handle-disk-full/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/06/08/kubernetes-best-practice-handle-disk-full/</id>
<published>2019-06-08T14:07:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><p>容器数据磁盘被写满造成的危害:</p><ul><li>不能创建 Pod (一直 ContainerCreating)</li><li>不能删除 Pod (一直 Terminating)</li><li>无法 exec 到容器</li></ul><p>判断是否被写满:</p><p>容器数据目录大多会单独挂数据盘,路径一般是 <code>/var/lib/docker</code>,也可能是 <code>/data/docker</code> 或 <code>/opt/docker</code>,取决于节点被添加时的配置:</p><p><img src="https://imroc.io/assets/blog/tke-select-data-disk.png" alt=""></p><p>可通过 <code>docker info</code> 确定:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">$ docker info</span><br><span class="line">...</span><br><span class="line">Docker Root Dir: /var/lib/docker</span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>如果没有单独挂数据盘,则会使用系统盘存储。判断是否被写满:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">$ df</span><br><span class="line">Filesystem 1K-blocks Used Available Use% Mounted on</span><br><span class="line">...</span><br><span class="line">/dev/vda1 51474044 4619112 44233548 10% /</span><br><span class="line">...</span><br><span class="line">/dev/vdb 20511356 20511356 0 100% /var/lib/docker</span><br></pre></td></tr></table></figure><h2 id="解决方法"><a href="#解决方法" class="headerlink" title="解决方法"></a>解决方法</h2><h3 id="先恢复业务,清理磁盘空间"><a href="#先恢复业务,清理磁盘空间" class="headerlink" title="先恢复业务,清理磁盘空间"></a>先恢复业务,清理磁盘空间</h3><p>重启 dockerd (清理容器日志输出和可写层文件)</p><ul><li>重启前需要稍微腾出一点空间,不然重启 docker 会失败,可以手动删除一些docker的log文件或可写层文件,通常删除log:</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cd</span> /var/lib/docker/containers</span><br><span class="line">$ du -sh * <span class="comment"># 找到比较大的目录</span></span><br><span class="line">$ <span class="built_in">cd</span> dda02c9a7491fa797ab730c1568ba06cba74cecd4e4a82e9d90d00fa11de743c</span><br><span class="line">$ cat /dev/null > dda02c9a7491fa797ab730c1568ba06cba74cecd4e4a82e9d90d00fa11de743c-json.log.9 <span class="comment"># 删除log文件</span></span><br></pre></td></tr></table></figure><p><strong>注意:</strong> 使用 <code>cat /dev/null ></code> 方式删除而不用 <code>rm</code>,因为用 rm 删除的文件,docker 进程可能不会释放文件,空间也就不会释放;log 的后缀数字越大表示越久远,先删除旧日志。</p><ul><li>将该 node 标记不可调度,并将其已有的 pod 驱逐到其它节点,这样重启dockerd就会让该节点的pod对应的容器删掉,容器相关的日志(标准输出)与容器内产生的数据文件(可写层)也会被清理:</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl drain 10.179.80.31</span><br></pre></td></tr></table></figure><ul><li>重启 dockerd:</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">systemctl restart dockerd</span><br></pre></td></tr></table></figure><ul><li>取消不可调度的标记:</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl uncordon 10.179.80.31</span><br></pre></td></tr></table></figure><h3 id="定位根因,彻底解决"><a href="#定位根因,彻底解决" class="headerlink" title="定位根因,彻底解决"></a>定位根因,彻底解决</h3><p>问题定位方法见附录,这里列举根因对应的解决方法:</p><ul><li>日志输出量大导致磁盘写满:<ul><li>减少日志输出</li><li>增大磁盘空间</li><li>减小单机可调度的pod数量</li></ul></li><li>可写层量大导致磁盘写满: 优化程序逻辑,不写文件到容器内或控制写入文件的大小与数量</li><li>镜像占用空间大导致磁盘写满:<ul><li>增大磁盘空间</li><li>删除不需要的镜像</li></ul></li></ul><h2 id="附录"><a href="#附录" class="headerlink" title="附录"></a>附录</h2><h3 id="查看docker的磁盘空间占用情况"><a href="#查看docker的磁盘空间占用情况" class="headerlink" title="查看docker的磁盘空间占用情况"></a>查看docker的磁盘空间占用情况</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ docker system df -v</span><br></pre></td></tr></table></figure><p><img src="https://imroc.io/assets/blog/docker-system-df.png" alt=""></p><h3 id="定位容器写满磁盘的原因"><a href="#定位容器写满磁盘的原因" class="headerlink" title="定位容器写满磁盘的原因"></a>定位容器写满磁盘的原因</h3><p>进入容器数据目录(假设是 <code>/var/lib/docker</code>,并且存储驱动是 aufs):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cd</span> /var/lib/docker</span><br><span class="line">$ du -sh *</span><br></pre></td></tr></table></figure><p><img src="https://imroc.io/assets/blog/docker-sh-dockerlib.png" alt=""></p><ul><li><code>containers</code> 目录: 体积大说明日志输出量大</li><li><code>aufs</code> 目录</li></ul><p><img src="https://imroc.io/assets/blog/docker-sh-aufs.png" alt=""></p><ul><li><code>diff</code> 子目录: 容器可写层,体积大说明可写层数据量大(程序在容器里写入文件)</li><li><code>mnt</code> 子目录: 联合挂载点,内容为容器里看到的内容,即包含镜像本身内容以及可写层内容</li></ul><h3 id="找出日志输出量大的-pod"><a href="#找出日志输出量大的-pod" class="headerlink" title="找出日志输出量大的 pod"></a>找出日志输出量大的 pod</h3><p>TKE 的 pod 中每个容器输出的日志最大存储 1G (日志轮转,最大10个文件,每个文件最大100m,可用 <code>docker inpect</code> 查看):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">$ docker inspect fef835ebfc88</span><br><span class="line">[</span><br><span class="line"> {</span><br><span class="line"> ...</span><br><span class="line"> <span class="string">"HostConfig"</span>: {</span><br><span class="line"> ...</span><br><span class="line"> <span class="string">"LogConfig"</span>: {</span><br><span class="line"> <span class="string">"Type"</span>: <span class="string">"json-file"</span>,</span><br><span class="line"> <span class="string">"Config"</span>: {</span><br><span class="line"> <span class="string">"max-file"</span>: <span class="string">"10"</span>,</span><br><span class="line"> <span class="string">"max-size"</span>: <span class="string">"100m"</span></span><br><span class="line"> }</span><br><span class="line"> },</span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>查看哪些容器日志输出量大:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cd</span> /var/lib/docker/containers</span><br><span class="line">$ du -sh *</span><br></pre></td></tr></table></figure><p><img src="https://imroc.io/assets/blog/du-sh-containers.png" alt=""></p><p>目录名即为容器id,使用前几位与 <code>docker ps</code> 结果匹配可找出对应容器,最后就可以推算出是哪些 pod 搞的鬼</p><h3 id="找出可写层数据量大的-pod"><a href="#找出可写层数据量大的-pod" class="headerlink" title="找出可写层数据量大的 pod"></a>找出可写层数据量大的 pod</h3><p>可写层的数据主要是容器内程序自身写入的,无法控制大小,可写层越大说明容器写入的文件越多或越大,通常是容器内程序将log写到文件里了,查看一下哪个容器的可写层数据量大:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cd</span> /var/lib/docker/aufs/diff</span><br><span class="line">$ du -sh *</span><br></pre></td></tr></table></figure><p><img src="https://imroc.io/assets/blog/du-sh-diff.png" alt=""><br>通过可写层目录(<code>diff</code>的子目录)反查容器id:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ grep 834d97500892f56b24c6e63ffd4e520fc29c6c0d809a3472055116f59fb1d2be /var/lib/docker/image/aufs/layerdb/mounts/*/mount-id</span><br><span class="line">/var/lib/docker/image/aufs/layerdb/mounts/eb76fcd31dfbe5fc949b67e4ad717e002847d15334791715ff7d96bb2c8785f9/mount-id:834d97500892f56b24c6e63ffd4e520fc29c6c0d809a3472055116f59fb1d2be</span><br></pre></td></tr></table></figure><p><code>mounts</code> 后面一级的id即为容器id: <code>eb76fcd31dfbe5fc949b67e4ad717e002847d15334791715ff7d96bb2c8785f9</code>,使用前几位与 <code>docker ps</code> 结果匹配可找出对应容器,最后就可以推算出是哪些 pod 搞的鬼</p><h3 id="找出体积大的镜像"><a href="#找出体积大的镜像" class="headerlink" title="找出体积大的镜像"></a>找出体积大的镜像</h3><p>看看哪些镜像比较占空间</p><p><img src="https://imroc.io/assets/blog/docker-images.png" alt=""></p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<p>容器数据磁盘被写满造成的危害:</p>
<ul>
<li>不能创建 Pod (一直 ContainerCreating)
</summary>
</entry>
<entry>
<title>Kubernetes 最佳实践:处理内存碎片化</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/06/06/handle-memory-fragmentation/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/06/06/handle-memory-fragmentation/</id>
<published>2019-06-06T14:01:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><h2 id="内存碎片化造成的危害"><a href="#内存碎片化造成的危害" class="headerlink" title="内存碎片化造成的危害"></a>内存碎片化造成的危害</h2><p>节点的内存碎片化严重,导致docker运行容器时,无法分到大的内存块,导致start docker失败。最终导致服务更新时,状态一直都是启动中</p><h2 id="判断是否内存碎片化严重"><a href="#判断是否内存碎片化严重" class="headerlink" title="判断是否内存碎片化严重"></a>判断是否内存碎片化严重</h2><p>内核日志显示:</p><p><img src="https://imroc.io/assets/blog/handle-memory-fragmentation-1.png" alt=""></p><p><img src="https://imroc.io/assets/blog/handle-memory-fragmentation-2.png" alt=""></p><p>进一步查看的系统内存(cache多可能是io导致的,为了提高io效率留下的缓存,这部分内存实际是可以释放的):</p><p><img src="https://imroc.io/assets/blog/handle-memory-fragmentation-3.png" alt=""></p><p>查看slab (后面的0多表示伙伴系统没有大块内存了):</p><p><img src="https://imroc.io/assets/blog/handle-memory-fragmentation-4.png" alt=""></p><h2 id="解决方法"><a href="#解决方法" class="headerlink" title="解决方法"></a>解决方法</h2><ul><li><p>周期性地或者在发现大块内存不足时,先进行drop_cache操作:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">echo</span> 3 > /proc/sys/vm/drop_caches</span><br></pre></td></tr></table></figure></li><li><p>必要时候进行内存整理,开销会比较大,会造成业务卡住一段时间(慎用):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">echo</span> 1 > /proc/sys/vm/compact_memory</span><br></pre></td></tr></table></figure></li></ul><h2 id="附录"><a href="#附录" class="headerlink" title="附录"></a>附录</h2><p>相关链接:</p><ul><li><a href="https://www.lijiaocn.com/%E9%97%AE%E9%A2%98/2017/11/13/problem-unable-create-nf-conn.html" target="_blank" rel="noopener">https://www.lijiaocn.com/%E9%97%AE%E9%A2%98/2017/11/13/problem-unable-create-nf-conn.html</a></li><li><a href="https://blog.csdn.net/wqhlmark64/article/details/79143975" target="_blank" rel="noopener">https://blog.csdn.net/wqhlmark64/article/details/79143975</a></li><li><a href="https://huataihuang.gitbooks.io/cloud-atlas/content/os/linux/kernel/memory/drop_caches_and_compact_memory.html" target="_blank" rel="noopener">https://huataihuang.gitbooks.io/cloud-atlas/content/os/linux/kernel/memory/drop_caches_and_compact_memory.html</a></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<h2 id="内存碎片化造成的危害"><a href="#内存碎片化造成的危害" class="headerlink" ti
</summary>
</entry>
<entry>
<title>Kubernetes 最佳实践:解决长连接服务扩容失效</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/06/06/scale-keepalive-service/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/06/06/scale-keepalive-service/</id>
<published>2019-06-06T13:59:00.000Z</published>
<updated>2020-06-16T01:53:49.351Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><p>在现网运营中,有很多场景为了提高效率,一般都采用建立长连接的方式来请求。我们发现在客户端以长连接请求服务端的场景下,K8S的自动扩容会失效。原因是客户端长连接一直保留在老的Pod容器中,新扩容的Pod没有新的连接过来,导致K8S按照步长扩容第一批Pod之后就停止了扩容操作,而且新扩容的Pod没能承载请求,进而出现服务过载的情况,自动扩容失去了意义。</p><p>对长连接扩容失效的问题,我们的解决方法是将长连接转换为短连接。我们参考了 nginx keepalive 的设计,nginx 中 keepalive_requests 这个配置项设定了一个TCP连接能处理的最大请求数,达到设定值(比如1000)之后服务端会在 http 的 Header 头标记 “<code>Connection:close</code>”,通知客户端处理完当前的请求后关闭连接,新的请求需要重新建立TCP连接,所以这个过程中不会出现请求失败,同时又达到了将长连接按需转换为短连接的目的。通过这个办法客户端和云K8S服务端处理完一批请求后不断的更新TCP连接,自动扩容的新Pod能接收到新的连接请求,从而解决了自动扩容失效的问题。</p><p>由于Golang并没有提供方法可以获取到每个连接处理过的请求数,我们重写了 <code>net.Listener</code> 和 <code>net.Conn</code>,注入请求计数器,对每个连接处理的请求做计数,并通过 <code>net.Conn.LocalAddr()</code> 获得计数值,判断达到阈值 1000 后在返回的 Header 中插入 “<code>Connection:close</code>” 通知客户端关闭连接,重新建立连接来发起请求。以上处理逻辑用 Golang 实现示例代码如下:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">package</span> main</span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> (</span><br><span class="line"> <span class="string">"net"</span></span><br><span class="line"> <span class="string">"github.com/gin-gonic/gin"</span></span><br><span class="line"> <span class="string">"net/http"</span></span><br><span class="line">)</span><br><span class="line"></span><br><span class="line"><span class="comment">//重新定义net.Listener</span></span><br><span class="line"><span class="keyword">type</span> counterListener <span class="keyword">struct</span> {</span><br><span class="line"> net.Listener</span><br><span class="line">}</span><br><span class="line"><span class="comment">//重写net.Listener.Accept(),对接收到的连接注入请求计数器</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *counterListener)</span> <span class="title">Accept</span><span class="params">()</span> <span class="params">(net.Conn, error)</span></span> {</span><br><span class="line"> conn, err := c.Listener.Accept()</span><br><span class="line"> <span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> &counterConn{Conn: conn}, <span class="literal">nil</span></span><br><span class="line">}</span><br><span class="line"><span class="comment">//定义计数器counter和计数方法Increment()</span></span><br><span class="line"><span class="keyword">type</span> counter <span class="keyword">int</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *counter)</span> <span class="title">Increment</span><span class="params">()</span> <span class="title">int</span></span> {</span><br><span class="line"> *c++</span><br><span class="line"> <span class="keyword">return</span> <span class="keyword">int</span>(*c)</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">//重新定义net.Conn,注入计数器ct</span></span><br><span class="line"><span class="keyword">type</span> counterConn <span class="keyword">struct</span> {</span><br><span class="line"> net.Conn</span><br><span class="line"> ct counter</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">//重写net.Conn.LocalAddr(),返回本地网络地址的同时返回该连接累计处理过的请求数</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *counterConn)</span> <span class="title">LocalAddr</span><span class="params">()</span> <span class="title">net</span>.<span class="title">Addr</span></span> {</span><br><span class="line"> <span class="keyword">return</span> &counterAddr{c.Conn.LocalAddr(), &c.ct}</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">//定义TCP连接计数器,指向连接累计请求的计数器</span></span><br><span class="line"><span class="keyword">type</span> counterAddr <span class="keyword">struct</span> {</span><br><span class="line"> net.Addr</span><br><span class="line"> *counter</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">main</span><span class="params">()</span></span> {</span><br><span class="line"> r := gin.New()</span><br><span class="line"> r.Use(<span class="function"><span class="keyword">func</span><span class="params">(c *gin.Context)</span></span> {</span><br><span class="line"> localAddr := c.Request.Context().Value(http.LocalAddrContextKey)</span><br><span class="line"> <span class="keyword">if</span> ct, ok := localAddr.(<span class="keyword">interface</span>{ Increment() <span class="keyword">int</span> }); ok {</span><br><span class="line"> <span class="keyword">if</span> ct.Increment() >= <span class="number">1000</span> {</span><br><span class="line"> c.Header(<span class="string">"Connection"</span>, <span class="string">"close"</span>)</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> c.Next()</span><br><span class="line"> })</span><br><span class="line"> r.GET(<span class="string">"/"</span>, <span class="function"><span class="keyword">func</span><span class="params">(c *gin.Context)</span></span> {</span><br><span class="line"> c.String(<span class="number">200</span>, <span class="string">"plain/text"</span>, <span class="string">"hello"</span>)</span><br><span class="line"> })</span><br><span class="line"> l, err := net.Listen(<span class="string">"tcp"</span>, <span class="string">":8080"</span>)</span><br><span class="line"> <span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"> <span class="built_in">panic</span>(err)</span><br><span class="line"> }</span><br><span class="line"> err = http.Serve(&counterListener{l}, r)</span><br><span class="line"> <span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"> <span class="built_in">panic</span>(err)</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<p>在现网运营中,有很多场景为了提高效率,一般都采用建立长连接的方式来请求。我们发现在客户端以长连接请求服务端的场景下,K8
</summary>
</entry>
<entry>
<title>Kubernetes 问题定位技巧:容器内抓包</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/05/19/capture-packets-in-container/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/05/19/capture-packets-in-container/</id>
<published>2019-05-19T05:04:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><p>在使用 kubernetes 跑应用的时候,可能会遇到一些网络问题,比较常见的是服务端无响应(超时)或回包内容不正常,如果没找出各种配置上有问题,这时我们需要确认数据包到底有没有最终被路由到容器里,或者报文到达容器的内容和出容器的内容符不符合预期,通过分析报文可以进一步缩小问题范围。那么如何在容器内抓包呢?本文提供实用的脚本一键进入容器网络命名空间(netns),使用宿主机上的tcpdump进行抓包。</p><h2 id="使用脚本一键进入-pod-netns-抓包"><a href="#使用脚本一键进入-pod-netns-抓包" class="headerlink" title="使用脚本一键进入 pod netns 抓包"></a>使用脚本一键进入 pod netns 抓包</h2><ul><li><p>发现某个服务不通,最好将其副本数调为1,并找到这个副本 pod 所在节点和 pod 名称</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl get pod -o wide</span><br></pre></td></tr></table></figure></li><li><p>登录 pod 所在节点,将如下脚本粘贴到 shell (注册函数到当前登录的 shell,我们后面用)</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">function</span> <span class="function"><span class="title">e</span></span>() {</span><br><span class="line"> <span class="built_in">set</span> -eu</span><br><span class="line"> ns=<span class="variable">${2-"default"}</span></span><br><span class="line"> pod=`kubectl -n <span class="variable">$ns</span> describe pod <span class="variable">$1</span> | grep -Eo <span class="string">'docker://.*$'</span> | head -n 1 | sed <span class="string">'s/docker:\/\/\(.*\)$/\1/'</span>`</span><br><span class="line"> pid=`docker inspect -f {{.State.Pid}} <span class="variable">$pod</span>`</span><br><span class="line"> <span class="built_in">echo</span> <span class="string">"enter pod netns successfully for <span class="variable">$ns</span>/<span class="variable">$1</span>"</span></span><br><span class="line"> nsenter -n --target <span class="variable">$pid</span></span><br><span class="line">}</span><br></pre></td></tr></table></figure></li><li><p>一键进入 pod 所在的 netns,格式:<code>e POD_NAME NAMESPACE</code>,示例:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">e istio-galley-58c7c7c646-m6568 istio-system</span><br><span class="line">e proxy-5546768954-9rxg6 <span class="comment"># 省略 NAMESPACE 默认为 default</span></span><br></pre></td></tr></table></figure></li><li><p>这时已经进入 pod 的 netns,可以执行宿主机上的 <code>ip a</code> 或 <code>ifconfig</code> 来查看容器的网卡,执行 <code>netstat -tunlp</code> 查看当前容器监听了哪些端口,再通过 <code>tcpdump</code> 抓包:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">tcpdump -i eth0 -w test.pcap port 80</span><br></pre></td></tr></table></figure></li><li><p><code>ctrl-c</code> 停止抓包,再用 <code>scp</code> 或 <code>sz</code> 将抓下来的包下载到本地使用 <code>wireshark</code> 分析,提供一些常用的 <code>wireshark</code> 过滤语法:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 使用 telnet 连上并发送一些测试文本,比如 "lbtest",</span></span><br><span class="line"><span class="comment"># 用下面语句可以看发送的测试报文有没有到容器</span></span><br><span class="line">tcp contains <span class="string">"lbtest"</span></span><br><span class="line"><span class="comment"># 如果容器提供的是http服务,可以使用 curl 发送一些测试路径的请求,</span></span><br><span class="line"><span class="comment"># 通过下面语句过滤 uri 看报文有没有都容器</span></span><br><span class="line">http.request.uri==<span class="string">"/mytest"</span></span><br></pre></td></tr></table></figure></li></ul><h3 id="脚本原理"><a href="#脚本原理" class="headerlink" title="脚本原理"></a>脚本原理</h3><p>我们解释下步骤二中用到的脚本的原理</p><ul><li><p>查看指定 pod 运行的容器 ID</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl describe pod <pod> -n mservice</span><br></pre></td></tr></table></figure></li><li><p>获得容器进程的 pid</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker inspect -f {{.State.Pid}} <container></span><br></pre></td></tr></table></figure></li><li><p>进入该容器的 network namespace</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">nsenter -n --target <PID></span><br></pre></td></tr></table></figure></li></ul><p>依赖宿主机的命名:<code>kubectl</code>, <code>docker</code>, <code>nsenter</code>, <code>grep</code>, <code>head</code>, <code>sed</code></p>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<p>在使用 kubernetes 跑应用的时候,可能会遇到一些网络问题,比较常见的是服务端无响应(超时)或回包内容不正常,如
</summary>
</entry>
<entry>
<title>kubernetes 最佳实践:优雅热更新</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/05/08/kubernetes-best-practice-grace-update/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/05/08/kubernetes-best-practice-grace-update/</id>
<published>2019-05-08T12:48:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><p>当kubernetes对服务滚动更新的期间,默认配置的情况下可能会让部分连接异常(比如连接被拒绝),我们来分析下原因并给出最佳实践</p><h2 id="滚动更新场景"><a href="#滚动更新场景" class="headerlink" title="滚动更新场景"></a>滚动更新场景</h2><p>使用 deployment 部署服务并关联 service</p><ul><li>修改 deployment 的 replica 调整副本数量来滚动更新</li><li>升级程序版本(修改镜像tag)触发 deployment 新建 replicaset 启动新版本的 pod</li><li>使用 HPA (HorizontalPodAutoscaler) 来对 deployment 自动扩缩容</li></ul><h2 id="更新过程连接异常的原因"><a href="#更新过程连接异常的原因" class="headerlink" title="更新过程连接异常的原因"></a>更新过程连接异常的原因</h2><p>滚动更新时,service 对应的 pod 会被创建或销毁,也就是 service 对应的 endpoint 列表会新增或移除endpoint,更新期间可能让部分连接异常,主要原因是:</p><ol><li>pod 被创建,还没完全启动就被 endpoint controller 加入到 service 的 endpoint 列表,然后 kube-proxy 配置对应的路由规则(iptables/ipvs),如果请求被路由到还没完全启动完成的 pod,这时 pod 还不能正常处理请求,就会导致连接异常</li><li>pod 被销毁,但是从 endpoint controller watch 到变化并更新 service 的 endpoint 列表到 kube-proxy 更新路由规则这期间有个时间差,pod可能已经完全被销毁了,但是路由规则还没来得及更新,造成请求依旧还能被转发到已经销毁的 pod ip,导致连接异常</li></ol><h2 id="最佳实践"><a href="#最佳实践" class="headerlink" title="最佳实践"></a>最佳实践</h2><ul><li>针对第一种情况,可以给 pod 里的 container 加 readinessProbe (就绪检查),这样可以让容器完全启动了才被endpoint controller加进 service 的 endpoint 列表,然后 kube-proxy 再更新路由规则,这时请求被转发到的所有后端 pod 都是正常运行,避免了连接异常</li><li>针对第二种情况,可以给 pod 里的 container 加 preStop hook,让 pod 真正销毁前先 sleep 等待一段时间,留点时间给 endpoint controller 和 kube-proxy 清理 endpoint 和路由规则,这段时间 pod 处于 Terminating 状态,在路由规则更新完全之前如果有请求转发到这个被销毁的 pod,请求依然可以被正常处理,因为它还没有被真正销毁</li></ul><p>最佳实践 yaml 示例:<br><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">extensions/v1beta1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Deployment</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">nginx</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> replicas:</span> <span class="number">1</span></span><br><span class="line"><span class="attr"> selector:</span></span><br><span class="line"><span class="attr"> matchLabels:</span></span><br><span class="line"><span class="attr"> component:</span> <span class="string">nginx</span></span><br><span class="line"><span class="attr"> template:</span></span><br><span class="line"><span class="attr"> metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"><span class="attr"> component:</span> <span class="string">nginx</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> containers:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">nginx</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">"nginx"</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">http</span></span><br><span class="line"><span class="attr"> hostPort:</span> <span class="number">80</span></span><br><span class="line"><span class="attr"> containerPort:</span> <span class="number">80</span></span><br><span class="line"><span class="attr"> protocol:</span> <span class="string">TCP</span></span><br><span class="line"><span class="attr"> readinessProbe:</span></span><br><span class="line"><span class="attr"> httpGet:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/healthz</span></span><br><span class="line"><span class="attr"> port:</span> <span class="number">80</span></span><br><span class="line"><span class="attr"> httpHeaders:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">X-Custom-Header</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">Awesome</span></span><br><span class="line"><span class="attr"> initialDelaySeconds:</span> <span class="number">15</span></span><br><span class="line"><span class="attr"> timeoutSeconds:</span> <span class="number">1</span></span><br><span class="line"><span class="attr"> lifecycle:</span></span><br><span class="line"><span class="attr"> preStop:</span></span><br><span class="line"><span class="attr"> exec:</span></span><br><span class="line"><span class="attr"> command:</span> <span class="string">["/bin/bash",</span> <span class="string">"-c"</span><span class="string">,</span> <span class="string">"sleep 30"</span><span class="string">]</span></span><br></pre></td></tr></table></figure></p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li>Container probes: <a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes" target="_blank" rel="noopener">https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes</a></li><li>Container Lifecycle Hooks: <a href="https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/" target="_blank" rel="noopener">https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/</a></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<p>当kubernetes对服务滚动更新的期间,默认配置的情况下可能会让部分连接异常(比如连接被拒绝),我们来分析下原因并给
</summary>
</entry>
<entry>
<title>如何使用 Kubernetes VPA 实现资源动态扩展和回收</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/04/30/kubernetes-vpa/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/04/30/kubernetes-vpa/</id>
<published>2019-04-30T08:00:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://github.com/xiaoxubeii" target="_blank" rel="noopener">徐蓓</a></p><h2 id="简述"><a href="#简述" class="headerlink" title="简述"></a>简述</h2><p>最近一段时间在研究和设计集群资源混合部署方案,以提高资源使用率。这其中一个重要的功能是资源动态扩展和回收。虽然方案是针对通用型集群管理软件,但由于 Kubernetes 目前是事实标准,所以先使用它来检验理论成果。</p><h2 id="资源动态扩展"><a href="#资源动态扩展" class="headerlink" title="资源动态扩展"></a>资源动态扩展</h2><p>资源动态扩展按照类型分为两种:纵向和横向。纵向指的是对资源的配置进行扩展,比如增加或减少 CPU 个数和内存大小等。横向扩展则是增加资源的数量,比如服务器个数。笔者研究方案的目的是为了提升集群资源使用率,所以这里单讨论资源纵向扩展。</p><p>不过坦白来讲,资源纵向扩展首要目标并不是为了提高集群利用率,而是为了优化集群资源、提高资源可用性和性能。</p><p>在 Kubernetes 中 <a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler" target="_blank" rel="noopener">VPA</a> 项目主要是完成这项工作(主要针对 Pod)。</p><h3 id="Kubernetes-VPA"><a href="#Kubernetes-VPA" class="headerlink" title="Kubernetes VPA"></a>Kubernetes VPA</h3><blockquote><p>Vertical Pod Autoscaler (VPA) frees the users from necessity of setting up-to-date resource requests for the containers in their pods. When configured, it will set the requests automatically based on usage and thus allow proper scheduling onto nodes so that appropriate resource amount is available for each pod.</p></blockquote><p>以上是官方定义。简单来说是 Kubernetes VPA 可以根据实际负载动态设置 pod resource requests。</p><p>Kubernetes VPA 包含以下组件:</p><ul><li>Recommender:用于根据监控指标结合内置机制给出资源建议值</li><li>Updater:用于实时更新 pod resource requests</li><li>History Storage:用于采集和存储监控数据</li><li>Admission Controller: 用于在 pod 创建时修改 resource requests</li></ul><p>以下是架构图:</p><p><img src="/images/15550662807076.jpg" alt=""></p><p>主要流程是:<code>Recommender</code>在启动时从<code>History Storage</code>获取历史数据,根据内置机制修改<code>VPA API object</code>资源建议值。<code>Updater</code>监听<code>VPA API object</code>,依据建议值动态修改 pod resource requests。<code>VPA Admission Controller</code>则是用于 pod 创建时修改 pod resource requests。<code>History Storage</code>则是通过<code>Kubernetes Metrics API</code>采集和存储监控数据。</p><p>Kubernetes VPA 的整体架构比较简单,流程也很清晰,理解起来并不困难。但里面隐藏的几个功能点,却是方案的核心所在。它们的质量直接影响了方案的成熟度和评价效果:</p><p>1、如何设计 Recommendation model<br><code>Recommendation model</code>是集群优化的重中之重,它的好坏直接影响了集群资源优化的效果。就笔者目前了解,在 Kubernetes VPA 中这个模型是固定的,用户能做的是配置参数和数据源。</p><p>从官方描述看:</p><blockquote><p>The request is calculated based on analysis of the current and previous runs of the container and other containers with similar properties (name, image, command, args). The recommendation model (MVP) assumes that the memory and CPU consumption are independent random variables with distribution equal to the one observed in the last N days (recommended value is N=8 to capture weekly peaks). A more advanced model in future could attempt to detect trends, periodicity and other time-related patterns.</p></blockquote><p>CPU 和内存的建议值均是依据<strong>历史数据+固定机制</strong>计算而成,并没有一套解释引擎能让用户自定义规则。这在一定程度上影响了<code>Recommendation model</code>的准确性。就笔者理解,集群优化和混合部署的核心难点在于寻找能准确描述集群负载的指标,建立指标模型,并最终通过优化模型而达到最终目的 - 不论是为了优化集群或提高集群使用率。这个过程类似机器学习:先依旧经验或特征工程寻找特征变量,建立模型后使用数据不断优化参数,最后得到可用模型。所以仅靠单一指标 - 比如 CPU 或内存使用率 - 所建立的固定模型并不能准确描述集群状态和资源瓶颈。不管是从指标的颗粒度或固定模型上来看,最终效果都不会太好。</p><p>2、Pod 是否支持热更新<br>在 Kubernetes 中,pod resource requests 会影响 pod QoS 和容器的限制状态,比如驱逐策略、<code>OOM Score</code>和 cgroup 的限制参数等。如果不重建的话,单纯的修改 pod spec 只会影响调度策略。重建的话会导致 pod 重新调度,同时也在一定程度上降低了应用的可用性。官网列出一个更新策略<code>auto</code>,是可以<code>in-place</code>重建:</p><blockquote><p>“Auto”: VPA assigns resource requests on pod creation as well as updates them on existing pods using the preferred update mechanism. Currently this is equivalent to “Recreate” (see below). Once restart free (“in-place”) update of pod requests is available, it may be used as the preferred update mechanism by the “Auto” mode. NOTE: This feature of VPA is experimental and may cause downtime for your applications.</p></blockquote><p>目前应该没有完全实现。不过无论哪种方式,pod 重建貌似不可避免。</p><p>3、Pod 实时更新是否支持模糊控制<br>由于 Pod 更新会涉及重建,那么实时更新的触发条件就不应依据一个固定的值,比如值的变化触发更新重建(显然不可取)、依据逻辑表达式触发更新重建(也不可取,极端情况下会在设定值上下不断触发)。此时就需要在离散的值之间加入缓冲范围。而这个范围的设置高度依赖经验和实际集群情况,不然的话又会影响方案的最终效果。</p><p>总的来说,Kubernetes VPA 解决了资源纵向扩展的大部分工程问题。若应用于生产,还需做很多的个性化工作。</p><h2 id="资源回收"><a href="#资源回收" class="headerlink" title="资源回收"></a>资源回收</h2><p>既然 Kubernetes VPA 主要目标不是提升资源使用率,那它和混合部署又有何关系?别急,我们先来回顾下集群混合部署中提升资源使用率的关键是什么。</p><p>提升资源使用率最直观的方式,是在保证服务可用性的前提下尽量多的分配集群资源。我们知道在一般的集群管理软件中,调度器会为应用分配集群的可用资源。分配给应用的是逻辑资源,无需要和物理资源一一对应,比如可以超卖。并且应用持有的资源,一般情况下也不会全时段占用。在这种情况下,可将分配资源分为闲时和忙时。应用按照优先级区分,为高优先级的应用分配较多的资源。动态回收高优先级应用的闲时资源分配给低优先级应用使用,在高优先级应用负载升高时驱逐低优先级应用,从而达到提升资源使用率的目的。</p><p>在 Kubernetes VPA 中缺少资源回收的机制,但<code>Recommender</code>却可以配合<code>Updater</code>动态修改 pod resource requests 的值。也就是说 <strong>pod resource requests - 推荐值 = 资源回收值</strong>。这间接实现了资源回收的功能。那么 Kubernetes 调度器就可将这部分资源分配给其他应用使用。当然实际方案不会这么简单。比如<code>Recommender</code>就不需要使用<code>History Storage</code>中的历史数据和计算规则。初始值设为 pod resource requests,实时获取监控数据,加个 buffer 即可。这可以算是 Kubernetes 简陋版的资源回收功能。至于回收后,资源再分配和资源峰值驱逐等又是另一套流程了。</p><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h2><p>暂时还是打算基于 Kubernetes VPA 实现资源回收和混合部署功能,毕竟现成的轮子。至于集群负载指标和模型,就完全是一套经验工程了。只能在实际生产中慢慢积累,别无他法。</p>]]></content>
<summary type="html">
<p>作者: <a href="https://github.com/xiaoxubeii" target="_blank" rel="noopener">徐蓓</a></p>
<h2 id="简述"><a href="#简述" class="headerlink" title=
</summary>
</entry>
<entry>
<title>Google Borg 浅析</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/04/17/google-borg/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/04/17/google-borg/</id>
<published>2019-04-17T07:00:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://github.com/xiaoxubeii" target="_blank" rel="noopener">徐蓓</a></p><h1 id="Google-Borg-浅析"><a href="#Google-Borg-浅析" class="headerlink" title="Google Borg 浅析"></a>Google Borg 浅析</h1><p>笔者的工作主要涉及集群资源调度和混合部署,对相关技术和论文有所研究,包括 Google Borg、Kubernetes、Firmament 和 Kubernetes Poseidon 等。尤其是这篇《Large-scale cluster management at Google with Borg》令笔者受益匪浅。下面本人就结合生产场景,尝试对 Google Borg 做些分析和延展。</p><h2 id="Google-Borg-简介"><a href="#Google-Borg-简介" class="headerlink" title="Google Borg 简介"></a>Google Borg 简介</h2><p>Google Borg 是一套资源管理系统,可用于管理和调度资源。在 Borg 中,资源的单位是 <strong>Job</strong> 和 <strong>Task</strong>。<strong>Job</strong> 包含一组 <strong>Task</strong>。<strong>Task</strong> 是 Borg 管理和调度的最小单元,它对应一组 Linux 进程。熟悉 Kubernetes 的读者,可以将 <strong>Job</strong> 和 <strong>Task</strong> 大致对应为 Kubernetes 的 <strong>Service</strong> 和 <strong>Pod</strong>。</p><p>在架构上,Borg 和 Kubernetes 类似,由 BorgMaster、Scheduler 和 Borglet 组成。</p><p><img src="/images/15414871166556.jpg" alt=""></p><h2 id="Allocs"><a href="#Allocs" class="headerlink" title="Allocs"></a>Allocs</h2><p>Borg Alloc 代表一组可用于运行 Task 的资源,如 CPU、内存、IO 和磁盘空间。它实际上是集群对物理资源的抽象。Alloc set 类似 Job,是一堆 Alloc 的集合。当一个 Alloc set 被创建时,一个或多个 Job 就可以运行在上面了。</p><h2 id="Priority-和-Quota"><a href="#Priority-和-Quota" class="headerlink" title="Priority 和 Quota"></a>Priority 和 Quota</h2><p>每个 Job 都可以设置 Priority。Priority 可用于标识 Job 的重要程度,并影响一些资源分配、调度和 Preemption 策略。比如在生产中,我们会将作业分为 Routine Job 和 Batch Job。Routine Job 为生产级的例行作业,优先级最高,它占用对应实际物理资源的 Alloc set。Batch Job 代表一些临时作业,优先级最低。当资源紧张时,集群会优先 Preempt Batch Job,将资源提供给 Routine Job 使用。这时 Preempted Batch Job 会回到调度队列等待重新调度。</p><p>Quota 代表资源配额,它约束 Job 的可用资源,比如 CPU、内存或磁盘。Quota 一般在调度之前进行检查。Job 若不满足,会立即在提交时被拒绝。生产中,我们一般依据实际物理资源配置 Routine Job Quota。这种方式可以确保 Routine Job 在 Quota 内一定有可用的资源。为了充分提升集群资源使用率,我们会将 Batch Job Quota 设置为无限,让它尽量去占用 Routine Job 的闲置资源,从而实现超卖。这方面内容后面会在再次详述。</p><h2 id="Schedule"><a href="#Schedule" class="headerlink" title="Schedule"></a>Schedule</h2><p>调度是资源管理系统的核心功能,它直接决定了系统的“好坏”。在 Borg 中,Job 被提交后,Borgmaster 会将其放入一个 Pending Queue。Scheduler 异步地扫描队列,将 Task 调度到有充足资源的机器上。通常情况下,调度过程分为两个步骤:Filter 和 Score。Filter,或是 Feasibility Checking,用于判断机器是否满足 Task 的约束和限制,比如 Schedule Preference、Affinity 或 Resource Limit。Filter 结束后,就需要 Score 符合要求的机器,或称为 Weight。上述两个步骤完成后,Scheduler 就会挑选相应数量的机器调度给 Task 运行。实际上,选择合适的调度策略尤为重要。</p><p>这里可以拿一个生产集群举例。在初期,我们的调度系统采用的 Score 策略类似 Borg E-PVM,它的作用是将 Task 尽量均匀的调度到整个集群上。从正面效果上讲,这种策略分散了 Task 负载,并在一定程度上缩小了故障域。但从反面看,它也引发了资源碎片化的问题。由于我们底层环境是异构的,机器配置并不统一,并且 Task 配置和物理配置并无对应关系。这就造成一些配置过大的 Task 无法运行,由此在一定程度上降低了资源的分配率和使用率。为了应付此类问题,我们自研了新的 Score 策略,称之为 “Best Fillup”。它的原理是在调度 Task 时选择可用资源最少的机器,也就是尽量填满。不过这种策略的缺点显而易见:单台机器的负载会升高,从而增加 Bursty Load 的风险;不利于 Batch Job 运行;故障域会增加。</p><p>这篇论文,作者采用了一种被称为 hybrid 的方式,据说比第一种策略增加 3-5% 的效率。具体实现方式还有待后续研究。</p><h2 id="Utilization"><a href="#Utilization" class="headerlink" title="Utilization"></a>Utilization</h2><p>资源管理系统的首要目标是提高资源使用率,Borg 亦是如此。不过由于过多的前置条件,诸如 Job 放置约束、负载尖峰、多样的机器配置和 Batch Job,导致不能仅选择 “average utilization” 作为策略指标。在 Borg 中,使用 <strong>Cell Compaction</strong> 作为评判基准。简述之就是:能承载给定负载的最小 Cell。</p><p>Borg 提供了一些提高 utilization 的思路和实践方法,有些是我们在生产中已经采用的,有些则非常值得我们学习和借鉴。</p><h3 id="Cell-Sharing"><a href="#Cell-Sharing" class="headerlink" title="Cell Sharing"></a>Cell Sharing</h3><p>Borg 发现,将各种优先级的 Task,比如 prod 和 non-prod 运行在共享的 Cell 中可以大幅度的提升资源利用率。</p><p><img src="/images/15414743848812.jpg" alt=""></p><p>上面(a)图表明,采用 Task 隔离的部署方式会增加对机器的需求。图(b)是对额外机器需求的分布函数。图(a)和图(b)都清楚的表明了将 prod job 和 non-prod job 分开部署会消耗更多的物理资源。Borg 的经验是大约会新增 20-30% 左右。</p><p>个中原理也很好理解:prod job 通常会为应对负载尖峰申请较大资源,实际上这部分资源在多数时间里是闲置的。Borg 会定时回收这部分资源,并将之分配给 non-prod job 使用。在 Kubernetes 中,对应的概念是 request limit 和 limit。我们在生产中,一般设置 Prod job 的 Request limit 等于 limit,这样它就具有了最高的 Guaranteed Qos。该 QoS 使得 pod 在机器负载高时不至于被驱逐和 OOM。non-prod job 则不设置 request limit 和 limit,这使得它具有 BestEffort 级别的 QoS。kubelet 会在资源负载高时优先驱逐此类 Pod。这样也达到了和 Borg 类似的效果。</p><h3 id="Large-cells"><a href="#Large-cells" class="headerlink" title="Large cells"></a>Large cells</h3><p>Borg 通过实验数据表明,小容量的 cell 通常比大容量的更占用物理资源。<br><img src="/images/15414759002584.jpg" alt=""></p><p>这点对我们有和很重要的指导意义。通常情况下,我们会在设计集群时对容量问题感到犹豫不决。显而易见,小集群可以带来更高的隔离性、更小的故障域以及潜在风险。但随之带来的则是管理和架构复杂度的增加,以及更多的故障点。大集群的优缺点正好相反。在资源利用率这个指标上,我们凭直觉认为是大集群更优,但苦于无坚实的理论依据。Borg 的研究表明,大集群有利于增加资源利用率,这点对我们的决策很有帮助。</p><h3 id="Fine-grained-resource-requests"><a href="#Fine-grained-resource-requests" class="headerlink" title="Fine-grained resource requests"></a>Fine-grained resource requests</h3><p>Borg 对资源细粒度分配的方法,目前已是主流,在此我就不再赘述。</p><h3 id="Resource-reclamation"><a href="#Resource-reclamation" class="headerlink" title="Resource reclamation"></a>Resource reclamation</h3><p>笔者感觉这部分内容帮助最大。熟悉 Kubernetes 的读者,应该对类似的概念很熟悉,也就是所谓的 request limit。job 在提交时需要指定 resource limit,它能确保内部的 task 有足够资源可以运行。有些用户会为 task 申请过大的资源,以应对可能的请求或计算的突增。但实际上,部分资源在多数时间内是闲置的。与其资源浪费,不如利用起来。这需要系统有较精确的预测机制,可以评估 task 对实际资源的需求,并将闲置资源回收以分配给低 priority 的任务,比如 batch job。上述过程在 Borg 中被称为 <strong>resource reclamation</strong>,对使用资源的评估则被称为 <strong>reservation</strong>。Borgmaster 会定期从 Borglet 收集 resource consumption,并执行 <strong>reservation</strong>。在初始阶段,reservation 等于 resource limit。随着 task 的运行,reservation 就变为了资源的实际使用量,外加 safety margin。</p><p>在 Borg 调度时,Scheduler 使用 resource limit 为 prod task 过滤和选择主机,这个过程并不依赖 reclaimed resource。从这个角度看,并不支持对 prod task 的资源超卖。但 non-prod task 则不同,它是占用已有 task 的 resource reservation。所以 non-prod task 会被调度到拥有 reclaimed resource 的机器上。</p><p>这种做法当然也是有一定风险的。若资源评估出现偏差,机器上的可用资源可能会被耗尽。在这种情况下,Borg 会杀死或者降级 non-prod task,prod task 则不会受到半分任何影响。</p><p><img src="/images/15414862899318.jpg" alt=""></p><p>上图证实了这种策略的有效性。参照 Week 1 和 4 的 baseline,Week 2 和 3 在调整了 estimation algorithm 后,实际资源的 usage 与 reservation 的 gap 在显著缩小。在 Borg 的一个 median cell 中,有 20% 的负载是运行在 reclaimed resource 上。</p><p>相较于 Borg,Kubernetes 虽然有 resource limit 和 capacity 的概念,但却缺少动态 reclaim 机制。这会使得系统对低 priority task 的资源缺少行之有效的评估机制,从而引发系统负载问题。个人感觉这个功能对资源调度和提升资源使用率影响巨大,这部分内容也是笔者的工作重心</p><h2 id="Isolation"><a href="#Isolation" class="headerlink" title="Isolation"></a>Isolation</h2><p>这部分内容虽十分重要,但对于我们的生产集群优先级不是很高,在此先略过。有兴趣的读者可以自行研究。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="">Large-scale cluster management at Google with Borg</a></li><li><a href="http://www.firmament.io/blog/scheduler-architectures.html" target="_blank" rel="noopener">The evolution of cluster scheduler architectures</a></li><li><a href="https://github.com/kubernetes-sigs/poseidon" target="_blank" rel="noopener">poseidon</a></li><li><a href="https://docs.google.com/document/d/1VNoaw1GoRK-yop_Oqzn7wZhxMxvN3pdNjuaICjXLarA/edit?usp=sharing" target="_blank" rel="noopener">Poseidon design</a></li><li><a href="https://github.com/camsas/firmament" target="_blank" rel="noopener">firemament</a></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://github.com/xiaoxubeii" target="_blank" rel="noopener">徐蓓</a></p>
<h1 id="Google-Borg-浅析"><a href="#Google-Borg-浅析" c
</summary>
</entry>
<entry>
<title>Istio 学习笔记:Istio CNI 插件</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/04/07/istio-cni/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/04/07/istio-cni/</id>
<published>2019-04-07T04:20:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p><h2 id="设计目标"><a href="#设计目标" class="headerlink" title="设计目标"></a>设计目标</h2><p>当前实现将用户 pod 流量转发到 proxy 的默认方式是使用 privileged 权限的 istio-init 这个 init container 来做的(运行脚本写入 iptables),Istio CNI 插件的主要设计目标是消除这个 privileged 权限的 init container,换成利用 k8s CNI 机制来实现相同功能的替代方案</p><h2 id="原理"><a href="#原理" class="headerlink" title="原理"></a>原理</h2><ul><li>Istio CNI Plugin 不是 istio 提出类似 k8s CNI 的插件扩展机制,而是 k8s CNI 的一个具体实现</li><li>k8s CNI 插件是一条链,在创建和销毁pod的时候会调用链上所有插件来安装和卸载容器的网络,istio CNI Plugin 即为 CNI 插件的一个实现,相当于在创建销毁pod这些hook点来针对istio的pod做网络配置:写入iptables,让该 pod 所在的 network namespace 的网络流量转发到 proxy 进程</li><li>当然也就要求集群启用 CNI,kubelet 启动参数: <code>--network-plugin=cni</code> (该参数只有两个可选项:<code>kubenet</code>, <code>cni</code>)</li></ul><h2 id="实现方式"><a href="#实现方式" class="headerlink" title="实现方式"></a>实现方式</h2><ul><li>运行一个名为 istio-cni-node 的 daemonset 运行在每个节点,用于安装 istio CNI 插件</li><li>该 CNI 插件负责写入 iptables 规则,让用户 pod 所在 netns 的流量都转发到这个 pod 中 proxy 的进程</li><li>当启用 istio cni 后,sidecar 的自动注入或<code>istioctl kube-inject</code>将不再注入 initContainers (istio-init)</li></ul><h2 id="istio-cni-node-工作流程"><a href="#istio-cni-node-工作流程" class="headerlink" title="istio-cni-node 工作流程"></a>istio-cni-node 工作流程</h2><ul><li>复制 Istio CNI 插件二进制程序到CNI的bin目录(即kubelet启动参数<code>--cni-bin-dir</code>指定的路径,默认是<code>/opt/cni/bin</code>)</li><li>使用istio-cni-node自己的ServiceAccount信息为CNI插件生成kubeconfig,让插件能与apiserver通信(ServiceAccount信息会被自动挂载到<code>/var/run/secrets/kubernetes.io/serviceaccount</code>)</li><li>生成CNI插件的配置并将其插入CNI配置插件链末尾(CNI的配置文件路径是kubelet启动参数<code>--cni-conf-dir</code>所指定的目录,默认是<code>/etc/cni/net.d</code>)</li><li>watch CNI 配置(<code>cni-conf-dir</code>),如果检测到被修改就重新改回来</li><li>watch istio-cni-node 自身的配置(configmap),检测到有修改就重新执行CNI配置生成与下发流程(当前写这篇文章的时候是istio 1.1.1,还没实现此功能)</li></ul><h2 id="设计提案"><a href="#设计提案" class="headerlink" title="设计提案"></a>设计提案</h2><ul><li>Istio CNI Plugin 提案创建时间:2018-09-28</li><li>Istio CNI Plugin 提案文档存放在:Istio 的 Google Team Drive<ul><li>Istio TeamDrive 地址:<a href="https://drive.google.com/corp/drive/u/0/folders/0AIS5p3eW9BCtUk9PVA" target="_blank" rel="noopener">https://drive.google.com/corp/drive/u/0/folders/0AIS5p3eW9BCtUk9PVA</a></li><li>Istio CNI Plugin 提案文档路径:<code>Working Groups/Networking/Istio CNI Plugin</code></li><li>查看文件需要申请权限,申请方法:加入istio-team-drive-access这个google网上论坛group</li><li>istio-team-drive-access group 地址: <a href="https://groups.google.com/forum/#!forum/istio-team-drive-access" target="_blank" rel="noopener">https://groups.google.com/forum/#!forum/istio-team-drive-access</a></li></ul></li></ul><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li>Install Istio with the Istio CNI plugin: <a href="https://istio.io/docs/setup/kubernetes/additional-setup/cni/" target="_blank" rel="noopener">https://istio.io/docs/setup/kubernetes/additional-setup/cni/</a></li><li>istio-cni 项目地址:<a href="https://github.com/istio/cni" target="_blank" rel="noopener">https://github.com/istio/cni</a></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://imroc.io/" target="_blank" rel="noopener">陈鹏</a></p>
<h2 id="设计目标"><a href="#设计目标" class="headerlink" title="设计目标"><
</summary>
</entry>
<entry>
<title>istio 庖丁解牛(三) galley</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/04/01/istio-analysis-3/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/04/01/istio-analysis-3/</id>
<published>2019-04-01T07:30:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imfox.io/" target="_blank" rel="noopener">钟华</a></p><p>今天我们来解析istio控制面组件Galley. Galley Pod是一个单容器单进程组件, 没有sidecar, 结构独立, 职责明确.</p><p><img src="https://ws4.sinaimg.cn/large/006tKfTcgy1g1maoldl74j31850u049x.jpg" referrerpolicy="no-referrer"></p><p><a href="https://ws4.sinaimg.cn/large/006tKfTcgy1g187dn7s1tj315m0u0x6t.jpg" target="_blank" referrerpolicy="no-referrer">查看高清原图</a></p><p>前不久istio 1.1 版本正式发布, 其中istio的配置管理机制有较大的改进, 以下是<a href="https://istio.io/about/notes/1.1/" target="_blank" rel="noopener">1.1 release note</a> 中部分说明:</p><blockquote><p>Added <a href="https://istio.io/docs/concepts/what-is-istio/#galley" target="_blank" rel="noopener">Galley</a> as the primary configuration ingestion and distribution mechanism within Istio. It provides a robust model to validate, transform, and distribute configuration states to Istio components insulating the Istio components from Kubernetes details. Galley uses the <a href="https://github.com/istio/api/tree/release-1.1/mcp" target="_blank" rel="noopener">Mesh Configuration Protocol (MCP)</a> to interact with components</p></blockquote><p>Galley 原来仅负责进行配置验证, 1.1 后升级为整个控制面的配置管理中心, 除了继续提供配置验证功能外, Galley还负责配置的管理和分发, Galley 使用 <strong>网格配置协议</strong>(Mesh Configuration Protocol) 和其他组件进行配置的交互.</p><p>今天对Galley的剖析大概有以下方面:</p><ul><li>Galley 演进的背景</li><li>Galley 配置验证功能</li><li>MCP 协议</li><li>Galley 配置管理实现浅析</li></ul><hr><h2 id="Galley-演进的背景"><a href="#Galley-演进的背景" class="headerlink" title="Galley 演进的背景"></a>Galley 演进的背景</h2><p>在 k8s 场景下, 「配置(Configuration)」一词主要指yaml编写的Resource Definition, 如service、pod, 以及扩展的CRD( Custom Resource Definition), 如 istio的 VirtualService、DestinationRule 等.</p><p><strong>本文中「配置」一词可以等同于 k8s Resource Definition + istio CRD</strong></p><p>声明式 API 是 Kubernetes 项目编排能力“赖以生存”的核心所在, 而「配置」是声明式 API的承载方式.</p><blockquote><p>Istio 项目的设计与实现,其实都依托于 Kubernetes 的声明式 API 和它所提供的各种编排能力。可以说,Istio 是在 Kubernetes 项目使用上的一位“集大成者”</p><p>Istio 项目有多火热,就说明 Kubernetes 这套“声明式 API”有多成功</p></blockquote><p>k8s 内置了几十个Resources, istio 创造了50多个CRD, 其复杂度可见一斑, 所以有人说面向k8s编程近似于面向yaml编程.</p><p>早期的Galley 仅仅负责对「配置」进行运行时验证, istio 控制面各个组件各自去list/watch 各自关注的「配置」, 以下是istio早期的Configuration flow:</p><p><img src="https://ws3.sinaimg.cn/large/006tKfTcgy1g1mbphtde5j31d20swae2.jpg" referrerpolicy="no-referrer"></p><p>越来越多且复杂的「配置」给istio 用户带来了诸多不便, 主要体现在:</p><ul><li>「配置」的缺乏统一管理, 组件各自订阅, 缺乏统一回滚机制, 配置问题难以定位</li><li>「配置」可复用度低, 比如在1.1之前, 每个mixer adpater 就需要定义个新的CRD.</li><li>另外「配置」的隔离, ACL 控制, 一致性, 抽象程度, 序列化等等问题都还不太令人满意.</li></ul><p>随着istio功能的演进, 可预见的istio CRD数量还会继续增加, 社区计划将Galley 强化为istio 「配置」控制层, Galley 除了继续提供「配置」验证功能外, 还将提供配置管理流水线, 包括输入, 转换, 分发, 以及适合istio控制面的「配置」分发协议(MCP).</p><p>本文对Galley的分析基于istio tag 1.1.1 (commit 2b13318)</p><hr><h2 id="Galley-配置验证功能"><a href="#Galley-配置验证功能" class="headerlink" title="Galley 配置验证功能"></a>Galley 配置验证功能</h2><p>在<a href="https://imfox.io/2019/03/19/istio-analysis-2/" target="_blank" rel="noopener">istio 庖丁解牛(二) sidecar injector</a>中我分析了istio-sidecar-injector 如何利用 MutatingWebhook 来实现sidecar注入, Galley 使用了k8s提供的另一个Admission Webhooks: ValidatingWebhook, 来做配置的验证:</p><p><img src="https://ws1.sinaimg.cn/large/006tKfTcgy1g1mcwsf5ggj30sz0ecjt4.jpg" referrerpolicy="no-referrer"></p><p>istio 需要一个关于ValidatingWebhook的配置项, 用于告诉k8s api server, 哪些CRD应该发往哪个服务的哪个接口去做验证, 该配置名为istio-galley, 简化的内容如下:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">%kubectl</span> <span class="string">get</span> <span class="string">ValidatingWebhookConfiguration</span> <span class="string">istio-galley</span> <span class="bullet">-oyaml</span></span><br><span class="line"></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">admissionregistration.k8s.io/v1beta1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ValidatingWebhookConfiguration</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">istio-galley</span></span><br><span class="line"><span class="attr">webhooks:</span></span><br><span class="line"><span class="attr">- clientConfig:</span></span><br><span class="line"> <span class="string">......</span></span><br><span class="line"><span class="attr"> service:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">istio-galley</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">istio-system</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/admitpilot</span></span><br><span class="line"><span class="attr"> failurePolicy:</span> <span class="string">Fail</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">pilot.validation.istio.io</span></span><br><span class="line"><span class="attr"> rules:</span></span><br><span class="line"> <span class="string">...pilot关注的CRD...</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">gateways</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">virtualservices</span></span><br><span class="line"> <span class="string">......</span></span><br><span class="line"><span class="attr">- clientConfig:</span></span><br><span class="line"> <span class="string">......</span></span><br><span class="line"><span class="attr"> service:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">istio-galley</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">istio-system</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">/admitmixer</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">mixer.validation.istio.io</span></span><br><span class="line"><span class="attr"> rules:</span></span><br><span class="line"> <span class="string">...mixer关注的CRD...</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">rules</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">metrics</span></span><br><span class="line"> <span class="string">......</span></span><br></pre></td></tr></table></figure><p>可以看到, 该配置将pilot和mixer关注的CRD, 分别发到了服务istio-galley的<code>/admitpilot</code>和<code>/admitmixer</code>, 在Galley 源码中可以很容易找到这2个path Handler的入口:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">h.HandleFunc(<span class="string">"/admitpilot"</span>, wh.serveAdmitPilot)</span><br><span class="line">h.HandleFunc(<span class="string">"/admitmixer"</span>, wh.serveAdmitMixer)</span><br></pre></td></tr></table></figure><hr><h2 id="MCP协议"><a href="#MCP协议" class="headerlink" title="MCP协议"></a>MCP协议</h2><p>MCP 提供了一套配置订阅和分发的API, 在MCP中, 可以抽象为以下模型:</p><ul><li>source: 「配置」的提供端, 在Istio中Galley 即是source</li><li>sink: 「配置」的消费端, 在isito中典型的sink包括Pilot和Mixer组件</li><li>resource: source和sink关注的资源体, 也就是isito中的「配置」</li></ul><p>当sink和source之间建立了对某些resource的订阅和分发关系后, source 会将指定resource的变化信息推送给sink, sink端可以选择接受或者不接受resource更新(比如格式错误的情况), 并对应返回ACK/NACK 给source端.</p><p>MCP 提供了gRPC 的实现, 实现代码参见: <a href="https://github.com/istio/api/tree/master/mcp/v1alpha1" target="_blank" rel="noopener">https://github.com/istio/api/tree/master/mcp/v1alpha1</a>,</p><p>其中包括2个services: <code>ResourceSource</code> 和 <code>ResourceSink</code>, 通常情况下, source 会作为 gRPC的server 端, 提供<code>ResourceSource</code>服务, sink 作为 gRPC的客户端, sink主动发起请求连接source; 不过有的场景下, source 会作为gRPC的client端, sink作为gRPC的server端提供<code>ResourceSink</code>服务, source主动发起请求连接sink.</p><p>以上2个服务, 内部功能逻辑都是一致的, 都是sink需要订阅source管理的resource, 区别仅仅是哪端主动发起的连接请求.</p><p>具体到istio的场景中:</p><ul><li>在单k8s集群的istio mesh中, Galley默认实现了<code>ResourceSource</code> service, Pilot和Mixer会作为该service的client主动连接Galley进行配置订阅.</li><li>Galley 可以配置去主动连接远程的其他sink, 比如说在多k8s集群的mesh中, 主集群中的Galley可以为多个集群的Pilot/Mixer提供配置管理, 跨集群的Pilot/Mixer无法主动连接主集群Galley, 这时候Galley就可以作为gRPC的client 主动发起连接, 跨集群的Pilot/Mixer作为gRPC server 实现<code>ResourceSink</code>服务,</li></ul><p>两种模式的示意图如下:</p><p><img src="https://ws2.sinaimg.cn/large/006tKfTcgy1g1n7omb7vrj30uk0u0452.jpg" referrerpolicy="no-referrer"></p><hr><h2 id="Galley-配置管理实现浅析"><a href="#Galley-配置管理实现浅析" class="headerlink" title="Galley 配置管理实现浅析"></a>Galley 配置管理实现浅析</h2><p>galley 进程对外暴露了若干服务, 最重要的就是基于gRPC的mcp服务, 以及http的验证服务, 除此之外还提供了 prometheus exporter接口以及Profiling接口:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> serverArgs.EnableServer { <span class="comment">// 配置管理服务</span></span><br><span class="line"><span class="keyword">go</span> server.RunServer(serverArgs, livenessProbeController, readinessProbeController)</span><br><span class="line">}</span><br><span class="line"><span class="keyword">if</span> validationArgs.EnableValidation { <span class="comment">// 验证服务</span></span><br><span class="line"><span class="keyword">go</span> validation.RunValidation(validationArgs, kubeConfig, livenessProbeController, readinessProbeController)</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">// 提供 prometheus exporter</span></span><br><span class="line"><span class="keyword">go</span> server.StartSelfMonitoring(galleyStop, monitoringPort)</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> enableProfiling {</span><br><span class="line"> <span class="comment">// 使用包net/http/pprof</span></span><br><span class="line"> <span class="comment">// 通过http server提供runtime profiling数据</span></span><br><span class="line"><span class="keyword">go</span> server.StartProfiling(galleyStop, pprofPort)</span><br><span class="line">}</span><br><span class="line"><span class="comment">// 开始探针更新</span></span><br><span class="line"><span class="keyword">go</span> server.StartProbeCheck(livenessProbeController, readinessProbeController, galleyStop)</span><br></pre></td></tr></table></figure><p>接下来主要分析下「配置」管理服务的实现:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">go server.RunServer(serverArgs, livenessProbeController, readinessProbeController)</span><br></pre></td></tr></table></figure><p>下面是Galley 配置服务结构示意图:</p><p><img src="https://ws1.sinaimg.cn/large/006tKfTcgy1g1mzi3oe9xj31r10u0qgp.jpg" referrerpolicy="no-referrer"></p><p><a href="https://ws2.sinaimg.cn/large/006tKfTcgy1g1n8o76s8yj31r10u0trx.jpg" target="_blank" referrerpolicy="no-referrer">查看高清原图</a></p><p>从上图可以看到, Galley 配置服务主要包括 Processor 和 负责mcp通信的grpc Server.</p><p>其中 Processor 又由以下部分组成:</p><ul><li>Source: 代表Galley管理的配置的来源</li><li>Handler: 对「配置」事件的处理器</li><li>State: Galley管理的「配置」在内存中状态</li></ul><hr><h3 id="Source"><a href="#Source" class="headerlink" title="Source"></a>Source</h3><p>interface Source 代表istio关注的配置的来源, 其<code>Start</code>方法需要实现对特定资源的变化监听.</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Source to be implemented by a source configuration provider.</span></span><br><span class="line"><span class="keyword">type</span> Source <span class="keyword">interface</span> {</span><br><span class="line"><span class="comment">// Start the source interface, provided the EventHandler. The initial state of the underlying</span></span><br><span class="line"><span class="comment">// config store should be reflected as a series of Added events, followed by a FullSync event.</span></span><br><span class="line">Start(handler resource.EventHandler) error</span><br><span class="line"></span><br><span class="line"><span class="comment">// Stop the source interface. Upon return from this method, the channel should not be accumulating any</span></span><br><span class="line"><span class="comment">// more events.</span></span><br><span class="line">Stop()</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>在Galley中, 有多个Source的实现, 主要包括</p><ul><li><p><code>source/fs.source</code></p></li><li><p><code>source/kube/builtin.source</code></p></li><li><code>source/kube/dynamic.source</code></li><li><code>source/kube.aggregate</code></li></ul><p>其中<code>source/fs</code>代表从文件系统中获取配置, 这种形式常用于开发和测试过程中, 不需要创建实际的k8s CRD, 只需要CRD文件即可, 同时<code>source/fs</code>也是实现了更新watch(使用<a href="https://github.com/howeyc/fsnotify" target="_blank" rel="noopener">https://github.com/howeyc/fsnotify</a>)</p><p><code>source/kube/builtin.source</code>处理k8s 内置的配置来源, 包括<code>Service</code>, <code>Node</code>, <code>Pod</code>, <code>Endpoints</code>等, <code>source/kube/dynamic.source</code>处理其他的istio 关注的CRD, <code>source/kube.aggregate</code>是多个Source 的聚合, 其本身也实现了Source interface:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> aggregate <span class="keyword">struct</span> {</span><br><span class="line">mu sync.Mutex</span><br><span class="line">sources []runtime.Source</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(s *aggregate)</span> <span class="title">Start</span><span class="params">(handler resource.EventHandler)</span> <span class="title">error</span></span> {</span><br><span class="line">......</span><br><span class="line"><span class="keyword">for</span> _, source := <span class="keyword">range</span> s.sources {</span><br><span class="line"><span class="keyword">if</span> err := source.Start(syncHandler); err != <span class="literal">nil</span> {</span><br><span class="line"><span class="keyword">return</span> err</span><br><span class="line">}</span><br><span class="line">}</span><br><span class="line">......</span><br></pre></td></tr></table></figure><p><code>source/kube/builtin.source</code>、<code>source/kube/dynamic.source</code>本身都包含一个k8s SharedIndexInformer:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// source is a simplified client interface for listening/getting Kubernetes resources in an unstructured way.</span></span><br><span class="line"><span class="keyword">type</span> source <span class="keyword">struct</span> {</span><br><span class="line">......</span><br><span class="line"><span class="comment">// SharedIndexInformer for watching/caching resources</span></span><br><span class="line">informer cache.SharedIndexInformer</span><br><span class="line"></span><br><span class="line">handler resource.EventHandler</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>二者的<code>Start</code>方法的实现, 正是用到了k8s典型的 Informer+list/watch 模式, 获取关注「配置」的变化事件, 在此不再赘述.</p><p>Source 获得「配置」更新事件后, 会将其推送到Processor 的events chan 中, events 长度为1024, 通过<code>go p.process()</code>, <code>Proccesor</code>的<code>handler</code>会对事件进行异步处理.</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(p *Processor)</span> <span class="title">Start</span><span class="params">()</span> <span class="title">error</span></span> {</span><br><span class="line">......</span><br><span class="line"> events := <span class="built_in">make</span>(<span class="keyword">chan</span> resource.Event, <span class="number">1024</span>)</span><br><span class="line">err := p.source.Start(<span class="function"><span class="keyword">func</span><span class="params">(e resource.Event)</span></span> {</span><br><span class="line">events <- e</span><br><span class="line">})</span><br><span class="line"> ......</span><br><span class="line">p.events = events</span><br><span class="line"></span><br><span class="line"><span class="keyword">go</span> p.process()</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(p *Processor)</span> <span class="title">process</span><span class="params">()</span></span> {</span><br><span class="line">loop:</span><br><span class="line"><span class="keyword">for</span> {</span><br><span class="line"><span class="keyword">select</span> {</span><br><span class="line"><span class="comment">// Incoming events are received through p.events</span></span><br><span class="line"><span class="keyword">case</span> e := <-p.events:</span><br><span class="line">p.processEvent(e)</span><br><span class="line"></span><br><span class="line"><span class="keyword">case</span> <-p.state.strategy.Publish:</span><br><span class="line">scope.Debug(<span class="string">"Processor.process: publish"</span>)</span><br><span class="line">p.state.publish()</span><br><span class="line"></span><br><span class="line"><span class="comment">// p.done signals the graceful Shutdown of the processor.</span></span><br><span class="line"><span class="keyword">case</span> <-p.done:</span><br><span class="line">scope.Debug(<span class="string">"Processor.process: done"</span>)</span><br><span class="line"><span class="keyword">break</span> loop</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> p.postProcessHook != <span class="literal">nil</span> {</span><br><span class="line">p.postProcessHook()</span><br><span class="line">}</span><br><span class="line">}</span><br><span class="line"> ......</span><br><span class="line">}</span><br></pre></td></tr></table></figure><hr><h3 id="Handler-和-State"><a href="#Handler-和-State" class="headerlink" title="Handler 和 State"></a>Handler 和 State</h3><p>interface Handler 代表对「配置」变化事件的处理器:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Handler handles an incoming resource event.</span></span><br><span class="line"><span class="keyword">type</span> Handler <span class="keyword">interface</span> {</span><br><span class="line">Handle(e resource.Event)</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>在istio中有多个Handler的实现, 典型的有:</p><ul><li>Dispatcher</li><li>State</li></ul><p>Dispatcher 是多个Handler的集合:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">type Dispatcher struct {</span><br><span class="line">handlers map[resource.Collection][]Handler</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>State 是对Galley的内存中的状态, 包括了Galley 当前持有「配置」的schema、发布策略以及内容快照等:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// State is the in-memory state of Galley.</span></span><br><span class="line"><span class="keyword">type</span> State <span class="keyword">struct</span> {</span><br><span class="line">name <span class="keyword">string</span></span><br><span class="line">schema *resource.Schema</span><br><span class="line"></span><br><span class="line">distribute <span class="keyword">bool</span></span><br><span class="line">strategy *publish.Strategy</span><br><span class="line">distributor publish.Distributor</span><br><span class="line"></span><br><span class="line">config *Config</span><br><span class="line"></span><br><span class="line"><span class="comment">// version counter is a nonce that generates unique ids for each updated view of State.</span></span><br><span class="line">versionCounter <span class="keyword">int64</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// entries for per-message-type State.</span></span><br><span class="line">entriesLock sync.Mutex</span><br><span class="line">entries <span class="keyword">map</span>[resource.Collection]*resourceTypeState</span><br><span class="line"></span><br><span class="line"><span class="comment">// Virtual version numbers for Gateways & VirtualServices for Ingress projected ones</span></span><br><span class="line">ingressGWVersion <span class="keyword">int64</span></span><br><span class="line">ingressVSVersion <span class="keyword">int64</span></span><br><span class="line">lastIngressVersion <span class="keyword">int64</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// pendingEvents counts the number of events awaiting publishing.</span></span><br><span class="line">pendingEvents <span class="keyword">int64</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// lastSnapshotTime records the last time a snapshot was published.</span></span><br><span class="line">lastSnapshotTime time.Time</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>同时State 也实现了interface <code>Handler</code>, 最终「配置」资源将会作为快照存储到State的<code>distributor</code>中, <code>distributor</code>实际的实现是mcp包中的<code>Cache</code>, 实际会调用mcp中的<code>Cache#SetSnapshot</code>.</p><hr><h3 id="Distributor-、Watcher-和-Cache"><a href="#Distributor-、Watcher-和-Cache" class="headerlink" title="Distributor 、Watcher 和 Cache"></a>Distributor 、Watcher 和 Cache</h3><p>在mcp包中, 有2个interface 值得特别关注: Distributor 和 Watcher</p><p>interface Distributor 定义了「配置」快照存储需要实现的接口, State 最终会调用<code>SetSnapshot</code>将配置存储到快照中.</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Distributor interface allows processor to distribute snapshots of configuration.</span></span><br><span class="line"><span class="keyword">type</span> Distributor <span class="keyword">interface</span> {</span><br><span class="line">SetSnapshot(name <span class="keyword">string</span>, snapshot sn.Snapshot)</span><br><span class="line"></span><br><span class="line">ClearSnapshot(name <span class="keyword">string</span>)</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>interface Watcher 功能有点类似k8s的 list/watch, Watch方法会注册 mcp sink 的watch 请求和处理函数:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Watcher requests watches for configuration resources by node, last</span></span><br><span class="line"><span class="comment">// applied version, and type. The watch should send the responses when</span></span><br><span class="line"><span class="comment">// they are ready. The watch can be canceled by the consumer.</span></span><br><span class="line"><span class="keyword">type</span> Watcher <span class="keyword">interface</span> {</span><br><span class="line"><span class="comment">// Watch returns a new open watch for a non-empty request.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// Cancel is an optional function to release resources in the</span></span><br><span class="line"><span class="comment">// producer. It can be called idempotently to cancel and release resources.</span></span><br><span class="line">Watch(*Request, PushResponseFunc) CancelWatchFunc</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>struct <code>mcp/snapshot.Cache</code> 同时实现了Distributor 和 Watcher interface:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">type Cache struct {</span><br><span class="line">mu sync.RWMutex</span><br><span class="line">snapshots map[string]Snapshot</span><br><span class="line">status map[string]*StatusInfo</span><br><span class="line">watchCount int64</span><br><span class="line"></span><br><span class="line">groupIndex GroupIndexFn</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>mcp 服务端在接口 <code>StreamAggregatedResources</code>和<code>EstablishResourceStream</code>中, 会调用Watch方法, 注册sink连接的watch请求:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">sr := &source.Request{</span><br><span class="line">SinkNode: req.SinkNode,</span><br><span class="line">Collection: collection,</span><br><span class="line">VersionInfo: req.VersionInfo,</span><br><span class="line">}</span><br><span class="line">w.cancel = con.watcher.Watch(sr, con.queueResponse)</span><br></pre></td></tr></table></figure><p> <code>mcp/snapshot.Cache</code> 实现了interface Distributor 的<code>SetSnapshot</code>方法, 该方法在State状态变化后会被调用, 该方法会遍历之前watch注册的responseWatch, 并将WatchResponse传递给各个处理方法.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line">// SetSnapshot updates a snapshot for a group.</span><br><span class="line">func (c *Cache) SetSnapshot(group string, snapshot Snapshot) {</span><br><span class="line">c.mu.Lock()</span><br><span class="line">defer c.mu.Unlock()</span><br><span class="line"></span><br><span class="line">// update the existing entry</span><br><span class="line">c.snapshots[group] = snapshot</span><br><span class="line"></span><br><span class="line">// trigger existing watches for which version changed</span><br><span class="line">if info, ok := c.status[group]; ok {</span><br><span class="line">info.mu.Lock()</span><br><span class="line">defer info.mu.Unlock()</span><br><span class="line"></span><br><span class="line">for id, watch := range info.watches {</span><br><span class="line">version := snapshot.Version(watch.request.Collection)</span><br><span class="line">if version != watch.request.VersionInfo {</span><br><span class="line">scope.Infof("SetSnapshot(): respond to watch %d for %v @ version %q",</span><br><span class="line">id, watch.request.Collection, version)</span><br><span class="line"></span><br><span class="line">response := &source.WatchResponse{</span><br><span class="line">Collection: watch.request.Collection,</span><br><span class="line">Version: version,</span><br><span class="line">Resources: snapshot.Resources(watch.request.Collection),</span><br><span class="line">Request: watch.request,</span><br><span class="line">}</span><br><span class="line">watch.pushResponse(response)</span><br><span class="line"></span><br><span class="line">// discard the responseWatch</span><br><span class="line">delete(info.watches, id)</span><br><span class="line"></span><br><span class="line">scope.Debugf("SetSnapshot(): watch %d for %v @ version %q complete",</span><br><span class="line">id, watch.request.Collection, version)</span><br><span class="line">}</span><br><span class="line">}</span><br><span class="line">}</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>提供给Watch的处理函数<code>queueResponse</code>会将WatchResponse放入连接的响应队列, 最终会推送给mcp sink端.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">// Queue the response for sending in the dispatch loop. The caller may provide</span><br><span class="line">// a nil response to indicate that the watch should be closed.</span><br><span class="line">func (con *connection) queueResponse(resp *WatchResponse) {</span><br><span class="line">if resp == nil {</span><br><span class="line">con.queue.Close()</span><br><span class="line">} else {</span><br><span class="line">con.queue.Enqueue(resp.Collection, resp)</span><br><span class="line">}</span><br><span class="line">}</span><br></pre></td></tr></table></figure><hr><p>最后上一张Galley mcp 服务相关模型UML:</p><p><img src="https://imfox.io/assets/images/istio-a/galley_uml.png" alt=""></p><p><a href="https://imfox.io/assets/images/istio-a/galley_uml.png" target="_blank">查看高清原图</a></p><p>Galley 源代码展示了面向抽象(interface)编程的好处, Source 是对「配置」数据源的抽象, Distributor 是「配置」快照存储的抽象, Watcher 是对「配置」订阅端的抽象. 抽象的具体实现可以组合起来使用. 另外Galley组件之间也充分解耦, 组件之间的数据通过chan/watcher等流转.</p><p>关于早期 istio 配置管理的演进计划, 可以参考2018年5月 CNCF KubeCon talk <a href="https://www.youtube.com/watch?v=x1Tyw8dFKjI&index=2&t=0s&list=LLQ2StCCdx81xHxHxBO0foGA" target="_blank" rel="noopener">Introduction to Istio Configuration - Joy Zhang</a> (需.翻.墙), 1.1 版本中Galley 也还未完全实现该文中的roadmap, 如 configuration pipeline 等. 未来Galley 还会继续演进.</p><blockquote><p>版权归作者所有, 欢迎转载, 转载请注明出处</p></blockquote><hr><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://www.youtube.com/watch?v=x1Tyw8dFKjI&index=2&t=0s&list=LLQ2StCCdx81xHxHxBO0foGA" target="_blank" rel="noopener">Introduction to Istio Configuration </a></li><li><a href="https://docs.google.com/document/d/1o2-V4TLJ8fJACXdlsnxKxDv2Luryo48bAhR8ShxE5-k/edit#heading=h.qex63c29z2to" target="_blank" rel="noopener">google doc Mesh Configuration Protocol (MCP)</a></li><li><a href="https://github.com/istio/api/tree/master/mcp" target="_blank" rel="noopener">github Mesh Configuration Protocol (MCP)</a></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://imfox.io/" target="_blank" rel="noopener">钟华</a></p>
<p>今天我们来解析istio控制面组件Galley. Galley Pod是一个单容器单进程组件, 没有sidecar, 结
</summary>
</entry>
<entry>
<title>istio 庖丁解牛(二) sidecar injector</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/03/19/istio-analysis-2/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/03/19/istio-analysis-2/</id>
<published>2019-03-19T07:30:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imfox.io/" target="_blank" rel="noopener">钟华</a></p><p>今天我们分析下istio-sidecar-injector 组件:</p><p><img src="https://ws1.sinaimg.cn/large/006tKfTcgy1g187i5bzkpj315q0u07dq.jpg" referrerpolicy="no-referrer"></p><p><a href="https://ws4.sinaimg.cn/large/006tKfTcgy1g187dn7s1tj315m0u0x6t.jpg" referrerpolicy="no-referrer" target="_blank">查看高清原图</a></p><p>用户空间的Pod要想加入mesh, 首先需要注入sidecar 容器, istio 提供了2种方式实现注入:</p><ul><li>自动注入: 利用 <a href="https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/" target="_blank" rel="noopener">Kubernetes Dynamic Admission Webhooks</a> 对 新建的pod 进行注入: initContainer + sidecar</li><li>手动注入: 使用命令<code>istioctl kube-inject</code></li></ul><p>「注入」本质上就是修改Pod的资源定义, 添加相应的sidecar容器定义, 内容包括2个新容器:</p><ul><li>名为<code>istio-init</code>的initContainer: 通过配置iptables来劫持Pod中的流量</li><li>名为<code>istio-proxy</code>的sidecar容器: 两个进程pilot-agent和envoy, pilot-agent 进行初始化并启动envoy</li></ul><p><img src="https://ws4.sinaimg.cn/large/006tKfTcgy1g187flw0dmj30wq0grn0b.jpg" referrerpolicy="no-referrer"></p><hr><h2 id="1-Dynamic-Admission-Control"><a href="#1-Dynamic-Admission-Control" class="headerlink" title="1. Dynamic Admission Control"></a>1. Dynamic Admission Control</h2><p>kubernetes 的准入控制(Admission Control)有2种:</p><ul><li>Built in Admission Control: 这些Admission模块可以选择性地编译进api server, 因此需要修改和重启kube-apiserver</li><li>Dynamic Admission Control: 可以部署在kube-apiserver之外, 同时无需修改或重启kube-apiserver.</li></ul><p>其中, Dynamic Admission Control 包含2种形式:</p><ul><li>Admission Webhooks: 该controller 提供http server, 被动接受kube-apiserver分发的准入请求.</li><li><p>Initializers: 该controller主动list and watch 关注的资源对象, 对watch到的未初始化对象进行相应的改造.</p><p>其中, Admission Webhooks 又包含2种准入控制:</p></li><li><p>ValidatingAdmissionWebhook</p></li><li>MutatingAdmissionWebhook</li></ul><p>istio 使用了MutatingAdmissionWebhook来实现对用户Pod的注入, 首先需要保证以下条件满足:</p><ul><li>确保 kube-apiserver 启动参数 开启了 MutatingAdmissionWebhook</li><li>给namespace 增加 label: <code>kubectl label namespace default istio-injection=enabled</code></li><li>同时还要保证 kube-apiserver 的 aggregator layer 开启: <code>--enable-aggregator-routing=true</code> 且证书和api server连通性正确设置.</li></ul><p>另外还需要一个配置对象, 来告诉kube-apiserver istio关心的资源对象类型, 以及webhook的服务地址. 如果使用helm安装istio, 配置对象已经添加好了, 查阅MutatingWebhookConfiguration:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line">% kubectl get mutatingWebhookConfiguration -oyaml</span><br><span class="line">- apiVersion: admissionregistration.k8s.io/v1beta1</span><br><span class="line"> kind: MutatingWebhookConfiguration</span><br><span class="line"> metadata:</span><br><span class="line"> name: istio-sidecar-injector</span><br><span class="line"> webhooks:</span><br><span class="line"> - clientConfig:</span><br><span class="line"> service:</span><br><span class="line"> name: istio-sidecar-injector</span><br><span class="line"> namespace: istio-system</span><br><span class="line"> path: /inject</span><br><span class="line"> name: sidecar-injector.istio.io</span><br><span class="line"> namespaceSelector:</span><br><span class="line"> matchLabels:</span><br><span class="line"> istio-injection: enabled</span><br><span class="line"> rules:</span><br><span class="line"> - apiGroups:</span><br><span class="line"> - ""</span><br><span class="line"> apiVersions:</span><br><span class="line"> - v1</span><br><span class="line"> operations:</span><br><span class="line"> - CREATE</span><br><span class="line"> resources:</span><br><span class="line"> - pods</span><br></pre></td></tr></table></figure><p>该配置告诉kube-apiserver: 命名空间istio-system 中的服务 <code>istio-sidecar-injector</code>(默认443端口), 通过路由<code>/inject</code>, 处理<code>v1/pods</code>的CREATE, 同时pod需要满足命名空间<code>istio-injection: enabled</code>, 当有符合条件的pod被创建时, kube-apiserver就会对该服务发起调用, 服务返回的内容正是添加了sidecar注入的pod定义.</p><hr><h2 id="2-Sidecar-注入内容分析"><a href="#2-Sidecar-注入内容分析" class="headerlink" title="2. Sidecar 注入内容分析"></a>2. Sidecar 注入内容分析</h2><p>查看Pod <code>istio-sidecar-injector</code>的yaml定义:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">%kubectl</span> <span class="bullet">-n</span> <span class="string">istio-system</span> <span class="string">get</span> <span class="string">pod</span> <span class="string">istio-sidecar-injector-5f7894f54f-w7f9v</span> <span class="bullet">-oyaml</span></span><br><span class="line"><span class="string">......</span></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/etc/istio/inject</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">inject-config</span></span><br><span class="line"><span class="attr"> readOnly:</span> <span class="literal">true</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> volumes:</span></span><br><span class="line"><span class="attr"> - configMap:</span></span><br><span class="line"><span class="attr"> items:</span></span><br><span class="line"><span class="attr"> - key:</span> <span class="string">config</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">config</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">istio-sidecar-injector</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">inject-config</span></span><br></pre></td></tr></table></figure><p>可以看到该Pod利用<a href="https://kubernetes.io/docs/concepts/storage/volumes/#projected" target="_blank" rel="noopener">projected volume</a>将<code>istio-sidecar-injector</code>这个config map 的config挂到了自己容器路径<code>/etc/istio/inject/config</code>, 该config map 内容正是注入用户空间pod所需的模板.</p><p>如果使用helm安装istio, 该 configMap 模板源码位于: <a href="https://github.com/istio/istio/blob/master/install/kubernetes/helm/istio/templates/sidecar-injector-configmap.yaml" target="_blank" rel="noopener">https://github.com/istio/istio/blob/master/install/kubernetes/helm/istio/templates/sidecar-injector-configmap.yaml</a>.</p><p>该config map 是在安装istio时添加的, kubernetes 会自动维护 projected volume的更新, 因此 容器 <code>sidecar-injector</code>只需要从本地文件直接读取所需配置.</p><p>高级用户可以按需修改这个模板内容.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl -n istio-system get configmap istio-sidecar-injector -o=jsonpath='{.data.config}'</span><br></pre></td></tr></table></figure><p>查看该configMap, <code>data.config</code>包含以下内容(简化):</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br></pre></td><td class="code"><pre><span class="line">policy: enabled // 是否开启自动注入</span><br><span class="line">template: |- // 使用go template 定义的pod patch</span><br><span class="line"> initContainers:</span><br><span class="line"> [[ if ne (annotation .ObjectMeta `sidecar.istio.io/interceptionMode` .ProxyConfig.InterceptionMode) "NONE" ]]</span><br><span class="line"> - name: istio-init</span><br><span class="line"> image: "docker.io/istio/proxy_init:1.1.0"</span><br><span class="line"> ......</span><br><span class="line"> securityContext:</span><br><span class="line"> capabilities:</span><br><span class="line"> add:</span><br><span class="line"> - NET_ADMIN</span><br><span class="line"> ......</span><br><span class="line"> containers:</span><br><span class="line"> - name: istio-proxy</span><br><span class="line"> args:</span><br><span class="line"> - proxy</span><br><span class="line"> - sidecar</span><br><span class="line"> ......</span><br><span class="line"> image: [[ annotation .ObjectMeta `sidecar.istio.io/proxyImage` "docker.io/istio/proxyv2:1.1.0" ]]</span><br><span class="line"> ......</span><br><span class="line"> readinessProbe:</span><br><span class="line"> httpGet:</span><br><span class="line"> path: /healthz/ready</span><br><span class="line"> port: [[ annotation .ObjectMeta `status.sidecar.istio.io/port` 0 ]]</span><br><span class="line"> ......</span><br><span class="line"> securityContext:</span><br><span class="line"> capabilities:</span><br><span class="line"> add:</span><br><span class="line"> - NET_ADMIN</span><br><span class="line"> runAsGroup: 1337</span><br><span class="line"> ......</span><br><span class="line"> volumeMounts:</span><br><span class="line"> ......</span><br><span class="line"> - mountPath: /etc/istio/proxy</span><br><span class="line"> name: istio-envoy</span><br><span class="line"> - mountPath: /etc/certs/</span><br><span class="line"> name: istio-certs</span><br><span class="line"> readOnly: true</span><br><span class="line"> ......</span><br><span class="line"> volumes:</span><br><span class="line"> ......</span><br><span class="line"> - emptyDir:</span><br><span class="line"> medium: Memory</span><br><span class="line"> name: istio-envoy</span><br><span class="line"> - name: istio-certs</span><br><span class="line"> secret:</span><br><span class="line"> optional: true</span><br><span class="line"> [[ if eq .Spec.ServiceAccountName "" -]]</span><br><span class="line"> secretName: istio.default</span><br><span class="line"> [[ else -]]</span><br><span class="line"> secretName: [[ printf "istio.%s" .Spec.ServiceAccountName ]]</span><br><span class="line"> ......</span><br></pre></td></tr></table></figure><p>对istio-init生成的部分参数分析:</p><ul><li><code>-u 1337</code> 排除用户ID为1337,即Envoy自身的流量</li><li>解析用户容器<code>.Spec.Containers</code>, 获得容器的端口列表, 传入<code>-b</code>参数(入站端口控制)</li><li>指定要从重定向到 Envoy 中排除(可选)的入站端口列表, 默认写入<code>-d 15020</code>, 此端口是sidecar的status server</li><li>赋予该容器<code>NET_ADMIN</code> 能力, 允许容器istio-init进行网络管理操作</li></ul><p>对istio-proxy 生成的部分参数分析:</p><ul><li>启动参数<code>proxy sidecar xxx</code> 用以定义该节点的代理类型(NodeType)</li><li>默认的status server 端口<code>--statusPort=15020</code></li><li>解析用户容器<code>.Spec.Containers</code>, 获取用户容器的application Ports, 然后设置到sidecar的启动参数<code>--applicationPorts</code>中, 该参数会最终传递给envoy, 用以确定哪些端口流量属于该业务容器.</li><li>设置<code>/healthz/ready</code> 作为该代理的readinessProbe</li><li>同样赋予该容器<code>NET_ADMIN</code>能力</li></ul><p>另外<code>istio-sidecar-injector</code>还给容器<code>istio-proxy</code>挂了2个volumes:</p><ul><li><p>名为<code>istio-envoy</code>的emptydir volume, 挂载到容器目录<code>/etc/istio/proxy</code>, 作为envoy的配置文件目录</p></li><li><p>名为<code>istio-certs</code>的secret volume, 默认secret名为<code>istio.default</code>, 挂载到容器目录<code>/etc/certs/</code>, 存放相关的证书, 包括服务端证书, 和可能的mtls客户端证书</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">% kubectl exec productpage-v1-6597cb5df9-xlndw -c istio-proxy -- ls /etc/certs/</span><br><span class="line">cert-chain.pem</span><br><span class="line">key.pem</span><br><span class="line">root-cert.pem</span><br></pre></td></tr></table></figure></li></ul><p>后续文章探究sidecar <code>istio-proxy</code>会对其进一步分析.</p><hr><h2 id="3-istio-sidecar-injector-webhook-源码分析"><a href="#3-istio-sidecar-injector-webhook-源码分析" class="headerlink" title="3. istio-sidecar-injector-webhook 源码分析"></a>3. istio-sidecar-injector-webhook 源码分析</h2><ul><li>镜像Dockerfile: <code>istio/pilot/docker/Dockerfile.sidecar_injector</code></li><li>启动命令: <code>/sidecar-injector</code></li><li>命令源码: <code>istio/pilot/cmd/sidecar-injector</code></li></ul><p>容器中命令/sidecar-injector启动参数如下:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">- args:</span><br><span class="line"> - --caCertFile=/etc/istio/certs/root-cert.pem</span><br><span class="line"> - --tlsCertFile=/etc/istio/certs/cert-chain.pem</span><br><span class="line"> - --tlsKeyFile=/etc/istio/certs/key.pem</span><br><span class="line"> - --injectConfig=/etc/istio/inject/config</span><br><span class="line"> - --meshConfig=/etc/istio/config/mesh</span><br><span class="line"> - --healthCheckInterval=2s</span><br><span class="line"> - --healthCheckFile=/health</span><br></pre></td></tr></table></figure><p><code>sidecar-injector</code> 的核心数据模型是 <code>Webhook</code>struct, 注入配置sidecarConfig包括注入模板以及注入开关和规则:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line">type Webhook struct {</span><br><span class="line">mu sync.RWMutex</span><br><span class="line">sidecarConfig *Config // 注入配置: 模板,开关,规则</span><br><span class="line">sidecarTemplateVersion string</span><br><span class="line">meshConfig *meshconfig.MeshConfig</span><br><span class="line"></span><br><span class="line">healthCheckInterval time.Duration</span><br><span class="line">healthCheckFile string</span><br><span class="line"></span><br><span class="line">server *http.Server</span><br><span class="line">meshFile string</span><br><span class="line">configFile string // 注入内容路径, 从启动参数injectConfig中获取</span><br><span class="line">watcher *fsnotify.Watcher // 基于文件系统的notifications</span><br><span class="line">certFile string</span><br><span class="line">keyFile string</span><br><span class="line">cert *tls.Certificate</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">type Config struct {</span><br><span class="line">Policy InjectionPolicy `json:"policy"`</span><br><span class="line">Template string `json:"template"`</span><br><span class="line">NeverInjectSelector []metav1.LabelSelector `json:"neverInjectSelector"`</span><br><span class="line">AlwaysInjectSelector []metav1.LabelSelector `json:"alwaysInjectSelector"`</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><code>sidecar-injector</code> 的root cmd 会创建一个<code>Webhook</code>, 该struct包含一个http server, 并将路由<code>/inject</code>注册到处理器函数<code>serveInject</code></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line">RunE: func(c *cobra.Command, _ []string) error {</span><br><span class="line"> ......</span><br><span class="line"> wh, err := inject.NewWebhook(parameters)</span><br><span class="line"> ......</span><br><span class="line"> go wh.Run(stop)</span><br><span class="line"> ......</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">func NewWebhook(p WebhookParameters) (*Webhook, error) {</span><br><span class="line"> ......</span><br><span class="line">watcher, err := fsnotify.NewWatcher()</span><br><span class="line">// watch the parent directory of the target files so we can catch</span><br><span class="line">// symlink updates of k8s ConfigMaps volumes.</span><br><span class="line">for _, file := range []string{p.ConfigFile, p.MeshFile, p.CertFile, p.KeyFile} {</span><br><span class="line">watchDir, _ := filepath.Split(file)</span><br><span class="line">if err := watcher.Watch(watchDir); err != nil {</span><br><span class="line">return nil, fmt.Errorf("could not watch %v: %v", file, err)</span><br><span class="line">}</span><br><span class="line">}</span><br><span class="line">......</span><br><span class="line">h := http.NewServeMux()</span><br><span class="line">h.HandleFunc("/inject", wh.serveInject)</span><br><span class="line">wh.server.Handler = h</span><br><span class="line">......</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><code>Webhook#Run</code>方法会启动该http server, 并负责响应配置文件的更新:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line">func (wh *Webhook) Run(stop <-chan struct{}) {</span><br><span class="line">go func() {</span><br><span class="line">wh.server.ListenAndServeTLS("", "")</span><br><span class="line">......</span><br><span class="line">}()</span><br><span class="line">......</span><br><span class="line">var timerC <-chan time.Time</span><br><span class="line">for {</span><br><span class="line">select {</span><br><span class="line">case <-timerC:</span><br><span class="line">timerC = nil</span><br><span class="line">sidecarConfig, meshConfig, err := loadConfig(wh.configFile, wh.meshFile)</span><br><span class="line">......</span><br><span class="line">case event := <-wh.watcher.Event:</span><br><span class="line">// use a timer to debounce configuration updates</span><br><span class="line">if (event.IsModify() || event.IsCreate()) && timerC == nil {</span><br><span class="line">timerC = time.After(watchDebounceDelay)</span><br><span class="line">}</span><br><span class="line">case ......</span><br><span class="line">}</span><br><span class="line">}</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><code>Webhook#Run</code>首先会启动处理注入请求的http server, 下面的for循环主要是处理2个配置文件的更新操作, select 里使用了一个timer(并不是ticker), 咋一看像是简单的定时更新配置文件, 其实不然. 配置文件更新事件由<code>wh.watcher</code>进行接收, 然后才会启动timer, 这里用到了第三方库<a href="https://github.com/howeyc/fsnotify" target="_blank" rel="noopener">https://github.com/howeyc/fsnotify</a>, 这是一个基于文件系统的notification. 这里使用timer限制在一个周期(watchDebounceDelay)里面最多重新加载一次配置文件, 避免在配置文件频繁变化的情况下多次触发不必要的loadConfig</p><blockquote><p>use a timer to debounce configuration updates</p></blockquote><p><code>Webhook.serveInject</code> 会调用<code>Webhook#inject</code>, 最终的模板处理函数是<code>injectionData</code>.</p>]]></content>
<summary type="html">
<p>作者: <a href="https://imfox.io/" target="_blank" rel="noopener">钟华</a></p>
<p>今天我们分析下istio-sidecar-injector 组件:</p>
<p><img src="https://w
</summary>
</entry>
<entry>
<title>istio 庖丁解牛(一) 组件概览</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/03/11/istio-analysis-1/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/03/11/istio-analysis-1/</id>
<published>2019-03-11T07:30:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://imfox.me/" target="_blank" rel="noopener">钟华</a></p><p>Istio 作为 Service Mesh 领域的集大成者, 提供了流控, 安全, 遥测等模型, 其功能复杂, 模块众多, 有较高的学习和使用门槛, 本文会对istio 1.1 的各组件进行分析, 希望能帮助读者了解istio各组件的职责、以及相互的协作关系.</p><meta name="referrer" content="no-referrer"><h2 id="1-istio-组件构成"><a href="#1-istio-组件构成" class="headerlink" title="1. istio 组件构成"></a>1. istio 组件构成</h2><p>以下是istio 1.1 官方架构图:</p><p><img src="https://preliminary.istio.io/docs/concepts/what-is-istio/arch.svg" width="80%"></p><p>虽然Istio 支持多个平台, 但将其与 Kubernetes 结合使用,其优势会更大, Istio 对Kubernetes 平台支持也是最完善的, 本文将基于Istio + Kubernetes 进行展开.</p><p>如果安装了grafana, prometheus, kiali, jaeger等组件的情况下, 一个完整的控制面组件包括以下pod:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">% kubectl -n istio-system get pod</span><br><span class="line">NAME READY STATUS</span><br><span class="line">grafana-5f54556df5-s4xr4 1/1 Running</span><br><span class="line">istio-citadel-775c6cfd6b-8h5gt 1/1 Running</span><br><span class="line">istio-galley-675d75c954-kjcsg 1/1 Running</span><br><span class="line">istio-ingressgateway-6f7b477cdd-d8zpv 1/1 Running</span><br><span class="line">istio-pilot-7dfdb48fd8-92xgt 2/2 Running</span><br><span class="line">istio-policy-544967d75b-p6qkk 2/2 Running</span><br><span class="line">istio-sidecar-injector-5f7894f54f-w7f9v 1/1 Running</span><br><span class="line">istio-telemetry-777876dc5d-msclx 2/2 Running</span><br><span class="line">istio-tracing-5fbc94c494-558fp 1/1 Running</span><br><span class="line">kiali-7c6f4c9874-vzb4t 1/1 Running</span><br><span class="line">prometheus-66b7689b97-w9glt 1/1 Running</span><br></pre></td></tr></table></figure><p>将istio系统组件细化到进程级别, 大概是这个样子:</p><p><img src="https://ws3.sinaimg.cn/large/006tKfTcgy1g187gshs79j315m0u0qct.jpg" referrerpolicy="no-referrer"></p><p><a href="https://ws4.sinaimg.cn/large/006tKfTcgy1g187dn7s1tj315m0u0x6t.jpg" referrerpolicy="no-referrer" target="_blank">查看高清原图</a></p><p>Service Mesh 的Sidecar 模式要求对数据面的用户Pod进行代理的注入, 注入的代理容器会去处理服务治理领域的各种「脏活累活」, 使得用户容器可以专心处理业务逻辑.</p><p>从上图可以看出, Istio 控制面本身就是一个复杂的微服务系统, 该系统包含多个组件Pod, 每个组件 各司其职, 既有单容器Pod, 也有多容器Pod, 既有单进程容器, 也有多进程容器, 每个组件会调用不同的命令, 各组件之间会通过RPC进行协作, 共同完成对数据面用户服务的管控.</p><hr><h2 id="2-Istio-源码-镜像和命令"><a href="#2-Istio-源码-镜像和命令" class="headerlink" title="2. Istio 源码, 镜像和命令"></a>2. Istio 源码, 镜像和命令</h2><p>Isito 项目代码主要由以下2个git 仓库组成:</p><table><thead><tr><th>仓库地址</th><th>语言</th><th>模块</th></tr></thead><tbody><tr><td><a href="https://github.com/istio/istio" target="_blank" rel="noopener">https://github.com/istio/istio</a></td><td>Go</td><td>包含istio控制面的大部分组件: pilot, mixer, citadel, galley, sidecar-injector等,</td></tr><tr><td><a href="https://github.com/istio/proxy" target="_blank" rel="noopener">https://github.com/istio/proxy</a></td><td>C++</td><td>包含 istio 使用的边车代理, 这个边车代理包含envoy和mixer client两块功能</td></tr></tbody></table><h3 id="2-1-istio-istio"><a href="#2-1-istio-istio" class="headerlink" title="2.1 istio/istio"></a>2.1 istio/istio</h3><p><a href="https://github.com/istio/istio" target="_blank" rel="noopener">https://github.com/istio/istio</a> 包含的主要的镜像和命令:</p><table><thead><tr><th>容器名</th><th>镜像名</th><th>启动命令</th><th>源码入口</th></tr></thead><tbody><tr><td>Istio_init</td><td>istio/proxy_init</td><td>istio-iptables.sh</td><td>istio/tools/deb/istio-iptables.sh</td></tr><tr><td>istio-proxy</td><td>istio/proxyv2</td><td>pilot-agent</td><td>istio/pilot/cmd/pilot-agent</td></tr><tr><td>sidecar-injector-webhook</td><td>istio/sidecar_injector</td><td>sidecar-injector</td><td>istio/pilot/cmd/sidecar-injector</td></tr><tr><td>discovery</td><td>istio/pilot</td><td>pilot-discovery</td><td>istio/pilot/cmd/pilot-discovery</td></tr><tr><td>galley</td><td>istio/galley</td><td>galley</td><td>istio/galley/cmd/galley</td></tr><tr><td>mixer</td><td>istio/mixer</td><td>mixs</td><td>istio/mixer/cmd/mixs</td></tr><tr><td>citadel</td><td>istio/citadel</td><td>istio_ca</td><td>istio/security/cmd/istio_ca</td></tr></tbody></table><p>另外还有2个命令不在上图中使用:</p><table><thead><tr><th>命令</th><th>源码入口</th><th>作用</th></tr></thead><tbody><tr><td>mixc</td><td>istio/mixer/cmd/mixc</td><td>用于和Mixer server 交互的客户端</td></tr><tr><td>node_agent</td><td>istio/security/cmd/node_agent</td><td>用于node上安装安全代理, 这在Mesh Expansion特性中会用到, 即k8s和vm打通.</td></tr></tbody></table><h3 id="2-2-istio-proxy"><a href="#2-2-istio-proxy" class="headerlink" title="2.2 istio/proxy"></a>2.2 istio/proxy</h3><p><a href="https://github.com/istio/proxy" target="_blank" rel="noopener">https://github.com/istio/proxy</a> 该项目本身不会产出镜像, 它可以编译出一个<code>name = "Envoy"</code>的二进制程序, 该二进制程序会被ADD到istio的边车容器镜像<code>istio/proxyv2</code>中.</p><p>istio proxy 项目使用的编译方式是Google出品的bazel, bazel可以直接在编译中引入第三方库,加载第三方源码.</p><p>这个项目包含了对Envoy源码的引用,还在此基础上进行了扩展,这些扩展是通过Envoy filter(过滤器)的形式来提供,这样做的目的是让边车代理将策略执行决策委托给Mixer,因此可以理解istio proxy 这个项目有2大功能模块:</p><ol><li>Envoy: 使用到Envoy的全部功能</li><li>mixer client: 测量和遥测相关的客户端实现, 基于Envoy做扩展,通过RPC和Mixer server 进行交互, 实现策略管控和遥测</li></ol><p>后续我将对以上各个模块、命令以及它们之间的协作进行探究.</p><hr><h2 id="3-Istio-Pod-概述"><a href="#3-Istio-Pod-概述" class="headerlink" title="3. Istio Pod 概述"></a>3. Istio Pod 概述</h2><h3 id="3-1-数据面用户Pod"><a href="#3-1-数据面用户Pod" class="headerlink" title="3.1 数据面用户Pod"></a>3.1 数据面用户Pod</h3><p>数据面用户Pod注入的内容包括:</p><ol><li><p>initContainer <code>istio-init</code>: 通过配置iptables来劫持Pod中的流量, 转发给envoy</p></li><li><p>sidecar container <code>istio-proxy</code>: 包含2个进程, 父进程pliot-agent 初始化并管控envoy, 子进程envoy除了包含原生envoy的功能外, 还加入了mixer client的逻辑.</p><p>主要端口:</p><ul><li><code>--statusPort</code> status server 端口, 默认为0, 表示不启动, istio启动时通常传递为15020, 由pliot-agent监听</li><li><code>--proxyAdminPort</code> 代理管理端口, 默认 15000, 由子进程envoy监听.</li></ul></li></ol><h3 id="3-2-istio-sidecar-injector"><a href="#3-2-istio-sidecar-injector" class="headerlink" title="3.2 istio-sidecar-injector"></a>3.2 istio-sidecar-injector</h3><p>包含一个单容器, <code>sidecar-injector-webhook</code>: 启动一个http server, 接受kube api server 的Admission Webhook 请求, 对用户pod进行sidecar注入.</p><p>进程为<code>sidecar-injector</code>, 主要监听端口:</p><ul><li><code>--port</code> Webhook服务端口, 默认443, 通过k8s service<code>istio-sidecar-injector</code> 对外提供服务.</li></ul><h3 id="3-3-istio-galley"><a href="#3-3-istio-galley" class="headerlink" title="3.3 istio-galley"></a>3.3 istio-galley</h3><p>包含一个单容器 <code>galley</code>: 提供 istio 中的配置管理服务, 验证Istio的CRD 资源的合法性.</p><p>进程为<code>galley server ......</code>, 主要监听端口:</p><ul><li><code>--server-address</code> galley gRPC 地址, 默认是tcp://0.0.0.0:9901</li><li><p><code>--validation-port</code> https端口, 提供验证crd合法性服务的端口, 默认443.</p></li><li><p><code>--monitoringPort</code> http 端口, self-monitoring 端口, 默认 15014</p></li></ul><p>以上端口通过k8s service<code>istio-galley</code>对外提供服务</p><h3 id="3-4-istio-pilot"><a href="#3-4-istio-pilot" class="headerlink" title="3.4 istio-pilot"></a>3.4 istio-pilot</h3><p>pilot组件核心Pod, 对接平台适配层, 抽象服务注册信息、流量控制模型等, 封装统一的 API,供 Envoy 调用获取.</p><p>包含以下容器:</p><ol><li><p>sidecar container <code>istio-proxy</code></p></li><li><p>container <code>discovery</code>: 进程为<code>pilot-discovery discovery ......</code></p><p>主要监听端口:</p><ul><li>15010: 通过grpc 提供的 xds 获取接口</li><li><p>15011: 通过https 提供的 xds 获取接口</p></li><li><p>8080: 通过http 提供的 xds 获取接口, 兼容v1版本, 另外 http readiness 探针 <code>/ready</code>也在该端口</p></li><li><code>--monitoringPort</code> http self-monitoring 端口, 默认 15014</li></ul><p>以上端口通过k8s service<code>istio-pilot</code>对外提供服务</p></li></ol><h3 id="3-5-istio-telemetry-和istio-policy"><a href="#3-5-istio-telemetry-和istio-policy" class="headerlink" title="3.5 istio-telemetry 和istio-policy"></a>3.5 istio-telemetry 和istio-policy</h3><p>mixer 组件包含2个pod, istio-telemetry 和 istio-policy, istio-telemetry负责遥测功能, istio-policy 负责策略控制, 它们分别包含2个容器:</p><ol><li><p>sidecar container<code>istio-proxy</code></p></li><li><p><code>mixer</code>: 进程为 <code>mixs server ……</code></p><p>主要监听端口:</p><ul><li>9091: grpc-mixer</li><li><p>15004: grpc-mixer-mtls</p></li><li><p><code>--monitoring-port</code>: http self-monitoring 端口, 默认 15014, liveness 探针<code>/version</code></p></li></ul></li></ol><h3 id="3-7-istio-citadel"><a href="#3-7-istio-citadel" class="headerlink" title="3.7 istio-citadel"></a>3.7 istio-citadel</h3><p>负责安全和证书管理的Pod, 包含一个单容器 <code>citadel</code></p><p>启动命令<code>/usr/local/bin/istio_ca --self-signed-ca ......</code> 主要监听端口:</p><ul><li><p><code>--grpc-port</code> citadel grpc 端口, 默认8060</p></li><li><p><code>--monitoring-port</code>: http self-monitoring 端口, 默认 15014, liveness 探针<code>/version</code></p></li></ul><p>以上端口通过k8s service<code>istio-citadel</code>对外提供服务</p><hr><p>后续将对各组件逐一进行分析.</p>]]></content>
<summary type="html">
<p>作者: <a href="https://imfox.me/" target="_blank" rel="noopener">钟华</a></p>
<p>Istio 作为 Service Mesh 领域的集大成者, 提供了流控, 安全, 遥测等模型, 其功能复杂, 模块众多
</summary>
</entry>
<entry>
<title>Istio 服务网格领域的新王者</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/01/31/servicemesh-istio/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/01/31/servicemesh-istio/</id>
<published>2019-01-31T07:30:00.000Z</published>
<updated>2020-06-16T01:53:49.351Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://github.com/zhongfox" target="_blank" rel="noopener">钟华</a></p><p><img src="https://github.com/TencentCloudContainerTeam/TencentCloudContainerTeam.github.io/raw/develop/source/_posts/res/istio/title.png" alt="istio"></p><p>今天分享的内容主要包括以下4个话题:</p><ul><li>1 Service Mesh: 下一代微服务</li><li>2 Istio: 第二代 Service Mesh</li><li>3 Istio 数据面</li><li>4 Istio 控制面</li></ul><p>首先我会和大家一起过一下 Service Mesh的发展历程, 并看看Istio 为 Service Mesh 带来了什么, 这部分相对比较轻松. 接下来我将和大家分析一下Istio的主要架构, 重点是数据面和控制面的实现, 包括sidecar的注入, 流量拦截, xDS介绍, Istio流量模型, 分布式跟踪, Mixer 的适配器模型等等, 中间也会穿插着 istio的现场使用demo.</p><hr><h1 id="1-Service-Mesh-下一代微服务"><a href="#1-Service-Mesh-下一代微服务" class="headerlink" title="1. Service Mesh: 下一代微服务"></a>1. Service Mesh: 下一代微服务</h1><ul><li>应用通信模式演进</li><li>Service Mesh(服务网格)的出现</li><li>第二代 Service Mesh</li><li>Service Mesh 的定义</li><li>Service Mesh 产品简史</li><li>国内Service Mesh 发展情况</li></ul><hr><h2 id="1-1-应用通信模式演进-网络流控进入操作系统"><a href="#1-1-应用通信模式演进-网络流控进入操作系统" class="headerlink" title="1.1 应用通信模式演进: 网络流控进入操作系统"></a>1.1 应用通信模式演进: 网络流控进入操作系统</h2><p><img src="https://zhongfox.github.io/assets/images/istio/1.1.png"></p><p>在计算机网络发展的初期, 开发人员需要在自己的代码中处理服务器之间的网络连接问题, 包括流量控制, 缓存队列, 数据加密等. 在这段时间内底层网络逻辑和业务逻辑是混杂在一起.</p><p>随着技术的发展,TCP/IP 等网络标准的出现解决了流量控制等问题。尽管网络逻辑代码依然存在,但已经从应用程序里抽离出来,成为操作系统网络层的一部分, 形成了经典的网络分层模式.</p><hr><h2 id="1-2-应用通信模式演进-微服务架构的出现"><a href="#1-2-应用通信模式演进-微服务架构的出现" class="headerlink" title="1.2 应用通信模式演进: 微服务架构的出现"></a>1.2 应用通信模式演进: 微服务架构的出现</h2><p><img src="https://zhongfox.github.io/assets/images/istio/1.2.png"></p><p>微服务架构是更为复杂的分布式系统,它给运维带来了更多挑战, 这些挑战主要包括资源的有效管理和服务之间的治理, 如:</p><ul><li>服务注册, 服务发现</li><li>服务伸缩</li><li>健康检查</li><li>快速部署</li><li>服务容错: 断路器, 限流, 隔离舱, 熔断保护, 服务降级等等</li><li>认证和授权</li><li>灰度发布方案</li><li>服务调用可观测性, 指标收集</li><li>配置管理</li></ul><p>在微服务架构的实现中,为提升效率和降低门槛,应用开发者会基于微服务框架来实现微服务。微服务框架一定程度上为使用者屏蔽了底层网络的复杂性及分布式场景下的不确定性。通过API/SDK的方式提供服务注册发现、服务RPC通信、服务配置管理、服务负载均衡、路由限流、容错、服务监控及治理、服务发布及升级等通用能力, 比较典型的产品有:</p><ul><li>分布式RPC通信框架: COBRA, WebServices, Thrift, GRPC 等</li><li>服务治理特定领域的类库和解决方案: Hystrix, Zookeeper, Zipkin, Sentinel 等</li><li>对多种方案进行整合的微服务框架: SpringCloud、Finagle、Dubbox 等</li></ul><p>实施微服务的成本往往会超出企业的预期(内容多, 门槛高), 花在服务治理上的时间成本甚至可能高过进行产品研发的时间. 另外上述的方案会限制可用的工具、运行时和编程语言。微服务软件库一般专注于某个平台, 这使得异构系统难以兼容, 存在重复的工作, 系统缺乏可移植性.</p><p>Docker 和Kubernetes 技术的流行, 为Pass资源的分配管理和服务的部署提供了新的解决方案, 但是微服务领域的其他服务治理问题仍然存在.</p><hr><h2 id="1-3-Sidecar-模式的兴起"><a href="#1-3-Sidecar-模式的兴起" class="headerlink" title="1.3 Sidecar 模式的兴起"></a>1.3 Sidecar 模式的兴起</h2><p><img src="https://zhongfox.github.io/assets/images/istio/1.3.png"></p><p>Sidecar(有时会叫做agent) 在原有的客户端和服务端之间加多了一个代理, 为应用程序提供的额外的功能, 如服务发现, 路由代理, 认证授权, 链路跟踪 等等.</p><p>业界使用Sidecar 的一些先例:</p><ul><li>2013 年,Airbnb 开发了Synapse 和 Nerve,是sidecar的一种开源实现</li><li>2014 年, Netflix 发布了Prana,它也是一个sidecar,可以让非 JVM 应用接入他们的 NetflixOSS 生态系统</li></ul><hr><h2 id="1-4-Service-Mesh-服务网格-的出现"><a href="#1-4-Service-Mesh-服务网格-的出现" class="headerlink" title="1.4 Service Mesh(服务网格)的出现"></a>1.4 Service Mesh(服务网格)的出现</h2><p><img src="https://zhongfox.github.io/assets/images/istio/1.4.png"></p><p>直观地看, Sidecar 到 Service Mesh 是一个规模的升级, 不过Service Mesh更强调的是:</p><ul><li>不再将Sidecar(代理)视为单独的组件,而是强调由这些代理连接而形成的网络</li><li>基础设施, 对应用程序透明</li></ul><hr><h2 id="1-5-Service-Mesh-定义"><a href="#1-5-Service-Mesh-定义" class="headerlink" title="1.5 Service Mesh 定义"></a>1.5 Service Mesh 定义</h2><p>以下是Linkerd的CEO <a href="https://twitter.com/wm" target="_blank" rel="noopener">Willian Morgan</a>给出的Service Mesh的定义:</p><blockquote><p>A Service Mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud native application. In practice, the Service Mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware.</p></blockquote><p>服务网格(Service Mesh)是致力于解决服务间通讯的<strong>基础设施层</strong>。它负责在现代云原生应用程序的复杂服务拓扑来可靠地传递请求。实际上,Service Mesh 通常是通过一组<strong>轻量级网络代理</strong>(Sidecar proxy),与应用程序代码部署在一起来实现,且<strong>对应用程序透明</strong>。</p><hr><h2 id="1-6-第二代-Service-Mesh"><a href="#1-6-第二代-Service-Mesh" class="headerlink" title="1.6 第二代 Service Mesh"></a>1.6 第二代 Service Mesh</h2><p><img src="https://zhongfox.github.io/assets/images/istio/1.6.png"></p><p>控制面板对每一个代理实例了如指掌,通过控制面板可以实现代理的访问控制和度量指标收集, 提升了服务网格的可观测性和管控能力, Istio 正是这类系统最为突出的代表.</p><hr><h2 id="1-7-Service-Mesh-产品简史"><a href="#1-7-Service-Mesh-产品简史" class="headerlink" title="1.7 Service Mesh 产品简史"></a>1.7 Service Mesh 产品简史</h2><p><img src="https://zhongfox.github.io/assets/images/istio/1.7.png"></p><ul><li><p>2016 年 1 月 15 日,前 Twitter 的基础设施工程师 <a href="https://twitter.com/wm" target="_blank" rel="noopener">William Morgan</a> 和 Oliver Gould,在 GitHub 上发布了 Linkerd 0.0.7 版本,采用Scala编写, 他们同时组建了一个创业小公司 Buoyant,这是业界公认的第一个Service Mesh </p></li><li><p>2016 年,<a href="https://twitter.com/mattklein123" target="_blank" rel="noopener">Matt Klein</a>在 Lyft 默默地进行 Envoy 的开发。Envoy 诞生的时间其实要比 Linkerd 更早一些,只是在 Lyft 内部不为人所知</p></li><li><p>2016 年 9 月 29 日在 SF Microservices 上,“Service Mesh”这个词汇第一次在公开场合被使用。这标志着“Service Mesh”这个词,从 Buoyant 公司走向社区.</p></li><li><p>2016 年 9 月 13 日,Matt Klein 宣布 Envoy 在 GitHub 开源,直接发布 1.0.0 版本。</p></li><li><p>2016 年下半年,Linkerd 陆续发布了 0.8 和 0.9 版本,开始支持 HTTP/2 和 gRPC,1.0 发布在即;同时,借助 Service Mesh 在社区的认可度,Linkerd 在年底开始申请加入 CNCF</p></li><li><p>2017 年 1 月 23 日,Linkerd 加入 CNCF。</p></li><li><p>2017 年 3 月 7 日,Linkerd 宣布完成千亿次产品请求</p></li><li><p>2017 年 4 月 25 日,Linkerd 1.0 版本发布</p></li><li><p>2017 年 7 月 11 日,Linkerd 发布版本 1.1.1,宣布和 Istio 项目集成</p></li><li><p>2017 年 9 月, nginx突然宣布要搞出一个Servicemesh来, Nginmesh: <a href="https://github.com/nginxinc/nginmesh" target="_blank" rel="noopener">https://github.com/nginxinc/nginmesh</a>, 可以作为istio的数据面, 不过这个项目目前处于不活跃开发(This project is no longer under active development)</p></li><li><p>2017 年 12 月 5 日,Conduit 0.1.0 版本发布</p></li></ul><p>Envoy 和 Linkerd 都是在数据面上的实现, 属于同一个层面的竞争, 前者是用 C++ 语言实现的,在性能和资源消耗上要比采用 Scala 语言实现的 Linkerd 小,这一点对于延迟敏感型和资源敏的服务尤为重要.</p><p>Envoy 对 作为 Istio 的标准数据面实现, 其最主要的贡献是提供了一套<a href="https://github.com/envoyproxy/data-plane-api/blob/master/API_OVERVIEW.md" target="_blank" rel="noopener">标准数据面API</a>, 将服务信息和流量规则下发到数据面的sidecar中, 另外Envoy还支持热重启. Istio早期采用了Envoy v1 API,目前的版本中则使用V2 API,V1已被废弃.</p><p>通过采用该标准API,Istio将控制面和数据面进行了解耦,为多种数据面sidecar实现提供了可能性。事实上基于该标准API已经实现了多种Sidecar代理和Istio的集成,除Istio目前集成的Envoy外,还可以和Linkerd, Nginmesh等第三方通信代理进行集成,也可以基于该API自己编写Sidecar实现.</p><p>将控制面和数据面解耦是Istio后来居上,风头超过Service mesh鼻祖Linkerd的一招妙棋。Istio站在了控制面的高度上,而Linkerd则成为了可选的一种sidecar实现.</p><p>Conduit 的整体架构和 Istio 一致,借鉴了 Istio 数据平面 + 控制平面的设计,而且选择了 Rust 编程语言来实现数据平面,以达成 Conduit 宣称的更轻、更快和超低资源占用.</p><center>(参考: 敖小剑 <a href="https://skyao.io/publication/201801-service-mesh-2017-summary/" target="_blank" rel="noopener">Service Mesh年度总结:群雄逐鹿烽烟起</a>)</center><hr><h2 id="1-8-似曾相识的竞争格局"><a href="#1-8-似曾相识的竞争格局" class="headerlink" title="1.8 似曾相识的竞争格局"></a>1.8 似曾相识的竞争格局</h2><table><thead><tr><th></th><th>Kubernetes</th><th>Istio</th></tr></thead><tbody><tr><td>领域</td><td>容器编排</td><td>服务网格</td></tr><tr><td>主要竞品</td><td>Swarm, Mesos</td><td>Linkerd, Conduit</td></tr><tr><td>主要盟友</td><td>RedHat, CoreOS</td><td>IBM, Lyft</td></tr><tr><td>主要竞争对手</td><td>Docker 公司</td><td>Buoyant 公司</td></tr><tr><td>标准化</td><td>OCI: runtime spec, image spec</td><td>XDS</td></tr><tr><td>插件化</td><td>CNI, CRI</td><td>Istio CNI, Mixer Adapter</td></tr><tr><td>结果</td><td>Kubernetes 成为容器编排事实标准</td><td>?</td></tr></tbody></table><p>google 主导的Kubernetes 在容器编排领域取得了完胜, 目前在服务网格领域的打法如出一辙, 社区对Istio前景也比较看好.</p><p>Istio CNI 计划在1.1 作为实验特性, 用户可以通过扩展方式定制sidecar的网络.</p><hr><h2 id="1-9-国内Service-Mesh-发展情况"><a href="#1-9-国内Service-Mesh-发展情况" class="headerlink" title="1.9 国内Service Mesh 发展情况"></a>1.9 国内Service Mesh 发展情况</h2><ul><li><p>蚂蚁金服开源SOFAMesh: <a href="https://github.com/alipay/sofa-mesh" target="_blank" rel="noopener">https://github.com/alipay/sofa-mesh</a></p><ul><li>从istio fork</li><li>使用Golang语言开发全新的Sidecar,替代Envoy</li><li>为了避免Mixer带来的性能瓶颈,合并Mixer部分功能进入Sidecar</li><li>Pilot和Citadel模块进行了大幅的扩展和增强</li><li>扩展RPC协议: SOFARPC/HSF/Dubbo</li></ul></li><li><p>华为:</p><ul><li>go-chassis: <a href="https://github.com/go-chassis/go-chassis" target="_blank" rel="noopener">https://github.com/go-chassis/go-chassis</a> golang 微服务框架, 支持istio平台</li><li>mesher: <a href="https://github.com/go-mesh/mesher" target="_blank" rel="noopener">https://github.com/go-mesh/mesher</a> mesh 数据面解决方案</li><li>国内首家提供Service Mesh公共服务的云厂商</li><li>目前(2019年1月)公有云Istio 产品线上已经支持申请公测, 产品形态比较完善</li></ul></li><li><p>腾讯云 TSF:</p><ul><li>基于 Istio、envoy 进行改造</li><li>支持 Kubernetes、虚拟机以及裸金属的服务</li><li>对 Istio 的能力进行了扩展和增强, 对 Consul 的完整适配</li><li>对于其他二进制协议进行扩展支持</li></ul></li><li><p>唯品会</p><ul><li>OSP (Open Service Platform)</li></ul></li><li><p>新浪:</p><ul><li>Motan: 是一套基于java开发的RPC框架, Weibo Mesh 是基于Motan</li></ul></li></ul><hr><h1 id="2-Istio-第二代-Service-Mesh"><a href="#2-Istio-第二代-Service-Mesh" class="headerlink" title="2. Istio: 第二代 Service Mesh"></a>2. Istio: 第二代 Service Mesh</h1><p>Istio来自希腊语,英文意思是「sail」, 意为「启航」</p><ul><li>2.1 Istio 架构</li><li>2.2 核心功能</li><li>2.3 Istio 演示: BookInfo</li></ul><hr><h2 id="2-1-Istio-架构"><a href="#2-1-Istio-架构" class="headerlink" title="2.1 Istio 架构"></a>2.1 Istio 架构</h2><p><img width="90%" src="https://preliminary.istio.io/docs/concepts/what-is-istio/arch.svg"></p><center>Istio Architecture(图片来自<a href="https://istio.io/docs/concepts/what-is-istio/" target="_blank" rel="noopener">Isio官网文档</a>)</center><ul><li><p>数据面</p><ul><li>Sidecar</li></ul></li><li><p>控制面</p><ul><li>Pilot:服务发现、流量管理</li><li>Mixer:访问控制、遥测</li><li>Citadel:终端用户认证、流量加密</li></ul></li></ul><hr><h3 id="2-2-核心功能"><a href="#2-2-核心功能" class="headerlink" title="2.2 核心功能"></a>2.2 核心功能</h3><ul><li>流量管理</li><li>安全</li><li>可观察性</li><li>多平台支持</li><li>集成和定制</li></ul><p>下面是我对Istio架构总结的思维导图:</p><p><img src="https://zhongfox.github.io/assets/images/istio/naotu2.png"></p><hr><h2 id="2-3-Istio-演示-BookInfo"><a href="#2-3-Istio-演示-BookInfo" class="headerlink" title="2.3 Istio 演示: BookInfo"></a>2.3 Istio 演示: BookInfo</h2><p>以下是Istio官网经典的 BookInfo Demo, 这是一个多语言组成的异构微服务系统:</p><p><img width="80%" src="https://istio.io/docs/examples/bookinfo/withistio.svg"></p><center>Bookinfo Application(图片来自<a href="https://istio.io/docs/examples/bookinfo/" target="_blank" rel="noopener">Isio官网文档</a>)</center><p>下面我将现场给大家进行演示, 从demo安装开始, 并体验一下istio的流控功能:</p><h4 id="使用helm管理istio"><a href="#使用helm管理istio" class="headerlink" title="使用helm管理istio"></a>使用helm管理istio</h4><p>下载istio release: <a href="https://istio.io/docs/setup/kubernetes/download-release/" target="_blank" rel="noopener">https://istio.io/docs/setup/kubernetes/download-release/</a></p><h5 id="安装istio"><a href="#安装istio" class="headerlink" title="安装istio:"></a>安装istio:</h5><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f install/kubernetes/helm/istio/templates/crds.yaml</span><br><span class="line">helm install install/kubernetes/helm/istio --name istio --namespace istio-system</span><br></pre></td></tr></table></figure><p>注意事项, 若要开启sidecar自动注入功能, 需要:</p><ul><li>确保 kube-apiserver 启动参数 开启了ValidatingAdmissionWebhook 和 MutatingAdmissionWebhook</li><li>给namespace 增加 label: <code>kubectl label namespace default istio-injection=enabled</code></li><li>同时还要保证 kube-apiserver 的 aggregator layer 开启: <code>--enable-aggregator-routing=true</code> 且证书和api server连通性正确设置.</li></ul><h5 id="如需卸载istio"><a href="#如需卸载istio" class="headerlink" title="如需卸载istio:"></a>如需卸载istio:</h5><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">helm delete --purge istio</span><br><span class="line">kubectl delete -f install/kubernetes/helm/istio/templates/crds.yaml -n istio-system</span><br></pre></td></tr></table></figure><p>更多安装选择请参考: <a href="https://istio.io/docs/setup/kubernetes/helm-install/" target="_blank" rel="noopener">https://istio.io/docs/setup/kubernetes/helm-install/</a></p><h4 id="安装Bookinfo-Demo"><a href="#安装Bookinfo-Demo" class="headerlink" title="安装Bookinfo Demo:"></a>安装Bookinfo Demo:</h4><p>Bookinfo 是一个多语言异构的微服务demo, 其中 productpage 微服务会调用 details 和 reviews 两个微服务, reviews 会调用ratings 微服务, reviews 微服务有 3 个版本. 关于此项目更多细节请参考: <a href="https://istio.io/docs/examples/bookinfo/" target="_blank" rel="noopener">https://istio.io/docs/examples/bookinfo/</a></p><h5 id="部署应用"><a href="#部署应用" class="headerlink" title="部署应用:"></a>部署应用:</h5><p><code>kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml</code></p><p>这将创建 productpage, details, ratings, reviews 对应的deployments 和 service, 其中reviews 有三个deployments, 代表三个不同的版本.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"> % kubectl get pod</span><br><span class="line">NAME READY STATUS RESTARTS AGE</span><br><span class="line">details-v1-6865b9b99d-mnxbt 2/2 Running 0 1m</span><br><span class="line">productpage-v1-f8c8fb8-zjbhh 2/2 Running 0 59s</span><br><span class="line">ratings-v1-77f657f55d-95rcz 2/2 Running 0 1m</span><br><span class="line">reviews-v1-6b7f6db5c5-zqvkn 2/2 Running 0 59s</span><br><span class="line">reviews-v2-7ff5966b99-lw72l 2/2 Running 0 59s</span><br><span class="line">reviews-v3-5df889bcff-w9v7g 2/2 Running 0 59s</span><br><span class="line"></span><br><span class="line"> % kubectl get svc</span><br><span class="line">NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE</span><br><span class="line">details ClusterIP 172.18.255.240 <none> 9080/TCP 1m</span><br><span class="line">productpage ClusterIP 172.18.255.137 <none> 9080/TCP 1m</span><br><span class="line">ratings ClusterIP 172.18.255.41 <none> 9080/TCP 1m</span><br><span class="line">reviews ClusterIP 172.18.255.140 <none> 9080/TCP 1m</span><br></pre></td></tr></table></figure><p>对入口流量进行配置:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml</span><br></pre></td></tr></table></figure><p>该操作会创建bookinfo-gateway 的Gateway, 并将流量发送到productpage服务</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">kubectl get gateway</span><br><span class="line">NAME AGE</span><br><span class="line">bookinfo-gateway 1m</span><br></pre></td></tr></table></figure><p>此时通过bookinfo-gateway 对应的LB或者nodeport 访问/productpage 页面, 可以看到三个版本的reviews服务在随机切换</p><h4 id="基于权重的路由"><a href="#基于权重的路由" class="headerlink" title="基于权重的路由"></a>基于权重的路由</h4><p>通过CRD DestinationRule创建3 个reviews 子版本:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f samples/bookinfo/networking/destination-rule-reviews.yaml</span><br></pre></td></tr></table></figure><p>通过CRD VirtualService 调整个 reviews 服务子版本的流量比例, 设置 v1 和 v3 各占 50%</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f samples/bookinfo/networking/virtual-service-reviews-50-v3.yaml</span><br></pre></td></tr></table></figure><p>刷新页面, 可以看到无法再看到reviews v2的内容, 页面在v1和v3之间切换.</p><h4 id="基于内容路由"><a href="#基于内容路由" class="headerlink" title="基于内容路由"></a>基于内容路由</h4><p>修改reviews CRD, 将jason 登录的用户版本路由到v2, 其他用户路由到版本v3.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f samples/bookinfo/networking/virtual-service-reviews-jason-v2-v3.yaml</span><br></pre></td></tr></table></figure><p>刷新页面, 使用jason登录的用户, 将看到v2 黑色星星版本, 其他用户将看到v3 红色星星版本.</p><p>更多BookInfo 示例, 请参阅: <a href="https://istio.io/docs/examples/bookinfo/" target="_blank" rel="noopener">https://istio.io/docs/examples/bookinfo/</a>, 若要删除应用: 执行脚本 <code>./samples/bookinfo/platform/kube/cleanup.sh</code></p><hr><h1 id="3-Istio-数据面"><a href="#3-Istio-数据面" class="headerlink" title="3. Istio 数据面"></a>3. Istio 数据面</h1><ul><li>3.1 数据面组件</li><li>3.2 sidecar 流量劫持原理</li><li>3.3 数据面标准API: xDS</li><li>3.4 分布式跟踪</li></ul><h2 id="3-1-数据面组件"><a href="#3-1-数据面组件" class="headerlink" title="3.1 数据面组件"></a>3.1 数据面组件</h2><p>Istio 注入sidecar实现:</p><ul><li>自动注入: 利用 <a href="https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/" target="_blank" rel="noopener">Kubernetes Dynamic Admission Webhooks</a> 对 新建的pod 进行注入: init container + sidecar</li><li>手动注入: 使用<code>istioctl kube-inject</code></li></ul><p>注入Pod内容:</p><ul><li>istio-init: 通过配置iptables来劫持Pod中的流量</li><li>istio-proxy: 两个进程pilot-agent和envoy, pilot-agent 进行初始化并启动envoy</li></ul><h4 id="Sidecar-自动注入实现"><a href="#Sidecar-自动注入实现" class="headerlink" title="Sidecar 自动注入实现"></a>Sidecar 自动注入实现</h4><p>Istio 利用 <a href="https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/" target="_blank" rel="noopener">Kubernetes Dynamic Admission Webhooks</a> 对pod 进行sidecar注入</p><p>查看istio 对这2个Webhooks 的配置 ValidatingWebhookConfiguration 和 MutatingWebhookConfiguration:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">% kubectl get ValidatingWebhookConfiguration -oyaml</span><br><span class="line">% kubectl get mutatingWebhookConfiguration -oyaml</span><br></pre></td></tr></table></figure><p>可以看出:</p><ul><li>命名空间<code>istio-system</code> 中的服务 <code>istio-galley</code>, 通过路由<code>/admitpilot</code>, 处理config.istio.io部分, rbac.istio.io, authentication.istio.io, networking.istio.io等资源的Validating 工作</li><li>命名空间istio-system 中的服务 <code>istio-galley</code>, 通过路由<code>/admitmixer</code>, 处理其他config.istio.io资源的Validating 工作</li><li>命名空间istio-system 中的服务 <code>istio-sidecar-injector</code>, 通过路由<code>/inject</code>, 处理其他<code>v1/pods</code>的CREATE, 同时需要满足命名空间<code>istio-injection: enabled</code></li></ul><h4 id="istio-init"><a href="#istio-init" class="headerlink" title="istio-init"></a>istio-init</h4><p>数据面的每个Pod会被注入一个名为<code>istio-init</code> 的initContainer, initContrainer是K8S提供的机制,用于在Pod中执行一些初始化任务.在Initialcontainer执行完毕并退出后,才会启动Pod中的其它container.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">initContainers:</span><br><span class="line">- image: docker.io/istio/proxy_init:1.0.5</span><br><span class="line"> args:</span><br><span class="line"> - -p</span><br><span class="line"> - "15001"</span><br><span class="line"> - -u</span><br><span class="line"> - "1337"</span><br><span class="line"> - -m</span><br><span class="line"> - REDIRECT</span><br><span class="line"> - -i</span><br><span class="line"> - '*'</span><br><span class="line"> - -x</span><br><span class="line"> - ""</span><br><span class="line"> - -b</span><br><span class="line"> - "9080"</span><br><span class="line"> - -d</span><br><span class="line"> - ""</span><br></pre></td></tr></table></figure><p>istio-init ENTRYPOINT 和 args 组合的启动命令:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">/usr/local/bin/istio-iptables.sh -p 15001 -u 1337 -m REDIRECT -i '*' -x "" -b 9080 -d ""</span><br></pre></td></tr></table></figure><p>istio-iptables.sh 源码地址为 <a href="https://github.com/istio/istio/blob/master/tools/deb/istio-iptables.sh" target="_blank" rel="noopener">https://github.com/istio/istio/blob/master/tools/deb/istio-iptables.sh</a></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">$ istio-iptables.sh -p PORT -u UID -g GID [-m mode] [-b ports] [-d ports] [-i CIDR] [-x CIDR] [-h]</span><br><span class="line"> -p: 指定重定向所有 TCP 流量的 Envoy 端口(默认为 $ENVOY_PORT = 15001)</span><br><span class="line"> -u: 指定未应用重定向的用户的 UID。通常,这是代理容器的 UID(默认为 $ENVOY_USER 的 uid,istio_proxy 的 uid 或 1337)</span><br><span class="line"> -g: 指定未应用重定向的用户的 GID。(与 -u param 相同的默认值)</span><br><span class="line"> -m: 指定入站连接重定向到 Envoy 的模式,“REDIRECT” 或 “TPROXY”(默认为 $ISTIO_INBOUND_INTERCEPTION_MODE)</span><br><span class="line"> -b: 逗号分隔的入站端口列表,其流量将重定向到 Envoy(可选)。使用通配符 “*” 表示重定向所有端口。为空时表示禁用所有入站重定向(默认为 $ISTIO_INBOUND_PORTS)</span><br><span class="line"> -d: 指定要从重定向到 Envoy 中排除(可选)的入站端口列表,以逗号格式分隔。使用通配符“*” 表示重定向所有入站流量(默认为 $ISTIO_LOCAL_EXCLUDE_PORTS)</span><br><span class="line"> -i: 指定重定向到 Envoy(可选)的 IP 地址范围,以逗号分隔的 CIDR 格式列表。使用通配符 “*” 表示重定向所有出站流量。空列表将禁用所有出站重定向(默认为 $ISTIO_SERVICE_CIDR)</span><br><span class="line"> -x: 指定将从重定向中排除的 IP 地址范围,以逗号分隔的 CIDR 格式列表。使用通配符 “*” 表示重定向所有出站流量(默认为 $ISTIO_SERVICE_EXCLUDE_CIDR)。</span><br><span class="line"></span><br><span class="line">环境变量位于 $ISTIO_SIDECAR_CONFIG(默认在:/var/lib/istio/envoy/sidecar.env)</span><br></pre></td></tr></table></figure><p>istio-init 通过配置iptable来劫持Pod中的流量:</p><ul><li>参数<code>-p 15001</code>: Pod中的数据流量被iptable拦截,并发向15001端口, 该端口将由 envoy 监听</li><li>参数<code>-u 1337</code>: 用于排除用户ID为1337,即Envoy自身的流量,以避免Iptable把Envoy发出的数据又重定向到Envoy, UID 为 1337,即 Envoy 所处的用户空间,这也是 istio-proxy 容器默认使用的用户, 见Sidecar <code>istio-proxy</code> 配置参数<code>securityContext.runAsUser</code></li><li>参数<code>-b 9080</code> <code>-d ""</code>: 入站端口控制, 将所有访问 9080 端口(即应用容器的端口)的流量重定向到 Envoy 代理</li><li>参数<code>-i '*'</code> <code>-x ""</code>: 出站IP控制, 将所有出站流量都重定向到 Envoy 代理</li></ul><p>Init 容器初始化完毕后就会自动终止,但是 Init 容器初始化结果(iptables)会保留到应用容器和 Sidecar 容器中.</p><h4 id="istio-proxy"><a href="#istio-proxy" class="headerlink" title="istio-proxy"></a>istio-proxy</h4><p>istio-proxy 以 sidecar 的形式注入到应用容器所在的pod中, 简化的注入yaml:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line">- image: docker.io/istio/proxyv2:1.0.5</span><br><span class="line"> name: istio-proxy</span><br><span class="line"> args:</span><br><span class="line"> - proxy</span><br><span class="line"> - sidecar</span><br><span class="line"> - --configPath</span><br><span class="line"> - /etc/istio/proxy</span><br><span class="line"> - --binaryPath</span><br><span class="line"> - /usr/local/bin/envoy</span><br><span class="line"> - --serviceCluster</span><br><span class="line"> - ratings</span><br><span class="line"> - --drainDuration</span><br><span class="line"> - 45s</span><br><span class="line"> - --parentShutdownDuration</span><br><span class="line"> - 1m0s</span><br><span class="line"> - --discoveryAddress</span><br><span class="line"> - istio-pilot.istio-system:15007</span><br><span class="line"> - --discoveryRefreshDelay</span><br><span class="line"> - 1s</span><br><span class="line"> - --zipkinAddress</span><br><span class="line"> - zipkin.istio-system:9411</span><br><span class="line"> - --connectTimeout</span><br><span class="line"> - 10s</span><br><span class="line"> - --proxyAdminPort</span><br><span class="line"> - "15000"</span><br><span class="line"> - --controlPlaneAuthPolicy</span><br><span class="line"> - NONE</span><br><span class="line"> env:</span><br><span class="line"> ......</span><br><span class="line"> ports:</span><br><span class="line"> - containerPort: 15090</span><br><span class="line"> name: http-envoy-prom</span><br><span class="line"> protocol: TCP</span><br><span class="line"> securityContext:</span><br><span class="line"> runAsUser: 1337</span><br><span class="line"> ......</span><br></pre></td></tr></table></figure><p>istio-proxy容器中有两个进程pilot-agent和envoy:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">~ % kubectl exec productpage-v1-f8c8fb8-wgmzk -c istio-proxy -- ps -ef</span><br><span class="line">UID PID PPID C STIME TTY TIME CMD</span><br><span class="line">istio-p+ 1 0 0 Jan03 ? 00:00:27 /usr/local/bin/pilot-agent proxy sidecar --configPath /etc/istio/proxy --binaryPath /usr/local/bin/envoy --serviceCluster productpage --drainDuration 45s --parentShutdownDuration 1m0s --discoveryAddress istio-pilot.istio-system:15007 --discoveryRefreshDelay 1s --zipkinAddress zipkin.istio-system:9411 --connectTimeout 10s --proxyAdminPort 15000 --controlPlaneAuthPolicy NONE</span><br><span class="line">istio-p+ 21 1 0 Jan03 ? 01:26:24 /usr/local/bin/envoy -c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster productpage --service-node sidecar~172.18.3.12~productpage-v1-f8c8fb8-wgmzk.default~default.svc.cluster.local --max-obj-name-len 189 --allow-unknown-fields -l warn --v2-config-only</span><br></pre></td></tr></table></figure><p>可以看到:</p><ul><li><code>/usr/local/bin/pilot-agent</code> 是 <code>/usr/local/bin/envoy</code> 的父进程, Pilot-agent进程根据启动参数和K8S API Server中的配置信息生成Envoy的初始配置文件(<code>/etc/istio/proxy/envoy-rev0.json</code>),并负责启动Envoy进程</li><li>pilot-agent 的启动参数里包括: discoveryAddress(pilot服务地址), Envoy 二进制文件的位置, 服务集群名, 监控指标上报地址, Envoy 的管理端口, 热重启时间等</li></ul><p>Envoy配置初始化流程:</p><ol><li>Pilot-agent根据启动参数和K8S API Server中的配置信息生成Envoy的初始配置文件envoy-rev0.json,该文件告诉Envoy从xDS server中获取动态配置信息,并配置了xDS server的地址信息,即控制面的Pilot</li><li>Pilot-agent使用envoy-rev0.json启动Envoy进程</li><li>Envoy根据初始配置获得Pilot地址,采用xDS接口从Pilot获取到Listener,Cluster,Route等d动态配置信息</li><li>Envoy根据获取到的动态配置启动Listener,并根据Listener的配置,结合Route和Cluster对拦截到的流量进行处理</li></ol><p>查看envoy 初始配置文件:</p><p><code>kubectl exec productpage-v1-f8c8fb8-wgmzk -c istio-proxy -- cat /etc/istio/proxy/envoy-rev0.json</code></p><hr><h2 id="3-2-sidecar-流量劫持原理"><a href="#3-2-sidecar-流量劫持原理" class="headerlink" title="3.2 sidecar 流量劫持原理"></a>3.2 sidecar 流量劫持原理</h2><p>sidecar 既要作为服务消费者端的正向代理,又要作为服务提供者端的反向代理, 具体拦截过程如下:</p><ul><li><p>Pod 所在的network namespace内, 除了envoy发出的流量外, iptables规则会对进入和发出的流量都进行拦截,通过nat redirect重定向到Envoy监听的15001端口.</p></li><li><p>envoy 会根据从Pilot拿到的 XDS 规则, 对流量进行转发.</p></li><li><p>envoy 的 listener 0.0.0.0:15001 接收进出 Pod 的所有流量,然后将请求移交给对应的virtual listener</p></li><li><p>对于本pod的服务, 有一个http listener <code>podIP+端口</code> 接受inbound 流量</p></li><li><p>每个service+非http端口, 监听器配对的 Outbound 非 HTTP 流量</p></li><li><p>每个service+http端口, 有一个http listener: <code>0.0.0.0+端口</code> 接受outbound流量</p></li></ul><p>整个拦截转发过程对业务容器是透明的, 业务容器仍然使用 Service 域名和端口进行通信, service 域名仍然会转换为service IP, 但service IP 在sidecar 中会被直接转换为 pod IP, 从容器中出去的流量已经使用了pod IP会直接转发到对应的Pod, 对比传统kubernetes 服务机制, service IP 转换为Pod IP 在node上进行, 由 kube-proxy维护的iptables实现.</p><hr><h2 id="3-3-数据面标准API-xDS"><a href="#3-3-数据面标准API-xDS" class="headerlink" title="3.3 数据面标准API: xDS"></a>3.3 数据面标准API: xDS</h2><p>xDS是一类发现服务的总称,包含LDS,RDS,CDS,EDS以及 SDS。Envoy通过xDS API可以动态获取Listener(监听器), Route(路由),Cluster(集群),Endpoint(集群成员)以 及Secret(证书)配置</p><p>xDS API 涉及的概念:</p><ul><li>Host</li><li>Downstream</li><li>Upstream</li><li>Listener</li><li>Cluster</li></ul><p>Envoy 配置热更新: 配置的动态变更,而不需要重启 Envoy:</p><ol><li>新老进程采用基本的RPC协议使用Unix Domain Socket通讯.</li><li>新进程启动并完成所有初始化工作后,向老进程请求监听套接字的副本.</li><li>新进程接管套接字后,通知老进程关闭套接字.</li><li>通知老进程终止自己.</li></ol><h4 id="xDS-调试"><a href="#xDS-调试" class="headerlink" title="xDS 调试"></a>xDS 调试</h4><p>Pilot在9093端口提供了下述调试接口:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"># What is sent to envoy</span><br><span class="line"># Listeners and routes</span><br><span class="line">curl $PILOT/debug/adsz</span><br><span class="line"></span><br><span class="line"># Endpoints</span><br><span class="line">curl $PILOT/debug/edsz</span><br><span class="line"></span><br><span class="line"># Clusters</span><br><span class="line">curl $PILOT/debug/cdsz</span><br></pre></td></tr></table></figure><p>Sidecar Envoy 也提供了管理接口,缺省为localhost的15000端口,可以获取listener,cluster以及完整的配置数据</p><p>可以通过以下命令查看支持的调试接口:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl exec productpage-v1-f8c8fb8-zjbhh -c istio-proxy curl http://127.0.0.1:15000/help</span><br></pre></td></tr></table></figure><p>或者forward到本地就行调试</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl port-forward productpage-v1-f8c8fb8-zjbhh 15000</span><br></pre></td></tr></table></figure><p>相关的调试接口:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">http://127.0.0.1:15000</span><br><span class="line">http://127.0.0.1:15000/help</span><br><span class="line">http://127.0.0.1:15000/config_dump</span><br><span class="line">http://127.0.0.1:15000/listeners</span><br><span class="line">http://127.0.0.1:15000/clusters</span><br></pre></td></tr></table></figure><p>使用istioctl 查看代理配置:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">istioctl pc {xDS类型} {POD_NAME} {过滤条件} {-o json/yaml}</span><br><span class="line"></span><br><span class="line">eg:</span><br><span class="line">istioctl pc routes productpage-v1-f8c8fb8-zjbhh --name 9080 -o json</span><br></pre></td></tr></table></figure><p>xDS 类型包括: listener, route, cluster, endpoint</p><h4 id="对xDS-进行分析-productpage-访问-reviews-服务"><a href="#对xDS-进行分析-productpage-访问-reviews-服务" class="headerlink" title="对xDS 进行分析: productpage 访问 reviews 服务"></a>对xDS 进行分析: productpage 访问 reviews 服务</h4><p>查看 product 的所有listener:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br></pre></td><td class="code"><pre><span class="line">% istioctl pc listener productpage-v1-f8c8fb8-zjbhh</span><br><span class="line">ADDRESS PORT TYPE</span><br><span class="line">172.18.255.178 15011 TCP</span><br><span class="line">172.18.255.194 44134 TCP</span><br><span class="line">172.18.255.110 443 TCP</span><br><span class="line">172.18.255.190 50000 TCP</span><br><span class="line">172.18.255.203 853 TCP</span><br><span class="line">172.18.255.2 443 TCP</span><br><span class="line">172.18.255.239 16686 TCP</span><br><span class="line">0.0.0.0 80 TCP</span><br><span class="line">172.18.255.215 3306 TCP</span><br><span class="line">172.18.255.203 31400 TCP</span><br><span class="line">172.18.255.111 443 TCP</span><br><span class="line">172.18.255.203 8060 TCP</span><br><span class="line">172.18.255.203 443 TCP</span><br><span class="line">172.18.255.40 443 TCP</span><br><span class="line">172.18.255.1 443 TCP</span><br><span class="line">172.18.255.53 53 TCP</span><br><span class="line">172.18.255.203 15011 TCP</span><br><span class="line">172.18.255.105 14268 TCP</span><br><span class="line">172.18.255.125 42422 TCP</span><br><span class="line">172.18.255.105 14267 TCP</span><br><span class="line">172.18.255.52 80 TCP</span><br><span class="line">0.0.0.0 15010 HTTP</span><br><span class="line">0.0.0.0 9411 HTTP</span><br><span class="line">0.0.0.0 8060 HTTP</span><br><span class="line">0.0.0.0 9080 HTTP</span><br><span class="line">0.0.0.0 15004 HTTP</span><br><span class="line">0.0.0.0 20001 HTTP</span><br><span class="line">0.0.0.0 9093 HTTP</span><br><span class="line">0.0.0.0 8080 HTTP</span><br><span class="line">0.0.0.0 15030 HTTP</span><br><span class="line">0.0.0.0 9091 HTTP</span><br><span class="line">0.0.0.0 9090 HTTP</span><br><span class="line">0.0.0.0 15031 HTTP</span><br><span class="line">0.0.0.0 3000 HTTP</span><br><span class="line">0.0.0.0 15001 TCP</span><br><span class="line">172.18.3.50 9080 HTTP 这是当前pod ip 暴露的服务地址, 会路由到回环地址, 各个pod 会不一样</span><br></pre></td></tr></table></figure><p>envoy 流量入口的listener:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line">% istioctl pc listener productpage-v1-f8c8fb8-zjbhh --address 0.0.0.0 --port 15001 -o json</span><br><span class="line">[</span><br><span class="line"> {</span><br><span class="line"> "name": "virtual",</span><br><span class="line"> "address": {</span><br><span class="line"> "socketAddress": {</span><br><span class="line"> "address": "0.0.0.0",</span><br><span class="line"> "portValue": 15001</span><br><span class="line"> }</span><br><span class="line"> },</span><br><span class="line"> "filterChains": [</span><br><span class="line"> {</span><br><span class="line"> "filters": [</span><br><span class="line"> {</span><br><span class="line"> "name": "envoy.tcp_proxy",</span><br><span class="line"> "config": {</span><br><span class="line"> "cluster": "BlackHoleCluster",</span><br><span class="line"> "stat_prefix": "BlackHoleCluster"</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> ]</span><br><span class="line"> }</span><br><span class="line"> ],</span><br><span class="line"> "useOriginalDst": true # 这意味着它将请求交给最符合请求原始目标的监听器。如果找不到任何匹配的虚拟监听器,它会将请求发送给返回 404 的 BlackHoleCluster</span><br><span class="line"> }</span><br><span class="line">]</span><br></pre></td></tr></table></figure><p>以下是reviews的所有pod IP</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"> % kubectl get ep reviews</span><br><span class="line">NAME ENDPOINTS AGE</span><br><span class="line">reviews 172.18.2.35:9080,172.18.3.48:9080,172.18.3.49:9080 1d</span><br></pre></td></tr></table></figure><p>对于目的地址是以上ip的http访问, 这些 ip 并没有对应的listener, 因此会通过端口9080 匹配到listener <code>0.0.0.0 9080</code></p><p>查看listener <code>0.0.0.0 9080</code>:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">% istioctl pc listener productpage-v1-f8c8fb8-zjbhh --address 0.0.0.0 --port 9080 -ojson</span><br><span class="line"> {</span><br><span class="line"> "name": "0.0.0.0_9080",</span><br><span class="line"> "address": {</span><br><span class="line"> "socketAddress": {</span><br><span class="line"> "address": "0.0.0.0",</span><br><span class="line"> "portValue": 9080</span><br><span class="line"> }</span><br><span class="line"> },</span><br><span class="line"> ......</span><br><span class="line"></span><br><span class="line"> "rds": {</span><br><span class="line"> "config_source": {</span><br><span class="line"> "ads": {}</span><br><span class="line"> },</span><br><span class="line"> "route_config_name": "9080"</span><br><span class="line"> },</span><br><span class="line"> ......</span><br></pre></td></tr></table></figure><p>查看名为<code>9080</code> 的 route:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br></pre></td><td class="code"><pre><span class="line">% istioctl pc routes productpage-v1-f8c8fb8-zjbhh --name 9080 -o json</span><br><span class="line"></span><br><span class="line">[</span><br><span class="line"> {</span><br><span class="line"> "name": "9080",</span><br><span class="line"> "virtualHosts": [</span><br><span class="line"> {</span><br><span class="line"> "name": "details.default.svc.cluster.local:9080",</span><br><span class="line"> "domains": [</span><br><span class="line"> "details.default.svc.cluster.local",</span><br><span class="line"> "details.default.svc.cluster.local:9080",</span><br><span class="line"> "details",</span><br><span class="line"> "details:9080",</span><br><span class="line"> "details.default.svc.cluster",</span><br><span class="line"> "details.default.svc.cluster:9080",</span><br><span class="line"> "details.default.svc",</span><br><span class="line"> "details.default.svc:9080",</span><br><span class="line"> "details.default",</span><br><span class="line"> "details.default:9080",</span><br><span class="line"> "172.18.255.240",</span><br><span class="line"> "172.18.255.240:9080"</span><br><span class="line"> ],</span><br><span class="line"> "routes": [</span><br><span class="line"> {</span><br><span class="line"> "match": {</span><br><span class="line"> "prefix": "/"</span><br><span class="line"> },</span><br><span class="line"> "route": {</span><br><span class="line"> "cluster": "outbound|9080||details.default.svc.cluster.local",</span><br><span class="line"> "timeout": "0.000s",</span><br><span class="line"> "maxGrpcTimeout": "0.000s"</span><br><span class="line"> },</span><br><span class="line"> ......</span><br><span class="line"> {</span><br><span class="line"> "name": "productpage.default.svc.cluster.local:9080",</span><br><span class="line"> "domains": [</span><br><span class="line"> "productpage.default.svc.cluster.local",</span><br><span class="line"> "productpage.default.svc.cluster.local:9080",</span><br><span class="line"> "productpage",</span><br><span class="line"> "productpage:9080",</span><br><span class="line"> "productpage.default.svc.cluster",</span><br><span class="line"> "productpage.default.svc.cluster:9080",</span><br><span class="line"> "productpage.default.svc",</span><br><span class="line"> "productpage.default.svc:9080",</span><br><span class="line"> "productpage.default",</span><br><span class="line"> "productpage.default:9080",</span><br><span class="line"> "172.18.255.137",</span><br><span class="line"> "172.18.255.137:9080"</span><br><span class="line"> ],</span><br><span class="line"> "routes": [ ...... ]</span><br><span class="line"> },</span><br><span class="line"> {</span><br><span class="line"> "name": "ratings.default.svc.cluster.local:9080",</span><br><span class="line"> "domains": [</span><br><span class="line"> "ratings.default.svc.cluster.local",</span><br><span class="line"> "ratings.default.svc.cluster.local:9080",</span><br><span class="line"> "ratings",</span><br><span class="line"> "ratings:9080",</span><br><span class="line"> "ratings.default.svc.cluster",</span><br><span class="line"> "ratings.default.svc.cluster:9080",</span><br><span class="line"> "ratings.default.svc",</span><br><span class="line"> "ratings.default.svc:9080",</span><br><span class="line"> "ratings.default",</span><br><span class="line"> "ratings.default:9080",</span><br><span class="line"> "172.18.255.41",</span><br><span class="line"> "172.18.255.41:9080"</span><br><span class="line"> ],</span><br><span class="line"> "routes": [ ...... ]</span><br><span class="line"> },</span><br><span class="line"> {</span><br><span class="line"> "name": "reviews.default.svc.cluster.local:9080",</span><br><span class="line"> "domains": [</span><br><span class="line"> "reviews.default.svc.cluster.local",</span><br><span class="line"> "reviews.default.svc.cluster.local:9080",</span><br><span class="line"> "reviews",</span><br><span class="line"> "reviews:9080",</span><br><span class="line"> "reviews.default.svc.cluster",</span><br><span class="line"> "reviews.default.svc.cluster:9080",</span><br><span class="line"> "reviews.default.svc",</span><br><span class="line"> "reviews.default.svc:9080",</span><br><span class="line"> "reviews.default",</span><br><span class="line"> "reviews.default:9080",</span><br><span class="line"> "172.18.255.140",</span><br><span class="line"> "172.18.255.140:9080"</span><br><span class="line"> ],</span><br><span class="line"> "routes": [</span><br><span class="line"> {</span><br><span class="line"> "match": {</span><br><span class="line"> "prefix": "/",</span><br><span class="line"> "headers": [</span><br><span class="line"> {</span><br><span class="line"> "name": "end-user",</span><br><span class="line"> "exactMatch": "jason"</span><br><span class="line"> }</span><br><span class="line"> ]</span><br><span class="line"> },</span><br><span class="line"> "route": {</span><br><span class="line"> "cluster": "outbound|9080|v2|reviews.default.svc.cluster.local",</span><br><span class="line"> "timeout": "0.000s",</span><br><span class="line"> "maxGrpcTimeout": "0.000s"</span><br><span class="line"> },</span><br><span class="line"> ......</span><br><span class="line"> },</span><br><span class="line"> {</span><br><span class="line"> "match": {</span><br><span class="line"> "prefix": "/"</span><br><span class="line"> },</span><br><span class="line"> "route": {</span><br><span class="line"> "cluster": "outbound|9080|v3|reviews.default.svc.cluster.local",</span><br><span class="line"> "timeout": "0.000s",</span><br><span class="line"> "maxGrpcTimeout": "0.000s"</span><br><span class="line"> },</span><br><span class="line"> .......</span><br><span class="line"> }</span><br><span class="line"> ]</span><br><span class="line"> }</span><br><span class="line"> ],</span><br><span class="line"> "validateClusters": false</span><br><span class="line"> }</span><br><span class="line">]</span><br></pre></td></tr></table></figure><p>可以看到, 在9080 这个route 中, 包含所有这个端口的http 路由信息, 通过virtualHosts列表进行服务域名分发到各个cluster.</p><p>查看virtualHosts <code>reviews.default.svc.cluster.local:9080</code> 中的routes信息, 可以看到jason 路由到了cluster <code>outbound|9080|v2|reviews.default.svc.cluster.local</code></p><p>查看该cluster:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line">% istioctl pc cluster productpage-v1-f8c8fb8-zjbhh --fqdn reviews.default.svc.cluster.local --subset v2 -o json</span><br><span class="line">[</span><br><span class="line"> {</span><br><span class="line"> "name": "outbound|9080|v2|reviews.default.svc.cluster.local",</span><br><span class="line"> "type": "EDS",</span><br><span class="line"> "edsClusterConfig": {</span><br><span class="line"> "edsConfig": {</span><br><span class="line"> "ads": {}</span><br><span class="line"> },</span><br><span class="line"> "serviceName": "outbound|9080|v2|reviews.default.svc.cluster.local"</span><br><span class="line"> },</span><br><span class="line"> "connectTimeout": "1.000s",</span><br><span class="line"> "lbPolicy": "RANDOM",</span><br><span class="line"> "circuitBreakers": {</span><br><span class="line"> "thresholds": [</span><br><span class="line"> {}</span><br><span class="line"> ]</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">]</span><br></pre></td></tr></table></figure><p>查看其对应的endpoint:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"> % istioctl pc endpoint productpage-v1-f8c8fb8-zjbhh --cluster 'outbound|9080|v2|reviews.default.svc.cluster.local'</span><br><span class="line">ENDPOINT STATUS CLUSTER</span><br><span class="line">172.18.2.35:9080 HEALTHY outbound|9080|v2|reviews.default.svc.cluster.local</span><br></pre></td></tr></table></figure><p>该endpoint 即为 reviews 服务 V2 对应的 pod IP</p><h4 id="XDS服务接口的最终一致性考虑"><a href="#XDS服务接口的最终一致性考虑" class="headerlink" title="XDS服务接口的最终一致性考虑"></a>XDS服务接口的最终一致性考虑</h4><p>遵循 make before break 模型</p><hr><h2 id="3-4-分布式跟踪"><a href="#3-4-分布式跟踪" class="headerlink" title="3.4 分布式跟踪"></a>3.4 分布式跟踪</h2><p>以下是分布式全链路跟踪示意图:</p><p><img width="40%" style="float: left" src="https://zhongfox.github.io/assets/images/hunter/opentracing_1.png"><br><img width="60%" style="float: left;margin-top:2cm;" src="https://zhongfox.github.io/assets/images/hunter/opentracing_2.png"></p><center>一个典型的Trace案例(图片来自<a href="https://wu-sheng.gitbooks.io/opentracing-io/content/" target="_blank" rel="noopener">opentracing文档中文版</a>)</center><hr><p>Jaeger 是Uber 开源的全链路跟踪系统, 符合OpenTracing协议, OpenTracing 和 Jaeger 均是CNCF 成员项目, 以下是Jaeger 架构的示意图:</p><p><img src="https://www.jaegertracing.io/img/architecture.png"></p><center>Jaeger 架构示意图(图片来自<a href="https://www.jaegertracing.io/docs/1.6/architecture/" target="_blank" rel="noopener">Jaeger官方文档</a>)</center><p>分布式跟踪系统让开发者能够得到可视化的调用流程展示。这对复杂的微服务系统进行问题排查和性能优化时至关重要.</p><p>Envoy 原生支持http 链路跟踪:</p><ul><li>生成 Request ID:Envoy 会在需要的时候生成 UUID,并操作名为 [x-request-id] 的 HTTP Header。应用可以转发这个 Header 用于统一的记录和跟踪.</li><li>支持集成外部跟踪服务:Envoy 支持可插接的外部跟踪可视化服务。目前支持有:<ul><li>LightStep</li><li>Zipkin 或者 Zipkin 兼容的后端(比如说 Jaeger)</li><li>Datadog</li></ul></li><li>客户端跟踪 ID 连接:x-client-trace-id Header 可以用来把不信任的请求 ID 连接到受信的 x-request-id Header 上</li></ul><h4 id="跟踪上下文信息的传播"><a href="#跟踪上下文信息的传播" class="headerlink" title="跟踪上下文信息的传播"></a>跟踪上下文信息的传播</h4><ul><li>不管使用的是哪个跟踪服务,都应该传播 x-request-id,这样在被调用服务中启动相关性的记录</li><li>如果使用的是 Zipkin,Envoy 要传播的是 <a href="https://www.envoyproxy.io/docs/envoy/latest/configuration/http_conn_man/headers#config-http-conn-man-headers-b3" target="_blank" rel="noopener">B3 Header</a>。(x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, 以及 x-b3-flags. x-b3-sampled)</li><li>上下文跟踪并非零修改, 在调用下游服务时, 上游应用应该自行传播跟踪相关的 HTTP Header</li></ul><hr><h1 id="4-Istio-控制面"><a href="#4-Istio-控制面" class="headerlink" title="4. Istio 控制面"></a>4. Istio 控制面</h1><ul><li>4.1 Pilot 架构</li><li>4.2 流量管理模型</li><li>4.3 故障处理</li><li>4.4 Mixer 架构</li><li>4.5 Mixer适配器模型</li><li>4.6 Mixer 缓存机制</li></ul><hr><h2 id="4-1-Pilot-架构"><a href="#4-1-Pilot-架构" class="headerlink" title="4.1 Pilot 架构"></a>4.1 Pilot 架构</h2><p><img src="https://preliminary.istio.io/docs/concepts/traffic-management/PilotAdapters.svg"></p><center>Pilot Architecture(图片来自<a href="https://istio.io/docs/concepts/traffic-management/" target="_blank" rel="noopener">Isio官网文档</a>)</center><ul><li>Rules API: 对外封装统一的 API,供服务的开发者或者运维人员调用,可以用于流量控制。</li><li>Envoy API: 对内封装统一的 API,供 Envoy 调用以获取注册信息、流量控制信息等。</li><li>抽象模型层: 对服务的注册信息、流量控制规则等进行抽象,使其描述与平台无关。</li><li>平台适配层: 用于适配各个平台如 Kubernetes、Mesos、Cloud Foundry 等,把平台特定的注册信息、资源信息等转换成抽象模型层定义的平台无关的描述。例如,Pilot 中的 Kubernetes 适配器实现必要的控制器来 watch Kubernetes API server 中 pod 注册信息、ingress 资源以及用于存储流量管理规则的第三方资源的更改</li></ul><hr><h2 id="4-2-流量管理模型"><a href="#4-2-流量管理模型" class="headerlink" title="4.2 流量管理模型"></a>4.2 流量管理模型</h2><ul><li>VirtualService</li><li>DestinationRule</li><li>ServiceEntry</li><li>Gateway</li></ul><h4 id="VirtualService"><a href="#VirtualService" class="headerlink" title="VirtualService"></a>VirtualService</h4><p>VirtualService 中定义了一系列针对指定服务的流量路由规则。每个路由规则都是针对特定协议的匹配规则。如果流量符合这些特征,就会根据规则发送到服务注册表中的目标服务, 或者目标服务的子集或版本, 匹配规则中还包含了对流量发起方的定义,这样一来,规则还可以针对特定客户上下文进行定制.</p><h4 id="Gateway"><a href="#Gateway" class="headerlink" title="Gateway"></a>Gateway</h4><p>Gateway 描述了一个负载均衡器,用于承载网格边缘的进入和发出连接。这一规范中描述了一系列开放端口,以及这些端口所使用的协议、负载均衡的 SNI 配置等内容</p><h4 id="ServiceEntry"><a href="#ServiceEntry" class="headerlink" title="ServiceEntry"></a>ServiceEntry</h4><p>Istio 服务网格内部会维护一个与平台无关的使用通用模型表示的服务注册表,当你的服务网格需要访问外部服务的时候,就需要使用 ServiceEntry 来添加服务注册, 这类服务可能是网格外的 API,或者是处于网格内部但却不存在于平台的服务注册表中的条目(例如需要和 Kubernetes 服务沟通的一组虚拟机服务).</p><h4 id="EnvoyFilter"><a href="#EnvoyFilter" class="headerlink" title="EnvoyFilter"></a>EnvoyFilter</h4><p>EnvoyFilter 描述了针对代理服务的过滤器,用来定制由 Istio Pilot 生成的代理配置.</p><h4 id="Kubernetes-Ingress-vs-Istio-Gateway"><a href="#Kubernetes-Ingress-vs-Istio-Gateway" class="headerlink" title="Kubernetes Ingress vs Istio Gateway"></a>Kubernetes Ingress vs Istio Gateway</h4><p><img src="https://zhongfox.github.io/assets/images/istio/gateway.png"></p><ul><li>合并了L4-6和L7的规范, 对传统技术栈用户的应用迁入不方便</li><li>表现力不足:<ul><li>只能对 service、port、HTTP 路径等有限字段匹配来路由流量</li><li>端口只支持默认80/443</li></ul></li></ul><p>Istio Gateway:·</p><ul><li>定义了四层到六层的负载均衡属性 (通常是SecOps或NetOps关注的内容)<ul><li>端口</li><li>端口所使用的协议(HTTP, HTTPS, GRPC, HTTP2, MONGO, TCP, TLS)</li><li>Hosts</li><li>TLS SNI header 路由支持</li><li>TLS 配置支持(http 自动301, 证书等)</li><li>ip / unix domain socket</li></ul></li></ul><h4 id="Kubernetes-Istio-Envoy-xDS-模型对比"><a href="#Kubernetes-Istio-Envoy-xDS-模型对比" class="headerlink" title="Kubernetes, Istio, Envoy xDS 模型对比"></a>Kubernetes, Istio, Envoy xDS 模型对比</h4><p>以下是对Kubernetes, Istio, Envoy xDS 模型的不严格对比</p><table><thead><tr><th></th><th>Kubernetes</th><th>Istio</th><th>Envoy xDS</th></tr></thead><tbody><tr><td>入口流量</td><td>Ingress</td><td>GateWay</td><td>Listener</td></tr><tr><td>服务定义</td><td>Service</td><td>-</td><td>Cluster+Listener</td></tr><tr><td>外部服务定义</td><td>-</td><td>ServiceEntry</td><td>Cluster+Listener</td></tr><tr><td>版本定义</td><td>-</td><td>DestinationRule</td><td>Cluster+Listener</td></tr><tr><td>版本路由</td><td>-</td><td>VirtualService</td><td>Route</td></tr><tr><td>实例</td><td>Endpoint</td><td>-</td><td>Endpoint</td></tr></tbody></table><h4 id="Kubernetes-和-Istio-服务寻址的区别"><a href="#Kubernetes-和-Istio-服务寻址的区别" class="headerlink" title="Kubernetes 和 Istio 服务寻址的区别:"></a>Kubernetes 和 Istio 服务寻址的区别:</h4><p><strong>Kubernetes</strong>:</p><ol><li>kube-dns: service domain -> service ip</li><li>kube-proxy(node iptables): service ip -> pod ip</li></ol><p><strong>Istio</strong>:</p><ol><li>kube-dns: service domain -> service ip</li><li>sidecar envoy: service ip -> pod ip</li></ol><hr><h2 id="4-3-故障处理"><a href="#4-3-故障处理" class="headerlink" title="4.3 故障处理"></a>4.3 故障处理</h2><p>随着微服务的拆分粒度增强, 服务调用会增多, 更复杂, 扇入 扇出, 调用失败的风险增加, 以下是常见的服务容错处理方式:</p><table><thead><tr><th></th><th>控制端</th><th>目的</th><th>实现</th><th>Istio</th></tr></thead><tbody><tr><td>超时</td><td>client</td><td>保护client</td><td>请求等待超时/请求运行超时</td><td>timeout</td></tr><tr><td>重试</td><td>client</td><td>容忍server临时错误, 保证业务整体可用性</td><td>重试次数/重试的超时时间</td><td>retries.attempts, retries.perTryTimeout</td></tr><tr><td>熔断</td><td>client</td><td>降低性能差的服务或实例的影响</td><td>通常会结合超时+重试, 动态进行服务状态决策(Open/Closed/Half-Open)</td><td>trafficPolicy.outlierDetection</td></tr><tr><td>降级</td><td>client</td><td>保证业务主要功能可用</td><td>主逻辑失败采用备用逻辑的过程(镜像服务分级, 调用备用服务, 或者返回mock数据)</td><td>暂不支持, 需要业务代码按需实现</td></tr><tr><td>隔离</td><td>client</td><td>防止异常server占用过多client资源</td><td>隔离对不同服务调用的资源依赖: 线程池隔离/信号量隔离</td><td>暂不支持</td></tr><tr><td>幂等</td><td>server</td><td>容忍client重试, 保证数据一致性</td><td>唯一ID/加锁/事务等手段</td><td>暂不支持, 需要业务代码按需实现</td></tr><tr><td>限流</td><td>server</td><td>保护server</td><td>常用算法: 计数器, 漏桶, 令牌桶</td><td>trafficPolicy.connectionPool</td></tr></tbody></table><p>Istio 没有无降级处理支持: Istio可以提高网格中服务的可靠性和可用性。但是,应用程序仍然需要处理故障(错误)并采取适当的回退操作。例如,当负载均衡池中的所有实例都失败时,Envoy 将返回 HTTP 503。应用程序有责任实现必要的逻辑,对这种来自上游服务的 HTTP 503 错误做出合适的响应。</p><hr><h2 id="4-4-Mixer-架构"><a href="#4-4-Mixer-架构" class="headerlink" title="4.4 Mixer 架构"></a>4.4 Mixer 架构</h2><p><img src="https://istio.io/docs/concepts/policies-and-telemetry/topology-without-cache.svg"></p><center>Mixer Topology(图片来自<a href="https://istio.io/docs/concepts/policies-and-telemetry/" target="_blank" rel="noopener">Isio官网文档</a>)</center><p>Istio 的四大功能点连接, 安全, 控制, 观察, 其中「控制」和「观察」的功能主要都是由Mixer组件来提供, Mixer 在Istio中角色:</p><ul><li>功能上: 负责策略控制和遥测收集</li><li>架构上:提供插件模型,可以扩展和定制</li></ul><hr><h2 id="4-5-Mixer-Adapter-模型"><a href="#4-5-Mixer-Adapter-模型" class="headerlink" title="4.5 Mixer Adapter 模型"></a>4.5 Mixer Adapter 模型</h2><ul><li>Attribute</li><li>Template</li><li>Adapter</li><li>Instance</li><li>Handler</li><li>Rule</li></ul><h4 id="Attribute"><a href="#Attribute" class="headerlink" title="Attribute"></a>Attribute</h4><p>Attribute 是策略和遥测功能中有关请求和环境的基本数据, 是用于描述特定服务请求或请求环境的属性的一小段数据。例如,属性可以指定特定请求的大小、操作的响应代码、请求来自的 IP 地址等.</p><ul><li>Istio 中的主要属性生产者是 Envoy,但专用的 Mixer 适配器也可以生成属性</li><li>属性词汇表见: <a href="https://istio.io/docs/reference/config/policy-and-telemetry/attribute-vocabulary/" target="_blank" rel="noopener">Attribute Vocabulary</a></li><li>数据流向: envoy -> mixer</li></ul><h4 id="Template"><a href="#Template" class="headerlink" title="Template"></a>Template</h4><p>Template 是对 adapter 的数据格式和处理接口的抽象, Template定义了:</p><ul><li>当处理请求时发送给adapter 的数据格式</li><li>adapter 必须实现的gRPC service 接口</li></ul><p>每个Template 通过 <code>template.proto</code> 进行定义:</p><ul><li>名为<code>Template</code> 的一个message</li><li>Name: 通过template所在的package name自动生成</li><li>template_variety: 可选Check, Report, Quota or AttributeGenerator, 决定了adapter必须实现的方法. 同时决定了在mixer的什么阶段要生成template对应的instance:<ul><li>Check: 在Mixer’s Check API call时创建并发送instance</li><li>Report: 在Mixer’s Report API call时创建并发送instance</li><li>Quota: 在Mixer’s Check API call时创建并发送instance(查询配额时)</li><li>AttributeGenerator: for both Check, Report Mixer API calls</li></ul></li></ul><p>Istio 内置的Templates: <a href="https://istio.io/docs/reference/config/policy-and-telemetry/templates/" target="_blank" rel="noopener">https://istio.io/docs/reference/config/policy-and-telemetry/templates/</a></p><h4 id="Adapter"><a href="#Adapter" class="headerlink" title="Adapter"></a>Adapter</h4><p>封装了 Mixer 和特定外部基础设施后端进行交互的必要接口,例如 Prometheus 或者 Stackdriver</p><ul><li>定义了需要处理的模板(在yaml中配置template)</li><li>定义了处理某个Template数据格式的GRPC接口</li><li>定义 Adapter需要的配置格式(Params)</li><li>可以同时处理多个数据(instance)</li></ul><p>Istio 内置的Adapter: <a href="https://istio.io/docs/reference/config/policy-and-telemetry/adapters/" target="_blank" rel="noopener">https://istio.io/docs/reference/config/policy-and-telemetry/adapters/</a></p><h4 id="Instance"><a href="#Instance" class="headerlink" title="Instance"></a>Instance</h4><p>代表符合某个Template定义的数据格式的具体实现, 该具体实现由用户配置的 CRD, CRD 定义了将Attributes 转换为具体instance 的规则, 支持属性表达式</p><ul><li>Instance CRD 是Template 中定义的数据格式 + 属性转换器</li><li>内置的Instance 类型(其实就是内置 Template): <a href="https://istio.io/docs/reference/config/policy-and-telemetry/templates/" target="_blank" rel="noopener">Templates</a></li><li>属性表达式见: <a href="https://istio.io/docs/reference/config/policy-and-telemetry/expression-language/" target="_blank" rel="noopener">Expression Language</a></li><li>数据流向: mixer -> adapter 实例</li></ul><h4 id="Handler"><a href="#Handler" class="headerlink" title="Handler"></a>Handler</h4><p>用户配置的 CRD, 为具体Adapter提供一个具体配置, 对应Adapter的可运行实例</p><h4 id="Rule"><a href="#Rule" class="headerlink" title="Rule"></a>Rule</h4><p>用户配置的 CRD, 配置一组规则,这些规则描述了何时调用特定(通过Handler对应的)适配器及哪些Instance</p><hr><h2 id="结语"><a href="#结语" class="headerlink" title="结语"></a>结语</h2><blockquote><p>计算机科学中的所有问题,都可以用另一个层来解决,除了层数太多的问题</p></blockquote><p>Kubernetes 本身已经很复杂, Istio 为了更高层控制的抽象, 又增加了很多概念. 复杂度堪比kubernetes.</p><p>可以看出istio 设计精良, 在处理微服务的复杂场景有很多优秀之处, 不过目前istio目前的短板还是很明显, 高度的抽象带来了很多性能的损耗, 社区现在也有很多优化的方向, 像蚂蚁金服开源的SofaMesh 主要是去精简层, 试图在sidecar里去做很多mixer 的事情, 减少sidecar和mixer的同步请求依赖, 而一些其他的sidecar 网络方案, 更多的是考虑去优化层, 优化sidecar 这一层的性能开销.</p><p>在Istio 1.0 之前, 主要还是以功能的实现为主, 不过后面随着社区的积极投入, 相信Istio的性能会有长足的提升.</p><p>笔者之前从事过多年的服务治理相关的工作, 过程中切身体会到微服务治理的痛点, 所以也比较关注 service mesh的发展, 个人对istio也非常看好, 刚好今年我们中心容器产品今年也有这方面的计划, 期待我们能在这个方向进行一些产品和技术的深耕.</p><hr><p><img src="https://zhongfox.github.io/assets/images/istio/last.png"></p><hr><p>参考资料:</p><ul><li><a href="http://www.servicemesher.com/" target="_blank" rel="noopener">servicemesher 中文社区</a></li><li><a href="https://thenewstack.io/why-you-should-care-about-istio-gateways/" target="_blank" rel="noopener">Why You Should Care About Istio Gateways</a></li><li><a href="http://philcalcado.com/2017/08/03/pattern_service_mesh.html" target="_blank" rel="noopener">Pattern: Service Mesh</a></li><li><a href="https://github.com/istio/istio/wiki/Mixer-Out-Of-Process-Adapter-Dev-Guide" target="_blank" rel="noopener">Mixer Out Of Process Adapter Dev Guide</a></li><li><a href="https://github.com/istio/istio/wiki/Mixer-Out-Of-Process-Adapter-Walkthrough" target="_blank" rel="noopener">Mixer Out of Process Adapter Walkthrough</a></li><li><a href="http://www.servicemesher.com/blog/envoy-xds-protocol" target="_blank" rel="noopener">Envoy 中的 xDS REST 和 gRPC 协议详解</a></li><li><a href="https://preliminary.istio.io/blog/2018/delayering-istio/delayering-istio/" target="_blank" rel="noopener">Delayering Istio with AppSwitch</a></li></ul>]]></content>
<summary type="html">
<p>作者: <a href="https://github.com/zhongfox" target="_blank" rel="noopener">钟华</a></p>
<p><img src="https://github.com/TencentCloudContainer
</summary>
</entry>
<entry>
<title>Kubernetes 流量复制方案</title>
<link href="https://TencentCloudContainerTeam.github.io/2019/01/10/k8s-traffic-copy/"/>
<id>https://TencentCloudContainerTeam.github.io/2019/01/10/k8s-traffic-copy/</id>
<published>2019-01-10T02:17:37.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者:田小康</p><h1 id="背景"><a href="#背景" class="headerlink" title="背景"></a>背景</h1><p>测试环境没有真实的数据, 会导致很多测试工作难以展开, 尤其是一些测试任务需要使用生产环境来做时, 会极大影响现网的稳定性。</p><p>我们需要一个流量复制方案, 将现网流量复制到预发布/测试环境</p><p><img src="https://github.com/TencentCloudContainerTeam/TencentCloudContainerTeam.github.io/raw/develop/source/_posts/res/k8s-traffic-copy/traffic-copy-diagram.png" alt="流量复制示意"></p><h3 id="期望"><a href="#期望" class="headerlink" title="期望"></a>期望</h3><ul><li>将线上请求拷贝一份到预发布/测试环境</li><li>不影响现网请求</li><li>可配置流量复制比例, 毕竟测试环境资源有限</li><li>零代码改动</li></ul><h1 id="方案"><a href="#方案" class="headerlink" title="方案"></a>方案</h1><p><img src="https://github.com/TencentCloudContainerTeam/TencentCloudContainerTeam.github.io/raw/develop/source/_posts/res/k8s-traffic-copy/k8s-traffic-copy-diagram.png" alt="Kubernetes 流量复制方案"></p><ul><li>承载入口流量的 Pod 新增一个 <code>Nginx 容器</code> 接管流量</li><li><a href="http://nginx.org/en/docs/http/ngx_http_mirror_module.html" target="_blank" rel="noopener">Nginx Mirror</a> 模块会将流量复制一份并 proxy 到指定 URL (测试环境)</li><li><code>Nginx mirror</code> 复制流量不会影响正常请求处理流程, 镜像请求的 Resp 会被 Nginx 丢弃</li><li><code>K8s Service</code> 按照 <code>Label Selector</code> 去选择请求分发的 Pod, 意味着不同Pod, 只要有相同 <code>Label</code>, 就可以协同处理请求</li><li>通过控制有 <code>Mirror 功能的 Pod</code> 和 <code>正常的 Pod</code> 的比例, 便可以配置流量复制的比例</li></ul><p>我们的部署环境为 <a href="https://cloud.tencent.com/product/tke" target="_blank" rel="noopener">腾讯云容器服务</a>, 不过所述方案是普适于 <code>Kubernetes</code> 环境的.</p><h1 id="实现"><a href="#实现" class="headerlink" title="实现"></a>实现</h1><p>PS: 下文假定读者了解</p><ul><li><a href="https://kubernetes.io/docs/concepts/" target="_blank" rel="noopener">Kubernetes</a> 以及 YAML</li><li><a href="https://helm.sh/" target="_blank" rel="noopener">Helm</a></li><li><a href="https://www.nginx.com/" target="_blank" rel="noopener">Nginx</a></li></ul><h3 id="Nginx-镜像"><a href="#Nginx-镜像" class="headerlink" title="Nginx 镜像"></a>Nginx 镜像</h3><p>使用 Nginx 官方镜像便已经预装了 Mirror 插件</p><p>即: <code>docker pull nginx</code></p><p><code>yum install nginx</code> 安装的版本貌似没有 Mirror 插件的哦, 需要自己装</p><h3 id="Nginx-ConfigMap"><a href="#Nginx-ConfigMap" class="headerlink" title="Nginx ConfigMap"></a>Nginx ConfigMap</h3><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">kind:</span> <span class="string">ConfigMap</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">entrance-nginx-config</span></span><br><span class="line"><span class="attr"> namespace:</span> <span class="string">default</span></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">data:</span></span><br><span class="line"> <span class="string">nginx.conf:</span> <span class="string">|-</span></span><br><span class="line"> <span class="string">worker_processes</span> <span class="string">auto;</span></span><br><span class="line"></span><br><span class="line"> <span class="string">error_log</span> <span class="string">/data/athena/logs/entrance/nginx-error.log;</span></span><br><span class="line"></span><br><span class="line"> <span class="string">events</span> <span class="string">{</span></span><br><span class="line"> <span class="string">worker_connections</span> <span class="number">1024</span><span class="string">;</span></span><br><span class="line"> <span class="string">}</span></span><br><span class="line"></span><br><span class="line"> <span class="string">http</span> <span class="string">{</span></span><br><span class="line"> <span class="string">default_type</span> <span class="string">application/octet-stream;</span></span><br><span class="line"> <span class="string">sendfile</span> <span class="string">on;</span></span><br><span class="line"> <span class="string">keepalive_timeout</span> <span class="number">65</span><span class="string">;</span></span><br><span class="line"></span><br><span class="line"> <span class="string">server</span> <span class="string">{</span></span><br><span class="line"> <span class="string">access_log</span> <span class="string">/data/athena/logs/entrance/nginx-access.log;</span></span><br><span class="line"></span><br><span class="line"> <span class="string">listen</span> <span class="string">{{</span> <span class="string">.Values.entrance.service.nodePort</span> <span class="string">}};</span></span><br><span class="line"> <span class="string">server_name</span> <span class="string">entrance;</span></span><br><span class="line"></span><br><span class="line"> <span class="string">location</span> <span class="string">/</span> <span class="string">{</span></span><br><span class="line"> <span class="string">root</span> <span class="string">html;</span></span><br><span class="line"> <span class="string">index</span> <span class="string">index.html</span> <span class="string">index.htm;</span></span><br><span class="line"> <span class="string">}</span></span><br><span class="line"></span><br><span class="line"> <span class="string">location</span> <span class="string">/entrance/</span> <span class="string">{</span></span><br><span class="line"> <span class="string">mirror</span> <span class="string">/mirror;</span></span><br><span class="line"> <span class="string">access_log</span> <span class="string">/data/athena/logs/entrance/nginx-entrance-access.log;</span></span><br><span class="line"> <span class="string">proxy_pass</span> <span class="attr">http://localhost:{{</span> <span class="string">.Values.entrance.service.nodePortMirror</span> <span class="string">}}/;</span></span><br><span class="line"> <span class="string">}</span></span><br><span class="line"></span><br><span class="line"> <span class="string">location</span> <span class="string">/mirror</span> <span class="string">{</span></span><br><span class="line"> <span class="string">internal;</span></span><br><span class="line"> <span class="string">access_log</span> <span class="string">/data/athena/logs/entrance/nginx-mirror-access.log;</span></span><br><span class="line"> <span class="string">proxy_pass</span> <span class="string">{{</span> <span class="string">.Values.entrance.mirrorProxyPass</span> <span class="string">}};</span></span><br><span class="line"> <span class="string">}</span></span><br><span class="line"></span><br><span class="line"> <span class="string">error_page</span> <span class="number">500</span> <span class="number">502</span> <span class="number">503</span> <span class="number">504</span> <span class="string">/50x.html;</span></span><br><span class="line"> <span class="string">location</span> <span class="string">=</span> <span class="string">/50x.html</span> <span class="string">{</span></span><br><span class="line"> <span class="string">root</span> <span class="string">html;</span></span><br><span class="line"> <span class="string">}</span></span><br><span class="line"> <span class="string">}</span></span><br><span class="line"> <span class="string">}</span></span><br></pre></td></tr></table></figure><p>其中重点部分如下:</p><p><img src="https://github.com/TencentCloudContainerTeam/TencentCloudContainerTeam.github.io/raw/develop/source/_posts/res/k8s-traffic-copy/nginx-config.png" alt=""></p><h3 id="业务方容器-Nginx-Mirror"><a href="#业务方容器-Nginx-Mirror" class="headerlink" title="业务方容器 + Nginx Mirror"></a>业务方容器 + Nginx Mirror</h3><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">{{-</span> <span class="string">if</span> <span class="string">.Values.entrance.mirrorEnable</span> <span class="string">}}</span></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">extensions/v1beta1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Deployment</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">entrance-mirror</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"><span class="attr"> replicas:</span> <span class="string">{{</span> <span class="string">.Values.entrance.mirrorReplicaCount</span> <span class="string">}}</span></span><br><span class="line"><span class="attr"> template:</span></span><br><span class="line"><span class="attr"> metadata:</span></span><br><span class="line"><span class="attr"> labels:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">entrance</span></span><br><span class="line"><span class="attr"> spec:</span></span><br><span class="line"><span class="attr"> affinity:</span></span><br><span class="line"><span class="attr"> podAntiAffinity:</span></span><br><span class="line"><span class="attr"> preferredDuringSchedulingIgnoredDuringExecution:</span></span><br><span class="line"><span class="attr"> - weight:</span> <span class="number">1</span></span><br><span class="line"><span class="attr"> podAffinityTerm:</span></span><br><span class="line"><span class="attr"> labelSelector:</span></span><br><span class="line"><span class="attr"> matchExpressions:</span></span><br><span class="line"><span class="attr"> - key:</span> <span class="string">"name"</span></span><br><span class="line"><span class="attr"> operator:</span> <span class="string">In</span></span><br><span class="line"><span class="attr"> values:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">entrance</span></span><br><span class="line"><span class="attr"> topologyKey:</span> <span class="string">"kubernetes.io/hostname"</span></span><br><span class="line"><span class="attr"> initContainers:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">init-kafka</span></span><br><span class="line"><span class="attr"> image:</span> <span class="string">"centos-dev"</span></span><br><span class="line"> <span class="string">{{-</span> <span class="string">if</span> <span class="string">.Values.delay</span> <span class="string">}}</span></span><br><span class="line"><span class="attr"> command:</span> <span class="string">['bash',</span> <span class="string">'-c'</span><span class="string">,</span> <span class="string">'sleep 480s; until nslookup athena-cp-kafka; do echo "waiting for athena-cp-kafka"; sleep 2; done;'</span><span class="string">]</span></span><br><span class="line"> <span class="string">{{-</span> <span class="string">else</span> <span class="string">}}</span></span><br><span class="line"><span class="attr"> command:</span> <span class="string">['bash',</span> <span class="string">'-c'</span><span class="string">,</span> <span class="string">'until nslookup athena-cp-kafka; do echo "waiting for athena-cp-kafka"; sleep 2; done;'</span><span class="string">]</span></span><br><span class="line"> <span class="string">{{-</span> <span class="string">end</span> <span class="string">}}</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> containers:</span></span><br><span class="line"><span class="attr"> - image:</span> <span class="string">"<span class="template-variable">{{ .Values.entrance.image.repository }}</span>:<span class="template-variable">{{ .Values.entrance.image.tag }}</span>"</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">entrance</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="string">{{</span> <span class="string">.Values.entrance.service.nodePort</span> <span class="string">}}</span></span><br><span class="line"><span class="attr"> env:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_KAFKA_BOOTSTRAP</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.kafka.kafkaBootstrap }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_KAFKA_SCHEMA_REGISTRY_URL</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.kafka.kafkaSchemaRegistryUrl }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_PG_CONN</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.pg.pgConn }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_COS_CONN</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.cos.cosConn }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_DEPLOY_TYPE</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.deployType }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_TPS_SYS_ID</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.tps.tpsSysId }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_TPS_SYS_SECRET</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.tps.tpsSysSecret }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_TPS_BASE_URL</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.tps.tpsBaseUrl }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_TPS_RESOURCE_FLOW_PERIOD_SEC</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.tps.tpsResourceFlowPeriodSec }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_CLUSTER</span></span><br><span class="line"><span class="attr"> value:</span> <span class="string">"<span class="template-variable">{{ .Values.cluster }}</span>"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_POD_NAME</span></span><br><span class="line"><span class="attr"> valueFrom:</span></span><br><span class="line"><span class="attr"> fieldRef:</span></span><br><span class="line"><span class="attr"> fieldPath:</span> <span class="string">metadata.name</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_HOST_IP</span></span><br><span class="line"><span class="attr"> valueFrom:</span></span><br><span class="line"><span class="attr"> fieldRef:</span></span><br><span class="line"><span class="attr"> fieldPath:</span> <span class="string">status.hostIP</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">ATHENA_POD_IP</span></span><br><span class="line"><span class="attr"> valueFrom:</span></span><br><span class="line"><span class="attr"> fieldRef:</span></span><br><span class="line"><span class="attr"> fieldPath:</span> <span class="string">status.podIP</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> command:</span> <span class="string">['/bin/bash',</span> <span class="string">'/data/service/go_workspace/script/start-entrance.sh'</span><span class="string">,</span> <span class="string">'-host 0.0.0.0:<span class="template-variable">{{ .Values.entrance.service.nodePortMirror }}</span>'</span><span class="string">]</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/data/athena/</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">athena</span></span><br><span class="line"><span class="attr"> readOnly:</span> <span class="literal">false</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> imagePullPolicy:</span> <span class="string">IfNotPresent</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="attr"> limits:</span></span><br><span class="line"><span class="attr"> cpu:</span> <span class="number">3000</span><span class="string">m</span></span><br><span class="line"><span class="attr"> memory:</span> <span class="number">800</span><span class="string">Mi</span></span><br><span class="line"><span class="attr"> requests:</span></span><br><span class="line"><span class="attr"> cpu:</span> <span class="number">100</span><span class="string">m</span></span><br><span class="line"><span class="attr"> memory:</span> <span class="number">100</span><span class="string">Mi</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> exec:</span></span><br><span class="line"><span class="attr"> command:</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">bash</span></span><br><span class="line"><span class="bullet"> -</span> <span class="string">/data/service/go_workspace/script/health-check/check-entrance.sh</span></span><br><span class="line"><span class="attr"> initialDelaySeconds:</span> <span class="number">120</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">60</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> - image:</span> <span class="string">"<span class="template-variable">{{ .Values.nginx.image.repository }}</span>:<span class="template-variable">{{ .Values.nginx.image.tag }}</span>"</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">entrance-mirror</span></span><br><span class="line"><span class="attr"> ports:</span></span><br><span class="line"><span class="attr"> - containerPort:</span> <span class="string">{{</span> <span class="string">.Values.entrance.service.nodePort</span> <span class="string">}}</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> volumeMounts:</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/data/athena/</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">athena</span></span><br><span class="line"><span class="attr"> readOnly:</span> <span class="literal">false</span></span><br><span class="line"><span class="attr"> - mountPath:</span> <span class="string">/etc/nginx/nginx.conf</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">nginx-config</span></span><br><span class="line"><span class="attr"> subPath:</span> <span class="string">nginx.conf</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> imagePullPolicy:</span> <span class="string">IfNotPresent</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> resources:</span></span><br><span class="line"><span class="attr"> limits:</span></span><br><span class="line"><span class="attr"> cpu:</span> <span class="number">1000</span><span class="string">m</span></span><br><span class="line"><span class="attr"> memory:</span> <span class="number">500</span><span class="string">Mi</span></span><br><span class="line"><span class="attr"> requests:</span></span><br><span class="line"><span class="attr"> cpu:</span> <span class="number">100</span><span class="string">m</span></span><br><span class="line"><span class="attr"> memory:</span> <span class="number">100</span><span class="string">Mi</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> livenessProbe:</span></span><br><span class="line"><span class="attr"> tcpSocket:</span></span><br><span class="line"><span class="attr"> port:</span> <span class="string">{{</span> <span class="string">.Values.entrance.service.nodePort</span> <span class="string">}}</span></span><br><span class="line"><span class="attr"> timeoutSeconds:</span> <span class="number">3</span></span><br><span class="line"><span class="attr"> initialDelaySeconds:</span> <span class="number">60</span></span><br><span class="line"><span class="attr"> periodSeconds:</span> <span class="number">60</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> terminationGracePeriodSeconds:</span> <span class="number">10</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> nodeSelector:</span></span><br><span class="line"><span class="attr"> entrance:</span> <span class="string">"true"</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> volumes:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">athena</span></span><br><span class="line"><span class="attr"> hostPath:</span></span><br><span class="line"><span class="attr"> path:</span> <span class="string">"/data/athena/"</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">nginx-config</span></span><br><span class="line"><span class="attr"> configMap:</span></span><br><span class="line"><span class="attr"> name:</span> <span class="string">entrance-nginx-config</span></span><br><span class="line"></span><br><span class="line"><span class="attr"> imagePullSecrets:</span></span><br><span class="line"><span class="attr"> - name:</span> <span class="string">"<span class="template-variable">{{ .Values.imagePullSecrets }}</span>"</span></span><br><span class="line"><span class="string">{{-</span> <span class="string">end</span> <span class="string">}}</span></span><br></pre></td></tr></table></figure><p>上面为真实在业务中使用的 Deployment 配置, 有些地方可以参考:</p><ul><li><code>valueFrom.fieldRef.fieldPath</code> 可以取到容器运行时的一些字段, 如 <code>NodeIP</code>, <code>PodIP</code> 这些可以用于全链路监控</li><li><code>ConfigMap</code> 直接 Mount 到文件系统, 覆盖默认配置的例子</li><li><code>affinity.podAntiAffinity</code> 亲和性调度, 使 Pod 在主机间均匀分布</li><li>使用了 <code>tcpSocket</code> 和 <code>exec.command</code> 两种健康检查方式</li></ul><h3 id="Helm-Values"><a href="#Helm-Values" class="headerlink" title="Helm Values"></a>Helm Values</h3><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># entrance, Athena 上报入口模块</span></span><br><span class="line"><span class="attr">entrance:</span></span><br><span class="line"><span class="attr"> enable:</span> <span class="literal">true</span></span><br><span class="line"><span class="attr"> replicaCount:</span> <span class="number">3</span></span><br><span class="line"><span class="attr"> mirrorEnable:</span> <span class="literal">true</span></span><br><span class="line"><span class="attr"> mirrorReplicaCount:</span> <span class="number">1</span></span><br><span class="line"><span class="attr"> mirrorProxyPass:</span> <span class="string">"http://10.16.0.147/entrance/"</span></span><br><span class="line"><span class="attr"> image:</span></span><br><span class="line"><span class="attr"> repository:</span> <span class="string">athena-go</span></span><br><span class="line"><span class="attr"> tag:</span> <span class="string">v1901091026</span></span><br><span class="line"><span class="attr"> service:</span></span><br><span class="line"><span class="attr"> nodePort:</span> <span class="number">30081</span></span><br><span class="line"><span class="attr"> nodePortMirror:</span> <span class="number">30082</span></span><br></pre></td></tr></table></figure><p>如上, <code>replicaCount: 3</code> + <code>mirrorReplicaCount: 1</code> = 4 个容器, 有 1/4 流量复制到 <code>http://10.16.0.147/entrance/</code></p><h3 id="内网负载均衡"><a href="#内网负载均衡" class="headerlink" title="内网负载均衡"></a>内网负载均衡</h3><p>流量复制到测试环境时, 尽量使用内网负载均衡, 为了成本, 安全及性能方面的考虑</p><p><img src="https://github.com/TencentCloudContainerTeam/TencentCloudContainerTeam.github.io/raw/develop/source/_posts/res/k8s-traffic-copy/lb-inner.png" alt="LB-inner-config"></p><h1 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h1><p>通过下面几个步骤, 便可以实现流量复制啦</p><ul><li>建一个内网负载均衡, 暴漏测试环境的 <code>服务入口 Service</code></li><li><code>服务入口 Service</code> 需要有可以更换端口号的能力 (例如命令行参数/环境变量)</li><li>线上环境, 新增一个 Deployment, Label 和之前的 <code>服务入口 Service</code> 一样, 只是端口号分配一个新的</li><li>为新增的 Deployment 增加一个 Nginx 容器, 配置 nginx.conf</li><li>调节有 <code>Nginx Mirror</code> 的 Pod 和 正常的 <code>Pod</code> 比例, 便可以实现<code>按比例流量复制</code></li></ul>]]></content>
<summary type="html">
<p>作者:田小康</p>
<h1 id="背景"><a href="#背景" class="headerlink" title="背景"></a>背景</h1><p>测试环境没有真实的数据, 会导致很多测试工作难以展开, 尤其是一些测试任务需要使用生产环境来做时, 会极大影响现
</summary>
</entry>
<entry>
<title>Cgroup泄漏--潜藏在你的集群中</title>
<link href="https://TencentCloudContainerTeam.github.io/2018/12/29/cgroup-leaking/"/>
<id>https://TencentCloudContainerTeam.github.io/2018/12/29/cgroup-leaking/</id>
<published>2018-12-29T09:00:00.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: <a href="https://github.com/honkiko" target="_blank" rel="noopener">洪志国</a></p><h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>绝大多数的kubernetes集群都有这个隐患。只不过一般情况下,泄漏得比较慢,还没有表现出来而已。 </p><p>一个pod可能泄漏两个memory cgroup数量配额。即使pod百分之百发生泄漏, 那也需要一个节点销毁过三万多个pod之后,才会造成后续pod创建失败。</p><p>一旦表现出来,这个节点就彻底不可用了,必须重启才能恢复。</p><h2 id="故障表现"><a href="#故障表现" class="headerlink" title="故障表现"></a>故障表现</h2><p>腾讯云SCF(Serverless Cloud Function)底层使用我们的TKE(Tencent Kubernetes Engine),并且会在节点上频繁创建和消耗容器。</p><p>SCF发现很多节点会出现类似以下报错,创建POD总是失败:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">Dec 24 11:54:31 VM_16_11_centos dockerd[11419]: time="2018-12-24T11:54:31.195900301+08:00" level=error msg="Handler for POST /v1.31/containers/b98d4aea818bf9d1d1aa84079e1688cd9b4218e008c58a8ef6d6c3c106403e7b/start returned error: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:279: applying cgroup configuration for process caused \\\"mkdir /sys/fs/cgroup/memory/kubepods/burstable/pod79fe803c-072f-11e9-90ca-525400090c71/b98d4aea818bf9d1d1aa84079e1688cd9b4218e008c58a8ef6d6c3c106403e7b: no space left on device\\\"\": unknown"</span><br></pre></td></tr></table></figure><p>这个时候,到节点上尝试创建几十个memory cgroup (以root权限执行 <code>for i in</code>seq 1 20<code>;do mkdir /sys/fs/cgroup/memory/${i}; done</code>),就会碰到失败:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mkdir: cannot create directory '/sys/fs/cgroup/memory/8': No space left on device</span><br></pre></td></tr></table></figure><p>其实,dockerd出现以上报错时, 手动创建<strong><em>一个</em></strong>memory cgroup都会失败的。 不过有时候随着一些POD的运行结束,可能会多出来一些“配额”,所以这里是尝试创建20个memory cgroup。</p><p>出现这样的故障以后,重启docker,释放内存等措施都没有效果,只有重启节点才能恢复。</p><h2 id="复现条件"><a href="#复现条件" class="headerlink" title="复现条件"></a>复现条件</h2><p>docker和kubernetes社区都有关于这个问题的issue:</p><ul><li><a href="https://github.com/moby/moby/issues/29638" target="_blank" rel="noopener">https://github.com/moby/moby/issues/29638</a></li><li><a href="https://github.com/kubernetes/kubernetes/issues/70324" target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/70324</a></li></ul><p>网上有文章介绍了类似问题的分析和复现方法。如:<br><a href="http://www.linuxfly.org/kubernetes-19-conflict-with-centos7/?from=groupmessage" target="_blank" rel="noopener">http://www.linuxfly.org/kubernetes-19-conflict-with-centos7/?from=groupmessage</a></p><p>不过按照文中的复现方法,我在<code>3.10.0-862.9.1.el7.x86_64</code>版本内核上并没有复现出来。</p><p>经过反复尝试,总结出了必现的复现条件。 一句话感慨就是,把进程加入到一个开启了kmem accounting的memory cgroup<strong><em>并且执行fork系统调用</em></strong>。</p><ol><li>centos 3.10.0-862.9.1.el7.x86_64及以下内核, 4G以上空闲内存,root权限。</li><li><p>把系统memory cgroup配额占满</p> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">for i in `seq 1 65536`;do mkdir /sys/fs/cgroup/memory/${i}; done</span><br></pre></td></tr></table></figure><p> 会看到报错:</p> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mkdir: cannot create directory ‘/sys/fs/cgroup/memory/65530’: No space left on device</span><br></pre></td></tr></table></figure><p> 这是因为这个版本内核写死了,最多只能有65535个memory cgroup共存。 systemd已经创建了一些,所以这里创建不到65535个就会遇到报错。</p><p> 确认删掉一个memory cgroup, 就能腾出一个“配额”:</p> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">rmdir /sys/fs/cgroup/memory/1</span><br><span class="line">mkdir /sys/fs/cgroup/memory/test</span><br></pre></td></tr></table></figure></li><li><p>给一个memory cgroup开启kmem accounting</p> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">cd /sys/fs/cgroup/memory/test/</span><br><span class="line">echo 1 > memory.kmem.limit_in_bytes</span><br><span class="line">echo -1 > memory.kmem.limit_in_bytes</span><br></pre></td></tr></table></figure></li><li><p>把一个进程加进某个memory cgroup, 并执行一次fork系统调用</p> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">最简单的就是把当前shell进程加进去:</span><br><span class="line">echo $$ > /sys/fs/cgroup/memory/test/tasks</span><br><span class="line">sleep 100 &</span><br><span class="line">cat /sys/fs/cgroup/memory/test/tasks</span><br></pre></td></tr></table></figure></li><li><p>把该memory cgroup里面的进程都挪走</p> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">for p in `cat /sys/fs/cgroup/memory/test/tasks`;do echo ${p} > /sys/fs/cgroup/memory/tasks; done</span><br><span class="line"></span><br><span class="line">cat /sys/fs/cgroup/memory/test/tasks //这时候应该为空</span><br></pre></td></tr></table></figure></li><li><p>删除这个memory cgroup</p> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">rmdir /sys/fs/cgroup/memory/test</span><br></pre></td></tr></table></figure></li><li><p>验证刚才删除一个memory cgroup, 所占的配额并没有释放</p> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mkdir /sys/fs/cgroup/memory/xx</span><br></pre></td></tr></table></figure><p> 这时候会报错:<code>mkdir: cannot create directory ‘/sys/fs/cgroup/memory/xx’: No space left on device</code></p></li></ol><h2 id="什么版本的内核有这个问题"><a href="#什么版本的内核有这个问题" class="headerlink" title="什么版本的内核有这个问题"></a>什么版本的内核有这个问题</h2><p>搜索内核commit记录,有一个commit应该是解决类似问题的:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">4bdfc1c4a943: 2015-01-08 memcg: fix destination cgroup leak on task charges migration [Vladimir Davydov]</span><br></pre></td></tr></table></figure><p>这个commit在3.19以及4.x版本的内核中都已经包含。 不过从docker和kubernetes相关issue里面的反馈来看,内核中应该还有其他cgroup泄漏的代码路径, 4.14版本内核都还有cgroup泄漏问题。</p><h2 id="规避办法"><a href="#规避办法" class="headerlink" title="规避办法"></a>规避办法</h2><p>不开启kmem accounting (以上复现步骤的第3步)的话,是不会发生cgroup泄漏的。</p><p>kubelet和runc都会给memory cgroup开启kmem accounting。所以要规避这个问题,就要保证kubelet和runc,都别开启kmem accounting。下面分别进行说明。 </p><h2 id="runc"><a href="#runc" class="headerlink" title="runc"></a>runc</h2><p>查看代码,发现在commit fe898e7 (2017-2-25, PR #1350)以后的runc版本中,都会默认开启kmem accounting。代码在libcontainer/cgroups/fs/kmem.go: (老一点的版本,代码在libcontainer/cgroups/fs/memory.go)<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">const cgroupKernelMemoryLimit = "memory.kmem.limit_in_bytes"</span><br><span class="line"></span><br><span class="line">func EnableKernelMemoryAccounting(path string) error {</span><br><span class="line"> // Ensure that kernel memory is available in this kernel build. If it</span><br><span class="line"> // isn't, we just ignore it because EnableKernelMemoryAccounting is</span><br><span class="line"> // automatically called for all memory limits.</span><br><span class="line"> if !cgroups.PathExists(filepath.Join(path, cgroupKernelMemoryLimit)) {</span><br><span class="line"> return nil</span><br><span class="line"> }</span><br><span class="line"> // We have to limit the kernel memory here as it won't be accounted at all</span><br><span class="line"> // until a limit is set on the cgroup and limit cannot be set once the</span><br><span class="line"> // cgroup has children, or if there are already tasks in the cgroup.</span><br><span class="line"> for _, i := range []int64{1, -1} {</span><br><span class="line"> if err := setKernelMemory(path, i); err != nil {</span><br><span class="line"> return err</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> return nil</span><br><span class="line">}</span><br></pre></td></tr></table></figure></p><p>runc社区也注意到这个问题,并做了比较灵活的修复: <a href="https://github.com/opencontainers/runc/pull/1921" target="_blank" rel="noopener">https://github.com/opencontainers/runc/pull/1921</a></p><p>这个修复给runc增加了”nokmem”编译选项。缺省的release版本没有使用这个选项。 自己使用nokmem选项编译runc的方法:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">cd $GO_PATH/src/github.com/opencontainers/runc/</span><br><span class="line">make BUILDTAGS="seccomp nokmem"</span><br></pre></td></tr></table></figure><h2 id="kubelet"><a href="#kubelet" class="headerlink" title="kubelet"></a>kubelet</h2><p>kubelet在创建pod对应的cgroup目录时,也会调用libcontianer中的代码对cgroup做设置。在 <code>pkg/kubelet/cm/cgroup_manager_linux.go</code>的Create方法中,会调用Manager.Apply方法,最终调用<code>vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go</code>中的MemoryGroup.Apply方法,开启kmem accounting。</p><p>这里也需要进行处理,可以不开启kmem accounting, 或者通过命令行参数来控制是否开启。</p><p>kubernetes社区也有issue讨论这个问题:<a href="https://github.com/kubernetes/kubernetes/issues/70324" target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/70324</a></p><p>但是目前还没有结论。我们TKE先直接把这部分代码注释掉了,不开启kmem accounting。</p>]]></content>
<summary type="html">
<p>作者: <a href="https://github.com/honkiko" target="_blank" rel="noopener">洪志国</a></p>
<h2 id="前言"><a href="#前言" class="headerlink" title="前
</summary>
</entry>
<entry>
<title>给容器设置内核参数</title>
<link href="https://TencentCloudContainerTeam.github.io/2018/11/19/kernel-parameters-and-container/"/>
<id>https://TencentCloudContainerTeam.github.io/2018/11/19/kernel-parameters-and-container/</id>
<published>2018-11-19T13:52:00.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>作者: 洪志国</p><h1 id="sysctl"><a href="#sysctl" class="headerlink" title="sysctl"></a>sysctl</h1><p>/proc/sys/目录下导出了一些可以在运行时修改kernel参数的proc文件。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"># ls /proc/sys</span><br><span class="line">abi crypto debug dev fs kernel net vm</span><br></pre></td></tr></table></figure><p>可以通过写proc文件来修改这些内核参数。例如, 要打开ipv4的路由转发功能:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">echo 1 > /proc/sys/net/ipv4/ip_forward</span><br></pre></td></tr></table></figure><p>也可以通过sysctl命令来完成(只是对以上写proc文件操作的简单包装):</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sysctl -w net.ipv4.ip_forward=1</span><br></pre></td></tr></table></figure><p>其他常用sysctl命令:</p><p>显示本机所有sysctl内核参数及当前值</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sysctl -a</span><br></pre></td></tr></table></figure><p>从文件(缺省使用/etc/sysctl.conf)加载多个参数和取值,并写入内核</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sysctl -p [FILE]</span><br></pre></td></tr></table></figure><p>另外, 系统启动的时候, 会自动执行一下”sysctl -p”。 所以,希望重启之后仍然生效的参数值, 应该写到/etc/sysctl.conf文件里面。</p><h1 id="容器与sysctl"><a href="#容器与sysctl" class="headerlink" title="容器与sysctl"></a>容器与sysctl</h1><p>内核方面做了大量的工作,把一部分sysctl内核参数进行了namespace化(namespaced)。 也就是多个容器和主机可以各自独立设置某些内核参数。例如, 可以通过net.ipv4.ip_local_port_range,在不同容器中设置不同的端口范围。</p><p>如何判断一个参数是不是namespaced? </p><p>运行一个具有privileged权限的容器(参考下一节内容), 然后在容器中修改该参数,看一下在host上能否看到容器在中所做的修改。如果看不到, 那就是namespaced, 否则不是。</p><p>目前已经namespace化的sysctl内核参数:</p><ul><li>kernel.shm*,</li><li>kernel.msg*,</li><li>kernel.sem,</li><li>fs.mqueue.*,</li><li>net.*.</li></ul><p>注意, vm.*并没有namespace化。 比如vm.max_map_count, 在主机或者一个容器中设置它, 其他所有容器都会受影响,都会看到最新的值。</p><h1 id="在docker容器中修改sysctl内核参数"><a href="#在docker容器中修改sysctl内核参数" class="headerlink" title="在docker容器中修改sysctl内核参数"></a>在docker容器中修改sysctl内核参数</h1><p>正常运行的docker容器中,是不能修改任何sysctl内核参数的。因为/proc/sys是以只读方式挂载到容器里面的。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)</span><br></pre></td></tr></table></figure><p>要给容器设置不一样的sysctl内核参数,有多种方式。</p><h3 id="方法一-–privileged"><a href="#方法一-–privileged" class="headerlink" title="方法一 –privileged"></a>方法一 –privileged</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"># docker run --privileged -it ubuntu bash</span><br></pre></td></tr></table></figure><p>整个/proc目录都是以”rw”权限挂载的<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)</span><br></pre></td></tr></table></figure></p><p>在容器中,可以任意修改sysctl内核参赛。</p><p>注意:<br>如果修改的是namespaced的参数, 则不会影响host和其他容器。反之,则会影响它们。</p><p>如果想在容器中修改主机的net.ipv4.ip_default_ttl参数, 则除了–privileged, 还需要加上 –net=host。</p><h3 id="方法二-把-proc-sys-bind到容器里面"><a href="#方法二-把-proc-sys-bind到容器里面" class="headerlink" title="方法二 把/proc/sys bind到容器里面"></a>方法二 把/proc/sys bind到容器里面</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"># docker run -v /proc/sys:/writable-sys -it ubuntu bash</span><br></pre></td></tr></table></figure><p>然后写bind到容器内的proc文件</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">echo 62 > /writable-sys/net/ipv4/ip_default_ttl</span><br></pre></td></tr></table></figure><p>注意: 这样操作,效果类似于”–privileged”, 对于namespaced的参数,不会影响host和其他容器。</p><h3 id="方法三-–sysctl"><a href="#方法三-–sysctl" class="headerlink" title="方法三 –sysctl"></a>方法三 –sysctl</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"># docker run -it --sysctl 'net.ipv4.ip_default_ttl=63' ubuntu sysctl net.ipv4.ip_default_ttl</span><br><span class="line">net.ipv4.ip_default_ttl = 63</span><br></pre></td></tr></table></figure><p>注意:</p><ul><li>只有namespaced参数才可以。否则会报错”invalid argument…”</li><li>这种方式只是在容器初始化过程中完成内核参数的修改,容器运行起来以后,/proc/sys仍然是以只读方式挂载的,在容器中不能再次修改sysctl内核参数。</li></ul><h1 id="kubernetes-与-sysctl"><a href="#kubernetes-与-sysctl" class="headerlink" title="kubernetes 与 sysctl"></a>kubernetes 与 sysctl</h1><h3 id="方法一-通过sysctls和unsafe-sysctls-annotation"><a href="#方法一-通过sysctls和unsafe-sysctls-annotation" class="headerlink" title="方法一 通过sysctls和unsafe-sysctls annotation"></a>方法一 通过sysctls和unsafe-sysctls annotation</h3><p>k8s还进一步把syctl参数分为safe和unsafe。 safe的条件:</p><ul><li>must not have any influence on any other pod on the node</li><li>must not allow to harm the node’s health</li><li>must not allow to gain CPU or memory resources outside of the resource limits of a pod.</li></ul><p>非namespaced的参数,肯定是unsafe。</p><p>namespaced参数,也只有一部分被认为是safe的。</p><p>在pkg/kubelet/sysctl/whitelist.go中维护了safe sysctl参数的名单。在1.7.8的代码中,只有三个参数被认为是safe的:</p><ul><li>kernel.shm_rmid_forced,</li><li>net.ipv4.ip_local_port_range,</li><li>net.ipv4.tcp_syncookies</li></ul><p>如果要设置一个POD中safe参数,通过security.alpha.kubernetes.io/sysctls这个annotation来传递给kubelet。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">metadata:</span><br><span class="line"> name: sysctl-example</span><br><span class="line"> annotations:</span><br><span class="line"> security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1</span><br></pre></td></tr></table></figure><p>如果要设置一个namespaced, 但是unsafe的参数,要使用另一个annotation: security.alpha.kubernetes.io/unsafe-sysctls, 另外还要给kubelet一个特殊的启动参数。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: v1</span><br><span class="line">kind: Pod</span><br><span class="line">metadata:</span><br><span class="line"> name: sysctl-example</span><br><span class="line"> annotations:</span><br><span class="line"> security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1</span><br><span class="line"> security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3</span><br><span class="line">spec:</span><br><span class="line"> ...</span><br></pre></td></tr></table></figure><p>kubelet 增加–experimental-allowed-unsafe-sysctls启动参数</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubelet --experimental-allowed-unsafe-sysctls 'kernel.msg*,net.ipv4.route.min_pmtu'</span><br></pre></td></tr></table></figure><h3 id="方法二-privileged-POD"><a href="#方法二-privileged-POD" class="headerlink" title="方法二 privileged POD"></a>方法二 privileged POD</h3><p>如果要修改的是非namespaced的参数, 如vm.*, 那就没办法使用以上方法。 可以给POD privileged权限,然后在容器的初始化脚本或代码中去修改sysctl参数。</p><p>创建POD/deployment/daemonset等对象时, 给容器的spec指定securityContext.privileged=true<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">spec:</span><br><span class="line"> containers:</span><br><span class="line"> - image: nginx:alpine</span><br><span class="line"> securityContext:</span><br><span class="line"> privileged: true</span><br></pre></td></tr></table></figure></p><p>这样跟”docker run –privileged”效果一样,在POD中/proc是以”rw”权限mount的,可以直接修改相关sysctl内核参数。</p><h1 id="ulimit"><a href="#ulimit" class="headerlink" title="ulimit"></a>ulimit</h1><p>每个进程都有若干操作系统资源的限制, 可以通过 /proc/$PID/limits 来查看。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">$ cat /proc/1/limits </span><br><span class="line">Limit Soft Limit Hard Limit Units </span><br><span class="line">Max cpu time unlimited unlimited seconds </span><br><span class="line">Max file size unlimited unlimited bytes </span><br><span class="line">Max data size unlimited unlimited bytes </span><br><span class="line">Max stack size 8388608 unlimited bytes </span><br><span class="line">Max core file size 0 unlimited bytes </span><br><span class="line">Max resident set unlimited unlimited bytes </span><br><span class="line">Max processes 62394 62394 processes </span><br><span class="line">Max open files 1024 4096 files </span><br><span class="line">Max locked memory 65536 65536 bytes </span><br><span class="line">Max address space unlimited unlimited bytes </span><br><span class="line">Max file locks unlimited unlimited locks </span><br><span class="line">Max pending signals 62394 62394 signals </span><br><span class="line">Max msgqueue size 819200 819200 bytes </span><br><span class="line">Max nice priority 0 0 </span><br><span class="line">Max realtime priority 0 0 </span><br><span class="line">Max realtime timeout unlimited unlimited us</span><br></pre></td></tr></table></figure><p>在bash中有个ulimit内部命令,可以查看当前bash进程的这些限制。</p><p>跟ulimit属性相关的配置文件是/etc/security/limits.conf。具体配置项和语法可以通过<code>man limits.conf</code> 命令查看。</p><h3 id="systemd给docker-daemon自身配置ulimit"><a href="#systemd给docker-daemon自身配置ulimit" class="headerlink" title="systemd给docker daemon自身配置ulimit"></a>systemd给docker daemon自身配置ulimit</h3><p>在service文件中(一般是/usr/lib/systemd/system/dockerd.service)中可以配置:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">[Service]</span><br><span class="line">LimitAS=infinity</span><br><span class="line">LimitRSS=infinity</span><br><span class="line">LimitCORE=infinity</span><br><span class="line">LimitNOFILE=65536</span><br><span class="line">ExecStart=...</span><br><span class="line">WorkingDirectory=...</span><br><span class="line">User=...</span><br><span class="line">Group=...</span><br></pre></td></tr></table></figure><h3 id="dockerd-给容器的-缺省ulimit设置"><a href="#dockerd-给容器的-缺省ulimit设置" class="headerlink" title="dockerd 给容器的 缺省ulimit设置"></a>dockerd 给容器的 缺省ulimit设置</h3><p>dockerd –default-ulimit nofile=65536:65536</p><p>冒号前面是soft limit, 后面是hard limit</p><h3 id="给容器指定ulimit设置"><a href="#给容器指定ulimit设置" class="headerlink" title="给容器指定ulimit设置"></a>给容器指定ulimit设置</h3><p>docker run -d –ulimit nofile=20480:40960 nproc=1024:2048 容器名</p><h3 id="在kubernetes中给pod设置ulimit参数"><a href="#在kubernetes中给pod设置ulimit参数" class="headerlink" title="在kubernetes中给pod设置ulimit参数"></a>在kubernetes中给pod设置ulimit参数</h3><p>有一个issue在讨论这个问题: <a href="https://github.com/kubernetes/kubernetes/issues/3595" target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/3595</a></p><p>目前可行的办法,是在镜像中的初始化程序中调用setrlimit()系统调用来进行设置。子进程会继承父进程的ulimit参数。</p><h1 id="参考文档:"><a href="#参考文档:" class="headerlink" title="参考文档:"></a>参考文档:</h1><p><a href="http://tapd.oa.com/CCCM/prong/stories/view/1010166561060564549" target="_blank" rel="noopener">http://tapd.oa.com/CCCM/prong/stories/view/1010166561060564549</a></p><p><a href="https://kubernetes.io/docs/concepts/cluster-administration/sysctl-cluster/" target="_blank" rel="noopener">https://kubernetes.io/docs/concepts/cluster-administration/sysctl-cluster/</a></p><p><a href="https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities" target="_blank" rel="noopener">https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities</a></p>]]></content>
<summary type="html">
<p>作者: 洪志国</p>
<h1 id="sysctl"><a href="#sysctl" class="headerlink" title="sysctl"></a>sysctl</h1><p>/proc/sys/目录下导出了一些可以在运行时修改kernel参数的proc
</summary>
</entry>
<entry>
<title>NodePort, svc, LB直通Pod性能测试对比</title>
<link href="https://TencentCloudContainerTeam.github.io/2018/11/06/NodePort-SVC-LB%E7%9B%B4%E9%80%9A%E5%AE%B9%E5%99%A8%E6%80%A7%E8%83%BD%E6%B5%8B%E8%AF%95%E5%AF%B9%E6%AF%94/"/>
<id>https://TencentCloudContainerTeam.github.io/2018/11/06/NodePort-SVC-LB直通容器性能测试对比/</id>
<published>2018-11-06T07:45:37.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者:郭志宏 </p><h3 id="1-测试背景:"><a href="#1-测试背景:" class="headerlink" title="1. 测试背景:"></a>1. 测试背景:</h3><p>目前基于k8s 服务的外网访问方式有以下几种:</p><ol><li>NodePort </li><li>svc(通过k8s 的clusterip 访问)</li><li>自研 LB -> Pod (比如pod ip 作为 nginx 的 upstream, 或者社区的nginx-ingress)</li></ol><p>其中第一种和第二种方案都要经过iptables 转发,第三种方案不经过iptables,本测试主要是为了测试这三种方案的性能损耗。</p><h3 id="2-测试方案"><a href="#2-测试方案" class="headerlink" title="2. 测试方案"></a>2. 测试方案</h3><p>为了做到测试的准确性和全面性,我们提供以下测试工具和测试数据:</p><ol><li><p>2核4G 的Pod</p></li><li><p>5个Node 的4核8G 集群</p></li><li><p>16核32G 的Nginx 作为统一的LB</p></li><li><p>一个测试应用,2个静态测试接口,分别对用不同大小的数据包(4k 和 100K)</p></li><li><p>测试1个pod ,10个pod的情况(service/pod 越多,一个机器上的iptables 规则数就越多,关于iptables规则数对转发性能的影响,在“ipvs和iptables模式下性能对⽐比测试报告” 已有结论: Iptables场景下,对应service在总数为2000内时,每个service 两个pod, 性能没有明显下降。当service总数达到3000、4000时,性能下降明显,service个数越多,性能越差。)所以这里就不考虑pod数太多的情况。</p></li><li><p>单独的16核32G 机器作作为压力机,使用wrk 作为压测工具, qps 作为评估标准,</p></li><li><p>那么每种访问方式对应以下4种情况</p></li></ol><table><thead><tr><th>测试用例</th><th>Pod 数</th><th>数据包大小</th><th>平均QPS</th></tr></thead><tbody><tr><td>1</td><td>1</td><td>4k</td><td></td></tr><tr><td>2</td><td>1</td><td>100K</td><td></td></tr><tr><td>3</td><td>10</td><td>4k</td><td></td></tr><tr><td>4</td><td>10</td><td>100k</td></tr></tbody></table><ol start="8"><li>每种情况测试5次,取平均值(qps),完善上表。</li></ol><h3 id="3-测试过程"><a href="#3-测试过程" class="headerlink" title="3. 测试过程"></a>3. 测试过程</h3><ol><li><p>准备一个测试应用(基于nginx),提供两个静态文件接口,分别返回4k的数据和100K 的数据。</p><p>镜像地址:ccr.ccs.tencentyun.com/caryguo/nginx:v0.1</p><p>接口:<a href="http://0.0.0.0/4k.html" target="_blank" rel="noopener">http://0.0.0.0/4k.html</a></p><p> <a href="http://0.0.0.0/100k.htm" target="_blank" rel="noopener">http://0.0.0.0/100k.htm</a></p></li><li><p>部署压测工具。<a href="https://github.com/wg/wrk" target="_blank" rel="noopener">https://github.com/wg/wrk</a></p></li><li><p>部署集群,5台Node来调度测试Pod, 10.0.4.6 这台用来独部署Nginx, 作为统一的LB, 将这台机器加入集群的目的是为了 将ClusterIP 作为nginx 的upstream .</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">root@VM-4-6-ubuntu:/etc/nginx# kubectl get node</span><br><span class="line">NAME STATUS ROLES AGE VERSION</span><br><span class="line">10.0.4.12 Ready <none> 3d v1.10.5-qcloud-rev1</span><br><span class="line">10.0.4.3 Ready <none> 3d v1.10.5-qcloud-rev1</span><br><span class="line">10.0.4.5 Ready <none> 3d v1.10.5-qcloud-rev1</span><br><span class="line">10.0.4.6 Ready,SchedulingDisabled <none> 12m v1.10.5-qcloud-rev1</span><br><span class="line">10.0.4.7 Ready <none> 3d v1.10.5-qcloud-rev1</span><br><span class="line">10.0.4.9 Ready <none> 3d v1.10.5-qcloud-rev1</span><br></pre></td></tr></table></figure></li><li><p>根据不同的测试场景,调整Nginx 的upstream, 根据不同的Pod, 调整压力,让请求的超时率控制在万分之一以内, 数据如下:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">./wrk -c 200 -d 20 -t 10 http://carytest.pod.com/10k.html 单pod</span><br><span class="line">./wrk -c 1000 -d 20 -t 100 http://carytest.pod.com/4k.html 10 pod</span><br></pre></td></tr></table></figure></li><li><p>测试wrk -> nginx -> Pod 场景,</p></li></ol><table><thead><tr><th>测试用例</th><th>Pod 数</th><th>数据包大小</th><th>平均QPS</th></tr></thead><tbody><tr><td>1</td><td>1</td><td>4k</td><td>12498</td></tr><tr><td>2</td><td>1</td><td>100K</td><td>2037</td></tr><tr><td>3</td><td>10</td><td>4k</td><td>82752</td></tr><tr><td>4</td><td>10</td><td>100k</td><td>7743</td></tr></tbody></table><ol start="5"><li>wrk -> nginx -> ClusterIP -> Pod</li></ol><table><thead><tr><th>测试用例</th><th>Pod 数</th><th>数据包大小</th><th>平均QPS</th></tr></thead><tbody><tr><td>1</td><td>1</td><td>4k</td><td>12568</td></tr><tr><td>2</td><td>1</td><td>100K</td><td>2040</td></tr><tr><td>3</td><td>10</td><td>4k</td><td>81752</td></tr><tr><td>4</td><td>10</td><td>100k</td><td>7824</td></tr></tbody></table><ol start="6"><li>NodePort 场景,wrk -> nginx -> NodePort -> Pod</li></ol><table><thead><tr><th>测试用例</th><th>Pod 数</th><th>数据包大小</th><th>平均QPS</th></tr></thead><tbody><tr><td>1</td><td>1</td><td>4k</td><td>12332</td></tr><tr><td>2</td><td>1</td><td>100K</td><td>2028</td></tr><tr><td>3</td><td>10</td><td>4k</td><td>76973</td></tr><tr><td>4</td><td>10</td><td>100k</td><td>5676</td></tr></tbody></table><p>压测过程中,4k 数据包的情况下,应用的负载都在80% -100% 之间, 100k 情况下,应用的负载都在20%-30%</p><p>之间,压力都在网络消耗上,没有到达服务后端。</p><h3 id="4-测试结论"><a href="#4-测试结论" class="headerlink" title="4. 测试结论"></a>4. 测试结论</h3><ol><li>在一个pod 的情况下(4k 或者100 数据包),3中网络方案差别不大,QPS 差距在3% 以内。</li><li>在10个pod,4k 数据包情况下,lb->pod 和 svc 差距不大,NodePort 损失近7% 左右。</li><li>10个Pod, 100k 数据包的情况下,lb->pod 和 svc 差距不大,NodePort 损失近 25% </li></ol><h3 id="5-附录"><a href="#5-附录" class="headerlink" title="5. 附录"></a>5. 附录</h3><ol><li>nginx 配置</li></ol><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br></pre></td><td class="code"><pre><span class="line">user nginx;</span><br><span class="line">worker_processes 50;</span><br><span class="line">error_log /var/log/nginx/error.log;</span><br><span class="line">pid /run/nginx.pid;</span><br><span class="line"></span><br><span class="line"># Load dynamic modules. See /usr/share/nginx/README.dynamic.</span><br><span class="line">include /usr/share/nginx/modules/*.conf;</span><br><span class="line"></span><br><span class="line">events {</span><br><span class="line"> worker_connections 100000;</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">http {</span><br><span class="line"> log_format main '$remote_addr - $remote_user [$time_local] "$request" '</span><br><span class="line"> '$status $body_bytes_sent "$http_referer" '</span><br><span class="line"> '"$http_user_agent" "$http_x_forwarded_for"';</span><br><span class="line"></span><br><span class="line"> access_log /var/log/nginx/access.log main;</span><br><span class="line"></span><br><span class="line"> sendfile on;</span><br><span class="line"> tcp_nopush on;</span><br><span class="line"> tcp_nodelay on;</span><br><span class="line"> keepalive_timeout 65;</span><br><span class="line"> types_hash_max_size 2048;</span><br><span class="line"></span><br><span class="line"> include /etc/nginx/mime.types;</span><br><span class="line"> default_type application/octet-stream;</span><br><span class="line"></span><br><span class="line"> # Load modular configuration files from the /etc/nginx/conf.d directory.</span><br><span class="line"> # See http://nginx.org/en/docs/ngx_core_module.html#include</span><br><span class="line"> # for more information.</span><br><span class="line"> include /etc/nginx/conf.d/*.conf;</span><br><span class="line"> </span><br><span class="line"> # pod ip</span><br><span class="line"> upstream panda-pod {</span><br><span class="line"> #ip_hash;</span><br><span class="line"> # Pod ip</span><br><span class="line"> #server 10.0.4.12:30734 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.1.5:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.2.3:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.3.5:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.4.6:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.4.5:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.3.6:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.1.4:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.0.7:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.0.6:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> #server 172.16.2.2:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> </span><br><span class="line"> # svc ip</span><br><span class="line"> #server 172.16.255.121:80 max_fails=2 fail_timeout=30s;</span><br><span class="line"> </span><br><span class="line"> # NodePort</span><br><span class="line"> server 10.0.4.12:30734 max_fails=2 fail_timeout=30s;</span><br><span class="line"> server 10.0.4.3:30734 max_fails=2 fail_timeout=30s;</span><br><span class="line"> server 10.0.4.5:30734 max_fails=2 fail_timeout=30s;</span><br><span class="line"> server 10.0.4.7:30734 max_fails=2 fail_timeout=30s;</span><br><span class="line"> server 10.0.4.9:30734 max_fails=2 fail_timeout=30s;</span><br><span class="line"> </span><br><span class="line"> keepalive 256;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> server {</span><br><span class="line"> listen 80;</span><br><span class="line"> server_name carytest.pod.com;</span><br><span class="line"> # root /usr/share/nginx/html;</span><br><span class="line"> charset utf-8;</span><br><span class="line"></span><br><span class="line"> # Load configuration files for the default server block.</span><br><span class="line"> include /etc/nginx/default.d/*.conf;</span><br><span class="line"> location / {</span><br><span class="line"> proxy_pass http://panda-pod;</span><br><span class="line"> proxy_http_version 1.1;</span><br><span class="line"> proxy_set_header Connection "";</span><br><span class="line"> proxy_redirect off;</span><br><span class="line"> proxy_set_header Host $host;</span><br><span class="line"> proxy_set_header X-Real-IP $remote_addr;</span><br><span class="line"> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;</span><br><span class="line"> proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;</span><br><span class="line"></span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> error_page 404 /404.html;</span><br><span class="line"> location = /40x.html {</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> error_page 500 502 503 504 /50x.html;</span><br><span class="line"> location = /50x.html {</span><br><span class="line"> }</span><br><span class="line"> }</span><br></pre></td></tr></table></figure>]]></content>
<summary type="html">
<p>作者:郭志宏 </p>
<h3 id="1-测试背景:"><a href="#1-测试背景:" class="headerlink" title="1. 测试背景:"></a>1. 测试背景:</h3><p>目前基于k8s 服务的外网访问方式有以下几种:</p>
<ol>
</summary>
</entry>
<entry>
<title>K8s Network Policy Controller之Kube-router功能介绍</title>
<link href="https://TencentCloudContainerTeam.github.io/2018/10/30/k8s-npc-kr-function/"/>
<id>https://TencentCloudContainerTeam.github.io/2018/10/30/k8s-npc-kr-function/</id>
<published>2018-10-30T09:22:24.000Z</published>
<updated>2020-06-16T01:53:49.339Z</updated>
<content type="html"><![CDATA[<p>Author: <a href="https://github.com/jimmy-zh" target="_blank" rel="noopener">Jimmy Zhang</a> (张浩)</p><h1 id="Network-Policy"><a href="#Network-Policy" class="headerlink" title="Network Policy"></a>Network Policy</h1><p><a href="https://kubernetes.io/docs/concepts/services-networking/network-policies/" target="_blank" rel="noopener">Network Policy</a>是k8s提供的一种资源,用于定义基于pod的网络隔离策略。它描述了一组pod是否可以与其它组pod,以及其它network endpoints进行通信。</p><h1 id="Kube-router"><a href="#Kube-router" class="headerlink" title="Kube-router"></a>Kube-router</h1><ul><li>官网: <a href="https://www.kube-router.io" target="_blank" rel="noopener">https://www.kube-router.io</a></li><li>项目: <a href="https://github.com/cloudnativelabs/kube-router" target="_blank" rel="noopener">https://github.com/cloudnativelabs/kube-router</a></li><li>目前最新版本:<a href="https://github.com/cloudnativelabs/kube-router/releases/tag/v0.2.1" target="_blank" rel="noopener">v0.2.1</a></li></ul><p>kube-router项目的三大功能:</p><ul><li>Pod Networking</li><li>IPVS/LVS based service proxy </li><li>Network Policy Controller </li></ul><p>在腾讯云TKE上,Pod Networking功能由基于IAAS层VPC的高性能容器网络实现,service proxy功能由kube-proxy所支持的ipvs/iptables两种模式实现。建议在TKE上,只使用kube-router的Network Policy功能。</p><h1 id="在TKE上部署kube-router"><a href="#在TKE上部署kube-router" class="headerlink" title="在TKE上部署kube-router"></a>在TKE上部署kube-router</h1><h3 id="腾讯云提供的kube-router版本"><a href="#腾讯云提供的kube-router版本" class="headerlink" title="腾讯云提供的kube-router版本"></a>腾讯云提供的kube-router版本</h3><p>腾讯云PAAS团队提供的镜像”ccr.ccs.tencentyun.com/library/kube-router:v1”基于官方的最新版本:<a href="https://github.com/cloudnativelabs/kube-router/releases/tag/v0.2.1" target="_blank" rel="noopener">v0.2.1</a></p><p>在该项目的开发过程中,腾讯云PAAS团队积极参与社区,持续贡献了一些feature support和bug fix, 列表如下(均已被社区合并):</p><ul><li><a href="https://github.com/cloudnativelabs/kube-router/pull/488" target="_blank" rel="noopener">https://github.com/cloudnativelabs/kube-router/pull/488</a></li><li><a href="https://github.com/cloudnativelabs/kube-router/pull/498" target="_blank" rel="noopener">https://github.com/cloudnativelabs/kube-router/pull/498</a></li><li><a href="https://github.com/cloudnativelabs/kube-router/pull/527" target="_blank" rel="noopener">https://github.com/cloudnativelabs/kube-router/pull/527</a></li><li><a href="https://github.com/cloudnativelabs/kube-router/pull/529" target="_blank" rel="noopener">https://github.com/cloudnativelabs/kube-router/pull/529</a></li><li><a href="https://github.com/cloudnativelabs/kube-router/pull/543" target="_blank" rel="noopener">https://github.com/cloudnativelabs/kube-router/pull/543</a></li></ul><p>我们会继续贡献社区,并提供腾讯云镜像的版本升级。</p><h3 id="部署kube-router"><a href="#部署kube-router" class="headerlink" title="部署kube-router"></a>部署kube-router</h3><p>Daemonset yaml文件:</p><blockquote><p><a href="https://ask.qcloudimg.com/draft/982360/9wn7eu0bek.zip" target="_blank" rel="noopener">#kube-router-firewall-daemonset.yaml.zip#</a></p></blockquote><p>在<strong>能访问公网</strong>,也能访问TKE集群apiserver的机器上,执行以下命令即可完成kube-router部署。</p><p>如果集群节点开通了公网IP,则可以直接在集群节点上执行以下命令。</p><p>如果集群节点没有开通公网IP, 则可以手动下载和粘贴yaml文件内容到节点, 保存为kube-router-firewall-daemonset.yaml,再执行最后的kubectl create命令。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">wget https://ask.qcloudimg.com/draft/982360/9wn7eu0bek.zip</span><br><span class="line">unzip 9wn7eu0bek.zip</span><br><span class="line">kuebectl create -f kube-router-firewall-daemonset.yaml</span><br></pre></td></tr></table></figure><h3 id="yaml文件内容和参数说明"><a href="#yaml文件内容和参数说明" class="headerlink" title="yaml文件内容和参数说明"></a>yaml文件内容和参数说明</h3><p>kube-router-firewall-daemonset.yaml文件内容:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: v1</span><br><span class="line">kind: ConfigMap</span><br><span class="line">metadata:</span><br><span class="line"> name: kube-router-cfg</span><br><span class="line"> namespace: kube-system</span><br><span class="line"> labels:</span><br><span class="line"> tier: node</span><br><span class="line"> k8s-app: kube-router</span><br><span class="line">data:</span><br><span class="line"> cni-conf.json: |</span><br><span class="line"> {</span><br><span class="line"> "name":"kubernetes",</span><br><span class="line"> "type":"bridge",</span><br><span class="line"> "bridge":"kube-bridge",</span><br><span class="line"> "isDefaultGateway":true,</span><br><span class="line"> "ipam": {</span><br><span class="line"> "type":"host-local"</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">---</span><br><span class="line">apiVersion: extensions/v1beta1</span><br><span class="line">kind: DaemonSet</span><br><span class="line">metadata:</span><br><span class="line"> name: kube-router</span><br><span class="line"> namespace: kube-system</span><br><span class="line"> labels:</span><br><span class="line"> k8s-app: kube-router</span><br><span class="line">spec:</span><br><span class="line"> template:</span><br><span class="line"> metadata:</span><br><span class="line"> labels:</span><br><span class="line"> k8s-app: kube-router</span><br><span class="line"> annotations:</span><br><span class="line"> scheduler.alpha.kubernetes.io/critical-pod: ''</span><br><span class="line"> spec:</span><br><span class="line"> containers:</span><br><span class="line"> - name: kube-router</span><br><span class="line"> image: ccr.ccs.tencentyun.com/library/kube-router:v1</span><br><span class="line"> args: ["--run-router=false", "--run-firewall=true", "--run-service-proxy=false", "--kubeconfig=/var/lib/kube-router/kubeconfig", "--iptables-sync-period=5m", "--cache-sync-timeout=3m"]</span><br><span class="line"> securityContext:</span><br><span class="line"> privileged: true</span><br><span class="line"> imagePullPolicy: Always</span><br><span class="line"> env:</span><br><span class="line"> - name: NODE_NAME</span><br><span class="line"> valueFrom:</span><br><span class="line"> fieldRef:</span><br><span class="line"> fieldPath: spec.nodeName</span><br><span class="line"> livenessProbe:</span><br><span class="line"> httpGet:</span><br><span class="line"> path: /healthz</span><br><span class="line"> port: 20244</span><br><span class="line"> initialDelaySeconds: 10</span><br><span class="line"> periodSeconds: 3</span><br><span class="line"> volumeMounts:</span><br><span class="line"> - name: lib-modules</span><br><span class="line"> mountPath: /lib/modules</span><br><span class="line"> readOnly: true</span><br><span class="line"> - name: cni-conf-dir</span><br><span class="line"> mountPath: /etc/cni/net.d</span><br><span class="line"> - name: kubeconfig</span><br><span class="line"> mountPath: /var/lib/kube-router/kubeconfig</span><br><span class="line"> readOnly: true</span><br><span class="line"> initContainers:</span><br><span class="line"> - name: install-cni</span><br><span class="line"> image: busybox</span><br><span class="line"> imagePullPolicy: Always</span><br><span class="line"> command:</span><br><span class="line"> - /bin/sh</span><br><span class="line"> - -c</span><br><span class="line"> - set -e -x;</span><br><span class="line"> if [ ! -f /etc/cni/net.d/10-kuberouter.conf ]; then</span><br><span class="line"> TMP=/etc/cni/net.d/.tmp-kuberouter-cfg;</span><br><span class="line"> cp /etc/kube-router/cni-conf.json ${TMP};</span><br><span class="line"> mv ${TMP} /etc/cni/net.d/10-kuberouter.conf;</span><br><span class="line"> fi</span><br><span class="line"> volumeMounts:</span><br><span class="line"> - name: cni-conf-dir</span><br><span class="line"> mountPath: /etc/cni/net.d</span><br><span class="line"> - name: kube-router-cfg</span><br><span class="line"> mountPath: /etc/kube-router</span><br><span class="line"> hostNetwork: true</span><br><span class="line"> tolerations:</span><br><span class="line"> - key: CriticalAddonsOnly</span><br><span class="line"> operator: Exists</span><br><span class="line"> - effect: NoSchedule</span><br><span class="line"> key: node-role.kubernetes.io/master</span><br><span class="line"> operator: Exists</span><br><span class="line"> volumes:</span><br><span class="line"> - name: lib-modules</span><br><span class="line"> hostPath:</span><br><span class="line"> path: /lib/modules</span><br><span class="line"> - name: cni-conf-dir</span><br><span class="line"> hostPath:</span><br><span class="line"> path: /etc/cni/net.d</span><br><span class="line"> - name: kube-router-cfg</span><br><span class="line"> configMap:</span><br><span class="line"> name: kube-router-cfg</span><br><span class="line"> - name: kubeconfig</span><br><span class="line"> hostPath:</span><br><span class="line"> path: /root/.kube/config</span><br></pre></td></tr></table></figure><p>args说明:</p><ol><li>“–run-router=false”, “–run-firewall=true”, “–run-service-proxy=false”:只加载firewall模块;</li><li>kubeconfig:用于指定master信息,映射到主机上的kubectl配置目录/root/.kube/config;</li><li>–iptables-sync-period=5m:指定定期同步iptables规则的间隔时间,根据准确性的要求设置,默认5m;</li><li>–cache-sync-timeout=3m:指定启动时将k8s资源做缓存的超时时间,默认1m;</li></ol><h1 id="NetworkPolicy配置示例"><a href="#NetworkPolicy配置示例" class="headerlink" title="NetworkPolicy配置示例"></a>NetworkPolicy配置示例</h1><h3 id="1-nsa-namespace下的pod可互相访问,而不能被其它任何pod访问"><a href="#1-nsa-namespace下的pod可互相访问,而不能被其它任何pod访问" class="headerlink" title="1.nsa namespace下的pod可互相访问,而不能被其它任何pod访问"></a>1.nsa namespace下的pod可互相访问,而不能被其它任何pod访问</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: extensions/v1beta1</span><br><span class="line">kind: NetworkPolicy</span><br><span class="line">metadata:</span><br><span class="line"> name: npa</span><br><span class="line"> namespace: nsa</span><br><span class="line">spec:</span><br><span class="line"> ingress: </span><br><span class="line"> - from:</span><br><span class="line"> - podSelector: {} </span><br><span class="line"> podSelector: {} </span><br><span class="line"> policyTypes:</span><br><span class="line"> - Ingress</span><br></pre></td></tr></table></figure><h3 id="2-nsa-namespace下的pod不能被任何pod访问"><a href="#2-nsa-namespace下的pod不能被任何pod访问" class="headerlink" title="2.nsa namespace下的pod不能被任何pod访问"></a>2.nsa namespace下的pod不能被任何pod访问</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: extensions/v1beta1</span><br><span class="line">kind: NetworkPolicy</span><br><span class="line">metadata:</span><br><span class="line"> name: npa</span><br><span class="line"> namespace: nsa</span><br><span class="line">spec:</span><br><span class="line"> podSelector: {}</span><br><span class="line"> policyTypes:</span><br><span class="line"> - Ingress</span><br></pre></td></tr></table></figure><h3 id="3-nsa-namespace下的pod只在6379-TCP端口可以被带有标签app-nsb的namespace下的pod访问,而不能被其它任何pod访问"><a href="#3-nsa-namespace下的pod只在6379-TCP端口可以被带有标签app-nsb的namespace下的pod访问,而不能被其它任何pod访问" class="headerlink" title="3.nsa namespace下的pod只在6379/TCP端口可以被带有标签app: nsb的namespace下的pod访问,而不能被其它任何pod访问"></a>3.nsa namespace下的pod只在6379/TCP端口可以被带有标签app: nsb的namespace下的pod访问,而不能被其它任何pod访问</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: extensions/v1beta1</span><br><span class="line">kind: NetworkPolicy</span><br><span class="line">metadata:</span><br><span class="line"> name: npa</span><br><span class="line"> namespace: nsa</span><br><span class="line">spec:</span><br><span class="line"> ingress:</span><br><span class="line"> - from:</span><br><span class="line"> - namespaceSelector:</span><br><span class="line"> matchLabels:</span><br><span class="line"> app: nsb</span><br><span class="line"> ports:</span><br><span class="line"> - protocol: TCP</span><br><span class="line"> port: 6379</span><br><span class="line"> podSelector: {}</span><br><span class="line"> policyTypes:</span><br><span class="line"> - Ingress</span><br></pre></td></tr></table></figure><h3 id="4-nsa-namespace下的pod可以访问CIDR为14-215-0-0-16的network-endpoint的5978-TCP端口,而不能访问其它任何network-endpoints(此方式可以用来为集群内的服务开访问外部network-endpoints的白名单)"><a href="#4-nsa-namespace下的pod可以访问CIDR为14-215-0-0-16的network-endpoint的5978-TCP端口,而不能访问其它任何network-endpoints(此方式可以用来为集群内的服务开访问外部network-endpoints的白名单)" class="headerlink" title="4.nsa namespace下的pod可以访问CIDR为14.215.0.0/16的network endpoint的5978/TCP端口,而不能访问其它任何network endpoints(此方式可以用来为集群内的服务开访问外部network endpoints的白名单)"></a>4.nsa namespace下的pod可以访问CIDR为14.215.0.0/16的network endpoint的5978/TCP端口,而不能访问其它任何network endpoints(此方式可以用来为集群内的服务开访问外部network endpoints的白名单)</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: extensions/v1beta1</span><br><span class="line">kind: NetworkPolicy</span><br><span class="line">metadata:</span><br><span class="line"> name: npa</span><br><span class="line"> namespace: nsa</span><br><span class="line">spec:</span><br><span class="line"> egress:</span><br><span class="line"> - to:</span><br><span class="line"> - ipBlock:</span><br><span class="line"> cidr: 14.215.0.0/16</span><br><span class="line"> ports:</span><br><span class="line"> - protocol: TCP</span><br><span class="line"> port: 5978</span><br><span class="line"> podSelector: {}</span><br><span class="line"> policyTypes:</span><br><span class="line"> - Egress</span><br></pre></td></tr></table></figure><h3 id="5-default-namespace下的pod只在80-TCP端口可以被CIDR为14-215-0-0-16的network-endpoint访问,而不能被其它任何network-endpoints访问"><a href="#5-default-namespace下的pod只在80-TCP端口可以被CIDR为14-215-0-0-16的network-endpoint访问,而不能被其它任何network-endpoints访问" class="headerlink" title="5.default namespace下的pod只在80/TCP端口可以被CIDR为14.215.0.0/16的network endpoint访问,而不能被其它任何network endpoints访问"></a>5.default namespace下的pod只在80/TCP端口可以被CIDR为14.215.0.0/16的network endpoint访问,而不能被其它任何network endpoints访问</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: extensions/v1beta1</span><br><span class="line">kind: NetworkPolicy</span><br><span class="line">metadata:</span><br><span class="line"> name: npd</span><br><span class="line"> namespace: default</span><br><span class="line">spec:</span><br><span class="line"> ingress:</span><br><span class="line"> - from:</span><br><span class="line"> - ipBlock:</span><br><span class="line"> cidr: 14.215.0.0/16</span><br><span class="line"> ports:</span><br><span class="line"> - protocol: TCP</span><br><span class="line"> port: 80</span><br><span class="line"> podSelector: {}</span><br><span class="line"> policyTypes:</span><br><span class="line"> - Ingress</span><br></pre></td></tr></table></figure><h1 id="附-测试情况"><a href="#附-测试情况" class="headerlink" title="附: 测试情况"></a>附: 测试情况</h1><table><thead><tr><th style="text-align:left">用例名称</th><th style="text-align:left">测试结果</th></tr></thead><tbody><tr><td style="text-align:left">不同namespace的pod互相隔离,同一namespace的pod互通</td><td style="text-align:left">通过</td></tr><tr><td style="text-align:left">不同namespace的pod互相隔离,同一namespace的pod隔离</td><td style="text-align:left">通过</td></tr><tr><td style="text-align:left">不同namespace的pod互相隔离,白名单指定B可以访问A</td><td style="text-align:left">通过</td></tr><tr><td style="text-align:left">允许某个namespace访问集群外某个CIDR,其他外部IP全部隔离</td><td style="text-align:left">通过</td></tr><tr><td style="text-align:left">不同namespace的pod互相隔离,白名单指定B可以访问A中对应的pod以及端口</td><td style="text-align:left">通过</td></tr><tr><td style="text-align:left">以上用例,当source pod 和 destination pod在一个node上时,隔离是否生效</td><td style="text-align:left">通过</td></tr></tbody></table><p>功能测试用例</p><blockquote><p><a href="https://ask.qcloudimg.com/draft/982360/dgs7x4hcly.zip" target="_blank" rel="noopener">#kube-router测试用例.xlsx.zip#</a></p></blockquote>]]></content>
<summary type="html">
<p>Author: <a href="https://github.com/jimmy-zh" target="_blank" rel="noopener">Jimmy Zhang</a> (张浩)</p>
<h1 id="Network-Policy"><a href="#N
</summary>
</entry>
<entry>
<title>kubernetes集群中夺命的5秒DNS延迟</title>
<link href="https://TencentCloudContainerTeam.github.io/2018/10/26/DNS-5-seconds-delay/"/>
<id>https://TencentCloudContainerTeam.github.io/2018/10/26/DNS-5-seconds-delay/</id>
<published>2018-10-26T07:52:37.000Z</published>
<updated>2020-06-16T01:53:49.335Z</updated>
<content type="html"><![CDATA[<p>作者: 洪志国</p><h2 id="超时问题"><a href="#超时问题" class="headerlink" title="超时问题"></a>超时问题</h2><p>客户反馈从pod中访问服务时,总是有些请求的响应时延会达到5秒。正常的响应只需要毫秒级别的时延。</p><h2 id="DNS-5秒延时"><a href="#DNS-5秒延时" class="headerlink" title="DNS 5秒延时"></a>DNS 5秒延时</h2><p>在pod中(通过nsenter -n tcpdump)抓包,发现是有的DNS请求没有收到响应,超时5秒后,再次发送DNS请求才成功收到响应。</p><p>在kube-dns pod抓包,发现是有DNS请求没有到达kube-dns pod, 在中途被丢弃了。</p><p>为什么是5秒? <code>man resolv.conf</code>可以看到glibc的resolver的缺省超时时间是5s。</p><h2 id="丢包原因"><a href="#丢包原因" class="headerlink" title="丢包原因"></a>丢包原因</h2><p>经过搜索发现这是一个普遍问题。<br>根本原因是内核conntrack模块的bug。</p><p>Weave works的工程师<a href="[email protected]">Martynas Pumputis</a>对这个问题做了很详细的分析:<br><a href="https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts" target="_blank" rel="noopener">https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts</a></p><p>相关结论:</p><ul><li>只有多个线程或进程,并发从同一个socket发送相同五元组的UDP报文时,才有一定概率会发生</li><li>glibc, musl(alpine linux的libc库)都使用”parallel query”, 就是并发发出多个查询请求,因此很容易碰到这样的冲突,造成查询请求被丢弃</li><li>由于ipvs也使用了conntrack, 使用kube-proxy的ipvs模式,并不能避免这个问题</li></ul><h2 id="问题的根本解决"><a href="#问题的根本解决" class="headerlink" title="问题的根本解决"></a>问题的根本解决</h2><p>Martynas向内核提交了两个patch来fix这个问题,不过他说如果集群中有多个DNS server的情况下,问题并没有完全解决。</p><p>其中一个patch已经在2018-7-18被合并到linux内核主线中: <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed07d9a021df6da53456663a76999189badc432a" target="_blank" rel="noopener">netfilter: nf_conntrack: resolve clash for matching conntracks</a></p><p>目前只有4.19.rc 版本包含这个patch。</p><h2 id="规避办法"><a href="#规避办法" class="headerlink" title="规避办法"></a>规避办法</h2><h4 id="规避方案一:使用TCP发送DNS请求"><a href="#规避方案一:使用TCP发送DNS请求" class="headerlink" title="规避方案一:使用TCP发送DNS请求"></a>规避方案一:使用TCP发送DNS请求</h4><p>由于TCP没有这个问题,有人提出可以在容器的resolv.conf中增加<code>options use-vc</code>, 强制glibc使用TCP协议发送DNS query。下面是这个man resolv.conf中关于这个选项的说明:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">use-vc (since glibc 2.14)</span><br><span class="line"> Sets RES_USEVC in _res.options. This option forces the</span><br><span class="line"> use of TCP for DNS resolutions.</span><br></pre></td></tr></table></figure><p>笔者使用镜像”busybox:1.29.3-glibc” (libc 2.24) 做了试验,并没有见到这样的效果,容器仍然是通过UDP发送DNS请求。</p><h4 id="规避方案二:避免相同五元组DNS请求的并发"><a href="#规避方案二:避免相同五元组DNS请求的并发" class="headerlink" title="规避方案二:避免相同五元组DNS请求的并发"></a>规避方案二:避免相同五元组DNS请求的并发</h4><p>resolv.conf还有另外两个相关的参数: </p><ul><li>single-request-reopen (since glibc 2.9)</li><li>single-request (since glibc 2.10)</li></ul><p>man resolv.conf中解释如下:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">single-request-reopen (since glibc 2.9)</span><br><span class="line"> Sets RES_SNGLKUPREOP in _res.options. The resolver</span><br><span class="line"> uses the same socket for the A and AAAA requests. Some</span><br><span class="line"> hardware mistakenly sends back only one reply. When</span><br><span class="line"> that happens the client system will sit and wait for</span><br><span class="line"> the second reply. Turning this option on changes this</span><br><span class="line"> behavior so that if two requests from the same port are</span><br><span class="line"> not handled correctly it will close the socket and open</span><br><span class="line"> a new one before sending the second request.</span><br><span class="line"> </span><br><span class="line">single-request (since glibc 2.10)</span><br><span class="line"> Sets RES_SNGLKUP in _res.options. By default, glibc</span><br><span class="line"> performs IPv4 and IPv6 lookups in parallel since</span><br><span class="line"> version 2.9. Some appliance DNS servers cannot handle</span><br><span class="line"> these queries properly and make the requests time out.</span><br><span class="line"> This option disables the behavior and makes glibc</span><br><span class="line"> perform the IPv6 and IPv4 requests sequentially (at the</span><br><span class="line"> cost of some slowdown of the resolving process).</span><br></pre></td></tr></table></figure><p>笔者做了试验,发现效果是这样的:</p><ul><li>single-request-reopen<br>发送A类型请求和AAAA类型请求使用不同的源端口。这样两个请求在conntrack表中不占用同一个表项,从而避免冲突。</li><li>single-request<br>避免并发,改为串行发送A类型和AAAA类型请求。没有了并发,从而也避免了冲突。</li></ul><p>要给容器的resolv.conf加上options参数,有几个办法:</p><h5 id="1-在容器的”ENTRYPOINT”或者”CMD”脚本中,执行-bin-echo-39-options-single-request-reopen-39-gt-gt-etc-resolv-conf"><a href="#1-在容器的”ENTRYPOINT”或者”CMD”脚本中,执行-bin-echo-39-options-single-request-reopen-39-gt-gt-etc-resolv-conf" class="headerlink" title="1) 在容器的”ENTRYPOINT”或者”CMD”脚本中,执行/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"></a>1) 在容器的”ENTRYPOINT”或者”CMD”脚本中,执行<code>/bin/echo 'options single-request-reopen' >> /etc/resolv.conf</code></h5><h5 id="2-在pod的postStart-hook中:"><a href="#2-在pod的postStart-hook中:" class="headerlink" title="2) 在pod的postStart hook中:"></a>2) 在pod的postStart hook中:</h5><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">lifecycle:</span><br><span class="line"> postStart:</span><br><span class="line"> exec:</span><br><span class="line"> command:</span><br><span class="line"> - /bin/sh</span><br><span class="line"> - -c </span><br><span class="line"> - "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"</span><br></pre></td></tr></table></figure><h5 id="3-使用template-spec-dnsConfig-k8s-v1-9-及以上才支持"><a href="#3-使用template-spec-dnsConfig-k8s-v1-9-及以上才支持" class="headerlink" title="3) 使用template.spec.dnsConfig (k8s v1.9 及以上才支持):"></a>3) 使用template.spec.dnsConfig (k8s v1.9 及以上才支持):</h5><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">template:</span><br><span class="line"> spec:</span><br><span class="line"> dnsConfig:</span><br><span class="line"> options:</span><br><span class="line"> - name: single-request-reopen</span><br></pre></td></tr></table></figure><h5 id="4-使用ConfigMap覆盖POD里面的-etc-resolv-conf"><a href="#4-使用ConfigMap覆盖POD里面的-etc-resolv-conf" class="headerlink" title="4) 使用ConfigMap覆盖POD里面的/etc/resolv.conf"></a>4) 使用ConfigMap覆盖POD里面的/etc/resolv.conf</h5><p>configmap:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: v1</span><br><span class="line">data:</span><br><span class="line"> resolv.conf: |</span><br><span class="line"> nameserver 1.2.3.4</span><br><span class="line"> search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal</span><br><span class="line"> options ndots:5 single-request-reopen timeout:1</span><br><span class="line">kind: ConfigMap</span><br><span class="line">metadata:</span><br><span class="line"> name: resolvconf</span><br></pre></td></tr></table></figure></p><p>POD spec:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"> volumeMounts:</span><br><span class="line"> - name: resolv-conf</span><br><span class="line"> mountPath: /etc/resolv.conf</span><br><span class="line"> subPath: resolv.conf</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"> volumes:</span><br><span class="line"> - name: resolv-conf</span><br><span class="line"> configMap:</span><br><span class="line"> name: resolvconf</span><br><span class="line"> items:</span><br><span class="line"> - key: resolv.conf</span><br><span class="line"> path: resolv.conf</span><br></pre></td></tr></table></figure></p><h5 id="5-使用MutatingAdmissionWebhook"><a href="#5-使用MutatingAdmissionWebhook" class="headerlink" title="5) 使用MutatingAdmissionWebhook"></a>5) 使用MutatingAdmissionWebhook</h5><p><a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook-beta-in-1-9" target="_blank" rel="noopener">MutatingAdmissionWebhook</a> 是1.9引入的Controller,用于对一个指定的Resource的操作之前,对这个resource进行变更。<br>istio的自动sidecar注入就是用这个功能来实现的。 我们也可以通过MutatingAdmissionWebhook,来自动给所有POD,注入以上3)或者4)所需要的相关内容。</p><hr><p>以上方法中, 1)和2)都需要修改镜像, 3)和4)则只需要修改POD的spec, 能适用于所有镜像。不过还是有不方便的地方:</p><ul><li>每个工作负载的yaml都要做修改,比较麻烦</li><li>对于通过helm创建的工作负载,需要修改helm charts</li></ul><p>方法5)对集群使用者最省事,照常提交工作负载即可。不过初期需要一定的开发工作量。</p><h4 id="规避方案三:使用本地DNS缓存"><a href="#规避方案三:使用本地DNS缓存" class="headerlink" title="规避方案三:使用本地DNS缓存"></a>规避方案三:使用本地DNS缓存</h4><p>容器的DNS请求都发往本地的DNS缓存服务(dnsmasq, nscd等),不需要走DNAT,也不会发生conntrack冲突。另外还有个好处,就是避免DNS服务成为性能瓶颈。</p><p>使用本地DNS缓存有两种方式:</p><ul><li>每个容器自带一个DNS缓存服务</li><li>每个节点运行一个DNS缓存服务,所有容器都把本节点的DNS缓存作为自己的nameserver</li></ul><p>从资源效率的角度来考虑的话,推荐后一种方式。</p><h5 id="实施办法"><a href="#实施办法" class="headerlink" title="实施办法"></a>实施办法</h5><p>条条大路通罗马,不管怎么做,最终到达上面描述的效果即可。</p><p>POD中要访问节点上的DNS缓存服务,可以使用节点的IP。 如果节点上的容器都连在一个虚拟bridge上, 也可以使用这个bridge的三层接口的IP(在TKE中,这个三层接口叫cbr0)。 要确保DNS缓存服务监听这个地址。</p><p>如何把POD的/etc/resolv.conf中的nameserver设置为节点IP呢?</p><p>一个办法,是设置POD.spec.dnsPolicy为”Default”, 意思是POD里面的/etc/resolv.conf, 使用节点上的文件。缺省使用节点上的/etc/resolv.conf(如果kubelet通过参数–resolv-conf指定了其他文件,则使用–resolv-conf所指定的文件)。</p><p>另一个办法,是给每个节点的kubelet指定不同的–cluster-dns参数,设置为节点的IP,POD.spec.dnsPolicy仍然使用缺省值”ClusterFirst”。 kops项目甚至有个issue在讨论如何在部署集群时设置好–cluster-dns指向节点IP: <a href="https://github.com/kubernetes/kops/issues/5584" target="_blank" rel="noopener">https://github.com/kubernetes/kops/issues/5584</a></p>]]></content>
<summary type="html">
<p>作者: 洪志国</p>
<h2 id="超时问题"><a href="#超时问题" class="headerlink" title="超时问题"></a>超时问题</h2><p>客户反馈从pod中访问服务时,总是有些请求的响应时延会达到5秒。正常的响应只需要毫秒级别的时延
</summary>
</entry>
<entry>
<title>开源组件</title>
<link href="https://TencentCloudContainerTeam.github.io/2018/10/21/%E5%BC%80%E6%BA%90%E9%A1%B9%E7%9B%AE/"/>
<id>https://TencentCloudContainerTeam.github.io/2018/10/21/开源项目/</id>
<published>2018-10-21T10:27:56.000Z</published>
<updated>2020-06-16T01:53:49.351Z</updated>
<content type="html"><![CDATA[<p>腾讯云容器团队现有开源组件:</p><ul><li>基于 csi 的 <a href="https://github.com/TencentCloud/kubernetes-csi-tencentcloud" target="_blank" rel="noopener">kubernetes volume 插件</a></li><li>基于 cni 的 <a href="https://github.com/TencentCloud/cni-bridge-networking" target="_blank" rel="noopener">bridge 插件</a></li><li>适配黑石负载均衡的 <a href="https://github.com/TencentCloud/ingress-tke-bm" target="_blank" rel="noopener">ingress 插件</a></li><li>适配腾讯云 cvm/clb/vpc 的 <a href="https://github.com/TencentCloud/tencentcloud-cloud-controller-manager" target="_blank" rel="noopener">kubernetes cloud-controller-manager</a></li></ul>]]></content>
<summary type="html">
<p>腾讯云容器团队现有开源组件:</p>
<ul>
<li>基于 csi 的 <a href="https://github.com/TencentCloud/kubernetes-csi-tencentcloud" target="_blank" rel="noopener"
</summary>
</entry>
</feed>