Skip to content

yinx0004/tidbcloud-metrics

Repository files navigation

Function

Run this command line script to get TiDBCloud Cluster Metrics

  • Capacity plan: a capacity plan will be generated based on the metrics and estimated traffic increase and resource redundancy required. It will be saved in a csv file.
  • Health check: a health report will be generated.
  • List clusters: List all clusters for a tenant, including cluster id, project id, number of nodes for each TiDB components
$ python3 main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  capacity
  health-check
  list-clusters

$ python3 main.py capacity --help
Welcome to TiDBCloud Capacity Planner and Health Checker!
Usage: main.py capacity [OPTIONS]

Options:
  -m, --mode [all|node|cluster]  capacity planner mode
  --help                         Show this message and exit.


$ python3 main.py health-check --help
Welcome to TiDBCloud Capacity Planner and Health Checker!
Usage: main.py health-check [OPTIONS]

Options:
  -t, --type [all|tidb|tikv|pd|tiflash]
                                  health check type
  -r, --report [console]          report channel
  --help                          Show this message and exit.

$ python3 main.py list-clusters --help
Welcome to TiDBCloud Capacity Planner and Health Checker!
Usage: main.py list-clusters [OPTIONS]

Options:
  -w, --write [Yes|No]  Write to spreadsheet
  --help                Show this message and exit.

Requirement

  • Python3.8+
  • prometheus_api_client: A Python wrapper for the Prometheus http api and some tools for metrics processing.
pip install prometheus-api-client
  • pyyaml
pip install pyyaml
  • openpyxl
pip install openpyxl
  • click
pip install click

How to use

  1. Configuration in yaml file tidbcloud.yaml
  • logging.level: case-insensitive, must be one of "debug", "info", "warning", "error", "critical"
  • logging.to_file: if ture, a log file will be generated under logs, otherwise only console log
  • prometheus.start_time: Prometheus query start time, format: 'dd-mm-YYYY HH:MM:SS'
  • prometheus.end_time: Prometheus query ent time, format:'dd-mm-YYYY HH:MM:SS'
  • prometheus.step_in_seconds: Prometheus query step(interval), unit is second, normally 60 for two days, 30 for 1 day
  • prometheus.cluster_prom_id_token: Token for prometheus API
  • prometheus.cluster_prom_base_url: URL for prometheus API, contains TiDBCloud Cluster information
  • capacity.plan_traffic_x: Estimated times of traffic increase
  • capacity.plan_resource_redundancy_x: Times of resource redundancy required
  • lark.user_access_token: Compulsory for list-cluster if you want to write results in to Lark speadsheet
  1. How to get prometheus.cluster_prom_id_token and prometheus.cluster_prom_base_url
  • step 1, login tidbcloud.com in Chrome
  • step 2, right-click the page to Inspect
  • step 3, goto Application tab,
  • step 4, under Local storage -> https://tidbcloud.com

Capacity Plan

A glance of capacity plan

component name max average percentile_50.0 percentile_75.0 percentile_80.0 percentile_85.0 percentile_90.0 percentile_95.0 percentile_99.0 percentile_99.9 capacity instance_cnt plan_max plan_average plan_percentile_50.0 plan_percentile_75.0 plan_percentile_80.0 plan_percentile_85.0 plan_percentile_90.0 plan_percentile_95.0 plan_percentile_99.0 plan_percentile_99.9
tidb CPU(core) 23.784666666636863 6.891452040305764 6.4966666666169965 10.47533333338797 11.068000000038495 11.763333333283663 12.577333333343267 13.587733333359147 15.136666666684672 16.846442666622035 32 64 190.2773333330949 55.13161632244611 51.97333333293597 83.80266666710376 88.54400000030796 94.1066666662693 100.61866666674614 108.70186666687317 121.09333333347737 134.77154133297628
tidb Memory(byte) 26741944320.0 16920691356.352173 17130733568.0 17670386688.0 17807736012.8 17977865420.8 18201889587.2 18560214425.6 19336546385.92 20367362531.32808 66206752768 64 103.40240926645895 65.4268153341575 66.23898031633485 68.32564357867768 68.85672878794647 69.5145639275223 70.38079258548662 71.766318302959 74.76814173535634 78.75397282042981
tidb Disk IOPS 467.56666666666666 11.098697260468551 10.483333333333333 11.5 11.891666666666667 12.366666666666667 12.841666666666667 13.416666666666666 17.025749999999974 110.29567500000084 40000 64 2.992426666666667 0.07103166246699873 0.06709333333333332 0.0736 0.07610666666666667 0.07914666666666667 0.08218666666666667 0.08586666666666666 0.10896479999999983 0.7058923200000053
tidb Disk Bandwidth(byte) 79486685.86666666 239634.35369328497 109324.8 141436.8 152755.2 171619.62666666662 216119.0400000008 407847.4666666666 1418572.4159999895 20195406.711474445 1245184000 64 16.341835087719296 0.04926693126917865 0.022476315789473685 0.029078289473684207 0.03140526315789474 0.035283640350877186 0.0444323684210528 0.0838502192982456 0.29164728947368207 4.152016182457739
tidb NetworkIn Bandwidth(byte) 721571589.9833333 27812681.373172816 17402588.81666667 39020371.479166664 45927429.45666666 54322016.234166645 65899973.45000005 85512609.18749997 132907427.57933328 239396503.55442265 1610612736 64 114.69071546925439 4.420706649330905 2.766066998905606 6.2021210160520335 7.299968315230475 8.634251949654681 10.474518688519803 13.591863185167307 21.125066690338972 38.051049479551736
tidb NetworkOut Bandwidth(byte) 109156186.39999999 14940615.595263155 14426000.933333334 21817573.53333333 23120781.346666668 24554338.02416666 26289229.68166667 28942049.474999998 40539531.13066665 68605984.56516692 1610612736 64 17.349908574422198 2.3747468940835246 2.2929510964287654 3.467809920840793 3.6749492242601183 3.9028069216675214 4.1785605242517265 4.600214874744415 6.443584939744735 10.904627571927216
tikv CPU(core) 22.808333333333334 8.052314804191548 8.275 11.416666666666666 11.866666666666667 12.341666666666667 12.866666666666667 13.608333333333333 14.95 16.51653333333355 32 60 171.0625 60.39236103143661 62.0625 85.625 89.0 92.5625 96.5 102.0625 112.125 123.87400000000162
tikv Memory(byte) 64757587968.0 57881447840.94151 63778476032.0 64016157696.0 64063688704.0 64117889433.6 64183327129.6 64278327705.6 64428602900.48 64577265741.824005 132135682048 60 117.62016793218831 105.13093258775989 115.84179239427237 116.27349712743657 116.35982840255616 116.45827399198653 116.57712945022902 116.74968040608452 117.02262747240458 117.29264599707227
tikv Storage(byte) 535961378515.0 361504338289.31384 442307236902.0 470968634179.0 482489093109.0 492801961876.0 502209251350.0 512172191227.0 523254653903.0 529922350440.00024 2113559478272 60 60.859763903481756 41.04972776085263 50.22510033323925 53.47967415394095 54.78785127013951 55.95890348301766 57.027124887227146 58.15844179364083 59.4168833324683 60.174017061294514
tikv Disk IOPS 2731.8 1600.941560132902 1686.7083333333335 1799.4708333333333 1827.6350000000002 1862.6433333333332 1911.71 1989.7924999999998 2111.456999999999 2340.291233333337 40000 60 16.3908 9.605649360797411 10.120250000000002 10.796825 10.965810000000001 11.175859999999998 11.470260000000001 11.938754999999999 12.668741999999995 14.041747400000022
tikv Disk Bandwidth(byte) 177578423.46666664 52251039.23717693 56854656.0 70412072.53333333 73642921.81333335 77325027.84 81805142.18666667 87842415.36 100117152.5973333 121224877.07306707 1245184000 60 34.22692680921052 10.07100108652413 10.958314144736843 13.571405838815789 14.194128125000004 14.903826809210528 15.767335690789473 16.93097541118421 19.29684016447368 23.36519783223692
tikv NetworkIn Bandwidth(byte) 89017284.35 14631376.639736388 14266203.9 20065022.108333334 21117279.176666666 22303078.026666667 23856447.096666664 26557750.6 37704011.943333186 58618524.251466826 1610612736 60 13.264609031379223 2.1802450179660897 2.125830039381981 2.9899212879439196 3.146719809869925 3.323417607943217 3.5548876369992883 3.957413226366043 5.618335596223651 8.734840788165751
tikv NetworkOut Bandwidth(byte) 610273573.65 36800260.63627839 25660910.583333332 48946311.40833333 57989962.74 69362410.76333334 84993119.1433333 108843962.25833325 157566252.0966662 267867539.45127594 1610612736 60 90.9378489330411 5.48366615716791 3.823773649831613 7.293568761398395 8.641177824139595 10.335804635783036 12.664961687227086 16.219014265884944 23.47920121202859 39.915373839628096
tiflash CPU(core) 4.416666666666667 2.146923225308642 2.1083333333333334 2.4583333333333335 2.558333333333333 2.6666666666666665 2.825 3.0416666666666665 3.4583333333333335 4.08101666666668 32 3 1.65625 0.8050962094907408 0.790625 0.921875 0.9593749999999999 1.0 1.0593750000000002 1.140625 1.296875 1.530381250000005
tiflash Memory(byte) 18783178752.0 13776136496.82963 13647306752.0 14205965312.0 14349398016.000002 14542488780.8 14804665139.2 15265175961.599998 16246958653.44001 17866946875.39206 267057463296 3 0.8440061634756644 0.6190189778696653 0.6132301228461976 0.6383329701407874 0.6447779967158069 0.6534543660222575 0.6652350377247853 0.6859276998979257 0.7300432702200399 0.8028360632897394
tiflash Storage(byte) 577336631778.0 565327623219.5511 564687437622.5 569220174264.5 570198602871.6 571012189417.9 571916048319.0 572727655758.1 574149121927.7 575365797786.477 1585119989760 3 4.370672016056628 4.279758959863824 4.274912495738558 4.309227147030184 4.31663424766676 4.322793427172836 4.329636004948188 4.33578020181159 4.346541276143751 4.3557520049211575
tiflash Disk IOPS 2432.3999999999996 1980.8632012226217 2009.8708333333334 2105.039583333333 2124.6116666666667 2147.8479166666666 2179.5 2226.8875000000003 2285.4057500000004 2376.089483333339 40000 3 0.7297199999999999 0.5942589603667865 0.60296125 0.631511875 0.6373835 0.644354375 0.65385 0.6680662500000001 0.6856217250000002 0.7128268450000018
tiflash Disk Bandwidth(byte) 127711074.13333334 46139117.60294602 44310118.400000006 61630208.00000001 66574980.266666666 70807158.18666667 74966400.42666668 81943101.01333332 88969590.95466669 113195749.41013435 1245184000 3 1.2307682154605264 0.44464867138941094 0.4270223684210527 0.5939383223684211 0.6415917351973685 0.6823777837171052 0.7224609416118422 0.7896963116776314 0.8574115082236845 1.0908821450657993
tiflash NetworkIn Bandwidth(byte) 25218719.400000002 5479655.264724445 4197687.066666666 6883541.754166666 7875524.610000003 8889669.059999999 10329853.090000002 13345982.555 18301372.35050001 21961515.843933348 1610612736 3 0.18789410144090654 0.04082661319318745 0.031275205810864765 0.051286382631709176 0.05867723084986212 0.06623319581151008 0.07696340300142766 0.0994353186711669 0.1363558497317136 0.16362604382584503
tiflash NetworkOut Bandwidth(byte) 30871977.95 4998730.973810626 3221239.5166666666 6484757.5875 7643299.823333334 9051939.528333332 10739699.711666668 13705309.657499999 19115213.09033334 26575713.058033533 1610612736 3 0.23001415990293025 0.037243448002715604 0.02400010464092096 0.04831520905718208 0.05694702136019866 0.06744220501432815 0.0800169982885321 0.10211251420900225 0.14241943575690194 0.19800449205959988
pd CPU(core) 17.123333333308498 2.6657532407407603 0.15333333333255722 6.595666666639348 6.871333333353202 7.055533333343143 7.275400000053148 10.008633333342152 15.028153333337379 16.661665333331644 32 3 6.421249999990687 0.9996574652777851 0.05749999999970896 2.4733749999897556 2.5767500000074506 2.6458250000036787 2.7282750000199307 3.7532375000033067 5.635557500001517 6.248124499999367
pd Memory(byte) 2833547264.0 1643391174.162963 1124302848.0 2654711808.0 2679431168.0 2727464960.0 2745749504.0 2780340224.0 2796290744.32 2815262720.0 66206752768 3 0.5135815569621867 0.2978652972010318 0.2037803337565435 0.48116756016762935 0.48564795389784976 0.4943540976052722 0.497668178402572 0.5039377600184314 0.5068287981049951 0.5102674761648869
pd Disk IOPS 204.24166666666667 16.563721099047033 10.55 21.041666666666664 21.708333333333332 22.366666666666667 23.575833333333335 34.476249999999986 68.61750000000004 162.06711666666675 40000 3 0.0612725 0.00496911632971411 0.0031650000000000003 0.0063124999999999995 0.0065125 0.00671 0.007072750000000001 0.010342874999999994 0.02058525000000001 0.04862013500000003
pd Disk Bandwidth(byte) 2114188.8 205138.40760609633 72142.93333333333 318773.3333333334 339725.6533333333 362276.26666666666 396188.5866666667 981537.7066666667 1133308.2026666673 1841979.5200000098 1245184000 3 0.02037471217105263 0.0019769454885969913 0.000695250822368421 0.003072060032894737 0.0032739802631578945 0.0034913034539473687 0.0038181208881578957 0.009459206414473684 0.010921838404605269 0.01775139597039483
pd NetworkIn Bandwidth(byte) 39037441.016666666 5637797.400659997 600457.975 14511753.454166666 15485324.030000001 15874992.755833333 16292030.848333333 17209804.14583333 24901563.423500016 29617194.922616865 1610612736 3 0.29085160059233506 0.04200486392274497 0.004473760537803173 0.1081209887129565 0.11537465475499631 0.11827791300291816 0.1213850889230768 0.12822303284580505 0.18553110527619732 0.22066529782576014
pd NetworkOut Bandwidth(byte) 36515445.11666667 6083618.387593778 267739.7166666667 16286233.104166666 17456376.450000003 17864665.948333334 18336149.77 19826134.20083333 27210716.633666668 30449632.54978338 1610612736 3 0.2720612668742736 0.04532648911769523 0.00199481633802255 0.12134189236288269 0.13006013967096808 0.13310213348517816 0.1366149616986513 0.14771621078873673 0.20273563737918934 0.22686744145887627

For node level metrics require global mode office network

How to read

For example, plan_max is the calculated number of instance needed based on max value

How capacity plan is calculated

plan_max = max * instance_cnt * plan_traffic_x * plan_resource_redundancy_x / capacity

Health Check

Health check report sample

pd:
        diagnostics:
                '🚨 Critical: pd_tso_wait_duration_999': 0.00280602557503623
        metrics:
                pd_tso_wait_duration_999: 0.00280602557503623
                store_disconnected_count: 0.0
                store_down_count: 0.0
                store_low_space_count: 0.0
                store_offline_count: 0.0
                store_slow_count: 0.0
                store_tombstone_count: 0.0
                store_unhealth_count: 0.0
tidb:
        diagnostics:
                '⚠️ Warning: threshold missing for failed_query_opm': 0.0
                '🚨 Critical: qps': 169871.1166666667
        metrics:
                failed_query_opm: 0.0
                qps: 169871.1166666667
                query_duration_p999: 0.015002920253115134
tiflash:
        diagnostics: '⚠️ Warning: tiflash threshold missing'
        metrics: {}
tikv:
        diagnostics:
                '⚠️ Warning: metric empty for tikv_channel_full_total': null
                '⚠️ Warning: metric empty for tikv_coprocessor_request_error': null
                '⚠️ Warning: metric empty for tikv_scheduler_too_busy_total': null
                '🚨 Critical: kv_request_duration_99_by_store': 0.13708654709296972
                '🚨 Critical: tikv_grpc_msg_duration_999': 0.025472
        metrics:
                down_peer_region_count: 0.0
                empty_region_count: 0.0
                extra_peer_region_count: 0.0
                kv_request_duration_99_by_store: 0.13708654709296972
                miss_peer_region_count: 0.0
                offline_peer_region_count: 0.0
                pending_peer_region_count: 0.0
                region_count: 24811.0
                tikv_channel_full_total: null
                tikv_coprocessor_request_error: null
                tikv_engine_write_stall: 0.0
                tikv_grpc_msg_duration_999: 0.025472
                tikv_raftstore_store_write_msg_block_wait_duration_seconds_count: 0.0
                tikv_scheduler_too_busy_total: null

List clusters

example

Cluster ID: 1111111111111111111 Project ID: 1333333333333333333 Tenant ID: 1388888888888888888 TiDB: 24 TiKV: 51 TiFlash: 33
Cluster ID: 1222222222222222222 Project ID: 1444444444444444444 Tenant ID: 1388888888888888888 TiDB: 38 TiKV: 36 TiFlash: 25
Cluster ID: 1333333333333333333 Project ID: 1455444444444444444 Tenant ID: 1388888888888888888 TiDB: 32 TiKV: 46 TiFlash: 15

Notice

For On-premise deployment TiDB cluster, please download v0.1.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages