From 9cf64d65cc34b37730665f28179d021a9b036edd Mon Sep 17 00:00:00 2001 From: mornyx Date: Tue, 25 Mar 2025 16:13:02 +0800 Subject: [PATCH 1/2] byoc: add o11y agent tutorial, configs and refs Signed-off-by: mornyx --- tidb-cloud/byoc-o11y-agent-references.md | 35 + tidb-cloud/byoc-o11y-agent-tutorial.md | 1043 +++++++++++++++++ .../byoc-o11y-agent-vector-config-k8s-aws.md | 525 +++++++++ .../byoc-o11y-agent-vector-config-k8s-gcp.md | 513 ++++++++ .../byoc-o11y-agent-vector-config-tidb.md | 430 +++++++ .../byoc-o11y-agent-vmagent-config-k8s.md | 271 +++++ .../byoc-o11y-agent-vmagent-config-tidb.md | 609 ++++++++++ 7 files changed, 3426 insertions(+) create mode 100644 tidb-cloud/byoc-o11y-agent-references.md create mode 100644 tidb-cloud/byoc-o11y-agent-tutorial.md create mode 100644 tidb-cloud/byoc-o11y-agent-vector-config-k8s-aws.md create mode 100644 tidb-cloud/byoc-o11y-agent-vector-config-k8s-gcp.md create mode 100644 tidb-cloud/byoc-o11y-agent-vector-config-tidb.md create mode 100644 tidb-cloud/byoc-o11y-agent-vmagent-config-k8s.md create mode 100644 tidb-cloud/byoc-o11y-agent-vmagent-config-tidb.md diff --git a/tidb-cloud/byoc-o11y-agent-references.md b/tidb-cloud/byoc-o11y-agent-references.md new file mode 100644 index 0000000000000..4f7ddbfcb3981 --- /dev/null +++ b/tidb-cloud/byoc-o11y-agent-references.md @@ -0,0 +1,35 @@ +# Deploy Kubernetes and TiDB Cluster + +- [Deploy TiDB Cluster on AWS EKS](https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-on-aws-eks/) +- [Deploy TiDB Cluster on Google Cloud GKE](https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-on-gcp-gke/) +- [Deploy TiDB Cluster on Azure AKS](https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-on-azure-aks/) + +# AWS Cloud Resources + +- [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) +- [AWS IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html) +- [AWS IAM Role for CLI](https://docs.aws.amazon.com/cli/latest/reference/iam/) +- [AWS Enable IAM Role for Service Accounts](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) +- [AWS Private Link](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html) +- [AWS Private Link for CLI](https://docs.aws.amazon.com/cli/latest/reference/ec2/create-vpc-endpoint.html) +- [AWS Node Group](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html) +- [AWS Node Group for CLI](https://docs.aws.amazon.com/cli/latest/reference/eks/create-nodegroup.html) + +# GCP Cloud Resources + +- [GCloud CLI](https://cloud.google.com/sdk/docs/install) +- [GCP Service Account](https://cloud.google.com/iam/docs/service-account-overview) +- [GCP Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect) +- [GCP Node Pool](https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools) + +# Agent + +- [Victoria Metrics Operator](https://docs.victoriametrics.com/guides/getting-started-with-vm-operator/) +- [Victoria Metrics Agent](https://docs.victoriametrics.com/vmagent/) +- [Vector Agent](https://vector.dev/) + +# Others + +- [Helm 3](https://helm.sh/docs/intro/install/) +- [kubectl 1.21+](https://kubernetes.io/docs/tasks/tools/) +- [Kubernetes Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) diff --git a/tidb-cloud/byoc-o11y-agent-tutorial.md b/tidb-cloud/byoc-o11y-agent-tutorial.md new file mode 100644 index 0000000000000..4ae2b253e6a62 --- /dev/null +++ b/tidb-cloud/byoc-o11y-agent-tutorial.md @@ -0,0 +1,1043 @@ +This document explains how to quickly deploy the O11y Agent to collect Metrics/Logs/Diagnosis data and push them to the O11y backend for viewing on the Clinic page. + +The basic deployment steps are: + +1. Prepare a basic Kubernetes cluster with TiDB already deployed. +2. Create cloud resources required by the O11y Agent. +3. Create the O11y backend cluster. +4. Deploy the O11y Agent Operator. +5. Deploy the O11y Agent. +6. Verify O11y data. +7. Maintain the O11y Agent. + +# Step 1: Prepare a Kubernetes Cluster + +If you already have a Kubernetes cluster with TiDB deployed, you can skip this section. + +If you need to create a Kubernetes + TiDB cluster from scratch, please refer to: + +- [Deploy TiDB Cluster on AWS EKS](https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-on-aws-eks/) +- [Deploy TiDB Cluster on Google Cloud GKE](https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-on-gcp-gke/) +- [Deploy TiDB Cluster on Azure AKS](https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-on-azure-aks/) + +Note: Since O11y relies on cloud provider features like IAM and PrivateLink, it cannot work properly on locally created test Kubernetes clusters. For example, O11y is not suitable for Kubernetes + TiDB clusters created following this tutorial: [Quick Start with TiDB on Kubernetes](https://docs.pingcap.com/tidb-in-kubernetes/stable/get-started/). + +# Step 2: Create Required Cloud Resources for O11y Agent + +O11y Agent requires the following two types of cloud resources to ensure secure data transmission and permission management: + +1. **IAM Role**: Provides credentials for the O11y Agent running in Kubernetes Pods, authorizing it to write monitoring data to AWS services (e.g., S3) on the O11y side. +2. **Private Link**: Establishes a private connection with the O11y backend service to avoid transmitting monitoring data over public networks. + +Additionally, as an optional step, you can create a dedicated Kubernetes node group for O11y Agent to isolate it from other Pods. + +## IAM Role + +### AWS + +#### Prerequisites + +Ensure your EKS cluster has IAM OIDC identity provider enabled. If not, please refer to the [AWS official documentation](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) to complete the configuration. + +#### Steps + +1. Log in to AWS IAM Console + +Sign in to the [IAM Console](https://console.aws.amazon.com/iam/) using your account, ensuring it has IAM management permissions. + +2. Create IAM Role + +- In the left navigation pane, select **Roles** > **Create Role**. +- **Trusted entity type**: Select **Web identity**. +- **Identity provider**: From the dropdown, select the OIDC provider in the format `oidc.eks..amazonaws.com/id/` (corresponding to your EKS cluster). +- **Audience**: Enter `sts.amazonaws.com` (default value, no modification needed). +- Click **Next**. +- Create IAM Role - Select Web Identity. + +3. Attach AssumeRole Permission Policy + +Click **Create policy (opens in new tab)**, select the **JSON** tab. + +Enter the following policy content: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": "sts:AssumeRole", + "Resource": "*" + } + ] +} +``` + +Click **Next**, name the policy `O11yAgentAssumeRolePolicy`, and complete the creation. + +Return to the role creation page, refresh and select `O11yAgentAssumeRolePolicy`, then click **Next**. + +4. Set Role Name and Tags + +- Enter a role name (e.g. `O11yAgentAssumeRole`) +- (Optional) Add tags (e.g. `Environment=Prod`) + +Click **Create Role**. + +5. Bind Kubernetes Service Account (Optional) + +After creating the role, record its ARN (format: `arn:aws:iam:::role/O11yAgentAssumeRole`). + +When deploying O11y Agent later, you need to specify this ARN in the Kubernetes ServiceAccount annotations. Here is an example: + +```yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: o11y-agent + annotations: + eks.amazonaws.com/role-arn: "arn:aws:iam:::role/O11yAgentAssumeRole" +``` + +This AWS IAM Role will be specified when creating the O11y backend cluster to complete the authorization of this IAM Role by the O11y backend. + +For more details about creating IAM Roles in AWS Web Console, please refer to the [official documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html). + +#### Steps (CLI Version) + +1. Install and configure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) with administrator privileges. + +2. Create AssumeRole policy + +```bash +# Generate AssumeRole policy json file +cat > assume-role-policy.json < trust-policy.json <:oidc-provider/oidc.eks..amazonaws.com/id/" + }, + "Action": "sts:AssumeRoleWithWebIdentity", + "Condition": { + "StringEquals": { + "oidc.eks..amazonaws.com/id/:aud": "sts.amazonaws.com" + } + } + } + ] +} +EOF + +# Create IAM Role +ROLE_ARN=$(aws iam create-role \ + --role-name O11yAgentAssumeRole \ + --assume-role-policy-document file://trust-policy.json \ + --query 'Role.Arn' \ + --output text) + +# Attach IAM policy to IAM role +aws iam attach-role-policy \ + --role-name O11yAgentAssumeRole \ + --policy-arn $POLICY_ARN +``` + +4. Verify IAM Configuration + +```bash +# Check AssumeRolePolicyDocument +aws iam get-role --role-name O11yAgentAssumeRole --query 'Role.AssumeRolePolicyDocument' + +# Check attached policies +aws iam list-attached-role-policies --role-name O11yAgentAssumeRole +``` + +This AWS IAM Role will be configured when creating the O11y backend cluster to grant the O11y backend access permissions for this IAM Role. + +For more details about creating IAM Roles using AWS CLI, please refer to the [official documentation](https://docs.aws.amazon.com/cli/latest/reference/iam/). + +### GCP + +#### Steps + +1. Enable Workload Identity + +- Go to [GKE Console](https://console.cloud.google.com/kubernetes/) +- Select target cluster > **Details** > **Edit** +- Under **Security** section: + - Enable **Workload Identity** + - Set Identity namespace (default format: `.svc.id.goog`) +- Click **Save** + +2. Create Project Service Account + +- Navigate to [IAM & Admin Console](https://console.cloud.google.com/iam-admin/serviceaccounts) +- Click **Create Service Account** +- Enter name: `o11y-agent` +- Click **Create and Continue** > **Done** + +3. Bind Kubernetes Service Account (Optional) + +- Create ServiceAccount in GKE cluster: + +```yaml +apiVersion: v1 +kind: ServiceAccount +metadata: +name: o11y-agent +annotations: + iam.gke.io/gcp-service-account: "o11y-agent@.iam.gserviceaccount.com" +``` + +This GCP Service Account will be configured when creating the O11y backend cluster to authorize the O11y backend's access to this Service Account. + +For more details about creating Service Accounts in GCP Web Console, please refer to the [official documentation](https://cloud.google.com/iam/docs/service-account-overview). + +#### Steps (CLI Version) + +1. Install and configure [Google Cloud CLI](https://cloud.google.com/sdk/docs/install) + +2. Ensure you have obtained the following information: + +- **Project ID**: `` +- **GKE Cluster Name**: `` +- **VPC Network Name**: `` +- **Subnet Name**: `` + +3. Enable Workload Identity + +```bash +# Get cluster location information (replace or ) +gcloud container clusters describe \ + --region \ + --format="value(location)" + +# Enable Workload Identity +gcloud container clusters update \ + --region \ + --workload-pool=.svc.id.goog +``` + +4. Create Service Account + +```bash +# Create service account +gcloud iam service-accounts create o11y-agent \ + --project= \ + --display-name="O11y Agent SA" +``` + +This GCP Service Account will be configured when setting up the O11y backend cluster to grant the O11y backend access permissions for this Service Account. + +For more details about creating Service Accounts using gcloud CLI, see the [official documentation](https://cloud.google.com/iam/docs/service-account-overview). + +## Private Link + +### AWS + +#### Steps + +1. Sign in to AWS Management Console + +Log in to the [VPC Console](https://console.aws.amazon.com/vpc/) using your AWS account. + +2. Create an Endpoint + +- In the left navigation pane, select **Endpoints** > **Create Endpoint** +- **Service category**: Choose Other endpoint services +- **Service name**: Enter the service name provided by the O11y team +- **VPC**: Select the VPC to connect to +- **Subnets**: Check at least two private subnets across different Availability Zones +- **Enable DNS name**: Keep enabled +- **Security group**: Select a security group that allows HTTP/HTTPS outbound traffic + +For more details about creating Private Link in AWS Web Console, refer to the [official documentation](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html) + +#### Steps (CLI Version) + +1. Get Subnet ID List + +```bash +SUBNET_IDS=$(aws ec2 describe-subnets \ + --filters "Name=vpc-id,Values=" \ + --query 'Subnets[].SubnetId' \ + --output text | tr '\n' ' ') +``` + +2. Create Interface Endpoint + +```bash +ENDPOINT_ID=$(aws ec2 create-vpc-endpoint \ + --vpc-id \ + --service-name $ENDPOINT_SERVICE \ + --vpc-endpoint-type Interface \ + --subnet-ids $SUBNET_IDS \ + --security-group-ids \ + --query 'VpcEndpoint.VpcEndpointId' \ + --output text) +``` + +3. Wait for Endpoint to Become Available + +```bash +aws ec2 wait vpc-endpoint-available --vpc-endpoint-ids $ENDPOINT_ID +``` + +For more details on creating Private Link using AWS CLI, see the [official documentation](https://docs.aws.amazon.com/cli/latest/reference/ec2/create-vpc-endpoint.html). + +### GCP + +#### Steps + +1. Access Private Service Connect + +- Navigate to **Network Services** > **Private Service Connect** +- Click **CONNECTED ENDPOINTS** > **Create Endpoint** + +2. Configure Endpoint Parameters + +- **Endpoint name**: o11y-psc-endpoint +- **Target service**: Enter the service attachment ID provided by O11y team +- **Network**: Select your VPC network +- **Subnet**: Choose the subnet where GKE cluster resides +- **Auto-assign IP**: Enabled +- **Global access**: Select based on cross-region requirements + +3. Configure Workload Identity + +- Navigate to **Kubernetes Engine** > **Clusters** +- Select GKE cluster, enable **Workload Identity** +- Record **Workload Pool ID** (format: .svc.id.goog) + +For more details about creating Private Service Connect in GCP Web Console, refer to the [official documentation](https://cloud.google.com/vpc/docs/private-service-connect). + +#### Steps (CLI Version) + +1. Create Private Service Connect Endpoint + +```bash +# Reserve internal IP +gcloud compute addresses create psc-ip \ + --region=${REGION} \ + --subnet=${CLIENT_SUBNET} \ + --purpose=PRIVATE_SERVICE_CONNECT + +# Create PSC endpoint +gcloud compute forwarding-rules create psc-endpoint \ + --region=${REGION} \ + --load-balancing-scheme=EXTERNAL \ + --address=psc-ip \ + --target-service-attachment=projects/${O11Y_PROJECT}/regions/${REGION}/serviceAttachments/${SA_ID} \ + --allow-psc-global-access +``` + +2. Configure Workload Identity + +```bash +# Enable Workload Identity +gcloud container clusters update ${CLUSTER_NAME} \ + --region=${REGION} \ + --workload-pool=${CLIENT_PROJECT}.svc.id.goog +``` + +For more details about creating Private Service Connect using GCP CLI, refer to the [official documentation](https://cloud.google.com/vpc/docs/private-service-connect). + +## Node Group (Optional) + +### AWS + +#### Steps + +1. Access EKS Console + +- Sign in to [AWS Management Console](https://console.aws.amazon.com/) +- Navigate to **Elastic Kubernetes Service** > **Clusters** and select target cluster +- Under Compute tab, click **Add Node Group** + +2. Configure Basic Node Group Parameters + +- **Node group name**: o11y-agent-nodes +- **Node IAM role**: Select existing role or create new (must include EKS worker node policies) +- **Instance type**: c5.xlarge (adjust based on monitoring workload) +- **Scaling policy**: Fixed count of 2 instances (minimum 2 nodes recommended for HA) + +3. Configure Advanced Settings + +Add node labels: + +``` +workload=o11y-agent +component=observability +``` + +Add node taints: + +``` +dedicated=o11y-agent:NoSchedule +``` + +4. Complete Creation + +- Review all configurations and click **Create** +- Wait for node status to change to **Active** (approximately 5-10 minutes) + +5. Configure O11y Agent Scheduling Policy (Example) + +Use the following configuration to schedule O11y Agent on the dedicated node group: + +```yaml +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + tolerations: + - key: "dedicated" + operator: "Equal" + value: "o11y-agent" + effect: "NoSchedule" + nodeSelector: + workload: "o11y-agent" +``` + +For additional details about creating Node Groups in AWS Web Console, see the [official documentation](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html). + +#### Steps (CLI Version) + +1. Prepare Node Group Configuration + +```bash +# Define environment variables +CLUSTER_NAME="o11y-cluster" +NODEGROUP_NAME="o11y-agent-nodes" +REGION="us-west-2" +SUBNETS="subnet-0123456789abcdef0,subnet-0fedcba9876543210" +INSTANCE_TYPE="c5.xlarge" +MIN_SIZE=2 +MAX_SIZE=5 +DESIRED_SIZE=2 + +# Generate node role ARN (requires pre-created IAM role) +NODE_ROLE_ARN=$(aws iam get-role --role-name AmazonEKSNodeRole --query 'Role.Arn' --output text) +``` + +2. Create Node Group with Labels and Taints + +```bash +aws eks create-nodegroup \ + --cluster-name $CLUSTER_NAME \ + --nodegroup-name $NODEGROUP_NAME \ + --subnets $SUBNETS \ + --node-role $NODE_ROLE_ARN \ + --ami-type AL2_x86_64 \ + --instance-types $INSTANCE_TYPE \ + --scaling-config minSize=$MIN_SIZE,maxSize=$MAX_SIZE,desiredSize=$DESIRED_SIZE \ + --labels workload=o11y-agent,component=observability \ + --taints '[{"key":"dedicated","value":"o11y-agent","effect":"NO_SCHEDULE"}]' \ + --region $REGION +``` + +3. Verify Node Group Status + +```bash +# Check creation status +aws eks describe-nodegroup \ + --cluster-name $CLUSTER_NAME \ + --nodegroup-name $NODEGROUP_NAME \ + --query 'nodegroup.status' \ + --region $REGION + +# Wait for status to become ACTIVE (~5-10 minutes) +aws eks wait nodegroup-active \ + --cluster-name $CLUSTER_NAME \ + --nodegroup-name $NODEGROUP_NAME \ + --region $REGION +``` + +4. Configure O11y Agent Scheduling Policy (Example) + +Use the following configuration to schedule O11y Agent on the dedicated node group: + +```yaml +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + tolerations: + - key: "dedicated" + operator: "Equal" + value: "o11y-agent" + effect: "NoSchedule" + nodeSelector: + workload: "o11y-agent" +``` + +For more details about creating Node Groups using AWS CLI, refer to the [official documentation](https://docs.aws.amazon.com/cli/latest/reference/eks/create-nodegroup.html). + +### GCP + +#### Steps + +1. Access GKE Console + +- Sign in to [GCP Console](https://console.cloud.google.com/) +- Navigate to **Kubernetes Engine** > **Clusters** +- Select target cluster, click **Node Pools** tab +- Click **Create Node Pool** + +2. Configure Basic Parameters + +- **Node pool name**: o11y-agent-pool +- **Number of nodes**: 2 +- **Machine type**: e2-standard-4 (adjust based on monitoring workload) +- **Operating system**: Container-Optimized OS +- **Boot disk size**: 100 GB + +3. Set Advanced Configuration + +Node **Metadata** > **Labels**: + +``` +workload: o11y-agent +component: observability +``` + +Node **Taints**: + +``` +Key: dedicated +Value: o11y-agent +Effect: NoSchedule +``` + +Network Configuration: + +- Select same VPC network as GKE cluster +- Enable **Use only private IP** + +4. Complete Creation + +- Click **Create** button +- Wait for node pool status to change to **Ready** (~5-10 minutes) + +5. Configure O11y Agent Scheduling Policy (Example) + +Use the following configuration to schedule O11y Agent on the dedicated node pool: + +```yaml +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + tolerations: + - key: "dedicated" + operator: "Equal" + value: "o11y-agent" + effect: "NoSchedule" + nodeSelector: + workload: "o11y-agent" +``` + +For additional details about creating Node Pools in GCP Web Console, refer to the [official documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools). + +#### Steps (CLI Version) + +1. Prepare Environment Variables + +```bash +# Basic configuration +CLUSTER_NAME="o11y-cluster" +NODE_POOL_NAME="o11y-agent-pool" +ZONE="us-central1-a" +MACHINE_TYPE="e2-standard-4" +DISK_SIZE="100" +NUM_NODES=2 + +# Network configuration +NETWORK="default" +SUBNETWORK="default" +``` + +2. Create Node Pool with Labels and Taints + +```bash +gcloud container node-pools create ${NODE_POOL_NAME} \ + --cluster=${CLUSTER_NAME} \ + --zone=${ZONE} \ + --machine-type=${MACHINE_TYPE} \ + --num-nodes=${NUM_NODES} \ + --disk-size=${DISK_SIZE} \ + --node-labels="workload=o11y-agent,component=observability" \ + --node-taints="dedicated=o11y-agent:NoSchedule" \ + --image-type="COS_CONTAINERD" \ + --network=${NETWORK} \ + --subnetwork=${SUBNETWORK} +``` + +3. Verify Node Pool Status + +```bash +# Check node pool status +gcloud container node-pools describe ${NODE_POOL_NAME} \ + --cluster=${CLUSTER_NAME} \ + --zone=${ZONE} \ + --format="value(status)" + +# Get node instance list +gcloud compute instances list \ + --filter="metadata.items.cluster-name=${CLUSTER_NAME} AND metadata.items.node-pool-name=${NODE_POOL_NAME}" +``` + +4. Configure O11y Agent Scheduling Policy (Example) + +Use the following configuration to schedule O11y Agent on the dedicated node pool: + +```yaml +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + tolerations: + - key: "dedicated" + operator: "Equal" + value: "o11y-agent" + effect: "NoSchedule" + nodeSelector: + workload: "o11y-agent" +``` + +For more details about creating Node Pools using GCP CLI, see the [official documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools). + +# Step 3: Create the O11y Backend Cluster + +> Follow the Clinic UI Guide to fill required parameters and create O11y backend cluster. + +# Step 4: Deploy the O11y Agent Operator + +The O11y Agent consists of the Victoria Metrics Agent (VMAgent) and the Vector Agent. Since the current Vector Agent Operator is not officially supported and has low maintenance, we deploy the Vector Agent using Helm chart. Therefore, the O11y Agent Operator specifically refers to the Victoria Metrics Operator. + +## Prerequisites + +- Kubernetes cluster 1.20.9 +- [Helm 3](https://helm.sh/docs/intro/install/) +- [kubectl 1.21+](https://kubernetes.io/docs/tasks/tools/) + +## Steps + +### Add VictoriaMetrics Helm Repository + +To install VictoriaMetrics components, first add the VictoriaMetrics Helm repository by executing: + +```bash +helm repo add vm https://victoriametrics.github.io/helm-charts/ +``` + +Update helm repository: + +```bash +helm repo update +``` + +You can run the following command to verify that all contents are set up correctly: + +```bash +helm search repo vm/ +``` + +The expected output would be: + +```bash +NAME CHART VERSION APP VERSION DESCRIPTION +vm/victoria-metrics-agent 0.7.20 v1.62.0 Victoria Metrics Agent - collects metrics from ... +vm/victoria-metrics-alert 0.3.34 v1.62.0 Victoria Metrics Alert - executes a list of giv... +vm/victoria-metrics-auth 0.2.23 1.62.0 Victoria Metrics Auth - is a simple auth proxy ... +vm/victoria-metrics-cluster 0.8.32 1.62.0 Victoria Metrics Cluster version - high-perform... +vm/victoria-metrics-k8s-stack 0.2.9 1.16.0 Kubernetes monitoring on VictoriaMetrics stack.... +vm/victoria-metrics-operator 0.1.17 0.16.0 Victoria Metrics Operator +vm/victoria-metrics-single 0.7.5 1.62.0 Victoria Metrics Single version - high-performa... +``` + +### Install VM Operator from Helm Chart + +```bash +helm install vmoperator vm/victoria-metrics-operator +``` + +The expected output would be: + +```bash +NAME: vmoperator +LAST DEPLOYED: Thu Sep 30 17:30:30 2021 +NAMESPACE: default +STATUS: deployed +REVISION: 1 +TEST SUITE: None +NOTES: +victoria-metrics-operator has been installed. Check its status by running: + kubectl --namespace default get pods -l "app.kubernetes.io/instance=vmoperator" + +Get more information on https://github.com/VictoriaMetrics/helm-charts/tree/master/charts/victoria-metrics-operator. +See "Getting started guide for VM Operator" on https://docs.victoriametrics.com/guides/getting-started-with-vm-operator.html. +``` + +Run the following command to check if the VM Operator has started and is running: + +```bash +kubectl --namespace default get pods -l "app.kubernetes.io/instance=vmoperator" +``` + +The expected output would be: + +```bash +NAME READY STATUS RESTARTS AGE +vmoperator-victoria-metrics-operator-67cff44cd6-s47n6 1/1 Running 0 77s +``` + +For more details on using the Victoria Metrics Operator, please refer to the [official documentation](https://docs.victoriametrics.com/guides/getting-started-with-vm-operator/) + +# Step 5: Deploy the O11y Agent + +As mentioned earlier, the O11y Agent consists of two types of agents: the Victoria Metrics Agent and the Vector Agent. Specifically, it can be divided into four deployment types: + +- **TiDB Victoria Metrics Agent**: [Victoria Metrics Agent](https://docs.victoriametrics.com/vmagent/) for collecting TiDB Metrics, deployed once per TiDB Cluster with a Kubernetes resource of `Deployment`. +- **Kubernetes Victoria Metrics Agent**: [Victoria Metrics Agent](https://docs.victoriametrics.com/vmagent/) for collecting Kubernetes Metrics, deployed once per Kubernetes cluster with a Kubernetes resource of `Deployment`. +- **TiDB Vector Agent**: [Vector Agent](https://vector.dev/) for collecting TiDB SQL diagnostic data, deployed once per TiDB Cluster with a Kubernetes resource of `Deployment`. +- **Kubernetes Vector Agent**: [Vector Agent](https://vector.dev/) for collecting all Logs, deployed once per Node with a Kubernetes resource of `DaemonSet`. + +## TiDB Victoria Metrics Agent + +### Functional Description + +**TiDB Victoria Metrics Agent** is specifically designed to collect monitoring metrics from TiDB clusters. Each TiDB Cluster requires one deployed instance. + +### Prepare Resource File + +Deploying VMAgent requires preparing a VMAgent CR resource file. The key parameters are `remoteWrite.url` and `additionalScrapeConfigs` - the former specifies where to push collected metrics data, while the latter describes how metrics should be collected. + +Basic resource template: + +```yaml +apiVersion: operator.victoriametrics.com/v1beta1 +kind: VMAgent +metadata: + # Recommended to distinguish by tidb naming + name: tidb-vmagent + namespace: monitoring +spec: + # Version + image: + repository: victoriametrics/vmagent + tag: v1.102.1 # Keep consistent with TiDB Cloud + pullPolicy: IfNotPresent + # Scrape interval. Recommended 15s for TiDB VMAgent + scrapeInterval: 15s + # Metric labels, can be added as needed + externalLabels: + cluster: "my-k8s-cluster" + region: "us-west-1" + # Remote write target - sending data to O11y backend + remoteWrite: + - url: + # VMAgent replica count + replicaCount: 1 + # Resource limits, adjust according to cluster size + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1024Mi" + cpu: "500m" + # Scrape configuration + additionalScrapeConfigs: + # ... + # (Example) + - job_name: 'node-exporter' + static_configs: + - targets: ['node-exporter-service.monitoring.svc:9100'] +``` + +The `remoteWrite.url` can be obtained after creating the O11y backend cluster as mentioned earlier - simply fill in the corresponding value. + +For standard TiDB clusters, we provide an `additionalScrapeConfigs` template to collect metrics from all TiDB components. Please refer to [TiDB VMAgent Config](./byoc-o11y-agent-vmagent-config-tidb.md). + +### Deployment + +Deploy using `kubectl` with the prepared resource file: + +```bash +kubectl apply -f tidb-vmagent.yaml +``` + +Verify vmagent is running properly: + +```bash +# Check VMAgent CR status +kubectl get vmagent -n monitoring + +# Check Pod status +kubectl get pods -n monitoring -l app.kubernetes.io/name=tidb-vmagent + +# Check logs +kubectl logs -n monitoring -l app.kubernetes.io/name=tidb-vmagent --tail=50 +``` + +## Kubernetes Victoria Metrics Agent + +### Functional Description + +**Kubernetes Victoria Metrics Agent** 专用于采集 Kubernetes 集群基础监控指标,每个 Kubernetes 集群需要部署一个实例。 + +### Prepare Resource File + +与 TiDB VMAgent 类似,需要准备一个专用于 Kubernetes VMAgent 的 CR 资源描述文件。 + +基本资源描述文件模板如下: + +```yaml +apiVersion: operator.victoriametrics.com/v1beta1 +kind: VMAgent +metadata: + name: k8s-vmagent + namespace: monitoring +spec: + # Version + image: + repository: victoriametrics/vmagent + tag: v1.102.1 # Keep consistent with TiDB Cloud + pullPolicy: IfNotPresent + # Scrape interval. Recommended 30s for Kubernetes VMAgent + scrapeInterval: 30s + # Metric labels, can be added as needed + externalLabels: + cluster: "my-k8s-cluster" + region: "us-west-1" + # Remote write target - sending data to O11y backend. + remoteWrite: + - url: + # VMAgent replica count + replicaCount: 1 + # Resource limits, adjust according to cluster size + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1024Mi" + cpu: "500m" + # Scrape configuration + additionalScrapeConfigs: + # ... + # (Example) + - job_name: 'node-exporter' + static_configs: + - targets: ['node-exporter-service.monitoring.svc:9100'] +``` + +The `remoteWrite.url` can be obtained after creating the O11y backend cluster as mentioned earlier - simply fill in the corresponding value. + +For Kubernetes VMAgent, we provide an `additionalScrapeConfigs` template to collect metrics from all Kubernetes components. Please refer to [Kubernetes VMAgent Config](./byoc-o11y-agent-vmagent-config-k8s.md). + +### Deployment + +Apply the prepared resource file using `kubectl`: + +```bash +kubectl apply -f k8s-vmagent.yaml +``` + +Verify VMAgent is running properly: + +```bash +# Check VMAgent CR status +kubectl get vmagent -n monitoring + +# Check Pod status +kubectl get pods -n monitoring -l app.kubernetes.io/name=k8s-vmagent + +# Check logs +kubectl logs -n monitoring -l app.kubernetes.io/name=k8s-vmagent --tail=50 +``` + +## Kubernetes Vector Agent + +### Functional Description + +**Kubernetes Vector Agent** is responsible for collecting Pod logs from node system disks, with one instance deployed per Kubernetes node. + +### Prepare Resource File + +For Kubernetes Vector Agent, we provide a Helm Values template to collect all necessary logs. Please refer to the following links for configuration: + +- [Node Vector Agent for AWS](./byoc-o11y-agent-vector-config-k8s-aws.md) +- [Node Vector Agent for GCP](./byoc-o11y-agent-vector-config-k8s-gcp.md) + +> Note: Please replace the variables in the `${VAR}` format in the configuration appropriately. + +### Deployment + +Add Vector Agent Helm repository: + +```bash +helm repo add vector https://helm.vector.dev +helm repo update +``` + +Install Kubernetes Vector Agent: + +```bash +helm install vector vector/vector \ + --namespace monitoring \ + --create-namespace \ + -f k8s-vector.yaml +``` + +Verify deployment: + +```bash +# Check Pod status +kubectl get pods -n monitoring -l app.kubernetes.io/instance=k8s-vector + +# Check logs +kubectl logs -n monitoring -l app.kubernetes.io/instance=k8s-vector --tail=50 +``` + +## TiDB Vector Agent + +### Functional Description + +**TiDB Vector Agent** is responsible for collecting TiDB SQL diagnostic data, with one instance deployed per TiDB Cluster. + +### Prepare Resource File + +Since TiDB Vector Agent doesn't follow the conventional `DaemonSet` deployment pattern, we deploy it directly using native Kubernetes resources. + +We need to prepare the following resources: + +1. Reference the previously created `ServiceAccount`, which is already bound to the corresponding IAM Role: + +```yaml +# tidb-vector-sa.yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: o11y-agent + annotations: + eks.amazonaws.com/role-arn: "arn:aws:iam:::role/O11yAgentAssumeRole" +``` + +2. Configuration file in `ConfigMap`: + +```yaml +# tidb-vector-cm.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: tidb-vector + namespace: monitoring +data: + config.toml: | + ... +``` + +The content of `config.toml` can be referenced from [TiDB Vector Config](./byoc-o11y-agent-vector-config-tidb.md). + +3. `Deployment` resource definition file: + +```yaml +# tidb-vector-deploy.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + # Adjust naming according to your TiDB cluster + name: tidb-vector + namespace: monitoring +spec: + replicas: 1 + selector: + matchLabels: + app: tidb-vector + template: + metadata: + labels: + app: tidb-vector + spec: + serviceAccountName: o11y-agent + containers: + - name: vector + image: timberio/vector:0.34.0-distroless-libc + args: ["--config", "/etc/vector/config.toml"] + resources: + requests: + cpu: "1" + memory: 1Gi + volumeMounts: + - name: config + mountPath: /etc/vector + volumes: + - name: config + configMap: + name: tidb-vector +``` + +### Deployment + +Deploy TiDB Vector Agent: + +```bash +kubectl apply -f tidb-vector-sa.yaml +kubectl apply -f tidb-vector-cm.yaml +kubectl apply -f tidb-vector-deploy.yaml +``` + +Verify deployment: + +```bash +# Check Pod status +kubectl get pods -n monitoring -l app.kubernetes.io/instance=tidb-vector + +# Check logs +kubectl logs -n monitoring -l app.kubernetes.io/instance=tidb-vector --tail=50 +``` + +# Step 6: Verify O11y data + +> Follow the Clinic UI Guide to browse and validate O11y data on the Clinic web console. + +# Step 7: Maintain the O11y Agent + +For O11y Agent maintenance operations, refer to the following docs: + +- [Victoria Metrics Operator Docs](https://docs.victoriametrics.com/guides/getting-started-with-vm-operator/) +- [Kubernetes Docs](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) +- [Helm Docs](https://helm.sh/docs/) diff --git a/tidb-cloud/byoc-o11y-agent-vector-config-k8s-aws.md b/tidb-cloud/byoc-o11y-agent-vector-config-k8s-aws.md new file mode 100644 index 0000000000000..04a1ea436ac19 --- /dev/null +++ b/tidb-cloud/byoc-o11y-agent-vector-config-k8s-aws.md @@ -0,0 +1,525 @@ +```yaml +# Please replace the variables in the ${} format in the configuration appropriately. + +customConfig: + data_dir: /vector-data-dir + api: + enabled: true + address: 0.0.0.0:8686 + playground: false + + sources: + all_logs: + type: kubernetes_logs + journald: + type: journald + + self_metrics: + type: internal_metrics + + oom_filenames: + type: filename + include: + - /var/tidb_oom_record/**/* + + replayer_filenames: + type: filename + include: + - /var/tidb_plan_replayer/**/* + + self_logs: + type: internal_logs + + transforms: + ensure_component: + type: remap + inputs: + - all_logs + drop_on_error: true + reroute_dropped: true + source: |- + if !is_null(.kubernetes.container_name) && .kubernetes.container_name == "slowlog" { + .o11y.component = "slowlog" + } else if !is_null(.kubernetes.container_name) && .kubernetes.container_name == "statementlog" { + .o11y.component = "statementlog" + } else if !is_null(.kubernetes.pod_labels."app.kubernetes.io/component") { + .o11y.component = .kubernetes.pod_labels."app.kubernetes.io/component" + } else if !is_null(.kubernetes.container_name) { + .o11y.component = .kubernetes.container_name + } else { + .o11y.component = "unknown" + } + route_logs: + type: route + inputs: + - ensure_component + route: + tidb_cluster_general_logs: |- + includes(["tidb","tikv","pd","ticdc","tiflash","tiflash-learner","db-tidb-extra-access","backup","restore","tidb-lightning","dm-master","dm-worker","extra-tidb","statementlog"], .o11y.component) + tidb_cluster_slow_logs: |- + .o11y.component == "slowlog" + infra_logs: |- + !includes(["tidb","tikv","pd","ticdc","tiflash","tiflash-learner","db-tidb-extra-access","backup","restore","tidb-lightning","dm-master","dm-worker","extra-tidb","slowlog","statementlog"], .o11y.component) + multiline_slow_logs: + type: reduce + inputs: + - route_logs.tidb_cluster_slow_logs + starts_when: + type: vrl + source: |- + res, err = starts_with(.message, "# Time: ") + if err != null { + false + } else { + res + } + merge_strategies: + message: concat_newline + + ensure_cluster_id: + type: remap + inputs: + - multiline_slow_logs + - route_logs.tidb_cluster_general_logs + drop_on_error: true + reroute_dropped: true + source: |- + if !is_null(.kubernetes.pod_labels."tags.tidbcloud.com/cluster") { + .o11y.cluster_id = .kubernetes.pod_labels."tags.tidbcloud.com/cluster" + } else { + parsed, err = parse_grok(.kubernetes.pod_namespace, "tidb%{GREEDYDATA:cluster_id}") + if err == null { + .o11y.cluster_id = parsed.cluster_id + } else { + .o11y.cluster_id = "unknown" + } + } + if is_null(.kubernetes.pod_labels.app) { + .kubernetes.pod_labels.app = .kubernetes.pod_name + } + if is_null(.kubernetes.pod_labels.release) { + .kubernetes.pod_labels.release = "unknown" + } + enrich_tidb_cluster_log_message: + type: remap + inputs: + - ensure_cluster_id + drop_on_error: true + reroute_dropped: true + source: |- + t = to_string!(.timestamp) + node = .kubernetes.pod_node_name + pod_namespace = .kubernetes.pod_namespace + pod_name = .kubernetes.pod_name + container_name = .kubernetes.container_name + prefix = join!([t, node, pod_namespace, pod_name, container_name], ";") + new_message, err = prefix + " " + .message + if err == null { + .message = new_message + } + enrich_infra_log_message: + type: remap + inputs: + - route_logs.infra_logs + drop_on_error: true + reroute_dropped: true + source: |- + t = to_string!(.timestamp) + node = .kubernetes.pod_node_name + pod_namespace = .kubernetes.pod_namespace + pod_name = .kubernetes.pod_name + container_name = .kubernetes.container_name + prefix = join!([t, node, pod_namespace, pod_name, container_name], ";") + new_message, err = prefix + " " + .message + if err == null { + .message = new_message + } + ensure_general_cluster_id: + type: remap + inputs: + # pod logs except tidb cluster general logs + - route_logs.infra_logs + # unmatched logs + - ensure_component.dropped + - route_logs._unmatched + - ensure_cluster_id.dropped + drop_on_error: true + reroute_dropped: true + source: |- + .o11y.cluster_id = "general_component" + if is_null(.kubernetes.pod_labels.app) { + .kubernetes.pod_labels.app = .kubernetes.pod_name + } + if is_null(.kubernetes.pod_labels.release) { + .kubernetes.pod_labels.release = "unknown" + } + oom_file_key_transform: + type: remap + inputs: + - oom_filenames + source: |- + fields = split!(.message, "/") + # /var/tidb_oom_record/1/9993/337845892/tidb337845892/db-tidb-0/running_sql2022-12-06T10:26:43Z + isV1 = length(fields) == 9 + # /var/tidb_oom_record/1/9993/337845892/tidb337845892/db-tidb-0/record2022-12-06T10:26:43Z/running_sql + isV2 = length(fields) == 10 + assert!(isV1 || isV2) + tenant_id = fields[3] + project_id = fields[4] + cluster_id = fields[5] + module = "oom_record" + component = replace!(fields[7], r'^.+?-(?P.+)-.*', "$$component") + generating_timestr = replace!(fields[8], r'^[^0-9]*', "") + generating_unix = to_string(to_unix_timestamp(parse_timestamp!(generating_timestr, "%+"))) + instance = replace!(fields[7], r'^.+?-', "") + if isV1 { + filename = fields[8] + .key = join!(["0", tenant_id, project_id, cluster_id, module, component, generating_unix, instance, filename], separator: "/") + } else if isV2 { + filename = join!([fields[9], generating_timestr]) + .key = join!(["0", tenant_id, project_id, cluster_id, module, component, generating_unix, instance, filename], separator: "/") + } + replayer_file_key_transform: + type: remap + inputs: + - replayer_filenames + source: |- + # v1: /var/tidb_plan_replayer/1/9993/337845892/tidb337845892/db-tidb-0/replayer_single_3Z90hA56EN0g4_Iav6PcZQ==_1670838592076844774.zip + # v2: /var/tidb_plan_replayer/1/9993/337845892/tidb337845892/db-tidb-0/replayer_r4hyX-MHJmxGo9rXva9fEg==_1670834504569984706.zip + fields = split!(.message, "/") + tenant_id = fields[3] + project_id = fields[4] + cluster_id = fields[5] + module = "plan_replayer" + component = replace!(fields[7], r'^.+?-(?P.+)-.*', "$$component") + generating_timestr = replace!(fields[8], r'^replayer.+_(?P\d+)\.zip', "$$ts") + generating_unix = slice!(generating_timestr, 0, length(generating_timestr) - 9) + instance = replace!(fields[7], r'^.+?-', "") + filename = fields[8] + .key = join!(["0", tenant_id, project_id, cluster_id, module, component, generating_unix, instance, filename], separator: "/") + sinks: + s3_infra_logs: + type: aws_s3 + inputs: + - enrich_infra_log_message + region: ${REGION} + bucket: ${BUCKET} + key_prefix: |- + 1/${TENANT_ID}/${PROJECT_ID}/k8s-infra/logs/{{ `{{ o11y.component }}` }}/ + compression: gzip + batch: + max_bytes: 104857600 # 100MB + timeout_secs: 180 # 3 minutes + buffer: + type: disk + max_size: 1073741824 # 1GB + encoding: + codec: text + auth: + # hostNetwork must be enabled + assume_role: ${ROLE_ARN} + + s3_tidb_cluster_logs: + type: aws_s3 + inputs: + - enrich_tidb_cluster_log_message + region: ${REGION} + bucket: ${BUCKET} + key_prefix: |- + 0/${TENANT_ID}/${PROJECT_ID}/{{ `{{ o11y.cluster_id }}` }}/logs/{{ `{{ o11y.component }}` }}/ + compression: gzip + batch: + max_bytes: 104857600 # 100MB + timeout_secs: 60 # 1 minute + buffer: + type: disk + max_size: 1073741824 # 1GB + encoding: + codec: text + auth: + # hostNetwork must be enabled + assume_role: ${ROLE_ARN} + + s3_unmatched_kubernetes_logs: + type: aws_s3 + inputs: + - ensure_component.dropped + - route_logs._unmatched + - ensure_cluster_id.dropped + - enrich_tidb_cluster_log_message.dropped + - enrich_infra_log_message.dropped + region: ${REGION} + bucket: ${BUCKET} + key_prefix: |- + 1/${TENANT_ID}/${PROJECT_ID}/k8s-infra/logs/_unmatched/ + compression: gzip + batch: + max_bytes: 104857600 # 100MB + timeout_secs: 180 # 3 minutes + buffer: + type: disk + max_size: 1073741824 # 1GB + encoding: + codec: json + auth: + # hostNetwork must be enabled + assume_role: ${ROLE_ARN} + + s3_journald_logs: + type: aws_s3 + inputs: + - journald + region: ${REGION} + bucket: ${BUCKET} + key_prefix: |- + 1/${TENANT_ID}/${PROJECT_ID}/k8s-infra/logs/journald/{{ `{{ host }}` }}/ + compression: gzip + batch: + max_bytes: 104857600 # 100MB + timeout_secs: 180 # 3 minutes + buffer: + type: disk + max_size: 1073741824 # 1GB + encoding: + codec: text + auth: + # hostNetwork must be enabled + assume_role: ${ROLE_ARN} + self_metrics_sink: + type: prometheus_exporter + inputs: + - self_metrics + address: "0.0.0.0:9598" + + loki_pod_logs: + type: loki + inputs: + # tidb cluster general logs + - ensure_cluster_id + # infra component logs + - ensure_general_cluster_id + + encoding: + codec: text + compression: gzip + endpoint: ${LOKI_URL} + labels: + cluster_env: "${CLUSTER_ENV}" + cluster: "${PROVIDER}-dataplane/${K8S_CLUSTER_NAME}" + tenant_id: "${TENANT_ID}" + project_id: "${PROJECT_ID}" + cluster_id: '{{ "{{ o11y.cluster_id }}" }}' + node: '{{ "{{ kubernetes.pod_node_name }}" }}' + namespace: '{{ "{{ kubernetes.pod_namespace }}" }}' + instance: '{{ "{{ kubernetes.pod_name }}" }}' + container: '{{ "{{ kubernetes.container_name }}" }}' + app: '{{ "{{ kubernetes.pod_labels.app }}" }}' + release: '{{ "{{ kubernetes.pod_labels.release }}" }}' + stream: '{{ "{{ stream }}" }}' + out_of_order_action: accept # require Loki >= 2.4.0 + batch: + max_bytes: 10240 # 10KiB + timeout_secs: 1 + buffer: + max_size: 2147483648 # 2GiB + type: disk + loki_sys_logs: + type: loki + inputs: + - journald + encoding: + codec: json + compression: gzip + endpoint: ${LOKI_URL} + labels: + cluster_env: "${CLUSTER_ENV}" + cluster: "${PROVIDER}-dataplane/${K8S_CLUSTER_NAME}" + tenant_id: "${TENANT_ID}" + project_id: "${PROJECT_ID}" + node: '{{ "{{ host }}" }}' + syslog: '{{ "{{ SYSLOG_IDENTIFIER }}" }}' + out_of_order_action: accept # require Loki >= 2.4.0 + batch: + max_bytes: 10240 # 10KiB + timeout_secs: 1 + buffer: + max_size: 2147483648 # 2GiB + type: disk + + + s3_upload_oom_file: + type: aws_s3_upload_file + inputs: + - oom_file_key_transform + region: ${REGION} + bucket: ${BUCKET} + auth: + assume_role: ${ROLE_ARN} + + s3_upload_replayer_file: + type: aws_s3_upload_file + inputs: + - replayer_file_key_transform + region: ${REGION} + bucket: ${BUCKET} + auth: + assume_role: ${ROLE_ARN} + + self_logs_sink: + type: loki + inputs: + - self_logs + encoding: + codec: json + compression: gzip + endpoint: ${LOKI_URL} + labels: + cluster_env: "${CLUSTER_ENV}" + cluster: "${PROVIDER}-dataplane/${K8S_CLUSTER_NAME}" + app: "vector_log_agent" + node: '{{ "{{ host }}" }}' + out_of_order_action: accept # require Loki >= 2.4.0 + batch: + max_bytes: 10240 # 10KiB + timeout_secs: 1 + buffer: + max_size: 1073741824 # 1GiB + type: disk + +# extraVolumes -- Additional Volumes to use with Vector Pods +extraVolumes: + - name: machine-id + hostPath: + path: /etc/machine-id + type: File + - name: tidb-oom-record + hostPath: + path: /var/tidb_oom_record + type: DirectoryOrCreate + - name: tidb-plan-replayer + hostPath: + path: /var/tidb_plan_replayer + type: DirectoryOrCreate + +# extraVolumeMounts -- Additional Volume to mount into Vector Containers +extraVolumeMounts: + - name: machine-id + mountPath: /etc/machine-id + readOnly: true + - name: tidb-oom-record + mountPath: /var/tidb_oom_record + - name: tidb-plan-replayer + mountPath: /var/tidb_plan_replayer + +# resources -- Set Vector resource requests and limits. +resources: + requests: + cpu: ${REQUEST_CPU} + memory: ${REQUEST_MEMORY} + limits: + cpu: ${LIMIT_CPU} + memory: ${LIMIT_MEMORY} + +role: "Agent" + +## Configuration for Vector's data persistence +## Persistence always used for Agent role +persistence: + hostPath: + # persistence.hostPath.path -- Override path used for hostPath persistence + ## Valid for Agent role + path: "/var/lib/vector/01" + +# rollWorkload -- Add a checksum of the generated ConfigMap to workload annotations +rollWorkload: true + +## Define the Vector image to use +image: + # image.repository -- Override default registry + name for Vector + repository: ${IMAGE_REPO} + # image.pullPolicy -- Vector image pullPolicy + pullPolicy: IfNotPresent + tag: ${IMAGE_TAG} + +# dnsPolicy -- Specify DNS policy for Vector Pods +## Ref: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy +dnsPolicy: ClusterFirst + +## Configure a HorizontalPodAutoscaler for Vector +## Valid for Aggregator and Stateless-Aggregator roles +autoscaling: + # autoscaling.enabled -- Enabled autoscaling for the Aggregator and Stateless-Aggregator + enabled: false + # autoscaling.minReplicas -- Minimum replicas for Vector's HPA + +podDisruptionBudget: + # podDisruptionBudget.enabled -- Enable a PodDisruptionBudget for Vector + enabled: false + +rbac: + # rbac.create -- If true, create and use RBAC resources. Only valid for the Agent role + create: true + +psp: + # psp.create -- If true, create a PodSecurityPolicy resource. PodSecurityPolicy is deprecated as of Kubernetes v1.21, and will be removed in v1.25 + ## Intended for use with the Agent role + create: false + +serviceAccount: + # serviceAccount.create -- If true, create ServiceAccount + create: true + annotations: + eks.amazonaws.com/role-arn: ${SA_ROLE_ARN} + # serviceAccount.automountToken -- Automount API credentials for the Vector ServiceAccount + automountToken: true + +# tolerations -- Allow Vector to schedule on tainted nodes +tolerations: + - operator: Exists + +## Configuration for Vector's Service +service: + # service.enabled -- If true, create and use a Service resource + enabled: false + +## Configuration for Vector's Ingress +ingress: + # ingress.enabled -- If true, create and use an Ingress resource + enabled: false + +# livenessProbe -- Override default liveness probe settings, if customConfig is used requires customConfig.api.enabled true +## Requires Vector's API to be enabled +livenessProbe: + httpGet: + path: /health + port: api + initialDelaySeconds: 30 + periodSeconds: 30 + failureThreshold: 10 + timeoutSeconds: 10 + +# readinessProbe -- Override default readiness probe settings, if customConfig is used requires customConfig.api.enabled true +## Requires Vector's API to be enabled +readinessProbe: + httpGet: + path: /health + port: api + +## Configure a PodMonitor for Vector +## Requires the PodMonitor CRD to be installed +podMonitor: + # podMonitor.enabled -- If true, create a PodMonitor for Vector + enabled: false +containerPorts: + - name: prom-exporter + containerPort: 9598 + protocol: TCP + - name: api + containerPort: 8686 + protocol: TCP + +## Optional built-in HAProxy load balancer +haproxy: + # haproxy.enabled -- If true, create a HAProxy load balancer + enabled: false +``` diff --git a/tidb-cloud/byoc-o11y-agent-vector-config-k8s-gcp.md b/tidb-cloud/byoc-o11y-agent-vector-config-k8s-gcp.md new file mode 100644 index 0000000000000..503997e6aa7c0 --- /dev/null +++ b/tidb-cloud/byoc-o11y-agent-vector-config-k8s-gcp.md @@ -0,0 +1,513 @@ +```yaml + +# customConfig -- Override Vector's default configs, if used **all** options need to be specified +## This section supports using helm templates to populate dynamic values +## Ref: https://vector.dev/docs/reference/configuration/ +customConfig: + data_dir: /vector-data-dir + api: + enabled: true + address: 0.0.0.0:8686 + playground: false + + sources: + all_logs: + type: kubernetes_logs + journald: + type: journald + + self_metrics: + type: internal_metrics + + oom_filenames: + type: filename + include: + - /var/tidb_oom_record/**/* + + replayer_filenames: + type: filename + include: + - /var/tidb_plan_replayer/**/* + + self_logs: + type: internal_logs + + transforms: + ensure_component: + type: remap + inputs: + - all_logs + drop_on_error: true + reroute_dropped: true + source: |- + if !is_null(.kubernetes.container_name) && .kubernetes.container_name == "slowlog" { + .o11y.component = "slowlog" + } else if !is_null(.kubernetes.container_name) && .kubernetes.container_name == "statementlog" { + .o11y.component = "statementlog" + } else if !is_null(.kubernetes.pod_labels."app.kubernetes.io/component") { + .o11y.component = .kubernetes.pod_labels."app.kubernetes.io/component" + } else if !is_null(.kubernetes.container_name) { + .o11y.component = .kubernetes.container_name + } else { + .o11y.component = "unknown" + } + route_logs: + type: route + inputs: + - ensure_component + route: + tidb_cluster_general_logs: |- + includes(["tidb","tikv","pd","ticdc","tiflash","tiflash-learner","db-tidb-extra-access","backup","restore","tidb-lightning","dm-master","dm-worker","extra-tidb","statementlog"], .o11y.component) + tidb_cluster_slow_logs: |- + .o11y.component == "slowlog" + infra_logs: |- + !includes(["tidb","tikv","pd","ticdc","tiflash","tiflash-learner","db-tidb-extra-access","backup","restore","tidb-lightning","dm-master","dm-worker","extra-tidb","slowlog","statementlog"], .o11y.component) + multiline_slow_logs: + type: reduce + inputs: + - route_logs.tidb_cluster_slow_logs + starts_when: + type: vrl + source: |- + res, err = starts_with(.message, "# Time: ") + if err != null { + false + } else { + res + } + merge_strategies: + message: concat_newline + + ensure_cluster_id: + type: remap + inputs: + - multiline_slow_logs + - route_logs.tidb_cluster_general_logs + drop_on_error: true + reroute_dropped: true + source: |- + if !is_null(.kubernetes.pod_labels."tags.tidbcloud.com/cluster") { + .o11y.cluster_id = .kubernetes.pod_labels."tags.tidbcloud.com/cluster" + } else { + parsed, err = parse_grok(.kubernetes.pod_namespace, "tidb%{GREEDYDATA:cluster_id}") + if err == null { + .o11y.cluster_id = parsed.cluster_id + } else { + .o11y.cluster_id = "unknown" + } + } + if is_null(.kubernetes.pod_labels.app) { + .kubernetes.pod_labels.app = .kubernetes.pod_name + } + if is_null(.kubernetes.pod_labels.release) { + .kubernetes.pod_labels.release = "unknown" + } + enrich_tidb_cluster_log_message: + type: remap + inputs: + - ensure_cluster_id + drop_on_error: true + reroute_dropped: true + source: |- + t = to_string!(.timestamp) + node = .kubernetes.pod_node_name + pod_namespace = .kubernetes.pod_namespace + pod_name = .kubernetes.pod_name + container_name = .kubernetes.container_name + prefix = join!([t, node, pod_namespace, pod_name, container_name], ";") + new_message, err = prefix + " " + .message + if err == null { + .message = new_message + } + enrich_infra_log_message: + type: remap + inputs: + - route_logs.infra_logs + drop_on_error: true + reroute_dropped: true + source: |- + t = to_string!(.timestamp) + node = .kubernetes.pod_node_name + pod_namespace = .kubernetes.pod_namespace + pod_name = .kubernetes.pod_name + container_name = .kubernetes.container_name + prefix = join!([t, node, pod_namespace, pod_name, container_name], ";") + new_message, err = prefix + " " + .message + if err == null { + .message = new_message + } + ensure_general_cluster_id: + type: remap + inputs: + # pod logs except tidb cluster general logs + - route_logs.infra_logs + # unmatched logs + - ensure_component.dropped + - route_logs._unmatched + - ensure_cluster_id.dropped + drop_on_error: true + reroute_dropped: true + source: |- + .o11y.cluster_id = "general_component" + if is_null(.kubernetes.pod_labels.app) { + .kubernetes.pod_labels.app = .kubernetes.pod_name + } + if is_null(.kubernetes.pod_labels.release) { + .kubernetes.pod_labels.release = "unknown" + } + oom_file_key_transform: + type: remap + inputs: + - oom_filenames + source: |- + fields = split!(.message, "/") + # /var/tidb_oom_record/1/9993/337845892/tidb337845892/db-tidb-0/running_sql2022-12-06T10:26:43Z + isV1 = length(fields) == 9 + # /var/tidb_oom_record/1/9993/337845892/tidb337845892/db-tidb-0/record2022-12-06T10:26:43Z/running_sql + isV2 = length(fields) == 10 + assert!(isV1 || isV2) + tenant_id = fields[3] + project_id = fields[4] + cluster_id = fields[5] + module = "oom_record" + component = replace!(fields[7], r'^.+?-(?P.+)-.*', "$$component") + generating_timestr = replace!(fields[8], r'^[^0-9]*', "") + generating_unix = to_string(to_unix_timestamp(parse_timestamp!(generating_timestr, "%+"))) + instance = replace!(fields[7], r'^.+?-', "") + if isV1 { + filename = fields[8] + .key = join!(["0", tenant_id, project_id, cluster_id, module, component, generating_unix, instance, filename], separator: "/") + } else if isV2 { + filename = join!([fields[9], generating_timestr]) + .key = join!(["0", tenant_id, project_id, cluster_id, module, component, generating_unix, instance, filename], separator: "/") + } + replayer_file_key_transform: + type: remap + inputs: + - replayer_filenames + source: |- + # v1: /var/tidb_plan_replayer/1/9993/337845892/tidb337845892/db-tidb-0/replayer_single_3Z90hA56EN0g4_Iav6PcZQ==_1670838592076844774.zip + # v2: /var/tidb_plan_replayer/1/9993/337845892/tidb337845892/db-tidb-0/replayer_r4hyX-MHJmxGo9rXva9fEg==_1670834504569984706.zip + fields = split!(.message, "/") + tenant_id = fields[3] + project_id = fields[4] + cluster_id = fields[5] + module = "plan_replayer" + component = replace!(fields[7], r'^.+?-(?P.+)-.*', "$$component") + generating_timestr = replace!(fields[8], r'^replayer.+_(?P\d+)\.zip', "$$ts") + generating_unix = slice!(generating_timestr, 0, length(generating_timestr) - 9) + instance = replace!(fields[7], r'^.+?-', "") + filename = fields[8] + .key = join!(["0", tenant_id, project_id, cluster_id, module, component, generating_unix, instance, filename], separator: "/") + sinks: + gcs_infra_logs: + type: gcp_cloud_storage + inputs: + - enrich_infra_log_message + bucket: ${BUCKET} + key_prefix: |- + 1/${TENANT_ID}/${PROJECT_ID}/k8s-infra/logs/{{ `{{ o11y.component }}` }}/ + compression: gzip + batch: + max_bytes: 104857600 # 100MB + timeout_secs: 180 # 3 minutes + buffer: + type: disk + max_size: 1073741824 # 1GB + encoding: + codec: text + + gcs_tidb_cluster_logs: + type: gcp_cloud_storage + inputs: + - enrich_tidb_cluster_log_message + bucket: ${BUCKET} + key_prefix: |- + 0/${TENANT_ID}/${PROJECT_ID}/{{ `{{ o11y.cluster_id }}` }}/logs/{{ `{{ o11y.component }}` }}/ + compression: gzip + batch: + max_bytes: 104857600 # 100MB + timeout_secs: 60 # 1 minute + buffer: + type: disk + max_size: 1073741824 # 1GB + encoding: + codec: text + + gcs_unmatched_kubernetes_logs: + type: gcp_cloud_storage + inputs: + - ensure_component.dropped + - route_logs._unmatched + - ensure_cluster_id.dropped + - enrich_tidb_cluster_log_message.dropped + - enrich_infra_log_message.dropped + bucket: ${BUCKET} + key_prefix: |- + 1/${TENANT_ID}/${PROJECT_ID}/k8s-infra/logs/_unmatched/ + compression: gzip + batch: + max_bytes: 104857600 # 100MB + timeout_secs: 180 # 3 minutes + buffer: + type: disk + max_size: 1073741824 # 1GB + encoding: + codec: json + + gcs_journald_logs: + type: gcp_cloud_storage + inputs: + - journald + bucket: ${BUCKET} + key_prefix: |- + 1/${TENANT_ID}/${PROJECT_ID}/k8s-infra/logs/journald/{{ `{{ host }}` }}/ + compression: gzip + batch: + max_bytes: 104857600 # 100MB + timeout_secs: 180 # 3 minutes + buffer: + type: disk + max_size: 1073741824 # 1GB + encoding: + codec: text + self_metrics_sink: + type: prometheus_exporter + inputs: + - self_metrics + address: "0.0.0.0:9598" + + loki_pod_logs: + type: loki + inputs: + # tidb cluster general logs + - ensure_cluster_id + # infra component logs + - ensure_general_cluster_id + + encoding: + codec: text + compression: gzip + endpoint: ${LOKI_URL} + labels: + cluster_env: "${CLUSTER_ENV}" + cluster: "${PROVIDER}-dataplane/${K8S_CLUSTER_NAME}" + tenant_id: "${TENANT_ID}" + project_id: "${PROJECT_ID}" + cluster_id: '{{ "{{ o11y.cluster_id }}" }}' + node: '{{ "{{ kubernetes.pod_node_name }}" }}' + namespace: '{{ "{{ kubernetes.pod_namespace }}" }}' + instance: '{{ "{{ kubernetes.pod_name }}" }}' + container: '{{ "{{ kubernetes.container_name }}" }}' + app: '{{ "{{ kubernetes.pod_labels.app }}" }}' + release: '{{ "{{ kubernetes.pod_labels.release }}" }}' + stream: '{{ "{{ stream }}" }}' + out_of_order_action: accept # require Loki >= 2.4.0 + batch: + max_bytes: 10240 # 10KiB + timeout_secs: 1 + buffer: + max_size: 2147483648 # 2GiB + type: disk + loki_sys_logs: + type: loki + inputs: + - journald + encoding: + codec: json + compression: gzip + endpoint: ${LOKI_URL} + labels: + cluster_env: "${CLUSTER_ENV}" + cluster: "${PROVIDER}-dataplane/${K8S_CLUSTER_NAME}" + tenant_id: "${TENANT_ID}" + project_id: "${PROJECT_ID}" + node: '{{ "{{ host }}" }}' + syslog: '{{ "{{ SYSLOG_IDENTIFIER }}" }}' + out_of_order_action: accept # require Loki >= 2.4.0 + batch: + max_bytes: 10240 # 10KiB + timeout_secs: 1 + buffer: + max_size: 2147483648 # 2GiB + type: disk + + + gcs_upload_oom_file: + type: gcp_cloud_storage_upload_file + inputs: + - oom_file_key_transform + bucket: ${BUCKET} + + gcs_upload_replayer_file: + type: gcp_cloud_storage_upload_file + inputs: + - replayer_file_key_transform + bucket: ${BUCKET} + + self_logs_sink: + type: loki + inputs: + - self_logs + encoding: + codec: json + compression: gzip + endpoint: ${LOKI_URL} + labels: + cluster_env: "${CLUSTER_ENV}" + cluster: "${PROVIDER}-dataplane/${K8S_CLUSTER_NAME}" + app: "vector_log_agent" + node: '{{ "{{ host }}" }}' + out_of_order_action: accept # require Loki >= 2.4.0 + batch: + max_bytes: 10240 # 10KiB + timeout_secs: 1 + buffer: + max_size: 1073741824 # 1GiB + type: disk + +# extraVolumes -- Additional Volumes to use with Vector Pods +extraVolumes: + + - name: machine-id + hostPath: + path: /etc/machine-id + type: File + + - name: tidb-oom-record + hostPath: + path: /var/tidb_oom_record + type: DirectoryOrCreate + + - name: tidb-plan-replayer + hostPath: + path: /var/tidb_plan_replayer + type: DirectoryOrCreate + + +# extraVolumeMounts -- Additional Volume to mount into Vector Containers +extraVolumeMounts: + + - name: machine-id + mountPath: /etc/machine-id + readOnly: true + + - name: tidb-oom-record + mountPath: /var/tidb_oom_record + + - name: tidb-plan-replayer + mountPath: /var/tidb_plan_replayer + +# resources -- Set Vector resource requests and limits. +resources: + requests: + cpu: ${REQUEST_CPU} + memory: ${REQUEST_MEMORY} + limits: + cpu: ${LIMIT_CPU} + memory: ${LIMIT_MEMORY} + +role: "Agent" + +podPriorityClassName: dataplane-node-o11y-critical + +## Configuration for Vector's data persistence +## Persistence always used for Agent role +persistence: + hostPath: + # persistence.hostPath.path -- Override path used for hostPath persistence + ## Valid for Agent role + path: "/var/lib/vector/01" + +# rollWorkload -- Add a checksum of the generated ConfigMap to workload annotations +rollWorkload: true + +## Define the Vector image to use +image: + # image.repository -- Override default registry + name for Vector + repository: ${IMAGE_REPO} + # image.pullPolicy -- Vector image pullPolicy + pullPolicy: IfNotPresent + tag: ${IMAGE_TAG} + +# dnsPolicy -- Specify DNS policy for Vector Pods +## Ref: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy +dnsPolicy: ClusterFirst + +## Configure a HorizontalPodAutoscaler for Vector +## Valid for Aggregator and Stateless-Aggregator roles +autoscaling: + # autoscaling.enabled -- Enabled autoscaling for the Aggregator and Stateless-Aggregator + enabled: false + # autoscaling.minReplicas -- Minimum replicas for Vector's HPA + +podDisruptionBudget: + # podDisruptionBudget.enabled -- Enable a PodDisruptionBudget for Vector + enabled: false + +rbac: + # rbac.create -- If true, create and use RBAC resources. Only valid for the Agent role + create: true + +psp: + # psp.create -- If true, create a PodSecurityPolicy resource. PodSecurityPolicy is deprecated as of Kubernetes v1.21, and will be removed in v1.25 + ## Intended for use with the Agent role + create: false + +serviceAccount: + # serviceAccount.create -- If true, create ServiceAccount + create: true + annotations: + iam.gke.io/gcp-service-account: ${SA_ROLE_ARN} + # serviceAccount.automountToken -- Automount API credentials for the Vector ServiceAccount + automountToken: true + +# tolerations -- Allow Vector to schedule on tainted nodes +tolerations: + - operator: Exists + +## Configuration for Vector's Service +service: + # service.enabled -- If true, create and use a Service resource + enabled: false + +## Configuration for Vector's Ingress +ingress: + # ingress.enabled -- If true, create and use an Ingress resource + enabled: false + +# livenessProbe -- Override default liveness probe settings, if customConfig is used requires customConfig.api.enabled true +## Requires Vector's API to be enabled +livenessProbe: + httpGet: + path: /health + port: api + initialDelaySeconds: 30 + periodSeconds: 30 + failureThreshold: 10 + timeoutSeconds: 10 + +# readinessProbe -- Override default readiness probe settings, if customConfig is used requires customConfig.api.enabled true +## Requires Vector's API to be enabled +readinessProbe: + httpGet: + path: /health + port: api + +## Configure a PodMonitor for Vector +## Requires the PodMonitor CRD to be installed +podMonitor: + # podMonitor.enabled -- If true, create a PodMonitor for Vector + enabled: false +containerPorts: + - name: prom-exporter + containerPort: 9598 + protocol: TCP + - name: api + containerPort: 8686 + protocol: TCP +## Optional built-in HAProxy load balancer +haproxy: + # haproxy.enabled -- If true, create a HAProxy load balancer + enabled: false +``` diff --git a/tidb-cloud/byoc-o11y-agent-vector-config-tidb.md b/tidb-cloud/byoc-o11y-agent-vector-config-tidb.md new file mode 100644 index 0000000000000..0fb87e01a9b61 --- /dev/null +++ b/tidb-cloud/byoc-o11y-agent-vector-config-tidb.md @@ -0,0 +1,430 @@ +O11y Vector Agent Configuration for Per-TiDB-Cluster Diagnostic Data Collection. + +# Basic Configuration + +```toml +# Replace all ${VAR} placeholders with appropriate values + +# Vector Agent data directory +data_dir = "/vector-data-dir" + +[api] +# Whether to enable Vector Agent's management API +enabled = false + +# Self-monitoring collection +[sources.self_metrics] +type = "internal_metrics" + +# 将 Expose Vector's own metrics in Prometheus format on /metrics endpoint +[sinks.self_metrics_sink] +type = "prometheus_exporter" +inputs = ["self_metrics"] +address = "0.0.0.0:${METRICS_PORT}" +``` + +# Feature Configuration + +## Top SQL + +```toml +# Top SQL collection configuration +[sources.topsql] +type = "topsql" +# PD endpoint for cluster topology discovery +pd_address = "db-pd:2379" +# TLS certificate paths +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" +# Number of top SQL statements to retain per minute +top_n = 10 + +# Top SQL metadata transformation +[transforms.topsql_add_meta] +type = "remap" +inputs = ["topsql"] +source = """ +.labels.provider = "${PROVIDER}" +.labels.region = "${REGION}" +.labels.k8s_cluster_name = "${K8S_CLUSTER_NAME}" +.labels.k8s_namespace = "${K8S_NAMESPACE}" +.labels.project_id = "${PROJECT_ID}" +.labels.tenant_id = "${TENANT_ID}" +.labels.cluster_id = "${CLUSTER_ID}" +""" + +# Top SQL export to VictoriaMetrics +[sinks.topsql_vm] +type = "vm_import" +inputs = ["topsql_add_meta"] +endpoint = "${VM_IMPORT_URL}" +batch.max_events = 1000 +batch.max_bytes = 1048576 # 1MiB +batch.timeout_secs = 1 +buffer.type = "disk" +buffer.max_size = 536870912 # 512MiB +buffer.when_full = "drop_newest" +``` + +## Continuous Profiling + +### AWS + +```toml +# Continuous Profiling Collection Configuration +[sources.conprof] +type = "conprof" +# PD endpoint for cluster topology subscription +pd_address = "db-pd:2379" +# TLS certificate paths +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" +# Enable TiKV heap profiling (older TiKV versions may not support heap profiling) +enable_tikv_heap_profile = true + +# Continuous Profiling Transformation - prepares S3 write path +[transforms.conprof_add_meta] +type = "remap" +inputs = ["conprof"] +source = """ +.key_prefix = join!(["0", "${TENANT_ID}", "${PROJECT_ID}", "${CLUSTER_ID}", "profiles", .filename], separator: "/") +""" + +# Continuous Profiling Export to S3. +[sinks.conprof_s3] +type = "aws_s3" +inputs = ["conprof_add_meta"] +encoding.codec = "raw_message" +region = "${REGION}" +bucket = "${BUCKET}" +key_prefix = "{{ key_prefix }}" +filename_time_format = "" +filename_append_uuid = false +batch.max_bytes = 1 # DO NOT BATCH +batch.max_events = 1 +batch.timeout_secs = 1 +auth.assume_role = "${ROLE_ARN}" # IAM role for write permissions +``` + +### GCP + +```toml +# Continuous Profiling Collection Configuration +[sources.conprof] +type = "conprof" +# PD endpoint for cluster topology subscription +pd_address = "db-pd:2379" +# TLS certificate paths +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" +# Enable TiKV heap profiling (older TiKV versions may not support heap profiling) +enable_tikv_heap_profile = true + +# Continuous Profiling Transformation - prepares GCS write path +[transforms.conprof_add_meta] +type = "remap" +inputs = ["conprof"] +source = """ +.key_prefix = join!(["0", "${TENANT_ID}", "${PROJECT_ID}", "${CLUSTER_ID}", "profiles", .filename], separator: "/") +""" + +# Continuous Profiling Export to GCS. +[sinks.conprof_gcs] +type = "gcp_cloud_storage" +inputs = ["conprof_add_meta"] +encoding.codec = "raw_message" +bucket = "${BUCKET}" +key_prefix = "{{ key_prefix }}" +filename_time_format = "" +filename_append_uuid = false +batch.max_bytes = 1 # DO NOT BATCH +batch.max_events = 1 +batch.timeout_secs = 1 +``` + +## Key Visualizer + +### AWS + +```toml +# Key Visualizer Collection Configuration +[sources.keyviz] +type = "keyviz" +# PD endpoint for cluster topology subscription +pd_address = "db-pd:2379" +# TLS certificate paths +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" + +# Key Visualizer Transformation - prepares S3 write path +[transforms.keyviz_add_meta] +type = "remap" +inputs = ["keyviz"] +source = """ +.key_prefix = join!(["0", "${TENANT_ID}", "${PROJECT_ID}", "${CLUSTER_ID}", "regions", .filename], separator: "/") +""" + +# Key Visualizer Export to S3. +[sinks.keyviz_s3] +type = "aws_s3" +inputs = ["keyviz_add_meta"] +encoding.codec = "raw_message" +region = "${REGION}" +bucket = "${BUCKET}" +key_prefix = "{{ key_prefix }}" +filename_time_format = "" +filename_append_uuid = false +batch.max_bytes = 1 # DO NOT BATCH +batch.max_events = 1 +batch.timeout_secs = 1 +auth.assume_role = "${ROLE_ARN}" +``` + +### GCP + +```toml +# Key Visualizer Collection Configuration +[sources.keyviz] +type = "keyviz" +# PD endpoint for cluster topology subscription +pd_address = "db-pd:2379" +# TLS certificate paths +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" + +# Key Visualizer Transformation - prepares GCS write path +[transforms.keyviz_add_meta] +type = "remap" +inputs = ["keyviz"] +source = """ +.key_prefix = join!(["0", "${TENANT_ID}", "${PROJECT_ID}", "${CLUSTER_ID}", "regions", .filename], separator: "/") +""" + +# Key Visualizer Export to GCS. +[sinks.keyviz_gcs] +type = "gcp_cloud_storage" +inputs = ["keyviz_add_meta"] +encoding.codec = "raw_message" +bucket = "${BUCKET}" +key_prefix = "{{ key_prefix }}" +filename_time_format = "" +filename_append_uuid = false +batch.max_bytes = 1 # DO NOT BATCH +batch.max_events = 1 +batch.timeout_secs = 1 +``` + +# Full Configuration Example + +## AWS + +```toml +data_dir = "${DATA_PATH}" + +[api] +enabled = false + +[sources.topsql] +type = "topsql" +pd_address = "db-pd:2379" +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" +top_n = 10 + +[transforms.topsql_add_meta] +type = "remap" +inputs = ["topsql"] +source = """ +.labels.provider = "${PROVIDER}" +.labels.region = "${REGION}" +.labels.k8s_cluster_name = "${K8S_CLUSTER_NAME}" +.labels.k8s_namespace = "${K8S_NAMESPACE}" +.labels.project_id = "${PROJECT_ID}" +.labels.tenant_id = "${TENANT_ID}" +.labels.cluster_id = "${CLUSTER_ID}" +""" + +[sinks.topsql_vm] +type = "vm_import" +inputs = ["topsql_add_meta"] +endpoint = "${VM_IMPORT_URL}" +batch.max_events = 1000 +batch.max_bytes = 1048576 # 1MiB +batch.timeout_secs = 1 +buffer.type = "disk" +buffer.max_size = 536870912 # 512MiB +buffer.when_full = "drop_newest" + +[sources.conprof] +type = "conprof" +pd_address = "db-pd:2379" +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" +enable_tikv_heap_profile = true + +[transforms.conprof_add_meta] +type = "remap" +inputs = ["conprof"] +source = """ +.key_prefix = join!(["0", "${TENANT_ID}", "${PROJECT_ID}", "${CLUSTER_ID}", "profiles", .filename], separator: "/") +""" + +[sinks.conprof_s3] +type = "aws_s3" +inputs = ["conprof_add_meta"] +encoding.codec = "raw_message" +region = "${REGION}" +bucket = "${BUCKET}" +key_prefix = "{{ key_prefix }}" +filename_time_format = "" +filename_append_uuid = false +batch.max_bytes = 1 # DO NOT BATCH +batch.max_events = 1 +batch.timeout_secs = 1 +auth.assume_role = "${ROLE_ARN}" + +[sources.keyviz] +type = "keyviz" +pd_address = "db-pd:2379" +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" + +[transforms.keyviz_add_meta] +type = "remap" +inputs = ["keyviz"] +source = """ +.key_prefix = join!(["0", "${TENANT_ID}", "${PROJECT_ID}", "${CLUSTER_ID}", "regions", .filename], separator: "/") +""" + +[sinks.keyviz_s3] +type = "aws_s3" +inputs = ["keyviz_add_meta"] +encoding.codec = "raw_message" +region = "${REGION}" +bucket = "${BUCKET}" +key_prefix = "{{ key_prefix }}" +filename_time_format = "" +filename_append_uuid = false +batch.max_bytes = 1 # DO NOT BATCH +batch.max_events = 1 +batch.timeout_secs = 1 +auth.assume_role = "${ROLE_ARN}" + +[sources.self_metrics] +type = "internal_metrics" + +[sinks.self_metrics_sink] +type = "prometheus_exporter" +inputs = ["self_metrics"] +address = "0.0.0.0:${METRICS_PORT}" +``` + +## GCP + +```toml +data_dir = "${DATA_PATH}" + +[api] +enabled = false + +[sources.topsql] +type = "topsql" +pd_address = "db-pd:2379" +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" +top_n = 10 + +[transforms.topsql_add_meta] +type = "remap" +inputs = ["topsql"] +source = """ +.labels.provider = "${PROVIDER}" +.labels.region = "${REGION}" +.labels.k8s_cluster_name = "${K8S_CLUSTER_NAME}" +.labels.k8s_namespace = "${K8S_NAMESPACE}" +.labels.project_id = "${PROJECT_ID}" +.labels.tenant_id = "${TENANT_ID}" +.labels.cluster_id = "${CLUSTER_ID}" +""" + +[sinks.topsql_vm] +type = "vm_import" +inputs = ["topsql_add_meta"] +endpoint = "${VM_IMPORT_URL}" +batch.max_events = 1000 +batch.max_bytes = 1048576 # 1MiB +batch.timeout_secs = 1 +buffer.type = "disk" +buffer.max_size = 536870912 # 512MiB +buffer.when_full = "drop_newest" + +[sources.conprof] +type = "conprof" +pd_address = "db-pd:2379" +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" +enable_tikv_heap_profile = true + +[transforms.conprof_add_meta] +type = "remap" +inputs = ["conprof"] +source = """ +.key_prefix = join!(["0", "${TENANT_ID}", "${PROJECT_ID}", "${CLUSTER_ID}", "profiles", .filename], separator: "/") +""" + +[sinks.conprof_gcs] +type = "gcp_cloud_storage" +inputs = ["conprof_add_meta"] +encoding.codec = "raw_message" +bucket = "${BUCKET}" +key_prefix = "{{ key_prefix }}" +filename_time_format = "" +filename_append_uuid = false +batch.max_bytes = 1 # DO NOT BATCH +batch.max_events = 1 +batch.timeout_secs = 1 + +[sources.keyviz] +type = "keyviz" +pd_address = "db-pd:2379" +tls.ca_path = "${TLS_PATH}/ca.crt" +tls.crt_path = "${TLS_PATH}/tls.crt" +tls.key_path = "${TLS_PATH}/tls.key" + +[transforms.keyviz_add_meta] +type = "remap" +inputs = ["keyviz"] +source = """ +.key_prefix = join!(["0", "${TENANT_ID}", "${PROJECT_ID}", "${CLUSTER_ID}", "regions", .filename], separator: "/") +""" + +[sinks.keyviz_gcs] +type = "gcp_cloud_storage" +inputs = ["keyviz_add_meta"] +encoding.codec = "raw_message" +bucket = "${BUCKET}" +key_prefix = "{{ key_prefix }}" +filename_time_format = "" +filename_append_uuid = false +batch.max_bytes = 1 # DO NOT BATCH +batch.max_events = 1 +batch.timeout_secs = 1 + +[sources.self_metrics] +type = "internal_metrics" + +[sinks.self_metrics_sink] +type = "prometheus_exporter" +inputs = ["self_metrics"] +address = "0.0.0.0:${METRICS_PORT}" +``` diff --git a/tidb-cloud/byoc-o11y-agent-vmagent-config-k8s.md b/tidb-cloud/byoc-o11y-agent-vmagent-config-k8s.md new file mode 100644 index 0000000000000..5ee4775f93a99 --- /dev/null +++ b/tidb-cloud/byoc-o11y-agent-vmagent-config-k8s.md @@ -0,0 +1,271 @@ +```yaml +# Please replace the variables in the ${} format in the configuration appropriately. +scrape_configs: +- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + honor_labels: false + job_name: kubelet + kubernetes_sd_configs: + - role: node + relabel_configs: + - source_labels: + - __meta_kubernetes_node_address_InternalIP + target_label: instance + - action: labelmap + regex: __meta_kubernetes_node_label_(.+) + - replacement: shoot + target_label: type + - action: replace + regex: (.*) + replacement: $1 + separator: ; + source_labels: + - __metrics_path__ + target_label: metrics_path + scheme: https + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: true +- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + honor_labels: true + job_name: cadvisor + kubernetes_sd_configs: + - role: node + metrics_path: /metrics/cadvisor + relabel_configs: + - source_labels: + - __meta_kubernetes_node_address_InternalIP + target_label: instance + - action: labelmap + regex: __meta_kubernetes_node_label_(.+) + - replacement: shoot + target_label: type + - action: replace + source_labels: + - cluster + target_label: tidb_cluster + - action: replace + source_labels: + - k8s_cluster_info + target_label: cluster + - action: replace + regex: (.*) + replacement: $1 + separator: ; + source_labels: + - __metrics_path__ + target_label: metrics_path + scheme: https + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: true +- honor_labels: false + job_name: kube-state-metrics + kubernetes_sd_configs: + - namespaces: + names: + - monitoring + role: service + metric_relabel_configs: + - action: drop + regex: ^.+\.tf-pod.+$ + source_labels: + - pod + relabel_configs: + - action: keep + regex: kube-state-metrics + source_labels: + - __meta_kubernetes_service_name + - replacement: kube-state-metrics + target_label: instance +- honor_labels: false + job_name: ebs-csi-node + scrape_interval: 15s + kubernetes_sd_configs: + - role: endpoints + namespaces: + names: + - kube-system + relabel_configs: + - action: keep + regex: ebs-csi-node + source_labels: + - __meta_kubernetes_endpoints_name + - source_labels: [__meta_kubernetes_endpoint_port_name] + regex: metrics + replacement: $1 + action: keep + - source_labels: + - __meta_kubernetes_pod_name + target_label: pod + - action: replace + regex: (.*) + replacement: $1 + source_labels: + - __meta_kubernetes_pod_node_name + target_label: instance + scrape_timeout: 15s +- honor_labels: false + job_name: node-exporter + kubernetes_sd_configs: + - role: endpoints + relabel_configs: + - action: keep + regex: node-exporter-prometheus-node-exporter + source_labels: + - __meta_kubernetes_endpoints_name + - source_labels: + - __meta_kubernetes_pod_name + target_label: pod + - action: labelmap + regex: __meta_kubernetes_node_label_(.+) + - action: replace + regex: (.*) + replacement: $1 + separator: ; + source_labels: + - __meta_kubernetes_pod_node_name + target_label: instance + scrape_timeout: 30s +- honor_labels: false + job_name: metrics-agents + kubernetes_sd_configs: + - role: endpoints + relabel_configs: + - action: keep + regex: (.*)vmagent + source_labels: + - __meta_kubernetes_endpoints_name + - source_labels: + - __meta_kubernetes_namespace + target_label: namespace + - source_labels: + - __meta_kubernetes_pod_name + target_label: pod + - source_labels: + - __meta_kubernetes_service_name + target_label: service + scrape_timeout: 30s +- honor_labels: false + job_name: vector-agents + kubernetes_sd_configs: + - role: pod + relabel_configs: + - action: keep + regex: vector + source_labels: + - __meta_kubernetes_pod_container_name + - action: keep + regex: prom-exporter + source_labels: + - __meta_kubernetes_pod_container_port_name + - source_labels: + - __meta_kubernetes_namespace + target_label: namespace + scrape_timeout: 30s +- honor_labels: false + honor_timestamps: true + job_name: cert-manager + kubernetes_sd_configs: + - role: endpoints + metrics_path: /metrics + relabel_configs: + - action: keep + regex: controller + replacement: $1 + separator: ; + source_labels: + - __meta_kubernetes_service_label_app_kubernetes_io_component + - action: keep + regex: cert-manager + replacement: $1 + separator: ; + source_labels: + - __meta_kubernetes_service_label_app_kubernetes_io_name + - action: keep + regex: cert-manager + replacement: $1 + separator: ; + source_labels: + - __meta_kubernetes_service_label_app_kubernetes_io_instance + - action: keep + regex: "9402" + replacement: $1 + separator: ; + source_labels: + - __meta_kubernetes_pod_container_port_number + - action: replace + regex: Node;(.*) + replacement: ${1} + separator: ; + source_labels: + - __meta_kubernetes_endpoint_address_target_kind + - __meta_kubernetes_endpoint_address_target_name + target_label: node + - action: replace + regex: Pod;(.*) + replacement: ${1} + separator: ; + source_labels: + - __meta_kubernetes_endpoint_address_target_kind + - __meta_kubernetes_endpoint_address_target_name + target_label: pod + - action: replace + regex: (.*) + replacement: $1 + separator: ; + source_labels: + - __meta_kubernetes_namespace + target_label: namespace + - action: replace + regex: (.*) + replacement: $1 + separator: ; + source_labels: + - __meta_kubernetes_service_name + target_label: service + - action: replace + regex: (.*) + replacement: $1 + separator: ; + source_labels: + - __meta_kubernetes_pod_name + target_label: pod + - action: replace + regex: (.*) + replacement: ${1} + separator: ; + source_labels: + - __meta_kubernetes_service_name + target_label: job + - action: replace + regex: (.+) + replacement: ${1} + separator: ; + source_labels: + - __meta_kubernetes_service_label_cert_manager + target_label: job + - action: replace + regex: (.*) + replacement: "9402" + separator: ; + target_label: endpoint + scrape_interval: 1m + scrape_timeout: 30s +- honor_labels: false + job_name: 'coredns' + kubernetes_sd_configs: + - role: pod + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_eks_amazonaws_com_component, __meta_kubernetes_pod_container_port_name] + regex: coredns;metrics + action: keep + - source_labels: [__meta_kubernetes_namespace] + action: replace + target_label: namespace + - source_labels: [__meta_kubernetes_pod_name] + action: replace + target_label: pod + - source_labels: [__meta_kubernetes_pod_node_name] + action: replace + target_label: node +``` diff --git a/tidb-cloud/byoc-o11y-agent-vmagent-config-tidb.md b/tidb-cloud/byoc-o11y-agent-vmagent-config-tidb.md new file mode 100644 index 0000000000000..8c12aadd848ef --- /dev/null +++ b/tidb-cloud/byoc-o11y-agent-vmagent-config-tidb.md @@ -0,0 +1,609 @@ +```yaml +# Please replace the variables in the ${} format in the configuration appropriately. +scrape_configs: +- job_name: ${NAMESPACE}-core-components + scrape_interval: 15s + honor_labels: true + scheme: https + tls_config: + ca_file: /var/lib/cluster-assets-tls/ca.crt + cert_file: /var/lib/cluster-assets-tls/tls.crt + key_file: /var/lib/cluster-assets-tls/tls.key + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + regex: db(-.*)? + action: keep + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + regex: "true" + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + regex: pd|tidb|tiflash|tikv + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + target_label: job + replacement: ${NAMESPACE}-db-$1 + action: replace + - source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_label_app_kubernetes_io_instance, + __meta_kubernetes_pod_label_app_kubernetes_io_component, __meta_kubernetes_namespace, + __meta_kubernetes_pod_annotation_prometheus_io_port] + target_label: __address__ + regex: (.+);(.+);(.+);(.+);(.+) + replacement: $1.$2-$3-peer.$4:$5 + action: replace + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: cluster + action: replace + - source_labels: [__meta_kubernetes_pod_name] + target_label: instance + action: replace + - source_labels: [instance] + target_label: instance + regex: db-(\d+)-([a-zA-Z0-9]+)-tidb-(\d+) + replacement: db-tidb-$3-ac-$1 + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + target_label: component + action: replace + - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_instance] + separator: '-' + target_label: tidb_cluster + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + target_label: __metrics_path__ + regex: (.+) + action: replace + kubernetes_sd_configs: + - role: pod + kubeconfig_file: "" + namespaces: + own_namespace: false + names: + - ${NAMESPACE} +- job_name: ${NAMESPACE}-tiproxy + scrape_interval: 15s + honor_labels: true + scheme: https + tls_config: + insecure_skip_verify: true + ca_file: /var/lib/cluster-assets-tls/ca.crt + cert_file: /var/lib/cluster-assets-tls/tls.crt + key_file: /var/lib/cluster-assets-tls/tls.key + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + regex: db + action: keep + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + regex: "true" + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + regex: tiproxy + action: keep + - source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_label_app_kubernetes_io_instance, + __meta_kubernetes_pod_label_app_kubernetes_io_component, __meta_kubernetes_namespace, + __meta_kubernetes_pod_annotation_prometheus_io_port] + target_label: __address__ + regex: (.+);(.+);(.+);(.+);(.+) + replacement: $1.$2-$3-peer.$4:$5 + action: replace + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: cluster + action: replace + - source_labels: [__meta_kubernetes_pod_name] + target_label: instance + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + target_label: component + action: replace + - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_instance] + separator: '-' + target_label: tidb_cluster + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + target_label: __metrics_path__ + regex: (.+) + action: replace + kubernetes_sd_configs: + - role: pod + kubeconfig_file: "" + namespaces: + own_namespace: false + names: + - ${NAMESPACE} +- job_name: ${NAMESPACE}-importer + scrape_interval: 15s + honor_labels: true + scheme: https + tls_config: + ca_file: /var/lib/cluster-assets-tls/ca.crt + cert_file: /var/lib/cluster-assets-tls/tls.crt + key_file: /var/lib/cluster-assets-tls/tls.key + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + regex: db + action: keep + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + regex: "true" + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + regex: importer + action: keep + - source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_label_app_kubernetes_io_instance, + __meta_kubernetes_namespace,__meta_kubernetes_pod_annotation_prometheus_io_port] + target_label: __address__ + regex: (.+);(.+);(.+);(.+) + replacement: $1.$2-importer.$3:$4 + action: replace + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: cluster + action: replace + - source_labels: [__meta_kubernetes_pod_name] + target_label: instance + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + target_label: component + action: replace + - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_instance] + separator: '-' + target_label: tidb_cluster + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + target_label: __metrics_path__ + regex: (.+) + action: replace + kubernetes_sd_configs: + - role: pod + kubeconfig_file: "" + namespaces: + own_namespace: false + names: + - ${NAMESPACE} +- job_name: ${NAMESPACE}-lightning + scrape_interval: 15s + honor_labels: true + scheme: https + tls_config: + insecure_skip_verify: true + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + regex: db + action: keep + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + regex: "true" + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + regex: tidb-lightning + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name, + __meta_kubernetes_namespace, __meta_kubernetes_pod_annotation_prometheus_io_port] + target_label: __address__ + regex: (.+);(.+);(.+) + replacement: $1.$2:$3 + action: replace + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: cluster + action: replace + - source_labels: [__meta_kubernetes_pod_name] + target_label: instance + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + target_label: component + action: replace + - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_instance] + separator: '-' + target_label: tidb_cluster + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + target_label: __metrics_path__ + regex: (.+) + action: replace + kubernetes_sd_configs: + - role: pod + kubeconfig_file: "" + namespaces: + own_namespace: false + names: + - ${NAMESPACE} +- job_name: ${NAMESPACE}-dumpling + scrape_interval: 15s + honor_labels: true + scheme: http + kubernetes_sd_configs: + - role: pod + kubeconfig_file: "" + namespaces: + own_namespace: false + names: + - ${NAMESPACE} + tls_config: + insecure_skip_verify: true + relabel_configs: + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_name] + regex: dumpling-job-(.+) + action: keep + - source_labels: [__meta_kubernetes_pod_ip] + target_label: __address__ + regex: (.+) + replacement: $1:8281 + action: replace + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_job_name] + target_label: instance + action: replace +- job_name: ${NAMESPACE}-dwworker + scrape_interval: 15s + honor_labels: true + scheme: http + kubernetes_sd_configs: + - role: pod + kubeconfig_file: "" + namespaces: + own_namespace: false + names: + - ${NAMESPACE} + tls_config: + insecure_skip_verify: true + relabel_configs: + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_name] + regex: dwworker-(.+) + action: keep + - source_labels: [__meta_kubernetes_pod_ip] + target_label: __address__ + regex: (.+) + replacement: $1:8185 + action: replace + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: instance + action: replace +- job_name: ${NAMESPACE}-tiflash-proxy + scrape_interval: 15s + honor_labels: true + scheme: https + tls_config: + ca_file: /var/lib/cluster-assets-tls/ca.crt + cert_file: /var/lib/cluster-assets-tls/tls.crt + key_file: /var/lib/cluster-assets-tls/tls.key + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + regex: db + action: keep + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + regex: "true" + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + regex: tiflash + action: keep + - source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_label_app_kubernetes_io_instance, + __meta_kubernetes_pod_label_app_kubernetes_io_component, __meta_kubernetes_namespace, + __meta_kubernetes_pod_annotation_tiflash_proxy_prometheus_io_port] + target_label: __address__ + regex: (.+);(.+);(.+);(.+);(.+) + replacement: $1.$2-$3-peer.$4:$5 + action: replace + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: cluster + action: replace + - source_labels: [__meta_kubernetes_pod_name] + target_label: instance + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + target_label: component + action: replace + - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_instance] + separator: '-' + target_label: tidb_cluster + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + target_label: __metrics_path__ + regex: (.+) + action: replace + kubernetes_sd_configs: + - role: pod + kubeconfig_file: "" + namespaces: + own_namespace: false + names: + - ${NAMESPACE} +- job_name: ${NAMESPACE}-ticdc + scrape_interval: 15s + honor_labels: true + scheme: https + tls_config: + ca_file: /var/lib/cluster-assets-tls/ca.crt + cert_file: /var/lib/cluster-assets-tls/tls.crt + key_file: /var/lib/cluster-assets-tls/tls.key + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + regex: db|db-cdc|rg\d+-op\d+-c\d+|changefeed-\d+ + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: job + replacement: ${NAMESPACE}-$1-ticdc + action: replace + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + regex: "true" + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + regex: ticdc + action: keep + - source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_label_app_kubernetes_io_instance, + __meta_kubernetes_pod_label_app_kubernetes_io_component, __meta_kubernetes_namespace, + __meta_kubernetes_pod_annotation_prometheus_io_port] + target_label: __address__ + regex: (.+);(.+);(.+);(.+);(.+) + replacement: $1.$2-$3-peer.$4:$5 + action: replace + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: cluster + action: replace + - source_labels: [__meta_kubernetes_pod_name] + target_label: instance + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + target_label: component + action: replace + - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_instance] + separator: '-' + target_label: tidb_cluster + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + target_label: __metrics_path__ + regex: (.+) + action: replace + kubernetes_sd_configs: + - role: pod + kubeconfig_file: "" + namespaces: + own_namespace: false + names: + - ${NAMESPACE} +- job_name: ${NAMESPACE}-tidb-auditlog + honor_labels: true + scrape_interval: 15s + scheme: http + kubernetes_sd_configs: + - api_server: null + role: pod + namespaces: + names: + - ${NAMESPACE} + relabel_configs: + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + regex: db + action: keep + - source_labels: [__meta_kubernetes_namespace] + regex: ${NAMESPACE} + action: keep + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + regex: "true" + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + regex: tidb + action: keep + - source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_label_app_kubernetes_io_instance, + __meta_kubernetes_namespace, __meta_kubernetes_pod_annotation_tidb_auditlog_prometheus_io_port] + regex: (.+);(.+);(.+);(.+) + target_label: __address__ + replacement: $1.$2-tidb-peer.$3:$4 + action: replace + - source_labels: [__meta_kubernetes_pod_annotation_tidb_auditlog_prometheus_io_port] + regex: (.+) + target_label: __address__ + action: keep + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance] + target_label: cluster + action: replace + - source_labels: [__meta_kubernetes_pod_name] + target_label: instance + action: replace + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + target_label: component + replacement: auditlog + action: replace + - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app_kubernetes_io_instance] + separator: '-' + target_label: tidb_cluster + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + regex: (.+) + target_label: __metrics_path__ +- job_name: ${NAMESPACE}-node-exporter + honor_labels: true + scrape_interval: 15s + scheme: http + params: + collect[]: + - meminfo + - cpu + kubernetes_sd_configs: + - api_server: null + role: pod + namespaces: + names: + - ${NAMESPACE} + relabel_configs: + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_instance + action: keep + regex: db(-.*)? + - source_labels: + - __meta_kubernetes_namespace + action: keep + regex: ${NAMESPACE} + - source_labels: + - __meta_kubernetes_pod_annotation_prometheus_io_scrape + action: keep + regex: "true" + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_component + action: keep + regex: pd|tikv|tidb|tiflash + - action: replace + regex: (.+) + replacement: $1:9100 + target_label: __address__ + source_labels: + - __meta_kubernetes_pod_host_ip + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_instance + action: replace + target_label: cluster + - source_labels: + - __meta_kubernetes_pod_name + action: replace + target_label: instance + - source_labels: [instance] + target_label: instance + regex: db-(\d+)-([a-zA-Z0-9]+)-tidb-(\d+) + replacement: db-tidb-$3-ac-$1 + action: replace + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_component + action: replace + target_label: component + - source_labels: + - __meta_kubernetes_namespace + - __meta_kubernetes_pod_label_app_kubernetes_io_instance + separator: '-' + target_label: tidb_cluster + metric_relabel_configs: + - source_labels: + - __name__ + action: keep + regex: 'node_cpu_seconds_total|node_memory_MemTotal_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_Buffers_bytes|node_memory_Cached_bytes' +- job_name: ${NAMESPACE}-kubelet + honor_labels: true + scrape_interval: 15s + scheme: https + bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + kubernetes_sd_configs: + - api_server: null + role: pod + namespaces: + names: + - ${NAMESPACE} + tls_config: + insecure_skip_verify: true + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + relabel_configs: + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_instance + action: keep + regex: db + - source_labels: + - __meta_kubernetes_namespace + action: keep + regex: ${NAMESPACE} + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_component + action: keep + regex: pd|tikv|tidb|tiflash + - action: replace + regex: (.+) + replacement: $1:10250 + target_label: __address__ + source_labels: + - __meta_kubernetes_pod_host_ip + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_instance + action: replace + target_label: cluster + - source_labels: + - __meta_kubernetes_pod_name + action: replace + target_label: instance + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_component + action: replace + target_label: component + - source_labels: + - __meta_kubernetes_namespace + - __meta_kubernetes_pod_label_app_kubernetes_io_instance + separator: '-' + target_label: tidb_cluster + metric_relabel_configs: + - source_labels: + - namespace + action: keep + regex: ${NAMESPACE} + - source_labels: + - __name__ + action: keep + regex: 'kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes' +- job_name: ${NAMESPACE}-tiflow + honor_labels: true + scrape_interval: 15s + scheme: http + kubernetes_sd_configs: + - api_server: null + role: pod + relabel_configs: + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + regex: "true" + action: keep + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + action: keep + regex: tiflow-executor + - source_labels: [__meta_kubernetes_pod_label_tiflow_cloud_pingcap_com_tidb_cluster_name] + action: keep + regex: "${CLUSTER_ID}" + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + regex: (.+) + target_label: __metrics_path__ + - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] + action: replace + target_label: component + - source_labels: [__meta_kubernetes_namespace] + target_label: kubernetes_namespace + action: replace + - source_labels: [__meta_kubernetes_pod_name] + target_label: instance + action: replace + metric_relabel_configs: + - source_labels: + - __name__ + action: keep + regex: 'dm_mydumper_exit_with_error_count|dm_loader_exit_with_error_count|dm_syncer_exit_with_error_count|dm_worker_task_state|dm_syncer_replication_lag_gauge' +``` From a53825c7b3840a174c0127b3ceb3000523afc6b7 Mon Sep 17 00:00:00 2001 From: mornyx Date: Tue, 25 Mar 2025 16:25:38 +0800 Subject: [PATCH 2/2] fix link slash Signed-off-by: mornyx --- tidb-cloud/byoc-o11y-agent-tutorial.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/tidb-cloud/byoc-o11y-agent-tutorial.md b/tidb-cloud/byoc-o11y-agent-tutorial.md index 4ae2b253e6a62..fa32d8e377fa8 100644 --- a/tidb-cloud/byoc-o11y-agent-tutorial.md +++ b/tidb-cloud/byoc-o11y-agent-tutorial.md @@ -789,7 +789,7 @@ spec: The `remoteWrite.url` can be obtained after creating the O11y backend cluster as mentioned earlier - simply fill in the corresponding value. -For standard TiDB clusters, we provide an `additionalScrapeConfigs` template to collect metrics from all TiDB components. Please refer to [TiDB VMAgent Config](./byoc-o11y-agent-vmagent-config-tidb.md). +For standard TiDB clusters, we provide an `additionalScrapeConfigs` template to collect metrics from all TiDB components. Please refer to [TiDB VMAgent Config](/tidb-cloud/byoc-o11y-agent-vmagent-config-tidb.md). ### Deployment @@ -816,13 +816,13 @@ kubectl logs -n monitoring -l app.kubernetes.io/name=tidb-vmagent --tail=50 ### Functional Description -**Kubernetes Victoria Metrics Agent** 专用于采集 Kubernetes 集群基础监控指标,每个 Kubernetes 集群需要部署一个实例。 +**Kubernetes Victoria Metrics Agent** is specifically designed to collect fundamental Kubernetes cluster metrics. One instance should be deployed per Kubernetes cluster. ### Prepare Resource File -与 TiDB VMAgent 类似,需要准备一个专用于 Kubernetes VMAgent 的 CR 资源描述文件。 +Similar to the TiDB VMAgent, you need to prepare a dedicated Custom Resource (CR) definition file for the Kubernetes VMAgent. -基本资源描述文件模板如下: +Basic resource template: ```yaml apiVersion: operator.victoriametrics.com/v1beta1 @@ -866,7 +866,7 @@ spec: The `remoteWrite.url` can be obtained after creating the O11y backend cluster as mentioned earlier - simply fill in the corresponding value. -For Kubernetes VMAgent, we provide an `additionalScrapeConfigs` template to collect metrics from all Kubernetes components. Please refer to [Kubernetes VMAgent Config](./byoc-o11y-agent-vmagent-config-k8s.md). +For Kubernetes VMAgent, we provide an `additionalScrapeConfigs` template to collect metrics from all Kubernetes components. Please refer to [Kubernetes VMAgent Config](/tidb-cloud/byoc-o11y-agent-vmagent-config-k8s.md). ### Deployment @@ -899,8 +899,8 @@ kubectl logs -n monitoring -l app.kubernetes.io/name=k8s-vmagent --tail=50 For Kubernetes Vector Agent, we provide a Helm Values template to collect all necessary logs. Please refer to the following links for configuration: -- [Node Vector Agent for AWS](./byoc-o11y-agent-vector-config-k8s-aws.md) -- [Node Vector Agent for GCP](./byoc-o11y-agent-vector-config-k8s-gcp.md) +- [Node Vector Agent for AWS](/tidb-cloud/byoc-o11y-agent-vector-config-k8s-aws.md) +- [Node Vector Agent for GCP](/tidb-cloud/byoc-o11y-agent-vector-config-k8s-gcp.md) > Note: Please replace the variables in the `${VAR}` format in the configuration appropriately. @@ -970,7 +970,7 @@ data: ... ``` -The content of `config.toml` can be referenced from [TiDB Vector Config](./byoc-o11y-agent-vector-config-tidb.md). +The content of `config.toml` can be referenced from [TiDB Vector Config](/tidb-cloud/byoc-o11y-agent-vector-config-tidb.md). 3. `Deployment` resource definition file: