Full-stack k8s observability for the sandbox-east cluster.
Metrics and logs are collected by OpenTelemetry Collector, stored in VictoriaMetrics and VictoriaLogs, and visualised in Grafana.
Currently deployed with Helm CLI.
ArgoCD deployment is planned (see Planned: ArgoCD).
| Component | Role |
|---|---|
| OpenTelemetry Collector | DaemonSet on every node - collects all signals |
| VictoriaMetrics | Metrics storage - 1-month retention |
| VictoriaLogs | Log storage - 30-day retention |
| Grafana | Unified UI - datasources and dashboards auto-provisioned |
Each layer has dedicated receivers, its own processor chain, and a layer label for clean separation in Grafana.
Host-level collection via direct mounts (/proc, /sys, /var/log).
Receivers: hostmetrics (CPU, memory, disk, filesystem, network, processes), filelog/system (syslog).
Workload signals enriched with full k8s metadata.
Receivers: kubeletstats (pod CPU/memory/network from kubelet API on port 10250), k8s_events (cluster-wide events).
Processor: k8sattributes adds namespace, pod, deployment, and node labels to every signal.
Zero-config annotation-driven discovery across all namespaces.
Any pod with prometheus.io/scrape: "true" is scraped automatically.
Receivers: prometheus/app (request rate, latency, custom metrics), filelog/app (CRI-parsed container logs from all pods with namespace/pod/container attribution).
Service mesh golden metrics with zero application instrumentation. When a workload is meshed, the Linkerd proxy exposes per-workload metrics on :4191/metrics. A dedicated scrape job collects them, keeps only the golden set, and tags everything layer=mesh so mesh-measured signals stay distinct from app-measured ones.
Receiver: prometheus/mesh (Kubernetes pod SD, keeps containers named linkerd-proxy, scrapes :4191).
Processor: filter/mesh keeps response_total, response_latency_ms, tcp_open_connections, tcp_read_bytes_total, and tcp_write_bytes_total; everything else is dropped to control cardinality.
Verified on a single-node K3s v1.34.6 lab with Linkerd enterprise-2.18.10 and emojivoto, not on sandbox-east. See the companion write-up for the full walkthrough.
hostmetrics + kubeletstats → resourcedetection · k8sattributes · attributes · batch → VictoriaMetrics
prometheus/app → resource/app · resourcedetection · k8sattributes · batch → VictoriaMetrics
prometheus/mesh → filter/mesh · resource/mesh · resourcedetection · k8sattributes · batch → VictoriaMetrics
filelog/system → resource/infra · resourcedetection · k8sattributes · batch → VictoriaLogs
k8s_events → resource/kuber · resourcedetection · batch → VictoriaLogs
filelog/app → resource/app · resourcedetection · k8sattributes · batch → VictoriaLogs
Three dashboards ship pre-provisioned as ConfigMaps and no manual import required.
| Dashboard | Panels |
|---|---|
| Node Infrastructure | CPU, memory, disk, network per node · system logs |
| Kubernetes Infrastructure | Pod CPU/memory/network · k8s events log |
| App in all namespaces | Request rate, p95/p99 latency · application logs |
All log panels filter by layer label: {layer="infra"}, {layer="kuber"}, {layer="app"}.
App metric panels require pods to expose http_request_duration_seconds histogram and carry prometheus.io/scrape: "true" annotation.
CONTEXT=sandbox-east
NS=ad-otel
helm upgrade --install ad-otel-victoriametrics helm/victoriametrics -n $NS --create-namespace --kube-context $CONTEXT
helm upgrade --install ad-otel-victorialogs helm/victorialogs -n $NS --create-namespace --kube-context $CONTEXT
helm upgrade --install ad-otel-otelcol helm/otelcol -n $NS --create-namespace --kube-context $CONTEXT
helm upgrade --install ad-otel-grafana helm/grafana -n $NS --create-namespace --kube-context $CONTEXTDeploy VictoriaMetrics and VictoriaLogs before otelcol on first install so exporters can connect immediately. Subsequent upgrades can run in any order.
Quick local access:
kubectl port-forward svc/ad-otel-grafana 3000:80 -n ad-otel --context sandbox-east
# → http://localhost:3000Via ingress — add to /etc/hosts:
<nginx-ingress-external-ip> grafana.ad-otel.local
Then open http://grafana.ad-otel.local. Anonymous admin access is enabled and no login required.
Add these annotations to any pod spec:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080" # port where /metrics is exposedThe OTel Collector picks it up within 30 seconds.
Metrics appear in VictoriaMetrics with job="all-namespaces" and namespace, pod labels.
Logs from the same pod appear automatically via filelog/app.
helm/
grafana/
templates/
configmap.yaml # grafana.ini + datasources.yaml + dashboards.yaml provider
dashboard-infra.yaml # Node Infrastructure dashboard (layer=infra)
dashboard-kuber.yaml # Kubernetes Infrastructure dashboard (layer=kuber)
dashboard-app.yaml # App dashboard (layer=app, all namespaces)
deployment.yaml # Single replica, securityContext uid/gid 472 (Longhorn compat)
ingress.yaml # NGINX ingress on grafana.ad-otel.local
pvc.yaml # 2Gi Longhorn PVC
service.yaml # ClusterIP :80 → :3000
otelcol/
templates/
configmap.yaml # Full OTel config: receivers, processors, exporters, pipelines
daemonset.yaml # Root + privileged, mounts: /proc /sys /var/log
serviceaccount.yaml
clusterrole.yaml # ClusterRole + Binding (kubelet, pods, events, nodes)
victoriametrics/
templates/
deployment.yaml # strategy: Recreate, 1-month retention
pvc.yaml # 10Gi Longhorn PVC
service.yaml # ClusterIP :8428
victorialogs/
templates/
deployment.yaml # strategy: Recreate, 30d retention, --memory.allowedPercent=60
pvc.yaml # 10Gi Longhorn PVC
service.yaml # ClusterIP :9428
argocd/ # Planned — see below
app-of-apps.yaml
apps/
application-grafana.yaml
application-otelcol.yaml
application-victorialogs.yaml
application-victoriametrics.yaml
ArgoCD is available on sandbox-east.
The argocd/ manifests are prepared and point to this GitLab repo.
Once the stack is validated via Helm, the plan is to switch to GitOps.
Static otelcol config - no Helm {{ }} templating inside the block scalar.
Component addresses are hardcoded Kubernetes service DNS names (ad-otel-*.ad-otel.svc.cluster.local).
ConfigMap-provisioned dashboards - datasources and dashboards survive pod restarts without re-importing. Dashboard JSON is embedded in Helm templates as a raw string block.
Projected volume for dashboards - all three dashboard ConfigMaps and the provisioning config are merged into a single directory at /etc/grafana/provisioning/dashboards using a projected volume.
Grafana securityContext - uid/gid 472 required for Longhorn PVCs. Without it Grafana cannot write to /var/lib/grafana and crashes on startup.
Deployment strategy: Recreate - VictoriaMetrics and VictoriaLogs use RWO PVCs with an exclusive file lock.
RollingUpdate would start the new pod before the old one releases the lock, causing a crash. Recreate terminates the old pod first.
Memory sizing - VictoriaLogs limit is 1Gi with --memory.allowedPercent=60.
OtelCol limit is 512Mi. Both use start_at: end for log receivers to avoid reading all historical logs on restart.
If OOM kills reappear (Exit Code 137 in pod describe), increase the limits in values.yaml.
layer label as routing key - one resource/* processor per pipeline inserts layer=infra|kuber|app.
This single attribute cleanly separates all three signal tiers in every Grafana dashboard. The mesh tier adds a fourth: layer=mesh.
Mesh metrics require collector 0.118.0+ - the prometheus/mesh relabel rule uses replacement: $1:4191 to point the scrape at the pod IP on the proxy admin port. Collector 0.104.0 rejects $1 in replacement fields (the confmap.unifyEnvVarExpansion gate treats it as an env var). Image is pinned to 0.118.0 in helm/otelcol/values.yaml where the gate is gone. Filtering is done in the filter/mesh processor, not metric_relabel_configs in the receiver, because the Prometheus receiver silently ignores keep filters.
DaemonSet runs as root - required to read /proc, /sys, /var/log/pods from the host.
Cluster-wide monitoring - prometheus/app uses cluster-wide Kubernetes SD (no namespace filter).
filelog/app pattern is /var/log/pods/*/*/*.log.
Both were scoped to a single namespace in the reference repo and were widened here to cover all workloads on sandbox-east.
otel-mesh-collector-config.yaml in this repo is the standalone Collector config for the
layer=mesh pipeline, extracted from helm/otelcol/templates/configmap.yaml and reduced to
the mesh path only. It is the companion artifact for the Buoyant article
"OTel and Mesh-Derived Metrics: A 2026 Reference."
It scrapes linkerd-proxy sidecars at :4191, keeps the 5 golden metric families via the
OTTL filter/mesh processor, tags every series layer=mesh, and writes to a
Prometheus-compatible backend (VictoriaMetrics in this repo).
Two findings from the article lab are captured as comments in the artifact:
- The golden-set filter must use OTTL
filter/mesh; a Prometheusmetric_relabel_configskeep rule is silently ignored on this path. replacement: $1:4191requires OTel Collector contrib >= 0.118.0; older builds enable theconfmap.unifyEnvVarExpansiongate, which rejects the$1relabel replacement at startup.
Article lab, tested on:
- Linkerd edge-26.5.5 (2.19+)
- K3s v1.34.6, single node
- OTel Collector contrib 0.118.0
- OpenTelemetry Demo (Astronomy Shop) as the meshed workload
- VictoriaMetrics as the metrics backend
This is a separate validation from the Linkerd mesh / layer=mesh section above, which was
verified on Linkerd enterprise-2.18.10 + emojivoto. Same pipeline, two different mesh
distributions and workloads.
The full walkthrough, lab notes, and Grafana dashboard JSON are in the Buoyant article. [Link to be added after publication.]
| Processor | Role |
|---|---|
memory_limiter |
Caps Collector memory (400 MiB soft / 100 MiB spike) |
filter/mesh |
Keeps 5 golden families via OTTL: response_total, response_latency_ms.*, tcp_open_connections, tcp_read_bytes_total, tcp_write_bytes_total. A metric_relabel_configs keep is silently ignored here. |
resource/mesh |
Inserts layer=mesh on every series |
resourcedetection |
Adds host metadata (hostname, OS) |
k8sattributes |
Enriches with pod, namespace, deployment, node labels |
batch |
Batches before export (10s) |