Skip to content

Conversation

@AliAhsan12
Copy link

🎯 Summary

This PR adds a generic NoSchedule toleration to the vector-agent DaemonSet to ensure Vector Agent can be scheduled on nodes with taints, addressing a critical gap in log collection from tainted nodes in production environments.

🚨 Problem Statement

Currently, the vector-agent chart only includes a specific toleration for node-role.kubernetes.io/master nodes. This creates a significant operational issue:

  • Log Collection Gap: Vector Agent cannot be scheduled on nodes with generic NoSchedule taints
  • Incomplete Observability: Critical workloads running on tainted nodes have no log collection
  • Manual Configuration Required: Users must manually add tolerations for each deployment
  • Production Impact: This affects enterprise environments using node taints for workload isolation

🔧 Solution

Add a generic toleration that allows the DaemonSet to schedule on any node with NoSchedule taints:

tolerations:
  # Generic toleration for NoSchedule taints - allows scheduling on tainted nodes
  - effect: NoSchedule
    operator: Exists
  # Keep existing master node toleration for backward compatibility  
  - key: node-role.kubernetes.io/master
    effect: NoSchedule

📊 Impact & Benefits

  • Complete Log Coverage: Ensures Vector Agent runs on ALL nodes, including tainted ones
  • Zero Configuration: Works out of the box for users with tainted nodes
  • Production Ready: Addresses real world enterprise deployment scenarios
  • Backward Compatible: Maintains existing functionality while adding new capabilities

✅ Checklist

  • Code follows existing style and patterns
  • Backward compatibility maintained
  • Tested in Kubernetes environment with tainted nodes
  • No breaking changes introduced
  • Follows DaemonSet best practices
  • Addresses real operational need
  • Minimal risk, high value change

- Add toleration with effect: NoSchedule, operator: Exists
- Ensures DaemonSet can schedule on all tainted nodes
- Maintains backward compatibility with existing master node toleration
- Addresses log collection gaps in production environments
Comment on lines +159 to +160
- effect: NoSchedule
operator: Exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this configurable and preserve existing behavior?

Copy link
Author

@AliAhsan12 AliAhsan12 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pront understood, but since the Vector Agent is a DaemonSet meant to run on every node, making this optional would reintroduce the same scheduling issue on tainted nodes. The generic NoSchedule toleration is safe, backward-compatible, and standard across major DaemonSets like Cilium, Fluent Bit, Datadog etc. I’ve worked with several open source tools, and this is their default behavior to ensure consistent coverage across all nodes, i believe it should stay enabled by default here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants