Skip to content

Conversation

@slab-msft
Copy link

@slab-msft slab-msft commented Nov 20, 2025

  • pass the input net_setup into network_init to reuse configured options
  • apply flb_input_upstream_set so the upstream inherits the input context

related stale PR (#10487)

Enable support for network setup parameters (like keepalive, timeouts) to be applied to the upstream connection in the Kubernetes events plugin. The network_init function now accepts and copies the input instance's net_setup configuration to the upstream, allowing proper TCP keepalive and timeout configuration.

Fixes Kubernetes events plugin failing to reconnect when an API server control plane node fails. The plugin uses long-lived watch streams to receive events, which can become stale when the underlying control plane node stops responding. The fix propagates network configuration settings (TCP keepalive, connection recycling, timeouts) from the input plugin to the upstream connection, enabling proper detection of dead connections and automatic reconnection to healthy control plane nodes.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change

Example configuration file for the change:

[INPUT]
    Name                kubernetes_events
    Alias               in.k8s_events
    Tag                 k8s_events
    kube_url            https://kubernetes.default.svc.cluster.local:443
    interval_sec        15
    kube_request_limit  150
    DB                  /fluent-bit/db/events.db
    net.connect_timeout         10
    net.keepalive               on
    net.keepalive_idle_timeout  30
    net.tcp_keepalive           on
    net.tcp_keepalive_time      30
    net.tcp_keepalive_interval  30
    net.tcp_keepalive_probes    3

Setup with 3CP apiserver endpoints

IPv4 of CPs: 10.0.0.4, 10.0.0.5, 10.0.0.6

Initially, fluent-bit connects to the 10.0.0.6 (via LB), once the apiserver fails/crashes/shutdowns, the conenction remains established but its stale

FB logs

Fluent Bit v4.1.1
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2025/11/20 16:16:28.375325255] [ info] [fluent bit] version=4.1.1, commit=912b7d783a, pid=1
[2025/11/20 16:16:28.375416549] [ info] [storage] ver=1.5.3, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/11/20 16:16:28.375423634] [ info] [simd    ] SSE2
[2025/11/20 16:16:28.375426985] [ info] [cmetrics] version=1.0.5
[2025/11/20 16:16:28.375429956] [ info] [ctraces ] version=0.6.6
[2025/11/20 16:16:28.375489181] [ info] [input:kubernetes_events:in.k8s_events] initializing
[2025/11/20 16:16:28.375496505] [ info] [input:kubernetes_events:in.k8s_events] storage_strategy='memory' (memory only)
[2025/11/20 16:16:28.413277761] [ info] [input:kubernetes_events:in.k8s_events] thread instance initialized
[2025/11/20 16:16:28.414020513] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2025/11/20 16:16:28.414030781] [ info] [sp] stream processor started
[2025/11/20 16:16:28.414089972] [ info] [engine] Shutdown Grace Period=5, Shutdown Input Grace Period=2
[2025/11/20 16:16:43.56832343] [ info] [input:kubernetes_events:in.k8s_events] Requesting /api/v1/events?watch=1&resourceVersion=2506189
[2025/11/20 16:20:32.773002850] [error] [/build/top/BUILD/fb/src/tls/openssl.c:977 errno=110] Connection timed out
[2025/11/20 16:20:32.773034639] [error] [tls] syscall error: error:00000005:lib(0)::reason(5)
[2025/11/20 16:20:32.773047128] [error] [http_client] broken connection to kubernetes.default.svc.cluster.local:443 ?
[2025/11/20 16:20:32.773052666] [ warn] [input:kubernetes_events:in.k8s_events] kubernetes chunked stream error.
[2025/11/20 16:20:32.773057139] [ info] [input:kubernetes_events:in.k8s_events] kubernetes stream disconnected, ret=-1
[2025/11/20 16:20:35.520307266] [ info] [input:kubernetes_events:in.k8s_events] Requesting /api/v1/events?watch=1&resourceVersion=2506958

Connection trace

root [ / ]# lsof -i -n
COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
fluent-bi   1 root   63u  IPv4 5529341      0t0  TCP 10.244.0.23:58860->10.0.0.6:6443 (ESTABLISHED)

root [ / ]# tcpdump -i eth0 port 6443
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
# 3x health probes to 10-0-0-6 unresponsive/unhealthy apiserver
16:19:31.332926 IP fluent-bit.58860 > 10-0-0-6.kube-apiserver.6443: Flags [.], ack 3926392123, win 7823, options [nop,nop,TS val 3952409628 ecr 513079258], length 0
16:20:02.052930 IP fluent-bit.58860 > 10-0-0-6.kube-apiserver.6443: Flags [.], ack 1, win 7823, options [nop,nop,TS val 3952440348 ecr 513079258], length 0
16:20:32.772929 IP fluent-bit.58860 > 10-0-0-6.kube-apiserver.6443: Flags [R.], seq 1, ack 1, win 7823, options [nop,nop,TS val 3952471068 ecr 513079258], length 0
# New connection to healthy apiserver at 10-0-0-4
16:20:35.471104 IP fluent-bit.49078 > 10-0-0-4.kube-apiserver.6443: Flags [S], seq 3540454757, win 64770, options [mss 3810,sackOK,TS val 2693763838 ecr 0,nop,wscale 7], length 0
16:20:35.472730 IP 10-0-0-4.kube-apiserver.6443 > fluent-bit.49078: Flags [S.], seq 3369463790, ack 3540454758, win 65416, options [mss 3810,sackOK,TS val 3613053218 ecr 2693763838,nop,wscale 7], length 0
16:20:35.472756 IP fluent-bit.49078 > 10-0-0-4.kube-apiserver.6443: Flags [.], ack 1, win 507, options [nop,nop,TS val 2693763839 ecr 3613053218], length 0
16:20:35.473155 IP fluent-bit.49078 > 10-0-0-4.kube-apiserver.6443: Flags [P.], seq 1:333, ack 1, win 507, options [nop,nop,TS val 2693763840 ecr 3613053218], length 332
16:20:35.473228 IP 10-0-0-4.kube-apiserver.6443 > fluent-bit.49078: Flags [.], ack 333, win 509, options [nop,nop,TS val 3613053219 ecr 2693763840], length 0

root [ / ]# lsof -i -n
COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
fluent-bi   1 root   55u  IPv4 5529629      0t0  TCP *:2020 (LISTEN)
fluent-bi   1 root   62u  IPv4 5540210      0t0  TCP 10.244.0.23:49078->10.0.0.4:6443 (ESTABLISHED)

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Refactor
    • Improved network initialization for the Kubernetes events plugin by adding validation and cleanup when upstream setup fails, leading to more robust behavior and clearer error reporting in failure cases.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 20, 2025

Walkthrough

Added an upstream-binding validation step in the Kubernetes events plugin's network initialization: after creating the upstream context the code calls flb_input_upstream_set(ctx->upstream, ctx->ins) and, on failure, logs an error, destroys and clears the upstream, and returns -1.

Changes

Cohort / File(s) Summary
Upstream validation and cleanup
plugins/in_kubernetes_events/kubernetes_events_conf.c
After creating ctx->upstream, call flb_input_upstream_set(ctx->upstream, ctx->ins). On failure, log error, call flb_upstream_destroy(ctx->upstream), set ctx->upstream = NULL, and return -1. No public API/signature changes.

Sequence Diagram(s)

sequenceDiagram
    participant Input as flb_input
    participant Init as network_init()
    participant Upstr as create_upstream()
    participant Bind as flb_input_upstream_set()

    Input->>Init: call network_init()
    Init->>Upstr: create upstream context
    Upstr-->>Init: upstream created
    Init->>Bind: flb_input_upstream_set(upstream, ins)
    alt bind succeeds
        Bind-->>Init: success
        Init-->>Input: return success
    else bind fails
        Bind-->>Init: error
        Init->>Upstr: flb_upstream_destroy(upstream)
        Init->>Init: ctx->upstream = NULL
        Init-->>Input: return -1 (error)
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Single-file change with localized control-flow and error handling.
  • Pay attention to correct use of flb_input_upstream_set() return semantics and proper cleanup calls (flb_upstream_destroy and clearing ctx->upstream).

Poem

🐇 I bound a stream with careful paw,

If binding failed, I fixed the flaw,
I tore it down and cleared the way,
Then hopped away to code another day. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: adding support for configuring input upstream network setup in the Kubernetes events input plugin.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbe85c1 and 6f6498f.

📒 Files selected for processing (1)
  • plugins/in_kubernetes_events/kubernetes_events_conf.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_kubernetes_events/kubernetes_events_conf.c (2)
src/flb_input.c (1)
  • flb_input_upstream_set (2207-2226)
src/flb_upstream.c (1)
  • flb_upstream_destroy (656-698)
🔇 Additional comments (3)
plugins/in_kubernetes_events/kubernetes_events_conf.c (3)

92-92: Good fix: removed unused parameter.

The removal of the unused net parameter addresses the previous review comment and is correct since network configuration propagation now happens through flb_input_upstream_set(ctx->upstream, ctx->ins) at line 131, which internally accesses ctx->ins->net_setup.


131-136: Proper implementation of upstream network setup propagation.

The addition correctly implements the PR objective by calling flb_input_upstream_set to propagate the input instance's network configuration (TCP keepalive, timeouts, etc.) to the upstream connection. The error handling is appropriate:

  • Destroys the upstream context on failure
  • Nulls the pointer to prevent double-free
  • Returns failure to trigger full cleanup via k8s_events_conf_destroy

The check is defensive (the only failure mode is NULL upstream, already guarded), but checking return values is good practice.


228-228: Call site correctly updated.

The function call properly reflects the signature change.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f8c50b and dbe85c1.

📒 Files selected for processing (1)
  • plugins/in_kubernetes_events/kubernetes_events_conf.c (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_kubernetes_events/kubernetes_events_conf.c (2)
src/flb_input.c (1)
  • flb_input_upstream_set (2207-2226)
src/flb_upstream.c (1)
  • flb_upstream_destroy (656-698)
🔇 Additional comments (1)
plugins/in_kubernetes_events/kubernetes_events_conf.c (1)

131-136: LGTM! Proper network setup propagation with correct error handling.

The addition of flb_input_upstream_set correctly propagates the input instance's network configuration (net_setup) to the upstream connection, which aligns with the PR objective. The error handling properly cleans up resources by destroying the upstream and setting it to NULL before returning.

…ork setup

- pass the input net_setup into network_init to reuse configured options
- apply flb_input_upstream_set so the upstream inherits the input context
@slab-msft slab-msft force-pushed the k8s_events_input_net branch from dbe85c1 to 6f6498f Compare November 20, 2025 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant