Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ def process_namespace_watcher_notices(watcher)
@stats.bump(:namespace_cache_watch_ignored)
end
end
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this simply fall through all the time and raise an exception on every notification. Should we be evaluating for a specific notice.type or at least have continue statements above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i suspected:

Error: test: pod MODIFIED cached when hostname matches(DefaultPodWatchStrategyTest)
: RuntimeError: 
/home/circleci/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:128:in `process_pod_watcher_notices'
/home/circleci/fluen

Copy link
Contributor

@jcantrill jcantrill Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably something more like:

when 'ERROR':
  message = notice['object']['message'] if notice['object'] && notice['object']['message']
  raise "Error while watching namespaces: #{message}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about doing it with when 'ERROR', but then it can still exit the watcher.each block without an exception which can cause endless loop without backoff.

I don't know how this testing framework works, but in reality it stays inside the watcher.each block until the connection is open. Watching opens long running connections, the API server usually kills the connection after 40-50 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I thought the notice.type value can only be one of these there: ADDED, MODIFIED, DELETED.

Reference: https://github.com/abonas/kubeclient/blame/28dfc4d538d72127015dc9b90e6b776c4b0fb986/test/test_watch.rb#L8

@jcantrill 's suggestion in https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/214/files#r384729998 makes sense to me.

Looking at the issue reported at #213 (comment), it's not an unexpected exception. Besides, if an exception if thrown, https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/9b48b09fa56fcb32c0bcc8e7547bb7589a309125/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb#L42 should have caught it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no exception thrown when the connection is closed.

When you open the HTTP connection for the watcher you get a 200 OK reply from the server, but connection stays open and it is streaming the events over the same connection. Kubernetes will close it after around 1 hour (at least on our cluster).

Once it's closed there is no exception, it's finishing normally.

So the thread loop will execute, but because there were no exception the rescue block did not execute and nothing set the pod_watcher to nil. It will use resource_version from the first run.
Kubernetes API remembers old resources for a while, but after some time it will return ERROR with "too old resource version" message and closes the connection (no exception happens).

When this happens the thread loop will start the watcher again with the old resource version, gets error again and this continues in a loop. It results in a high load on the API server and no way to recover except restarting fluentd.

I can rework this PR to use the when 'ERROR' case, but then the issue can still happen if the connection is closed by the API server without returning ERROR and throwing exception.

I think an exception should be raised whenever the watcher.each block is exited, because it means the long-running connection was closed gracefully.
Unfortunately I am not sure how I can make it work with the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re #214 (comment): Ah I see. That sounds good to me. Instead of a simple raise, can we give it some message to help with debugging / trouble shooting when the issue happens?

For the fact that it returned ERROR notice, seems like there is some discussion at kubeclient/pull/275. Sounds like it might be fixed in

end
end
end
1 change: 1 addition & 0 deletions lib/fluent/plugin/kubernetes_metadata_watch_pods.rb
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ def process_pod_watcher_notices(watcher)
@stats.bump(:pod_cache_watch_ignored)
end
end
raise
end
end
end