Skip to content

With flow control, EPP's CPU usage reaches 100% #1847

@delavet

Description

@delavet

When I was using the experimental flow control feature, based on my observation of the pod metrics, I noticed that the EPP's CPU usage would quickly reach nearly 100% after receiving the first request, approaching its limit. This situation will continue even if EPP doesn't receive any requests.

Image

At the same time, I occasionally observe that the EPP restarts due to health check failures, not sure if it is affected by this phenomenon.

By inspecting the code, I found that it might be due to the ShardProcessor continuously invoking runtime.Gosched(). I can understand the design intent, but it seems that calling runtime.Gosched() too frequently could lead to high CPU usage.

I attempted to make the following changes by introducing a ticker with a small time interval to avoid continuous calls.

func (sp *ShardProcessor) Run(ctx context.Context) {
	sp.logger.V(logutil.DEFAULT).Info("Shard processor run loop starting.")
	defer sp.logger.V(logutil.DEFAULT).Info("Shard processor run loop stopped.")

	sp.wg.Add(1)
	go sp.runCleanupSweep(ctx)

	// Create a ticker for periodic dispatch attempts to avoid tight loops
	dispatchTicker := sp.clock.NewTicker(time.Millisecond) // we may adjust interval as needed
	defer dispatchTicker.Stop()

	// ...
	for {
		select {
		case <-ctx.Done():
			sp.shutdown()
			sp.wg.Wait()
			return
		case item, ok := <-sp.enqueueChan:
			if !ok { // Should not happen in practice, but is a clean shutdown signal.
				sp.shutdown()
				sp.wg.Wait()
				return
			}
			// This is a safeguard against logic errors in the distributor.
			if item == nil {
				sp.logger.Error(nil, "Logic error: nil item received on shard processor enqueue channel, ignoring.")
				continue
			}
			sp.enqueue(item)
			sp.dispatchCycle(ctx) // Process immediately when an item arrives
		case <-dispatchTicker.C():
			sp.dispatchCycle(ctx) // Periodically attempt to dispatch from queues
		}
	}
}

This can reduce the CPU usage of EPP to about 2.5% when it is idle.

Image

But this introduces a behavior change: ShardProcessor will not continuously attempt to perform dispatch. However, I think it might be fine, because the saturation detector is unlikely to change its detection result within a very short period.

Some opinions from the maintainer are needed:

  • Should this phenomenon be considered a problem? (since I have not yet tested its impact on EPP performance)
  • Does the aforementioned modification still align with the design intent of Flow control?

cc @LukeAVanDrie

Metadata

Metadata

Assignees

Labels

needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions