-
Notifications
You must be signed in to change notification settings - Fork 194
Description
When I was using the experimental flow control feature, based on my observation of the pod metrics, I noticed that the EPP's CPU usage would quickly reach nearly 100% after receiving the first request, approaching its limit. This situation will continue even if EPP doesn't receive any requests.
At the same time, I occasionally observe that the EPP restarts due to health check failures, not sure if it is affected by this phenomenon.
By inspecting the code, I found that it might be due to the ShardProcessor continuously invoking runtime.Gosched(). I can understand the design intent, but it seems that calling runtime.Gosched() too frequently could lead to high CPU usage.
gateway-api-inference-extension/pkg/epp/flowcontrol/controller/internal/processor.go
Line 202 in 4752d66
| runtime.Gosched() |
I attempted to make the following changes by introducing a ticker with a small time interval to avoid continuous calls.
func (sp *ShardProcessor) Run(ctx context.Context) {
sp.logger.V(logutil.DEFAULT).Info("Shard processor run loop starting.")
defer sp.logger.V(logutil.DEFAULT).Info("Shard processor run loop stopped.")
sp.wg.Add(1)
go sp.runCleanupSweep(ctx)
// Create a ticker for periodic dispatch attempts to avoid tight loops
dispatchTicker := sp.clock.NewTicker(time.Millisecond) // we may adjust interval as needed
defer dispatchTicker.Stop()
// ...
for {
select {
case <-ctx.Done():
sp.shutdown()
sp.wg.Wait()
return
case item, ok := <-sp.enqueueChan:
if !ok { // Should not happen in practice, but is a clean shutdown signal.
sp.shutdown()
sp.wg.Wait()
return
}
// This is a safeguard against logic errors in the distributor.
if item == nil {
sp.logger.Error(nil, "Logic error: nil item received on shard processor enqueue channel, ignoring.")
continue
}
sp.enqueue(item)
sp.dispatchCycle(ctx) // Process immediately when an item arrives
case <-dispatchTicker.C():
sp.dispatchCycle(ctx) // Periodically attempt to dispatch from queues
}
}
}This can reduce the CPU usage of EPP to about 2.5% when it is idle.
But this introduces a behavior change: ShardProcessor will not continuously attempt to perform dispatch. However, I think it might be fine, because the saturation detector is unlikely to change its detection result within a very short period.
Some opinions from the maintainer are needed:
- Should this phenomenon be considered a problem? (since I have not yet tested its impact on EPP performance)
- Does the aforementioned modification still align with the design intent of Flow control?