Skip to content

Fix EssPower thread-safety race condition (#3500)#3602

Open
rishabhvaish wants to merge 1 commit intoOpenEMS:developfrom
rishabhvaish:fix/esspower-race-condition-3500
Open

Fix EssPower thread-safety race condition (#3500)#3602
rishabhvaish wants to merge 1 commit intoOpenEMS:developfrom
rishabhvaish:fix/esspower-race-condition-3500

Conversation

@rishabhvaish
Copy link
Contributor

Summary

  • Race condition between OSGi thread (adding ESS) and Solver thread (reading coefficients) causes "Coefficient was not found" errors
  • ESS operates at 0W until manual restart — 40+ hour production downtime reported
  • Root cause: Data.getConstraints* and Coefficients.of() are not synchronized, while Coefficients.initialize() clears-then-rebuilds (non-atomic)

Root cause

  1. OSGi thread: Data.addEss()updateInverters()Coefficients.initialize() → clears coefficient list
  2. Solver thread (concurrent): getConstraintsWithoutDisabledInverters() → sees new ESS in CopyOnWriteArrayList → calls Coefficients.of(essId) → coefficient not found (list was cleared)

Changes

Coefficients.java (io.openems.edge.ess.api)

  • of()synchronized: Prevents reading coefficients while initialize() is rebuilding them
  • initialize() → build-then-swap: Coefficients are built in a temporary ArrayList first, then clear() + addAll() happen at the end while holding the monitor lock. No reader (via of()) can observe the empty intermediate state.

Data.java (io.openems.edge.ess.core)

  • getConstraintsForAllInverters()synchronized
  • getConstraintsForInverters()synchronized
  • getConstraintsWithoutDisabledInverters()synchronized

These methods read esss, inverters, coefficients, and symmetricMode — all of which are mutated by addEss()/removeEss()/updateInverters() (which are already synchronized on the same Data instance). Without synchronization, the Solver thread can observe partially-updated state.

Test plan

  • Start OpenEMS with multiple ESS components — verify no "Coefficient not found" errors in log
  • Add/remove ESS component during active operation — verify no race condition errors
  • Verify ESS power limits are correctly applied after component restart
  • Stress test: rapidly add/remove ESS components while solver is running

Fixes #3500

The Solver thread calls Data.getConstraintsWithoutDisabledInverters() which
reads the esss list and calls Coefficients.of(). Neither method is synchronized.
When the OSGi thread concurrently calls Data.addEss() → updateInverters() →
Coefficients.initialize(), the initialize() call clears the coefficient list
before rebuilding it. The Solver thread can see the new ESS (via CopyOnWriteArrayList)
but find no coefficient for it, causing 'Coefficient was not found' errors.

This leads to the ESS operating at 0W and requires manual intervention to recover.
A production site reported 40+ hours of downtime from this race condition.

Fix:
- Synchronize Data read methods (getConstraintsWithoutDisabledInverters, etc.)
- Synchronize Coefficients.of()
- Change Coefficients.initialize() to build-then-swap instead of clear-then-rebuild

Fixes OpenEMS#3500

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Rishabh Vaish <[email protected]>
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@              Coverage Diff              @@
##             develop    #3602      +/-   ##
=============================================
- Coverage      58.60%   58.51%   -0.08%     
+ Complexity       105      104       -1     
=============================================
  Files           3091     3095       +4     
  Lines         134005   134207     +202     
  Branches        9882     9870      -12     
=============================================
  Hits           78516    78516              
- Misses         52590    52772     +182     
- Partials        2899     2919      +20     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Sn0w3y
Copy link
Collaborator

Sn0w3y commented Mar 3, 2026

The synchronization is unnecessary:

  • coefficients, esss, inverters, and constraints are all CopyOnWriteArrayList, which is inherently thread-safe for concurrent read/write. Iteration always works on a snapshot
    of the internal array.
  • initialize() is already synchronized.
  • The brief window between clear() and the add() calls can produce at most a single WARN log entry during startup, which self-corrects on the next cycle.

The "atomic swap" is not actually atomic:

this.coefficients.clear();           // list is empty here
this.coefficients.addAll(newCoefficients);  // list is filled here
The same window exists between clear() and addAll(). It only "works" because of() is now also synchronized - making the temporary list entirely redundant.

The real issue from #3500 is something else entirely.
The evidence posted there shows:
org.osgi.framework.ServiceException: ServiceFactory.getService() resulted in a cycle.
This is an OSGi SCR service cycle error - meaning Felix detected a circular dependency during service resolution and addEss() was never called for that ESS. That's why the "Coefficient
not found" error is permanent and never self-corrects. No amount of synchronized keywords will fix a method that was never invoked.

Adding synchronized to the Data getter methods introduces unnecessary lock contention on a hot path - getConstraintsWithoutDisabledInverters() runs every cycle (~1x/second) and performs
multiple ConstraintUtil calls. Holding the Data lock during that time blocks addEss()/removeEss() for no benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] EssPower circular dependency fix from #3113 needs to be applied to all ESS implementations

2 participants