-
Notifications
You must be signed in to change notification settings - Fork 198
Module graceful shutdown support #567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #567
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
sonic_platform_base/module_base.py
Outdated
# gnoi reboot pipe related | ||
GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe" | ||
GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe" | ||
GNOI_PORT = 50052 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default port is 8080 and please ready from Redis similar to https://github.com/sonic-net/sonic-utilities/blob/c78e0f73fece3fb1c6fb07718a64eddd337dae23/scripts/reboot_smartswitch_helper#L41C1-L45C2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required anymore. Cleaned it.
sonic_platform_base/module_base.py
Outdated
GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe" | ||
GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe" | ||
GNOI_PORT = 50052 | ||
GNOI_RESPONSE_TIMEOUT = 60 # seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read the timeout from platform.json similar to https://github.com/sonic-net/sonic-utilities/blob/c78e0f73fece3fb1c6fb07718a64eddd337dae23/scripts/reboot_smartswitch_helper#L109C7-L109C52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sonic_platform_base/module_base.py
Outdated
This method performs the following steps: | ||
1. Sends a JSON-formatted reboot request to the gNOI reboot daemon via a named pipe. | ||
2. Waits for a response on a designated response pipe, with a timeout of 60 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment accordingly to platform.json timeout
sonic_platform_base/module_base.py
Outdated
""" | ||
raise NotImplementedError | ||
|
||
def pre_shutdown_hook(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this invoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored and not valid anymore
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
/azp run |
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
sonic_platform_base/module_base.py
Outdated
subtype = device_info.get_device_subtype() | ||
if subtype == "SmartSwitch" and not is_dpu(): | ||
self.graceful_shutdown_handler() | ||
# Proceed to set the admin state using the platform-specific implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this supposed to work? super
here will call set_admin_state
of the base class, not of the derived one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the refactored implementation the platform will graceful_shutdown_handler()
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
This reverts commit c479203.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
tests/module_base_test.py
Outdated
|
||
@staticmethod | ||
def test_transition_timeouts_platform_missing(): | ||
"""If platfrom is missing, defaults are used.""" |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'platfrom' to 'platform'.
"""If platfrom is missing, defaults are used.""" | |
"""If platform is missing, defaults are used.""" |
Copilot uses AI. Check for mistakes.
assert hasattr(MB, 'is_module_state_transition_timed_out') and callable(getattr(MB, 'is_module_state_transition_timed_out')) | ||
|
||
|
||
class TestModuleBasePCIAndSensors: |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class TestModuleBasePCIAndSensors
is defined three times in this file (lines 353, 880, 1408), causing duplicate test definitions. This creates confusion and potential test execution issues. Consider consolidating all PCI and sensor tests into a single class definition or renaming them to reflect their specific test scenarios.
class TestModuleBasePCIAndSensors: | |
class TestModuleBasePCIAndSensorsStateDB: |
Copilot uses AI. Check for mistakes.
tests/module_base_test.py
Outdated
class TestStateDbConnectorSwsscommonOnly: | ||
@patch('swsscommon.swsscommon.SonicV2Connector') | ||
def test_initialize_state_db_connector_success(self, mock_connector): | ||
from sonic_platform_base.module_base import ModuleBase | ||
mock_db = MagicMock() | ||
mock_connector.return_value = mock_db | ||
module = ModuleBase() | ||
assert module._state_db_connector == mock_db | ||
mock_db.connect.assert_called_once_with(mock_db.STATE_DB) | ||
|
||
@patch('swsscommon.swsscommon.SonicV2Connector') | ||
def test_initialize_state_db_connector_exception(self, mock_connector): | ||
from sonic_platform_base.module_base import ModuleBase | ||
mock_db = MagicMock() | ||
mock_db.connect.side_effect = RuntimeError("Connection failed") | ||
mock_connector.return_value = mock_db | ||
|
||
with patch('sys.stderr', new_callable=StringIO) as mock_stderr: | ||
module = ModuleBase() | ||
assert module._state_db_connector is None | ||
assert "Failed to connect to STATE_DB" in mock_stderr.getvalue() | ||
|
||
def test_state_db_connector_uses_swsscommon_only(self): | ||
import importlib | ||
import sys | ||
from types import ModuleType | ||
from unittest.mock import patch | ||
|
||
# Fake swsscommon package + swsscommon.swsscommon module | ||
pkg = ModuleType("swsscommon") | ||
pkg.__path__ = [] # mark as package | ||
sub = ModuleType("swsscommon.swsscommon") | ||
|
||
class FakeV2: | ||
def connect(self, *_): | ||
pass | ||
|
||
sub.SonicV2Connector = FakeV2 | ||
|
||
with patch.dict(sys.modules, { | ||
"swsscommon": pkg, | ||
"swsscommon.swsscommon": sub | ||
}, clear=False): | ||
mb = importlib.import_module("sonic_platform_base.module_base") | ||
importlib.reload(mb) | ||
# Since __init__ calls it, we need to patch before creating an instance | ||
with patch.object(mb.ModuleBase, '_initialize_state_db_connector') as mock_init_db: | ||
mock_init_db.return_value = FakeV2() | ||
instance = mb.ModuleBase() | ||
assert isinstance(instance._state_db_connector, FakeV2) | ||
|
||
|
||
# ==== graceful shutdown tests (match timeouts + centralized helpers) ==== | ||
|
||
@patch("sonic_platform_base.module_base.time", create=True) | ||
def test_graceful_shutdown_handler_success(self, mock_time): | ||
dpu_name = "DPU0" | ||
mock_time.time.return_value = 1710000000 | ||
mock_time.sleep.return_value = None | ||
|
||
module = DummyModule(name=dpu_name) | ||
module._state_db_connector.get_all.side_effect = [ | ||
{"state_transition_in_progress": "True"}, | ||
{"state_transition_in_progress": "False"}, | ||
] | ||
|
||
# Mock the race condition protection to allow the transition to be set | ||
with patch.object(module, "get_name", return_value=dpu_name), \ | ||
patch.object(module, "_load_transition_timeouts", return_value={"shutdown": 10}), \ | ||
patch.object(module, "set_module_state_transition", return_value=True), \ | ||
patch.object(module, "is_module_state_transition_timed_out", return_value=False): | ||
result = module.graceful_shutdown_handler() | ||
assert result is True | ||
|
||
@patch("sonic_platform_base.module_base.time", create=True) | ||
def test_graceful_shutdown_handler_timeout(self, mock_time): | ||
dpu_name = "DPU1" | ||
mock_time.time.return_value = 1710000000 | ||
mock_time.sleep.return_value = None | ||
|
||
module = DummyModule(name=dpu_name) | ||
# Keep it perpetually "in progress" so the handler’s wait path runs | ||
module._state_db_connector.get_all.return_value = { | ||
"state_transition_in_progress": "True", | ||
"transition_type": "shutdown", | ||
"transition_start_time": "2024-01-01T00:00:00", | ||
} | ||
|
||
with patch.object(module, "get_name", return_value=dpu_name), \ | ||
patch.object(module, "_load_transition_timeouts", return_value={"shutdown": 5}), \ | ||
patch.object(module, "set_module_state_transition", return_value=True), \ | ||
patch.object(module, "is_module_state_transition_timed_out", return_value=True): | ||
result = module.graceful_shutdown_handler() | ||
assert result is False | ||
|
||
@staticmethod | ||
@patch("sonic_platform_base.module_base.time", create=True) | ||
def test_graceful_shutdown_handler_offline_clear(mock_time): | ||
mock_time.time.return_value = 123456789 | ||
mock_time.sleep.return_value = None | ||
|
||
module = DummyModule(name="DPUX") | ||
module._state_db_connector.get_all.return_value = { | ||
"state_transition_in_progress": "True", | ||
"transition_type": "shutdown", | ||
"transition_start_time": "2024-01-01T00:00:00", | ||
} | ||
|
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class TestStateDbConnectorSwsscommonOnly
is defined three times in this file (lines 568, 1094, 1623), with identical test methods duplicated across all three definitions. This is a clear violation of the DRY (Don't Repeat Yourself) principle and should be consolidated into a single class definition.
class TestStateDbConnectorSwsscommonOnly: | |
@patch('swsscommon.swsscommon.SonicV2Connector') | |
def test_initialize_state_db_connector_success(self, mock_connector): | |
from sonic_platform_base.module_base import ModuleBase | |
mock_db = MagicMock() | |
mock_connector.return_value = mock_db | |
module = ModuleBase() | |
assert module._state_db_connector == mock_db | |
mock_db.connect.assert_called_once_with(mock_db.STATE_DB) | |
@patch('swsscommon.swsscommon.SonicV2Connector') | |
def test_initialize_state_db_connector_exception(self, mock_connector): | |
from sonic_platform_base.module_base import ModuleBase | |
mock_db = MagicMock() | |
mock_db.connect.side_effect = RuntimeError("Connection failed") | |
mock_connector.return_value = mock_db | |
with patch('sys.stderr', new_callable=StringIO) as mock_stderr: | |
module = ModuleBase() | |
assert module._state_db_connector is None | |
assert "Failed to connect to STATE_DB" in mock_stderr.getvalue() | |
def test_state_db_connector_uses_swsscommon_only(self): | |
import importlib | |
import sys | |
from types import ModuleType | |
from unittest.mock import patch | |
# Fake swsscommon package + swsscommon.swsscommon module | |
pkg = ModuleType("swsscommon") | |
pkg.__path__ = [] # mark as package | |
sub = ModuleType("swsscommon.swsscommon") | |
class FakeV2: | |
def connect(self, *_): | |
pass | |
sub.SonicV2Connector = FakeV2 | |
with patch.dict(sys.modules, { | |
"swsscommon": pkg, | |
"swsscommon.swsscommon": sub | |
}, clear=False): | |
mb = importlib.import_module("sonic_platform_base.module_base") | |
importlib.reload(mb) | |
# Since __init__ calls it, we need to patch before creating an instance | |
with patch.object(mb.ModuleBase, '_initialize_state_db_connector') as mock_init_db: | |
mock_init_db.return_value = FakeV2() | |
instance = mb.ModuleBase() | |
assert isinstance(instance._state_db_connector, FakeV2) | |
# ==== graceful shutdown tests (match timeouts + centralized helpers) ==== | |
@patch("sonic_platform_base.module_base.time", create=True) | |
def test_graceful_shutdown_handler_success(self, mock_time): | |
dpu_name = "DPU0" | |
mock_time.time.return_value = 1710000000 | |
mock_time.sleep.return_value = None | |
module = DummyModule(name=dpu_name) | |
module._state_db_connector.get_all.side_effect = [ | |
{"state_transition_in_progress": "True"}, | |
{"state_transition_in_progress": "False"}, | |
] | |
# Mock the race condition protection to allow the transition to be set | |
with patch.object(module, "get_name", return_value=dpu_name), \ | |
patch.object(module, "_load_transition_timeouts", return_value={"shutdown": 10}), \ | |
patch.object(module, "set_module_state_transition", return_value=True), \ | |
patch.object(module, "is_module_state_transition_timed_out", return_value=False): | |
result = module.graceful_shutdown_handler() | |
assert result is True | |
@patch("sonic_platform_base.module_base.time", create=True) | |
def test_graceful_shutdown_handler_timeout(self, mock_time): | |
dpu_name = "DPU1" | |
mock_time.time.return_value = 1710000000 | |
mock_time.sleep.return_value = None | |
module = DummyModule(name=dpu_name) | |
# Keep it perpetually "in progress" so the handler’s wait path runs | |
module._state_db_connector.get_all.return_value = { | |
"state_transition_in_progress": "True", | |
"transition_type": "shutdown", | |
"transition_start_time": "2024-01-01T00:00:00", | |
} | |
with patch.object(module, "get_name", return_value=dpu_name), \ | |
patch.object(module, "_load_transition_timeouts", return_value={"shutdown": 5}), \ | |
patch.object(module, "set_module_state_transition", return_value=True), \ | |
patch.object(module, "is_module_state_transition_timed_out", return_value=True): | |
result = module.graceful_shutdown_handler() | |
assert result is False | |
@staticmethod | |
@patch("sonic_platform_base.module_base.time", create=True) | |
def test_graceful_shutdown_handler_offline_clear(mock_time): | |
mock_time.time.return_value = 123456789 | |
mock_time.sleep.return_value = None | |
module = DummyModule(name="DPUX") | |
module._state_db_connector.get_all.return_value = { | |
"state_transition_in_progress": "True", | |
"transition_type": "shutdown", | |
"transition_start_time": "2024-01-01T00:00:00", | |
} | |
# [REMOVED DUPLICATE] class TestStateDbConnectorSwsscommonOnly | |
"transition_start_time": "2024-01-01T00:00:00", | |
} |
Copilot uses AI. Check for mistakes.
tests/module_base_test.py
Outdated
from sonic_platform_base.module_base import ( | ||
set_module_state_transition, | ||
clear_module_state_transition, | ||
is_module_state_transition_timed_out | ||
) | ||
assert callable(set_module_state_transition) | ||
assert callable(clear_module_state_transition) | ||
assert callable(is_module_state_transition_timed_out) |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test attempts to import set_module_state_transition
, clear_module_state_transition
, and is_module_state_transition_timed_out
as module-level functions, but according to the implementation in module_base.py, these are instance methods of ModuleBase
, not top-level exports. This import will fail at runtime unless these functions are explicitly exported at the module level.
from sonic_platform_base.module_base import ( | |
set_module_state_transition, | |
clear_module_state_transition, | |
is_module_state_transition_timed_out | |
) | |
assert callable(set_module_state_transition) | |
assert callable(clear_module_state_transition) | |
assert callable(is_module_state_transition_timed_out) | |
module = DummyModule() | |
assert callable(getattr(module, "set_module_state_transition", None)) | |
assert callable(getattr(module, "clear_module_state_transition", None)) | |
assert callable(getattr(module, "is_module_state_transition_timed_out", None)) |
Copilot uses AI. Check for mistakes.
|
||
def _initialize_state_db_connector(self): | ||
"""Initialize a STATE_DB connector using swsscommon only.""" | ||
from swsscommon.swsscommon import SonicV2Connector # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please move this to beginning of the file to be clear?
if up: | ||
# Admin UP: Clear any transition state and proceed with admin state change | ||
module_name = self.get_name() | ||
admin_state_success = self.set_admin_state(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we set the transition_in_progress before calling admin state platform API?
Returns: | ||
bool: True if transition was successfully set, False if already in progress | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before proceeding with set operation, could you add these sanity checks to make sure correct fields are passed??
allowed = {"shutdown", "startup", "reboot"}
ttype = (transition_type or "").strip().lower()
if ttype not in allowed:
sys.stderr.write(f"Invalid transition_type='{transition_type}' for module {module_name}")
return False
module = module_name.strip().upper()
key = f"CHASSIS_MODULE_TABLE|{module}"
sys.stderr.write(f"Failed to clear timed-out transition for module {module_name} before setting new one.\n") | ||
return False | ||
# Set new transition atomically | ||
db.set(db.STATE_DB, key, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the STATE_DB keys are hashes, should we use hmset/hset instead of set? could you please check if set is working for all types of builds?
}) | ||
# Remove the start timestamp (avoid stale value lingering) | ||
if hasattr(db, 'delete'): | ||
db.delete(db.STATE_DB, key, "transition_start_time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use hset first? Additional if no attribute to delete, then we need to set fields to blank to be conservative?
Co-authored-by: Copilot <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Fixing test failures
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Provide support for SmartSwitch DPU module graceful shutdown.
Description
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBase
helpers:set_module_state_transition(db, name, transition_type)
clear_module_state_transition(db, name)
get_module_state_transition(db, name) -> dict
is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
Eliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE
(replacesCHASSIS_MODULE_INFO_TABLE
).Ownership & lifecycle
The initiator of an operation (
startup
/shutdown
/reboot
) sets:state_transition_in_progress=True
transition_type=<op>
transition_start_time=<utc-iso8601>
The platform (
set_admin_state()
) is responsible for clearing:state_transition_in_progress=False
transition_end_time=<epoch>
(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json
; else constants.Typical production values used:
startup: 180s
,shutdown: 180s
(≈graceful_wait 60s + power 120s
),reboot: 120s
.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()
—not in ModuleBase.Boot behavior
chassisd
on start:set_initial_dpu_admin_state()
which marks transitions via ModuleBase before calling platformset_admin_state()
.gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLE
and triggers only when:state_transition_in_progress=True
andtransition_type=shutdown
.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …
)is_module_state_transition_timed_out()
→ auto-clear then proceed.startup
/shutdown
; platform clears on completion.Redis robustness
hset(mapping=...)
usage.Race reduction & consistency
transition_start_time
; clears may add an end stamp.Change scope
HLD: # 1991 sonic-net/SONiC#1991
sonic-host-services: #255 sonic-net/sonic-host-services#255
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
sonic-utilities: sonic-net/sonic-utilities#4031
How Has This Been Tested?
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU