Skip to content

Conversation

rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented May 14, 2025

Provide support for SmartSwitch DPU module graceful shutdown.

Description

  • Single source of truth for transitions

    • All components now use sonic_platform_base.module_base.ModuleBase helpers:

      • set_module_state_transition(db, name, transition_type)
      • clear_module_state_transition(db, name)
      • get_module_state_transition(db, name) -> dict
      • is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
    • Eliminates duplicated logic and race-prone direct Redis writes.

  • Correct table everywhere

    • Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
    • HLD mismatch addressed in code (HLD fix tracked separately).
  • Ownership & lifecycle

    • The initiator of an operation (startup/shutdown/reboot) sets:

      • state_transition_in_progress=True
      • transition_type=<op>
      • transition_start_time=<utc-iso8601>
    • The platform (set_admin_state()) is responsible for clearing:

      • state_transition_in_progress=False
      • optionally transition_end_time=<epoch> (or similar end stamp).
    • CLI pre-clears only when a prior transition is timed out.

  • Timeouts & policy

    • Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.

    • Typical production values used:

      • startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
    • Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.

  • Boot behavior

    • chassisd on start:

      1. Clears stale flags once (centralized sweep).
      2. Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
      3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
  • gNOI shutdown daemon

    • Listens on CHASSIS_MODULE_TABLE and triggers only when:

      • state_transition_in_progress=True and transition_type=shutdown.
    • Never clears the flag (ownership stays with the platform).

    • Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

  • CLI (config chassis modules …)

    • Uses ModuleBase APIs for all set/get/timeout checks.
    • If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
    • Sets transition at the start of startup/shutdown; platform clears on completion.
    • Fabric card flow retained; edits are surgical.
  • Redis robustness

    • Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
    • Consistent HGETALL/HSET paths; resilient to connector differences.
  • Race reduction & consistency

    • Centralized writes prevent multi-writer races.
    • All transition writes include transition_start_time; clears may add an end stamp.
    • Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
  • Change scope

    • Minimal, targeted diffs.
    • No background tasks added, no broad refactors beyond transition handling.
    • Behavior changes are limited to making transition semantics correct and uniform across repos.

HLD: # 1991 sonic-net/SONiC#1991
sonic-host-services: #255 sonic-net/sonic-host-services#255
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
sonic-utilities: sonic-net/sonic-utilities#4031

How Has This Been Tested?

Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

# gnoi reboot pipe related
GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe"
GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe"
GNOI_PORT = 50052
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required anymore. Cleaned it.

GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe"
GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe"
GNOI_PORT = 50052
GNOI_RESPONSE_TIMEOUT = 60 # seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

This method performs the following steps:
1. Sends a JSON-formatted reboot request to the gNOI reboot daemon via a named pipe.
2. Waits for a response on a designated response pipe, with a timeout of 60 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment accordingly to platform.json timeout

"""
raise NotImplementedError

def pre_shutdown_hook(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this invoked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored and not valid anymore

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

subtype = device_info.get_device_subtype()
if subtype == "SmartSwitch" and not is_dpu():
self.graceful_shutdown_handler()
# Proceed to set the admin state using the platform-specific implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this supposed to work? super here will call set_admin_state of the base class, not of the derived one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the refactored implementation the platform will graceful_shutdown_handler()

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from Copilot October 21, 2025 18:08
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


@staticmethod
def test_transition_timeouts_platform_missing():
"""If platfrom is missing, defaults are used."""
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'platfrom' to 'platform'.

Suggested change
"""If platfrom is missing, defaults are used."""
"""If platform is missing, defaults are used."""

Copilot uses AI. Check for mistakes.

assert hasattr(MB, 'is_module_state_transition_timed_out') and callable(getattr(MB, 'is_module_state_transition_timed_out'))


class TestModuleBasePCIAndSensors:
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class TestModuleBasePCIAndSensors is defined three times in this file (lines 353, 880, 1408), causing duplicate test definitions. This creates confusion and potential test execution issues. Consider consolidating all PCI and sensor tests into a single class definition or renaming them to reflect their specific test scenarios.

Suggested change
class TestModuleBasePCIAndSensors:
class TestModuleBasePCIAndSensorsStateDB:

Copilot uses AI. Check for mistakes.

Comment on lines 568 to 675
class TestStateDbConnectorSwsscommonOnly:
@patch('swsscommon.swsscommon.SonicV2Connector')
def test_initialize_state_db_connector_success(self, mock_connector):
from sonic_platform_base.module_base import ModuleBase
mock_db = MagicMock()
mock_connector.return_value = mock_db
module = ModuleBase()
assert module._state_db_connector == mock_db
mock_db.connect.assert_called_once_with(mock_db.STATE_DB)

@patch('swsscommon.swsscommon.SonicV2Connector')
def test_initialize_state_db_connector_exception(self, mock_connector):
from sonic_platform_base.module_base import ModuleBase
mock_db = MagicMock()
mock_db.connect.side_effect = RuntimeError("Connection failed")
mock_connector.return_value = mock_db

with patch('sys.stderr', new_callable=StringIO) as mock_stderr:
module = ModuleBase()
assert module._state_db_connector is None
assert "Failed to connect to STATE_DB" in mock_stderr.getvalue()

def test_state_db_connector_uses_swsscommon_only(self):
import importlib
import sys
from types import ModuleType
from unittest.mock import patch

# Fake swsscommon package + swsscommon.swsscommon module
pkg = ModuleType("swsscommon")
pkg.__path__ = [] # mark as package
sub = ModuleType("swsscommon.swsscommon")

class FakeV2:
def connect(self, *_):
pass

sub.SonicV2Connector = FakeV2

with patch.dict(sys.modules, {
"swsscommon": pkg,
"swsscommon.swsscommon": sub
}, clear=False):
mb = importlib.import_module("sonic_platform_base.module_base")
importlib.reload(mb)
# Since __init__ calls it, we need to patch before creating an instance
with patch.object(mb.ModuleBase, '_initialize_state_db_connector') as mock_init_db:
mock_init_db.return_value = FakeV2()
instance = mb.ModuleBase()
assert isinstance(instance._state_db_connector, FakeV2)


# ==== graceful shutdown tests (match timeouts + centralized helpers) ====

@patch("sonic_platform_base.module_base.time", create=True)
def test_graceful_shutdown_handler_success(self, mock_time):
dpu_name = "DPU0"
mock_time.time.return_value = 1710000000
mock_time.sleep.return_value = None

module = DummyModule(name=dpu_name)
module._state_db_connector.get_all.side_effect = [
{"state_transition_in_progress": "True"},
{"state_transition_in_progress": "False"},
]

# Mock the race condition protection to allow the transition to be set
with patch.object(module, "get_name", return_value=dpu_name), \
patch.object(module, "_load_transition_timeouts", return_value={"shutdown": 10}), \
patch.object(module, "set_module_state_transition", return_value=True), \
patch.object(module, "is_module_state_transition_timed_out", return_value=False):
result = module.graceful_shutdown_handler()
assert result is True

@patch("sonic_platform_base.module_base.time", create=True)
def test_graceful_shutdown_handler_timeout(self, mock_time):
dpu_name = "DPU1"
mock_time.time.return_value = 1710000000
mock_time.sleep.return_value = None

module = DummyModule(name=dpu_name)
# Keep it perpetually "in progress" so the handler’s wait path runs
module._state_db_connector.get_all.return_value = {
"state_transition_in_progress": "True",
"transition_type": "shutdown",
"transition_start_time": "2024-01-01T00:00:00",
}

with patch.object(module, "get_name", return_value=dpu_name), \
patch.object(module, "_load_transition_timeouts", return_value={"shutdown": 5}), \
patch.object(module, "set_module_state_transition", return_value=True), \
patch.object(module, "is_module_state_transition_timed_out", return_value=True):
result = module.graceful_shutdown_handler()
assert result is False

@staticmethod
@patch("sonic_platform_base.module_base.time", create=True)
def test_graceful_shutdown_handler_offline_clear(mock_time):
mock_time.time.return_value = 123456789
mock_time.sleep.return_value = None

module = DummyModule(name="DPUX")
module._state_db_connector.get_all.return_value = {
"state_transition_in_progress": "True",
"transition_type": "shutdown",
"transition_start_time": "2024-01-01T00:00:00",
}

Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class TestStateDbConnectorSwsscommonOnly is defined three times in this file (lines 568, 1094, 1623), with identical test methods duplicated across all three definitions. This is a clear violation of the DRY (Don't Repeat Yourself) principle and should be consolidated into a single class definition.

Suggested change
class TestStateDbConnectorSwsscommonOnly:
@patch('swsscommon.swsscommon.SonicV2Connector')
def test_initialize_state_db_connector_success(self, mock_connector):
from sonic_platform_base.module_base import ModuleBase
mock_db = MagicMock()
mock_connector.return_value = mock_db
module = ModuleBase()
assert module._state_db_connector == mock_db
mock_db.connect.assert_called_once_with(mock_db.STATE_DB)
@patch('swsscommon.swsscommon.SonicV2Connector')
def test_initialize_state_db_connector_exception(self, mock_connector):
from sonic_platform_base.module_base import ModuleBase
mock_db = MagicMock()
mock_db.connect.side_effect = RuntimeError("Connection failed")
mock_connector.return_value = mock_db
with patch('sys.stderr', new_callable=StringIO) as mock_stderr:
module = ModuleBase()
assert module._state_db_connector is None
assert "Failed to connect to STATE_DB" in mock_stderr.getvalue()
def test_state_db_connector_uses_swsscommon_only(self):
import importlib
import sys
from types import ModuleType
from unittest.mock import patch
# Fake swsscommon package + swsscommon.swsscommon module
pkg = ModuleType("swsscommon")
pkg.__path__ = [] # mark as package
sub = ModuleType("swsscommon.swsscommon")
class FakeV2:
def connect(self, *_):
pass
sub.SonicV2Connector = FakeV2
with patch.dict(sys.modules, {
"swsscommon": pkg,
"swsscommon.swsscommon": sub
}, clear=False):
mb = importlib.import_module("sonic_platform_base.module_base")
importlib.reload(mb)
# Since __init__ calls it, we need to patch before creating an instance
with patch.object(mb.ModuleBase, '_initialize_state_db_connector') as mock_init_db:
mock_init_db.return_value = FakeV2()
instance = mb.ModuleBase()
assert isinstance(instance._state_db_connector, FakeV2)
# ==== graceful shutdown tests (match timeouts + centralized helpers) ====
@patch("sonic_platform_base.module_base.time", create=True)
def test_graceful_shutdown_handler_success(self, mock_time):
dpu_name = "DPU0"
mock_time.time.return_value = 1710000000
mock_time.sleep.return_value = None
module = DummyModule(name=dpu_name)
module._state_db_connector.get_all.side_effect = [
{"state_transition_in_progress": "True"},
{"state_transition_in_progress": "False"},
]
# Mock the race condition protection to allow the transition to be set
with patch.object(module, "get_name", return_value=dpu_name), \
patch.object(module, "_load_transition_timeouts", return_value={"shutdown": 10}), \
patch.object(module, "set_module_state_transition", return_value=True), \
patch.object(module, "is_module_state_transition_timed_out", return_value=False):
result = module.graceful_shutdown_handler()
assert result is True
@patch("sonic_platform_base.module_base.time", create=True)
def test_graceful_shutdown_handler_timeout(self, mock_time):
dpu_name = "DPU1"
mock_time.time.return_value = 1710000000
mock_time.sleep.return_value = None
module = DummyModule(name=dpu_name)
# Keep it perpetually "in progress" so the handler’s wait path runs
module._state_db_connector.get_all.return_value = {
"state_transition_in_progress": "True",
"transition_type": "shutdown",
"transition_start_time": "2024-01-01T00:00:00",
}
with patch.object(module, "get_name", return_value=dpu_name), \
patch.object(module, "_load_transition_timeouts", return_value={"shutdown": 5}), \
patch.object(module, "set_module_state_transition", return_value=True), \
patch.object(module, "is_module_state_transition_timed_out", return_value=True):
result = module.graceful_shutdown_handler()
assert result is False
@staticmethod
@patch("sonic_platform_base.module_base.time", create=True)
def test_graceful_shutdown_handler_offline_clear(mock_time):
mock_time.time.return_value = 123456789
mock_time.sleep.return_value = None
module = DummyModule(name="DPUX")
module._state_db_connector.get_all.return_value = {
"state_transition_in_progress": "True",
"transition_type": "shutdown",
"transition_start_time": "2024-01-01T00:00:00",
}
# [REMOVED DUPLICATE] class TestStateDbConnectorSwsscommonOnly
"transition_start_time": "2024-01-01T00:00:00",
}

Copilot uses AI. Check for mistakes.

Comment on lines 1398 to 1405
from sonic_platform_base.module_base import (
set_module_state_transition,
clear_module_state_transition,
is_module_state_transition_timed_out
)
assert callable(set_module_state_transition)
assert callable(clear_module_state_transition)
assert callable(is_module_state_transition_timed_out)
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test attempts to import set_module_state_transition, clear_module_state_transition, and is_module_state_transition_timed_out as module-level functions, but according to the implementation in module_base.py, these are instance methods of ModuleBase, not top-level exports. This import will fail at runtime unless these functions are explicitly exported at the module level.

Suggested change
from sonic_platform_base.module_base import (
set_module_state_transition,
clear_module_state_transition,
is_module_state_transition_timed_out
)
assert callable(set_module_state_transition)
assert callable(clear_module_state_transition)
assert callable(is_module_state_transition_timed_out)
module = DummyModule()
assert callable(getattr(module, "set_module_state_transition", None))
assert callable(getattr(module, "clear_module_state_transition", None))
assert callable(getattr(module, "is_module_state_transition_timed_out", None))

Copilot uses AI. Check for mistakes.


def _initialize_state_db_connector(self):
"""Initialize a STATE_DB connector using swsscommon only."""
from swsscommon.swsscommon import SonicV2Connector # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please move this to beginning of the file to be clear?

if up:
# Admin UP: Clear any transition state and proceed with admin state change
module_name = self.get_name()
admin_state_success = self.set_admin_state(True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we set the transition_in_progress before calling admin state platform API?

Returns:
bool: True if transition was successfully set, False if already in progress
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before proceeding with set operation, could you add these sanity checks to make sure correct fields are passed??

    allowed = {"shutdown", "startup", "reboot"}
    ttype = (transition_type or "").strip().lower()
    if ttype not in allowed:
        sys.stderr.write(f"Invalid transition_type='{transition_type}' for module {module_name}")
        return False
        
    module = module_name.strip().upper()
    key = f"CHASSIS_MODULE_TABLE|{module}" 

sys.stderr.write(f"Failed to clear timed-out transition for module {module_name} before setting new one.\n")
return False
# Set new transition atomically
db.set(db.STATE_DB, key, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the STATE_DB keys are hashes, should we use hmset/hset instead of set? could you please check if set is working for all types of builds?

})
# Remove the start timestamp (avoid stale value lingering)
if hasattr(db, 'delete'):
db.delete(db.STATE_DB, key, "transition_start_time")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use hset first? Additional if no attribute to delete, then we need to set fields to blank to be conservative?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Fixing test failures
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants