Skip to content

RDKB-63116: Force kill stale dibbler-client process and prevent zombi…#36

Open
S-Parthiban-Selvaraj wants to merge 1 commit intodevelopfrom
bugfix/dibbler-restart-conflict
Open

RDKB-63116: Force kill stale dibbler-client process and prevent zombi…#36
S-Parthiban-Selvaraj wants to merge 1 commit intodevelopfrom
bugfix/dibbler-restart-conflict

Conversation

@S-Parthiban-Selvaraj
Copy link

…e processes

Problem

When dibbler-client fails to terminate gracefully with SIGTERM within the timeout period, the process remains running. Subsequent attempts to start a new dibbler-client instance detect the stale process but only return its PID without killing it. This causes the new instance to fail to start while the old zombie process persists.

Solution

Implemented two-stage process termination strategy:

  1. Graceful Termination (SIGTERM): First attempt to terminate the process gracefully to allow it to clean up resources. Wait for completion with configurable timeout.

  2. Forceful Termination (SIGKILL): If SIGTERM fails to terminate the process within the timeout, send SIGKILL to forcefully terminate it.

  3. Proper Process Reaping: Use collect_waiting_process() after SIGKILL to properly reap the process using waitpid(). This prevents zombie processes from accumulating in the process table.

Changes Made

dhcpv6_client_utils.c (start_dhcpv6_client)

  • When an existing dibbler-client instance is detected, forcefully kill it with SIGKILL
  • Call collect_waiting_process() to properly reap the killed process
  • Reset pid to 0 and continue to start a fresh instance

dibbler_client_utils.c (stop_dibbler)

  • When SIGTERM timeout expires, send SIGKILL as fallback
  • Replace usleep() + stat() check with collect_waiting_process()
  • Proper zombie process handling ensures clean termination

Scenarios Handled

  • Stale dibbler-client process blocking new instance startup
  • Process termination timeout recovery
  • Zombie process prevention
  • Interface state transitions without stale process interference

…e processes

## Problem
When dibbler-client fails to terminate gracefully with SIGTERM within the timeout
period, the process remains running. Subsequent attempts to start a new dibbler-client
instance detect the stale process but only return its PID without killing it. This
causes the new instance to fail to start while the old zombie process persists.

## Solution
Implemented two-stage process termination strategy:

1. **Graceful Termination (SIGTERM)**: First attempt to terminate the process gracefully
   to allow it to clean up resources. Wait for completion with configurable timeout.

2. **Forceful Termination (SIGKILL)**: If SIGTERM fails to terminate the process within
   the timeout, send SIGKILL to forcefully terminate it.

3. **Proper Process Reaping**: Use collect_waiting_process() after SIGKILL to properly
   reap the process using waitpid(). This prevents zombie processes from accumulating
   in the process table.

## Changes Made

### dhcpv6_client_utils.c (start_dhcpv6_client)
- When an existing dibbler-client instance is detected, forcefully kill it with SIGKILL
- Call collect_waiting_process() to properly reap the killed process
- Reset pid to 0 and continue to start a fresh instance

### dibbler_client_utils.c (stop_dibbler)
- When SIGTERM timeout expires, send SIGKILL as fallback
- Replace usleep() + stat() check with collect_waiting_process()
- Proper zombie process handling ensures clean termination

## Scenarios Handled
- Stale dibbler-client process blocking new instance startup
- Process termination timeout recovery
- Zombie process prevention
- Interface state transitions without stale process interference
@S-Parthiban-Selvaraj S-Parthiban-Selvaraj marked this pull request as ready for review January 20, 2026 16:30
@S-Parthiban-Selvaraj S-Parthiban-Selvaraj requested review from a team as code owners January 20, 2026 16:30
Copilot AI review requested due to automatic review settings January 20, 2026 16:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a robust two-stage process termination strategy for dibbler-client to prevent stale and zombie processes. When dibbler-client fails to terminate gracefully with SIGTERM, the code now sends SIGKILL and properly reaps the process to prevent zombies. The PR also forces termination of any existing dibbler-client instance before starting a new one.

Changes:

  • Enhanced process termination with SIGKILL fallback when SIGTERM times out
  • Added proper zombie process reaping using collect_waiting_process() after SIGKILL
  • Modified start_dhcpv6_client to forcefully kill existing instances before starting new ones
  • Updated function documentation to reflect new behavior and return values

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
source/dhcp_client_utils/dibbler_client_utils.c Added SIGKILL fallback and zombie reaping when SIGTERM timeout expires in stop_dibbler()
source/dhcp_client_utils/dhcpv6_client_utils.c Modified start_dhcpv6_client() to forcefully kill existing instances and updated documentation for both start/stop functions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

DBG_PRINT("%s %d: Fail to open sysevent.\n", __FUNCTION__, __LINE__);
/* CID 189993 Improper use of negative value */
return FAILURE;
return 0;
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value change from FAILURE to 0 creates an inconsistency with the function's documented behavior and other failure paths. The function documentation states it returns 'pid (>0)...or 0 on failure', but this change makes it inconsistent with line 119 which still returns FAILURE for the same error condition (existing instance detection). Either both failure cases should return 0, or the documentation needs to clarify that different failure modes have different return codes.

Copilot uses AI. Check for mistakes.
/* Collect the killed process to prevent zombie */
if (collect_waiting_process(pid, 1000) != SUCCESS)
{
DBG_PRINT("%s %d: unable to collect pid %d after SIGKILL\n", __FUNCTION__, __LINE__, pid);
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When collect_waiting_process() fails after SIGKILL, the code continues execution and attempts to start a new dibbler-client instance. This could lead to resource conflicts if the old process wasn't properly cleaned up. Consider returning a failure code or adding additional verification that the process is truly terminated before proceeding.

Suggested change
DBG_PRINT("%s %d: unable to collect pid %d after SIGKILL\n", __FUNCTION__, __LINE__, pid);
DBG_PRINT("%s %d: unable to collect pid %d after SIGKILL\n", __FUNCTION__, __LINE__, pid);
/* Failed to confirm process termination; do not start a new client */
return 0;

Copilot uses AI. Check for mistakes.
Comment on lines +431 to +435
if (collect_waiting_process(pid, 1000) != SUCCESS)
{
DBG_PRINT("%s %d: unable to collect pid %d after SIGKILL\n", __FUNCTION__, __LINE__, pid);
return FAILURE;
}
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout value of 1000 milliseconds for collect_waiting_process() is hardcoded in multiple locations. Consider defining this as a constant (e.g., PROCESS_COLLECT_TIMEOUT_MS) to improve maintainability and ensure consistency across the codebase.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants