Skip to content

Comments

RDKB-63377 : Add patch for SH Logic for reap hung child #128

Open
yogeswaransky wants to merge 1 commit intosupport/2025q4from
topic/RDKB-63377_1.3.0
Open

RDKB-63377 : Add patch for SH Logic for reap hung child #128
yogeswaransky wants to merge 1 commit intosupport/2025q4from
topic/RDKB-63377_1.3.0

Conversation

@yogeswaransky
Copy link

No description provided.

Signed-off-by: Yogeswaran K <yogeswaransky@gmail.com>
@yogeswaransky yogeswaransky requested a review from a team as a code owner February 6, 2026 18:53
Copilot AI review requested due to automatic review settings February 6, 2026 18:53
@yogeswaransky yogeswaransky requested a review from a team as a code owner February 6, 2026 18:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a self-heal workaround to reap extra telemetry2_0 forked child processes (intended to address hung/long-running children) and expands the existing “hung at rbus queries” detection criteria.

Changes:

  • Introduces detect_and_kill_locked_pids() to identify a “parent” telemetry2_0 PID and SIGTERM/SIGKILL its long-running children.
  • Invokes the new reaping logic from self_heal_t2().
  • Extends the rbus-hung detection to also match "Caching the event to File" in the broker health output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +317 to +318
# Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is :
# 1] running for more than 120 sec
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says "childs"; use "children" for correct grammar (also in the following line where the purpose is described).

Suggested change
# Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is :
# 1] running for more than 120 sec
# Purpose of this selfheal is to kill t2 telemetry2_0 children if they are:
# 1] children running for more than 120 sec

Copilot uses AI. Check for mistakes.
Comment on lines +342 to +348
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
fi
fi
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clock-tick conversion for /proc//stat starttime looks incorrect: hz defaults to 1000 and is optionally read from CONFIG_HZ, but starttime is in units of sysconf(_SC_CLK_TCK) (often 100) and may not match CONFIG_HZ. With the current fallback, elapsed time can be miscomputed and children may be killed too early. Consider deriving ticks-per-second via getconf CLK_TCK (or equivalent) and using a safe default that matches the platform, rather than parsing /proc/config.gz.

Suggested change
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
fi
fi
local hz
if command -v getconf >/dev/null 2>&1; then
hz=$(getconf CLK_TCK 2>/dev/null)
fi
# Fallback to a safe default if getconf is unavailable or returns an invalid value
if ! echo "$hz" | grep -Eq '^[0-9]+$'; then
hz=100
fi

Copilot uses AI. Check for mistakes.
Comment on lines +320 to +322
detect_and_kill_locked_pids() {
local name="$1" THRESH="${2:-120}"

Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_and_kill_locked_pids kills processes purely based on age, but the name implies the PIDs are "locked"/hung. Consider renaming to reflect the actual criteria (e.g., age-based child reap) to avoid misleading future maintainers.

Copilot uses AI. Check for mistakes.
Comment on lines +432 to +439
kill -15 "$pid" 2>/dev/null
sleep 5

if [ -d "/proc/$pid" ]; then
echo_t " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
fi
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-PID sleep 5/sleep 1 inside the loop can block the health monitor for a long time if multiple child PIDs exist (e.g., N children => ~6N seconds). Consider a bounded overall timeout, shorter waits, or reaping in a way that doesn't delay the rest of self_heal_t2() for extended periods.

Copilot uses AI. Check for mistakes.
Comment on lines +379 to +386
if [ -z "$parent" ]; then
if [ -d "/proc/$pid/task" ]; then
local task_count=$(ls "/proc/$pid/task" 2>/dev/null | wc -l)
if [ "$task_count" -gt 1 ]; then
parent="$pid"
fi
fi
fi
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "multi-threaded" parent heuristic runs ls | wc -l for each PID, which is relatively expensive and can be avoided by reading the Threads: field from /proc/<pid>/status (or similar) without spawning multiple processes. This will reduce overhead in the periodic self-heal loop.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant