RDKB-63377 : Add patch for SH Logic for reap hung child #128
RDKB-63377 : Add patch for SH Logic for reap hung child #128yogeswaransky wants to merge 1 commit intosupport/2025q4from
Conversation
Signed-off-by: Yogeswaran K <yogeswaransky@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds a self-heal workaround to reap extra telemetry2_0 forked child processes (intended to address hung/long-running children) and expands the existing “hung at rbus queries” detection criteria.
Changes:
- Introduces
detect_and_kill_locked_pids()to identify a “parent”telemetry2_0PID and SIGTERM/SIGKILL its long-running children. - Invokes the new reaping logic from
self_heal_t2(). - Extends the rbus-hung detection to also match
"Caching the event to File"in the broker health output.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is : | ||
| # 1] running for more than 120 sec |
There was a problem hiding this comment.
Comment says "childs"; use "children" for correct grammar (also in the following line where the purpose is described).
| # Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is : | |
| # 1] running for more than 120 sec | |
| # Purpose of this selfheal is to kill t2 telemetry2_0 children if they are: | |
| # 1] children running for more than 120 sec |
| local hz=1000 | ||
| if [ -r /proc/config.gz ]; then | ||
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | ||
| if [ -n "$detected_hz" ]; then | ||
| hz=$detected_hz | ||
| fi | ||
| fi |
There was a problem hiding this comment.
The clock-tick conversion for /proc//stat starttime looks incorrect: hz defaults to 1000 and is optionally read from CONFIG_HZ, but starttime is in units of sysconf(_SC_CLK_TCK) (often 100) and may not match CONFIG_HZ. With the current fallback, elapsed time can be miscomputed and children may be killed too early. Consider deriving ticks-per-second via getconf CLK_TCK (or equivalent) and using a safe default that matches the platform, rather than parsing /proc/config.gz.
| local hz=1000 | |
| if [ -r /proc/config.gz ]; then | |
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | |
| if [ -n "$detected_hz" ]; then | |
| hz=$detected_hz | |
| fi | |
| fi | |
| local hz | |
| if command -v getconf >/dev/null 2>&1; then | |
| hz=$(getconf CLK_TCK 2>/dev/null) | |
| fi | |
| # Fallback to a safe default if getconf is unavailable or returns an invalid value | |
| if ! echo "$hz" | grep -Eq '^[0-9]+$'; then | |
| hz=100 | |
| fi |
| detect_and_kill_locked_pids() { | ||
| local name="$1" THRESH="${2:-120}" | ||
|
|
There was a problem hiding this comment.
detect_and_kill_locked_pids kills processes purely based on age, but the name implies the PIDs are "locked"/hung. Consider renaming to reflect the actual criteria (e.g., age-based child reap) to avoid misleading future maintainers.
| kill -15 "$pid" 2>/dev/null | ||
| sleep 5 | ||
|
|
||
| if [ -d "/proc/$pid" ]; then | ||
| echo_t " Forcing SIGKILL $pid" | ||
| kill -9 "$pid" 2>/dev/null | ||
| sleep 1 | ||
| fi |
There was a problem hiding this comment.
The per-PID sleep 5/sleep 1 inside the loop can block the health monitor for a long time if multiple child PIDs exist (e.g., N children => ~6N seconds). Consider a bounded overall timeout, shorter waits, or reaping in a way that doesn't delay the rest of self_heal_t2() for extended periods.
| if [ -z "$parent" ]; then | ||
| if [ -d "/proc/$pid/task" ]; then | ||
| local task_count=$(ls "/proc/$pid/task" 2>/dev/null | wc -l) | ||
| if [ "$task_count" -gt 1 ]; then | ||
| parent="$pid" | ||
| fi | ||
| fi | ||
| fi |
There was a problem hiding this comment.
The "multi-threaded" parent heuristic runs ls | wc -l for each PID, which is relatively expensive and can be avoided by reading the Threads: field from /proc/<pid>/status (or similar) without spawning multiple processes. This will reduce overhead in the periodic self-heal loop.
No description provided.