Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 131 additions & 1 deletion scripts/task_health_monitor.sh
Original file line number Diff line number Diff line change
Expand Up @@ -313,12 +313,141 @@ self_heal_meshAgent_hung() {
fi
}

# This is a workaround till fork calls are removed from t2
# Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is :
# 1] running for more than 120 sec
Comment on lines +317 to +318
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says "childs"; use "children" for correct grammar (also in the following line where the purpose is described).

Suggested change
# Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is :
# 1] running for more than 120 sec
# Purpose of this selfheal is to kill t2 telemetry2_0 children if they are:
# 1] children running for more than 120 sec

Copilot uses AI. Check for mistakes.

detect_and_kill_locked_pids() {
local name="$1" THRESH="${2:-120}"

Comment on lines +320 to +322
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_and_kill_locked_pids kills processes purely based on age, but the name implies the PIDs are "locked"/hung. Consider renaming to reflect the actual criteria (e.g., age-based child reap) to avoid misleading future maintainers.

Copilot uses AI. Check for mistakes.
if [ -z "$name" ]; then
return 2
fi

local pids=$(pidof "$name")
if [ -z "$pids" ]; then
return 0
fi

local pid_count
pid_count=$(set -- $pids; echo $#)
if [ "$pid_count" -le 1 ]; then
return 0
fi

echo_t "[RDKB_SELFHEAL_T2] Multiple telemetry pids are running $pids"

# 1. CLK_TCK (USER_HZ) & Uptime
# USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
fi
fi
Comment on lines +342 to +348
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clock-tick conversion for /proc//stat starttime looks incorrect: hz defaults to 1000 and is optionally read from CONFIG_HZ, but starttime is in units of sysconf(_SC_CLK_TCK) (often 100) and may not match CONFIG_HZ. With the current fallback, elapsed time can be miscomputed and children may be killed too early. Consider deriving ticks-per-second via getconf CLK_TCK (or equivalent) and using a safe default that matches the platform, rather than parsing /proc/config.gz.

Suggested change
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
fi
fi
local hz
if command -v getconf >/dev/null 2>&1; then
hz=$(getconf CLK_TCK 2>/dev/null)
fi
# Fallback to a safe default if getconf is unavailable or returns an invalid value
if ! echo "$hz" | grep -Eq '^[0-9]+$'; then
hz=100
fi

Copilot uses AI. Check for mistakes.

local uptime_sec=$(awk '{print int($1)}' /proc/uptime)

# 2. Identify Parent (prefer multi-threaded; fallback to oldest)
local parent=""
local oldest_pid=""
local oldest_start_ticks=""

for pid in $pids; do
if [ ! -r "/proc/$pid/stat" ]; then
continue
fi

local stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -z "$stat_data" ]; then
continue
fi

local start_ticks=$(echo "$stat_data" | cut -d' ' -f20)
if [ -z "$start_ticks" ]; then
continue
fi

# Track oldest (smallest start_ticks)
if [ -z "$oldest_pid" ] || [ "$start_ticks" -lt "$oldest_start_ticks" ]; then
oldest_pid="$pid"
oldest_start_ticks="$start_ticks"
fi

# Prefer "multi-threaded" heuristic if present
if [ -z "$parent" ]; then
if [ -d "/proc/$pid/task" ]; then
local task_count=$(ls "/proc/$pid/task" 2>/dev/null | wc -l)
if [ "$task_count" -gt 1 ]; then
parent="$pid"
fi
fi
fi
Comment on lines +379 to +386
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "multi-threaded" parent heuristic runs ls | wc -l for each PID, which is relatively expensive and can be avoided by reading the Threads: field from /proc/<pid>/status (or similar) without spawning multiple processes. This will reduce overhead in the periodic self-heal loop.

Copilot uses AI. Check for mistakes.
done

echo_t "[RDKB_SELFHEAL_T2] Received Parent PID: $parent, Oldest PID: $oldest_pid"

if [ -z "$parent" ]; then
parent="$oldest_pid"
fi

if [ -z "$parent" ]; then
return 0
fi

echo_t "[RDKB_SELFHEAL_T2] Selected Parent PID: $parent"

# 3. Loop Children
for pid in $pids; do
# Skip if it is the parent or process is already gone
if [ "$pid" = "$parent" ]; then
continue
fi
if [ ! -d "/proc/$pid" ]; then
continue
fi

local stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -z "$stat_data" ]; then
continue
fi

local state=$(echo "$stat_data" | cut -d' ' -f1)
local ppid=$(echo "$stat_data" | cut -d' ' -f2)
local start_ticks=$(echo "$stat_data" | cut -d' ' -f20)

if [ "$ppid" != "$parent" ]; then
continue
fi

# Calculate Age
local started_sec=$(( start_ticks / hz ))
local elapsed=$(( uptime_sec - started_sec ))

# 4. Action Logic
if [ "$elapsed" -ge "$THRESH" ]; then
echo_t "[RDKB_SELFHEAL_T2] : PID $pid: State=$state, Age=${elapsed}s (Kill Triggered)"

kill -15 "$pid" 2>/dev/null
sleep 5

if [ -d "/proc/$pid" ]; then
echo_t " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
fi
Comment on lines +432 to +439
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-PID sleep 5/sleep 1 inside the loop can block the health monitor for a long time if multiple child PIDs exist (e.g., N children => ~6N seconds). Consider a bounded overall timeout, shorter waits, or reaping in a way that doesn't delay the rest of self_heal_t2() for extended periods.

Copilot uses AI. Check for mistakes.
fi
done
}

# This is a workaround to be out of the finger pointing state of telemetry2_0 being in between generic KP monitoring and uncontrolled profile assignments from cloud
# Purpose of this selfheal is to restart telemetry2_0 if it is :
# 1] Consuming more memory than the threshold
# 2] Stops reporting due to issues external to telemetry2_0 causing it to go to hung state
self_heal_t2() {

detect_and_kill_locked_pids "telemetry2_0"
restartNeeded=0

# Floor limit on telemetry2_0 memory usage
Expand Down Expand Up @@ -363,9 +492,10 @@ self_heal_t2() {

# Check for rbus communication failure
ERROR_STRING="rbus_set Failed for \[Telemetry.ReportProfiles.EventMarker\]"
ERROR_STRING_NEW="Caching the event to File"
telemetry2_0_client "TEST_RT_CONNECTION" "1" > /tmp/t2_test_broker_health 2>&1
if [ -f /tmp/t2_test_broker_health ]; then
if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ]; then
if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ] || [ `grep -c "$ERROR_STRING_NEW" /tmp/t2_test_broker_health` -gt 0 ]; then
echo_t "[RDKB_SELFHEAL] : telemetry2_0 is hung at rbus queries. Set restart flag for telemetry2_0."
restartNeeded=1
fi
Expand Down