Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 131 additions & 2 deletions scripts/task_health_monitor.sh
Original file line number Diff line number Diff line change
Expand Up @@ -311,14 +311,142 @@ self_heal_meshAgent_hung() {
fi
}

# This is a workaround till fork calls are removed from t2
# Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is :
# 1] running for more than 120 sec

detect_and_kill_locked_pids() {
local name="$1" THRESH="${2:-120}"

if [ -z "$name" ]; then
return 2
fi

local pids=$(pidof "$name")
if [ -z "$pids" ]; then
return 0
fi

local pid_count
pid_count=$(set -- $pids; echo $#)
if [ "$pid_count" -le 1 ]; then
return 0
fi

echo_t "[RDKB_SELFHEAL_T2] Multiple telemetry pids are running $pids"

# 1. CLK_TCK (USER_HZ) & Uptime
# USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
Comment on lines +340 to +344
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 339 states "USER_HZ is almost always 100 on Linux" but the code initializes hz to 1000 instead of 100. This is incorrect. The comment is referring to USER_HZ (the value used for converting /proc/pid/stat clock ticks to seconds), which is typically 100 on Linux systems. However, the code is confusing USER_HZ with CONFIG_HZ (the kernel's internal tick rate). For calculating process age from /proc/pid/stat field 22 (starttime), USER_HZ=100 should be used, not CONFIG_HZ. The code should initialize hz=100, or better yet, use getconf CLK_TCK to get the actual USER_HZ value.

Suggested change
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
# Prefer the actual USER_HZ from getconf; fall back to 100 if unavailable.
local hz=100
if command -v getconf >/dev/null 2>&1; then
local detected_hz
detected_hz=$(getconf CLK_TCK 2>/dev/null || echo "")
if [ -n "$detected_hz" ]; then
hz=$detected_hz

Copilot uses AI. Check for mistakes.
fi
fi

Comment on lines +339 to +347
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clock-tick handling looks incorrect: /proc/[pid]/stat starttime is in USER_HZ (sysconf/_SC_CLK_TCK), which is typically 100. Defaulting hz to 1000 and/or reading CONFIG_HZ will skew the age calculation (often by ~10x) and can cause killing healthy children. Prefer getconf CLK_TCK (or equivalent) with a safe fallback (commonly 100) and avoid using CONFIG_HZ as a proxy for USER_HZ.

Suggested change
# USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
fi
fi
# USER_HZ is obtained from getconf(_SC_CLK_TCK); default safely to 100 if unavailable
local hz=""
if command -v getconf >/dev/null 2>&1; then
hz=$(getconf CLK_TCK 2>/dev/null || echo "")
fi
case "$hz" in
''|*[!0-9]*)
hz=100
;;
esac

Copilot uses AI. Check for mistakes.
local uptime_sec=$(awk '{print int($1)}' /proc/uptime)

# 2. Identify Parent (prefer multi-threaded; fallback to oldest)
local parent=""
local oldest_pid=""
local oldest_start_ticks=""

for pid in $pids; do
if [ ! -r "/proc/$pid/stat" ]; then
continue
fi

local stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -z "$stat_data" ]; then
continue
fi

local start_ticks=$(echo "$stat_data" | cut -d' ' -f20)
if [ -z "$start_ticks" ]; then
continue
fi

# Track oldest (smallest start_ticks)
if [ -z "$oldest_pid" ] || [ "$start_ticks" -lt "$oldest_start_ticks" ]; then
oldest_pid="$pid"
oldest_start_ticks="$start_ticks"
fi

# Prefer "multi-threaded" heuristic if present
if [ -z "$parent" ]; then
if [ -d "/proc/$pid/task" ]; then
local task_count=$(ls "/proc/$pid/task" 2>/dev/null | wc -l)
if [ "$task_count" -gt 1 ]; then
parent="$pid"
fi
fi
fi
Comment on lines +376 to +384
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent identification logic has a potential issue. The code prefers a process with multiple threads (task_count > 1) as the parent, but once a parent is found this way, it never gets reassessed. However, this heuristic may select the wrong process as the parent since having multiple threads doesn't necessarily mean a process is the parent of other processes with the same name. A more reliable approach would be to check the actual parent-child relationship by examining the PPID field in /proc/pid/stat to determine which PID is the parent. The fallback to "oldest_pid" is reasonable, but the multi-threaded heuristic is questionable and could lead to incorrect identification of the parent process.

Copilot uses AI. Check for mistakes.
done

echo_t "[RDKB_SELFHEAL_T2] Received Parent PID: $parent, Oldest PID: $oldest_pid"

if [ -z "$parent" ]; then
parent="$oldest_pid"
fi

if [ -z "$parent" ]; then
return 0
fi

echo_t "[RDKB_SELFHEAL_T2] Selected Parent PID: $parent"

# 3. Loop Children
for pid in $pids; do
# Skip if it is the parent or process is already gone
if [ "$pid" = "$parent" ]; then
continue
fi
if [ ! -d "/proc/$pid" ]; then
continue
fi

local stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -z "$stat_data" ]; then
continue
fi

local state=$(echo "$stat_data" | cut -d' ' -f1)
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'state' variable extracted on line 414 is logged but never actually used in any logic. If the intent was to only kill processes in certain states (e.g., not killing processes in 'R' running state or 'S' sleeping state), this logic is missing. If state checking is not needed, the variable extraction and logging could be removed to simplify the code.

Copilot uses AI. Check for mistakes.
local ppid=$(echo "$stat_data" | cut -d' ' -f2)
local start_ticks=$(echo "$stat_data" | cut -d' ' -f20)

if [ "$ppid" != "$parent" ]; then
continue
fi

# Calculate Age
local started_sec=$(( start_ticks / hz ))
local elapsed=$(( uptime_sec - started_sec ))

# 4. Action Logic
if [ "$elapsed" -ge "$THRESH" ]; then
echo_t "[RDKB_SELFHEAL_T2] : PID $pid: State=$state, Age=${elapsed}s (Kill Triggered)"

kill -15 "$pid" 2>/dev/null
sleep 5

if [ -d "/proc/$pid" ]; then
echo_t " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
Comment on lines 429 to 436
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition in process state checks. The code checks if a process exists with "[ ! -d "/proc/$pid" ]" on line 405, but by the time the code tries to read from /proc/$pid/stat on line 409, the process might have exited. While the sed command has error redirection (2>/dev/null), there's no check to verify the process still exists before attempting to kill it on lines 430 and 435. If the process exits naturally between the age calculation and the kill attempt, the kill commands will fail silently (due to 2>/dev/null), but this could lead to attempting to kill a PID that has been recycled and assigned to a different process. Consider adding another existence check right before the kill commands.

Suggested change
kill -15 "$pid" 2>/dev/null
sleep 5
if [ -d "/proc/$pid" ]; then
echo " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
# Re-validate that the PID still refers to the same process before sending SIGTERM
if [ -d "/proc/$pid" ]; then
local current_stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -n "$current_stat_data" ]; then
local current_start_ticks=$(echo "$current_stat_data" | cut -d' ' -f20)
if [ "$current_start_ticks" = "$start_ticks" ]; then
kill -15 "$pid" 2>/dev/null
fi
fi
fi
sleep 5
# Re-validate again before forcing SIGKILL
if [ -d "/proc/$pid" ]; then
local current_stat_data_kill=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -n "$current_stat_data_kill" ]; then
local current_start_ticks_kill=$(echo "$current_stat_data_kill" | cut -d' ' -f20)
if [ "$current_start_ticks_kill" = "$start_ticks" ]; then
echo " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
fi
fi

Copilot uses AI. Check for mistakes.
fi
fi
done
}
Comment on lines 318 to 440
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential performance issue: The function calls pidof which scans the entire process table, then iterates through all matching PIDs twice (once to identify parent, once to check/kill children). For each PID, it performs multiple file reads from /proc. If telemetry2_0 has many child processes, this could cause noticeable overhead. Additionally, the sleep commands (5 seconds on line 431, 1 second on line 436) will block the selfheal script execution. If this function is called from a critical path or frequently, consider optimizing the process detection or making the sleep durations configurable.

Copilot uses AI. Check for mistakes.

# This is a workaround to be out of the finger pointing state of telemetry2_0 being in between generic KP monitoring and uncontrolled profile assignments from cloud
# Purpose of this selfheal is to restart telemetry2_0 if it is :
# 1] Consuming more memory than the threshold
# 2] Stops reporting due to issues external to telemetry2_0 causing it to go to hung state
self_heal_t2() {

restartNeeded=0

detect_and_kill_locked_pids "telemetry2_0"
# Floor limit on telemetry2_0 memory usage
t2MemMax=30000
# using busybox as different platforms are behaving differently with top command using -mbn1 to get the rss data
Expand Down Expand Up @@ -361,9 +489,10 @@ self_heal_t2() {

# Check for rbus communication failure
ERROR_STRING="rbus_set Failed for \[Telemetry.ReportProfiles.EventMarker\]"
ERROR_STRING_NEW="Caching the event to File"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed indentation detected. Line 494 uses tabs while the surrounding code uses spaces. This violates consistent code formatting standards.

Suggested change
ERROR_STRING_NEW="Caching the event to File"
ERROR_STRING_NEW="Caching the event to File"

Copilot uses AI. Check for mistakes.
telemetry2_0_client "TEST_RT_CONNECTION" "1" > /tmp/t2_test_broker_health 2>&1
if [ -f /tmp/t2_test_broker_health ]; then
if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ]; then
if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ] || [ `grep -c "$ERROR_STRING_NEW" /tmp/t2_test_broker_health` -gt 0 ]; then
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition runs grep twice and uses command substitution inside [ ... ], which is more error-prone and slower than necessary. Consider switching to a single grep -Eq with multiple patterns (or one grep with -e) and checking the exit status instead of counting matches.

Suggested change
if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ] || [ `grep -c "$ERROR_STRING_NEW" /tmp/t2_test_broker_health` -gt 0 ]; then
if grep -q -e "$ERROR_STRING" -e "$ERROR_STRING_NEW" /tmp/t2_test_broker_health; then

Copilot uses AI. Check for mistakes.
echo_t "[RDKB_SELFHEAL] : telemetry2_0 is hung at rbus queries. Set restart flag for telemetry2_0."
restartNeeded=1
fi
Expand Down
Loading