-
Notifications
You must be signed in to change notification settings - Fork 23
RDKB-63377 : Add patch for SH Logic for reap hung child #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: support/1.1.0
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -311,14 +311,142 @@ self_heal_meshAgent_hung() { | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # This is a workaround till fork calls are removed from t2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is : | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
yogeswaransky marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # 1] running for more than 120 sec | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| detect_and_kill_locked_pids() { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local name="$1" THRESH="${2:-120}" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if [ -z "$name" ]; then | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return 2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local pids=$(pidof "$name") | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if [ -z "$pids" ]; then | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local pid_count | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pid_count=$(set -- $pids; echo $#) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if [ "$pid_count" -le 1 ]; then | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| echo_t "[RDKB_SELFHEAL_T2] Multiple telemetry pids are running $pids" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # 1. CLK_TCK (USER_HZ) & Uptime | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local hz=1000 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if [ -r /proc/config.gz ]; then | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if [ -n "$detected_hz" ]; then | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| hz=$detected_hz | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+340
to
+344
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local hz=1000 | |
| if [ -r /proc/config.gz ]; then | |
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | |
| if [ -n "$detected_hz" ]; then | |
| hz=$detected_hz | |
| # Prefer the actual USER_HZ from getconf; fall back to 100 if unavailable. | |
| local hz=100 | |
| if command -v getconf >/dev/null 2>&1; then | |
| local detected_hz | |
| detected_hz=$(getconf CLK_TCK 2>/dev/null || echo "") | |
| if [ -n "$detected_hz" ]; then | |
| hz=$detected_hz |
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clock-tick handling looks incorrect: /proc/[pid]/stat starttime is in USER_HZ (sysconf/_SC_CLK_TCK), which is typically 100. Defaulting hz to 1000 and/or reading CONFIG_HZ will skew the age calculation (often by ~10x) and can cause killing healthy children. Prefer getconf CLK_TCK (or equivalent) with a safe fallback (commonly 100) and avoid using CONFIG_HZ as a proxy for USER_HZ.
| # USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ | |
| local hz=1000 | |
| if [ -r /proc/config.gz ]; then | |
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | |
| if [ -n "$detected_hz" ]; then | |
| hz=$detected_hz | |
| fi | |
| fi | |
| # USER_HZ is obtained from getconf(_SC_CLK_TCK); default safely to 100 if unavailable | |
| local hz="" | |
| if command -v getconf >/dev/null 2>&1; then | |
| hz=$(getconf CLK_TCK 2>/dev/null || echo "") | |
| fi | |
| case "$hz" in | |
| ''|*[!0-9]*) | |
| hz=100 | |
| ;; | |
| esac |
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parent identification logic has a potential issue. The code prefers a process with multiple threads (task_count > 1) as the parent, but once a parent is found this way, it never gets reassessed. However, this heuristic may select the wrong process as the parent since having multiple threads doesn't necessarily mean a process is the parent of other processes with the same name. A more reliable approach would be to check the actual parent-child relationship by examining the PPID field in /proc/pid/stat to determine which PID is the parent. The fallback to "oldest_pid" is reasonable, but the multi-threaded heuristic is questionable and could lead to incorrect identification of the parent process.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'state' variable extracted on line 414 is logged but never actually used in any logic. If the intent was to only kill processes in certain states (e.g., not killing processes in 'R' running state or 'S' sleeping state), this logic is missing. If state checking is not needed, the variable extraction and logging could be removed to simplify the code.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition in process state checks. The code checks if a process exists with "[ ! -d "/proc/$pid" ]" on line 405, but by the time the code tries to read from /proc/$pid/stat on line 409, the process might have exited. While the sed command has error redirection (2>/dev/null), there's no check to verify the process still exists before attempting to kill it on lines 430 and 435. If the process exits naturally between the age calculation and the kill attempt, the kill commands will fail silently (due to 2>/dev/null), but this could lead to attempting to kill a PID that has been recycled and assigned to a different process. Consider adding another existence check right before the kill commands.
| kill -15 "$pid" 2>/dev/null | |
| sleep 5 | |
| if [ -d "/proc/$pid" ]; then | |
| echo " Forcing SIGKILL $pid" | |
| kill -9 "$pid" 2>/dev/null | |
| sleep 1 | |
| # Re-validate that the PID still refers to the same process before sending SIGTERM | |
| if [ -d "/proc/$pid" ]; then | |
| local current_stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null) | |
| if [ -n "$current_stat_data" ]; then | |
| local current_start_ticks=$(echo "$current_stat_data" | cut -d' ' -f20) | |
| if [ "$current_start_ticks" = "$start_ticks" ]; then | |
| kill -15 "$pid" 2>/dev/null | |
| fi | |
| fi | |
| fi | |
| sleep 5 | |
| # Re-validate again before forcing SIGKILL | |
| if [ -d "/proc/$pid" ]; then | |
| local current_stat_data_kill=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null) | |
| if [ -n "$current_stat_data_kill" ]; then | |
| local current_start_ticks_kill=$(echo "$current_stat_data_kill" | cut -d' ' -f20) | |
| if [ "$current_start_ticks_kill" = "$start_ticks" ]; then | |
| echo " Forcing SIGKILL $pid" | |
| kill -9 "$pid" 2>/dev/null | |
| sleep 1 | |
| fi | |
| fi |
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential performance issue: The function calls pidof which scans the entire process table, then iterates through all matching PIDs twice (once to identify parent, once to check/kill children). For each PID, it performs multiple file reads from /proc. If telemetry2_0 has many child processes, this could cause noticeable overhead. Additionally, the sleep commands (5 seconds on line 431, 1 second on line 436) will block the selfheal script execution. If this function is called from a critical path or frequently, consider optimizing the process detection or making the sleep durations configurable.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixed indentation detected. Line 494 uses tabs while the surrounding code uses spaces. This violates consistent code formatting standards.
| ERROR_STRING_NEW="Caching the event to File" | |
| ERROR_STRING_NEW="Caching the event to File" |
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition runs grep twice and uses command substitution inside [ ... ], which is more error-prone and slower than necessary. Consider switching to a single grep -Eq with multiple patterns (or one grep with -e) and checking the exit status instead of counting matches.
| if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ] || [ `grep -c "$ERROR_STRING_NEW" /tmp/t2_test_broker_health` -gt 0 ]; then | |
| if grep -q -e "$ERROR_STRING" -e "$ERROR_STRING_NEW" /tmp/t2_test_broker_health; then |
Uh oh!
There was an error while loading. Please reload this page.