-
Notifications
You must be signed in to change notification settings - Fork 45
Coredump: implementation on MMU targets #649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @etiaro, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team,
Gemini here to provide a summary of this pull request. This PR introduces the core dump functionality into the system. The primary goal is to implement a configurable mechanism to dump serialized memory, registers, and metadata of a process upon encountering a fault. This output is intended to be in a format suitable for post-mortem debugging using tools like GDB, specifically the ELF core dump format. The changes involve adding new core dump specific files, integrating the core dump generation into the exception handling paths for various architectures (ARMv7a, IA32, RISC-V64), and implementing a thread freezing mechanism to ensure a consistent state during the dump.
Highlights
- Core Dump Implementation: Adds the core logic for generating ELF-formatted core dumps (
proc/coredump.c,proc/coredump.h). This includes handling ELF headers, program headers (for notes and memory segments), and note sections (thread status, floating-point registers, auxiliary vector). - Architecture Support: Integrates core dump generation into the exception handling routines for ARMv7a, IA32, and RISC-V64. This involves adding architecture-specific definitions for core dump sizes, ELF machine types, and functions to extract general-purpose and auxiliary register sets (
hal/*/exceptions.c,hal/*/arch/exceptions.h). - Thread Freezing: Introduces a mechanism to freeze threads within a process (
proc/threads.c,proc/threads.h). This is essential to prevent threads from changing state while the core dump is being generated, ensuring a consistent snapshot of the process state. - Context Saving: Adds logic to save the user-space CPU context (
userContextfield inthread_t) upon entering the kernel via exceptions or syscalls (proc/threads.c,syscalls.c,vm/map.c). This saved context is then used when generating the core dump. - Output Encoding: The core dump output is encoded using Base64 with simple Run-Length Encoding (RLE) compression and includes a CRC32 checksum (
proc/coredump.c). This allows the dump to be output via serial console or log. - Configurable Memory Dumping: The core dump can be configured to dump only the crashing thread's stack (default), all threads' stacks (
COREDUMP_MEM_ALL_STACKS), or all readable/writable memory segments (COREDUMP_MEM_ALL).
Changelog
Click here to see the changelog
- hal/armv7a/arch/exceptions.h
- +Added defines for core dump note sizes (
SIZE_COREDUMP_GREGSET,SIZE_COREDUMP_THREADAUX,SIZE_COREDUMP_GENAUX). - +Added define for ARM ELF machine type (
HAL_ELF_MACHINE).
- +Added defines for core dump note sizes (
- hal/armv7a/exceptions.c
- +Included
proc/elf.h. - +Extracted exception mnemonic lookup into a separate function
hal_exceptionMnemonic(lines 53-60). - +Added
hal_excToCpuCtxto get the CPU context from the exception context (lines 244-247). - +Added
hal_coredumpGRegsetto dump general-purpose registers for ARMv7a (lines 250-270). - +Added
hal_coredumpThreadAuxto dump thread auxiliary notes (VFP context) for ARMv7a (lines 271-286). - +Added
hal_coredumpGeneralAuxto dump general auxiliary notes (HWCAP) for ARMv7a (lines 295-314).
- +Included
- hal/exceptions.h
- +Added declaration for
hal_exceptionMnemonic(line 31). - +Added declarations for core dump related HAL functions (
hal_excToCpuCtx,hal_coredumpGRegset,hal_coredumpThreadAux,hal_coredumpGeneralAux) (lines 43-52).
- +Added declaration for
- hal/ia32/arch/exceptions.h
- +Added defines for core dump note sizes (
SIZE_COREDUMP_GREGSET,SIZE_COREDUMP_THREADAUX,SIZE_COREDUMP_GENAUX). - +Added define for IA32 ELF machine type (
HAL_ELF_MACHINE).
- +Added defines for core dump note sizes (
- hal/ia32/exceptions.c
- +Included
proc/elf.h. - +Extracted exception mnemonic lookup into a separate function
hal_exceptionMnemonic(lines 116-126). - +Updated
hal_exceptionsDumpContextto usehal_exceptionMnemonic(line 146). - +Added
hal_excToCpuCtxto get the CPU context from the exception context (lines 314-317). - +Added
hal_coredumpGRegsetto dump general-purpose registers for IA32 (lines 320-342). - +Added
hal_coredumpThreadAuxto dump thread auxiliary notes (FPU context) for IA32 (lines 345-358). - +Added
hal_coredumpGeneralAux(empty implementation) for IA32 (lines 361-363).
- +Included
- hal/riscv64/arch/cpu.h
- +Renamed
pcfield toraincpu_context_tstruct (line 125).
- +Renamed
- hal/riscv64/arch/exceptions.h
- +Added defines for core dump note sizes (
SIZE_COREDUMP_GREGSET,SIZE_COREDUMP_THREADAUX,SIZE_COREDUMP_GENAUX). - +Added define for RISC-V 64-bit ELF machine type (
HAL_ELF_MACHINE).
- +Added defines for core dump note sizes (
- hal/riscv64/cpu.c
- +Updated
hal_cpuCreateContextto initializectx->rainstead ofctx->pc(line 141).
- +Updated
- hal/riscv64/exceptions.c
- +Included
proc/elf.h. - +Extracted exception mnemonic lookup into a separate function
hal_exceptionMnemonic(lines 35-44). - +Updated
hal_exceptionsDumpContextto usehal_exceptionMnemonic(line 54) and dumpctx->rainstead ofctx->pc(line 59). - +Updated
hal_exceptionsPCto returnctx->sepcinstead ofctx->pc(line 179). - +Added
hal_excToCpuCtxto get the CPU context from the exception context (lines 257-260). - +Added
hal_coredumpGRegsetto dump general-purpose registers for RISC-V64 (lines 263-300). - +Added
hal_coredumpThreadAuxto dump thread auxiliary notes (FPU context) for RISC-V64 (lines 303-317). - +Added
hal_coredumpGeneralAux(empty implementation) for RISC-V64 (lines 320-322).
- +Included
- proc/Makefile
- +Added
coredump.oto the list of objects to be built (line 8).
- +Added
- proc/coredump.c
- +New file implementing the core dump generation logic.
- +Includes ELF structure definitions, note types, and functions for writing the ELF header, program headers, and notes.
- +Implements Base64 encoding with RLE compression and CRC32 checksumming for the output.
- +Includes logic to dump memory segments based on configuration (
COREDUMP_MEM_ALL,COREDUMP_MEM_ALL_STACKS). - +Uses thread freezing (
proc_freeze,proc_unfreeze) to ensure a consistent state during the dump.
- proc/coredump.h
- +New file declaring the
coredump_dumpfunction and thecoredump_threadinfo_tstructure used to pass thread information.
- +New file declaring the
- proc/elf.h
- +Added various ELF definitions relevant to core dumps, including
ELFMAG,EI_*constants,ET_CORE,PT_NOTE,NT_PRSTATUS,NT_FPREGSET,NT_AUXV,NT_ARM_VFP(lines 35-108). - +Added
Elf32_NhdrandElf64_Nhdrstructures for note headers (lines 156-160, 216-220). - +Corrected
e_hsizetoe_ehsizeinElf32_Ehdr(line 121).
- +Added various ELF definitions relevant to core dumps, including
- proc/process.c
- +Added calls to
threads_saveUserContextandcoredump_dumpin theprocess_exceptionhandler (lines 277, 279). - +Added a call to
threads_saveUserContextin theprocess_illegalhandler (line 301).
- +Added calls to
- proc/threads.c
- +Included
coredump.h. - +Added logic in
_threads_scheduleto handle threads inFREEZINGorFROZENstates (lines 580-582, 599-602). - +Initialized the new
freezefield toRUNNINGinproc_threadCreate(line 779). - +Initialized the new
userContextfield toNULLinproc_threadCreate(line 792). - +Added logic in
_proc_threadDequeueto setfreezestate toFROZENif the thread wasFREEZING(lines 1070-1073). - +Implemented
threads_saveUserContextto store the user CPU context (lines 2092-2105). - +Implemented
_threads_userContextto retrieve the appropriate user context (lines 2108-2117). - +Implemented
proc_freezeto freeze all threads within a process (lines 2120-2163). - +Implemented
proc_unfreezeto unfreeze all threads within a process (lines 2166-2184). - +Implemented
coredump_threadsInfoto collect information about process threads for the core dump (lines 2187-2239).
- +Included
- proc/threads.h
- +Included
coredump.h. - +Added
freezestate enum (RUNNING,FREEZER,FREEZING,FROZEN) (line 41). - +Added
freezefield tothread_tstruct (line 69). - +Added
userContextfield tothread_tstruct (line 95). - +Added declarations for new thread/process functions (
threads_saveUserContext,_threads_userContext,proc_freeze,proc_unfreeze,coredump_threadsInfo) (lines 226-238).
- +Included
- syscalls.c
- +Added call to
threads_saveUserContextat the beginning ofsyscalls_dispatchto save the context upon entering the kernel (line 1894). - +Added call to
threads_saveUserContext(NULL)before returning fromsyscalls_dispatchto clear the saved context (line 1906).
- +Added call to
- vm/map.c
- +Added call to
threads_saveUserContextat the beginning ofmap_pageFaultto save the context (line 740). - +Added call to
coredump_dumpin thePAGEFAULTSTOPblock (line 747). - +Added call to
coredump_dumpifvm_mapForcefails (line 767).
- +Added call to
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A crash, a sudden halt,
The system finds a fault.
Registers are saved,
Memory is braved,
A core dump tells the tale.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request implements a configurable coredump feature, which is a significant addition for post-mortem debugging. The implementation includes architecture-specific register and auxiliary data dumping for ARM, IA32, and RISC-V, along with ELF-compatible output formatting, RLE and Base64 encoding for the dump data. The changes also introduce a process freezing mechanism to ensure coredump consistency.
The overall structure and approach seem sound. The code is well-organized with clear separation between generic coredump logic and architecture-specific helpers. The use of conditional compilation for output methods (serial/log) and memory dump scope (all memory, all stacks, or just current stack) provides good flexibility.
I've identified a couple of areas that need attention, detailed in the comments below. Additionally, the PR author has noted that documentation changes are required but not yet updated; this should be addressed before merging.
Summary of Findings
- Hardcoded
AT_HWCAPValue with TODO: Inhal/armv7a/exceptions.c, theAT_HWCAPvalue for the coredump's general auxiliary vector is hardcoded as0x8b8d7with aTODO. This should be clarified, defined as a constant, or dynamically determined to improve clarity and maintainability. - Potential Typo in Coredump Function Call: In
vm/map.c, within thePAGEFAULTSTOPconditional block,proc_coredump(n, ctx)is called. The coredump function introduced in this PR is namedcoredump_dump. This might be a typo that could lead to errors ifPAGEFAULTSTOPis enabled. - Documentation Update Required: The pull request description indicates that documentation changes are required for this new feature, but the corresponding checkbox for updating documentation is unchecked. This should be addressed.
Merge Readiness
This pull request introduces a valuable coredump feature. The implementation is largely solid. However, there are a couple of issues identified (a TODO for a hardcoded value and a potential function call typo) that should be addressed before merging. Additionally, the pending documentation update should be completed.
As an AI, I am not authorized to approve pull requests. Please ensure these changes are reviewed and approved by other maintainers before merging.
agkaminski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, will tackle the rest later
81f8c6f to
a3b48ad
Compare
e30f82b to
7b4c0a2
Compare
JIRA: RTOS-1049
JIRA: RTOS-1049
JIRA: RTOS-1049
JIRA: RTOS-1049
9c05223 to
d662685
Compare
|
Long tests with multiple processes and threads unveiled a few problems:
These problems are solved in last two commits. |
How easy is to trigger the error on master branch? If it's a common occurrence - maybe we want to fix it ASAP on |
It's hard to trigger, I was able to do that only by spawning massive amounts of threads of which some were instantly failing, and even in this setup it doesn't always happen. |
| pid_t pr_ppid; | ||
| pid_t pr_pgrp; | ||
| pid_t pr_sid; | ||
| struct timeval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name collision with libc name, might cause problems in the future
| char pr_reg[0]; /* actual size depends on architecture */ | ||
| int pr_fpvalid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? Won't this cause pr_fpvalid offset to be always invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's equivalent to declaring two structures:
elf_prstatus_header,
elf_prstatus_footer,
and dumping in order header, hal_coredumpGRegset(), footer.
I just combined it in single struct and used pr_reg as a marker to put real arch-dependent pr_reg here in actual dump. The intention was to achieve: &prstatus->pr_reg == &prstatus->pr_fpvalid.
After quick read, zero-length arrays is legacy GCC extension and it's rather discouraged, not really intended to use this way, just legacy way for C99's [] array (variable length at the end of struct).
Alternatively, whole struct could be moved to HAL (which will affect buffer size and require filling its entries inside HAL function). This adds some redundancy inside HAL.
Current solution isn't perfect either - it turned out that GDB is not working with SPARC-linux coredumps and require a hack inside its HAL to modify this struct into (luckily larger) Solaris's version of elf_prstatus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr
It works only on GCC and it's hacky. I wonder if I should:
- move whole struct into HAL adding redundant code in all architectures except SPARC
- split it into header/footer and keep hacky solution in SPARC's HAL (next PR)
- forget about this struct and only remember its size and pr_pid offset, as that's the only used (most probably probably the only useful) field for now
| void hal_coredumpGRegset(void *buff, cpu_context_t *ctx) | ||
| { | ||
| u32 *regs = (u32 *)buff; | ||
| *(regs++) = ctx->r0; | ||
| *(regs++) = ctx->r1; | ||
| *(regs++) = ctx->r2; | ||
| *(regs++) = ctx->r3; | ||
| *(regs++) = ctx->r4; | ||
| *(regs++) = ctx->r5; | ||
| *(regs++) = ctx->r6; | ||
| *(regs++) = ctx->r7; | ||
| *(regs++) = ctx->r8; | ||
| *(regs++) = ctx->r9; | ||
| *(regs++) = ctx->r10; | ||
| *(regs++) = ctx->fp; | ||
| *(regs++) = ctx->ip; | ||
| *(regs++) = ctx->sp; | ||
| *(regs++) = ctx->lr; | ||
| *(regs++) = ctx->pc; | ||
| *(regs++) = ctx->psr; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat iffy in scope of C standard and MISRA - yes, we provide aligned buffer, but in pedantic view this is still unwanted method.
It is best to pass the desired type all the way down, in this case it depends on the architecture - perhaps per arch typedef?
On the other hand memcpy can be used instead, most likely less efficient approach.
Not really a merge blocker, I'll leave this for @Darchiv to decide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the previous comment, would it be any better to define struct elf_prstatus including pr_regs in HAL which would just be filled here, and make a cast to this new type in caller?
| /* Threads can become frozen when rescheduled from userspace */ | ||
| do { | ||
| while ((thread->freeze != FROZEN) && (thread != current)) { | ||
| if ((thread->state == SLEEP) && (_threads_usesKernelOnlyLock(thread) == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this stuff with locks type etc? We can allow thread freeze only and only if either:
- threads is not in supervisor mode,
- thread is sleeping.
This assumption would make matter way simpler and we wouldn't need to classify locks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we freeze the thread that is sleeping, and
- it can be waiting for a lock - then will sooner or later get that lock, and if it's still freezed, then that lock can cause deadlock, and will block other waiting threads
- it can't be waiting for any lock - then we have to wait until that thread gets the locks its waiting for - but it can get deadlocked if that lock is "user.mutex" lock.
Another solution would be to only wait to free acquired locks, dequeue freezed threads and put them in queue on unfreeze, but the queue mechanism is used not only for locks and I didn't have an idea how to do that reasonably well that wouldn't be a harsh to maintain. Also, this solution still requires distinguishing "user.mutex", as it can never get freed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're freezing all threads, so there's no risk of deadlock regarding user mutexes.
But it's a valid point, something has to be done regarding queues. Perhaps we could use interruption mechanism (from signals) to force threads to vacate interruptible queues and freeze only in user space or on return to user space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another solution is to skip frozen threads on dequeue, but yeah, it might get ugly quick in edge cases (e.g. how to proceed when we only have frozen thread on queue and it becomes unfrozen?).
Side note - IMHO freezing is not really needed for coredump mechanism - we can simply kill process on fault and collect thread stack as they die
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're freezing all threads, so there's no risk of deadlock regarding user mutexes.
The exception thread can be holding this lock, if another thread is waiting to acquire it and freezer is waiting for this acquisition then that's a deadlock
Side note - IMHO freezing is not really needed for coredump mechanism - we can simply kill process on fault and collect thread stack as they die
I'm working on dumping running processes on request. That's where freezing will really be required.
Also, the thread that gets a SIGSEGV can handle that signal and keep working (probably not recommended, but possible)
Perhaps we could use interruption mechanism (from signals) to force threads to vacate interruptible queues
I missed that proc_mutexLock is interruptible. That indeed solves the problem. I'll remove the attribute and just interrupt them out of the queues.
e4c5a04 to
4e934b4
Compare
JIRA: RTOS-1049
JIRA: RTOS-1049
JIRA: RTOS-1049
JIRA: RTOS-1049
JIRA: RTOS-1049
JIRA: RTOS-1049
JIRA: RTOS-1049
JIRA: RTOS-1049
Before this commit it was possible for a process to create new thread while dying which was not killed. JIRA: RTOS-1049
JIRA: RTOS-1049
Description
Implement configurable dumping of serialized memory, registers and metadata on a fault that allows for post-mortem debugging of crashed process using GDB after parsing on host machine.
Motivation and Context
RTOS-1043 Research core dump and backtrace techniques
RTOS-1049 Implement core dump on MMU targets
Example coredump log
This is the most minimalistic coredump example (only stack memory and registers, one thread), generated on ia32.
Allows for backtrace and shows all stack variables in GDB.
Types of changes
How Has This Been Tested?
Checklist:
Special treatment