History log of /src/sys/kern/kern_heartbeat.c |
Revision | | Date | Author | Comments |
1.14 |
| 25-Aug-2024 |
riastradh | heartbeat(9): Use the cheaper and equally safe time_uptime32.
Since we cache this every 15sec, and check it within a tick, there's no way for this to wrap around without first triggering a heartbeat panic. So just use time_uptime32, the low 32 bits of the number of seconds of uptime -- cheaper on LP32 platforms.
PR kern/58633: heartbeat(9) makes unnecessary use of time_uptime
|
1.13 |
| 08-Mar-2024 |
riastradh | heartbeat(9): Return early if panicstr is set.
This way we avoid doing unnecessary work -- and print unnecessary messages -- to _not_ trigger another panic anyway.
PR kern/58011
|
1.12 |
| 28-Feb-2024 |
riastradh | heartbeat(9): Restore still-applicable comment nixed in last commit.
The nesting depth is stored in ci_heartbeat_suspend which is 32-bit.
|
1.11 |
| 28-Feb-2024 |
riastradh | heartbeat(9): No kpreempt_disable/enable in heartbeat_suspend/resume.
This causes a leak of l_nopreempt in xc_thread when a CPU is offlined and onlined again, because the offlining heartbeat_suspend and the onlining heartbeat_resume happen in separate xcalls.
No change to callers because they are already bound to the CPU:
1. cnpollc does kpreempt_disable/enable itself around the calls to heartbeat_suspend/resume anyway
2. cpu_xc_offline/online run in the xcall thread, which is always bound to the CPU that is being offlined or onlined
|
1.10 |
| 06-Sep-2023 |
riastradh | heartbeat(9): Make heartbeat_suspend/resume nestable.
And make them bind to the CPU as a side effect, instead of requiring the caller to have already done so.
This lets us eliminate the assertions so we can use them in ddb even when things are going haywire and we just want to get diagnostics.
XXX kernel revbump -- struct cpu_info change
|
1.9 |
| 02-Sep-2023 |
riastradh | heartbeat(9): Move panicstr check into the IPI itself.
We can't return early from defibrillate because the IPI may have yet to run -- we can't return until the other CPU is definitely done using the ipi_msg_t we created on the stack.
We should avoid calling panic again on the patient CPU in case it was already in the middle of a panic, so that we don't re-enter panic while, e.g., trying to print a stack trace.
Sprinkle some comments.
|
1.8 |
| 02-Sep-2023 |
riastradh | heartbeat(9): More detail about manual test success criteria.
Changes comments only, no functional change.
|
1.7 |
| 02-Sep-2023 |
riastradh | heartbeat(9): Ignore stale tc if primary CPU heartbeat is suspended.
The timecounter ticks only on the primary CPU, so of course it will go stale if it's suspended.
(It is, perhaps, a mistake that it only ticks on the primary CPU, even if the primary CPU is offlined or in a polled-input console loop, but that's a separate issue.)
|
1.6 |
| 02-Sep-2023 |
riastradh | heartbeat(9): New flag SPCF_HEARTBEATSUSPENDED.
This way we can suspend heartbeats on a single CPU while the console is in polling mode, not just when the CPU is offlined. This should be rare, so it's not _convenient_, but it should enable us to fix polling-mode console input when the hardclock timer is still running on other CPUs.
|
1.5 |
| 16-Jul-2023 |
riastradh | heartbeat(9): For now, use time_uptime without atomic_load_relaxed.
A later commit will change time_uptime to a macro so it is atomic, using atomc_load_relaxed if possible or seqlock if not.
|
1.4 |
| 16-Jul-2023 |
riastradh | heartbeat(9): Avoid xcall(9) while cold.
|
1.3 |
| 08-Jul-2023 |
riastradh | curcpu_stable(9): New function for asserting curcpu() is stable.
|
1.2 |
| 07-Jul-2023 |
riastradh | heartbeat(9): Test whether curcpu is stable, not kpreempt_disabled.
kpreempt_disabled worked for my testing because I tested on aarch64, which doesn't have kpreemption.
XXX Should move curcpu_stable() to somewhere that other things can use it.
|
1.1 |
| 07-Jul-2023 |
riastradh | heartbeat(9): New mechanism to check progress of kernel.
This uses hard interrupts to check progress of low-priority soft interrupts, and one CPU to check progress of another CPU.
If no progress has been made after a configurable number of seconds (kern.heartbeat.max_period, default 15), then the system panics -- preferably on the CPU that is stuck so we get a stack trace in dmesg of where it was stuck, but if the stuckness was detected by another CPU and the stuck CPU doesn't acknowledge the request to panic within one second, the detecting CPU panics instead.
This doesn't supplant hardware watchdog timers. It is possible for hard interrupts to be stuck on all CPUs for some reason too; in that case heartbeat(9) has no opportunity to complete.
Downside: heartbeat(9) relies on hardclock to run at a reasonably consistent rate, which might cause trouble for the glorious tickless future. However, it could be adapted to take a parameter for an approximate number of units that have elapsed since the last call on the current CPU, rather than treating that as a constant 1.
XXX kernel revbump -- changes struct cpu_info layout
|