Home | History | Annotate | Download | only in kern
History log of /src/sys/kern/kern_heartbeat.c
RevisionDateAuthorComments
 1.14  25-Aug-2024  riastradh heartbeat(9): Use the cheaper and equally safe time_uptime32.

Since we cache this every 15sec, and check it within a tick, there's
no way for this to wrap around without first triggering a heartbeat
panic. So just use time_uptime32, the low 32 bits of the number of
seconds of uptime -- cheaper on LP32 platforms.

PR kern/58633: heartbeat(9) makes unnecessary use of time_uptime
 1.13  08-Mar-2024  riastradh heartbeat(9): Return early if panicstr is set.

This way we avoid doing unnecessary work -- and print unnecessary
messages -- to _not_ trigger another panic anyway.

PR kern/58011
 1.12  28-Feb-2024  riastradh heartbeat(9): Restore still-applicable comment nixed in last commit.

The nesting depth is stored in ci_heartbeat_suspend which is 32-bit.
 1.11  28-Feb-2024  riastradh heartbeat(9): No kpreempt_disable/enable in heartbeat_suspend/resume.

This causes a leak of l_nopreempt in xc_thread when a CPU is offlined
and onlined again, because the offlining heartbeat_suspend and the
onlining heartbeat_resume happen in separate xcalls.

No change to callers because they are already bound to the CPU:

1. cnpollc does kpreempt_disable/enable itself around the calls to
heartbeat_suspend/resume anyway

2. cpu_xc_offline/online run in the xcall thread, which is always
bound to the CPU that is being offlined or onlined
 1.10  06-Sep-2023  riastradh heartbeat(9): Make heartbeat_suspend/resume nestable.

And make them bind to the CPU as a side effect, instead of requiring
the caller to have already done so.

This lets us eliminate the assertions so we can use them in ddb even
when things are going haywire and we just want to get diagnostics.

XXX kernel revbump -- struct cpu_info change
 1.9  02-Sep-2023  riastradh heartbeat(9): Move panicstr check into the IPI itself.

We can't return early from defibrillate because the IPI may have yet
to run -- we can't return until the other CPU is definitely done
using the ipi_msg_t we created on the stack.

We should avoid calling panic again on the patient CPU in case it was
already in the middle of a panic, so that we don't re-enter panic
while, e.g., trying to print a stack trace.

Sprinkle some comments.
 1.8  02-Sep-2023  riastradh heartbeat(9): More detail about manual test success criteria.

Changes comments only, no functional change.
 1.7  02-Sep-2023  riastradh heartbeat(9): Ignore stale tc if primary CPU heartbeat is suspended.

The timecounter ticks only on the primary CPU, so of course it will
go stale if it's suspended.

(It is, perhaps, a mistake that it only ticks on the primary CPU,
even if the primary CPU is offlined or in a polled-input console
loop, but that's a separate issue.)
 1.6  02-Sep-2023  riastradh heartbeat(9): New flag SPCF_HEARTBEATSUSPENDED.

This way we can suspend heartbeats on a single CPU while the console
is in polling mode, not just when the CPU is offlined. This should
be rare, so it's not _convenient_, but it should enable us to fix
polling-mode console input when the hardclock timer is still running
on other CPUs.
 1.5  16-Jul-2023  riastradh heartbeat(9): For now, use time_uptime without atomic_load_relaxed.

A later commit will change time_uptime to a macro so it is atomic,
using atomc_load_relaxed if possible or seqlock if not.
 1.4  16-Jul-2023  riastradh heartbeat(9): Avoid xcall(9) while cold.
 1.3  08-Jul-2023  riastradh curcpu_stable(9): New function for asserting curcpu() is stable.
 1.2  07-Jul-2023  riastradh heartbeat(9): Test whether curcpu is stable, not kpreempt_disabled.

kpreempt_disabled worked for my testing because I tested on aarch64,
which doesn't have kpreemption.

XXX Should move curcpu_stable() to somewhere that other things can
use it.
 1.1  07-Jul-2023  riastradh heartbeat(9): New mechanism to check progress of kernel.

This uses hard interrupts to check progress of low-priority soft
interrupts, and one CPU to check progress of another CPU.

If no progress has been made after a configurable number of seconds
(kern.heartbeat.max_period, default 15), then the system panics --
preferably on the CPU that is stuck so we get a stack trace in dmesg
of where it was stuck, but if the stuckness was detected by another
CPU and the stuck CPU doesn't acknowledge the request to panic within
one second, the detecting CPU panics instead.

This doesn't supplant hardware watchdog timers. It is possible for
hard interrupts to be stuck on all CPUs for some reason too; in that
case heartbeat(9) has no opportunity to complete.

Downside: heartbeat(9) relies on hardclock to run at a reasonably
consistent rate, which might cause trouble for the glorious tickless
future. However, it could be adapted to take a parameter for an
approximate number of units that have elapsed since the last call on
the current CPU, rather than treating that as a constant 1.

XXX kernel revbump -- changes struct cpu_info layout

RSS XML Feed