Home | History | Annotate | Download | only in kern
History log of /src/sys/kern/subr_pool.c
RevisionDateAuthorComments
 1.295  26-May-2025  bouyer Never call pr_drain_hook from pool_allocator_alloc().
In the PR_WAITOK case it's called from pool_reclaim
In the !PR_WAITOK case we're holding the pool lock and if the drain hook
wants kernel_lock we may deadlock with another thread holding
kernel_lock and calling pool_get().
Fixes PR kern/59411
 1.294  16-May-2025  bouyer Revert previous, requested by riastradh@
One possible fix for kern/59411 makes PR_GROWINGNOWAIT usefull again.
 1.293  09-May-2025  bouyer pool_grow(): The thread setting PR_GROWINGNOWAIT holds the pr_lock and
should not release it before clearing PR_GROWINGNOWAIT because it's called
with !PR_WAITOK. No other thread should see PR_GROWINGNOWAIT while holding
pr_lock, so PR_GROWINGNOWAIT looks useless and can probably be removed.
For now, only KASSERT that PR_GROWINGNOWAIT is never seeen, to make sure.
Note that in the PR_GROWINGNOWAIT case we would exit/reenter pr_lock
while we don't have PR_WAITOK, which is probably wrong too.
 1.292  07-Dec-2024  chs pool: fix pool_sethiwat() to actually do something

The change that I made to the pool code back in April 2020
("slightly change and fix the semantics of pool_set*wat()" ...)
accidental broke pool_sethiwat() by making it have no effect.

This was discovered after the crash reported in PR 58666 was fixed.
The same machine (32-bit, with 10GB RAM) would hang due to the buffer
cache causing the system to run out of kernel virtual space. The
buffer cache uses a separate pool for buffer data for each power of 2
between DEV_BSIZE and MAXBSIZE, and if the usage pattern of buffer
sizes changes then memory has to be moved between the different pools
in order to create buffers of the new size. The buffer cache handles
this by using pool_sethiwat() to cause memory freed from the buffer
cache back to the pools to not be cached in the buffer cache pools but
instead be freed back to the pools' back-end allocator (which
allocates from the low-level kva allocator) as soon as possible. But
since pool_sethiwat() wasn't doing anything, memory would stay cached
in some buffer cache pools and starve other buffer cache pools (and a
few other pools that do no use the kmem layer for memory allocation).

Fix pool_sethiwat() to do what it is supposed to do again.
 1.291  07-Dec-2024  chs pool: use "big" (ie. > PAGE_SIZE) default allocators for more cases

When I added the default "big" pool allocators back in 2017,
I added them only for pool_caches and not plain pools, and only for
IPL_NONE pool_caches at that. But these allocators work fine for
for all pool caches and plain pools as well, so use them automatically
by default when needed for all of those cases.
 1.290  09-Apr-2023  riastradh pool(9): Tweak branch prediction in pool_cache_get_paddr assertion.

No functional change intended.
 1.289  09-Apr-2023  riastradh pool(9): Simplify assertion in pool_update_curpage.

Add message while here.
 1.288  09-Apr-2023  riastradh kern: KASSERT(A && B) -> KASSERT(A); KASSERT(B)
 1.287  24-Feb-2023  riastradh kern: Eliminate most __HAVE_ATOMIC_AS_MEMBAR conditionals.

I'm leaving in the conditional around the legacy membar_enters
(store-before-load, store-before-store) in kern_mutex.c and in
kern_lock.c because they may still matter: store-before-load barriers
tend to be the most expensive kind, so eliding them is probably
worthwhile on x86. (It also may not matter; I just don't care to do
measurements right now, and it's a single valid and potentially
justifiable use case in the whole tree.)

However, membar_release/acquire can be mere instruction barriers on
all TSO platforms including x86, so there's no need to go out of our
way with a bad API to conditionalize them. If the procedure call
overhead is measurable we just could change them to be macros on x86
that expand into __insn_barrier.

Discussed on tech-kern:
https://mail-index.netbsd.org/tech-kern/2023/02/23/msg028729.html
 1.286  17-Feb-2023  skrll Avoid undefined behaviour.
 1.285  16-Jul-2022  simonb branches: 1.285.4;
Use 64-bit math to calculate pool sizes. Fixes overflow errors for
pools larger than 4GB and gives the correct output for kernel pool pages
in "vmstat -s" output.
 1.284  29-May-2022  andvar fix various typos in comments and log messages.
 1.283  24-May-2022  andvar fix various typos in comments, docs and log messages.
 1.282  09-Apr-2022  riastradh pool(9): Convert membar_exit to membar_release.
 1.281  27-Feb-2022  riastradh pool(9): Membar audit.

- Use atomic_store_release and atomic_load_consume for associating a
freshly constructed pool_cache with its underlying pool. The pool
gets published in various ways before the pool cache is fully
constructed.

=> Nix membar_sync -- no store-before-load is needed here.

- Take pool_head_lock around sysctl kern.pool TAILQ_FOREACH. Then take
a reference count, and drop the lock, around copyout.

=> Otherwise, pools could be partially initialized or freed while
we're still trying to read from them -- and in the worst case,
we might see a corrupted view of the tailq.

=> If we kept the lock around copyout, this could deadlock in memory
allocation.

=> If we didn't take a reference count while releasing the lock, the
pool could be destroyed while we're trying to traverse the list,
sending us into oblivion instead of the next element.
 1.280  24-Dec-2021  riastradh pool(9): Fix default PR_NOALIGN for large pool caches.

Was broken in recent change to separate some pool cache flags from
pool flags.

Fixes crash in zfs.
 1.279  22-Dec-2021  thorpej Do the last change differently:

Instead of having a pre-destruct hook, put knowledge of passive
serialization into the pool allocator directly, enabled by PR_PSERIALIZE
when the pool / pool_cache is initialized. This will guarantee that
a passive serialization barrier will be performed before the object's
destructor is called, or before the page containing the object is freed
back to the system (in the case of no destructor). Note that the internal
allocator overhead is different when PR_PSERIALIZE is used (it implies
PR_NOTOUCH, because the objects must remain in a valid state).

In the DRM Linux API shim, this allows us to remove the custom page
allocator for SLAB_TYPESAFE_BY_RCU.
 1.278  21-Dec-2021  thorpej Add pool_cache_setpredestruct(), which allows a pool cache to specify
a function to be called before the destructor for a batch of one or more
objects is called. This can be used as a synchronization point by
subsystems that rely on the type-stable nature of pool cache objects or
subsystems that use other forms of passive serialization.
 1.277  25-Jul-2021  simonb Add accessor functions to get the number of gets and puts on pools and
pool caches.
 1.276  24-Feb-2021  mrg branches: 1.276.4;
skip redzone on pools with the allocation (including all overhead)
on anything greater than half the pool pagesize.

this stops 4KiB being used per allocation from the kmem-02048 pool,
and 64KiB per allocation from the buf32k pool.

we're still wasting 1/4 of space for overhead on eg, the buf1k or
kmem-01024 pools. however, including overhead costs, the amount of
useless space (not used by consumer or overhead) reduces from 47%
to 18%, so this is far less bad overall.


there are a couple of ideas on solving this less ugly:

- pool redzones are enabled with DIAGNOSTIC kernels, which is
defined as being "fast, cheap". this is not cheap (though it
is relatively fast if you don't run out of memory) so it does
not really belong here as is, but DEBUG or a special option
would work for it.

- if we increase the "pool page" size for these pools, such that
the overhead over pool page is reduced to 5% or less, we can
have redzones for more allocations without using more space.


also, see this thread:

https://mail-index.netbsd.org/tech-kern/2021/02/23/msg027130.html
 1.275  19-Dec-2020  mrg ddb: add two new modifiers to "show pool" and "show all pools"

- /s shows a short single-line per pool list (the normal output
is about 10 lines per.)
- /S skips pools with zero allocations.
 1.274  05-Sep-2020  riastradh branches: 1.274.2;
Suppress pool redzone message unless booted with debug.
 1.273  19-Jun-2020  jdolecek bump the limit on max item size for pool_init()/pool_cache_init() up
to 1 << 24, so that the pools can be used for ZFS block allocations, which
are up to SPA_MAXBLOCKSHIFT (1 << 24)

part of PR kern/55397 by Frank Kardel
 1.272  14-Jun-2020  ad Arithmetic error in previous.
 1.271  14-Jun-2020  ad pool_cache:

- make all counters per-CPU and make cache layer do its work with atomic ops.
- conserve memory by caching empty groups globally.
 1.270  07-Jun-2020  maxv Add fault(4).
 1.269  06-Jun-2020  maxv kMSan: re-set the orig after pool_cache_get_slow(), using the address of
the caller of pool_cache_get_paddr().

Otherwise the orig is just pool_cache_get_paddr(), and that's not really
useful for debugging.
 1.268  15-Apr-2020  maxv Introduce POOL_NOCACHE, simple option to cancel pool_caches and go directly
to the pool layer. It is taken out of POOL_QUARANTINE.

Advertise POOL_NOCACHE for kMSan rather than POOL_QUARANTINE. With kMSan
we are only interested in the no-caching effect, not the quarantine. This
reduces memory pressure on kMSan kernels.
 1.267  13-Apr-2020  chs slightly change and fix the semantics of pool_set*wat(), pool_sethardlimit()
and pool_prime() (and their pool_cache_* counterparts):

- the pool_set*wat() APIs are supposed to specify thresholds for the count of
free items in the pool before pool pages are automatically allocated or freed
during pool_get() / pool_put(), whereas pool_sethardlimit() and pool_prime()
are supposed to specify minimum and maximum numbers of total items
in the pool (both free and allocated). these were somewhat conflated
in the existing code, so separate them as they were intended.

- change pool_prime() to take an absolute number of items to preallocate
rather than an increment over whatever was done before, and wait for
any memory allocations to succeed. since pool_prime() can no longer fail
after this, change its return value to void and adjust all callers.

- pool_setlowat() is documented as not immediately attempting to allocate
any memory, but it was changed some time ago to immediately try to allocate
up to the lowat level, so just fix the manpage to describe the current
behaviour.

- add a pool_cache_prime() to complete the API set.
 1.266  08-Feb-2020  maxv branches: 1.266.4;
Retire KLEAK.

KLEAK was a nice feature and served its purpose; it allowed us to detect
dozens of info leaks on the kernel->userland boundary, and thanks to it we
tackled a good part of the infoleak problem 1.5 years ago.

Nowadays however, we have kMSan, which can detect uninitialized memory in
the kernel. kMSan supersedes KLEAK: it can detect what KLEAK was able to
detect, but in addition, (1) it operates in all of the kernel and not just
the kernel->userland boundary, (2) it requires no user interaction, and (3)
it is deterministic and not statistical.

That makes kMSan the feature of choice to detect info leaks nowadays;
people interested in detecting info leaks should boot a kMSan kernel and
just wait for the magic to happen.

KLEAK was a good ride, and a fun project, but now is time for it to go.

Discussed with several people, including Thomas Barabosch.
 1.265  19-Jan-2020  chs fix assertions about when it is ok for pool_get() to return NULL.
 1.264  27-Dec-2019  maxv branches: 1.264.2;
Switch to panic, and make the message more useful.
 1.263  03-Dec-2019  riastradh Use __insn_barrier to enforce ordering in l_ncsw loops.

(Only need ordering observable by interruption, not by other CPUs.)
 1.262  14-Nov-2019  maxv Add support for Kernel Memory Sanitizer (kMSan). It detects uninitialized
memory used by the kernel at run time, and just like kASan and kCSan, it
is an excellent feature. It has already detected 38 uninitialized variables
in the kernel during my testing, which I have since discreetly fixed.

We use two shadows:
- "shad", to track uninitialized memory with a bit granularity (1:1).
Each bit set to 1 in the shad corresponds to one uninitialized bit of
real kernel memory.
- "orig", to track the origin of the memory with a 4-byte granularity
(1:1). Each uint32_t cell in the orig indicates the origin of the
associated uint32_t of real kernel memory.

The memory consumption of these shadows is consequent, so at least 4GB of
RAM is recommended to run kMSan.

The compiler inserts calls to specific __msan_* functions on each memory
access, to manage both the shad and the orig and detect uninitialized
memory accesses that change the execution flow (like an "if" on an
uninitialized variable).

We mark as uninit several types of memory buffers (stack, pools, kmem,
malloc, uvm_km), and check each buffer passed to copyout, copyoutstr,
bwrite, if_transmit_lock and DMA operations, to detect uninitialized memory
that leaves the system. This allows us to detect kernel info leaks in a way
that is more efficient and also more user-friendly than KLEAK.

Contrary to kASan, kMSan requires comprehensive coverage, ie we cannot
tolerate having one non-instrumented function, because this could cause
false positives. kMSan cannot instrument ASM functions, so I converted
most of them to __asm__ inlines, which kMSan is able to instrument. Those
that remain receive special treatment.

Contrary to kASan again, kMSan uses a TLS, so we must context-switch this
TLS during interrupts. We use different contexts depending on the interrupt
level.

The orig tracks precisely the origin of a buffer. We use a special encoding
for the orig values, and pack together in each uint32_t cell of the orig:
- a code designating the type of memory (Stack, Pool, etc), and
- a compressed pointer, which points either (1) to a string containing
the name of the variable associated with the cell, or (2) to an area
in the kernel .text section which we resolve to a symbol name + offset.

This encoding allows us not to consume extra memory for associating
information with each cell, and produces a precise output, that can tell
for example the name of an uninitialized variable on the stack, the
function in which it was pushed on the stack, and the function where we
accessed this uninitialized variable.

kMSan is available with LLVM, but not with GCC.

The code is organized in a way that is similar to kASan and kCSan, so it
means that other architectures than amd64 can be supported.
 1.261  16-Oct-2019  christos Add and use __FPTRCAST, requested by uwe@
 1.260  16-Oct-2019  christos Add void * function pointer casts. There are different ways to "fix" those
warnings:
1. this one: add a void * cast (which I think is the least intrusive)
2. add pragmas to elide the warning
3. add intermediate inline conversion functions
4. change the called function prototypes, adding unused arguments and
converting some of the pointer arguments to void *.
5. make the functions varyadic (which defeats the purpose of checking)
6. pass command line flags to elide the warning
I did try 3 and 4 and I was not pleased with the result (sys_ptrace_common.c)
(3) added too much code and defines, and (4) made the regular use clumsy.
 1.259  23-Sep-2019  skrll Enable POOL_REDZONE with DIAGNOSTIC.

The bug in the arm pmap was fixed long ago.
 1.258  06-Sep-2019  maxv Reorder for clarity, and localify pool_allocator_big[], should not be used
outside.
 1.257  26-Aug-2019  maxv Revert r1.254, put back || for KASAN, some destructors like lwp_dtor()
caused false positives. Needs more work.
 1.256  17-Aug-2019  maxv Kernel Heap Hardening: use bitmaps on all off-page pools. This migrates 29
MI pools on amd64 from linked lists to bitmaps, which have higher security
properties.

Then, change the computation of the size of the PH pools: take into account
the bitmap area available by default in the ph_u2 union, and don't go with
&phpool[>0] if &phpool[0] already has enough space to embed a bitmap.

The pools that are migrated in this change all use bitmaps small enough to
fit in &phpool[0], therefore there is no increase in memory consumption.
 1.255  16-Aug-2019  maxv Initialize pp->pr_redzone to false. For some reason with KUBSAN GCC does
not eliminate the unused branch in pr_item_linkedlist_put(), and this
leads to a unused uninitialized access which triggers KUBSAN messages.
 1.254  03-Aug-2019  maxv Replace || by && in KASAN, to increase the pool coverage.

Strictly speaking, what we want to avoid is poisoning buffers that were
referenced in a global list as part of the ctor. But, if a buffer indeed
got referenced as part of the ctor, it necessarily has to be unreferenced
in the dtor; which implies it has to have a dtor. So we want both a ctor
and a dtor, and not just one of them.

Note that POOL_QUARANTINE already implicitly provides this increased
coverage.
 1.253  02-Aug-2019  maxv Kernel Heap Hardening: perform certain sanity checks on the pool caches
directly, to immediately detect certain bugs that would otherwise have
been detected only later on the pool layer, if the buffer ever reached
the pool layer.
 1.252  29-Jun-2019  maxv branches: 1.252.2;
The big pool allocators use pool_page_alloc(), which allocates page-aligned
storage. So if we switch to a big pool, set PR_NOALIGN, because the address
of the storage is not aligned to the item size.

Should fix PR/54319.
 1.251  13-Jun-2019  christos make pool assertion messages consistent.
 1.250  09-May-2019  skrll Avoid KASSERT(!cpu_intr_p()) when breaking into ddb and issuing

show uvmexp
 1.249  13-Apr-2019  maxv Introduce POOL_QUARANTINE, a feature that creates a window during which a
freed buffer cannot be reallocated. This greatly helps detecting
use-after-frees, because they are not short-lived anymore.

We maintain a per-pool fifo of 128 buffers. On each pool_put, we do a real
free of the oldest buffer, and insert the new buffer. Before insertion, we
mark the buffer as invalid with KASAN. On each pool_cache_put, we destruct
the object, so it lands in pool_put, and the quarantine is handled there.

POOL_QUARANTINE can be used in conjunction with KASAN to detect more
use-after-free bugs.
 1.248  07-Apr-2019  maxv Provide a code argument in kasan_mark(), and give a code to each caller.
Five codes used: GenericRedZone, MallocRedZone, KmemRedZone, PoolRedZone,
and PoolUseAfterFree.

This can greatly help debugging complex memory corruptions.
 1.247  07-Apr-2019  maxv Fix tiny race in pool+KASAN, that resulted in occasional false positives.

We were uselessly marking already valid areas as valid. When doing that,
our KASAN code emits two calls to kasan_markmem, and there is a very small
window where the area becomes invalid. So, if the area happens to be
already globally referenced, and if another thread happens to read the
buffer via this reference, we get a false positive.

This happens only with pool_caches that have a pc_ctor that creates a
global reference to the buffer, and there is one single pool_cache that
does that: 'file_cache'.

So now, two changes:

- In pool_cache_get_slow(), the pool_get() has already redzoned the
object, so no need to call pool_redzone_fill().

- In pool_cache_destruct_object1(), don't re-mark the object. If there is
no ctor pool_put is fine with already-invalid objects, if there is a
ctor the object was not marked as invalid in the first place; so in
either case, the re-marking is not needed.

Fixes PR/53674. Although very rare and difficult to reproduce, a local
quarantine patch of mine made the false positives recurrent.
 1.246  28-Mar-2019  maxv Move pnbuf_cache into vfs_init.c, where it belongs.
 1.245  27-Mar-2019  maxv Kernel Heap Hardening: detect frees-in-wrong-pool on on-page pools. The
detection is already implicitly done for off-page pools.

We recycle pr_slack (unused) in struct pool, and make ph_node a union in
order to recycle an unsigned int in struct pool_item_header. Each time a
pool is created we atomically increase a global counter, and register the
current value in pp. We then propagate this value in each ph, and ensure
they match in pool_put.

This can catch several classes of kernel bugs and basically makes them
unexploitable. It comes with no increase in memory usage and no measurable
increase in CPU cost (inexistent cost actually, just one check predicted
false).
 1.244  26-Mar-2019  maxv Remove POOL_SUBPAGE, it is unused, undocumented, and adds confusion.
 1.243  18-Mar-2019  maxv Kernel Heap Hardening: manage freed items with bitmaps rather than linked
lists when we're on-page and the page header is naturally big enough to
contain a bitmap.

This comes with no increase in memory consumption, and similar CPU cost
(maybe it's a little faster actually).

We want to favor bitmaps over linked lists, because linked lists install
kernel pointers inside the items, and this can be too easily exploitable
in use-after-free or double-free conditions, or in item buffer overflows
occurring within a pool page.
 1.242  17-Mar-2019  maxv Introduce a new flag, PR_USEBMAP, that indicates whether the pool uses a
bitmap to manage freed items. It dissociates PR_NOTOUCH from bitmaps, but
for now is set only when PR_NOTOUCH is set, which reproduces the current
behavior. Therefore, no functional change. Also clarify the code.
 1.241  17-Mar-2019  maxv Kernel Heap Hardening: put the pool header at the beginning of the backing
page, not at the end of it.

This makes it harder to exploit buffer overflows, because it eliminates the
certainty that sensitive kernel data is located after the item space and is
therefore overwritable.

The pr_itemoffset field is recycled, and holds the (aligned) offset of the
item space. The pr_phoffset field becomes unused. We align 'itemspace' for
clarity, but it's not strictly necessary.

This comes with no performance cost or increase in memory usage, in
particular the potential padding consumed by roundup(PHSIZE, align) was
already implicitly consumed before, because of the (necessary) truncations
in the divisions. Now it's just more explicit, but not bigger.
 1.240  17-Mar-2019  maxv Move some code into a separate function, and explain a bit. Also define
PHSIZE. No functional change.
 1.239  17-Mar-2019  maxv cosmetic
 1.238  17-Mar-2019  maxv Prepare the removal of the 'ioff' argument: add a KASSERT to ensure it is
zero, and remove the internal logic. The pool code is simpler now.
 1.237  16-Mar-2019  maxv Misc changes:

- Turn two KASSERTs to real panics, they are useful and not expensive.
- Rename a few variables for clarity.
- Add a new panic, to make sure a freed item is in the item space.
 1.236  13-Mar-2019  maxv style
 1.235  11-Mar-2019  maxv Add sanity check: make sure we retrieve a valid item header, by checking
its page address against the one we computed. If there's a mismatch it
means the buffer does not belong to the pool, and we panic.
 1.234  11-Mar-2019  maxv Rename pr_item_notouch_* to pr_item_bitmap_*, and move some code into new
pr_item_linkedlist_* functions. This makes it easier to see that we have
two ways of handling freed items.

No functional change.
 1.233  11-Feb-2019  maxv Fix previous, pr_size includes the KASAN redzone. Repurpose pr_reqsize and
use it for PR_ZERO, it holds the size requested by the user with no padding
or redzone added, and only these bytes should be zeroed.
 1.232  10-Feb-2019  christos Introduce PR_ZERO to avoid open-coding memset()s everywhere. OK riastradh@.
 1.231  23-Dec-2018  maxv Simplify the KASAN API, use only kasan_mark() and explain briefly. The
alloc/free naming was too confusing.
 1.230  23-Dec-2018  maxv Remove useless debugging code, the area is completely filled but it's not
checked afterwards, only pi_magic is.
 1.229  16-Dec-2018  maxv Add support for detecting use-after-frees in KASAN. We poison each freed
buffer, any subsequent read or write will be detected as illegal.

* Add POOL_CHECK_MAGIC, which is disabled under KASAN, because the same
detection is done in a better way.

* Register the size+redzone in the pool structure, to reduce the overhead.

* Fix the CTOR/DTOR check in KLEAK, the fields are never NULL.
 1.228  02-Dec-2018  maxv Introduce KLEAK, a new feature that can detect kernel information leaks.

It works by tainting memory sources with marker values, letting the data
travel through the kernel, and scanning the kernel<->user frontier for
these marker values. Combined with compiler instrumentation and rotation
of the markers, it is able to yield relevant results with little effort.

We taint the pools and the stack, and scan copyout/copyoutstr. KLEAK is
supported on amd64 only for now, but it is not complicated to add more
architectures (just a matter of having the address of .text, and a stack
unwinder).

A userland tool is provided, that allows to execute a command in rounds
and monitor the leaks generated all the while.

KLEAK already detected directly 12 kernel info leaks, and prompted changes
that in total fixed 25+ leaks.

Based on an idea developed jointly with Thomas Barabosch (of Fraunhofer
FKIE).
 1.227  10-Sep-2018  maxv Correctly align the size+redzone for KASAN, on amd64 it happens to be
always 8byte-aligned but on other architectures it may not be.
 1.226  25-Aug-2018  maxv Disable POOL_REDZONE until we figure out what's wrong. There must be a dumb
problem, that is not triggerable on amd64.
 1.225  24-Aug-2018  maxv Use __predict_false to optimize, and also replace panic->printf.
 1.224  23-Aug-2018  maxv Add kASan redzones on pools and pool_caches. Also enable POOL_REDZONE
on DIAGNOSTIC.
 1.223  04-Jul-2018  kamil Avoid undefined behavior in pr_item_notouch_put()

Do not left shift a signed integer changing its signedness bit.

sys/kern/subr_pool.c:251:30, left shift of 1 by 31 places cannot be represented in type 'int'

Detected with Kernel Undefined Behavior Sanitizer.

Reported by <Harry Pantazis>
 1.222  04-Jul-2018  kamil Avoid Undefined Behavior in pr_item_notouch_get()

Change the type of left shifted integer from signed to unsigned.

sys/kern/subr_pool.c:274:13, left shift of 1 by 31 places cannot be represented in type 'int'

Detected with Kernel Undefined Behavior Sanitizer.

Reported by <Harry Pantazis>
 1.221  12-Jan-2018  para branches: 1.221.2; 1.221.4;
fix comment

pool stats are listed 'vmstat -m' not 'vmstat -i'
 1.220  29-Dec-2017  christos Don't release the lock in the PR_NOWAIT allocation. Move flags setting
after the acquiring the mutex. (from Tobias Nygren)
 1.219  16-Dec-2017  mrg hopefully workaround the irregularly "fork fails in init" problem.

if a pool is growing, and the grower is PR_NOWAIT, mark this.
if another caller wants to grow the pool and is also PR_NOWAIT,
busy-wait for the original caller, which should either succeed
or hard-fail fairly quickly.

implement the busy-wait by unlocking and relocking this pools
mutex and returning ERESTART. other methods (such as having
the caller do this) were significantly more code and this hack
is fairly localised.

ok chs@ riastradh@
 1.218  04-Dec-2017  mrg properly account PR_RECURSIVE pools like vmstat does.
 1.217  02-Dec-2017  mrg add two new members to uvmexp_sysctl{}: bootpages and poolpages.
bootpages is set to the pages allocated via uvm_pageboot_alloc().
poolpages is calculated from the list of pools nr_pages members.

this brings us closer to having a valid total of pages known by
the system, vs actual pages originally managed.

XXX: poolpages needs some handling for PR_RECURSIVE pools still.
 1.216  14-Nov-2017  christos - fix an assert; we can reach there if we are nowait or limitfail.
- when priming the pool and failing with ERESTART, don't decrement the number
of pages; this avoids the issue of returning an ERESTART when we get to 0,
and is more correct.
- simplify the pool_grow code, and don't wakeup things if we ENOMEM.
 1.215  09-Nov-2017  christos Add assertions that either PR_WAITOK or PR_NOWAIT are set.
 1.214  09-Nov-2017  christos Handle the ERESTART case from pool_grow()
 1.213  09-Nov-2017  christos make the KASSERTMSG/panic strings consistent as '%s: [%s], __func__, wchan'
 1.212  09-Nov-2017  christos Since pr_lock is now used to wait for two things now (PR_GROWING and
PR_WANTED) we need to loop for the condition we wanted.
 1.211  06-Nov-2017  riastradh Assert that pool_get failure happens only with PR_NOWAIT.

This would have caught the mistake I made last week leading to null
pointer dereferences all over the place, a mistake which I evidently
poorly scheduled alongside maxv's change to the panic message on x86
for null pointer dereferences.
 1.210  05-Nov-2017  mlelstv pool_grow can now fail even when sleeping is ok. Catch this case in pool_get
and retry.
 1.209  28-Oct-2017  riastradh Allow only one pending call to a pool's backing allocator at a time.

Candidate fix for problems with hanging after kva fragmentation related
to PR kern/45718.

Proposed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2017/10/23/msg022472.html

Tested by bouyer@ on i386.

This makes one small change to the semantics of pool_prime and
pool_setlowat: they may fail with EWOULDBLOCK instead of ENOMEM, if
there is a pending call to the backing allocator in another thread but
we are not actually out of memory. That is unlikely because nearly
always these are used during initialization, when the pool is not in
use.

XXX pullup-8
XXX pullup-7
XXX pullup-6 (requires tweaking the patch)
XXX pullup-5...
 1.208  08-Jun-2017  chs add some pool_allocators for pool item sizes larger than PAGE_SIZE.
needed by dtrace.
 1.207  14-Mar-2017  riastradh branches: 1.207.6;
#if DIAGNOSTIC panic ---> KASSERT

- Omit mutex_exit before panic. No need.
- Sprinkle some more information into a few messages.
- Prefer __diagused over #if DIAGNOSTIC for declarations,
to reduce conditionals.

ok mrg@
 1.206  05-Feb-2016  knakahara branches: 1.206.2; 1.206.4;
fix: "vmstat -C" CpuLayer showed only the last cpu values.
 1.205  24-Aug-2015  pooka to garnish, dust with _KERNEL_OPT
 1.204  28-Jul-2015  maxv Introduce POOL_REDZONE.
 1.203  13-Jun-2014  joerg branches: 1.203.2; 1.203.4;
Add kern.pool for memory pool stats.
 1.202  26-Apr-2014  abs Ensure pool_head is non static - for "vmstat -i"
 1.201  17-Feb-2014  para branches: 1.201.2;
replace vmem(9) custom boundary tag allocation with a pool(9)
 1.200  11-Mar-2013  pooka branches: 1.200.6;
In pool_cache_put_slow(), pool_get() can block (it does mutex_enter()),
so we need to retry if curlwp took a context switch during the call.
Otherwise, CPU-local invariants can get screwed up:

panic: kernel diagnostic assertion "cur->pcg_avail == cur->pcg_size" failed

This is (was) very easy to reproduce by just running:

while : ; do RUMP_NCPU=32 ./a.out ; done

where a.out only calls rump_init(). But, any situation there's contention
and a pool doesn't have emptygroups would do.
 1.199  09-Feb-2013  christos printflike maintenance.
 1.198  28-Aug-2012  christos branches: 1.198.2;
proper locking for DEBUG
 1.197  05-Jun-2012  jym Now that pool_cache_invalidate() is synchronous and can handle per-CPU
caches, merge together pool_drain_start() and pool_drain_end() into

bool pool_drain(struct pool **ppp);

"bool" value indicates whether reclaiming was fully done (true) or not (false)
"ppp" will contain a pointer to the pool that was drained (optional).

See http://mail-index.netbsd.org/tech-kern/2012/06/04/msg013287.html
 1.196  05-Jun-2012  jym As pool reclaiming is unlikely to happen at interrupt or softint
context, re-enable the portion of code that allows invalidation of CPU-bound
pool caches.

Two reasons:
- CPU cached objects being invalidated, the probability of fetching an
obsolete object from the pool_cache(9) is greatly reduced. This speeds up
pool_cache_get() quite a bit as it does not have to keep destroying objects
until it finds an updated one when an invalidation is in progress.

- for situations where we have to ensure that no obsolete object remains
after a state transition (canonical example: pmap mappings between Xen VM
restoration), invalidating all pool_cache(9) is the safest way to go.

As it uses xcall(9) to broadcast the execution of pool_cache_transfer(),
pool_cache_invalidate() cannot be called from interrupt or softint context
(scheduling a xcall(9) can put a LWP to sleep).

pool_cache_xcall() => pool_cache_transfer() to reflect its use.

Invalidation being a costly process (1000s objects may be destroyed),
all places where pool_cache_invalidate() may be called from
interrupt/softint context will now get caught by the proper KASSERT(), and
fixed. Ping me when you see one.

Tested under i386 and amd64 by running ATF suite within 64MiB HVM
domains (tried triggering pgdaemon a few times).

No objection on tech-kern@.

XXX a similar fix has to be pulled up to NetBSD-6, but with a more
conservative approach.

See http://mail-index.netbsd.org/tech-kern/2012/05/29/msg013245.html
 1.195  05-May-2012  rmind G/C POOL_DIAGNOSTIC option. No objection on tech-kern@.
 1.194  04-Feb-2012  para branches: 1.194.2;
make acorn26 compile by fixing up subpage pool allocations

ok: riz@
 1.193  29-Jan-2012  he Use the same style for initialization of pool_allocator_kmem under
POOL_SUBPAGE as all the other poll_allocator structs. Fixes build
problem for acorn26.
 1.192  28-Jan-2012  rmind pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.
 1.191  27-Jan-2012  para extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged
 1.190  27-Sep-2011  jym branches: 1.190.2; 1.190.6;
Modify *ASSERTMSG() so they are now used as variadic macros. The main goal
is to provide routines that do as KASSERT(9) says: append a message
to the panic format string when the assertion triggers, with optional
arguments.

Fix call sites to reflect the new definition.

Discussed on tech-kern@. See
http://mail-index.netbsd.org/tech-kern/2011/09/07/msg011427.html
 1.189  22-Mar-2011  pooka pnbuf_cache is used all over the place outside of vfs, so put it
in one place to avoid many definitions.
 1.188  17-Jan-2011  uebayasi Fix a conditional include.
 1.187  17-Jan-2011  uebayasi Include internal definitions (uvm/uvm.h) only where necessary.
 1.186  03-Jun-2010  pooka branches: 1.186.2;
Report result of pool_reclaim() from pool_drain_end().
 1.185  12-May-2010  rmind pool_{cache_}get: improve previous diagnostic by checking for panicstr,
so it wont trigger the assert while trying to dump core on crash.
 1.184  12-May-2010  rmind - Sprinkle asserts to catch calls from interrupt context on IPL_NONE pools.
- Add diagnostic drain attempt.
 1.183  25-Apr-2010  ad MAXCPUS -> __arraycount
 1.182  20-Jan-2010  rmind branches: 1.182.2; 1.182.4;
pool_cache_invalidate: comment out invalidation of per-CPU caches (nobody depends
on it, at the moment) until we decide how to fix it (xcall(9) cannot be used from
interrupt context). XXX: Perhaps implement XC_HIGHPRI.
 1.181  03-Jan-2010  mlelstv drop __predict micro optimization in pool_init for cleaner code.
 1.180  03-Jan-2010  mlelstv Pools are created way before the pool subsystem mutexes are
initialized.

Ignore also pool_allocator_lock while the system is in cold state.

When the system has left cold state, uvm_init() should have
also initialized the pool subsystem and the mutexes are
ready to use.
 1.179  02-Jan-2010  mlelstv Move initialization of pool_allocator_lock before its first use.
This failed on archs where a mutex isn't initialized to a zero
value.

Defer allocation of pool log to the logging action, if allocation
fails, it will be retried the next time something is logged.

Clear pool log on allocation so that ddb doesn't crash when showing
so far unused log entries.
 1.178  30-Dec-2009  elad Turn PA_INITIALIZED to a reference count for the pool allocator, and once
it drops to zero destroy the mutex we initialize. This fixes the problem
mentioned in

http://mail-index.netbsd.org/tech-kern/2009/12/28/msg006727.html

Also remove pa_flags now that it's no longer needed.

Idea from matt@, okay matt@.
 1.177  20-Oct-2009  jym Fix a bug where on MP systems, pool_cache_invalidate(9) could be called
early during boot, just after CPUs are attached but before they are marked
as running.

This will result in a list of CPUs without the SPCF_RUNNING flag set, and
will trigger the 'KASSERT(xc_tailp < xc_headp)' in xc_lowpri() as no cross
call is issued.

Bug reported and patch tested by tron@.

See also http://mail-index.netbsd.org/tech-kern/2009/10/19/msg006293.html
 1.176  15-Oct-2009  thorpej - pool_cache_invalidate(): broadcast a cross-call to drain the per-CPU
caches before draining the global cache.
- pool_cache_invalidate_local(): remove.
 1.175  08-Oct-2009  jym Add pool_cache_invalidate_local() to the pool_cache(9) API, to permit
per-CPU objects invalidation when cached in the pool cache.

See http://mail-index.netbsd.org/tech-kern/2009/10/05/msg006206.html .

Reviewed by bouyer@. Thanks!
 1.174  13-Sep-2009  pooka Wipe out the last vestiges of POOL_INIT with one swift stroke. In
most cases, use a proper constructor. For proplib, give a local
equivalent of POOL_INIT for the kernel object implementation. This
way the code structure can be preserved, and a local link set is
not hazardous anyway (unless proplib is split to several modules,
but that'll be the day).

tested by booting a kernel in qemu and compile-testing i386/ALL
 1.173  29-Aug-2009  rmind Make pool_head static.
 1.172  15-Apr-2009  yamt pool_cache_put_paddr: add an assertion.
 1.171  11-Nov-2008  ad branches: 1.171.4;
Avoid recursive mutex_enter() when the system is low on KVA.
Should fix crash reported by riz on current-users.
 1.170  15-Oct-2008  ad branches: 1.170.2; 1.170.4;
- Rename cpu_lookup_byindex() to cpu_lookup(). The hardware ID isn't of
interest to MI code. No functional change.
- Change /dev/cpu to operate on cpu index, not hardware ID. Now cpuctl
shouldn't print confused output.
 1.169  11-Aug-2008  yamt make pcg_dummy const to catch bugs earlier.
 1.168  11-Aug-2008  yamt add some KASSERTs.
 1.167  08-Aug-2008  skrll Comment whitespace.
 1.166  09-Jul-2008  yamt pool_do_put: fix a pool corruption bug discovered by
the recent exec_pool changes.
 1.165  07-Jul-2008  yamt branches: 1.165.2;
fix pool corruption bugs in subr_pool.c 1.162.
 1.164  04-Jul-2008  ad Move an assignment later.
 1.163  04-Jul-2008  ad - Keep cache locked while allocating a cache group - later we might want
to automatically tune the group sizes at run time.
- Fix broken assertion.
- Avoid another test+branch.
 1.162  04-Jul-2008  ad Remove a bunch of conditional branches from the pool_cache fast path.
 1.161  31-May-2008  ad branches: 1.161.2;
Use __noinline.
 1.160  28-Apr-2008  martin branches: 1.160.2;
Remove clause 3 and 4 from TNF licenses
 1.159  28-Apr-2008  ad Add MI code to support in-kernel preemption. Preemption is deferred by
one of the following:

- Holding kernel_lock (indicating that the code is not MT safe).
- Bracketing critical sections with kpreempt_disable/kpreempt_enable.
- Holding the interrupt priority level above IPL_NONE.

Statistics on kernel preemption are reported via event counters, and
where preemption is deferred for some reason, it's also reported via
lockstat. The LWP priority at which preemption is triggered is tuneable
via sysctl.
 1.158  27-Apr-2008  ad branches: 1.158.2;
- Rename crit_enter/crit_exit to kpreempt_disable/kpreempt_enable.
DragonflyBSD uses the crit names for something quite different.
- Add a kpreempt_disabled function for diagnostic assertions.
- Add inline versions of kpreempt_enable/kpreempt_disable for primitives.
- Make some more changes for preemption safety to the x86 pmap.
 1.157  24-Apr-2008  ad Merge the socket locking patch:

- Socket layer becomes MP safe.
- Unix protocols become MP safe.
- Allows protocol processing interrupts to safely block on locks.
- Fixes a number of race conditions.

With much feedback from matt@ and plunky@.
 1.156  27-Mar-2008  ad branches: 1.156.2;
Replace use of CACHE_LINE_SIZE in some obvious places.
 1.155  17-Mar-2008  ad Make them compile again.
 1.154  17-Mar-2008  yamt - simplify ASSERT_SLEEPABLE.
- move it from proc.h to systm.h.
- add some more checks.
- make it a little more lkm friendly.
 1.153  10-Mar-2008  martin Use cpu index instead of the machine dependend, not very expressive
cpuid when naming user-visible kernel entities.
 1.152  02-Mar-2008  yamt pool_do_put: remove pa_starved_p check for now as it seems to cause
more problems than it solves. PR/37993 from Greg A. Woods.
 1.151  14-Feb-2008  yamt branches: 1.151.2; 1.151.6;
use time_uptime instead of getmicrotime() for ph_time.
 1.150  05-Feb-2008  skrll Revert previous as requested by yamt.
 1.149  02-Feb-2008  skrll Check alignment against pp->pr_align not pp->pr_alloc->pa_pagesz.

DIAGNOSTIC kernels on hppa boot again.

OK'd by ad.
 1.148  28-Jan-2008  yamt pool_cache_get_paddr: don't bother to clear pcgo_va unless DIAGNOSTIC.
 1.147  04-Jan-2008  ad Start detangling lock.h from intr.h. This is likely to cause short term
breakage, but the mess of dependencies has been regularly breaking the
build recently anyhow.
 1.146  02-Jan-2008  ad Merge vmlocking2 to head.
 1.145  26-Dec-2007  ad Merge more changes from vmlocking2, mainly:

- Locking improvements.
- Use pool_cache for more items.
 1.144  22-Dec-2007  yamt pool_in_cg: don't bother to check slots past pcg_avail.
 1.143  22-Dec-2007  yamt pool_whatis: print cached items as well.
 1.142  20-Dec-2007  ad - Support two different sizes of pool_cache group. The default has 14 or 15
items, and the new large groups (for busy caches) have 62 or 63 items.
- Add PR_LARGECACHE flag as a hint that a pool_cache should use large groups.
This should be eventually be tuned at runtime.
- Report group size for vmstat -C.
 1.141  13-Dec-2007  yamt add ddb "whatis" command. inspired from solaris ::whatis dcmd.
 1.140  13-Dec-2007  yamt don't forget to initialize ph_off for PR_NOTOUCH.
 1.139  11-Dec-2007  ad Change the ncpu test to work when a pool_cache or softint is initialized
between mi_cpu_attach() and attachment of the boot CPU. Suggested by mrg@.
 1.138  05-Dec-2007  ad branches: 1.138.2; 1.138.4;
pool_init, pool_cache_init: hack around IP input processing which can
not yet safely block without severely confusing soo_write() and friends.
If the pool's IPL is IPL_SOFTNET, initialize the mutex at IPL_VM so that
it's a spinlock. To be dealt with correctly in the near future.
 1.137  18-Nov-2007  ad branches: 1.137.2;
Work around issues with pool_cache on sparc.
 1.136  14-Nov-2007  yamt fix freecheck.
 1.135  10-Nov-2007  yamt for PR_NOTOUCH pool_item_header, use a bitmap rather than a freelist.
it saves some space and allows more items per a page.
 1.134  07-Nov-2007  ad Merge from vmlocking:

- pool_cache changes.
- Debugger/procfs locking fixes.
- Other minor changes.
 1.133  11-Oct-2007  ad branches: 1.133.2; 1.133.4;
Remove LOCK_ASSERT(!simple_lock_held(&foo));
 1.132  11-Oct-2007  ad Merge from vmlocking:

- G/C spinlockmgr() and simple_lock debugging.
- Always include the kernel_lock functions, for LKMs.
- Slightly improved subr_lockdebug code.
- Keep sizeof(struct lock) the same if LOCKDEBUG.
 1.131  18-Aug-2007  ad branches: 1.131.2; 1.131.4;
pool_drain: add a comment.
 1.130  18-Aug-2007  ad pool_do_cache_invalidate_grouplist: drop locks while calling the destructor.
XXX Expensive - to be revisited.
 1.129  12-Mar-2007  ad branches: 1.129.8; 1.129.12;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.
 1.128  04-Mar-2007  christos branches: 1.128.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.127  22-Feb-2007  thorpej TRUE -> true, FALSE -> false
 1.126  21-Feb-2007  thorpej Replace the Mach-derived boolean_t type with the C99 bool type. A
future commit will replace use of TRUE and FALSE with true and false.
 1.125  09-Feb-2007  ad branches: 1.125.2;
Merge newlock2 to head.
 1.124  01-Nov-2006  yamt remove some __unused from function parameters.
 1.123  12-Oct-2006  christos - sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386
 1.122  03-Sep-2006  christos branches: 1.122.2; 1.122.4;
avoid empty else statement
 1.121  20-Aug-2006  yamt implement PR_NOALIGN. (allow unaligned pages)
to be used by vmem quantum cache.
 1.120  19-Aug-2006  yamt pool_init: in the case of PR_NOTOUCH, don't bump item size to
sizeof(struct pool_item).
 1.119  21-Jul-2006  yamt use ASSERT_SLEEPABLE where appropriate.
 1.118  07-Jun-2006  kardel merge FreeBSD timecounters from branch simonb-timecounters
- struct timeval time is gone
time.tv_sec -> time_second
- struct timeval mono_time is gone
mono_time.tv_sec -> time_uptime
- access to time via
{get,}{micro,nano,bin}time()
get* versions are fast but less precise
- support NTP nanokernel implementation (NTP API 4)
- further reading:
Timecounter Paper: http://phk.freebsd.dk/pubs/timecounter.pdf
NTP Nanokernel: http://www.eecis.udel.edu/~mills/ntp/html/kern.html
 1.117  25-May-2006  yamt move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html
 1.116  15-Apr-2006  simonb branches: 1.116.2;
Add a DEBUG check that panics if pool_init() is called more than
once on the same pool.

As discussed on tech-kern a few months ago.
 1.115  15-Apr-2006  christos Coverity CID 760: Protect against NULL deref.
 1.114  02-Apr-2006  yamt pool_grow: don't increase pr_minpages. (fix a mistake in 1.113)
 1.113  17-Mar-2006  yamt make duplicated code fragments into a function, pool_grow.
 1.112  24-Feb-2006  bjh21 branches: 1.112.2; 1.112.4; 1.112.6;
Medium-sized overhaul of POOL_SUBPAGE support so that:
1: I can understand it, and
2: It works.
Notable externally-visible changes are that POOL_SUBPAGE now has to be a
compile-time constant, and that trying to initialise a pool whose objects are
larger than POOL_SUBPAGE automatically generates a pool that doesn't use
subpages.

NetBSD/acorn26 now boots multi-user again.
 1.111  26-Jan-2006  christos branches: 1.111.2; 1.111.4;
PR/32631: Yves-Emmanuel JUTARD: Fix DIAGNOSTIC panic in the pool code. At
the time pool_get() calls pool_catchup(), pp has been free'd but it is still
in the "entered" state. The chain pool_catchup() -> pool_allocator_alloc()
-> pool_reclaim() on pp fails because pp is still in the "entered" state.
Call pr_leave() before calling calling pool_catchup() to avoid this.

Thanks for the excellent analysis!
 1.110  24-Dec-2005  perry branches: 1.110.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.
 1.109  20-Dec-2005  christos Commit temporary fix against kva starvation from yamt:

- pool_allocator_alloc: drain ourselves as well,
so that pool_cache on us is drained as well.
- pool_cache_put_paddr: destruct objects if underlying pool is starved.
- pool_get: on kva starvation, wake up once a second and try again.

Fixes:
PR/32287: Processes hang in "mclpl"
PR/32330: shark kernel hangs under memory load.
 1.108  01-Dec-2005  yamt add "show all pools" command for ddb.
 1.107  02-Nov-2005  yamt pool_printit: don't keep a lock when printing info.
we can't clean it up if ddb pager is quitted.
 1.106  16-Oct-2005  christos Make the grouplist invalidate function take a grouplist instead of a group.
Suggested by yamt.
 1.105  16-Oct-2005  christos This is why I hate gotos: My previous change had different semantics than
the original code since if fullgroups was empty and partgroups wasn't, we
would not clean up partgroups (pointed out by yamt). Well, this one has
different semantics from the original, they are the correct ones I think..
 1.104  16-Oct-2005  christos avoid a goto.
 1.103  15-Oct-2005  chs in pool_do_cache_invalidate(), make sure to process both full and partial
group lists even if the first one we look at is empty. fix ddb print routine.
 1.102  02-Oct-2005  chs optimize pool_caches similarly to how I optimized pools before:
split the single list of pool cache groups into three lists:
completely full, partially full, and completely empty.
use LIST instead of TAILQ where appropriate.
 1.101  18-Jun-2005  thorpej branches: 1.101.2;
Fix some locking issues:
- Make the locking rules for pr_rmpage() sane, and don't modify fields
protected by the pool lock without actually holding it.
- Always defer freeing the pool page to the back-end allocator, to avoid
invoking the pool_allocator with the pool locked (which would violate
the pool_allocator -> pool locking order).
- Fix pool_reclaim() to not violate the pool_cache -> pool locking order
by using a trylock.

Reviewed by Chuq Silvers.
 1.100  01-Apr-2005  yamt merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.
 1.99  01-Jan-2005  yamt branches: 1.99.2; 1.99.4; 1.99.8;
PR_NOTOUCH:
- use uint8_t instead of uint16_t for freelist index.
- set ph_off only if PR_NOTOUCH.
- comment.
 1.98  01-Jan-2005  yamt in the case of !PMAP_MAP_POOLPAGE, gather pool backend allocations to
large chunks for kernel_map and kmem_map to ease kva fragmentation.
 1.97  01-Jan-2005  yamt introduce a new flag for pool_init, PR_NOTOUCH.
if it's specified, don't use free items as storage for internal state.
so that we can use pools for non memory backed objects.
inspired from solaris's KMC_NOTOUCH.
 1.96  20-Jun-2004  thorpej Remove PR_IMMEDRELEASE, since setting the high water mark will achieve
the same thing.

Pointed out back in January by YAMAMOTO Takashi.
 1.95  20-May-2004  atatat Add a DIAGNOSTIC check to detect un-initialized pools.
 1.94  25-Apr-2004  simonb Initialise (most) pools from a link set instead of explicit calls
to pool_init. Untouched pools are ones that either in arch-specific
code, or aren't initialiased during initial system startup.

Convert struct session, ucred and lockf to pools.
 1.93  08-Mar-2004  dbj branches: 1.93.2;
add splvm() around a few pa_slock and psppool calls since they
may be shared with pools that can be used in interrupt context.
 1.92  22-Feb-2004  enami Modify pool page header allocation strategy as follows:
In addition to current one (i.e., don't wast so large part of the page),
- if the header fitsin the page without wasting any items, put it there.
- don't put the header in the page if it may consume rather big item.

For example, on i386, header is now allocated in the page for the pools
like fdescpl or sigapl, and allocated off the page for the pools like
buf1k or buf2k.
 1.91  16-Jan-2004  yamt - fix locking order problem. (pa_slock -> pr_slock)
- protect pr_phtree with pr_slock.
- add some LOCK_ASSERTs.
 1.90  09-Jan-2004  thorpej Add a new pool initialization flag, PR_IMMEDRELEASE. This flag causes
idle pool pages to be returned to the system immediately upon becoming
de-fragmented.

Also, in pool_do_put(), don't free back an idle page unless we are over
our minimum page claim.
 1.89  29-Dec-2003  yamt pool_prime_page: initialize ph_time to mono_time instead of zero
as it's a mono_time relative value.
 1.88  13-Nov-2003  chs two changes in improve scalability:

(1) split the single list of pages allocated to a pool into three lists:
completely full, partially full, and completely empty.
there is no longer any need to traverse any list looking for a
certain type of page.

(2) replace the 8-element hash table for out-of-page page headers
with a splay tree.

these two changes (together with the recent enhancements to the wait code)
give us linear scaling for a fork+exit microbenchmark.
 1.87  09-Apr-2003  thorpej branches: 1.87.2;
Add the ability for pool caches to cache the physical address of
objects. Clients of the pool_cache API must consistently use
the "paddr" variants or not, otherwise behavior is undefined.

Enable this on Alpha, ARM, MIPS, and x86. Other platforms must
define POOL_VTOPHYS() in the appropriate manner in order to enable
the feature.

Part 1 of a series of simple patches contributed by Wasabi Systems
to improve network performance.
 1.86  16-Mar-2003  matt Only define POOL_LOGSIZE/pool_size if POOL_DIAGNOSTIC is defined.
 1.85  23-Feb-2003  pk Use splvm() instead of splhigh() when accessing the internal page header pool.
 1.84  18-Jan-2003  thorpej Merge the nathanw_sa branch.
 1.83  24-Nov-2002  scw Quell uninitialised variable warnings.
 1.82  09-Nov-2002  thorpej Fix signed/unsigned comparison warnings.
 1.81  08-Nov-2002  enami Parse the modifier of ddb command as documented.
 1.80  27-Sep-2002  provos remove trailing \n in panic(). approved perry.
 1.79  25-Aug-2002  thorpej Fix signed/unsigned comparison warnings from GCC 3.3.
 1.78  30-Jul-2002  thorpej Bring down a fix from the "newlock" branch, slightly modified:
* In pool_prime_page(), assert that the object being placed onto the
free list meets the alignment constraints (that "ioff" within the
object is aligned to "align").
* In pool_init(), round up the object size to the alignment value (or
ALIGN(1), if no special alignment is needed) so that the above invariant
holds true.
 1.77  11-Jul-2002  matt Add wchan to a panic (must have NOWAIT).
 1.76  13-Mar-2002  simonb branches: 1.76.4; 1.76.6;
Move 'struct pool_cache_group' definition into <sys/pool.h>
 1.75  13-Mar-2002  simonb Remove two instances of an "error" variable that is only ever assigned to
but not used.
 1.74  09-Mar-2002  thorpej branches: 1.74.2;
Put back pool_prime(); the i386 mp pmap uses it.
 1.73  09-Mar-2002  thorpej Fix a couple of typos in simple_{,un}lock()'s.
 1.72  09-Mar-2002  thorpej Remove pool_prime(). Nothing uses it, and how it should be used it not
really well-defined in the absense of PR_STATIC.
 1.71  09-Mar-2002  thorpej If, when a page becomes idle, the backend allocator is waiting for
resources, release the page immediately, rather than letting it sit
around cached.

From art@openbsd.org.
 1.70  09-Mar-2002  thorpej Remove PR_MALLOCOK and PR_STATIC. The former wasn't actually used,
and the latter, while there was some code tested the bit, was woefully
incomplete and also unused by anything. Besides, PR_STATIC functionality
could be better handled by backend allocators anyhow.

From art@openbsd.org
 1.69  08-Mar-2002  thorpej Add a missing simple_unlock.
 1.68  08-Mar-2002  thorpej Add an optional "drain" client callback, which can be set by the new
pool_set_drain_hook(). This hook is called in three cases:
* When a pool has hit the hard limit, just before either erroring
out or sleeping.
* When a backend allocator fails to allocate memory.
* Just before trying to reclaim pages in pool_reclaim().

This hook requests the client to try and free some items back to
the pool.

From art@openbsd.org.
 1.67  08-Mar-2002  thorpej Remove PR_FREEHEADER; nothing uses it anymore.

From art@openbsd.org.
 1.66  08-Mar-2002  thorpej Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.
 1.65  20-Nov-2001  enami Call pr_log(PRLOG_GET) when POOL_DIAGNOSTIC is defined instead of DIAGNOSTIC
for consistency.
 1.64  12-Nov-2001  lukem add RCSIDs
 1.63  21-Oct-2001  chs branches: 1.63.2;
in pool_drain(), call pool_reclaim() while we still have interrupts blocked
since the pool in question might be one used in interrupt context.
 1.62  07-Oct-2001  bjh21 Add support for allocating pool memory in units smaller than a whole page.
This is activated by defining POOL_SUBPAGE to the size of the new allocation
unit, and makes pools much more efficient on machines with obscenely large
pages. It might even make four-megabyte arm26 systems usable.
 1.61  26-Sep-2001  chs jump through hoops to avoid calling uvm_km_free_poolpage() while holding
spinlocks, since that function can sleep. (note that there's still one
instance remaining to be fixed.) use TAILQ_FOREACH where appropriate.
 1.60  01-Jul-2001  thorpej branches: 1.60.2; 1.60.4;
Protect the `pool cache group' pool with splvm(), so that pool caches
can be used by code that runs in interrupt context.
 1.59  05-Jun-2001  thorpej Do the reentrancy checking if POOL_DIAGNOSTIC, not DIAGNOSTIC. Prevents
ABI change for diagnostic vs. non-diagnostic kernels.
 1.58  05-Jun-2001  thorpej Assert that no locks are held if we're called with PR_WAITOK.
From Bill Sommerfeld.
 1.57  13-May-2001  sommerfeld Make this build again ifdef DIAGNOSTIC (oops)
 1.56  13-May-2001  sommerfeld Remove pool reentrancy testing overhead unless DIAGNOSTIC is defined.
Previously, we passed __FILE__ and __LINE__ on all pool_get/pool_set calls.

This change results in a measured 1.2% performance improvement in
ping-flood packets-per-second as reported by ping(8).
 1.55  10-May-2001  thorpej Rearrange the code that adds pages of objects to the pool; require
that the caller allocate the pool_item_header when it allocates the
pool page, so we can avoid a locking pitfall (sleeping with a simple
lock held).

Also revive pool_prime(), as there are some letigimate uses of it,
but in doing so, eliminate some of the bogosities of the old version
(i.e. don't do an implicit "setlowat", just prime the pool, and incr
the minpages for each additional page we add, and compute the number
of pages to prime in a way that callers would expect).
 1.54  10-May-2001  thorpej Use POOL_NEEDS_CATCHUP() in one more place.
 1.53  10-May-2001  thorpej Encapsulate the test for a pool needing a pool_catchup() in a macro.
 1.52  09-May-2001  thorpej Remove pool_create() and pool_prime(). Nothing except pool_create()
used pool_prime(), and no one uses pool_create() anymore.

This makes it easier to fix a locking pitfall.
 1.51  04-May-2001  thorpej Add pool_cache_destruct_object(), used to force destruction of
an object and release back into the pool.
 1.50  29-Jan-2001  enami branches: 1.50.2;
Don't use PR_URGENT to allocate page header. We don't want to just panic
on memory shortage. Instead, use the same wait/nowait condition with the
item requested, and just cleanup and return failure if we can't allocate
page header while we aren't allowed to wait.
 1.49  14-Jan-2001  thorpej Change some low-hanging splimp() calls to splvm().
 1.48  11-Dec-2000  thorpej Add some basic statistics to pool_cache.
 1.47  10-Dec-2000  thorpej Don't hold a pool cache lock across any call to pool_get() or pool_put().
This allows us to change a try-lock into a normal lock in the reclaim
case.
 1.46  07-Dec-2000  thorpej ...and when freeing cache groups, clear `freeto' if that's the one
we're freeing.
 1.45  07-Dec-2000  thorpej When we invalidate a pool cache, make sure to clear `allocfrom' if
we empty out that cache group.
 1.44  07-Dec-2000  thorpej Add a /c modifier to "show pool" to display pool caches.
 1.43  07-Dec-2000  thorpej This is a first-cut implementation of support for caching of
constructed objects in the pool allocator, similar to caching
of constructed objects in the Solaris SLAB allocator.

This implementation is a separate API (pool_cache_*()) layered
on top of pools to keep the caching complexity out of the way
of pools that won't benefit from it.

While we're here, allow pool items to be as large as the pool
page size.
 1.42  06-Dec-2000  thorpej ANSI'ify.
 1.41  19-Nov-2000  sommerfeld In pool_setlowat(), only call pool_catchup() if the pool is under the
low water mark. (Avoids annoying warning when you setlowat a static
pool).
 1.40  12-Aug-2000  sommerfeld Use ltsleep instead of simple_unlock/tsleep/simple_lock
 1.39  27-Jun-2000  mrg remove include of <vm/vm.h>
 1.38  26-Jun-2000  mrg remove/move more mach vm header files:

<vm/pglist.h> -> <uvm/uvm_pglist.h>
<vm/vm_inherit.h> -> <uvm/uvm_inherit.h>
<vm/vm_kern.h> -> into <uvm/uvm_extern.h>
<vm/vm_object.h> -> nothing
<vm/vm_pager.h> -> into <uvm/uvm_pager.h>

also includes a bunch of <vm/vm_page.h> include removals (due to redudancy
with <vm/vm.h>), and a scattering of other similar headers.
 1.37  10-Jun-2000  sommerfeld Fix assorted bugs around shutdown/reboot/panic time.
- add a new global variable, doing_shutdown, which is nonzero if
vfs_shutdown() or panic() have been called.
- in panic, set RB_NOSYNC if doing_shutdown is already set on entry
so we don't reenter vfs_shutdown if we panic'ed there.
- in vfs_shutdown, don't use proc0's process for sys_sync unless
curproc is NULL.
- in lockmgr, attribute successful locks to proc0 if doing_shutdown
&& curproc==NULL, and panic if we can't get the lock right away; avoids the
spurious lockmgr DIAGNOSTIC panic from the ddb reboot command.
- in subr_pool, deal with curproc==NULL in the doing_shutdown case.
- in mfs_strategy, bitbucket writes if doing_shutdown, so we don't
wedge waiting for the mfs process.
- in ltsleep, treat ((curproc == NULL) && doing_shutdown) like the
panicstr case.

Appears to fix: kern/9239, kern/10187, kern/9367.
May also fix kern/10122.
 1.36  31-May-2000  pk Allow a pool's pagesz to larger than the VM page size.
Enforce the required page alignment restriction in pool_prime_page().
 1.35  31-May-2000  pk Assert that the pool item size does not exceed the page size.
 1.34  08-May-2000  thorpej branches: 1.34.2;
__predict_false() the DIAGNOSTIC and other error condition checks.
 1.33  13-Apr-2000  chs always define PI_MAGIC so this compiles in all cases.
 1.32  10-Apr-2000  chs in pool_put(), fill the entire object with PI_MAGIC instead of just the
first element.
 1.31  14-Feb-2000  thorpej Use ratecheck().
 1.30  29-Aug-1999  thorpej branches: 1.30.2;
In _pool_put(), panic if we're put'ing with nout == 0. This will help us
detect a little earlier if we've dup-put'd. Otherwise, underflow occurs,
and subsequent allocations simply hang or fail (it thinks the hardlimit
has been reached).
 1.29  05-Aug-1999  sommerfeld Create new pool flag PR_LIMITFAIL, indicating that even PR_WAIT
allocations should fail if the pool is at its hard limit.
Document flag in pool(9).
Use it in mbuf.h for the first allocate call for M_GET, M_GETHDR, and
MCLGET, so that m_reclaim gets called even for blocking allocations.
 1.28  27-Jul-1999  thorpej In _pool_put(), call simple_lock_freecheck() if we're LOCKDEBUG before
we put the item on the free list.
 1.27  06-Jun-1999  pk Guard our global resource `phpool' against all interrupts.
 1.26  10-May-1999  thorpej Make sure page allocations are counted everywhere that they need to be.
 1.25  10-May-1999  thorpej Improve the pool allocator's diagnostic helpers, adding the ability to
log on a per-pool basis, reentrancy checking, and dumping various pool
information from DDB.
 1.24  29-Apr-1999  scottr Pull in opt_poollog.h for POOL_LOGSIZE.
 1.23  06-Apr-1999  thorpej More locking protocol fixes. Protect pool_head with a spin lock (statically
initialized). This lock also protects the "next drain candidate" pointer.

XXX There is still one locking protocol problem, which should not be
a problem in practice, but is still marked as an issue in the code anyhow.
 1.22  04-Apr-1999  chs Undo the part of the last revision about pr_rmpage() referencing
a data structure after it was freed. This wasn't actually a problem,
and the change caused the wrong pool_item_header to be freed
in the non-PR_PHINPAGE case.
 1.21  31-Mar-1999  thorpej branches: 1.21.2;
Yet more fixes to the pool allocator:

- Protect userspace from unnecessary header inclusions (as noted on
current-users).

- Some const poisioning.

- GREATLY simplify the locking protocol, and fix potential deadlock
scenarios. In particular, assume that the back-end page allocator
provides its own locking mechanism (this is currently true for all
such allocators in the NetBSD kernel). Doing so allows us to simply
use one spin lock for serialized access to all r/w members of the pool
descriptor. The spin lock is released before calling the back-end
allocator, and re-acquired upon return from it.

- Fix a problem in pr_rmpage() where a data structure was referenced
after it was freed.

- Minor tweak to page manaement. Migrate both idle and empty pages
to the end of the page list. As soon as a page becomes un-empty
(by a pool_put()), place it at the head of the page list, and set
curpage to point to it. This reduces fragmentation as well as the
time required to find a non-empty page as soon as curpage becomes
empty again.

- Use mono_time throughout, and protect access to it w/ splclock().

- In pool_reclaim(), if freeing an idle page would reduce the number
of allocatable items to below the low water mark, don't.
 1.20  31-Mar-1999  thorpej Fix several bugs/deficiencies in the pool allocator:

- Add support for hard limits, with optional rate-limited logging of
a warning message when the pool limit is reached. (This will be used
to fix a bug in mbuf cluster allocation on the MIPS and Alpha ports.)

- Fix some locking protocol errors. This required splitting pr_flags
into pr_flags (which is protected by the spin lock) and pr_roflags (which
are `read only' flags, set when the pool is initialized, and never changed
again; these do not need to be protected by a mutex).

- Make the low water support actually mean something. When a low water
mark is set, add free items to the pool until the low water mark is
reached. When an item allocation causes the number of free items to
drop below the low water mark, make the pool catch up to it. This can
make the pool allocator more useful for several applications (e.g.
pmap `pv entry' management) and more robust for others (for e.g. mbuf
and mbuf cluster allocation, so that the pagedaemon can use NFS to clean
pages on diskless systems without completely running dry on buffers to
receive packets in during extreme memory shoratages).

- Add a comment where we sleep waiting for more pages for the back-end
page allocator. Specifically, instead of sleeping potentially forever,
perhaps we should just wake up once a second to try allocating a page
again. XXX Revisit this soon.
 1.19  24-Mar-1999  mrg completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.
 1.18  23-Mar-1999  thorpej Fix the order of arguments to roundup().
 1.17  27-Dec-1998  thorpej Make this compile with POOL_DIAGNOSTIC, and add a POOL_LOGSIZE option.
Defopt these.
 1.16  16-Dec-1998  briggs Prototype pool_print() and pool_chk() if DEBUG.
Initialize pool hash table with PR_HASHTABSIZE (i.e., 8) LIST_INIT()s
instead of one memset().
Only check for page != ph->ph_page if PR_PHINPAGE is set (in pool_chk()).
Print pool base pointer when reporting page inconsistency in pool_chk().
 1.15  29-Sep-1998  pk In addition to the spinlock, use the lockmgr() to serialize access to
the back-end page allocator. This allows the back-end to sleep since we
now relinquish the spin lock after acquiring the long-term lock.
 1.14  22-Sep-1998  thorpej Make sure the size is large enough to hold a pool_item.
 1.13  12-Sep-1998  christos Make copyrights consistent; fix weird/trailing spaces add missing (c) etc.
 1.12  28-Aug-1998  thorpej Add an alternate pool page allocator that can be used if the pool is
never accessed in interrupt context. In the UVM case, this uses the
kernel_map, to reduce usage of the previous kmem_map resource.
 1.11  28-Aug-1998  thorpej Add a waitok boolean argument to the VM system's pool page allocator backend.
 1.10  13-Aug-1998  eeh Merge paddr_t changes into the main branch.
 1.9  04-Aug-1998  perry Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)
 1.8  02-Aug-1998  thorpej Make sure we initialize pr_nidle.
 1.7  02-Aug-1998  thorpej Fix a braino in the idle page instrumentation.
 1.6  01-Aug-1998  thorpej Instrument "idle pages" (i.e. pages which have no items allocated from
them, and could thus be freed back to the system).
 1.5  31-Jul-1998  thorpej Un-static pool_head; vmstat wants to find it.
 1.4  24-Jul-1998  thorpej branches: 1.4.2;
A few small changes to how pool pages are allocated/freed:
- If either an alloc or release function is provided, make sure both are
provided, otherwise panic, as this is a fatal error.
- If using the default allocator, default the pool pagesz to PAGE_SIZE,
since that is the granularity of the default allocator's mechanism.
- In the default allocator, use new functions:
uvm_km_alloc_poolpage()/uvm_km_free_poolpage(), or
kmem_alloc_poolpage()/kmem_free_poolpage()
rather than doing it here. These functions may use pmap hooks to
provide alternate methods of mapping pool pages.
 1.3  23-Jul-1998  pk Re-vamped pool manager.
* support for customized memory supplier
* automatic page reclaim by VM system
* time-based hysteresis
* cache coloring (after Bonwick's "slabs")
 1.2  19-Feb-1998  pk Add option to use "static" storage provided by the caller.
From Matthias Drochner.
 1.1  15-Dec-1997  pk Memory pool resource utility.
 1.4.2.2  08-Aug-1998  eeh Revert cdevsw mmap routines to return int.
 1.4.2.1  30-Jul-1998  eeh Split vm_offset_t and vm_size_t into paddr_t, psize_t, vaddr_t, and vsize_t.
 1.21.2.4  25-Jun-1999  perry somehow, the last commit was botched. fix it
 1.21.2.3  24-Jun-1999  perry pullup 1.26->1.27 (pk): deal with missing "raise interrupt level" code
 1.21.2.2  07-Apr-1999  thorpej branches: 1.21.2.2.2; 1.21.2.2.4;
Pull up 1.22 -> 1.23.
 1.21.2.1  04-Apr-1999  chs pull up rev 1.22. approved by perry.
 1.21.2.2.4.1  30-Nov-1999  itojun bring in latest KAME (as of 19991130, KAME/NetBSD141) into kame branch
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.

The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.

synchronization to latest KAME will take place on HEAD branch soon.
 1.21.2.2.2.3  02-Aug-1999  thorpej Update from trunk.
 1.21.2.2.2.2  04-Jul-1999  chs in pool_put(), fill the item with a distinctive pattern ifdef DEBUG.
 1.21.2.2.2.1  21-Jun-1999  thorpej Sync w/ -current.
 1.30.2.6  11-Feb-2001  bouyer Sync with HEAD.
 1.30.2.5  18-Jan-2001  bouyer Sync with head (for UBC+NFS fixes, mostly).
 1.30.2.4  13-Dec-2000  bouyer Sync with HEAD (for UBC fixes).
 1.30.2.3  08-Dec-2000  bouyer Sync with HEAD.
 1.30.2.2  22-Nov-2000  bouyer Sync with HEAD.
 1.30.2.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.34.2.1  22-Jun-2000  minoura Sync w/ netbsd-1-5-base.
 1.50.2.13  11-Dec-2002  thorpej Sync with HEAD.
 1.50.2.12  11-Nov-2002  nathanw Catch up to -current
 1.50.2.11  18-Oct-2002  nathanw Catch up to -current.
 1.50.2.10  27-Aug-2002  nathanw Catch up to -current.
 1.50.2.9  01-Aug-2002  nathanw Catch up to -current.
 1.50.2.8  24-Jun-2002  nathanw Curproc->curlwp renaming.

Change uses of "curproc->l_proc" back to "curproc", which is more like the
original use. Bare uses of "curproc" are now "curlwp".

"curproc" is now #defined in proc.h as ((curlwp) ? (curlwp)->l_proc) : NULL)
so that it is always safe to reference curproc (*de*referencing curproc
is another story, but that's always been true).
 1.50.2.7  01-Apr-2002  nathanw Catch up to -current.
(CVS: It's not just a program. It's an adventure!)
 1.50.2.6  08-Jan-2002  nathanw Catch up to -current.
 1.50.2.5  14-Nov-2001  nathanw Catch up to -current.
 1.50.2.4  22-Oct-2001  nathanw Catch up to -current.
 1.50.2.3  26-Sep-2001  nathanw Catch up to -current.
Again.
 1.50.2.2  24-Aug-2001  nathanw Catch up with -current.
 1.50.2.1  21-Jun-2001  nathanw Catch up to -current.
 1.60.4.2  11-Oct-2001  fvdl Catch up with -current. Fix some bogons in the sparc64 kbd/ms
attach code. cd18xx conversion provided by mrg.
 1.60.4.1  01-Oct-2001  fvdl Catch up with -current.
 1.60.2.4  10-Oct-2002  jdolecek sync kqueue with -current; this includes merge of gehenna-devsw branch,
merge of i386 MP branch, and part of autoconf rototil work
 1.60.2.3  06-Sep-2002  jdolecek sync kqueue branch with HEAD
 1.60.2.2  16-Mar-2002  jdolecek Catch up with -current.
 1.60.2.1  10-Jan-2002  thorpej Sync kqueue branch with -current.
 1.63.2.1  12-Nov-2001  thorpej Sync the thorpej-mips-cache branch with -current.
 1.74.2.2  12-Mar-2002  thorpej Do the previous differently; instead, pad the size the the structure
to the specified alignment, the way we pad to the system's natural
alignment.
 1.74.2.1  12-Mar-2002  thorpej Sprinkle some assertions around that ensures that the returned
object is aligned as requested.

Bug fix: in pool_prime_page(), make sure to account for alignment when
advancing the pointer through the page.
 1.76.6.1  11-Nov-2002  he Pull up revision 1.78 (requested by thorpej in ticket #582):
Bring down a fix from the "newlock" branch, slightly modified:
o In pool_prime_page(), assert that the object being placed
onto the free list meets the alignment constraints (that
"ioff" within the object is aligned to "align").
o In pool_init(), round up the object size to the alignment
value (or ALIGN(1), if no special alignment is needed) so
that the above invariant holds true.
 1.76.4.2  29-Aug-2002  gehenna catch up with -current.
 1.76.4.1  15-Jul-2002  gehenna catch up with -current.
 1.87.2.7  11-Dec-2005  christos Sync with head.
 1.87.2.6  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.87.2.5  01-Apr-2005  skrll Sync with HEAD.
 1.87.2.4  17-Jan-2005  skrll Sync with HEAD.
 1.87.2.3  21-Sep-2004  skrll Fix the sync with head I botched.
 1.87.2.2  18-Sep-2004  skrll Sync with HEAD.
 1.87.2.1  03-Aug-2004  skrll Sync with HEAD
 1.93.2.1  22-Jun-2004  tron Pull up revision 1.96 (requested by thorpej in ticket #522):
Remove PR_IMMEDRELEASE, since setting the high water mark will achieve
the same thing.
Pointed out back in January by YAMAMOTO Takashi.
 1.99.8.2  10-Mar-2006  tron Pull up following revision(s) (requested by bjh21 in ticket #1192):
sys/sys/pool.h: revision 1.48
sys/kern/subr_pool.c: revision 1.112
Medium-sized overhaul of POOL_SUBPAGE support so that:
1: I can understand it, and
2: It works.
Notable externally-visible changes are that POOL_SUBPAGE now has to be a
compile-time constant, and that trying to initialise a pool whose objects are
larger than POOL_SUBPAGE automatically generates a pool that doesn't use
subpages.
NetBSD/acorn26 now boots multi-user again.
 1.99.8.1  18-Jun-2005  tron branches: 1.99.8.1.2;
Pull up revision 1.101 (requested by thorpej in ticket #474):
Fix some locking issues:
- Make the locking rules for pr_rmpage() sane, and don't modify fields
protected by the pool lock without actually holding it.
- Always defer freeing the pool page to the back-end allocator, to avoid
invoking the pool_allocator with the pool locked (which would violate
the pool_allocator -> pool locking order).
- Fix pool_reclaim() to not violate the pool_cache -> pool locking order
by using a trylock.
Reviewed by Chuq Silvers.
 1.99.8.1.2.1  10-Mar-2006  tron Pull up following revision(s) (requested by bjh21 in ticket #1192):
sys/sys/pool.h: revision 1.48
sys/kern/subr_pool.c: revision 1.112
Medium-sized overhaul of POOL_SUBPAGE support so that:
1: I can understand it, and
2: It works.
Notable externally-visible changes are that POOL_SUBPAGE now has to be a
compile-time constant, and that trying to initialise a pool whose objects are
larger than POOL_SUBPAGE automatically generates a pool that doesn't use
subpages.
NetBSD/acorn26 now boots multi-user again.
 1.99.4.1  25-Jan-2005  yamt convert to new apis.
 1.99.2.1  29-Apr-2005  kent sync with -current
 1.101.2.13  24-Mar-2008  yamt sync with head.
 1.101.2.12  17-Mar-2008  yamt sync with head.
 1.101.2.11  27-Feb-2008  yamt sync with head.
 1.101.2.10  11-Feb-2008  yamt sync with head.
 1.101.2.9  04-Feb-2008  yamt sync with head.
 1.101.2.8  21-Jan-2008  yamt sync with head
 1.101.2.7  07-Dec-2007  yamt sync with head
 1.101.2.6  15-Nov-2007  yamt sync with head.
 1.101.2.5  27-Oct-2007  yamt sync with head.
 1.101.2.4  03-Sep-2007  yamt sync with head.
 1.101.2.3  26-Feb-2007  yamt sync with head.
 1.101.2.2  30-Dec-2006  yamt sync with head.
 1.101.2.1  21-Jun-2006  yamt sync with head.
 1.110.2.2  01-Mar-2006  yamt sync with head.
 1.110.2.1  01-Feb-2006  yamt sync with head.
 1.111.4.3  01-Jun-2006  kardel Sync with head.
 1.111.4.2  22-Apr-2006  simonb Sync with head.
 1.111.4.1  04-Feb-2006  simonb Adapt for timecounters: mostly use get*time() and use "time_second"
instead of "time.tv_sec".
 1.111.2.1  09-Sep-2006  rpaulo sync with head
 1.112.6.2  24-May-2006  tron Merge 2006-05-24 NetBSD-current into the "peter-altq" branch.
 1.112.6.1  28-Mar-2006  tron Merge 2006-03-28 NetBSD-current into the "peter-altq" branch.
 1.112.4.1  19-Apr-2006  elad sync with head.
 1.112.2.6  03-Sep-2006  yamt sync with head.
 1.112.2.5  11-Aug-2006  yamt sync with head
 1.112.2.4  26-Jun-2006  yamt sync with head.
 1.112.2.3  24-May-2006  yamt sync with head.
 1.112.2.2  11-Apr-2006  yamt sync with head
 1.112.2.1  01-Apr-2006  yamt sync with head.
 1.116.2.1  19-Jun-2006  chap Sync with head.
 1.122.4.2  10-Dec-2006  yamt sync with head.
 1.122.4.1  22-Oct-2006  yamt sync with head
 1.122.2.3  19-Jan-2007  ad Add some DEBUG code to check that items being freed were previously
allocated from the same source. Needs to be enabled via DDB.
 1.122.2.2  20-Oct-2006  ad Remove sched_lock assertion.
 1.122.2.1  11-Sep-2006  ad From the newlock branch: add some KASSERT() verifying correct alignment.
 1.125.2.3  24-Mar-2007  yamt sync with head.
 1.125.2.2  12-Mar-2007  rmind Sync with HEAD.
 1.125.2.1  27-Feb-2007  yamt - sync with head.
- move sched_changepri back to kern_synch.c as it doesn't know PPQ anymore.
 1.128.2.13  01-Nov-2007  ad pool_reclaim: acquire kernel_lock if the pool is at IPL_SOFTCLOCK,
SOFTNET or SOFTSERIAL, as mutexes at these levels must still be
spinlocks. It's not yet safe for e.g. ip_intr() to block as this
upsets code calling up from the socket layer. It can find pcbs
sitting half baked.

pool_cache_xcall: go to splvm to prevent kernel_lock from being
taken, for the reason listed above.

Pointed out by yamt@.
 1.128.2.12  29-Oct-2007  ad pool_drain_start: tweak assertions/comments.
 1.128.2.11  26-Oct-2007  ad - Use a cross call to drain the per-CPU component of pool caches.
- When draining, skip over pools that are completly inactive.
 1.128.2.10  25-Sep-2007  ad If no constructor/destructor are provided for a pool_cache, use nullop.
Remove the tests for pc_ctor/pc_dtor != NULL.
 1.128.2.9  10-Sep-2007  ad Fix a deadlock.
 1.128.2.8  09-Sep-2007  ad - Re-enable pool_cache, since it works on i386 again after today's pmap
change. pool_cache_invalidate() no longer invalidates objects stored
in the per-CPU caches. This needs some thought.
- Remove pcg_get, pcg_put since they are only called from one place each.
- Remove cc_busy assertions, since they don't work correctly. Pointed out
by yamt@.
- Add some more-assertions and simplify.
 1.128.2.7  01-Sep-2007  ad - Add a CPU layer to pool caches. In combination with vmem/kmem this
provides CPU-local slab/object and general purpose allocators. The
strategy used is as described in Jeff Bonwick's USENIX paper, except in
at least one place where the described allocation strategy doesn't make
sense. For exclusive access to the CPU layer the IPL is raised or kernel
preemption disabled. Where the interrupt priority levels are software
emulated this is much cheaper than taking a lock, and I think that
writing to a local %pil register is likely to have a similar penalty to
taking a lock.

No tuning of the group sizes is currently done - all groups have 15
items each, but this should be fairly easy to implement. Also, the
reclamation mechanism should probably use a cross-call to drain the
CPU-level caches on remote CPUs.

Currently this causes kernel memory corruption on i386, yet works without
a problem on amd64. The cache layer is disabled for the time being until I
can find the bug.

- Change the pool_cache API so that the caches are themselves dynamically
allocated, and that each cache is tied to a single pool only. Add some
stubs to change pool_cache parameters that call directly through to the
pool layer (e.g. pool_cache_sethiwat). The idea here is that pool_cache
should become the default object allocator (and so LKM friendly), and
that the pool allocator should be for kernel-internal use only. This will
be posted to tech-kern@ for review.
 1.128.2.6  20-Aug-2007  ad Sync with HEAD.
 1.128.2.5  29-Jul-2007  ad Trap free() of areas that contain undestroyed locks. Not a major problem
but it helps to catch bugs.
 1.128.2.4  22-Mar-2007  ad - Remove debugging crud.
- wakeup -> cv_broadcast.
 1.128.2.3  21-Mar-2007  ad GC the simplelock/spinlock debugging stuff.
 1.128.2.2  13-Mar-2007  ad Pull in the initial set of changes for the vmlocking branch.
 1.128.2.1  13-Mar-2007  ad Sync with head.
 1.129.12.6  09-Dec-2007  jmcneill Sync with HEAD.
 1.129.12.5  21-Nov-2007  joerg Sync with HEAD.
 1.129.12.4  14-Nov-2007  joerg Sync with HEAD.
 1.129.12.3  11-Nov-2007  joerg Sync with HEAD.
 1.129.12.2  26-Oct-2007  joerg Sync with HEAD.

Follow the merge of pmap.c on i386 and amd64 and move
pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup
code to restore CR4 before jumping back into kernel space as the large
page option might cover that.
 1.129.12.1  03-Sep-2007  jmcneill Sync with HEAD.
 1.129.8.1  03-Sep-2007  skrll Sync with HEAD.
 1.131.4.1  14-Oct-2007  yamt sync with head.
 1.131.2.4  23-Mar-2008  matt sync with HEAD
 1.131.2.3  09-Jan-2008  matt sync with HEAD
 1.131.2.2  08-Nov-2007  matt sync with -HEAD
 1.131.2.1  06-Nov-2007  matt sync with HEAD
 1.133.4.4  18-Feb-2008  mjf Sync with HEAD.
 1.133.4.3  27-Dec-2007  mjf Sync with HEAD.
 1.133.4.2  08-Dec-2007  mjf Sync with HEAD.
 1.133.4.1  19-Nov-2007  mjf Sync with HEAD.
 1.133.2.2  18-Nov-2007  bouyer Sync with HEAD
 1.133.2.1  13-Nov-2007  bouyer Sync with HEAD
 1.137.2.7  31-Dec-2007  ad Make pool_cache_disable work again.
 1.137.2.6  28-Dec-2007  ad pool_cache_put_slow: fill cc_previous if empty. Pointed out by yamt@.
 1.137.2.5  26-Dec-2007  ad Sync with head.
 1.137.2.4  26-Dec-2007  ad Need sys/atomic.h here.
 1.137.2.3  15-Dec-2007  ad Sort list of pools/caches to make easier them easier to find.
 1.137.2.2  12-Dec-2007  ad Add a global 'pool_cache_disable', to be set from the debugger. Helpful
when tracking down leaks.
 1.137.2.1  08-Dec-2007  ad Sync with head.
 1.138.4.3  08-Jan-2008  bouyer Sync with HEAD
 1.138.4.2  02-Jan-2008  bouyer Sync with HEAD
 1.138.4.1  13-Dec-2007  bouyer Sync with HEAD
 1.138.2.3  13-Dec-2007  yamt sync with head.
 1.138.2.2  10-Dec-2007  yamt - separate kernel va allocation (kernel_va_arena) from
in-kernel fault handling (kernel_map).
- add vmem bootstrap code. vmem doesn't rely on malloc anymore.
- make kmem_alloc interrupt-safe.
- kill kmem_map. make malloc a wrapper of kmem_alloc.
 1.138.2.1  10-Dec-2007  yamt add pool_cache_bootstrap_destroy. will be used by vmem.
 1.151.6.4  17-Jan-2009  mjf Sync with HEAD.
 1.151.6.3  28-Sep-2008  mjf Sync with HEAD.
 1.151.6.2  02-Jun-2008  mjf Sync with HEAD.
 1.151.6.1  03-Apr-2008  mjf Sync with HEAD.
 1.151.2.1  24-Mar-2008  keiichi sync with head.
 1.156.2.2  04-Jun-2008  yamt sync with head
 1.156.2.1  18-May-2008  yamt sync with head.
 1.158.2.5  11-Aug-2010  yamt sync with head.
 1.158.2.4  11-Mar-2010  yamt sync with head
 1.158.2.3  16-Sep-2009  yamt sync with head
 1.158.2.2  04-May-2009  yamt sync with head.
 1.158.2.1  16-May-2008  yamt sync with head.
 1.160.2.2  18-Sep-2008  wrstuden Sync with wrstuden-revivesa-base-2.
 1.160.2.1  23-Jun-2008  wrstuden Sync w/ -current. 34 merge conflicts to follow.
 1.161.2.1  18-Jul-2008  simonb Sync with head.
 1.165.2.3  13-Dec-2008  haad Update haad-dm branch to haad-dm-base2.
 1.165.2.2  19-Oct-2008  haad Sync with HEAD.
 1.165.2.1  07-Jul-2008  haad file subr_pool.c was added on branch haad-dm on 2008-10-19 22:17:28 +0000
 1.170.4.1  17-Nov-2008  snj Pull up following revision(s) (requested by ad in ticket #72):
sys/kern/subr_pool.c: revision 1.171
Avoid recursive mutex_enter() when the system is low on KVA.
Should fix crash reported by riz on current-users.
 1.170.2.2  28-Apr-2009  skrll Sync with HEAD.
 1.170.2.1  19-Jan-2009  skrll Sync with HEAD.
 1.171.4.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.182.4.4  21-Apr-2011  rmind sync with head
 1.182.4.3  05-Mar-2011  rmind sync with head
 1.182.4.2  03-Jul-2010  rmind sync with head
 1.182.4.1  30-May-2010  rmind sync with head
 1.182.2.2  17-Aug-2010  uebayasi Sync with HEAD.
 1.182.2.1  30-Apr-2010  uebayasi Sync with HEAD.
 1.186.2.1  06-Jun-2011  jruoho Sync with HEAD.
 1.190.6.2  02-Jun-2012  mrg sync to latest -current.
 1.190.6.1  18-Feb-2012  mrg merge to -current.
 1.190.2.4  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.190.2.3  30-Oct-2012  yamt sync with head
 1.190.2.2  23-May-2012  yamt sync with head.
 1.190.2.1  17-Apr-2012  yamt sync with head
 1.194.2.2  21-May-2014  bouyer Pull up following revision(s) (requested by abs in ticket #1054):
sys/kern/subr_pool.c: revision 1.202
Ensure pool_head is non static - for "vmstat -i"
 1.194.2.1  02-Jul-2012  jdc Pull up revisions:
src/sys/kern/subr_pool.c revision 1.196
src/share/man/man9/pool_cache.9 patch
(requested by jym in ticket #366).

As pool reclaiming is unlikely to happen at interrupt or softint
context, re-enable the portion of code that allows invalidation of
CPU-bound pool caches.

Two reasons:
- CPU cached objects being invalidated, the probability of fetching an
obsolete object from the pool_cache(9) is greatly reduced. This speeds
up pool_cache_get() quite a bit as it does not have to keep destroying
objects until it finds an updated one when an invalidation is in progress.

- for situations where we have to ensure that no obsolete object remains
after a state transition (canonical example: pmap mappings between Xen
VM restoration), invalidating all pool_cache(9) is the safest way to go.

As it uses xcall(9) to broadcast the execution of pool_cache_transfer(),
pool_cache_invalidate() cannot be called from interrupt or softint
context (scheduling a xcall(9) can put a LWP to sleep).

pool_cache_xcall() => pool_cache_transfer() to reflect its use.

Invalidation being a costly process (1000s objects may be destroyed),
all places where pool_cache_invalidate() may be called from
interrupt/softint context will now get caught by the proper KASSERT(),
and fixed. Ping me when you see one.

Tested under i386 and amd64 by running ATF suite within 64MiB HVM
domains (tried triggering pgdaemon a few times).

No objection on tech-kern@.

XXX a similar fix has to be pulled up to NetBSD-6, but with a more
conservative approach.

See http://mail-index.netbsd.org/tech-kern/2012/05/29/msg013245.html
 1.198.2.4  03-Dec-2017  jdolecek update from HEAD
 1.198.2.3  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.198.2.2  23-Jun-2013  tls resync from head
 1.198.2.1  25-Feb-2013  tls resync with head
 1.200.6.1  18-May-2014  rmind sync with head
 1.201.2.1  10-Aug-2014  tls Rebase.
 1.203.4.3  28-Aug-2017  skrll Sync with HEAD
 1.203.4.2  19-Mar-2016  skrll Sync with HEAD
 1.203.4.1  22-Sep-2015  skrll Sync with HEAD
 1.203.2.1  06-Mar-2016  martin Pull up following revision(s) (requested by knakahara in ticket #1103):
sys/kern/subr_pool.c: revision 1.206
fix: "vmstat -C" CpuLayer showed only the last cpu values.
 1.206.4.1  21-Apr-2017  bouyer Sync with HEAD
 1.206.2.1  20-Mar-2017  pgoyette Sync with HEAD
 1.207.6.1  27-Feb-2018  martin Pull up following revision(s) (requested by mrg in ticket #593):
sys/dev/marvell/mvxpsec.c: revision 1.2
sys/arch/m68k/m68k/pmap_motorola.c: revision 1.70
sys/opencrypto/crypto.c: revision 1.102
sys/arch/sparc64/sparc64/pmap.c: revision 1.308
sys/ufs/chfs/chfs_malloc.c: revision 1.5
sys/arch/powerpc/oea/pmap.c: revision 1.95
sys/sys/pool.h: revision 1.80,1.82
sys/kern/subr_pool.c: revision 1.209-1.216,1.219-1.220
sys/arch/alpha/alpha/pmap.c: revision 1.262
sys/kern/uipc_mbuf.c: revision 1.173
sys/uvm/uvm_fault.c: revision 1.202
sys/sys/mbuf.h: revision 1.172
sys/kern/subr_extent.c: revision 1.86
sys/arch/x86/x86/pmap.c: revision 1.266 (via patch)
sys/dev/dtv/dtv_scatter.c: revision 1.4

Allow only one pending call to a pool's backing allocator at a time.
Candidate fix for problems with hanging after kva fragmentation related
to PR kern/45718.

Proposed on tech-kern:
https://mail-index.NetBSD.org/tech-kern/2017/10/23/msg022472.html
Tested by bouyer@ on i386.

This makes one small change to the semantics of pool_prime and
pool_setlowat: they may fail with EWOULDBLOCK instead of ENOMEM, if
there is a pending call to the backing allocator in another thread but
we are not actually out of memory. That is unlikely because nearly
always these are used during initialization, when the pool is not in
use.

Define the new flag too for previous commit.

pool_grow can now fail even when sleeping is ok. Catch this case in pool_get
and retry.

Assert that pool_get failure happens only with PR_NOWAIT.
This would have caught the mistake I made last week leading to null
pointer dereferences all over the place, a mistake which I evidently
poorly scheduled alongside maxv's change to the panic message on x86
for null pointer dereferences.

Since pr_lock is now used to wait for two things now (PR_GROWING and
PR_WANTED) we need to loop for the condition we wanted.
make the KASSERTMSG/panic strings consistent as '%s: [%s], __func__, wchan'
Handle the ERESTART case from pool_grow()

don't pass 0 to the pool flags
Guess pool_cache_get(pc, 0) means PR_WAITOK here.
Earlier on in the same context we use kmem_alloc(sz, KM_SLEEP).

use PR_WAITOK everywhere.
use PR_NOWAIT.

Don't use 0 for PR_NOWAIT

use PR_NOWAIT instead of 0

panic ex nihilo -- PR_NOWAITing for zerot

Add assertions that either PR_WAITOK or PR_NOWAIT are set.
- fix an assert; we can reach there if we are nowait or limitfail.
- when priming the pool and failing with ERESTART, don't decrement the number
of pages; this avoids the issue of returning an ERESTART when we get to 0,
and is more correct.
- simplify the pool_grow code, and don't wakeup things if we ENOMEM.

In pmap_enter_ma(), only try to allocate pves if we might need them,
and even if that fails, only fail the operation if we later discover
that we really do need them. This implements the requirement that
pmap_enter(PMAP_CANFAIL) must not fail when replacing an existing
mapping with the first mapping of a new page, which is an unintended
consequence of the changes from the rmind-uvmplock branch in 2011.

The problem arises when pmap_enter(PMAP_CANFAIL) is used to replace an existing
pmap mapping with a mapping of a different page (eg. to resolve a copy-on-write).
If that fails and leaves the old pmap entry in place, then UVM won't hold
the right locks when it eventually retries. This entanglement of the UVM and
pmap locking was done in rmind-uvmplock in order to improve performance,
but it also means that the UVM state and pmap state need to be kept in sync
more than they did before. It would be possible to handle this in the UVM code
instead of in the pmap code, but these pmap changes improve the handling of
low memory situations in general, and handling this in UVM would be clunky,
so this seemed like the better way to go.

This somewhat indirectly fixes PR 52706, as well as the failing assertion
about "uvm_page_locked_p(old_pg)". (but only on x86, various other platforms
will need their own changes to handle this issue.)
In uvm_fault_upper_enter(), if pmap_enter(PMAP_CANFAIL) fails, assert that
the pmap did not leave around a now-stale pmap mapping for an old page.
If such a pmap mapping still existed after we unlocked the vm_map,
the UVM code would not know later that it would need to lock the
lower layer object while calling the pmap to remove or replace that
stale pmap mapping. See PR 52706 for further details.
hopefully workaround the irregularly "fork fails in init" problem.
if a pool is growing, and the grower is PR_NOWAIT, mark this.
if another caller wants to grow the pool and is also PR_NOWAIT,
busy-wait for the original caller, which should either succeed
or hard-fail fairly quickly.

implement the busy-wait by unlocking and relocking this pools
mutex and returning ERESTART. other methods (such as having
the caller do this) were significantly more code and this hack
is fairly localised.
ok chs@ riastradh@

Don't release the lock in the PR_NOWAIT allocation. Move flags setting
after the acquiring the mutex. (from Tobias Nygren)
apply the change from arch/x86/x86/pmap.c rev. 1.266 commitid vZRjvmxG7YTHLOfA:

In pmap_enter_ma(), only try to allocate pves if we might need them,
and even if that fails, only fail the operation if we later discover
that we really do need them. If we are replacing an existing mapping,
reuse the pv structure where possible.

This implements the requirement that pmap_enter(PMAP_CANFAIL) must not fail
when replacing an existing mapping with the first mapping of a new page,
which is an unintended consequence of the changes from the rmind-uvmplock
branch in 2011.

The problem arises when pmap_enter(PMAP_CANFAIL) is used to replace an existing
pmap mapping with a mapping of a different page (eg. to resolve a copy-on-write).
If that fails and leaves the old pmap entry in place, then UVM won't hold
the right locks when it eventually retries. This entanglement of the UVM and
pmap locking was done in rmind-uvmplock in order to improve performance,
but it also means that the UVM state and pmap state need to be kept in sync
more than they did before. It would be possible to handle this in the UVM code
instead of in the pmap code, but these pmap changes improve the handling of
low memory situations in general, and handling this in UVM would be clunky,
so this seemed like the better way to go.

This somewhat indirectly fixes PR 52706 on the remaining platforms where
this problem existed.
 1.221.4.3  21-Apr-2020  martin Sync with HEAD
 1.221.4.2  13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.221.4.1  10-Jun-2019  christos Sync with HEAD
 1.221.2.4  26-Dec-2018  pgoyette Sync with HEAD, resolve a few conflicts
 1.221.2.3  30-Sep-2018  pgoyette Ssync with HEAD
 1.221.2.2  06-Sep-2018  pgoyette Sync with HEAD

Resolve a couple of conflicts (result of the uimin/uimax changes)
 1.221.2.1  28-Jul-2018  pgoyette Sync with HEAD
 1.252.2.5  29-May-2025  martin Pull up following revision(s) (requested by bouyer in ticket #1956):

sys/kern/subr_pool.c: revision 1.295

Never call pr_drain_hook from pool_allocator_alloc().

In the PR_WAITOK case it's called from pool_reclaim

In the !PR_WAITOK case we're holding the pool lock and if the drain hook
wants kernel_lock we may deadlock with another thread holding
kernel_lock and calling pool_get().

Fixes PR kern/59411
 1.252.2.4  17-Jul-2022  martin Pull up following revision(s) (requested by simonb in ticket #1479):

sys/kern/subr_pool.c: revision 1.285

Use 64-bit math to calculate pool sizes. Fixes overflow errors for
pools larger than 4GB and gives the correct output for kernel pool pages
in "vmstat -s" output.
 1.252.2.3  08-Mar-2020  martin Pull up following revision(s) (requested by chs in ticket #766):

sys/kern/subr_pool.c: revision 1.265

fix assertions about when it is ok for pool_get() to return NULL.
 1.252.2.2  01-Sep-2019  martin Pull up following revision(s) (requested by maxv in ticket #129):

sys/kern/subr_pool.c: revision 1.256
sys/kern/subr_pool.c: revision 1.257

Kernel Heap Hardening: use bitmaps on all off-page pools. This migrates 29
MI pools on amd64 from linked lists to bitmaps, which have higher security
properties.

Then, change the computation of the size of the PH pools: take into account
the bitmap area available by default in the ph_u2 union, and don't go with
&phpool[>0] if &phpool[0] already has enough space to embed a bitmap.

The pools that are migrated in this change all use bitmaps small enough to
fit in &phpool[0], therefore there is no increase in memory consumption.

-

Revert r1.254, put back || for KASAN, some destructors like lwp_dtor()
caused false positives. Needs more work.
 1.252.2.1  18-Aug-2019  martin Pull up following revision(s) (requested by maxv in ticket #81):

sys/kern/subr_pool.c: revision 1.253
sys/kern/subr_pool.c: revision 1.254
sys/kern/subr_pool.c: revision 1.255

Kernel Heap Hardening: perform certain sanity checks on the pool caches
directly, to immediately detect certain bugs that would otherwise have
been detected only later on the pool layer, if the buffer ever reached
the pool layer.

-

Replace || by && in KASAN, to increase the pool coverage.
Strictly speaking, what we want to avoid is poisoning buffers that were
referenced in a global list as part of the ctor. But, if a buffer indeed
got referenced as part of the ctor, it necessarily has to be unreferenced
in the dtor; which implies it has to have a dtor. So we want both a ctor
and a dtor, and not just one of them.

Note that POOL_QUARANTINE already implicitly provides this increased
coverage.

-

Initialize pp->pr_redzone to false. For some reason with KUBSAN GCC does
not eliminate the unused branch in pr_item_linkedlist_put(), and this
leads to a unused uninitialized access which triggers KUBSAN messages.
 1.264.2.2  29-Feb-2020  ad Sync with head.
 1.264.2.1  25-Jan-2020  ad Sync with head.
 1.266.4.1  20-Apr-2020  bouyer Sync with HEAD
 1.274.2.2  03-Apr-2021  thorpej Sync with HEAD.
 1.274.2.1  03-Jan-2021  thorpej Sync w/ HEAD.
 1.276.4.1  01-Aug-2021  thorpej Sync with HEAD.
 1.285.4.3  29-May-2025  martin Pull up following revision(s) (requested by bouyer in ticket #1122):

sys/kern/subr_pool.c: revision 1.295

Never call pr_drain_hook from pool_allocator_alloc().

In the PR_WAITOK case it's called from pool_reclaim

In the !PR_WAITOK case we're holding the pool lock and if the drain hook
wants kernel_lock we may deadlock with another thread holding
kernel_lock and calling pool_get().

Fixes PR kern/59411
 1.285.4.2  15-Dec-2024  martin Pull up following revision(s) (requested by chs in ticket #1028):

sys/kern/subr_pool.c: revision 1.292

pool: fix pool_sethiwat() to actually do something

The change that I made to the pool code back in April 2020
("slightly change and fix the semantics of pool_set*wat()" ...)
accidental broke pool_sethiwat() by making it have no effect.

This was discovered after the crash reported in PR 58666 was fixed.

The same machine (32-bit, with 10GB RAM) would hang due to the buffer
cache causing the system to run out of kernel virtual space. The
buffer cache uses a separate pool for buffer data for each power of 2
between DEV_BSIZE and MAXBSIZE, and if the usage pattern of buffer
sizes changes then memory has to be moved between the different pools
in order to create buffers of the new size. The buffer cache handles
this by using pool_sethiwat() to cause memory freed from the buffer
cache back to the pools to not be cached in the buffer cache pools but
instead be freed back to the pools' back-end allocator (which
allocates from the low-level kva allocator) as soon as possible. But
since pool_sethiwat() wasn't doing anything, memory would stay cached
in some buffer cache pools and starve other buffer cache pools (and a
few other pools that do no use the kmem layer for memory allocation).

Fix pool_sethiwat() to do what it is supposed to do again.
 1.285.4.1  20-Sep-2024  martin Pull up following revision(s) (requested by rin in ticket #871):

sys/kern/subr_pool.c: revision 1.286

Avoid undefined behaviour.

RSS XML Feed