Home | History | Annotate | Download | only in kern
History log of /src/sys/kern/kern_descrip.c
RevisionDateAuthorComments
 1.266  16-Jul-2025  kre Kernel part of O_CLOFORK implementation (plus kernel revbump)

This is Ricardo Branco's implementation of O_CLOFORK (and
associated fcntl, etc) for NetBSD (with a few minor changes
by me).

For now, the header file symbols that should be exposed to
userland are hidden inside temporary #ifdef _KERNEL blocks,
just to avoid random userland apps, or config scripts, from
seeing any of this before it is better tested.

Userland parts of this will follow soon.

This also bumps the kernel version to 10.99.15 (changes to
data structs, and the signature of fd_dup()).
 1.265  21-Dec-2024  riastradh closef(9): Assert no ERESTART from struct fileops::fo_close.

This cannot possibly work so make sure we flag it early.

Currently the sys_close wrapper will neuter ERESTART by mapping it to
EINTR, but let's catch this mistake earlier where we have better
diagnostic information available like what the fo_close function is.
(Haven't seen the printf fire in the >decade since I added it, so I
think this KASSERT is unlikely.)
 1.264  10-Nov-2024  kre Make O_CLOEXEC always close specified files on exec

It turns out that close-on-exec doesn't always close on exec.

If all close-on-exec fd's were made close-on-exec via dup3() or
fcntl(F_DUPFD_CLOEXEC) or use of the internal fd_clone() (whose uses
I did not fully investigate but I think is used to create a fd for
the open of a cloner device, and perhaps other things) then none
of the close-on-exec file descriptors will be closed when an exec
happens - but will be passed through to the new process (still marked,
apparently, as close-on-exec - but still won't be closed if another exec
happens) - that is unless...

If at least one fd in the process has close-on-exec set some other way
(fcntl(F_SETFD), open(O_CLOEXEC) (and the similar functions for sockets,
and epoll) and perhaps others then all close-on-exec file descriptors
in the process will be correctly closed when an exec happens (however
they obtained the close-on-exec status).

There are two steps that need to be taken (in the kernel) when turning
on close on exec - the obvious one of setting the ff_exclose field in
the struct fdfile for the fd. And second, marking the file descriptor
table (which holds the fdfile's for one or more processes) as containing
file descriptors with close-on-exec set (it is a simple yes/no, and once
set is never cleared until an actual exec happens). If it was set during
an exec, all the file descriptors are examined, and those marked
close-on-exec are closed. If the file descriptor table doesn't indicate
that close-on-exec fds exist in the table, none of that happens.

Several places were setting ff_exclose in the struct fdfile but
not bothering to set the fd_exclose field in the file descriptor table.

There's even a function (fd_set_exclose()) whose whole purpose is to do
this properly - but it wasn't being used.

Now it is, everywhere (I hope).
 1.263  14-Jul-2024  kre PR kern/58425 -- Disallow INT_MIN as a (negative) pid arg.

Since -INT_MIN is undefined, and to point of negative pid args is
to negate them, and use the result as a pgrp id instead, we need
to avoid accidentally negating INT_MIN.

Since pid_t is just an integral type, of unspecified width, when
testing pid_t value test for <= INT_MIN (or > INT_MIN sometimes)
rather than == INT_MIN. When testing int values, just == INT_MIN
is all that is needed, < INT_MIN cannot occur.

XXX pullup -9, -10
 1.262  04-Oct-2023  ad branches: 1.262.6;
kauth_cred_hold(): return cred verbatim so that donating a reference to
another data structure can be done more elegantly.
 1.261  23-Sep-2023  ad Repply this change with a couple of bugs fixed:

- Do away with separate pool_cache for some kernel objects that have no special
requirements and use the general purpose allocator instead. On one of my
test systems this makes for a small (~1%) but repeatable reduction in system
time during builds presumably because it decreases the kernel's cache /
memory bandwidth footprint a little.
- vfs_lockf: cache a pointer to the uidinfo and put mutex in the data segment.
 1.260  12-Sep-2023  ad Back out recent change to replace pool_cache with then general allocator.
Will return to this when I have time again.
 1.259  10-Sep-2023  ad - Do away with separate pool_cache for some kernel objects that have no special
requirements and use the general purpose allocator instead. On one of my
test systems this makes for a small (~1%) but repeatable reduction in system
time during builds presumably because it decreases the kernel's cache /
memory bandwidth footprint a little.
- vfs_lockf: cache a pointer to the uidinfo and put mutex in the data segment.
 1.258  10-Sep-2023  ad It's easy to exhaust the open file limit on a system with many CPUs due to
caching. Allow a bit of leeway to reduce the element of surprise.
 1.257  22-Apr-2023  riastradh fcntl(2), flock(2): Assert FHASLOCK is clear if no fo_advlock.
 1.256  22-Apr-2023  riastradh file(9): New fo_advlock operation.

This moves the vnode-specific logic from sys_descrip.c into
vfs_vnode.c, like we did for fo_seek.

XXX kernel revbump -- struct fileops API and ABI change
 1.255  24-Feb-2023  riastradh kern: Eliminate most __HAVE_ATOMIC_AS_MEMBAR conditionals.

I'm leaving in the conditional around the legacy membar_enters
(store-before-load, store-before-store) in kern_mutex.c and in
kern_lock.c because they may still matter: store-before-load barriers
tend to be the most expensive kind, so eliding them is probably
worthwhile on x86. (It also may not matter; I just don't care to do
measurements right now, and it's a single valid and potentially
justifiable use case in the whole tree.)

However, membar_release/acquire can be mere instruction barriers on
all TSO platforms including x86, so there's no need to go out of our
way with a bad API to conditionalize them. If the procedure call
overhead is measurable we just could change them to be macros on x86
that expand into __insn_barrier.

Discussed on tech-kern:
https://mail-index.netbsd.org/tech-kern/2023/02/23/msg028729.html
 1.254  23-Feb-2023  riastradh kern_descrip.c: Change membar_enter to membar_acquire in fd_getfile.

membar_acquire is cheaper on many CPUs, and unlikely to be costlier
on any CPUs, than the legacy membar_enter.

Add a long comment explaining the interaction between fd_getfile and
fd_close and why membar_acquire is safe.

XXX pullup-10
 1.253  23-Feb-2023  riastradh kern_descrip.c: Use atomic_store_relaxed/release for ff->ff_file.

1. atomic_store_relaxed in fd_close avoids the appearance of race in
sanitizers (minor bug).

2. atomic_store_release in fd_affix is necessary because the lock
activity was not, in fact, enough to guarantee ordering (real bug
some architectures like aarch64).

The premise appears to have been that the mutex_enter/exit earlier
in fd_affix is enough to guarantee that initialization of fp (A)
happens before use of fp by a user once fp is published (B):

fp->f_... = ...; // A

/* fd_affix */
mutex_enter(&fp->f_lock);
fp->f_count++;
mutex_exit(&fp->f_lock);
...
ff->ff_file = fp; // B

But actually mutex_enter/exit allow the following reordering by
the CPU:

mutex_enter(&fp->f_lock);
ff->ff_file = fp; // B
fp->f_count++;
fp->f_... = ...; // A
mutex_exit(&fp->f_lock);

The only constraints they imply are:

1. fp->f_count++ and B cannot precede mutex_enter
2. mutex_exit cannot precede A and fp->f_count++

They imply no constraint on the relative ordering of A, B, and
fp->f_count++ amongst each other, however.

This affects any architecture that has a native load-acquire or
store-release operation in mutex_enter/exit, like aarch64, instead
of explicit load-before-load/store and load/store-before-store
barrier.

No need for atomic_store_* in fd_copy or fd_free because we have
exclusive access to ff as is.

XXX pullup-9
XXX pullup-10
 1.252  23-Feb-2023  riastradh kern_descrip.c: Fix membars around reference count decrement.

In general, the `last one out hit the lights' style of reference
counting (as opposed to the `whoever's destroying must wait for
pending users to finish' style) requires memory barriers like so:

... usage of resources associated with object ...
membar_release();
if (atomic_dec_uint_nv(&obj->refcnt) != 0)
return;
membar_acquire();
... freeing of resources associated with object ...

This way, all usage happens-before all freeing. This fixes several
errors:

- fd_close failed to ensure whatever its caller did would
happen-before the freeing, in the case where another thread is
concurrently trying to close the fd (ff->ff_file == NULL).

Fix: Add membar_release before atomic_dec_uint(&ff->ff_refcnt) in
that branch.

- fd_close failed to ensure all loads its caller had issued will have
happened-before the freeing, in the case where the fd is still in
use by another thread (fdp->fd_refcnt > 1 and ff->ff_refcnt-- > 0).

Fix: Change membar_producer to membar_release before
atomic_dec_uint(&ff->ff_refcnt).

- fd_close failed to ensure that any usage of fp by other callers
would happen-before any freeing it does.

Fix: Add membar_acquire after atomic_dec_uint_nv(&ff->ff_refcnt).

- fd_free failed to ensure that any usage of fdp by other callers
would happen-before any freeing it does.

Fix: Add membar_acquire after atomic_dec_uint_nv(&fdp->fd_refcnt).

While here, change membar_exit -> membar_release. No semantic
change, just updating away from the legacy API.

XXX pullup-8
XXX pullup-9
XXX pullup-10
 1.251  29-Jun-2021  dholland branches: 1.251.10;
Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)
 1.250  24-Dec-2020  nia branches: 1.250.4;
Avoid negating the minimum size of pid_t (this overflows).

Reported-by: syzbot+e2eb02f9dfaf4f2e6626@syzkaller.appspotmail.com
 1.249  28-Aug-2020  christos branches: 1.249.2;
We already zeroed the struct, no point in zeroing things twice.
 1.248  28-Aug-2020  riastradh Just zero out struct file::f_lock when exposed to userland.

Userland has no business examining a snapshot of the lock state, even
if pseudonymized. Should fix hppa build, where kmutex_t is somewhat
larger than anticipated by recent changes.
 1.247  26-Aug-2020  christos Instead of returning 0 when sysctl kern.expose_address=0, return a random
hashed value of the data. This allows sockstat to work without exposing
kernel addresses or being setgid kmem.
 1.246  23-May-2020  ad Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.
 1.245  01-Feb-2020  riastradh Load struct fdfile::ff_file with atomic_load_consume.

Exceptions: when we're only testing whether it's there, not about to
dereference it.

Note: We do not use atomic_store_release to set it because the
preceding mutex_exit should be enough.

(That said, it's not clear the mutex_enter/exit is needed unless
refcnt > 0 already, in which case maybe it would be a win to switch
from the membar implied by mutex_enter to the membar implied by
atomic_store_release -- which I would generally expect to be much
cheaper. And a little clearer without a long comment.)
 1.244  01-Feb-2020  riastradh Load struct filedesc::fd_dt with atomic_load_consume.

Exceptions: when fd_refcnt <= 1, or when holding fd_lock.

While here:

- Restore KASSERT(mutex_owned(&fdp->fd_lock)) in fd_unused.
=> This is used only in fd_close and fd_abort, where it holds.
- Move bounds check assertion in fd_putfile to where it matters.
- Store fd_dt with atomic_store_release.
- Move load of fd_dt under lock in knote_fdclose.
- Omit membar_consumer in fdesc_readdir.
=> atomic_load_consume serves the same purpose now.
=> Was needed only on alpha anyway.
 1.243  20-Feb-2019  christos branches: 1.243.4; 1.243.6;
handle O_NOSIGPIPE too.
 1.242  03-Jan-2019  maxv Add KASSERT.
 1.241  24-Nov-2018  maxv Fix kernel pointer leaks in the kern.file sysctl, same as kern.file2.
 1.240  24-Nov-2018  maxv Rename fill_file -> fill_file2, since that's the KERN_FILE2 sysctl.
 1.239  02-Nov-2018  maxv Add LIST_INIT for filehead.
 1.238  05-Oct-2018  christos Provide a sysctl kern.expose_address to expose kernel addresses in
sysctl structure returns for non-root. Defaults to off. Turning it
on will restore sockstat/fstat and friends for regular users.
 1.237  13-Sep-2018  maxv Don't leak kernel pointers to userland in kern.file2, same as kern.proc2.
 1.236  03-Sep-2018  riastradh Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)
 1.235  03-Jul-2018  kamil Avoid unportable signed integer left shift in fd_unused()

Detected with Kernel Undefined Behavior Sanitizer.

There were at least a single place reported, for consistency fix all the
left bit shift operations.
sys/kern/kern_descrip.c:345:2, left shift of 1 by 31 places cannot be represented in type 'int'
sys/kern/kern_descrip.c:346:28, left shift of 1 by 31 places cannot be represented in type 'int'

Reported by <Harry Pantazis>
 1.234  03-Jul-2018  kamil Avoid unportable signed integer left shift in fd_copy()

Detected with Kernel Undefined Behavior Sanitizer.

There were at least a single place reported, for consistency fix all the
left bit shift operations.
sys/kern/kern_descrip.c:1492:3, left shift of 1 by 31 places cannot be represented in type 'int'
sys/kern/kern_descrip.c:1493:28, left shift of 1 by 31 places cannot be represented in type 'int'

Reported by <Harry Pantazis>
 1.233  03-Jul-2018  kamil Avoid unportable signed integer left shift in fd_isused()

Detected with Kernel Undefined Behavior Sanitizer.

sys/kern/kern_descrip.c:188:34, left shift of 1 by 31 places cannot be represented in type 'int'

Reported by <Harry Pantazis>
 1.232  03-Jul-2018  kamil Avoid unportable signed integer left shift in fd_used()

Detected with Kernel Undefined Behavior Sanitizer.

There were at least a single place reported, for consistency fix all the
left bit shift operations.
sys/kern/kern_descrip.c:302:26, left shift of 1 by 31 places cannot be represented in type 'int'

Reported by <Harry Pantazis>
 1.231  01-Jun-2017  chs branches: 1.231.8; 1.231.10;
remove checks for failure after memory allocation calls that cannot fail:

kmem_alloc() with KM_SLEEP
kmem_zalloc() with KM_SLEEP
percpu_alloc()
pserialize_create()
psref_class_create()

all of these paths include an assertion that the allocation has not failed,
so callers should not assert that again.
 1.230  11-May-2017  nat Explicitly set the flags instead of masking set values in.

This fixes FNONBLOCK weirdness seen in audio.c

OK christos@ and martin@.
 1.229  03-Aug-2015  christos branches: 1.229.8;
1. mask fflags so we don't tack on whateve oflags were passed from userland
2. honor O_CLOEXEC, so the children of daemons that use cloning devices, don't
end up with the parents descriptors
fd_clone and in general the fd approach of 'allocate' > 'play with guts' >
'attach' should be converted to be more constructor like.
XXX: pullup-{6,7}
 1.228  21-Sep-2014  christos branches: 1.228.2;
remove casts to the same type.
 1.227  05-Sep-2014  matt Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.
 1.226  05-Sep-2014  matt Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.
 1.225  25-Jul-2014  dholland branches: 1.225.2;
Add d_discard to all struct cdevsw instances I could find.

All have been set to "nodiscard"; some should get a real implementation.
 1.224  16-Mar-2014  dholland branches: 1.224.2;
Change (mostly mechanically) every cdevsw/bdevsw I can find to use
designated initializers.

I have not built every extant kernel so I have probably broken at
least one build; however I've also found and fixed some wrong
cdevsw/bdevsw entries so even if so I think we come out ahead.
 1.223  25-Feb-2014  pooka Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.
 1.222  15-Sep-2013  martin Remove __CT_LOCAL_.. hack
 1.221  14-Sep-2013  martin Avoid warnings for a local CTASSERT
 1.220  05-Sep-2013  pooka In fd_abort(), reset ff_exclose to preserve invariants expected by fd_free()
 1.219  24-Nov-2012  christos branches: 1.219.2;
Return EOPNOTSUPP for fnullop_kqfilter to prevent registration of unsupported
fds. XXX: We should really fix the fd's to be supported in the future.
Unsupported fd's have a NULL f_event, so registering crashes the kernel with
a NULL function dereference of f_event.
 1.218  25-Jan-2012  christos branches: 1.218.2; 1.218.6; 1.218.8;
As discussed in tech-kern, provide the means to prevent delivery of SIGPIPE
on EPIPE for all file descriptor types:

- provide O_NOSIGPIPE for open,kqueue1,pipe2,dup3,fcntl(F_{G,S}ETFL) [NetBSD]
- provide SOCK_NOSIGPIPE for socket,socketpair [NetBSD]
- provide SO_NOSIGPIPE for {g,s}seckopt [NetBSD/FreeBSD/MacOSX]
- provide F_{G,S}ETNOSIGPIPE for fcntl [MacOSX]
 1.217  25-Sep-2011  chs branches: 1.217.2; 1.217.6;
in fd_allocfile(), free the fd if we fail to allocate a file.
 1.216  15-Jul-2011  christos fail with EINVAL if flags not are not O_CLOEXEC|O_NONBLOCK in pipe2(2) and
dup3(2)
 1.215  26-Jun-2011  christos * Arrange for interfaces that create new file descriptors to be able to
set close-on-exec on creation (http://udrepper.livejournal.com/20407.html).

- Add F_DUPFD_CLOEXEC to fcntl(2).
- Add MSG_CMSG_CLOEXEC to recvmsg(2) for unix file descriptor passing.
- Add dup3(2) syscall with a flags argument for O_CLOEXEC, O_NONBLOCK.
- Add pipe2(2) syscall with a flags argument for O_CLOEXEC, O_NONBLOCK.
- Add flags SOCK_CLOEXEC, SOCK_NONBLOCK to the socket type parameter
for socket(2) and socketpair(2).
- Add new paccept(2) syscall that takes an additional sigset_t to alter
the sigmask temporarily and a flags argument to set SOCK_CLOEXEC,
SOCK_NONBLOCK.
- Add new mode character 'e' to fopen(3) and popen(3) to open pipes
and file descriptors for close on exec.
- Add new kqueue1(2) syscall with a new flags argument to open the
kqueue file descriptor with O_CLOEXEC, O_NONBLOCK.

* Fix the system calls that take socklen_t arguments to actually do so.

* Don't include userland header files (signal.h) from system header files
(rump_syscallargs.h).

* Bump libc version for the new syscalls.
 1.214  24-Apr-2011  rmind Drop extern inline for fd_getfile(). Apparently, GCC already ignores it.
 1.213  23-Apr-2011  rmind - Sprinkle __cacheline_aligned and __read_mostly in file descriptor code.
- While here, remove trailing whitespaces, KNF.
 1.212  10-Apr-2011  christos - Add O_CLOEXEC to open(2)
- Add fd_set_exclose() to encapsulate uses of FIO{,N}CLEX, O_CLOEXEC, F{G,S}ETFD
- Add a pipe1() function to allow passing flags to the fd's that pipe(2)
opens to ease implementation of linux pipe2(2)
- Factor out fp handling code from open(2) and fhopen(2)
 1.211  15-Feb-2011  pooka Support FD_CLOEXEC in rump kernels.
 1.210  28-Jan-2011  pooka Move sysctl routines from init_sysctl.c to kern_descrip.c (for
descriptors) and kern_proc.c (for processes). This makes them
usable in a rump kernel, in case somebody was wondering.
 1.209  01-Jan-2011  pooka branches: 1.209.2; 1.209.4;
Update comment and inspired by that update variable naming too.
no functional change.
 1.208  17-Dec-2010  yamt update some comments
 1.207  29-Oct-2010  pooka Attach implicit threads to initproc instead of proc0. This way
applications which alter, by purpose or by accident, the uid in an
implicit thread are don't affect kernel threads.

from discussion with njoly
 1.206  01-Sep-2010  pooka Actually, the comment probably meant "would be nice to KASSERT here,
but can't". So turn it into a KASSERT now that it's possible.
 1.205  01-Sep-2010  pooka Remove XXX comment. I'm not sure what it precisely means, but I'm
guessing it's from a time when rump used filedesc0 for everything
(and that isn't true anymore).
 1.204  04-Aug-2010  pooka Remove overzealous KASSERT: the refcount can be non-zero if another
thread attempts to use a non-open file descriptor. from ad

fixes PR kern/43694
 1.203  01-Jul-2010  rmind Remove pfind() and pgfind(), fix locking in various broken uses of these.
Rename real routines to proc_find() and pgrp_find(), remove PFIND_* flags
and have consistent behaviour. Provide proc_find_raw() for special cases.
Fix memory leak in sysctl_proc_corename().

COMPAT_LINUX: rework ptrace() locking, minimise differences between
different versions per-arch.

Note: while this change adds some formal cosmetics for COMPAT_DARWIN and
COMPAT_IRIX - locking there is utterly broken (for ages).

Fixes PR/43176.
 1.202  20-Dec-2009  dsl branches: 1.202.2; 1.202.4;
If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567
 1.201  09-Dec-2009  dsl Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.
 1.200  27-Oct-2009  rmind - Amend fd_hold() to take an argument and add assert (reflects two cases,
fork1() and the rest, e.g. kthread_create(), when creating from lwp0).

- lwp_create(): do not touch filedesc internals, use fd_hold().
 1.199  16-Aug-2009  yamt assertion
 1.198  30-Jun-2009  martin Update fd_freefile when kqueue descriptors are not copied from
parent to child. From Wolfgang Solfrank in PR kern/41651.
Approved by Andrew Doran.
 1.197  08-Jun-2009  yamt fd_free: fix posix advisory locks. PR/41549 from HITOSHI OSADA.
 1.196  07-Jun-2009  yamt shut up the following assertion failure and add a comment.

panic: kernel diagnostic assertion "!fd_isused(fdp, fd)" failed: file "/siro/nbsd/src/sys/kern/kern_descrip.c", line 175
 1.195  29-May-2009  yamt fd_free: reset fd_himap/lomap to make fd_checkmaps comfortable. PR/41487.
 1.194  28-May-2009  yamt wrap a long line.
 1.193  26-May-2009  ad PR kern/41487: kern_descrip.c assertion failure

Remove bogus assertion.
 1.192  24-May-2009  ad More changes to improve kern_descrip.c.

- Avoid atomics in more places.
- Remove the per-descriptor mutex, and just use filedesc_t::fd_lock.
It was only being used to synchronize close, and in any case we needed
to take fd_lock to free the descriptor slot.
- Optimize certain paths for the <NDFDFILE case.
- Sprinkle more comments and assertions.
- Cache more stuff in filedesc_t.
- Fix numerous minor bugs spotted along the way.
- Restructure how the open files array is maintained, for clarity and so
that we can eliminate the membar_consumer() call in fd_getfile(). This is
mostly syntactic sugar; the main functional change is that fd_nfiles now
lives alongside the open file array.

Some measurements with libmicro:

- simple file syscalls are like close() are between 1 to 10% faster.
- some nice improvements, e.g. poll(1000) which is ~50% faster.
 1.191  23-May-2009  ad Make descriptor access and file allocation cheaper in many cases,
mostly by avoiding a bunch of atomic operations.
 1.190  04-Apr-2009  ad Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)
 1.189  29-Mar-2009  rmind fownsignal: pre-check for zero pgid, avoids locking of proc_lock.
 1.188  11-Mar-2009  mrg completely rework the way that orphaned sockets that are being fdpassed
via SCM_RIGHTS messages are dealt with:

1. unp_gc: make this a kthread.

2. unp_detach: go not call unp_gc directly. instead, wake up unp_gc kthread.

3. unp_scan: do not close files here. instead, put them on a global list
for unp_gc to close, along with a per-file "deferred close count". if
file is already enqueued for close, just increment deferred close count.
this eliminates the recursive calls.

3. unp_gc: scan files on global deferred close list. close each file N
times, as specified by deferred close count in file. continue processing
list until it becomes empty (closing may cause additional files to be
queued for close).

4. unp_gc: add additional bit to mark files we are scanning. set during
initial scan of global file list that currently clears FMARK/FDEFER.
during later scans, never examine / garbage collect descriptors that
we have not marked during the earlier scan. do not proceed with this
initial scan until all deferred closes have been processed. be careful
with locking to ensure no races are introduced between deferred close
and file scan.

5. unp_gc: use dummy file_t to mark position in list when scanning. allow
us to drop filelist_lock. in turn allows us to eliminate kmem_alloc()
and safely close files, etc.

6. prohibit transfer of descriptors within SCM_RIGHTS messages if
(num_files_in_transit > maxfiles / unp_rights_ratio)

7. fd_allocfile: ensure recycled filse don't get scanned.


this is 97% work done by andrew doran, with a couple of minor bug fixes
and a lot of testing by yours truly.
 1.187  08-Mar-2009  ad Don't bother with file_t::f_iflags any more, as it's not used.
Noted by mrg@.
 1.186  02-Mar-2009  rmind fd_copy: fix off-by-one bug in a race condition path and assert.
Should fix PR/40625. OK by <ad>.
 1.185  21-Dec-2008  ad branches: 1.185.2;
- Fix a bug where we trashed descriptor zero in the old open files array
while ironically trying to preserve the same during copy. Would only have
occurred if a multithreaded program expanded the descriptor table and,
within a tiny window of exposure, another thread in the program tried to
access descriptor zero.

- Convert to use kmem_alloc/kmem_free.
 1.184  18-Nov-2008  pooka Move fd_closeexec() and fd_checkstd() from kern_descrip to their
own file, subr_exec_fd.c (they're used only by exec).

After this change, the kernel source modules are in a partitioned
enough state to allow building a system without vfs at all.
 1.183  18-Nov-2008  pooka cwd is logically a vfs concept, so take it out from the bosom of
kern_descrip and into vfs_cwd. No functional change.
 1.182  02-Jul-2008  matt branches: 1.182.2; 1.182.4; 1.182.6;
Change {ff,fd}_exclose and ff_allocated to bool. Change exclose arg to
fd_dup to bool. Switch assignments from 1/0 to true/false.

This make alpha kernels compile. Bump kern to 4.99.69 since structure
changed.
 1.181  02-Jul-2008  matt Switch from KASSERT to CTASSERT for those asserts testing sizes of types.
 1.180  24-Jun-2008  gmcgarry ioctl commands are unsigned long. Changes ABI for fsetown() and fgetown() on 64-bit architectures.
 1.179  05-May-2008  ad branches: 1.179.2; 1.179.4;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.
 1.178  28-Apr-2008  martin Remove clause 3 and 4 from TNF licenses
 1.177  24-Apr-2008  ad branches: 1.177.2;
Merge proc::p_mutex and proc::p_smutex into a single adaptive mutex, since
we no longer need to guard against access from hardware interrupt handlers.

Additionally, if cloning a process with CLONE_SIGHAND, arrange to have the
child process share the parent's lock so that signal state may be kept in
sync. Partially addresses PR kern/37437.
 1.176  24-Apr-2008  ad Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.
 1.175  09-Apr-2008  wiz branches: 1.175.2;
Commit fix for the fdfile leak described in PR 38374.

Patch provided by YAMAMOTO Takashi.

Ok ad@
 1.174  27-Mar-2008  ad Replace use of CACHE_LINE_SIZE in some obvious places.
 1.173  21-Mar-2008  ad File descriptor changes, discussed on tech-kern:

- Redo reference counting to be sane. LWPs accessing files take a short
term reference on the local file descriptor. This is the most common
case. While a file is in a process descriptor table, a reference is
held to the file. The file reference count only changes during control
operations like open() or close(). Code that comes at files from an
unusual direction (i.e. foreign to the process) like procfs or sysctl
takes a reference on the file (f_count), and not on a descriptor.

- Remove knowledge of reference counting and locking from most code that
deals with files.

- Make the usual case of file descriptor lookup lockless.

- Make kqueue MP and MT safe. PR kern/38098, PR kern/38137.

- Fix numerous file handling bugs, and bugs in the descriptor code that
affected multithreaded processes.

- Split descriptor system calls out into sys_descrip.c.

- A few stylistic changes: KNF, remove unused casts now that caddr_t is
gone. Replace dumb gotos with loop control in a few places.

- Don't do redundant pointer passing (struct proc, lwp, filedesc *) unless
the routine is likely to be inlined. Most of the time it's about the
current process.
 1.172  06-Feb-2008  ad branches: 1.172.6;
- Shrink 'struct file' to 60 bytes on 32-bit platforms.
- Align 'struct file' and 'struct filedesc' to CACHE_LINE_SIZE.
 1.171  27-Jan-2008  dsl Move the prototype for do_posix_fadvise() somewhere useful.
 1.170  27-Jan-2008  martin Implement new version of posix_fadvise as a stub callinig the real
worker function, and compatibility stub doing the same with old argument
sturcture.
 1.169  05-Jan-2008  ad Add fgetdummy/fputdummy: allocate and free dummy 'struct file' entries
to be used when traversing filehead.
 1.168  05-Jan-2008  dsl Use FILE_LOCK() and FILE_UNLOCK()
 1.167  26-Dec-2007  ad Merge more changes from vmlocking2, mainly:

- Locking improvements.
- Use pool_cache for more items.
 1.166  20-Dec-2007  dsl Convert all the system call entry points from:
int foo(struct lwp *l, void *v, register_t *retval)
to:
int foo(struct lwp *l, const struct foo_args *uap, register_t *retval)
Fixup compat code to not write into 'uap' and (in some cases) to actually
pass a correctly formatted 'uap' structure with the right name to the
next routine.
A few 'compat' routines that just call standard ones have been deleted.
All the 'compat' code compiles (along with the kernels required to test
build it).
98% done by automated scripts.
 1.165  08-Dec-2007  pooka branches: 1.165.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.
 1.164  29-Nov-2007  ad branches: 1.164.2;
Use atomics to adjust filedesc::fd_refcnt.
 1.163  29-Nov-2007  ad Use atomics to adjust cwdi_refcnt.
 1.162  07-Nov-2007  ad Merge from vmlocking:

- pool_cache changes.
- Debugger/procfs locking fixes.
- Other minor changes.
 1.161  08-Oct-2007  ad branches: 1.161.2; 1.161.4;
Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.
 1.160  07-Sep-2007  rmind branches: 1.160.2;
Implementation of POSIX message queues.

Reviewed by: <ad>, <tech-kern>
 1.159  09-Jul-2007  ad branches: 1.159.2; 1.159.6; 1.159.8;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements
 1.158  12-May-2007  dsl Split the fcntl locking code out from its copyin/out.
Use to avoid all the stackgap stuff in compat code.
 1.157  22-Apr-2007  dsl I'm not sure why I decided that cwdinit() shouldn't copy cwd_edir.
Since this is called in fork() it does rather need to give the child
process the parent's emulation root.
This means that (for example) an emulated shell will, by default, run
programs from the emulation root.
 1.156  22-Apr-2007  dsl Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.
 1.155  21-Mar-2007  dsl Somehow a single K&R function definition was lurking - nuke it.
 1.154  12-Mar-2007  ad branches: 1.154.2; 1.154.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.
 1.153  10-Mar-2007  dsl branches: 1.153.2;
Split the work for sys_stat, sys_lstat, sys_fstat and sys_fhstat out into
separate functions that don't do the copyout.
This allows all the compat_xxx versions to convert the 'struct stat' to
the correct format without using the 'stackgap'.
The stackgap isn't at all LWP friendly, and needs to be removed from
any compat functions that might involve threads (inc. clone()).
The code is still binary compatible with existing LKMs.
 1.152  09-Mar-2007  ad - Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.
 1.151  17-Feb-2007  pavel Change the process/lwp flags seen by userland via sysctl back to the
P_*/L_* naming convention, and rename the in-kernel flags to avoid
conflict. (P_ -> PK_, L_ -> LW_ ). Add back the (now unused) LSDEAD
constant.

Restores source compatibility with pre-newlock2 tools like ps or top.

Reviewed by Andrew Doran.
 1.150  09-Feb-2007  ad branches: 1.150.2;
Merge newlock2 to head.
 1.149  31-Jan-2007  ad ffree(): don't call kauth_cred_free() with a held simplelock.
 1.148  06-Dec-2006  yamt use KSI_INIT rather than memset. no functional changes.
 1.147  01-Nov-2006  yamt remove some __unused from function parameters.
 1.146  12-Oct-2006  christos - sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386
 1.145  02-Sep-2006  christos branches: 1.145.2; 1.145.4;
add missing initializer
 1.144  23-Jul-2006  ad Use the LWP cached credentials where sane.
 1.143  14-May-2006  elad integrate kauth.
 1.142  15-Apr-2006  christos Coverity CID 845: Make it clear that devnullfp != NULL.
 1.141  07-Mar-2006  pooka branches: 1.141.2; 1.141.4;
remove the no longer useful fdavail(), as proposed and (thankfully) not
discussed on tech-kern
 1.140  31-Jan-2006  yamt branches: 1.140.2; 1.140.4; 1.140.6;
falloc: grab fd_slock when calling fd_unused.
 1.139  24-Dec-2005  perry branches: 1.139.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.
 1.138  11-Dec-2005  christos merge ktrace-lwp.
 1.137  29-Nov-2005  yamt merge yamt-readahead branch.
 1.136  03-Oct-2005  mrg branches: 1.136.6;
fix a bug pointed out by der mouse on tech-kern: in F_GETOWN, use a
pointer to a temporary "int" variable to pass to fo_ioctl(TIOCGPGRP), not
a register_t pointer. (how did F_GETOWN ever work on sparc64 before?)
 1.135  19-Aug-2005  christos 64 bit inode changes.
 1.134  23-Jun-2005  thorpej branches: 1.134.2;
Use ANSI function decls. Apply some static.
 1.133  29-May-2005  christos - add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.
 1.132  20-May-2005  wrstuden The file being closed is (fdp->fd_lastfile - i), not i. So compare
(fdp->fd_lastfile - i) against fd_knlistsize. Otherwise we can
call knote_fdclose() on a file descriptor that doesn't have a knote.

This issue explains random panics I have had on process exit over the
past few years.
 1.131  26-Feb-2005  perry branches: 1.131.2;
nuke trailing whitespace
 1.130  12-Feb-2005  christos pass the flag to fdclone.
 1.129  14-Jan-2005  cube branches: 1.129.2; 1.129.4;
As fd_lastfile might be negative, we can't use the (u_int) cast trick to
compare fd and fdp->fd_lastfile in fdrelease(), so change the test to a
more explicit one. Spotted by Matt Thomas.

Should fix the panic reported by Matthias Scheler.
 1.128  12-Jan-2005  cube fd_lastfile should be -1 when there are no opened file descriptors.
Hence, make find_last_set return -1 in such situation, and initialize it
such. Otherwise, with 0 meaning two things, it confused the F_CLOSEM
fcntl which could end up looping indifintely (PR#28929 by Brian Marcotte).

However, this change enlightens another bug in fdcopy(), where more entries
than needed were cleared in the new file descriptor table, so the memset()
call there is fixed too.

Analyzed with the help of Greg Oster.
 1.127  30-Nov-2004  christos Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat
 1.126  31-May-2004  pk Implement mutexes for file descriptor and current working directory access.
Fix a potential race condition when reallocating storage for file descriptors
(even for non-SMP kernels).
Add missing locks for `struct file' ref count updates.
 1.125  25-Apr-2004  simonb Initialise (most) pools from a link set instead of explicit calls
to pool_init. Untouched pools are ones that either in arch-specific
code, or aren't initialiased during initial system startup.

Convert struct session, ucred and lockf to pools.
 1.124  05-Apr-2004  yamt add assertions related to file descriptor allocation.
 1.123  07-Jan-2004  jdolecek branches: 1.123.2;
fix F_MAXFD fcntl - it returned the value as errno instead
of return value from the syscall
from mouss <usebsd at free dot fr>
 1.122  05-Jan-2004  christos Ad F_CLOSEM, F_MAXFD from Matt Thomas.
 1.121  30-Nov-2003  provos fix off by one in find_last_set(); triggered for processes that have no
open file descriptors; found by tim robbins from freebsd
 1.120  26-Nov-2003  yamt fdcopy: copy inline bitmaps properly.
hopefully fixes PR/23469.
 1.119  09-Nov-2003  yamt fix typos in comments.
 1.118  09-Nov-2003  yamt - fix an use-after-free bug in /dev/fd/* handling.
specifically, don't keep a stale pointer in fd_ofiles.
it isn't needed anymore as fd allocation is now done using bitmaps.
- clean up dupfdopen() a little.
- don't call fd_used() unnecessarily.
 1.117  09-Nov-2003  yamt in the non-overwritten case of sys_dup2(),
call fd_used() by itsself rather than leaving it to finishdup().
 1.116  01-Nov-2003  provos use fdremove to remove kqueue file descriptor so that bitmap information
is maintained correctly; found by Juergen Hannken-Illjes
 1.115  30-Oct-2003  provos use a two-level bitmap as suggested by mogul and banga for fdalloc;
approved thorpej@
 1.114  22-Sep-2003  christos - pass signo to fownsignal [ok by jd]
- make urg signal handling use fownsignal
- remove out of band detection in sowakeup
 1.113  21-Sep-2003  jdolecek cleanup & uniform descriptor owner handling:
* introduce fsetown(), fgetown(), fownsignal() - this sets/retrieves/signals
the owner of descriptor, according to appropriate sematics
of TIOCSPGRP/FIOSETOWN/SIOCSPGRP/TIOCGPGRP/FIOGETOWN/SIOCGPGRP ioctl; use
these routines instead of custom code where appropriate
* make every place handling TIOCSPGRP/TIOCGPGRP handle also FIOSETOWN/FIOGETOWN
properly, and remove the translation of FIO[SG]OWN to TIOC[SG]PGRP
in sys_ioctl() & sys_fcntl()
* also remove the socket-specific hack in sys_ioctl()/sys_fcntl() and
pass the ioctls down to soo_ioctl() as any other ioctl

change discussed on tech-kern@
 1.112  13-Sep-2003  jdolecek move dupfd from struct proc to struct lwp - it's per-LWP, not per-process; we
use curlwp where the lwp is not directly available, i.e. in device open
routines

briefly discussed on tech-kern
 1.111  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.
 1.110  29-Jun-2003  fvdl branches: 1.110.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.
 1.109  28-Jun-2003  darrenr Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V
 1.108  16-May-2003  itojun use strlcat
 1.107  22-Mar-2003  dsl Correct rewinding if FIONBIO or FIOASYNC fail in F_SETFL
(code use to always turn off FIONBIO if FIOASYNC fails)
(approved by christos)
 1.106  22-Mar-2003  dsl Change caddr_t to void *
 1.105  17-Mar-2003  martin When being passed bogus file descriptors make close(2) return EBADF.
From Stephen Ma in PR kern/20762.
 1.104  01-Mar-2003  yamt make fdcheckstd f_slock friendly.
 1.103  23-Feb-2003  pk Make updating a file's reference and use count MP-safe.
 1.102  14-Feb-2003  pk Use a mutex to protect the global list of open files.
 1.101  01-Feb-2003  thorpej Add extensible malloc types, adapted from FreeBSD. This turns
malloc types into a structure, a pointer to which is passed around,
instead of an int constant. Allow the limit to be adjusted when the
malloc type is defined, or with a function call, as suggested by
Jonathan Stone.
 1.100  19-Jan-2003  simonb Remove variable that is only assigned too but not referenced.
 1.99  18-Jan-2003  thorpej Merge the nathanw_sa branch.
 1.98  06-Jan-2003  wiz descriptor, not decriptor.
 1.97  24-Nov-2002  scw Quell uninitialised variable warnings.
 1.96  23-Oct-2002  jdolecek merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe
 1.95  23-Sep-2002  simonb fp->f_count is unsigned, don't check if it's less than zero.
 1.94  06-Sep-2002  gehenna Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.
 1.93  18-Jun-2002  thorpej sys_fpathconf: Don't panic in the default case; just return EOPNOTSUPP.
 1.92  09-May-2002  atatat branches: 1.92.2;
Maintain a short list of the actual descriptors that were closed and
log that intead of being ambiguous about which of 0, 1, and/or 2 it
was that was closed.
 1.91  28-Apr-2002  enami Log who invoked the s[ug]id program. Tested by mozilla.
 1.90  27-Apr-2002  enami A loop to expand file descriptor table and retry is move from fdalloc()
to caller. So, no longer need to loop in fdalloc().
 1.89  27-Apr-2002  enami KNF.
 1.88  24-Apr-2002  christos Avoid file use underflow; thanks to YAMAMOTO Takashi for noticing.
 1.87  23-Apr-2002  christos Don't forget to set mature and unuse the file.
 1.86  23-Apr-2002  christos From OpenBSD, via FreeBSD: If a set{u,g}id binary is invoked with fd < 3
closed, open those fds to /dev/null.

XXX: This needs to be fixed in a better way. The kernel should not need to
know about /dev/null or special case 0, 1, 2.
 1.85  08-Mar-2002  thorpej Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.
 1.84  31-Jan-2002  kleink fcntl(..., F_GETOWN, ...): fix LP64-BE bug; raised by der Mouse
on tech-kern.
 1.83  07-Dec-2001  jdolecek Back off previous for now, Jason thinks it's not right. Will discuss
on tech-kern@
 1.82  06-Dec-2001  jdolecek replace FIF_WANTCLOSE/FIF_LARVAL with FWANTCLOSE/FLARVAL, which are set
in f_flag of struct file
for now, keep former f_iflags of struct file as _f_spare0, it will be g/c'ed
when struct file will be changed (this will happen soon)
 1.81  12-Nov-2001  lukem add RCSIDs
 1.80  18-Jul-2001  thorpej branches: 1.80.2; 1.80.4;
Unshare the file descriptor table and `cwdinfo' when we exec.
From Matthew Orgass <darkstar@pgh.net>.
 1.79  01-Jul-2001  thorpej branches: 1.79.2;
Duh, use fd_getfile() in sys_close().
 1.78  16-Jun-2001  jdolecek Add DTYPE_PIPE (to be used by new pipe implementation) and handle
it accordingly.
 1.77  14-Jun-2001  thorpej Fix a partial construction problem that can cause race conditions
between creation of a file descriptor and close(2) when using kernel
assisted threads. What we do is stick descriptors in the table, but
mark them as "larval". This causes essentially everything to treat
it as a non-existent descriptor, except for fdalloc(), which sees a
filled slot so that it won't (incorrectly) allocate it again. When
a descriptor is fully constructed, the code that has constructed it
marks it as "mature" (which actually clears the "larval" flag), and
things continue to work as normal.

While here, gather all the code that gets a descriptor from the table
into a fd_getfile() function, and call it, rather than having the
same (sometimes incorrect) code copied all over the place.
 1.76  07-Jun-2001  thorpej Rework fdalloc() even further: split fdalloc() into fdalloc() and
fdexpand(). The former will return ENOSPC if there is not space
in the current filedesc table. The latter performs the expansion
of the filedesc table. This means that fdalloc() won't ever block,
and it gives callers an opportunity to clean up before the
potentially-blocking fdexpand() call.

Update all fdalloc() callers to deal with the need-to-fdexpand() case.

Rewrite unp_externalize() to use fdalloc() and fdexpand() in a
safe way, using an algorithm suggested by Bill Sommerfeld:
- Use a temporary array of integers to hold the new filedesc table
indexes. This allows us to repeat the loop if necessary.
- Loop through the array of file *'s, assigning them to filedesc table
slots. If fdalloc() indicates expansion is necessary, undo the
assignments we've done so far, expand, and retry the whole process.
- Once all file *'s have been assigned to slots, update the f_msgcount
and unp_rights counters.
- Right before we return, copy the temporary integer array to the message
buffer, and trim the length as before.
Note that once locking is added to the filedesc array, this entire
operation will be `atomic', in that the lock will be held while
file *'s are assigned to embryonic table slots, thus preventing anything
else from using them.
 1.75  06-Jun-2001  thorpej Change fdalloc() to return ERESTART if we had to reallocate the
descriptor array, which may have blocked. Change callers of
fdalloc() to restart whatever they\'re doing if this condition
happens. (XXX unp_externalize() needs some work, but that will
be tackled later.)

Change finishdup() to close the descriptor in the `new\' slot if
one exists, and change sys_dup2() accordingly.

Closes a race condition when using kernel-assisted user threads.

While here, garbage-collect UF_MAPPED -- it is not used anywhere.
 1.74  09-Apr-2001  jdolecek Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.
 1.73  07-Apr-2001  jdolecek Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.
 1.72  26-Feb-2001  lukem branches: 1.72.2;
convert to ANSI KNF
 1.71  15-Aug-2000  fvdl Fix omission in previous.
 1.70  15-Aug-2000  eeh Fix LP64BE bug.
 1.69  04-Jul-2000  jdolecek change tablefull() to accept one more parameter - optional hint

use that to inform about way to raise current limit when we reach maximum
number of processes, descriptors or vnodes

XXX hopefully I catched all users of tablefull()
 1.68  27-Jun-2000  mrg remove include of <vm/vm.h>
 1.67  26-May-2000  sommerfeld branches: 1.67.4;
Eliminate incorrect use of "curproc" in a comment.
 1.66  30-Mar-2000  augustss Get rid of register declarations.
 1.65  23-Mar-2000  thorpej Implement fdremove() which is used in place of all the code that
did the "fdp->fd_ofiles[fd] = 0" assignment; fdremove() make sure
the fd_freefiles hints stay in sync.

From OpenBSD.
 1.64  22-Mar-2000  thorpej Pool'ify filedesc0 allocation.
 1.63  24-Jan-2000  thorpej In cwdinit(), if there isn't a cdir vnode yet, don't VREF() it.
 1.62  08-Dec-1999  sommerfeld Fix bug observed by Perry and myself: when emacs was shut down
uncleanly due to a lost connection, it would hang in closef() waiting
for the usecount to go back to 1.

An audit of FILE_USE() vs FILE_UNUSE() usage led me to discover some
incorrect error-path code..

In sys_fcntl(), avoid leaking a file descriptor usecount in an error
case of F_SETFL; don't return, instead go to "out" to clean up. I
suspect that the F_SETFL would fail because vop_fcntl is not
implemented in deadfs.
 1.61  03-Aug-1999  wrstuden branches: 1.61.2; 1.61.8;
Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden
 1.60  20-Jun-1999  christos Fix umask inheritance problem introduced by the cwdi changes, whereby
children processes will not inherit the parent's umask but 022.
 1.59  05-May-1999  thorpej Add "use counting" to file entries. When closing a file, and it's reference
count is 0, wait for use count to drain before finishing the close.

This is necessary in order for multiple processes to safely share file
descriptor tables.
 1.58  30-Apr-1999  thorpej Break cdir/rdir/cmask info out of struct filedesc, and put it in a new
substructure, `cwdinfo'. Implement optional sharing of this substructure.

This is required for clone(2).
 1.57  24-Mar-1999  mrg branches: 1.57.4;
completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.
 1.56  22-Mar-1999  sommerfe bug fix to fdavail: be consistent about taking per-process descriptor
limit into account when checking against the limit; fdp->fd_nfiles may
be greater than the current descriptor limit, and there may be space
in fdp->fd_ofiles beyond the limit. If we say it's available,
unp_externalize will get confused and panic when fdalloc fails.
 1.55  31-Aug-1998  thorpej Use the pool allocator and "nointr" pool page allocator for file structures.
 1.54  13-Aug-1998  kleink Per POSIX, fail with EINVAL if advisory locking is attempted on a file type
that doesn't support it, rather than using a homegrown EBADF or EOPNOTSUPP.
 1.53  04-Aug-1998  perry Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)
 1.52  31-Jul-1998  perry fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.
 1.51  01-Mar-1998  fvdl branches: 1.51.2;
Merge with Lite2 + local changes
 1.50  10-Feb-1998  mrg - add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.
 1.49  05-Feb-1998  mrg initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)
 1.48  05-Jan-1998  thorpej Implement file descriptor table sharing. Partially from FreeBSD.
 1.47  20-Oct-1997  thorpej Fix the shared library versioning snafu caused by the recent changes
to the stat(2) family and msync(2). This uses a primitive function
versioning scheme.

This reverts the libc shared library major version from 13 to 12, and
adds a few new interfaces to bring us to libc version 12.20.

From Frank van der Linden <fvdl@NetBSD.ORG>.
 1.46  19-Oct-1997  mycroft Minor change; remove unnecessary casts.
 1.45  15-Oct-1997  mycroft Adjust u_int arguments of some system calls to int, to match user-level
prototypes.
 1.44  17-Jul-1997  phil In sys_flock, change EBADF to EINVAL because error was generated by
a bad argument, not a bad file descriptor. (Found in response to
PR 2602.)
 1.43  02-Apr-1997  kleink Like in F_SETLK, check if F_GETLK is actually called with a
valid lock type.
 1.42  30-Mar-1996  christos Eliminate kern_conf.h
 1.41  29-Mar-1996  cgd kill unnecessary (and sometimes dangerous) casts of ioctl commands to int
 1.40  14-Mar-1996  christos - fdopen -> filedescopen
- bring kgdb prototype in scope.
 1.39  09-Feb-1996  christos More proto fixes
 1.38  04-Feb-1996  christos First pass at prototyping
 1.37  07-Oct-1995  mycroft Prefix names of system call implementation functions with `sys_'.
 1.36  19-Sep-1995  thorpej Make system calls conform to a standard prototype and bring those
prototypes into scope.
 1.35  24-Jun-1995  christos Extracted all of the compat_xxx routines, and created a library [libcompat]
for them. There are a few #ifdef COMPAT_XX remaining, but they are not easy
or worth eliminating (yet).
 1.34  10-Apr-1995  mycroft Change `fdclose' to `fdrelease', to avoid confusion with device interfaces.
 1.33  08-Mar-1995  cgd need COMPAT_OSF1 for some things
 1.32  15-Feb-1995  mycroft NULL out file descriptors as they're closed, for the benefit of fstat(8).
 1.31  23-Jan-1995  cgd ooops. forgot to emable fpathconf's use of VOP_PATHCONF!
 1.30  12-Jan-1995  cgd cast pointer to long, not int
 1.29  14-Dec-1994  mycroft Remove old declaration.
 1.28  14-Dec-1994  mycroft Revert dup handling.
 1.27  04-Dec-1994  mycroft Abstract out the code to maintain fd_lastfile. Remove the old dup() compatibility
kluge. Rearrange fdopen() handling. Make a common function to handle closing
a particular file descriptor in a process. Some other cleanup.
 1.26  30-Oct-1994  cgd be more careful with types, also pull in headers where necessary.
 1.25  20-Oct-1994  cgd update for new syscall args description mechanism
 1.24  30-Aug-1994  mycroft Convert process, file, and namei lists and hash tables to use queue.h.
 1.23  15-Aug-1994  mycroft Need ofstat() for iBCS2 syscall conversion.
 1.22  29-Jun-1994  cgd branches: 1.22.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'
 1.21  22-Jun-1994  mycroft Make ogetdtablesize if COMPAT_HPUX.
 1.20  16-Jun-1994  glass compat_ultrix
 1.19  14-Jun-1994  chopps getdtabledsize used by sunos compat code.
 1.18  14-Jun-1994  cgd make getdtablesize COMPAT_43; should be COMPAT_44 or _09, but that has probs
 1.17  19-May-1994  cgd update to 4.4-Lite, with some local changes
 1.16  17-May-1994  cgd copyright foo
 1.15  07-May-1994  cgd stub fpathconf
 1.14  04-May-1994  cgd Rename a lot of process flags.
 1.13  27-Mar-1994  cgd expand uid_t/gid_t/off_t
 1.12  04-Jan-1994  cgd generalize dupfdopen() to allow dups and moves. from jsp
 1.11  21-Dec-1993  cgd more of the same; gah!
 1.10  21-Dec-1993  cgd kill a billism
 1.9  18-Dec-1993  mycroft Canonicalize all #includes.
 1.8  23-Aug-1993  mycroft branches: 1.8.2;
RLIMIT_OFILE --> RLIMIT_NOFILE
 1.7  13-Jul-1993  cgd break args structs out, into syscallname_args structs, so gcc2 doesn't
whine so much.
 1.6  27-Jun-1993  andrew ANSIfications - removed all implicit function return types and argument
definitions. Ensured that all files include "systm.h" to gain access to
general prototypes. Casts where necessary.
 1.5  22-May-1993  cgd add include of select.h if necessary for protos, or delete if extraneous
 1.4  18-May-1993  cgd make kernel select interface be one-stop shopping & clean it all up.
 1.3  04-Apr-1993  cgd now uses `maxfdescs' to bound `openfiles' resource limit.
 1.2  23-Mar-1993  cgd modified files to support kernfs and fdesc fs
 1.1  21-Mar-1993  cgd branches: 1.1.1;
Initial revision
 1.1.1.3  01-Mar-1998  fvdl Import 4.4BSD-Lite2
 1.1.1.2  01-Mar-1998  fvdl Import 4.4BSD-Lite for reference
 1.1.1.1  21-Mar-1993  cgd initial import of 386bsd-0.1 sources
 1.8.2.2  21-Dec-1993  cgd from trunk
 1.8.2.1  14-Nov-1993  mycroft Canonicalize all #includes.
 1.22.2.1  15-Aug-1994  mycroft update from trunk
 1.51.2.1  08-Aug-1998  eeh Revert cdevsw mmap routines to return int.
 1.57.4.2  01-Jul-1999  thorpej Sync w/ -current.
 1.57.4.1  21-Jun-1999  thorpej Sync w/ -current.
 1.61.8.1  27-Dec-1999  wrstuden Pull up to last week's -current.
 1.61.2.3  21-Apr-2001  bouyer Sync with HEAD
 1.61.2.2  12-Mar-2001  bouyer Sync with HEAD.
 1.61.2.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.67.4.8  27-Apr-2002  he Apply patch (requested by christos):
Adapt previous pull-up to the branch.
 1.67.4.7  26-Apr-2002  he Pull up revisions 1.86-1.88 (requested by christos):
If a set{u,g}id binary is invoked with fd < 3 closed, open those
file desciptors to /dev/null.
 1.67.4.6  09-Feb-2002  he Apply patch (requested by windsor):
Correct typo in previous pull-up.
 1.67.4.5  09-Feb-2002  he Pull up revision 1.84 (via patch, requested by kleink):
Fix an LP64-BE bug with fctnl(..., F_GETOWN, ...).
 1.67.4.4  29-Jul-2001  he Pull up revision 1.80 (via patch, requested by thorpej):
Unshare the file descriptor table and ``cwdinfo'' when we exec.
 1.67.4.3  10-Jun-2001  he Pull up revision 1.75 (via patch, requested by thorpej):
Change fdalloc() to return ERESTART if reallocation of the
descriptor array was needed, and change uses to handle that
condition. Make finishdup() close the descriptor in the new slot
if it exists, and change sys_dup2() accordingly. Closes a race
condition when using kernel-assisted user threads.
 1.67.4.2  26-Aug-2000  mrg pull up 1.70, 1.71. approved by thorpej:
1.70
>Fix LP64BE bug.
1.71
>Fix omission in previous.
 1.67.4.1  04-Jul-2000  jdolecek Pullup from trunk [approved by thorpej]:

change tablefull() to accept one more parameter - optional hint

use that to inform about way to raise current limit when we reach maximum
number of processes, descriptors or vnodes
 1.72.2.17  07-Jan-2003  thorpej Sync with HEAD.
 1.72.2.16  11-Dec-2002  thorpej Sync with HEAD.
 1.72.2.15  11-Nov-2002  nathanw Catch up to -current
 1.72.2.14  18-Oct-2002  nathanw Catch up to -current.
 1.72.2.13  17-Sep-2002  nathanw Catch up to -current.
 1.72.2.12  12-Jul-2002  nathanw No longer need to pull in lwp.h; proc.h pulls it in for us.
 1.72.2.11  10-Jul-2002  nathanw Whitespace.
 1.72.2.10  20-Jun-2002  nathanw Catch up to -current.
 1.72.2.9  29-May-2002  nathanw #include <sys/sa.h> before <sys/syscallargs.h>, to provide sa_upcall_t
now that <sys/param.h> doesn't include <sys/sa.h>.

(Behold the Power of Ed)
 1.72.2.8  01-Apr-2002  nathanw Catch up to -current.
(CVS: It's not just a program. It's an adventure!)
 1.72.2.7  28-Feb-2002  nathanw Catch up to -current.
 1.72.2.6  08-Jan-2002  nathanw Catch up to -current.
 1.72.2.5  14-Nov-2001  nathanw Catch up to -current.
 1.72.2.4  24-Aug-2001  nathanw Catch up with -current.
 1.72.2.3  21-Jun-2001  nathanw Catch up to -current.
 1.72.2.2  09-Apr-2001  nathanw Catch up with -current.
 1.72.2.1  05-Mar-2001  nathanw Initial commit of scheduler activations and lightweight process support.
 1.79.2.10  12-Oct-2002  jdolecek need knote_fdclose() in finishdup()
 1.79.2.9  10-Oct-2002  jdolecek sync kqueue with -current; this includes merge of gehenna-devsw branch,
merge of i386 MP branch, and part of autoconf rototil work
 1.79.2.8  06-Sep-2002  jdolecek sync kqueue branch with HEAD
 1.79.2.7  23-Jun-2002  jdolecek catch up with -current on kqueue branch
 1.79.2.6  16-Mar-2002  jdolecek Catch up with -current.
 1.79.2.5  15-Mar-2002  jdolecek fdfree(): fix the argument to knote_fdfree() - 'i' is not the the descriptor
value, it's index counting from fdp->fd_lastfile down; this fixes
deadlock in closef() when watched descriptor is lower than the
kqueue one and the process has open further descriptors

finishdup(): add comment a knote_fdfree() call is needed there; will address
this later
 1.79.2.4  11-Feb-2002  jdolecek Sync w/ -current.
 1.79.2.3  10-Jan-2002  thorpej Sync kqueue branch with -current.
 1.79.2.2  03-Aug-2001  lukem update to -current
 1.79.2.1  10-Jul-2001  lukem create and destroy fd_kn{list,hash} entries as appropriate (for kqueue use)
 1.80.4.1  12-Nov-2001  thorpej Sync the thorpej-mips-cache branch with -current.
 1.80.2.1  07-Sep-2001  thorpej Commit my "devvp" changes to the thorpej-devvp branch. This
replaces the use of dev_t in most places with a struct vnode *.

This will form the basic infrastructure for real cloning device
support (besides being architecurally cleaner -- it'll be good
to get away from using numbers to represent objects).
 1.92.2.2  15-Jul-2002  gehenna catch up with -current.
 1.92.2.1  16-May-2002  gehenna Add the character device switch.
 1.110.2.11  11-Dec-2005  christos Sync with head.
 1.110.2.10  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.110.2.9  04-Mar-2005  skrll Sync with HEAD.

Hi Perry!
 1.110.2.8  24-Feb-2005  skrll Reduce diff to HEAD
 1.110.2.7  15-Feb-2005  skrll Sync with HEAD.
 1.110.2.6  17-Jan-2005  skrll Sync with HEAD.
 1.110.2.5  18-Dec-2004  skrll Sync with HEAD.
 1.110.2.4  21-Sep-2004  skrll Fix the sync with head I botched.
 1.110.2.3  18-Sep-2004  skrll Sync with HEAD.
 1.110.2.2  03-Aug-2004  skrll Sync with HEAD
 1.110.2.1  02-Jul-2003  darrenr Apply the aborted ktrace-lwp changes to a specific branch. This is just for
others to review, I'm concerned that patch fuziness may have resulted in some
errant code being generated but I'll look at that later by comparing the diff
from the base to the branch with the file I attempt to apply to it. This will,
at the very least, put the changes in a better context for others to review
them and attempt to tinker with removing passing of 'struct lwp' through
the kernel.
 1.123.2.3  24-May-2005  riz Pull up revision 1.132 (requested by wrstuden in ticket #1537):
The file being closed is (fdp->fd_lastfile - i), not i. So compare
(fdp->fd_lastfile - i) against fd_knlistsize. Otherwise we can
call knote_fdclose() on a file descriptor that doesn't have a knote.
This issue explains random panics I have had on process exit over the
past few years.
 1.123.2.2  16-Mar-2005  tron Pull up revision 1.128 via patch (requested by cube in ticket #1089):
fd_lastfile should be -1 when there are no opened file descriptors.
Hence, make find_last_set return -1 in such situation, and initialize it
such. Otherwise, with 0 meaning two things, it confused the F_CLOSEM
fcntl which could end up looping indifintely (PR#28929 by Brian Marcotte).
However, this change enlightens another bug in fdcopy(), where more entries
than needed were cleared in the new file descriptor table, so the memset()
call there is fixed too.
Analyzed with the help of Greg Oster.
 1.123.2.1  10-Jul-2004  tron branches: 1.123.2.1.2;
Pull up revision 1.124 (requested by tls in ticket #634):
add assertions related to file descriptor allocation.
 1.123.2.1.2.2  24-May-2005  riz Pull up revision 1.132 (requested by wrstuden in ticket #1537):
The file being closed is (fdp->fd_lastfile - i), not i. So compare
(fdp->fd_lastfile - i) against fd_knlistsize. Otherwise we can
call knote_fdclose() on a file descriptor that doesn't have a knote.
This issue explains random panics I have had on process exit over the
past few years.
 1.123.2.1.2.1  16-Mar-2005  tron Pull up revision 1.128 via patch (requested by cube in ticket #1089):
fd_lastfile should be -1 when there are no opened file descriptors.
Hence, make find_last_set return -1 in such situation, and initialize it
such. Otherwise, with 0 meaning two things, it confused the F_CLOSEM
fcntl which could end up looping indifintely (PR#28929 by Brian Marcotte).
However, this change enlightens another bug in fdcopy(), where more entries
than needed were cleared in the new file descriptor table, so the memset()
call there is fixed too.
Analyzed with the help of Greg Oster.
 1.129.4.1  19-Mar-2005  yamt sync with head. xen and whitespace. xen part is not finished.
 1.129.2.1  29-Apr-2005  kent sync with -current
 1.131.2.1  28-May-2005  tron Pull up revision 1.132 (requested by wrstuden in ticket #331):
The file being closed is (fdp->fd_lastfile - i), not i. So compare
(fdp->fd_lastfile - i) against fd_knlistsize. Otherwise we can
call knote_fdclose() on a file descriptor that doesn't have a knote.
This issue explains random panics I have had on process exit over the
past few years.
 1.134.2.11  24-Mar-2008  yamt sync with head.
 1.134.2.10  11-Feb-2008  yamt sync with head.
 1.134.2.9  04-Feb-2008  yamt sync with head.
 1.134.2.8  21-Jan-2008  yamt sync with head
 1.134.2.7  07-Dec-2007  yamt sync with head
 1.134.2.6  15-Nov-2007  yamt sync with head.
 1.134.2.5  27-Oct-2007  yamt sync with head.
 1.134.2.4  03-Sep-2007  yamt sync with head.
 1.134.2.3  26-Feb-2007  yamt sync with head.
 1.134.2.2  30-Dec-2006  yamt sync with head.
 1.134.2.1  21-Jun-2006  yamt sync with head.
 1.136.6.6  18-Nov-2005  yamt - associate read-ahead context to vnode, rather than file.
- revert VOP_READ prototype.
 1.136.6.5  17-Nov-2005  yamt use UVM_ADV_ rather than POSIX_FADV_.
 1.136.6.4  16-Nov-2005  yamt update a comment following posix_fadvise prototype change.
 1.136.6.3  16-Nov-2005  yamt sys_posix_fadvise: correct how to return an error.
 1.136.6.2  15-Nov-2005  yamt add posix_fadvise.
 1.136.6.1  15-Nov-2005  yamt - setup/cleanup readahead context.
- adapt to the new VOP_READ prototype.
 1.139.2.1  01-Feb-2006  yamt sync with head.
 1.140.6.4  03-Sep-2006  yamt sync with head.
 1.140.6.3  11-Aug-2006  yamt sync with head
 1.140.6.2  24-May-2006  yamt sync with head.
 1.140.6.1  13-Mar-2006  yamt sync with head.
 1.140.4.2  01-Jun-2006  kardel Sync with head.
 1.140.4.1  22-Apr-2006  simonb Sync with head.
 1.140.2.1  09-Sep-2006  rpaulo sync with head
 1.141.4.1  24-May-2006  tron Merge 2006-05-24 NetBSD-current into the "peter-altq" branch.
 1.141.2.4  06-May-2006  christos - Move kauth_cred_t declaration to <sys/types.h>
- Cleanup struct ucred; forward declarations that are unused.
- Don't include <sys/kauth.h> in any header, but include it in the c files
that need it.

Approved by core.
 1.141.2.3  19-Apr-2006  elad sync with head.
 1.141.2.2  08-Mar-2006  elad Adapt to kernel authorization KPI.
 1.141.2.1  07-Mar-2006  elad file kern_descrip.c was added on branch elad-kernelauth on 2006-03-08 00:53:40 +0000
 1.145.4.2  10-Dec-2006  yamt sync with head.
 1.145.4.1  22-Oct-2006  yamt sync with head
 1.145.2.6  01-Feb-2007  ad Sync with head.
 1.145.2.5  30-Jan-2007  ad Remove support for SA. Ok core@.
 1.145.2.4  12-Jan-2007  ad Sync with head.
 1.145.2.3  18-Nov-2006  ad Sync with head.
 1.145.2.2  17-Nov-2006  ad Checkpoint work in progress.
 1.145.2.1  11-Sep-2006  ad - Allocate and free turnstiles where needed.
- Split proclist_mutex and alllwp_mutex out of the proclist_lock,
and use in interrupt context.
- Fix an MP race in enterpgrp()/setsid().
- Acquire proclist_lock and p_crmutex in some obvious places.
 1.150.2.5  17-May-2007  yamt sync with head.
 1.150.2.4  07-May-2007  yamt sync with head.
 1.150.2.3  24-Mar-2007  yamt sync with head.
 1.150.2.2  12-Mar-2007  rmind Sync with HEAD.
 1.150.2.1  27-Feb-2007  yamt - sync with head.
- move sched_changepri back to kern_synch.c as it doesn't know PPQ anymore.
 1.153.2.9  09-Oct-2007  ad Sync with head.
 1.153.2.8  09-Oct-2007  ad Sync with head.
 1.153.2.7  01-Sep-2007  ad Use pool_cache for allocating a few more types of objects.
 1.153.2.6  09-Jul-2007  ad closef: restore check for l == NULL removed in revision 1.153.2.4.
Noted by yamt.
 1.153.2.5  08-Jun-2007  ad Sync with head.
 1.153.2.4  13-May-2007  ad - Pass the error number and residual count to biodone(), and let it handle
setting error indicators. Prepare to eliminate B_ERROR.
- Add a flag argument to brelse() to be set into the buf's flags, instead
of doing it directly. Typically used to set B_INVAL.
- Add a "struct cpu_info *" argument to kthread_create(), to be used to
create bound threads. Change "bool mpsafe" to "int flags".
- Allow exit of LWPs in the IDL state when (l != curlwp).
- More locking fixes & conversion to the new API.
 1.153.2.3  12-Apr-2007  ad filedesc::fd_lock a reader/writer lock, for multithreaded processes.
 1.153.2.2  21-Mar-2007  ad - Replace more simple_locks, and fix up in a few places.
- Use condition variables.
- LOCK_ASSERT -> KASSERT.
 1.153.2.1  13-Mar-2007  ad Sync with head.
 1.154.4.1  29-Mar-2007  reinoud Pullup to -current
 1.154.2.1  11-Jul-2007  mjf Sync with head.
 1.159.8.4  23-Mar-2008  matt sync with HEAD
 1.159.8.3  09-Jan-2008  matt sync with HEAD
 1.159.8.2  08-Nov-2007  matt sync with -HEAD
 1.159.8.1  06-Nov-2007  matt sync with HEAD
 1.159.6.5  09-Dec-2007  jmcneill Sync with HEAD.
 1.159.6.4  03-Dec-2007  joerg Sync with HEAD.
 1.159.6.3  11-Nov-2007  joerg Sync with HEAD.
 1.159.6.2  26-Oct-2007  joerg Sync with HEAD.

Follow the merge of pmap.c on i386 and amd64 and move
pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup
code to restore CR4 before jumping back into kernel space as the large
page option might cover that.
 1.159.6.1  02-Oct-2007  joerg Sync with HEAD.
 1.159.2.1  10-Sep-2007  skrll Sync with HEAD.
 1.160.2.1  14-Oct-2007  yamt sync with head.
 1.161.4.4  18-Feb-2008  mjf Sync with HEAD.
 1.161.4.3  27-Dec-2007  mjf Sync with HEAD.
 1.161.4.2  08-Dec-2007  mjf Sync with HEAD.
 1.161.4.1  19-Nov-2007  mjf Sync with HEAD.
 1.161.2.1  13-Nov-2007  bouyer Sync with HEAD
 1.164.2.3  26-Dec-2007  ad Sync with head.
 1.164.2.2  13-Dec-2007  ad Unused var
 1.164.2.1  13-Dec-2007  ad Eliminate contention on filelist_lock.
 1.165.4.2  08-Jan-2008  bouyer Sync with HEAD
 1.165.4.1  02-Jan-2008  bouyer Sync with HEAD
 1.172.6.5  17-Jan-2009  mjf Sync with HEAD.
 1.172.6.4  02-Jul-2008  mjf Sync with HEAD.
 1.172.6.3  29-Jun-2008  mjf Sync with HEAD.
 1.172.6.2  02-Jun-2008  mjf Sync with HEAD.
 1.172.6.1  03-Apr-2008  mjf Sync with HEAD.
 1.175.2.1  18-May-2008  yamt sync with head.
 1.177.2.8  09-Oct-2010  yamt sync with head
 1.177.2.7  11-Aug-2010  yamt sync with head.
 1.177.2.6  11-Mar-2010  yamt sync with head
 1.177.2.5  19-Aug-2009  yamt sync with head.
 1.177.2.4  18-Jul-2009  yamt sync with head.
 1.177.2.3  20-Jun-2009  yamt sync with head
 1.177.2.2  04-May-2009  yamt sync with head.
 1.177.2.1  16-May-2008  yamt sync with head.
 1.179.4.2  03-Jul-2008  simonb Sync with head.
 1.179.4.1  27-Jun-2008  simonb Sync with head.
 1.179.2.3  18-Sep-2008  wrstuden Sync with wrstuden-revivesa-base-2.
 1.179.2.2  14-May-2008  wrstuden Per discussion with ad, remove most of the #include <sys/sa.h> lines
as they were including sa.h just for the type(s) needed for syscallargs.h.

Instead, create a new file, sys/satypes.h, which contains just the
types needed for syscallargs.h. Yes, there's only one now, but that
may change and it's probably more likely to change if it'd be difficult
to handle. :-)

Per discussion with matt at n dot o, add an include of satypes.h to
sigtypes.h. Upcall handlers are kinda signal handlers, and signalling
is the header file that's already included for syscallargs.h that
closest matches SA.

This shaves about 3000 lines off of the diff of the branch relative
to the base. That also represents about 18% of the total before this
checkin.

I think this reduction is very good thing.
 1.179.2.1  10-May-2008  wrstuden Initial checkin of re-adding SA. Everything except kern_sa.c
compiles in GENERIC for i386. This is still a work-in-progress, but
this checkin covers most of the mechanical work (changing signalling
to be able to accomidate SA's process-wide signalling and re-adding
includes of sys/sa.h and savar.h). Subsequent changes will be much
more interesting.

Also, kern_sa.c has received partial cleanup. There's still more
to do, though.
 1.182.6.6  04-Apr-2009  snj Pull up following revision(s) (requested by ad in ticket #661):
sys/arch/xen/xen/xenevt.c: revision 1.32
sys/compat/svr4/svr4_net.c: revision 1.56
sys/compat/svr4_32/svr4_32_net.c: revision 1.19
sys/dev/dmover/dmover_io.c: revision 1.32
sys/dev/putter/putter.c: revision 1.21
sys/kern/kern_descrip.c: revision 1.190
sys/kern/kern_drvctl.c: revision 1.23
sys/kern/kern_event.c: revision 1.64
sys/kern/sys_mqueue.c: revision 1.14
sys/kern/sys_pipe.c: revision 1.109
sys/kern/sys_socket.c: revision 1.59
sys/kern/uipc_syscalls.c: revision 1.136
sys/kern/vfs_vnops.c: revision 1.164
sys/kern/uipc_socket.c: revision 1.188
sys/net/bpf.c: revision 1.144
sys/net/if_tap.c: revision 1.55
sys/opencrypto/cryptodev.c: revision 1.47
sys/sys/file.h: revision 1.67
sys/sys/param.h: patch
sys/sys/socketvar.h: revision 1.119
Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.
Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.
thr0 accept(fd, ...)
thr1 close(fd)
 1.182.6.5  31-Mar-2009  snj Pull up following revision(s) (requested by rmind in ticket #619):
sys/kern/kern_descrip.c: revision 1.189
fownsignal: pre-check for zero pgid, avoids locking of proc_lock.
 1.182.6.4  18-Mar-2009  snj Pull up following revision(s) (requested by mrg in ticket #577):
sys/kern/kern_descrip.c: revision 1.188
sys/kern/uipc_usrreq.c: revision 1.121
sys/sys/fcntl.h: revision 1.35
sys/sys/file.h: revision 1.66
sys/sys/param.h: patch
sys/sys/un.h: revision 1.45
completely rework the way that orphaned sockets that are being fdpassed
via SCM_RIGHTS messages are dealt with:
1. unp_gc: make this a kthread.
2. unp_detach: go not call unp_gc directly. instead, wake up unp_gc kthread.
3. unp_scan: do not close files here. instead, put them on a global list
for unp_gc to close, along with a per-file "deferred close count". if
file is already enqueued for close, just increment deferred close count.
this eliminates the recursive calls.
3. unp_gc: scan files on global deferred close list. close each file N
times, as specified by deferred close count in file. continue processing
list until it becomes empty (closing may cause additional files to be
queued for close).
4. unp_gc: add additional bit to mark files we are scanning. set during
initial scan of global file list that currently clears FMARK/FDEFER.
during later scans, never examine / garbage collect descriptors that
we have not marked during the earlier scan. do not proceed with this
initial scan until all deferred closes have been processed. be careful
with locking to ensure no races are introduced between deferred close
and file scan.
5. unp_gc: use dummy file_t to mark position in list when scanning. allow
us to drop filelist_lock. in turn allows us to eliminate kmem_alloc()
and safely close files, etc.
6. prohibit transfer of descriptors within SCM_RIGHTS messages if
(num_files_in_transit > maxfiles / unp_rights_ratio)
7. fd_allocfile: ensure recycled filse don't get scanned.
this is 97% work done by andrew doran, with a couple of minor bug fixes
and a lot of testing by yours truly.
 1.182.6.3  15-Mar-2009  snj Pull up following revision(s) (requested by mrg in ticket #566):
sys/kern/init_sysctl.c: revision 1.157
sys/kern/kern_descrip.c: revision 1.187
usr.sbin/pstat/pstat.c: revision 1.112
Don't bother with file_t::f_iflags any more, as it's not used.
Noted by mrg@.
 1.182.6.2  02-Mar-2009  snj Pull up following revision(s) (requested by rmind in ticket #542):
sys/kern/kern_descrip.c: revision 1.186
fd_copy: fix off-by-one bug in a race condition path and assert.
Should fix PR/40625. OK by <ad>.
 1.182.6.1  02-Feb-2009  snj Pull up following revision(s) (requested by ad in ticket #358):
sys/kern/kern_descrip.c: revision 1.185
- Fix a bug where we trashed descriptor zero in the old open files array
while ironically trying to preserve the same during copy. Would only have
occurred if a multithreaded program expanded the descriptor table and,
within a tiny window of exposure, another thread in the program tried to
access descriptor zero.
- Convert to use kmem_alloc/kmem_free.
 1.182.4.3  28-Apr-2009  skrll Sync with HEAD.
 1.182.4.2  03-Mar-2009  skrll Sync with HEAD.
 1.182.4.1  19-Jan-2009  skrll Sync with HEAD.
 1.182.2.1  13-Dec-2008  haad Update haad-dm branch to haad-dm-base2.
 1.185.2.2  23-Jul-2009  jym Sync with HEAD.
 1.185.2.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.202.4.4  31-May-2011  rmind sync with head
 1.202.4.3  21-Apr-2011  rmind sync with head
 1.202.4.2  05-Mar-2011  rmind sync with head
 1.202.4.1  03-Jul-2010  rmind sync with head
 1.202.2.3  06-Nov-2010  uebayasi Sync with HEAD.
 1.202.2.2  22-Oct-2010  uebayasi Sync with HEAD (-D20101022).
 1.202.2.1  17-Aug-2010  uebayasi Sync with HEAD.
 1.209.4.2  17-Feb-2011  bouyer Sync with HEAD
 1.209.4.1  08-Feb-2011  bouyer Sync with HEAD
 1.209.2.1  06-Jun-2011  jruoho Sync with HEAD.
 1.217.6.1  18-Feb-2012  mrg merge to -current.
 1.217.2.3  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.217.2.2  16-Jan-2013  yamt sync with (a bit old) head
 1.217.2.1  17-Apr-2012  yamt sync with head
 1.218.8.1  24-Nov-2012  jdc Pull up revisions:
src/sys/kern/kern_event.c revision 1.79
src/sys/kern/kern_descrip.c revision 1.219
src/lib/libc/sys/kqueue.2 revision 1.33
src/tests/lib/libc/sys/t_kevent.c revision 1.2-1.5
(requested by christos in ticket #716).

- initialize kn_id
- in close, invalidate f_data and f_type early to prevent accidental re-use
- add a DIAGNOSTIC for when we use unsupported fd's and a KASSERT for f_event
being NULL.

Return EOPNOTSUPP for fnullop_kqfilter to prevent registration of unsupported
fds. XXX: We should really fix the fd's to be supported in the future.
Unsupported fd's have a NULL f_event, so registering crashes the kernel with
a NULL function dereference of f_event.

mention that kevent returns now EOPNOTSUPP.

Move the references to PRs from code comments to the test description. Once
ATF has the ability to output the metadata in the HTML reports, it should be
easy to traverse between releng and gnats -reports via links.

Add a (skipped for now) test case for PR 46463

adapt to new reality

Add a test for adding an event to an unsupported fd.
 1.218.6.3  03-Dec-2017  jdolecek update from HEAD
 1.218.6.2  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.218.6.1  25-Feb-2013  tls resync with head
 1.218.2.1  24-Nov-2012  jdc Pull up revisions:
src/sys/kern/kern_event.c revision 1.79
src/sys/kern/kern_descrip.c revision 1.219
src/lib/libc/sys/kqueue.2 revision 1.33
src/tests/lib/libc/sys/t_kevent.c revision 1.2-1.5
(requested by christos in ticket #716).

- initialize kn_id
- in close, invalidate f_data and f_type early to prevent accidental re-use
- add a DIAGNOSTIC for when we use unsupported fd's and a KASSERT for f_event
being NULL.

Return EOPNOTSUPP for fnullop_kqfilter to prevent registration of unsupported
fds. XXX: We should really fix the fd's to be supported in the future.
Unsupported fd's have a NULL f_event, so registering crashes the kernel with
a NULL function dereference of f_event.

mention that kevent returns now EOPNOTSUPP.

Move the references to PRs from code comments to the test description. Once
ATF has the ability to output the metadata in the HTML reports, it should be
easy to traverse between releng and gnats -reports via links.

Add a (skipped for now) test case for PR 46463

adapt to new reality

Add a test for adding an event to an unsupported fd.
 1.219.2.1  18-May-2014  rmind sync with head
 1.224.2.1  10-Aug-2014  tls Rebase.
 1.225.2.2  03-Jun-2017  snj Pull up following revision(s) (requested by riastradh in ticket #1425):
sys/kern/kern_descrip.c: revision 1.230
Explicitly set the flags instead of masking set values in.
This fixes FNONBLOCK weirdness seen in audio.c
OK christos@ and martin@.
 1.225.2.1  04-Aug-2015  snj branches: 1.225.2.1.2; 1.225.2.1.6;
Pull up following revision(s) (requested by christos in ticket #933):
sys/kern/kern_descrip.c: revision 1.229
1. mask fflags so we don't tack on whateve oflags were passed from userland
2. honor O_CLOEXEC, so the children of daemons that use cloning devices, don't
end up with the parents descriptors
fd_clone and in general the fd approach of 'allocate' > 'play with guts' >
'attach' should be converted to be more constructor like.
 1.225.2.1.6.1  03-Jun-2017  snj Pull up following revision(s) (requested by riastradh in ticket #1425):
sys/kern/kern_descrip.c: revision 1.230
Explicitly set the flags instead of masking set values in.
This fixes FNONBLOCK weirdness seen in audio.c
OK christos@ and martin@.
 1.225.2.1.2.1  03-Jun-2017  snj Pull up following revision(s) (requested by riastradh in ticket #1425):
sys/kern/kern_descrip.c: revision 1.230
Explicitly set the flags instead of masking set values in.
This fixes FNONBLOCK weirdness seen in audio.c
OK christos@ and martin@.
 1.228.2.2  28-Aug-2017  skrll Sync with HEAD
 1.228.2.1  22-Sep-2015  skrll Sync with HEAD
 1.229.8.1  19-May-2017  pgoyette Resolve conflicts from previous merge (all resulting from $NetBSD
keywork expansion)
 1.231.10.2  08-Apr-2020  martin Merge changes from current as of 20200406
 1.231.10.1  10-Jun-2019  christos Sync with HEAD
 1.231.8.6  18-Jan-2019  pgoyette Synch with HEAD
 1.231.8.5  26-Nov-2018  pgoyette Sync with HEAD, resolve a couple of conflicts
 1.231.8.4  20-Oct-2018  pgoyette Sync with head
 1.231.8.3  30-Sep-2018  pgoyette Ssync with HEAD
 1.231.8.2  06-Sep-2018  pgoyette Sync with HEAD

Resolve a couple of conflicts (result of the uimin/uimax changes)
 1.231.8.1  28-Jul-2018  pgoyette Sync with HEAD
 1.243.6.1  29-Feb-2020  ad Sync with head.
 1.243.4.3  20-Nov-2024  martin Pull up following revision(s) (requested by riastradh in ticket #1921):

sys/kern/kern_event.c: revision 1.106
sys/kern/sys_select.c: revision 1.51
sys/kern/subr_exec_fd.c: revision 1.10
sys/kern/sys_aio.c: revision 1.46
sys/kern/kern_descrip.c: revision 1.244
sys/kern/kern_descrip.c: revision 1.245
sys/ddb/db_xxx.c: revision 1.72
sys/ddb/db_xxx.c: revision 1.73
sys/miscfs/fdesc/fdesc_vnops.c: revision 1.132
sys/kern/uipc_usrreq.c: revision 1.195
sys/kern/sys_descrip.c: revision 1.36
sys/kern/uipc_usrreq.c: revision 1.196
sys/kern/uipc_socket2.c: revision 1.135
sys/kern/uipc_socket2.c: revision 1.136
sys/kern/kern_sig.c: revision 1.383
sys/kern/kern_sig.c: revision 1.384
sys/compat/netbsd32/netbsd32_ioctl.c: revision 1.107
sys/miscfs/procfs/procfs_vnops.c: revision 1.208
sys/kern/subr_exec_fd.c: revision 1.9
sys/kern/kern_descrip.c: revision 1.252
(all via patch)

Load struct filedesc::fd_dt with atomic_load_consume.

Exceptions: when fd_refcnt <= 1, or when holding fd_lock.

While here:
- Restore KASSERT(mutex_owned(&fdp->fd_lock)) in fd_unused.
=> This is used only in fd_close and fd_abort, where it holds.
- Move bounds check assertion in fd_putfile to where it matters.
- Store fd_dt with atomic_store_release.
- Move load of fd_dt under lock in knote_fdclose.
- Omit membar_consumer in fdesc_readdir.
=> atomic_load_consume serves the same purpose now.
=> Was needed only on alpha anyway.

Load struct fdfile::ff_file with atomic_load_consume.
Exceptions: when we're only testing whether it's there, not about to
dereference it.

Note: We do not use atomic_store_release to set it because the
preceding mutex_exit should be enough.

(That said, it's not clear the mutex_enter/exit is needed unless
refcnt > 0 already, in which case maybe it would be a win to switch
from the membar implied by mutex_enter to the membar implied by
atomic_store_release -- which I would generally expect to be much
cheaper. And a little clearer without a long comment.)
kern_descrip.c: Fix membars around reference count decrement.

In general, the `last one out hit the lights' style of reference
counting (as opposed to the `whoever's destroying must wait for
pending users to finish' style) requires memory barriers like so:

... usage of resources associated with object ...
membar_release();
if (atomic_dec_uint_nv(&obj->refcnt) != 0)
return;
membar_acquire();
... freeing of resources associated with object ...

This way, all usage happens-before all freeing. This fixes several
errors:
- fd_close failed to ensure whatever its caller did would
happen-before the freeing, in the case where another thread is
concurrently trying to close the fd (ff->ff_file == NULL).
Fix: Add membar_release before atomic_dec_uint(&ff->ff_refcnt) in
that branch.
- fd_close failed to ensure all loads its caller had issued will have
happened-before the freeing, in the case where the fd is still in
use by another thread (fdp->fd_refcnt > 1 and ff->ff_refcnt-- > 0).
Fix: Change membar_producer to membar_release before
atomic_dec_uint(&ff->ff_refcnt).
- fd_close failed to ensure that any usage of fp by other callers
would happen-before any freeing it does.
Fix: Add membar_acquire after atomic_dec_uint_nv(&ff->ff_refcnt).
- fd_free failed to ensure that any usage of fdp by other callers
would happen-before any freeing it does.
Fix: Add membar_acquire after atomic_dec_uint_nv(&fdp->fd_refcnt).

While here, change membar_exit -> membar_release. No semantic
change, just updating away from the legacy API.
 1.243.4.2  17-Nov-2024  martin Pull up following revision(s) (requested by kre in ticket #1003):

sys/kern/kern_descrip.c: revision 1.264 (via patch)

Make O_CLOEXEC always close specified files on exec

It turns out that close-on-exec doesn't always close on exec.

If all close-on-exec fd's were made close-on-exec via dup3() or
fcntl(F_DUPFD_CLOEXEC) or use of the internal fd_clone() (whose uses

I did not fully investigate but I think is used to create a fd for
the open of a cloner device, and perhaps other things) then none
of the close-on-exec file descriptors will be closed when an exec
happens - but will be passed through to the new process (still marked,
apparently, as close-on-exec - but still won't be closed if another exec
happens) - that is unless...

If at least one fd in the process has close-on-exec set some other way
(fcntl(F_SETFD), open(O_CLOEXEC) (and the similar functions for sockets,
and epoll) and perhaps others then all close-on-exec file descriptors
in the process will be correctly closed when an exec happens (however
they obtained the close-on-exec status).

There are two steps that need to be taken (in the kernel) when turning
on close on exec - the obvious one of setting the ff_exclose field in
the struct fdfile for the fd. And second, marking the file descriptor
table (which holds the fdfile's for one or more processes) as containing
file descriptors with close-on-exec set (it is a simple yes/no, and once
set is never cleared until an actual exec happens). If it was set during
an exec, all the file descriptors are examined, and those marked
close-on-exec are closed. If the file descriptor table doesn't indicate
that close-on-exec fds exist in the table, none of that happens.

Several places were setting ff_exclose in the struct fdfile but
not bothering to set the fd_exclose field in the file descriptor table.

There's even a function (fd_set_exclose()) whose whole purpose is to do
this properly - but it wasn't being used.

Now it is, everywhere (I hope).
 1.243.4.1  07-Aug-2024  martin Pull up following revision(s) (requested by kre in ticket #1859):

sys/kern/kern_proc.c: revision 1.276 (via patch)
sys/kern/kern_ktrace.c: revision 1.185 (via patch)
sys/kern/sys_sig.c: revision 1.58 (via patch)
sys/kern/kern_descrip.c: revision 1.263 (via patch)
lib/libc/compat-43/killpg.c: revision 1.10
sys/kern/tty.c: revision 1.313 (via patch)
tests/lib/libc/sys/t_kill.c: revision 1.2

PR kern/58425 -- Disallow INT_MIN as a (negative) pid arg.
Since -INT_MIN is undefined, and to point of negative pid args is
to negate them, and use the result as a pgrp id instead, we need
to avoid accidentally negating INT_MIN.

Since pid_t is just an integral type, of unspecified width, when
testing pid_t value test for <= INT_MIN (or > INT_MIN sometimes)
rather than == INT_MIN. When testing int values, just == INT_MIN
is all that is needed, < INT_MIN cannot occur.

tests/lib/libc/sys/t_kill: Test kill(INT_MIN, ...) fails with ESRCH.
PR kern/58425
 1.249.2.1  03-Jan-2021  thorpej Sync w/ HEAD.
 1.250.4.1  01-Aug-2021  thorpej Sync with HEAD.
 1.251.10.3  17-Nov-2024  martin Pull up following revision(s) (requested by kre in ticket #1003):

sys/kern/kern_descrip.c: revision 1.264

Make O_CLOEXEC always close specified files on exec

It turns out that close-on-exec doesn't always close on exec.

If all close-on-exec fd's were made close-on-exec via dup3() or
fcntl(F_DUPFD_CLOEXEC) or use of the internal fd_clone() (whose uses

I did not fully investigate but I think is used to create a fd for
the open of a cloner device, and perhaps other things) then none
of the close-on-exec file descriptors will be closed when an exec
happens - but will be passed through to the new process (still marked,
apparently, as close-on-exec - but still won't be closed if another exec
happens) - that is unless...

If at least one fd in the process has close-on-exec set some other way
(fcntl(F_SETFD), open(O_CLOEXEC) (and the similar functions for sockets,
and epoll) and perhaps others then all close-on-exec file descriptors
in the process will be correctly closed when an exec happens (however
they obtained the close-on-exec status).

There are two steps that need to be taken (in the kernel) when turning
on close on exec - the obvious one of setting the ff_exclose field in
the struct fdfile for the fd. And second, marking the file descriptor
table (which holds the fdfile's for one or more processes) as containing
file descriptors with close-on-exec set (it is a simple yes/no, and once
set is never cleared until an actual exec happens). If it was set during
an exec, all the file descriptors are examined, and those marked
close-on-exec are closed. If the file descriptor table doesn't indicate
that close-on-exec fds exist in the table, none of that happens.

Several places were setting ff_exclose in the struct fdfile but
not bothering to set the fd_exclose field in the file descriptor table.

There's even a function (fd_set_exclose()) whose whole purpose is to do
this properly - but it wasn't being used.

Now it is, everywhere (I hope).
 1.251.10.2  07-Aug-2024  martin Pull up following revision(s) (requested by kre in ticket #773):

sys/kern/kern_proc.c: revision 1.276
sys/kern/kern_ktrace.c: revision 1.185
sys/kern/sys_sig.c: revision 1.58
sys/kern/kern_descrip.c: revision 1.263
lib/libc/compat-43/killpg.c: revision 1.10
sys/kern/tty.c: revision 1.313
tests/lib/libc/sys/t_kill.c: revision 1.2

PR kern/58425 -- Disallow INT_MIN as a (negative) pid arg.

Since -INT_MIN is undefined, and to point of negative pid args is
to negate them, and use the result as a pgrp id instead, we need
to avoid accidentally negating INT_MIN.

Since pid_t is just an integral type, of unspecified width, when
testing pid_t value test for <= INT_MIN (or > INT_MIN sometimes)
rather than == INT_MIN. When testing int values, just == INT_MIN
is all that is needed, < INT_MIN cannot occur.

tests/lib/libc/sys/t_kill: Test kill(INT_MIN, ...) fails with ESRCH.
PR kern/58425
 1.251.10.1  30-Jul-2023  martin Pull up following revision(s) (requested by riastradh in ticket #262):

sys/kern/kern_descrip.c: revision 1.252
sys/kern/kern_descrip.c: revision 1.253
sys/kern/kern_descrip.c: revision 1.254

kern_descrip.c: Fix membars around reference count decrement.

In general, the `last one out hit the lights' style of reference
counting (as opposed to the `whoever's destroying must wait for
pending users to finish' style) requires memory barriers like so:
... usage of resources associated with object ...
membar_release();
if (atomic_dec_uint_nv(&obj->refcnt) != 0)
return;
membar_acquire();
... freeing of resources associated with object ...

This way, all usage happens-before all freeing. This fixes several
errors:
- fd_close failed to ensure whatever its caller did would
happen-before the freeing, in the case where another thread is
concurrently trying to close the fd (ff->ff_file == NULL).
Fix: Add membar_release before atomic_dec_uint(&ff->ff_refcnt) in
that branch.
- fd_close failed to ensure all loads its caller had issued will have
happened-before the freeing, in the case where the fd is still in
use by another thread (fdp->fd_refcnt > 1 and ff->ff_refcnt-- > 0).
Fix: Change membar_producer to membar_release before
atomic_dec_uint(&ff->ff_refcnt).
- fd_close failed to ensure that any usage of fp by other callers
would happen-before any freeing it does.
Fix: Add membar_acquire after atomic_dec_uint_nv(&ff->ff_refcnt).
- fd_free failed to ensure that any usage of fdp by other callers
would happen-before any freeing it does.
Fix: Add membar_acquire after atomic_dec_uint_nv(&fdp->fd_refcnt).

While here, change membar_exit -> membar_release. No semantic
change, just updating away from the legacy API.

kern_descrip.c: Use atomic_store_relaxed/release for ff->ff_file.
1. atomic_store_relaxed in fd_close avoids the appearance of race in
sanitizers (minor bug).
2. atomic_store_release in fd_affix is necessary because the lock
activity was not, in fact, enough to guarantee ordering (real bug
some architectures like aarch64).
The premise appears to have been that the mutex_enter/exit earlier
in fd_affix is enough to guarantee that initialization of fp (A)
happens before use of fp by a user once fp is published (B):
fp->f_... = ...; // A
/* fd_affix */
mutex_enter(&fp->f_lock);
fp->f_count++;
mutex_exit(&fp->f_lock);
...
ff->ff_file = fp; // B
But actually mutex_enter/exit allow the following reordering by
the CPU:
mutex_enter(&fp->f_lock);
ff->ff_file = fp; // B
fp->f_count++;
fp->f_... = ...; // A
mutex_exit(&fp->f_lock);
The only constraints they imply are:
1. fp->f_count++ and B cannot precede mutex_enter
2. mutex_exit cannot precede A and fp->f_count++
They imply no constraint on the relative ordering of A, B, and
fp->f_count++ amongst each other, however.
This affects any architecture that has a native load-acquire or
store-release operation in mutex_enter/exit, like aarch64, instead
of explicit load-before-load/store and load/store-before-store
barrier.

No need for atomic_store_* in fd_copy or fd_free because we have
exclusive access to ff as is.

kern_descrip.c: Change membar_enter to membar_acquire in fd_getfile.
membar_acquire is cheaper on many CPUs, and unlikely to be costlier
on any CPUs, than the legacy membar_enter.
Add a long comment explaining the interaction between fd_getfile and
fd_close and why membar_acquire is safe.
 1.262.6.1  02-Aug-2025  perseant Sync with HEAD

RSS XML Feed