Home | History | Annotate | Download | only in kern
History log of /src/sys/kern/vfs_vnops.c
RevisionDateAuthorComments
 1.246  09-Jul-2025  bad release fp->f_lock after reading the offset in vn_read()

Fixes an obvious lock leak introduced in r1.238 and pulled up to netbsd-10.

Pullup to netbsd-10.

Fixes PR kern/59519 vn_read() leaks file* lock
 1.245  08-Jul-2025  mlelstv Access v_rdev only for a device special file.

Pullup-10
 1.244  07-Dec-2024  riastradh vfs(9): Sprinkle SET_ERROR dtrace probes.

PR kern/58378: Kernel error code origination lacks dtrace probes
 1.243  07-Dec-2024  riastradh vfs(9): Sprinkle KNF.

No functional change intended.
 1.242  10-Jul-2023  christos branches: 1.242.6;
Add memfd_create(2) from GSoC 2023 by Theodore Preduta
 1.241  22-Apr-2023  riastradh file(9): New fo_posix_fadvise operation.

XXX kernel revbump -- changes struct fileops API and ABI
 1.240  22-Apr-2023  riastradh file(9): New fo_fpathconf operation.

XXX kernel revbump -- struct fileops API and ABI change
 1.239  22-Apr-2023  riastradh file(9): New fo_advlock operation.

This moves the vnode-specific logic from sys_descrip.c into
vfs_vnode.c, like we did for fo_seek.

XXX kernel revbump -- struct fileops API and ABI change
 1.238  22-Apr-2023  riastradh readdir(2), lseek(2): Fix races in access to struct file::f_offset.

For non-directory vnodes:
- reading f_offset requires a shared or exclusive vnode lock
- writing f_offset requires an exclusive vnode lock

For directory vnodes, access (read or write) requires either:
- a shared vnode lock AND f_lock, or
- an exclusive vnode lock.

This way, two files for the same underlying directory vnode can still
do VOP_READDIR in parallel, but if two readdir(2) or lseek(2) calls
run in parallel on the same file, the load and store of f_offset is
atomic (otherwise, e.g., on 32-bit systems it might be torn and lead
to corrupt offsets).

There is still a potential problem: the _whole transaction_ of
readdir(2) may not be atomic. For example, if thread A and thread B
read n bytes of directory content, thread A might get bytes [0,n) and
thread B might get bytes [n,2n) but f_offset might end up at n
instead of 2n once both operations complete. (However, f_offset
wouldn't be some corrupt garbled number like n & 0xffffffff00000000.)
Fixing this would require either:
(a) using an exclusive vnode lock in vn_readdir,
(b) introducing a new lock that serializes vn_readdir on the same
file (but ont necessarily the same vnode), or
(c) proving it is safe to hold f_lock across VOP_READDIR, VOP_SEEK,
and VOP_GETATTR.
 1.237  13-Mar-2023  riastradh vn_open(9): Add assertion that vp is locked on return.

Null out vp internally out of paranoia so we'll crash in evaluating
the assertion if we ever reach it via one of the vput paths.
 1.236  13-Mar-2023  riastradh vn_open(9): Clarify that this returns a locked vnode.

Comment only, no functional change intended.
 1.235  06-Aug-2022  riastradh branches: 1.235.4;
vnodeops(9): Take exclusive lock in read/seek for f_offset update.

Otherwise concurrent readers/seekers might clobber it.
 1.234  18-Jul-2022  thorpej Make kqueue event status for vnodes shareable, and for stacked file systems
like nullfs, make the upper vnode share that status with the lower vnode.

And, lo, NetBSD 9.99.99.

Fixes PR kern/56713.
 1.233  06-Jul-2022  riastradh kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.
 1.232  06-Jul-2022  riastradh kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.
 1.231  06-Jul-2022  riastradh kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.
 1.230  06-Jul-2022  riastradh mmap(2): Avoid overflow in overflow check in vn_mmap.
 1.229  06-Jul-2022  riastradh uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.
 1.228  22-May-2022  andvar fix various small typos, mainly in comments.
 1.227  25-Mar-2022  hannken It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.
 1.226  19-Mar-2022  hannken Lock vnode across VOP_OPEN.
 1.225  13-Mar-2022  riastradh vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com
 1.224  20-Oct-2021  thorpej Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.
 1.223  11-Sep-2021  riastradh sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.
 1.222  11-Sep-2021  riastradh sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).
 1.221  18-Jul-2021  dholland Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.
 1.220  01-Jul-2021  martin gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.
 1.219  01-Jul-2021  christos don't clear the error before we use it to determine if we are moving or duping.
 1.218  30-Jun-2021  dholland Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds
 1.217  30-Jun-2021  christos PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.
 1.216  29-Jun-2021  dholland Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)
 1.215  16-Jun-2021  dholland Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.
 1.214  09-Nov-2020  chs branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com
 1.213  11-Jun-2020  ad branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.
 1.212  23-May-2020  ad Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.
 1.211  13-Apr-2020  ad Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.
 1.210  12-Apr-2020  christos Oops missed one more NULL -> NOCRED
 1.209  12-Apr-2020  christos delete debugging printf.
 1.208  12-Apr-2020  christos Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)
 1.207  27-Feb-2020  ad branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.
 1.206  23-Feb-2020  ad UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.
 1.205  12-Jan-2020  ad - Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).
 1.204  16-Dec-2019  ad branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).
 1.203  01-Dec-2019  ad Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system
 1.202  10-Nov-2019  mlelstv Add functions to open devices by device number or path.
 1.201  15-Sep-2019  christos set VEXEC if FEXEC is set.
 1.200  07-Mar-2019  hannken branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com
 1.199  04-Feb-2019  mrg s/fall into .../FALLTHROUGH/
 1.198  03-Sep-2018  riastradh Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)
 1.197  30-Nov-2017  christos branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.
 1.196  09-Nov-2017  christos Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.
 1.195  30-Mar-2017  hannken branches: 1.195.6;
Lock the vnode before changing its writecount.
 1.194  01-Mar-2017  hannken Must always lock the parent -> lock the child -> unlock the parent.
 1.193  04-Feb-2015  msaitoh branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.
 1.192  14-Dec-2014  chs add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).
 1.191  05-Sep-2014  matt branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.
 1.190  22-Jun-2014  maxv branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).
 1.189  27-Feb-2014  hannken branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33
 1.188  23-Jan-2014  hannken Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30
 1.187  17-Jan-2014  hannken Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29
 1.186  12-Nov-2012  hannken branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.
 1.185  24-Aug-2012  dholland branches: 1.185.2;
don't truncate size_t to int
 1.184  05-Apr-2012  hannken Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code
 1.183  14-Oct-2011  hannken branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.
 1.182  16-Aug-2011  yamt vn_close: add an assertion
 1.181  12-Jun-2011  rmind Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.
 1.180  19-Nov-2010  dholland branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.
 1.179  28-Oct-2010  pooka Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette
 1.178  21-Sep-2010  chs implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.
 1.177  25-Aug-2010  pooka I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.
 1.176  28-Jul-2010  hannken Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.
 1.175  13-Jul-2010  pooka Don't leak kernel stack into userspace.
 1.174  24-Jun-2010  hannken Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.
 1.173  18-Jun-2010  hannken Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.
 1.172  06-Jun-2010  hannken Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.
 1.171  23-Apr-2010  pooka Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).
 1.170  29-Mar-2010  pooka Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.
 1.169  08-Jan-2010  pooka branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change
 1.168  20-Dec-2009  dsl If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567
 1.167  09-Dec-2009  dsl Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.
 1.166  17-May-2009  yamt remove FILE_LOCK and FILE_UNLOCK.
 1.165  11-Apr-2009  christos Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.
 1.164  04-Apr-2009  ad Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)
 1.163  11-Feb-2009  enami Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.
 1.162  17-Jan-2009  yamt branches: 1.162.2;
malloc -> kmem_alloc.
 1.161  12-Nov-2008  ad Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.
 1.160  27-Aug-2008  christos branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset
 1.159  31-Jul-2008  simonb Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.
 1.158  02-Jun-2008  ad branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.
 1.157  02-Jun-2008  ad vn_marktext, vn_lock: don't needlessly acquire v_interlock.
 1.156  24-Apr-2008  ad branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.
 1.155  21-Mar-2008  ad branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.
 1.154  30-Jan-2008  ad branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.
 1.153  25-Jan-2008  ad vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.
 1.152  25-Jan-2008  ad Remove VOP_LEASE. Discussed on tech-kern.
 1.151  25-Jan-2008  pooka vn_write: include f_advice in VOP_WRITE
 1.150  05-Jan-2008  dsl Use FILE_LOCK() and FILE_UNLOCK()
 1.149  02-Jan-2008  ad Merge vmlocking2 to head.
 1.148  08-Dec-2007  pooka branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.
 1.147  02-Dec-2007  hannken branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>
 1.146  30-Nov-2007  yamt - reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.
 1.145  29-Nov-2007  ad Use atomics to maintain uvmexp.{anon,exec,file}pages.
 1.144  26-Nov-2007  pooka Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern
 1.143  10-Oct-2007  ad branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.
 1.142  08-Oct-2007  ad Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.
 1.141  07-Oct-2007  hannken Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>
 1.140  22-Jul-2007  pooka branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden
 1.139  19-May-2007  christos branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.
 1.138  22-Apr-2007  dsl Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.
 1.137  08-Apr-2007  hannken Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17
 1.136  03-Apr-2007  hannken Remove calls to now obsolete vn_start_write() and vn_finished_write().
 1.135  09-Mar-2007  ad branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.
 1.134  04-Mar-2007  christos Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.133  16-Feb-2007  hannken branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().
 1.132  09-Feb-2007  ad Merge newlock2 to head.
 1.131  19-Jan-2007  hannken New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).
 1.130  30-Dec-2006  elad Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().
 1.129  30-Nov-2006  elad branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.
 1.128  01-Nov-2006  elad printf() -> log().
 1.127  28-Oct-2006  elad Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.
 1.126  27-Oct-2006  elad Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@
 1.125  05-Oct-2006  chs add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).
 1.124  12-Sep-2006  elad branches: 1.124.2;
Fix typo.
 1.123  10-Sep-2006  blymn Prevent a veriexec file from being truncated.
 1.122  26-Jul-2006  dogcow branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.
 1.121  25-Jul-2006  dogcow mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.
 1.120  24-Jul-2006  elad replace magic numbers for strict levels (0-3) with defines.
 1.119  24-Jul-2006  elad finally do things properly. veriexec_report() takes flags, not three ints.
 1.118  24-Jul-2006  elad some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.
 1.117  23-Jul-2006  ad Use the LWP cached credentials where sane.
 1.116  22-Jul-2006  elad kill a VOP_GETATTR() we don't need for veriexec.
 1.115  22-Jul-2006  elad deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.
 1.114  16-Jul-2006  elad oops, forgot to commit that one. thanks Arnaud Lacombe.
 1.113  14-Jul-2006  elad okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.
 1.112  27-May-2006  simonb Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.
 1.111  14-May-2006  elad branches: 1.111.2;
integrate kauth.
 1.110  14-May-2006  christos XXX: GCC uninitialized.
 1.109  04-May-2006  perseant Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.
 1.108  24-Mar-2006  hannken vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.
 1.107  10-Mar-2006  yamt branches: 1.107.2;
remove a wrong assertion.
 1.106  01-Mar-2006  yamt branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.
 1.105  04-Feb-2006  yamt vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.
 1.104  01-Jan-2006  yamt branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.
 1.103  31-Dec-2005  yamt vn_lock: assert that only a limited set of LK_* flags is used.
 1.102  12-Dec-2005  elad branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.
 1.101  11-Dec-2005  christos merge ktrace-lwp.
 1.100  29-Nov-2005  yamt merge yamt-readahead branch.
 1.99  08-Nov-2005  hannken branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.
 1.98  15-Oct-2005  elad copystr and copyinstr return int, not void.
 1.97  14-Oct-2005  christos No need for __UNCONST in previous commit; factor out the function call.
 1.96  14-Oct-2005  elad Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.
 1.95  20-Sep-2005  yamt uninline vn_start_write and vn_finished_write as they are big enough.
 1.94  23-Jul-2005  erh Fix a null vp panic when creating a file at veriexec strict level 3.
 1.93  16-Jul-2005  christos defopt verified_exec.
 1.92  19-Jun-2005  elad branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.
 1.91  17-Jun-2005  elad More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.
 1.90  11-Jun-2005  elad Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.
 1.89  05-Jun-2005  thorpej Use ANSI function decls.
 1.88  29-May-2005  christos - add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.
 1.87  20-Apr-2005  blymn Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.
 1.86  26-Feb-2005  perry branches: 1.86.2;
nuke trailing whitespace
 1.85  02-Jan-2005  thorpej branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.
 1.84  12-Dec-2004  yamt vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)
 1.83  30-Nov-2004  christos Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat
 1.82  06-Nov-2004  christos Fix another stupid typo.
 1.81  06-Nov-2004  wrstuden Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.
 1.80  31-May-2004  yamt vn_lock: add an assertion about usecount.
 1.79  30-May-2004  yamt vn_lock: don't pass LK_RETRY to VOP_LOCK.
 1.78  25-May-2004  hannken Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>
 1.77  14-Feb-2004  hannken branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>
 1.76  10-Jan-2004  hannken Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.
 1.75  15-Oct-2003  hannken Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>
 1.74  29-Sep-2003  cb fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)
 1.73  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.
 1.72  29-Jun-2003  fvdl branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.
 1.71  29-Jun-2003  thorpej Adjust to ktrace/lwp changes.
 1.70  28-Jun-2003  darrenr Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V
 1.69  03-Apr-2003  fvdl Copy birthtime in vn_stat.
 1.68  21-Mar-2003  dsl Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).
 1.67  21-Mar-2003  dsl Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)
 1.66  17-Mar-2003  jdolecek make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted
 1.65  16-Mar-2003  jdolecek move union filesystem code from sys/miscfs/union to sys/fs/union
 1.64  03-Mar-2003  jdolecek only pull in/declare veriexec related stuff with VERIFIED_EXEC
 1.63  24-Feb-2003  perseant Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.
 1.62  01-Feb-2003  atatat Check for (and deny) negative values passed to FIOGETBMAP.
 1.61  24-Jan-2003  fvdl Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.
 1.60  11-Dec-2002  atatat Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().
 1.59  06-Dec-2002  christos s/NOSYMLINK/O_NOFOLLOW/
 1.58  29-Oct-2002  blymn Added support for fingerprinted executables aka verified exec
 1.57  23-Oct-2002  jdolecek merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe
 1.56  14-Oct-2002  gmcgarry vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.
 1.55  05-Oct-2002  chs count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).
 1.54  17-Mar-2002  atatat branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.
 1.53  09-Dec-2001  chs replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.
 1.52  12-Nov-2001  lukem add RCSIDs
 1.51  30-Oct-2001  thorpej - Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.
 1.50  21-Sep-2001  chs branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().
 1.49  15-Sep-2001  chs a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.
 1.48  09-Apr-2001  jdolecek branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.
 1.47  07-Apr-2001  jdolecek Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.
 1.46  09-Mar-2001  chs add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.
 1.45  27-Nov-2000  chs branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.
 1.44  12-Aug-2000  sommerfeld Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()
 1.43  27-Jun-2000  mrg remove include of <vm/vm.h>
 1.42  11-Apr-2000  chs add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.
 1.41  30-Mar-2000  augustss Get rid of register declarations.
 1.40  30-Mar-2000  simonb Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.
 1.39  14-Feb-2000  fvdl Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.
 1.38  31-Aug-1999  bouyer branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.
 1.37  03-Aug-1999  wrstuden Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden
 1.36  31-Mar-1999  mycroft branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.
 1.35  30-Mar-1999  wrstuden The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.
 1.34  25-Mar-1999  sommerfe Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.
 1.33  24-Mar-1999  mrg completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.
 1.32  26-Feb-1999  wrstuden Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.
 1.31  02-Aug-1998  kleink branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.
 1.30  28-Jul-1998  thorpej Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.
 1.29  30-Jun-1998  thorpej Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.
 1.28  01-Mar-1998  fvdl Merge with Lite2 + local changes
 1.27  19-Feb-1998  thorpej Include the UNION option header.
 1.26  10-Feb-1998  mrg - add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.
 1.25  05-Feb-1998  mrg initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)
 1.24  14-Jan-1998  thorpej Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.
 1.23  10-Oct-1997  fvdl branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().
 1.22  24-Mar-1997  mycroft branches: 1.22.4;
Do not return generation counts to the user.
 1.21  07-Sep-1996  mycroft Implement poll(2).
 1.20  04-Feb-1996  christos First pass at prototyping
 1.19  23-May-1995  mycroft Remove gratuitous extra indirections.
 1.18  14-Dec-1994  mycroft Remove extra arg to vn_open().
 1.17  13-Dec-1994  mycroft LEASE_CHECK -> VOP_LEASE
 1.16  14-Nov-1994  christos added extra argument in vn_open and VOP_OPEN to allow cloning devices
 1.15  30-Oct-1994  cgd be more careful with types, also pull in headers where necessary.
 1.14  18-Sep-1994  mycroft Fix space change in last commit.
 1.13  14-Sep-1994  cgd from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!
 1.12  29-Jun-1994  cgd branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'
 1.11  08-Jun-1994  mycroft Update to 4.4-Lite fs code.
 1.10  17-May-1994  cgd copyright foo
 1.9  25-Apr-1994  cgd some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.
 1.8  12-Apr-1994  chopps FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)
 1.7  21-Dec-1993  cgd kill two wrong 'case's
 1.6  18-Dec-1993  mycroft Canonicalize all #includes.
 1.5  07-Sep-1993  ws branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers
 1.4  24-Aug-1993  pk Support added for proc filesystem.
 1.3  22-May-1993  cgd add include of select.h if necessary for protos, or delete if extraneous
 1.2  18-May-1993  cgd make kernel select interface be one-stop shopping & clean it all up.
 1.1  21-Mar-1993  cgd branches: 1.1.1;
Initial revision
 1.1.1.3  01-Mar-1998  fvdl Import 4.4BSD-Lite2
 1.1.1.2  01-Mar-1998  fvdl Import 4.4BSD-Lite for reference
 1.1.1.1  21-Mar-1993  cgd initial import of 386bsd-0.1 sources
 1.5.2.1  14-Nov-1993  mycroft Canonicalize all #includes.
 1.12.2.1  14-Sep-1994  cgd from trunk
 1.22.4.1  14-Oct-1997  thorpej Update marc-pcmcia branch from trunk.
 1.23.2.1  29-Jan-1998  mellon Pull up 1.24 (thorpej)
 1.31.2.1  09-Nov-1998  chs initial snapshot. lots left to do.
 1.36.4.2  11-Jul-1999  chs remove uvm_vnp_uncache(), it's no longer needed.
 1.36.4.1  07-Jun-1999  chs merge everything from chs-ubc branch.
 1.36.2.1  03-Sep-1999  he Pull up revision 1.38:
Don't allow coredump to follow symlinks, this has security
implications. (bouyer)
 1.38.2.4  21-Apr-2001  bouyer Sync with HEAD
 1.38.2.3  12-Mar-2001  bouyer Sync with HEAD.
 1.38.2.2  08-Dec-2000  bouyer Sync with HEAD.
 1.38.2.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.45.2.10  19-Dec-2002  thorpej Sync with HEAD.
 1.45.2.9  11-Dec-2002  thorpej Sync with HEAD.
 1.45.2.8  11-Nov-2002  nathanw Catch up to -current
 1.45.2.7  18-Oct-2002  nathanw Catch up to -current.
 1.45.2.6  01-Apr-2002  nathanw Catch up to -current.
(CVS: It's not just a program. It's an adventure!)
 1.45.2.5  08-Jan-2002  nathanw Catch up to -current.
 1.45.2.4  14-Nov-2001  nathanw Catch up to -current.
 1.45.2.3  21-Sep-2001  nathanw Catch up to -current.
 1.45.2.2  21-Jun-2001  nathanw Catch up to -current.
 1.45.2.1  09-Apr-2001  nathanw Catch up with -current.
 1.48.4.2  01-Oct-2001  fvdl Catch up with -current.
 1.48.4.1  18-Sep-2001  fvdl Various changes to make cloning devices possible:

* Add an extra argument (struct vnode **) to VOP_OPEN. If it is
not NULL, specfs will create a cloned (aliased) vnode during
the call, and return it there. The caller should release and
unlock the original vnode if a new vnode was returned. The
new vnode is returned locked.

* Add a flag field to the cdevsw and bdevsw structures.
DF_CLONING indicates that it wants a new vnode for each
open (XXX is there a better way? devprop?)

* If a device is cloning, always call the close entry
point for a VOP_CLOSE.


Also, rewrite cons.c to do the right thing with vnodes. Use VOPs
rather then direct device entry calls. Suggested by mycroft@

Light to moderate testing done an i386 system (arch doesn't matter
though, these are MI changes).
 1.48.2.4  11-Oct-2002  jdolecek vn_kqfilter(): g/c local variable vp
 1.48.2.3  23-Jun-2002  jdolecek catch up with -current on kqueue branch
 1.48.2.2  10-Jan-2002  thorpej Sync kqueue branch with -current.
 1.48.2.1  10-Jul-2001  lukem add method vn_kqfilter
 1.50.2.1  12-Nov-2001  thorpej Sync the thorpej-mips-cache branch with -current.
 1.54.6.2  01-Jun-2006  simonb Pull up rev 1.112 from trunk:
Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.

OK'd by tron@
 1.54.6.1  02-Oct-2003  tron Pull up revision 1.55 (requested by junyoung in ticket #1488):
count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).
 1.72.2.12  11-Dec-2005  christos Sync with head.
 1.72.2.11  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.72.2.10  04-Mar-2005  skrll Sync with HEAD.

Hi Perry!
 1.72.2.9  24-Jan-2005  skrll Adapt to branch.
 1.72.2.8  17-Jan-2005  skrll Sync with HEAD.
 1.72.2.7  18-Dec-2004  skrll Sync with HEAD.
 1.72.2.6  14-Nov-2004  skrll Sync with HEAD.
 1.72.2.5  21-Sep-2004  skrll Fix the sync with head I botched.
 1.72.2.4  18-Sep-2004  skrll Sync with HEAD.
 1.72.2.3  03-Aug-2004  skrll Sync with HEAD
 1.72.2.2  03-Jul-2003  wrstuden LWP-ify union fs.

Note: These changes suffer from the same cnp->cn_lwp issue noted for
ufs. They will need to get fixed at the same time as ufs. The fix is to
add struct lwp * as a parameter to some VOPs.

Note also that most of the cn_lwp references used to be cn_proc references,
so if cnp->cn_lwp is bad to use, unionfs's been naughty for quite some
time.
 1.72.2.1  02-Jul-2003  darrenr Apply the aborted ktrace-lwp changes to a specific branch. This is just for
others to review, I'm concerned that patch fuziness may have resulted in some
errant code being generated but I'll look at that later by comparing the diff
from the base to the branch with the file I attempt to apply to it. This will,
at the very least, put the changes in a better context for others to review
them and attempt to tinker with removing passing of 'struct lwp' through
the kernel.
 1.77.6.2  31-May-2006  tron Pull up following revision(s) (requested by simonb in ticket #10633):
sys/kern/vfs_vnops.c: revision 1.112
Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.
 1.77.6.1  14-Nov-2005  riz Pull up following revision(s) (requested by hannken in ticket #5983):
sys/kern/vfs_vnops.c: revision 1.99
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.
Fixes PR 32005.
 1.77.4.2  31-May-2006  tron Pull up following revision(s) (requested by simonb in ticket #10633):
sys/kern/vfs_vnops.c: revision 1.112
Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.
 1.77.4.1  14-Nov-2005  riz Pull up following revision(s) (requested by hannken in ticket #5983):
sys/kern/vfs_vnops.c: revision 1.99
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.
Fixes PR 32005.
 1.77.2.2  31-May-2006  tron Pull up following revision(s) (requested by simonb in ticket #10633):
sys/kern/vfs_vnops.c: revision 1.112
Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.
 1.77.2.1  14-Nov-2005  riz Pull up following revision(s) (requested by hannken in ticket #5983):
sys/kern/vfs_vnops.c: revision 1.99
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.
Fixes PR 32005.
 1.85.4.1  19-Mar-2005  yamt sync with head. xen and whitespace. xen part is not finished.
 1.85.2.1  29-Apr-2005  kent sync with -current
 1.86.2.11  31-May-2006  tron Pull up following revision(s) (requested by simonb in ticket #1347):
sys/kern/vfs_vnops.c: revision 1.112
Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.
 1.86.2.10  09-Nov-2005  tron branches: 1.86.2.10.2;
Pull up following revision(s) (requested by hannken in ticket #944):
sys/kern/vfs_vnops.c: revision 1.99
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.
Fixes PR 32005.
 1.86.2.9  15-Oct-2005  riz Apply patch (requested by elad in ticket #891):
Fix a crash whenever the nameidata has the pathname in userspace.
 1.86.2.8  08-Sep-2005  tron Apply patch (requested by elad in ticket #740):
Defopt VERIFIED_EXEC.
 1.86.2.7  23-Aug-2005  tron Backout ticket 685. It causes build failures.
 1.86.2.6  23-Aug-2005  tron Pull up revision 1.93 (requested by elad in ticket #685):
defopt verified_exec.
 1.86.2.5  24-Jul-2005  tron Pull up revision 1.94 (requested by elad in ticket #613):
Fix a null vp panic when creating a file at veriexec strict level 3.
 1.86.2.4  02-Jul-2005  tron Pull up revision 1.92 (requested by elad in ticket #487):
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.
- Handle non-regular (!VREG) files correctly).
- Remove (no longer needed) FINGERPRINT_NOENTRY.
 1.86.2.3  02-Jul-2005  tron Pull up revision 1.91 (requested by elad in ticket #487):
More veriexec changes:
- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.
- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.
- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.
- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.
- Update sysctl(3) man-page with above. (date bumped too :)
- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.
- Simplify veriexec_removechk() in light of new strict level policies.
- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.
 1.86.2.2  13-Jun-2005  tron Pull up revision 1.90 (requested by elad in ticket #447):
Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for
writing,
only invalidate the fingerprint - not always the data will be changed.
 1.86.2.1  10-Jun-2005  tron Pull up revision 1.87 (requested by elad in ticket #389):
Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.
 1.86.2.10.2.1  31-May-2006  tron Pull up following revision(s) (requested by simonb in ticket #1347):
sys/kern/vfs_vnops.c: revision 1.112
Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.
 1.92.2.9  24-Mar-2008  yamt sync with head.
 1.92.2.8  04-Feb-2008  yamt sync with head.
 1.92.2.7  21-Jan-2008  yamt sync with head
 1.92.2.6  07-Dec-2007  yamt sync with head
 1.92.2.5  27-Oct-2007  yamt sync with head.
 1.92.2.4  03-Sep-2007  yamt sync with head.
 1.92.2.3  26-Feb-2007  yamt sync with head.
 1.92.2.2  30-Dec-2006  yamt sync with head.
 1.92.2.1  21-Jun-2006  yamt sync with head.
 1.99.2.3  18-Nov-2005  yamt - associate read-ahead context to vnode, rather than file.
- revert VOP_READ prototype.
 1.99.2.2  15-Nov-2005  yamt add posix_fadvise.
 1.99.2.1  15-Nov-2005  yamt - setup/cleanup readahead context.
- adapt to the new VOP_READ prototype.
 1.102.2.3  18-Feb-2006  yamt sync with head.
 1.102.2.2  15-Jan-2006  yamt sync with head.
 1.102.2.1  31-Dec-2005  yamt uio_segflg/uio_lwp -> uio_vmspace.
 1.104.4.2  01-Jun-2006  kardel Sync with head.
 1.104.4.1  22-Apr-2006  simonb Sync with head.
 1.104.2.1  09-Sep-2006  rpaulo sync with head
 1.106.4.4  11-May-2006  elad sync with head
 1.106.4.3  06-May-2006  christos - Move kauth_cred_t declaration to <sys/types.h>
- Cleanup struct ucred; forward declarations that are unused.
- Don't include <sys/kauth.h> in any header, but include it in the c files
that need it.

Approved by core.
 1.106.4.2  19-Apr-2006  elad sync with head.
 1.106.4.1  08-Mar-2006  elad Adapt to kernel authorization KPI.
 1.106.2.6  14-Sep-2006  yamt sync with head.
 1.106.2.5  11-Aug-2006  yamt sync with head
 1.106.2.4  26-Jun-2006  yamt sync with head.
 1.106.2.3  24-May-2006  yamt sync with head.
 1.106.2.2  01-Apr-2006  yamt sync with head.
 1.106.2.1  13-Mar-2006  yamt sync with head.
 1.107.2.2  24-May-2006  tron Merge 2006-05-24 NetBSD-current into the "peter-altq" branch.
 1.107.2.1  28-Mar-2006  tron Merge 2006-03-28 NetBSD-current into the "peter-altq" branch.
 1.111.2.1  19-Jun-2006  chap Sync with head.
 1.122.4.4  01-Feb-2007  ad Sync with head.
 1.122.4.3  12-Jan-2007  ad Sync with head.
 1.122.4.2  18-Nov-2006  ad Sync with head.
 1.122.4.1  11-Sep-2006  ad - Convert some locks to mutexes and RW locks.
- Use the proclist_lock to protect pgrps and sessions in some places.
 1.124.2.2  10-Dec-2006  yamt sync with head.
 1.124.2.1  22-Oct-2006  yamt sync with head
 1.129.2.1  06-Jan-2007  bouyer Pull up following revision(s) (requested by elad in ticket #318):
sys/kern/kern_verifiedexec.c: revision 1.88
sys/kern/vfs_vnops.c: revision 1.130
sys/sys/verified_exec.h: revision 1.48
Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().
 1.133.2.3  07-May-2007  yamt sync with head.
 1.133.2.2  15-Apr-2007  yamt sync with head.
 1.133.2.1  12-Mar-2007  rmind Sync with HEAD.
 1.135.4.1  11-Jul-2007  mjf Sync with head.
 1.135.2.9  09-Oct-2007  ad Sync with head.
 1.135.2.8  09-Oct-2007  ad Sync with head.
 1.135.2.7  20-Aug-2007  ad Sync with HEAD.
 1.135.2.6  17-Jun-2007  ad - Increase the number of thread priorities from 128 to 256. How the space
is set up is to be revisited.
- Implement soft interrupts as kernel threads. A generic implementation
is provided, with hooks for fast-path MD code that can run the interrupt
threads over the top of other threads executing in the kernel.
- Split vnode::v_flag into three fields, depending on how the flag is
locked (by the interlock, by the vnode lock, by the file system).
- Miscellaneous locking fixes and improvements.
 1.135.2.5  08-Jun-2007  ad Sync with head.
 1.135.2.4  13-Apr-2007  ad - Fix a (new) bug where vget tries to acquire freed vnodes' interlocks.
- Minor locking fixes.
 1.135.2.3  10-Apr-2007  ad Sync with head.
 1.135.2.2  21-Mar-2007  ad - Replace more simple_locks, and fix up in a few places.
- Use condition variables.
- LOCK_ASSERT -> KASSERT.
 1.135.2.1  13-Mar-2007  ad Pull in the initial set of changes for the vmlocking branch.
 1.139.2.1  15-Aug-2007  skrll Sync with HEAD.
 1.140.10.2  22-Jul-2007  pooka Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden
 1.140.10.1  22-Jul-2007  pooka file vfs_vnops.c was added on branch matt-mips64 on 2007-07-22 19:16:06 +0000
 1.140.8.1  14-Oct-2007  yamt sync with head.
 1.140.6.3  23-Mar-2008  matt sync with HEAD
 1.140.6.2  09-Jan-2008  matt sync with HEAD
 1.140.6.1  06-Nov-2007  matt sync with HEAD
 1.140.4.4  09-Dec-2007  jmcneill Sync with HEAD.
 1.140.4.3  03-Dec-2007  joerg Sync with HEAD.
 1.140.4.2  27-Nov-2007  joerg Sync with HEAD. amd64 Xen support needs testing.
 1.140.4.1  26-Oct-2007  joerg Sync with HEAD.

Follow the merge of pmap.c on i386 and amd64 and move
pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup
code to restore CR4 before jumping back into kernel space as the large
page option might cover that.
 1.143.4.3  18-Feb-2008  mjf Sync with HEAD.
 1.143.4.2  27-Dec-2007  mjf Sync with HEAD.
 1.143.4.1  08-Dec-2007  mjf Sync with HEAD.
 1.147.2.5  26-Dec-2007  ad Sync with head.
 1.147.2.4  18-Dec-2007  ad Lock readahead context using the associated object's lock.
 1.147.2.3  10-Dec-2007  ad - Don't drain the vnode lock in vclean(); reference counting and XLOCK
should be enough.
- LK_SETRECURSE is gone.
 1.147.2.2  09-Dec-2007  ad do_sys_mount: use vn_setrecurse(), not LK_SETRECURSE.
 1.147.2.1  04-Dec-2007  ad Pull the vmlocking changes into a new branch.
 1.148.4.2  08-Jan-2008  bouyer Sync with HEAD
 1.148.4.1  02-Jan-2008  bouyer Sync with HEAD
 1.154.6.5  17-Jan-2009  mjf Sync with HEAD.
 1.154.6.4  28-Sep-2008  mjf Sync with HEAD.
 1.154.6.3  05-Jun-2008  mjf Sync with HEAD.

Also fix build.
 1.154.6.2  02-Jun-2008  mjf Sync with HEAD.
 1.154.6.1  03-Apr-2008  mjf Sync with HEAD.
 1.155.2.2  04-Jun-2008  yamt sync with head
 1.155.2.1  18-May-2008  yamt sync with head.
 1.156.4.2  18-Sep-2008  wrstuden Sync with wrstuden-revivesa-base-2.
 1.156.4.1  23-Jun-2008  wrstuden Sync w/ -current. 34 merge conflicts to follow.
 1.156.2.5  09-Oct-2010  yamt sync with head
 1.156.2.4  11-Aug-2010  yamt sync with head.
 1.156.2.3  11-Mar-2010  yamt sync with head
 1.156.2.2  20-Jun-2009  yamt sync with head
 1.156.2.1  04-May-2009  yamt sync with head.
 1.158.4.2  13-Dec-2008  haad Update haad-dm branch to haad-dm-base2.
 1.158.4.1  19-Oct-2008  haad Sync with HEAD.
 1.158.2.1  10-Jun-2008  simonb Initial commit of Wasabi System's WAPBL (Write Ahead Physical Block
Logging) journaling code. Originally written by Darrin B. Jewell
while at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

Still a number of issues - look in doc/BRANCHES for "simonb-wapbl"
for more info.
 1.160.4.1  04-Apr-2009  snj Pull up following revision(s) (requested by ad in ticket #661):
sys/arch/xen/xen/xenevt.c: revision 1.32
sys/compat/svr4/svr4_net.c: revision 1.56
sys/compat/svr4_32/svr4_32_net.c: revision 1.19
sys/dev/dmover/dmover_io.c: revision 1.32
sys/dev/putter/putter.c: revision 1.21
sys/kern/kern_descrip.c: revision 1.190
sys/kern/kern_drvctl.c: revision 1.23
sys/kern/kern_event.c: revision 1.64
sys/kern/sys_mqueue.c: revision 1.14
sys/kern/sys_pipe.c: revision 1.109
sys/kern/sys_socket.c: revision 1.59
sys/kern/uipc_syscalls.c: revision 1.136
sys/kern/vfs_vnops.c: revision 1.164
sys/kern/uipc_socket.c: revision 1.188
sys/net/bpf.c: revision 1.144
sys/net/if_tap.c: revision 1.55
sys/opencrypto/cryptodev.c: revision 1.47
sys/sys/file.h: revision 1.67
sys/sys/param.h: patch
sys/sys/socketvar.h: revision 1.119
Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.
Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.
thr0 accept(fd, ...)
thr1 close(fd)
 1.160.2.3  28-Apr-2009  skrll Sync with HEAD.
 1.160.2.2  03-Mar-2009  skrll Sync with HEAD.
 1.160.2.1  19-Jan-2009  skrll Sync with HEAD.
 1.162.2.2  23-Jul-2009  jym Sync with HEAD.
 1.162.2.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.169.4.4  05-Mar-2011  rmind sync with head
 1.169.4.3  03-Jul-2010  rmind sync with head
 1.169.4.2  30-May-2010  rmind sync with head
 1.169.4.1  16-Mar-2010  rmind Change struct uvm_object::vmobjlock to be dynamically allocated with
mutex_obj_alloc(). It allows us to share the locks among UVM objects.
 1.169.2.4  06-Nov-2010  uebayasi Sync with HEAD.
 1.169.2.3  22-Oct-2010  uebayasi Sync with HEAD (-D20101022).
 1.169.2.2  17-Aug-2010  uebayasi Sync with HEAD.
 1.169.2.1  30-Apr-2010  uebayasi Sync with HEAD.
 1.180.6.1  23-Jun-2011  cherry Catchup with rmind-uvmplock merge.
 1.183.8.2  22-Nov-2012  riz Pull up following revision(s) (requested by hannken in ticket #692):
sys/kern/vfs_vnode.c: revision 1.17
sys/kern/vfs_vnops.c: revision 1.186
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().
Needs pullup to NetBSD-6.
 1.183.8.1  12-Apr-2012  riz branches: 1.183.8.1.4;
Pull up following revision(s) (requested by hannken in ticket #179):
sys/kern/vfs_vnops.c: revision 1.184
Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.
Should fix PR #46221: Kernel panic in NFS server code
 1.183.8.1.4.1  22-Nov-2012  riz Pull up following revision(s) (requested by hannken in ticket #692):
sys/kern/vfs_vnode.c: revision 1.17
sys/kern/vfs_vnops.c: revision 1.186
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().
Needs pullup to NetBSD-6.
 1.183.6.1  05-Apr-2012  mrg sync to latest -current.
 1.183.2.4  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.183.2.3  16-Jan-2013  yamt sync with (a bit old) head
 1.183.2.2  30-Oct-2012  yamt sync with head
 1.183.2.1  17-Apr-2012  yamt sync with head
 1.185.2.4  03-Dec-2017  jdolecek update from HEAD
 1.185.2.3  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.185.2.2  20-Nov-2012  tls Resync to 2012-11-19 00:00:00 UTC
 1.185.2.1  12-Sep-2012  tls Initial snapshot of work to eliminate 64K MAXPHYS. Basically works for
physio (I/O to raw devices); needs more doing to get it going with the
filesystems, but it shouldn't damage data.

All work's been done on amd64 so far. Not hard to add support to other
ports. If others want to pitch in, one very helpful thing would be to
sort out when and how IDE disks can do 128K or larger transfers, and
adjust the various PCI IDE (or at least ahcisata) drivers and wd.c
accordingly -- it would make testing much easier. Another very helpful
thing would be to implement a smart minphys() for RAIDframe along the
lines detailed in the MAXPHYS-NOTES file.
 1.186.2.1  18-May-2014  rmind sync with head
 1.189.2.1  10-Aug-2014  tls Rebase.
 1.190.2.1  31-Dec-2014  snj Pull up following revision(s) (requested by chs in ticket #363):
common/lib/libprop/prop_kern.c: revision 1.18
sys/arch/mac68k/dev/grf_compat.c: revision 1.27
sys/arch/x68k/dev/grf.c: revision 1.45
sys/external/bsd/drm/dist/bsd-core/drm_bufs.c: revision 1.12
sys/external/bsd/drm2/drm/drm_drv.c: revision 1.12
sys/external/bsd/drm2/drm/drm_vm.c: revision 1.6
sys/external/bsd/drm2/include/linux/mm.h: revision 1.4
sys/kern/vfs_vnops.c: revision 1.192 via patch
sys/rump/librump/rumpkern/vm.c: revision 1.160
sys/sys/file.h: revision 1.78 via patch
sys/uvm/uvm_device.c: revision 1.64
sys/uvm/uvm_device.h: revision 1.13
sys/uvm/uvm_extern.h: revision 1.192
sys/uvm/uvm_mmap.c: revision 1.150 via patch
add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).
 1.191.2.2  28-Aug-2017  skrll Sync with HEAD
 1.191.2.1  06-Apr-2015  skrll Sync with HEAD
 1.193.4.1  21-Apr-2017  bouyer Sync with HEAD
 1.193.2.2  26-Apr-2017  pgoyette Sync with HEAD
 1.193.2.1  20-Mar-2017  pgoyette Sync with HEAD
 1.195.6.2  21-Jun-2021  martin Pull up following revision(s) (requested by dholland in ticket #1685):

sys/sys/namei.src: revision 1.59 (via patch)
sys/kern/vfs_vnops.c: revision 1.215
sys/kern/vfs_lookup.c: revision 1.226

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.
This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.
 1.195.6.1  12-Apr-2018  msaitoh Pull up following revision(s) (requested by christos in ticket #741):
lib/libc/stdio/flags.c: revision 1.19
lib/libc/stdio/fdopen.c: revision 1.18
sys/kern/vfs_vnops.c: revision 1.196
lib/libc/stdio/freopen.c: revision 1.20
lib/libc/stdio/fopen.c: revision 1.17
external/bsd/nvi/dist/common/recover.c: revision 1.10
external/bsd/nvi/dist/common/recover.c: revision 1.11
lib/libc/sys/open.2: revision 1.58
sys/sys/fcntl.h: revision 1.49
make the checkok test stricter to avoid races, and use O_REGULAR.
Instead of opening the file and using popen(3), pass the file descriptor
to sendmail directory. Idea and code from Todd Miller.
Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.
 1.197.4.4  21-Apr-2020  martin Sync with HEAD
 1.197.4.3  13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.197.4.2  08-Apr-2020  martin Merge changes from current as of 20200406
 1.197.4.1  10-Jun-2019  christos Sync with HEAD
 1.197.2.1  06-Sep-2018  pgoyette Sync with HEAD

Resolve a couple of conflicts (result of the uimin/uimax changes)
 1.200.4.1  21-Jun-2021  martin Pull up following revision(s) (requested by dholland in ticket #1296):

sys/sys/namei.src: revision 1.59 (via patch)
sys/kern/vfs_vnops.c: revision 1.215
sys/kern/vfs_lookup.c: revision 1.226

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.
This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.
 1.204.2.4  29-Feb-2020  ad Back out experimental change - not ready for LK_SHARED on VOP_OPEN() just yet.
 1.204.2.3  29-Feb-2020  ad Sync with head.
 1.204.2.2  19-Jan-2020  ad Use LOCKLEAF in the few cases it's useful for ffs/tmpfs/nullfs. Others need
to be checked.
 1.204.2.1  17-Jan-2020  ad Sync with head.
 1.207.4.1  20-Apr-2020  bouyer Sync with HEAD
 1.213.2.1  14-Dec-2020  thorpej Sync w/ HEAD.
 1.214.4.2  01-Aug-2021  thorpej Sync with HEAD.
 1.214.4.1  17-Jun-2021  thorpej Sync w/ HEAD.
 1.235.4.3  12-Jul-2025  martin Pull up following revision(s) (requested by mlelstv in ticket #1134):

sys/kern/vfs_vnops.c: revision 1.245

Access v_rdev only for a device special file.
 1.235.4.2  12-Jul-2025  martin Pull up following revision(s) (requested by bad in ticket #1133):

sys/kern/vfs_vnops.c: revision 1.246

release fp->f_lock after reading the offset in vn_read()

Fixes an obvious lock leak introduced in r1.238 and pulled up to netbsd-10.

Fixes PR kern/59519 vn_read() leaks file* lock
 1.235.4.1  01-Aug-2023  martin Pull up following revision(s) (requested by riastradh in ticket #287):

sys/kern/vfs_vnops.c: revision 1.238

readdir(2), lseek(2): Fix races in access to struct file::f_offset.

For non-directory vnodes:
- reading f_offset requires a shared or exclusive vnode lock
- writing f_offset requires an exclusive vnode lock

For directory vnodes, access (read or write) requires either:
- a shared vnode lock AND f_lock, or
- an exclusive vnode lock.

This way, two files for the same underlying directory vnode can still
do VOP_READDIR in parallel, but if two readdir(2) or lseek(2) calls
run in parallel on the same file, the load and store of f_offset is
atomic (otherwise, e.g., on 32-bit systems it might be torn and lead
to corrupt offsets).

There is still a potential problem: the _whole transaction_ of
readdir(2) may not be atomic. For example, if thread A and thread B
read n bytes of directory content, thread A might get bytes [0,n) and
thread B might get bytes [n,2n) but f_offset might end up at n
instead of 2n once both operations complete. (However, f_offset
wouldn't be some corrupt garbled number like n & 0xffffffff00000000.)

Fixing this would require either:
(a) using an exclusive vnode lock in vn_readdir,
(b) introducing a new lock that serializes vn_readdir on the same
file (but ont necessarily the same vnode), or
(c) proving it is safe to hold f_lock across VOP_READDIR, VOP_SEEK,
and VOP_GETATTR.
 1.242.6.1  02-Aug-2025  perseant Sync with HEAD

RSS XML Feed