Home | History | Annotate | Download | only in kern
History log of /src/sys/kern/sys_descrip.c
RevisionDateAuthorComments
 1.52  16-Jul-2025  kre Kernel part of O_CLOFORK implementation (plus kernel revbump)

This is Ricardo Branco's implementation of O_CLOFORK (and
associated fcntl, etc) for NetBSD (with a few minor changes
by me).

For now, the header file symbols that should be exposed to
userland are hidden inside temporary #ifdef _KERNEL blocks,
just to avoid random userland apps, or config scripts, from
seeing any of this before it is better tested.

Userland parts of this will follow soon.

This also bumps the kernel version to 10.99.15 (changes to
data structs, and the signature of fd_dup()).
 1.51  20-May-2024  martin branches: 1.51.2;
Fix a few oversights from the renaming of dup3110 to dup3100
 1.50  19-May-2024  christos version dup3
 1.49  19-May-2024  christos PR/58266: Collin Funk: Fail if from == to, like FreeBSD and Linux. The test
is done in dup3 before any other tests so even if a bad descriptor it is
passed we will return EINVAL not EBADFD like Linux does.
 1.48  10-Jul-2023  christos Add memfd_create(2) from GSoC 2023 by Theodore Preduta
 1.47  14-May-2023  riastradh kern/sys_descrip.c: Nix trailing whitespace.
 1.46  22-Apr-2023  riastradh fcntl(2), flock(2): Assert FHASLOCK is clear if no fo_advlock.
 1.45  22-Apr-2023  riastradh fcntl(2), flock(2): Unify error branches.

Let's make this a bit less error-prone by having everything converge
in the same place instead of multiple returns in different contexts.
 1.44  22-Apr-2023  riastradh fcntl(2), flock(2): Fix missing fd_putfile in error branch.

Oops!
 1.43  22-Apr-2023  riastradh file(9): New fo_posix_fadvise operation.

XXX kernel revbump -- changes struct fileops API and ABI
 1.42  22-Apr-2023  riastradh file(9): New fo_fpathconf operation.

XXX kernel revbump -- struct fileops API and ABI change
 1.41  22-Apr-2023  riastradh file(9): New fo_advlock operation.

This moves the vnode-specific logic from sys_descrip.c into
vfs_vnode.c, like we did for fo_seek.

XXX kernel revbump -- struct fileops API and ABI change
 1.40  16-Apr-2022  hannken Lock vnode for VOP_PATHCONF().
 1.39  15-Mar-2022  riastradh posix_fadvise(2): Detect arithmetic overflow without UB.

Reported-by: syzbot+18f01abff11bd527c464@syzkaller.appspotmail.com
 1.38  11-Sep-2021  riastradh sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.
 1.37  23-Feb-2020  ad UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.
 1.36  01-Feb-2020  riastradh Load struct filedesc::fd_dt with atomic_load_consume.

Exceptions: when fd_refcnt <= 1, or when holding fd_lock.

While here:

- Restore KASSERT(mutex_owned(&fdp->fd_lock)) in fd_unused.
=> This is used only in fd_close and fd_abort, where it holds.
- Move bounds check assertion in fd_putfile to where it matters.
- Store fd_dt with atomic_store_release.
- Move load of fd_dt under lock in knote_fdclose.
- Omit membar_consumer in fdesc_readdir.
=> atomic_load_consume serves the same purpose now.
=> Was needed only on alpha anyway.
 1.35  15-Sep-2019  christos branches: 1.35.2;
Add F_GETPATH, presented to tech-kern.
 1.34  26-Aug-2019  maxv Reject negative offsets, to prevent panics later in genfs_getpages().
 1.33  21-May-2019  christos branches: 1.33.2;
provide more info about who is getting ERESTART.
 1.32  03-Feb-2019  mrg - add or adjust /* FALLTHROUGH */ where appropriate
- add __unreachable() after functions that can return but won't in
this case, and thus can't be marked __dead easily
 1.31  26-Dec-2017  kamil branches: 1.31.4;
Refactor pipe1() and correct a bug in sys_pipe2() (SYS_pipe2)

sys_pipe2() returns two integers (values), the 2nd one is a copy of the 2nd
file descriptor that lands in fildes[2]. This is a side effect of reusing
the code for sys_pipe() (SYS_pipe) and not cleaning it up.

The first returned value is (on success) 0.

Introduced a small refactoring in pipe1() that it does not operate over
retval[], but on an array int[2]. A user sets retval[] for pipe() when
desired and needed.

This refactoring touches compat code: netbsd32, linux, linux32.

Before the changes on NetBSD/amd64:

$ ktruss -i ./a.out
[...]
15131 1 a.out pipe2(0x7f7fff2e62b8, 0) = 0, 4
[...]

After the changes:

$ ktruss -i ./a.out
[...]
782 1 a.out pipe2(0x7f7fff97e850, 0) = 0
[...]

There should not be a visible change for current users.

Sponsored by <The NetBSD Foundation>
 1.30  05-Sep-2014  matt Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.
 1.29  05-Sep-2014  matt Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.
 1.28  08-Apr-2013  skrll Remove some set but unused variables
 1.27  05-Aug-2012  riastradh branches: 1.27.2;
Force sys_close not to restart by returning ERESTART.

Print a diagnostic message if we ever get ERESTART out of fd_close
and convert it to EINTR instead.

Even if fd_close fails, it has already closed the file descriptor, so
restarting the system call is a mistake, with dangerous consequences
for multithreaded programs.

Should probably turn the message into a kassert eventually, and maybe
add one deeper in fd_close in order to more easily debug it before
all the data structures are destroyed.
 1.26  11-Feb-2012  martin Add a posix_spawn syscall, as discussed on tech-kern.
Based on the summer of code project by Charles Zhang, heavily reworked
later by me - all bugs are likely mine.
Ok: core, releng.
 1.25  25-Jan-2012  christos Add locking, requested by yamt. Note that locking is not used everywhere
for these.
 1.24  25-Jan-2012  christos As discussed in tech-kern, provide the means to prevent delivery of SIGPIPE
on EPIPE for all file descriptor types:

- provide O_NOSIGPIPE for open,kqueue1,pipe2,dup3,fcntl(F_{G,S}ETFL) [NetBSD]
- provide SOCK_NOSIGPIPE for socket,socketpair [NetBSD]
- provide SO_NOSIGPIPE for {g,s}seckopt [NetBSD/FreeBSD/MacOSX]
- provide F_{G,S}ETNOSIGPIPE for fcntl [MacOSX]
 1.23  31-Oct-2011  christos branches: 1.23.2; 1.23.6;
PR/45545 Yui NARUSE: pipe2's return value is wrong
 1.22  26-Jun-2011  christos * Arrange for interfaces that create new file descriptors to be able to
set close-on-exec on creation (http://udrepper.livejournal.com/20407.html).

- Add F_DUPFD_CLOEXEC to fcntl(2).
- Add MSG_CMSG_CLOEXEC to recvmsg(2) for unix file descriptor passing.
- Add dup3(2) syscall with a flags argument for O_CLOEXEC, O_NONBLOCK.
- Add pipe2(2) syscall with a flags argument for O_CLOEXEC, O_NONBLOCK.
- Add flags SOCK_CLOEXEC, SOCK_NONBLOCK to the socket type parameter
for socket(2) and socketpair(2).
- Add new paccept(2) syscall that takes an additional sigset_t to alter
the sigmask temporarily and a flags argument to set SOCK_CLOEXEC,
SOCK_NONBLOCK.
- Add new mode character 'e' to fopen(3) and popen(3) to open pipes
and file descriptors for close on exec.
- Add new kqueue1(2) syscall with a new flags argument to open the
kqueue file descriptor with O_CLOEXEC, O_NONBLOCK.

* Fix the system calls that take socklen_t arguments to actually do so.

* Don't include userland header files (signal.h) from system header files
(rump_syscallargs.h).

* Bump libc version for the new syscalls.
 1.21  12-Jun-2011  rmind Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.
 1.20  10-Apr-2011  christos branches: 1.20.2;
- Add O_CLOEXEC to open(2)
- Add fd_set_exclose() to encapsulate uses of FIO{,N}CLEX, O_CLOEXEC, F{G,S}ETFD
- Add a pipe1() function to allow passing flags to the fd's that pipe(2)
opens to ease implementation of linux pipe2(2)
- Factor out fp handling code from open(2) and fhopen(2)
 1.19  18-Dec-2010  rmind branches: 1.19.2;
do_posix_fadvise: fix and improve previous change - add a comment with
some rationale and handle few range overflows.

Per report/discussion with yamt@.
 1.18  27-Oct-2010  rmind do_posix_fadvise: check for a negative length; truncate the offset and
round the end-offset, not vice-versa.

Thanks to jakllsch@ for debug info.
 1.17  28-Oct-2009  njoly branches: 1.17.2; 1.17.4;
Make flock(2) more robust to invalid operation, such as
(LOCK_EX|LOCK_SH).
 1.16  10-Jun-2009  yamt do_posix_fadvise:
- deactivate pages on POSIX_FADV_DONTNEED.
- more sanity checks. fix a panic in genfs_getpages
introduced by the previous (rev.1.15).
 1.15  10-Jun-2009  yamt do_posix_fadvise: on POSIX_FADV_WILLNEED, start prefeching of object's pages.
 1.14  31-May-2009  yamt do_posix_fadvise: turn some KASSERTs into CTASSERTs.
 1.13  24-May-2009  ad More changes to improve kern_descrip.c.

- Avoid atomics in more places.
- Remove the per-descriptor mutex, and just use filedesc_t::fd_lock.
It was only being used to synchronize close, and in any case we needed
to take fd_lock to free the descriptor slot.
- Optimize certain paths for the <NDFDFILE case.
- Sprinkle more comments and assertions.
- Cache more stuff in filedesc_t.
- Fix numerous minor bugs spotted along the way.
- Restructure how the open files array is maintained, for clarity and so
that we can eliminate the membar_consumer() call in fd_getfile(). This is
mostly syntactic sugar; the main functional change is that fd_nfiles now
lives alongside the open file array.

Some measurements with libmicro:

- simple file syscalls are like close() are between 1 to 10% faster.
- some nice improvements, e.g. poll(1000) which is ~50% faster.
 1.12  28-Mar-2009  rmind sys_fcntl: use FD_CLOEXEC, instead of magic number '1'.
 1.11  04-Mar-2009  skrll Fix the posix_fadvise return value... finally.

Tested martin on sparc64/m68k and me on hppa.
 1.10  22-Jan-2009  yamt branches: 1.10.2;
malloc -> kmem_alloc
 1.9  11-Jan-2009  christos merge christos-time_t
 1.8  21-Dec-2008  ad Prevent a potential deadlock from a multithreaded process doing:

t1 dup2(0, 1)
t2 dup2(1, 0)
 1.7  15-Sep-2008  rmind branches: 1.7.2; 1.7.4;
Replace intptr_t with uintptr_t in few more places.
OK by <matt>.
 1.6  31-Aug-2008  njoly Make dup(2) return the correct error value, not 0.
 1.5  02-Jul-2008  matt branches: 1.5.2;
Change {ff,fd}_exclose and ff_allocated to bool. Change exclose arg to
fd_dup to bool. Switch assignments from 1/0 to true/false.

This make alpha kernels compile. Bump kern to 4.99.69 since structure
changed.
 1.4  23-Jun-2008  ad sys_fcntl: use l_fd, not p_fd.
 1.3  28-Apr-2008  martin branches: 1.3.2; 1.3.4;
Remove clause 3 and 4 from TNF licenses
 1.2  24-Apr-2008  ad branches: 1.2.2;
Merge proc::p_mutex and proc::p_smutex into a single adaptive mutex, since
we no longer need to guard against access from hardware interrupt handlers.

Additionally, if cloning a process with CLONE_SIGHAND, arrange to have the
child process share the parent's lock so that signal state may be kept in
sync. Partially addresses PR kern/37437.
 1.1  21-Mar-2008  ad branches: 1.1.2; 1.1.4; 1.1.6; 1.1.8;
File descriptor changes, discussed on tech-kern:

- Redo reference counting to be sane. LWPs accessing files take a short
term reference on the local file descriptor. This is the most common
case. While a file is in a process descriptor table, a reference is
held to the file. The file reference count only changes during control
operations like open() or close(). Code that comes at files from an
unusual direction (i.e. foreign to the process) like procfs or sysctl
takes a reference on the file (f_count), and not on a descriptor.

- Remove knowledge of reference counting and locking from most code that
deals with files.

- Make the usual case of file descriptor lookup lockless.

- Make kqueue MP and MT safe. PR kern/38098, PR kern/38137.

- Fix numerous file handling bugs, and bugs in the descriptor code that
affected multithreaded processes.

- Split descriptor system calls out into sys_descrip.c.

- A few stylistic changes: KNF, remove unused casts now that caddr_t is
gone. Replace dumb gotos with loop control in a few places.

- Don't do redundant pointer passing (struct proc, lwp, filedesc *) unless
the routine is likely to be inlined. Most of the time it's about the
current process.
 1.1.8.1  18-May-2008  yamt sync with head.
 1.1.6.7  17-Jan-2009  mjf Sync with HEAD.
 1.1.6.6  28-Sep-2008  mjf Sync with HEAD.
 1.1.6.5  02-Jul-2008  mjf Sync with HEAD.
 1.1.6.4  29-Jun-2008  mjf Sync with HEAD.
 1.1.6.3  02-Jun-2008  mjf Sync with HEAD.
 1.1.6.2  03-Apr-2008  mjf Sync with HEAD.
 1.1.6.1  21-Mar-2008  mjf file sys_descrip.c was added on branch mjf-devfs2 on 2008-04-03 12:43:04 +0000
 1.1.4.3  27-Dec-2008  christos merge with head.
 1.1.4.2  01-Nov-2008  christos Sync with head.
 1.1.4.1  29-Mar-2008  christos Welcome to the time_t=long long dev_t=uint64_t branch.
 1.1.2.2  24-Mar-2008  yamt sync with head.
 1.1.2.1  21-Mar-2008  yamt file sys_descrip.c was added on branch yamt-lazymbuf on 2008-03-24 09:39:02 +0000
 1.2.2.4  11-Mar-2010  yamt sync with head
 1.2.2.3  20-Jun-2009  yamt sync with head
 1.2.2.2  04-May-2009  yamt sync with head.
 1.2.2.1  16-May-2008  yamt sync with head.
 1.3.4.2  03-Jul-2008  simonb Sync with head.
 1.3.4.1  27-Jun-2008  simonb Sync with head.
 1.3.2.2  24-Sep-2008  wrstuden Merge in changes between wrstuden-revivesa-base-2 and
wrstuden-revivesa-base-3.
 1.3.2.1  18-Sep-2008  wrstuden Sync with wrstuden-revivesa-base-2.
 1.5.2.1  19-Oct-2008  haad Sync with HEAD.
 1.7.4.2  06-Nov-2012  riz Pull up following revision(s) (requested by he in ticket #1815):
sys/kern/sys_descrip.c: revision 1.11
Fix the posix_fadvise return value... finally.
Tested martin on sparc64/m68k and me on hppa.
 1.7.4.1  02-Feb-2009  snj Pull up following revision(s) (requested by ad in ticket #341):
sys/kern/sys_descrip.c: revision 1.8
Prevent a potential deadlock from a multithreaded process doing:
t1 dup2(0, 1)
t2 dup2(1, 0)
 1.7.2.3  28-Apr-2009  skrll Sync with HEAD.
 1.7.2.2  03-Mar-2009  skrll Sync with HEAD.
 1.7.2.1  19-Jan-2009  skrll Sync with HEAD.
 1.10.2.2  23-Jul-2009  jym Sync with HEAD.
 1.10.2.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.17.4.3  21-Apr-2011  rmind sync with head
 1.17.4.2  05-Mar-2011  rmind sync with head
 1.17.4.1  16-Mar-2010  rmind Change struct uvm_object::vmobjlock to be dynamically allocated with
mutex_obj_alloc(). It allows us to share the locks among UVM objects.
 1.17.2.1  06-Nov-2010  uebayasi Sync with HEAD.
 1.19.2.1  06-Jun-2011  jruoho Sync with HEAD.
 1.20.2.1  23-Jun-2011  cherry Catchup with rmind-uvmplock merge.
 1.23.6.1  18-Feb-2012  mrg merge to -current.
 1.23.2.3  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.23.2.2  30-Oct-2012  yamt sync with head
 1.23.2.1  17-Apr-2012  yamt sync with head
 1.27.2.3  03-Dec-2017  jdolecek update from HEAD
 1.27.2.2  23-Jun-2013  tls resync from head
 1.27.2.1  12-Sep-2012  tls Initial snapshot of work to eliminate 64K MAXPHYS. Basically works for
physio (I/O to raw devices); needs more doing to get it going with the
filesystems, but it shouldn't damage data.

All work's been done on amd64 so far. Not hard to add support to other
ports. If others want to pitch in, one very helpful thing would be to
sort out when and how IDE disks can do 128K or larger transfers, and
adjust the various PCI IDE (or at least ahcisata) drivers and wd.c
accordingly -- it would make testing much easier. Another very helpful
thing would be to implement a smart minphys() for RAIDframe along the
lines detailed in the MAXPHYS-NOTES file.
 1.31.4.3  13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.31.4.2  08-Apr-2020  martin Merge changes from current as of 20200406
 1.31.4.1  10-Jun-2019  christos Sync with HEAD
 1.33.2.1  20-Nov-2024  martin Pull up following revision(s) (requested by riastradh in ticket #1921):

sys/kern/kern_event.c: revision 1.106
sys/kern/sys_select.c: revision 1.51
sys/kern/subr_exec_fd.c: revision 1.10
sys/kern/sys_aio.c: revision 1.46
sys/kern/kern_descrip.c: revision 1.244
sys/kern/kern_descrip.c: revision 1.245
sys/ddb/db_xxx.c: revision 1.72
sys/ddb/db_xxx.c: revision 1.73
sys/miscfs/fdesc/fdesc_vnops.c: revision 1.132
sys/kern/uipc_usrreq.c: revision 1.195
sys/kern/sys_descrip.c: revision 1.36
sys/kern/uipc_usrreq.c: revision 1.196
sys/kern/uipc_socket2.c: revision 1.135
sys/kern/uipc_socket2.c: revision 1.136
sys/kern/kern_sig.c: revision 1.383
sys/kern/kern_sig.c: revision 1.384
sys/compat/netbsd32/netbsd32_ioctl.c: revision 1.107
sys/miscfs/procfs/procfs_vnops.c: revision 1.208
sys/kern/subr_exec_fd.c: revision 1.9
sys/kern/kern_descrip.c: revision 1.252
(all via patch)

Load struct filedesc::fd_dt with atomic_load_consume.

Exceptions: when fd_refcnt <= 1, or when holding fd_lock.

While here:
- Restore KASSERT(mutex_owned(&fdp->fd_lock)) in fd_unused.
=> This is used only in fd_close and fd_abort, where it holds.
- Move bounds check assertion in fd_putfile to where it matters.
- Store fd_dt with atomic_store_release.
- Move load of fd_dt under lock in knote_fdclose.
- Omit membar_consumer in fdesc_readdir.
=> atomic_load_consume serves the same purpose now.
=> Was needed only on alpha anyway.

Load struct fdfile::ff_file with atomic_load_consume.
Exceptions: when we're only testing whether it's there, not about to
dereference it.

Note: We do not use atomic_store_release to set it because the
preceding mutex_exit should be enough.

(That said, it's not clear the mutex_enter/exit is needed unless
refcnt > 0 already, in which case maybe it would be a win to switch
from the membar implied by mutex_enter to the membar implied by
atomic_store_release -- which I would generally expect to be much
cheaper. And a little clearer without a long comment.)
kern_descrip.c: Fix membars around reference count decrement.

In general, the `last one out hit the lights' style of reference
counting (as opposed to the `whoever's destroying must wait for
pending users to finish' style) requires memory barriers like so:

... usage of resources associated with object ...
membar_release();
if (atomic_dec_uint_nv(&obj->refcnt) != 0)
return;
membar_acquire();
... freeing of resources associated with object ...

This way, all usage happens-before all freeing. This fixes several
errors:
- fd_close failed to ensure whatever its caller did would
happen-before the freeing, in the case where another thread is
concurrently trying to close the fd (ff->ff_file == NULL).
Fix: Add membar_release before atomic_dec_uint(&ff->ff_refcnt) in
that branch.
- fd_close failed to ensure all loads its caller had issued will have
happened-before the freeing, in the case where the fd is still in
use by another thread (fdp->fd_refcnt > 1 and ff->ff_refcnt-- > 0).
Fix: Change membar_producer to membar_release before
atomic_dec_uint(&ff->ff_refcnt).
- fd_close failed to ensure that any usage of fp by other callers
would happen-before any freeing it does.
Fix: Add membar_acquire after atomic_dec_uint_nv(&ff->ff_refcnt).
- fd_free failed to ensure that any usage of fdp by other callers
would happen-before any freeing it does.
Fix: Add membar_acquire after atomic_dec_uint_nv(&fdp->fd_refcnt).

While here, change membar_exit -> membar_release. No semantic
change, just updating away from the legacy API.
 1.35.2.1  29-Feb-2020  ad Sync with head.
 1.51.2.1  02-Aug-2025  perseant Sync with HEAD

RSS XML Feed