Home | History | Annotate | Download | only in kern
History log of /src/sys/kern/subr_disk.c
RevisionDateAuthorComments
 1.138  13-Apr-2025  jakllsch Add physical sector and alignment info to struct disk_geom and the
geometry plist, and handle in partutil.

Bump version for disk_geom addition.

Collect DIOCGSECTORALIGN handling into one place.
 1.137  09-May-2023  riastradh branches: 1.137.6;
ioctl(DIOCRMWEDGES): Delete only idle wedges.

Don't forcibly delete busy wedges.

Reported-by: syzbot+e46f31fe56e04f567d88@syzkaller.appspotmail.com
https://syzkaller.appspot.com/bug?id=8a00fd7f2e7459748d7a274098180a4708ff0f61

Fixes accidental destruction of the busy wedge that the root file
system is mounted on, triggered by syzbot's ioctl(DIOCRMWEDGES).
 1.136  22-Apr-2023  riastradh disk(9): Fix missing unlock in error branch in previous change.

Reported-by: syzbot+870665adaf8911c0d94d@syzkaller.appspotmail.com
https://syzkaller.appspot.com/bug?id=a4ae17cf66b5bb999182ae77fd3c7ad9ad18c891
 1.135  21-Apr-2023  riastradh disk(9): Fix use-after-free race with concurrent disk_set_info.

This can happen with dk(4), which allows wedges to have their size
increased without destroying and recreating the device instance.

Drivers which allow concurrent disk_set_info and disk_ioctl must
serialize disk_set_info with dk_openlock.
 1.134  28-Mar-2022  riastradh branches: 1.134.4;
disk(9): New function disklabel_dev_unit.

Maps a dev_t like wd3e to an autoconf instance number like 3, with no
partition. Same as DISKUNIT macro, but is a symbol whose pointer can
be taken. Meant for use with struct bdevsw, cdevsw::d_devtounit.
 1.133  17-May-2021  mrg move bi-endian disklabel support from the kernel and libsa into libkern.

- dkcksum() and dkcksum_sized() move from subr_disk.c and from
libsa into libkern/dkcksum.c (which is missing _sized() version),
using the version from usr.sbin/disklabel.

- swap_disklabel() moves from subr_disk_mbr.c into libkern, now called
disklabel_swap(). (the sh3 version should be updated to use this.)

- DISKLABEL_EI becomes a first-class option with opt_disklabel.h.

- add libkern.h to libsa/disklabel.c.

this enables future work for bi-endian libsa/ufs.c (relevant for ffsv1,
ffsv2, lfsv1, and lfsv2), as well as making it possible for ports not
using subr_disk_mbr.c to include bi-endian disklabel support (which,
afaict, includes any disk on mbr-supporting platforms that do not have
an mbr as well as disklabel.)

builds successsfully on: alpha, i386, amd64, sun2, sun3, evbarm64,
evbarm64-eb, sparc, and sparc64. tested in anita on i386 and sparc,
testing in hardware on evbarm64*.
 1.132  17-Oct-2020  mlelstv branches: 1.132.6; 1.132.8;
Attach disk info even for zero sized disks.
Slight refactoring.
 1.131  11-Jun-2020  thorpej Update for proplib(3) API changes.
 1.130  27-Mar-2020  mlelstv Avoid division by zero if label isn't valid.
 1.129  30-Sep-2019  cnst kern/subr_disk: bounds_check_with_label: really protect against div by zero

Solves kernel panic in NetBSD 8.1 amd64 on VirtualBox 6.0.12 r133076.

Triggered with an NVMe controller without any actual discs behind it:

nvme0 at pci0 dev 14 function 0: vendor 80ee product 4e56 (rev. 0x00)
nvme0: NVMe 1.2
nvme0: interrupting at ioapic0 pin 22
nvme0: ORCL-VBOX-NVME-VER12, firmware 1.0, serial VB1234-56789
ld0 at nvme0 nsid 1
ld0: 0, 0 cyl, 16 head, 63 sec, 1 bytes/sect x 0 sectors

Code path is reached 4 times during normal boot, each time after wd0a
is already mounted; this patch avoids a crash with a dirty filesystem.
 1.128  22-May-2019  hannken branches: 1.128.2;
Implement disk_rename()/iostat_rename() to rename a disk.

Use it from zvol_rename_minor() when renaming a ZVOL.
 1.127  04-Apr-2019  christos move setdisklabel(9) into a separate file.
 1.126  04-Apr-2019  christos one more __func__
 1.125  04-Apr-2019  martin Make the DEBUG version compile
 1.124  03-Apr-2019  christos centralize setdisklabel(9)
 1.123  27-Mar-2019  martin Add a disk ioctl DIOCRMWEDGES to remove all wedges of a given disk
(if not busy).
 1.122  07-Mar-2018  kre branches: 1.122.2;

Fix typo in comment (s/is/if/) - NFC.
 1.121  27-Oct-2017  joerg branches: 1.121.2;
Revert printf return value change.
 1.120  27-Oct-2017  utkarsh009 [syzkaller] Cast all the printf's to (void *)
> as a result of new printf(9) declaration.
 1.119  01-Jun-2017  chs branches: 1.119.2;
remove checks for failure after memory allocation calls that cannot fail:

kmem_alloc() with KM_SLEEP
kmem_zalloc() with KM_SLEEP
percpu_alloc()
pserialize_create()
psref_class_create()

all of these paths include an assertion that the allocation has not failed,
so callers should not assert that again.
 1.118  05-Mar-2017  mlelstv Enhance disk metrics by calculating a weighted sum that is incremented
by the number of concurrent I/O requests. Also introduce a new disk_wait()
function to measure requests waiting in a bufq.
iostat -y now reports data about waiting and active requests.

So far only drivers using dksubr and dk, ccd, wd and xbd collect data about
waiting requests.
 1.117  28-Feb-2017  jakllsch pi_bsize must be at least pi_secsize

Allows block device accesses to 4KiB logical sector disks to function on the
vast majority of ports with 2KiB BLKDEV_IOSIZE.
 1.116  06-Jan-2016  christos branches: 1.116.2; 1.116.4;
print the disklabel information on error if DIAGNOSTIC.
 1.115  08-Dec-2015  christos Replace DIOCGPART -> DIOCGPARTINFO which returns the data needed instead of
pointers.
 1.114  28-Nov-2015  mlelstv Handle sector sizes other than DEV_BSIZE when reading labels.
 1.113  14-May-2015  chs in bounds_check_with_*, reject negative block numbers and avoid
a potential overflow in calculating the size of the request.
 1.112  05-May-2015  mlelstv Always fixup zero sector size, even when other geometry values are invalid.
 1.111  02-Jan-2015  christos - Use NODEV instead of 0
- Return EBUSY if there was no label
 1.110  31-Dec-2014  mlelstv Retire disk_blocksize().
 1.109  31-Dec-2014  christos Mention which ioctls need to move to dk_ioctl, and don't allow wedges on
wedges.
 1.108  31-Dec-2014  christos make more drivers use disk_ioctl, and add a dev parameter to it so that
we can merge the "easy" disklabel ioctls to it. Ultimately all this will
go do dk_ioctl once all the drivers have been converted.
 1.107  31-Dec-2014  christos Centralize wedge ioctls in disk_ioctl.
 1.106  31-Dec-2014  mlelstv disk_blocksize and disk_set_info relay the same information
to the disk subsystem.

Make disk_set_info also set blocksize shift values.
Remove every call to disk_blocksize.

Keep disk_blocksize for ABI compatibility, make it also set dg_secsize.
 1.105  29-Dec-2014  mlelstv clear error for new ioctls.
 1.104  29-Dec-2014  mlelstv Implement DIOCGMEDIASIZE and DIOCGSECTORSIZE from FreeBSD.
 1.103  19-Oct-2013  mlelstv branches: 1.103.4; 1.103.6;
use 64bit arithmetic to compute sectors-per-unit
 1.102  29-May-2013  christos branches: 1.102.2;
phase 1 of disk geometry cleanup:
- centralize the geometry -> plist code so that we don't have
n useless copies of it.
 1.101  09-Feb-2013  christos printflike maintenance.
 1.100  14-Oct-2010  mrg branches: 1.100.8; 1.100.18;
add some (uint64_t) casts so avoid 32 bit overflows. this fixes my
3TB disk with 4KB sectors and disklabel (which looks like it would
work upto 16TB.)

idea from mlelstv@.
 1.99  28-Nov-2009  dsl branches: 1.99.2; 1.99.4;
When truncating a request in bounds_check_with_mediasize() multiply
by the provided sector size instead of 512.
Fixes last bit of PR/31565
 1.98  27-Nov-2009  tsutsui u_short -> uint16_t, some KNF.
 1.97  20-May-2009  dyoung On second thought, let's call disk_predetach() disk_begindetach().
Verbs are good.
 1.96  19-May-2009  dyoung Encapsulate the checks that I do before detaching a disk(9) provider
in a pre-detachment routine, disk_predetach().
 1.95  04-Apr-2009  ad Add disk_isbusy(), iostat_isbusy().
 1.94  22-Jan-2009  yamt branches: 1.94.2;
malloc -> kmem_alloc
 1.93  28-Apr-2008  martin branches: 1.93.8; 1.93.10;
Remove clause 3 and 4 from TNF licenses
 1.92  28-Feb-2008  matt branches: 1.92.2; 1.92.4;
constify dkdriver
 1.91  31-Jan-2008  dyoung branches: 1.91.2; 1.91.6;
Constify both struct disk->dk_name and the `name' argument to
disk_init().
 1.90  02-Jan-2008  ad Merge vmlocking2 to head.
 1.89  08-Oct-2007  ad branches: 1.89.4; 1.89.6; 1.89.10; 1.89.12;
Merge disk init changes from the vmlocking branch. These seperate init /
destroy of 'struct disk' from attach / detach.
 1.88  29-Jul-2007  ad branches: 1.88.4; 1.88.6; 1.88.8; 1.88.10;
It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.
 1.87  21-Jul-2007  ad Replace some uses of lockmgr().
 1.86  24-Jun-2007  dyoung branches: 1.86.2;
Extract common code from i386, xen, and sparc64, creating
config_handle_wedges() and read_disk_sectors(). On x86, handle_wedges()
is a thin wrapper for config_handle_wedges(). Share opendisk()
across architectures.

Add kernel code in support of specifying a root partition by wedge
name. E.g., root specifications "wedge:wd0a", "wedge:David's Root
Volume" are possible. (Patches for config(1) coming soon.)

In support of moving disks between architectures (esp. i386 <->
evbmips), I've written a routine convertdisklabel() that ensures
that the raw partition is at RAW_DISK by following these steps:

0 If we have read a disklabel that has a RAW_PART with
p_offset == 0 and p_size != 0, then use that raw partition.

1 If we have read a disklabel that has both partitions 'c'
and 'd', and RAW_PART has p_offset != 0 or p_size == 0,
but the other partition is suitable for a raw partition
(p_offset == 0, p_size != 0), then swap the two partitions
and use the new raw partition.

2 If the architecture's raw partition is 'd', and if there
is no partition 'd', but there is a partition 'c' that
is suitable for a raw partition, then copy partition 'c'
to partition 'd'.

3 Determine the drive's last sector, using either the
d_secperunit the drive reported, or by guessing (0x1fffffff).
If we cannot read the drive's last sector, then fail.

4 If we have read a disklabel that has no partition slot
RAW_PART, then create a partition RAW_PART. Make it span
the whole drive.

5 If there are fewer than MAXPARTITIONS partitions,
then "slide" the unsuitable raw partition RAW_PART, and
subsequent partitions, into partition slots RAW_PART+1
and subsequent slots. Create a raw partition at RAW_PART.
Make it span the whole drive.

The convertdisklabel() procedure can probably stand to be simplified,
but it ought to deal with all but an extraordinarily broken disklabel,
now.

i386: compiled and tested, sparc64: compiled, evbmips: compiled.
 1.85  04-Mar-2007  christos branches: 1.85.2; 1.85.4;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.84  01-Mar-2007  martin Split the disklabel checksum function into two, so we can pass the
length separately.
Use this for foreign-endianess labels in wedge autodiscovery, and
calculate the checksum of those before we swap various fields in the
label.
 1.83  25-Nov-2006  scw branches: 1.83.2; 1.83.4;
Replace the myriad copies of bounds_check_with_label() with a single MI
version.

Add disk_blocksize(9) so that disk drivers can record the physical
block size of a disk if it is different to DEV_BSIZE. Right now this
simply initialises dk_blkshift and dk_byteshift according to the
supplied block size. This information is used in the MI version of
bounds_check_with_label().
 1.82  25-Oct-2006  thorpej - Add a new disk ioctl (DIOCGDISKINFO) to get the disk-info dictionary
for the disk.
- Add a new function, disk_ioctl(), that does generic disk ioctl handling.
DIOCGDISKINFO is handled here now, and others will be added in the future.
- In the wd driver, fill in the dk_info member of struct disk and use the
new disk_ioctl() function.
 1.81  22-Sep-2006  thorpej - Define disk information, disk geometry, and disk partition dictionary
schemas. Disk information and disk geometry are designed to replace
information currently conveyed to user space using struct disklabel.
- Add a dk_info member to struct disk; a reference to a disk information
dictionary. This dictionary is to be allocated and the reference stored
in struct disk by individual drivers.
- disk_detach0() will release dk_info if non-NULL.
- Convert the wd(4) driver to stash geometry and other disk properties
as the "disk-info" property in its properties dictionary. This needs
some cleanup, but will serve as an example of what to do with other
disk drivers.
 1.80  23-Aug-2006  christos branches: 1.80.2; 1.80.4;
Change iostat_alloc() to take the parent pointer and the name directly, so
that callers are not responsible for initializing the fields. Store the name
inside the struct instead of maintaining a pointer to external storage, or
leaked memory (nfs case).
 1.79  07-Jun-2006  kardel merge FreeBSD timecounters from branch simonb-timecounters
- struct timeval time is gone
time.tv_sec -> time_second
- struct timeval mono_time is gone
mono_time.tv_sec -> time_uptime
- access to time via
{get,}{micro,nano,bin}time()
get* versions are fast but less precise
- support NTP nanokernel implementation (NTP API 4)
- further reading:
Timecounter Paper: http://phk.freebsd.dk/pubs/timecounter.pdf
NTP Nanokernel: http://www.eecis.udel.edu/~mills/ntp/html/kern.html
 1.78  21-Apr-2006  yamt branches: 1.78.2;
iostat_find/disk_find: constify and simplify.
 1.77  21-Apr-2006  yamt remove some unnecessary #include.
 1.76  21-Apr-2006  yamt whitespace.
 1.75  20-Apr-2006  blymn Prefix iostat structure elements with io_
 1.74  14-Apr-2006  blymn Make i/o statistics collection more generic, include tape drives and
nfs mounts in the set of devices that statistics will be reported on.
 1.73  26-Dec-2005  perry branches: 1.73.4; 1.73.6; 1.73.8; 1.73.10; 1.73.12;
u_intN_t -> uintN_t
 1.72  11-Dec-2005  christos merge ktrace-lwp.
 1.71  15-Oct-2005  yamt - change the way to specify a bufq strategy. (by string rather than by number)
- rather than embedding bufq_state in driver softc,
have a pointer to the former.
- move bufq related functions from kern/subr_disk.c to kern/subr_bufq.c.
- rename method to strategy for consistency.
- move some definitions which don't need to be exposed to the rest of kernel
from sys/bufq.h to sys/bufq_impl.h.
(is it better to move it to kern/ or somewhere?)
- fix some obvious breakage in dev/qbus/ts.c. (not tested)
 1.70  20-Aug-2005  yamt introduce a variant of disk_attach/detach, for pseudo disks
which is opened by user before being attached.
 1.69  29-May-2005  christos branches: 1.69.2;
- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.
 1.68  31-Mar-2005  yamt introduce a function to drain bufq and use it where appropriate.
 1.67  08-Feb-2005  fvdl branches: 1.67.4;
Change the 'sz' variable in bounds_check_* to int64_t to avoid overflows
when a very large blocknumber is passed in.
 1.66  06-Feb-2005  christos Change an if/panic statement to a KASSERT and disable a chatty printf.
 1.65  25-Nov-2004  yamt branches: 1.65.4; 1.65.6;
lookup bufq using link_set rather than a switch statement.
 1.64  28-Oct-2004  yamt move buffer queue related stuffs from buf.h to their own header, bufq.h.
 1.63  15-Oct-2004  thorpej - Eliminate the need to call disk_init().
- disk_count needs to be protected with disklist_slock, too.
 1.62  14-Oct-2004  yamt move i/o schedulers to their own files.
namely, from kern/subr_disk.c to kern/bufq_{fcfs,disksort,readprio,priocscan}.c.
 1.61  25-Sep-2004  thorpej Work-in-progress implementation of "wedges", a new way to represent
partitions in the NetBSD kernel. See discussion on tech-kern for details.
 1.60  09-Mar-2004  yamt - add a function prototype.
- consitify.
 1.59  28-Feb-2004  yamt change the way to handle NEW_BUFQ_STRATEGY option.
instead of putting #ifdefs into each drivers,
use a global variable to indicate default strategy.

XXX should have a way to specify other strategies.
 1.58  10-Jan-2004  yamt add a new bufq strategy, BUFQ_PRIOCSCAN (per-priority CSCAN).
discussed on tech-kern@
 1.57  06-Dec-2003  yamt rev.1.55 didn't handle BUFQ_SORT_CYLINDER case correctly.
pointed by Juergen Hannken-Illjes. patch provided by him.
 1.56  06-Dec-2003  he Make sure buf_inorder() returns a value under all conditions.
 1.55  05-Dec-2003  yamt buf_inorder: deal with 64-bit daddr_t correctly.
 1.54  04-Dec-2003  atatat Dynamic sysctl.

Gone are the old kern_sysctl(), cpu_sysctl(), hw_sysctl(),
vfs_sysctl(), etc, routines, along with sysctl_int() et al. Now all
nodes are registered with the tree, and nodes can be added (or
removed) easily, and I/O to and from the tree is handled generically.

Since the nodes are registered with the tree, the mapping from name to
number (and back again) can now be discovered, instead of having to be
hard coded. Adding new nodes to the tree is likewise much simpler --
the new infrastructure handles almost all the work for simple types,
and just about anything else can be done with a small helper function.

All existing nodes are where they were before (numerically speaking),
so all existing consumers of sysctl information should notice no
difference.

PS - I'm sorry, but there's a distinct lack of documentation at the
moment. I'm working on sysctl(3/8/9) right now, and I promise to
watch out for buses.
 1.53  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.
 1.52  13-Apr-2003  dsl branches: 1.52.2;
CONSTCONT should have been CONSTCOND
 1.51  13-Apr-2003  dsl Fix error message for 64bit daddr_t
 1.50  03-Apr-2003  fvdl Add a bounds_check_with_mediasize function, which is intended
for checking RAW_PART transfers (and later raw disk devices).
 1.49  06-Nov-2002  enami Factor out the COMPAT_16 code.
 1.48  05-Nov-2002  mrg - do the COMPAT_16 dance in sysctl_diskstats() for the where == NULL case
as well. pointed out by enami@.
- defflag COMPAT_16.
 1.47  04-Nov-2002  mrg repair backwards compatibility with netbsd 1.6 - if we are not given the
wanted sizeof(struct disk_sysctl), use the old size. for non-COMPAT_16,
however, we return EINVAL so that all future programs are forced into
passing the wanted size. 1.6 iostat(8) works with -current kernel again.

as seen on tech-kern.
 1.46  01-Nov-2002  simonb When calculating the space needed for the data, use the supplied
userland structure size (if passed in).
Use the supplied userland structure size (if passed in) to check if
there is enough room to copyout the next structure.
 1.45  01-Nov-2002  mrg implement separate read/write disk statistics:
- disk_unbusy() gets a new parameter to tell the IO direction.
- struct disk_sysctl gets 4 new members for read/write bytes/transfers.
when processing hw.diskstats, add the read&write bytes/transfers for
the old combined stats to attempt to keep backwards compatibility.

unfortunately, due to multiple bugs, this will cause new kernels and old
vmstat/iostat/systat programs to fail. however, the next time this is
change it will not fail again.

this is just the kernel portion.
 1.44  01-Nov-2002  enami Make this works with QUEUEDEBUG defined; don't use queue pointer after
removing an element from queue.
 1.43  01-Nov-2002  enami Cosmetic changes.
 1.42  30-Aug-2002  hannken Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>
 1.41  23-Jul-2002  hannken The buffer returned by BUFQ_PEEK must remain the same until BUFQ_GET is
called. It may be used as the "current" buffer.
 1.40  21-Jul-2002  hannken Rename bufq_init() to bufq_alloc().
Add bufq_free() to remove a buffer queue.
Avoid MALLOC while holding a spinlock.

From Chuck Silvers.
 1.39  16-Jul-2002  hannken Implement a new device buffer queue interface.
One basic struct, a function to setup a queue with a specific strategy and
three macros to put buf's into the queue, get and remove the next buf or
get the next buf without removal.

The BUFQ_XXX interface will be removed in the future.
The B_ORDERED flag is not longer supported.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>
 1.38  28-Jun-2002  yamt constify diskerr().
 1.37  16-Feb-2002  enami branches: 1.37.8; 1.37.10;
Use sizeof correctly. Fixes PR#15613.
 1.36  16-Feb-2002  enami - Wrap long line.
- Remove unnecessary semi-colon.
 1.35  28-Jan-2002  simonb Remember to update the "size copied" counter in sysctl_diskstats().
 1.34  28-Jan-2002  simonb Use TAILQ_FOREACH().
 1.33  27-Jan-2002  simonb Implement the hw.disknames and hw.diskstats sysctl's that have been listed
in <sys/sysctl.h> since day one but never implemented.
 1.32  30-Nov-2001  enami Use cached pointer to next buf instead of re-fetching it. GCC actually
generates different code.
 1.31  13-Nov-2001  lukem add RCSID
 1.30  09-Jul-2001  simonb branches: 1.30.2; 1.30.4;
ANSIfy.
 1.29  30-Mar-2000  augustss branches: 1.29.6;
Get rid of register declarations.
 1.28  07-Feb-2000  thorpej Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.
 1.27  28-Jan-2000  hannken The decision that `disksort_cylinder' uses to decide if the buffer needs
to go to the inversion list is incomplete. If the cylinders are equal
block numbers must be checked.

This caused lockups if some buffers with the same cylinder were cycling
through the list, as it may happen with softdep enabled.

Fixes PR #9197.
 1.26  21-Jan-2000  thorpej - Add a B_ORDERED flag to communicate to drivers that an I/O request should
be issued/completed in order; that is, provide a barrier for I/O queues.
- Change the buffer driver queue links to a TAILQ, rather than using
a home-grown equivalent. Provide BUFQ_*() macros to manipulate buffer
queues; these deal with the barrier provided by B_ORDERED.
- Update disksort() accordingly, and provide 3 versions:
- disksort_cylinder(): historical disksort(), which keys on
b_cylinder (and b_blkno for the case when b_cylinder matches).
- disksort_blkno(): sorts only on b_blkno. Essentially the
same as disksort_cylinder(), but with fewer comparisons.
- disksort_tail(): requests are simply inserted into the queue
at the tail. This is provided as an option so that drivers
can simply have a pointer to the appropriate sort function.
Note that disksort() now pays attention to B_ORDERED.
 1.25  22-Feb-1999  drochner branches: 1.25.8; 1.25.14;
PR kern/7033 (Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>): use
device minor to unit/partition macros from sys/disklabel.h
 1.24  04-Aug-1998  perry Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)
 1.23  30-Dec-1997  thorpej Rearrange disk_detach() slightly, and make a small run-time cosmetic
change in disk_unbusy().
 1.22  05-Oct-1997  thorpej Copyright assigned to The NetBSD Foundation.
 1.21  17-Oct-1996  perry branches: 1.21.10;
removed #ifdef tahoe
 1.20  13-Oct-1996  christos backout previous kprintf change
 1.19  10-Oct-1996  christos printf -> kprintf, sprintf -> ksprintf
 1.18  12-Jul-1996  thorpej Remove old-style disk instrumentation code.
 1.17  16-Mar-1996  christos Fix printf() formats.
 1.16  09-Feb-1996  christos More proto fixes
 1.15  07-Jan-1996  thorpej New generic disk framework. Highlights:

- New metrics handling. Metrics are now kept in the new
`struct disk'. Busy time is now stored as a timeval, and
transfer count in bytes.

- Storage for disklabels is now dynamically allocated, so that
the size of the disk structure is not machine-dependent.

- Several new functions for attaching and detaching disks, and
handling metrics calculation.

Old-style instrumentation is still supported in drivers that did it before.
However, old-style instrumentation is being deprecated, and will go away
once the userland utilities are updated for the new framework.

For usage and architectural details, see the forthcoming disk(9) manual
page.
 1.14  28-Dec-1995  thorpej Move the old-style disk instrumentation "structures" to a central location
(sys/kern/subr_disk.c) and note that they should/will be deperecated.
 1.13  29-Mar-1995  mycroft Make definition of b_cylinder global.
 1.12  29-Jun-1994  cgd New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'
 1.11  19-May-1994  mycroft Update to 4.4-Lite.
 1.10  10-Feb-1994  mycroft Don't need back pointers for disksort().
 1.9  06-Feb-1994  mycroft Remove another use of b_actl.
 1.8  06-Feb-1994  mycroft Use b_actf, not av_forw.
 1.7  23-Jan-1994  glass remove warning
 1.6  11-Jan-1994  mycroft Get rid of disklabel indirection functions.
 1.5  17-Dec-1993  mycroft Canonicalize all #includes.
 1.4  05-Sep-1993  mycroft branches: 1.4.2;
Add \n to end of error message.
 1.3  20-May-1993  deraadt more disklabel changes
 1.2  20-May-1993  cgd add rcs ids, and clean up headers where necessary
 1.1  21-Mar-1993  cgd branches: 1.1.1;
Initial revision
 1.1.1.1  21-Mar-1993  cgd initial import of 386bsd-0.1 sources
 1.4.2.3  14-Nov-1993  mycroft Canonicalize all #includes.
 1.4.2.2  30-Sep-1993  deraadt delete hopping-functions cpu_{read,write,set}disklabel()
 1.4.2.1  29-Sep-1993  mycroft Strategy functions return void.
 1.21.10.1  14-Oct-1997  thorpej Update marc-pcmcia branch from trunk.
 1.25.14.1  21-Dec-1999  wrstuden Initial commit of recent changes to make DEV_BSIZE go away.

Runs on i386, needs work on other arch's. Main kernel routines should be
fine, but a number of the stand programs need help.

cd, fd, ccd, wd, and sd have been updated. sd has been tested with non-512
byte block devices. vnd, raidframe, and lfs need work.

Non 2**n block support is automatic for LKM's and conditional for kernels
on "options NON_PO2_BLOCKS".
 1.25.8.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.29.6.7  11-Nov-2002  nathanw Catch up to -current
 1.29.6.6  17-Sep-2002  nathanw Catch up to -current.
 1.29.6.5  01-Aug-2002  nathanw Catch up to -current.
 1.29.6.4  28-Feb-2002  nathanw Catch up to -current.
 1.29.6.3  08-Jan-2002  nathanw Catch up to -current.
 1.29.6.2  14-Nov-2001  nathanw Catch up to -current.
 1.29.6.1  24-Aug-2001  nathanw Catch up with -current.
 1.30.4.1  07-Sep-2001  thorpej Commit my "devvp" changes to the thorpej-devvp branch. This
replaces the use of dev_t in most places with a struct vnode *.

This will form the basic infrastructure for real cloning device
support (besides being architecurally cleaner -- it'll be good
to get away from using numbers to represent objects).
 1.30.2.4  06-Sep-2002  jdolecek sync kqueue branch with HEAD
 1.30.2.3  16-Mar-2002  jdolecek Catch up with -current.
 1.30.2.2  11-Feb-2002  jdolecek Sync w/ -current.
 1.30.2.1  10-Jan-2002  thorpej Sync kqueue branch with -current.
 1.37.10.1  22-Jul-2002  lukem Pull up revision 1.38 (requested by yamt in ticket #536):
constify diskerr().
 1.37.8.4  31-Aug-2002  gehenna catch up with -current.
 1.37.8.3  29-Aug-2002  gehenna catch up with -current.
 1.37.8.2  20-Jul-2002  gehenna catch up with -current.
 1.37.8.1  15-Jul-2002  gehenna catch up with -current.
 1.52.2.10  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.52.2.9  01-Apr-2005  skrll Sync with HEAD.
 1.52.2.8  09-Feb-2005  skrll Sync with HEAD.
 1.52.2.7  07-Feb-2005  skrll Sunc with HEAD.
 1.52.2.6  29-Nov-2004  skrll Sync with HEAD.
 1.52.2.5  02-Nov-2004  skrll Sync with HEAD.
 1.52.2.4  19-Oct-2004  skrll Sync with HEAD
 1.52.2.3  21-Sep-2004  skrll Fix the sync with head I botched.
 1.52.2.2  18-Sep-2004  skrll Sync with HEAD.
 1.52.2.1  03-Aug-2004  skrll Sync with HEAD
 1.65.6.1  12-Feb-2005  yamt sync with head.
 1.65.4.1  29-Apr-2005  kent sync with -current
 1.67.4.1  06-Apr-2005  tron Pull up revision 1.68 (requested by yamt in ticket #112):
introduce a function to drain bufq and use it where appropriate.
 1.69.2.7  17-Mar-2008  yamt sync with head.
 1.69.2.6  04-Feb-2008  yamt sync with head.
 1.69.2.5  21-Jan-2008  yamt sync with head
 1.69.2.4  27-Oct-2007  yamt sync with head.
 1.69.2.3  03-Sep-2007  yamt sync with head.
 1.69.2.2  30-Dec-2006  yamt sync with head.
 1.69.2.1  21-Jun-2006  yamt sync with head.
 1.73.12.1  24-May-2006  tron Merge 2006-05-24 NetBSD-current into the "peter-altq" branch.
 1.73.10.2  11-May-2006  elad sync with head
 1.73.10.1  19-Apr-2006  elad sync with head.
 1.73.8.3  03-Sep-2006  yamt sync with head.
 1.73.8.2  26-Jun-2006  yamt sync with head.
 1.73.8.1  24-May-2006  yamt sync with head.
 1.73.6.2  22-Apr-2006  simonb Sync with head.
 1.73.6.1  04-Feb-2006  simonb Adapt for timecounters: mostly use get*time() and use "time_second"
instead of "time.tv_sec".
 1.73.4.1  09-Sep-2006  rpaulo sync with head
 1.78.2.1  19-Jun-2006  chap Sync with head.
 1.80.4.2  10-Dec-2006  yamt sync with head.
 1.80.4.1  22-Oct-2006  yamt sync with head
 1.80.2.2  12-Jan-2007  ad Sync with head.
 1.80.2.1  18-Nov-2006  ad Sync with head.
 1.83.4.1  12-Mar-2007  rmind Sync with HEAD.
 1.83.2.1  21-Nov-2010  riz Pull up following revision(s) (requested by mrg in ticket #1411):
sys/kern/subr_disk.c: revision 1.100
add some (uint64_t) casts so avoid 32 bit overflows. this fixes my
3TB disk with 4KB sectors and disklabel (which looks like it would
work upto 16TB.)
idea from mlelstv@.
 1.85.4.1  11-Jul-2007  mjf Sync with head.
 1.85.2.6  24-Aug-2007  ad Sync with buffer cache locking changes. See buf.h/vfs_bio.c for details.
Some minor portions are incomplete and needs to be verified as a whole.
 1.85.2.5  20-Aug-2007  ad Sync with head.
 1.85.2.4  20-Aug-2007  ad Sync with HEAD.
 1.85.2.3  20-Aug-2007  ad - Alter disk attach/detach to fix a panic when closing a vnd device.
- Sync with HEAD.
 1.85.2.2  19-Aug-2007  ad - Back out the biodone() changes.
- Eliminate B_ERROR (from HEAD).
 1.85.2.1  15-Jul-2007  ad Sync with head.
 1.86.2.1  15-Aug-2007  skrll Sync with HEAD.
 1.88.10.2  29-Jul-2007  ad It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.
 1.88.10.1  29-Jul-2007  ad file subr_disk.c was added on branch matt-mips64 on 2007-07-29 12:15:46 +0000
 1.88.8.1  14-Oct-2007  yamt sync with head.
 1.88.6.3  23-Mar-2008  matt sync with HEAD
 1.88.6.2  09-Jan-2008  matt sync with HEAD
 1.88.6.1  06-Nov-2007  matt sync with HEAD
 1.88.4.1  26-Oct-2007  joerg Sync with HEAD.

Follow the merge of pmap.c on i386 and amd64 and move
pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup
code to restore CR4 before jumping back into kernel space as the large
page option might cover that.
 1.89.12.1  30-Jan-2008  cube constify disk->dk_name.
 1.89.10.1  02-Jan-2008  bouyer Sync with HEAD
 1.89.6.1  04-Dec-2007  ad Pull the vmlocking changes into a new branch.
 1.89.4.1  18-Feb-2008  mjf Sync with HEAD.
 1.91.6.2  02-Jun-2008  mjf Sync with HEAD.
 1.91.6.1  03-Apr-2008  mjf Sync with HEAD.
 1.91.2.1  24-Mar-2008  keiichi sync with head.
 1.92.4.4  11-Mar-2010  yamt sync with head
 1.92.4.3  20-Jun-2009  yamt sync with head
 1.92.4.2  04-May-2009  yamt sync with head.
 1.92.4.1  16-May-2008  yamt sync with head.
 1.92.2.1  18-May-2008  yamt sync with head.
 1.93.10.3  07-Jan-2011  riz Pull up following revision(s) (requested by mrg in ticket #1520):
sys/sys/device.h: revision 1.133
sys/kern/subr_disk.c: patch
Add helper function that determines the size and block size of a disk device.
For now we query
- the disk label
- the wedge info and data from disk(9)
 1.93.10.2  21-Nov-2010  riz Pull up following revision(s) (requested by mrg in ticket #1463):
sys/kern/subr_disk.c: revision 1.100
add some (uint64_t) casts so avoid 32 bit overflows. this fixes my
3TB disk with 4KB sectors and disklabel (which looks like it would
work upto 16TB.)
idea from mlelstv@.
 1.93.10.1  04-Apr-2009  snj Pull up following revision(s) (requested by ad in ticket #657):
sys/kern/subr_disk.c: revision 1.95
sys/kern/subr_iostat.c: revision 1.17
sys/sys/disk.h: revision 1.52
sys/sys/iostat.h: revision 1.10
Add disk_isbusy(), iostat_isbusy().
 1.93.8.2  28-Apr-2009  skrll Sync with HEAD.
 1.93.8.1  03-Mar-2009  skrll Sync with HEAD.
 1.94.2.2  23-Jul-2009  jym Sync with HEAD.
 1.94.2.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.99.4.1  05-Mar-2011  rmind sync with head
 1.99.2.1  22-Oct-2010  uebayasi Sync with HEAD (-D20101022).
 1.100.18.7  03-Dec-2017  jdolecek update from HEAD
 1.100.18.6  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.100.18.5  23-Jun-2013  tls resync from head
 1.100.18.4  25-Feb-2013  tls resync with head
 1.100.18.3  10-Feb-2013  tls Add an accessor -- ufs_maxphys() -- to check the maximum transfer size
for a given UFS mountpoint, and move the code from mount that finds
the underlying disk and resets the mountpoint max transfer size into a
utility function, ufs_update_maxphys().

Add a global serial number that counts disk property changes to which
filesystems are meant to accomodate themselves. Make ufs_maxphys()
check it. This is a sort of flag-polling interface that avoids callbacks
into the filesystem code, but will require freezing filesystems and
draining in-flight transactions before a decrease in size that is
mandatory (like attaching a disk with a smaller maximum transfer size
as a spare in a RAIDframe set), rather than "advisory", like finding
out set geometry from a RAID controller long after boot and deciding
a smaller transfer size would be optimal, can be signalled. Still, the
"advisory" case is the common one so this is progress.

Make a bit of an example of RAIDframe by making it bump this new
serial number when disks are added to the subsystem. I will attack
one of the hardware RAID drivers (probably arcmsr) next.
 1.100.18.2  02-Dec-2012  tls Don't pass NULL struct dkdriver to disk_init. That's seriously bogus.
 1.100.18.1  12-Sep-2012  tls Initial snapshot of work to eliminate 64K MAXPHYS. Basically works for
physio (I/O to raw devices); needs more doing to get it going with the
filesystems, but it shouldn't damage data.

All work's been done on amd64 so far. Not hard to add support to other
ports. If others want to pitch in, one very helpful thing would be to
sort out when and how IDE disks can do 128K or larger transfers, and
adjust the various PCI IDE (or at least ahcisata) drivers and wd.c
accordingly -- it would make testing much easier. Another very helpful
thing would be to implement a smart minphys() for RAIDframe along the
lines detailed in the MAXPHYS-NOTES file.
 1.100.8.1  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.102.2.1  18-May-2014  rmind sync with head
 1.103.6.5  28-Aug-2017  skrll Sync with HEAD
 1.103.6.4  19-Mar-2016  skrll Sync with HEAD
 1.103.6.3  27-Dec-2015  skrll Sync with HEAD (as of 26th Dec)
 1.103.6.2  06-Jun-2015  skrll Sync with HEAD
 1.103.6.1  06-Apr-2015  skrll Sync with HEAD
 1.103.4.2  01-Jun-2015  snj Pull up following revision(s) (requested by jnemeth in ticket #775):
share/man/man9/disk.9: revision 1.37
sys/kern/subr_disk.c: revisions 1.104, 1.105
sys/dev/dksubr.c: revision 1.56
sys/sys/dkio.h: revision 1.21
Implement DIOCGMEDIASIZE and DIOCGSECTORSIZE from FreeBSD.
--
clear error for new ioctls.
 1.103.4.1  19-May-2015  snj Pull up following revision(s) (requested by chs in ticket #766):
sys/kern/subr_disk.c: revision 1.113
in bounds_check_with_*, reject negative block numbers and avoid
a potential overflow in calculating the size of the request.
 1.116.4.1  21-Apr-2017  bouyer Sync with HEAD
 1.116.2.1  20-Mar-2017  pgoyette Sync with HEAD
 1.119.2.3  29-Mar-2020  martin Pull up following revision(s) (requested by mlelstv in ticket #1527):

sys/dev/scsipi/cd.c: revision 1.343
sys/kern/subr_disk.c: revision 1.130

Avoid division by zero if label isn't valid.
Allow open of RAWPART even when no medium is loaded.
Keep errors silent if no medium is loaded.
Fixes PR kern/55104
 1.119.2.2  01-Nov-2019  martin Pull up following revision(s) (requested by cnst in ticket #1397):

sys/kern/subr_disk.c: revision 1.129

kern/subr_disk: bounds_check_with_label: really protect against div by zero

Solves kernel panic in NetBSD 8.1 amd64 on VirtualBox 6.0.12 r133076.

Triggered with an NVMe controller without any actual discs behind it:

nvme0 at pci0 dev 14 function 0: vendor 80ee product 4e56 (rev. 0x00)
nvme0: NVMe 1.2
nvme0: interrupting at ioapic0 pin 22
nvme0: ORCL-VBOX-NVME-VER12, firmware 1.0, serial VB1234-56789
ld0 at nvme0 nsid 1
ld0: 0, 0 cyl, 16 head, 63 sec, 1 bytes/sect x 0 sectors

Code path is reached 4 times during normal boot, each time after wd0a
is already mounted; this patch avoids a crash with a dirty filesystem.
 1.119.2.1  05-Apr-2019  msaitoh Pull up following revision(s) (requested by martin in ticket #1223):
sys/sys/dkio.h: revision 1.25
sys/kern/subr_disk.c: revision 1.123
sys/dev/dksubr.c: revision 1.107
sys/dev/ccd.c: revision 1.179
sys/dev/ofw/ofdisk.c: revision 1.53
Add a disk ioctl DIOCRMWEDGES to remove all wedges of a given disk
(if not busy).
 1.121.2.1  15-Mar-2018  pgoyette Synch with HEAD
 1.122.2.3  13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.122.2.2  08-Apr-2020  martin Merge changes from current as of 20200406
 1.122.2.1  10-Jun-2019  christos Sync with HEAD
 1.128.2.1  02-Apr-2020  martin Pull up following revision(s) (requested by mlelstv in ticket #814):

sys/dev/scsipi/cd.c: revision 1.343
sys/kern/subr_disk.c: revision 1.130

Avoid division by zero if label isn't valid.

Allow open of RAWPART even when no medium is loaded.
Keep errors silent if no medium is loaded.

Fixes PR kern/55104
 1.132.8.1  31-May-2021  cjep sync with head
 1.132.6.1  17-Jun-2021  thorpej Sync w/ HEAD.
 1.134.4.1  01-Aug-2023  martin Pull up following revision(s) (requested by riastradh in ticket #284):

sys/dev/dkwedge/dk.c 1.125-1.158
sys/kern/subr_disk.c 1.135-1.137
sys/sys/disk.h 1.78

dk(4): Explain why dk_rawopens can't overflow and assert it.

dk(4): Restore assertions in dklastclose.

We only enter dklastclose if the wedge is open (sc->sc_dk.dk_openmask
!= 0), which can happen only if dkfirstopen has succeeded, in which
case we hold a dk_rawopens reference to the parent that prevents
anyone else from closing it. Hence sc->sc_parent->dk_rawopens > 0.

On open, sc->sc_parent->dk_rawvp is set to nonnull, and it is only
reset to null on close. Hence if the parent is still open, as it
must be here, sc->sc_parent->dk_rawvp must be nonnull.

dk(4): Avoid holding dkwedges_lock while allocating array.

This is not great -- we shouldn't be choosing the unit number here
anyway; we should just let autoconf do it for us -- but it's better
than potentially blocking any dk_openlock or dk_rawlock (which are
sometimes held when waiting for dkwedges_lock) for memory allocation.

dk(4): KNF: return (v) -> return v.
No functional change intended.

dk(4): KNF: Whitespace.
No functional change intended.

dk(4): Omit needless void * cast.
No functional change intended.

dk(4): Fix typo in comment: dkstrategy, not dkstragegy.
No functional change intended.

dk(4): ENXIO, not ENODEV, means no such device.
ENXIO is `device not configured', meaning there is no such device.
ENODEV is `operation not supported by device', meaning the device is
there but refuses the operation, like writing to a read-only medium.

Exception: For undefined ioctl commands, it's not ENODEV _or_ ENXIO,
but rather ENOTTY, because why make any of this obvious when you
could make it obscure Unix lore?

dk(4): KNF: Sort includes.
No functional change intended.

dk(4): <sys/rwlock.h> for rwlock(9).

dk(4): Prevent races in access to struct dkwedge_softc::sc_size.
Rules:
1. Only ever increases, never decreases.
(Decreases require removing and readding the wedge.)
2. Increases are serialized by dk_openlock.
3. Reads can happen unlocked in any context where the softc is valid.

Access is gathered into dkwedge_size* subroutines -- don't touch
sc_size outside these. For now, we use rwlock(9) to keep the
reasoning simple. This should be done with atomics on 64-bit
platforms and a seqlock on 32-bit platforms to avoid contention.

However, we can do that in a later change.

dk(4): Move CFDRIVER_DECL and CFATTACH_DECL3_NEW earlier in file.

Follows the pattern of most drivers, and will be necessary for
referencing dk_cd in dk_bdevsw and dk_cdevsw soon, to prevent
open/detach races.
No functional change intended.

dk(4): Don't touch dkwedges or ndkwedges outside dkwedges_lock.

dk(4): Assert parent vp is nonnull before we stash it away.

Let's enable early attribution if this goes wrong.

If it's not the parent's first open, also assert the parent vp is
already nonnull.

dk(4): Assert dkwedges[unit] is the sc we're about to free.

dk(4): Require dk_openlock in dk_set_geometry.

Not strictly necessary but this makes reasoning easier and documents
with an assertion how disk_set_info is serialized.

disk(9): Fix use-after-free race with concurrent disk_set_info.

This can happen with dk(4), which allows wedges to have their size
increased without destroying and recreating the device instance.

Drivers which allow concurrent disk_set_info and disk_ioctl must
serialize disk_set_info with dk_openlock.

dk(4): Add null d_cancel routine to devsw.

This way, dkclose is guaranteed that dkopen, dkread, dkwrite,
dkioctl, &c., have all returned before it runs. For block opens,
setting d_cancel also guarantees that any buffered writes are flushed
with vinvalbuf before dkclose is called.

dk(4): Fix callout detach race.
1. Set a flag sc_iostop under the lock sc_iolock so dkwedge_detach
and dkstart don't race over it.
2. Decline to schedule the callout if sc_iostop is set. The callout
is already only ever scheduled while the lock is held.
3. Use callout_halt to wait for any concurrent callout to complete.
At this point, it can't reschedule itself.

Without this change, the callout could be concurrently rescheduling
itself as we issue callout_stop, leading to use-after-free later.

dk(4): Use disk_begindetach and rely on vdevgone to close instances.

The first step is to decide whether we can detach (if forced, yes; if
not forced, only if not already open), and prevent new opens if so.

There's no need to start closing open instances at this point --
we're just making a decision to detach, and preventing new opens by
transitioning state that dkopen will respect[*].

The second step is to force all open instances to close. This is
done by vdevgone. By the time vdevgone returns, there can be no open
instances, so if there _were_ any, closing them via vdevgone will
have passed through dklastclose.

After that point, there can be no opens and no I/O operations, so
dk_openmask must already be zero and the bufq must be empty.

Thus, there's no need to have an explicit call to dklastclose (via
dkwedge_cleanup_parent) before or after making the decision to
detach.
[*] Currently access to this state is racy: nothing serializes
dkwedge_detach's state transition with dkopen's test. TBD in a
separate commit shortly.

dk(4): Set .d_cfdriver and .d_devtounit to plug open/detach race.

This way, opening dkN or rdkN will wait if attach or detach is still
in progress, and vdevgone will wake up such pending opens and make
them fail. So it is no longer possible for a wedge to be detached
after dkopen has already started using it.

For now, we use a custom .d_devtounit function that looks up the
autoconf unit number via the dkwedges array, which conceivably may
use an independent unit numbering system -- nothing guarantees they
match up. (In practice they will mostly match up, but concurrent
wedge creation could lead to different numbering.) Eventually this
should be changed so the two numbering systems match, which would let
us delete the new dkunit function and just use dev_minor_unit like
many other drivers can.

dk(4): Take a read-lock on dkwedges_lock if we're only reading.
- dkwedge_find_by_name
- dkwedge_find_by_parent
- dkwedge_print_wnames

dk(4): Omit needless locking in dksize, dkdump.

All the members these use are stable after initialization, except for
the wedge size, which dkwedge_size safely reads a snapshot of without
locking in the caller.

dk(4): dkdump: Simplify. No functional change intended.

dk(4): Narrow the scope of the device numbering lookup on detach.

Just need it for vdevgone, order relative to other things in detach
doesn't matter.
No functional change intended.

disk(9): Fix missing unlock in error branch in previous change.

dk(4): Fix racy access to sc->sc_dk.dk_openmask in dkwedge_delall1.
Need sc->sc_parent->dk_rawlock for this, as used in dkopen/dkclose.

dk(4): Convert tests to assertions in various devsw operations.
.d_cancel, .d_strategy, .d_read, .d_write, .d_ioctl, and .d_discard
are only ever used between successful .d_open return and entry to
.d_close. .d_open doesn't return until sc is nonnull and sc_state is
RUNNING, and dkwedge_detach waits for the last .d_close before
setting sc_state to DEAD. So there is no possibility for sc to be
null or for sc_state to be anything other than RUNNING or DYING.

There is a small functional change here but only in the event of a
race: in the short window between when dkwedge_detach is entered, and
when .d_close runs, any I/O operations (read, write, ioctl, &c.) may
be issued that would have failed with ENXIO before.

This shouldn't matter for anything: disk I/O operations are supposed
to complete reasonably promptly, and these operations _could_ have
begun milliseconds prior, before dkwedge_detach was entered, so it's
not a significant distinction.

Notes:
- .d_open must still contend with trying to open a nonexistent wedge,
of course.
- .d_close must also contend with closing a nonexistent wedge, in
case there were two calls to open in quick succession and the first
failed while the second hadn't yet determined it would fail.
- .d_size and .d_dump are used from ddb without any open/close.

dk(4): Fix lock assertion in size increase: parent's, not wedge's.

dk(4): Rename label for consistency. No functional change intended.

dk(4): dkclose must handle a dying wedge too to close the parent.

Otherwise the parent open leaks on detach (or revoke) when the wedge
was open and had to be forcibly closed.

Fixes assertion sc->sc_dk.dk_openmask == 0.
ioctl(DIOCRMWEDGES): Delete only idle wedges.

Don't forcibly delete busy wedges.

Fixes accidental destruction of the busy wedge that the root file
system is mounted on, triggered by syzbot's ioctl(DIOCRMWEDGES).

dk(4): Omit needless sc_iopend, sc_dkdrn mechanism.
vdevgone guarantees that all instances are closed by the time it
returns, which in turn guarantees all I/O operations (read, write,
ioctl, &c.) have completed, and, if the block device is open,
vinvalbuf(V_SAVE) -> vflushbuf has completed, which forces all
buffered transfers to be issued and waits for them to complete.
So by the time vdevgone returns, no further transfers can be
submitted and the bufq must be empty.

dk(4): Fix typo: sc_state, not sc_satte.

Had tested a patch series, but not every patch in it, and I
inadvertently fixed the typo in a later patch in the series, not in
the one I committed.

dk(4): Make it clearer that dkopen EROFS branch doesn't leak.
It looked like we may need to sometimes call dklastclose in error
branch for the case of (flags & ~sc->sc_mode & FWRITE) != 0, but it
is not actually possible to reach that case: if the caller requested
read/write, and the parent is read-only, and it is the first time
we've opened the parent, then dkfirstopen will fail with EROFS so we
never get there.

But this is confusing and it looked like the error branch is wrong,
so let's rearrange the conditional to make it clearer that we cannot
goto out after dkfirstopen has succeeded. And then assert that the
case cannot happen when we do call dkfirstopen.

dk(4): Need pdk->dk_openlock to read pdk->dk_wedges.
 1.137.6.1  02-Aug-2025  perseant Sync with HEAD

RSS XML Feed