Home | History | Annotate | Download | only in dkwedge
History log of /src/sys/dev/dkwedge/dk.c
RevisionDateAuthorComments
 1.173  13-Apr-2025  jakllsch Add physical sector and alignment info to struct disk_geom and the
geometry plist, and handle in partutil.

Bump version for disk_geom addition.

Collect DIOCGSECTORALIGN handling into one place.
 1.172  05-Mar-2025  jakllsch Ensure dsa_firstaligned returned from DIOCGSECTORALIGN is actually the first
 1.171  22-May-2023  riastradh dk(4): Add locking notes.
 1.170  22-May-2023  riastradh dk(4): Explain why no need for device reference in dksize, dkdump.
 1.169  22-May-2023  riastradh dk(4): Strengthen preconditions of various devsw operations.

These can only happen between dkopen and dkclose, so there's no need
to test -- we can assert instead that the wedge exists and is fully
initialized.
 1.168  22-May-2023  riastradh dk(4): Strengthen dkclose preconditions.

Like dkopen, except it is possible for this to be called after the
wedge has transitioned to dying.

XXX sc_state read here races with sc_state write in dkwedge_detach.
Could change this to atomic_load/store.
 1.167  22-May-2023  riastradh dk(4): Strengthen dkopen preconditions.

This cannot be called before dkwedge_attach for the same unit
returns, so sc->sc_dev is guaranteed to be set to a nonnull device_t
and the state is guaranteed not to be larval.

And this cannot be called concurrently with dkwedge_detach, or after
dkwedge_detach does vdevgone until another wedge with the same number
is attached (which can't happen until dkwedge_detach completes), so
the state is guaranteed not to be dying or dead.

Hence sc->sc_dev != NULL and sc->sc_state == DKW_STATE_RUNNING.
 1.166  22-May-2023  riastradh dk(4): Prevent race between dkwedge_get_parent_name and wedge detach.

Still races with parent detach but maybe this is better.

XXX Maybe we should ditch dkwedge_get_parent_name -- it's used only
by rf_containsboot, which kinda suggests it shouldn't exist.
 1.165  22-May-2023  riastradh dk(4): Split unsafe lookups into safe subroutines and unsafe wrappers.

No functional change intended.

Eventually we should adjust the callers to use the safe subroutines
instead and device_release when done.
 1.164  22-May-2023  riastradh dk(4): Don't hold lock around uiomove in dkwedge_list.

Instead, hold a device reference. dkwedge_detach will not run until
the device reference is released.
 1.163  22-May-2023  riastradh dk(4): Skip larval wedges in various lookup routines.

These have not yet finished a concurent dkwedge_attach, so there's
nothing we can safely do with them. Just pretend they don't exist --
as if we had arrived at the lookup a moment earlier.
 1.162  22-May-2023  riastradh dk(4): Simplify dkwedge_delall by detaching directly.

No need for O(n^2) algorithm and potentially racy lookups -- not that
n is large enough for n^2 to matter, but the mechanism is simpler
this way.
 1.161  22-May-2023  riastradh dk(4): Use device_lookup_private for dkwedge_lookup.

No longer necessary to go through the dkwedges array.

Currently device_lookup_private still involves touching other global
locks, but that will change eventually to a lockless pserialized fast
path.
 1.160  22-May-2023  riastradh dk(4): dkunit is no longer needed; nix it.

dkwedges array indexing now coincides with autoconf device numbering.
 1.159  22-May-2023  riastradh dk(4): Use config_attach_pseudo_acquire to create wedges.

This way, indexing of the dkwedges array coincides with numbering of
autoconf dk(4) instances.

As a side effect, this plugs a race in dkwedge_add with concurrent
drvctl -r. There are a lot of such races in dk(4) left -- to be
addressed with more device references.
 1.158  13-May-2023  riastradh dk(4): Need pdk->dk_openlock to read pdk->dk_wedges.
 1.157  10-May-2023  riastradh dk(4): Make it clearer that dkopen EROFS branch doesn't leak.

It looked like we may need to sometimes call dklastclose in error
branch for the case of (flags & ~sc->sc_mode & FWRITE) != 0, but it
is not actually possible to reach that case: if the caller requested
read/write, and the parent is read-only, and it is the first time
we've opened the parent, then dkfirstopen will fail with EROFS so we
never get there.

But this is confusing and it looked like the error branch is wrong,
so let's rearrange the conditional to make it clearer that we cannot
goto out after dkfirstopen has succeeded. And then assert that the
case cannot happen when we do call dkfirstopen.
 1.156  09-May-2023  riastradh dk(4): Fix typo: sc_state, not sc_satte.

Had tested a patch series, but not every patch in it, and I
inadvertently fixed the typo in a later patch in the series, not in
the one I committed.
 1.155  09-May-2023  riastradh dk(4): Omit needless sc_iopend, sc_dkdrn mechanism.

vdevgone guarantees that all instances are closed by the time it
returns, which in turn guarantees all I/O operations (read, write,
ioctl, &c.) have completed, and, if the block device is open,
vinvalbuf(V_SAVE) -> vflushbuf has completed, which forces all
buffered transfers to be issued and waits for them to complete.

So by the time vdevgone returns, no further transfers can be
submitted and the bufq must be empty.
 1.154  09-May-2023  riastradh ioctl(DIOCRMWEDGES): Delete only idle wedges.

Don't forcibly delete busy wedges.

Reported-by: syzbot+e46f31fe56e04f567d88@syzkaller.appspotmail.com
https://syzkaller.appspot.com/bug?id=8a00fd7f2e7459748d7a274098180a4708ff0f61

Fixes accidental destruction of the busy wedge that the root file
system is mounted on, triggered by syzbot's ioctl(DIOCRMWEDGES).
 1.153  09-May-2023  riastradh dk(4): dkclose must handle a dying wedge too to close the parent.

Otherwise the parent open leaks on detach (or revoke) when the wedge
was open and had to be forcibly closed.

Reported-by: syzbot+e46f31fe56e04f567d88@syzkaller.appspotmail.com
https://syzkaller.appspot.com/bug?id=8a00fd7f2e7459748d7a274098180a4708ff0f61

Fixes assertion sc->sc_dk.dk_openmask == 0.
 1.152  29-Apr-2023  riastradh dk(4): Rename label for consistency. No functional change intended.
 1.151  29-Apr-2023  riastradh dk(4): Fix lock assertion in size increase: parent's, not wedge's.

Reported-by: syzbot+d4dc610473cacc5183dd@syzkaller.appspotmail.com
https://syzkaller.appspot.com/bug?id=e18ddae8283d6fab44cfb1ac7e3f8e791f8c0700
 1.150  22-Apr-2023  riastradh dk(4): Convert tests to assertions in various devsw operations.

.d_cancel, .d_strategy, .d_read, .d_write, .d_ioctl, and .d_discard
are only ever used between successful .d_open return and entry to
.d_close. .d_open doesn't return until sc is nonnull and sc_state is
RUNNING, and dkwedge_detach waits for the last .d_close before
setting sc_state to DEAD. So there is no possibility for sc to be
null or for sc_state to be anything other than RUNNING or DYING.

There is a small functional change here but only in the event of a
race: in the short window between when dkwedge_detach is entered, and
when .d_close runs, any I/O operations (read, write, ioctl, &c.) may
be issued that would have failed with ENXIO before.

This shouldn't matter for anything: disk I/O operations are supposed
to complete reasonably promptly, and these operations _could_ have
begun milliseconds prior, before dkwedge_detach was entered, so it's
not a significant distinction.

Notes:

- .d_open must still contend with trying to open a nonexistent wedge,
of course.

- .d_close must also contend with closing a nonexistent wedge, in
case there were two calls to open in quick succession and the first
failed while the second hadn't yet determined it would fail.

- .d_size and .d_dump are used from ddb without any open/close.
 1.149  22-Apr-2023  riastradh dk(4): Fix racy access to sc->sc_dk.dk_openmask in dkwedge_delall1.

Need sc->sc_parent->dk_rawlock for this, as used in dkopen/dkclose.
 1.148  21-Apr-2023  riastradh dk(4): Narrow the scope of the device numbering lookup on detach.

Just need it for vdevgone, order relative to other things in detach
doesn't matter.

No functional change intended.
 1.147  21-Apr-2023  riastradh dk(4): dkdump: Simplify. No functional change intended.
 1.146  21-Apr-2023  riastradh dk(4): Omit needless locking in dksize, dkdump.

All the members these use are stable after initialization, except for
the wedge size, which dkwedge_size safely reads a snapshot of without
locking in the caller.
 1.145  21-Apr-2023  riastradh dk(4): Take a read-lock on dkwedges_lock if we're only reading.

- dkwedge_find_by_name
- dkwedge_find_by_parent
- dkwedge_print_wnames
 1.144  21-Apr-2023  riastradh dk(4): Set .d_cfdriver and .d_devtounit to plug open/detach race.

This way, opening dkN or rdkN will wait if attach or detach is still
in progress, and vdevgone will wake up such pending opens and make
them fail. So it is no longer possible for a wedge to be detached
after dkopen has already started using it.

For now, we use a custom .d_devtounit function that looks up the
autoconf unit number via the dkwedges array, which conceivably may
use an independent unit numbering system -- nothing guarantees they
match up. (In practice they will mostly match up, but concurrent
wedge creation could lead to different numbering.) Eventually this
should be changed so the two numbering systems match, which would let
us delete the new dkunit function and just use dev_minor_unit like
many other drivers can.
 1.143  21-Apr-2023  riastradh dk(4): Use disk_begindetach and rely on vdevgone to close instances.

The first step is to decide whether we can detach (if forced, yes; if
not forced, only if not already open), and prevent new opens if so.
There's no need to start closing open instances at this point --
we're just making a decision to detach, and preventing new opens by
transitioning state that dkopen will respect[*].

The second step is to force all open instances to close. This is
done by vdevgone. By the time vdevgone returns, there can be no open
instances, so if there _were_ any, closing them via vdevgone will
have passed through dklastclose.

After that point, there can be no opens and no I/O operations, so
dk_openmask must already be zero and the bufq must be empty.

Thus, there's no need to have an explicit call to dklastclose (via
dkwedge_cleanup_parent) before or after making the decision to
detach.

[*] Currently access to this state is racy: nothing serializes
dkwedge_detach's state transition with dkopen's test. TBD in a
separate commit shortly.
 1.142  21-Apr-2023  riastradh dk(4): Fix callout detach race.

1. Set a flag sc_iostop under the lock sc_iolock so dkwedge_detach
and dkstart don't race over it.

2. Decline to schedule the callout if sc_iostop is set. The callout
is already only ever scheduled while the lock is held.

3. Use callout_halt to wait for any concurrent callout to complete.
At this point, it can't reschedule itself.

Without this change, the callout could be concurrently rescheduling
itself as we issue callout_stop, leading to use-after-free later.
 1.141  21-Apr-2023  riastradh dk(4): Add null d_cancel routine to devsw.

This way, dkclose is guaranteed that dkopen, dkread, dkwrite,
dkioctl, &c., have all returned before it runs. For block opens,
setting d_cancel also guarantees that any buffered writes are flushed
with vinvalbuf before dkclose is called.
 1.140  21-Apr-2023  riastradh dk(4): Require dk_openlock in dk_set_geometry.

Not strictly necessary but this makes reasoning easier and documents
with an assertion how disk_set_info is serialized.
 1.139  21-Apr-2023  riastradh dk(4): Assert dkwedges[unit] is the sc we're about to free.
 1.138  21-Apr-2023  riastradh dk(4): Assert parent vp is nonnull before we stash it away.

Let's enable early attribution if this goes wrong.

If it's not the parent's first open, also assert the parent vp is
already nonnull.
 1.137  21-Apr-2023  riastradh dk(4): Don't touch dkwedges or ndkwedges outside dkwedges_lock.
 1.136  21-Apr-2023  riastradh dk(4): Move CFDRIVER_DECL and CFATTACH_DECL3_NEW earlier in file.

Follows the pattern of most drivers, and will be necessary for
referencing dk_cd in dk_bdevsw and dk_cdevsw soon, to prevent
open/detach races.

No functional change intended.
 1.135  21-Apr-2023  riastradh dk(4): Prevent races in access to struct dkwedge_softc::sc_size.

Rules:

1. Only ever increases, never decreases.

(Decreases require removing and readding the wedge.)

2. Increases are serialized by dk_openlock.

3. Reads can happen unlocked in any context where the softc is valid.

Access is gathered into dkwedge_size* subroutines -- don't touch
sc_size outside these. For now, we use rwlock(9) to keep the
reasoning simple. This should be done with atomics on 64-bit
platforms and a seqlock on 32-bit platforms to avoid contention.
However, we can do that in a later change.
 1.134  21-Apr-2023  riastradh dk(4): <sys/rwlock.h> for rwlock(9).
 1.133  21-Apr-2023  riastradh dk(4): KNF: Sort includes.

No functional change intended.
 1.132  21-Apr-2023  riastradh dk(4): ENXIO, not ENODEV, means no such device.

ENXIO is `device not configured', meaning there is no such device.

ENODEV is `operation not supported by device', meaning the device is
there but refuses the operation, like writing to a read-only medium.

Exception: For undefined ioctl commands, it's not ENODEV _or_ ENXIO,
but rather ENOTTY, because why make any of this obvious when you
could make it obscure Unix lore?
 1.131  21-Apr-2023  riastradh dk(4): Fix typo in comment: dkstrategy, not dkstragegy.

No functional change intended.
 1.130  21-Apr-2023  riastradh dk(4): Omit needless void * cast.

No functional change intended.
 1.129  21-Apr-2023  riastradh dk(4): KNF: Whitespace.

No functional change intended.
 1.128  21-Apr-2023  riastradh dk(4): KNF: return (v) -> return v.

No functional change intended.
 1.127  21-Apr-2023  riastradh dk(4): Avoid holding dkwedges_lock while allocating array.

This is not great -- we shouldn't be choosing the unit number here
anyway; we should just let autoconf do it for us -- but it's better
than potentially blocking any dk_openlock or dk_rawlock (which are
sometimes held when waiting for dkwedges_lock) for memory allocation.
 1.126  21-Apr-2023  riastradh dk(4): Restore assertions in dklastclose.

We only enter dklastclose if the wedge is open (sc->sc_dk.dk_openmask
!= 0), which can happen only if dkfirstopen has succeeded, in which
case we hold a dk_rawopens reference to the parent that prevents
anyone else from closing it. Hence sc->sc_parent->dk_rawopens > 0.

On open, sc->sc_parent->dk_rawvp is set to nonnull, and it is only
reset to null on close. Hence if the parent is still open, as it
must be here, sc->sc_parent->dk_rawvp must be nonnull.
 1.125  13-Apr-2023  riastradh dk(4): Explain why dk_rawopens can't overflow and assert it.
 1.124  27-Sep-2022  mlelstv branches: 1.124.4;
Remove bogus assertions.
 1.123  22-Aug-2022  riastradh dk(4): Assert about dk_openmask under the lock.

This serves two purposes:

1. Pacifies data race sanitizers.

2. Ensures that we don't spuriously trip over the assertion if
dkclose happens concurrently with dkopen due to a revoke call.
 1.122  22-Aug-2022  riastradh Revert "dk(4): Narrow scope of dk_rawlock on close to dklastclose."

dkfirstopen relies on reading from dk_openmask of _other_ wedges,
writes to dk_openmask must be serialized by dk_rawlock in addition to
dk_openlock. (However, reads from dk_openlock only require one or
the other).
 1.121  22-Aug-2022  riastradh dk(4): dklastclose never fails. Make it return void.
 1.120  22-Aug-2022  riastradh dk(4): Simplify dklastclose.

No functional change intended.
 1.119  22-Aug-2022  riastradh dk(4): Assert parent is open in dklastclose.

It is not possible for us to be closing a wedge whose parent is not
open by at least this wedge.
 1.118  22-Aug-2022  riastradh dk(4): Move first-open logic to new dkfirstopen function.

Makes the logic more clearly pair with dklastclose.
 1.117  22-Aug-2022  riastradh dk(4): Turn locking contract comment into assertions in dklastclose.
 1.116  22-Aug-2022  riastradh dk(4): Narrow scope of dk_rawlock on close to dklastclose.

No need to take it if we're not actually going to close the parent.

No functional change intended; dk_rawlock is only supposed to
serialize dk_rawopens access and open/close of the parent, after all.
 1.115  22-Aug-2022  riastradh dk(4): Factor common mutex_exit out of branches to keep it balanced.

No functional change intended.
 1.114  22-Aug-2022  riastradh dk(4): Move lock release out of dklastclose into caller.

No longer necessary to have this unbalanced logic now that
dk_close_parent correctly happens under the lock in order to
serialize with dk_open_parent.

No functional change intended.
 1.113  22-Aug-2022  riastradh dk(4): Serialize closing parent's dk_rawvp with opening it.

Otherwise, the following events might happen:

- process 123 had /dev/rdkN open, starts close, enters dk_close_parent
- process 456 opens /dev/rdkM (same parent, different wedge), calls
dk_open_parent

At this point, the block device hasn't yet closed, so dk_open_parent
will fail with EBUSY. This is incorrect -- the chardev is never
supposed to fail with EBUSY, and dkopen/dkclose carefully manage
state to avoid opening the block device while it's still open. The
problem is that dkopen in process 456 didn't wait for vn_close
in process 123 to finish before calling VOP_OPEN.

(Note: If it were the _same_ chardev /dev/rdkN in both processes,
then spec_open/close would prevent this. But since it's a
_different_ chardev, spec_open/close assume that concurrency is OK,
and it's the driver's responsibility to serialize access to the
parent disk which, unbeknownst to spec_open/close, is shared between
dkN and dkM.)

It appears that the vn_close call was previously moved outside
dk_rawlock in 2010 to work around an unrelated bug in raidframe that
had already been fixed in HEAD:

Crash pointing to dk_rawlock and raidclose:
https://mail-index.netbsd.org/tech-kern/2010/07/27/msg008612.html

Change working around that crash:
https://mail-index.netbsd.org/source-changes/2010/08/04/msg012270.html

Change removing raidclose -> mutex_destroy(&dk_rawlock) path:
https://mail-index.netbsd.org/source-changes/2009/07/23/msg223381.html
 1.112  11-Jun-2022  martin Since rev 1.101 DIOCAWEDGE could return success without filling in the
wedge device name - which is quite confusing for userland.
Always fill the name if we return success.
 1.111  23-Apr-2022  hannken Need vnode locked fot VOP_FDISCARD().
 1.110  15-Jan-2022  riastradh dk(4): Omit redundant microoptimization around cv_broadcast.

cv_broadcast already has a fast path for the no-waiter case.
 1.109  18-Oct-2021  simonb Whitespace nits.
 1.108  16-Oct-2021  simonb Remove funny straggling blank line.
 1.107  21-Aug-2021  andvar fix some more typos in comments/log messages, improve wording as well.
 1.106  04-Aug-2021  mlelstv Swap and Dump uses DEV_BSIZE units. Translate from device sectors like
regular I/O (strategy).
 1.105  02-Jun-2021  mlelstv Clear sc_mode only on last close.
 1.104  02-Jun-2021  mlelstv Copy mode of open wedges with the same parent and validate it.
Remove race on mode value when closing.
 1.103  22-May-2021  mlelstv branches: 1.103.2;
Handle read-only parent devices.

Currently this only affects xbd(4). Other disk drivers succeed opening
read-only disks as read-write and only fail subsequent write requests.
 1.102  06-Oct-2020  mlelstv branches: 1.102.6; 1.102.8;
Check dkdriver before calling a driver function.
 1.101  24-May-2020  jmcneill dkwedge_add: Allow for expanding the size of an existing wedge without
having to delete it first, provided that no other parameters have changed.
 1.100  02-Mar-2020  riastradh New ioctl DIOCGSECTORALIGN returns sector alignment parameters.

struct disk_sectoralign {
/* First aligned sector number. */
uint32_t dsa_firstaligned;

/* Number of sectors per aligned unit. */
uint32_t dsa_alignment;
};

- Teach wd(4) to get it from ATA.
- Teach cgd(4) to pass it through from the underlying disk.
- Teach dk(4) to pass it through with adjustments.
- Teach zpool (zfs) to take advantage of it.
=> XXX zpool doesn't seem to understand when the vdev's starting
sector is misaligned.

Missing:

- ccd(4) and raidframe(4) support -- these should support _using_
DIOCGSECTORALIGN to decide where to start putting ccd or raid
stripes on disk, and these should perhaps _implement_
DIOCGSECTORALIGN by reporting the stripe/interleave factor.

- sd(4) support -- I don't know any obvious way to get it from SCSI,
but if any SCSI wizards know better than I, please feel free to
teach sd(4) about it!

- any ld(4) attachments -- might be worth teaching the ld drivers for
nvme and various raid controllers to get the aligned sector size

There's some duplicate logic here for now. I'm doing it this way,
rather than gathering the logic into a new disklabel_sectoralign
function or something, so that this change is limited to adding a new
ioctl, without any new kernel symbols, in order to make it easy to
pull up to netbsd-9 without worrying about the module ABI.
 1.99  01-Mar-2020  riastradh Allow dumping to cgd(4) on a dk(4).

(Technically this also allows dumping to a dk(4) on which there
happens to be a cgd(4) configured, but I'm not sure how to
distinguish that case here. So don't do that!)
 1.98  28-Feb-2020  yamaguchi Update sc->sc_parent->dk_rawvp while the lock named dk_rawlock held
to prevent a race condition

Fixes PR kern/55026

OKed by mlelstv@, thanks
 1.97  12-May-2018  mlelstv branches: 1.97.2; 1.97.8; 1.97.10;
Support dump on wedges.
 1.96  05-Mar-2017  mlelstv branches: 1.96.4; 1.96.6; 1.96.12;
Enhance disk metrics by calculating a weighted sum that is incremented
by the number of concurrent I/O requests. Also introduce a new disk_wait()
function to measure requests waiting in a bufq.
iostat -y now reports data about waiting and active requests.

So far only drivers using dksubr and dk, ccd, wd and xbd collect data about
waiting requests.
 1.95  27-Feb-2017  jdolecek pass also DIOCGCACHE to underlying device, so that upper layers would be able
to get the device cache properties without knowing the topology; while here also
pass down DIOCGSTRATEGY for neater dkctl(8) output
 1.94  19-Jan-2017  maya use a bounded copy. NFCI
 1.93  24-Dec-2016  mlelstv branches: 1.93.2;
add missing mutex/cv cleanup to error paths.
 1.92  16-Dec-2016  mlelstv Make dk(4) device mpsafe.
 1.91  29-May-2016  mlelstv branches: 1.91.2;
missed one exit path with the previous change.
 1.90  29-May-2016  mlelstv release openlock mutex before closing parent device.
 1.89  27-Apr-2016  christos Add dkwedge_find_by_parent()
 1.88  15-Jan-2016  mlelstv Allow dump to raidframe component which is a wedge.

N.B. ordinary devices check the partition type only in the xxxsize routine.
 1.87  27-Dec-2015  mlelstv Return error in dkopen when dk_open_parent fails. Also change dk_open_parent
to pass error code to caller.
XXX: Pullups
 1.86  28-Nov-2015  mlelstv sc_size is already measured in sectors.
 1.85  10-Oct-2015  christos remove incorrect comment (from kre)
 1.84  06-Oct-2015  jmcneill print wedge announcement in one line instead of two
 1.83  25-Aug-2015  pooka Rename variable to avoid -Wshadow warnings with some compilers.
 1.82  22-Aug-2015  mlelstv No longer access the disk driver directly.
If there is an open wedge, temporarily reference its vnode.
Otherwise try to open the block device.
 1.81  22-Aug-2015  mlelstv revert the previous
 1.80  20-Aug-2015  mlelstv when scanning for disklabels, close block device only when this was
the first open. The device driver doesn't do reference counting.

This is still subject to race conditions.
 1.79  02-Jan-2015  christos - Use NODEV instead of 0
- Return EBUSY if there was no label
 1.78  31-Dec-2014  christos make more drivers use disk_ioctl, and add a dev parameter to it so that
we can merge the "easy" disklabel ioctls to it. Ultimately all this will
go do dk_ioctl once all the drivers have been converted.
 1.77  31-Dec-2014  mlelstv disk_blocksize and disk_set_info relay the same information
to the disk subsystem.

Make disk_set_info also set blocksize shift values.
Remove every call to disk_blocksize.

Keep disk_blocksize for ABI compatibility, make it also set dg_secsize.
 1.76  08-Dec-2014  mlelstv Really provide disk properties, the old code computed values that were
never attached to the device.
 1.75  22-Nov-2014  mlelstv branches: 1.75.2;
fix iobuf setup, cleanup
 1.74  04-Nov-2014  mlelstv Implement DIOCMWEDGES ioctl that triggers wedge autodiscovery.
Also fix a reference counting bug and clean up some code.
 1.73  28-Aug-2014  riastradh Make dk(4) discard from partition start, not from disk start.

Otherwise, anything mounted with `-o discard' will pretty quickly
munch itself up and barf up an unrecoverably corrupted file system!

XXX pullup to netbsd-7
 1.72  25-Jul-2014  dholland branches: 1.72.2;
Implement d_discard for dk. This closes PR 47940.
 1.71  25-Jul-2014  dholland Add d_discard to all struct cdevsw instances I could find.

All have been set to "nodiscard"; some should get a real implementation.
 1.70  25-Jul-2014  dholland Add d_discard to all struct bdevsw instances I could find.

I've set them all to nodiscard. Some of them (wd, dk, vnd, ld,
raidframe, maybe cgd) should be implemented for real.
 1.69  03-Apr-2014  christos branches: 1.69.2;
add dkwedge_get_parent_name().
 1.68  16-Mar-2014  dholland Change (mostly mechanically) every cdevsw/bdevsw I can find to use
designated initializers.

I have not built every extant kernel so I have probably broken at
least one build; however I've also found and fixed some wrong
cdevsw/bdevsw entries so even if so I think we come out ahead.
 1.67  03-Aug-2013  soren Don't complain about not being able to open empty removable media drives.
 1.66  29-May-2013  christos branches: 1.66.2;
phase 1 of disk geometry cleanup:
- centralize the geometry -> plist code so that we don't have
n useless copies of it.
 1.65  27-Oct-2012  chs split device_t/softc for all remaining drivers.
replace "struct device *" with "device_t".
use device_xname(), device_unit(), etc.
 1.64  10-Jun-2012  mlelstv branches: 1.64.2;
Make detection of root on wedges (dk(4)) machine independent. Remove
MD code for x86, xen, sparc64.
 1.63  27-Apr-2012  drochner minor mostly cosmetical fixes: use designated type for device major
numbers, typo in comment, misuse of minor()
(the latter one is not cosmetical, but would only affect systems
with more than 256 disk wedges)
 1.62  30-Jul-2011  jmcneill branches: 1.62.2; 1.62.6; 1.62.8;
Add an FSILENT flag and use it to suppress "Medium Not Present" scsipi
spam when trying to access offline drives at boot.
 1.61  12-Jun-2011  rmind Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.
 1.60  03-Mar-2011  christos branches: 1.60.2;
check rawvp before doing ioctl or strategy.
 1.59  28-Feb-2011  christos Make error checking consistent, possibly fixes PR/44652.
 1.58  23-Dec-2010  mlelstv branches: 1.58.2; 1.58.4;
Make wedges aware of underlying physical block size.
 1.57  04-Aug-2010  bouyer Make sure to release sc_parent->dk_rawlock before calling
vn_close(sc->sc_parent->dk_rawvp). Avoids a lockdebug panic:
error: mutex_destroy: assertion failed: !MUTEX_OWNED(mtx->mtx_owner) && !MUTEX_HAS_WAITERS(mtx)
when the parent is a raidframe device.
See also:
http://mail-index.netbsd.org/tech-kern/2010/07/27/msg008612.html
 1.56  24-Jun-2010  hannken Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.
 1.55  07-Feb-2010  mlelstv branches: 1.55.2; 1.55.4;
d_psize routine returns a number of blocks or -1 on error.
d_dump routine returns 0 or an error code.
 1.54  25-Jan-2010  mlelstv GPTs are defined in terms of physical blocks.
- Fix reading of GPT for devices with non-512byte sectors
- Fix bounds check to use DEV_BSIZE units.
 1.53  23-Jan-2010  bouyer struct buf::b_iodone is not called at splbio() any more.
Make sure non-MPsafe iodone callbacks raise the SPL as appropriate.
Fix buffer corruption issue I noticed in dk(4), and probable similar
issues in vnd(4) and cgd(4).
 1.52  27-Dec-2009  jakllsch Implement and use a dkminphys() that calls the parent device's minphys
function with b_dev temporarily adjusted to the parent device's dev_t.

Fixes PR/37390.
 1.51  08-Sep-2009  pooka dkwedge_list() is currently called only from ioctl routines where
l == curlwp. Since there is no perceived case where we'd ever want
to copy the list to non-curlwp, simplify the code a bit.
(the struct lwp * argument could probably be dropped too, but
that's another commit)
 1.50  07-Sep-2009  pooka grow some _KERNEL_POT
 1.49  06-Sep-2009  pooka Remove autoconf dependency on vfs and dk:
opendisk() -> kern/subr_disk_open.c
config_handle_wedges -> dev/dkwedge/dk.c
 1.48  06-Aug-2009  haad Add support for DIOCGDISKINFO for wedges. This fixes regression after my
DIOCGDISKINFO commit to fsck/partutil.c.

Tested by me and adegroot@.
 1.47  21-Jul-2009  dyoung Extract a lot of code from dkwedge_del(), and move it to dkwedge_detach()
to create a comprehensive detachment hook. Let that hook run at
shutdown. Now, 'drvctl -d dk0' actually deletes a wedge if it is
not in-use (otherwise fails w/ EBUSY), and wedges are gracefully
detached from their "parent" at shutdown.
 1.46  02-Jul-2009  dyoung Extract subroutine dklastclose(). This is a step toward detachable
dk(4).
 1.45  12-May-2009  cegger struct device * -> device_t, no functional changes intended.
 1.44  12-May-2009  cegger struct cfdata * -> cfdata_t, no functional changes intended.
 1.43  13-Jan-2009  yamt branches: 1.43.2;
g/c BUFQ_FOO() macros and use bufq_foo() directly.
 1.42  17-Jun-2008  reinoud branches: 1.42.4; 1.42.6; 1.42.10; 1.42.12;
Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.
 1.41  03-Jun-2008  ad branches: 1.41.2;
dkwedge_read: don't place struct buf on the stack.
 1.40  01-Jun-2008  chris Call buf_destroy when finished with an on-stack struct buf.

Spotted by LOCKDEBUG, because the condvars were already initialised.
 1.39  03-May-2008  plunky branches: 1.39.2;
after the "struct disk" is finished with, it should be
destroyed with disk_destroy(9) to stave off LOCKDEBUG
panics.
 1.38  28-Apr-2008  martin Remove clause 3 and 4 from TNF licenses
 1.37  10-Apr-2008  agc branches: 1.37.2; 1.37.4;
Fix a minor nit in a comment
 1.36  06-Apr-2008  cegger use aprint_*_dev and device_xname
 1.35  21-Mar-2008  ad Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.
 1.34  04-Mar-2008  cube Split device_t/softc. Well, there's not much to split there, as the
device_t didn't contain the softc anyway.

This driver should be re-structured so it doesn't have to manage its own
set of softcs.
 1.33  30-Jan-2008  ad branches: 1.33.2; 1.33.6;
Hold v_interlock when adjust v_writecount.
 1.32  02-Jan-2008  ad Merge vmlocking2 to head.
 1.31  09-Dec-2007  jmcneill branches: 1.31.2;
Merge jmcneill-pm branch.
 1.30  26-Nov-2007  pooka branches: 1.30.2; 1.30.4;
Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern
 1.29  08-Oct-2007  ad branches: 1.29.4;
Merge disk init changes from the vmlocking branch. These seperate init /
destroy of 'struct disk' from attach / detach.
 1.28  29-Jul-2007  ad branches: 1.28.4; 1.28.6; 1.28.8; 1.28.10;
It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.
 1.27  21-Jul-2007  ad Replace some uses of lockmgr().
 1.26  09-Jul-2007  ad branches: 1.26.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements
 1.25  24-Jun-2007  dyoung Extract common code from i386, xen, and sparc64, creating
config_handle_wedges() and read_disk_sectors(). On x86, handle_wedges()
is a thin wrapper for config_handle_wedges(). Share opendisk()
across architectures.

Add kernel code in support of specifying a root partition by wedge
name. E.g., root specifications "wedge:wd0a", "wedge:David's Root
Volume" are possible. (Patches for config(1) coming soon.)

In support of moving disks between architectures (esp. i386 <->
evbmips), I've written a routine convertdisklabel() that ensures
that the raw partition is at RAW_DISK by following these steps:

0 If we have read a disklabel that has a RAW_PART with
p_offset == 0 and p_size != 0, then use that raw partition.

1 If we have read a disklabel that has both partitions 'c'
and 'd', and RAW_PART has p_offset != 0 or p_size == 0,
but the other partition is suitable for a raw partition
(p_offset == 0, p_size != 0), then swap the two partitions
and use the new raw partition.

2 If the architecture's raw partition is 'd', and if there
is no partition 'd', but there is a partition 'c' that
is suitable for a raw partition, then copy partition 'c'
to partition 'd'.

3 Determine the drive's last sector, using either the
d_secperunit the drive reported, or by guessing (0x1fffffff).
If we cannot read the drive's last sector, then fail.

4 If we have read a disklabel that has no partition slot
RAW_PART, then create a partition RAW_PART. Make it span
the whole drive.

5 If there are fewer than MAXPARTITIONS partitions,
then "slide" the unsuitable raw partition RAW_PART, and
subsequent partitions, into partition slots RAW_PART+1
and subsequent slots. Create a raw partition at RAW_PART.
Make it span the whole drive.

The convertdisklabel() procedure can probably stand to be simplified,
but it ought to deal with all but an extraordinarily broken disklabel,
now.

i386: compiled and tested, sparc64: compiled, evbmips: compiled.
 1.24  16-Jun-2007  christos Unwedge the previous change. Always increment the number of rawopens if the
open is successful.
 1.23  09-Jun-2007  dyoung Fix two bugs:

1 In dkopen(), do not leave dk_rawopens > 0 if the open ultimately
failed for some reason.

2 Add a dkdump() implementation by Martin Husemann for writing
system dumps to wedges. Tiny modifications by me. Lightly tested
on an evbmips box.
 1.22  04-Mar-2007  christos branches: 1.22.2; 1.22.4;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.21  15-Feb-2007  yamt branches: 1.21.2;
dkwedge_discover: open a device as read-only.
 1.20  16-Nov-2006  christos __unused removal on arguments; approved by core.
 1.19  12-Oct-2006  christos - sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386
 1.18  18-Sep-2006  uebayasi branches: 1.18.2;
Typo in comment.
 1.17  24-Aug-2006  dbj branches: 1.17.2;
avoid diagnostic panic if both blk and chr wedge are open at the same time
 1.16  21-Jul-2006  ad - Use the LWP cached credentials where sane.
- Minor cosmetic changes.
 1.15  14-May-2006  elad integrate kauth.
 1.14  06-Apr-2006  thorpej A couple of fixes from dbj@:
- dkwedge_del(): Don't compute a minor number based on partitions, because
wedges don't have partitions. Just provide the unit number to vdevgone().
- dkopen(): Make sure we release all of the locks we've acquired should
opening the parent device fail.
 1.13  06-Apr-2006  thorpej Implement dksize().
 1.12  01-Mar-2006  yamt branches: 1.12.2; 1.12.4; 1.12.6;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.
 1.11  04-Jan-2006  yamt branches: 1.11.2; 1.11.4;
- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.
 1.10  11-Dec-2005  christos branches: 1.10.2;
merge ktrace-lwp.
 1.9  15-Oct-2005  yamt - change the way to specify a bufq strategy. (by string rather than by number)
- rather than embedding bufq_state in driver softc,
have a pointer to the former.
- move bufq related functions from kern/subr_disk.c to kern/subr_bufq.c.
- rename method to strategy for consistency.
- move some definitions which don't need to be exposed to the rest of kernel
from sys/bufq.h to sys/bufq_impl.h.
(is it better to move it to kern/ or somewhere?)
- fix some obvious breakage in dev/qbus/ts.c. (not tested)
 1.8  28-Sep-2005  nathanw Set sc->sc_cfdata.cf_fstate to FSTATE_STAR rather than FSTATE_NOTFOUND
so that config_detach() doesn't panic.

(XXX this points to some disagreement between config_attach_pseudo()
and config_detach() over the correct role of pseudo-device cfdata)
 1.7  29-May-2005  christos branches: 1.7.2;
avoid variable shadowing.
 1.6  27-Feb-2005  perry nuke trailing whitespace
 1.5  28-Oct-2004  yamt branches: 1.5.4; 1.5.6;
move buffer queue related stuffs from buf.h to their own header, bufq.h.
 1.4  26-Oct-2004  thorpej Implement the DIOCCACHESYNC ioctl; we just pass it along to the parent.
 1.3  23-Oct-2004  thorpej - Adjust minor number usage for wedges; minor number directly maps to
unit now. Don't pretend wedges have "partitions".
- Fix a buglet related to opening char and block devices of a wedge
at the same time.
- Add dkwedge_set_bootwedge(), that MD code can call to set booted_device
and booted_wedge appropriately when MD code knows the parent disk and
the start/size of the wedge that was booted from.
 1.2  15-Oct-2004  thorpej branches: 1.2.2;
Use config_attach_pseudo() to create device instances in the device
tree for created wedges. This is necessary for setroot().
 1.1  04-Oct-2004  thorpej Move wedge code to a subdirectory, as suggested by Christos.
 1.2.2.5  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.2.2.4  04-Mar-2005  skrll Sync with HEAD.

Hi Perry!
 1.2.2.3  02-Nov-2004  skrll Sync with HEAD.
 1.2.2.2  19-Oct-2004  skrll Sync with HEAD
 1.2.2.1  15-Oct-2004  skrll file dk.c was added on branch ktrace-lwp on 2004-10-19 15:56:45 +0000
 1.5.6.1  19-Mar-2005  yamt sync with head. xen and whitespace. xen part is not finished.
 1.5.4.1  29-Apr-2005  kent sync with -current
 1.7.2.10  24-Mar-2008  yamt sync with head.
 1.7.2.9  17-Mar-2008  yamt sync with head.
 1.7.2.8  04-Feb-2008  yamt sync with head.
 1.7.2.7  21-Jan-2008  yamt sync with head
 1.7.2.6  07-Dec-2007  yamt sync with head
 1.7.2.5  27-Oct-2007  yamt sync with head.
 1.7.2.4  03-Sep-2007  yamt sync with head.
 1.7.2.3  26-Feb-2007  yamt sync with head.
 1.7.2.2  30-Dec-2006  yamt sync with head.
 1.7.2.1  21-Jun-2006  yamt sync with head.
 1.10.2.2  15-Jan-2006  yamt sync with head.
 1.10.2.1  31-Dec-2005  yamt adapt some random parts of kernel to uio_vmspace.
 1.11.4.2  01-Jun-2006  kardel Sync with head.
 1.11.4.1  22-Apr-2006  simonb Sync with head.
 1.11.2.1  09-Sep-2006  rpaulo sync with head
 1.12.6.1  24-May-2006  tron Merge 2006-05-24 NetBSD-current into the "peter-altq" branch.
 1.12.4.3  06-May-2006  christos - Move kauth_cred_t declaration to <sys/types.h>
- Cleanup struct ucred; forward declarations that are unused.
- Don't include <sys/kauth.h> in any header, but include it in the c files
that need it.

Approved by core.
 1.12.4.2  19-Apr-2006  elad sync with head.
 1.12.4.1  08-Mar-2006  elad Adapt to kernel authorization KPI.
 1.12.2.4  03-Sep-2006  yamt sync with head.
 1.12.2.3  11-Aug-2006  yamt sync with head
 1.12.2.2  24-May-2006  yamt sync with head.
 1.12.2.1  11-Apr-2006  yamt sync with head
 1.17.2.1  18-Nov-2006  ad Sync with head.
 1.18.2.3  10-Dec-2006  yamt sync with head.
 1.18.2.2  22-Oct-2006  yamt sync with head
 1.18.2.1  18-Sep-2006  yamt file dk.c was added on branch yamt-splraiseipl on 2006-10-22 06:05:35 +0000
 1.21.2.1  12-Mar-2007  rmind Sync with HEAD.
 1.22.4.1  11-Jul-2007  mjf Sync with head.
 1.22.2.10  09-Oct-2007  ad Sync with head.
 1.22.2.9  24-Aug-2007  ad Sync with buffer cache locking changes. See buf.h/vfs_bio.c for details.
Some minor portions are incomplete and needs to be verified as a whole.
 1.22.2.8  20-Aug-2007  ad - Alter disk attach/detach to fix a panic when closing a vnd device.
- Sync with HEAD.
 1.22.2.7  19-Aug-2007  ad - Back out the biodone() changes.
- Eliminate B_ERROR (from HEAD).
 1.22.2.6  15-Jul-2007  ad Sync with head.
 1.22.2.5  01-Jul-2007  ad Adapt to callout API change.
 1.22.2.4  23-Jun-2007  ad - Lock v_cleanblkhd, v_dirtyblkhd, v_numoutput with the vnode's interlock.
Get rid of global_v_numoutput_lock. Partially incomplete as the buffer
cache locking doesn't work very well and needs an overhaul.
- Some changes to try and make softdep MP safe. Untested.
 1.22.2.3  17-Jun-2007  ad - Increase the number of thread priorities from 128 to 256. How the space
is set up is to be revisited.
- Implement soft interrupts as kernel threads. A generic implementation
is provided, with hooks for fast-path MD code that can run the interrupt
threads over the top of other threads executing in the kernel.
- Split vnode::v_flag into three fields, depending on how the flag is
locked (by the interlock, by the vnode lock, by the file system).
- Miscellaneous locking fixes and improvements.
 1.22.2.2  13-May-2007  ad - Pass the error number and residual count to biodone(), and let it handle
setting error indicators. Prepare to eliminate B_ERROR.
- Add a flag argument to brelse() to be set into the buf's flags, instead
of doing it directly. Typically used to set B_INVAL.
- Add a "struct cpu_info *" argument to kthread_create(), to be used to
create bound threads. Change "bool mpsafe" to "int flags".
- Allow exit of LWPs in the IDL state when (l != curlwp).
- More locking fixes & conversion to the new API.
 1.22.2.1  13-Mar-2007  ad Pull in the initial set of changes for the vmlocking branch.
 1.26.2.1  15-Aug-2007  skrll Sync with HEAD.
 1.28.10.2  29-Jul-2007  ad It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.
 1.28.10.1  29-Jul-2007  ad file dk.c was added on branch matt-mips64 on 2007-07-29 12:50:21 +0000
 1.28.8.1  14-Oct-2007  yamt sync with head.
 1.28.6.3  23-Mar-2008  matt sync with HEAD
 1.28.6.2  09-Jan-2008  matt sync with HEAD
 1.28.6.1  06-Nov-2007  matt sync with HEAD
 1.28.4.4  08-Dec-2007  jmcneill Rename pnp(9) -> pmf(9), as requested by many.
 1.28.4.3  27-Nov-2007  joerg Sync with HEAD. amd64 Xen support needs testing.
 1.28.4.2  06-Nov-2007  joerg Refactor PNP API:
- Make suspend/resume directly a device functionality. It consists of
three layers (class logic, device logic, bus logic), all of them being
optional. This replaces D0/D3 transitions.
- device_is_active returns true if the device was not disabled and was
not suspended (even partially), device_is_enabled returns true if the
device was enabled.
- Change pnp_global_transition into pnp_system_suspend and
pnp_system_resume. Before running any suspend/resume handlers, check
that all currently attached devices support power management and bail
out otherwise. The latter is not done for the shutdown/panic case.
- Make the former bus-specific generic network handlers a class handler.
- Make PNP message like volume up/down/toogle PNP events. Each device
can register what events they are interested in and whether the handler
should be global or not.
- Introduce device_active API for devices to mark themselve in use from
either the system or the device. Use this to implement the idle handling
for audio and input devices. This is intended to replace most ad-hoc
watchdogs as well.
- Fix somes situations in which audio resume would lose mixer settings.
- Make USB host controllers better deal with suspend in the light of
shared interrupts.
- Flush filesystem cache on suspend.
- Flush disk caches on suspend. Put ATA disks into standby on suspend as
well.
- Adopt drivers to use the new PNP API.
- Fix a critical bug in the generic cardbus layer that made D0->D3
break.
- Fix ral(4) to set if_stop.
- Convert cbb(4) to the new PNP API.
- Apply the PCI Express SCI fix on resume again.
 1.28.4.1  26-Oct-2007  joerg Sync with HEAD.

Follow the merge of pmap.c on i386 and amd64 and move
pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup
code to restore CR4 before jumping back into kernel space as the large
page option might cover that.
 1.29.4.3  18-Feb-2008  mjf Sync with HEAD.
 1.29.4.2  27-Dec-2007  mjf Sync with HEAD.
 1.29.4.1  08-Dec-2007  mjf Sync with HEAD.
 1.30.4.1  11-Dec-2007  yamt sync with head.
 1.30.2.2  26-Dec-2007  ad Sync with head.
 1.30.2.1  04-Dec-2007  ad Pull the vmlocking changes into a new branch.
 1.31.2.1  02-Jan-2008  bouyer Sync with HEAD
 1.33.6.6  17-Jan-2009  mjf Sync with HEAD.
 1.33.6.5  29-Jun-2008  mjf Sync with HEAD.
 1.33.6.4  05-Jun-2008  mjf Sync with HEAD.

Also fix build.
 1.33.6.3  02-Jun-2008  mjf Sync with HEAD.
 1.33.6.2  06-Apr-2008  mjf - after some discussion with agc@ i agreed it would be a good idea to move
device_unregister_* to device_deregister_* to be more like the pmf(9)
functions, especially since a lot of the time the function calls are next
to each other.

- add device_register_name() support for dk(4).
 1.33.6.1  03-Apr-2008  mjf Sync with HEAD.
 1.33.2.1  24-Mar-2008  keiichi sync with head.
 1.37.4.8  11-Aug-2010  yamt sync with head.
 1.37.4.7  11-Mar-2010  yamt sync with head
 1.37.4.6  16-Sep-2009  yamt sync with head
 1.37.4.5  19-Aug-2009  yamt sync with head.
 1.37.4.4  18-Jul-2009  yamt sync with head.
 1.37.4.3  16-May-2009  yamt sync with head
 1.37.4.2  04-May-2009  yamt sync with head.
 1.37.4.1  16-May-2008  yamt sync with head.
 1.37.2.2  04-Jun-2008  yamt sync with head
 1.37.2.1  18-May-2008  yamt sync with head.
 1.39.2.1  23-Jun-2008  wrstuden Sync w/ -current. 34 merge conflicts to follow.
 1.41.2.1  18-Jun-2008  simonb Sync with head.
 1.42.12.1  21-Apr-2010  matt sync to netbsd-5
 1.42.10.1  30-Jan-2010  snj Pull up following revision(s) (requested by bouyer in ticket #1269):
sys/dev/dkwedge/dk.c: revision 1.53
sys/dev/cgd.c: revision 1.69
sys/dev/vnd.c: revision 1.206
struct buf::b_iodone is not called at splbio() any more.
Make sure non-MPsafe iodone callbacks raise the SPL as appropriate.
Fix buffer corruption issue I noticed in dk(4), and probable similar
issues in vnd(4) and cgd(4).
 1.42.6.3  21-Nov-2010  riz Pull up following revision(s) (requested by bouyer in ticket #1435):
sys/dev/dkwedge/dk.c: revision 1.57
Make sure to release sc_parent->dk_rawlock before calling
vn_close(sc->sc_parent->dk_rawvp). Avoids a lockdebug panic:
error: mutex_destroy: assertion failed: !MUTEX_OWNED(mtx->mtx_owner) && !MUTEX_HAS_WAITERS(mtx)
when the parent is a raidframe device.
See also:
http://mail-index.netbsd.org/tech-kern/2010/07/27/msg008612.html
 1.42.6.2  30-Jan-2010  snj Pull up following revision(s) (requested by bouyer in ticket #1269):
sys/dev/cgd.c: revision 1.69
sys/dev/vnd.c: revision 1.206
sys/dev/dkwedge/dk.c: revision 1.53
struct buf::b_iodone is not called at splbio() any more.
Make sure non-MPsafe iodone callbacks raise the SPL as appropriate.
Fix buffer corruption issue I noticed in dk(4), and probable similar
issues in vnd(4) and cgd(4).
 1.42.6.1  09-Jan-2010  snj Pull up following revision(s) (requested by jakllsch in ticket #1213):
sys/dev/dkwedge/dk.c: revision 1.52
Implement and use a dkminphys() that calls the parent device's minphys
function with b_dev temporarily adjusted to the parent device's dev_t.
Fixes PR/37390.
 1.42.4.1  19-Jan-2009  skrll Sync with HEAD.
 1.43.2.2  23-Jul-2009  jym Sync with HEAD.
 1.43.2.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.55.4.3  05-Mar-2011  rmind sync with head
 1.55.4.2  03-Jul-2010  rmind sync with head
 1.55.4.1  16-Mar-2010  rmind Change struct uvm_object::vmobjlock to be dynamically allocated with
mutex_obj_alloc(). It allows us to share the locks among UVM objects.
 1.55.2.1  17-Aug-2010  uebayasi Sync with HEAD.
 1.58.4.1  05-Mar-2011  bouyer Sync with HEAD
 1.58.2.1  06-Jun-2011  jruoho Sync with HEAD.
 1.60.2.1  23-Jun-2011  cherry Catchup with rmind-uvmplock merge.
 1.62.8.1  05-Jul-2012  riz Pull up following revision(s) (requested by mlelstv in ticket #402):
sys/dev/vnd.c: revision 1.221
sys/kern/init_main.c: revision 1.443
sys/kern/init_main.c: revision 1.444
sys/dev/dkwedge/dk.c: revision 1.64
sys/arch/x86/x86/x86_autoconf.c: revision 1.63
sys/arch/sparc64/sparc64/autoconf.c: revision 1.187
sys/sys/device.h: revision 1.141
sys/dev/dkwedge/dkwedge_bsdlabel.c: revision 1.17
sys/kern/kern_subr.c: revision 1.213
sys/arch/zaurus/zaurus/autoconf.c: revision 1.11
sys/arch/xen/x86/autoconf.c: revision 1.14
sys/sys/disk.h: revision 1.57
Use the label's packname to create wedge names instead of the classic
device names. Fall back to classic device names when the label has an
empty name or the default name 'fictitious'.
autodiscover wedges
Make detection of root on wedges (dk(4)) machine independent. Remove
MD code for x86, xen, sparc64.
Make detection of root on wedges (dk(4)) machine independent. Remove
MD code for zaurus.
Do not try to find the wedge we booted from if opendisk(booted_device)
failed.
 1.62.6.1  29-Apr-2012  mrg sync to latest -current.
 1.62.2.3  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.62.2.2  30-Oct-2012  yamt sync with head
 1.62.2.1  23-May-2012  yamt sync with head.
 1.64.2.5  03-Dec-2017  jdolecek update from HEAD
 1.64.2.4  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.64.2.3  23-Jun-2013  tls resync from head
 1.64.2.2  02-Dec-2012  tls Don't pass NULL struct dkdriver to disk_init. That's seriously bogus.
 1.64.2.1  20-Nov-2012  tls Resync to 2012-11-19 00:00:00 UTC
 1.66.2.2  18-May-2014  rmind sync with head
 1.66.2.1  28-Aug-2013  rmind sync with head
 1.69.2.1  10-Aug-2014  tls Rebase.
 1.72.2.4  08-Sep-2015  martin Pull up following revision(s) (requested by mlelstv in ticket #967):
sys/dev/dkwedge/dk.c: revision 1.82
No longer access the disk driver directly.
If there is an open wedge, temporarily reference its vnode.
Otherwise try to open the block device.
 1.72.2.3  09-Dec-2014  martin Pull up following revision(s) (requested by mlelstv in ticket #304):
sys/dev/dkwedge/dk.c: revision 1.75
sys/dev/dkwedge/dk.c: revision 1.76
fix iobuf setup, cleanup
Really provide disk properties, the old code computed values that were
never attached to the device.
 1.72.2.2  11-Nov-2014  martin Pull up following revision(s) (requested by mlelstv in ticket #201):
sbin/dkctl/dkctl.8: revision 1.24
sbin/dkctl/dkctl.8: revision 1.25
sys/dev/scsipi/sd.c: revision 1.310
sys/dev/ata/wd.c: revision 1.415
sbin/dkctl/dkctl.c: revision 1.21
sys/dev/raidframe/rf_netbsdkintf.c: revision 1.315
sys/dev/ld.c: revision 1.78
sys/dev/vnd.c: revision 1.234
sys/dev/dksubr.c: revision 1.54
sys/sys/dkio.h: revision 1.20
sys/dev/dkwedge/dk.c: revision 1.74
Add ioctl to autodiscover wedges.
Implement DIOCMWEDGES ioctl that triggers wedge autodiscovery.
Also fix a reference counting bug and clean up some code.
support DIOCMWEDGES ioctl.
Add 'makewedges' option to autodiscover wedges from a changed label.
New sentence, new line. Bump date for previous.
 1.72.2.1  29-Aug-2014  martin Pull up following revision(s) (requested by riastradh in ticket #65):
sys/dev/dkwedge/dk.c: revision 1.73
Make dk(4) discard from partition start, not from disk start.
Otherwise, anything mounted with `-o discard' will pretty quickly
munch itself up and barf up an unrecoverably corrupted file system!
XXX pullup to netbsd-7
 1.75.2.8  28-Aug-2017  skrll Sync with HEAD
 1.75.2.7  05-Feb-2017  skrll Sync with HEAD
 1.75.2.6  09-Jul-2016  skrll Sync with HEAD
 1.75.2.5  29-May-2016  skrll Sync with HEAD
 1.75.2.4  19-Mar-2016  skrll Sync with HEAD
 1.75.2.3  27-Dec-2015  skrll Sync with HEAD (as of 26th Dec)
 1.75.2.2  22-Sep-2015  skrll Sync with HEAD
 1.75.2.1  06-Apr-2015  skrll Sync with HEAD
 1.91.2.3  20-Mar-2017  pgoyette Sync with HEAD
 1.91.2.2  07-Jan-2017  pgoyette Sync with HEAD. (Note that most of these changes are simply $NetBSD$
tag issues.)
 1.91.2.1  20-Jul-2016  pgoyette Adapt machine-independant code to the new {b,c}devsw reference-counting
(using localcount(9)). All callers of {b,c}devsw_lookup() now call
{b,c}devsw_lookup_acquire() which retains a reference on the 'struct
{b,c}devsw'. This reference must be released by the caller once it is
finished with the structure's content (or other data that would disappear
if the 'struct {b,c}devsw' were to disappear).
 1.93.2.1  21-Apr-2017  bouyer Sync with HEAD
 1.96.12.1  21-May-2018  pgoyette Sync with HEAD
 1.96.6.1  24-Apr-2020  martin Pull up following revision(s) (requested by maya in ticket #1541):

sys/dev/dkwedge/dk.c: revision 1.98

Update sc->sc_parent->dk_rawvp while the lock named dk_rawlock held
to prevent a race condition

Fixes PR kern/55026

OKed by mlelstv@, thanks
 1.96.4.2  17-May-2017  pgoyette At suggestion of chuq@, modify config_attach_pseudo() to return with a
reference held on the device.

Adapt callers to expect the reference to exist, and to ensure that the
reference is released.
 1.96.4.1  27-Apr-2017  pgoyette Restore all work from the former pgoyette-localcount branch (which is
now abandoned doe to cvs merge botch).

The branch now builds, and installs via anita. There are still some
problems (cgd is non-functional and all atf tests time-out) but they
will get resolved soon.
 1.97.10.1  29-Feb-2020  ad Sync with head.
 1.97.8.4  11-Oct-2020  martin Pull up following revision(s) (requested by mlelstv in ticket #1110):

sys/dev/dkwedge/dk.c: revision 1.102
sys/dev/ccd.c: revision 1.185
sbin/ccdconfig/ccdconfig.c: revision 1.58

Use raw device for configuring units. This is necessary as
having a block device opened prevents autodiscovery of wedges.

Fix ioctl locking. Add dkdriver.

Check dkdriver before calling a driver function.
 1.97.8.3  24-Apr-2020  martin Pull up following revision(s) (requested by maya in ticket #850):

sys/dev/dkwedge/dk.c: revision 1.98

Update sc->sc_parent->dk_rawvp while the lock named dk_rawlock held
to prevent a race condition

Fixes PR kern/55026

OKed by mlelstv@, thanks
 1.97.8.2  06-Apr-2020  martin Pull up following revision(s) (requested by riastradh in ticket #822):

sys/dev/dkwedge/dk.c: revision 1.99

Allow dumping to cgd(4) on a dk(4).

(Technically this also allows dumping to a dk(4) on which there
happens to be a cgd(4) configured, but I'm not sure how to
distinguish that case here. So don't do that!)
 1.97.8.1  21-Mar-2020  martin Pull up following revision(s) (requested by riastradh in ticket #788):

sys/sys/dkio.h: revision 1.26
sys/dev/dkwedge/dk.c: revision 1.100
sys/sys/disk.h: revision 1.75
external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c: revision 1.14
external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c: revision 1.15
sys/dev/cgd.c: revision 1.121
sys/dev/ata/wdvar.h: revision 1.50
sys/kern/subr_disk_open.c: revision 1.15
sys/dev/ata/wd.c: revision 1.459

New ioctl DIOCGSECTORALIGN returns sector alignment parameters.

struct disk_sectoralign {
/* First aligned sector number. */
uint32_t dsa_firstaligned;
/* Number of sectors per aligned unit. */
uint32_t dsa_alignment;
};

- Teach wd(4) to get it from ATA.
- Teach cgd(4) to pass it through from the underlying disk.
- Teach dk(4) to pass it through with adjustments.
- Teach zpool (zfs) to take advantage of it.
=> XXX zpool doesn't seem to understand when the vdev's starting
sector is misaligned.

Missing:
- ccd(4) and raidframe(4) support -- these should support _using_
DIOCGSECTORALIGN to decide where to start putting ccd or raid
stripes on disk, and these should perhaps _implement_
DIOCGSECTORALIGN by reporting the stripe/interleave factor.
- sd(4) support -- I don't know any obvious way to get it from SCSI,
but if any SCSI wizards know better than I, please feel free to
teach sd(4) about it!
- any ld(4) attachments -- might be worth teaching the ld drivers for
nvme and various raid controllers to get the aligned sector size

There's some duplicate logic here for now. I'm doing it this way,
rather than gathering the logic into a new disklabel_sectoralign
function or something, so that this change is limited to adding a new
ioctl, without any new kernel symbols, in order to make it easy to
pull up to netbsd-9 without worrying about the module ABI.

Make getdiskinfo() compatible with a DIOCGWEDGEINFO.

dkw_parent is defined to hold the disk name as used by disk_find(), not
a partition (i.e. no partition letter appended).

Use utility functions to handle disk geometry.
 1.97.2.1  08-Apr-2020  martin Merge changes from current as of 20200406
 1.102.8.1  31-May-2021  cjep sync with head
 1.102.6.1  17-Jun-2021  thorpej Sync w/ HEAD.
 1.103.2.1  06-Jun-2021  cjep sync with head
 1.124.4.1  01-Aug-2023  martin Pull up following revision(s) (requested by riastradh in ticket #284):

sys/dev/dkwedge/dk.c 1.125-1.158
sys/kern/subr_disk.c 1.135-1.137
sys/sys/disk.h 1.78

dk(4): Explain why dk_rawopens can't overflow and assert it.

dk(4): Restore assertions in dklastclose.

We only enter dklastclose if the wedge is open (sc->sc_dk.dk_openmask
!= 0), which can happen only if dkfirstopen has succeeded, in which
case we hold a dk_rawopens reference to the parent that prevents
anyone else from closing it. Hence sc->sc_parent->dk_rawopens > 0.

On open, sc->sc_parent->dk_rawvp is set to nonnull, and it is only
reset to null on close. Hence if the parent is still open, as it
must be here, sc->sc_parent->dk_rawvp must be nonnull.

dk(4): Avoid holding dkwedges_lock while allocating array.

This is not great -- we shouldn't be choosing the unit number here
anyway; we should just let autoconf do it for us -- but it's better
than potentially blocking any dk_openlock or dk_rawlock (which are
sometimes held when waiting for dkwedges_lock) for memory allocation.

dk(4): KNF: return (v) -> return v.
No functional change intended.

dk(4): KNF: Whitespace.
No functional change intended.

dk(4): Omit needless void * cast.
No functional change intended.

dk(4): Fix typo in comment: dkstrategy, not dkstragegy.
No functional change intended.

dk(4): ENXIO, not ENODEV, means no such device.
ENXIO is `device not configured', meaning there is no such device.
ENODEV is `operation not supported by device', meaning the device is
there but refuses the operation, like writing to a read-only medium.

Exception: For undefined ioctl commands, it's not ENODEV _or_ ENXIO,
but rather ENOTTY, because why make any of this obvious when you
could make it obscure Unix lore?

dk(4): KNF: Sort includes.
No functional change intended.

dk(4): <sys/rwlock.h> for rwlock(9).

dk(4): Prevent races in access to struct dkwedge_softc::sc_size.
Rules:
1. Only ever increases, never decreases.
(Decreases require removing and readding the wedge.)
2. Increases are serialized by dk_openlock.
3. Reads can happen unlocked in any context where the softc is valid.

Access is gathered into dkwedge_size* subroutines -- don't touch
sc_size outside these. For now, we use rwlock(9) to keep the
reasoning simple. This should be done with atomics on 64-bit
platforms and a seqlock on 32-bit platforms to avoid contention.

However, we can do that in a later change.

dk(4): Move CFDRIVER_DECL and CFATTACH_DECL3_NEW earlier in file.

Follows the pattern of most drivers, and will be necessary for
referencing dk_cd in dk_bdevsw and dk_cdevsw soon, to prevent
open/detach races.
No functional change intended.

dk(4): Don't touch dkwedges or ndkwedges outside dkwedges_lock.

dk(4): Assert parent vp is nonnull before we stash it away.

Let's enable early attribution if this goes wrong.

If it's not the parent's first open, also assert the parent vp is
already nonnull.

dk(4): Assert dkwedges[unit] is the sc we're about to free.

dk(4): Require dk_openlock in dk_set_geometry.

Not strictly necessary but this makes reasoning easier and documents
with an assertion how disk_set_info is serialized.

disk(9): Fix use-after-free race with concurrent disk_set_info.

This can happen with dk(4), which allows wedges to have their size
increased without destroying and recreating the device instance.

Drivers which allow concurrent disk_set_info and disk_ioctl must
serialize disk_set_info with dk_openlock.

dk(4): Add null d_cancel routine to devsw.

This way, dkclose is guaranteed that dkopen, dkread, dkwrite,
dkioctl, &c., have all returned before it runs. For block opens,
setting d_cancel also guarantees that any buffered writes are flushed
with vinvalbuf before dkclose is called.

dk(4): Fix callout detach race.
1. Set a flag sc_iostop under the lock sc_iolock so dkwedge_detach
and dkstart don't race over it.
2. Decline to schedule the callout if sc_iostop is set. The callout
is already only ever scheduled while the lock is held.
3. Use callout_halt to wait for any concurrent callout to complete.
At this point, it can't reschedule itself.

Without this change, the callout could be concurrently rescheduling
itself as we issue callout_stop, leading to use-after-free later.

dk(4): Use disk_begindetach and rely on vdevgone to close instances.

The first step is to decide whether we can detach (if forced, yes; if
not forced, only if not already open), and prevent new opens if so.

There's no need to start closing open instances at this point --
we're just making a decision to detach, and preventing new opens by
transitioning state that dkopen will respect[*].

The second step is to force all open instances to close. This is
done by vdevgone. By the time vdevgone returns, there can be no open
instances, so if there _were_ any, closing them via vdevgone will
have passed through dklastclose.

After that point, there can be no opens and no I/O operations, so
dk_openmask must already be zero and the bufq must be empty.

Thus, there's no need to have an explicit call to dklastclose (via
dkwedge_cleanup_parent) before or after making the decision to
detach.
[*] Currently access to this state is racy: nothing serializes
dkwedge_detach's state transition with dkopen's test. TBD in a
separate commit shortly.

dk(4): Set .d_cfdriver and .d_devtounit to plug open/detach race.

This way, opening dkN or rdkN will wait if attach or detach is still
in progress, and vdevgone will wake up such pending opens and make
them fail. So it is no longer possible for a wedge to be detached
after dkopen has already started using it.

For now, we use a custom .d_devtounit function that looks up the
autoconf unit number via the dkwedges array, which conceivably may
use an independent unit numbering system -- nothing guarantees they
match up. (In practice they will mostly match up, but concurrent
wedge creation could lead to different numbering.) Eventually this
should be changed so the two numbering systems match, which would let
us delete the new dkunit function and just use dev_minor_unit like
many other drivers can.

dk(4): Take a read-lock on dkwedges_lock if we're only reading.
- dkwedge_find_by_name
- dkwedge_find_by_parent
- dkwedge_print_wnames

dk(4): Omit needless locking in dksize, dkdump.

All the members these use are stable after initialization, except for
the wedge size, which dkwedge_size safely reads a snapshot of without
locking in the caller.

dk(4): dkdump: Simplify. No functional change intended.

dk(4): Narrow the scope of the device numbering lookup on detach.

Just need it for vdevgone, order relative to other things in detach
doesn't matter.
No functional change intended.

disk(9): Fix missing unlock in error branch in previous change.

dk(4): Fix racy access to sc->sc_dk.dk_openmask in dkwedge_delall1.
Need sc->sc_parent->dk_rawlock for this, as used in dkopen/dkclose.

dk(4): Convert tests to assertions in various devsw operations.
.d_cancel, .d_strategy, .d_read, .d_write, .d_ioctl, and .d_discard
are only ever used between successful .d_open return and entry to
.d_close. .d_open doesn't return until sc is nonnull and sc_state is
RUNNING, and dkwedge_detach waits for the last .d_close before
setting sc_state to DEAD. So there is no possibility for sc to be
null or for sc_state to be anything other than RUNNING or DYING.

There is a small functional change here but only in the event of a
race: in the short window between when dkwedge_detach is entered, and
when .d_close runs, any I/O operations (read, write, ioctl, &c.) may
be issued that would have failed with ENXIO before.

This shouldn't matter for anything: disk I/O operations are supposed
to complete reasonably promptly, and these operations _could_ have
begun milliseconds prior, before dkwedge_detach was entered, so it's
not a significant distinction.

Notes:
- .d_open must still contend with trying to open a nonexistent wedge,
of course.
- .d_close must also contend with closing a nonexistent wedge, in
case there were two calls to open in quick succession and the first
failed while the second hadn't yet determined it would fail.
- .d_size and .d_dump are used from ddb without any open/close.

dk(4): Fix lock assertion in size increase: parent's, not wedge's.

dk(4): Rename label for consistency. No functional change intended.

dk(4): dkclose must handle a dying wedge too to close the parent.

Otherwise the parent open leaks on detach (or revoke) when the wedge
was open and had to be forcibly closed.

Fixes assertion sc->sc_dk.dk_openmask == 0.
ioctl(DIOCRMWEDGES): Delete only idle wedges.

Don't forcibly delete busy wedges.

Fixes accidental destruction of the busy wedge that the root file
system is mounted on, triggered by syzbot's ioctl(DIOCRMWEDGES).

dk(4): Omit needless sc_iopend, sc_dkdrn mechanism.
vdevgone guarantees that all instances are closed by the time it
returns, which in turn guarantees all I/O operations (read, write,
ioctl, &c.) have completed, and, if the block device is open,
vinvalbuf(V_SAVE) -> vflushbuf has completed, which forces all
buffered transfers to be issued and waits for them to complete.
So by the time vdevgone returns, no further transfers can be
submitted and the bufq must be empty.

dk(4): Fix typo: sc_state, not sc_satte.

Had tested a patch series, but not every patch in it, and I
inadvertently fixed the typo in a later patch in the series, not in
the one I committed.

dk(4): Make it clearer that dkopen EROFS branch doesn't leak.
It looked like we may need to sometimes call dklastclose in error
branch for the case of (flags & ~sc->sc_mode & FWRITE) != 0, but it
is not actually possible to reach that case: if the caller requested
read/write, and the parent is read-only, and it is the first time
we've opened the parent, then dkfirstopen will fail with EROFS so we
never get there.

But this is confusing and it looked like the error branch is wrong,
so let's rearrange the conditional to make it clearer that we cannot
goto out after dkfirstopen has succeeded. And then assert that the
case cannot happen when we do call dkfirstopen.

dk(4): Need pdk->dk_openlock to read pdk->dk_wedges.

RSS XML Feed