Home | History | Annotate | Download | only in lfs
History log of /src/sys/ufs/lfs/lfs_segment.c
RevisionDateAuthorComments
 1.293  17-Sep-2025  perseant Add working in-kernel roll forward.
 1.292  17-Sep-2025  perseant Use a workqueue to handle the superblock callback.
 1.291  17-Sep-2025  perseant Add routines to check freelist consistency if compiled with DEBUG and
conditional on a kernel variable manipulated via sysctl.
Add checks before and after each routine that modifies the free list.
#if 0 a section of lfs_vfree() that was intended to keep the free list ordered
but instead corrupted it.
 1.290  04-Sep-2025  perseant Copy the flags from a full partial segment to its continuation, if
a continuation is necessary, so that partial-segment collections marked
with SS_DIROP|SS_CONT are properly completed wiht a partial-segment marked
SS_DIROP (without SS_CONT). Necessary for roll-forward.
 1.289  02-Sep-2025  perseant Use a workqueue to handle cluster iodone, rather than doing it in interrupt context.
 1.288  05-Sep-2020  riastradh Round of uvm.h cleanup.

The poorly named uvm.h is generally supposed to be for uvm-internal
users only.

- Narrow it to files that actually need it -- mostly files that need
to query whether curlwp is the pagedaemon, which should maybe be
exposed by an external header.

- Use uvm_extern.h where feasible and uvm_*.h for things not exposed
by it. We should split up uvm_extern.h but this will serve for now
to reduce the uvm.h dependencies.

- Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use
UVMHIST(ubchist), since ubchist is declared in uvm.h but the
reference evaporates if UVMHIST is not defined, so we reduce header
file dependencies.

- Make uvm_device.h and uvm_swap.h independently includable while
here.

ok chs@
 1.287  13-Aug-2020  riastradh Skip unlinked inodes.

They no longer matter on disk so we don't need to write anything out
for them.
 1.286  23-Feb-2020  ad UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.
 1.285  23-Feb-2020  riastradh Break deadlock in PR kern/52301.

The lock order is lfs_writer -> lfs_seglock. The problem in 52301 is
that lfs_segwrite violates this lock order by sometimes doing
lfs_seglock -> lfs_writer, either (a) when doing a checkpoint or (b),
opportunistically, when there are no dirops pending. Both cases can
deadlock, because dirops sometimes take the seglock (lfs_truncate,
lfs_valloc, lfs_vfree):

(a) There may be dirops pending, and they may be waiting for the
seglock, so we can't wait for them to complete while holding the
seglock.

(b) The test for fs->lfs_dirops == 0 happens unlocked, and the state
may change by the time lfs_writer_enter acquires lfs_lock.

To resolve this in each case:

(a) Do lfs_writer_enter before lfs_seglock, since we will need it
unconditionally anyway. The worst performance impact of this should
be that some dirops get delayed a little bit.

(b) Create a new lfs_writer_tryenter to use at this point so that the
test for fs->lfs_dirops == 0 and the acquisition of lfs_writer happen
atomically under lfs_lock.
 1.284  23-Feb-2020  riastradh Change some cheap KDASSERT into KASSERT.
 1.283  22-Feb-2020  ad Make LFS/rump play nice with aiodoned removal.

PR kern/55004 (Hundreds of file system tests now fail on real hardware)
 1.282  18-Feb-2020  chs remove the aiodoned thread. I originally added this to provide a thread context
for doing page cache iodone work, but since then biodone() has changed to
hand off all iodone work to a softint thread, so we no longer need the
special-purpose aiodoned thread.
 1.281  15-Jan-2020  ad Merge from yamt-pagecache (after much testing):

- Reduce unnecessary page scan in putpages esp. when an object has a ton of
pages cached but only a few of them are dirty.

- Reduce the number of pmap operations by tracking page dirtiness more
precisely in uvm layer.
 1.280  08-Dec-2019  ad branches: 1.280.2;
Revert previous. No performance gain worth the potential headaches
with buffers in these contexts.
 1.279  08-Dec-2019  ad Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
 1.278  03-Sep-2018  riastradh branches: 1.278.4;
Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)
 1.277  09-Jun-2018  zafer branches: 1.277.2;
Add missing b_cflags and b_oflags.
Ok dholland@
Addresses PR kern/42342 by Yoshihiro Nakajima
 1.276  06-Jun-2018  maya Remove duplicate ;
 1.275  20-Aug-2017  maya branches: 1.275.2;
XXX question our double-flushing of dirops
 1.274  26-Jul-2017  maya change lfs_nextsegsleep and lfs_allclean_wakeup to use condvar

XXX had to use lfs_lock in lfs_segwait, removed kernel_lock, is this
appropriate?
 1.273  26-Jul-2017  maya Revert r1.272 fix to PR kern/52301, the performance hit is making things
unusable.
 1.272  15-Jun-2017  maya It isn't safe to drain dirops with seglock held, it'll deadlock if there
are any dirops. drain before grabbing seglock.

lfs_dirops == 0 is always true (as we already drained dirops), so omit
that part of the comparison.

Fixes a lot of LFS deadlocks. PR kern/52301

Many thanks to dholland for help analyzing coredumps
 1.271  12-Jun-2017  maya Use continue to denote the no-op loop to match netbsd style
newline for extra clarity.
 1.270  10-Jun-2017  maya Rename i_flag to i_state.

The similarity to i_flags has previously caused errors.
 1.269  06-Apr-2017  maya branches: 1.269.6;
don't guard lfs_sbactive or lfs_log with splbio, lfs_lock is plenty.
 1.268  06-Apr-2017  maya remove deprecated comment (and move it below assert)
there's no spl dance for I/O here.
 1.267  06-Apr-2017  maya Provide a LFS_ENTER_LOG (__nothing) in the !DEBUG case.
so I can drop lots of #ifdef DEBUG around this macro. NFCI
 1.266  06-Apr-2017  maya Drop single use macro LFS_BCLEAN_LOG with an inlined implementation.

LFS_ENTER_LOG currently macro grabs lfs_lock, so I'd like to have just one
name for it.
 1.265  01-Apr-2017  riastradh KASSERT(mutex_owned(vp->v_interlock)) in vnode iterator selector.
 1.264  13-Mar-2017  riastradh #if DIAGNOSTIC panic ---> KASSERT

Replace some #if DEBUG by this too. DEBUG is only for expensive
assertions; these are not.
 1.263  19-Oct-2015  dholland branches: 1.263.2; 1.263.4;
improve some panic messages
 1.262  10-Oct-2015  dholland Fix minor bitrot in #if 0 or otherwise disabled code.
 1.261  10-Oct-2015  dholland Use accessors for some more indirect block manipulations.
 1.260  03-Oct-2015  dholland Use IINFO in lfs_writeinode().
(both the kernel and the userland copies)
 1.259  01-Sep-2015  dholland Use the lfs dinode accessors in place of the ufs-derived ones.
(Mostly.)

The ufs-derived ones are fake structure member macros, which are gross
and not very safe. Also, it seems that a lot of places in the lfs code
were using the ffsv1 branch of them unconditionally, and this way it's
guaranteed all those places have been updated.

Found while doing this: for non-devices, have getattr produce NODEV
in the rdev field instead of leaking the address of the first direct
block.
 1.258  21-Aug-2015  hannken lfs_writevnodes: replace mnt_vnodelist traversal with vfs_vnode_iterator.
 1.257  19-Aug-2015  dholland Part two of dinodes; use the same union everywhere.
(previously the ufs-derived code had things set up slightly different)

Remove a bunch of associated mess.
 1.256  12-Aug-2015  dholland Hack up dinode usage to be 64 vs. 32 as needed. Part 1.

(This part changes the native lfs code; the ufs-derived code already
has 64 vs. 32 logic, but as aspects of it are unsafe, and don't
entirely interoperate cleanly with the lfs 64/32 stuff, pass 2 will be
rehashing that.)
 1.255  12-Aug-2015  dholland Provide 32-bit and 64-bit versions of FINFO.

This also entailed sorting out part of struct segment, as that
contains a pointer into the current FINFO data.
 1.254  12-Aug-2015  dholland Make 32-bit and 64-bit versions of SEGSUM.
Also fix some of the FINFO handling as it's closely entangled.
 1.253  12-Aug-2015  dholland Add IFILE32 and IFILE64 structures for the on-disk ifile entries.
Add and use accessors. There are also a bunch of places that cast and
I hope I've found them all...
 1.252  12-Aug-2015  dholland Make 32-bit and 64-bit versions of CLEANERINFO.

XXX: while this is written to disk, it seems like much of it would
XXX: be better set up as a commpage shared with the cleaner.
 1.251  02-Aug-2015  dholland Pass the fs object to LFS_MAX_DADDR so it can check lfs_is64.

Remove some hackish intentional 64->32 truncations next to the checks
using LFS_MAX_DADDR, and tackle the problem they handled in bmap
instead.

The problem: the magic block pointer value UNWRITTEN has magic value
-2, and if it's not handled specifically, uint32 -> uint64 promotion
turns it into 4294967294, which then causes consternation and
monkeyhouse downstream.

What's here is still kind of a hack, but it's a step forward.
 1.250  02-Aug-2015  dholland Add a (draft) 64-bit superblock. Make things build again.

Add pieces of support for using both superblock types where
convenient, and specifically to the superblock accessors, but don't
actually enable it anywhere.

First substantive step on PR 50000.
 1.249  02-Aug-2015  dholland Use accessor functions for the version field of the lfs superblock.
I thought at first maybe the cases that test the version should be
rolled into the accessors, but on the whole I think the conclusion on
that is no.
 1.248  02-Aug-2015  dholland Make i_eff_nblks in the in-memory inode 64 bits wide.
 1.247  02-Aug-2015  dholland Fix catastrophic bug in lfs_rewind() that changed segment numbers
(lfs_curseg/lfs_nextseg in the superblock) using the wrong units.
These fields are for whatever reason the start addresses of segments
(measured in frags) rather than the segment numbers 0..n.

This only apparently affects dumping from a mounted fs; however, it
trashes the fs.

I would really, really like to have a static analysis tool that can
keep track of the units things are measured in, since fs code is full
of conversion macros and the macros are named inscrutable things like
"sntod" whose letters don't necessarily even correspond to the units
they convert. It is surprising that more of these are not wrong.
 1.246  02-Aug-2015  dholland Second batch of 64 -> 32 truncations in lfs, along with more minor
tidyups and corrections in passing.
 1.245  28-Jul-2015  dholland Add a new lfs header file: lfs_accessors.h.

This contains all the accessor functions and macros out of lfs.h.
Add an include of lfs_accessors.h after all uses of lfs.h... except
for code that wants to define its own struct lfs-alike that the
accessors are supposed to play along with. For these, set STRUCT_LFS
and include lfs_accessors.h after the necessary structure has been
defined, so that lfs_accessors.h can emit functions in terms of it.
 1.244  25-Jul-2015  martin Use accessors in DEBUG and DIAGNOSTIC code as well
 1.243  24-Jul-2015  dholland More lfs superblock accessors.
(This changes the rest of the code over; all the accessors were
already added.)

The difference between this commit and the previous one is arbitrary,
but the previous one passed the regression tests on its own so I'm
keeping it separate to help with any bisections that might be needed
in the future.
 1.242  24-Jul-2015  dholland Switch to accessor functions for elements of the LFS on-disk
superblock. This will allow switching between 32/64 bit forms on the
fly; it will also allow handling LFS_EI reasonably tidily. (That
currently doesn't work on the superblock.)

It also gets rid of cpp abuse in the form of fake structure member
macros.

Also, instead of doing sleep/wakeup on &lfs_avail and &lfs_nextseg
inside the on-disk superblock, add extra elements to the in-memory
struct lfs for this. (XXX: these should be changed to condvars, but
not right now)

XXX: this migrates a structure needed by the lfs code in libsa (struct
salfs) into lfs.h, where it doesn't belong, but for the time being
this is necessary in order to allow the accessors (and the various
lfs macros and other goop that relies on them) to compile.
 1.241  07-Jun-2015  hannken Fix copy and paste errors from last commits.
- Kernel i386/ALL and amd64/ALL compile again.
- Resolves CID 1304138 (DEADCODE) and 1304139 (IDENTICAL_BRANCHES).
 1.240  31-May-2015  hannken Change lfs from hash table to vcache.

- Change lfs_valloc() to return an inode number and version instead of
a vnode and move lfs_ialloc() and lfs_vcreate() to new lfs_init_vnode().

- Add lfs_valloc_fixed() to allocate a known inode, used by kernel
roll forward.

- Remove lfs_*ref(), these functions cannot coexist with vcache and
their commented behaviour is far away from their implementation.

- Add the cleaner lwp and blockinfo to struct ulfsmount so lfs_loadvnode()
may use hints from the cleaner.

- Remove vnode locks from ulfs_lookup() like we did with ufs_lookup().
 1.239  31-May-2015  hannken Use VFS_PROTOS() for lfs.
Rename conflicting struct lfs field "lfs_start" to "lfs_s0addr".

No functional change.
 1.238  20-Apr-2015  riastradh Make vget always return vnode unlocked.

Convert callers who want locks to use vn_lock afterward.

Add extra argument so the compiler will report stragglers.
 1.237  28-Mar-2015  maxv Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@
 1.236  24-Mar-2014  hannken branches: 1.236.4; 1.236.6;
- Make VI_XLOCK, VI_CLEAN and VI_LOCKSHARE private to kern/vfs_*.c.
- Make vwait() static.
- Add vdead_check() to check a vnode for being or becoming dead.

Discussed on tech-kern.

Welcome to 6.99.38
 1.235  18-Mar-2014  hannken Operations vmark(), vunmark() and vismarker() have been replaced by
vfs_vnode_iterator_*(), remove them.

Document vfs_vnode_iterator_*().

Make VI_MARKER private to vfs_vnode.c, vfs_mount.c and unfortunately
to ufs/lfs/lfs_segment.c.

Welcome to 6.99.37
 1.234  17-Mar-2014  hannken Change vismarker() to VI_MARKER for lfs_writevnodes().
This operation has to be changed to vfs_vnode_iterator.
 1.233  29-Oct-2013  hannken Vnode API cleanup pass 1.

- Make these defines and functions private to vfs_vnode.c:

VC_MASK, VC_LOCK, DOCLOSE, VI_IANCTREDO and VI_INACTNOW
vclean() and vrelel()

- Remove the long time unused lwp argument from vrecycle().

- Remove vtryget(), it is responsible for ugly hacks and doesn't
look that effective.

Presented on tech-kern.

Welcome to 6.99.25
 1.232  17-Oct-2013  christos - remove unused variables
- add debug ifdefs for debugging variables
- __USE() where appropriate.
 1.231  28-Jul-2013  dholland Add lfs_kernel.h for declarations that don't need to be exposed to userland.

lfs currently has the following headers:
lfs.h - on-disk structures and stuff needed for userlevel tools
lfs_inode.h - additional restricted materials for userlevel tools
that operate the fs (newfs_lfs, fsck_lfs, lfs_cleanerd)
lfs_kernel.h - stuff needed only in the kernel

and the following legacy headers that are expected to be mopped up and
folded into one of the above:
lfs_extern.h - function prototypes
ulfs_bswap.h - endian-independent support
ulfs_dinode.h - now contains very little
ulfs_dirhash.h - dirhash support
ulfs_extattr.h - extattr support
ulfs_extern.h - more function prototypes
ulfs_inode.h - assorted kernel-only declarations
ulfs_quota.h - quota support
ulfs_quota1.h - more quota support
ulfs_quota2.h - more quota support
ulfs_quotacommon.h - more quota support
ulfsmount.h - legacy copy of ufsmount material
 1.230  18-Jun-2013  christos branches: 1.230.2;
Prefix most of the cpp macros with lfs_ and LFS_ to avoid conflicts with ffs.
This was done so that boot blocks that want to compile both FFS and LFS in
the same file work.
 1.229  08-Jun-2013  dholland ulfs_dir.h has been emptied; remove it.
 1.228  08-Jun-2013  dholland Stick LFS_ in front of IFMT, IFIFO, IFREG, etc. so as not to conflict
with the UFS copies of these symbols. (Which themselves ought to have
UFS_ stuck on.)
 1.227  06-Jun-2013  dholland Split lfs from ufs step 4:

Massedit all ufs symbols to be "ulfs" instead, to make sure there are
no conflicts with ufs. Confirmed with grep.

(This required changing a few comments that maybe should have been
left alone to say "ulfs", but we'll survive that.)
 1.226  06-Jun-2013  dholland Split lfs from ufs, part 2:

Change all <ufs/ufs/foo.h> includes to <ufs/lfs/ulfs_foo.h>.
 1.225  22-Jan-2013  dholland Stuff UFS_ in front of a few of ufs's symbols to reduce namespace
pollution. Specifically:
ROOTINO -> UFS_ROOTINO
WINO -> UFS_WINO
NXADDR -> UFS_NXADDR
NDADDR -> UFS_NDADDR
NIADDR -> UFS_NIADDR
MAXSYMLINKLEN -> UFS_MAXSYMLINKLEN
MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Sort out ext2fs's misuse of NDADDR and NIADDR; fortunately, these have
the same values in ext2fs and ffs.

No functional change intended.
 1.224  16-Feb-2012  perseant branches: 1.224.2;
Pass t_renamerace and t_rmdirrace tests.

Adapt dholland@'s fix to ufs_rename to fix PR kern/43582. Address several
other MP locking issues discovered during the course of investigating the
same problem.

Removed extraneous vn_lock() calls on the Ifile, since the Ifile writes
are controlled by the segment lock.

Fix PR kern/45982 by deemphasizing the estimate of how much metadata
will fill the empty space on disk when the disk is nearly empty
(t_renamerace crates a lot of inode blocks on a tiny empty disk).
 1.223  02-Jan-2012  perseant branches: 1.223.2;

* Remove PGO_RECLAIM during lfs_putpages()' call to genfs_putpages(),
to avoid a live lock in the latter when reclaiming a vnode with
dirty pages.

* Add a new segment flag, SEGM_RECLAIM, to note when a segment is
being written for vnode reclamation, and record which inode is being
reclaimed, to aid in forensic debugging.

* Add a new segment flag, SEGM_SINGLE, so that opportunistic writes
can write a single segment's worth of blocks and then stop, rather
than writing all the way up to the cleaner's reserved number of
segments.

* Add assert statements to check mutex ownership is the way it ought
to be, mostly in lfs_putpages; fix problems uncovered by this.

* Don't clear VU_DIROP until the inode actually makes its way to disk,
avoiding a problem where dirop inodes could become separated
(uncovered by a modified version of the "ckckp" forensic regression
test).

* Move the vfs_getopsbyname() call into lfs_writerd. Prepare code to
make lfs_writerd notice when there are no more LFSs, and exit losing
the reference, so that, in theory, the module can be unloaded. This
code is not enabled, since it causes a crash on exit.

* Set IN_MODIFIED on inodes flushed by lfs_flush_dirops. Really we
only need to set IN_MODIFIED if we are going to write them again
(e.g., to write pages); need to think about this more.

Finally, several changes to help avoid "no clean segments" panics:

* In lfs_bmapv, note when a vnode is loaded only to discover whether
its blocks are live, so it can immediately be recycled. Since the
cleaner will try to choose ~empty segments over full ones, this
prevents the cleaner from (1) filling the vnode cache with junk, and
(2) squeezing any unwritten writes to disk and running the fs out of
segments.

* Overestimate by half the amount of metadata that will be required
to fill the clean segments. This will make the disk appear smaller,
but should help avoid a "no clean segments" panic.

* Rearrange lfs_writerd. In particular, lfs_writerd now pays
attention to the number of clean segments available, and holds off
writing until there is room.
 1.222  11-Jul-2011  hannken branches: 1.222.2; 1.222.6;
Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.
 1.221  12-Jun-2011  rmind Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.
 1.220  03-Apr-2011  rmind branches: 1.220.2;
- Use offsetof() in VOPARG_OFFSETOF() instead of re-implementing it.
- Remove VDESC_NOMAP_VPP and VDESC_VPP_WILLRELE.
- Remove VRELEL_NOINACTIVE and VRELEL_ONHEAD.
 1.219  02-Apr-2011  rmind Split off parts of vfs_subr.c into vfs_vnode.c and vfs_mount.c modules.

No functional change. Discussed on tech-kern@.
 1.218  23-Mar-2011  rmind G/C count_lock_queue (unused for 12 years)
 1.217  21-Jul-2010  hannken branches: 1.217.2;
Make holding v_interlock mandatory for callers of vget().

Announced some time ago on tech-kern.
 1.216  24-Jun-2010  hannken Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.
 1.215  16-Feb-2010  mlelstv branches: 1.215.2;
Three changes in a single commit.

- drop the notion of frags (LFS fragments) vs fsb (FFS fragments)
The code uses a complicated unity function that just makes the
code difficult to understand.

- support larger sector sizes. Fix disk address computations
to use DEV_BSIZE in the kernel as required by device drivers
and to use sector sizes in userland.

- Fix several locking bugs in lfs_bio.c and lfs_subr.c.
 1.214  07-Aug-2009  wiz branches: 1.214.2;
Add missing parenthesis in #ifdef LFS_USE_B_INVAL.
From Henning Petersen in PR 41841.
 1.213  02-Jun-2008  ad branches: 1.213.8; 1.213.18; 1.213.22;
Use atomics to maintain v_usecount.
 1.212  16-May-2008  hannken Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>
 1.211  28-Apr-2008  martin branches: 1.211.2;
Remove clause 3 and 4 from TNF licenses
 1.210  27-Mar-2008  ad branches: 1.210.2; 1.210.4;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.
 1.209  15-Feb-2008  ad branches: 1.209.6;
The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.
 1.208  27-Jan-2008  pooka Replace vrelel() 010101-mania with a flags parameter. However,
leave flags unimplemented for a while (no change in functionality).
 1.207  02-Jan-2008  ad Merge vmlocking2 to head.
 1.206  10-Oct-2007  ad branches: 1.206.4; 1.206.6; 1.206.10;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.
 1.205  08-Oct-2007  ad Merge ffs locking & brelse changes from the vmlocking branch.
 1.204  09-Aug-2007  pooka branches: 1.204.2; 1.204.4;
Instead of having lfs muck directly about with vnode free lists,
introduce vrele2(), which allows to release vnodes the way lfs
sometimes wants it:
+ without calling inactive
+ inserting the vnode at the head of the freelist (this is a very
questionable optimization that isn't even enabled by default,
but I went along with the same semantics for now)
 1.203  29-Jul-2007  ad branches: 1.203.4; 1.203.6;
It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.
 1.202  12-Jul-2007  rmind branches: 1.202.2;
Implementation of per-CPU work-queues support for workqueue(9) interface.
WQ_PERCPU flag for workqueue and additional argument for workqueue_enqueue()
to assign a CPU might be used. Notes:
- For now, the list is used for workqueue_queue, which is non-optimal,
and will be changed with array, where index would be CPU ID.
- The data structures should be changed to be cache-friendly.

Reviewed by: <yamt>, <tech-kern>
 1.201  30-Jun-2007  pooka Using POOL_INIT here makes no sense, since file systems always have
an init method. So get rid of it and #ifdef _LKM and just always
init in the init method. Give malloc types the same treatment.
Makes file systems nicer to work with in linksetless environments
and fixes a few LKM discrepancies.
 1.200  16-May-2007  perseant Change references to SEGM_W_DIROPS to SEGM_CKP, and replace the logic that
formerly used SEGM_W_DIROPS in lfs_segwrite() appropriately. This prevents
a problem in which processes could get stuck in "buffers" sleep forever.
 1.199  17-Apr-2007  perseant Install a new sysctl, vfs.lfs.ignore_lazy_sync, which causes LFS to ignore
the "smooth" syncer, as if vfs.sync.*delay = 0, but only for LFS. The
default is "on", i.e., ignore lazy sync.

Reduce the amount of polling/busy-waiting done by lfs_putpages(). To
accomplish this, copied genfs_putpages() and modified it to indicate which
page it was that caused it to return with EDEADLK. fsync()/fdatasync()
should no longer ever fail with EAGAIN, and should not consume huge
quantities of cpu.

Also, try to make dirops less likely to be written as the result of a
VOP_PUTPAGES(), while ensuring that they are written regularly.
 1.198  04-Mar-2007  christos branches: 1.198.2; 1.198.4;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.197  23-Feb-2007  perseant Reverse the order of searching the vnode list in lfs_writevnodes(). This
should speed up e.g. "chown -R" on LFS filesystems; e.g. it shows a 100%
increase in the 'seq_stat' column of bonnie++.
 1.196  21-Dec-2006  yamt branches: 1.196.2;
merge yamt-splraiseipl branch.

- finish implementing splraiseipl (and makeiplcookie).
http://mail-index.NetBSD.org/tech-kern/2006/07/01/0000.html
- complete workqueue(9) and fix its ipl problem, which is reported
to cause audio skipping.
- fix netbt (at least compilation problems) for some ports.
- fix PR/33218.
 1.195  16-Nov-2006  christos branches: 1.195.2; 1.195.4;
__unused removal on arguments; approved by core.
 1.194  20-Oct-2006  reinoud Replace the LIST structure mp->mnt_vnodelist to a TAILQ structure since all
vnodes were synced and processed backwards. This meant that the last
accessed node was processed first and the earlierst last.

An extra benefit is the removal of the ugly hack from the Berkly days on
LFS.

In the proces, i've also replaced the various variations hand written loops
by the TAILQ_FOREACH() macro's.
 1.193  12-Oct-2006  christos - sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386
 1.192  04-Oct-2006  christos fix empty if
 1.191  28-Sep-2006  perseant Use lockstatus instead of a homebrewed locking system to control
LFCNWRAPSTOP and LFCNWRAPGO.

Be less verbose about the various looping checks: use log() rather than
printf(), and only log anything if we are really looping ("count = 2" is
not an error condition).

Allow dirops sleeping on available space to be interruptible.
 1.190  02-Sep-2006  christos branches: 1.190.2; 1.190.4;
remove impossible test
 1.189  01-Sep-2006  perseant Changes to help the roll-forward agent, to wit:

* Mark being-deleted files in the Ifile so we can finish deleting them
at fs mount time.
* Flag the Ifile with "cleaner must clean" when writers are waiting for
the cleaner, rather than relying solely on the cleaner's estimation of
whether it should clean or not.
* Note partial segments written by a user agent (in particular,
fsck_lfs) so that repeated rolls forward don't interfere with one
another.
* Add a new fcntl, LFCNPASS, that allows the log to wrap exactly once,
for better testing of the validity of checkpoints.
* Keep track of the on-disk nlink count when cleaning, so that we don't
partially complete directory operations while cleaning.
* Ensure that every single Ifile inode write represents a consistent
view of the filesystem. In particular, the accounting for the segment
we are writing the inode into must be correct, and the accounting for
the segment that inode used to reside in must be correct. Rather than
just rewriting the inode if we wrote it wrong, rewrite the necessary
ifile blocks before writing the inode so we never write it wrong.
* Don't unmark any VDIROP vnodes if we haven't written them to disk,
avoiding yet another problem with the "wait for the cleaner" error
return from lfs_putpages().

Also, move the last callback to an aiodone call, so we no longer do any
memory management from interrupt context.
 1.188  20-Jul-2006  perseant Note partial segments that are written by the cleaner, to help out the
roll-forward agent.
 1.187  20-Jul-2006  perseant Loop on the check for lfs_nowrap, so we don't allow a process to squeeze by.
 1.186  20-Jul-2006  perseant Don't try to write all the vnodes, when the cleaner needs a vnode to be
recycled.
 1.185  29-Jun-2006  perseant Don't wake up the cleaner if the filesystem is unwrappable, and fix the
compatibility fcntls.

Also includes one-line fixes for an MP locking bug and a zero-length FINFO
problem that manifested during testing.
 1.184  24-Jun-2006  perseant Change LFCNWRAP{STOP,GO} to make them more suitable for snapshotting; in
particular, the caller can now choose whether to wait for the condition
to be met, and if the caller of LFCNWRAPSTOP dies or otherwise closes
the descriptor, the filesystem is started again. Updated the ckckp
regression test to use the new semantics.

dump_lfs(8) now uses the fcntls to implement LFS-style snapshotting through
the -X flag, addressing PR#33457 albeit not using fss(4). Fixed a couple
other problems with dump_lfs that manifested themselves during testing.
 1.183  23-Jun-2006  yamt fix a simonb-timecounters regression.
the precision of getnanotime() is not suitable for file timestamps.
esp. when it's nfs-exported.

- introduce vfs_timestamp().
(the name is from freebsd. currently merely a wrapper of nanotime())
- for ufs-like filesystems, use it rather than getnanotime().

XXX check other filesystems.
 1.182  07-Jun-2006  kardel branches: 1.182.2;
merge FreeBSD timecounters from branch simonb-timecounters
- struct timeval time is gone
time.tv_sec -> time_second
- struct timeval mono_time is gone
mono_time.tv_sec -> time_uptime
- access to time via
{get,}{micro,nano,bin}time()
get* versions are fast but less precise
- support NTP nanokernel implementation (NTP API 4)
- further reading:
Timecounter Paper: http://phk.freebsd.dk/pubs/timecounter.pdf
NTP Nanokernel: http://www.eecis.udel.edu/~mills/ntp/html/kern.html
 1.181  20-May-2006  perseant Fix a bug in which FINFOs were written with a version number of zero.
Add assertions and add this to the DEBUG fip test in lfs_writeseg.
 1.180  18-May-2006  perseant branches: 1.180.2;
Break out the finfo array manipulation code into two new functions,
lfs_acquire_finfo() and lfs_release_finfo(). Add a debugging check
for zero-length finfo arrays in the segment summary to avoid future
regressions.
 1.179  14-May-2006  elad integrate kauth.
 1.178  12-May-2006  perseant Fixes to address the "vinvalbuf: dirty blocks" panic that can occur when
many inodes are cleaned at once. Make sure that we write all the pages
on vnodes that are being flushed, even if we don't think there's room;
drain v_numoutput before lfs_vflush() completes.

Also, don't allow a vnode that is in the process of being cleaned to be
chosen by getnewvnode(); this avoids a segment accounting panic in the case
that a large number of inodes are fed to lfs_markv() all at once.
 1.177  01-May-2006  perseant Don't ever partially write dirops, even if we need the cleaner to run.
This increases the chances of the "no clean segments" panic slightly,
but allows us to run the ckckp regression test successfully to completion.
 1.176  30-Apr-2006  perseant Postpone the segment accounting changes coming from truncation until the
inode that makes those changes valid is either written to disk by
lfs_writeinode() or discarded by lfs_vfree().

A couple of locking fixes are also included as well.
 1.175  22-Apr-2006  perseant Regression test improvements:

Move the stop for LFCNWRAPSTOP to the point at which writing at segment 0
is really about to commence, since this is what the test expects (and
incidentally what a snapshotting utility wants as well).

More correctly reconstruct the on-disk state at every checkpoint, rather
than relying on the entire state at the point of wrapping to be accurate
(that is only true the first time we wrap). Add a "make abort" target to
make rerunning the test more convenient when it has failed and we're done
analyzing the failure.
 1.174  17-Apr-2006  perseant Introduce two fcntl calls that freeze the filesystem right at the point
where segment 0 is being considered for writing. This allows for automated
checkpoint vailidity scanning, and could be used (in conjunction with the
existing LFCNREWIND) for e.g. snapshot dumps as well.

Include a regression test that does such scanning.

When writing the Ifile, loop through the dirty block list three times to
make sure that the checkpoint is always consistent (the first and second
times the Ifile blocks can cross a segment boundary; not so the third time
unless the segments are very small). Discovered by using the aforementioned
regression test.
 1.173  13-Apr-2006  perseant Make lfs_vref/lfs_vunref not need to know about VXLOCK and VFREEING
explicitly (especially since we didn't know about VFREEING at all before),
but notice the EBUSY return from vget() instead.

Fix some more MP locking protocol issues, most of which were pointed out by
Christian Ehrhardt this morning on tech-kern.
 1.172  07-Apr-2006  perseant Several minor bug fixes:

* Correct (weak) segment lock assertions in lfs_fragextend and lfs_putpages.
* Keep IN_MODIFIED set if we run out of avail in lfs_putpages.
* Don't try to (re)write buffers on a VBLK vnode; fixes a panic I found
while running with an LFS root.
* Raise priority of LFCNSEGWAIT to PVFS; PUSER is way too low for
something the pagedaemon is relying on.
 1.171  24-Mar-2006  perseant Improvements to LFS's paging mechanism, to wit:

* Acknowledge that sometimes there are more dirty pages to be written to
disk than clean segments. When we reach the danger line,
lfs_gop_write() now returns EAGAIN. The caller of VOP_PUTPAGES(), if
it holds the segment lock, drops it and waits for the cleaner to make
room before continuing.

* Note and avoid a three-way deadlock in lfs_putpages (a writer holding
a page busy blocks on the cleaner while the cleaner blocks on the
segment lock while lfs_putpages blocks on the page).
 1.170  17-Mar-2006  tls From Konrad Schroeder, in response to strange df output on anoncvs.netbsd.org:
We were returning the wrong value for free space. Now we're not.
 1.169  04-Jan-2006  yamt branches: 1.169.2; 1.169.4; 1.169.6; 1.169.8; 1.169.10;
- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.
 1.168  11-Dec-2005  christos branches: 1.168.2;
merge ktrace-lwp.
 1.167  26-Sep-2005  yamt always use nanotime rather than time.
it's bad to mix nanotime and time because it sometimes
make timestamps go backwards.
 1.166  12-Sep-2005  christos Use nanotime() to update the time fields in filesystems. Convert the code
from macros to real functions. Original patch and review from chuq.
Note: ext2fs only keeps seconds in the on-disk inode, and msdosfs does not
have enough precision for all fields, so this is not very useful for those
two.
 1.165  19-Aug-2005  christos 64 bit inode changes.
 1.164  29-May-2005  christos branches: 1.164.2;
- sprinkle const
- avoid shadow variables.
 1.163  23-Apr-2005  perseant Provide a resize_lfs(8), including kernel and cleaner support. The current
implementation requires the fs to be mounted while resizing. Tested in both
directions, and everything appears to work happily, but ymmv.
 1.162  19-Apr-2005  perseant Keep per-inode, per-fs, and subsystem-wide counts of blocks allocated through
lfs_balloc(), and use that to estimate the number of dirty pages belonging
to LFS (subsystem or filesystem). This is almost certainly wrong for
the case of a large mmap()ed region, but the accounting is tighter than
what we had before, and performs much better in the typical case of pages
dirtied through write().
 1.161  18-Apr-2005  perseant Check the to-be-on-disk consistency of directories as well (correct a typo
in an earlier commit).
 1.160  14-Apr-2005  perseant Keep track of the highest block held by an LFS inode, so that we can
be assured that the last byte of a file is always allocated. Previously
a file extension could cause the filesystem to be flushed, writing an
inconsistent inode to disk. Although this condition would be corrected
the next time blocks were written to disk, an intervening crash would leave
the filesystem in an inconsistent state, leaving fsck_lfs to complain
of an inode "partially truncated".
 1.159  01-Apr-2005  perseant Protect various per-fs structures with fs->lfs_interlock simple_lock, to
improve behavior in the multiprocessor case. Add debugging segment-lock
assertion statements.
 1.158  08-Mar-2005  perseant branches: 1.158.2;
Straighten out the maze of ifdefs. Instead, consolidate all the debugging
stuff under '#ifdef DEBUG', and use sysctl knobs to turn on/off particular
parts of the debugging reporting (if DEBUG is enabled). Re-enable the LFS
statistics in sysctl, while I'm there. A bit of a rototill.
 1.157  26-Feb-2005  perry nuke trailing whitespace
 1.156  26-Feb-2005  perseant Various minor LFS improvements:

* Note when lfs_putpages(9) thinks it is not going to be writing any
pages before calling genfs_putpages(9). This prevents a situation in
which blocks can be queued for writing without a segment header.
* Correct computation of NRESERVE(), though it is still a gross
overestimate in most cases. Note that if NRESERVE() is too high, it
may be impossible to create files on the filesystem. We catch this
case on filesystem mount and refuse to mount r/w.
* Allow filesystems to be mounted whose block size is == MAXBSIZE.
* Somewhere along the line, ufs_bmaparray(9) started mangling UNWRITTEN
entries in indirect blocks again, triggering a failed assertion "daddr
<= LFS_MAX_DADDR". Explicitly convert to and from int32_t to correct
this.
* Add a high-water mark for the number of dirty pages any given LFS can
hold before triggering a flush. This is settable by sysctl, but off
(zero) by default.
* Be more careful about the MAX_BYTES and MAX_BUFS computations so we
shouldn't see "please increase to at least zero" messages.
* Note that VBLK and VCHR vnodes can have nonzero values in di_db[0]
even though their v_size == 0. Don't panic when we see this.
* Change lfs_bfree to a signed quantity. The manner in which it is
processed before being passed to the cleaner means that sometimes it
may drop below zero, and the cleaner must be aware of this.
* Never report bfree < 0 (or higher than lfs_dsize) through
lfs_statvfs(9). This prevents df(1) from ever telling us that our full
filesystems have 16TB free.
* Account space allocated through lfs_balloc(9) that does not have
associated buffer headers, so that the pagedaemon doesn't run us out
of segments.
* Return ENOSPC from lfs_balloc(9) when bfree drops to zero.
* Address a deadlock in lfs_bmapv/lfs_markv when the filesystem is being
unmounted. Because vfs_busy() is a shared lock, and
lfs_bmapv/lfs_markv mark the filesystem vfs_busy(), the cleaner can be
holding the lock that umount() is blocking on, then try to vfs_busy()
again in getnewvnode().
 1.155  18-Sep-2004  yamt branches: 1.155.4; 1.155.6;
change some members of struct buf from long to int.
ride on 2.0H.
 1.154  14-Aug-2004  mycroft Add a new flag, IN_MODIFY. This is like IN_UPDATE|IN_CHANGE, but unlike
setting those flags, it does not cause the inode to be written in the periodic
sync. This is used for writes to special files (devices and named pipes) and
FIFOs.

Do not preemptively sync updates to access times and modification times. They
are now updated in the inode only opportunistically, or when the file or device
is closed. (Really, it should be delayed beyond close, but this is enough to
help substantially with device nodes.)

And the most amusing part:
Trickle sync was broken on both FFS and ext2fs, in different ways. In FFS, the
periodic call to VFS_SYNC(MNT_LAZY) was still causing all file data to be
synced. In ext2fs, it was causing the metadata to *not* be synced. We now
only call VOP_UPDATE() on the node if we're doing MNT_LAZY. I've confirmed
that we do in fact trickle correctly now.
 1.153  19-May-2004  yamt lfs_cluster_aiodone: turn an invariant condition into an assertion.
 1.152  09-Mar-2004  yamt branches: 1.152.4;
calculate data checksum inline.
 1.151  09-Mar-2004  yamt use correct segment size. this fixes memory corruption when using lfsv1.
 1.150  29-Jan-2004  yamt lfs_update_single: add an assertion.
 1.149  28-Jan-2004  yamt eliminate tricky usages of VOP_STRATEGY which are (no longer?) necessary.
 1.148  25-Jan-2004  hannken Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.
 1.147  10-Jan-2004  yamt store a i/o priority hint in struct buf for buffer queue discipline.
 1.146  17-Dec-2003  yamt set VBWAIT when waiting v_numoutput to be drained.
 1.145  17-Dec-2003  yamt remove a redundant substitution.
 1.144  04-Dec-2003  yamt use b_private rather than b_saveaddr.
XXX LFS_USE_B_INVAL
 1.143  07-Nov-2003  yamt - tweak lfs_update_single()'s prototype so that it can be used by
roll-forward code.
- reduce code duplication using the above in update_meta()
this also fixes fragment accounting.
 1.142  25-Oct-2003  christos Fix uninitialized variable warnings.
 1.141  18-Oct-2003  yamt be more strict about sa->vp.
(make sure the last lfs_updatemata in lfs_putpages takes effect.)
 1.140  18-Oct-2003  simonb Remove assigned-to but otherwise unused variable.
 1.139  17-Oct-2003  yamt add comments and tweak code a little for readability.
(no behaviour changes)
 1.138  14-Oct-2003  yamt remove a redundant definition of LFS_MAX_ACTIVE.
 1.137  08-Oct-2003  yamt - a comment.
- bcopy -> memcpy
- increase 'p' only when needed.
 1.136  03-Oct-2003  yamt assertions.
 1.135  03-Oct-2003  yamt reassignbuf() when lfs_writeseg() takes away B_DELWRI.
 1.134  03-Oct-2003  yamt when inactivating segments, compare segment numbers correctly.
 1.133  29-Sep-2003  yamt remove redundant prototypes.
 1.132  07-Sep-2003  yamt - buffer cache MP locks.
- avoid changing buffer state on the free queue.
 1.131  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.
 1.130  30-Jul-2003  yamt using normal bufcache buffer for cluster buffer head.
 1.129  23-Jul-2003  yamt KNF.
 1.128  12-Jul-2003  yamt - wrap long lines.
- remove a mysterious blank line.
 1.127  12-Jul-2003  yamt - protect global resource counts with lfs_subsys_lock.
- clean up scattered externs a little.
 1.126  02-Jul-2003  yamt use queue.h macros.
 1.125  02-Jul-2003  yamt - add a new functions, lfs_writer_enter/leave, and use them instead of
duplicated code fragments.
- add an assertion.
 1.124  29-Jun-2003  fvdl branches: 1.124.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.
 1.123  29-Jun-2003  thorpej Undo part of the ktrace/lwp changes. In particular:
* Remove the "lwp *" argument that was added to vget(). Turns out
that nothing actually used it!
* Remove the "lwp *" arguments that were added to VFS_ROOT(), VFS_VGET(),
and VFS_FHTOVP(); all they did was pass it to vget() (which, as noted
above, didn't use it).
* Remove all of the "lwp *" arguments to internal functions that were added
just to appease the above.
 1.122  28-Jun-2003  darrenr Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V
 1.121  18-May-2003  yamt make is_sequential a callback in order to achieve better lfs write clustering.

since lfs always rewrite blocks into the new segment,
current on-disk place of the block doesn't affect to write clustering.

ok'ed by Konrad Schroder.
 1.120  23-Apr-2003  perseant Make LFS work better (though still not "well") as an NFS-exported
filesystem (and other things that needed to be fixed before the tests
would complete), to wit:

* Include the fs ident in the filehandle; improve stale filehandle checks.

* Change definition of blksize() to use the on-dinode size instead of
the inode's i_size, so that fsck_lfs will work properly again.

* Use b_interlock in lfs_vtruncbuf.

* Postpone dirop reclamation until after the seglock has been released,
so that lfs_truncate is not called with the segment lock held.

* Don't loop in lfs_fsync(), just write everything and wait.

* Be more careful about the interlock/uobjlock in lfs_putpages: when we
lose this lock, we have to resynchronize dirtiness of pages in each
block.

* Be sure to always write indirect blocks and update metadata in
lfs_putpages; fixes a bug that caused blocks to be accounted to the
wrong segment.
 1.119  02-Apr-2003  fvdl Add support for UFS2. UFS2 is an enhanced FFS, adding support for
64 bit block pointers, extended attribute storage, and a few
other things.

This commit does not yet include the code to manipulate the extended
storage (for e.g. ACLs), this will be done later.

Originally written by Kirk McKusick and Network Associates Laboratories for
FreeBSD.
 1.118  01-Apr-2003  yamt add assertions and a debug check.
 1.117  28-Mar-2003  fvdl The checkpoint loop always used (multiples of) lfs_sepb as the number
of segments to mark. However, this may be much more than lfs_nseg.

Originally this wasn't a big problem, since only the structures in the
diskblock were changed, but nowadays there's a mirror of the segflags
in the in-core superblock. This problem caused the code to walk
way past the end of that allocated area, causing memory corruption
in other kernel structures. So, use lfs_nseg as the maximum, as it should be.

While here, simplify the loop; it had become an obfuscated piece of
code overtime.
 1.116  28-Mar-2003  perseant Add a sleeper count, to prevent the cleaner from panicing the kernel
when the filesystem is unmounted, relocking the Ifile when its lock is
draining. (We can't use vfs_busy() since the process is sleeping for a
good long time.) Clean up / organize lfs.h, while I'm here.

In lfs_update_single, assert that disk addresses are either negative, or
are still positive when converted to int32_t, to prevent recurrence of a
negative/positive block problem.
 1.115  21-Mar-2003  perseant KNF (space after keywords).
 1.114  21-Mar-2003  perseant Use VONWORKLST as a heuristic for vnode emptiness, rather than exhaustively
checking the memq.

Take greater care not to dirty the Ifile vnode when unmounting the filesystem.
This should fix a "(vp->v_flag & VONWORKLST) == 0" assertion panic in vgonel
that could occur when unmounting.

Do not allow the Ifile to be mapped for writing.
 1.113  20-Mar-2003  yamt lfs_writevnodes:
in the case of "starting over", kick lfs_writeseg
in order to avoid deadlock in check_dirty.
 1.112  20-Mar-2003  perseant Don't break out of Ifile-writing loop in lfs_segwrite until nothing is left.
Note however that blocks can be added to the Ifile even when the segment
block is held because of inodes' atime. Do not panic with "dirty blocks"
if these blocks are present.
 1.111  15-Mar-2003  perseant Add simple_lock protection for lfs_seglock and lfs_subsys_pages; these will
be expanded to cover other per-fs and subsystem-wide data as well.

Fix a case of IN_MODIFIED being set without updating lfs_uinodes, resulting
in a "lfs_uinodes < 0" panic.

Fix a deadlock in lfs_putpages arising from the need to busy all pages in a
block; unbusy any that had already been busied before starting over.
 1.110  15-Mar-2003  kristerw SO C requires a statement after a label.
 1.109  11-Mar-2003  perseant - Get rid of unused #ifdefs LFS_NO_PAGEMOVE and LFS_MALLOC_SUMMARY (both
always true) and accompanying dead code.

- When constructing write clusters in lfs_writeseg, if the block we are
about to add is itself a cluster from GOP_WRITE, don't put a cluster
in a cluster, just write the GOP_WRITE cluster on its own. This seems
to represent a slight performance gain on my test machine.

- Charge someone's rusage for writes on LFSes. It's difficult to tell
who the "right" process to charge is; just charge whoever triggered
the write.
 1.108  08-Mar-2003  perseant Take away "#ifdef LFS_UBC".
 1.107  08-Mar-2003  perseant Add an lfs_strategy() that checks to make sure we're not trying to read
where the cleaner is trying to write, instead of tying up the "live"
buffers (or pages).

Fix a bug in the LFS_UBC case where oversized buffers would not be
checksummed correctly, causing uncleanable segments.

Make sure that wakeup(fs->lfs_iocount) is done if fs->lfs_iocount is 1
as well as 0, since we wait in some places for it to drop to 1.

Activate all pages that make it into lfs_gop_write without the segment
lock held, since they must have been dirtied very recently, even if
PG_DELWRI is not set.
 1.106  04-Mar-2003  perseant Make sure we hold the uobjlock when checking for dirty pages, in lfs_vflush.
Note that pages can become dirty without our knowing it, anyway; don't
panic if that happens.
 1.105  02-Mar-2003  perseant Account SEGUSE_ACTIVE correctly so that the automatic segment cleaning
actually happens.

Add a new fcntl call that will write the minimum necessary to checkpoint
(i.e., for on-disk directory structure to be consistent, not including
updates to file data) so that the cleaner can clean segments more quickly
without sacrificing three-way commit for cleaning.
 1.104  23-Feb-2003  perseant Fix a buffer overflow bug in the LFS_UBC case that manifested itself
either as a mysterious UVM error or as "panic: dirty bufs". Verify
maximum size in lfs_malloc.

Teach lfs_updatemeta and lfs_shellsort about oversized cluster blocks from
lfs_gop_write.

When unwiring pages in lfs_gop_write, deactivate them, under the theory
that the pagedaemon wanted to free them last we knew.
 1.103  20-Feb-2003  perseant Tabify, and fix some comment alignment problems.
 1.102  19-Feb-2003  yamt acquire v_interlock before calling VOP_PUTPAGES.
 1.101  17-Feb-2003  perseant Add code to UBCify LFS. This is still behind "#ifdef LFS_UBC" for now
(there are still some details to work out) but expect that to go
away soon. To support these basic changes (creation of lfs_putpages,
lfs_gop_write, mods to lfs_balloc) several other changes were made, to
wit:

* Create a writer daemon kernel thread whose purpose is to handle page
writes for the pagedaemon, but which also takes over some of the
functions of lfs_check(). This thread is started the first time an
LFS is mounted.

* Add a "flags" parameter to GOP_SIZE. Current values are
GOP_SIZE_READ, meaning that the call should return the size of the
in-core version of the file, and GOP_SIZE_WRITE, meaning that it
should return the on-disk size. One of GOP_SIZE_READ or
GOP_SIZE_WRITE must be specified.

* Instead of using malloc(...M_WAITOK) for everything, reserve enough
resources to get by and use malloc(...M_NOWAIT), using the reserves if
necessary. Use the pool subsystem for structures small enough that
this is feasible. This also obsoletes LFS_THROTTLE.

And a few that are not strictly necessary:

* Moves the LFS inode extensions off onto a separately allocated
structure; getting closer to LFS as an LKM. "Welcome to 1.6O."

* Unified GOP_ALLOC between FFS and LFS.

* Update LFS copyright headers to correct values.

* Actually cast to unsigned in lfs_shellsort, like the comment says.

* Keep track of which segments were empty before the previous
checkpoint; any segments that pass two checkpoints both dirty and
empty can be summarily cleaned. Do this. Right now lfs_segclean
still works, but this should be turned into an effectless
compatibility syscall.
 1.100  05-Feb-2003  pk Make the buffer cache code MP-safe.
 1.99  01-Feb-2003  thorpej Add extensible malloc types, adapted from FreeBSD. This turns
malloc types into a structure, a pointer to which is passed around,
instead of an int constant. Allow the limit to be adjusted when the
malloc type is defined, or with a function call, as suggested by
Jonathan Stone.
 1.98  29-Jan-2003  yamt don't use daddr_t for segment summary since it's an on-disk structure.
 1.97  29-Jan-2003  simonb Remove variable that is only assigned to but not referenced.
 1.96  27-Jan-2003  yamt make these compilable with lfs debug options.
(follow daddr_t change)

XXX maybe segment number should be 64bit.
 1.95  27-Jan-2003  kleink Further printf format fixes in the wake of daddr_t.

Note that PRI?64 and long long int arguments aren't made for each other,
nor are %lld and int64_t arguments.
 1.94  25-Jan-2003  kleink Fix further printf format warnings for DEBUG, in the wake of daddr_t
having changed.
 1.93  25-Jan-2003  tron Use PRId64 instead of hard coding "%lld" to fix build problems under
LP64 ports.
 1.92  25-Jan-2003  tron Fix printf() format strings problems caused by "daddr_t" change.
 1.91  24-Jan-2003  fvdl Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.
 1.90  08-Jan-2003  yamt backout wrong assertions that i added.
 1.89  08-Jan-2003  yamt add assertions.
 1.88  31-Dec-2002  yamt write ifile only when it has dirty buffers.
 1.87  17-Dec-2002  yamt no need for cleaner to hold vnode locks.
cleaner and normal vnode operations are synchronized enough by
seglock/fraglock and buf's B_BUSY-ness.
 1.86  17-Dec-2002  yamt use ufs_daddr_t instead of int where appropriate.
 1.85  14-Dec-2002  yamt in lfs_writefile, check v_type==VNON earlier.
to avoid null dereference with DEBUG_LFS_VERBOSE.
 1.84  13-Dec-2002  yamt save a segment write when doing checkpoint.
 1.83  12-Dec-2002  yamt correct DIAGNOSTIC code for duplicated inodes in a segment and su_nbytes.
 1.82  27-Sep-2002  provos remove trailing \n in panic(). approved perry.
 1.81  22-Sep-2002  jdolecek don't need <sys/conf.h> here
 1.80  06-Jul-2002  perseant Deal with fragment size changes better. For each fragment that can
exist on an on-disk inode, we keep a record of its size in struct inode,
which is updated when we write the block to disk. The cleaner routines
thus have ready access to what size is the correct size for this block,
on disk.

Fixed a related bug: if a file with fragments is being cleaned
(fragments being cleaned) at the same time it is being extended beyond
NDADDR blocks, we could write a bogus FINFO record that has a frag in the
middle; when it was cleaned this would give back bogus file data. Don't
write the indirect blocks in this case, since there is no need.

lfs_fragextend and lfs_truncate no longer require the seglock, but instead
take a shared lock, which the seglock locks exclusively.
 1.79  16-Jun-2002  perseant For synchronous writes, keep separate i/o counters for each write, so
processes don't have to wait for one another to finish (e.g., nfsd seems
to be a little happier now, though I haven't measured the difference).
Synchronous checkpoints, however, must always wait for all i/o to finish.

Take the contents of the callback functions and have them run in thread
context instead (aiodoned thread). lfs_iocount no longer has to be
protected in splbio(), and quite a bit less of the segment construction
loop needs to be in splbio() as well.

If lfs_markv is handed a block that is not the correct size according to
the inode, refuse to process it. (Formerly it was extended to the "correct"
size.) This is possibly more prone to deadlock, but less prone to corruption.

lfs_segclean now outright refuses to clean segments that appear to have live
bytes in them. Again this may be more prone to deadlock but avoids
corruption.

Replace ufsspec_close and ufsfifo_close with LFS equivalents; this means
that no UFS functions need to know about LFS_ITIMES any more. Remove
the reference from ufs/inode.h.

Tested on i386, test-compiled on alpha.
 1.78  24-May-2002  perseant Fix a couple of instances where reassignbuf() was not done at splbio.

Tested on i386.
 1.77  23-May-2002  perseant Back out rev 1.174 of vfs_subr.c, because the splbio() wasn't protecting
enough to be useful, and broadening it so that it did would have meant
that operations possibly requiring synchronous disk activity would have
to be done in splbio(). This clearly was not going to work.

Worked around this in the LFS case by having lfs_cluster_callback put an
extra hold on the vnode before calling biodone(), and taking the hold
off without HOLDRELE's problematic list swapping. lfs_vunref() will take
care of that---in thread context---on the next write if need be.

Also, ensure that the list walking in lfs_{writevnodes,segunlock,gather}
takes into account the possibility that the list may change
underneath it (possibly because it itself deleted an element).

Tested on i386, test-compiled on alpha.
 1.76  20-May-2002  perseant branches: 1.76.2;
Protect v_freelist with splbio(), since HOLDRELE can be called in
interrupt context (through brelvp). (LFS may be the only subsystem
affected by this problem.)

Tested on i386.
 1.75  17-May-2002  perseant use macros from <sys/queue.h>
 1.74  14-May-2002  perseant branches: 1.74.2;
Phase one of my three-phase plan to make LFS play nice with UBC, and bug-fixes
I found while making sure there weren't any new ones.

* Make the write clusters keep track of the buffers whose blocks they contain.
This should make it possible to (1) write clusters using a page mapping
instead of malloc, if desired, and (2) schedule blocks for rewriting
(somewhere else) if a write error occurs. Code is present to use
pagemove() to construct the clusters but that is untested and will go away
anyway in favor of page mapping.
* DEBUG now keeps a log of Ifile writes, so that any lingering instances of
the "dirty bufs" problem can be properly debugged.
* Keep track of whether the Ifile has been dirtied by various routines that
can be called by lfs_segwrite, and loop on that until it is clean, for
a checkpoint. Checkpoints need to be squeaky clean.
* Warn the user (once) if the Ifile grows larger than is reasonable for their
buffer cache. Both lfs_mountfs and lfs_unmount check since the Ifile can
grow.
* If an inode is not found in a disk block, try rereading the block, under
the assumption that the block was copied to a cluster and then freed.
* Protect WRITEINPROG() with splbio() to fix a hang in lfs_update.
 1.73  23-Nov-2001  chs add spaces for KNF. confirmed to produce identical objects.
 1.72  08-Nov-2001  lukem add RCSID
 1.71  26-Oct-2001  lukem remove #include <ufs/ufs/quota.h> where it was just to appease
<ufs/ufs/inode.h>, since the latter now includes the former. leave the former
in source that obviously uses specific bits of it (for completeness.)
 1.70  26-Jul-2001  jdolecek branches: 1.70.2; 1.70.4;
lfs_writeseg(): make el_size a size_t (cosmetic only, no functional change)
 1.69  13-Jul-2001  perseant Merge the short-lived perseant-lfsv2 branch into the trunk.

Kernels and tools understand both v1 and v2 filesystems; newfs_lfs
generates v2 by default. Changes for the v2 layout include:

- Segments of non-PO2 size and arbitrary block offset, so these can be
matched to convenient physical characteristics of the partition (e.g.,
stripe or track size and offset).

- Address by fragment instead of by disk sector, paving the way for
non-512-byte-sector devices. In theory fragments can be as large
as you like, though in reality they must be smaller than MAXBSIZE in size.

- Use serial number and filesystem identifier to ensure that roll-forward
doesn't get old data and think it's new. Roll-forward is enabled for
v2 filesystems, though not for v1 filesystems by default.

- The inode free list is now a tailq, paving the way for undelete (undelete
is not yet implemented, but can be without further non-backwards-compatible
changes to disk structures).

- Inode atime information is kept in the Ifile, instead of on the inode;
that is, the inode is never written *just* because atime was changed.
Because of this the inodes remain near the file data on the disk, rather
than wandering all over as the disk is read repeatedly. This speeds up
repeated reads by a small but noticeable amount.

Other changes of note include:

- The ifile written by newfs_lfs can now be of arbitrary length, it is no
longer restricted to a single indirect block.

- Fixed an old bug where ctime was changed every time a vnode was created.
I need to look more closely to make sure that the times are only updated
during write(2) and friends, not after-the-fact during a segment write,
and certainly not by the cleaner.
 1.68  30-May-2001  mrg branches: 1.68.2; 1.68.4;
use _KERNEL_OPT
 1.67  09-Jan-2001  joff branches: 1.67.2;
If DIAGNOSTIC and the segment writer gets a badly sized buffer, panic()
instead of silently corrupting the filesystem.
 1.66  03-Dec-2000  perseant Get rid of some old unnecessary code that cleared B_NEEDCOMMIT from buffers in
lfs_writeseg (possibly after they had been freed).

If MALLOCLOG is defined, make lfs_newbuf and lfs_freebuf pass along the
caller's file and line to _malloc and _free.
 1.65  30-Nov-2000  jdolecek only include opt_ddb.h for !LKM
 1.64  27-Nov-2000  chs Initial integration of the Unified Buffer Cache project.
 1.63  27-Nov-2000  perseant If LFS_DO_ROLLFORWARD is defined, roll forward from the older checkpoint
on mount, through the newer checkpoint and on through any newer
partial-segments that may have been written but not checkpointed because
of an intervening crash.

LFS_DO_ROLLFORWARD is not defined by default.
 1.62  17-Nov-2000  perseant Correct accounting of lfs_avail, locked_queue_count, and locked_queue_bytes.
(PR #11468). In the case of fragment allocation, check to see if enough
space is available before extending a fragment already scheduled for writing.

The locked_queue_* variables indicate the number of buffer headers and bytes,
respectively, that are unavailable to getnewbuf() because they are locked up
waiting for LFS to flush them; make sure that that is actually what we're
counting, i.e., never count malloced buffers, and always use b_bufsize instead
of b_bcount.

If DEBUG is defined, the periodic calls to lfs_countlocked will now complain
if either counter is incorrect. (In the future lfs_countlocked will not need
to be called at all if DEBUG is not defined.)
 1.61  12-Nov-2000  perseant Do not needlessly dirty segment table blocks during lfs_segwrite,
preventing needless disk activity when the filesystem is idle. (PR #10979.)
 1.60  12-Nov-2000  toshii Fix obsolete comments in lfs_writeinode since rev. 1.27.
New comments are mostly from perseant, with my additions.
 1.59  09-Sep-2000  perseant oops
 1.58  09-Sep-2000  perseant Various bug-fixes to LFS, to wit:


Kernel:

* Add runtime quantity lfs_ravail, the number of disk-blocks reserved
for writing. Writes to the filesystem first reserve a maximum amount
of blocks before their write is allowed to proceed; after the blocks
are allocated the reserved total is reduced by a corresponding amount.

If the lfs_reserve function cannot immediately reserve the requested
number of blocks, the inode is unlocked, and the thread sleeps until
the cleaner has made enough space available for the blocks to be
reserved. In this way large files can be written to the filesystem
(or, smaller files can be written to a nearly-full but thoroughly
clean filesystem) and the cleaner can still function properly.

* Remove explicit switching on dlfs_minfreeseg from the kernel code; it
is now merely a fs-creation parameter used to compute dlfs_avail and
dlfs_bfree (and used by fsck_lfs(8) to check their accuracy). Its
former role is better assumed by a properly computed dlfs_avail.

* Bounds-check inode numbers submitted through lfs_bmapv and lfs_markv.
This prevents a panic, but, if the cleaner is feeding the filesystem
the wrong data, you are still in a world of hurt.

* Cleanup: remove explicit references of DEV_BSIZE in favor of
btodb()/dbtob().

lfs_cleanerd:

* Make -n mean "send N segments' blocks through a single call to
lfs_markv". Previously it had meant "clean N segments though N calls
to lfs_markv, before looking again to see if more need to be cleaned".
The new behavior gives better packing of direct data on disk with as
little metadata as possible, largely alleviating the problem that the
cleaner can consume more disk through inefficient use of metadata than
it frees by moving dirty data away from clean "holes" to produce
entirely clean segments.

* Make -b mean "read as many segments as necessary to write N segments
of dirty data back to disk", rather than its former meaning of "read
as many segments as necessary to free N segments worth of space". The
new meaning, combined with the new -n behavior described above,
further aids in cleaning storage efficiency as entire segments can be
written at once, using as few blocks as possible for segment summaries
and inode blocks.

* Make the cleaner take note of segments which could not be cleaned due
to error, and not attempt to clean them until they are entirely free
of dirty blocks. This prevents the case in which a cleanerd running
with -n 1 and without -b (formerly the default) would spin trying
repeatedly to clean a corrupt segment, while the remaining space
filled and deadlocked the filesystem.

* Update the lfs_cleanerd manual page to describe all the options,
including the changes mentioned here (in particular, the -b and -n
flags were previously undocumented).

fsck_lfs:

* Check, and optionally fix, lfs_avail (to an exact figure) and
lfs_bfree (within a margin of error) in pass 5.

newfs_lfs:

* Reduce the default dlfs_minfreeseg to 1/20 of the total segments.

* Add a warning if the sgs disklabel field is 16 (the default for FFS'
cpg, but not usually desirable for LFS' sgs: 5--8 is a better range).

* Change the calculation of lfs_avail and lfs_bfree, corresponding to
the kernel changes mentioned above.

mount_lfs:

* Add -N and -b options to pass corresponding -n and -b options to
lfs_cleanerd.

* Default to calling lfs_cleanerd with "-b -n 4".


[All of these changes were largely tested in the 1.5 branch, with the
idea that they (along with previous un-pulled-up work) could be applied
to the branch while it was still in ALPHA2; however my test system has
experienced corruption on another filesystem (/dev/console has gone
missing :^), and, while I believe this unrelated to the LFS changes, I
cannot with good conscience request that the changes be pulled up.]
 1.57  09-Sep-2000  perseant Fix a buffer-cache corrupting bug in lfs_writeseg, where brelse could
be improperly used on an already-queued buffer.
 1.56  05-Jul-2000  perseant Clean up accounting of lfs_uinodes (dirty but unwritten inodes).

Make lfs_uinodes a signed quantity for debugging purposes, and set it to
zero as fs mount time.

Enclose setting/clearing of the dirty flags (IN_MODIFIED, IN_ACCESSED,
IN_CLEANING) in macros, and use those macros everywhere. Make
LFS_ITIMES use these macros; updated the ITIMES macro in inode.h to know
about this. Make ufs_getattr use ITIMES instead of FFS_ITIMES.
 1.55  04-Jul-2000  perseant Fix errors observed while trying to fill the filesystem with yesterday's
fixes:

- Write copies of bfree and avail in the CLEANERINFO block, so the
cleaner doesn't have to guess which superblock has the current
information (if indeed any do).

- Tighten up accounting of lfs_avail (more needs to be done).

- When cleansing indirect blocks of UNWRITTEN, make sure not to mark
them clean, since they'll need to be rewritten later.
 1.54  03-Jul-2000  perseant i_lfs_effnblks fixes. Put debugging printfs under #ifdef DEBUG_LFS.
 1.53  03-Jul-2000  perseant Allow the number of free segments reserved for the cleaner to be
parametrized in the filesystem, defaulting to MIN_FREE_SEGS = 2 but set
to something more reasonable at newfs_lfs time.

Note the number of blocks that have been scheduled for writing but which
are not yet on disk in an inode extension, i_lfs_effnblks. Move
i_ffs_effnlink out of the ffs extension and onto the main inode, since
it's used all over the shared code and the lfs extension would clobber
it.

At inode write time, indirect blocks and inode-held blocks of inodes
that have i_lfs_effnblks != i_ffs_blocks are cleansed of UNWRITTEN disk
addresses, so that these never make it to disk.
 1.52  27-Jun-2000  perseant Fixes associated with filling an LFS:

Change the space computation to appear to change the size of the *disk*
rather than the *bytes used* when more segment summaries and inode
blocks are written. Try to estimate the amount of space that these will
take up when more files are written, so the disk size doesn't change too
much.

Regularize error returns from lfs_valloc, lfs_balloc, lfs_truncate: they
now fail entirely, rather than succeeding half-way and leaving the fs in
an inconsistent state.

Rewrite lfs_truncate, mostly stealing from ffs_truncate. The old
lfs_truncate had difficulty truncating a large file to a non-zero size
(indirect blocks were not handled appropriately).

Unmark VDIROP on fvp after ufs_remove, ufs_rmdir, so these can be
reclaimed immediately: this vnode would not be written to disk again
anyway if the removal succeeded, and if it failed, no directory
operation occurred.

ufs_makeinode and ufs_mkdir now remove IN_ADIROP on error.
 1.51  27-Jun-2000  perseant From John Evans <jevans@cray.com>: use datosn() to convert to segment
number, when remarking the current segment ACTIVE. See PR #10463.
 1.50  22-Jun-2000  perseant Update lfs_vunref for the fact that now a vnode can be locked with no
references (locked for VOP_INACTIVE at the end of vrele) and it's okay.
Check the return value of lfs_vref where appropriate.
Fixes PR #s 10285 and 10352.
 1.49  06-Jun-2000  perseant branches: 1.49.2;
Protect inode free list with seglock, instead of separate lock, so that
the head of the inode free list (on the superblock) always matches the
rest of the free list (in the ifile).

Protect lfs_fragextend with seglock, to prevent the segment byte count
fudging from making its way to disk.

Don't try to inactivate dirop vnodes that are still in the middle of
their dirop (may address PR#10285).
 1.48  31-May-2000  fredb Make this build. (Balance parenthesis.
 1.47  31-May-2000  perseant update for IN_ACCESSED changes
 1.46  27-May-2000  perseant branches: 1.46.2;
Prevent dirops from getting around lfs_check and wedging the buffer cache.
All the dirop vnops now mark the inodes with a new flag, IN_ADIROP, which
is removed as soon as the dirop is done (as opposed to VDIROP which stays
until the file is written). To address one issue raised in PR#9357.
 1.45  19-May-2000  thorpej NULL != 0
 1.44  10-May-2000  perseant stop vnode reference leak introduced in patch to PR#9994
 1.43  05-May-2000  perseant Change the way LFS does block accounting, from trying to infer from the
buffer cache flags, to marking the inode and/or indirect blocks with a
special disk address UNWRITTEN==-2 when a block is accounted for. (This
address is never written to disk, but only used in-core. This is essentially
the same method of block accounting as on the UBC branch, where the buffer
headers don't exist.) Make sure that truncation is handled properly,
especially in the case of holey files.

Fixes PR#9994.
 1.42  30-Mar-2000  augustss Remove register declarations.
 1.41  13-Mar-2000  soren Fix doubled 'the's in comments.
 1.40  19-Jan-2000  perseant Changes to stabilize LFS. The first two of these should also apply to the
1.4 branch.

* Use a separate per-fs lock, instead of ufs_hashlock, to protect the Inode
free list. This seems to prevent the "lockmgr: %d, not exclusive lock holder
%d, unlocking" message I was mis-attributing last night to an unlocked vnode
being passed to vrele.

* Change calling semantics of lfs_ifind, to give better error reporting:
If fed a struct buf, it can report the block number of the offending inode
block as well as the inode number.

* Back out rev 1.10 of lfs_subr.c, since the replacement code was slightly
uglier while being functionally identical.

* Make lfs_vunref use the same free list convention as vrele/vput, so that
vget does not remove vnodes from a hash list they are not on.
 1.39  16-Jan-2000  perseant Fix a problem in my changes of Dec 14th, that prevents removed vnodes
from being inactivated under some conditions. Removed vnodes are now
inactivated when the VDIROP flag is cleared, and to prevent block
accounting problems this clearing has been postponed until
lfs_segunlock.
 1.38  14-Jan-2000  perseant Better handling of various combinations of cleaning, vnode flushing, and
dirop writing. In particular, lfs_writevnodes now writes all buffers from
a flushed vnode whether cleaning or not, and the same with the Ifile; and
lfs_segwrite does not attempt to write data from other non-cleaning vnodes,
even if a vnode is being flushed.
 1.37  03-Dec-1999  perseant Handle the case of a vnode flush while dirops are active correctly in
lfs_segwrite. Also, make sure a flush is called in SET_DIROP before sleeping
on its results. Addresses PR #8863.
 1.36  17-Nov-1999  perseant Fix spllevel problem with superblock exclusion and with segment write throttle.
May address PR#8383.
 1.35  15-Nov-1999  fvdl Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O
 1.34  12-Nov-1999  perseant Back out my patch of the 8th (to address unreferenced inode problem).
Apparently this needs more thought.
 1.33  09-Nov-1999  perseant If ifile blocks were written before dirops were complete, and then the
system crashed, inodes could be allocated that were not referenced. (Though
not a serious problem, it evidences itself in phase 4 of fsck_lfs.) Fix
this by marking if_daddr with UNASSIGNED before the inodes are actually
written; at mount time the ifile is checked for UNASSIGNED entries and
any that are found are linked back into the free list. (The latter
functionality should move into the roll-forward agent when it materializes.)
 1.32  06-Nov-1999  perseant branches: 1.32.2;
Address ufs_hashlock/ufs_ihashins protocol bug, discovered while doing a
post-mortem of a production machine. Also, take the active dirop
count off of the fs and make it global (since it is measuring a global
resource) and tie the threshold value LFS_MAXDIROP to desiredvnodes.
 1.31  01-Oct-1999  mycroft branches: 1.31.2; 1.31.4; 1.31.6;
Fix printf() formats.
 1.30  03-Sep-1999  perseant Make changes that will allow an LFS filesystem to be used as the root
filesystem. In particular,

- Fix mknod deadlock, described in PR 8172.
- Enable lfs_mountroot.
- Make lfs_writevnodes treat filesystems mounted on lfs device nodes properly,
by flushing that device rather than trying to add blocks to the device inode.

This, in combination with lfs boot blocks, will allow operation of an all-lfs
system.
 1.29  08-Jul-1999  wrstuden Modify file systems to deal with struct lock in struct vnode. All leaf
fs's other than nfs use genfs_lock() for locking.

Modify lookup routines to set PDIRUNLOCK when they unlock the parrent.
 1.28  17-Jun-1999  tls squash some compiler warnings on debug printfs by casting to int
 1.27  15-Jun-1999  perseant Minor changes to the segment live bytes calculation. In particular, fixed
a bug in fragment extension that could run the count negative. Also, don't
overcount for inodes, and don't count segment summaries. Thus, for empty
segments the live bytes count should now be exactly zero.
 1.26  12-Apr-1999  perseant Make sure that the wakeup occurs for vnodes that lfs_update might be sleeping
on (nodes which are not marked IN_MODIFIED/IN_CLEANING, but which have dirty
buffers), by marking them with the appropriate flag if dirtybuffers were added
while the write was in progress.
 1.25  12-Apr-1999  perseant Better checking for held inode locks in lfs_fastvget, for a number of error
conditions. Also change the default setting of lfs_clean_vnhead to 0, which
seems to make the locking problems go away (although this is difficult to
test as I can't reliably reproduce them).
 1.24  12-Apr-1999  perseant Fix "lfs_ifind: dinode xxx not found" panic. When inodes were freed,
then immediately reloaded, their dinodes were located in an inode block
which was not on disk at the advertized location, nor in the cache (although
it would be flushed to disk next segment write). Fix this by using getblk()
instead of lfs_newbuf() for inode blocks.
 1.23  30-Mar-1999  perseant branches: 1.23.2;
Add initialization to quell compiler warning (only on some platforms?)
 1.22  30-Mar-1999  perseant Move variable initialization to the top of lfs_vflush
 1.21  29-Mar-1999  perseant lfs_truncate calls vinvalbuf to invalidate all currently-hald buffers, which
in turn forces a flush of the vnode, whether or not it is involved in a dirop.
(This can happen during a remove or rmdir, when the directory is shrunk.)
Because of the nature of dirops, however, flushing a vnode involved in a dirop
is disallowed (and was marked with a panic). This patch has lfs_truncate
call a specialized vinvalbuf that only invalidates buffers following the new
end-of-file, and thus does not require a flush. Also the panic is demoted,
in case I missed any other path to lfs_vflush.
 1.20  25-Mar-1999  perseant Make sysctl variable lfs_clean_vnhead do what it was supposed to do,
namely, toggle whether vnodes loaded only for cleaning (as opposed to
normal filesystem use) are freed to the *head* of the vnode free list,
rather than the tail. This should avoid a possible cache flushing
effect, if the cleaner cleans a segment containing a large number of
live inodes.
 1.19  25-Mar-1999  perseant Fixes to make dirops and lfs_vflush play together well. In particular,
if we are short on vnodes, lfs_vflush from another process can grab a
vnode that lfs_markv has already processed but not yet written; but
lfs_markv holds the seglock. When lfs_vflush gets around to writing it,
the context for copyin is gone. So, now lfs_markv calls copyin itself,
rather than having lfs_writeseg do it.
 1.18  25-Mar-1999  perseant Lock buffers with B_BUSY between data checksum calculation and write, so
some other process doesn't change the data after it was checksummed.
 1.17  25-Mar-1999  perseant Change lfs_sb_cksum to use offsetof() instead of an inlined version.

Fix lfs_vref/lfs_vunredf to ignore VXLOCKed vnodes that are also being
flushed.

Improve the debugging messages somewhat.
 1.16  25-Mar-1999  perseant clean up unused/required #ifdefs
 1.15  10-Mar-1999  perseant New sources should leave the LFS in a more-or-less working state. Changes
include:

- DIROP segregation is enabled, and greater care is taken
to make sure that a checkpoint completes. Fsck is not
needed to remount the filesystem.
- Several checks to make sure that the LFS subsystem does not
overuse various resources (memory, in particular).
- The cleaner routines, lfs_markv in particular, are completely
rewritten. A buffer overflow is removed. Greater care is taken
to ensure that inodes come from where lfs_cleanerd say they come
from (so we know nothing has changed since lfs_bmapv was called).
- Fragment allocation is fixed, so that writes beyond end-of-file
do the right thing.
 1.14  09-Nov-1998  mycroft GC the B_CACHE bit.
 1.13  23-Oct-1998  thorpej Use DINODE_SIZE rather than sizeof(struct dinode).
 1.12  11-Sep-1998  pk PR#6032: define fixed sized on-disk superblock structure.
 1.11  08-May-1998  kleink Fix some arithmetics lossage on typeless pointers.
 1.10  01-Mar-1998  fvdl Merge with Lite2 + local changes
 1.9  13-Jun-1997  pk TIMESPEC_TO_TIMEVAL => TIMEVAL_TO_TIMESPEC
 1.8  11-Jun-1997  bouyer Add support for ext2fs, this needed a few modifications to ufs/ufs/inode.h:
- added an "union inode_ext" to struct inode, for the per-fs extentions.
For now only ext2fs uses it.
- i_din is now an union:
union {
struct dinode ffs_din; /* 128 bytes of the on-disk dinode. */
struct ext2fs_dinode e2fs_din; /* 128 bytes of the on-disk dinode. */
} i_din
Added a lot of #define i_ffs_* and i_e2fs_* to access the fields.
- Added two macros: FFS_ITIMES and EXT2FS_ITIMES. ITIMES calls the rigth
macro, depending on the time of the inode. ITIMES is used where necessary,
FFS_ITIMES and EXT2FS_ITIMES in other places.
 1.7  12-Oct-1996  christos revert previous kprintf changes
 1.6  10-Oct-1996  christos printf -> kprintf, sprintf -> ksprintf
 1.5  01-Sep-1996  mycroft Add a set of generic file system operations that most file systems use.
Also, fix some time stamp bogosities.
 1.4  09-Feb-1996  christos lfs prototypes
 1.3  21-Aug-1994  cgd C syntax fix, and syscall args style (For later.)
 1.2  29-Jun-1994  cgd New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'
 1.1  08-Jun-1994  mycroft branches: 1.1.1;
Update to 4.4-Lite fs code, with local changes.
 1.1.1.2  01-Mar-1998  fvdl Import 4.4BSD-Lite2
 1.1.1.1  01-Mar-1998  fvdl Import 4.4BSD-Lite for reference
 1.23.2.10  20-Jan-2000  he Pull up revision 1.39 (requested by perseant):
Files removed (through unlink, rmdir) are now really removed, though the
removal is postponed until the dirop is complete to ensure validity of
the filesystem through a crash. Use a separate per-fs lock, instead of
ufs_hashlock, to protect the inode free list. Change calling semantics
of lfs_ifind, to give better error reporting: If fed a struct buf, it
can report the block number of the offending inode block as well as the
inode number.
 1.23.2.9  15-Jan-2000  he Pull up revision 1.38 (requested by perseant):
Handle flushing a vnode during cleaning, and cleaning the Ifile,
more correctly, avoiding possible disk corruption in some cases.
 1.23.2.8  15-Jan-2000  he Pull up revision 1.30 (requested by perseant):
Address problems related to using an LFS filesystem as the root
filesystem, including mknod hangs. Fixes PR#8172 and PR#9072.
 1.23.2.7  18-Dec-1999  he Pull up revision 1.37 (requested by perseant):
Handle the case of a vnode flush while dirops are active correctly
in lfs_segwrite. Also, make sure a flush is called in SET_DIROP
before sleeping on its results. Addresses PR#8863.
 1.23.2.6  17-Dec-1999  he Pull up revision 1.32 (requested by perseant):
Address locking protocol error for inode hash, and make the
maximum number of active dirops a global quantity.
 1.23.2.5  16-Dec-1999  he Pull up revision 1.36 (requested by perseant):
Fix spllevel problem with superblock exclusion and with write
throttle. Addressess PR#8383.
 1.23.2.4  10-Oct-1999  cgd pull up rev 1.31 from trunk (requested by mycroft):
Fix potential overflow of v_usecount and v_writecount (and panics
resulting from this) by widening them to `long'. Mostly affects
systems where maxvnodes>=32768.
 1.23.2.3  03-Sep-1999  he Pull up revision 1.28:
Fix a printf format bug that gives compiler warnings/errors on
64-bit platforms, fixing PR#8241. (perseant)
 1.23.2.2  25-Jun-1999  perry pullup 1.26->1.27 (perseant)
 1.23.2.1  13-Apr-1999  perseant branches: 1.23.2.1.2; 1.23.2.1.4;
Pull-up of changes made to the trunk on Sunday [1.23->1.26], to wit:

Take out the `#ifdef USE_UFSHASH'; use ufs_hashlock to lock the inode free
list instead of free_lock.

Fix inode reporting in lfs_statfs (the meaning of f_files and f_ffree was
reversed).

Fix "lfs_ifind: dinode xxx not found" panic. When inodes were freed, then
immediately reloaded, their dinodes were located in an inode block which
was not on disk at the advertized location, nor in the cache (although it
would be flushed to disk next segment write). Fix this by using getblk()
instead of lfs_newbuf() for inode blocks.

Better checking for held inode locks in lfs_fastvget, for a number of
error conditions. Also change the default setting of lfs_clean_vnhead to
0, which seems to make the locking problems go away (although this is
difficult to test as I can't reliably reproduce them).

Make sure that the wakeup occurs for vnodes that lfs_update might be
sleeping on (nodes which are not marked IN_MODIFIED/IN_CLEANING, but which
have dirty buffers), by marking them with the appropriate flag if
dirtybuffers were added while the write was in progress.

Fix block counting during file truncation, if not truncating to zero.

Disallow threshold-initiated cache flush when dirops are active. Also,
make SET_ENDOP use lfs_check instead of inlining most of it.

Improve the debugging printfs in the cleaner syscalls (in particular, make
it obvious that they're coming from lfs).

Check the superblock version field, and refuse to mount the filesystem if
the version number is higher than we know about. This allows, e.g.,
changes in the format of the ifile, segment size restrictions and
boundaries, etc., which would not affect existing fields in the
superblock, but which would drastically affect the filesystem, to be
smoothly integrated at a later date.
 1.23.2.1.4.1  30-Nov-1999  itojun bring in latest KAME (as of 19991130, KAME/NetBSD141) into kame branch
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.

The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.

synchronization to latest KAME will take place on HEAD branch soon.
 1.23.2.1.2.4  31-Aug-1999  perseant Rudimentary support for LFS under UBC:

- LFS-specific VOP_BALLOC and VOP_PUTPAGES vnode ops.

- getblk VREG panic #ifdef'd out (can be reinstated when Ifile is
internalized and Ifile can be made another type from VREG)

- interface to VOP_PUTPAGES changed to pass all pager flags, not
just sync. FS putpages routines must know about the pager flags.

- new LFS magic disk address, -2 ("unwritten"), meaning accounted for
but not assigned to a fixed disk location (since LFS does these two
things separately, and the previous accounting method using buffer
headers no longer will work). Changed references to (foo == (daddr_t)-1)
to (foo < 0). Since disk drivers reject all addresses < 0, this should
not present a problem for other FSs.
 1.23.2.1.2.3  02-Aug-1999  thorpej Update from trunk.
 1.23.2.1.2.2  21-Jun-1999  thorpej Correct a printf format now that vnode flags are an int (in the uvm_vnode
structure).
 1.23.2.1.2.1  21-Jun-1999  thorpej Sync w/ -current.
 1.31.6.2  27-Dec-1999  wrstuden Pull up to last week's -current.
 1.31.6.1  21-Dec-1999  wrstuden Initial commit of recent changes to make DEV_BSIZE go away.

Runs on i386, needs work on other arch's. Main kernel routines should be
fine, but a number of the stand programs need help.

cd, fd, ccd, wd, and sd have been updated. sd has been tested with non-512
byte block devices. vnd, raidframe, and lfs need work.

Non 2**n block support is automatic for LKM's and conditional for kernels
on "options NON_PO2_BLOCKS".
 1.31.4.2  15-Nov-1999  fvdl Sync with -current
 1.31.4.1  19-Oct-1999  fvdl Bring in Kirk McKusick's FFS softdep code on a branch.
 1.31.2.4  18-Jan-2001  bouyer Sync with head (for UBC+NFS fixes, mostly).
 1.31.2.3  08-Dec-2000  bouyer Sync with HEAD.
 1.31.2.2  22-Nov-2000  bouyer Sync with HEAD.
 1.31.2.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.32.2.2  06-Nov-1999  perseant Address ufs_hashlock/ufs_ihashins protocol bug, discovered while doing a
post-mortem of a production machine. Also, take the active dirop
count off of the fs and make it global (since it is measuring a global
resource) and tie the threshold value LFS_MAXDIROP to desiredvnodes.
 1.32.2.1  06-Nov-1999  perseant file lfs_segment.c was added on branch comdex-fall-1999 on 1999-11-06 20:33:06 +0000
 1.46.2.1  22-Jun-2000  minoura Sync w/ netbsd-1-5-base.
 1.49.2.4  03-Feb-2001  he Pull up revisions 1.60-1.62 (requested by perseant):
o Don't write anything if the filesystem is idle (PR#10979).
o Close up accounting holes in LFS' accounting of immediately-
available-space, number of clean segments, and amount of dirty
space taken up by metadata (PR#11468, PR#11470, PR#11534).
 1.49.2.3  14-Sep-2000  perseant Pull up recent LFS kernel changes (approved by thorpej):

ufs/ufs/inode.h, 1.20--1.22 (add i_lfs_effnblks extension ;
make ITIMES aware of LFS_ITIMES;
_LKM protection so userland progs
compile)
ufs/ufs/ufs_vnops.c, 1.69, 1.71 (remove IN_ADIROP;
use ITIMES instead of FFS_ITIMES)
ufs/ufs/ufs_readwrite.c, 1.27 (use lfs_reserve in lfs_write)
ufs/lfs/lfs.h, 1.26--1.32 (define LFS_EST_* macros ;
change MIN_FREE_SEGS to lfs_minfreesegs ;
add avail and bfree to CLEANERINFO ;
change lfs_uinodes to signed ;
change lfs_dmeta to signed ;
add whitespace to line up structure
members ;
explicit cast to int32_t in LFS_EST_*
macros)
ufs/lfs/lfs_alloc.c, back out 1.34.2.3 (pullups of 1.39, 1.40);
then pull up 1.38 (clean up on error)
1.39--1.43 (restore fvdl's ufs_hashlock fix ;
restore fvdl's ufs_hashlock fix ;
set i_lfs_effnblks ;
use UINO macros ;
add comments and fix long lines)
ufs/lfs/lfs_balloc.c, 1.19 (don't succeed halfway)
1.21--1.25 (use i_lfs_effnblks ;
fix i_lfs_effnblks computation and
quieten ;
fix i_ffs_blocks in unwritten fragment ;
remove useless debugging check ;
add comments and (c) 2000)
ufs/lfs/lfs_bio.c, 1.24--1.30 (cleanup and make lfs_flush_fs take
"struct lfs *" instead of "struct
mount *" ;
use lfs_minfreeseg instead of
MIN_FREE_SEGS ;
use UINO macros, and copy bfree/avail
to CLEANERINFO ;
add lfs_reserve function ;
1.28--1.30 fix printf formatting)
ufs/lfs/lfs_cksum.c, 1.13 (add (c) 2000)
ufs/lfs/lfs_debug.c, 1.11 (use btodb instead of DEV_BSIZE)
ufs/lfs/lfs_extern.h, 1.18, 1.20--1.21 (function prototype changes)
ufs/lfs/lfs_inode.c, 1.38 (rewrite lfs_truncate from
ffs_truncate)
1.40--1.44 (count written and unwritten blocks
seperately ;
use disk block units instead of bytes ;
remove unnecessary "mod" variable ;
correct B_DELWRI to avoid bawrite panic ;
use lfs_reserve)
ufs/lfs/lfs_segment.c, 1.52-1.59 (use lfs_dmeta to note used summaries ;
check for UNWRITTEN in indirect blocks ;
more debugging stuff inside #ifdef
DEBUG_LFS ;
use LK_CANRECURSE ;
don't drop dirty indirect blocks ;
use UINO macros ;
don't hose the free list ;
use btodb() instead of DEV_BSIZE ;
make it compile again (oops))
ufs/lfs/lfs_subr.c, 1.16--1.17 (check for locked inodes before
changing ;
use btodb() instead of DEV_BSIZE, (c)
2000)
ufs/lfs/lfs_syscalls.c, back out 1.41.4.2 (fvdl's ufs_hashlock fix);
then pull up 1.43 (use lfs_dmeta)
1.44--1.45 (restore fvdl's ufs_hashlock fix)
1.46--1.47 (fix lfs_avail leakage from sblock
segments ;
use UINO macros)
1.49 (bounds-check inode numbers in
lfs_markv)
ufs/lfs/lfs_vfsops.c, 1.53 (use LFS_EST_* macros in lfs_statfs)
1.56--1.58 (initialize lfs_minfreeseg, lfs_effnblk ;
initialize lfs_uinodes ;
initialize lfs_ravail)
ufs/lfs/lfs_vnops.c, 1.40 (remove VDIROP from removed files)
1.42--1.44 (move SET_ENDOP below the removal of
VDIROP ;
use UINO macros and add lfs_itimes
function ;
use lfs_reserve in dirops)
 1.49.2.2  28-Jun-2000  perseant pull up active current segment patch from trunk
 1.49.2.1  22-Jun-2000  perseant Pull up lfs_vunref fix from the trunk.
 1.67.2.13  08-Jan-2003  thorpej Oh my aching HEAD.
 1.67.2.12  08-Jan-2003  thorpej Sync with HEAD.
 1.67.2.11  03-Jan-2003  thorpej Sync with HEAD.
 1.67.2.10  19-Dec-2002  thorpej Sync with HEAD.
 1.67.2.9  18-Oct-2002  nathanw Catch up to -current.
 1.67.2.8  01-Aug-2002  nathanw Catch up to -current.
 1.67.2.7  24-Jun-2002  nathanw Curproc->curlwp renaming.

Change uses of "curproc->l_proc" back to "curproc", which is more like the
original use. Bare uses of "curproc" are now "curlwp".

"curproc" is now #defined in proc.h as ((curlwp) ? (curlwp)->l_proc) : NULL)
so that it is always safe to reference curproc (*de*referencing curproc
is another story, but that's always been true).
 1.67.2.6  20-Jun-2002  nathanw Catch up to -current.
 1.67.2.5  08-Jan-2002  nathanw Catch up to -current.
 1.67.2.4  14-Nov-2001  nathanw Catch up to -current.
 1.67.2.3  24-Aug-2001  nathanw Catch up with -current.
 1.67.2.2  21-Jun-2001  nathanw Catch up to -current.
 1.67.2.1  05-Mar-2001  nathanw Initial commit of scheduler activations and lightweight process support.
 1.68.4.5  10-Oct-2002  jdolecek sync kqueue with -current; this includes merge of gehenna-devsw branch,
merge of i386 MP branch, and part of autoconf rototil work
 1.68.4.4  06-Sep-2002  jdolecek sync kqueue branch with HEAD
 1.68.4.3  23-Jun-2002  jdolecek catch up with -current on kqueue branch
 1.68.4.2  10-Jan-2002  thorpej Sync kqueue branch with -current.
 1.68.4.1  03-Aug-2001  lukem update to -current
 1.68.2.3  02-Jul-2001  perseant Change disk addressing unit to be the fragment, instead of the disk sector.
All quantities in the superblock, inodes, indirect blocks, etc. refer now
to this abstract unit (called "fsb" as it is in FFS) instead of disk sectors;
as a consequence segment summary blocks have to be multiples of a fragment in
size. In v1 filesystems, compatibility code ensures that 1 fsb == 1 sector,
regardless of fragment size.

Fragments can now range in size between 512 and 32k; in the event that
LFS_LABELPAD (8k) is smaller than the disk address unit size, an extra
proto-superblock is kept at 8k from the beginning of the disk, to be used
*only* to locate the real superblocks. (Not all of the userland knows about
this yet.)

Almost all of this was done not by me, but by joff.
 1.68.2.2  29-Jun-2001  perseant Get rid of __P(), protoizing where it had not already been done
 1.68.2.1  27-Jun-2001  perseant Import of what I've been calling "LFSv2", that is, LFS with some features
added that require changes to the on-disk data structures. These include:

- 64-bit time in everything but inodes
- User-specified segment offset, and segment size no longer
restricted to PO2.
- Serial number on segment summaries in addition to timestamp, and
a new volume identifier, to make roll-forward feasible without
fear of finding old data and thinking it was new.

Although I think this version works at least as well as what's on the trunk,
we're not done yet; hence this commit is going in on a branch and not on
the trunk. Enhancements that are not here yet include fragment addressing,
like FFS does, instead of block addressing.
 1.70.4.1  12-Nov-2001  thorpej Sync the thorpej-mips-cache branch with -current.
 1.70.2.1  07-Sep-2001  thorpej Commit my "devvp" changes to the thorpej-devvp branch. This
replaces the use of dev_t in most places with a struct vnode *.

This will form the basic infrastructure for real cloning device
support (besides being architecurally cleaner -- it'll be good
to get away from using numbers to represent objects).
 1.74.2.3  15-Jul-2002  gehenna catch up with -current.
 1.74.2.2  20-Jun-2002  gehenna catch up with -current.
 1.74.2.1  30-May-2002  gehenna Catch up with -current.
 1.76.2.3  20-Jun-2002  lukem Pull up revision 1.79 (requested by perseant in ticket #325):
For synchronous writes, keep separate i/o counters for each write, so
processes don't have to wait for one another to finish (e.g., nfsd seems
to be a little happier now, though I haven't measured the difference).
Synchronous checkpoints, however, must always wait for all i/o to finish.
Take the contents of the callback functions and have them run in thread
context instead (aiodoned thread). lfs_iocount no longer has to be
protected in splbio(), and quite a bit less of the segment construction
loop needs to be in splbio() as well.
If lfs_markv is handed a block that is not the correct size according to
the inode, refuse to process it. (Formerly it was extended to the "correct"
size.) This is possibly more prone to deadlock, but less prone to corruption.
lfs_segclean now outright refuses to clean segments that appear to have live
bytes in them. Again this may be more prone to deadlock but avoids
corruption.
Replace ufsspec_close and ufsfifo_close with LFS equivalents; this means
that no UFS functions need to know about LFS_ITIMES any more. Remove
the reference from ufs/inode.h.
Tested on i386, test-compiled on alpha.
 1.76.2.2  02-Jun-2002  tv Pull up revision 1.78 (requested by perseant in ticket #135):
Fix a couple of instances where reassignbuf() was not done at splbio.
Tested on i386.
 1.76.2.1  02-Jun-2002  tv Pull up revision 1.77 (requested by perseant in ticket #132):
Back out rev 1.174 of vfs_subr.c, because the splbio() wasn't protecting
enough to be useful, and broadening it so that it did would have meant
that operations possibly requiring synchronous disk activity would have
to be done in splbio(). This clearly was not going to work.
Worked around this in the LFS case by having lfs_cluster_callback put an
extra hold on the vnode before calling biodone(), and taking the hold
off without HOLDRELE's problematic list swapping. lfs_vunref() will take
care of that---in thread context---on the next write if need be.
Also, ensure that the list walking in lfs_{writevnodes,segunlock,gather}
takes into account the possibility that the list may change
underneath it (possibly because it itself deleted an element).
Tested on i386, test-compiled on alpha.
 1.124.2.10  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.124.2.9  08-Mar-2005  skrll Sync with HEAD.
 1.124.2.8  04-Mar-2005  skrll Sync with HEAD.

Hi Perry!
 1.124.2.7  24-Sep-2004  skrll Sync with HEAD.
 1.124.2.6  21-Sep-2004  skrll Fix the sync with head I botched.
 1.124.2.5  18-Sep-2004  skrll Sync with HEAD.
 1.124.2.4  25-Aug-2004  skrll Sync with HEAD.
 1.124.2.3  24-Aug-2004  skrll Undo part of the ktrace/lwp changes. In particular:
* Remove the "lwp *" argument that was added to vget(). Turns out
that nothing actually used it!
* Remove the "lwp *" arguments that were added to VFS_ROOT(), VFS_VGET(),
and VFS_FHTOVP(); all they did was pass it to vget() (which, as noted
above, didn't use it).
* Remove all of the "lwp *" arguments to internal functions that were added
just to appease the above.
 1.124.2.2  03-Aug-2004  skrll Sync with HEAD
 1.124.2.1  02-Jul-2003  darrenr Apply the aborted ktrace-lwp changes to a specific branch. This is just for
others to review, I'm concerned that patch fuziness may have resulted in some
errant code being generated but I'll look at that later by comparing the diff
from the base to the branch with the file I attempt to apply to it. This will,
at the very least, put the changes in a better context for others to review
them and attempt to tinker with removing passing of 'struct lwp' through
the kernel.
 1.152.4.1  10-May-2005  riz Pull up the following revisions (requested by perseant in ticket #1281):

1.8 sys/ufs/lfs/TODO
1.75 sys/ufs/lfs/lfs.h (via patch)
1.74 sys/ufs/lfs/lfs_alloc.c (via patch)
1.49, 1.51 sys/ufs/lfs/lfs_balloc.c (1.51 via patch)
1.78 sys/ufs/lfs/lfs_bio.c
1.62 sys/ufs/lfs/lfs_extern.h (via patch)
1.156 sys/ufs/lfs/lfs_segment.c (via patch)
1.48 sys/ufs/lfs/lfs_subr.c
1.101 sys/ufs/lfs/lfs_syscalls.c
1.163 sys/ufs/lfs/lfs_vfsops.c (via patch)
1.134 sys/ufs/lfs/lfs_vnops.c (via patch)
1.61 sys/ufs/ufs/ufs_readwrite.c (via patch)

1.20 libexec/lfs_cleanerd/clean.h (via patch)
1.52 libexec/lfs_cleanerd/cleanerd.c (via patch)
1.41 libexec/lfs_cleanerd/library.c (via patch)

1.4 regress/sys/fs/lfs/newfs_fsck/Makefile
1.2 regress/sys/fs/lfs/newfs_fsck/mkfs_mount
1.2 regress/sys/fs/lfs/newfs_fsck/smallfiles
1.3 sbin/fsck_lfs/bufcache.c
1.3 sbin/fsck_lfs/bufcache.h
1.3 sbin/fsck_lfs/lfs.h
1.8 sbin/fsck_lfs/lfs.c (via patch)
1.8 sbin/fsck_lfs/pass3.c (via patch)
1.18 sbin/fsck_lfs/pass0.c (via patch)
1.18 sbin/fsck_lfs/utilities.c (via patch)
1.7 sbin/fsck_lfs/segwrite.c
1.19 sbin/fsck_lfs/setup.c (via patch)
1.3 sbin/newfs_lfs/Makefile
0 sbin/newfs_lfs/lfs.c (yes, remove it)
1.1 sbin/newfs_lfs/make_lfs.c
1.15 sbin/newfs_lfs/newfs.c (via patch)

Various minor LFS improvements.

Kernel:

* Note when lfs_putpages(9) thinks it is not going to be writing any
pages before calling genfs_putpages(9). This prevents a situation in
which blocks can be queued for writing without a segment header.
* Correct computation of NRESERVE(), though it is still a gross
overestimate in most cases. Note that if NRESERVE() is too high, it
may be impossible to create files on the filesystem. We catch this
case on filesystem mount and refuse to mount r/w.
* Allow filesystems to be mounted whose block size is == MAXBSIZE.
* Somewhere along the line, ufs_bmaparray(9) started mangling UNWRITTEN
entries in indirect blocks again, triggering a failed assertion "daddr
<= LFS_MAX_DADDR". Explicitly convert to and from int32_t to correct
this. Should fix PR #29045.
* Add a high-water mark for the number of dirty pages any given LFS can
hold before triggering a flush. This is settable by sysctl, but off
(zero) by default.
* Be more careful about the MAX_BYTES and MAX_BUFS computations so we
shouldn't see "please increase to at least zero" messages.
* Note that VBLK and VCHR vnodes can have nonzero values in di_db[0]
even though their v_size == 0. Don't panic when we see this.
Fixes PR #26680.
* Change lfs_bfree to a signed quantity. The manner in which it is
processed before being passed to the cleaner means that sometimes it
may drop below zero, and the cleaner must be aware of this.
* Never report bfree < 0 (or higher than lfs_dsize) through
lfs_statfs(9). This prevents df(1) from ever telling us that our full
filesystems have 16TB free.
* Account space allocated through lfs_balloc(9) that does not have
associated buffer headers, so that the pagedaemon doesn't run us out
of segments.
* Return ENOSPC from lfs_balloc(9) when bfree drops to zero.
* Address a deadlock in lfs_bmapv/lfs_markv when the filesystem is being
unmounted. Because vfs_busy() is a shared lock, and
lfs_bmapv/lfs_markv mark the filesystem vfs_busy(), the cleaner can be
holding the lock that umount() is blocking on, then try to vfs_busy()
again in getnewvnode().

cleaner:

* Adapt lfs_cleanerd to use the fcntl call to get the Ifile filehandle,
so it need not be in the namespace.
* Make lfs_cleanerd be more careful when there are very few available
segments.
* Make lfs_cleanerd less verbose when the filesystem is unmounted.

newfs_lfs, fsck_lfs, and regression:

* Extend the lfs library from fsck_lfs(8) so that it can be used with a
not-yet-existent LFS. Make newfs_lfs(8) use this library, so it can
create LFSs whose Ifile is larger than one segment. Addresses PR #11110.
* Make newfs_lfs(8) use strsuftoi64() for its arguments, a la newfs(8).
* Make fsck_lfs(8) respect the "file system is clean" flag.
* Don't let fsck_lfs(8) think it has dirty blocks when invoked with the
-n flag.
* Remove the Ifile from the filesystem namespace. The cleaner now uses
a fcntl call on the root inode to find the Ifile filehandle. (As a
side-effect, addresses PR #29144.)
 1.155.6.1  19-Mar-2005  yamt sync with head. xen and whitespace. xen part is not finished.
 1.155.4.1  29-Apr-2005  kent sync with -current
 1.158.2.13  10-Aug-2006  tron Apply patch (requested by fair in perseant #1457):
Bring LFS up to current, including a patch (1.95 lfs_alloc.c) that
should prevent the inode free list errors seen on the STABLE branch
subsequent to pullup ticket #1327.
 1.158.2.12  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs_alloc.c: revision 1.93
sys/ufs/lfs/lfs.h: revision 1.106
sys/ufs/lfs/lfs_vfsops.c: revision 1.209
sys/ufs/lfs/lfs_vnops.c: revision 1.175
sys/ufs/lfs/lfs_segment.c: revision 1.178
Fixes to address the "vinvalbuf: dirty blocks" panic that can occur when
many inodes are cleaned at once. Make sure that we write all the pages
on vnodes that are being flushed, even if we don't think there's room;
drain v_numoutput before lfs_vflush() completes.
Also, don't allow a vnode that is in the process of being cleaned to be
chosen by getnewvnode(); this avoids a segment accounting panic in the case
that a large number of inodes are fed to lfs_markv() all at once.
 1.158.2.11  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs_vnops.c: revision 1.171
sys/ufs/lfs/lfs_extern.h: revision 1.81
sys/ufs/lfs/lfs_segment.c: revision 1.177
Don't ever partially write dirops, even if we need the cleaner to run.
This increases the chances of the "no clean segments" panic slightly,
but allows us to run the ckckp regression test successfully to completion.
 1.158.2.10  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs.h: revision 1.104
sys/ufs/lfs/lfs_vfsops.c: revision 1.206
sys/ufs/lfs/lfs_vnops.c: revision 1.170
sys/ufs/lfs/lfs_extern.h: revision 1.80
sys/ufs/lfs/lfs_segment.c: revision 1.176
sys/ufs/lfs/lfs_inode.c: revision 1.103 via patch
sys/ufs/lfs/lfs_alloc.c: revision 1.90
Postpone the segment accounting changes coming from truncation until the
inode that makes those changes valid is either written to disk by
lfs_writeinode() or discarded by lfs_vfree().
A couple of locking fixes are also included as well.
 1.158.2.9  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs_segment.c: revision 1.175
Regression test improvements:
Move the stop for LFCNWRAPSTOP to the point at which writing at segment 0
is really about to commence, since this is what the test expects (and
incidentally what a snapshotting utility wants as well).
More correctly reconstruct the on-disk state at every checkpoint, rather
than relying on the entire state at the point of wrapping to be accurate
(that is only true the first time we wrap). Add a "make abort" target to
make rerunning the test more convenient when it has failed and we're done
analyzing the failure.
 1.158.2.8  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs.h: revision 1.103
sys/ufs/lfs/lfs_segment.c: revision 1.174
sys/ufs/lfs/lfs_vnops.c: revision 1.168
Introduce two fcntl calls that freeze the filesystem right at the point
where segment 0 is being considered for writing. This allows for automated
checkpoint vailidity scanning, and could be used (in conjunction with the
existing LFCNREWIND) for e.g. snapshot dumps as well.
Include a regression test that does such scanning.
When writing the Ifile, loop through the dirty block list three times to
make sure that the checkpoint is always consistent (the first and second
times the Ifile blocks can cross a segment boundary; not so the third time
unless the segments are very small). Discovered by using the aforementioned
regression test.
 1.158.2.7  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs.h: revision 1.102
sys/ufs/lfs/lfs_segment.c: revision 1.173
sys/ufs/lfs/lfs_vnops.c: revision 1.167 via patch
sys/ufs/lfs/lfs_bio.c: revision 1.91
Make lfs_vref/lfs_vunref not need to know about VXLOCK and VFREEING
explicitly (especially since we didn't know about VFREEING at all before),
but notice the EBUSY return from vget() instead.
Fix some more MP locking protocol issues, most of which were pointed out by
Christian Ehrhardt this morning on tech-kern.
 1.158.2.6  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs_balloc.c: revision 1.60
sys/ufs/lfs/lfs_syscalls.c: revision 1.111
sys/ufs/lfs/lfs_segment.c: revision 1.172
sys/ufs/lfs/lfs_vnops.c: revision 1.163
Several minor bug fixes:
* Correct (weak) segment lock assertions in lfs_fragextend and lfs_putpages.
* Keep IN_MODIFIED set if we run out of avail in lfs_putpages.
* Don't try to (re)write buffers on a VBLK vnode; fixes a panic I found
while running with an LFS root.
* Raise priority of LFCNSEGWAIT to PVFS; PUSER is way too low for
something the pagedaemon is relying on.
 1.158.2.5  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs_vnops.c: revision 1.158
sys/ufs/lfs/lfs_subr.c: revision 1.57
sys/ufs/lfs/lfs_segment.c: revision 1.171
sys/ufs/lfs/lfs.h: revision 1.97
sys/ufs/lfs/lfs_vfsops.c: revision 1.195
sys/ufs/lfs/lfs_extern.h: revision 1.76
Improvements to LFS's paging mechanism, to wit:
* Acknowledge that sometimes there are more dirty pages to be written to
disk than clean segments. When we reach the danger line,
lfs_gop_write() now returns EAGAIN. The caller of VOP_PUTPAGES(), if
it holds the segment lock, drops it and waits for the cleaner to make
room before continuing.
* Note and avoid a three-way deadlock in lfs_putpages (a writer holding
a page busy blocks on the cleaner while the cleaner blocks on the
segment lock while lfs_putpages blocks on the page).
 1.158.2.4  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs_segment.c: revision 1.170
sys/ufs/lfs/lfs.h: revision 1.96
sys/ufs/lfs/lfs_vfsops.c: revision 1.194
sys/ufs/lfs/lfs_syscalls.c: revision 1.109
From Konrad Schroeder, in response to strange df output on anoncvs.netbsd.org:
We were returning the wrong value for free space. Now we're not.
 1.158.2.3  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs_vnops.c: revision 1.153
sys/ufs/lfs/lfs_debug.c: revision 1.32
sys/ufs/lfs/lfs_alloc.c: revision 1.84
sys/ufs/lfs/lfs_vfsops.c: revision 1.185
sys/ufs/lfs/lfs_segment.c: revision 1.165
64 bit inode changes.
 1.158.2.2  20-May-2006  riz Pull up following revision(s) (requested by perseant in ticket #1327):
sys/ufs/lfs/lfs_vnops.c: revision 1.152
sys/ufs/lfs/lfs_debug.c: revision 1.31
sys/ufs/lfs/lfs_subr.c: revision 1.53
sys/ufs/lfs/lfs_extern.h: revision 1.68
sys/ufs/lfs/lfs_inode.c: revision 1.96
sys/ufs/lfs/lfs_bio.c: revision 1.86
sys/ufs/lfs/lfs_alloc.c: revision 1.83
sys/ufs/lfs/lfs_vfsops.c: revision 1.181
sys/ufs/lfs/lfs.h: revision 1.88
sys/ufs/lfs/lfs_segment.c: revision 1.164
- sprinkle const
- avoid shadow variables.
 1.158.2.1  07-May-2005  tron Apply patch (requested by perseant in ticket #242):
* fsck_lfs buffer cache fixes, including PR #29151
* Change fsck_lfs phase 0 message to reflect reality
* fsck_lfs: check phase 5 (cleanerinfo accounting) even on
roll-forward
* Keep better track of the free list during roll-forward, avoiding
a core dump
* Improve hash table use for fsck_lfs buffer and vnode cache
* Document fsck_lfs flag -f, and implement -q
* Add resize_lfs, including kernel support
* Add LFS to mountd's list of exportable filesystem types
* Make the LFS lkm work again [christos@]
* Add MP locking to the LFS kernel subsystem
* Fix pager_map deadlock in lfs_putpages()
* Avoid incomplete file extension that looks like "partial
truncation" to fsck
* Use lfs_malloc for cleaner malloc, since the cleaner often runs
in low-memory conditions.
* Use splay trees, not hash table, to track page allocation for
write.
* Fix mkdir panic on full fs
* Fix page accounting leak by counting differently.
* Use rightly named structure for lfs_getattr [skrll@]
* Cosmetic changes for readability.
 1.164.2.8  27-Feb-2008  yamt sync with head.
 1.164.2.7  04-Feb-2008  yamt sync with head.
 1.164.2.6  21-Jan-2008  yamt sync with head
 1.164.2.5  27-Oct-2007  yamt sync with head.
 1.164.2.4  03-Sep-2007  yamt sync with head.
 1.164.2.3  26-Feb-2007  yamt sync with head.
 1.164.2.2  30-Dec-2006  yamt sync with head.
 1.164.2.1  21-Jun-2006  yamt sync with head.
 1.168.2.1  15-Jan-2006  yamt sync with head.
 1.169.10.2  24-May-2006  tron Merge 2006-05-24 NetBSD-current into the "peter-altq" branch.
 1.169.10.1  28-Mar-2006  tron Merge 2006-03-28 NetBSD-current into the "peter-altq" branch.
 1.169.8.3  11-May-2006  elad sync with head
 1.169.8.2  06-May-2006  christos - Move kauth_cred_t declaration to <sys/types.h>
- Cleanup struct ucred; forward declarations that are unused.
- Don't include <sys/kauth.h> in any header, but include it in the c files
that need it.

Approved by core.
 1.169.8.1  19-Apr-2006  elad sync with head.
 1.169.6.6  03-Sep-2006  yamt sync with head.
 1.169.6.5  11-Aug-2006  yamt sync with head
 1.169.6.4  26-Jun-2006  yamt sync with head.
 1.169.6.3  24-May-2006  yamt sync with head.
 1.169.6.2  11-Apr-2006  yamt sync with head
 1.169.6.1  01-Apr-2006  yamt sync with head.
 1.169.4.3  01-Jun-2006  kardel Sync with head.
 1.169.4.2  22-Apr-2006  simonb Sync with head.
 1.169.4.1  04-Feb-2006  simonb Adapt for timecounters: mostly use get*time() and use "time_second"
instead of "time.tv_sec".
 1.169.2.1  09-Sep-2006  rpaulo sync with head
 1.180.2.1  19-Jun-2006  chap Sync with head.
 1.182.2.1  13-Jul-2006  gdamore Merge from HEAD.
 1.190.4.3  10-Dec-2006  yamt sync with head.
 1.190.4.2  22-Oct-2006  yamt use workqueue for aiodoned.
 1.190.4.1  22-Oct-2006  yamt sync with head
 1.190.2.2  12-Jan-2007  ad Sync with head.
 1.190.2.1  18-Nov-2006  ad Sync with head.
 1.195.4.1  03-Sep-2007  wrstuden Sync w/ NetBSD-4-RC_1
 1.195.2.1  05-Jun-2007  bouyer Pull up following revision(s) (requested by perseant in ticket #703):
sys/miscfs/genfs/genfs.h 1.21
sys/miscfs/genfs/genfs_vnops.c 1.151
sys/ufs/lfs/lfs.h 1.119, 1.120
sys/ufs/lfs/lfs_bio.c 1.99-101
sys/ufs/lfs/lfs_extern.h 1.89
sys/ufs/lfs/lfs_inode.c 1.108, 1.109
sys/ufs/lfs/lfs_segment.c 1.197, 1.199, 1.200
sys/ufs/lfs/lfs_subr.c 1.69, 1.70
sys/ufs/lfs/lfs_syscalls.c 1.119
sys/ufs/lfs/lfs_vfsops.c 1.234, 1.235
sys/ufs/lfs/lfs_vnops.c 1.195, 1.196, 1.200, 1.202-206

Reduce busy waiting in lfs_putpages(), and other LFS improvements.
 1.196.2.4  17-May-2007  yamt sync with head.
 1.196.2.3  07-May-2007  yamt sync with head.
 1.196.2.2  12-Mar-2007  rmind Sync with HEAD.
 1.196.2.1  27-Feb-2007  yamt - sync with head.
- move sched_changepri back to kern_synch.c as it doesn't know PPQ anymore.
 1.198.4.1  11-Jul-2007  mjf Sync with head.
 1.198.2.13  01-Oct-2007  ad Make it compile (XXX not correct).
 1.198.2.12  28-Aug-2007  yamt - mark aiodone workqueue MPSAFE.
- make lfs callbacks acquire kernel_lock by themselves.

ok'ed by Andrew Doran.
 1.198.2.11  28-Aug-2007  yamt make this compilable with DEBUG.
 1.198.2.10  24-Aug-2007  ad Sync with buffer cache locking changes. See buf.h/vfs_bio.c for details.
Some minor portions are incomplete and needs to be verified as a whole.
 1.198.2.9  20-Aug-2007  ad Sync with HEAD.
 1.198.2.8  19-Aug-2007  ad - Back out the biodone() changes.
- Eliminate B_ERROR (from HEAD).
 1.198.2.7  15-Jul-2007  ad Sync with head.
 1.198.2.6  23-Jun-2007  ad - Lock v_cleanblkhd, v_dirtyblkhd, v_numoutput with the vnode's interlock.
Get rid of global_v_numoutput_lock. Partially incomplete as the buffer
cache locking doesn't work very well and needs an overhaul.
- Some changes to try and make softdep MP safe. Untested.
 1.198.2.5  17-Jun-2007  ad - Increase the number of thread priorities from 128 to 256. How the space
is set up is to be revisited.
- Implement soft interrupts as kernel threads. A generic implementation
is provided, with hooks for fast-path MD code that can run the interrupt
threads over the top of other threads executing in the kernel.
- Split vnode::v_flag into three fields, depending on how the flag is
locked (by the interlock, by the vnode lock, by the file system).
- Miscellaneous locking fixes and improvements.
 1.198.2.4  08-Jun-2007  ad Sync with head.
 1.198.2.3  13-May-2007  ad - Pass the error number and residual count to biodone(), and let it handle
setting error indicators. Prepare to eliminate B_ERROR.
- Add a flag argument to brelse() to be set into the buf's flags, instead
of doing it directly. Typically used to set B_INVAL.
- Add a "struct cpu_info *" argument to kthread_create(), to be used to
create bound threads. Change "bool mpsafe" to "int flags".
- Allow exit of LWPs in the IDL state when (l != curlwp).
- More locking fixes & conversion to the new API.
 1.198.2.2  21-Mar-2007  ad - Replace more simple_locks, and fix up in a few places.
- Use condition variables.
- LOCK_ASSERT -> KASSERT.
 1.198.2.1  13-Mar-2007  ad Pull in the initial set of changes for the vmlocking branch.
 1.202.2.1  15-Aug-2007  skrll Sync with HEAD.
 1.203.6.2  29-Jul-2007  ad It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.
 1.203.6.1  29-Jul-2007  ad file lfs_segment.c was added on branch matt-mips64 on 2007-07-29 13:31:15 +0000
 1.203.4.2  26-Oct-2007  joerg Sync with HEAD.

Follow the merge of pmap.c on i386 and amd64 and move
pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup
code to restore CR4 before jumping back into kernel space as the large
page option might cover that.
 1.203.4.1  16-Aug-2007  jmcneill Sync with HEAD.
 1.204.4.1  14-Oct-2007  yamt sync with head.
 1.204.2.3  23-Mar-2008  matt sync with HEAD
 1.204.2.2  09-Jan-2008  matt sync with HEAD
 1.204.2.1  06-Nov-2007  matt sync with HEAD
 1.206.10.1  02-Jan-2008  bouyer Sync with HEAD
 1.206.6.4  19-Dec-2007  ad Use a global lfs_lock.
 1.206.6.3  19-Dec-2007  ad Fix some more problems w/lfs on this branch.
 1.206.6.2  19-Dec-2007  ad Get lfs mostly working.
 1.206.6.1  04-Dec-2007  ad Pull the vmlocking changes into a new branch.
 1.206.4.1  18-Feb-2008  mjf Sync with HEAD.
 1.209.6.3  05-Jun-2008  mjf Sync with HEAD.

Also fix build.
 1.209.6.2  02-Jun-2008  mjf Sync with HEAD.
 1.209.6.1  03-Apr-2008  mjf Sync with HEAD.
 1.210.4.5  11-Aug-2010  yamt sync with head.
 1.210.4.4  11-Mar-2010  yamt sync with head
 1.210.4.3  19-Aug-2009  yamt sync with head.
 1.210.4.2  04-May-2009  yamt sync with head.
 1.210.4.1  16-May-2008  yamt sync with head.
 1.210.2.2  04-Jun-2008  yamt sync with head
 1.210.2.1  18-May-2008  yamt sync with head.
 1.211.2.1  23-Jun-2008  wrstuden Sync w/ -current. 34 merge conflicts to follow.
 1.213.22.2  09-Nov-2015  snj Fix ticket #1974 fallout.
 1.213.22.1  07-Nov-2015  snj Pull up following revision(s) (requested by dholland in ticket #1974):
sys/ufs/lfs/lfs_segment.c: revision 1.247 via patch
Fix catastrophic bug in lfs_rewind() that changed segment numbers
(lfs_curseg/lfs_nextseg in the superblock) using the wrong units.
These fields are for whatever reason the start addresses of segments
(measured in frags) rather than the segment numbers 0..n.
This only apparently affects dumping from a mounted fs; however, it
trashes the fs.
I would really, really like to have a static analysis tool that can
keep track of the units things are measured in, since fs code is full
of conversion macros and the macros are named inscrutable things like
"sntod" whose letters don't necessarily even correspond to the units
they convert. It is surprising that more of these are not wrong.
 1.213.18.2  09-Nov-2015  snj Fix ticket #1974 fallout.
 1.213.18.1  07-Nov-2015  snj Pull up following revision(s) (requested by dholland in ticket #1974):
sys/ufs/lfs/lfs_segment.c: revision 1.247 via patch
Fix catastrophic bug in lfs_rewind() that changed segment numbers
(lfs_curseg/lfs_nextseg in the superblock) using the wrong units.
These fields are for whatever reason the start addresses of segments
(measured in frags) rather than the segment numbers 0..n.
This only apparently affects dumping from a mounted fs; however, it
trashes the fs.
I would really, really like to have a static analysis tool that can
keep track of the units things are measured in, since fs code is full
of conversion macros and the macros are named inscrutable things like
"sntod" whose letters don't necessarily even correspond to the units
they convert. It is surprising that more of these are not wrong.
 1.213.8.2  09-Nov-2015  sborrill Fix breakage from ticket #1974
 1.213.8.1  07-Nov-2015  snj Pull up following revision(s) (requested by dholland in ticket #1974):
sys/ufs/lfs/lfs_segment.c: revision 1.247 via patch
Fix catastrophic bug in lfs_rewind() that changed segment numbers
(lfs_curseg/lfs_nextseg in the superblock) using the wrong units.
These fields are for whatever reason the start addresses of segments
(measured in frags) rather than the segment numbers 0..n.
This only apparently affects dumping from a mounted fs; however, it
trashes the fs.
I would really, really like to have a static analysis tool that can
keep track of the units things are measured in, since fs code is full
of conversion macros and the macros are named inscrutable things like
"sntod" whose letters don't necessarily even correspond to the units
they convert. It is surprising that more of these are not wrong.
 1.214.2.2  17-Aug-2010  uebayasi Sync with HEAD.
 1.214.2.1  30-Apr-2010  uebayasi Sync with HEAD.
 1.215.2.4  21-Apr-2011  rmind sync with head
 1.215.2.3  05-Mar-2011  rmind sync with head
 1.215.2.2  03-Jul-2010  rmind sync with head
 1.215.2.1  16-Mar-2010  rmind Change struct uvm_object::vmobjlock to be dynamically allocated with
mutex_obj_alloc(). It allows us to share the locks among UVM objects.
 1.217.2.1  06-Jun-2011  jruoho Sync with HEAD.
 1.220.2.1  23-Jun-2011  cherry Catchup with rmind-uvmplock merge.
 1.222.6.1  18-Feb-2012  mrg merge to -current.
 1.222.2.4  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.222.2.3  23-Jan-2013  yamt sync with head
 1.222.2.2  17-Apr-2012  yamt sync with head
 1.222.2.1  02-Nov-2011  yamt page cache related changes

- maintain object pages in radix tree rather than rb tree.
- reduce unnecessary page scan in putpages. esp. when an object has a ton of
pages cached but only a few of them are dirty.
- reduce the number of pmap operations by tracking page dirtiness more
precisely in uvm layer.
- fix nfs commit range tracking.
- fix nfs write clustering. XXX hack
 1.223.2.2  15-Nov-2015  bouyer Pull up following revision(s) (requested by dholland in ticket #1319):
sys/ufs/lfs/lfs_segment.c: revision 1.247 via patch
Fix catastrophic bug in lfs_rewind() that changed segment numbers
(lfs_curseg/lfs_nextseg in the superblock) using the wrong units.
These fields are for whatever reason the start addresses of segments
(measured in frags) rather than the segment numbers 0..n.
This only apparently affects dumping from a mounted fs; however, it
trashes the fs.
I would really, really like to have a static analysis tool that can
keep track of the units things are measured in, since fs code is full
of conversion macros and the macros are named inscrutable things like
"sntod" whose letters don't necessarily even correspond to the units
they convert. It is surprising that more of these are not wrong.
 1.223.2.1  17-Mar-2012  bouyer Pull up following revision(s) (requested by perseant in ticket #116):
sys/ufs/lfs/lfs_alloc.c: revision 1.112
tests/fs/vfs/t_rmdirrace.c: revision 1.9
tests/fs/vfs/t_renamerace.c: revision 1.25
sys/ufs/lfs/lfs_vnops.c: revision 1.240
sys/ufs/lfs/lfs_segment.c: revision 1.224
sys/ufs/lfs/lfs_bio.c: revision 1.122
sys/ufs/lfs/lfs_vfsops.c: revision 1.294
sbin/newfs_lfs/make_lfs.c: revision 1.19
sys/ufs/lfs/lfs.h: revision 1.136
Pass t_renamerace and t_rmdirrace tests.
Adapt dholland@'s fix to ufs_rename to fix PR kern/43582. Address several
other MP locking issues discovered during the course of investigating the
same problem.
Removed extraneous vn_lock() calls on the Ifile, since the Ifile writes
are controlled by the segment lock.
Fix PR kern/45982 by deemphasizing the estimate of how much metadata
will fill the empty space on disk when the disk is nearly empty
(t_renamerace crates a lot of inode blocks on a tiny empty disk).
 1.224.2.4  03-Dec-2017  jdolecek update from HEAD
 1.224.2.3  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.224.2.2  23-Jun-2013  tls resync from head
 1.224.2.1  25-Feb-2013  tls resync with head
 1.230.2.2  18-May-2014  rmind sync with head
 1.230.2.1  28-Aug-2013  rmind sync with head
 1.236.6.5  28-Aug-2017  skrll Sync with HEAD
 1.236.6.4  27-Dec-2015  skrll Sync with HEAD (as of 26th Dec)
 1.236.6.3  22-Sep-2015  skrll Sync with HEAD
 1.236.6.2  06-Jun-2015  skrll Sync with HEAD
 1.236.6.1  06-Apr-2015  skrll Sync with HEAD
 1.236.4.1  04-Aug-2015  snj Pull up following revision(s) (requested by dholland in ticket #932):
sys/ufs/lfs/lfs_segment.c: revision 1.247 via patch
Fix catastrophic bug in lfs_rewind() that changed segment numbers
(lfs_curseg/lfs_nextseg in the superblock) using the wrong units.
These fields are for whatever reason the start addresses of segments
(measured in frags) rather than the segment numbers 0..n.
This only apparently affects dumping from a mounted fs; however, it
trashes the fs.
I would really, really like to have a static analysis tool that can
keep track of the units things are measured in, since fs code is full
of conversion macros and the macros are named inscrutable things like
"sntod" whose letters don't necessarily even correspond to the units
they convert. It is surprising that more of these are not wrong.
 1.263.4.1  21-Apr-2017  bouyer Sync with HEAD
 1.263.2.2  26-Apr-2017  pgoyette Sync with HEAD
 1.263.2.1  20-Mar-2017  pgoyette Sync with HEAD
 1.269.6.1  30-Oct-2017  snj Pull up following revision(s) (requested by maya in ticket #330):
sbin/fsck_lfs/inode.c: 1.69
sbin/fsck_lfs/lfs.c: 1.73
sbin/fsck_lfs/pass6.c: 1.50
sbin/fsck_lfs/segwrite.c: 1.46
sys/ufs/lfs/lfs.h: 1.202-1.203
sys/ufs/lfs/lfs_accessors.h: 1.48
sys/ufs/lfs/lfs_alloc.c: 1.136-1.137
sys/ufs/lfs/lfs_balloc.c: 1.94
sys/ufs/lfs/lfs_bio.c: 1.141
sys/ufs/lfs/lfs_extern.h: 1.113
sys/ufs/lfs/lfs_inode.c: 1.156-1.157
sys/ufs/lfs/lfs_inode.h: 1.20, 1.21, 1.23
sys/ufs/lfs/lfs_itimes.c: 1.20
sys/ufs/lfs/lfs_pages.c: 1.13-1.15
sys/ufs/lfs/lfs_rename.c: 1.22
sys/ufs/lfs/lfs_segment.c: 1.270-1.275
sys/ufs/lfs/lfs_subr.c: 1.94-1.97
sys/ufs/lfs/lfs_syscalls.c: 1.175
sys/ufs/lfs/lfs_vfsops.c: 1.360
sys/ufs/lfs/lfs_vnops.c: 1.316-1.321
sys/ufs/lfs/ulfs_inode.c: 1.20
sys/ufs/lfs/ulfs_inode.h: 1.24
sys/ufs/lfs/ulfs_lookup.c: 1.41
sys/ufs/lfs/ulfs_quota2.c: 1.31
sys/ufs/lfs/ulfs_readwrite.c: 1.24
sys/ufs/lfs/ulfs_vnops.c: 1.49-1.50
Update inode member i_flag --> i_state to keep up with kernel changes
Move definition of IN_ALLMOD near the flag it's a mask for.
Now we can see that it doesn't match all the flags, but changing that will
require more careful thought.
Correct confusion between i_flag and i_flags
These will have to be renamed.
Spotted by Riastradh, thanks!
Add an XXX about the missing flags so it's not buried in a commit
message.
now the XXX count for LFS is 260
Rename i_flag to i_state.
The similarity to i_flags has previously caused errors.
Use continue to denote the no-op loop to match netbsd style
newline for extra clarity.
It isn't safe to drain dirops with seglock held, it'll deadlock if there
are any dirops. drain before grabbing seglock.
lfs_dirops == 0 is always true (as we already drained dirops), so omit
that part of the comparison.
Fixes a lot of LFS deadlocks. PR kern/52301
Many thanks to dholland for help analyzing coredumps
Ifdef out KDASSERT which fires on my machine.
Deduplicate sanity check that seglock is held on segunlock
Revert r1.272 fix to PR kern/52301, the performance hit is making things
unusable.
change lfs_nextsegsleep and lfs_allclean_wakeup to use condvar
XXX had to use lfs_lock in lfs_segwait, removed kernel_lock, is this
appropriate?
fix buffer overflow/KASSERT when cookies are supplied
lfs no longer uses the ffs-style struct direct, use the correct minimum
size
from dholland
XXX more wrong
Consistently use {,UN}MARK_VNODE macros rather than function calls.
Not much point doing anything after a panic call
Ask some question about the code in a XXX comment
XXX question our double-flushing of dirops
Fix typo in comment
 1.275.2.2  06-Sep-2018  pgoyette Sync with HEAD

Resolve a couple of conflicts (result of the uimin/uimax changes)
 1.275.2.1  25-Jun-2018  pgoyette Sync with HEAD
 1.277.2.2  08-Apr-2020  martin Merge changes from current as of 20200406
 1.277.2.1  10-Jun-2019  christos Sync with HEAD
 1.278.4.1  17-Aug-2020  martin Pull up following revision(s) (requested by riastradh in ticket #1050):

sys/ufs/lfs/lfs_subr.c: revision 1.101
sys/ufs/lfs/lfs_subr.c: revision 1.102
sys/ufs/lfs/lfs_inode.c: revision 1.158
sys/ufs/lfs/lfs_inode.h: revision 1.25
sys/ufs/lfs/lfs_balloc.c: revision 1.95
sys/ufs/lfs/lfs_pages.c: revision 1.21
sys/ufs/lfs/lfs_vnops.c: revision 1.330
sys/ufs/lfs/lfs_alloc.c: revision 1.140 (patch)
sys/ufs/lfs/lfs_alloc.c: revision 1.141 (patch)
lib/libp2k/p2k.c: revision 1.72
sys/ufs/lfs/lfs.h: revision 1.205
sys/ufs/lfs/lfs.h: revision 1.206
sys/ufs/lfs/lfs_segment.c: revision 1.284
sys/ufs/lfs/lfs.h: revision 1.207
sys/ufs/lfs/lfs_segment.c: revision 1.285
sys/ufs/lfs/lfs_debug.c: revision 1.55
sys/ufs/lfs/lfs_rename.c: revision 1.23
usr.sbin/dumplfs/dumplfs.c: revision 1.65
sys/ufs/lfs/lfs_vfsops.c: revision 1.371
sys/arch/i386/stand/efiboot/bootx64/Makefile: revision 1.3
sys/ufs/lfs/lfs_vfsops.c: revision 1.372
sys/ufs/lfs/lfs_vfsops.c: revision 1.373
sbin/fsck_lfs/pass1.c: revision 1.46
sys/ufs/lfs/lfs_vnops.c: revision 1.326
sys/ufs/lfs/lfs_vnops.c: revision 1.327
sys/ufs/lfs/lfs_vfsops.c: revision 1.375 (patch)
sys/ufs/lfs/lfs_vnops.c: revision 1.328
sys/ufs/lfs/lfs_subr.c: revision 1.98
sys/ufs/lfs/lfs_extern.h: revision 1.116
sys/ufs/lfs/lfs_vnops.c: revision 1.329
sys/ufs/lfs/lfs_subr.c: revision 1.99
sys/ufs/lfs/lfs_extern.h: revision 1.117
sys/ufs/lfs/lfs_accessors.h: revision 1.49
sys/ufs/lfs/lfs_extern.h: revision 1.118
sys/rump/fs/lib/liblfs/Makefile: revision 1.15
sys/ufs/lfs/lfs_bio.c: revision 1.146 (patch)
sys/ufs/lfs/lfs_bio.c: revision 1.147
sys/ufs/lfs/lfs_subr.c: revision 1.100

Fix kassert in lfs by initializing vp first.

Use a marker node to iterate lfs_dchainhd / i_lfs_dchain.

I believe elements can be removed while the lock is dropped,
including the next node we're hanging on to.

Just use VOP_BWRITE for lfs_bwrite_log.
Hope this doesn't cause trouble with vfs_suspend.

Teach lfs to transition ro<->rw.

Prevent new dirops while we issue lfs_flush_dirops.

lfs_flush_dirops assumes (by KASSERT((ip->i_state & IN_ADIROP) == 0))
that vnodes on the dchain will not become involved in active dirops
even while holding no other locks (lfs_lock, v_interlock), so we must
set lfs_writer here. All other callers already set lfs_writer.

We set fs->lfs_writer++ without explicitly doing lfs_writer_enter
because
(a) we already waited for the dirops to drain, and
(b) we hold lfs_lock and cannot drop it before setting lfs_writer.

Assert lfs_writer where I think we can now prove it.

Serialize access to the splay tree with lfs_lock.

Change some cheap KDASSERT into KASSERT.

Take a reference and fix assertions in lfs_flush_dirops.
Fixes panic:
KASSERT((ip->i_state & IN_ADIROP) == 0) at lfs_vnops.c:1670
lfs_flush_dirops
lfs_check
lfs_setattr
VOP_SETATTR
change_mode
sys_fchmod
syscall

This assertion -- and the assertion that vp->v_uflag has VU_DIROP set
-- is valid only until we release lfs_lock, because we may race with
lfs_unmark_dirop which will remove the nodes and change the flags.

Further, vp itself is valid only as long as it is referenced, which it
is as long as it's on the dchain, but lfs_unmark_dirop drops the
dchain's reference.

Don't lfs_writer_enter while holding v_interlock.

There's no need to lfs_writer_enter at all here, as far as I can see.
lfs_flush_fs will do it for us.

Break deadlock in PR kern/52301.

The lock order is lfs_writer -> lfs_seglock. The problem in 52301 is
that lfs_segwrite violates this lock order by sometimes doing
lfs_seglock -> lfs_writer, either (a) when doing a checkpoint or (b),
opportunistically, when there are no dirops pending. Both cases can
deadlock, because dirops sometimes take the seglock (lfs_truncate,
lfs_valloc, lfs_vfree):
(a) There may be dirops pending, and they may be waiting for the
seglock, so we can't wait for them to complete while holding the
seglock.
(b) The test for fs->lfs_dirops == 0 happens unlocked, and the state
may change by the time lfs_writer_enter acquires lfs_lock.

To resolve this in each case:
(a) Do lfs_writer_enter before lfs_seglock, since we will need it
unconditionally anyway. The worst performance impact of this should
be that some dirops get delayed a little bit.
(b) Create a new lfs_writer_tryenter to use at this point so that the
test for fs->lfs_dirops == 0 and the acquisition of lfs_writer happen
atomically under lfs_lock.

Initialize/destroy lfs_allclean_wakeup in modcmd, not lfs_mountfs.

Fixes reloading lfs.kmod.

In lfs_update, hold lfs_writer around lfs_vflush.

Otherwise, we might do
lfs_vflush
-> lfs_seglock
-> lfs_segwait(SEGM_CKP)
-> lfs_writer_enter
which is the reverse of the lfs_writer -> lfs_seglock ordering.

Call lfs_orphan in lfs_rename while we're still in the dirop.
lfs_writer_enter can't fail; keep it simple and don't pretend it can.

Assert that mtsleep can't fail either -- it doesn't catch signals and
there's no timeout.

Teach LFS_ORPHAN_NEXTFREE about lfs64.

Dust off the orphan detection code and try to make it work.

Fix !DIAGNOSTIC compile

Fix userland references to LFS_ORPHAN_NEXTFREE.

Forgot to grep for these or do a full distribution build, oops!

Fix missing <sys/evcnt.h> by removing the evcnts instead.

Just wanted to confirm that a race might happen, and indeed it did.
These serve little diagnostic value otherwise.

OR into bp->b_cflags; don't overwrite.

CTASSERT lfs on-disk structure sizes.

Avoid misaligned access to lfs64 on-disk records in memory.
lfs64 directory entries are only 32-bit aligned in order to conserve
space in directory blocks, and we had a hack to stuff a 64-bit inode
in them. This replaces the hack by __aligned(4) __packed, and goes
further:

1. It's not clear that all the other lfs64 data structures are 64-bit
aligned on disk to begin with. We can go through these later and
upgrade them from
struct foo64 {
...
} __aligned(4) __packed;
union foo {
struct foo64 f64;
...
};
to
struct foo64 {
...
};
union foo {
struct foo64 f64 __aligned(8);
...
} __aligned(4) __packed;
if we really want to take advantage of 64-bit memory accesses.
However, the __aligned(4) __packed must remain on the union
because:
2. We access even the lfs32 data structures via a union that has
lfs64 members, and it turns out that compilers will assume access
through a union with 64-bit aligned members implies the whole
union has 64-bit alignment, even if we're only accessing a 32-bit
aligned member.

Fix clang build after packed lfs64 accessor change.

Suppress spurious address-of-packed error in rump lfs too.
 1.280.2.2  29-Feb-2020  ad Sync with head.
 1.280.2.1  17-Jan-2020  ad Sync with head.

RSS XML Feed