Home | History | Annotate | Download | only in netinet
History log of /src/sys/netinet/tcp_var.h
RevisionDateAuthorComments
 1.199  03-Dec-2024  andvar s/packlets/packets/ in comment.
 1.198  28-Oct-2022  ozaki-r inpcb: integrate data structures of PCB into one

Data structures of network protocol control blocks (PCBs), i.e.,
struct inpcb, in6pcb and inpcb_hdr, are not organized well. Users of
the data structures have to handle them separately and thus the code
is cluttered and duplicated.

The commit integrates the data structures into one, struct inpcb. As a
result, users of PCBs only have to handle just one data structure, so
the code becomes simple.

One drawback is that the data size of PCB for IPv4 increases by 40 bytes
(from 248 bytes to 288 bytes).
 1.197  20-Sep-2022  ozaki-r tcp: separate syn cache stuffs into tcp_syncache.[ch] files

No functional change.
 1.196  31-Jul-2021  andvar s/threshhold/threshold
 1.195  08-Mar-2021  christos branches: 1.195.4;
Remove the unused "addin" argument (it was always 0) and go back using
a random iss by default (instead of rfc1948)
 1.194  03-Feb-2021  roy Sprinkle CTASSERT to enforce on-wire layout without __packed
 1.193  03-Feb-2021  roy Remove __packed from various network structures

They are already network aligned and adding the __packed attribute
just causes needless compiler warnings about accssing members of packed
objects.
 1.192  05-Mar-2020  riastradh branches: 1.192.4;
Revert "Include opt_diagnostic.h for DIAGNOSTIC."

This did not do what I thought it did. opt_diagnostic.h is only for
the unused _DIAGNOSTIC, which seems like an abortive attempt to
incrementally convert DIAGNOSTIC to an opt_*.h option rather than a
command-line option.
 1.191  05-Mar-2020  riastradh Include opt_diagnostic.h for DIAGNOSTIC.

...at least, in header files, which may not have already included
libkern.h.
 1.190  27-Dec-2018  maxv Remove unused arguments.
 1.189  14-Sep-2018  maxv Use non-variadic function pointer in protosw::pr_input.
 1.188  03-Sep-2018  riastradh Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)
 1.187  22-Aug-2018  msaitoh - Cleanup for dynamic sysctl:
- Remove unused *_NAMES macros for sysctl.
- Remove unused *_MAXID for sysctls.
- Move CTL_MACHDEP sysctl definitions for m68k into m68k/include/cpu.h and
use them on all m68k machines.
 1.186  29-Apr-2018  maxv branches: 1.186.2;
Move struct tcpiphdr from tcpip.h to tcp_var.h, to match UDP (udpiphdr in
udp_var.h).

tcpip.h is now empty, and can be removed.
 1.185  28-Mar-2018  maxv Remove two unused args from syn_cache_get().
 1.184  12-Feb-2018  maxv branches: 1.184.2;
Remove unused argument from tcp_signature_getsav.
 1.183  12-Feb-2018  maxv Remove the 'm' argument from syn_cache_respond(); all it does with it is
freeing it, so free in the caller instead.
 1.182  19-Jan-2018  ozaki-r Run tcp_slowtimo in workqueue if NET_MPSAFE

If NET_MPSAFE is enabled, we have to avoid taking softnet_lock in softint as
much as possible to prevent any softint handlers including callout handlers
such as tcp_slowtimo from sticking on softnet_lock because it results in
undesired delays of executing subsequent softint handlers.

NFCI for !NET_MPSAFE
 1.181  15-Nov-2017  ozaki-r Make syn_cache_timer static
 1.180  31-Jul-2017  maxv Fix TCPCTL_NAMES, and remove TCPCTL_VARIABLES.
 1.179  28-Jul-2017  maxv Remove TCP_COMPAT_42. This feature is a workaround for a bug in the TCP
stack of BSD4.2. Having such features just does not make any sense, and
looking at the code, I'm not sure it actually works.
 1.178  07-Jul-2017  ozaki-r Rename key_alloc* functions (NFC)

We shouldn't use the term "alloc" for functions that just look up
data and actually don't allocate memory.
 1.177  14-Feb-2015  he branches: 1.177.10;
Change the new counter variables in struct tcpcb to uint32_t, as
per christos' comments.
 1.176  14-Feb-2015  he Port over the TCP_INFO socket option from FreeBSD, originally from
the Linux 2.6 TCP API. This permits the caller to query certain information
about a TCP connection, and is used by pkgsrc's net/iperf3 test program
if available.

This extends struct tcbcb with three fields to count retransmits,
out-of-sequence receives and zero window announcements, and will
therefore warrant a kernel revision bump (done separately).
 1.175  31-Jul-2014  rtr branches: 1.175.2; 1.175.4;
split PRU_DISCONNECT, PRU_SHUTDOWN and PRU_ABORT function out of
pr_generic() usrreq switches and put into separate functions

xxx_disconnect(struct socket *)
xxx_shutdown(struct socket *)
xxx_abort(struct socket *)

- always KASSERT(solocked(so)) even if not implemented
- replace calls to pr_generic() with req =
PRU_{DISCONNECT,SHUTDOWN,ABORT}
with calls to pr_{disconnect,shutdown,abort}() respectively

rename existing internal functions used to implement above functionality
to permit use of the names for xxx_{disconnect,shutdown,abort}().

- {l2cap,sco,rfcomm}_disconnect() ->
{l2cap,sco,rfcomm}_disconnect_pcb()
- {unp,rip,tcp}_disconnect() -> {unp,rip,tcp}_disconnect1()
- unp_shutdown() -> unp_shutdown1()

patch reviewed by rmind
 1.174  19-May-2014  rmind - Split off PRU_ATTACH and PRU_DETACH logic into separate functions.
- Replace malloc with kmem and eliminate M_PCB while here.
- Sprinkle more asserts.
 1.173  18-May-2014  rmind Add struct pr_usrreqs with a pr_generic function and prepare for the
dismantling of pr_usrreq in the protocols; no functional change intended.
PRU_ATTACH/PRU_DETACH changes will follow soon.

Bump for struct protosw. Welcome to 6.99.62!
 1.172  02-Jan-2014  pooka branches: 1.172.2;
Allow kernels compiled with INET+INET6 to be booted as IPv4-only or IPv6-only.
 1.171  12-Nov-2013  kefren * implement TCP CUBIC congestion control algorithm
* move tcp_sack_newack bits inside reno and newreno_fast_retransmit_newack
* notify ECN peer about cwnd shrink in [new]reno_slow_retransmit

Based on the patch proposed on tech-net@ on Nov 7 with minor improvments:
* adapt wmax for no-fast convergence case
* correct cbrt calculation for big window sizes (>750KB)
 1.170  10-Apr-2013  christos branches: 1.170.4;
Limit the tcp initial window setting to 10, leaving it by default to 4
and simplifying the code in process. Per draft-ietf-initcwnd-08.txt.
 1.169  02-Feb-2012  tls branches: 1.169.6;
Entropy-pool implementation move and cleanup.

1) Move core entropy-pool code and source/sink/sample management code
to sys/kern from sys/dev.

2) Remove use of NRND as test for presence of entropy-pool code throughout
source tree.

3) Remove use of RND_ENABLED in device drivers as microoptimization to
avoid expensive operations on disabled entropy sources; make the
rnd_add calls do this directly so all callers benefit.

4) Fix bug in recent rnd_add_data()/rnd_add_uint32() changes that might
have lead to slight entropy overestimation for some sources.

5) Add new source types for environmental sensors, power sensors, VM
system events, and skew between clocks, with a sample implementation
for each.

ok releng to go in before the branch due to the difficulty of later
pullup (widespread #ifdef removal and moved files). Tested with release
builds on amd64 and evbarm and live testing on amd64.
 1.168  31-Oct-2011  yamt branches: 1.168.2; 1.168.6;
tcp_reass_unlock: assertion
 1.167  25-May-2011  gdt Add comment urging a separation of TCP_RTT_SHIFT into separate defines
describing the EWMA calculation and the storage representation.
(No code change.)
 1.166  03-May-2011  dyoung Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).

MSLT and VTW were contributed by Coyote Point Systems, Inc.

Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.

Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.

Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.

It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.

A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
 1.165  03-May-2011  dyoung *_drain() routines may be called with locks held, so instead of doing
any work in *_drain(), set a drain-needed flag. Do the work in the
fasttimo handler.

Contributed by Coyote Point Systems, Inc.
 1.164  20-Apr-2011  gdt Rewrite comments about TCP RTO calculations.

Long ago, the storage representations of srtt and rttvar were changed
from the 4.4BSD scheme, and the comments are out of sync with the
code. This commit rewrites most of the comments that explain the RTO
calculations, and points out some issues in the code.

Joint work with Bev Schwartz of BBN (original analysis and comments),
but I have rewritten and extended them, so errors are mine.

This material is based upon work supported by the Defense Advanced
Research Projects Agency and Space and Naval Warfare Systems Center,
Pacific, under Contract No. N66001-09-C-2073. Approved for Public
Release, Distribution Unlimited
 1.163  14-Apr-2011  yamt comments
 1.162  16-Sep-2009  pooka branches: 1.162.4; 1.162.6;
Replace a large number of link set based sysctl node creations with
calls from subsystem constructors. Benefits both future kernel
modules and rump.

no change to sysctl nodes on i386/MONOLITHIC & build tested i386/ALL
 1.161  09-Sep-2009  darran Make tcp msl (max segment life) tunable via sysctl net.inet.tcp.msl.
Okayed by tls@.
 1.160  27-May-2009  pooka POOL_INIT -> pool_init
 1.159  29-Jan-2009  pooka branches: 1.159.2;
stinkset purge: POOL_INIT -> pool_init
also, make the syncache pool static in scope
 1.158  06-Aug-2008  plunky branches: 1.158.2; 1.158.4; 1.158.10;
Convert socket options code to use a sockopt structure
instead of laying everything into an mbuf.

approved by core
 1.157  28-Apr-2008  martin branches: 1.157.2; 1.157.6;
Remove clause 3 and 4 from TNF licenses
 1.156  24-Apr-2008  ad branches: 1.156.2;
Merge the socket locking patch:

- Socket layer becomes MP safe.
- Unix protocols become MP safe.
- Allows protocol processing interrupts to safely block on locks.
- Fixes a number of race conditions.

With much feedback from matt@ and plunky@.
 1.155  12-Apr-2008  thorpej branches: 1.155.2;
Make IP, TCP, UDP, and ICMP statistics per-CPU. The stats are collated
when the user requests them via sysctl.
 1.154  08-Apr-2008  thorpej Change TCP stats from a structure to an array of uint64_t's.

Note: This is ABI-compatible with the old tcpstat structure; old netstat
binaries will continue to work properly.
 1.153  29-Feb-2008  matt Rework tcp congctl selection code so that the congctl entries can be const.
Don't access tcp_congctl stuff outside of tcp_congctl.c, use routines to
update t_congctl. This code is slightly now more complicated.
 1.152  27-Feb-2008  matt Convert stragglers to ansi definitions from old-style definitons.
Remember that func() is not ansi, func(void) is.
 1.151  25-Dec-2007  perry branches: 1.151.2; 1.151.6;
Convert many of the uses of __attribute__ to equivalent
__packed, __unused and __dead macros from cdefs.h
 1.150  02-Aug-2007  rmind branches: 1.150.4; 1.150.10; 1.150.12; 1.150.16; 1.150.20;
TCP socket buffers automatic sizing - ported from FreeBSD.
http://mail-index.netbsd.org/tech-net/2007/02/04/0006.html

! Disabled by default, marked as experimental. Testers are very needed.
! Someone should thoroughly test this, and improve if possible.

Discussed on <tech-net>:
http://mail-index.netbsd.org/tech-net/2007/07/12/0002.html
Thanks Greg Troxel for comments.

OK by the long silence on <tech-net>.
 1.149  09-Jul-2007  ad branches: 1.149.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements
 1.148  25-Jun-2007  christos tcpdrop kernel bits (from anon ymous)
 1.147  20-Jun-2007  christos - per socket keepalive settings
- settable connection establishment timeout
 1.146  02-May-2007  dyoung Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.

The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.

Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.

DETAILS

1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:

struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);

sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.

sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.

The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).

2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.

3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:

int rtcache_setdst(struct route *, const struct sockaddr *);

rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.

It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.

4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
 1.145  04-Mar-2007  christos branches: 1.145.2; 1.145.4;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.144  17-Feb-2007  dyoung KNF: de-__P, bzero -> memset, bcmp -> memcmp. Remove extraneous
parentheses in return statements.

Cosmetic: don't open-code TAILQ_FOREACH().

Cosmetic: change types of variables to avoid oodles of casts: in
in6_src.c, avoid casts by changing several route_in6 pointers
to struct route pointers. Remove unnecessary casts to caddr_t
elsewhere.

Pave the way for eliminating address family-specific route caches:
soon, struct route will not embed a sockaddr, but it will hold
a reference to an external sockaddr, instead. We will set the
destination sockaddr using rtcache_setdst(). (I created a stub
for it, but it isn't used anywhere, yet.) rtcache_free() will
free the sockaddr. I have extracted from rtcache_free() a helper
subroutine, rtcache_clear(). rtcache_clear() will "forget" a
cached route, but it will not forget the destination by releasing
the sockaddr. I use rtcache_clear() instead of rtcache_free()
in rtcache_update(), because rtcache_update() is not supposed
to forget the destination.

Constify:

1 Introduce const accessor for route->ro_dst, rtcache_getdst().

2 Constify the 'dst' argument to ifnet->if_output(). This
led me to constify a lot of code called by output routines.

3 Constify the sockaddr argument to protosw->pr_ctlinput. This
led me to constify a lot of code called by ctlinput routines.

4 Introduce const macros for converting from a generic sockaddr
to family-specific sockaddrs, e.g., sockaddr_in: satocsin6,
satocsin, et cetera.
 1.143  06-Dec-2006  yamt branches: 1.143.2;
add some more tcp mowners.
 1.142  06-Dec-2006  yamt - make tcp_reass static.
- constify.
 1.141  21-Oct-2006  yamt branches: 1.141.2; 1.141.4;
- constify.
- make tcp_dooptions and tcpipqent_pool static.
 1.140  19-Oct-2006  yamt implement RFC3465 appropriate byte counting.
from Kentaro A. Kurahone, with minor adjustments by me.
the ack prediction part of the original patch was omitted because
it's a separate change. reviewed by Rui Paulo.
 1.139  16-Oct-2006  rpaulo Export the tcp_do_rfc1948 variable to userland via sysctl.
The code to generate an ISS via an MD5 hash has been present in the
NetBSD kernel since 2001, but it wasn't even exported to userland at
that time. It was agreed on tech-net with the original author <thorpej>
that we should let the user decide if he wants to enable it or not.
Not enabled by default.
 1.138  09-Oct-2006  rpaulo Modular (I tried ;-) TCP congestion control API. Whenever certain conditions
happen in the TCP stack, this interface calls the specified callback to
handle the situation according to the currently selected congestion
control algorithm.
A new sysctl node was created: net.inet.tcp.congctl.{available,selected}
with obvious meanings.
The old net.inet.tcp.newreno MIB was removed.
The API is discussed in tcp_congctl(9).

In the near future, it will be possible to selected a congestion control
algorithm on a per-socket basis.

Discussed on tech-net and reviewed by <yamt>.
 1.137  05-Sep-2006  rpaulo branches: 1.137.2; 1.137.4;
Import of TCP ECN algorithm for congestion control.
Both available for IPv4 and IPv6.
Basic implementation test results are available at
http://netbsd-soc.sourceforge.net/projects/ecn/testresults.html.

Work sponsored by the Google Summer of Code project 2006.
Special thanks to Kentaro Kurahone, Allen Briggs and Matt Thomas for their
help, comments and support during the project.
 1.136  22-Jul-2006  rpaulo revert stuff that shouldn't have gone in.
 1.135  22-Jul-2006  rpaulo TCP RFC is 793, not 783.
 1.134  16-Feb-2006  perry branches: 1.134.2;
Change "inline" back to "__inline" in .h files -- C99 is still too
new, and some apps compile things in C89 mode. C89 keywords stay.

As per core@.
 1.133  24-Dec-2005  perry branches: 1.133.2; 1.133.4; 1.133.6;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.
 1.132  11-Dec-2005  christos merge ktrace-lwp.
 1.131  10-Dec-2005  elad Multiple inclusion protection, as suggested by christos@ on tech-kern@
few days ago.
 1.130  06-Sep-2005  rpaulo Implement tcp.inet{,6}.tcp{,6}.(debug|debx) when TCP_DEBUG is set. They
can be used to ``transliterate protocol trace'' like trpt(8) does.
 1.129  10-Aug-2005  yamt move {tcp,udp}_do_loopback_cksum back to tcp/udp
so that they can be referenced by ipv6.
 1.128  05-Aug-2005  elad Add sysctls for IP, ICMP, TCP, and UDP statistics.
 1.127  19-Jul-2005  christos Implement PMTU checks from:

http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html

1. Don't act on ICMP-need-frag immediately if adhoc checks on the
advertised MTU fail. The MTU update is delayed until a TCP retransmit
happens.
2. Ignore ICMP Source Quench messages meant for TCP connections.

From OpenBSD.
 1.126  29-May-2005  christos branches: 1.126.2;
- add const
- remove bogus casts
- avoid nested variables
 1.125  05-Apr-2005  kurahone Added sysctl tunable limits for the number of maximum SACK holes
per connection and per system.

Idea taken from FreeBSD.
 1.124  29-Mar-2005  yamt protect tcpipqent with splvm.
 1.123  16-Mar-2005  yamt branches: 1.123.2;
simplify data receiver side sack processing.
- introduce t_segqlen, the number of segments in segq/timeq.
the name is from freebsd.
- rather than maintaining a copy of sack blocks (rcv_sack_block[]),
build it directly from the segment list when needed.
 1.122  16-Mar-2005  yamt - use full sized segments unless we actually have SACKs to send.
- avoid TSO duplicate D-SACK.
- send SACKs regardless of TF_ACKNOW.
- don't clear rcv_sack_num when transmitting.

discussed on tech-net@.
 1.121  09-Mar-2005  atatat gc the tcp_sysctl() prototype since it's completely vestigial
 1.120  02-Mar-2005  mycroft Copyright maintenance.
 1.119  28-Feb-2005  jonathan Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz

Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.

The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.

There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.

After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
 1.118  06-Feb-2005  pk Update tcp_trace() prototype to match implementation.
 1.117  27-Jan-2005  mycroft Introduce a new state variable, t_partialacks. It has 3 states:
* t_partialacks<0 means we are not in fast recovery.
* t_partialacks==0 means we are in fast recovery, but we have not received
any partial acks yet.
* t_partialacks>0 means we are in fast recovery, and we have received
partial acks.

This is used to implement 2 changes in RFC 3782:
* We keep the notion that we are in fast recovery separate from t_dupacks, so
it is not reset due to out-of-order acks. (This affects both the Reno and
NewReno cases.)
* We only reset the retransmit timer on the first partial ack -- preventing us
from possibly taking one RTO per segment once fast recovery is initiated.

As before, it is hard to measure any difference between Reno and NewReno in the
real-world cases that I've tested.
 1.116  26-Jan-2005  mycroft Fix two problems in our TCP stack:

1) If an echoed RFC 1323 time stamp appears to be later than the current time,
ignore it and fall back to old-style RTT calculation. This prevents ending
up with a negative RTT and panicking later.

2) Fix NewReno. This involves a few changes:

a) Implement the send_high variable in RFC 2582. Our implementation is
subtly different; it is one *past* the last sequence number transmitted
rather than being equal to it. This simplifies some logic and makes
the code smaller. Additional logic was required to prevent sequence
number wraparound problems; this is not mentioned in RFC 2582.

b) Make sure we reset t_dupacks on new acks, but *not* on a partial ack.
All of the new ack code is pushed out into tcp_newreno(). (Later this
will probably be a pluggable function.) Thus t_dupacks keeps track of
whether we're in fast recovery all the time, with Reno or NewReno, which
keeps some logic simpler.

c) We do not need to update snd_recover when we're not in fast recovery.
See tech-net for an explanation of this.

d) In the gratuitous fast retransmit prevention case, do not send a packet.
RFC 2582 specifically says that we should "do nothing".

e) Do not inflate the congestion window on a partial ack. (This is done by
testing t_dupacks to see whether we're still in fast recovery.)

This brings the performance of NewReno back up to the same as Reno in a few
random test cases (e.g. transferring peer-to-peer over my wireless network).
I have not concocted a good test case for the behavior specific to NewReno.
 1.115  21-Dec-2004  yamt branches: 1.115.2; 1.115.4;
factor out receive side tcp/udp checksum handling code so that they
can be used by eg. packet filters.

reviewed by Christos Zoulas on tech-net@.
(slightly tweaked since then to make tcp and udp similar.)
 1.114  15-Dec-2004  thorpej Don't perform checksums on loopback interfaces. They can be reenabled with
the net.inet.*.do_loopback_cksum sysctl.

Approved by: groo
 1.113  15-Sep-2004  yamt fix ipqent pool corruption problems. make tcp reass code use
its own pool of ipqent rather than sharing it with ip reass code.
PR/24782.
 1.112  18-May-2004  itojun fix MD5 signature support to actually validate inbound signature, and
drop packet if fails.
 1.111  26-Apr-2004  itojun make TCP MD5 signature work with KAME IPSEC (#define IPSEC).

support IPv6 if KAME IPSEC (RFC is not explicit about how we make data stream
for checksum with IPv6, but i'm pretty sure using normal pseudo-header is the
right thing).

XXX
current TCP MD5 signature code has giant flaw:
it does not validate signature on input (can't believe it! what is the point?)
 1.110  25-Apr-2004  jonathan Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.

This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).


NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.

In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:

sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15

Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
 1.109  21-Apr-2004  itojun no space between function name and paren: foo (blah) -> foo(blah)
 1.108  20-Apr-2004  itojun - respond to RST by ACK, as suggested in NISCC recommendation
- rate-limit ACKs against RSTs and SYNs
 1.107  18-Apr-2004  matt De __P()
 1.106  22-Oct-2003  thorpej branches: 1.106.2;
Rather than zeroing a tcpcb structure and filling in all the fields
individually, create a tcpcb template pre-initialized (and pre-zero'd)
with the static and mostly-static tcpcb parameters. The template is
now copied into the new tcpcb, which zeros and initializes most of the
tcpcb in one pass. The template is kept up-to-date as TCP sysctl
variables are changed.

Combined with the previous sb_max change, TCP socket creation is now
25% faster.
 1.105  04-Sep-2003  itojun revamp inpcb/in6pcb so that they are more aligned with each other.
in6pcb lookup now uses hash(9).
 1.104  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.
 1.103  20-Jul-2003  he As a temporary workaround, apply the fix from PR#20390, thereby
cooperating with the callout code in working around the race
condition caused by the TCP code's use of the callout facility.

Instead of unconditionally releasing memory in tcp_close() and
SYN_CACHE_PUT(), check whether any of the related callout handlers
are about to be invoked (but have not yet done callout_ack()), and
if so, just mark the associated data structure (tcpcb or syn cache
entry) as "dead", and test for this (and release storage) in the
callout handler functions.
 1.102  29-Jun-2003  fvdl branches: 1.102.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.
 1.101  29-Jun-2003  ragge Add code to remember where in the send queue of mbufs the last packet was
sent from. This change avoid a linear search through all mbufs when using
large TCP windows, and therefore permit high-speed connections on long
distances.

Tested on a 1 Gigabit connection between Lule� and San Francisco, a distance
of about 15000km. With TCP windows of just over 20 Mbytes it could keep up
with 950Mbit/s.

After discussions with Matt Thomas and Jason Thorpe.
 1.100  28-Jun-2003  darrenr Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V
 1.99  26-Jun-2003  christos abuse the mib instead of abusing the new pointer. Idea from simon burge.
It allows the tcp_sysctl_ident to run by non-super-users. No backwards
compatibility provided.
 1.98  23-Jun-2003  martin Make sure to include opt_foo.h if a defflag option FOO is used.
 1.97  19-Apr-2003  christos PR/2352: Tor Egge: Add sysctl to get uid of connected socket.
 1.96  01-Mar-2003  thorpej Allow TCP connections to hosts on a local network to use a larger
slow start initial window. Default this larger initial window to
4 packets, allowing it to be adjusted with net.inet.tcp.init_win_local.
 1.95  26-Feb-2003  matt Add MBUFTRACE kernel option.
Do a little mbuf rework while here. Change all uses of MGET*(*, M_WAIT, *)
to m_get*(M_WAIT, *). These are not performance critical and making them
call m_get saves considerable space. Add m_clget analogue of MCLGET and
make corresponding change for M_WAIT uses.
Modify netinet, gem, fxp, tulip, nfs to support MBUFTRACE.
Begin to change netstat to use sysctl.
 1.94  02-Nov-2002  perry /*CONTCOND*/ while (0)'ed macros
 1.93  30-Jun-2002  thorpej Changes to allow the IPv4 and IPv6 layers to align headers themseves,
as necessary:
* Implement a new mbuf utility routine, m_copyup(), is is like
m_pullup(), except that it always prepends and copies, rather
than only doing so if the desired length is larger than m->m_len.
m_copyup() also allows an offset into the destination mbuf, which
allows space for packet headers, in the forwarding case.
* Add *_HDR_ALIGNED_P() macros for IP, IPv6, ICMP, and IGMP. These
macros expand to 1 if __NO_STRICT_ALIGNMENT is defined, so that
architectures which do not have strict alignment constraints don't
pay for the test or visit the new align-if-needed path.
* Use the new macros to check if a header needs to be aligned, or to
assert that it already is, as appropriate.

Note: This code is still somewhat experimental. However, the new
code path won't be visited if individual device drivers continue
to guarantee that packets are delivered to layer 3 already properly
aligned (which are rules that are already in use).
 1.92  09-Jun-2002  itojun whitespace
 1.91  26-May-2002  itojun path MTU discovery blackhole detection.
PR 12790 (sorry for not committing it for a long time)
 1.90  12-May-2002  matt branches: 1.90.2; 1.90.4;
Eliminate commons.
 1.89  15-Mar-2002  itojun have tcp6_drain
 1.88  24-Jan-2002  itojun place NRL copyright notice itself, not a reference to it.
 1.87  11-Sep-2001  thorpej Use callouts for SYN cache timers, rather than traversing time queues
in tcp_slowtimo().
 1.86  10-Sep-2001  thorpej Use callouts for TCP timers, rather than traversing the list of
all open TCP connections in tcp_slowtimo() (which is called 2x
per second). It's fairly rare for TCP timers to actually fire,
so saving this list traversal is good, especially if you want
to scale to thousands of open connections.
 1.85  10-Sep-2001  thorpej Split tcp_timers() into multiple functions, one for each timer,
and call it directly from tcp_slowtimo() (via a table) rather
than going through tcp_userreq().

This will allow us to call TCP timers directly from callouts,
in a future revision.
 1.84  10-Sep-2001  thorpej Change the way receive idle time and round trip time are measured.
Instead of incrementing t_idle and t_rtt in tcp_slowtimo(), we now
take a timstamp (via tcp_now) and use subtraction to compute the
delta when we actually need it (using unsigned arithmetic so that
tcp_now wrapping is handled correctly).

Based on similar changes in FreeBSD.
 1.83  10-Sep-2001  thorpej Use a callout for the delayed ACK timer, and delete tcp_fasttimo().
Expose the delayed ACK timer as net.inet.tcp.delack_ticks.
 1.82  31-Jul-2001  thorpej branches: 1.82.2;
Count the number of times we "self-quench" (ip_output() returns
ENOBUFS), and don't inline tcp_segsize() if profiling.
 1.81  30-May-2001  mrg branches: 1.81.2;
use _KERNEL_OPT
 1.80  26-May-2001  matt Make t_flags a u_int instead of u_short. It's followed by a mbuf pointer
so there's padding around it already. And it increases the amount of bits
available for TF_* flags.
 1.79  13-Apr-2001  thorpej Remove the use of splimp() from the NetBSD kernel. splnet()
and only splnet() is allowed for the protection of data structures
used by network devices.
 1.78  20-Mar-2001  thorpej Two changes, designed to make us even more resilient against TCP
ISS attacks (which we already fend off quite well).

1. First-cut implementation of RFC1948, Steve Bellovin's cryptographic
hash method of generating TCP ISS values. Note, this code is experimental
and disabled by default (experimental enough that I don't export the
variable via sysctl yet, either). There are a couple of issues I'd
like to discuss with Steve, so this code should only be used by people
who really know what they're doing.

2. Per a recent thread on Bugtraq, it's possible to determine a system's
uptime by snooping the RFC1323 TCP timestamp options sent by a host; in
4.4BSD, timestamps are created by incrementing the tcp_now variable
at 2 Hz; there's even a company out there that uses this to determine
web server uptime. According to Newsham's paper "The Problem With
Random Increments", while NetBSD's TCP ISS generation method is much
better than the "random increment" method used by FreeBSD and OpenBSD,
it is still theoretically possible to mount an attack against NetBSD's
method if the attacker knows how many times the tcp_iss_seq variable
has been incremented. By not leaking uptime information, we can make
that much harder to determine. So, we avoid the leak by giving each
TCP connection a timebase of 0.
 1.77  19-Oct-2000  itojun branches: 1.77.2;
remove #ifdef TCP6. it is not likely for us to bring in sys/netinet6/tcp6*.c
(separate TCP/IPv6 stack) into netbsd-current.
 1.76  18-Oct-2000  thorpej Restructure the Path MTU Discovery code somewhat to avoid
entering rtentry's for hosts we're not actually communicating
with.

Do this by invoking the ctlinput for the protocol, which is
responsible for validating the ICMP message:
* TCP -- Lookup the connection based on the address/port
pairs in the ICMP message.
* AH/ESP -- Lookup the SA based on the SPI in the ICMP message.

If validation succeeds, ctlinput is responsible for calling
icmp_mtudisc(). icmp_mtudisc() then invokes callbacks registered
by protocols (such as TCP) which want to take some sort of special
action when a path's MTU changes. For TCP, this is where we now
refresh cached routes and re-enter slow-start.

As a side-effect, this fixes the problem where TCP would not be
notified when a path's MTU changed if AH/ESP were being used.

XXX Note, this is only a fix for the IPv4 case. For the IPv6
XXX case, we need to wait for the KAME folks.

Reviewed by sommerfeld@netbsd.org and itojun@netbsd.org.
 1.75  15-Aug-2000  itojun net.inet.tcp.rstratelimit is deprecated. make it invalid and return
ENOPROTOOPT.
 1.74  28-Jul-2000  itojun nuke the following sysctl variables. "ppsratelimit" should work better.
need to recompile sbin/sysctl after updating /usr/include.
net.inet.tcp.rstratelimit
net.inet.icmp.errratelimit
net.inet6.icmp6.errratelimit
 1.73  27-Jul-2000  itojun implement net.inet.tcp.rstppslimit to limit TCP RSTs by packet-per-second
basis. default: 100pps

set default value for net.inet.tcp.rstratelimit to 0 (disabled),
NOTE: it does not work right for smaller-than-1/hz interval. maybe we should
nuke it, or make it impossible to set smaller-than-1/hz value.
 1.72  15-Feb-2000  thorpej branches: 1.72.4;
Add support for rate-limiting RSTs sent in response to no socket for
an incoming packet. Default minimum interval is 10ms. The interval
is changeable via the "net.inet.tcp.rstratelimit" sysctl variable.
 1.71  13-Dec-1999  itojun sync IPv6 part with latest KAME tree. IPsec part is left unmodified
due to massive changes in KAME side.
- IPv6 output goes through nd6_output
- faith can capture IPv4 packets as well - you can run IPv4-to-IPv6 translator
using heavily modified DNS servers
- per-interface statistics (required for IPv6 MIB)
- interface autoconfig is revisited
- udp input handling has a big change for mapped address support.
- introduce in4_cksum() for non-overwriting checksumming
- introduce m_pulldown()
- neighbor discovery cleanups/improvements
- netinet/in.h strictly conforms to RFC2553 (no extra defs visible to userland)
- IFA_STATS is fixed a bit (not tested)
- and more more more.

TODO:
- cleanup os-independency #ifdef
- avoid rcvif dual use (for IPsec) to help ifdetach

(sorry for jumbo commit, I can't separate this any more...)
 1.70  08-Dec-1999  itojun do not drop from IP header to tcp option until sbappend(), to reduce
requirement to mbuf chain.
part of KAME sync, committed separately for its (possible) impact.
 1.69  19-Nov-1999  bouyer Update protocoles and interfaces stats counters to 64bit.
RTM_IFINFO is now 0xf, 0xe is RTM_OIFINFO which returns the old (if_msghdr14)
struct with 32bit counters (binary compat, conditioned on COMPAT_14).
Same for sysctl: node 3 is renamed NET_RT_OIFLIST, NET_RT_IFLIST is now node 4.
Change rt_msg1() to add an mbuf to the mbuf chain instead of just panic()
when the message is larger than MHLEN.
 1.68  23-Sep-1999  itojun branches: 1.68.2; 1.68.8;
cleanup and correct TCP MSS consideration with IPsec headers.

MSS advertisement must always be:
max(if mtu) - ip hdr siz - tcp hdr siz
We violated this in the previous code so it was fixed.

tcp_mss_to_advertise() now takes af (af on wire) as its argument,
to compute right ip hdr siz.

tcp_segsize() will take care of IPsec header size.
One thing I'm not really sure is how to handle IPsec header size in
*rxsegsizep (inbound segment size estimation).
The current code subtracts possible *outbound* IPsec size from *rxsegsizep,
hoping that the peer is using the same IPsec policy as me.
It may not be applicable, could TCP gulu please comment...
 1.67  25-Aug-1999  itojun When listening socket goes away, remove assockated syn cache entires.
Stale syn cache entries are useless because none of them will be used
if there is no listening socket, as tcp_input looks up listening socket by
in_pcblookup*() before looking into syn cache.

This fixes race condition due to dangling socket pointer from syn cache
entries to listening socket (this was introduced when ipsec is merged in).

This should preserve currently implemented behavior (but not 4.4BSD
behavior prior to syn cache).

Tested in KAME repository before commit, but we'd better run some
regression tests.
 1.66  12-Aug-1999  itojun fix sototcpcb(). this sometimes caused panic on OOB data reception.

the macro may need to be expanded into dedicated function, rather than a macro,
to capture unsupported values.
 1.65  31-Jul-1999  itojun sync with recent KAME.
- loosen ipsec restriction on packet diredtion.
- revise icmp6 redirect handling on IsRouter bit.
- tcp/udp notification processing (link-local address case)
- cosmetic fixes (better code share across *BSD).
 1.64  22-Jul-1999  itojun - implement IPv6 pmtud, which is necessary for TCP6.
- fix memory leak on SO_DEBUG over TCP.
 1.63  14-Jul-1999  itojun Use proper ip protocol # field and tcp hdr on sending RST against SYN,
when ip header and tcp header are not adjacent to each other
(i.e. when ip6 options are attached).

To test this, try
telnet @::1@::1 port
toward a port without responding server. Prior to the fix, the kernel will
generate broken RST packet.
 1.62  09-Jul-1999  thorpej defopt INET6, and put it in opt_inet.h (most places already include this
file, which is why the file list is so short).
 1.61  01-Jul-1999  itojun IPv6 kernel code, based on KAME/NetBSD 1.4, SNAP kit 19990628.
(Sorry for a big commit, I can't separate this into several pieces...)
Pls check sys/netinet6/TODO and sys/netinet6/IMPLEMENTATION for details.

- sys/kern: do not assume single mbuf, accept chained mbuf on passing
data from userland to kernel (or other way round).
- "midway" ATM card: ATM PVC pseudo device support, like those done in ALTQ
package (ftp://ftp.csl.sony.co.jp/pub/kjc/).
- sys/netinet/tcp*: IPv4/v6 dual stack tcp support.
- sys/netinet/{ip6,icmp6}.h, sys/net/pfkeyv2.h: IETF document assumes those
file to be there so we patch it up.
- sys/netinet: IPsec additions are here and there.
- sys/netinet6/*: most of IPv6 code sits here.
- sys/netkey: IPsec key management code
- dev/pci/pcidevs: regen

In my understanding no code here is subject to export control so it
should be safe.
 1.60  23-May-1999  ad Add new sysctl (net.inet.tcp.log_refused) that when set, causes refused TCP
connections to be logged.
 1.59  29-Apr-1999  thorpej Implement retransmit logic for the SYN cache engine. Fixes a rare condition
where one side can think a connection exists, where the other side thinks
the connection was never established.

The original problem was first reported by Ty Sarna in PR #5909. The
original fix I made to the code didn't cover all cases. The problem this
fix addresses was reported by Christoph Badura via private e-mail.

Many thanks to Bill Sommerfeld for helping me to test this code, and
for finding a subtle bug.
 1.58  24-Jan-1999  thorpej branches: 1.58.2;
Oops, forgot to update copyright notice in previous.
 1.57  24-Jan-1999  thorpej * Completely rewrite syn_cache_respond().
- Don't use tcp_respond(), instead create the tcp/ip header from scratch,
and send it ourself.
- Reuse the mbuf that carried the SYN, or allocate one if that is not
available.
- Cache the route we look up to do the Path MTU Discovery check, and
transfer the reference to that route to the inpcb when the connection
completes.
* Macro'ize a small, but often repeated code fragment.
 1.56  18-Dec-1998  thorpej Add a lock around the TCPCB's sequence queue, to prevent tcp_drain()
from corrupting the queue if called from a device's interrupt context.

Similar in nature to the problem reported in PR #5684.
 1.55  06-Oct-1998  matt Add a sysctl for newreno (default to off).
 1.54  04-Oct-1998  matt Adapt the NEWRENO changes from the UCSB diffs of BSDI 3.0's TCP
to NetBSD. Ignore the SACK & FACK stuff for now.
 1.53  10-Sep-1998  mouse Create tcp.keepidle, tcp.keepintvl, tcp.keepcnt, tcp.slowhz sysctls.
 1.52  09-Sep-1998  thorpej Use an algorithm similar to that in tcp_notify() to determine if
syn_cache_unreach() should remove the entry, or just continue on.

Algorithm is to only remove the entry if we've had more than one unreach
error and have retransmitted 3 or more times. This prevents the following
scenario, as noted in PR #5909 (PR from Ty Sarna, scenario from
Charles Hannum):

* Host A sends a SYN.
* Host A retransmits the SYN.
* Host B gets the first SYN and sends a SYN-ACK.
* Host B gets the second SYN and sends a SYN-ACK.
* One of the SYN-ACK bounces with an
ICMP unreachable, causing the `SYN cache' entry to be
removed with no notification.
* Host A receives the other SYN-ACK, sends an ACK, and goes to
ESTABLISHED state.

Should fix PR #5909.
 1.51  21-Jul-1998  mycroft Implement a better fix for the `gratuitous FIN' problem, as
mentioned on tcp-impl but with a bit more commentary.
 1.50  11-May-1998  thorpej Nuke TUBA per my note to tech-net; there's no reason to keep it around.
 1.49  07-May-1998  thorpej Rework the syn cache code somewhat:
- Don't use home-grown queue manipulation. Use <sys/queue.h> instead. The
data structures are a little larger, but we are otherwise wasting the
memory chunk anyway (we're already a 64-byte malloc bucket).
- Fix a bug in the cache-is-full case: if the oldest element removed from
the first non-empty bucket was the only element in the bucket, the
bucket wouldn't be removed from the bucket cache, causing queue corruption
later.
- Optimize the syn cache timers by using PRT timers rather than home-grown
decrement-and-propagate timers.

This code is now a fair bit smaller, and significantly easier to read
and understand.
 1.48  06-May-1998  thorpej Use the monotonically increasing slow timer timestamp provided by
the protocol dispatch layer for TCP timers. This saves having to
modify a potentially large number of timer values (which were shorts,
and expanded to ... a lot of code on the Alpha).
 1.47  02-May-1998  thorpej Reintroduce the immediate ACK-on-PUSH behavior removed in revision 1.47,
but make the decision to do this dependent on the sysctl variable
net.inet.tcp.ack_on_push, which is disabled by default.
 1.46  01-May-1998  thorpej Garbage-collect.
 1.45  30-Apr-1998  thorpej In the CWM code, don't use the Floyd initial window computation as
the burst size allowed, but rather a fixed number of packets, as
described in the Internet Draft. Default allowed burst is 4 packets,
per the Draft.

Make the use of CWM and the allowed burst size tunable via sysctl.
 1.44  30-Apr-1998  thorpej Make tcp_compat_42 a sysctl option.
 1.43  29-Apr-1998  matt New TCP reassembly code. The new code reduces the memory needed by
out-of-order packets and builds the infrastructure needed for sending
SACK blocks (to be added shortly).
 1.42  29-Apr-1998  thorpej Make use of the work-arounds for ancient broken TCP peers run-time
conditional (tcp_compat_42). The kernel config option TCP_COMPAT_42
will still enable this by default, or disable this by default if the
option is not included (i.e. current behavior). This will be made a
sysctl soon.
 1.41  13-Apr-1998  kml Fix to ensure that the correct MSS is advertised for loopback
TCP connections by using the MTU of the interface. Also added
a knob, mss_ifmtu, to force all connections to use the MTU of
the interface to calculate the advertised MSS.
 1.40  07-Apr-1998  thorpej Remember any source routes that may have accompanied a SYN.
 1.39  03-Apr-1998  thorpej Now that we have a flags word in the syn cache entry, use a flag to indicate
"peer will do timestamps" rather than a bitfield, and give the now-unsed
bit to the hash, making it now 32 bits.
 1.38  03-Apr-1998  thorpej Clean up some comments wrt. the syn cache code.
 1.37  31-Mar-1998  thorpej Fix a potential-congestion case in the larger initial congestion window
code, as clarified in the TCPIMPL WG meeting at IETF #41: If the SYN
(active open) or SYN,ACK (passive open) was retransmitted, the initial
congestion window for the first slow start of that connection must be
one segment.
 1.36  17-Mar-1998  kml Ensure that the TCP segment size reflects the size of TCP options
in the packet. This fixes a bug that was resulting in extra packets
in retransmissions (the second packet would be 12 bytes long,
reflecting the RFC1323 timestamp option size).
 1.35  19-Feb-1998  thorpej Update copyright (sigh, should have done this long ago).
 1.34  10-Feb-1998  perry add/cleanup multiple inclusion protection.
 1.33  05-Jan-1998  thorpej Finishing merging 4.4BSD-Lite2 netinet. At this point, the only changes
left were SCCS IDs and Copyright dates.
 1.32  31-Dec-1997  thorpej Implement a queue for delayed ACK processing. This queue is used in
tcp_fasttimo() in lieu of scanning all open TCP connections.
 1.31  17-Dec-1997  thorpej Keep stats on connections dropped due to excessive persist timeout.
 1.30  13-Dec-1997  thorpej After further examination of traces of bulk transfers (with help from
Kevin Lahey), undo the "defer window update until next delayed ACK".
 1.29  11-Dec-1997  thorpej Implement an infrastructure to allow larger initial congestion windows.
The sysctl'able variable "tcp_init_win", when set to 0, selects an
auto-tuning algorithm for selecting the initial window, based on transmit
segment size, per discussion in the IETF tcpimpl working group.

Default initial window is still 1 segment, but will soon become 2 segments,
per discussion in tcpimpl.
 1.28  11-Dec-1997  thorpej In the PRU_RCVD entry point, if TF_DELACK is set, don't send the window
update now, since it will be sent within 200ms when the delayed ACK is
sent. Instrument how many hits we get on this optimization.
 1.27  10-Dec-1997  thorpej Implement tcp_drain().
 1.26  08-Nov-1997  kml TCP MSS fixes to provide cleaner slow-start and recovery.
 1.25  17-Oct-1997  kml branches: 1.25.2;
Path MTU Discovery support. This is turned off by default.
Use sysctl -w net.inet.icmp.mtudisc=1 to turn on.
Still to come: path removal after some period, black hole detection
 1.24  10-Oct-1997  explorer Add hooks to use the kernel random system to generate TCP sequence numbers.
 1.23  22-Sep-1997  thorpej Fix several annoyances related to MSS handling in BSD TCP:
- Don't overload t_maxseg. Previous behavior was to set it to the min
of the peer's advertised MSS, our advertised MSS, and tcp_mssdflt
(for non-local networks). This breaks PMTU discovery running on
either host. Instead, remember the MSS we advertise, and use it
as appropriate (in silly window avoidance).
- Per last bullet, split tcp_mss() into several functions for handling
MSS (ours and peer's), and performing various tasks when a connection
becomes ESTABLISHED.
- Introduce a new function, tcp_segsize(), which computes the max size
for every segment transmitted in tcp_output(). This will eventually
be used to hook in PMTU discovery.
 1.22  29-Aug-1997  gwr Tweaks to allow operation with an interface address of 0.0.0.0
(needed for NFS mountroot using BOOTP to get boot parameters)
 1.21  28-Jul-1997  thorpej branches: 1.21.2;
Make the following tunable via sysctl, inspired by BSD/OS:
- tcp_sendspace
- tcp_recvspace
- tcp_mssdflt
- tcp_syn_cache_limit
- tcp_syn_bucket_limit
- tcp_syn_cache_timer
 1.20  23-Jul-1997  thorpej Pull SYN_cache_branch down into the main line.
 1.19  10-Dec-1996  mycroft branches: 1.19.8;
Fix RTT scaling problems introduced with Brakmo and Peterson changes.
 1.18  22-May-1996  mycroft Pass a proc pointer down to the usrreq and pcbbind functions for PRU_ATTACH, PRU_BIND and
PRU_CONTROL. The usrreq interface really needs to be split up, but this will have to wait.
Remove SS_PRIV completely.
 1.17  13-Feb-1996  christos branches: 1.17.4;
netinet prototypes
 1.16  31-Jan-1996  mycroft Build a hash table of PCBs. Hash function needs tweaking.
 1.15  21-Nov-1995  cgd make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.
 1.14  30-Sep-1995  thorpej branches: 1.14.2;
Implement tcp_sysctl(). Add a sysctl option to enable/disable RFC1323
extensions to TCP. From John Kohl <jtk@kolvir.blrc.ma.us>.
 1.13  12-Jun-1995  mycroft Various cleanup, including:
* Convert several data structures to use queue.h.
* Split in_pcbnotify() into two parts; one for notifying a specific PCB, and
one for notifying all PCBs for a particular foreign address.
 1.12  11-Jun-1995  mycroft As suggested by Brakmo and Peterson:
* Don't add the extra 1/8 of the mss when ramping up the congestion window.
* Scale the RTT values slightly to adjust for rounding errors.
* Set the lower bound of the RTO to RTT+2.
 1.11  13-Apr-1995  cgd be a bit more careful and explicit with types. (basically a large no-op.)
 1.10  26-Mar-1995  jtc KERNEL -> _KERNEL
 1.9  29-Jun-1994  cgd New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'
 1.8  13-May-1994  mycroft Update to 4.4-Lite networking code, with a few local changes.
 1.7  10-Jan-1994  mycroft Change the counters to be all the same type -- u_long.
 1.6  10-Jan-1994  mycroft Don't prototype this until it's safe.
 1.5  08-Jan-1994  mycroft Prototypes.
 1.4  08-Jan-1994  mycroft Fix some inconsistent spacing; spaces at the end of lines, etc.
 1.3  20-May-1993  cgd more rcsid additions and file header cleanups
 1.2  19-Apr-1993  mycroft Add consistent multiple-inclusion protection.
 1.1  21-Mar-1993  cgd branches: 1.1.1;
Initial revision
 1.1.1.3  05-Jan-1998  thorpej Import sys/netinet from 4.4BSD-Lite2 for reference purposes.
 1.1.1.2  05-Jan-1998  thorpej Import sys/netinet from 4.4BSD-Lite for reference purposes.
 1.1.1.1  21-Mar-1993  cgd initial import of 386bsd-0.1 sources
 1.14.2.1  02-Feb-1996  mycroft Bring in changes for mondo patch 2.
 1.17.4.2  11-Dec-1996  mycroft From trunk:
Eliminate SS_PRIV; instead, pass down a proc pointer to the usrreq methods
that need it.
Fix numerous memory leaks and bogus return values.
 1.17.4.1  10-Dec-1996  mycroft From trunk:
Fix RTT scaling problems introduced with Brakmo and Peterson changes.
 1.19.8.6  16-Jul-1997  thorpej Declare struct tcp_opt_info here; it's needed by tuba_tcpinput().
 1.19.8.5  29-Jun-1997  thorpej Instrument syn cache hash collisions.
 1.19.8.4  28-Jun-1997  thorpej KNF.
 1.19.8.3  28-Jun-1997  thorpej Use explicit type sizes in struct cyn_cache, and add a comment about
this structure being larger than intended on the Alpha.
 1.19.8.2  26-Jun-1997  thorpej tcp_mss() needs to take a u_int, not a u_int16_t.
 1.19.8.1  14-May-1997  mellon More of David Borman's SYN cache patches for Lite2:

- Define syn_cache entry and syn_cache_head structures.
- Add syn_cache statistics to tcpstat structure.
- Declare externs for syn cache variables.
- Update prototypes: tcp_dooptions, tcp_mss, tcp_respond.
- Add prototypes for syn_cache_* functions.
 1.21.2.3  14-Oct-1997  thorpej Update marc-pcmcia branch from trunk.
 1.21.2.2  29-Sep-1997  thorpej Update marc-pcmcia branch from trunk.
 1.21.2.1  01-Sep-1997  thorpej Update marc-pcmcia branch from trunk.
 1.25.2.4  09-May-1998  mycroft Pull up patch from kml.
 1.25.2.3  05-May-1998  mycroft Pull up 1.36, per request of kml.
 1.25.2.2  29-Jan-1998  mellon Pull up 1.27-1.33 (thorpej)
 1.25.2.1  08-Nov-1997  thorpej Pull up from trunk: TCP MSS fixes to provide cleaner slow-start and recovery.
(kml)
 1.58.2.1  29-Apr-1999  perry branches: 1.58.2.1.2; 1.58.2.1.4;
pullup 1.58->1.59 (thorpej)
 1.58.2.1.4.3  30-Nov-1999  itojun bring in latest KAME (as of 19991130, KAME/NetBSD141) into kame branch
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.

The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.

synchronization to latest KAME will take place on HEAD branch soon.
 1.58.2.1.4.2  06-Jul-1999  itojun KAME/NetBSD 1.4, SNAP kit 1999/07/05.
NOTE: this branch is just for reference purposes (i.e. for taking cvs diff).
do not touch anything on the branch. actual work must be done on HEAD branch.
 1.58.2.1.4.1  28-Jun-1999  itojun KAME/NetBSD 1.4 SNAP kit, dated 19990628.

NOTE: this branch (kame) is used just for refernce. this may not compile
due to multiple reasons.
 1.58.2.1.2.3  02-Aug-1999  thorpej Update from trunk.
 1.58.2.1.2.2  01-Jul-1999  thorpej Sync w/ -current.
 1.58.2.1.2.1  21-Jun-1999  thorpej Sync w/ -current.
 1.68.8.1  27-Dec-1999  wrstuden Pull up to last week's -current.
 1.68.2.3  21-Apr-2001  bouyer Sync with HEAD
 1.68.2.2  27-Mar-2001  bouyer Sync with HEAD.
 1.68.2.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.72.4.3  20-Apr-2004  jmc Pullup patch (requested by itojun in ticket #143)

If a segment is received with RST set and the segment is completely to the
left of the receive window, ignore it. Add some additional comments to
the code that deals with received segemnts that are completely to the right
of the receive window. If an invalid SYN is received, force an ACK and
drop it; if the other side really sent the SYN; it'll respond with a reset.
Respond to RST by ACK, as suggested in NISCC recommendation.
Rate-limit ACKs against RSTs and SYNs.
If SYN is coming and RCV.NXT == SEG.SEQ, then ACK with value - 1.
 1.72.4.2  24-Jan-2002  he Pull up revision 1.88 (requested by itojun):
Clean up the NRL copyright.
 1.72.4.1  16-Aug-2000  itojun pullup (approved by releng-1-5)

switch from net.inet*.*.*ratelimit to net.inet*.*.ppslimit.

(tags are rough estimate - we had some try-and-error in main trunc)
sys/netinet/icmp6.h 1.9 -> 1.11
sys/netinet/icmp_var.h 1.15 -> 1.17
sys/netinet/in_proto.c 1.39 -> 1.42
sys/netinet/ip_icmp.c 1.50 -> 1.51, 1.52 -> 1.54
sys/netinet/tcp_input.c 1.111 -> 1.112, 1.115 -> 1.117
sys/netinet/tcp_usrreq.c 1.52 -> 1.53
sys/netinet/tcp_var.h 1.72 -> 1.75
sys/netinet6/icmp6.c 1.34 -> 1.35, 1.36 -> 1.38
sys/netinet6/in6_proto.c 1.17 -> 1.19
 1.77.2.9  11-Nov-2002  nathanw Catch up to -current
 1.77.2.8  01-Aug-2002  nathanw Catch up to -current.
 1.77.2.7  20-Jun-2002  nathanw Catch up to -current.
 1.77.2.6  01-Apr-2002  nathanw Catch up to -current.
(CVS: It's not just a program. It's an adventure!)
 1.77.2.5  28-Feb-2002  nathanw Catch up to -current.
 1.77.2.4  21-Sep-2001  nathanw Catch up to -current.
 1.77.2.3  24-Aug-2001  nathanw Catch up with -current.
 1.77.2.2  21-Jun-2001  nathanw Catch up to -current.
 1.77.2.1  09-Apr-2001  nathanw Catch up with -current.
 1.81.2.5  06-Sep-2002  jdolecek sync kqueue branch with HEAD
 1.81.2.4  23-Jun-2002  jdolecek catch up with -current on kqueue branch
 1.81.2.3  11-Feb-2002  jdolecek Sync w/ -current.
 1.81.2.2  13-Sep-2001  thorpej Update the kqueue branch to HEAD.
 1.81.2.1  03-Aug-2001  lukem update to -current
 1.82.2.1  01-Oct-2001  fvdl Catch up with -current.
 1.90.4.3  20-Apr-2004  jmc Pullup patch (requested by itojun in ticket #1680)

If a segment is received with RST set and the segment is completely to the
left of the receive window, ignore it. Add some additional comments to
the code that deals with received segemnts that are completely to the right
of the receive window. If an invalid SYN is received, force an ACK and
drop it; if the other side really sent the SYN; it'll respond with a reset.
Respond to RST by ACK, as suggested in NISCC recommendation.
Rate-limit ACKs against RSTs and SYNs.
If SYN is coming and RCV.NXT == SEG.SEQ, then ACK with value - 1.
 1.90.4.2  22-Oct-2003  jmc Pullup rev 1.03 (requested by he in ticket #1530)


Introduce a new INVOKING status for callouts, and use it to close
a race condition in the TCP code. Fixes PR#20390.
 1.90.4.1  05-Sep-2003  tron Pull up revision 1.91 (requested by tls in ticket #1445):
path MTU discovery blackhole detection.
PR 12790 (sorry for not committing it for a long time)
 1.90.2.3  15-Jul-2002  gehenna catch up with -current.
 1.90.2.2  20-Jun-2002  gehenna catch up with -current.
 1.90.2.1  30-May-2002  gehenna Catch up with -current.
 1.102.2.12  11-Dec-2005  christos Sync with head.
 1.102.2.11  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.102.2.10  01-Apr-2005  skrll Sync with HEAD.
 1.102.2.9  04-Mar-2005  skrll Sync with HEAD.

Hi Perry!
 1.102.2.8  07-Feb-2005  skrll Sunc with HEAD.
 1.102.2.7  04-Feb-2005  skrll Sync with HEAD.
 1.102.2.6  17-Jan-2005  skrll Sync with HEAD.
 1.102.2.5  18-Dec-2004  skrll Sync with HEAD.
 1.102.2.4  21-Sep-2004  skrll Fix the sync with head I botched.
 1.102.2.3  18-Sep-2004  skrll Sync with HEAD.
 1.102.2.2  03-Aug-2004  skrll Sync with HEAD
 1.102.2.1  02-Jul-2003  darrenr Apply the aborted ktrace-lwp changes to a specific branch. This is just for
others to review, I'm concerned that patch fuziness may have resulted in some
errant code being generated but I'll look at that later by comparing the diff
from the base to the branch with the file I attempt to apply to it. This will,
at the very least, put the changes in a better context for others to review
them and attempt to tinker with removing passing of 'struct lwp' through
the kernel.
 1.106.2.2  18-Sep-2004  he Pull up revision 1.113 (requested by yamt in ticket #861):
Fix ipqent pool corruption problems. Make the TCP reassembly
code use its own pool of ipqent rather than sharing it with
the IP reassembly code. Fixes PR#24782.
 1.106.2.1  20-Apr-2004  jmc Pullup patch (requested by itojun in ticket #169)

If a segment is received with RST set and the segment is completely to the
left of the receive window, ignore it. Add some additional comments to
the code that deals with received segemnts that are completely to the right
of the receive window. If an invalid SYN is received, force an ACK and
drop it; if the other side really sent the SYN; it'll respond with a reset.
Respond to RST by ACK, as suggested in NISCC recommendation.
Rate-limit ACKs against RSTs and SYNs.
If SYN is coming and RCV.NXT == SEG.SEQ, then ACK with value - 1.
 1.115.4.2  19-Mar-2005  yamt sync with head. xen and whitespace. xen part is not finished.
 1.115.4.1  12-Feb-2005  yamt sync with head.
 1.115.2.1  29-Apr-2005  kent sync with -current
 1.123.2.2  06-May-2005  tron Pull up revision 1.125 (requested by kurahone in ticket #199):
Added sysctl tunable limits for the number of maximum SACK holes
per connection and per system.
Idea taken from FreeBSD.
 1.123.2.1  04-Apr-2005  tron Pull up revision 1.124 (requested by yamt in ticket #90):
protect tcpipqent with splvm.
 1.126.2.6  17-Mar-2008  yamt sync with head.
 1.126.2.5  21-Jan-2008  yamt sync with head
 1.126.2.4  03-Sep-2007  yamt sync with head.
 1.126.2.3  26-Feb-2007  yamt sync with head.
 1.126.2.2  30-Dec-2006  yamt sync with head.
 1.126.2.1  21-Jun-2006  yamt sync with head.
 1.133.6.1  22-Apr-2006  simonb Sync with head.
 1.133.4.3  09-Sep-2006  rpaulo sync with head
 1.133.4.2  14-Mar-2006  rpaulo Remove in6pcb in parameter list.
 1.133.4.1  14-Mar-2006  rpaulo Remove back pointer to in6pcb.
 1.133.2.1  18-Feb-2006  yamt sync with head.
 1.134.2.2  14-Sep-2006  yamt sync with head.
 1.134.2.1  11-Aug-2006  yamt sync with head
 1.137.4.2  10-Dec-2006  yamt sync with head.
 1.137.4.1  22-Oct-2006  yamt sync with head
 1.137.2.2  12-Jan-2007  ad Sync with head.
 1.137.2.1  18-Nov-2006  ad Sync with head.
 1.141.4.1  03-Jun-2008  skrll Sync with netbsd-4.
 1.141.2.1  21-Jan-2008  bouyer Pull up following revision(s) (requested by ghen in ticket #1039):
sys/netinet/tcp_var.h: revision 1.148
distrib/sets/lists/comp/mi: revision 1.1035
distrib/sets/lists/man/mi: revision 1.1010
usr.sbin/tcpdrop/Makefile: revision 1.1
usr.sbin/tcpdrop/tcpdrop.c: revision 1.1 - 1.3
usr.sbin/tcpdrop/tcpdrop.8: revision 1.1
usr.sbin/Makefile: revision 1.228 via patch
sys/netinet/tcp_usrreq.c: revision 1.133
distrib/sets/lists/base/mi: revision 1.712
Import tcpdrop(8) from OpenBSD
 1.143.2.3  07-May-2007  yamt sync with head.
 1.143.2.2  12-Mar-2007  rmind Sync with HEAD.
 1.143.2.1  27-Feb-2007  yamt - sync with head.
- move sched_changepri back to kern_synch.c as it doesn't know PPQ anymore.
 1.145.4.1  11-Jul-2007  mjf Sync with head.
 1.145.2.4  20-Aug-2007  ad Sync with HEAD.
 1.145.2.3  15-Jul-2007  ad Sync with head.
 1.145.2.2  01-Jul-2007  ad Adapt to callout API change.
 1.145.2.1  08-Jun-2007  ad Sync with head.
 1.149.2.1  15-Aug-2007  skrll Sync with HEAD.
 1.150.20.2  02-Aug-2007  rmind TCP socket buffers automatic sizing - ported from FreeBSD.
http://mail-index.netbsd.org/tech-net/2007/02/04/0006.html

! Disabled by default, marked as experimental. Testers are very needed.
! Someone should thoroughly test this, and improve if possible.

Discussed on <tech-net>:
http://mail-index.netbsd.org/tech-net/2007/07/12/0002.html
Thanks Greg Troxel for comments.

OK by the long silence on <tech-net>.
 1.150.20.1  02-Aug-2007  rmind file tcp_var.h was added on branch matt-mips64 on 2007-08-02 02:42:43 +0000
 1.150.16.1  02-Jan-2008  bouyer Sync with HEAD
 1.150.12.1  26-Dec-2007  ad Sync with head.
 1.150.10.1  18-Feb-2008  mjf Sync with HEAD.
 1.150.4.2  23-Mar-2008  matt sync with HEAD
 1.150.4.1  09-Jan-2008  matt sync with HEAD
 1.151.6.3  28-Sep-2008  mjf Sync with HEAD.
 1.151.6.2  02-Jun-2008  mjf Sync with HEAD.
 1.151.6.1  03-Apr-2008  mjf Sync with HEAD.
 1.151.2.1  24-Mar-2008  keiichi sync with head.
 1.155.2.1  18-May-2008  yamt sync with head.
 1.156.2.5  11-Mar-2010  yamt sync with head
 1.156.2.4  16-Sep-2009  yamt sync with head
 1.156.2.3  20-Jun-2009  yamt sync with head
 1.156.2.2  04-May-2009  yamt sync with head.
 1.156.2.1  16-May-2008  yamt sync with head.
 1.157.6.1  19-Oct-2008  haad Sync with HEAD.
 1.157.2.1  18-Sep-2008  wrstuden Sync with wrstuden-revivesa-base-2.
 1.158.10.1  21-Apr-2010  matt sync to netbsd-5
 1.158.4.1  26-Sep-2009  snj Pull up following revision(s) (requested by darran in ticket #950):
sys/netinet/tcp_input.c: revision 1.299
sys/netinet/tcp_usrreq.c: revision 1.156
sys/netinet/tcp_var.h: revision 1.161
Make tcp msl (max segment life) tunable via sysctl net.inet.tcp.msl.
Okayed by tls@.
 1.158.2.1  03-Mar-2009  skrll Sync with HEAD.
 1.159.2.1  23-Jul-2009  jym Sync with HEAD.
 1.162.6.1  06-Jun-2011  jruoho Sync with HEAD.
 1.162.4.2  31-May-2011  rmind sync with head
 1.162.4.1  21-Apr-2011  rmind sync with head
 1.168.6.1  18-Feb-2012  mrg merge to -current.
 1.168.2.2  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.168.2.1  17-Apr-2012  yamt sync with head
 1.169.6.3  03-Dec-2017  jdolecek update from HEAD
 1.169.6.2  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.169.6.1  23-Jun-2013  tls resync from head
 1.170.4.3  18-May-2014  rmind sync with head
 1.170.4.2  28-Aug-2013  rmind Checkpoint work in progress:
- Initial split of the protocol user-request method into the following
methods: pr_attach, pr_detach and pr_generic for old the pr_usrreq.
- Adjust socreate(9) and sonewconn(9) to call pr_attach without the
socket lock held (as a preparation for the locking scheme adjustment).
- Adjust all pr_attach routines to assert that PCB is not set.
- Sprinkle various comments, document some routines and their locking.
- Remove M_PCB, replace with kmem(9).
- Fix few bugs spotted on the way.
 1.170.4.1  17-Jul-2013  rmind Checkpoint work in progress:
- Move PCB structures under __INPCB_PRIVATE, adjust most of the callers
and thus make IPv4 PCB structures mostly opaque. Any volunteers for
merging in6pcb with inpcb (see rpaulo-netinet-merge-pcb branch)?
- Move various global vars to the modules where they belong, make them static.
- Some preliminary work for IPv4 PCB locking scheme.
- Make raw IP code mostly MP-safe. Simplify some of it.
- Rework "fast" IP forwarding (ipflow) code to be mostly MP-safe. It should
run from a software interrupt, rather than hard.
- Rework tun(4) pseudo interface to be MP-safe.
- Work towards making some other interfaces more strict.
 1.172.2.1  10-Aug-2014  tls Rebase.
 1.175.4.2  28-Aug-2017  skrll Sync with HEAD
 1.175.4.1  06-Apr-2015  skrll Sync with HEAD
 1.175.2.1  21-Feb-2015  martin Pull up following revision(s) (requested by he in ticket #530):
sys/netinet/tcp_output.c: revision 1.180
sys/netinet/tcp_input.c: revision 1.336
sys/netinet/tcp_usrreq.c: revision 1.203
share/man/man4/tcp.4: revision 1.30
sys/netinet/tcp.h: revision 1.31
sys/netinet/tcp_subr.c: revision 1.258
sys/netinet/tcp_var.h: revision 1.176
sys/netinet/tcp_var.h: revision 1.177
sys/sys/param.h: bump revision

Port over the TCP_INFO socket option from FreeBSD, originally from
the Linux 2.6 TCP API. This permits the caller to query certain information
about a TCP connection, and is used by pkgsrc's net/iperf3 test program
if available.

This extends struct tcbcb with three fields to count retransmits,
out-of-sequence receives and zero window announcements, and will
therefore warrant a kernel revision bump (done separately).

Change the new counter variables in struct tcpcb to uint32_t, as
per christos' comments.
 1.177.10.2  03-Feb-2018  snj Pull up following revision(s) (requested by ozaki-r in ticket #514):
sys/net/route.c: 1.205
sys/net/rtsock.c: 1.237-1.238
sys/netinet/in.c: 1.215
sys/netinet/tcp_subr.c: 1.272
sys/netinet/tcp_timer.c: 1.93
sys/netinet/tcp_timer.h: 1.29
sys/netinet/tcp_var.h: 1.182
sys/netinet6/in6.c: 1.258
Remove extra pserialize_perform from in_purgeaddr
It's already performed in ifa_remove. Note so there (in in6_unlink_ifa too).
Release rt_so_mtx on updating a rtentry to avoid a deadlock with route_intr
The deadlock happened only if NET_MPSAFE on.
Run tcp_slowtimo in workqueue if NET_MPSAFE
If NET_MPSAFE is enabled, we have to avoid taking softnet_lock in softint as
much as possible to prevent any softint handlers including callout handlers
such as tcp_slowtimo from sticking on softnet_lock because it results in
undesired delays of executing subsequent softint handlers.
NFCI for !NET_MPSAFE
Fix a return value of rt_update_prepare
Callers expect it to be an errno.
Fix another deadlock
When waiting for a route update to finish, a waiter has to release its reference
to the route to avoid a deadlock. Because a updater tries to wait for references
to a target route (except for a reference by the updater itself) to be released.
 1.177.10.1  21-Oct-2017  snj Pull up following revision(s) (requested by ozaki-r in ticket #300):
crypto/dist/ipsec-tools/src/setkey/parse.y: 1.19
crypto/dist/ipsec-tools/src/setkey/token.l: 1.20
distrib/sets/lists/tests/mi: 1.754, 1.757, 1.759
doc/TODO.smpnet: 1.12-1.13
sys/net/pfkeyv2.h: 1.32
sys/net/raw_cb.c: 1.23-1.24, 1.28
sys/net/raw_cb.h: 1.28
sys/net/raw_usrreq.c: 1.57-1.58
sys/net/rtsock.c: 1.228-1.229
sys/netinet/in_proto.c: 1.125
sys/netinet/ip_input.c: 1.359-1.361
sys/netinet/tcp_input.c: 1.359-1.360
sys/netinet/tcp_output.c: 1.197
sys/netinet/tcp_var.h: 1.178
sys/netinet6/icmp6.c: 1.213
sys/netinet6/in6_proto.c: 1.119
sys/netinet6/ip6_forward.c: 1.88
sys/netinet6/ip6_input.c: 1.181-1.182
sys/netinet6/ip6_output.c: 1.193
sys/netinet6/ip6protosw.h: 1.26
sys/netipsec/ipsec.c: 1.100-1.122
sys/netipsec/ipsec.h: 1.51-1.61
sys/netipsec/ipsec6.h: 1.18-1.20
sys/netipsec/ipsec_input.c: 1.44-1.51
sys/netipsec/ipsec_netbsd.c: 1.41-1.45
sys/netipsec/ipsec_output.c: 1.49-1.64
sys/netipsec/ipsec_private.h: 1.5
sys/netipsec/key.c: 1.164-1.234
sys/netipsec/key.h: 1.20-1.32
sys/netipsec/key_debug.c: 1.18-1.21
sys/netipsec/key_debug.h: 1.9
sys/netipsec/keydb.h: 1.16-1.20
sys/netipsec/keysock.c: 1.59-1.62
sys/netipsec/keysock.h: 1.10
sys/netipsec/xform.h: 1.9-1.12
sys/netipsec/xform_ah.c: 1.55-1.74
sys/netipsec/xform_esp.c: 1.56-1.72
sys/netipsec/xform_ipcomp.c: 1.39-1.53
sys/netipsec/xform_ipip.c: 1.50-1.54
sys/netipsec/xform_tcp.c: 1.12-1.16
sys/rump/librump/rumpkern/Makefile.rumpkern: 1.170
sys/rump/librump/rumpnet/net_stub.c: 1.27
sys/sys/protosw.h: 1.67-1.68
tests/net/carp/t_basic.sh: 1.7
tests/net/if_gif/t_gif.sh: 1.11
tests/net/if_l2tp/t_l2tp.sh: 1.3
tests/net/ipsec/Makefile: 1.7-1.9
tests/net/ipsec/algorithms.sh: 1.5
tests/net/ipsec/common.sh: 1.4-1.6
tests/net/ipsec/t_ipsec_ah_keys.sh: 1.2
tests/net/ipsec/t_ipsec_esp_keys.sh: 1.2
tests/net/ipsec/t_ipsec_gif.sh: 1.6-1.7
tests/net/ipsec/t_ipsec_l2tp.sh: 1.6-1.7
tests/net/ipsec/t_ipsec_misc.sh: 1.8-1.18
tests/net/ipsec/t_ipsec_sockopt.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_tcp.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_transport.sh: 1.5-1.6
tests/net/ipsec/t_ipsec_tunnel.sh: 1.9
tests/net/ipsec/t_ipsec_tunnel_ipcomp.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_tunnel_odd.sh: 1.3
tests/net/mcast/t_mcast.sh: 1.6
tests/net/net/t_ipaddress.sh: 1.11
tests/net/net_common.sh: 1.20
tests/net/npf/t_npf.sh: 1.3
tests/net/route/t_flags.sh: 1.20
tests/net/route/t_flags6.sh: 1.16
usr.bin/netstat/fast_ipsec.c: 1.22
Do m_pullup before mtod

It may fix panicks of some tests on anita/sparc and anita/GuruPlug.
---
KNF
---
Enable DEBUG for babylon5
---
Apply C99-style struct initialization to xformsw
---
Tweak outputs of netstat -s for IPsec

- Get rid of "Fast"
- Use ipsec and ipsec6 for titles to clarify protocol
- Indent outputs of sub protocols

Original outputs were organized like this:

(Fast) IPsec:
IPsec ah:
IPsec esp:
IPsec ipip:
IPsec ipcomp:
(Fast) IPsec:
IPsec ah:
IPsec esp:
IPsec ipip:
IPsec ipcomp:

New outputs are organized like this:

ipsec:
ah:
esp:
ipip:
ipcomp:
ipsec6:
ah:
esp:
ipip:
ipcomp:
---
Add test cases for IPComp
---
Simplify IPSEC_OSTAT macro (NFC)
---
KNF; replace leading whitespaces with hard tabs
---
Introduce and use SADB_SASTATE_USABLE_P
---
KNF
---
Add update command for testing

Updating an SA (SADB_UPDATE) requires that a process issuing
SADB_UPDATE is the same as a process issued SADB_ADD (or SADB_GETSPI).
This means that update command must be used with add command in a
configuration of setkey. This usage is normally meaningless but
useful for testing (and debugging) purposes.
---
Add test cases for updating SA/SP

The tests require newly-added udpate command of setkey.
---
PR/52346: Frank Kardel: Fix checksumming for NAT-T
See XXX for improvements.
---
Remove codes for PACKET_TAG_IPSEC_IN_CRYPTO_DONE

It seems that PACKET_TAG_IPSEC_IN_CRYPTO_DONE is for network adapters
that have IPsec accelerators; a driver sets the mtag to a packet
when its device has already encrypted the packet.

Unfortunately no driver implements such offload features for long
years and seems unlikely to implement them soon. (Note that neither
FreeBSD nor Linux doesn't have such drivers.) Let's remove related
(unused) codes and simplify the IPsec code.
---
Fix usages of sadb_msg_errno
---
Avoid updating sav directly

On SADB_UPDATE a target sav was updated directly, which was unsafe.
Instead allocate another sav, copy variables of the old sav to
the new one and replace the old one with the new one.
---
Simplify; we can assume sav->tdb_xform cannot be NULL while it's valid
---
Rename key_alloc* functions (NFC)

We shouldn't use the term "alloc" for functions that just look up
data and actually don't allocate memory.
---
Use explicit_memset to surely zero-clear key_auth and key_enc
---
Make sure to clear keys on error paths of key_setsaval
---
Add missing KEY_FREESAV
---
Make sure a sav is inserted to a sah list after its initialization completes
---
Remove unnecessary zero-clearing codes from key_setsaval

key_setsaval is now used only for a newly-allocated sav. (It was
used to reset variables of an existing sav.)
---
Correct wrong assumption of sav->refcnt in key_delsah

A sav in a list is basically not to be sav->refcnt == 0. And also
KEY_FREESAV assumes sav->refcnt > 0.
---
Let key_getsavbyspi take a reference of a returning sav
---
Use time_mono_to_wall (NFC)
---
Separate sending message routine (NFC)
---
Simplify; remove unnecessary zero-clears

key_freesaval is used only when a target sav is being destroyed.
---
Omit NULL checks for sav->lft_c

sav->lft_c can be NULL only when initializing or destroying sav.
---
Omit unnecessary NULL checks for sav->sah
---
Omit unnecessary check of sav->state

key_allocsa_policy picks a sav of either MATURE or DYING so we
don't need to check its state again.
---
Simplify; omit unnecessary saidx passing

- ipsec_nextisr returns a saidx but no caller uses it
- key_checkrequest is passed a saidx but it can be gotton by
another argument (isr)
---
Fix splx isn't called on some error paths
---
Fix header size calculation of esp where sav is NULL
---
Fix header size calculation of ah in the case sav is NULL

This fix was also needed for esp.
---
Pass sav directly to opencrypto callback

In a callback, use a passed sav as-is by default and look up a sav
only if the passed sav is dead.
---
Avoid examining freshness of sav on packet processing

If a sav list is sorted (by lft_c->sadb_lifetime_addtime) in advance,
we don't need to examine each sav and also don't need to delete one
on the fly and send up a message. Fortunately every sav lists are sorted
as we need.

Added key_validate_savlist validates that each sav list is surely sorted
(run only if DEBUG because it's not cheap).
---
Add test cases for SAs with different SPIs
---
Prepare to stop using isr->sav

isr is a shared resource and using isr->sav as a temporal storage
for each packet processing is racy. And also having a reference from
isr to sav makes the lifetime of sav non-deterministic; such a reference
is removed when a packet is processed and isr->sav is overwritten by
new one. Let's have a sav locally for each packet processing instead of
using shared isr->sav.

However this change doesn't stop using isr->sav yet because there are
some users of isr->sav. isr->sav will be removed after the users find
a way to not use isr->sav.
---
Fix wrong argument handling
---
fix printf format.
---
Don't validate sav lists of LARVAL or DEAD states

We don't sort the lists so the validation will always fail.

Fix PR kern/52405
---
Make sure to sort the list when changing the state by key_sa_chgstate
---
Rename key_allocsa_policy to key_lookup_sa_bysaidx
---
Separate test files
---
Calculate ah_max_authsize on initialization as well as esp_max_ivlen
---
Remove m_tag_find(PACKET_TAG_IPSEC_PENDING_TDB) because nobody sets the tag
---
Restore a comment removed in previous

The comment is valid for the below code.
---
Make tests more stable

sleep command seems to wait longer than expected on anita so
use polling to wait for a state change.
---
Add tests that explicitly delete SAs instead of waiting for expirations
---
Remove invalid M_AUTHIPDGM check on ESP isr->sav

M_AUTHIPDGM flag is set to a mbuf in ah_input_cb. An sav of ESP can
have AH authentication as sav->tdb_authalgxform. However, in that
case esp_input and esp_input_cb are used to do ESP decryption and
AH authentication and M_AUTHIPDGM never be set to a mbuf. So
checking M_AUTHIPDGM of a mbuf on isr->sav of ESP is meaningless.
---
Look up sav instead of relying on unstable sp->req->sav

This code is executed only in an error path so an additional lookup
doesn't matter.
---
Correct a comment
---
Don't release sav if calling crypto_dispatch again
---
Remove extra KEY_FREESAV from ipsec_process_done

It should be done by the caller.
---
Don't bother the case of crp->crp_buf == NULL in callbacks
---
Hold a reference to an SP during opencrypto processing

An SP has a list of isr (ipsecrequest) that represents a sequence
of IPsec encryption/authentication processing. One isr corresponds
to one opencrypto processing. The lifetime of an isr follows its SP.

We pass an isr to a callback function of opencrypto to continue
to a next encryption/authentication processing. However nobody
guaranteed that the isr wasn't freed, i.e., its SP wasn't destroyed.

In order to avoid such unexpected destruction of isr, hold a reference
to its SP during opencrypto processing.
---
Don't make SAs expired on tests that delete SAs explicitly
---
Fix a debug message
---
Dedup error paths (NFC)
---
Use pool to allocate tdb_crypto

For ESP and AH, we need to allocate an extra variable space in addition
to struct tdb_crypto. The fixed size of pool items may be larger than
an actual requisite size of a buffer, but still the performance
improvement by replacing malloc with pool wins.
---
Don't use unstable isr->sav for header size calculations

We may need to optimize to not look up sav here for users that
don't need to know an exact size of headers (e.g., TCP segmemt size
caclulation).
---
Don't use sp->req->sav when handling NAT-T ESP fragmentation

In order to do this we need to look up a sav however an additional
look-up degrades performance. A sav is later looked up in
ipsec4_process_packet so delay the fragmentation check until then
to avoid an extra look-up.
---
Don't use key_lookup_sp that depends on unstable sp->req->sav

It provided a fast look-up of SP. We will provide an alternative
method in the future (after basic MP-ification finishes).
---
Stop setting isr->sav on looking up sav in key_checkrequest
---
Remove ipsecrequest#sav
---
Stop setting mtag of PACKET_TAG_IPSEC_IN_DONE because there is no users anymore
---
Skip ipsec_spi_*_*_preferred_new_timeout when running on qemu

Probably due to PR 43997
---
Add localcount to rump kernels
---
Remove unused macro
---
Fix key_getcomb_setlifetime

The fix adjusts a soft limit to be 80% of a corresponding hard limit.

I'm not sure the fix is really correct though, at least the original
code is wrong. A passed comb is zero-cleared before calling
key_getcomb_setlifetime, so
comb->sadb_comb_soft_addtime = comb->sadb_comb_soft_addtime * 80 / 100;
is meaningless.
---
Provide and apply key_sp_refcnt (NFC)

It simplifies further changes.
---
Fix indentation

Pointed out by knakahara@
---
Use pslist(9) for sptree
---
Don't acquire global locks for IPsec if NET_MPSAFE

Note that the change is just to make testing easy and IPsec isn't MP-safe yet.
---
Let PF_KEY socks hold their own lock instead of softnet_lock

Operations on SAD and SPD are executed via PF_KEY socks. The operations
include deletions of SAs and SPs that will use synchronization mechanisms
such as pserialize_perform to wait for references to SAs and SPs to be
released. It is known that using such mechanisms with holding softnet_lock
causes a dead lock. We should avoid the situation.
---
Make IPsec SPD MP-safe

We use localcount(9), not psref(9), to make the sptree and secpolicy (SP)
entries MP-safe because SPs need to be referenced over opencrypto
processing that executes a callback in a different context.

SPs on sockets aren't managed by the sptree and can be destroyed in softint.
localcount_drain cannot be used in softint so we delay the destruction of
such SPs to a thread context. To do so, a list to manage such SPs is added
(key_socksplist) and key_timehandler_spd deletes dead SPs in the list.

For more details please read the locking notes in key.c.

Proposed on tech-kern@ and tech-net@
---
Fix updating ipsec_used

- key_update_used wasn't called in key_api_spddelete2 and key_api_spdflush
- key_update_used wasn't called if an SP had been added/deleted but
a reply to userland failed
---
Fix updating ipsec_used; turn on when SPs on sockets are added
---
Add missing IPsec policy checks to icmp6_rip6_input

icmp6_rip6_input is quite similar to rip6_input and the same checks exist
in rip6_input.
---
Add test cases for setsockopt(IP_IPSEC_POLICY)
---
Don't use KEY_NEWSP for dummy SP entries

By the change KEY_NEWSP is now not called from softint anymore
and we can use kmem_zalloc with KM_SLEEP for KEY_NEWSP.
---
Comment out unused functions
---
Add test cases that there are SPs but no relevant SAs
---
Don't allow sav->lft_c to be NULL

lft_c of an sav that was created by SADB_GETSPI could be NULL.
---
Clean up clunky eval strings

- Remove unnecessary \ at EOL
- This allows to omit ; too
- Remove unnecessary quotes for arguments of atf_set
- Don't expand $DEBUG in eval
- We expect it's expanded on execution

Suggested by kre@
---
Remove unnecessary KEY_FREESAV in an error path

sav should be freed (unreferenced) by the caller.
---
Use pslist(9) for sahtree
---
Use pslist(9) for sah->savtree
---
Rename local variable newsah to sah

It may not be new.
---
MP-ify SAD slightly

- Introduce key_sa_mtx and use it for some list operations
- Use pserialize for some list iterations
---
Introduce KEY_SA_UNREF and replace KEY_FREESAV with it where sav will never be actually freed in the future

KEY_SA_UNREF is still key_freesav so no functional change for now.

This change reduces diff of further changes.
---
Remove out-of-date log output

Pointed out by riastradh@
---
Use KDASSERT instead of KASSERT for mutex_ownable

Because mutex_ownable is too heavy to run in a fast path
even for DIAGNOSTIC + LOCKDEBUG.

Suggested by riastradh@
---
Assemble global lists and related locks into cache lines (NFCI)

Also rename variable names from *tree to *list because they are
just lists, not trees.

Suggested by riastradh@
---
Move locking notes
---
Update the locking notes

- Add locking order
- Add locking notes for misc lists such as reglist
- Mention pserialize, key_sp_ref and key_sp_unref on SP operations

Requested by riastradh@
---
Describe constraints of key_sp_ref and key_sp_unref

Requested by riastradh@
---
Hold key_sad.lock on SAVLIST_WRITER_INSERT_TAIL
---
Add __read_mostly to key_psz

Suggested by riastradh@
---
Tweak wording (pserialize critical section => pserialize read section)

Suggested by riastradh@
---
Add missing mutex_exit
---
Fix setkey -D -P outputs

The outputs were tweaked (by me), but I forgot updating libipsec
in my local ATF environment...
---
MP-ify SAD (key_sad.sahlist and sah entries)

localcount(9) is used to protect key_sad.sahlist and sah entries
as well as SPD (and will be used for SAD sav).

Please read the locking notes of SAD for more details.
---
Introduce key_sa_refcnt and replace sav->refcnt with it (NFC)
---
Destroy sav only in the loop for DEAD sav
---
Fix KASSERT(solocked(sb->sb_so)) failure in sbappendaddr that is called eventually from key_sendup_mbuf

If key_sendup_mbuf isn't passed a socket, the assertion fails.
Originally in this case sb->sb_so was softnet_lock and callers
held softnet_lock so the assertion was magically satisfied.
Now sb->sb_so is key_so_mtx and also softnet_lock isn't always
held by callers so the assertion can fail.

Fix it by holding key_so_mtx if key_sendup_mbuf isn't passed a socket.

Reported by knakahara@
Tested by knakahara@ and ozaki-r@
---
Fix locking notes of SAD
---
Fix deadlock between key_sendup_mbuf called from key_acquire and localcount_drain

If we call key_sendup_mbuf from key_acquire that is called on packet
processing, a deadlock can happen like this:
- At key_acquire, a reference to an SP (and an SA) is held
- key_sendup_mbuf will try to take key_so_mtx
- Some other thread may try to localcount_drain to the SP with
holding key_so_mtx in say key_api_spdflush
- In this case localcount_drain never return because key_sendup_mbuf
that has stuck on key_so_mtx never release a reference to the SP

Fix the deadlock by deferring key_sendup_mbuf to the timer
(key_timehandler).
---
Fix that prev isn't cleared on retry
---
Limit the number of mbufs queued for deferred key_sendup_mbuf

It's easy to be queued hundreds of mbufs on the list under heavy
network load.
---
MP-ify SAD (savlist)

localcount(9) is used to protect savlist of sah. The basic design is
similar to MP-ifications of SPD and SAD sahlist. Please read the
locking notes of SAD for more details.
---
Simplify ipsec_reinject_ipstack (NFC)
---
Add per-CPU rtcache to ipsec_reinject_ipstack

It reduces route lookups and also reduces rtcache lock contentions
when NET_MPSAFE is enabled.
---
Use pool_cache(9) instead of pool(9) for tdb_crypto objects

The change improves network throughput especially on multi-core systems.
---
Update

ipsec(4), opencrypto(9) and vlan(4) are now MP-safe.
---
Write known issues on scalability
---
Share a global dummy SP between PCBs

It's never be changed so it can be pre-allocated and shared safely between PCBs.
---
Fix race condition on the rawcb list shared by rtsock and keysock

keysock now protects itself by its own mutex, which means that
the rawcb list is protected by two different mutexes (keysock's one
and softnet_lock for rtsock), of course it's useless.

Fix the situation by having a discrete rawcb list for each.
---
Use a dedicated mutex for rt_rawcb instead of softnet_lock if NET_MPSAFE
---
fix localcount leak in sav. fixed by ozaki-r@n.o.

I commit on behalf of him.
---
remove unnecessary comment.
---
Fix deadlock between pserialize_perform and localcount_drain

A typical ussage of localcount_drain looks like this:

mutex_enter(&mtx);
item = remove_from_list();
pserialize_perform(psz);
localcount_drain(&item->localcount, &cv, &mtx);
mutex_exit(&mtx);

This sequence can cause a deadlock which happens for example on the following
situation:

- Thread A calls localcount_drain which calls xc_broadcast after releasing
a specified mutex
- Thread B enters the sequence and calls pserialize_perform with holding
the mutex while pserialize_perform also calls xc_broadcast
- Thread C (xc_thread) that calls an xcall callback of localcount_drain tries
to hold the mutex

xc_broadcast of thread B doesn't start until xc_broadcast of thread A
finishes, which is a feature of xcall(9). This means that pserialize_perform
never complete until xc_broadcast of thread A finishes. On the other hand,
thread C that is a callee of xc_broadcast of thread A sticks on the mutex.
Finally the threads block each other (A blocks B, B blocks C and C blocks A).

A possible fix is to serialize executions of the above sequence by another
mutex, but adding another mutex makes the code complex, so fix the deadlock
by another way; the fix is to release the mutex before pserialize_perform
and instead use a condvar to prevent pserialize_perform from being called
simultaneously.

Note that the deadlock has happened only if NET_MPSAFE is enabled.
---
Add missing ifdef NET_MPSAFE
---
Take softnet_lock on pr_input properly if NET_MPSAFE

Currently softnet_lock is taken unnecessarily in some cases, e.g.,
icmp_input and encap4_input from ip_input, or not taken even if needed,
e.g., udp_input and tcp_input from ipsec4_common_input_cb. Fix them.

NFC if NET_MPSAFE is disabled (default).
---
- sanitize key debugging so that we don't print extra newlines or unassociated
debugging messages.
- remove unused functions and make internal ones static
- print information in one line per message
---
humanize printing of ip addresses
---
cast reduction, NFC.
---
Fix typo in comment
---
Pull out ipsec_fill_saidx_bymbuf (NFC)
---
Don't abuse key_checkrequest just for looking up sav

It does more than expected for example key_acquire.
---
Fix SP is broken on transport mode

isr->saidx was modified accidentally in ipsec_nextisr.

Reported by christos@
Helped investigations by christos@ and knakahara@
---
Constify isr at many places (NFC)
---
Include socketvar.h for softnet_lock
---
Fix buffer length for ipsec_logsastr
 1.184.2.5  18-Jan-2019  pgoyette Synch with HEAD
 1.184.2.4  30-Sep-2018  pgoyette Ssync with HEAD
 1.184.2.3  06-Sep-2018  pgoyette Sync with HEAD

Resolve a couple of conflicts (result of the uimin/uimax changes)
 1.184.2.2  02-May-2018  pgoyette Synch with HEAD
 1.184.2.1  30-Mar-2018  pgoyette Resolve conflicts between branch and HEAD
 1.186.2.1  10-Jun-2019  christos Sync with HEAD
 1.192.4.1  03-Apr-2021  thorpej Sync with HEAD.
 1.195.4.1  01-Aug-2021  thorpej Sync with HEAD.

RSS XML Feed