Home | History | Annotate | Download | only in netinet
History log of /src/sys/netinet/tcp_output.c
RevisionDateAuthorComments
 1.222  08-Sep-2024  rillig fix a/an grammar in obvious cases
 1.221  05-Jul-2024  rin sys: Drop redundant NULL check before m_freem(9)

m_freem(9) safely has accepted NULL argument at least since 4.2BSD:
https://www.tuhs.org/cgi-bin/utree.pl?file=4.2BSD/usr/src/sys/sys/uipc_mbuf.c

Compile-tested on amd64/ALL.

Suggested by knakahara@
 1.220  29-Jun-2024  riastradh branches: 1.220.2;
netinet: Use _NET_STAT* API instead of direct array access.

PR kern/58380
 1.219  13-Sep-2023  bouyer Handle EHOSTDOWN the same way as EHOSTUNREACH and ENETDOWN for established
connections. Avoid premature end of tcp connection with "Host is down" error
in case of transient link-layer failure.
Discussed and patch proposed in
http://mail-index.netbsd.org/tech-net/2023/09/11/msg008610.html
and followups.
 1.218  04-Nov-2022  ozaki-r branches: 1.218.2;
inpcb: rename functions to in6pcb_*
 1.217  04-Nov-2022  ozaki-r inpcb: rename functions to inpcb_*

Inspired by rmind-smpnet patches.
 1.216  28-Oct-2022  ozaki-r inpcb: separate inpcb again to reduce the size of PCB for IPv4

The data size of PCB for IPv4 increased because of the merge of
struct in6pcb. The change decreases the size to the original size by
separating struct inpcb (again). struct in4pcb and in6pcb that embed
struct inpcb are introduced.

Even after the separation, users don't need to realize the separation
and only have to use some macros to access dedicated data. For example,
inp->inp_laddr is now accessed through in4p_laddr(inp).
 1.215  28-Oct-2022  ozaki-r inpcb: integrate data structures of PCB into one

Data structures of network protocol control blocks (PCBs), i.e.,
struct inpcb, in6pcb and inpcb_hdr, are not organized well. Users of
the data structures have to handle them separately and thus the code
is cluttered and duplicated.

The commit integrates the data structures into one, struct inpcb. As a
result, users of PCBs only have to handle just one data structure, so
the code becomes simple.

One drawback is that the data size of PCB for IPv4 increases by 40 bytes
(from 248 bytes to 288 bytes).
 1.214  30-Dec-2021  andvar s/bandwith/bandwidth/
 1.213  12-Jun-2020  roy Remove in-kernel handling of Router Advertisements

This is much better handled by a user-land tool.
Proposed on tech-net here:
https://mail-index.netbsd.org/tech-net/2020/04/22/msg007766.html

Note that the ioctl SIOCGIFINFO_IN6 no longer sets flags. That now
needs to be done using the pre-existing SIOCSIFINFO_FLAGS ioctl.

Compat is fully provided where it makes sense, but trying to turn on
RA handling will obviously throw an error as it no longer exists.

Note that if you use IPv6 temporary addresses, this now needs to be
turned on in dhcpcd.conf(5) rather than in sysctl.conf(5).
 1.212  17-Nov-2019  mlelstv Don't allow zero sized segments that will panic the stack.
Reported-by: syzbot+5542516fa4afe7a101e6@syzkaller.appspotmail.com
 1.211  25-Feb-2019  maxv Improve panic messages.
 1.210  27-Dec-2018  maxv Remove unused arguments.
 1.209  03-Sep-2018  riastradh Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)
 1.208  17-May-2018  maxv branches: 1.208.2;
Remove reference to tcpiphdr in comment.
 1.207  07-May-2018  uwe Fix unsigned wraparound on window size calculations.

This is another instance where tp->rcv_adv - tp->rcv_nxt can wrap
around after successful zero-window probe from the peer. The first
one was fixed by chs@ in revision 1.112 on 2004-05-08.

While here, CSE and de-obfuscate the code a bit.
 1.206  03-May-2018  maxv Remove now unused tcpip.h includes. Some were already unused before.
 1.205  03-Apr-2018  maxv bcopy -> memcpy, it's obvious the areas don't overlap.
 1.204  01-Apr-2018  maxv Change the check to be <= instead of <. This fixes one occurrence of an
apparently widespread division-by-zero bug in our TCP code: if a user adds
huge IPv6 options with setsockopt, and if the total size of the options
happens to be equal to the available space calculated for the TCP payload,
t_segsz gets set to zero, and given that we then divide several things by
it, the kernel crashes.
 1.203  01-Apr-2018  maxv Reorder and style, for clarity.
 1.202  30-Mar-2018  maxv Remove dead code. It was introduced in rev1 (25 years ago), and is
irrelevant today.
 1.201  30-Mar-2018  maxv Style, use NULL for pointers, use KASSERT, and don't inline huge functions,
we want to debug them with DDB (and not just with GPROF).
 1.200  29-Mar-2018  maxv Remove #ifdef INET. Same as tcp_input.c. Makes the code easier to
understand.

Also make tcp6_mtudisc() static in tcp_subr.c.
 1.199  10-Mar-2018  khorben Fix spello in a comment
 1.198  12-Feb-2018  maxv branches: 1.198.2;
Remove unused argument from tcp_signature_getsav.
 1.197  03-Aug-2017  ozaki-r Introduce KEY_SA_UNREF and replace KEY_FREESAV with it where sav will never be actually freed in the future

KEY_SA_UNREF is still key_freesav so no functional change for now.

This change reduces diff of further changes.
 1.196  02-Jun-2017  ozaki-r branches: 1.196.2;
Assert inph_locked on ipsec_pcb_skip_ipsec (was IPSEC_PCB_SKIP_IPSEC)

The assertion confirms SP caches are accessed under inph lock (solock).
 1.195  03-Mar-2017  ozaki-r Pass inpcb/in6pcb instead of socket to ip_output/ip6_output

- Passing a socket to Layer 3 is layer violation and even unnecessary
- The change makes codes of callers and IPsec a bit simple
 1.194  04-Jan-2017  martin branches: 1.194.2;
Fix optlen calculation for the SACK block - 2 bytes too few were
calculated, causing corruption in PR kern/51767.
 1.193  04-Jan-2017  kre Remove redundant tests: if optlen === 0, then optlen % 4 != 2 (it is 0)
so there is no need to test both.
 1.192  03-Jan-2017  christos use symbolic constants; no functional change.
 1.191  03-Jan-2017  christos put it the way we had it before; since we check for the resulting size after
we added the extra space we can be equal to the size of the buffer.
 1.190  03-Jan-2017  christos fix off-by-one
 1.189  02-Jan-2017  christos make sure that the reset label is defined without TCP_SIGNATURE.
 1.188  02-Jan-2017  christos Fix TCP signature code:
1. pack options more tightly instead of being generous with no/op
2. put TCP_SIGNATURE option before SACK
3. fix computation of options length, by deferring it
XXX: Really we should move the options setting code in one place instead
of having two copies one for input and one for output.
XXX: tcp_optlen/tcp_hdrsiz need to be fixed; they were wrong before too.
 1.187  08-Dec-2016  ozaki-r Add rtcache_unref to release points of rtentry stemming from rtcache

In the MP-safe world, a rtentry stemming from a rtcache can be freed at any
points. So we need to protect rtentries somehow say by reference couting or
passive references. Regardless of the method, we need to call some release
function of a rtentry after using it.

The change adds a new function rtcache_unref to release a rtentry. At this
point, this function does nothing because for now we don't add a reference
to a rtentry when we get one from a rtcache. We will add something useful
in a further commit.

This change is a part of changes for MP-safe routing table. It is separated
to avoid one big change that makes difficult to debug by bisecting.
 1.186  10-Jun-2016  ozaki-r branches: 1.186.2;
Introduce m_set_rcvif and m_reset_rcvif

The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.

No functional change.
 1.185  24-Aug-2015  pooka sprinkle _KERNEL_OPT
 1.184  24-Jul-2015  matt If we are sending a window probe and there's unacked data in the socket, make
sure at least the persist timer is running.
 1.183  16-May-2015  kefren Don't put segment on the wire if security request can't be fulfilled
 1.182  27-Apr-2015  christos Apply Revision 220794 from FreeBSD to avoid dup ACKs:

When checking to see if a window update should be sent to the remote peer,
don't force a window update if the window would not actually grow due to
window scaling. Specifically, if the window scaling factor is larger than
2 * MSS, then after the local reader has drained 2 * MSS bytes from the
socket, a window update can end up advertising the same window. If this
happens, the supposed window update actually ends up being a duplicate ACK.
This can result in an excessive number of duplicate ACKs when using a
higher maximum socket buffer size.

Pointed out by Ricky Charlet, in tech-net.
 1.181  27-Apr-2015  ozaki-r Introduce in6_selecthlim_rt to consolidate an idiom for rt->rt_ifp

It consolidates a scattered routine:
(rt = rtcache_validate(&in6p->in6p_route)) != NULL ? rt->rt_ifp : NULL
 1.180  14-Feb-2015  he Port over the TCP_INFO socket option from FreeBSD, originally from
the Linux 2.6 TCP API. This permits the caller to query certain information
about a TCP connection, and is used by pkgsrc's net/iperf3 test program
if available.

This extends struct tcbcb with three fields to count retransmits,
out-of-sequence receives and zero window announcements, and will
therefore warrant a kernel revision bump (done separately).
 1.179  10-Nov-2014  maxv branches: 1.179.2;
Do not uselessly include <sys/malloc.h>.
 1.178  25-Oct-2014  christos Avoid stack overflow when SACK and TCP_SIGNATURE are both present. Thanks
to Jonathan Looney for pointing this out.
 1.177  21-Oct-2014  hikaru Fix wrong condition checking TSO capability.
ipsec_used is not necessary condition.
IPsec outbound policy will not be checked when ipsec_used is false.
 1.176  30-May-2014  christos branches: 1.176.2;
Introduce 2 new variables: ipsec_enabled and ipsec_used.
Ipsec enabled is controlled by sysctl and determines if is allowed.
ipsec_used is set automatically based on ipsec being enabled, and
rules existing.
 1.175  05-Jun-2013  christos branches: 1.175.2; 1.175.6;
IPSEC has not come in two speeds for a long time now (IPSEC == kame,
FAST_IPSEC). Make everything refer to IPSEC to avoid confusion.
 1.174  22-Mar-2012  drochner branches: 1.174.2;
remove KAME IPSEC, replaced by FAST_IPSEC
 1.173  31-Dec-2011  christos branches: 1.173.2; 1.173.6; 1.173.8;
- fix offsetof usage, and redundant defines
- kill pointer casts to 0
 1.172  19-Dec-2011  drochner rename the IPSEC in-kernel CPP variable and config(8) option to
KAME_IPSEC, and make IPSEC define it so that existing kernel
config files work as before
Now the default can be easily be changed to FAST_IPSEC just by
setting the IPSEC alias to FAST_IPSEC.
 1.171  14-Apr-2011  yamt branches: 1.171.4; 1.171.8;
simplify a compile-time assertion
 1.170  21-Mar-2011  matt Clean up setting ECN bit in TOS. Fixes PR 44742
 1.169  26-Jan-2010  pooka branches: 1.169.4; 1.169.6;
tcp sockbuf autoscaling was initially added turned off because it
was experimental. People (including myself) have been running with
it turned on for eons now, so flip the default to enabled.
 1.168  18-Mar-2009  cegger bzero -> memset
 1.167  28-Apr-2008  martin branches: 1.167.8; 1.167.10; 1.167.14; 1.167.16; 1.167.20;
Remove clause 3 and 4 from TNF licenses
 1.166  12-Apr-2008  thorpej branches: 1.166.2; 1.166.4;
Make IP, TCP, UDP, and ICMP statistics per-CPU. The stats are collated
when the user requests them via sysctl.
 1.165  08-Apr-2008  thorpej Change TCP stats from a structure to an array of uint64_t's.

Note: This is ABI-compatible with the old tcpstat structure; old netstat
binaries will continue to work properly.
 1.164  14-Jan-2008  dyoung branches: 1.164.6;
Use rtcache_validate() instead of rtcache_getrt(). Shorten staircase
in in_losing().
 1.163  20-Dec-2007  dyoung Poison struct route->ro_rt uses in the kernel by changing the name
to _ro_rt. Use rtcache_getrt() to access a route cache's struct
rtentry *.

Introduce struct ifnet->if_dl that always points at the interface
identifier/link-layer address. Make code that treated the first
ifaddr on struct ifnet->if_addrlist as the interface address use
if_dl, instead.

Remove stale debugging code from net/route.c. Move the rtflush()
code into rtcache_clear() and delete rtflush(). Delete rtalloc(),
because nothing uses it any more.

Make ND6_HINT an inline, lowercase subroutine, nd6_hint.

I've done my best to convert IP Filter, the ISO stack, and the
AppleTalk stack to rtcache_getrt(). They compile, but I have not
tested them. I have given the changes to PF, GRE, IPv4 and IPv6
stacks a lot of exercise.
 1.162  02-Sep-2007  dyoung branches: 1.162.6; 1.162.8; 1.162.12;
m_copy() was deprecated, apparently, long ago. m_copy(...) ->
m_copym(..., M_DONTWAIT).
 1.161  02-Aug-2007  yamt branches: 1.161.2; 1.161.4; 1.161.6;
make rfbuf_ts a tcp timestamp so that calculations in tcp_input make sense.
 1.160  02-Aug-2007  rmind TCP socket buffers automatic sizing - ported from FreeBSD.
http://mail-index.netbsd.org/tech-net/2007/02/04/0006.html

! Disabled by default, marked as experimental. Testers are very needed.
! Someone should thoroughly test this, and improve if possible.

Discussed on <tech-net>:
http://mail-index.netbsd.org/tech-net/2007/07/12/0002.html
Thanks Greg Troxel for comments.

OK by the long silence on <tech-net>.
 1.159  18-May-2007  riz branches: 1.159.2;
Fix compilation in the TCP_SIGNATURE case:

- don't use void * for pointer arithmetic
- don't try to modify const parameters

A kernel with 'options TCP_SIGNATURE' works as well as it ever did, now.
(ie, clunky, but passable)
 1.158  02-May-2007  dyoung Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.

The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.

Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.

DETAILS

1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:

struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);

sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.

sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.

The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).

2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.

3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:

int rtcache_setdst(struct route *, const struct sockaddr *);

rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.

It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.

4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
 1.157  04-Mar-2007  christos branches: 1.157.2; 1.157.4;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.156  22-Feb-2007  thorpej TRUE -> true, FALSE -> false
 1.155  21-Feb-2007  thorpej Replace the Mach-derived boolean_t type with the C99 bool type. A
future commit will replace use of TRUE and FALSE with true and false.
 1.154  10-Feb-2007  degroote branches: 1.154.2;
Commit my SoC work
Add ipv6 support for fast_ipsec
Note that currently, packet with extensions headers are not correctly
supported
Change the ipcomp logic
 1.153  25-Nov-2006  yamt branches: 1.153.2; 1.153.4;
move tso-by-software code to their own files. no functional changes.
 1.152  23-Nov-2006  martin Make it compile on IPv4-only kernels
 1.151  23-Nov-2006  yamt implement ipv6 TSO.
partly from Matthias Scheler. tested by him.
 1.150  17-Oct-2006  yamt tcp_output: as a comment in tcp_sack_newack says, actually send
one or two segments on partial acks. even if sack_bytes_rxmt==0,
if we are in fast recovory with sack, snd_cwnd has somewhat special
meaning here. PR/34749.
 1.149  09-Oct-2006  rpaulo Modular (I tried ;-) TCP congestion control API. Whenever certain conditions
happen in the TCP stack, this interface calls the specified callback to
handle the situation according to the currently selected congestion
control algorithm.
A new sysctl node was created: net.inet.tcp.congctl.{available,selected}
with obvious meanings.
The old net.inet.tcp.newreno MIB was removed.
The API is discussed in tcp_congctl(9).

In the near future, it will be possible to selected a congestion control
algorithm on a per-socket basis.

Discussed on tech-net and reviewed by <yamt>.
 1.148  08-Oct-2006  yamt tcp_output: don't make TSO duplicate CWR/ECE.
 1.147  08-Oct-2006  yamt tcp_output: don't try to send SACK option larger than txsegsize.
fix a panic like "panic: m_copydata: off 0, len -7".
 1.146  07-Oct-2006  yamt tcp_output: remove duplicated code and tweak indent. no functional changes.
 1.145  01-Oct-2006  dbj back out revision 1.144 calculating txsegsizep since it unmasks
other bugs. See PR kern/34674
 1.144  28-Sep-2006  dbj consider sb_lowat when limiting the transmit length to keep acks on the wire
 1.143  05-Sep-2006  rpaulo branches: 1.143.2; 1.143.4;
Import of TCP ECN algorithm for congestion control.
Both available for IPv4 and IPv6.
Basic implementation test results are available at
http://netbsd-soc.sourceforge.net/projects/ecn/testresults.html.

Work sponsored by the Google Summer of Code project 2006.
Special thanks to Kentaro Kurahone, Allen Briggs and Matt Thomas for their
help, comments and support during the project.
 1.142  25-Mar-2006  seanb Slight simplification of hdr len calculation in tcp_segsize().
No functional change.
 1.141  24-Dec-2005  perry branches: 1.141.4; 1.141.6; 1.141.8; 1.141.10; 1.141.12;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.
 1.140  11-Dec-2005  christos merge ktrace-lwp.
 1.139  10-Aug-2005  yamt wrap INET-only code by #if defined(INET).
 1.138  10-Aug-2005  yamt ipv6 tx checksum offloading. reviewed by Jason Thorpe.
 1.137  19-Jul-2005  christos Implement PMTU checks from:

http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html

1. Don't act on ICMP-need-frag immediately if adhoc checks on the
advertised MTU fail. The MTU update is delayed until a TCP retransmit
happens.
2. Ignore ICMP Source Quench messages meant for TCP connections.

From OpenBSD.
 1.136  28-Jun-2005  drochner branches: 1.136.2;
typo in comment
 1.135  29-May-2005  christos - add const
- remove bogus casts
- avoid nested variables
 1.134  08-May-2005  yamt tcp_output: account FIN when building sack option.
 1.133  08-May-2005  yamt tcp_output: don't try to send more data than we have. PR/30160.
 1.132  08-May-2005  yamt tcp_output: clear TH_FIN where appropriate. related to PR/30160.
 1.131  18-Apr-2005  yamt add a function to handle M_CSUM_TSOv4 by software.
 1.130  18-Apr-2005  yamt fix problems related to loopback interface checksum omission. PR/29971.

- for ipv4, defer decision to ip layer as h/w checksum offloading does
so that it can check the actual interface the packet is going to.
- for ipv6, disable it.
(maybe will be revisited when it implements h/w checksum offloading.)

ok'ed by Jason Thorpe.
 1.129  29-Mar-2005  yamt tcp_output: lock reass queue when building sack.
 1.128  16-Mar-2005  yamt branches: 1.128.2;
simplify data receiver side sack processing.
- introduce t_segqlen, the number of segments in segq/timeq.
the name is from freebsd.
- rather than maintaining a copy of sack blocks (rcv_sack_block[]),
build it directly from the segment list when needed.
 1.127  16-Mar-2005  yamt - use full sized segments unless we actually have SACKs to send.
- avoid TSO duplicate D-SACK.
- send SACKs regardless of TF_ACKNOW.
- don't clear rcv_sack_num when transmitting.

discussed on tech-net@.
 1.126  12-Mar-2005  yamt don't try to use TSO to transmit a single segment.
- there's no benefit.
- rtl8169 seems to be stuck with it.
 1.125  09-Mar-2005  matt For AF_INET, always set m->m_pkthdr.csum_data. Don't or TSOv4, just set it.
 1.124  07-Mar-2005  yamt tcp_sack_option: the max number of sack blocks in a packet is 4, not 3.
 1.123  06-Mar-2005  thorpej Add a /*CONSTCOND*/ to last.
 1.122  06-Mar-2005  matt Fix typo. Opposite of >= is <, not ==.
 1.121  06-Mar-2005  matt Replace some gotos with a do while (0) and breaks. No functional change.
 1.120  06-Mar-2005  matt Add IPv4/TCP hooks for TCP Segment Offload on transmit.
 1.119  02-Mar-2005  mycroft Copyright maintenance.
 1.118  28-Feb-2005  jonathan Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz

Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.

The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.

There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.

After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
 1.117  26-Feb-2005  perry nuke trailing whitespace
 1.116  03-Feb-2005  perry ANSIfy function declarations
 1.115  15-Dec-2004  thorpej branches: 1.115.2; 1.115.4;
Don't perform checksums on loopback interfaces. They can be reenabled with
the net.inet.*.do_loopback_cksum sysctl.

Approved by: groo
 1.114  20-May-2004  jonathan With FAST_IPSEC, include <netipsec/key.h>, as Itojun's recent changes
now require KEY_FREESAV() to be in scope.
 1.113  18-May-2004  itojun fix MD5 signature support to actually validate inbound signature, and
drop packet if fails.
 1.112  08-May-2004  chs work around an LP64 problem where we report an excessively large window
due to incorrect mixing of types.
 1.111  26-Apr-2004  itojun make TCP MD5 signature work with KAME IPSEC (#define IPSEC).

support IPv6 if KAME IPSEC (RFC is not explicit about how we make data stream
for checksum with IPv6, but i'm pretty sure using normal pseudo-header is the
right thing).

XXX
current TCP MD5 signature code has giant flaw:
it does not validate signature on input (can't believe it! what is the point?)
 1.110  25-Apr-2004  jonathan Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.

This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).


NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.

In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:

sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15

Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
 1.109  30-Mar-2004  christos Make sure we disarm the persist timer before we arm the rexmit
timer, otherwise there is a tiny window where both timers are
active, and this is not correct according to the comments in the
code. I believe that this is the cause of the to_ticks <= 0 assertion
failure in callout_schedule() that I've been getting.
 1.108  03-Mar-2004  thorpej branches: 1.108.2;
Use IPSEC_PCB_SKIP_IPSEC() to short-circuit calls to ipsec{4,6}_hdrsiz_tcp().
 1.107  04-Feb-2004  itojun deal with IPv6 path MTU < 1280 (RFC2460 section 5 last paragraph).
check if there really is room for TCP data.
 1.106  12-Nov-2003  ragge Remove the FAST_MBSEARCH ifdef, send packet prediction is now default.
 1.105  24-Oct-2003  ragge Fix the bug in the tcp transmit prediction code.
During testing the prediction counters show a hit-rate on about 85% for
packets sent on a local LAN, and better than 99% for intercontinental
high-speed bulk traffic (!).
 1.104  24-Oct-2003  enami Make this file compile again when TCP_OUTPUT_COUNTERS defined.
 1.103  23-Oct-2003  thorpej Oops, FAST_MBSEARCH counters were swapped; fix it. Pointed out by yamt@.
 1.102  21-Oct-2003  thorpej Add event counters that measure FAST_MBSEARCH.
 1.101  22-Aug-2003  itojun remove ipsec_set/getsocket. now we explicitly pass socket * to ip{,6}_output.
 1.100  22-Aug-2003  itojun change the additional arg to be passed to ip{,6}_output to struct socket *.

this fixes KAME policy lookup which was broken by the previous commit.
 1.99  22-Aug-2003  jonathan Replace the set_socket() method of passing an extra struct socket*
argument to ip6_output() with a new explicit struct in6pcb* argument.
(The underlying socket can be obtained via in6pcb->inp6_socket.)

In preparation for fast-ipsec. Reviewed by itojun.
 1.98  15-Aug-2003  jonathan (fast-ipsec): Add hooks to pass IPv4 IPsec traffic into fast-ipsec, if
configured with ``options FAST_IPSEC''. Kernels with KAME IPsec or
with no IPsec should work as before.

All calls to ip_output() now always pass an additional compulsory
argument: the inpcb associated with the packet being sent,
or 0 if no inpcb is available.

Fast-ipsec tested with ICMP or UDP over ESP. TCP doesn't work, yet.
 1.97  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.
 1.96  02-Jul-2003  ragge Make the fast-search stuff an option. There are still reports on
problem with it.
 1.95  02-Jul-2003  ragge Fix previous bug. Thanks to Enami for spotting the (obvious) error, and
to other people with much help with bug reports etc.
While fixing, change some of the code I added last time to make it
cleaner and simpler.
 1.94  30-Jun-2003  ragge branches: 1.94.2;
Disable the code I checked in yesterday; reports that samba (!) are crashing
machines with it. Will do some more tests.
 1.93  29-Jun-2003  fvdl Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.
 1.92  29-Jun-2003  ragge Add code to remember where in the send queue of mbufs the last packet was
sent from. This change avoid a linear search through all mbufs when using
large TCP windows, and therefore permit high-speed connections on long
distances.

Tested on a 1 Gigabit connection between Lule� and San Francisco, a distance
of about 15000km. With TCP windows of just over 20 Mbytes it could keep up
with 950Mbit/s.

After discussions with Matt Thomas and Jason Thorpe.
 1.91  17-May-2003  itojun no need for ip_v recovery in output path too
(tcp_template includes ip_v setting)
 1.90  01-Mar-2003  thorpej Allow TCP connections to hosts on a local network to use a larger
slow start initial window. Default this larger initial window to
4 packets, allowing it to be adjusted with net.inet.tcp.init_win_local.
 1.89  26-Feb-2003  matt Add MBUFTRACE kernel option.
Do a little mbuf rework while here. Change all uses of MGET*(*, M_WAIT, *)
to m_get*(M_WAIT, *). These are not performance critical and making them
call m_get saves considerable space. Add m_clget analogue of MCLGET and
make corresponding change for M_WAIT uses.
Modify netinet, gem, fxp, tulip, nfs to support MBUFTRACE.
Begin to change netstat to use sysctl.
 1.88  24-Nov-2002  scw Fix a genuine uninitialised variable warning.
 1.87  02-Nov-2002  itojun cleanup ipsec.h dependency. commented by perry, sync w/kame
 1.86  13-Sep-2002  mycroft In the txsegsize bounding code, it is not necessary to adjust for the options
length.
 1.85  20-Aug-2002  thorpej Never send more than half a socket buffer of data. This insures that
we can always keep 2 packets on the wire, no matter what SO_SNDBUF is,
and therefore ACKs will never be delayed unless we run out of data to
transmit. The problem is quite easy to tickle when the MTU of the
outgoing interface is larger than the socket buffer size (e.g. loopback).

Fix from Charles Hannum.
 1.84  14-Aug-2002  itojun avoid swapping endian of ip_len and ip_off on mbuf, to meet with M_LEADINGSPACE
optimization made last year. should solve PR 17867 and 10195.

IP_HDRINCL behavior of raw ip socket is kept unchanged. we may want to
provide IP_HDRINCL variant that does not swap endian.
 1.83  13-Jun-2002  thorpej Disable TCP Congestion Window Monitoring by default; there are
performance problems in the face of tinygrams.
 1.82  09-Jun-2002  itojun whitespace
 1.81  29-May-2002  itojun attach nd_ifinfo structure into if_afdata.
split IPv6 link MTU (advertised by RA) from real link MTU.
sync with kame
 1.80  26-May-2002  itojun path MTU discovery blackhole detection.
PR 12790 (sorry for not committing it for a long time)
 1.79  27-Apr-2002  thorpej branches: 1.79.2; 1.79.4;
* Instrument tcp_build_datapkt().
* Remove the code that allocates a cluster if the packet would
fit in one; it totally defeats doing references to M_EXT mbufs
in the socket buffer. This drastically reduces the number of
data copies in the tcp_output() path for applications which use
large writes. Kudos to Matt Thomas for pointing me in the right
direction.
 1.78  01-Mar-2002  thorpej In tcp_segsize(), move a label so that option length is considered
when using the default TCP MSS as well. From Matt Thomas.
 1.77  24-Jan-2002  itojun place NRL copyright notice itself, not a reference to it.
 1.76  03-Dec-2001  jmcneill Fix TCP segment size computation. From Rick Byersm, PR kern/14799.
 1.75  13-Nov-2001  lukem add RCSIDs
 1.74  10-Sep-2001  thorpej Use callouts for TCP timers, rather than traversing the list of
all open TCP connections in tcp_slowtimo() (which is called 2x
per second). It's fairly rare for TCP timers to actually fire,
so saving this list traversal is good, especially if you want
to scale to thousands of open connections.
 1.73  10-Sep-2001  thorpej Change the way receive idle time and round trip time are measured.
Instead of incrementing t_idle and t_rtt in tcp_slowtimo(), we now
take a timstamp (via tcp_now) and use subtraction to compute the
delta when we actually need it (using unsigned arithmetic so that
tcp_now wrapping is handled correctly).

Based on similar changes in FreeBSD.
 1.72  10-Sep-2001  thorpej Enable Congestion Window Monitoring by default.
 1.71  10-Sep-2001  thorpej Use a callout for the delayed ACK timer, and delete tcp_fasttimo().
Expose the delayed ACK timer as net.inet.tcp.delack_ticks.
 1.70  31-Jul-2001  thorpej branches: 1.70.2;
Carve off the code that builds a TCP data packet into its own
function, and inline it, except when profiling... so we can
profile it.
 1.69  31-Jul-2001  thorpej Count the number of times we "self-quench" (ip_output() returns
ENOBUFS), and don't inline tcp_segsize() if profiling.
 1.68  26-Jul-2001  thorpej Slight cosmetic change.
 1.67  08-Jul-2001  abs branches: 1.67.2;
Rename TCPDEBUG to TCP_DEBUG, defopt TCP_DEBUG and TCP_NDEBUG, and
make all usage of tcp_trace dependent on TCP_DEBUG - resulting in
a 31K saving on an INET enabled i386 kernel.
 1.66  02-Jun-2001  thorpej Implement support for IP/TCP/UDP checksum offloading provided by
network interfaces. This works by pre-computing the pseudo-header
checksum and caching it, delaying the actual checksum to ip_output()
if the hardware cannot perform the sum for us. In-bound checksums
can either be fully-checked by hardware, or summed up for final
verification by software. This method was modeled after how this
is done in FreeBSD, although the code is significantly different in
most places.

We don't delay checksums for IPv6/TCP, but we do take advantage of the
cached pseudo-header checksum.

Note: hardware-assisted checksumming defaults to "off". It is
enabled with ifconfig(8). See the manual page for details.

Implement hardware-assisted checksumming on the DP83820 Gigabit Ethernet,
3c90xB/3c90xC 10/100 Ethernet, and Alteon Tigon/Tigon2 Gigabit Ethernet.
 1.65  03-Apr-2001  itojun check ip_mtudisc only for TCP over IPv4.
PMTUD is mandatory for TCP over IPv6 (if packets > 1280).
 1.64  20-Mar-2001  thorpej Two changes, designed to make us even more resilient against TCP
ISS attacks (which we already fend off quite well).

1. First-cut implementation of RFC1948, Steve Bellovin's cryptographic
hash method of generating TCP ISS values. Note, this code is experimental
and disabled by default (experimental enough that I don't export the
variable via sysctl yet, either). There are a couple of issues I'd
like to discuss with Steve, so this code should only be used by people
who really know what they're doing.

2. Per a recent thread on Bugtraq, it's possible to determine a system's
uptime by snooping the RFC1323 TCP timestamp options sent by a host; in
4.4BSD, timestamps are created by incrementing the tcp_now variable
at 2 Hz; there's even a company out there that uses this to determine
web server uptime. According to Newsham's paper "The Problem With
Random Increments", while NetBSD's TCP ISS generation method is much
better than the "random increment" method used by FreeBSD and OpenBSD,
it is still theoretically possible to mount an attack against NetBSD's
method if the attacker knows how many times the tcp_iss_seq variable
has been incremented. By not leaking uptime information, we can make
that much harder to determine. So, we avoid the leak by giving each
TCP connection a timebase of 0.
 1.63  24-Jan-2001  itojun branches: 1.63.2;
- record IPsec packet history into m_aux structure.
- let ipfilter look at wire-format packet only (not the decapsulated ones),
so that VPN setting can work with NAT/ipfilter settings.
sync with kame.

TODO: use header history for stricter inbound validation
 1.62  06-Nov-2000  itojun fix IPv4 TTL selection with AF_INET6 API. sync with kame. From: jdc
 1.61  19-Oct-2000  itojun remove #ifdef TCP6. it is not likely for us to bring in sys/netinet6/tcp6*.c
(separate TCP/IPv6 stack) into netbsd-current.
 1.60  17-Oct-2000  itojun be more friendly with INET-less build.
XXX we need to do more to do a working INET-less build
 1.59  17-Oct-2000  thorpej Add an IP_MTUDISC flag to the flags that can be passed to
ip_output(). This flag, if set, causes ip_output() to set
DF in the IP header if the MTU in the route is not locked.

This allows a bunch of redundant code, which I was never
really all that happy about adding in the first place, to
be eliminated.

Inspired by a similar change made by provos@openbsd.org when
he integrated NetBSD's Path MTU Discovery code into OpenBSD.
 1.58  28-Jul-2000  itojun forgot to call tcp6_quench(). sync with kame.
 1.57  30-Jun-2000  itojun remove old mbuf assumption (ip header and tcp header are on the same mbuf).
this is for m_pulldown use. (sync with kame)
 1.56  30-Mar-2000  augustss branches: 1.56.4;
Remove register declarations.
 1.55  01-Mar-2000  itojun introduce m->m_pkthdr.aux to hold random data which needs to be passed
between protocol handlers.

ipsec socket pointers, ipsec decryption/auth information, tunnel
decapsulation information are in my mind - there can be several other usage.
at this moment, we use this for ipsec socket pointer passing. this will
avoid reuse of m->m_pkthdr.rcvif in ipsec code.

due to the change, MHLEN will be decreased by sizeof(void *) - for example,
for i386, MHLEN was 100 bytes, but is now 96 bytes.
we may want to increase MSIZE from 128 to 256 for some of our architectures.

take caution if you use it for keeping some data item for long period
of time - use extra caution on M_PREPEND() or m_adj(), as they may result
in loss of m->m_pkthdr.aux pointer (and mbuf leak).

this will bump kernel version.

(as discussed in tech-net, tested in kame tree)
 1.54  09-Feb-2000  itojun optimize mbuf allocation for ip/tcp/tcpopt part.
 1.53  13-Dec-1999  itojun sync IPv6 part with latest KAME tree. IPsec part is left unmodified
due to massive changes in KAME side.
- IPv6 output goes through nd6_output
- faith can capture IPv4 packets as well - you can run IPv4-to-IPv6 translator
using heavily modified DNS servers
- per-interface statistics (required for IPv6 MIB)
- interface autoconfig is revisited
- udp input handling has a big change for mapped address support.
- introduce in4_cksum() for non-overwriting checksumming
- introduce m_pulldown()
- neighbor discovery cleanups/improvements
- netinet/in.h strictly conforms to RFC2553 (no extra defs visible to userland)
- IFA_STATS is fixed a bit (not tested)
- and more more more.

TODO:
- cleanup os-independency #ifdef
- avoid rcvif dual use (for IPsec) to help ifdetach

(sorry for jumbo commit, I can't separate this any more...)
 1.52  23-Sep-1999  itojun branches: 1.52.2; 1.52.8;
cleanup and correct TCP MSS consideration with IPsec headers.

MSS advertisement must always be:
max(if mtu) - ip hdr siz - tcp hdr siz
We violated this in the previous code so it was fixed.

tcp_mss_to_advertise() now takes af (af on wire) as its argument,
to compute right ip hdr siz.

tcp_segsize() will take care of IPsec header size.
One thing I'm not really sure is how to handle IPsec header size in
*rxsegsizep (inbound segment size estimation).
The current code subtracts possible *outbound* IPsec size from *rxsegsizep,
hoping that the peer is using the same IPsec policy as me.
It may not be applicable, could TCP gulu please comment...
 1.51  09-Jul-1999  thorpej defopt IPSEC and IPSEC_ESP (both into opt_ipsec.h).
 1.50  02-Jul-1999  fvdl Fix for -Wunitialized warnings broke compiles without INET6, refix.
 1.49  02-Jul-1999  itojun avoid "variable not initialized" warnings on some of the platforms.
 1.48  01-Jul-1999  itojun IPv6 kernel code, based on KAME/NetBSD 1.4, SNAP kit 19990628.
(Sorry for a big commit, I can't separate this into several pieces...)
Pls check sys/netinet6/TODO and sys/netinet6/IMPLEMENTATION for details.

- sys/kern: do not assume single mbuf, accept chained mbuf on passing
data from userland to kernel (or other way round).
- "midway" ATM card: ATM PVC pseudo device support, like those done in ALTQ
package (ftp://ftp.csl.sony.co.jp/pub/kjc/).
- sys/netinet/tcp*: IPv4/v6 dual stack tcp support.
- sys/netinet/{ip6,icmp6}.h, sys/net/pfkeyv2.h: IETF document assumes those
file to be there so we patch it up.
- sys/netinet: IPsec additions are here and there.
- sys/netinet6/*: most of IPv6 code sits here.
- sys/netkey: IPsec key management code
- dev/pci/pcidevs: regen

In my understanding no code here is subject to export control so it
should be safe.
 1.47  20-Jan-1999  thorpej branches: 1.47.4; 1.47.6;
Fix a problem pointed out by Charles Hannum; DF wasn't being set in
SYN,ACK packets during Path MTU Discovery. Fix tcp_respond() to do the
appropriate route lookup and set DF as appropriate.

Also, fixup similar code in tcp_output() to relookup the route if it
is down.
 1.46  16-Dec-1998  thorpej Delay sending if SS_MORETOCOME is set in so_state. This avoids the case
where the user issued a write with a length greater than MLEN but less
than MINCLSIZE, thus causing two mbufs to be used. The loop in sosend()
would then call PRU_SEND twice, causing TCP to transmit 2 packets when
it could have transmitted one.

Suggested by Justin Walker <justin@apple.com> on the freebsd-net
mailing list.
 1.45  06-Oct-1998  matt Add a sysctl for newreno (default to off).
 1.44  04-Oct-1998  matt Adapt the NEWRENO changes from the UCSB diffs of BSDI 3.0's TCP
to NetBSD. Ignore the SACK & FACK stuff for now.
 1.43  21-Jul-1998  mycroft Implement a better fix for the `gratuitous FIN' problem, as
mentioned on tcp-impl but with a bit more commentary.
 1.42  17-Jul-1998  thorpej Add a comment wrt. a current issue w/ CWM.
 1.41  17-Jul-1998  thorpej Comment where the Restart Window is computed, and in the non-CWM case,
make sure it never _increases_ cwnd.
 1.40  07-Jul-1998  sommerfe Delete bogus (void) cast of m_freem (which is already a void function..)
 1.39  11-May-1998  thorpej Nuke TUBA per my note to tech-net; there's no reason to keep it around.
 1.38  06-May-1998  thorpej Use macros from tcp_timer.h to manipulate TCP timers, so that their
implementation can be changed easily.
 1.37  02-May-1998  thorpej Correct a comment related to Congestion Window Monitoring.
 1.36  30-Apr-1998  thorpej In the CWM code, don't use the Floyd initial window computation as
the burst size allowed, but rather a fixed number of packets, as
described in the Internet Draft. Default allowed burst is 4 packets,
per the Draft.

Make the use of CWM and the allowed burst size tunable via sysctl.
 1.35  29-Apr-1998  kml Add support for deletion of routes added by path MTU discovery;
uses new generic route timeout code. Add sysctl for timeout period.
 1.34  13-Apr-1998  kml Fix to ensure that the correct MSS is advertised for loopback
TCP connections by using the MTU of the interface. Also added
a knob, mss_ifmtu, to force all connections to use the MTU of
the interface to calculate the advertised MSS.
 1.33  01-Apr-1998  thorpej Implement Congestion Window Monitoring as described in the TCPIMPL
meeting of IETF #41 by Amy Hughes <ahughes@isi.edu>, and in an upcoming
internet draft from Hughes, Touch, and Heidemann.

CWM eliminates line-rate bursts after idle periods by counting pending
(unacknowledged) packets and limiting the congestion window to the
initial congestion window plus the pending packet count. This has the
effect of allowing us to use the window as long as we continue to transmit,
but as soon as we stop transmitting, we go back to a slow-start (also known
as `use it or lose it').

This is not enabled by default. You can enable this behavior by patching
the "tcp_cwm" global (set it to non-zero) or by building a kernel with the
TCP_CWM option.
 1.32  31-Mar-1998  thorpej Fix a potential-congestion case in the larger initial congestion window
code, as clarified in the TCPIMPL WG meeting at IETF #41: If the SYN
(active open) or SYN,ACK (passive open) was retransmitted, the initial
congestion window for the first slow start of that connection must be
one segment.
 1.31  24-Mar-1998  kml Ensure that we take the IP option length into account when we calculate
the effective maximum send size for TCP. ip_optlen() and tcp_optlen()
should probably be inlined for efficiency.
 1.30  19-Mar-1998  kml Fix a retransmission bug introduced by the Brakmo and Peterson
RTO estimation changes. Under some circumstances it would return a value
of 0, while the old Van Jacobson RTO code would return a minimum of 3.
This would result in 12 retransmissions, each 1 second apart.
This takes care of those instances, and ensures that t_rttmin is
used everywhere as a lower bound.
 1.29  17-Mar-1998  kml Ensure that the TCP segment size reflects the size of TCP options
in the packet. This fixes a bug that was resulting in extra packets
in retransmissions (the second packet would be 12 bytes long,
reflecting the RFC1323 timestamp option size).
 1.28  19-Feb-1998  thorpej Update copyright (sigh, should have done this long ago).
 1.27  05-Jan-1998  thorpej Finishing merging 4.4BSD-Lite2 netinet. At this point, the only changes
left were SCCS IDs and Copyright dates.
 1.26  31-Dec-1997  thorpej Implement a queue for delayed ACK processing. This queue is used in
tcp_fasttimo() in lieu of scanning all open TCP connections.
 1.25  17-Dec-1997  thorpej From 4.4BSD-Lite2:
- If we fail to allocate mbufs for the outgoing segment, free the header
and abort.

From Stevens:
- Ensure the persist timer is running if the send window reaches zero.
Part of the fix for kern/2335 (pete@daemon.net).
 1.24  11-Dec-1997  thorpej Implement an infrastructure to allow larger initial congestion windows.
The sysctl'able variable "tcp_init_win", when set to 0, selects an
auto-tuning algorithm for selecting the initial window, based on transmit
segment size, per discussion in the IETF tcpimpl working group.

Default initial window is still 1 segment, but will soon become 2 segments,
per discussion in tcpimpl.
 1.23  11-Dec-1997  thorpej Count delayed ACKs after they have been sucessfully transmitted.
 1.22  20-Nov-1997  thorpej Add missing (implied) int to a variable declaration.
 1.21  08-Nov-1997  kml TCP MSS fixes to provide cleaner slow-start and recovery.
 1.20  18-Oct-1997  kml branches: 1.20.2;
change sysctl net.inet.icmp.mtudisc to net.inet.ip.mtudisc
 1.19  17-Oct-1997  kml Path MTU Discovery support. This is turned off by default.
Use sysctl -w net.inet.icmp.mtudisc=1 to turn on.
Still to come: path removal after some period, black hole detection
 1.18  08-Oct-1997  thorpej Fix an oversight in my previous MSS-related changes:

Basically, in silly window avoidance, don't use the raw MSS we advertised
to the peer. What we really want here is the _expected_ size of received
segments, so we need to account for the path MTU (eventually; right now,
the interface MTU for "local" addresses and loopback or tcp_mssdflt for
non-local addresses). Without this, silly window avoidance would never
kick in if we advertised a very large (e.g. ~64k) MSS to the peer.
 1.17  22-Sep-1997  thorpej Fix several annoyances related to MSS handling in BSD TCP:
- Don't overload t_maxseg. Previous behavior was to set it to the min
of the peer's advertised MSS, our advertised MSS, and tcp_mssdflt
(for non-local networks). This breaks PMTU discovery running on
either host. Instead, remember the MSS we advertise, and use it
as appropriate (in silly window avoidance).
- Per last bullet, split tcp_mss() into several functions for handling
MSS (ours and peer's), and performing various tasks when a connection
becomes ESTABLISHED.
- Introduce a new function, tcp_segsize(), which computes the max size
for every segment transmitted in tcp_output(). This will eventually
be used to hook in PMTU discovery.
 1.16  03-Jun-1997  kml branches: 1.16.4;
Fix urgent pointer overflow problems when used with large windows
 1.15  10-Dec-1996  mycroft Fix RTT scaling problems introduced with Brakmo and Peterson changes.
 1.14  13-Feb-1996  christos branches: 1.14.4;
netinet prototypes
 1.13  13-Apr-1995  cgd oops; missed the chance to fix a cast, that then becamse a compiler warning.
 1.12  13-Apr-1995  cgd be a bit more careful and explicit with types. (basically a large no-op.)
 1.11  23-Jan-1995  mycroft Fix a condition where we sometimes sent a FIN too early. Also, a small
optimization.
 1.10  29-Jun-1994  cgd New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'
 1.9  13-May-1994  mycroft Update to 4.4-Lite networking code, with a few local changes.
 1.8  12-Apr-1994  mycroft Acks with no data should have the highest sequence number sent.
 1.7  10-Jan-1994  mycroft Should compile now with or without `options MULTICAST'.
 1.6  08-Jan-1994  mycroft Prototypes.
 1.5  08-Jan-1994  mycroft Fix some inconsistent spacing; spaces at the end of lines, etc.
 1.4  18-Dec-1993  mycroft Canonicalize all #includes.
 1.3  22-May-1993  cgd add include of select.h if necessary for protos, or delete if extraneous
 1.2  18-May-1993  cgd make kernel select interface be one-stop shopping & clean it all up.
 1.1  21-Mar-1993  cgd branches: 1.1.1;
Initial revision
 1.1.1.3  05-Jan-1998  thorpej Import sys/netinet from 4.4BSD-Lite2 for reference purposes.
 1.1.1.2  05-Jan-1998  thorpej Import sys/netinet from 4.4BSD-Lite for reference purposes.
 1.1.1.1  21-Mar-1993  cgd initial import of 386bsd-0.1 sources
 1.14.4.1  10-Dec-1996  mycroft From trunk:
Fix RTT scaling problems introduced with Brakmo and Peterson changes.
 1.16.4.2  14-Oct-1997  thorpej Update marc-pcmcia branch from trunk.
 1.16.4.1  29-Sep-1997  thorpej Update marc-pcmcia branch from trunk.
 1.20.2.6  09-May-1998  mycroft Pull up patch from kml.
 1.20.2.5  05-May-1998  mycroft Pull up 1.29, per request of kml.
 1.20.2.4  05-May-1998  mycroft Pull up 1.30, per request of kml.
 1.20.2.3  29-Jan-1998  mellon Pull up 1.24-1.27 (thorpej)
 1.20.2.2  21-Nov-1997  thorpej Sync w/ trunk: add a missing (previously implied) int.
 1.20.2.1  08-Nov-1997  thorpej Pull up from trunk: TCP MSS fixes to provide cleaner slow-start and recovery.
(kml)
 1.47.6.3  30-Nov-1999  itojun bring in latest KAME (as of 19991130, KAME/NetBSD141) into kame branch
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.

The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.

synchronization to latest KAME will take place on HEAD branch soon.
 1.47.6.2  06-Jul-1999  itojun KAME/NetBSD 1.4, SNAP kit 1999/07/05.
NOTE: this branch is just for reference purposes (i.e. for taking cvs diff).
do not touch anything on the branch. actual work must be done on HEAD branch.
 1.47.6.1  28-Jun-1999  itojun KAME/NetBSD 1.4 SNAP kit, dated 19990628.

NOTE: this branch (kame) is used just for refernce. this may not compile
due to multiple reasons.
 1.47.4.2  02-Aug-1999  thorpej Update from trunk.
 1.47.4.1  01-Jul-1999  thorpej Sync w/ -current.
 1.52.8.1  27-Dec-1999  wrstuden Pull up to last week's -current.
 1.52.2.5  21-Apr-2001  bouyer Sync with HEAD
 1.52.2.4  27-Mar-2001  bouyer Sync with HEAD.
 1.52.2.3  11-Feb-2001  bouyer Sync with HEAD.
 1.52.2.2  22-Nov-2000  bouyer Sync with HEAD.
 1.52.2.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.56.4.5  24-Jan-2002  he Pull up revision 1.77 (requested by itojun):
Clean up the NRL copyright.
 1.56.4.4  06-Apr-2001  he Pull up revision 1.63 (requested by itojun):
Record IPsec packet history in m_aux structure. Let ipfilter
look at wire-format packet only (not the decapsulated ones), so
that VPN setting can work with NAT/ipfilter settings.
 1.56.4.3  10-Nov-2000  tv Pullup 1.62 [itojun]:
fix IPv4 TTL selection with AF_INET6 API. sync with kame. From: jdc
 1.56.4.2  15-Aug-2000  itojun pullup 1.57 -> 1.58 (approved by releng-1-5)

> forgot to call tcp6_quench(). sync with kame.
 1.56.4.1  23-Jul-2000  itojun pullup from main trunc (approved by releng-1-5)

remove old mbuf assumption (ip header and tcp header are on the same mbuf).
this is for m_pulldown use. (sync with kame)

1.108 -> 1.109 syssrc/sys/netinet/tcp_input.c
1.56 -> 1.57 syssrc/sys/netinet/tcp_output.c
1.91 -> 1.92 syssrc/sys/netinet/tcp_subr.c
 1.63.2.14  11-Dec-2002  thorpej Sync with HEAD.
 1.63.2.13  11-Nov-2002  nathanw Catch up to -current
 1.63.2.12  17-Sep-2002  nathanw Catch up to -current.
 1.63.2.11  27-Aug-2002  nathanw Catch up to -current.
 1.63.2.10  20-Jun-2002  nathanw Catch up to -current.
 1.63.2.9  04-May-2002  thorpej Update from trunk.
 1.63.2.8  01-Apr-2002  nathanw Catch up to -current.
(CVS: It's not just a program. It's an adventure!)
 1.63.2.7  28-Feb-2002  nathanw Catch up to -current.
 1.63.2.6  08-Jan-2002  nathanw Catch up to -current.
 1.63.2.5  14-Nov-2001  nathanw Catch up to -current.
 1.63.2.4  21-Sep-2001  nathanw Catch up to -current.
 1.63.2.3  24-Aug-2001  nathanw Catch up with -current.
 1.63.2.2  21-Jun-2001  nathanw Catch up to -current.
 1.63.2.1  09-Apr-2001  nathanw Catch up with -current.
 1.67.2.8  10-Oct-2002  jdolecek sync kqueue with -current; this includes merge of gehenna-devsw branch,
merge of i386 MP branch, and part of autoconf rototil work
 1.67.2.7  06-Sep-2002  jdolecek sync kqueue branch with HEAD
 1.67.2.6  23-Jun-2002  jdolecek catch up with -current on kqueue branch
 1.67.2.5  16-Mar-2002  jdolecek Catch up with -current.
 1.67.2.4  11-Feb-2002  jdolecek Sync w/ -current.
 1.67.2.3  10-Jan-2002  thorpej Sync kqueue branch with -current.
 1.67.2.2  13-Sep-2001  thorpej Update the kqueue branch to HEAD.
 1.67.2.1  03-Aug-2001  lukem update to -current
 1.70.2.1  01-Oct-2001  fvdl Catch up with -current.
 1.79.4.5  07-Feb-2004  jmc Pullup rev 1.107 (requested by itojun in ticket #1605)

Deal with IPv6 path MTU < 1280 (RFC2460 section 5 last paragraph)
Check if there really is room for TCP data.
 1.79.4.4  05-Sep-2003  tron Pull up revision 1.80 (requested by tls in ticket #1445):
path MTU discovery blackhole detection.
PR 12790 (sorry for not committing it for a long time)
 1.79.4.3  30-Nov-2002  he Pull up revision 1.86 (requested by thorpej in ticket #795):
In the txsegsize bounding code, it is not necessary to adjust
for the options length.
 1.79.4.2  21-Nov-2002  he Pull up revision 1.85 (requested by thorpej in ticket #707):
Never send more than half a socket buffer of data in a
segment. This ensures that we can always keep 2 packets
on the wire, and we will therefore not cause any delayed
ACKs. Otherwise, this causes performance problems when
using large-MTU interfaces, such as the loopback interface.
 1.79.4.1  14-Jun-2002  lukem Pull up revision 1.83 (requested by thorpej in ticket #267):
Disable TCP Congestion Window Monitoring by default; there are
performance problems in the face of tinygrams.
 1.79.2.3  29-Aug-2002  gehenna catch up with -current.
 1.79.2.2  20-Jun-2002  gehenna catch up with -current.
 1.79.2.1  30-May-2002  gehenna Catch up with -current.
 1.94.2.9  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.94.2.8  01-Apr-2005  skrll Sync with HEAD.
 1.94.2.7  08-Mar-2005  skrll Sync with HEAD.
 1.94.2.6  04-Mar-2005  skrll Sync with HEAD.

Hi Perry!
 1.94.2.5  04-Feb-2005  skrll Sync with HEAD.
 1.94.2.4  18-Dec-2004  skrll Sync with HEAD.
 1.94.2.3  21-Sep-2004  skrll Fix the sync with head I botched.
 1.94.2.2  18-Sep-2004  skrll Sync with HEAD.
 1.94.2.1  03-Aug-2004  skrll Sync with HEAD
 1.108.2.1  11-May-2004  tron Pull up revision 1.112 (requested by chs in ticket #292):
work around an LP64 problem where we report an excessively large window
due to incorrect mixing of types.
 1.115.4.2  19-Mar-2005  yamt sync with head. xen and whitespace. xen part is not finished.
 1.115.4.1  12-Feb-2005  yamt sync with head.
 1.115.2.1  29-Apr-2005  kent sync with -current
 1.128.2.5  11-May-2005  tron Pull up revision 1.134 (requested by yamt in ticket #294):
tcp_output: account FIN when building sack option.
 1.128.2.4  11-May-2005  tron Pull up revision 1.133 (requested by yamt in ticket #293):
tcp_output: don't try to send more data than we have. PR/30160.
 1.128.2.3  11-May-2005  tron Pull up revision 1.132 (requested by yamt in ticket #293):
tcp_output: clear TH_FIN where appropriate. related to PR/30160.
 1.128.2.2  06-May-2005  tron Pull up revision 1.130 (requested by yamt in ticket #251):
fix problems related to loopback interface checksum omission. PR/29971.
- for ipv4, defer decision to ip layer as h/w checksum offloading does
so that it can check the actual interface the packet is going to.
- for ipv6, disable it.
(maybe will be revisited when it implements h/w checksum offloading.)
ok'ed by Jason Thorpe.
 1.128.2.1  04-Apr-2005  tron Pull up revision 1.129 (requested by yamt in ticket #89):
tcp_output: lock reass queue when building sack.
 1.136.2.5  21-Jan-2008  yamt sync with head
 1.136.2.4  03-Sep-2007  yamt sync with head.
 1.136.2.3  26-Feb-2007  yamt sync with head.
 1.136.2.2  30-Dec-2006  yamt sync with head.
 1.136.2.1  21-Jun-2006  yamt sync with head.
 1.141.12.1  28-Mar-2006  tron Merge 2006-03-28 NetBSD-current into the "peter-altq" branch.
 1.141.10.1  19-Apr-2006  elad sync with head.
 1.141.8.2  14-Sep-2006  yamt sync with head.
 1.141.8.1  01-Apr-2006  yamt sync with head.
 1.141.6.1  22-Apr-2006  simonb Sync with head.
 1.141.4.2  09-Sep-2006  rpaulo sync with head
 1.141.4.1  05-Feb-2006  rpaulo <netinet6/in6_pcb.h> went away. Bye!
 1.143.4.2  10-Dec-2006  yamt sync with head.
 1.143.4.1  22-Oct-2006  yamt sync with head
 1.143.2.2  12-Jan-2007  ad Sync with head.
 1.143.2.1  18-Nov-2006  ad Sync with head.
 1.153.4.1  04-Jun-2007  wrstuden Update to today's netbsd-4.
 1.153.2.2  03-Apr-2011  riz Pull up following revision(s) (requested by spz in ticket #1424):
sys/netinet/tcp_output.c: revision 1.170
Clean up setting ECN bit in TOS. Fixes PR 44742
 1.153.2.1  24-May-2007  pavel branches: 1.153.2.1.4;
Pull up following revision(s) (requested by degroote in ticket #667):
sys/netinet/tcp_input.c: revision 1.260
sys/netinet/tcp_output.c: revision 1.154
sys/netinet/tcp_subr.c: revision 1.210
sys/netinet6/icmp6.c: revision 1.129
sys/netinet6/in6_proto.c: revision 1.70
sys/netinet6/ip6_forward.c: revision 1.54
sys/netinet6/ip6_input.c: revision 1.94
sys/netinet6/ip6_output.c: revision 1.114
sys/netinet6/raw_ip6.c: revision 1.81
sys/netipsec/ipcomp_var.h: revision 1.4
sys/netipsec/ipsec.c: revision 1.26 via patch,1.31-1.32
sys/netipsec/ipsec6.h: revision 1.5
sys/netipsec/ipsec_input.c: revision 1.14
sys/netipsec/ipsec_netbsd.c: revision 1.18,1.26
sys/netipsec/ipsec_output.c: revision 1.21 via patch
sys/netipsec/key.c: revision 1.33,1.44
sys/netipsec/xform_ipcomp.c: revision 1.9
sys/netipsec/xform_ipip.c: revision 1.15
sys/opencrypto/deflate.c: revision 1.8
Commit my SoC work
Add ipv6 support for fast_ipsec
Note that currently, packet with extensions headers are not correctly
supported
Change the ipcomp logic

Add sysctl tree to modify the fast_ipsec options related to ipv6. Similar
to the sysctl kame interface.

Choose the good default policy, depending of the adress family of the
desired policy

Increase the refcount for the default ipv6 policy so nobody can reclaim it

Always compute the sp index even if we don't have any sp in spd. It will
let us to choose the right default policy (based on the adress family
requested).
While here, fix an error message

Use dynamic array instead of an static array to decompress. It lets us to
decompress any data, whatever is the radio decompressed data / compressed
data.
It fixes the last issues with fast_ipsec and ipcomp.
While here, bzero -> memset, bcopy -> memcpy, FREE -> free
Reviewed a long time ago by sam@
 1.153.2.1.4.1  03-Apr-2011  riz Pull up following revision(s) (requested by spz in ticket #1424):
sys/netinet/tcp_output.c: revision 1.170
Clean up setting ECN bit in TOS. Fixes PR 44742
 1.154.2.3  07-May-2007  yamt sync with head.
 1.154.2.2  12-Mar-2007  rmind Sync with HEAD.
 1.154.2.1  27-Feb-2007  yamt - sync with head.
- move sched_changepri back to kern_synch.c as it doesn't know PPQ anymore.
 1.157.4.1  11-Jul-2007  mjf Sync with head.
 1.157.2.3  09-Oct-2007  ad Sync with head.
 1.157.2.2  20-Aug-2007  ad Sync with HEAD.
 1.157.2.1  08-Jun-2007  ad Sync with head.
 1.159.2.2  03-Sep-2007  skrll Sync with HEAD.
 1.159.2.1  15-Aug-2007  skrll Sync with HEAD.
 1.161.6.2  02-Aug-2007  yamt make rfbuf_ts a tcp timestamp so that calculations in tcp_input make sense.
 1.161.6.1  02-Aug-2007  yamt file tcp_output.c was added on branch matt-mips64 on 2007-08-02 13:12:36 +0000
 1.161.4.3  23-Mar-2008  matt sync with HEAD
 1.161.4.2  09-Jan-2008  matt sync with HEAD
 1.161.4.1  06-Nov-2007  matt sync with HEAD
 1.161.2.1  03-Sep-2007  jmcneill Sync with HEAD.
 1.162.12.2  19-Jan-2008  bouyer Sync with HEAD
 1.162.12.1  02-Jan-2008  bouyer Sync with HEAD
 1.162.8.1  26-Dec-2007  ad Sync with head.
 1.162.6.1  18-Feb-2008  mjf Sync with HEAD.
 1.164.6.1  02-Jun-2008  mjf Sync with HEAD.
 1.166.4.3  11-Mar-2010  yamt sync with head
 1.166.4.2  04-May-2009  yamt sync with head.
 1.166.4.1  16-May-2008  yamt sync with head.
 1.166.2.1  18-May-2008  yamt sync with head.
 1.167.20.2  24-Jul-2015  martin Pull up following revision(s) (requested by matt in ticket #1973):
sys/netinet/tcp_output.c: revision 1.184
sys/netinet/tcp_input.c: revision 1.343

If we are sending a window probe and there's unacked data in the
socket, make sure at least the persist timer is running.
Make sure that snd_win doesn't go negative.
 1.167.20.1  29-Mar-2011  riz Pull up following revision(s) (requested by spz in ticket #1586):
sys/netinet/tcp_output.c: revision 1.170
Clean up setting ECN bit in TOS. Fixes PR 44742
 1.167.16.1  29-Mar-2011  riz Pull up following revision(s) (requested by spz in ticket #1586):
sys/netinet/tcp_output.c: revision 1.170
Clean up setting ECN bit in TOS. Fixes PR 44742
 1.167.14.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.167.10.2  24-Jul-2015  martin Pull up following revision(s) (requested by matt in ticket #1973):
sys/netinet/tcp_output.c: revision 1.184
sys/netinet/tcp_input.c: revision 1.343

If we are sending a window probe and there's unacked data in the
socket, make sure at least the persist timer is running.
Make sure that snd_win doesn't go negative.
 1.167.10.1  29-Mar-2011  riz branches: 1.167.10.1.2;
Pull up following revision(s) (requested by spz in ticket #1586):
sys/netinet/tcp_output.c: revision 1.170
Clean up setting ECN bit in TOS. Fixes PR 44742
 1.167.10.1.2.1  24-Jul-2015  martin Pull up following revision(s) (requested by matt in ticket #1973):
sys/netinet/tcp_output.c: revision 1.184
sys/netinet/tcp_input.c: revision 1.343

If we are sending a window probe and there's unacked data in the
socket, make sure at least the persist timer is running.
Make sure that snd_win doesn't go negative.
 1.167.8.1  28-Apr-2009  skrll Sync with HEAD.
 1.169.6.1  06-Jun-2011  jruoho Sync with HEAD.
 1.169.4.1  21-Apr-2011  rmind sync with head
 1.171.8.2  05-Apr-2012  mrg sync to latest -current.
 1.171.8.1  18-Feb-2012  mrg merge to -current.
 1.171.4.2  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.171.4.1  17-Apr-2012  yamt sync with head
 1.173.8.2  24-Jul-2015  martin Pull up following revision(s) (requested by matt in ticket #1315):
sys/netinet/tcp_output.c: revision 1.184
sys/netinet/tcp_input.c: revision 1.343

If we are sending a window probe and there's unacked data in the
socket, make sure at least the persist timer is running.
Make sure that snd_win doesn't go negative.
 1.173.8.1  03-Nov-2014  msaitoh Pull up following revision(s) (requested by christos in ticket #1174):
sys/netinet/tcp_output.c: revision 1.178
Avoid stack overflow when SACK and TCP_SIGNATURE are both present. Thanks
to Jonathan Looney for pointing this out.
 1.173.6.2  24-Jul-2015  martin Pull up following revision(s) (requested by matt in ticket #1315):
sys/netinet/tcp_output.c: revision 1.184
sys/netinet/tcp_input.c: revision 1.343

If we are sending a window probe and there's unacked data in the
socket, make sure at least the persist timer is running.
Make sure that snd_win doesn't go negative.
 1.173.6.1  03-Nov-2014  msaitoh Pull up following revision(s) (requested by christos in ticket #1174):
sys/netinet/tcp_output.c: revision 1.178
Avoid stack overflow when SACK and TCP_SIGNATURE are both present. Thanks
to Jonathan Looney for pointing this out.
 1.173.2.2  24-Jul-2015  martin Pull up following revision(s) (requested by matt in ticket #1315):
sys/netinet/tcp_output.c: revision 1.184
sys/netinet/tcp_input.c: revision 1.343

If we are sending a window probe and there's unacked data in the
socket, make sure at least the persist timer is running.
Make sure that snd_win doesn't go negative.
 1.173.2.1  03-Nov-2014  msaitoh Pull up following revision(s) (requested by christos in ticket #1174):
sys/netinet/tcp_output.c: revision 1.178
Avoid stack overflow when SACK and TCP_SIGNATURE are both present. Thanks
to Jonathan Looney for pointing this out.
 1.174.2.3  03-Dec-2017  jdolecek update from HEAD
 1.174.2.2  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.174.2.1  23-Jun-2013  tls resync from head
 1.175.6.1  10-Aug-2014  tls Rebase.
 1.175.2.1  17-Jul-2013  rmind Checkpoint work in progress:
- Move PCB structures under __INPCB_PRIVATE, adjust most of the callers
and thus make IPv4 PCB structures mostly opaque. Any volunteers for
merging in6pcb with inpcb (see rpaulo-netinet-merge-pcb branch)?
- Move various global vars to the modules where they belong, make them static.
- Some preliminary work for IPv4 PCB locking scheme.
- Make raw IP code mostly MP-safe. Simplify some of it.
- Rework "fast" IP forwarding (ipflow) code to be mostly MP-safe. It should
run from a software interrupt, rather than hard.
- Rework tun(4) pseudo interface to be MP-safe.
- Work towards making some other interfaces more strict.
 1.176.2.5  24-Jul-2015  martin Pull up following revision(s) (requested by matt in ticket #886):
sys/netinet/tcp_output.c: revision 1.184
sys/netinet/tcp_input.c: revision 1.343

If we are sending a window probe and there's unacked data in the
socket, make sure at least the persist timer is running.
Make sure that snd_win doesn't go negative.
 1.176.2.4  21-Feb-2015  martin Pull up following revision(s) (requested by he in ticket #530):
sys/netinet/tcp_output.c: revision 1.180
sys/netinet/tcp_input.c: revision 1.336
sys/netinet/tcp_usrreq.c: revision 1.203
share/man/man4/tcp.4: revision 1.30
sys/netinet/tcp.h: revision 1.31
sys/netinet/tcp_subr.c: revision 1.258
sys/netinet/tcp_var.h: revision 1.176
sys/netinet/tcp_var.h: revision 1.177
sys/sys/param.h: bump revision

Port over the TCP_INFO socket option from FreeBSD, originally from
the Linux 2.6 TCP API. This permits the caller to query certain information
about a TCP connection, and is used by pkgsrc's net/iperf3 test program
if available.

This extends struct tcbcb with three fields to count retransmits,
out-of-sequence receives and zero window announcements, and will
therefore warrant a kernel revision bump (done separately).

Change the new counter variables in struct tcpcb to uint32_t, as
per christos' comments.
 1.176.2.3  17-Jan-2015  martin Pull up following revision(s) (requested by maxv in ticket #427):
sys/compat/svr4/svr4_schedctl.c: revision 1.8
sys/netinet/tcp_timer.c: revision 1.88
sys/miscfs/genfs/layer_vfsops.c: revision 1.45
sys/compat/svr4/svr4_ioctl.c: revision 1.37
sys/ufs/chfs/chfs_vfsops.c: revision 1.14
sys/miscfs/fdesc/fdesc_vfsops.c: revision 1.91
sys/compat/linux/arch/i386/linux_ptrace.c: revision 1.30
sys/compat/common/kern_time_50.c: revision 1.28
sys/netinet6/ip6_forward.c: revision 1.74
sys/miscfs/umapfs/umap_vnops.c: revision 1.57
sys/compat/svr4/svr4_fcntl.c: revision 1.74
distrib/sets/lists/comp/mi: revision 1.1931
sys/netinet6/udp6_output.c: revision 1.46
sys/fs/puffs/puffs_compat.c: revision 1.3
sys/fs/udf/udf_rename.c: revision 1.11
sys/compat/svr4/svr4_filio.c: revision 1.24
sys/fs/udf/udf_rename.c: revision 1.12
sys/netinet/tcp_usrreq.c: revision 1.202
sys/miscfs/umapfs/umap_subr.c: revision 1.29
sys/compat/linux/common/linux_fadvise64.c: revision 1.3
sys/netinet/if_atm.c: revision 1.34
sys/miscfs/procfs/procfs_subr.c: revision 1.106
sys/miscfs/genfs/layer_subr.c: revision 1.37
sys/netinet/tcp_sack.c: revision 1.30
sys/compat/freebsd/freebsd_misc.c: revision 1.33
sys/compat/freebsd/freebsd_file.c: revision 1.33
sys/ufs/chfs/chfs_vnode.c: revision 1.12
sys/compat/svr4/svr4_ttold.c: revision 1.34
sys/compat/linux/common/linux_file.c: revision 1.114
sys/compat/linux/arch/mips/linux_machdep.c: revision 1.43
sys/compat/linux/common/linux_signal.c: revision 1.76
sys/compat/common/compat_util.c: revision 1.46
sys/compat/linux/arch/arm/linux_ptrace.c: revision 1.18
sys/compat/svr4/svr4_sockio.c: revision 1.36
sys/compat/linux/arch/arm/linux_machdep.c: revision 1.32
sys/compat/svr4/svr4_signal.c: revision 1.66
sys/kern/kern_exec.c: revision 1.410
sys/fs/puffs/puffs_vfsops.c: revision 1.115
sys/compat/svr4/svr4_exec_elf64.c: revision 1.15
sys/compat/linux/arch/i386/linux_machdep.c: revision 1.159
sys/compat/linux/arch/alpha/linux_machdep.c: revision 1.50
sys/compat/linux32/common/linux32_misc.c: revision 1.24
sys/netinet/in_pcb.c: revision 1.153
sys/sys/malloc.h: revision 1.116
sys/compat/common/if_43.c: revision 1.9
share/man/man9/Makefile: revision 1.380
sys/netinet/tcp_vtw.c: revision 1.12
sys/miscfs/umapfs/umap_vfsops.c: revision 1.95
sys/ufs/ext2fs/ext2fs_vfsops.c: revision 1.186
sys/compat/common/uipc_syscalls_43.c: revision 1.46
sys/ufs/ext2fs/ext2fs_vnops.c: revision 1.115
sys/fs/puffs/puffs_msgif.c: revision 1.97
sys/compat/svr4/svr4_ipc.c: revision 1.27
sys/compat/linux/common/linux_exec.c: revision 1.117
sys/ufs/ext2fs/ext2fs_readwrite.c: revision 1.66
sys/netinet/tcp_output.c: revision 1.179
sys/compat/svr4/svr4_termios.c: revision 1.28
sys/fs/udf/udf_strat_bootstrap.c: revision 1.4
sys/fs/puffs/puffs_subr.c: revision 1.67
sys/fs/puffs/puffs_node.c: revision 1.36
sys/miscfs/overlay/overlay_vnops.c: revision 1.21
sys/fs/cd9660/cd9660_node.c: revision 1.34
sys/netinet/raw_ip.c: revision 1.146
sys/sys/mallocvar.h: revision 1.13
sys/miscfs/overlay/overlay_vfsops.c: revision 1.63
share/man/man9/malloc.9: revision 1.50
sys/netinet6/dest6.c: revision 1.18
sys/compat/linux/common/linux_uselib.c: revision 1.33
sys/compat/linux/common/linux_socket.c: revision 1.120
share/man/man9/malloc.9: revision 1.51
sys/netinet/tcp_subr.c: revision 1.257
sys/compat/linux/common/linux_socketcall.c: revision 1.45
sys/compat/linux/common/linux_fadvise64_64.c: revision 1.3
sys/compat/freebsd/freebsd_ipc.c: revision 1.17
sys/compat/linux/common/linux_misc_notalpha.c: revision 1.109
sys/compat/linux/arch/alpha/linux_pipe.c: revision 1.17
sys/netinet6/in6_pcb.c: revision 1.132
sys/netinet6/in6_ifattach.c: revision 1.94
sys/compat/svr4/svr4_exec_elf32.c: revision 1.15
sys/miscfs/nullfs/null_vfsops.c: revision 1.90
sys/fs/cd9660/cd9660_util.c: revision 1.12
sys/compat/linux/arch/powerpc/linux_machdep.c: revision 1.48
sys/compat/freebsd/freebsd_exec_elf32.c: revision 1.20
sys/miscfs/procfs/procfs_vfsops.c: revision 1.94
sys/compat/linux/arch/powerpc/linux_ptrace.c: revision 1.28
sys/compat/linux/common/linux_sched.c: revision 1.67
sys/compat/linux/common/linux_exec_aout.c: revision 1.67
sys/compat/linux/common/linux_pipe.c: revision 1.67
sys/compat/linux/common/linux_llseek.c: revision 1.34
sys/compat/linux/arch/mips/linux_ptrace.c: revision 1.10
Do not uselessly include <sys/malloc.h>.
Cleanup:
- remove struct kmembuckets (dead)
- correctly deadify MALLOC_XX
- remove MALLOC_DEFINE_LIMIT and MALLOC_JUSTDEFINE_LIMIT (dead)
- remove malloc_roundup(), malloc_type_setlimit(), MALLOC_DEFINE_LIMIT()
and MALLOC_JUSTDEFINE_LIMIT() from man 9 malloc
New sentence, new line. Bump date for previous.
Obsolete malloc_roundup(9), malloc_type_setlimit(9) and MALLOC_DEFINE_LIMIT(9)
man pages.
 1.176.2.2  26-Oct-2014  martin Pull up following revision(s) (requested by christos in ticket #157):
sys/netinet/tcp_output.c: revision 1.178
Avoid stack overflow when SACK and TCP_SIGNATURE are both present. Thanks
to Jonathan Looney for pointing this out.
 1.176.2.1  24-Oct-2014  martin Pull up following revision(s) (requested by hikaru in ticket #154):
sys/netinet/tcp_output.c: revision 1.177
Fix wrong condition checking TSO capability.
ipsec_used is not necessary condition.
IPsec outbound policy will not be checked when ipsec_used is false.
 1.179.2.6  28-Aug-2017  skrll Sync with HEAD
 1.179.2.5  05-Feb-2017  skrll Sync with HEAD
 1.179.2.4  09-Jul-2016  skrll Sync with HEAD
 1.179.2.3  22-Sep-2015  skrll Sync with HEAD
 1.179.2.2  06-Jun-2015  skrll Sync with HEAD
 1.179.2.1  06-Apr-2015  skrll Sync with HEAD
 1.186.2.2  20-Mar-2017  pgoyette Sync with HEAD
 1.186.2.1  07-Jan-2017  pgoyette Sync with HEAD. (Note that most of these changes are simply $NetBSD$
tag issues.)
 1.194.2.1  21-Apr-2017  bouyer Sync with HEAD
 1.196.2.1  21-Oct-2017  snj Pull up following revision(s) (requested by ozaki-r in ticket #300):
crypto/dist/ipsec-tools/src/setkey/parse.y: 1.19
crypto/dist/ipsec-tools/src/setkey/token.l: 1.20
distrib/sets/lists/tests/mi: 1.754, 1.757, 1.759
doc/TODO.smpnet: 1.12-1.13
sys/net/pfkeyv2.h: 1.32
sys/net/raw_cb.c: 1.23-1.24, 1.28
sys/net/raw_cb.h: 1.28
sys/net/raw_usrreq.c: 1.57-1.58
sys/net/rtsock.c: 1.228-1.229
sys/netinet/in_proto.c: 1.125
sys/netinet/ip_input.c: 1.359-1.361
sys/netinet/tcp_input.c: 1.359-1.360
sys/netinet/tcp_output.c: 1.197
sys/netinet/tcp_var.h: 1.178
sys/netinet6/icmp6.c: 1.213
sys/netinet6/in6_proto.c: 1.119
sys/netinet6/ip6_forward.c: 1.88
sys/netinet6/ip6_input.c: 1.181-1.182
sys/netinet6/ip6_output.c: 1.193
sys/netinet6/ip6protosw.h: 1.26
sys/netipsec/ipsec.c: 1.100-1.122
sys/netipsec/ipsec.h: 1.51-1.61
sys/netipsec/ipsec6.h: 1.18-1.20
sys/netipsec/ipsec_input.c: 1.44-1.51
sys/netipsec/ipsec_netbsd.c: 1.41-1.45
sys/netipsec/ipsec_output.c: 1.49-1.64
sys/netipsec/ipsec_private.h: 1.5
sys/netipsec/key.c: 1.164-1.234
sys/netipsec/key.h: 1.20-1.32
sys/netipsec/key_debug.c: 1.18-1.21
sys/netipsec/key_debug.h: 1.9
sys/netipsec/keydb.h: 1.16-1.20
sys/netipsec/keysock.c: 1.59-1.62
sys/netipsec/keysock.h: 1.10
sys/netipsec/xform.h: 1.9-1.12
sys/netipsec/xform_ah.c: 1.55-1.74
sys/netipsec/xform_esp.c: 1.56-1.72
sys/netipsec/xform_ipcomp.c: 1.39-1.53
sys/netipsec/xform_ipip.c: 1.50-1.54
sys/netipsec/xform_tcp.c: 1.12-1.16
sys/rump/librump/rumpkern/Makefile.rumpkern: 1.170
sys/rump/librump/rumpnet/net_stub.c: 1.27
sys/sys/protosw.h: 1.67-1.68
tests/net/carp/t_basic.sh: 1.7
tests/net/if_gif/t_gif.sh: 1.11
tests/net/if_l2tp/t_l2tp.sh: 1.3
tests/net/ipsec/Makefile: 1.7-1.9
tests/net/ipsec/algorithms.sh: 1.5
tests/net/ipsec/common.sh: 1.4-1.6
tests/net/ipsec/t_ipsec_ah_keys.sh: 1.2
tests/net/ipsec/t_ipsec_esp_keys.sh: 1.2
tests/net/ipsec/t_ipsec_gif.sh: 1.6-1.7
tests/net/ipsec/t_ipsec_l2tp.sh: 1.6-1.7
tests/net/ipsec/t_ipsec_misc.sh: 1.8-1.18
tests/net/ipsec/t_ipsec_sockopt.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_tcp.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_transport.sh: 1.5-1.6
tests/net/ipsec/t_ipsec_tunnel.sh: 1.9
tests/net/ipsec/t_ipsec_tunnel_ipcomp.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_tunnel_odd.sh: 1.3
tests/net/mcast/t_mcast.sh: 1.6
tests/net/net/t_ipaddress.sh: 1.11
tests/net/net_common.sh: 1.20
tests/net/npf/t_npf.sh: 1.3
tests/net/route/t_flags.sh: 1.20
tests/net/route/t_flags6.sh: 1.16
usr.bin/netstat/fast_ipsec.c: 1.22
Do m_pullup before mtod

It may fix panicks of some tests on anita/sparc and anita/GuruPlug.
---
KNF
---
Enable DEBUG for babylon5
---
Apply C99-style struct initialization to xformsw
---
Tweak outputs of netstat -s for IPsec

- Get rid of "Fast"
- Use ipsec and ipsec6 for titles to clarify protocol
- Indent outputs of sub protocols

Original outputs were organized like this:

(Fast) IPsec:
IPsec ah:
IPsec esp:
IPsec ipip:
IPsec ipcomp:
(Fast) IPsec:
IPsec ah:
IPsec esp:
IPsec ipip:
IPsec ipcomp:

New outputs are organized like this:

ipsec:
ah:
esp:
ipip:
ipcomp:
ipsec6:
ah:
esp:
ipip:
ipcomp:
---
Add test cases for IPComp
---
Simplify IPSEC_OSTAT macro (NFC)
---
KNF; replace leading whitespaces with hard tabs
---
Introduce and use SADB_SASTATE_USABLE_P
---
KNF
---
Add update command for testing

Updating an SA (SADB_UPDATE) requires that a process issuing
SADB_UPDATE is the same as a process issued SADB_ADD (or SADB_GETSPI).
This means that update command must be used with add command in a
configuration of setkey. This usage is normally meaningless but
useful for testing (and debugging) purposes.
---
Add test cases for updating SA/SP

The tests require newly-added udpate command of setkey.
---
PR/52346: Frank Kardel: Fix checksumming for NAT-T
See XXX for improvements.
---
Remove codes for PACKET_TAG_IPSEC_IN_CRYPTO_DONE

It seems that PACKET_TAG_IPSEC_IN_CRYPTO_DONE is for network adapters
that have IPsec accelerators; a driver sets the mtag to a packet
when its device has already encrypted the packet.

Unfortunately no driver implements such offload features for long
years and seems unlikely to implement them soon. (Note that neither
FreeBSD nor Linux doesn't have such drivers.) Let's remove related
(unused) codes and simplify the IPsec code.
---
Fix usages of sadb_msg_errno
---
Avoid updating sav directly

On SADB_UPDATE a target sav was updated directly, which was unsafe.
Instead allocate another sav, copy variables of the old sav to
the new one and replace the old one with the new one.
---
Simplify; we can assume sav->tdb_xform cannot be NULL while it's valid
---
Rename key_alloc* functions (NFC)

We shouldn't use the term "alloc" for functions that just look up
data and actually don't allocate memory.
---
Use explicit_memset to surely zero-clear key_auth and key_enc
---
Make sure to clear keys on error paths of key_setsaval
---
Add missing KEY_FREESAV
---
Make sure a sav is inserted to a sah list after its initialization completes
---
Remove unnecessary zero-clearing codes from key_setsaval

key_setsaval is now used only for a newly-allocated sav. (It was
used to reset variables of an existing sav.)
---
Correct wrong assumption of sav->refcnt in key_delsah

A sav in a list is basically not to be sav->refcnt == 0. And also
KEY_FREESAV assumes sav->refcnt > 0.
---
Let key_getsavbyspi take a reference of a returning sav
---
Use time_mono_to_wall (NFC)
---
Separate sending message routine (NFC)
---
Simplify; remove unnecessary zero-clears

key_freesaval is used only when a target sav is being destroyed.
---
Omit NULL checks for sav->lft_c

sav->lft_c can be NULL only when initializing or destroying sav.
---
Omit unnecessary NULL checks for sav->sah
---
Omit unnecessary check of sav->state

key_allocsa_policy picks a sav of either MATURE or DYING so we
don't need to check its state again.
---
Simplify; omit unnecessary saidx passing

- ipsec_nextisr returns a saidx but no caller uses it
- key_checkrequest is passed a saidx but it can be gotton by
another argument (isr)
---
Fix splx isn't called on some error paths
---
Fix header size calculation of esp where sav is NULL
---
Fix header size calculation of ah in the case sav is NULL

This fix was also needed for esp.
---
Pass sav directly to opencrypto callback

In a callback, use a passed sav as-is by default and look up a sav
only if the passed sav is dead.
---
Avoid examining freshness of sav on packet processing

If a sav list is sorted (by lft_c->sadb_lifetime_addtime) in advance,
we don't need to examine each sav and also don't need to delete one
on the fly and send up a message. Fortunately every sav lists are sorted
as we need.

Added key_validate_savlist validates that each sav list is surely sorted
(run only if DEBUG because it's not cheap).
---
Add test cases for SAs with different SPIs
---
Prepare to stop using isr->sav

isr is a shared resource and using isr->sav as a temporal storage
for each packet processing is racy. And also having a reference from
isr to sav makes the lifetime of sav non-deterministic; such a reference
is removed when a packet is processed and isr->sav is overwritten by
new one. Let's have a sav locally for each packet processing instead of
using shared isr->sav.

However this change doesn't stop using isr->sav yet because there are
some users of isr->sav. isr->sav will be removed after the users find
a way to not use isr->sav.
---
Fix wrong argument handling
---
fix printf format.
---
Don't validate sav lists of LARVAL or DEAD states

We don't sort the lists so the validation will always fail.

Fix PR kern/52405
---
Make sure to sort the list when changing the state by key_sa_chgstate
---
Rename key_allocsa_policy to key_lookup_sa_bysaidx
---
Separate test files
---
Calculate ah_max_authsize on initialization as well as esp_max_ivlen
---
Remove m_tag_find(PACKET_TAG_IPSEC_PENDING_TDB) because nobody sets the tag
---
Restore a comment removed in previous

The comment is valid for the below code.
---
Make tests more stable

sleep command seems to wait longer than expected on anita so
use polling to wait for a state change.
---
Add tests that explicitly delete SAs instead of waiting for expirations
---
Remove invalid M_AUTHIPDGM check on ESP isr->sav

M_AUTHIPDGM flag is set to a mbuf in ah_input_cb. An sav of ESP can
have AH authentication as sav->tdb_authalgxform. However, in that
case esp_input and esp_input_cb are used to do ESP decryption and
AH authentication and M_AUTHIPDGM never be set to a mbuf. So
checking M_AUTHIPDGM of a mbuf on isr->sav of ESP is meaningless.
---
Look up sav instead of relying on unstable sp->req->sav

This code is executed only in an error path so an additional lookup
doesn't matter.
---
Correct a comment
---
Don't release sav if calling crypto_dispatch again
---
Remove extra KEY_FREESAV from ipsec_process_done

It should be done by the caller.
---
Don't bother the case of crp->crp_buf == NULL in callbacks
---
Hold a reference to an SP during opencrypto processing

An SP has a list of isr (ipsecrequest) that represents a sequence
of IPsec encryption/authentication processing. One isr corresponds
to one opencrypto processing. The lifetime of an isr follows its SP.

We pass an isr to a callback function of opencrypto to continue
to a next encryption/authentication processing. However nobody
guaranteed that the isr wasn't freed, i.e., its SP wasn't destroyed.

In order to avoid such unexpected destruction of isr, hold a reference
to its SP during opencrypto processing.
---
Don't make SAs expired on tests that delete SAs explicitly
---
Fix a debug message
---
Dedup error paths (NFC)
---
Use pool to allocate tdb_crypto

For ESP and AH, we need to allocate an extra variable space in addition
to struct tdb_crypto. The fixed size of pool items may be larger than
an actual requisite size of a buffer, but still the performance
improvement by replacing malloc with pool wins.
---
Don't use unstable isr->sav for header size calculations

We may need to optimize to not look up sav here for users that
don't need to know an exact size of headers (e.g., TCP segmemt size
caclulation).
---
Don't use sp->req->sav when handling NAT-T ESP fragmentation

In order to do this we need to look up a sav however an additional
look-up degrades performance. A sav is later looked up in
ipsec4_process_packet so delay the fragmentation check until then
to avoid an extra look-up.
---
Don't use key_lookup_sp that depends on unstable sp->req->sav

It provided a fast look-up of SP. We will provide an alternative
method in the future (after basic MP-ification finishes).
---
Stop setting isr->sav on looking up sav in key_checkrequest
---
Remove ipsecrequest#sav
---
Stop setting mtag of PACKET_TAG_IPSEC_IN_DONE because there is no users anymore
---
Skip ipsec_spi_*_*_preferred_new_timeout when running on qemu

Probably due to PR 43997
---
Add localcount to rump kernels
---
Remove unused macro
---
Fix key_getcomb_setlifetime

The fix adjusts a soft limit to be 80% of a corresponding hard limit.

I'm not sure the fix is really correct though, at least the original
code is wrong. A passed comb is zero-cleared before calling
key_getcomb_setlifetime, so
comb->sadb_comb_soft_addtime = comb->sadb_comb_soft_addtime * 80 / 100;
is meaningless.
---
Provide and apply key_sp_refcnt (NFC)

It simplifies further changes.
---
Fix indentation

Pointed out by knakahara@
---
Use pslist(9) for sptree
---
Don't acquire global locks for IPsec if NET_MPSAFE

Note that the change is just to make testing easy and IPsec isn't MP-safe yet.
---
Let PF_KEY socks hold their own lock instead of softnet_lock

Operations on SAD and SPD are executed via PF_KEY socks. The operations
include deletions of SAs and SPs that will use synchronization mechanisms
such as pserialize_perform to wait for references to SAs and SPs to be
released. It is known that using such mechanisms with holding softnet_lock
causes a dead lock. We should avoid the situation.
---
Make IPsec SPD MP-safe

We use localcount(9), not psref(9), to make the sptree and secpolicy (SP)
entries MP-safe because SPs need to be referenced over opencrypto
processing that executes a callback in a different context.

SPs on sockets aren't managed by the sptree and can be destroyed in softint.
localcount_drain cannot be used in softint so we delay the destruction of
such SPs to a thread context. To do so, a list to manage such SPs is added
(key_socksplist) and key_timehandler_spd deletes dead SPs in the list.

For more details please read the locking notes in key.c.

Proposed on tech-kern@ and tech-net@
---
Fix updating ipsec_used

- key_update_used wasn't called in key_api_spddelete2 and key_api_spdflush
- key_update_used wasn't called if an SP had been added/deleted but
a reply to userland failed
---
Fix updating ipsec_used; turn on when SPs on sockets are added
---
Add missing IPsec policy checks to icmp6_rip6_input

icmp6_rip6_input is quite similar to rip6_input and the same checks exist
in rip6_input.
---
Add test cases for setsockopt(IP_IPSEC_POLICY)
---
Don't use KEY_NEWSP for dummy SP entries

By the change KEY_NEWSP is now not called from softint anymore
and we can use kmem_zalloc with KM_SLEEP for KEY_NEWSP.
---
Comment out unused functions
---
Add test cases that there are SPs but no relevant SAs
---
Don't allow sav->lft_c to be NULL

lft_c of an sav that was created by SADB_GETSPI could be NULL.
---
Clean up clunky eval strings

- Remove unnecessary \ at EOL
- This allows to omit ; too
- Remove unnecessary quotes for arguments of atf_set
- Don't expand $DEBUG in eval
- We expect it's expanded on execution

Suggested by kre@
---
Remove unnecessary KEY_FREESAV in an error path

sav should be freed (unreferenced) by the caller.
---
Use pslist(9) for sahtree
---
Use pslist(9) for sah->savtree
---
Rename local variable newsah to sah

It may not be new.
---
MP-ify SAD slightly

- Introduce key_sa_mtx and use it for some list operations
- Use pserialize for some list iterations
---
Introduce KEY_SA_UNREF and replace KEY_FREESAV with it where sav will never be actually freed in the future

KEY_SA_UNREF is still key_freesav so no functional change for now.

This change reduces diff of further changes.
---
Remove out-of-date log output

Pointed out by riastradh@
---
Use KDASSERT instead of KASSERT for mutex_ownable

Because mutex_ownable is too heavy to run in a fast path
even for DIAGNOSTIC + LOCKDEBUG.

Suggested by riastradh@
---
Assemble global lists and related locks into cache lines (NFCI)

Also rename variable names from *tree to *list because they are
just lists, not trees.

Suggested by riastradh@
---
Move locking notes
---
Update the locking notes

- Add locking order
- Add locking notes for misc lists such as reglist
- Mention pserialize, key_sp_ref and key_sp_unref on SP operations

Requested by riastradh@
---
Describe constraints of key_sp_ref and key_sp_unref

Requested by riastradh@
---
Hold key_sad.lock on SAVLIST_WRITER_INSERT_TAIL
---
Add __read_mostly to key_psz

Suggested by riastradh@
---
Tweak wording (pserialize critical section => pserialize read section)

Suggested by riastradh@
---
Add missing mutex_exit
---
Fix setkey -D -P outputs

The outputs were tweaked (by me), but I forgot updating libipsec
in my local ATF environment...
---
MP-ify SAD (key_sad.sahlist and sah entries)

localcount(9) is used to protect key_sad.sahlist and sah entries
as well as SPD (and will be used for SAD sav).

Please read the locking notes of SAD for more details.
---
Introduce key_sa_refcnt and replace sav->refcnt with it (NFC)
---
Destroy sav only in the loop for DEAD sav
---
Fix KASSERT(solocked(sb->sb_so)) failure in sbappendaddr that is called eventually from key_sendup_mbuf

If key_sendup_mbuf isn't passed a socket, the assertion fails.
Originally in this case sb->sb_so was softnet_lock and callers
held softnet_lock so the assertion was magically satisfied.
Now sb->sb_so is key_so_mtx and also softnet_lock isn't always
held by callers so the assertion can fail.

Fix it by holding key_so_mtx if key_sendup_mbuf isn't passed a socket.

Reported by knakahara@
Tested by knakahara@ and ozaki-r@
---
Fix locking notes of SAD
---
Fix deadlock between key_sendup_mbuf called from key_acquire and localcount_drain

If we call key_sendup_mbuf from key_acquire that is called on packet
processing, a deadlock can happen like this:
- At key_acquire, a reference to an SP (and an SA) is held
- key_sendup_mbuf will try to take key_so_mtx
- Some other thread may try to localcount_drain to the SP with
holding key_so_mtx in say key_api_spdflush
- In this case localcount_drain never return because key_sendup_mbuf
that has stuck on key_so_mtx never release a reference to the SP

Fix the deadlock by deferring key_sendup_mbuf to the timer
(key_timehandler).
---
Fix that prev isn't cleared on retry
---
Limit the number of mbufs queued for deferred key_sendup_mbuf

It's easy to be queued hundreds of mbufs on the list under heavy
network load.
---
MP-ify SAD (savlist)

localcount(9) is used to protect savlist of sah. The basic design is
similar to MP-ifications of SPD and SAD sahlist. Please read the
locking notes of SAD for more details.
---
Simplify ipsec_reinject_ipstack (NFC)
---
Add per-CPU rtcache to ipsec_reinject_ipstack

It reduces route lookups and also reduces rtcache lock contentions
when NET_MPSAFE is enabled.
---
Use pool_cache(9) instead of pool(9) for tdb_crypto objects

The change improves network throughput especially on multi-core systems.
---
Update

ipsec(4), opencrypto(9) and vlan(4) are now MP-safe.
---
Write known issues on scalability
---
Share a global dummy SP between PCBs

It's never be changed so it can be pre-allocated and shared safely between PCBs.
---
Fix race condition on the rawcb list shared by rtsock and keysock

keysock now protects itself by its own mutex, which means that
the rawcb list is protected by two different mutexes (keysock's one
and softnet_lock for rtsock), of course it's useless.

Fix the situation by having a discrete rawcb list for each.
---
Use a dedicated mutex for rt_rawcb instead of softnet_lock if NET_MPSAFE
---
fix localcount leak in sav. fixed by ozaki-r@n.o.

I commit on behalf of him.
---
remove unnecessary comment.
---
Fix deadlock between pserialize_perform and localcount_drain

A typical ussage of localcount_drain looks like this:

mutex_enter(&mtx);
item = remove_from_list();
pserialize_perform(psz);
localcount_drain(&item->localcount, &cv, &mtx);
mutex_exit(&mtx);

This sequence can cause a deadlock which happens for example on the following
situation:

- Thread A calls localcount_drain which calls xc_broadcast after releasing
a specified mutex
- Thread B enters the sequence and calls pserialize_perform with holding
the mutex while pserialize_perform also calls xc_broadcast
- Thread C (xc_thread) that calls an xcall callback of localcount_drain tries
to hold the mutex

xc_broadcast of thread B doesn't start until xc_broadcast of thread A
finishes, which is a feature of xcall(9). This means that pserialize_perform
never complete until xc_broadcast of thread A finishes. On the other hand,
thread C that is a callee of xc_broadcast of thread A sticks on the mutex.
Finally the threads block each other (A blocks B, B blocks C and C blocks A).

A possible fix is to serialize executions of the above sequence by another
mutex, but adding another mutex makes the code complex, so fix the deadlock
by another way; the fix is to release the mutex before pserialize_perform
and instead use a condvar to prevent pserialize_perform from being called
simultaneously.

Note that the deadlock has happened only if NET_MPSAFE is enabled.
---
Add missing ifdef NET_MPSAFE
---
Take softnet_lock on pr_input properly if NET_MPSAFE

Currently softnet_lock is taken unnecessarily in some cases, e.g.,
icmp_input and encap4_input from ip_input, or not taken even if needed,
e.g., udp_input and tcp_input from ipsec4_common_input_cb. Fix them.

NFC if NET_MPSAFE is disabled (default).
---
- sanitize key debugging so that we don't print extra newlines or unassociated
debugging messages.
- remove unused functions and make internal ones static
- print information in one line per message
---
humanize printing of ip addresses
---
cast reduction, NFC.
---
Fix typo in comment
---
Pull out ipsec_fill_saidx_bymbuf (NFC)
---
Don't abuse key_checkrequest just for looking up sav

It does more than expected for example key_acquire.
---
Fix SP is broken on transport mode

isr->saidx was modified accidentally in ipsec_nextisr.

Reported by christos@
Helped investigations by christos@ and knakahara@
---
Constify isr at many places (NFC)
---
Include socketvar.h for softnet_lock
---
Fix buffer length for ipsec_logsastr
 1.198.2.6  18-Jan-2019  pgoyette Synch with HEAD
 1.198.2.5  06-Sep-2018  pgoyette Sync with HEAD

Resolve a couple of conflicts (result of the uimin/uimax changes)
 1.198.2.4  21-May-2018  pgoyette Sync with HEAD
 1.198.2.3  07-Apr-2018  pgoyette Sync with HEAD. 77 conflicts resolved - all of them $NetBSD$
 1.198.2.2  30-Mar-2018  pgoyette Resolve conflicts between branch and HEAD
 1.198.2.1  15-Mar-2018  pgoyette Synch with HEAD
 1.208.2.2  13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.208.2.1  10-Jun-2019  christos Sync with HEAD
 1.218.2.1  21-Sep-2023  martin Pull up following revision(s) (requested by bouyer in ticket #377):

sys/netinet/tcp_output.c: revision 1.219

Handle EHOSTDOWN the same way as EHOSTUNREACH and ENETDOWN for established
connections. Avoid premature end of tcp connection with "Host is down" error
in case of transient link-layer failure.

Discussed and patch proposed in
http://mail-index.netbsd.org/tech-net/2023/09/11/msg008610.html
and followups.
 1.220.2.1  02-Aug-2025  perseant Sync with HEAD

RSS XML Feed