Home | History | Annotate | Download | only in netinet6
History log of /src/sys/netinet6/ip6_output.c
RevisionDateAuthorComments
 1.235  19-Apr-2024  riastradh ip6_output: Initialize plen for ip6_hopopts_input.

This funny little block in ip6_process_hopopts assumes it is
initialized as and behaves differently depending on whether it's zero
or not:

https://nxr.netbsd.org/xref/src/sys/netinet6/ip6_input.c?r=1.227#976

In the other call site, it is initialized to ip6->ip6_plen:

https://nxr.netbsd.org/xref/src/sys/netinet6/ip6_input.c?r=1.227#561

Reported-by: syzbot+587e3b707bdfe533283f@syzkaller.appspotmail.com
https://syzkaller.appspot.com/bug?extid=587e3b707bdfe533283f
 1.234  03-Aug-2023  ozaki-r in6: don't send any IPv6 packets over a disabled interface
 1.233  20-Mar-2023  ozaki-r in6: reject setting negative values but -1 via setsockopt(IPV6_CHECKSUM)

Same as OpenBSD.
 1.232  27-Jan-2023  ozaki-r ipsec: remove unnecessary splsoftnet

Because the code of IPsec itself is already MP-safe.
 1.231  28-Oct-2022  ozaki-r branches: 1.231.2;
inpcb: separate inpcb again to reduce the size of PCB for IPv4

The data size of PCB for IPv4 increased because of the merge of
struct in6pcb. The change decreases the size to the original size by
separating struct inpcb (again). struct in4pcb and in6pcb that embed
struct inpcb are introduced.

Even after the separation, users don't need to realize the separation
and only have to use some macros to access dedicated data. For example,
inp->inp_laddr is now accessed through in4p_laddr(inp).
 1.230  28-Oct-2022  ozaki-r inpcb: integrate data structures of PCB into one

Data structures of network protocol control blocks (PCBs), i.e.,
struct inpcb, in6pcb and inpcb_hdr, are not organized well. Users of
the data structures have to handle them separately and thus the code
is cluttered and duplicated.

The commit integrates the data structures into one, struct inpcb. As a
result, users of PCBs only have to handle just one data structure, so
the code becomes simple.

One drawback is that the data size of PCB for IPv4 increases by 40 bytes
(from 248 bytes to 288 bytes).
 1.229  21-Sep-2021  christos don't opencode kauth_cred_get()
 1.228  17-Aug-2021  andvar fix multiplei repetitive typos in comments, messages and documentation. mainly because copy paste code big amount of files are affected.
 1.227  10-Mar-2021  christos byte-flipping a random number is not very useful.
 1.226  08-Sep-2020  christos branches: 1.226.2;
Add IP_BINDANY, IPV6_BINDANY which can be used to bind to any address in
order to implement transparent proxies.
 1.225  28-Aug-2020  ozaki-r inet6: reduce silent packet discards
 1.224  28-Aug-2020  ozaki-r inet, inet6: count packets dropped by IPsec

The counters count packets dropped due to security policy checks.
 1.223  12-Jun-2020  roy Remove in-kernel handling of Router Advertisements

This is much better handled by a user-land tool.
Proposed on tech-net here:
https://mail-index.netbsd.org/tech-net/2020/04/22/msg007766.html

Note that the ioctl SIOCGIFINFO_IN6 no longer sets flags. That now
needs to be done using the pre-existing SIOCSIFINFO_FLAGS ioctl.

Compat is fully provided where it makes sense, but trying to turn on
RA handling will obviously throw an error as it no longer exists.

Note that if you use IPv6 temporary addresses, this now needs to be
turned on in dhcpcd.conf(5) rather than in sysctl.conf(5).
 1.222  13-Nov-2019  ozaki-r Get rid of unnecessary NULL checks for rt_ifa and ifa_ifp

They are always non-NULL nowadays.
 1.221  01-Nov-2019  knakahara Fix ipsecif(4) IPV6_MINMTU does not work correctly.
 1.220  15-May-2019  ozaki-r branches: 1.220.2;
Get rid of IFNET_LOCK for if_mcast_op to avoid a deadlock

The IFNET_LOCK was added to avoid data races on if_flags for IFF_ALLMULTI.
Unfortunatetly it caused a deadlock instead. A known scenario causing a
deadlock is to occur the following two operations concurrently: (a) a removal of
an IP adddres assigned to an interface and (b) a manipulation of multicast
groups to the interface. The resource dependency graph is like this:
softnet_lock => IFNET_LOCK => psref_target_destroy => softint => softnet_lock

Thanks to the previous commit that avoids data races on if_flags for
IFF_ALLMULTI by another approach, we can remove IFNET_LOCK and defuse the
deadlock.

PR kern/54189
 1.219  13-May-2019  ozaki-r Count packets dropped by pfil
 1.218  03-Apr-2019  maxv Fix small read overflow; harmless, because since I removed RH0, the memory
access on IPV6_RTHDR that would normally be illegal is not needed, and GCC
automatically removes it.
 1.217  04-Feb-2019  mrg rework the #ifdef IPSEC code to not use fallthru.
same number of lines with more local context.
 1.216  22-Dec-2018  maxv Replace M_ALIGN and MH_ALIGN by m_align.
 1.215  22-Dec-2018  maxv Replace: M_MOVE_PKTHDR -> m_move_pkthdr. No functional change, since the
former is a macro to the latter.
 1.214  12-Dec-2018  rin Simplify logic in ip{,6}_output().

Now, we have M_CSUM_TSOv[46] bit in ifp->if_csum_flags_tx when
TSO[46] is enabled for the interface. So we can simply check
whether TSO[46] is required in a packet but missing in the
interface by (sw_csum & M_CSUM_TSOv[46]).

Note that this is a very rare case where TSO[46] is suddenly
turned off during a packet passing b/w TCP and IP.

part of PR kern/53562
OK msaitoh
 1.213  29-Nov-2018  ozaki-r Don't validate the source address of forwarding IPv6 packets (same as IPv4)
 1.212  10-Aug-2018  maxv Rename

ip6_undefer_csum -> in6_undefer_cksum
in6_delayed_cksum -> in6_undefer_cksum_tcpudp

The two previous names were inconsistent and misleading.

Put the two functions into in6_offload.c. Add comments to explain what
we're doing.

Same as IPv4.
 1.211  01-Jun-2018  maxv branches: 1.211.2;
Rename

M_CSUM_DATA_IPv6_HL -> M_CSUM_DATA_IPv6_IPHL
M_CSUM_DATA_IPv6_HL_SET -> M_CSUM_DATA_IPv6_SET

Reduces the diff against IPv4. Also, clarify the definitions.
 1.210  29-May-2018  maxv Remove dead code, we don't care.
 1.209  09-May-2018  maxv Replace
m_copym(m, 0, M_COPYALL, M_DONTWAIT)
by
m_copypacket(m, M_DONTWAIT)
when it is clear that we are copying a packet (that has M_PKTHDR) and not
a raw mbuf chain.
 1.208  01-May-2018  maxv Remove now unused net_osdep.h includes, the other BSDs did the same.
 1.207  29-Apr-2018  maxv Remove unused and misleading argument from ipsec_set_policy.
 1.206  26-Apr-2018  maxv Stop using m_copy(), use m_copym() directly. m_copy is useless,
undocumented and confusing.
 1.205  23-Apr-2018  maxv Remove the kernel RH0 code. RH0 is deprecated by RFC5095, for security
reasons. RH0 was already removed in the kernel's input path, but some
parts were still present in the output path: they are now removed.

Sent on tech-net@ a few days ago.
 1.204  18-Apr-2018  maxv Remove unused netipsec/xform.h includes.
 1.203  27-Feb-2018  maxv branches: 1.203.2;
Dedup: merge ipsec4_set_policy and ipsec6_set_policy. The content of the
original ipsec_set_policy function is inlined into the new one.
 1.202  27-Feb-2018  maxv Dedup: merge

ipsec4_get_policy and ipsec6_get_policy
ipsec4_delete_pcbpolicy and ipsec6_delete_pcbpolicy

The already-existing ipsec_get_policy() function is inlined in the new
one.
 1.201  12-Feb-2018  maxv Replace bcopy -> memcpy when it is obvious that the areas don't overlap.
Rearrange ip6_splithdr() for clarity.
 1.200  31-Jan-2018  maxv Correct the check; we want to find IPPROTO_HOPOPTS, not IPV6_HOPOPTS. This
just couldn't work.

By the way, I'm wondering what is the point of this block. Calling
ip6_hopopts_input() won't achieve anything useful, and it could actually
be a problem, because there are several paths in it that call icmp6_error,
which calls ip6_output, and then we're back in the same function. Besides
it is possible to reach icmp6_error with a packet we emitted (as opposed
to a packet we are forwarding), and in that case we are sending an ICMP
error back to ourselves.
 1.199  31-Jan-2018  maxv Remove a misleading instruction. We don't care about increasing
m_pkthdr.len in ip6_insertfraghdr(), it gets recomputed after calling
this function.

If we cared there would be a bug, since we don't increase it in the
other branches.
 1.198  31-Jan-2018  maxv Try to sound a little less pessimistic, there is nothing wrong here.
 1.197  31-Jan-2018  maxv Style, localify, constify, and reorder a bit. No real functional change.
 1.196  15-Dec-2017  ozaki-r Ensure to call if_mcast_op with holding IFNET_LOCK

Note that CARP doesn't deal with IFNET_LOCK yet.
 1.195  25-Nov-2017  kre Attempt to restore v6 networking. Not 100% certain that these
changes are all that is needed, but they're certainly a big part of it
(especially the ip6_input.c change.)
 1.194  24-Nov-2017  roy Allow local communication over DETACHED addresses.
Allow binding to DETACHED or TENTATIVE addresses as we deny
sending upstream from them anyway.
Prefer non DETACHED or TENTATIVE addresses.
 1.193  02-Aug-2017  ozaki-r Make IPsec SPD MP-safe

We use localcount(9), not psref(9), to make the sptree and secpolicy (SP)
entries MP-safe because SPs need to be referenced over opencrypto
processing that executes a callback in a different context.

SPs on sockets aren't managed by the sptree and can be destroyed in softint.
localcount_drain cannot be used in softint so we delay the destruction of
such SPs to a thread context. To do so, a list to manage such SPs is added
(key_socksplist) and key_timehandler_spd deletes dead SPs in the list.

For more details please read the locking notes in key.c.

Proposed on tech-kern@ and tech-net@
 1.192  26-Jun-2017  ozaki-r Fix usage of ip6_get_membership

It may set nothing to ifp even if returning 0. So we need to NULL-clear
ifp before calling it.

Fix PR kern/52324
 1.191  03-Mar-2017  ozaki-r branches: 1.191.6;
Pass inpcb/in6pcb instead of socket to ip_output/ip6_output

- Passing a socket to Layer 3 is layer violation and even unnecessary
- The change makes codes of callers and IPsec a bit simple
 1.190  02-Mar-2017  ozaki-r Make sure im6o_memberships is protected by in6p's lock (solock)
 1.189  02-Mar-2017  ozaki-r Make usages of ifp MP-safe in some functions of IP multicast
 1.188  02-Mar-2017  ozaki-r Use LIST_* macros

No functional change.
 1.187  01-Mar-2017  ozaki-r Provide in6_multi_group

Use it when checking if we belong to the group, instead of in6_lookup_multi.

No functional change.
 1.186  22-Feb-2017  ozaki-r Stop using useless IN6_*_MULTI macros
 1.185  22-Feb-2017  ozaki-r Add assertions and comments for lock states of socket and pcb
 1.184  17-Feb-2017  ozaki-r Rename if_acquire_NOMPSAFE to if_acquire

It can be used in MP-safe ways. So let's remove the confusing postfix.
If it's used in a unsafe way, warn NOMPSAFE in a comment.
 1.183  14-Feb-2017  ozaki-r Do ND in L2_output in the same manner as arpresolve

The benefits of this change are:
- The flow is consistent with IPv4 (and FreeBSD and OpenBSD)
- old: ip6_output => nd6_output (do ND if needed) => L2_output (lookup a stored cache)
- new: ip6_output => L2_output (lookup a cache. Do ND if cache not found)
- We can remove some workarounds in nd6_output
- We can move L2 specific operations to their own place
- The performance slightly improves because one cache lookup is reduced
 1.182  16-Jan-2017  christos ip6_sprintf -> IN6_PRINT so that we pass the size.
 1.181  16-Jan-2017  ryo Make ip6_sprintf(), in_fmtaddr(), lla_snprintf() and icmp6_redirect_diag() mpsafe.

Reviewed by ozaki-r@
 1.180  11-Jan-2017  ozaki-r branches: 1.180.2;
Get rid of unnecessary header inclusions
 1.179  08-Dec-2016  ozaki-r Add rtcache_unref to release points of rtentry stemming from rtcache

In the MP-safe world, a rtentry stemming from a rtcache can be freed at any
points. So we need to protect rtentries somehow say by reference couting or
passive references. Regardless of the method, we need to call some release
function of a rtentry after using it.

The change adds a new function rtcache_unref to release a rtentry. At this
point, this function does nothing because for now we don't add a reference
to a rtentry when we get one from a rtcache. We will add something useful
in a further commit.

This change is a part of changes for MP-safe routing table. It is separated
to avoid one big change that makes difficult to debug by bisecting.
 1.178  10-Nov-2016  ozaki-r Tidy up in6_select*

This change tidies up in6_select* functions, especially
selectroute.

selectroute is annoying because:
- It returns both/either of a rtentry and/or an ifp
- Yes, it may return only an ifp!
- It is valid but selectroute shouldn't handle the case
- Such conditional behavior makes it difficult
to apply locking/psref thingy
- It may return a rtentry even if error
- It may use opt->ip6po_nextroute rtcache implicitly
- The caller can know if it is used
by rtcache_validate(&opt->ip6po_nextroute)
but it's racy in MP-safe world
- Even if it uses opt->ip6po_nextroute, it may
return a rtentry that isn't derived from the rtcache

The change includes:
- Rename selectroute to in6_selectroute
- Let a remaining caller of selectroute, in6_selectif,
use in6_selectroute instead
- Let in6_selectroute return only an rtentry
- If error, it doesn't return an rtentry
- A caller gets an ifp from a returned rtentry
- Allow in6_selectroute to modify a passed rtcache
and a caller can know if opt->ip6po_nextroute is
used via the rtcache
- Let callers (ip6_output and in6_selectif) handle
the case that only an ifp is required

Inspired by OpenBSD
Proposed on tech-kern and tech-net
LGTM by roy@
 1.177  07-Nov-2016  ozaki-r Pull routing header handling out of ip6_output

No functional change.
 1.176  07-Nov-2016  ozaki-r Tidy up ip6_getpmtu

Pull rtcache thing out of ip6_getpmtu; that isn't an essential
of the function. Add comments inspired by FreeBSD.

No functional change.
 1.175  20-Sep-2016  roy Drop UDP packets as well as TCP without error when sending from detached or
tentative addresses.
 1.174  15-Sep-2016  roy Ensure that packets are sent from a valid address.
If the packet is TCP and the address is detached or tentative then
it's just dropped, otherwise an error is returned.

This is needed because you can bind to a valid address and it can then
become invalid.

This satisfies RFC 4862 section 5.5.4.
 1.173  01-Aug-2016  ozaki-r Apply pserialize and psref to struct ifaddr and its variants

This change makes struct ifaddr and its variants (in_ifaddr and in6_ifaddr)
MP-safe by using pserialize and psref. At this moment, pserialize_perform
and psref_target_destroy are disabled because (1) we don't need them
because of softnet_lock (2) they cause a deadlock because of softnet_lock.
So we'll enable them when we remove softnet_lock in the future.
 1.172  29-Jul-2016  ozaki-r Avoid memset and rtcache_free if unnecessary

It's the same as ip_output.
 1.171  27-Jun-2016  christos branches: 1.171.2;
CID 1362905: Initialize ifp early, so that we don't if_put garbage in the
IPSEC case.
 1.170  21-Jun-2016  ozaki-r Make sure returning ifp from in6_select* functions psref-ed

To this end, callers need to pass struct psref to the functions
and the fuctions acquire a reference of ifp with it. In some cases,
we can simply use if_get_byindex, however, in other cases
(say rt->rt_ifp and ia->ifa_ifp), we have no MP-safe way for now.
In order to take a reference anyway we use non MP-safe function
if_acquire_NOMPSAFE for the latter cases. They should be fixed in
the future somehow.
 1.169  21-Jun-2016  ozaki-r Protect if_byindex with pserialize
 1.168  21-Jun-2016  ozaki-r Replace ifp of ip_moptions and ip6_moptions with if_index

The motivation is the same as the mbuf's rcvif case; avoid having a pointer
of an ifnet object in ip_moptions and ip6_moptions, which is not MP-safe.

ip_moptions and ip6_moptions can be stored in a PCB for inet or inet6
that's life time is different from ifnet one and so an ifnet object can be
disappeared anytime we get it via them. Thus we need to look up an ifnet
object by if_index every time for safe.
 1.167  10-Jun-2016  ozaki-r Introduce m_set_rcvif and m_reset_rcvif

The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.

No functional change.
 1.166  24-Aug-2015  pooka sprinkle _KERNEL_OPT
 1.165  27-Apr-2015  ozaki-r Add missing error checks on rtcache_setdst

It can fail with ENOMEM.
 1.164  24-Apr-2015  ozaki-r Avoid NULL checks for a variable that is definitely NULL
 1.163  02-Feb-2015  christos CID/1267860: Missing break in switch
 1.162  20-Jan-2015  roy Fix IPV6_USE_MIN_MTU set by setsockopt(2) being ignored when
IPV6_PKTINFO is set as a control with sendmsg(2).
 1.161  20-Jan-2015  roy Add net.inet6.ip6.prefer_tempaddr sysctl knob so that we can prefer
IPv6 temporary addresses as the source address.

Fixes PR kern/47100 based on a patch by Dieter Roelants.
 1.160  12-Oct-2014  christos branches: 1.160.2;
Refactor the multicast membership code so that we can handle v4 mapped
addresses using the v6 membership ioctls.
 1.159  11-Oct-2014  christos Make IPV4 mapped addresses able to do IPV4 multicast. Fixes needed:

- allow binding to mapped v4 multicast addresses
- define v4moptions, allow setting it via ioctl, pass it to ip_output,
free it when killing the pcb.

Ideally we would allow the IPV6 multicast setsockopts work on mapped addresses
too, but this is a lot more work and linux does not do it either.
 1.158  16-Aug-2014  maxv http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html#Report-2

#03-0x02: Memory leak

ok ozaki-r@
 1.157  30-May-2014  christos branches: 1.157.2;
Introduce 2 new variables: ipsec_enabled and ipsec_used.
Ipsec enabled is controlled by sysctl and determines if is allowed.
ipsec_used is set automatically based on ipsec being enabled, and
rules existing.
 1.156  17-May-2014  rmind Replace open-coded access (and boundary checking) of ifindex2ifnet with
if_byindex() function.
 1.155  03-Oct-2013  christos branches: 1.155.2;
check sockopt_get() error, from logan.
 1.154  29-Jun-2013  rmind - Rewrite parts of pfil(9): use array to store hooks and thus be more cache
friendly (there are only few hooks in the system). Make the structures
opaque and the interface more strict.
- Remove PFIL_HOOKS option by making pfil(9) mandatory.
 1.153  05-Jun-2013  christos branches: 1.153.2;
IPSEC has not come in two speeds for a long time now (IPSEC == kame,
FAST_IPSEC). Make everything refer to IPSEC to avoid confusion.
 1.152  18-Mar-2013  gdt Initialize variable used as (conditional) result parameter.

ip6_insertfraghdr either sets a result parameter or returns an error.
While the caller only uses the result parameter in the non-error case,
knowing that requires cross-module static analysis, and that's not
robust against distant code changes. Therfore, set ip6f to NULL
before the function call that maybe sets it, avoiding a spuruious
warning and changing the future possible bug from an unitialized
dereference to a NULL deferrence.
 1.151  25-Jan-2013  kefren don't return hlim when asked for multicast loop flag
 1.150  21-Jul-2012  gdt branches: 1.150.2;
Add comments describing parameter handling for ip6_insertfraghdr.

Depending on compiler options, this code can be involved in an
(apparently) spurious compiler warning. However, it was not
immediately obvious the the compiler was wrong.
 1.149  25-Jun-2012  christos rename rfc6056 -> portalgo, requested by yamt
 1.148  22-Jun-2012  christos PR/46602: Move the rfc6056 port randomization to the IP layer.
 1.147  22-Mar-2012  drochner remove KAME IPSEC, replaced by FAST_IPSEC
 1.146  13-Mar-2012  elad Replace the remaining KAUTH_GENERIC_ISSUSER authorization calls with
something meaningful. All relevant documentation has been updated or
written.

Most of these changes were brought up in the following messages:

http://mail-index.netbsd.org/tech-kern/2012/01/18/msg012490.html
http://mail-index.netbsd.org/tech-kern/2012/01/19/msg012502.html
http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012728.html

Thanks to christos, manu, njoly, and jmmv for input.

Huge thanks to pgoyette for spinning these changes through some build
cycles and ATF.
 1.145  05-Feb-2012  rmind branches: 1.145.2; 1.145.6; 1.145.8;
ip6_output: check for rtcache_setdst() error, which may happen if running
out of memory.
 1.144  10-Jan-2012  drochner remove conditionals which can't succeed, and also shouldn't because
one would get a kernel NULL dereference immediately
 1.143  10-Jan-2012  drochner add patch from Arnaud Degroote to handle IPv6 extended options with
(FAST_)IPSEC, tested lightly with a DSTOPTS header consisting
of PAD1
 1.142  31-Dec-2011  christos - fix offsetof usage, and redundant defines
- kill pointer casts to 0
 1.141  19-Dec-2011  drochner rename the IPSEC in-kernel CPP variable and config(8) option to
KAME_IPSEC, and make IPSEC define it so that existing kernel
config files work as before
Now the default can be easily be changed to FAST_IPSEC just by
setting the IPSEC alias to FAST_IPSEC.
 1.140  25-Apr-2011  yamt branches: 1.140.4; 1.140.8;
undefer csum in looutput.
looutput is used by various code (ether_output, mcast) to loopback packets.
 1.139  07-May-2009  elad branches: 1.139.4; 1.139.6;
Remove some more "priv" variable usage in favor of kauth(9) calls.
 1.138  06-May-2009  elad Remove some usage of "priv" and "privileged" variables and instead pass
around credentials. Also push down kauth(9) calls closer to where the
operation is done.

Mailing list reference:

http://mail-index.netbsd.org/tech-net/2009/04/30/msg001270.html
 1.137  18-Apr-2009  drochner fix traversing of a control mbuf in the case that a message len
is not aligned wrt CMSG_ALIGN - the length counter drops below 0
in this case which was not checked for,
fixes crashes (with isc_dhcrelay4) reported by Uwe in tech-net
(subject: netbsd5-rc3 crash caused by isc_dhcrelay)
 1.136  18-Mar-2009  cegger bzero -> memset
 1.135  27-Oct-2008  plunky branches: 1.135.2; 1.135.6;
sockopt_getmbuf() may fail, handle that possibility
 1.134  12-Oct-2008  plunky branches: 1.134.2;
ip6_pcbopts() is called with the socket lock held, use M_NOWAIT
 1.133  12-Oct-2008  plunky ip6_pcbopt() is in the ctloutput path, we should not
sleep here because socket lock is held. use M_NOWAIT
 1.132  12-Oct-2008  plunky convert ip6_[sg]etmoptions() to use sockopt(9) API
should be no functional change
 1.131  12-Oct-2008  plunky do not sleep while allocating memory, socket lock is held
(use ENOBUFS for failure)
 1.130  06-Aug-2008  plunky Convert socket options code to use a sockopt structure
instead of laying everything into an mbuf.

approved by core
 1.129  23-Apr-2008  thorpej branches: 1.129.2; 1.129.4; 1.129.8;
Make IPSEC and FAST_IPSEC stats per-cpu. Use <net/net_stats.h> and
netstat_sysctl().
 1.128  15-Apr-2008  thorpej branches: 1.128.2;
Make ip6 and icmp6 stats per-cpu.
 1.127  08-Apr-2008  thorpej Change IPv6 stats from a structure to an array of uint64_t's.

Note: This is ABI-compatible with the old ip6stat structure; old netstat
binaries will continue to work properly.
 1.126  14-Jan-2008  dyoung branches: 1.126.2; 1.126.6;
Use rtcache_validate() instead of rtcache_getrt(). Shorten staircase
in in6_losing().
 1.125  10-Jan-2008  dyoung Save some rtcache_getrt() calls.
 1.124  20-Dec-2007  dyoung Poison struct route->ro_rt uses in the kernel by changing the name
to _ro_rt. Use rtcache_getrt() to access a route cache's struct
rtentry *.

Introduce struct ifnet->if_dl that always points at the interface
identifier/link-layer address. Make code that treated the first
ifaddr on struct ifnet->if_addrlist as the interface address use
if_dl, instead.

Remove stale debugging code from net/route.c. Move the rtflush()
code into rtcache_clear() and delete rtflush(). Delete rtalloc(),
because nothing uses it any more.

Make ND6_HINT an inline, lowercase subroutine, nd6_hint.

I've done my best to convert IP Filter, the ISO stack, and the
AppleTalk stack to rtcache_getrt(). They compile, but I have not
tested them. I have given the changes to PF, GRE, IPv4 and IPv6
stacks a lot of exercise.
 1.123  06-Nov-2007  dyoung branches: 1.123.2; 1.123.6;
Use sockaddr_in6_init().
 1.122  01-Nov-2007  dyoung branches: 1.122.2;
De-__P().
 1.121  19-Sep-2007  dyoung branches: 1.121.4;
1) Introduce a new socket option, (SOL_SOCKET, SO_NOHEADER), that
tells a socket that it should both add a protocol header to tx'd
datagrams and remove the header from rx'd datagrams:

int onoff = 1, s = socket(...);
setsockopt(s, SOL_SOCKET, SO_NOHEADER, &onoff);

2) Add an implementation of (SOL_SOCKET, SO_NOHEADER) for raw IPv4
sockets.

3) Reorganize the protocols' pr_ctloutput implementations a bit.
Consistently return ENOPROTOOPT when an option is unsupported,
and EINVAL if a supported option's arguments are incorrect.
Reorganize the flow of code so that it's more clear how/when
options are passed down the stack until they are handled.

Shorten some pr_ctloutput staircases for readability.

4) Extract common mbuf code into subroutines, add new sockaddr
methods, and introduce a new subroutine, fsocreate(), for reuse
later; use it first in sys_socket():

struct mbuf *m_getsombuf(struct socket *so)

Create an mbuf and make its owner the socket `so'.

struct mbuf *m_intopt(struct socket *so, int val)

Create an mbuf, make its owner the socket `so', put the
int `val' into it, and set its length to sizeof(int).


int fsocreate(..., int *fd)

Create a socket, a la socreate(9), put the socket into the
given LWP's descriptor table, return the descriptor at `fd'
on success.

void *sockaddr_addr(struct sockaddr *sa, socklen_t *slenp)
const void *sockaddr_const_addr(const struct sockaddr *sa, socklen_t *slenp)

Extract a pointer to the address part of a sockaddr. Write
the length of the address part at `slenp', if `slenp' is
not NULL.

socklen_t sockaddr_getlen(const struct sockaddr *sa)

Return the length of a sockaddr. This just evaluates to
sa->sa_len. I only add this for consistency with code that
appears in a portable userland library that I am going to
import.

const struct sockaddr *sockaddr_any(const struct sockaddr *sa)

Return the "don't care" sockaddr in the same family as
`sa'. This is the address a client should sobind(9) if it
does not care the source address and, if applicable, the
port et cetera that it uses.

const void *sockaddr_anyaddr(const struct sockaddr *sa, socklen_t *slenp)

Return the "don't care" sockaddr in the same family as
`sa'. This is the address a client should sobind(9) if it
does not care the source address and, if applicable, the
port et cetera that it uses.
 1.120  02-Jun-2007  alc branches: 1.120.6; 1.120.8;
don't increment `ip6stat.ip6s_noroute' here, it has already been done in
in6_src:in6_selectroute().

ok dyoung@
 1.119  23-May-2007  christos Ansify + add a few comments, from Karl Sjödahl
 1.118  02-May-2007  dyoung Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.

The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.

Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.

DETAILS

1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:

struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);

sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.

sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.

The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).

2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.

3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:

int rtcache_setdst(struct route *, const struct sockaddr *);

rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.

It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.

4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
 1.117  04-Mar-2007  christos branches: 1.117.2; 1.117.4;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.116  21-Feb-2007  thorpej Replace the Mach-derived boolean_t type with the C99 bool type. A
future commit will replace use of TRUE and FALSE with true and false.
 1.115  17-Feb-2007  dyoung KNF: de-__P, bzero -> memset, bcmp -> memcmp. Remove extraneous
parentheses in return statements.

Cosmetic: don't open-code TAILQ_FOREACH().

Cosmetic: change types of variables to avoid oodles of casts: in
in6_src.c, avoid casts by changing several route_in6 pointers
to struct route pointers. Remove unnecessary casts to caddr_t
elsewhere.

Pave the way for eliminating address family-specific route caches:
soon, struct route will not embed a sockaddr, but it will hold
a reference to an external sockaddr, instead. We will set the
destination sockaddr using rtcache_setdst(). (I created a stub
for it, but it isn't used anywhere, yet.) rtcache_free() will
free the sockaddr. I have extracted from rtcache_free() a helper
subroutine, rtcache_clear(). rtcache_clear() will "forget" a
cached route, but it will not forget the destination by releasing
the sockaddr. I use rtcache_clear() instead of rtcache_free()
in rtcache_update(), because rtcache_update() is not supposed
to forget the destination.

Constify:

1 Introduce const accessor for route->ro_dst, rtcache_getdst().

2 Constify the 'dst' argument to ifnet->if_output(). This
led me to constify a lot of code called by output routines.

3 Constify the sockaddr argument to protosw->pr_ctlinput. This
led me to constify a lot of code called by ctlinput routines.

4 Introduce const macros for converting from a generic sockaddr
to family-specific sockaddrs, e.g., sockaddr_in: satocsin6,
satocsin, et cetera.
 1.114  10-Feb-2007  degroote branches: 1.114.2;
Commit my SoC work
Add ipv6 support for fast_ipsec
Note that currently, packet with extensions headers are not correctly
supported
Change the ipcomp logic
 1.113  29-Jan-2007  dyoung Cosmetic: bzero -> memset, remove gratuitous cast, compare pointer
with NULL instead of 0.
 1.112  29-Jan-2007  dyoung In In ip6_setmoptions(), don't leave a route cache (struct route_in6)
on the stack if we exit with EADDRNOTAVAIL.

(I already fixed this bug once tonight. Clearly, ip6_setmoptions
was cut-and-pasted from ip_setmoptions.)
 1.111  04-Jan-2007  elad Consistent usage of KAUTH_GENERIC_ISSUSER.
 1.110  27-Dec-2006  alc CID-3317: check for 'm != NULL' before using it (rework the code path to
explicitly return `EINVAL'. Before, it was done but later in
ip6_setpktopt() when checking for 'len < ...')
CID-3316: check for 'm != NULL' before using it

ok christos@
 1.109  15-Dec-2006  joerg Introduce new helper functions to abstract the route caching.
rtcache_init and rtcache_init_noclone lookup ro_dst and store
the result in ro_rt, taking care of the reference counting and
calling the domain specific route cache.
rtcache_free checks if a route was cashed and frees the reference.
rtcache_copy copies ro_dst of the given struct route, checking that
enough space is available and incrementing the reference count of the
cached rtentry if necessary.
rtcache_check validates that the cached route is still up. If it isn't,
it tries to look it up again. Afterwards ro_rt is either a valid again
or NULL.
rtcache_copy is used internally.

Adjust to callers of rtalloc/rtflush in the tree to check the sanity of
ro_dst first (if necessary). If it doesn't fit the expectations, free
the cache, otherwise check if the cached route is still valid. After
that combination, a single check for ro_rt == NULL is enough to decide
whether a new lookup needs to be done with a different ro_dst.
Make the route checking in gre stricter by repeating the loop check
after revalidation.
Remove some unused RADIX_MPATH code in in6_src.c. The logic is slightly
changed here to first validate the route and check RTF_GATEWAY
afterwards. This is sementically equivalent though.
etherip doesn't need sc_route_expire similiar to the gif changes from
dyoung@ earlier.

Based on the earlier patch from dyoung@, reviewed and discussed with
him.
 1.108  09-Dec-2006  dyoung Here are various changes designed to protect against bad IPv4
routing caused by stale route caches (struct route). Route caches
are sprinkled throughout PCBs, the IP fast-forwarding table, and
IP tunnel interfaces (gre, gif, stf).

Stale IPv6 and ISO route caches will be treated by separate patches.

Thank you to Christoph Badura for suggesting the general approach
to invalidating route caches that I take here.

Here are the details:

Add hooks to struct domain for tracking and for invalidating each
domain's route caches: dom_rtcache, dom_rtflush, and dom_rtflushall.

Introduce helper subroutines, rtflush(ro) for invalidating a route
cache, rtflushall(family) for invalidating all route caches in a
routing domain, and rtcache(ro) for notifying the domain of a new
cached route.

Chain together all IPv4 route caches where ro_rt != NULL. Provide
in_rtcache() for adding a route to the chain. Provide in_rtflush()
and in_rtflushall() for invalidating IPv4 route caches. In
in_rtflush(), set ro_rt to NULL, and remove the route from the
chain. In in_rtflushall(), walk the chain and remove every route
cache.

In rtrequest1(), call rtflushall() to invalidate route caches when
a route is added.

In gif(4), discard the workaround for stale caches that involves
expiring them every so often.

Replace the pattern 'RTFREE(ro->ro_rt); ro->ro_rt = NULL;' with a
call to rtflush(ro).

Update ipflow_fastforward() and all other users of route caches so
that they expect a cached route, ro->ro_rt, to turn to NULL.

Take care when moving a 'struct route' to rtflush() the source and
to rtcache() the destination.

In domain initializers, use .dom_xxx tags.

KNF here and there.
 1.107  02-Dec-2006  dyoung Use the queue(3) macros instead of open-coding them. Shorten
staircases. Remove unnecessary casts. Where appropriate, s/8/NBBY/.
De-__P(). KNF.

No functional changes intended.
 1.106  25-Nov-2006  yamt branches: 1.106.2; 1.106.4;
move tso-by-software code to their own files. no functional changes.
 1.105  23-Nov-2006  yamt implement ipv6 TSO.
partly from Matthias Scheler. tested by him.
 1.104  16-Nov-2006  christos __unused removal on arguments; approved by core.
 1.103  12-Oct-2006  christos - sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386
 1.102  30-Aug-2006  christos branches: 1.102.2; 1.102.4;
remove impossible comparisons.
 1.101  23-Jul-2006  ad Use the LWP cached credentials where sane.
 1.100  12-Jul-2006  tron Add diagnostic checks for hardware-assisted checksum related flags in
the mbuf which supposed to get sent out:
- Complain in ip_output() if any of the IPv6 related flags are set.
- Complain in ip6_output() if any of the IPv4 related flags are set.
- Complain in both functions if the flags indicate that both a TCP and
UCP checksum should be calculated by the hardware.
 1.99  08-Jul-2006  rpaulo Add a missing piece from RFC 3542. KAME-NetBSD-current branch
revision 1.1.1.2.2.5:
do not call pfctlinput2(PRC_MSGSIZE) on fragmentation to avoid
notification storm

From Keiichi SHIMA:
"In the current NetBSD code, the PRC_MSGSIZE message will be generated
for every fragmented packets when a node is trying to send a big
packet. That was the intermediate behavior while RFC3542 was under
discussion."

By (obviously) the KAME project.
 1.98  14-May-2006  elad branches: 1.98.4;
integrate kauth.
 1.97  05-May-2006  rpaulo Add support for RFC 3542 Adv. Socket API for IPv6 (which obsoletes 2292).
* RFC 3542 isn't binary compatible with RFC 2292.
* RFC 2292 support is on by default but can be disabled.
* update ping6, telnet and traceroute6 to the new API.

From the KAME project (www.kame.net).
Reviewed by core.
 1.96  15-Apr-2006  christos Coverity CID 608: #ifdef out dead code.
 1.95  05-Mar-2006  rpaulo branches: 1.95.2; 1.95.4;
NDP-related improvements:
RFC4191
- supports host-side router-preference

RFC3542
- if DAD fails on a interface, disables IPv6 operation on the
interface
- don't advertise MLD report before DAD finishes

Others
- fixes integer overflow for valid and preferred lifetimes
- improves timer granularity for MLD, using callout-timer.
- reflects rtadvd's IPv6 host variable information into kernel
(router only)
- adds a sysctl option to enable/disable pMTUd for multicast
packets
- performs NUD on PPP/GRE interface by default
- Redirect works regardless of ip6_accept_rtadv
- removes RFC1885-related code

From the KAME project via SUZUKI Shinsuke.
Reviewed by core.
 1.94  21-Jan-2006  rpaulo branches: 1.94.2; 1.94.4; 1.94.6;
Better support of IPv6 scoped addresses.

- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.
- and, while there, don't try to remove the ff02::/32 interface route
entry in in6_ifdetach() as it's already gone.

This also includes some level of support for the standard source
address selection algorithm defined in RFC3484, which will be
completed on in the future.

From the KAME project via JINMEI Tatuya.
Approved by core@.
 1.93  11-Dec-2005  christos branches: 1.93.2;
merge ktrace-lwp.
 1.92  23-Sep-2005  christos change bcopy to memmove since this was supposed to be an ovbcopy (from kre)
 1.91  18-Aug-2005  yamt - introduce M_MOVE_PKTHDR and use it where appropriate.
intended to be mostly API compatible with openbsd/freebsd.
- remove a glue #define in netipsec/ipsec_osdep.h.
 1.90  10-Aug-2005  yamt re-implement ipv6 tx loopback checksum omission.
 1.89  10-Aug-2005  yamt ipv6 tx checksum offloading. reviewed by Jason Thorpe.
 1.88  28-Feb-2005  itojun branches: 1.88.4;
make ip6_getpmtu back to static
 1.87  21-Dec-2004  drochner branches: 1.87.2; 1.87.4;
fix ifindex argument checks for IPV6_JOIN_GROUP,
IPV6_LEAVE_GROUP and IPV6_MULTICAST_IF -
0 is always legal
 1.86  04-Dec-2004  peter Convert lo(4) to a clonable device.

This also removes the loif array and changes all code to use the new
lo0ifp pointer which points to the lo0 ifnet structure.

Approved by christos.
 1.85  14-Jul-2004  itojun - update ro_pmtu on IPsec tunnel encapsulation. ro != ro_pmtu is used as the
sign for the existence of routing header.
- fragment to 1280 on IPv6-over-IPv6 encapsulation, as ICMPv6 too big may not
give you enough information to update pmtu cache.

from iij seil team, via kame.
 1.84  06-Jul-2004  minoura Remove broken code for now: getsockopt(s, IPPROTO_IP, IP_IPSEC_POLICY,...).
It returned EINVAL, now returns ENOPROTOOPT.
Ok'd by itojun.
 1.83  11-Jun-2004  itojun implement IPV6_USE_MIN_MTU sockopt. needed by bind9 + EDNS0 + big receive buffer.
 1.82  23-Mar-2004  martti branches: 1.82.2;
Make ip6_getpmtu() globally visible. This is needed by IPFilter 4.x.
 1.81  02-Mar-2004  thorpej Use the new IPSEC_PCB_SKIP_IPSEC() to bypass a socket policy lookup
when possible. This shaves several cycles from the output path for
non-IPsec connections, even if the policy is cached in the PCB.
 1.80  01-Mar-2004  itojun knf
 1.79  06-Feb-2004  itojun remove unneeded #ifdef
 1.78  04-Feb-2004  itojun strictly follow RFC2460 section 5 last paragraph
(sending rule when PMTU < 1280). pointed out by guninski at guninski.com
 1.77  24-Jan-2004  darrenr make ip6_getpmtu() externally visible
 1.76  19-Jan-2004  itojun do not lookup security policy if IPV6_FORWARDING.
avoids possible infinite ipsec encapsulation on
ip6_input -> ip6_forward -(tunnel mode)-> ip6_output
case. from kame
 1.75  10-Dec-2003  itojun fix cases where pktinfo specifies outgoing interface of "0".
 1.74  10-Dec-2003  itojun use if_indexlim (instead of if_index) and ifindex2ifnet[x] != NULL
to check if interface exists, as (1) if_index has different meaning
(2) ifindex2ifnet could become NULL when interface gets destroyed,
since when we have introduced dynamically-created interfaces. from kame
 1.73  06-Nov-2003  itojun correct behavior when ipv6mr_interface is 0. Matthias Drochner
 1.72  30-Oct-2003  simonb Remove some assigned-to but otherwise unused variables.
 1.71  03-Oct-2003  itojun when dropping M_PKTHDR, need to free m_tag associated with it.
 1.70  06-Sep-2003  itojun randomize IPv4/v6 fragment ID and IPv6 flowlabel. avoids predictability
of these fields. ip_id.c is from openbsd. ip6_id.c is adapted by kame.
 1.69  05-Sep-2003  itojun u_short -> u_int16_t. sync w/ kame.
don't set ip6_plen where unneeded (i.e. before calling ip6_output)
 1.68  04-Sep-2003  itojun don't use m_cat to mbuf of different types. KAME-PR-495
 1.67  25-Aug-2003  itojun don't commit value into ip6_ptkopts until the validation is done.
(note: the code will be updated with 2292bis definition soon, hopefully)
 1.66  22-Aug-2003  itojun remove ipsec_set/getsocket. now we explicitly pass socket * to ip{,6}_output.
 1.65  22-Aug-2003  itojun change the additional arg to be passed to ip{,6}_output to struct socket *.

this fixes KAME policy lookup which was broken by the previous commit.
 1.64  22-Aug-2003  jonathan Change KAME code for ip_output()/ip6_output() to obtain struct socket*
from the explicit inpcb*/in6pcb* argument. set_socket() becomes redundant.
 1.63  22-Aug-2003  jonathan Replace the set_socket() method of passing an extra struct socket*
argument to ip6_output() with a new explicit struct in6pcb* argument.
(The underlying socket can be obtained via in6pcb->inp6_socket.)

In preparation for fast-ipsec. Reviewed by itojun.
 1.62  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.
 1.61  06-Jun-2003  itojun branches: 1.61.2;
- sync up MLD declaration with RFC3542 (s/MLD6/MLD/)
- routing header declaration with RFC3542
(note: sizeof(ip6_rthdr0) has changed!)
also, sync up with RFC2460 routing header definition (no "strict" source
routing mode any more)

part of advanced API update (RFC2292 -> 3542).
 1.60  02-Nov-2002  perry /*CONTCOND*/ while (0)'ed macros
 1.59  31-Oct-2002  itojun plug a memory leak. from sam leffler. sync w/kame
 1.58  23-Sep-2002  itojun length field on PADN option, before jumbo payload option was wrong.
sync w/kame
 1.57  11-Sep-2002  itojun KNF - return is not a function. sync w/kame.
 1.56  11-Sep-2002  itojun correct signedness mixup in pointer passing. sync w/kame
 1.55  09-Jun-2002  itojun whitespace cleanup
 1.54  08-Jun-2002  itojun sync with latest KAME in6_ifaddr/prefix/default router manipulation.
behavior changes:
- two iocts used by ndp(8) are now obsolete (backward compat provided).
use sysctl path instead.
- lo0 does not get ::1 automatically. it will get ::1 when lo0 comes up.
 1.53  07-Jun-2002  itojun sync IPV6_CHECKSUM handling with kame.
 1.52  07-Jun-2002  itojun comment
 1.51  07-Jun-2002  itojun whitespace
 1.50  07-Jun-2002  itojun remove #if 0'ed portion
 1.49  07-Jun-2002  itojun KNF a bit
 1.48  07-Jun-2002  itojun typo
 1.47  07-Jun-2002  itojun 'fall through' is not a valid LINT keyword.
 1.46  31-May-2002  itojun do not try to update rmx_mtu if rmx_mtu == 0 (obey ifmtu)
 1.45  29-May-2002  itojun attach nd_ifinfo structure into if_afdata.
split IPv6 link MTU (advertised by RA) from real link MTU.
sync with kame
 1.44  28-Mar-2002  itojun branches: 1.44.2; 1.44.4;
make sure to check address family in route cache
(I really hate IPv4 mapped address...)
 1.43  20-Dec-2001  itojun centralize multicast group management (in6_join/leavegroup).
have a flag for ip6_output() to fragment to minimum MTU.
sync with kame
 1.42  18-Dec-2001  itojun reduce white space/cosmetic diffs w/kame.
 1.41  13-Nov-2001  lukem add RCSIDs
 1.40  24-Oct-2001  itojun more whitespace sync with kame
 1.39  18-Oct-2001  itojun branches: 1.39.2;
reduce diffs with kame (mostly cosmetic).
move IPV6_CHECKSUM processing to sys/netinet6/raw_ip6.c.
constify a couple of places.
 1.38  17-Oct-2001  itojun unifdef OLDIP6OUTPUT
 1.37  15-Oct-2001  itojun implement IPV6_V6ONLY socket option from draft-ietf-ipngwg-rfc2553bis-03.txt.
IPV6_BINDV6ONLY (netbsd only) is deprecated, but still work just like before.
 1.36  11-Jun-2001  itojun branches: 1.36.2;
remove IPV6FIREWALL case, which is never used
 1.35  11-Apr-2001  itojun disallow userland programs from specifying addresses with IPV6_PKTINFO
setsockopt, if:
- the address is not verified by DAD (= not ready)
- the address is an anycast address (= not permitted as source)
sync with kame
 1.34  30-Mar-2001  itojun enable FAKE_LOOPBACK_IF case by default.
now traffic on loopback interface will be presented to bpf as normal wire
format packet (without KAME scopeid in s6_addr16[1]).

fix KAME PR 250 (host mistakenly accepts packets to fe80::x%lo0).

sync with kame.
 1.33  25-Mar-2001  itojun re-initialize mopt in ip6_insert_jumboopt(). sync with kame
From: csapuntz@stanford.edu
 1.32  21-Mar-2001  itojun set rmx_mtu to L2 interface mtu, instead of 0, on mtudisc timeout.
ip6_output() change is for safety. sync with kame
 1.31  10-Feb-2001  itojun branches: 1.31.2;
to sync with kame better, (1) remove register declaration for variables,
(2) sync whitespaces, (3) update comments. (4) bring in some of portability
and logging enhancements. no functional changes here.
 1.30  06-Feb-2001  itojun bad semicolon after "if" conditional. sync with kame
 1.29  02-Feb-2001  itojun avoid panic when a packet with nonexistent link-local address is issued.
kame 1.151 -> 1.152.
 1.28  24-Jan-2001  itojun - record IPsec packet history into m_aux structure.
- let ipfilter look at wire-format packet only (not the decapsulated ones),
so that VPN setting can work with NAT/ipfilter settings.
sync with kame.

TODO: use header history for stricter inbound validation
 1.27  11-Nov-2000  thorpej Restructure the PFIL_HOOKS mechanism a bit:
- All packets are passed to PFIL_HOOKS as they come off the wire, i.e.
fields in protocol headers in network order, etc.
- Allow for multiple hooks to be registered, using a "key" and a "dlt".
The "dlt" is a BPF data link type, indicating what type of header is
present.
- INET and INET6 register with key == AF_INET or AF_INET6, and
dlt == DLT_RAW.
- PFIL_HOOKS now take an argument for the filter hook, and mbuf **,
an ifnet *, and a direction (PFIL_IN or PFIL_OUT), thus making them
less IP (really, IP Filter) centric.

Maintain compatibility with IP Filter by adding wrapper functions for
IP Filter.
 1.26  23-Oct-2000  itojun make IFA_STATS really work on IPv6.
 1.25  19-Aug-2000  itojun - icmp6 nodeinfo: remove possibility of unaligned pointer access.
- jumbo payload output: fix incorrect mbuf manipulation
- pedant: align issues, mbuf assumption
(sync with kame)
 1.24  06-Jul-2000  itojun remove unnecessary #include <netkey/key_debug.h>. from kame.
 1.23  20-Jun-2000  itojun branches: 1.23.2;
avoid possible mbuf leaks on ipsec policy violation.(sync with kame)
 1.22  03-Jun-2000  itojun sync with kame.
- use latest source address selection code - in6_src.c.
- correct frag header insertion.
- deep copy ip6 header portion in ip6_mloopback to avoid overwrite.
- do not bark when we forward packet to loopback.
- some cosmetics.
 1.21  19-May-2000  itojun branches: 1.21.2;
correct manipulation of link-local scoped address on loopback.
now "telnet fe80::1%lo0" should work again.
(we have another bug near here - will attack it soon)
 1.20  19-May-2000  thorpej NULL != 0
 1.19  19-May-2000  itojun do not mistakingly forward link-local scoped packet (the bug was added
with "beyondscope" icmp6 support).
"options FAKE_LOOPBACK_IF" will honor scope on loopback outputs. rcvif will
be real interface, not the loopback, just like when multicast loopback.

(sync with kame)
 1.18  29-Mar-2000  simonb Remove duplicate declaration of ifindex2ifnet - it's in <net/if.h>.
 1.17  01-Mar-2000  itojun introduce m->m_pkthdr.aux to hold random data which needs to be passed
between protocol handlers.

ipsec socket pointers, ipsec decryption/auth information, tunnel
decapsulation information are in my mind - there can be several other usage.
at this moment, we use this for ipsec socket pointer passing. this will
avoid reuse of m->m_pkthdr.rcvif in ipsec code.

due to the change, MHLEN will be decreased by sizeof(void *) - for example,
for i386, MHLEN was 100 bytes, but is now 96 bytes.
we may want to increase MSIZE from 128 to 256 for some of our architectures.

take caution if you use it for keeping some data item for long period
of time - use extra caution on M_PREPEND() or m_adj(), as they may result
in loss of m->m_pkthdr.aux pointer (and mbuf leak).

this will bump kernel version.

(as discussed in tech-net, tested in kame tree)
 1.16  20-Feb-2000  darrenr pass "struct pfil_head *" to pfil_add_hook and pfil_remove hook rather
than "struct protosw *".
 1.15  17-Feb-2000  darrenr Change the use of pfil hooks. There is no longer a single list of all
pfil information, instead, struct protosw now contains a structure
which caontains list heads, etc. The per-protosw pfil struct is passed
to pfil_hook_get(), along with an in/out flag to get the head of the
relevant filter list. This has been done for only IPv4 and IPv6, at
present, with these patches only enabling filtering for IPPROTO_IP and
IPPROTO_IPV6, although it is possible to have tcp/udp, etc, dedicated
filters now also. The ipfilter code has been updated to only filter
IPv4 packets - next major release of ipfilter is required for ipv6.
 1.14  06-Feb-2000  itojun fix include pathname for better rfc2292 compliance.
 1.13  31-Jan-2000  itojun bring in latest KAME ipsec tree.
- interop issues in ipcomp is fixed
- padding type (after ESP) is configurable
- key database memory management (need more fixes)
- policy specification is revisited

XXX m->m_pkthdr.rcvif is still overloaded - hope to fix it soon
 1.12  26-Jan-2000  itojun make setsockopt(IPV6_PORTRANGE) work. obeys IPNOPRIVPORTS.
 1.11  06-Jan-2000  itojun remove extra portability #ifdef (like #ifdef __FreeBSD__) in KAME IPv6/IPsec
code, from netbsd-current repository.
#ifdef'ed version is always available from ftp.kame.net.

XXX please do not make too many diff-unfriendly changes, we'll need to take
bunch of diffs on upgrade...
 1.10  06-Jan-2000  itojun make IPV6_BINDV6ONLY setsockopt available. it controls behavior of
AF_INET6 wildcard listening socket. heavily documented in ip6(4).
net.inet6.ip6.bindv6only defines default value. default is 1.

"options INET6_BINDV6ONLY" removes any code fragment that supports
IPV6_BINDV6ONLY == 0 case (not defopt'ed as use of this is rare).
 1.9  13-Dec-1999  itojun sync IPv6 part with latest KAME tree. IPsec part is left unmodified
due to massive changes in KAME side.
- IPv6 output goes through nd6_output
- faith can capture IPv4 packets as well - you can run IPv4-to-IPv6 translator
using heavily modified DNS servers
- per-interface statistics (required for IPv6 MIB)
- interface autoconfig is revisited
- udp input handling has a big change for mapped address support.
- introduce in4_cksum() for non-overwriting checksumming
- introduce m_pulldown()
- neighbor discovery cleanups/improvements
- netinet/in.h strictly conforms to RFC2553 (no extra defs visible to userland)
- IFA_STATS is fixed a bit (not tested)
- and more more more.

TODO:
- cleanup os-independency #ifdef
- avoid rcvif dual use (for IPsec) to help ifdetach

(sorry for jumbo commit, I can't separate this any more...)
 1.8  31-Jul-1999  itojun branches: 1.8.2; 1.8.8;
sync with recent KAME.
- loosen ipsec restriction on packet diredtion.
- revise icmp6 redirect handling on IsRouter bit.
- tcp/udp notification processing (link-local address case)
- cosmetic fixes (better code share across *BSD).
 1.7  30-Jul-1999  itojun remove reference to in6_systm.h (file itself will be removed afterwords)
 1.6  22-Jul-1999  itojun - implement IPv6 pmtud, which is necessary for TCP6.
- fix memory leak on SO_DEBUG over TCP.
 1.5  22-Jul-1999  itojun change unnecessary u_long/long into u_int32_t or something relevant.
more fixes should follow.
 1.4  09-Jul-1999  thorpej defopt IPSEC and IPSEC_ESP (both into opt_ipsec.h).
 1.3  03-Jul-1999  thorpej RCS ID police.
 1.2  01-Jul-1999  itojun branches: 1.2.2;
IPv6 kernel code, based on KAME/NetBSD 1.4, SNAP kit 19990628.
(Sorry for a big commit, I can't separate this into several pieces...)
Pls check sys/netinet6/TODO and sys/netinet6/IMPLEMENTATION for details.

- sys/kern: do not assume single mbuf, accept chained mbuf on passing
data from userland to kernel (or other way round).
- "midway" ATM card: ATM PVC pseudo device support, like those done in ALTQ
package (ftp://ftp.csl.sony.co.jp/pub/kjc/).
- sys/netinet/tcp*: IPv4/v6 dual stack tcp support.
- sys/netinet/{ip6,icmp6}.h, sys/net/pfkeyv2.h: IETF document assumes those
file to be there so we patch it up.
- sys/netinet: IPsec additions are here and there.
- sys/netinet6/*: most of IPv6 code sits here.
- sys/netkey: IPsec key management code
- dev/pci/pcidevs: regen

In my understanding no code here is subject to export control so it
should be safe.
 1.1  28-Jun-1999  itojun branches: 1.1.2;
file ip6_output.c was initially added on branch kame.
 1.1.2.2  30-Nov-1999  itojun bring in latest KAME (as of 19991130, KAME/NetBSD141) into kame branch
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.

The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.

synchronization to latest KAME will take place on HEAD branch soon.
 1.1.2.1  28-Jun-1999  itojun KAME/NetBSD 1.4 SNAP kit, dated 19990628.

NOTE: this branch (kame) is used just for refernce. this may not compile
due to multiple reasons.
 1.2.2.3  02-Aug-1999  thorpej Update from trunk.
 1.2.2.2  01-Jul-1999  thorpej Sync w/ -current.
 1.2.2.1  01-Jul-1999  thorpej file ip6_output.c was added on branch chs-ubc2 on 1999-07-01 23:48:28 +0000
 1.8.8.1  27-Dec-1999  wrstuden Pull up to last week's -current.
 1.8.2.5  21-Apr-2001  bouyer Sync with HEAD
 1.8.2.4  27-Mar-2001  bouyer Sync with HEAD.
 1.8.2.3  11-Feb-2001  bouyer Sync with HEAD.
 1.8.2.2  22-Nov-2000  bouyer Sync with HEAD.
 1.8.2.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.21.2.1  22-Jun-2000  minoura Sync w/ netbsd-1-5-base.
 1.23.2.7  23-Sep-2002  itojun Correct length field on PADN option, before jumbo payload option.
 1.23.2.6  26-Feb-2002  he Apply patch (requested by martti):
Fix it so that IPFilter handles IPv6 traffic.
 1.23.2.5  22-Apr-2001  he Pull up revision 1.35 (requested by itojun):
Disallow addresses that are not supposed to be put into IPv6
source, on IPV6_PKTINFO setsockopt.
 1.23.2.4  06-Apr-2001  he Pull up revision 1.28 (requested by itojun):
Record IPsec packet history in m_aux structure. Let ipfilter
look at wire-format packet only (not the decapsulated ones), so
that VPN setting can work with NAT/ipfilter settings.
 1.23.2.3  26-Feb-2001  he Pull up revision 1.30 (requested by itojun):
Remove a misplaced semicolon after ``if'' conditional.
 1.23.2.2  04-Feb-2001  he Pull up revision 1.29 (via patch, requested by itojun):
Avoid panic when a packet with nonexistent link-local address is
issued.
 1.23.2.1  20-Jun-2000  he file ip6_output.c was added on branch netbsd-1-5 on 2001-02-04 19:22:08 +0000
 1.31.2.14  11-Nov-2002  nathanw Catch up to -current
 1.31.2.13  18-Oct-2002  nathanw Catch up to -current.
 1.31.2.12  17-Sep-2002  nathanw Catch up to -current.
 1.31.2.11  12-Jul-2002  nathanw No longer need to pull in lwp.h; proc.h pulls it in for us.
 1.31.2.10  24-Jun-2002  nathanw Curproc->curlwp renaming.

Change uses of "curproc->l_proc" back to "curproc", which is more like the
original use. Bare uses of "curproc" are now "curlwp".

"curproc" is now #defined in proc.h as ((curlwp) ? (curlwp)->l_proc) : NULL)
so that it is always safe to reference curproc (*de*referencing curproc
is another story, but that's always been true).
 1.31.2.9  20-Jun-2002  nathanw Catch up to -current.
 1.31.2.8  01-Apr-2002  nathanw Catch up to -current.
(CVS: It's not just a program. It's an adventure!)
 1.31.2.7  08-Jan-2002  nathanw Catch up to -current.
 1.31.2.6  14-Nov-2001  nathanw Catch up to -current.
 1.31.2.5  22-Oct-2001  nathanw Catch up to -current.
 1.31.2.4  21-Jun-2001  nathanw Catch up to -current.
 1.31.2.3  09-Apr-2001  nathanw Catch up with -current.
 1.31.2.2  13-Mar-2001  nathanw Be more careful not to dereference curproc when there might not be
a process context.
 1.31.2.1  05-Mar-2001  nathanw Initial commit of scheduler activations and lightweight process support.
 1.36.2.3  10-Oct-2002  jdolecek sync kqueue with -current; this includes merge of gehenna-devsw branch,
merge of i386 MP branch, and part of autoconf rototil work
 1.36.2.2  23-Jun-2002  jdolecek catch up with -current on kqueue branch
 1.36.2.1  10-Jan-2002  thorpej Sync kqueue branch with -current.
 1.39.2.1  12-Nov-2001  thorpej Sync the thorpej-mips-cache branch with -current.
 1.44.4.5  14-Jun-2004  jmc Pullup rev 1.83 (requested by itojun in ticket #1709)

Implement IPV6_USE_MIN_MTU sockopt.
 1.44.4.4  07-Feb-2004  jmc Pullup rev 1.78 (requested by itojun in ticket #1605)

Strictly follow RFC2460 section 5 last paragraph (sending rule when
PMTU < 1280).
 1.44.4.3  02-Oct-2003  tron Pull up revision 1.54 via patch (requested by itojun in ticket #1491):
sync with latest KAME in6_ifaddr/prefix/default router manipulation.
behavior changes:
- two iocts used by ndp(8) are now obsolete (backward compat provided).
use sysctl path instead.
- lo0 does not get ::1 automatically. it will get ::1 when lo0 comes up.
 1.44.4.2  13-Oct-2002  lukem Pull up revision 1.58 (requested by itojun in ticket #855):
length field on PADN option, before jumbo payload option was wrong.
sync w/kame
 1.44.4.1  05-Jun-2002  lukem Pull up revision 1.46 (via patch) (requested by itojun in ticket #123):
do not try to update rmx_mtu if rmx_mtu == 0 (obey ifmtu)
 1.44.2.2  20-Jun-2002  gehenna catch up with -current.
 1.44.2.1  30-May-2002  gehenna Catch up with -current.
 1.61.2.7  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.61.2.6  04-Mar-2005  skrll Sync with HEAD.

Hi Perry!
 1.61.2.5  17-Jan-2005  skrll Sync with HEAD.
 1.61.2.4  18-Dec-2004  skrll Sync with HEAD.
 1.61.2.3  21-Sep-2004  skrll Fix the sync with head I botched.
 1.61.2.2  18-Sep-2004  skrll Sync with HEAD.
 1.61.2.1  03-Aug-2004  skrll Sync with HEAD
 1.82.2.1  14-Jun-2004  tron Pull up revision 1.83 (requested by itojun in ticket #468):
implement IPV6_USE_MIN_MTU sockopt. needed by bind9 + EDNS0 + big receive buffer.
 1.87.4.1  19-Mar-2005  yamt sync with head. xen and whitespace. xen part is not finished.
 1.87.2.1  29-Apr-2005  kent sync with -current
 1.88.4.7  21-Jan-2008  yamt sync with head
 1.88.4.6  15-Nov-2007  yamt sync with head.
 1.88.4.5  27-Oct-2007  yamt sync with head.
 1.88.4.4  03-Sep-2007  yamt sync with head.
 1.88.4.3  26-Feb-2007  yamt sync with head.
 1.88.4.2  30-Dec-2006  yamt sync with head.
 1.88.4.1  21-Jun-2006  yamt sync with head.
 1.93.2.1  01-Feb-2006  yamt sync with head.
 1.94.6.4  03-Sep-2006  yamt sync with head.
 1.94.6.3  11-Aug-2006  yamt sync with head
 1.94.6.2  24-May-2006  yamt sync with head.
 1.94.6.1  13-Mar-2006  yamt sync with head.
 1.94.4.2  01-Jun-2006  kardel Sync with head.
 1.94.4.1  22-Apr-2006  simonb Sync with head.
 1.94.2.5  09-Sep-2006  rpaulo sync with head
 1.94.2.4  23-Feb-2006  rpaulo ip6_raw_ctloutput(): s/in6pcb/inpcb
ip6_optlen(): convert to inpcb
 1.94.2.3  14-Feb-2006  rpaulo in6pcb -> inpcb.
 1.94.2.2  07-Feb-2006  rpaulo sotoinpcb_hdr -> sotoinpcb.
 1.94.2.1  07-Feb-2006  rpaulo remove in6_pcb.h and include in_pcb.h.
 1.95.4.1  24-May-2006  tron Merge 2006-05-24 NetBSD-current into the "peter-altq" branch.
 1.95.2.6  12-May-2006  elad adapt to kauth kpi, include sys/kauth.h where needed..
 1.95.2.5  11-May-2006  elad sync with head
 1.95.2.4  06-May-2006  christos - Move kauth_cred_t declaration to <sys/types.h>
- Cleanup struct ucred; forward declarations that are unused.
- Don't include <sys/kauth.h> in any header, but include it in the c files
that need it.

Approved by core.
 1.95.2.3  19-Apr-2006  elad sync with head.
 1.95.2.2  10-Mar-2006  elad generic_authorize() -> kauth_authorize_generic().
 1.95.2.1  08-Mar-2006  elad Adapt to kernel authorization KPI.
 1.98.4.1  13-Jul-2006  gdamore Merge from HEAD.
 1.102.4.3  18-Dec-2006  yamt sync with head.
 1.102.4.2  10-Dec-2006  yamt sync with head.
 1.102.4.1  22-Oct-2006  yamt sync with head
 1.102.2.3  01-Feb-2007  ad Sync with head.
 1.102.2.2  12-Jan-2007  ad Sync with head.
 1.102.2.1  18-Nov-2006  ad Sync with head.
 1.106.4.1  04-Jun-2007  wrstuden Update to today's netbsd-4.
 1.106.2.1  24-May-2007  pavel Pull up following revision(s) (requested by degroote in ticket #667):
sys/netinet/tcp_input.c: revision 1.260
sys/netinet/tcp_output.c: revision 1.154
sys/netinet/tcp_subr.c: revision 1.210
sys/netinet6/icmp6.c: revision 1.129
sys/netinet6/in6_proto.c: revision 1.70
sys/netinet6/ip6_forward.c: revision 1.54
sys/netinet6/ip6_input.c: revision 1.94
sys/netinet6/ip6_output.c: revision 1.114
sys/netinet6/raw_ip6.c: revision 1.81
sys/netipsec/ipcomp_var.h: revision 1.4
sys/netipsec/ipsec.c: revision 1.26 via patch,1.31-1.32
sys/netipsec/ipsec6.h: revision 1.5
sys/netipsec/ipsec_input.c: revision 1.14
sys/netipsec/ipsec_netbsd.c: revision 1.18,1.26
sys/netipsec/ipsec_output.c: revision 1.21 via patch
sys/netipsec/key.c: revision 1.33,1.44
sys/netipsec/xform_ipcomp.c: revision 1.9
sys/netipsec/xform_ipip.c: revision 1.15
sys/opencrypto/deflate.c: revision 1.8
Commit my SoC work
Add ipv6 support for fast_ipsec
Note that currently, packet with extensions headers are not correctly
supported
Change the ipcomp logic

Add sysctl tree to modify the fast_ipsec options related to ipv6. Similar
to the sysctl kame interface.

Choose the good default policy, depending of the adress family of the
desired policy

Increase the refcount for the default ipv6 policy so nobody can reclaim it

Always compute the sp index even if we don't have any sp in spd. It will
let us to choose the right default policy (based on the adress family
requested).
While here, fix an error message

Use dynamic array instead of an static array to decompress. It lets us to
decompress any data, whatever is the radio decompressed data / compressed
data.
It fixes the last issues with fast_ipsec and ipcomp.
While here, bzero -> memset, bcopy -> memcpy, FREE -> free
Reviewed a long time ago by sam@
 1.114.2.3  07-May-2007  yamt sync with head.
 1.114.2.2  12-Mar-2007  rmind Sync with HEAD.
 1.114.2.1  27-Feb-2007  yamt - sync with head.
- move sched_changepri back to kern_synch.c as it doesn't know PPQ anymore.
 1.117.4.1  11-Jul-2007  mjf Sync with head.
 1.117.2.3  09-Oct-2007  ad Sync with head.
 1.117.2.2  09-Jun-2007  ad Sync with head.
 1.117.2.1  08-Jun-2007  ad Sync with head.
 1.120.8.4  23-Mar-2008  matt sync with HEAD
 1.120.8.3  09-Jan-2008  matt sync with HEAD
 1.120.8.2  08-Nov-2007  matt sync with -HEAD
 1.120.8.1  06-Nov-2007  matt sync with HEAD
 1.120.6.3  11-Nov-2007  joerg Sync with HEAD.
 1.120.6.2  04-Nov-2007  jmcneill Sync with HEAD.
 1.120.6.1  02-Oct-2007  joerg Sync with HEAD.
 1.121.4.1  13-Nov-2007  bouyer Sync with HEAD
 1.122.2.3  18-Feb-2008  mjf Sync with HEAD.
 1.122.2.2  27-Dec-2007  mjf Sync with HEAD.
 1.122.2.1  19-Nov-2007  mjf Sync with HEAD.
 1.123.6.3  19-Jan-2008  bouyer Sync with HEAD
 1.123.6.2  10-Jan-2008  bouyer Sync with HEAD
 1.123.6.1  02-Jan-2008  bouyer Sync with HEAD
 1.123.2.1  26-Dec-2007  ad Sync with head.
 1.126.6.3  17-Jan-2009  mjf Sync with HEAD.
 1.126.6.2  28-Sep-2008  mjf Sync with HEAD.
 1.126.6.1  02-Jun-2008  mjf Sync with HEAD.
 1.126.2.1  22-Feb-2008  keiichi imported Mobile IPv6 code developed by the SHISA project
(http://www.mobileip.jp/).
 1.128.2.1  18-May-2008  yamt sync with head.
 1.129.8.2  13-Dec-2008  haad Update haad-dm branch to haad-dm-base2.
 1.129.8.1  19-Oct-2008  haad Sync with HEAD.
 1.129.4.1  18-Sep-2008  wrstuden Sync with wrstuden-revivesa-base-2.
 1.129.2.2  16-May-2009  yamt sync with head
 1.129.2.1  04-May-2009  yamt sync with head.
 1.134.2.2  28-Apr-2009  skrll Sync with HEAD.
 1.134.2.1  19-Jan-2009  skrll Sync with HEAD.
 1.135.6.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.135.2.2  27-Aug-2014  msaitoh Pull up following revision(s) (requested by maxv in ticket #1920):
sys/netinet6/ip6_output.c 1.158 via patch

Fix a memory leak in calling setsockopt() on an INET6 socket.
 1.135.2.1  20-Apr-2009  snj branches: 1.135.2.1.6; 1.135.2.1.10;
Pull up following revision(s) (requested by drochner in ticket #713):
sys/netinet6/ip6_output.c: revision 1.137
fix traversing of a control mbuf in the case that a message len
is not aligned wrt CMSG_ALIGN - the length counter drops below 0
in this case which was not checked for,
fixes crashes (with isc_dhcrelay4) reported by Uwe in tech-net
(subject: netbsd5-rc3 crash caused by isc_dhcrelay)
 1.135.2.1.10.1  27-Aug-2014  msaitoh Pull up following revision(s) (requested by maxv in ticket #1920):
sys/netinet6/ip6_output.c 1.158 via patch

Fix a memory leak in calling setsockopt() on an INET6 socket.
 1.135.2.1.6.1  27-Aug-2014  msaitoh Pull up following revision(s) (requested by maxv in ticket #1920):
sys/netinet6/ip6_output.c 1.158 via patch

Fix a memory leak in calling setsockopt() on an INET6 socket.
 1.139.6.1  06-Jun-2011  jruoho Sync with HEAD.
 1.139.4.1  31-May-2011  rmind sync with head
 1.140.8.2  05-Apr-2012  mrg sync to latest -current.
 1.140.8.1  18-Feb-2012  mrg merge to -current.
 1.140.4.3  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.140.4.2  30-Oct-2012  yamt sync with head
 1.140.4.1  17-Apr-2012  yamt sync with head
 1.145.8.1  27-Aug-2014  msaitoh Pull up following revision(s) (requested by maxv in ticket #1114):
sys/netinet6/ip6_output.c 1.158 via patch

Fix a memory leak in calling setsockopt() on an INET6 socket.
 1.145.6.1  27-Aug-2014  msaitoh Pull up following revision(s) (requested by maxv in ticket #1114):
sys/netinet6/ip6_output.c 1.158 via patch

Fix a memory leak in calling setsockopt() on an INET6 socket.
 1.145.2.1  27-Aug-2014  msaitoh Pull up following revision(s) (requested by maxv in ticket #1114):
sys/netinet6/ip6_output.c 1.158 via patch

Fix a memory leak in calling setsockopt() on an INET6 socket.
 1.150.2.4  03-Dec-2017  jdolecek update from HEAD
 1.150.2.3  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.150.2.2  23-Jun-2013  tls resync from head
 1.150.2.1  25-Feb-2013  tls resync with head
 1.153.2.3  18-May-2014  rmind sync with head
 1.153.2.2  28-Aug-2013  rmind sync with head
 1.153.2.1  17-Jul-2013  rmind Checkpoint work in progress:
- Move PCB structures under __INPCB_PRIVATE, adjust most of the callers
and thus make IPv4 PCB structures mostly opaque. Any volunteers for
merging in6pcb with inpcb (see rpaulo-netinet-merge-pcb branch)?
- Move various global vars to the modules where they belong, make them static.
- Some preliminary work for IPv4 PCB locking scheme.
- Make raw IP code mostly MP-safe. Simplify some of it.
- Rework "fast" IP forwarding (ipflow) code to be mostly MP-safe. It should
run from a software interrupt, rather than hard.
- Rework tun(4) pseudo interface to be MP-safe.
- Work towards making some other interfaces more strict.
 1.155.2.1  10-Aug-2014  tls Rebase.
 1.157.2.3  14-Feb-2015  snj Pull up following revision(s) (requested by roy in ticket #509):
sys/netinet6/ip6_output.c: revision 1.163
CID/1267860: Missing break in switch
 1.157.2.2  23-Jan-2015  martin Pull up following revision(s) (requested by pettai in ticket #441):
sys/netinet6/ip6_var.h: revision 1.64
sys/netinet6/in6.h: revision 1.82
sys/netinet6/in6_src.c: revision 1.56
sys/netinet6/mld6.c: revision 1.62
sys/netinet6/ip6_input.c: revision 1.150
sys/netinet6/ip6_output.c: revision 1.161
Add net.inet6.ip6.prefer_tempaddr sysctl knob so that we can prefer
IPv6 temporary addresses as the source address.
Fixes PR kern/47100 based on a patch by Dieter Roelants.
 1.157.2.1  24-Aug-2014  martin Pull up following revision(s) (requested by maxv in ticket #51):
sys/netinet6/ip6_output.c: revision 1.158
sys/rump/librump/rumpvfs/rumpfs.c: revision 1.130
Fix memory leaks in error cases
 1.160.2.8  28-Aug-2017  skrll Sync with HEAD
 1.160.2.7  05-Feb-2017  skrll Sync with HEAD
 1.160.2.6  05-Dec-2016  skrll Sync with HEAD
 1.160.2.5  05-Oct-2016  skrll Sync with HEAD
 1.160.2.4  09-Jul-2016  skrll Sync with HEAD
 1.160.2.3  22-Sep-2015  skrll Sync with HEAD
 1.160.2.2  06-Jun-2015  skrll Sync with HEAD
 1.160.2.1  06-Apr-2015  skrll Sync with HEAD
 1.171.2.4  20-Mar-2017  pgoyette Sync with HEAD
 1.171.2.3  07-Jan-2017  pgoyette Sync with HEAD. (Note that most of these changes are simply $NetBSD$
tag issues.)
 1.171.2.2  04-Nov-2016  pgoyette Sync with HEAD
 1.171.2.1  06-Aug-2016  pgoyette Sync with HEAD
 1.180.2.1  21-Apr-2017  bouyer Sync with HEAD
 1.191.6.6  04-Aug-2023  martin Pull up following revision(s) (requested by ozaki-r in ticket #1884):

sys/netinet6/in6.c: revision 1.289
sys/netinet6/ip6_output.c: revision 1.234

in6: clear ND6_IFF_IFDISABLED to allow DAD again on link-up

in6: don't send any IPv6 packets over a disabled interface
 1.191.6.5  23-Mar-2023  martin Pull up following revision(s) (requested by ozaki-r in ticket #1808):

sys/netinet6/raw_ip6.c: revision 1.183 (via patch)
sys/netinet6/ip6_output.c: revision 1.233

in6: reject setting negative values but -1 via setsockopt(IPV6_CHECKSUM)
Same as OpenBSD.

in6: make sure a user-specified checksum field is within a packet
From OpenBSD
 1.191.6.4  02-Jan-2018  snj Pull up following revision(s) (requested by ozaki-r in ticket #456):
sys/arch/arm/sunxi/sunxi_emac.c: 1.9
sys/dev/ic/dwc_gmac.c: 1.43-1.44
sys/dev/pci/if_iwm.c: 1.75
sys/dev/pci/if_wm.c: 1.543
sys/dev/pci/ixgbe/ixgbe.c: 1.112
sys/dev/pci/ixgbe/ixv.c: 1.74
sys/kern/sys_socket.c: 1.75
sys/net/agr/if_agr.c: 1.43
sys/net/bpf.c: 1.219
sys/net/if.c: 1.397, 1.399, 1.401-1.403, 1.406-1.410, 1.412-1.416
sys/net/if.h: 1.242-1.247, 1.250, 1.252-1.257
sys/net/if_bridge.c: 1.140 via patch, 1.142-1.146
sys/net/if_etherip.c: 1.40
sys/net/if_ethersubr.c: 1.243, 1.246
sys/net/if_faith.c: 1.57
sys/net/if_gif.c: 1.132
sys/net/if_l2tp.c: 1.15, 1.17
sys/net/if_loop.c: 1.98-1.101
sys/net/if_media.c: 1.35
sys/net/if_pppoe.c: 1.131-1.132
sys/net/if_spppsubr.c: 1.176-1.177
sys/net/if_tun.c: 1.142
sys/net/if_vlan.c: 1.107, 1.109, 1.114-1.121
sys/net/npf/npf_ifaddr.c: 1.3
sys/net/npf/npf_os.c: 1.8-1.9
sys/net/rtsock.c: 1.230
sys/netcan/if_canloop.c: 1.3-1.5
sys/netinet/if_arp.c: 1.255
sys/netinet/igmp.c: 1.65
sys/netinet/in.c: 1.210-1.211
sys/netinet/in_pcb.c: 1.180
sys/netinet/ip_carp.c: 1.92, 1.94
sys/netinet/ip_flow.c: 1.81
sys/netinet/ip_input.c: 1.362
sys/netinet/ip_mroute.c: 1.147
sys/netinet/ip_output.c: 1.283, 1.285, 1.287
sys/netinet6/frag6.c: 1.61
sys/netinet6/in6.c: 1.251, 1.255
sys/netinet6/in6_pcb.c: 1.162
sys/netinet6/ip6_flow.c: 1.35
sys/netinet6/ip6_input.c: 1.183
sys/netinet6/ip6_output.c: 1.196
sys/netinet6/mld6.c: 1.90
sys/netinet6/nd6.c: 1.239-1.240
sys/netinet6/nd6_nbr.c: 1.139
sys/netinet6/nd6_rtr.c: 1.136
sys/netipsec/ipsec_output.c: 1.65
sys/rump/net/lib/libnetinet/netinet_component.c: 1.9-1.10
kmem_intr_free kmem_intr_[z]alloced memory
the underlying pools are the same but api-wise those should match
Unify IFEF_*_MPSAFE into IFEF_MPSAFE
There are already two flags for if_output and if_start, however, it seems such
MPSAFE flags are eventually needed for all if_XXX operations. Having discrete
flags for each operation is wasteful of if_extflags bits. So let's unify
the flags into one: IFEF_MPSAFE.
Fortunately IFEF_*_MPSAFE flags have never been included in any releases, so
we can change them without breaking backward compatibility of the releases
(though the kernel version of -current should be bumped).
Note that if an interface have both MP-safe and non-MP-safe operations at a
time, we have to set the IFEF_MPSAFE flag and let callees of non-MP-safe
opeartions take the kernel lock.
Proposed on tech-kern@ and tech-net@
Provide macros for softnet_lock and KERNEL_LOCK hiding NET_MPSAFE switch
It reduces C&P codes such as "#ifndef NET_MPSAFE KERNEL_LOCK(1, NULL); ..."
scattered all over the source code and makes it easy to identify remaining
KERNEL_LOCK and/or softnet_lock that are held even if NET_MPSAFE.
No functional change
Hold KERNEL_LOCK on if_ioctl selectively based on IFEF_MPSAFE
If IFEF_MPSAFE is set, hold the lock and otherwise don't hold.
This change requires additions of KERNEL_LOCK to subsequence functions from
if_ioctl such as ifmedia_ioctl and ifioctl_common to protect non-MP-safe
components.
Proposed on tech-kern@ and tech-net@
Ensure to hold if_ioctl_lock when calling if_flags_set
Fix locking against myself on ifpromisc
vlan_unconfig_locked could be called with holding if_ioctl_lock.
Ensure to not turn on IFF_RUNNING of an interface until its initialization completes
And ensure to turn off it before destruction as per IFF_RUNNING's description
"resource allocated". (The description is a bit doubtful though, I believe the
change is still proper.)
Ensure to hold if_ioctl_lock on if_up and if_down
One exception for if_down is if_detach; in the case the lock isn't needed
because it's guaranteed that no other one can access ifp at that point.
Make if_link_queue MP-safe if IFEF_MPSAFE
if_link_queue is a queue to store events of link state changes, which is
used to pass events from (typically) an interrupt handler to
if_link_state_change softint. The queue was protected by KERNEL_LOCK so far,
but if IFEF_MPSAFE is enabled, it becomes unsafe because (perhaps) an interrupt
handler of an interface with IFEF_MPSAFE doesn't take KERNEL_LOCK. Protect it
by a spin mutex.
Additionally with this change KERNEL_LOCK of if_link_state_change softint is
omitted if NET_MPSAFE is enabled.
Note that the spin mutex is now ifp->if_snd.ifq_lock as well as the case of
if_timer (see the comment).
Use IFADDR_WRITER_FOREACH instead of IFADDR_READER_FOREACH
At that point no other one modifies the list so IFADDR_READER_FOREACH
is unnecessary. Use of IFADDR_READER_FOREACH is harmless in general though,
if we try to detect contract violations of pserialize, using it violates
the contract. So avoid using it makes life easy.
Ensure to call if_addr_init with holding if_ioctl_lock
Get rid of outdated comments
Fix build of kernels without ether
By throwing out if_enable_vlan_mtu and if_disable_vlan_mtu that
created a unnecessary dependency from if.c to if_ethersubr.c.
PR kern/52790
Rename IFNET_LOCK to IFNET_GLOBAL_LOCK
IFNET_LOCK will be used in another lock, if_ioctl_lock (might be renamed then).
Wrap if_ioctl_lock with IFNET_* macros (NFC)
Also if_ioctl_lock perhaps needs to be renamed to something because it's now
not just for ioctl...
Reorder some destruction routines in if_detach
- Destroy if_ioctl_lock at the end of the if_detach because it's used in various
destruction routines
- Move psref_target_destroy after pr_purgeif because we want to use psref in
pr_purgeif (otherwise destruction procedures can be tricky)
Ensure to call if_mcast_op with holding IFNET_LOCK
Note that CARP doesn't deal with IFNET_LOCK yet.
Remove IFNET_GLOBAL_LOCK where it's unnecessary because IFNET_LOCK is held
Describe which lock is used to protect each member variable of struct ifnet
Requested by skrll@
Write a guideline for converting an interface to IFEF_MPSAFE
Requested by skrll@
Note that IFNET_LOCK must not be held in softint
Don't set IFEF_MPSAFE unless NET_MPSAFE at this point
Because recent investigations show that interfaces with IFEF_MPSAFE need to
follow additional restrictions to work with the flag safely. We should enable it
on an interface by default only if the interface surely satisfies the
restrictions, which are described in if.h.
Note that enabling IFEF_MPSAFE solely gains a few benefit on performance because
the network stack is still serialized by the big kernel locks by default.
 1.191.6.3  10-Dec-2017  snj Pull up following revision(s) (requested by roy in ticket #390):
sys/netinet/ip_input.c: 1.363
sys/netinet6/ip6_input.c: 1.184-1.185
sys/netinet6/ip6_output.c: 1.194-1.195
sys/netinet6/in6_src.c: 1.83-1.84
Allow local communication over DETACHED addresses.
Allow binding to DETACHED or TENTATIVE addresses as we deny
sending upstream from them anyway.
Prefer non DETACHED or TENTATIVE addresses.
--
Attempt to restore v6 networking. Not 100% certain that these
changes are all that is needed, but they're certainly a big part of it
(especially the ip6_input.c change.)
--
Treat unvalidated addresses as deprecated in rule 3.
 1.191.6.2  21-Oct-2017  snj Pull up following revision(s) (requested by ozaki-r in ticket #300):
crypto/dist/ipsec-tools/src/setkey/parse.y: 1.19
crypto/dist/ipsec-tools/src/setkey/token.l: 1.20
distrib/sets/lists/tests/mi: 1.754, 1.757, 1.759
doc/TODO.smpnet: 1.12-1.13
sys/net/pfkeyv2.h: 1.32
sys/net/raw_cb.c: 1.23-1.24, 1.28
sys/net/raw_cb.h: 1.28
sys/net/raw_usrreq.c: 1.57-1.58
sys/net/rtsock.c: 1.228-1.229
sys/netinet/in_proto.c: 1.125
sys/netinet/ip_input.c: 1.359-1.361
sys/netinet/tcp_input.c: 1.359-1.360
sys/netinet/tcp_output.c: 1.197
sys/netinet/tcp_var.h: 1.178
sys/netinet6/icmp6.c: 1.213
sys/netinet6/in6_proto.c: 1.119
sys/netinet6/ip6_forward.c: 1.88
sys/netinet6/ip6_input.c: 1.181-1.182
sys/netinet6/ip6_output.c: 1.193
sys/netinet6/ip6protosw.h: 1.26
sys/netipsec/ipsec.c: 1.100-1.122
sys/netipsec/ipsec.h: 1.51-1.61
sys/netipsec/ipsec6.h: 1.18-1.20
sys/netipsec/ipsec_input.c: 1.44-1.51
sys/netipsec/ipsec_netbsd.c: 1.41-1.45
sys/netipsec/ipsec_output.c: 1.49-1.64
sys/netipsec/ipsec_private.h: 1.5
sys/netipsec/key.c: 1.164-1.234
sys/netipsec/key.h: 1.20-1.32
sys/netipsec/key_debug.c: 1.18-1.21
sys/netipsec/key_debug.h: 1.9
sys/netipsec/keydb.h: 1.16-1.20
sys/netipsec/keysock.c: 1.59-1.62
sys/netipsec/keysock.h: 1.10
sys/netipsec/xform.h: 1.9-1.12
sys/netipsec/xform_ah.c: 1.55-1.74
sys/netipsec/xform_esp.c: 1.56-1.72
sys/netipsec/xform_ipcomp.c: 1.39-1.53
sys/netipsec/xform_ipip.c: 1.50-1.54
sys/netipsec/xform_tcp.c: 1.12-1.16
sys/rump/librump/rumpkern/Makefile.rumpkern: 1.170
sys/rump/librump/rumpnet/net_stub.c: 1.27
sys/sys/protosw.h: 1.67-1.68
tests/net/carp/t_basic.sh: 1.7
tests/net/if_gif/t_gif.sh: 1.11
tests/net/if_l2tp/t_l2tp.sh: 1.3
tests/net/ipsec/Makefile: 1.7-1.9
tests/net/ipsec/algorithms.sh: 1.5
tests/net/ipsec/common.sh: 1.4-1.6
tests/net/ipsec/t_ipsec_ah_keys.sh: 1.2
tests/net/ipsec/t_ipsec_esp_keys.sh: 1.2
tests/net/ipsec/t_ipsec_gif.sh: 1.6-1.7
tests/net/ipsec/t_ipsec_l2tp.sh: 1.6-1.7
tests/net/ipsec/t_ipsec_misc.sh: 1.8-1.18
tests/net/ipsec/t_ipsec_sockopt.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_tcp.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_transport.sh: 1.5-1.6
tests/net/ipsec/t_ipsec_tunnel.sh: 1.9
tests/net/ipsec/t_ipsec_tunnel_ipcomp.sh: 1.1-1.2
tests/net/ipsec/t_ipsec_tunnel_odd.sh: 1.3
tests/net/mcast/t_mcast.sh: 1.6
tests/net/net/t_ipaddress.sh: 1.11
tests/net/net_common.sh: 1.20
tests/net/npf/t_npf.sh: 1.3
tests/net/route/t_flags.sh: 1.20
tests/net/route/t_flags6.sh: 1.16
usr.bin/netstat/fast_ipsec.c: 1.22
Do m_pullup before mtod

It may fix panicks of some tests on anita/sparc and anita/GuruPlug.
---
KNF
---
Enable DEBUG for babylon5
---
Apply C99-style struct initialization to xformsw
---
Tweak outputs of netstat -s for IPsec

- Get rid of "Fast"
- Use ipsec and ipsec6 for titles to clarify protocol
- Indent outputs of sub protocols

Original outputs were organized like this:

(Fast) IPsec:
IPsec ah:
IPsec esp:
IPsec ipip:
IPsec ipcomp:
(Fast) IPsec:
IPsec ah:
IPsec esp:
IPsec ipip:
IPsec ipcomp:

New outputs are organized like this:

ipsec:
ah:
esp:
ipip:
ipcomp:
ipsec6:
ah:
esp:
ipip:
ipcomp:
---
Add test cases for IPComp
---
Simplify IPSEC_OSTAT macro (NFC)
---
KNF; replace leading whitespaces with hard tabs
---
Introduce and use SADB_SASTATE_USABLE_P
---
KNF
---
Add update command for testing

Updating an SA (SADB_UPDATE) requires that a process issuing
SADB_UPDATE is the same as a process issued SADB_ADD (or SADB_GETSPI).
This means that update command must be used with add command in a
configuration of setkey. This usage is normally meaningless but
useful for testing (and debugging) purposes.
---
Add test cases for updating SA/SP

The tests require newly-added udpate command of setkey.
---
PR/52346: Frank Kardel: Fix checksumming for NAT-T
See XXX for improvements.
---
Remove codes for PACKET_TAG_IPSEC_IN_CRYPTO_DONE

It seems that PACKET_TAG_IPSEC_IN_CRYPTO_DONE is for network adapters
that have IPsec accelerators; a driver sets the mtag to a packet
when its device has already encrypted the packet.

Unfortunately no driver implements such offload features for long
years and seems unlikely to implement them soon. (Note that neither
FreeBSD nor Linux doesn't have such drivers.) Let's remove related
(unused) codes and simplify the IPsec code.
---
Fix usages of sadb_msg_errno
---
Avoid updating sav directly

On SADB_UPDATE a target sav was updated directly, which was unsafe.
Instead allocate another sav, copy variables of the old sav to
the new one and replace the old one with the new one.
---
Simplify; we can assume sav->tdb_xform cannot be NULL while it's valid
---
Rename key_alloc* functions (NFC)

We shouldn't use the term "alloc" for functions that just look up
data and actually don't allocate memory.
---
Use explicit_memset to surely zero-clear key_auth and key_enc
---
Make sure to clear keys on error paths of key_setsaval
---
Add missing KEY_FREESAV
---
Make sure a sav is inserted to a sah list after its initialization completes
---
Remove unnecessary zero-clearing codes from key_setsaval

key_setsaval is now used only for a newly-allocated sav. (It was
used to reset variables of an existing sav.)
---
Correct wrong assumption of sav->refcnt in key_delsah

A sav in a list is basically not to be sav->refcnt == 0. And also
KEY_FREESAV assumes sav->refcnt > 0.
---
Let key_getsavbyspi take a reference of a returning sav
---
Use time_mono_to_wall (NFC)
---
Separate sending message routine (NFC)
---
Simplify; remove unnecessary zero-clears

key_freesaval is used only when a target sav is being destroyed.
---
Omit NULL checks for sav->lft_c

sav->lft_c can be NULL only when initializing or destroying sav.
---
Omit unnecessary NULL checks for sav->sah
---
Omit unnecessary check of sav->state

key_allocsa_policy picks a sav of either MATURE or DYING so we
don't need to check its state again.
---
Simplify; omit unnecessary saidx passing

- ipsec_nextisr returns a saidx but no caller uses it
- key_checkrequest is passed a saidx but it can be gotton by
another argument (isr)
---
Fix splx isn't called on some error paths
---
Fix header size calculation of esp where sav is NULL
---
Fix header size calculation of ah in the case sav is NULL

This fix was also needed for esp.
---
Pass sav directly to opencrypto callback

In a callback, use a passed sav as-is by default and look up a sav
only if the passed sav is dead.
---
Avoid examining freshness of sav on packet processing

If a sav list is sorted (by lft_c->sadb_lifetime_addtime) in advance,
we don't need to examine each sav and also don't need to delete one
on the fly and send up a message. Fortunately every sav lists are sorted
as we need.

Added key_validate_savlist validates that each sav list is surely sorted
(run only if DEBUG because it's not cheap).
---
Add test cases for SAs with different SPIs
---
Prepare to stop using isr->sav

isr is a shared resource and using isr->sav as a temporal storage
for each packet processing is racy. And also having a reference from
isr to sav makes the lifetime of sav non-deterministic; such a reference
is removed when a packet is processed and isr->sav is overwritten by
new one. Let's have a sav locally for each packet processing instead of
using shared isr->sav.

However this change doesn't stop using isr->sav yet because there are
some users of isr->sav. isr->sav will be removed after the users find
a way to not use isr->sav.
---
Fix wrong argument handling
---
fix printf format.
---
Don't validate sav lists of LARVAL or DEAD states

We don't sort the lists so the validation will always fail.

Fix PR kern/52405
---
Make sure to sort the list when changing the state by key_sa_chgstate
---
Rename key_allocsa_policy to key_lookup_sa_bysaidx
---
Separate test files
---
Calculate ah_max_authsize on initialization as well as esp_max_ivlen
---
Remove m_tag_find(PACKET_TAG_IPSEC_PENDING_TDB) because nobody sets the tag
---
Restore a comment removed in previous

The comment is valid for the below code.
---
Make tests more stable

sleep command seems to wait longer than expected on anita so
use polling to wait for a state change.
---
Add tests that explicitly delete SAs instead of waiting for expirations
---
Remove invalid M_AUTHIPDGM check on ESP isr->sav

M_AUTHIPDGM flag is set to a mbuf in ah_input_cb. An sav of ESP can
have AH authentication as sav->tdb_authalgxform. However, in that
case esp_input and esp_input_cb are used to do ESP decryption and
AH authentication and M_AUTHIPDGM never be set to a mbuf. So
checking M_AUTHIPDGM of a mbuf on isr->sav of ESP is meaningless.
---
Look up sav instead of relying on unstable sp->req->sav

This code is executed only in an error path so an additional lookup
doesn't matter.
---
Correct a comment
---
Don't release sav if calling crypto_dispatch again
---
Remove extra KEY_FREESAV from ipsec_process_done

It should be done by the caller.
---
Don't bother the case of crp->crp_buf == NULL in callbacks
---
Hold a reference to an SP during opencrypto processing

An SP has a list of isr (ipsecrequest) that represents a sequence
of IPsec encryption/authentication processing. One isr corresponds
to one opencrypto processing. The lifetime of an isr follows its SP.

We pass an isr to a callback function of opencrypto to continue
to a next encryption/authentication processing. However nobody
guaranteed that the isr wasn't freed, i.e., its SP wasn't destroyed.

In order to avoid such unexpected destruction of isr, hold a reference
to its SP during opencrypto processing.
---
Don't make SAs expired on tests that delete SAs explicitly
---
Fix a debug message
---
Dedup error paths (NFC)
---
Use pool to allocate tdb_crypto

For ESP and AH, we need to allocate an extra variable space in addition
to struct tdb_crypto. The fixed size of pool items may be larger than
an actual requisite size of a buffer, but still the performance
improvement by replacing malloc with pool wins.
---
Don't use unstable isr->sav for header size calculations

We may need to optimize to not look up sav here for users that
don't need to know an exact size of headers (e.g., TCP segmemt size
caclulation).
---
Don't use sp->req->sav when handling NAT-T ESP fragmentation

In order to do this we need to look up a sav however an additional
look-up degrades performance. A sav is later looked up in
ipsec4_process_packet so delay the fragmentation check until then
to avoid an extra look-up.
---
Don't use key_lookup_sp that depends on unstable sp->req->sav

It provided a fast look-up of SP. We will provide an alternative
method in the future (after basic MP-ification finishes).
---
Stop setting isr->sav on looking up sav in key_checkrequest
---
Remove ipsecrequest#sav
---
Stop setting mtag of PACKET_TAG_IPSEC_IN_DONE because there is no users anymore
---
Skip ipsec_spi_*_*_preferred_new_timeout when running on qemu

Probably due to PR 43997
---
Add localcount to rump kernels
---
Remove unused macro
---
Fix key_getcomb_setlifetime

The fix adjusts a soft limit to be 80% of a corresponding hard limit.

I'm not sure the fix is really correct though, at least the original
code is wrong. A passed comb is zero-cleared before calling
key_getcomb_setlifetime, so
comb->sadb_comb_soft_addtime = comb->sadb_comb_soft_addtime * 80 / 100;
is meaningless.
---
Provide and apply key_sp_refcnt (NFC)

It simplifies further changes.
---
Fix indentation

Pointed out by knakahara@
---
Use pslist(9) for sptree
---
Don't acquire global locks for IPsec if NET_MPSAFE

Note that the change is just to make testing easy and IPsec isn't MP-safe yet.
---
Let PF_KEY socks hold their own lock instead of softnet_lock

Operations on SAD and SPD are executed via PF_KEY socks. The operations
include deletions of SAs and SPs that will use synchronization mechanisms
such as pserialize_perform to wait for references to SAs and SPs to be
released. It is known that using such mechanisms with holding softnet_lock
causes a dead lock. We should avoid the situation.
---
Make IPsec SPD MP-safe

We use localcount(9), not psref(9), to make the sptree and secpolicy (SP)
entries MP-safe because SPs need to be referenced over opencrypto
processing that executes a callback in a different context.

SPs on sockets aren't managed by the sptree and can be destroyed in softint.
localcount_drain cannot be used in softint so we delay the destruction of
such SPs to a thread context. To do so, a list to manage such SPs is added
(key_socksplist) and key_timehandler_spd deletes dead SPs in the list.

For more details please read the locking notes in key.c.

Proposed on tech-kern@ and tech-net@
---
Fix updating ipsec_used

- key_update_used wasn't called in key_api_spddelete2 and key_api_spdflush
- key_update_used wasn't called if an SP had been added/deleted but
a reply to userland failed
---
Fix updating ipsec_used; turn on when SPs on sockets are added
---
Add missing IPsec policy checks to icmp6_rip6_input

icmp6_rip6_input is quite similar to rip6_input and the same checks exist
in rip6_input.
---
Add test cases for setsockopt(IP_IPSEC_POLICY)
---
Don't use KEY_NEWSP for dummy SP entries

By the change KEY_NEWSP is now not called from softint anymore
and we can use kmem_zalloc with KM_SLEEP for KEY_NEWSP.
---
Comment out unused functions
---
Add test cases that there are SPs but no relevant SAs
---
Don't allow sav->lft_c to be NULL

lft_c of an sav that was created by SADB_GETSPI could be NULL.
---
Clean up clunky eval strings

- Remove unnecessary \ at EOL
- This allows to omit ; too
- Remove unnecessary quotes for arguments of atf_set
- Don't expand $DEBUG in eval
- We expect it's expanded on execution

Suggested by kre@
---
Remove unnecessary KEY_FREESAV in an error path

sav should be freed (unreferenced) by the caller.
---
Use pslist(9) for sahtree
---
Use pslist(9) for sah->savtree
---
Rename local variable newsah to sah

It may not be new.
---
MP-ify SAD slightly

- Introduce key_sa_mtx and use it for some list operations
- Use pserialize for some list iterations
---
Introduce KEY_SA_UNREF and replace KEY_FREESAV with it where sav will never be actually freed in the future

KEY_SA_UNREF is still key_freesav so no functional change for now.

This change reduces diff of further changes.
---
Remove out-of-date log output

Pointed out by riastradh@
---
Use KDASSERT instead of KASSERT for mutex_ownable

Because mutex_ownable is too heavy to run in a fast path
even for DIAGNOSTIC + LOCKDEBUG.

Suggested by riastradh@
---
Assemble global lists and related locks into cache lines (NFCI)

Also rename variable names from *tree to *list because they are
just lists, not trees.

Suggested by riastradh@
---
Move locking notes
---
Update the locking notes

- Add locking order
- Add locking notes for misc lists such as reglist
- Mention pserialize, key_sp_ref and key_sp_unref on SP operations

Requested by riastradh@
---
Describe constraints of key_sp_ref and key_sp_unref

Requested by riastradh@
---
Hold key_sad.lock on SAVLIST_WRITER_INSERT_TAIL
---
Add __read_mostly to key_psz

Suggested by riastradh@
---
Tweak wording (pserialize critical section => pserialize read section)

Suggested by riastradh@
---
Add missing mutex_exit
---
Fix setkey -D -P outputs

The outputs were tweaked (by me), but I forgot updating libipsec
in my local ATF environment...
---
MP-ify SAD (key_sad.sahlist and sah entries)

localcount(9) is used to protect key_sad.sahlist and sah entries
as well as SPD (and will be used for SAD sav).

Please read the locking notes of SAD for more details.
---
Introduce key_sa_refcnt and replace sav->refcnt with it (NFC)
---
Destroy sav only in the loop for DEAD sav
---
Fix KASSERT(solocked(sb->sb_so)) failure in sbappendaddr that is called eventually from key_sendup_mbuf

If key_sendup_mbuf isn't passed a socket, the assertion fails.
Originally in this case sb->sb_so was softnet_lock and callers
held softnet_lock so the assertion was magically satisfied.
Now sb->sb_so is key_so_mtx and also softnet_lock isn't always
held by callers so the assertion can fail.

Fix it by holding key_so_mtx if key_sendup_mbuf isn't passed a socket.

Reported by knakahara@
Tested by knakahara@ and ozaki-r@
---
Fix locking notes of SAD
---
Fix deadlock between key_sendup_mbuf called from key_acquire and localcount_drain

If we call key_sendup_mbuf from key_acquire that is called on packet
processing, a deadlock can happen like this:
- At key_acquire, a reference to an SP (and an SA) is held
- key_sendup_mbuf will try to take key_so_mtx
- Some other thread may try to localcount_drain to the SP with
holding key_so_mtx in say key_api_spdflush
- In this case localcount_drain never return because key_sendup_mbuf
that has stuck on key_so_mtx never release a reference to the SP

Fix the deadlock by deferring key_sendup_mbuf to the timer
(key_timehandler).
---
Fix that prev isn't cleared on retry
---
Limit the number of mbufs queued for deferred key_sendup_mbuf

It's easy to be queued hundreds of mbufs on the list under heavy
network load.
---
MP-ify SAD (savlist)

localcount(9) is used to protect savlist of sah. The basic design is
similar to MP-ifications of SPD and SAD sahlist. Please read the
locking notes of SAD for more details.
---
Simplify ipsec_reinject_ipstack (NFC)
---
Add per-CPU rtcache to ipsec_reinject_ipstack

It reduces route lookups and also reduces rtcache lock contentions
when NET_MPSAFE is enabled.
---
Use pool_cache(9) instead of pool(9) for tdb_crypto objects

The change improves network throughput especially on multi-core systems.
---
Update

ipsec(4), opencrypto(9) and vlan(4) are now MP-safe.
---
Write known issues on scalability
---
Share a global dummy SP between PCBs

It's never be changed so it can be pre-allocated and shared safely between PCBs.
---
Fix race condition on the rawcb list shared by rtsock and keysock

keysock now protects itself by its own mutex, which means that
the rawcb list is protected by two different mutexes (keysock's one
and softnet_lock for rtsock), of course it's useless.

Fix the situation by having a discrete rawcb list for each.
---
Use a dedicated mutex for rt_rawcb instead of softnet_lock if NET_MPSAFE
---
fix localcount leak in sav. fixed by ozaki-r@n.o.

I commit on behalf of him.
---
remove unnecessary comment.
---
Fix deadlock between pserialize_perform and localcount_drain

A typical ussage of localcount_drain looks like this:

mutex_enter(&mtx);
item = remove_from_list();
pserialize_perform(psz);
localcount_drain(&item->localcount, &cv, &mtx);
mutex_exit(&mtx);

This sequence can cause a deadlock which happens for example on the following
situation:

- Thread A calls localcount_drain which calls xc_broadcast after releasing
a specified mutex
- Thread B enters the sequence and calls pserialize_perform with holding
the mutex while pserialize_perform also calls xc_broadcast
- Thread C (xc_thread) that calls an xcall callback of localcount_drain tries
to hold the mutex

xc_broadcast of thread B doesn't start until xc_broadcast of thread A
finishes, which is a feature of xcall(9). This means that pserialize_perform
never complete until xc_broadcast of thread A finishes. On the other hand,
thread C that is a callee of xc_broadcast of thread A sticks on the mutex.
Finally the threads block each other (A blocks B, B blocks C and C blocks A).

A possible fix is to serialize executions of the above sequence by another
mutex, but adding another mutex makes the code complex, so fix the deadlock
by another way; the fix is to release the mutex before pserialize_perform
and instead use a condvar to prevent pserialize_perform from being called
simultaneously.

Note that the deadlock has happened only if NET_MPSAFE is enabled.
---
Add missing ifdef NET_MPSAFE
---
Take softnet_lock on pr_input properly if NET_MPSAFE

Currently softnet_lock is taken unnecessarily in some cases, e.g.,
icmp_input and encap4_input from ip_input, or not taken even if needed,
e.g., udp_input and tcp_input from ipsec4_common_input_cb. Fix them.

NFC if NET_MPSAFE is disabled (default).
---
- sanitize key debugging so that we don't print extra newlines or unassociated
debugging messages.
- remove unused functions and make internal ones static
- print information in one line per message
---
humanize printing of ip addresses
---
cast reduction, NFC.
---
Fix typo in comment
---
Pull out ipsec_fill_saidx_bymbuf (NFC)
---
Don't abuse key_checkrequest just for looking up sav

It does more than expected for example key_acquire.
---
Fix SP is broken on transport mode

isr->saidx was modified accidentally in ipsec_nextisr.

Reported by christos@
Helped investigations by christos@ and knakahara@
---
Constify isr at many places (NFC)
---
Include socketvar.h for softnet_lock
---
Fix buffer length for ipsec_logsastr
 1.191.6.1  01-Jul-2017  snj Pull up following revision(s) (requested by ozaki-r in ticket #73):
sys/netinet6/ip6_output.c: revision 1.192
Fix usage of ip6_get_membership
It may set nothing to ifp even if returning 0. So we need to NULL-clear
ifp before calling it.
Fix PR kern/52324
 1.203.2.6  26-Dec-2018  pgoyette Sync with HEAD, resolve a few conflicts
 1.203.2.5  06-Sep-2018  pgoyette Sync with HEAD

Resolve a couple of conflicts (result of the uimin/uimax changes)
 1.203.2.4  25-Jun-2018  pgoyette Sync with HEAD
 1.203.2.3  21-May-2018  pgoyette Sync with HEAD
 1.203.2.2  02-May-2018  pgoyette Synch with HEAD
 1.203.2.1  22-Apr-2018  pgoyette Sync with HEAD
 1.211.2.2  13-Apr-2020  martin Mostly merge changes from HEAD upto 20200411
 1.211.2.1  10-Jun-2019  christos Sync with HEAD
 1.220.2.2  04-Aug-2023  martin Pull up following revision(s) (requested by ozaki-r in ticket #1707):

sys/netinet6/in6.c: revision 1.289
sys/netinet6/ip6_output.c: revision 1.234

in6: clear ND6_IFF_IFDISABLED to allow DAD again on link-up

in6: don't send any IPv6 packets over a disabled interface
 1.220.2.1  23-Mar-2023  martin Pull up following revision(s) (requested by ozaki-r in ticket #1615):

sys/netinet6/raw_ip6.c: revision 1.183 (via patch)
sys/netinet6/ip6_output.c: revision 1.233

in6: reject setting negative values but -1 via setsockopt(IPV6_CHECKSUM)
Same as OpenBSD.

in6: make sure a user-specified checksum field is within a packet
From OpenBSD
 1.226.2.1  03-Apr-2021  thorpej Sync with HEAD.
 1.231.2.2  04-Aug-2023  martin Pull up following revision(s) (requested by ozaki-r in ticket #310):

sys/netinet6/in6.c: revision 1.289
sys/netinet6/ip6_output.c: revision 1.234

in6: clear ND6_IFF_IFDISABLED to allow DAD again on link-up

in6: don't send any IPv6 packets over a disabled interface
 1.231.2.1  23-Mar-2023  martin Pull up following revision(s) (requested by ozaki-r in ticket #125):

sys/netinet6/raw_ip6.c: revision 1.183
sys/netinet6/ip6_output.c: revision 1.233

in6: reject setting negative values but -1 via setsockopt(IPV6_CHECKSUM)
Same as OpenBSD.

in6: make sure a user-specified checksum field is within a packet
From OpenBSD

RSS XML Feed