Home | History | Annotate | Download | only in netinet
History log of /src/sys/netinet/ip_output.c
RevisionDateAuthorComments
 1.330  17-Jul-2025  ozaki-r in: avoid racy ifa_acquire(rt->rt_ifa) in ip_output()

If a rtentry is being destroyed asynchronously, ifa referenced by rt_ifa
can be destructed and taking ifa_acquire(rt->rt_ifa) aborts with a
KASSERT failure. Fortunately, the ifa is not actually freed because of
a reference by rt_ifa, it can be available (except some functions like
psref) so as long the rtentry is held.

PR kern/59527
 1.329  11-Jun-2025  ozaki-r in: narrow the scope of ifa in ip_output (NFC)
 1.328  11-Jun-2025  ozaki-r in: take a reference of ifp on IP_ROUTETOIF

The ifp could be released after ia4_release(ia).
 1.327  11-Jun-2025  ozaki-r in: get rid of unused argument from ip_newid() and ip_newid_range()
 1.326  19-Apr-2023  mlelstv branches: 1.326.6;
Again allow multicast packets to be sent from unnumbered interfaces.
 1.325  19-Apr-2023  ozaki-r Revert "Fix panic on packet sending via a route with rt_ifa of AF_LINK."

The fix is mistakenly upstreamed.
 1.324  21-Nov-2022  knakahara branches: 1.324.2;
Fix panic on packet sending via a route with rt_ifa of AF_LINK.

A route with rt_ifa of AF_LINK can be set by some routing daemons when
it adds a route that has a gateway of AF_LINK. If there is no address on
a target interface, the kernel sets an AF_LINK address of the interface to
rt_ifa of the route. In that case, a variable of a local address in
ip_output (ia) can be NULL and we need more NULL-checks of it.
 1.323  04-Nov-2022  ozaki-r inpcb: rename functions to inpcb_*

Inspired by rmind-smpnet patches.
 1.322  28-Oct-2022  ozaki-r inpcb: separate inpcb again to reduce the size of PCB for IPv4

The data size of PCB for IPv4 increased because of the merge of
struct in6pcb. The change decreases the size to the original size by
separating struct inpcb (again). struct in4pcb and in6pcb that embed
struct inpcb are introduced.

Even after the separation, users don't need to realize the separation
and only have to use some macros to access dedicated data. For example,
inp->inp_laddr is now accessed through in4p_laddr(inp).
 1.321  28-Oct-2022  ozaki-r inpcb: integrate data structures of PCB into one

Data structures of network protocol control blocks (PCBs), i.e.,
struct inpcb, in6pcb and inpcb_hdr, are not organized well. Users of
the data structures have to handle them separately and thus the code
is cluttered and duplicated.

The commit integrates the data structures into one, struct inpcb. As a
result, users of PCBs only have to handle just one data structure, so
the code becomes simple.

One drawback is that the data size of PCB for IPv4 increases by 40 bytes
(from 248 bytes to 288 bytes).
 1.320  08-Sep-2020  christos Add IP_BINDANY, IPV6_BINDANY which can be used to bind to any address in
order to implement transparent proxies.
 1.319  28-Aug-2020  christos Don't cache the sa, because we are dealing with multiple mbufs (from ozaki-r)
 1.318  28-Aug-2020  ozaki-r inet: reduce silent packet discards
 1.317  28-Aug-2020  ozaki-r inet: reduce indents of a normal path to improve readability (NFCI)
 1.316  28-Aug-2020  ozaki-r inet, inet6: count packets dropped by IPsec

The counters count packets dropped due to security policy checks.
 1.315  27-Dec-2019  msaitoh s/referece/reference/ in comment.
 1.314  05-Jun-2019  knakahara The packets which will be esp-fragmented should not be applied pfil. Pointed out by ohishi@IIJ, thanks.
 1.313  05-Jun-2019  knakahara Fix rtcache cannot be released once an esp-fragmented packet is sent. Pointed out by ohishi@IIJ, thanks.
 1.312  15-May-2019  ozaki-r Get rid of IFNET_LOCK for if_mcast_op to avoid a deadlock

The IFNET_LOCK was added to avoid data races on if_flags for IFF_ALLMULTI.
Unfortunatetly it caused a deadlock instead. A known scenario causing a
deadlock is to occur the following two operations concurrently: (a) a removal of
an IP adddres assigned to an interface and (b) a manipulation of multicast
groups to the interface. The resource dependency graph is like this:
softnet_lock => IFNET_LOCK => psref_target_destroy => softint => softnet_lock

Thanks to the previous commit that avoids data races on if_flags for
IFF_ALLMULTI by another approach, we can remove IFNET_LOCK and defuse the
deadlock.

PR kern/54189
 1.311  13-May-2019  ozaki-r Count packets dropped by pfil
 1.310  04-Feb-2019  mrg rework the #ifdef IPSEC code to not use fallthru.
same number of lines with more local context.
 1.309  22-Dec-2018  maxv Replace: M_MOVE_PKTHDR -> m_move_pkthdr. No functional change, since the
former is a macro to the latter.
 1.308  12-Dec-2018  rin Simplify logic in ip{,6}_output().

Now, we have M_CSUM_TSOv[46] bit in ifp->if_csum_flags_tx when
TSO[46] is enabled for the interface. So we can simply check
whether TSO[46] is required in a packet but missing in the
interface by (sw_csum & M_CSUM_TSOv[46]).

Note that this is a very rare case where TSO[46] is suddenly
turned off during a packet passing b/w TCP and IP.

part of PR kern/53562
OK msaitoh
 1.307  11-Jul-2018  maxv Rename

ip_undefer_csum -> in_undefer_cksum
in_delayed_cksum -> in_undefer_cksum_tcpudp

The two previous names were inconsistent and misleading.

Put the two functions into in_offload.c. Add comments to explain what
we're doing.

The same could be done for IPv6.
 1.306  02-Jun-2018  maxv branches: 1.306.2;
Copy more mbuf flags.
 1.305  29-May-2018  maxv Fix an XXX of mine, be clearer about what we're doing. Basically we want to
preserve the fragment offset and flags. That's necessary if the packet
we're fragmenting is itself a fragment.
 1.304  29-Apr-2018  maxv Remove unused and misleading argument from ipsec_set_policy.
 1.303  21-Apr-2018  maxv Remove #ifndef __vax__.

The check enforces a 4-byte-aligned size for the option mbuf. If the size
is not multiple of 4, the computation of ip_hl gets truncated in the
output path. There is no reason for this check not to be present on VAX.

While here add a KASSERT in ip_insertoptions to enforce the assumption.

Discussed briefly on tech-net@
 1.302  13-Apr-2018  maxv Remove useless comment and style.
 1.301  13-Apr-2018  maxv Reduce the diff between similar blocks.
 1.300  13-Apr-2018  maxv Reorder a few instructions to clarify. Replace two bcopy by memcpy.
 1.299  30-Mar-2018  maya correct typo: and and -> and (comments only)

heads up on this being a common typo from chris28.
 1.298  03-Mar-2018  maxv branches: 1.298.2;
Add KASSERTs, we don't want m_nextpkt in ipsec{4/6}_process_packet.
 1.297  27-Feb-2018  maxv Dedup: merge ipsec4_set_policy and ipsec6_set_policy. The content of the
original ipsec_set_policy function is inlined into the new one.
 1.296  27-Feb-2018  maxv Dedup: merge

ipsec4_get_policy and ipsec6_get_policy
ipsec4_delete_pcbpolicy and ipsec6_delete_pcbpolicy

The already-existing ipsec_get_policy() function is inlined in the new
one.
 1.295  12-Feb-2018  christos Keep a pointer to the interface of the multicast membership, because the
multicast element itself might go away in in_delmulti (but the interface
can't because we hold the lock). From ozaki-r@
 1.294  07-Feb-2018  mrg ip_add_membership() has an missing {} issue, but solve it by
dropping the "goto out" that would have happened immediately
next anyway, ie, should be NFC.
 1.293  06-Feb-2018  maxv Several changes, mostly cosmetic:

* Add a KASSERT in ip_output(), we expect (at least) the IP header to be
here.

* In ip_fragment(), declare two variables instead of recomputing the
values each time. Add an XXX for ipoff, it seems to me we should also
remove IP_RF.

* Rename the arguments of ip_optcopy().

* Style: use NULL for pointers, remove ()s for return statements, and
add whitespaces for clarity.

No real functional change.
 1.292  10-Jan-2018  christos from ozaki-r: use the proper ifp.
XXX: perhaps push the lock in in_delmulti()?
 1.291  10-Jan-2018  christos - this is not python, we need braces
- protect ifp locking against NULL
 1.290  01-Jan-2018  christos Remove comment now that the getsockopt code passes the size.
 1.289  01-Jan-2018  christos 1) "#define ipi_spec_dst ipi_addr" in <netinet/in.h>
2) Change the IP_RECVPKTINFO option to control the generation of
IP_PKTINFO control messages, the way it's done in Solaris.
3) Remove the superfluous IP_RECVPKTINFO control message.
4) Change the IP_PKTINFO option to do different things depending on
the parameter it's supplied with:
- If it's sizeof(int), assume it's being used as in Linux:
- If it's non-zero, turn on the IP_RECVPKTINFO option.
- If it's zero, turn off the IP_RECVPKTINFO option.
- If it's sizeof(struct in_pktinfo), assume it's being used as in
Solaris, to set a default for the source interface and/or
source address for outgoing packets on the socket.
5) Return what Linux or Solaris compatible code expects, depending
on data size, and just added a fallback to a Linux (and current NetBSD)
compatible value if the size is unknown (as it is now), or,
in the future, if the calling application specifies a receiving
buffer that doesn't match either data item.

From: Tom Ivar Helbekkmo
 1.288  22-Dec-2017  ozaki-r Fix usage of curlwp_bind in ip_output

curlwp_bindx must be called in LIFO order, i.e., we can't call curlwp_bind
and curlwp_bindx like this:
bound1 = curlwp_bind();
bound2 = curlwp_bind();
curlwp_bindx(bound1);
curlwp_bindx(bound2);

ip_outout did so if NET_MPSAFE. Fix it.
 1.287  15-Dec-2017  ozaki-r Ensure to call if_mcast_op with holding IFNET_LOCK

Note that CARP doesn't deal with IFNET_LOCK yet.
 1.286  11-Dec-2017  ryo As is the case with IPV6_PKTINFO, IP_PKTINFO can be sent without EADDRINUSE
even if the UDP address:port in use is specified.
 1.285  17-Nov-2017  ozaki-r Provide macros for softnet_lock and KERNEL_LOCK hiding NET_MPSAFE switch

It reduces C&P codes such as "#ifndef NET_MPSAFE KERNEL_LOCK(1, NULL); ..."
scattered all over the source code and makes it easy to identify remaining
KERNEL_LOCK and/or softnet_lock that are held even if NET_MPSAFE.

No functional change
 1.284  10-Aug-2017  ryo Add support IP_PKTINFO for sendmsg(2).

The source address or output interface can be specified by adding IP_PKTINFO
to the control part of the message on a SOCK_DGRAM or SOCK_RAW socket.

Reviewed by ozaki-r@ and christos@. thanks.
 1.283  23-Jul-2017  para kmem_intr_free kmem_intr_[z]alloced memory

the underlying pools are the same but api-wise those should match
 1.282  04-Jul-2017  roy Rename u to udst, .dst to .sa and .dst4 to sin.
Create sockaddr for the source address in usrc so it won't stamp on udst.

This fixes a regression caused in r1.280
 1.281  03-Jul-2017  khorben Typo
 1.280  03-Jul-2017  roy When outputting, search for the sending address on the sending interface
rather than blindly picking the first matcing address from any interface
when testing source address validity.

This allows another interface to have the same address, but be detached.
 1.279  12-May-2017  ryo branches: 1.279.2;
replace in_fmtaddr() by IN_PRINT(), and delete function in_fmtaddr()
 1.278  10-May-2017  ozaki-r Stop ipsec4_output returning SP to the caller

SP isn't used by the caller (ip_output) and also holding its
reference looks unnecessary.
 1.277  07-May-2017  christos PR/52074: Frank Kardel: current npf map directive broken
Don't filter packets that can't be resolved to source interfaces because
they could have been generated by a packet filter.
 1.276  05-Mar-2017  ozaki-r branches: 1.276.4;
Fix the position of curlwp_bindx; it should be after if_put
 1.275  03-Mar-2017  ozaki-r Pass inpcb/in6pcb instead of socket to ip_output/ip6_output

- Passing a socket to Layer 3 is layer violation and even unnecessary
- The change makes codes of callers and IPsec a bit simple
 1.274  02-Mar-2017  ozaki-r Make sure imo_membership is protected by inp's lock (solock)
 1.273  02-Mar-2017  ozaki-r Make usages of ifp MP-safe in some functions of IP multicast
 1.272  22-Feb-2017  ozaki-r Add assertions and comments for lock states of socket and pcb
 1.271  17-Feb-2017  ozaki-r Make NOMPSAFE comments informative
 1.270  13-Feb-2017  ozaki-r Use IFQ_LOCK instead of splnet for if_snd
 1.269  16-Jan-2017  christos rename arplog -> ARPLOG to make it clear that it is a macro and tuck-in the
buffer used for address formatting.
 1.268  16-Jan-2017  ryo Make ip6_sprintf(), in_fmtaddr(), lla_snprintf() and icmp6_redirect_diag() mpsafe.

Reviewed by ozaki-r@
 1.267  11-Jan-2017  ozaki-r branches: 1.267.2;
Get rid of unnecessary header inclusions
 1.266  10-Jan-2017  knakahara avoid double rtcache_unref().

reviewed by ozaki-r@n.o.
 1.265  12-Dec-2016  ozaki-r Make the routing table and rtcaches MP-safe

See the following descriptions for details.

Proposed on tech-kern and tech-net


Overview
 1.264  08-Dec-2016  ozaki-r Add rtcache_unref to release points of rtentry stemming from rtcache

In the MP-safe world, a rtentry stemming from a rtcache can be freed at any
points. So we need to protect rtentries somehow say by reference couting or
passive references. Regardless of the method, we need to call some release
function of a rtentry after using it.

The change adds a new function rtcache_unref to release a rtentry. At this
point, this function does nothing because for now we don't add a reference
to a rtentry when we get one from a rtcache. We will add something useful
in a further commit.

This change is a part of changes for MP-safe routing table. It is separated
to avoid one big change that makes difficult to debug by bisecting.
 1.263  20-Sep-2016  roy Drop UDP packets as well as TCP without error when sending from detached or
tentative addresses.
 1.262  18-Sep-2016  christos Dealing with arplog is a bit more complicated...
 1.261  15-Sep-2016  roy Ensure that packets are sent from a valid address.
If the packet is TCP and the address is detached or tentative then
it's just dropped, otherwise an error is returned.

This is needed because you can bind to a valid address and it can then
become invalid.

This satisfies RFC 4862 section 5.5.4.
 1.260  01-Aug-2016  ozaki-r Apply pserialize and psref to struct ifaddr and its variants

This change makes struct ifaddr and its variants (in_ifaddr and in6_ifaddr)
MP-safe by using pserialize and psref. At this moment, pserialize_perform
and psref_target_destroy are disabled because (1) we don't need them
because of softnet_lock (2) they cause a deadlock because of softnet_lock.
So we'll enable them when we remove softnet_lock in the future.
 1.259  08-Jul-2016  ozaki-r branches: 1.259.2;
Replace macros to get an IP address with proper inline functions

The inline functions are more friendly for applying psz/psref;
they consist of only simple interations.
 1.258  21-Jun-2016  ozaki-r Replace ifp of ip_moptions and ip6_moptions with if_index

The motivation is the same as the mbuf's rcvif case; avoid having a pointer
of an ifnet object in ip_moptions and ip6_moptions, which is not MP-safe.

ip_moptions and ip6_moptions can be stored in a PCB for inet or inet6
that's life time is different from ifnet one and so an ifnet object can be
disappeared anytime we get it via them. Thus we need to look up an ifnet
object by if_index every time for safe.
 1.257  20-Jun-2016  knakahara apply if_output_lock() to L3 callers which call ifp->if_output() of L2(or L3 tunneling).
 1.256  10-Jun-2016  ozaki-r Introduce m_set_rcvif and m_reset_rcvif

The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.

No functional change.
 1.255  09-May-2016  ozaki-r Fix compilation for ppc
 1.254  04-May-2016  christos fix compilation for ppc.
 1.253  28-Apr-2016  ozaki-r Constify rtentry of if_output

We no longer need to change rtentry below if_output.

The change makes it clear where rtentries are changed (or not)
and helps forthcoming locking (os psrefing) rtentries.
 1.252  26-Apr-2016  ozaki-r Stop using rt_gwroute on packet sending paths

rt_gwroute of rtentry is a reference to a rtentry of the gateway
for a rtentry with RTF_GATEWAY. That was used by L2 (arp and ndp)
to look up L2 addresses. By separating L2 nexthop caches, we don't
need a route for the purpose and we can stop using rt_gwroute.
By doing so, we can reduce referencing and modifying rtentries,
which makes it easy to apply a lock (and/or psref) to the
routing table and rtentries.

One issue to do this is to keep RTF_REJECT behavior. It seems it
was broken when we moved rtalloc1 things from L2 output routines
(e.g., ether_output) to ip_hresolv_output, but (fortunately?)
it works unexpectedly. What we mistook are:
- RTF_REJECT was checked for any routes in L2 output routines,
but in ip_hresolv_output it is checked only when the route
is RTF_GATEWAY
- The RTF_REJECT check wasn't copied to IPv6 (nd6_output)

It seems that rt_gwroute checks hid the mistakes and it looked
work (unexpectedly) and removing rt_gwroute checks unveil the
issue. So we need to fix RTF_REJECT checks in ip_hresolv_output
and also add them to nd6_output.

One more point we have to care is returning an errno; we need
to mimic looutput behavior. Originally RTF_REJECT check was
done either in L2 output routines or in looutput. The latter is
applied when a reject route directs to a loopback interface.
However, now RTF_REJECT check is done before looutput so to keep
the original behavior we need to return an errno which looutput
chooses. Added rt_check_reject_route does such tweaks.
 1.251  19-Apr-2016  ozaki-r Fix error path
 1.250  19-Apr-2016  ozaki-r Separate MPLS-related routines from ip_hresolv_output

No functional changes.
 1.249  18-Apr-2016  ozaki-r Get rid of meaningless RTF_UP check from ip_hresolv_output

The check is meaningless because
- An obtained rtentry is ensured that it's always RTF_UP by rtcache,
rtalloc1 and rtlookup. If the rtentry isn't changed (i.e., RTF_UP gets
dropped) during processing, the check should be unnecessary
- Even if not, i.e., an obtained rtentry can be changed during processing,
checking only at the point doesn't help; the rtentry can be changed after
the check

Instead we have to ensure that RTF_UP isn't dropped if someone is using it
somehow. Note that we already ensure that a rtentry being used isn't freed
by rt_refcnt.

Proposed on tech-kern and tech-net.
 1.248  20-Jan-2016  riastradh Give proper prototype to ip_output.
 1.247  02-Sep-2015  ozaki-r Do rt_refcnt++ when set a rtentry to another rtentry's rt_gwroute

And also do rtfree when deref a rtentry from rt_gwroute.
 1.246  24-Aug-2015  pooka sprinkle _KERNEL_OPT
 1.245  07-Aug-2015  ozaki-r Use time_uptime instead of time_second to avoid time leaps

Some codes in sys/net* use time_second to manage time periods such as
cache expirations. However, time_second doesn't increase monotonically
and can leap by say settimeofday(2) according to time_second(9). We
should use time_uptime instead of it to avoid such time leaps.

This change replaces time_second with time_uptime. Additionally it
converts a time based on time_uptime to a time based on time_second
when the kernel passes the time to userland programs that expect
the latter, and vice versa.

Note that we shouldn't leak time_uptime to other hosts over the
netowrk. My investigation shows there is no such leak:
http://mail-index.netbsd.org/tech-net/2015/08/06/msg005332.html

Discussed on tech-kern and tech-net.
 1.244  17-Jul-2015  ozaki-r Reform use of rt_refcnt

rt_refcnt of rtentry was used in bad manners, for example, direct rt_refcnt++
and rt_refcnt-- outside route.c, "rt->rt_refcnt++; rtfree(rt);" idiom, and
touching rt after rt->rt_refcnt--.

These abuses seem to be needed because rt_refcnt manages only references
between rtentry and doesn't take care of references during packet processing
(IOW references from local variables). In order to reduce the above abuses,
the latter cases should be counted by rt_refcnt as well as the former cases.

This change improves consistency of use of rt_refcnt:
- rtentry is always accessed with rt_refcnt incremented
- rtentry's rt_refcnt is decremented after use (rtfree is always used instead
of rt_refcnt--)
- functions returning rtentry increment its rt_refcnt (and caller rtfree it)

Note that rt_refcnt prevents rtentry from being freed but doesn't prevent
rtentry from being updated. Toward MP-safe, we need to provide another
protection for rtentry, e.g., locks. (Or introduce a better data structure
allowing concurrent readers during updates.)
 1.243  14-Jul-2015  ozaki-r Move rt_gwroute operation out of stripoutput

We should do it in ip_hresolv_needed.
 1.242  01-Jul-2015  ozaki-r Use ip_hresolv_output for if_token as well

I thought we cannot apply ip_hresolv_output to if_token because
rt0 looked being needed by arpresolve in token_output. However,
rt0 is actually not used by arpresolve in NetBSD (see obsolete
ARPRESOLVE macro).
 1.241  08-Jun-2015  roy errno -> error, spotted by the hawk skrll
 1.240  08-Jun-2015  roy It's possible we could not have any ready addresses.
 1.239  04-Jun-2015  ozaki-r Pull out route lookups from L2 output routines

Route lookups for routes of RTF_GATEWAY were done in L2 output
routines such as ether_output, but they should be done in L3
i.e., before L2 output routines. This change places the lookups
between L3 output routines (say ip_output) and the L2 output
routines.

The change is based on dyoung's patch submitted in the thread:
https://mail-index.netbsd.org/tech-net/2013/02/01/msg003847.html
You can find out detailed investigations by dyoung about the
issue in there.

Note that the change introduces a workaround for MPLS. ether_output
knew that it needs to fill the ethertype of a frame as MPLS,
based on a tag of an original route (rtentry), but now we don't
pass it to ehter_output. So we have to tell that in another way.
We use mtag to do so for now, which introduces some overhead.
We should fix it somehow in the future.

Discussed on tech-kern and tech-net.
 1.238  27-Apr-2015  ozaki-r Add missing error checks on rtcache_setdst

It can fail with ENOMEM.
 1.237  24-Apr-2015  ozaki-r KNF
 1.236  03-Apr-2015  ozaki-r Don't grab KERNEL_LOCK during if_output when NET_MPSAFE

The change makes L3 MP-safe work easy. At this point
we deal with only IP forwarding.

No functional change when NET_MPSAFE isn't enabled.
 1.235  31-Mar-2015  ozaki-r Add missing ifdef IPSEC
 1.234  23-Mar-2015  roy Add RTF_BROADCAST to mark routes used for the broadcast address when
they are created on the fly. This makes it clear what the route is for
and allows an optimisation in ip_output() by avoiding a call to
in_broadcast() because most of the time we do talk to a host.
It also avoids a needless allocation for the storage of llinfo_arp and
thus vanishes from arp(8) - it showed as incomplete anyway so this
is a nice side effect.

Guard against this and routes marked with RTF_BLACKHOLE in
ip_fastforward().
While here, guard against routes marked with RTF_BLACKHOLE in
ip6_fastforward().
RTF_BROADCAST is IPv4 only, so don't bother checking that here.
 1.233  26-Nov-2014  ozaki-r branches: 1.233.2;
Call looutput with holding KERNEL_LOCK

This fixes diagnostic assertion "KERNEL_LOCKED_P()" in if_loop.c.

PR kern/49410
 1.232  12-Oct-2014  christos Refactor the multicast membership code so that we can handle v4 mapped
addresses using the v6 membership ioctls.
 1.231  11-Oct-2014  christos exposet multicast option functions which are used by the v6 code now.
 1.230  06-Jun-2014  rmind branches: 1.230.2;
ip_output: zero iproute structure only when needed; reduce the scope
of some variables.
 1.229  30-May-2014  christos Introduce 2 new variables: ipsec_enabled and ipsec_used.
Ipsec enabled is controlled by sysctl and determines if is allowed.
ipsec_used is set automatically based on ipsec being enabled, and
rules existing.
 1.228  29-May-2014  rmind Make IGMP and multicast group management code MP-safe. Use a read-write
lock to protect the hash table of multicast address records; also, make it
private and eliminate some macros. In the long term, the lookup path ought
to be optimised.
 1.227  23-May-2014  rmind Fix the assert in the previous commit.
 1.226  22-May-2014  rmind - Make ip_setmoptions(), ip_getmoptions() and ip_pcbopts() static.
- ip_output: eliminate 7th variadic argument; IP_RETURNMTU is flag
always used to store MTU size into struct inpcb::inp_errormtu.
- Clean up these routines: reduce #ifdefs, variable scopes, etc.
 1.225  17-May-2014  rmind Replace open-coded access (and boundary checking) of ifindex2ifnet with
if_byindex() function.
 1.224  29-Jun-2013  rmind branches: 1.224.4;
- Rewrite parts of pfil(9): use array to store hooks and thus be more cache
friendly (there are only few hooks in the system). Make the structures
opaque and the interface more strict.
- Remove PFIL_HOOKS option by making pfil(9) mandatory.
 1.223  27-Jun-2013  christos branches: 1.223.2;
implement IP_PKTINFO and IP_RECVPKTINFO.
 1.222  08-Jun-2013  rmind Split IPsec code in ip_input() and ip_forward() into the separate routines
ipsec4_input() and ipsec4_forward(). Tested by christos@.
 1.221  08-Jun-2013  rmind Split IPSec logic from ip_output() into a separate routine - ipsec4_output().
No change to the mechanism intended. Tested by christos@.
 1.220  05-Jun-2013  christos IPSEC has not come in two speeds for a long time now (IPSEC == kame,
FAST_IPSEC). Make everything refer to IPSEC to avoid confusion.
 1.219  04-Jun-2013  christos PR/47886: Dr. Wolfgang Stukenbrock: IPSEC_NAT_T enabled kernels may access
outdated pointers and pass ESP data to UPD-sockets.
While here, simplify the code and remove the IPSEC_NAT_T option; always
compile nat-traversal in so that it does not bitrot.
 1.218  02-Feb-2013  kefren get rid of ip_len local variable. Use ntohs(ip->ip_len) like the rest
of the code in the two places this variable was used
 1.217  25-Jun-2012  christos branches: 1.217.2;
rename rfc6056 -> portalgo, requested by yamt
 1.216  22-Jun-2012  christos PR/46602: Move the rfc6056 port randomization to the IP layer.
 1.215  30-Apr-2012  rmind - Replace some malloc(9) uses with kmem(9).
- G/C M_IPMOPTS, M_IPMADDR and M_BWMETER.
 1.214  22-Mar-2012  drochner remove KAME IPSEC, replaced by FAST_IPSEC
 1.213  15-Feb-2012  drochner fix for IPSEC tunnel + NAT-T + esp_frag:
Output packets larger than "esp_frag" are fragmented first
and then reinjected into ip_output for encapsulation
and transfer. The problem was that each packet got a new
ip_id value assigned, so that fragments couldn't be matched
by the receiver. Offset information was overwritten too.
approved by releng
 1.212  31-Dec-2011  christos - fix offsetof usage, and redundant defines
- kill pointer casts to 0
 1.211  19-Dec-2011  drochner rename the IPSEC in-kernel CPP variable and config(8) option to
KAME_IPSEC, and make IPSEC define it so that existing kernel
config files work as before
Now the default can be easily be changed to FAST_IPSEC just by
setting the IPSEC alias to FAST_IPSEC.
 1.210  31-Oct-2011  yamt branches: 1.210.2; 1.210.6;
redo ip_output.c rev.1.206 and 1.207 differently. PR/43664.
ok'ed by martin@
 1.209  17-Jul-2011  joerg Retire varargs.h support. Move machine/stdarg.h logic into MI
sys/stdarg.h and expect compiler to provide proper builtins, defaulting
to the GCC interface. lint still has a special fallback.
Reduce abuse of _BSD_VA_LIST_ by defining __va_list by default and
derive va_list as required by standards.
 1.208  14-Apr-2011  yamt after ip_input.c rev.1.285 and 1.286, restore kernel_lock for if_output.
 1.207  09-Apr-2011  martin PR kern/43664:
mlelstv pointed out that we sometimes may use checksums on loopback
interfaces. Make the test consistent with the code path selecting
the checksum operation before invoking fragmentation.
 1.206  09-Apr-2011  martin We do not do checksums on loopback interfaces, not even if fragmenting.
Fixes PR kern/43664.
 1.205  17-Jul-2009  minskim branches: 1.205.4; 1.205.6;
Add the IP_MINTTL socket option.

The IP_MINTTL option may be used on SOCK_STREAM sockets to discard
packets with a TTL lower than the option value. This can be used to
implement the Generalized TTL Security Mechanism (GTSM) according to
RFC 3682.

OK'ed by christos@.
 1.204  16-Jul-2009  minskim Add the IP_RECVTTL option support.

If the IP_RECVTTL option is enabled on a SOCK_DGRAM socket, the
recvmsg(2) call will return the TTL of the received datagram. The
msg_control field in the msghdr structure points to a buffer that
contains a cmsghdr structure followed by the TTL value.

Modeled after FreeBSD implementation.
 1.203  01-Jul-2009  martin From Wolfgang Stukenbrock in PR kern/41659: add missing splx().
 1.202  06-May-2009  elad Remove some usage of "priv" and "privileged" variables and instead pass
around credentials. Also push down kauth(9) calls closer to where the
operation is done.

Mailing list reference:

http://mail-index.netbsd.org/tech-net/2009/04/30/msg001270.html
 1.201  18-Mar-2009  cegger bzero -> memset
 1.200  12-Oct-2008  plunky branches: 1.200.2; 1.200.4; 1.200.8; 1.200.10;
update ip_pcbopts() to use sockopt(9) API.

cleans up function and one small fix is that we now stop copying user
options to the mbuf when the _EOL is given, previously this function
would continue to copy options.
 1.199  12-Oct-2008  plunky do not sleep while allocating memory here as socket lock is held
 1.198  16-Aug-2008  plunky constify sockopt in the PRCO_SETOPT path
 1.197  06-Aug-2008  plunky Convert socket options code to use a sockopt structure
instead of laying everything into an mbuf.

approved by core
 1.196  28-Apr-2008  martin branches: 1.196.2; 1.196.6;
Remove clause 3 and 4 from TNF licenses
 1.195  23-Apr-2008  thorpej branches: 1.195.2;
Make IPSEC and FAST_IPSEC stats per-cpu. Use <net/net_stats.h> and
netstat_sysctl().
 1.194  12-Apr-2008  thorpej branches: 1.194.2;
Make IP, TCP, UDP, and ICMP statistics per-CPU. The stats are collated
when the user requests them via sysctl.
 1.193  07-Apr-2008  thorpej Change IP stats from a structure to an array of uint64_t's.

Note: This is ABI-compatible with the old ipstat structure; old netstat
binaries will continue to work properly.
 1.192  06-Feb-2008  matt branches: 1.192.6;
Add a new ip_id generation scheme based on a Fisher-Yates shuffle over a
sliding window. XXX replace use of arc4random RSN.
 1.191  14-Jan-2008  dyoung Use rtcache_validate() instead of rtcache_getrt(). Shorten staircase
in in_losing().
 1.190  12-Jan-2008  dyoung Good-bye, rtcache_check(). Call both rtcache_validate() and
rtcache_update(,1) instead of rtcache_check().
 1.189  29-Dec-2007  degroote Restore correctly the sp level in case of FAST_IPSEC + IPSEC_NAT_T
 1.188  29-Dec-2007  degroote Simplify the FAST_IPSEC output path
Only record an IPSEC_OUT_DONE tag when we have finished the processing
In ip{,6}_output, check this tag to know if we have already processed this
packet.
Remove some dead code (IPSEC_PENDING_TDB is not used in NetBSD)

Fix pr/36870
 1.187  21-Dec-2007  matt Add fix for ip_id information leakage. Since the leakage information is
primarily used with TCP SYN and RST packets and such packets are less than
the smallest sized packet that an IP stack is allowed to fragment, we simply
set ip_id to 0 for all packets 68 bytes or less.
 1.186  20-Dec-2007  dyoung Poison struct route->ro_rt uses in the kernel by changing the name
to _ro_rt. Use rtcache_getrt() to access a route cache's struct
rtentry *.

Introduce struct ifnet->if_dl that always points at the interface
identifier/link-layer address. Make code that treated the first
ifaddr on struct ifnet->if_addrlist as the interface address use
if_dl, instead.

Remove stale debugging code from net/route.c. Move the rtflush()
code into rtcache_clear() and delete rtflush(). Delete rtalloc(),
because nothing uses it any more.

Make ND6_HINT an inline, lowercase subroutine, nd6_hint.

I've done my best to convert IP Filter, the ISO stack, and the
AppleTalk stack to rtcache_getrt(). They compile, but I have not
tested them. I have given the changes to PF, GRE, IPv4 and IPv6
stacks a lot of exercise.
 1.185  28-Nov-2007  dyoung branches: 1.185.2; 1.185.6;
Move IN_NEED_CHECKSUM() to in_offload.h for re-use.
 1.184  19-Sep-2007  dyoung branches: 1.184.6;
1) Introduce a new socket option, (SOL_SOCKET, SO_NOHEADER), that
tells a socket that it should both add a protocol header to tx'd
datagrams and remove the header from rx'd datagrams:

int onoff = 1, s = socket(...);
setsockopt(s, SOL_SOCKET, SO_NOHEADER, &onoff);

2) Add an implementation of (SOL_SOCKET, SO_NOHEADER) for raw IPv4
sockets.

3) Reorganize the protocols' pr_ctloutput implementations a bit.
Consistently return ENOPROTOOPT when an option is unsupported,
and EINVAL if a supported option's arguments are incorrect.
Reorganize the flow of code so that it's more clear how/when
options are passed down the stack until they are handled.

Shorten some pr_ctloutput staircases for readability.

4) Extract common mbuf code into subroutines, add new sockaddr
methods, and introduce a new subroutine, fsocreate(), for reuse
later; use it first in sys_socket():

struct mbuf *m_getsombuf(struct socket *so)

Create an mbuf and make its owner the socket `so'.

struct mbuf *m_intopt(struct socket *so, int val)

Create an mbuf, make its owner the socket `so', put the
int `val' into it, and set its length to sizeof(int).


int fsocreate(..., int *fd)

Create a socket, a la socreate(9), put the socket into the
given LWP's descriptor table, return the descriptor at `fd'
on success.

void *sockaddr_addr(struct sockaddr *sa, socklen_t *slenp)
const void *sockaddr_const_addr(const struct sockaddr *sa, socklen_t *slenp)

Extract a pointer to the address part of a sockaddr. Write
the length of the address part at `slenp', if `slenp' is
not NULL.

socklen_t sockaddr_getlen(const struct sockaddr *sa)

Return the length of a sockaddr. This just evaluates to
sa->sa_len. I only add this for consistency with code that
appears in a portable userland library that I am going to
import.

const struct sockaddr *sockaddr_any(const struct sockaddr *sa)

Return the "don't care" sockaddr in the same family as
`sa'. This is the address a client should sobind(9) if it
does not care the source address and, if applicable, the
port et cetera that it uses.

const void *sockaddr_anyaddr(const struct sockaddr *sa, socklen_t *slenp)

Return the "don't care" sockaddr in the same family as
`sa'. This is the address a client should sobind(9) if it
does not care the source address and, if applicable, the
port et cetera that it uses.
 1.183  02-Sep-2007  dyoung m_copym(..., 0, M_COPYALL, ...) -> m_copypacket(..., ...).
 1.182  02-Sep-2007  dyoung m_copy() was deprecated, apparently, long ago. m_copy(...) ->
m_copym(..., M_DONTWAIT).
 1.181  28-Aug-2007  cube Fix ipv4 multicast that could sometimes send packets with the wrong
Ethernet multicast address.

Reported by jmcneill@, fix discussed with dyoung@, _very_ light testing by
myself, some more money for my dealer of anxiolytics after reading
ip_output()'s twisted code maze.
 1.180  02-May-2007  dyoung branches: 1.180.2; 1.180.6; 1.180.8;
Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.

The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.

Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.

DETAILS

1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:

struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);

sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.

sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.

The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).

2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.

3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:

int rtcache_setdst(struct route *, const struct sockaddr *);

rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.

It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.

4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
 1.179  04-Mar-2007  christos branches: 1.179.2; 1.179.4;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.
 1.178  17-Feb-2007  dyoung KNF: de-__P, bzero -> memset, bcmp -> memcmp. Remove extraneous
parentheses in return statements.

Cosmetic: don't open-code TAILQ_FOREACH().

Cosmetic: change types of variables to avoid oodles of casts: in
in6_src.c, avoid casts by changing several route_in6 pointers
to struct route pointers. Remove unnecessary casts to caddr_t
elsewhere.

Pave the way for eliminating address family-specific route caches:
soon, struct route will not embed a sockaddr, but it will hold
a reference to an external sockaddr, instead. We will set the
destination sockaddr using rtcache_setdst(). (I created a stub
for it, but it isn't used anywhere, yet.) rtcache_free() will
free the sockaddr. I have extracted from rtcache_free() a helper
subroutine, rtcache_clear(). rtcache_clear() will "forget" a
cached route, but it will not forget the destination by releasing
the sockaddr. I use rtcache_clear() instead of rtcache_free()
in rtcache_update(), because rtcache_update() is not supposed
to forget the destination.

Constify:

1 Introduce const accessor for route->ro_dst, rtcache_getdst().

2 Constify the 'dst' argument to ifnet->if_output(). This
led me to constify a lot of code called by output routines.

3 Constify the sockaddr argument to protosw->pr_ctlinput. This
led me to constify a lot of code called by ctlinput routines.

4 Introduce const macros for converting from a generic sockaddr
to family-specific sockaddrs, e.g., sockaddr_in: satocsin6,
satocsin, et cetera.
 1.177  17-Feb-2007  dyoung branches: 1.177.2;
Join lines.
 1.176  29-Jan-2007  dyoung bzero -> memset.
 1.175  29-Jan-2007  dyoung In ip_setmoptions(), don't leave a route cache (struct route) on
the stack if we exit with EADDRNOTAVAIL.
 1.174  13-Jan-2007  joerg Unconditionally zero and free iproute. Before IPsec tunnel packets e.g.
from ICMP could end up in leaking the reference in iproute, as
ipsec4_output would overwrite the ro pointer in state.

Tested by Juraj Hercek and supposed to fix PR kern/35273 and kern/35318.
 1.173  08-Jan-2007  yamt ip_output: reload ip_len after running pfil_run_hooks.
pf "fragment reassemble" rule can change it, at least.
 1.172  04-Jan-2007  elad Consistent usage of KAUTH_GENERIC_ISSUSER.
 1.171  15-Dec-2006  joerg Introduce new helper functions to abstract the route caching.
rtcache_init and rtcache_init_noclone lookup ro_dst and store
the result in ro_rt, taking care of the reference counting and
calling the domain specific route cache.
rtcache_free checks if a route was cashed and frees the reference.
rtcache_copy copies ro_dst of the given struct route, checking that
enough space is available and incrementing the reference count of the
cached rtentry if necessary.
rtcache_check validates that the cached route is still up. If it isn't,
it tries to look it up again. Afterwards ro_rt is either a valid again
or NULL.
rtcache_copy is used internally.

Adjust to callers of rtalloc/rtflush in the tree to check the sanity of
ro_dst first (if necessary). If it doesn't fit the expectations, free
the cache, otherwise check if the cached route is still valid. After
that combination, a single check for ro_rt == NULL is enough to decide
whether a new lookup needs to be done with a different ro_dst.
Make the route checking in gre stricter by repeating the loop check
after revalidation.
Remove some unused RADIX_MPATH code in in6_src.c. The logic is slightly
changed here to first validate the route and check RTF_GATEWAY
afterwards. This is sementically equivalent though.
etherip doesn't need sc_route_expire similiar to the gif changes from
dyoung@ earlier.

Based on the earlier patch from dyoung@, reviewed and discussed with
him.
 1.170  09-Dec-2006  dyoung Here are various changes designed to protect against bad IPv4
routing caused by stale route caches (struct route). Route caches
are sprinkled throughout PCBs, the IP fast-forwarding table, and
IP tunnel interfaces (gre, gif, stf).

Stale IPv6 and ISO route caches will be treated by separate patches.

Thank you to Christoph Badura for suggesting the general approach
to invalidating route caches that I take here.

Here are the details:

Add hooks to struct domain for tracking and for invalidating each
domain's route caches: dom_rtcache, dom_rtflush, and dom_rtflushall.

Introduce helper subroutines, rtflush(ro) for invalidating a route
cache, rtflushall(family) for invalidating all route caches in a
routing domain, and rtcache(ro) for notifying the domain of a new
cached route.

Chain together all IPv4 route caches where ro_rt != NULL. Provide
in_rtcache() for adding a route to the chain. Provide in_rtflush()
and in_rtflushall() for invalidating IPv4 route caches. In
in_rtflush(), set ro_rt to NULL, and remove the route from the
chain. In in_rtflushall(), walk the chain and remove every route
cache.

In rtrequest1(), call rtflushall() to invalidate route caches when
a route is added.

In gif(4), discard the workaround for stale caches that involves
expiring them every so often.

Replace the pattern 'RTFREE(ro->ro_rt); ro->ro_rt = NULL;' with a
call to rtflush(ro).

Update ipflow_fastforward() and all other users of route caches so
that they expect a cached route, ro->ro_rt, to turn to NULL.

Take care when moving a 'struct route' to rtflush() the source and
to rtcache() the destination.

In domain initializers, use .dom_xxx tags.

KNF here and there.
 1.169  06-Dec-2006  dyoung Remove stray curly brace. Thanks, yamt!
 1.168  06-Dec-2006  dyoung KNF.
 1.167  25-Nov-2006  yamt branches: 1.167.2;
move tso-by-software code to their own files. no functional changes.
 1.166  13-Nov-2006  dyoung Add a source-address selection policy mechanism to the kernel.

Also, add ioctls SIOCGIFADDRPREF/SIOCSIFADDRPREF to get/set preference
numbers for addresses. Make ifconfig(8) set/display preference
numbers.

To activate source-address selection policies in your kernel, add
'options IPSELSRC' to your kernel configuration.

Miscellaneous changes in support of source-address selection:

1 Factor out some common code, producing rt_replace_ifa().

2 Abbreviate a for-loop with TAILQ_FOREACH().

3 Add the predicates on IPv4 addresses IN_LINKLOCAL() and
IN_PRIVATE(), that are true for link-local unicast
(169.254/16) and RFC1918 private addresses, respectively.
Add the predicate IN_ANY_LOCAL() that is true for link-local
unicast and multicast.

4 Add IPv4-specific interface attach/detach routines,
in_domifattach and in_domifdetach, which build #ifdef
IPSELSRC.

See in_getifa(9) for a more thorough description of source-address
selection policy.
 1.165  23-Jul-2006  ad branches: 1.165.4; 1.165.6;
Use the LWP cached credentials where sane.
 1.164  12-Jul-2006  tron Remove test for M_CSUM_TSOv6 flag which is not (yet) defined in
NetBSD-current.
 1.163  12-Jul-2006  tron Add diagnostic checks for hardware-assisted checksum related flags in
the mbuf which supposed to get sent out:
- Complain in ip_output() if any of the IPv6 related flags are set.
- Complain in ip6_output() if any of the IPv4 related flags are set.
- Complain in both functions if the flags indicate that both a TCP and
UCP checksum should be calculated by the hardware.
 1.162  15-May-2006  christos branches: 1.162.4;
kauth fallout
 1.161  14-May-2006  elad integrate kauth.
 1.160  23-Feb-2006  christos branches: 1.160.2; 1.160.4; 1.160.6;
Handle IPSEC_NAT_T in the FAST_IPSEC case.
XXX: need to fix the FAST_IPSEC code now.
 1.159  11-Dec-2005  christos branches: 1.159.2; 1.159.4; 1.159.6;
merge ktrace-lwp.
 1.158  19-Sep-2005  dyoung People have to read this code, so I am removing the double-negative
tautology, #ifndef notdef, which is not only superfluous, but easily
misread as #ifdef notyet.
 1.157  11-Sep-2005  seb Replace plain 255 by MAXTTL.
 1.156  11-Sep-2005  christos Allow the multicast_ttl and the multicast_loop options to be set with both
u_char and u_int option variables. Original patch from seb.
 1.155  18-Aug-2005  yamt - introduce M_MOVE_PKTHDR and use it where appropriate.
intended to be mostly API compatible with openbsd/freebsd.
- remove a glue #define in netipsec/ipsec_osdep.h.
 1.154  10-Aug-2005  yamt move {tcp,udp}_do_loopback_cksum back to tcp/udp
so that they can be referenced by ipv6.
 1.153  29-May-2005  christos branches: 1.153.2;
- add const
- remove bogus casts
- avoid nested variables
 1.152  18-Apr-2005  yamt ip_output: handle the case M_CSUM_TSOv4 but !IFCAP_TSOv4.
 1.151  18-Apr-2005  yamt fix problems related to loopback interface checksum omission. PR/29971.

- for ipv4, defer decision to ip layer as h/w checksum offloading does
so that it can check the actual interface the packet is going to.
- for ipv6, disable it.
(maybe will be revisited when it implements h/w checksum offloading.)

ok'ed by Jason Thorpe.
 1.150  07-Apr-2005  yamt when doing TSO, avoid to use duplicated ip_id heavily.
XXX ip_randomid
 1.149  11-Mar-2005  matt branches: 1.149.2;
Set ip_len to 0 in the wm driver when TSO is being used.
 1.148  10-Mar-2005  thorpej In ip_fragment():
- Use the correct IP header length variable for other-than-first packets.
- Remove redundant setting of the original IP header length in the first
packet's csum_data. (It's already set before ip_fragment() is called
in 1.147.)
 1.147  09-Mar-2005  matt Move all the hardware-assisted checksum/segment offload code together.
 1.146  06-Mar-2005  matt Add IPv4/TCP hooks for TCP Segment Offload on transmit.
 1.145  05-Mar-2005  briggs Fix checksum offload for fragmented packets. From John Heasley
on gnats-bugs in PR kern/29544.
Tested with an NFS client using default rwsize on an NFS server
with wm(4) interface configured IP4CSUM,TCP4CSUM,UDP4CSUM.
Prior revision required the server to have checksum offload disabled.
 1.144  26-Feb-2005  perry nuke trailing whitespace
 1.143  18-Feb-2005  heas My last change for pseudo-header checksums was flawed. The pseudo-header
checksum is always in the L4 header by the time we get to this point. It
was occasionally not there due to a bug in tcp_respond, which has since
been fixed.
So, instead just stash the length of the L3 header in the high 16 bits of
csum_data.
 1.142  12-Feb-2005  heas For controllers (eg: hme & gem) that can only perform linear hardware checksums
(from an offset to the end of the packet), the pseudo-header checksum must be
calculated by software. So, provide it in the TCP/UDP header when
M_CSUM_NO_PSEUDOHDR is set in the interface's if_csum_flags_tx.

The start offset, the end of the IP header, is also provided in the high 16
bits of pkthdr.csum_data. Such that the driver need not examine the packet
at all.

XXX At the request of Jonathan Stone, note that sharing of if_csum_flags_tx &
pkthdr.csum_flags for checksum quirks should be re-evaluated.
 1.141  12-Feb-2005  manu Add support for IPsec Network Address Translator traversal (NAT-T), as
described by RFC 3947 and 3948.
 1.140  03-Feb-2005  perry ANSIfy function declarations
 1.139  02-Feb-2005  perry de-__P -- will ANSIfy .c files later.
 1.138  15-Dec-2004  thorpej branches: 1.138.2; 1.138.4;
Don't perform checksums on loopback interfaces. They can be reenabled with
the net.inet.*.do_loopback_cksum sysctl.

Approved by: groo
 1.137  04-Dec-2004  peter Convert lo(4) to a clonable device.

This also removes the loif array and changes all code to use the new
lo0ifp pointer which points to the lo0 ifnet structure.

Approved by christos.
 1.136  06-Oct-2004  thorpej Slight simplification to IFA_STATS handling.
 1.135  04-Sep-2004  manu IPv4 PIM support, based on a submission from Pavlin Radoslavov posted on
tech-net@
 1.134  06-Jul-2004  minoura Remove broken code for now: getsockopt(s, IPPROTO_IP, IP_IPSEC_POLICY,...).
It returned EINVAL, now returns ENOPROTOOPT.
Ok'd by itojun.
 1.133  01-Jun-2004  itojun update mtu value if outgoing interface changes with ipsec ops
(draft-touch-vpn case only?) iij seil team
 1.132  18-May-2004  christos Fix buffer overrun in in_pcbopts() (FreeBSD PR/66386)
 1.131  26-Apr-2004  matt Remove #else clause of __STDC__
 1.130  02-Mar-2004  thorpej Use the new IPSEC_PCB_SKIP_IPSEC() to bypass a socket policy lookup
when possible. This shaves several cycles from the output path for
non-IPsec connections, even if the policy is cached in the PCB.
 1.129  10-Dec-2003  itojun use if_indexlim (instead of if_index) and ifindex2ifnet[x] != NULL
to check if interface exists, as (1) if_index has different meaning
(2) ifindex2ifnet could become NULL when interface gets destroyed,
since when we have introduced dynamically-created interfaces. from kame
 1.128  19-Nov-2003  jonathan Patch back support for (badly) randomized IP ids, by request:

* Include "opt_inet.h" everywhere IP-ids are generated with ip_newid(),
so the RANDOM_IP_ID option is visible. Also in ip_id(), to ensure
the prototype for ip_randomid() is made visible.

* Add new sysctl to enable randomized IP-ids, provided the kernel was
configured with RANDOM_IP_ID. (The sysctl defaults to zero, and is
a read-only zero if RANDOM_IP_ID is not configured).

Note that the implementation of randomized IP ids is still defective,
and should not be enabled at all (even if configured) without
very careful deliberation. Caveat emptor.
 1.127  17-Nov-2003  jonathan Revert the (default) ip_id algorithm to the pre-randomid algorithm,
due to demonstrated low-period repeated IDs from the randomized IP_id
code. Consensus is that the low-period repetition (much less than
2^15) is not suitable for general-purpose use.

Allocators of new IPv4 IDs should now call the function ip_newid().
Randomized IP_ids is now a config-time option, "options RANDOM_IP_ID".
ip_newid() can use ip_random-id()_IP_ID if and only if configured
with RANDOM_IP_ID. A sysctl knob should be provided.

This API may be reworked in the near future to support linear ip_id
counters per (src,dst) IP-address pair.
 1.126  17-Oct-2003  enami Increment stats when packet is dropped since there is no room
to put all fragments in the interfaces's send queue. Some large
UDP packets are dropped here and administrator may want to bump ifqmaxlen.
 1.125  14-Oct-2003  itojun more correction to ip_fragment; free mbuf correctly if ENOBUFS is raised
during fragmenting.
 1.124  14-Oct-2003  itojun avoid mbuf leak on ip_fragment(); obey 4.4bsd mbuf passing rule (mbuf passed
to a function must be freed by the called function on error).
pointed out by enami
 1.123  03-Oct-2003  itojun when dropping M_PKTHDR, need to free m_tag associated with it.
 1.122  01-Oct-2003  itojun correct ip_fragment() wrt ip->ip_off handling.
do not send out incomplete fragment due to ENOBUFS (behavior change from 4.4BSD)
 1.121  19-Sep-2003  jonathan Fast-ipsec can call ip_output() with a null 'struct socket *so'
argument. So check so is non-NULL before doing the pointer-chasing
dance to find the PCB. (Unless and until we rework fast-ipsec and
KAME, to pass a struct in_pcbhdr * instead of the struct socket *).
 1.120  06-Sep-2003  itojun randomize IPv4/v6 fragment ID and IPv6 flowlabel. avoids predictability
of these fields. ip_id.c is from openbsd. ip6_id.c is adapted by kame.
 1.119  27-Aug-2003  itojun don't intiialize m by m0, m0 is not initialized (by introduction of ip_fragment)
 1.118  23-Aug-2003  itojun need sys/domain.h for FAST_IPSEC case; jonathan
 1.117  22-Aug-2003  itojun remove ipsec_set/getsocket. now we explicitly pass socket * to ip{,6}_output.
 1.116  22-Aug-2003  itojun change the additional arg to be passed to ip{,6}_output to struct socket *.

this fixes KAME policy lookup which was broken by the previous commit.
 1.115  22-Aug-2003  jonathan Change KAME code for ip_output()/ip6_output() to obtain struct socket*
from the explicit inpcb*/in6pcb* argument. set_socket() becomes redundant.
 1.114  19-Aug-2003  itojun remove unneeded #ifdef __NetBSD__
 1.113  19-Aug-2003  itojun make ip_fragment public (it is for coming PF integration)
 1.112  19-Aug-2003  christos make ip_fragment static and add prototype.
 1.111  19-Aug-2003  itojun correct ip_multicast_if fix to always set ifp (tnx Shiva)
 1.110  18-Aug-2003  itojun fix problem we can't drop membership on !IFF_UP interface.
reported by Shiva Shenoy

while we're here, fix another problem when the same interface address is
assigned to !IFF_MULTICAST and IFF_MULTICAST interface. if ip_multicast_if()
returns the first one, join/leave will fail, which is not an desired effect.
 1.109  15-Aug-2003  jonathan (fast-ipsec): Add hooks to pass IPv4 IPsec traffic into fast-ipsec, if
configured with ``options FAST_IPSEC''. Kernels with KAME IPsec or
with no IPsec should work as before.

All calls to ip_output() now always pass an additional compulsory
argument: the inpcb associated with the packet being sent,
or 0 if no inpcb is available.

Fast-ipsec tested with ICMP or UDP over ESP. TCP doesn't work, yet.
 1.108  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.
 1.107  30-Jun-2003  itojun branches: 1.107.2;
freebsd code somehow crept in
 1.106  30-Jun-2003  itojun after pfil_run_hooks, need to fix hlen as well
 1.105  26-Jun-2003  itojun tabify
 1.104  26-May-2003  yamt - don't pass mbufs with M_CSUM_* flags which isn't supported by the interface
to if_output.
- offload ip-checksumming for each fragmented packets as well.
 1.103  26-Feb-2003  matt Add MBUFTRACE kernel option.
Do a little mbuf rework while here. Change all uses of MGET*(*, M_WAIT, *)
to m_get*(M_WAIT, *). These are not performance critical and making them
call m_get saves considerable space. Add m_clget analogue of MCLGET and
make corresponding change for M_WAIT uses.
Modify netinet, gem, fxp, tulip, nfs to support MBUFTRACE.
Begin to change netstat to use sysctl.
 1.102  17-Sep-2002  darrenr From FreeBSD (1.164) courtesy of Maxim Konovalov:
"In rare cases when there is no room for ip options ip_insertoptions()
can fail and corrupt a header length. Initialize len and check what
ip_insertoptions() returns."
 1.101  11-Sep-2002  itojun KNF - return is not a function. sync w/kame.
 1.100  14-Aug-2002  itojun avoid swapping endian of ip_len and ip_off on mbuf, to meet with M_LEADINGSPACE
optimization made last year. should solve PR 17867 and 10195.

IP_HDRINCL behavior of raw ip socket is kept unchanged. we may want to
provide IP_HDRINCL variant that does not swap endian.
 1.99  24-Jun-2002  itojun set ia as well
 1.98  24-Jun-2002  itojun do not consult routing table under the following condition:
- the destination is IPv4 multicast or 255.255.255.255, and
- outgoing interface is specified via socket option

this simplifies operation of routed
(no longer reqiure 224.0.0.0/4 to be set up)
 1.97  09-Jun-2002  itojun whitespace
 1.96  31-May-2002  itojun since if_mtu is u_long, use u_long for mtu.
 1.95  07-Feb-2002  thorpej branches: 1.95.8; 1.95.10;
IFF_POINTTOPOINT interfaces can also transmit packets to broadcast
destinations.
 1.94  06-Feb-2002  thorpej ip_mloopback(): process the delayed checksum on the copy, not
the original mbuf.
 1.93  31-Jan-2002  itojun correct bad ip checksum on multicast loopback packet. PR14597
 1.92  22-Jan-2002  itojun make sure to check address family on route cache. with IPv4 mapped
address we can see both AF_INET/INET6.
 1.91  08-Jan-2002  itojun don't panic when there's no interface address exist for the specified multicast
outgoing interface (ia == NULL after IFP_TO_IA).

historic behavior (up to revision 1.43) was to use 0.0.0.0 as source address,
but it seems like a mistake according to RFC1112/1122.
 1.90  21-Nov-2001  itojun update outgoing ifp, only if tunnel mode ipsec is used. this is to
honor IP_MULTICAST_IF setsockopt on ipsec-over-multicast. sync with kame
 1.89  13-Nov-2001  lukem add RCSIDs
 1.88  17-Sep-2001  thorpej Split the pre-computed ifnet checksum flags into Tx and Rx directions.
Add capabilities bits that indicate an interface can only perform
in-bound TCPv4 or UDPv4 checksums. There is at least one Gig-E chip
for which this is true (Level One LXT-1001), and this is also the
case for the Intel i82559 10/100 Ethernet chips.
 1.87  11-Aug-2001  yamt branches: 1.87.2;
fix cksum error of udp and tcp packet with ip options
 1.86  02-Jun-2001  thorpej branches: 1.86.2;
Implement support for IP/TCP/UDP checksum offloading provided by
network interfaces. This works by pre-computing the pseudo-header
checksum and caching it, delaying the actual checksum to ip_output()
if the hardware cannot perform the sum for us. In-bound checksums
can either be fully-checked by hardware, or summed up for final
verification by software. This method was modeled after how this
is done in FreeBSD, although the code is significantly different in
most places.

We don't delay checksums for IPv6/TCP, but we do take advantage of the
cached pseudo-header checksum.

Note: hardware-assisted checksumming defaults to "off". It is
enabled with ifconfig(8). See the manual page for details.

Implement hardware-assisted checksumming on the DP83820 Gigabit Ethernet,
3c90xB/3c90xC 10/100 Ethernet, and Alteon Tigon/Tigon2 Gigabit Ethernet.
 1.85  26-May-2001  ragge Remove one #ifdef vax, bugfix another. Should probably be #ifdef i386 also.
 1.84  13-Apr-2001  thorpej Remove the use of splimp() from the NetBSD kernel. splnet()
and only splnet() is allowed for the protection of data structures
used by network devices.
 1.83  27-Feb-2001  itojun branches: 1.83.2;
remove obsolete #if 0'ed section
(IPsec and DF bit interaction - the code was incorrect anyways)
 1.82  24-Jan-2001  itojun - record IPsec packet history into m_aux structure.
- let ipfilter look at wire-format packet only (not the decapsulated ones),
so that VPN setting can work with NAT/ipfilter settings.
sync with kame.

TODO: use header history for stricter inbound validation
 1.81  13-Jan-2001  itojun allow IP_MULTICAST_IF and IP_ADD/DROP_MEMBERSHIP to specify interface
by interface index. if the interface address specified is in 0.0.0.0/8
it will be considered as interface index in network byteorder.

getsockopt(IP_MULTICAST_IF) preserves old behavior if
setsockopt(IP_MULTICAST_IF) was done with interface address, and
returns interface index if setsockopt(IP_MULTICAST_IF) was done with
interface index (again using the form in 0.0.0.0/8).

Suggested by Dave Thaler, based on RIPv2 MIB spec (RFC1724 section 3.3).

http://mail-index.netbsd.org/tech-net/2001/01/13/0003.html
 1.80  13-Jan-2001  itojun on getsockopt(IP_IPSEC_POLICY), make sure to initialize len
 1.79  11-Nov-2000  thorpej Actually, our local ip_off variable isn't needed.
 1.78  11-Nov-2000  thorpej Restructure the PFIL_HOOKS mechanism a bit:
- All packets are passed to PFIL_HOOKS as they come off the wire, i.e.
fields in protocol headers in network order, etc.
- Allow for multiple hooks to be registered, using a "key" and a "dlt".
The "dlt" is a BPF data link type, indicating what type of header is
present.
- INET and INET6 register with key == AF_INET or AF_INET6, and
dlt == DLT_RAW.
- PFIL_HOOKS now take an argument for the filter hook, and mbuf **,
an ifnet *, and a direction (PFIL_IN or PFIL_OUT), thus making them
less IP (really, IP Filter) centric.

Maintain compatibility with IP Filter by adding wrapper functions for
IP Filter.
 1.77  23-Oct-2000  itojun fix IFA_STATS.
- use hashed in_ifaddr lookup.
- correct endianness.
 1.76  17-Oct-2000  thorpej Add an IP_MTUDISC flag to the flags that can be passed to
ip_output(). This flag, if set, causes ip_output() to set
DF in the IP header if the MTU in the route is not locked.

This allows a bunch of redundant code, which I was never
really all that happy about adding in the first place, to
be eliminated.

Inspired by a similar change made by provos@openbsd.org when
he integrated NetBSD's Path MTU Discovery code into OpenBSD.
 1.75  28-Jun-2000  mrg remove include of <vm/vm.h>
 1.74  10-May-2000  itojun branches: 1.74.4;
add missing boundary checks to ip options processing.
correct timestamp option validation (len and ptr upper/lower bound
based on RFC791).
fill "pointer" field for parameter problem in timestamp option processing.
 1.73  13-Apr-2000  is Copy M_BCAST and M_MCAST flags when fragmenting a packet (else
Multicast packets won't be send to the correct link layer address
by the interface driver).
By Artur Grabowski, PR 9772.
 1.72  31-Mar-2000  jdolecek Since last duplicate prototype cleanup, we need to include
<netinet/ip_mroute.h> to get ip_mforward() prototype if MROUTING
is defined.
 1.71  30-Mar-2000  augustss Remove register declarations.
 1.70  22-Mar-2000  itojun tabify a line.
 1.69  01-Mar-2000  itojun introduce m->m_pkthdr.aux to hold random data which needs to be passed
between protocol handlers.

ipsec socket pointers, ipsec decryption/auth information, tunnel
decapsulation information are in my mind - there can be several other usage.
at this moment, we use this for ipsec socket pointer passing. this will
avoid reuse of m->m_pkthdr.rcvif in ipsec code.

due to the change, MHLEN will be decreased by sizeof(void *) - for example,
for i386, MHLEN was 100 bytes, but is now 96 bytes.
we may want to increase MSIZE from 128 to 256 for some of our architectures.

take caution if you use it for keeping some data item for long period
of time - use extra caution on M_PREPEND() or m_adj(), as they may result
in loss of m->m_pkthdr.aux pointer (and mbuf leak).

this will bump kernel version.

(as discussed in tech-net, tested in kame tree)
 1.68  20-Feb-2000  darrenr pass "struct pfil_head *" to pfil_add_hook and pfil_remove hook rather
than "struct protosw *".
 1.67  17-Feb-2000  darrenr Change the use of pfil hooks. There is no longer a single list of all
pfil information, instead, struct protosw now contains a structure
which caontains list heads, etc. The per-protosw pfil struct is passed
to pfil_hook_get(), along with an in/out flag to get the head of the
relevant filter list. This has been done for only IPv4 and IPv6, at
present, with these patches only enabling filtering for IPPROTO_IP and
IPPROTO_IPV6, although it is possible to have tcp/udp, etc, dedicated
filters now also. The ipfilter code has been updated to only filter
IPv4 packets - next major release of ipfilter is required for ipv6.
 1.66  31-Jan-2000  itojun bring in latest KAME ipsec tree.
- interop issues in ipcomp is fixed
- padding type (after ESP) is configurable
- key database memory management (need more fixes)
- policy specification is revisited

XXX m->m_pkthdr.rcvif is still overloaded - hope to fix it soon
 1.65  20-Dec-1999  itojun avoid shared cluster mbuf overwrite on multicast packet loopback.
(bsdi and freebsd fixed this a long time ago...)

PR: 9020
From: pavlin@catarina.usc.edu
 1.64  13-Dec-1999  is Handle packets to 255.255.255.255 like multicast packets. Fixes PR 7682 by
Darren Reed.
 1.63  13-Dec-1999  itojun sync IPv6 part with latest KAME tree. IPsec part is left unmodified
due to massive changes in KAME side.
- IPv6 output goes through nd6_output
- faith can capture IPv4 packets as well - you can run IPv4-to-IPv6 translator
using heavily modified DNS servers
- per-interface statistics (required for IPv6 MIB)
- interface autoconfig is revisited
- udp input handling has a big change for mapped address support.
- introduce in4_cksum() for non-overwriting checksumming
- introduce m_pulldown()
- neighbor discovery cleanups/improvements
- netinet/in.h strictly conforms to RFC2553 (no extra defs visible to userland)
- IFA_STATS is fixed a bit (not tested)
- and more more more.

TODO:
- cleanup os-independency #ifdef
- avoid rcvif dual use (for IPsec) to help ifdetach

(sorry for jumbo commit, I can't separate this any more...)
 1.62  09-Jul-1999  thorpej branches: 1.62.2; 1.62.8;
defopt IPSEC and IPSEC_ESP (both into opt_ipsec.h).
 1.61  01-Jul-1999  itojun IPv6 kernel code, based on KAME/NetBSD 1.4, SNAP kit 19990628.
(Sorry for a big commit, I can't separate this into several pieces...)
Pls check sys/netinet6/TODO and sys/netinet6/IMPLEMENTATION for details.

- sys/kern: do not assume single mbuf, accept chained mbuf on passing
data from userland to kernel (or other way round).
- "midway" ATM card: ATM PVC pseudo device support, like those done in ALTQ
package (ftp://ftp.csl.sony.co.jp/pub/kjc/).
- sys/netinet/tcp*: IPv4/v6 dual stack tcp support.
- sys/netinet/{ip6,icmp6}.h, sys/net/pfkeyv2.h: IETF document assumes those
file to be there so we patch it up.
- sys/netinet: IPsec additions are here and there.
- sys/netinet6/*: most of IPv6 code sits here.
- sys/netkey: IPsec key management code
- dev/pci/pcidevs: regen

In my understanding no code here is subject to export control so it
should be safe.
 1.60  07-Jun-1999  mrg oops. move sendit: above the PFIL_HOOKS so that multicast traffic is filtered. from darren reed.
 1.59  04-May-1999  hwr Don't let packets with a Class-D source address escape the host.
Fixes second half of kern/7003 by Jonathan Stone <jonathan@DSG.Stanford.EDU>.
 1.58  27-Mar-1999  aidan branches: 1.58.2; 1.58.4; 1.58.6;
Added per-addr input/output statistics. Currently just support netatalk
and netinet, currently only tested under netinet.

Disabled by default, enabled by compiling the kernel with option
IFA_STATS. Enabling this feature seems to make the ip_output function
take 13% longer than before, which should be OK for people that need
this feature.
 1.57  12-Mar-1999  perry exterminate ovbcopy. patches provided by Erik Bertelsen, pr-7145
 1.56  19-Jan-1999  mycroft There's just no plausible reason to byte-swap ip_id internally. It's opaque.
 1.55  11-Jan-1999  thorpej Fix byte order and ip_len inconsistencies in ICMP reply code. Also, fix
some formatting and HTONS(foo) vs. foo = htons(foo) inconsistencies.

PR #6602, Darren Reed.
 1.54  19-Dec-1998  thorpej Reverse the copyright-notice-swap. It went against existing practice.
 1.53  26-Oct-1998  ws branches: 1.53.4;
Fix a buglet when looking up an interface for multicast:
Zero out the routing structure before calling the route lookup code
in order to correctly match addresses.
 1.52  20-Oct-1998  matt vax -> __vax__ (and mips to __mips__ in ultrix_misc.c)
 1.51  30-Sep-1998  tls Switch order of TNF and UCB copyrights so UCB copyright is first; this seems more appropriate since UCB wrote the original code, after all.
 1.50  09-Aug-1998  mrg defopt PFIL_HOOKS.
 1.49  17-Jul-1998  sommerfe Fix PR5508: ipfil cut-through forwarding causes panic
 1.48  28-Apr-1998  matt Only transmit fragments if the send queue of interface can actually hold
all of the fragments. Use the mtu of route in preference of the MTU of the
interface when doing fragmentation decisions. (ie. Fragment to the path
mtu if it is available).
 1.47  24-Mar-1998  kml Ensure that we take the IP option length into account when we calculate
the effective maximum send size for TCP. ip_optlen() and tcp_optlen()
should probably be inlined for efficiency.
 1.46  19-Mar-1998  mrg convert pfil(9) in and out lists from <sys/queue.h> LISTs to TAILQs, and
change pfil_add_hook to put output filters at the tail of the queue,
while continuing to place input filters at the head of the queue. update
the two users of these functions, and document these changes.

fixes PR#4593.
 1.45  15-Feb-1998  tls Add correct copyright notice for IP address hash change. This code is donated to TNF by the original copyright holder, Panix.
 1.44  13-Feb-1998  tls Change list of interface IP addresses to a hash. Improves performance on hosts with a large number of IP addresses significantly.
 1.43  13-Feb-1998  kleink Fix variable declarations: register -> register int.
 1.42  12-Jan-1998  scottr Use option header file for MROUTING
 1.41  07-Jan-1998  lukem add the following, derived from FreeBSD:
* IP_PORTRANGE socket option, which controls how the ephemeral ports
are allocated. it takes the following settings:
IP_PORTRANGE_DEFAULT use anonportmin (49152) -> anonportmax (65535)
IP_PORTRANGE_HIGH as IP_PORTRANGE_DEFAULT (retained for FreeBSD
compat reasons, where these are separate)
IP_PORTRANGE_LOW use 600 -> 1023. only works if uid==0.
* in_pcb flag INP_ANONPORT. set if port was allocated ephmerally
 1.40  14-Oct-1997  matt branches: 1.40.2;
Add support for returning maximum supported MTU when ip_output fails with
EMSGSIZE.
 1.39  15-Apr-1997  christos branches: 1.39.4;
Move the mtod calls *after* we've made sure that the packet has passed the
filter successfully. Otherwise it can be NULL if the filter blocked it,
and we die. How did this ever work?
 1.38  18-Feb-1997  mrg pseudo-device ipfilter brings in PFIL_HOOKS.
 1.37  11-Jan-1997  thorpej branches: 1.37.4;
Implement the IP_RECVIF socket option: supply a datagram packet's incoming
interface using a sockaddr_dl in a control mbuf.

Implement SO_TIMESTAMP for IP datagrams.

Move packet information option processing into a generic function
so that they work with multicast UDP and raw IP as well as unicast UDP.

Contributed by Bill Fenner <fenner@parc.xerox.com>.
 1.36  20-Dec-1996  mrg always reassign ip after calling function.
 1.35  20-Dec-1996  mrg in pfil_hooks: always reassign ip after calling hook.
 1.34  22-Oct-1996  veego Fix a panic from the pfil_hooks.
 1.33  11-Oct-1996  is Fix a mbuf leak in ip_output().

Scenario: If ip_insertoptions() prepends a new mbuf to the chain, the
bad: label's m_freem(m0) still would free only the original mbuf chain
if the transmission failed for, e.g., no route to host; resulting in
one lost mbuf per failed packet. (The original posting included a
demonstration program).

Original report of this bug was by jinmei@isl.rdc.toshiba.co.jp
(JINMEI Tatuya) on comp.bugs.4bsd.
 1.32  14-Sep-1996  mrg move the packet filter hooks in to a saner location. while i'm here, rename
PACKET_FILTER to PFIL_HOOKS.
 1.31  09-Sep-1996  mycroft Add in_nullhost() and in_hosteq() macros, to hide some protocol
details. Also, fix a bug in TCP wrt SYN+URG packets.
 1.30  06-Sep-1996  mrg add packet filter interface code. see pfil(9) for more details. you
need the PACKET_FILTER option to enable this code. currently, ipfilter
version 3.1.1-beta has been converted to use this new interface.
 1.29  26-Feb-1996  mrg branches: 1.29.4;
two more local addr changes, all done differently now (idea from charles)
 1.28  13-Feb-1996  christos netinet prototypes
 1.27  01-Jul-1995  cgd null mbuf pointer could cause system crash; avoid it. From
Torsten Duwe <duwe@immd4.informatik.uni-erlangen.de>.
 1.26  12-Jun-1995  mycroft Various cleanup, including:
* Convert several data structures to use queue.h.
* Split in_pcbnotify() into two parts; one for notifying a specific PCB, and
one for notifying all PCBs for a particular foreign address.
 1.25  04-Jun-1995  mycroft Don't cast things unnecessarily.
 1.24  04-Jun-1995  mycroft Clean up many more casts.
 1.23  01-Jun-1995  mycroft Avoid byte-swapping IP addresses at run time.
 1.22  15-May-1995  cgd simplify ip_output() out-of-memory condition slightly, and style nits.
 1.21  13-Apr-1995  cgd be a bit more careful and explicit with types. (basically a large no-op.)
 1.20  11-Apr-1995  mycroft Remove some explicit references to loif.
 1.19  29-Jun-1994  cgd New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'
 1.18  13-May-1994  mycroft Update to 4.4-Lite networking code, with a few local changes.
 1.17  02-Feb-1994  hpeyerl Multicast is no longer optional.
 1.16  19-Jan-1994  brezak Fix arguments to ip_getmoptions.
 1.15  18-Jan-1994  brezak Fix some prototype detected warnings/errors.
 1.14  18-Jan-1994  brezak Patch for ip-multicast bugs from mccanne@ee.lbl.gov (Steven McCanne)
 1.13  10-Jan-1994  mycroft Should compile now with or without `options MULTICAST'.
 1.12  09-Jan-1994  mycroft Prototype the rest.
 1.11  08-Jan-1994  mycroft Fix some inconsistent spacing; spaces at the end of lines, etc.
 1.10  07-Jan-1994  cgd kill COMPAT_OLDSOCKOPT
 1.9  06-Jan-1994  ws Apparently noone ever tested the COMPAT_OLDSOCKOPT flag...
 1.8  18-Dec-1993  mycroft Canonicalize all #includes.
 1.7  06-Dec-1993  cgd oops; fix that last...
 1.6  06-Dec-1993  cgd the ugliest compatibility hack i think i've ever seen...
define COMPAT_OLDSOCKOPT to get new kernels to work with the
old args to [sg]sockopt. this is going to go away "soon".
note that this option only has effect if MULTICAST is not defined.
 1.5  06-Dec-1993  hpeyerl multicast support.
>From Chris Maeda, cmaeda@cs.washington.edu
These patches are derived from the IP Multicast patches for BSDI.
 1.4  05-Nov-1993  cgd fix from david greenman, davidg@freefall.cdrom.com:
fixed bug where large amounts of unidirectional UDP traffic would fill
the interface output queue and further udp packets would be fragmented
and only partially sent - keeping the output queue full and jamming the
network, but not actually getting any real work done (because you can't
send just 'part' of a udp packet - if you fragment it, you must send
the whole thing). The fix involves adding a check to make sure that the
output queue has sufficient space for all of the fragments.
 1.3  22-May-1993  cgd branches: 1.3.4;
add include of select.h if necessary for protos, or delete if extraneous
 1.2  18-May-1993  cgd make kernel select interface be one-stop shopping & clean it all up.
 1.1  21-Mar-1993  cgd branches: 1.1.1;
Initial revision
 1.1.1.2  05-Jan-1998  thorpej Import sys/netinet from 4.4BSD-Lite for reference purposes.
 1.1.1.1  21-Mar-1993  cgd initial import of 386bsd-0.1 sources
 1.3.4.2  06-Nov-1993  mycroft Merge changes from trunk.
 1.3.4.1  16-Oct-1993  mycroft Nuke references to machine/mtpr.h.
 1.29.4.1  11-Dec-1996  mycroft From trunk:
Fix a mbuf leak when fragmentation fails due to lack of memory.
 1.37.4.1  12-Mar-1997  is Merge in changes from Trunk
 1.39.4.1  14-Oct-1997  thorpej Update marc-pcmcia branch from trunk.
 1.40.2.4  29-Oct-1998  cgd pull up rev 1.53 from trunk (ws)
 1.40.2.3  01-Oct-1998  cgd pull up revisions 1.44-1.45, 1.51 (via patch) from trunk. (tls)
 1.40.2.2  22-Jul-1998  mellon Pull up 1.46 and 1.49 (veego)
 1.40.2.1  09-May-1998  mycroft Pull up patch from kml.
 1.53.4.1  11-Dec-1998  kenh The beginnings of interface detach support. Still some bugs, but mostly
works for me.

This work was originally by Bill Studenmund, and cleaned up by me.
 1.58.6.3  30-Nov-1999  itojun bring in latest KAME (as of 19991130, KAME/NetBSD141) into kame branch
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.

The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.

synchronization to latest KAME will take place on HEAD branch soon.
 1.58.6.2  06-Jul-1999  itojun KAME/NetBSD 1.4, SNAP kit 1999/07/05.
NOTE: this branch is just for reference purposes (i.e. for taking cvs diff).
do not touch anything on the branch. actual work must be done on HEAD branch.
 1.58.6.1  28-Jun-1999  itojun KAME/NetBSD 1.4 SNAP kit, dated 19990628.

NOTE: this branch (kame) is used just for refernce. this may not compile
due to multiple reasons.
 1.58.4.3  02-Aug-1999  thorpej Update from trunk.
 1.58.4.2  01-Jul-1999  thorpej Sync w/ -current.
 1.58.4.1  21-Jun-1999  thorpej Sync w/ -current.
 1.58.2.3  30-Apr-2000  he Pull up revision 1.73 (requested by is):
Pass M_BCAST and M_MCAST flags to fragments. Fixes PR#9772.
 1.58.2.2  20-Dec-1999  he Pull up revision 1.65 (requested by itojun):
Avoid panic caused by shared cluster mbuf overwrite on multicast
packet loopback for packets with certain sizes. Fixes PR#9020.
 1.58.2.1  22-Jun-1999  perry pullup 1.59->1.60 (mrg): ipfilter should filter multicast traffic...
 1.62.8.1  27-Dec-1999  wrstuden Pull up to last week's -current.
 1.62.2.6  21-Apr-2001  bouyer Sync with HEAD
 1.62.2.5  12-Mar-2001  bouyer Sync with HEAD.
 1.62.2.4  11-Feb-2001  bouyer Sync with HEAD.
 1.62.2.3  18-Jan-2001  bouyer Sync with head (for UBC+NFS fixes, mostly).
 1.62.2.2  22-Nov-2000  bouyer Sync with HEAD.
 1.62.2.1  20-Nov-2000  bouyer Update thorpej_scsipi to -current as of a month ago
 1.74.4.4  04-Aug-2003  msaitoh Pull up revision 1.106-1.107 (requested by itojun in ticket #53):
after pfil_run_hooks, need to fix hlen as well.
freebsd code somehow crept in.
 1.74.4.3  15-Dec-2002  he Pull up revision 1.102 (requested by darrenr):
Initialize len and check what ip_insertoptions() returns.
In some rare cases there might not be sufficient room for
the options.
 1.74.4.2  14-Jan-2002  he Pull up revision 1.91 (requested by itojun):
Avoid kernel panic on IPv4 multicast packet transmission if there
is no IPv4 address assigned to the specified outgoing interface.
 1.74.4.1  06-Apr-2001  he Pull up revision 1.82 (via patch, requested by itojun):
Record IPsec packet history in m_aux structure. Let ipfilter
look at wire-format packet only (not the decapsulated ones), so
that VPN setting can work with NAT/ipfilter settings.
 1.83.2.16  20-Sep-2002  thorpej Sync with HEAD.
 1.83.2.15  17-Sep-2002  nathanw Catch up to -current.
 1.83.2.14  27-Aug-2002  nathanw Catch up to -current.
 1.83.2.13  01-Aug-2002  nathanw Catch up to -current.
 1.83.2.12  12-Jul-2002  nathanw No longer need to pull in lwp.h; proc.h pulls it in for us.
 1.83.2.11  24-Jun-2002  nathanw Curproc->curlwp renaming.

Change uses of "curproc->l_proc" back to "curproc", which is more like the
original use. Bare uses of "curproc" are now "curlwp".

"curproc" is now #defined in proc.h as ((curlwp) ? (curlwp)->l_proc) : NULL)
so that it is always safe to reference curproc (*de*referencing curproc
is another story, but that's always been true).
 1.83.2.10  20-Jun-2002  nathanw Catch up to -current.
 1.83.2.9  28-Feb-2002  nathanw Catch up to -current.
 1.83.2.8  11-Jan-2002  nathanw More catchup.
 1.83.2.7  08-Jan-2002  nathanw Catch up to -current.
 1.83.2.6  14-Nov-2001  nathanw Catch up to -current.
 1.83.2.5  21-Sep-2001  nathanw Catch up to -current.
 1.83.2.4  24-Aug-2001  nathanw Catch up with -current.
 1.83.2.3  21-Jun-2001  nathanw Catch up to -current.
 1.83.2.2  13-Mar-2001  nathanw Be more careful not to dereference curproc when there might not be
a process context.
 1.83.2.1  05-Mar-2001  nathanw Initial commit of scheduler activations and lightweight process support.
 1.86.2.7  10-Oct-2002  jdolecek sync kqueue with -current; this includes merge of gehenna-devsw branch,
merge of i386 MP branch, and part of autoconf rototil work
 1.86.2.6  06-Sep-2002  jdolecek sync kqueue branch with HEAD
 1.86.2.5  23-Jun-2002  jdolecek catch up with -current on kqueue branch
 1.86.2.4  16-Mar-2002  jdolecek Catch up with -current.
 1.86.2.3  11-Feb-2002  jdolecek Sync w/ -current.
 1.86.2.2  10-Jan-2002  thorpej Sync kqueue branch with -current.
 1.86.2.1  25-Aug-2001  thorpej Merge Aug 24 -current into the kqueue branch.
 1.87.2.1  01-Oct-2001  fvdl Catch up with -current.
 1.95.10.3  30-Jun-2003  grant Pull up revisions 1.106-1.107 (requested by itojun in ticket #1358):

after pfil_run_hooks, need to fix hlen as well

freebsd code somehow crept in
 1.95.10.2  01-Nov-2002  tron Pull up revision 1.98-1.99 (requested by itojun in ticket #356):
do not consult routing table under the following condition:
- - the destination is IPv4 multicast or 255.255.255.255, and
- - outgoing interface is specified via socket option
this simplifies operation of routed
(no longer require 224.0.0.0/4 to be set up)
 1.95.10.1  30-Sep-2002  lukem Pull up revision 1.102 (requested by darrenr in ticket #842):
From FreeBSD (1.164) courtesy of Maxim Konovalov:
"In rare cases when there is no room for ip options ip_insertoptions()
can fail and corrupt a header length. Initialize len and check what
ip_insertoptions() returns."
 1.95.8.3  29-Aug-2002  gehenna catch up with -current.
 1.95.8.2  15-Jul-2002  gehenna catch up with -current.
 1.95.8.1  20-Jun-2002  gehenna catch up with -current.
 1.107.2.11  10-Nov-2005  skrll Sync with HEAD. Here we go again...
 1.107.2.10  01-Apr-2005  skrll Sync with HEAD.
 1.107.2.9  08-Mar-2005  skrll Sync with HEAD.
 1.107.2.8  04-Mar-2005  skrll Sync with HEAD.

Hi Perry!
 1.107.2.7  15-Feb-2005  skrll Sync with HEAD.
 1.107.2.6  04-Feb-2005  skrll Sync with HEAD.
 1.107.2.5  18-Dec-2004  skrll Sync with HEAD.
 1.107.2.4  19-Oct-2004  skrll Sync with HEAD
 1.107.2.3  21-Sep-2004  skrll Fix the sync with head I botched.
 1.107.2.2  18-Sep-2004  skrll Sync with HEAD.
 1.107.2.1  03-Aug-2004  skrll Sync with HEAD
 1.138.4.2  19-Mar-2005  yamt sync with head. xen and whitespace. xen part is not finished.
 1.138.4.1  12-Feb-2005  yamt sync with head.
 1.138.2.1  29-Apr-2005  kent sync with -current
 1.149.2.5  31-Mar-2007  bouyer Pull up following revision(s) (requested by joerg in ticket #1734):
sys/netinet/ip_output.c: revision 1.167.2.2
Unconditionally zero and free iproute. Before IPsec tunnel packets e.g.
from ICMP could end up in leaking the reference in iproute, as
ipsec4_output would overwrite the ro pointer in state.
Tested by Juraj Hercek and supposed to fix PR kern/35273 and kern/35318.
 1.149.2.4  28-Jan-2007  tron Pull up following revision(s) (requested by yamt in ticket #1656):
sys/netinet/ip_output.c: revision 1.173
ip_output: reload ip_len after running pfil_run_hooks.
pf "fragment reassemble" rule can change it, at least.
 1.149.2.3  21-Oct-2005  riz branches: 1.149.2.3.2; 1.149.2.3.4;
Pull up following revision(s) (requested by seb in ticket #903):
sys/netinet/ip_output.c: revisions 1.156 - 1.157
Allow the multicast_ttl and the multicast_loop options to be set with both
u_char and u_int option variables. Original patch from seb.
 1.149.2.2  06-May-2005  tron Pull up revision 1.151 (requested by yamt in ticket #251):
fix problems related to loopback interface checksum omission. PR/29971.
- for ipv4, defer decision to ip layer as h/w checksum offloading does
so that it can check the actual interface the packet is going to.
- for ipv6, disable it.
(maybe will be revisited when it implements h/w checksum offloading.)
ok'ed by Jason Thorpe.
 1.149.2.1  13-Apr-2005  tron Pull up revision 1.150 (requested by yamt in ticket #145):
when doing TSO, avoid to use duplicated ip_id heavily.
XXX ip_randomid
 1.149.2.3.4.1  28-Jan-2007  tron Pull up following revision(s) (requested by yamt in ticket #1656):
sys/netinet/ip_output.c: revision 1.173
ip_output: reload ip_len after running pfil_run_hooks.
pf "fragment reassemble" rule can change it, at least.
 1.149.2.3.2.1  28-Jan-2007  tron Pull up following revision(s) (requested by yamt in ticket #1656):
sys/netinet/ip_output.c: revision 1.173
ip_output: reload ip_len after running pfil_run_hooks.
pf "fragment reassemble" rule can change it, at least.
 1.153.2.8  11-Feb-2008  yamt sync with head.
 1.153.2.7  21-Jan-2008  yamt sync with head
 1.153.2.6  07-Dec-2007  yamt sync with head
 1.153.2.5  27-Oct-2007  yamt sync with head.
 1.153.2.4  03-Sep-2007  yamt sync with head.
 1.153.2.3  26-Feb-2007  yamt sync with head.
 1.153.2.2  30-Dec-2006  yamt sync with head.
 1.153.2.1  21-Jun-2006  yamt sync with head.
 1.159.6.2  01-Jun-2006  kardel Sync with head.
 1.159.6.1  22-Apr-2006  simonb Sync with head.
 1.159.4.2  09-Sep-2006  rpaulo sync with head
 1.159.4.1  07-Feb-2006  rpaulo sotoinpcb_hdr -> sotoinpcb.
 1.159.2.1  01-Mar-2006  yamt sync with head.
 1.160.6.1  24-May-2006  tron Merge 2006-05-24 NetBSD-current into the "peter-altq" branch.
 1.160.4.2  10-Mar-2006  elad generic_authorize() -> kauth_authorize_generic().
 1.160.4.1  08-Mar-2006  elad Adapt to kernel authorization KPI.
 1.160.2.2  11-Aug-2006  yamt sync with head
 1.160.2.1  24-May-2006  yamt sync with head.
 1.162.4.1  13-Jul-2006  gdamore Merge from HEAD.
 1.165.6.2  18-Dec-2006  yamt sync with head.
 1.165.6.1  10-Dec-2006  yamt sync with head.
 1.165.4.3  01-Feb-2007  ad Sync with head.
 1.165.4.2  12-Jan-2007  ad Sync with head.
 1.165.4.1  18-Nov-2006  ad Sync with head.
 1.167.2.2  28-Mar-2007  jdc Pull up revision 1.174 (requested by joerg in ticket #524).

Unconditionally zero and free iproute. Before IPsec tunnel packets e.g.
from ICMP could end up in leaking the reference in iproute, as
ipsec4_output would overwrite the ro pointer in state.

Tested by Juraj Hercek and supposed to fix PR kern/35273 and kern/35318.
 1.167.2.1  18-Jan-2007  tron Pull up following revision(s) (requested by yamt in ticket #361):
sys/netinet/ip_output.c: revision 1.173
ip_output: reload ip_len after running pfil_run_hooks.
pf "fragment reassemble" rule can change it, at least.
 1.177.2.4  07-May-2007  yamt sync with head.
 1.177.2.3  12-Mar-2007  rmind Sync with HEAD.
 1.177.2.2  27-Feb-2007  yamt - sync with head.
- move sched_changepri back to kern_synch.c as it doesn't know PPQ anymore.
 1.177.2.1  17-Feb-2007  yamt file ip_output.c was added on branch yamt-idlelwp on 2007-02-27 16:54:56 +0000
 1.179.4.1  11-Jul-2007  mjf Sync with head.
 1.179.2.2  09-Oct-2007  ad Sync with head.
 1.179.2.1  08-Jun-2007  ad Sync with head.
 1.180.8.3  23-Mar-2008  matt sync with HEAD
 1.180.8.2  09-Jan-2008  matt sync with HEAD
 1.180.8.1  06-Nov-2007  matt sync with HEAD
 1.180.6.3  03-Dec-2007  joerg Sync with HEAD.
 1.180.6.2  02-Oct-2007  joerg Sync with HEAD.
 1.180.6.1  03-Sep-2007  jmcneill Sync with HEAD.
 1.180.2.1  03-Sep-2007  skrll Sync with HEAD.
 1.184.6.3  18-Feb-2008  mjf Sync with HEAD.
 1.184.6.2  27-Dec-2007  mjf Sync with HEAD.
 1.184.6.1  08-Dec-2007  mjf Sync with HEAD.
 1.185.6.2  19-Jan-2008  bouyer Sync with HEAD
 1.185.6.1  02-Jan-2008  bouyer Sync with HEAD
 1.185.2.1  26-Dec-2007  ad Sync with head.
 1.192.6.3  17-Jan-2009  mjf Sync with HEAD.
 1.192.6.2  28-Sep-2008  mjf Sync with HEAD.
 1.192.6.1  02-Jun-2008  mjf Sync with HEAD.
 1.194.2.1  18-May-2008  yamt sync with head.
 1.195.2.5  19-Aug-2009  yamt sync with head.
 1.195.2.4  18-Jul-2009  yamt sync with head.
 1.195.2.3  16-May-2009  yamt sync with head
 1.195.2.2  04-May-2009  yamt sync with head.
 1.195.2.1  16-May-2008  yamt sync with head.
 1.196.6.1  19-Oct-2008  haad Sync with HEAD.
 1.196.2.1  18-Sep-2008  wrstuden Sync with wrstuden-revivesa-base-2.
 1.200.10.1  09-Jul-2009  snj branches: 1.200.10.1.2;
Pull up following revision(s) (requested by martin in ticket #847):
sys/netinet/ip_output.c: revision 1.203
From Wolfgang Stukenbrock in PR kern/41659: add missing splx().
 1.200.10.1.2.1  21-Apr-2010  matt sync to netbsd-5
 1.200.8.2  23-Jul-2009  jym Sync with HEAD.
 1.200.8.1  13-May-2009  jym Sync with HEAD.

Commit is split, to avoid a "too many arguments" protocol error.
 1.200.4.1  09-Jul-2009  snj Pull up following revision(s) (requested by martin in ticket #847):
sys/netinet/ip_output.c: revision 1.203
From Wolfgang Stukenbrock in PR kern/41659: add missing splx().
 1.200.2.1  28-Apr-2009  skrll Sync with HEAD.
 1.205.6.1  06-Jun-2011  jruoho Sync with HEAD.
 1.205.4.1  21-Apr-2011  rmind sync with head
 1.210.6.3  02-Jun-2012  mrg sync to latest -current.
 1.210.6.2  05-Apr-2012  mrg sync to latest -current.
 1.210.6.1  18-Feb-2012  mrg merge to -current.
 1.210.2.4  22-May-2014  yamt sync with head.

for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.

this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")
 1.210.2.3  30-Oct-2012  yamt sync with head
 1.210.2.2  23-May-2012  yamt sync with head.
 1.210.2.1  17-Apr-2012  yamt sync with head
 1.217.2.4  03-Dec-2017  jdolecek update from HEAD
 1.217.2.3  20-Aug-2014  tls Rebase to HEAD as of a few days ago.
 1.217.2.2  23-Jun-2013  tls resync from head
 1.217.2.1  25-Feb-2013  tls resync with head
 1.223.2.3  17-Oct-2013  rmind Eliminate some of the splsoftnet() calls, misc clean up.
 1.223.2.2  28-Aug-2013  rmind sync with head
 1.223.2.1  17-Jul-2013  rmind Checkpoint work in progress:
- Move PCB structures under __INPCB_PRIVATE, adjust most of the callers
and thus make IPv4 PCB structures mostly opaque. Any volunteers for
merging in6pcb with inpcb (see rpaulo-netinet-merge-pcb branch)?
- Move various global vars to the modules where they belong, make them static.
- Some preliminary work for IPv4 PCB locking scheme.
- Make raw IP code mostly MP-safe. Simplify some of it.
- Rework "fast" IP forwarding (ipflow) code to be mostly MP-safe. It should
run from a software interrupt, rather than hard.
- Rework tun(4) pseudo interface to be MP-safe.
- Work towards making some other interfaces more strict.
 1.224.4.1  10-Aug-2014  tls Rebase.
 1.230.2.1  01-Dec-2014  martin Pull up following revision(s) (requested by ozaki-r in ticket #277):
sys/netinet/ip_output.c: revision 1.233
Call looutput with holding KERNEL_LOCK
This fixes diagnostic assertion "KERNEL_LOCKED_P()" in if_loop.c.
PR kern/49410
 1.233.2.10  28-Aug-2017  skrll Sync with HEAD
 1.233.2.9  05-Feb-2017  skrll Sync with HEAD
 1.233.2.8  05-Oct-2016  skrll Sync with HEAD
 1.233.2.7  09-Jul-2016  skrll Sync with HEAD
 1.233.2.6  29-May-2016  skrll Sync with HEAD
 1.233.2.5  22-Apr-2016  skrll Sync with HEAD
 1.233.2.4  19-Mar-2016  skrll Sync with HEAD
 1.233.2.3  22-Sep-2015  skrll Sync with HEAD
 1.233.2.2  06-Jun-2015  skrll Sync with HEAD
 1.233.2.1  06-Apr-2015  skrll Sync with HEAD
 1.259.2.4  20-Mar-2017  pgoyette Sync with HEAD
 1.259.2.3  07-Jan-2017  pgoyette Sync with HEAD. (Note that most of these changes are simply $NetBSD$
tag issues.)
 1.259.2.2  04-Nov-2016  pgoyette Sync with HEAD
 1.259.2.1  06-Aug-2016  pgoyette Sync with HEAD
 1.267.2.1  21-Apr-2017  bouyer Sync with HEAD
 1.276.4.2  19-May-2017  pgoyette Resolve conflicts from previous merge (all resulting from $NetBSD
keywork expansion)
 1.276.4.1  11-May-2017  pgoyette Sync with HEAD
 1.279.2.7  18-Mar-2018  martin Pull up following revision(s) (requested by tih in ticket #639):
sys/kern/uipc_socket.c: revision 1.258
sys/kern/uipc_socket.c: revision 1.259
sys/netinet/ip_input.c: revision 1.364 (via patch)
sys/netinet/ip_output.c: revision 1.289
sys/netinet/in.h: revision 1.102
sys/netinet/in_pcb.c: revision 1.181
share/man/man9/sockopt.9: revision 1.11
sys/netinet/in_pcb.h: revision 1.65
sys/sys/socketvar.h: revision 1.146
sys/kern/uipc_syscalls.c: revision 1.189
sys/netinet/ip_output.c: revision 1.290
share/man/man4/ip.4: revision 1.41
share/man/man4/ip.4: revision 1.42
sys/kern/uipc_syscalls.c: revision 1.190

pass valsize for getsockopt like we do for setsockopt
make sure that we have enough space, don't require the exact size
(Tom Ivar Helbekkmo)

1) "#define ipi_spec_dst ipi_addr" in <netinet/in.h>
2) Change the IP_RECVPKTINFO option to control the generation of
IP_PKTINFO control messages, the way it's done in Solaris.
3) Remove the superfluous IP_RECVPKTINFO control message.
4) Change the IP_PKTINFO option to do different things depending on
the parameter it's supplied with:
- If it's sizeof(int), assume it's being used as in Linux:
- If it's non-zero, turn on the IP_RECVPKTINFO option.
- If it's zero, turn off the IP_RECVPKTINFO option.
- If it's sizeof(struct in_pktinfo), assume it's being used as in
Solaris, to set a default for the source interface and/or
source address for outgoing packets on the socket.
5) Return what Linux or Solaris compatible code expects, depending
on data size, and just added a fallback to a Linux (and current NetBSD)
compatible value if the size is unknown (as it is now), or,
in the future, if the calling application specifies a receiving
buffer that doesn't match either data item.

From: Tom Ivar Helbekkmo

new sentence-new line

Remove comment now that the getsockopt code passes the size.

Add a new sockopt member to keep track of the actual size of the option
that should be returned to the caller in getsockopt(2).
(Tom Ivar Helbekkmo)
 1.279.2.6  19-Feb-2018  snj Pull up following revision(s) (requested by ozaki-r in ticket #557):
sys/netinet/ip_output.c: 1.295
Keep a pointer to the interface of the multicast membership, because the
multicast element itself might go away in in_delmulti (but the interface
can't because we hold the lock). From ozaki-r@
 1.279.2.5  13-Jan-2018  snj Pull up following revision(s) (requested by ozaki-r in ticket #494):
sys/netinet/ip_output.c: revision 1.291-1.292
- this is not python, we need braces
- protect ifp locking against NULL
--
from ozaki-r: use the proper ifp.
XXX: perhaps push the lock in in_delmulti()?
 1.279.2.4  02-Jan-2018  snj Pull up following revision(s) (requested by ozaki-r in ticket #463):
sys/netinet/in.c: revision 1.212
sys/netinet/ip_output.c: revision 1.288
sys/netinet6/in6.c: revision 1.256
sys/netinet6/in6_pcb.c: revision 1.163
sys/sys/lwp.h: revision 1.176
Add missing curlwp_bindx
--
Add missing curlwp_bindx
--
Check LP_BOUND is surely set in curlwp_bindx
This may find an extra call of curlwp_bindx.
--
Fix usage of curlwp_bind in ip_output
curlwp_bindx must be called in LIFO order, i.e., we can't call curlwp_bind
and curlwp_bindx like this:
bound1 = curlwp_bind();
bound2 = curlwp_bind();
curlwp_bindx(bound1);
curlwp_bindx(bound2);
ip_outout did so if NET_MPSAFE. Fix it.
--
Fix wrong usage of psref_held
We can't use it for checking if a caller does NOT hold a given target.
If you want to do it you should have psref_not_held or something.
 1.279.2.3  02-Jan-2018  snj Pull up following revision(s) (requested by ozaki-r in ticket #456):
sys/arch/arm/sunxi/sunxi_emac.c: 1.9
sys/dev/ic/dwc_gmac.c: 1.43-1.44
sys/dev/pci/if_iwm.c: 1.75
sys/dev/pci/if_wm.c: 1.543
sys/dev/pci/ixgbe/ixgbe.c: 1.112
sys/dev/pci/ixgbe/ixv.c: 1.74
sys/kern/sys_socket.c: 1.75
sys/net/agr/if_agr.c: 1.43
sys/net/bpf.c: 1.219
sys/net/if.c: 1.397, 1.399, 1.401-1.403, 1.406-1.410, 1.412-1.416
sys/net/if.h: 1.242-1.247, 1.250, 1.252-1.257
sys/net/if_bridge.c: 1.140 via patch, 1.142-1.146
sys/net/if_etherip.c: 1.40
sys/net/if_ethersubr.c: 1.243, 1.246
sys/net/if_faith.c: 1.57
sys/net/if_gif.c: 1.132
sys/net/if_l2tp.c: 1.15, 1.17
sys/net/if_loop.c: 1.98-1.101
sys/net/if_media.c: 1.35
sys/net/if_pppoe.c: 1.131-1.132
sys/net/if_spppsubr.c: 1.176-1.177
sys/net/if_tun.c: 1.142
sys/net/if_vlan.c: 1.107, 1.109, 1.114-1.121
sys/net/npf/npf_ifaddr.c: 1.3
sys/net/npf/npf_os.c: 1.8-1.9
sys/net/rtsock.c: 1.230
sys/netcan/if_canloop.c: 1.3-1.5
sys/netinet/if_arp.c: 1.255
sys/netinet/igmp.c: 1.65
sys/netinet/in.c: 1.210-1.211
sys/netinet/in_pcb.c: 1.180
sys/netinet/ip_carp.c: 1.92, 1.94
sys/netinet/ip_flow.c: 1.81
sys/netinet/ip_input.c: 1.362
sys/netinet/ip_mroute.c: 1.147
sys/netinet/ip_output.c: 1.283, 1.285, 1.287
sys/netinet6/frag6.c: 1.61
sys/netinet6/in6.c: 1.251, 1.255
sys/netinet6/in6_pcb.c: 1.162
sys/netinet6/ip6_flow.c: 1.35
sys/netinet6/ip6_input.c: 1.183
sys/netinet6/ip6_output.c: 1.196
sys/netinet6/mld6.c: 1.90
sys/netinet6/nd6.c: 1.239-1.240
sys/netinet6/nd6_nbr.c: 1.139
sys/netinet6/nd6_rtr.c: 1.136
sys/netipsec/ipsec_output.c: 1.65
sys/rump/net/lib/libnetinet/netinet_component.c: 1.9-1.10
kmem_intr_free kmem_intr_[z]alloced memory
the underlying pools are the same but api-wise those should match
Unify IFEF_*_MPSAFE into IFEF_MPSAFE
There are already two flags for if_output and if_start, however, it seems such
MPSAFE flags are eventually needed for all if_XXX operations. Having discrete
flags for each operation is wasteful of if_extflags bits. So let's unify
the flags into one: IFEF_MPSAFE.
Fortunately IFEF_*_MPSAFE flags have never been included in any releases, so
we can change them without breaking backward compatibility of the releases
(though the kernel version of -current should be bumped).
Note that if an interface have both MP-safe and non-MP-safe operations at a
time, we have to set the IFEF_MPSAFE flag and let callees of non-MP-safe
opeartions take the kernel lock.
Proposed on tech-kern@ and tech-net@
Provide macros for softnet_lock and KERNEL_LOCK hiding NET_MPSAFE switch
It reduces C&P codes such as "#ifndef NET_MPSAFE KERNEL_LOCK(1, NULL); ..."
scattered all over the source code and makes it easy to identify remaining
KERNEL_LOCK and/or softnet_lock that are held even if NET_MPSAFE.
No functional change
Hold KERNEL_LOCK on if_ioctl selectively based on IFEF_MPSAFE
If IFEF_MPSAFE is set, hold the lock and otherwise don't hold.
This change requires additions of KERNEL_LOCK to subsequence functions from
if_ioctl such as ifmedia_ioctl and ifioctl_common to protect non-MP-safe
components.
Proposed on tech-kern@ and tech-net@
Ensure to hold if_ioctl_lock when calling if_flags_set
Fix locking against myself on ifpromisc
vlan_unconfig_locked could be called with holding if_ioctl_lock.
Ensure to not turn on IFF_RUNNING of an interface until its initialization completes
And ensure to turn off it before destruction as per IFF_RUNNING's description
"resource allocated". (The description is a bit doubtful though, I believe the
change is still proper.)
Ensure to hold if_ioctl_lock on if_up and if_down
One exception for if_down is if_detach; in the case the lock isn't needed
because it's guaranteed that no other one can access ifp at that point.
Make if_link_queue MP-safe if IFEF_MPSAFE
if_link_queue is a queue to store events of link state changes, which is
used to pass events from (typically) an interrupt handler to
if_link_state_change softint. The queue was protected by KERNEL_LOCK so far,
but if IFEF_MPSAFE is enabled, it becomes unsafe because (perhaps) an interrupt
handler of an interface with IFEF_MPSAFE doesn't take KERNEL_LOCK. Protect it
by a spin mutex.
Additionally with this change KERNEL_LOCK of if_link_state_change softint is
omitted if NET_MPSAFE is enabled.
Note that the spin mutex is now ifp->if_snd.ifq_lock as well as the case of
if_timer (see the comment).
Use IFADDR_WRITER_FOREACH instead of IFADDR_READER_FOREACH
At that point no other one modifies the list so IFADDR_READER_FOREACH
is unnecessary. Use of IFADDR_READER_FOREACH is harmless in general though,
if we try to detect contract violations of pserialize, using it violates
the contract. So avoid using it makes life easy.
Ensure to call if_addr_init with holding if_ioctl_lock
Get rid of outdated comments
Fix build of kernels without ether
By throwing out if_enable_vlan_mtu and if_disable_vlan_mtu that
created a unnecessary dependency from if.c to if_ethersubr.c.
PR kern/52790
Rename IFNET_LOCK to IFNET_GLOBAL_LOCK
IFNET_LOCK will be used in another lock, if_ioctl_lock (might be renamed then).
Wrap if_ioctl_lock with IFNET_* macros (NFC)
Also if_ioctl_lock perhaps needs to be renamed to something because it's now
not just for ioctl...
Reorder some destruction routines in if_detach
- Destroy if_ioctl_lock at the end of the if_detach because it's used in various
destruction routines
- Move psref_target_destroy after pr_purgeif because we want to use psref in
pr_purgeif (otherwise destruction procedures can be tricky)
Ensure to call if_mcast_op with holding IFNET_LOCK
Note that CARP doesn't deal with IFNET_LOCK yet.
Remove IFNET_GLOBAL_LOCK where it's unnecessary because IFNET_LOCK is held
Describe which lock is used to protect each member variable of struct ifnet
Requested by skrll@
Write a guideline for converting an interface to IFEF_MPSAFE
Requested by skrll@
Note that IFNET_LOCK must not be held in softint
Don't set IFEF_MPSAFE unless NET_MPSAFE at this point
Because recent investigations show that interfaces with IFEF_MPSAFE need to
follow additional restrictions to work with the flag safely. We should enable it
on an interface by default only if the interface surely satisfies the
restrictions, which are described in if.h.
Note that enabling IFEF_MPSAFE solely gains a few benefit on performance because
the network stack is still serialized by the big kernel locks by default.
 1.279.2.2  21-Dec-2017  snj Pull up following revision(s) (requested by ryo in ticket #445):
distrib/sets/lists/debug/mi: revision 1.222
distrib/sets/lists/tests/mi: revision 1.760
share/man/man4/ip.4: revision 1.38
sys/netinet/in.c: revision 1.207
sys/netinet/in.h: revision 1.101
sys/netinet/in_pcb.c: revision 1.179
sys/netinet/in_pcb.h: revision 1.64
sys/netinet/ip_output.c: revision 1.284, 1.286
sys/netinet/ip_var.h: revision 1.120-1.121
sys/netinet/raw_ip.c: revision 1.166-1.167
sys/netinet/udp_usrreq.c: revision 1.235-1.236
sys/netinet/udp_var.h: revision 1.42
tests/net/net/Makefile: revision 1.21
tests/net/net/t_pktinfo_send.c: revision 1.1-1.2
Add support IP_PKTINFO for sendmsg(2).
The source address or output interface can be specified by adding IP_PKTINFO
to the control part of the message on a SOCK_DGRAM or SOCK_RAW socket.
Reviewed by ozaki-r@ and christos@. thanks.
--
As is the case with IPV6_PKTINFO, IP_PKTINFO can be sent without EADDRINUSE
even if the UDP address:port in use is specified.
 1.279.2.1  07-Jul-2017  martin Pull up following revision(s) (requested by roy in ticket #100):
sys/netinet/ip_output.c: revision 1.280
sys/netinet/ip_output.c: revision 1.282
When outputting, search for the sending address on the sending interface
rather than blindly picking the first matcing address from any interface
when testing source address validity.
This allows another interface to have the same address, but be detached.
Rename u to udst, .dst to .sa and .dst4 to sin.
Create sockaddr for the source address in usrc so it won't stamp on udst.
This fixes a regression caused in r1.280
 1.298.2.7  26-Dec-2018  pgoyette Sync with HEAD, resolve a few conflicts
 1.298.2.6  28-Jul-2018  pgoyette Sync with HEAD
 1.298.2.5  25-Jun-2018  pgoyette Sync with HEAD
 1.298.2.4  02-May-2018  pgoyette Synch with HEAD
 1.298.2.3  22-Apr-2018  pgoyette Sync with HEAD
 1.298.2.2  16-Apr-2018  pgoyette Sync with HEAD, resolve some conflicts
 1.298.2.1  07-Apr-2018  pgoyette Sync with HEAD. 77 conflicts resolved - all of them $NetBSD$
 1.306.2.2  08-Apr-2020  martin Merge changes from current as of 20200406
 1.306.2.1  10-Jun-2019  christos Sync with HEAD
 1.324.2.3  29-Jul-2025  martin Pull up following revision(s) (requested by ozaki-r in ticket #1140):

sys/netinet/ip_output.c: revision 1.330
sys/netinet/sctp_output.c: revision 1.39
sys/netinet/ip_mroute.c: revision 1.166
sys/netipsec/ipsecif.c: revision 1.24
sys/netipsec/xform_ipip.c: revision 1.80
sys/netinet/ip_output.c: revision 1.327
sys/netinet/ip_output.c: revision 1.328
sys/netinet/ip_input.c: revision 1.406
sys/netinet/ip_output.c: revision 1.329
sys/netinet/in_var.h: revision 1.105

in: get rid of unused argument from ip_newid() and ip_newid_range()

in: take a reference of ifp on IP_ROUTETOIF
The ifp could be released after ia4_release(ia).

in: narrow the scope of ifa in ip_output (NFC)

sctp: follow the recent change of ip_newid()

in: avoid racy ifa_acquire(rt->rt_ifa) in ip_output()
If a rtentry is being destroyed asynchronously, ifa referenced by rt_ifa
can be destructed and taking ifa_acquire(rt->rt_ifa) aborts with a
KASSERT failure. Fortunately, the ifa is not actually freed because of
a reference by rt_ifa, it can be available (except some functions like
psref) so as long the rtentry is held.
PR kern/59527

in: avoid racy ia4_acquire(ifatoia(rt->rt_ifa) in ip_rtaddr()
Same as the case of ip_output(), it's racy and should be avoided.
PR kern/59527
 1.324.2.2  21-Sep-2024  martin Pull up following revision(s) (requested by rin in ticket #903):

sys/netinet/ip_output.c: revision 1.326

Again allow multicast packets to be sent from unnumbered interfaces.
 1.324.2.1  25-Apr-2023  martin Pull up following revision(s) (requested by ozaki-r in ticket #150):

sys/netinet/ip_output.c: revision 1.325

Revert "Fix panic on packet sending via a route with rt_ifa of AF_LINK."

The fix is mistakenly upstreamed.
 1.326.6.1  02-Aug-2025  perseant Sync with HEAD

RSS XML Feed