TODO.smpnet revision 1.51
11.51Sozaki$NetBSD: TODO.smpnet,v 1.51 2025/06/17 02:00:25 ozaki-r Exp $
21.1Sozaki
31.2SozakiMP-safe components
41.2Sozaki==================
51.1Sozaki
61.21SozakiThey work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE
71.21Sozakikernel option.  Some components scale up and some don't.
81.21Sozaki
91.7Sozaki - Device drivers
101.30Smsaitoh   - aq(4)
111.50Snia   - awge(4)
121.41Smrg   - bcmgenet(4)
131.47Snia   - bge(4)
141.48Snia   - ena(4)
151.41Smrg   - iavf(4)
161.41Smrg   - ixg(4)
171.41Smrg   - ixl(4)
181.41Smrg   - ixv(4)
191.41Smrg   - mcx(4)
201.41Smrg   - rge(4)
211.41Smrg   - se(4)
221.41Smrg   - sunxi_emac(4)
231.7Sozaki   - vioif(4)
241.7Sozaki   - vmx(4)
251.7Sozaki   - wm(4)
261.41Smrg   - xennet(4)
271.41Smrg   - usbnet(4) based adapters:
281.41Smrg     - axe(4)
291.41Smrg     - axen(4)
301.41Smrg     - cdce(4)
311.41Smrg     - cue(4)
321.41Smrg     - kue(4)
331.41Smrg     - mos(4)
341.41Smrg     - mue(4)
351.41Smrg     - smsc(4)
361.41Smrg     - udav(4)
371.41Smrg     - upl(4)
381.41Smrg     - ure(4)
391.41Smrg     - url(4)
401.41Smrg     - urndis(4)
411.7Sozaki - Layer 2
421.7Sozaki   - Ethernet (if_ethersubr.c)
431.7Sozaki   - bridge(4)
441.7Sozaki     - STP
451.7Sozaki   - Fast forward (ipflow)
461.7Sozaki - Layer 3
471.7Sozaki   - All except for items in the below section
481.7Sozaki - Interfaces
491.43Snia   - canloop(4)
501.7Sozaki   - gif(4)
511.22Sozaki   - ipsecif(4)
521.7Sozaki   - l2tp(4)
531.43Snia   - lagg(4)
541.7Sozaki   - pppoe(4)
551.7Sozaki     - if_spppsubr.c
561.40Snia   - tap(4)
571.7Sozaki   - tun(4)
581.43Snia   - vether(4)
591.12Sozaki   - vlan(4)
601.7Sozaki - Packet filters
611.7Sozaki   - npf(7)
621.49Smrg   - ipf(4)
631.7Sozaki - Others
641.7Sozaki   - bpf(4)
651.12Sozaki   - ipsec(4)
661.12Sozaki   - opencrypto(9)
671.7Sozaki   - pfil(9)
681.2Sozaki
691.2SozakiNon MP-safe components and kernel options
701.2Sozaki=========================================
711.2Sozaki
721.21SozakiThe components and options aren't MP-safe, i.e., requires the big kernel lock,
731.21Sozakiyet.  Some of them can be used safely even if NET_MPSAFE is enabled because
741.21Sozakithey're still protected by the big kernel lock.  The others aren't protected and
751.21Sozakiso unsafe, e.g, they may crash the kernel.
761.21Sozaki
771.21SozakiProtected ones
781.21Sozaki--------------
791.21Sozaki
801.7Sozaki - Device drivers
811.7Sozaki   - Most drivers other than ones listed in the above section
821.21Sozaki - Layer 4
831.21Sozaki   - DCCP
841.21Sozaki   - SCTP
851.21Sozaki   - TCP
861.21Sozaki   - UDP
871.21Sozaki
881.21SozakiUnprotected ones
891.21Sozaki----------------
901.21Sozaki
911.6Sozaki - Layer 2
921.6Sozaki   - ARCNET (if_arcsubr.c)
931.6Sozaki   - IEEE 1394 (if_ieee1394subr.c)
941.6Sozaki   - IEEE 802.11 (ieee80211(4))
951.6Sozaki - Layer 3
961.6Sozaki   - IPSELSRC
971.6Sozaki   - MROUTING
981.6Sozaki   - PIM
991.6Sozaki   - MPLS (mpls(4))
1001.17Sozaki   - IPv6 address selection policy
1011.6Sozaki - Interfaces
1021.6Sozaki   - agr(4)
1031.6Sozaki   - carp(4)
1041.6Sozaki   - faith(4)
1051.6Sozaki   - gre(4)
1061.6Sozaki   - ppp(4)
1071.6Sozaki   - sl(4)
1081.6Sozaki   - stf(4)
1091.6Sozaki   - if_srt
1101.6Sozaki - Packet filters
1111.6Sozaki   - pf(4)
1121.6Sozaki - Others
1131.6Sozaki   - AppleTalk (sys/netatalk/)
1141.6Sozaki   - Bluetooth (sys/netbt/)
1151.6Sozaki   - altq(4)
1161.6Sozaki   - kttcp(4)
1171.6Sozaki   - NFS
1181.2Sozaki
1191.2SozakiKnow issues
1201.2Sozaki===========
1211.1Sozaki
1221.15SozakiNOMPSAFE
1231.15Sozaki--------
1241.15Sozaki
1251.15SozakiWe use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe
1261.15Sozakiyet.  We use it in comments and also use as part of function names, for example
1271.15Sozakim_get_rcvif_NOMPSAFE.  Let's use "NOMPSAFE" to make it easy to find non-MP-safe
1281.15Sozakicodes by grep.
1291.15Sozaki
1301.1Sozakibpf
1311.2Sozaki---
1321.1Sozaki
1331.1SozakiMP-ification of bpf requires all of bpf_mtap* are called in normal LWP context
1341.1Sozakior softint context, i.e., not in hardware interrupt context.  For Tx, all
1351.44Sandvarbpf_mtap satisfy the requirement.  For Rx, most of bpf_mtap are called in softint.
1361.1SozakiUnfortunately some bpf_mtap on Rx are still called in hardware interrupt context.
1371.1Sozaki
1381.1SozakiThis is the list of the functions that have such bpf_mtap:
1391.1Sozaki
1401.1Sozaki - sca_frame_process() @ sys/dev/ic/hd64570.c
1411.1Sozaki
1421.1SozakiIdeally we should make the functions run in softint somehow, but we don't have
1431.1Sozakiactual devices, no time (or interest/love) to work on the task, so instead we
1441.1Sozakiprovide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint
1451.1Sozakicontext.  It's a workaround and once the functions run in softint, we should use
1461.1Sozakithe original bpf_mtap again.
1471.10Sozaki
1481.35Sjdolecekif_mcast_op() - SIOCADDMULTI/SIOCDELMULTI
1491.35Sjdolecek-----------------------------------------
1501.35SjdolecekHelper function is called to add or remove multicast addresses for
1511.35Sjdolecekinterface.  When called via ioctl it takes IFNET_LOCK(), when called
1521.35Sjdolecekvia sosetopt() it doesn't.
1531.35Sjdolecek
1541.35SjdolecekVarious network drivers can't assert IFNET_LOCKED() in their if_ioctl
1551.35Sjdolecekbecause of this. Generally drivers still take care to splnet() even
1561.35Sjdolecekwith NET_MPSAFE before calling ether_ioctl(), but they do not take
1571.35SjdolecekKERNEL_LOCK(), so this is actually unsafe.
1581.35Sjdolecek
1591.10SozakiLingering obsolete variables
1601.10Sozaki-----------------------------
1611.10Sozaki
1621.10SozakiSome obsolete global variables and member variables of structures remain to
1631.10Sozakiavoid breaking old userland programs which directly access such variables via
1641.10Sozakikvm(3).
1651.10Sozaki
1661.10SozakiThe following programs still use kvm(3) to get some information related to
1671.10Sozakithe network stack.
1681.10Sozaki
1691.10Sozaki - netstat(1)
1701.10Sozaki - vmstat(1)
1711.10Sozaki - fstat(1)
1721.10Sozaki
1731.10Sozakinetstat(1) accesses ifnet_list, the head of a list of interface objects
1741.10Sozaki(struct ifnet), and traverses each object through ifnet#if_list member variable.
1751.10Sozakiifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and
1761.10Sozakiifnet#if_pslist_entry respectively. netstat also accesses the IP address list
1771.46Sandvarof an interface through ifnet#if_addrlist. struct ifaddr, struct in_ifaddr
1781.10Sozakiand struct in6_ifaddr are accessed and the following obsolete member variables
1791.10Sozakiare stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list,
1801.10Sozakiin6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already
1811.10Sozakiimplements alternative methods to fetch the above information via sysctl(3).
1821.10Sozaki
1831.10Sozakivmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel.
1841.10SozakiThe statistic information is retrieved via kvm(3). The global variables
1851.10Sozakiin_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4
1861.10Sozakiaddresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist,
1871.10Sozakiare kept for this purpose. We should provide a means to fetch statistics of
1881.10Sozakihash tables via sysctl(3).
1891.10Sozaki
1901.10Sozakifstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is
1911.10Sozakiobtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list
1921.10Sozakimember variables are obsolete but remain. ifnet#if_xname is also accessed
1931.10Sozakivia struct bpf_if and obsolete ifnet#if_list is required to remain to not change
1941.11Sozakithe offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount,
1951.11Sozakibpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for
1961.11Sozakiscalability the statistic counters should be per-CPU and we should stop using
1971.11Sozakiatomic operations for them however we have to remain the counters and atomic
1981.11Sozakioperations.
1991.13Sozaki
2001.13SozakiScalability
2011.13Sozaki-----------
2021.13Sozaki
2031.13Sozaki - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple
2041.13Sozaki   flows per CPU
2051.13Sozaki - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up
2061.13Sozaki   is O(n)
2071.14Sknakahar - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable
2081.14Sknakahar   as they are serialized by one mutex
2091.16Sozaki
2101.18SozakiALTQ
2111.18Sozaki----
2121.18Sozaki
2131.18SozakiIf ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd)
2141.18Sozakifor packet transmissions, resulting in serializing all Tx packet processing on
2151.18Sozakithe queue.  We should probably design and implement an alternative queuing
2161.18Sozakimechanism that deals with multi-core systems at the first place, not making the
2171.18Sozakiexisting ALTQ MP-safe because it's just annoying.
2181.27Spgoyette
2191.27SpgoyetteUsing kernel modules
2201.27Spgoyette--------------------
2211.27Spgoyette
2221.27SpgoyettePlease note that if you enable NET_MPSAFE in your kernel, and you use and
2231.27Spgoyetteloadable kernel modules (including compat_xx modules or individual network
2241.27Spgoyetteinterface if_xxx device driver modules), you will need to build custom
2251.27Spgoyettemodules.  For each module you will need to add the following line to its
2261.27SpgoyetteMakefile:
2271.27Spgoyette
2281.27Spgoyette	CPPFLAGS+=	NET_MPSAFE
2291.27Spgoyette
2301.27SpgoyetteFailure to do this may result in unpredictable behavior.
2311.28Sozaki
2321.28SozakiIPv4 address initialization atomicity
2331.28Sozaki-------------------------------------
2341.28Sozaki
2351.28SozakiAn IPv4 address is referenced by several data structures: an associated
2361.28Sozakiinterface, its local route, a connected route (if necessary), the global list,
2371.28Sozakithe global hash table, etc.  These data structures are not updated atomically,
2381.28Sozakii.e., there can be inconsistent states on an IPv4 address in the kernel during
2391.28Sozakithe initialization of an IPv4 address.
2401.28Sozaki
2411.28SozakiOne known failure of the issue is that incoming packets destinating to an
2421.28Sozakiinitializing address can loop in the network stack in a short period of time.
2431.28SozakiThe address initialization creates an local route first and then registers an
2441.28Sozakiinitializing address to the global hash table that is used to decide if an
2451.28Sozakiincoming packet destinates to the host by checking the destination of the packet
2461.44Sandvaris registered to the hash table.  So, if the host allows forwarding, an incoming
2471.28Sozakipacket can match on a local route of an initializing address at ip_output while
2481.28Sozakiit fails the to-self check described above at ip_input.  Because a matched local
2491.28Sozakiroute points a loopback interface as its destination interface, an incoming
2501.28Sozakipacket sends to the network stack (ip_input) again, which results in looping.
2511.28SozakiThe loop stops once an initializing address is registered to the hash table.
2521.28Sozaki
2531.28SozakiOne solution of the issue is to reorder the address initialization instructions,
2541.28Sozakifirst register an address to the hash table then create its routes.  Another
2551.28Sozakisolution is to use the routing table for the to-self check instead of using the
2561.28Sozakiglobal hash table, like IPv6.
2571.29Sozaki
2581.29Sozakiif_flags
2591.29Sozaki--------
2601.29Sozaki
2611.29SozakiTo avoid data race on if_flags it should be protected by a lock (currently it's
2621.29SozakiIFNET_LOCK).  Thus, if_flags should not be accessed on packet processing to
2631.29Sozakiavoid performance degradation by lock contentions.  Traditionally IFF_RUNNING,
2641.29SozakiIFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing.  If
2651.29Sozakiyou make a driver MP-safe you must remove such checks.
2661.29Sozaki
2671.45SriastradDrivers should not touch IFF_ALLMULTI.  They are tempted to do so when updating
2681.45Sriastradhardware multicast filters on SIOCADDMULTI/SIOCDELMULTI.  Instead, they should
2691.45Sriastraduse the ETHER_F_ALLMULTI bit in struct ethercom::ec_flags, under ETHER_LOCK.
2701.45Sriastradether_ioctl takes care of presenting IFF_ALLMULTI according to the current state
2711.45Sriastradof ETHER_F_ALLMULTI when queried with SIOCGIFFLAGS.
2721.29Sozaki
2731.29SozakiAlso IFF_PROMISC is checked in ether_input and we should get rid of it somehow.
2741.51Sozaki
2751.51SozakiToo many kpreempt_disable/kpreempt_enable
2761.51Sozaki-----------------------------------------
2771.51Sozaki
2781.51SozakiPacket counters in the network stack such as if_statadd() and ip_statinc() are
2791.51Sozakiimplemented with percpu(9) to avoid atomic operations.  The implementation seems
2801.51Sozakigood for scalability, however, it introduces another issue.  Since percpu(9)
2811.51Sozakirequires kpreempt_{dis,en}able() for each per-cpu operation, we have to call
2821.51Sozakithem for each packet counting.  An observation shows that
2831.51Sozakikpreempt_{dis,en}able()s over 10 times are called for each packet on forwarding.
2841.51SozakiFor better performance on a single flow, we should reduce per-packet operations
2851.51Sozakias much as possible.
2861.51Sozaki
2871.51SozakiOne possible solution for the issue is to make the whole network stack
2881.51Sozakinon-preemptive so that we don't need to kpreempt_{dis,en}able()s for each packet
2891.51Sozakicounting.
290