TODO.smpnet revision 1.48
11.48Snia$NetBSD: TODO.smpnet,v 1.48 2024/04/24 06:44:18 nia Exp $ 21.1Sozaki 31.2SozakiMP-safe components 41.2Sozaki================== 51.1Sozaki 61.21SozakiThey work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE 71.21Sozakikernel option. Some components scale up and some don't. 81.21Sozaki 91.7Sozaki - Device drivers 101.30Smsaitoh - aq(4) 111.41Smrg - bcmgenet(4) 121.47Snia - bge(4) 131.48Snia - ena(4) 141.41Smrg - iavf(4) 151.41Smrg - ixg(4) 161.41Smrg - ixl(4) 171.41Smrg - ixv(4) 181.41Smrg - mcx(4) 191.41Smrg - rge(4) 201.41Smrg - se(4) 211.41Smrg - sunxi_emac(4) 221.7Sozaki - vioif(4) 231.7Sozaki - vmx(4) 241.7Sozaki - wm(4) 251.41Smrg - xennet(4) 261.41Smrg - usbnet(4) based adapters: 271.41Smrg - axe(4) 281.41Smrg - axen(4) 291.41Smrg - cdce(4) 301.41Smrg - cue(4) 311.41Smrg - kue(4) 321.41Smrg - mos(4) 331.41Smrg - mue(4) 341.41Smrg - smsc(4) 351.41Smrg - udav(4) 361.41Smrg - upl(4) 371.41Smrg - ure(4) 381.41Smrg - url(4) 391.41Smrg - urndis(4) 401.7Sozaki - Layer 2 411.7Sozaki - Ethernet (if_ethersubr.c) 421.7Sozaki - bridge(4) 431.7Sozaki - STP 441.7Sozaki - Fast forward (ipflow) 451.7Sozaki - Layer 3 461.7Sozaki - All except for items in the below section 471.7Sozaki - Interfaces 481.43Snia - canloop(4) 491.7Sozaki - gif(4) 501.22Sozaki - ipsecif(4) 511.7Sozaki - l2tp(4) 521.43Snia - lagg(4) 531.7Sozaki - pppoe(4) 541.7Sozaki - if_spppsubr.c 551.40Snia - tap(4) 561.7Sozaki - tun(4) 571.43Snia - vether(4) 581.12Sozaki - vlan(4) 591.7Sozaki - Packet filters 601.7Sozaki - npf(7) 611.7Sozaki - Others 621.7Sozaki - bpf(4) 631.12Sozaki - ipsec(4) 641.12Sozaki - opencrypto(9) 651.7Sozaki - pfil(9) 661.2Sozaki 671.2SozakiNon MP-safe components and kernel options 681.2Sozaki========================================= 691.2Sozaki 701.21SozakiThe components and options aren't MP-safe, i.e., requires the big kernel lock, 711.21Sozakiyet. Some of them can be used safely even if NET_MPSAFE is enabled because 721.21Sozakithey're still protected by the big kernel lock. The others aren't protected and 731.21Sozakiso unsafe, e.g, they may crash the kernel. 741.21Sozaki 751.21SozakiProtected ones 761.21Sozaki-------------- 771.21Sozaki 781.7Sozaki - Device drivers 791.7Sozaki - Most drivers other than ones listed in the above section 801.21Sozaki - Layer 4 811.21Sozaki - DCCP 821.21Sozaki - SCTP 831.21Sozaki - TCP 841.21Sozaki - UDP 851.21Sozaki 861.21SozakiUnprotected ones 871.21Sozaki---------------- 881.21Sozaki 891.6Sozaki - Layer 2 901.6Sozaki - ARCNET (if_arcsubr.c) 911.6Sozaki - IEEE 1394 (if_ieee1394subr.c) 921.6Sozaki - IEEE 802.11 (ieee80211(4)) 931.6Sozaki - Layer 3 941.6Sozaki - IPSELSRC 951.6Sozaki - MROUTING 961.6Sozaki - PIM 971.6Sozaki - MPLS (mpls(4)) 981.17Sozaki - IPv6 address selection policy 991.6Sozaki - Interfaces 1001.6Sozaki - agr(4) 1011.6Sozaki - carp(4) 1021.6Sozaki - faith(4) 1031.6Sozaki - gre(4) 1041.6Sozaki - ppp(4) 1051.6Sozaki - sl(4) 1061.6Sozaki - stf(4) 1071.6Sozaki - if_srt 1081.6Sozaki - Packet filters 1091.6Sozaki - ipf(4) 1101.6Sozaki - pf(4) 1111.6Sozaki - Others 1121.6Sozaki - AppleTalk (sys/netatalk/) 1131.6Sozaki - Bluetooth (sys/netbt/) 1141.6Sozaki - altq(4) 1151.6Sozaki - kttcp(4) 1161.6Sozaki - NFS 1171.2Sozaki 1181.2SozakiKnow issues 1191.2Sozaki=========== 1201.1Sozaki 1211.15SozakiNOMPSAFE 1221.15Sozaki-------- 1231.15Sozaki 1241.15SozakiWe use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe 1251.15Sozakiyet. We use it in comments and also use as part of function names, for example 1261.15Sozakim_get_rcvif_NOMPSAFE. Let's use "NOMPSAFE" to make it easy to find non-MP-safe 1271.15Sozakicodes by grep. 1281.15Sozaki 1291.1Sozakibpf 1301.2Sozaki--- 1311.1Sozaki 1321.1SozakiMP-ification of bpf requires all of bpf_mtap* are called in normal LWP context 1331.1Sozakior softint context, i.e., not in hardware interrupt context. For Tx, all 1341.44Sandvarbpf_mtap satisfy the requirement. For Rx, most of bpf_mtap are called in softint. 1351.1SozakiUnfortunately some bpf_mtap on Rx are still called in hardware interrupt context. 1361.1Sozaki 1371.1SozakiThis is the list of the functions that have such bpf_mtap: 1381.1Sozaki 1391.1Sozaki - sca_frame_process() @ sys/dev/ic/hd64570.c 1401.1Sozaki 1411.1SozakiIdeally we should make the functions run in softint somehow, but we don't have 1421.1Sozakiactual devices, no time (or interest/love) to work on the task, so instead we 1431.1Sozakiprovide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint 1441.1Sozakicontext. It's a workaround and once the functions run in softint, we should use 1451.1Sozakithe original bpf_mtap again. 1461.10Sozaki 1471.35Sjdolecekif_mcast_op() - SIOCADDMULTI/SIOCDELMULTI 1481.35Sjdolecek----------------------------------------- 1491.35SjdolecekHelper function is called to add or remove multicast addresses for 1501.35Sjdolecekinterface. When called via ioctl it takes IFNET_LOCK(), when called 1511.35Sjdolecekvia sosetopt() it doesn't. 1521.35Sjdolecek 1531.35SjdolecekVarious network drivers can't assert IFNET_LOCKED() in their if_ioctl 1541.35Sjdolecekbecause of this. Generally drivers still take care to splnet() even 1551.35Sjdolecekwith NET_MPSAFE before calling ether_ioctl(), but they do not take 1561.35SjdolecekKERNEL_LOCK(), so this is actually unsafe. 1571.35Sjdolecek 1581.10SozakiLingering obsolete variables 1591.10Sozaki----------------------------- 1601.10Sozaki 1611.10SozakiSome obsolete global variables and member variables of structures remain to 1621.10Sozakiavoid breaking old userland programs which directly access such variables via 1631.10Sozakikvm(3). 1641.10Sozaki 1651.10SozakiThe following programs still use kvm(3) to get some information related to 1661.10Sozakithe network stack. 1671.10Sozaki 1681.10Sozaki - netstat(1) 1691.10Sozaki - vmstat(1) 1701.10Sozaki - fstat(1) 1711.10Sozaki 1721.10Sozakinetstat(1) accesses ifnet_list, the head of a list of interface objects 1731.10Sozaki(struct ifnet), and traverses each object through ifnet#if_list member variable. 1741.10Sozakiifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and 1751.10Sozakiifnet#if_pslist_entry respectively. netstat also accesses the IP address list 1761.46Sandvarof an interface through ifnet#if_addrlist. struct ifaddr, struct in_ifaddr 1771.10Sozakiand struct in6_ifaddr are accessed and the following obsolete member variables 1781.10Sozakiare stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list, 1791.10Sozakiin6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already 1801.10Sozakiimplements alternative methods to fetch the above information via sysctl(3). 1811.10Sozaki 1821.10Sozakivmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel. 1831.10SozakiThe statistic information is retrieved via kvm(3). The global variables 1841.10Sozakiin_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4 1851.10Sozakiaddresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist, 1861.10Sozakiare kept for this purpose. We should provide a means to fetch statistics of 1871.10Sozakihash tables via sysctl(3). 1881.10Sozaki 1891.10Sozakifstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is 1901.10Sozakiobtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list 1911.10Sozakimember variables are obsolete but remain. ifnet#if_xname is also accessed 1921.10Sozakivia struct bpf_if and obsolete ifnet#if_list is required to remain to not change 1931.11Sozakithe offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount, 1941.11Sozakibpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for 1951.11Sozakiscalability the statistic counters should be per-CPU and we should stop using 1961.11Sozakiatomic operations for them however we have to remain the counters and atomic 1971.11Sozakioperations. 1981.13Sozaki 1991.13SozakiScalability 2001.13Sozaki----------- 2011.13Sozaki 2021.13Sozaki - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple 2031.13Sozaki flows per CPU 2041.13Sozaki - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up 2051.13Sozaki is O(n) 2061.14Sknakahar - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable 2071.14Sknakahar as they are serialized by one mutex 2081.16Sozaki 2091.18SozakiALTQ 2101.18Sozaki---- 2111.18Sozaki 2121.18SozakiIf ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd) 2131.18Sozakifor packet transmissions, resulting in serializing all Tx packet processing on 2141.18Sozakithe queue. We should probably design and implement an alternative queuing 2151.18Sozakimechanism that deals with multi-core systems at the first place, not making the 2161.18Sozakiexisting ALTQ MP-safe because it's just annoying. 2171.27Spgoyette 2181.27SpgoyetteUsing kernel modules 2191.27Spgoyette-------------------- 2201.27Spgoyette 2211.27SpgoyettePlease note that if you enable NET_MPSAFE in your kernel, and you use and 2221.27Spgoyetteloadable kernel modules (including compat_xx modules or individual network 2231.27Spgoyetteinterface if_xxx device driver modules), you will need to build custom 2241.27Spgoyettemodules. For each module you will need to add the following line to its 2251.27SpgoyetteMakefile: 2261.27Spgoyette 2271.27Spgoyette CPPFLAGS+= NET_MPSAFE 2281.27Spgoyette 2291.27SpgoyetteFailure to do this may result in unpredictable behavior. 2301.28Sozaki 2311.28SozakiIPv4 address initialization atomicity 2321.28Sozaki------------------------------------- 2331.28Sozaki 2341.28SozakiAn IPv4 address is referenced by several data structures: an associated 2351.28Sozakiinterface, its local route, a connected route (if necessary), the global list, 2361.28Sozakithe global hash table, etc. These data structures are not updated atomically, 2371.28Sozakii.e., there can be inconsistent states on an IPv4 address in the kernel during 2381.28Sozakithe initialization of an IPv4 address. 2391.28Sozaki 2401.28SozakiOne known failure of the issue is that incoming packets destinating to an 2411.28Sozakiinitializing address can loop in the network stack in a short period of time. 2421.28SozakiThe address initialization creates an local route first and then registers an 2431.28Sozakiinitializing address to the global hash table that is used to decide if an 2441.28Sozakiincoming packet destinates to the host by checking the destination of the packet 2451.44Sandvaris registered to the hash table. So, if the host allows forwarding, an incoming 2461.28Sozakipacket can match on a local route of an initializing address at ip_output while 2471.28Sozakiit fails the to-self check described above at ip_input. Because a matched local 2481.28Sozakiroute points a loopback interface as its destination interface, an incoming 2491.28Sozakipacket sends to the network stack (ip_input) again, which results in looping. 2501.28SozakiThe loop stops once an initializing address is registered to the hash table. 2511.28Sozaki 2521.28SozakiOne solution of the issue is to reorder the address initialization instructions, 2531.28Sozakifirst register an address to the hash table then create its routes. Another 2541.28Sozakisolution is to use the routing table for the to-self check instead of using the 2551.28Sozakiglobal hash table, like IPv6. 2561.29Sozaki 2571.29Sozakiif_flags 2581.29Sozaki-------- 2591.29Sozaki 2601.29SozakiTo avoid data race on if_flags it should be protected by a lock (currently it's 2611.29SozakiIFNET_LOCK). Thus, if_flags should not be accessed on packet processing to 2621.29Sozakiavoid performance degradation by lock contentions. Traditionally IFF_RUNNING, 2631.29SozakiIFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing. If 2641.29Sozakiyou make a driver MP-safe you must remove such checks. 2651.29Sozaki 2661.45SriastradDrivers should not touch IFF_ALLMULTI. They are tempted to do so when updating 2671.45Sriastradhardware multicast filters on SIOCADDMULTI/SIOCDELMULTI. Instead, they should 2681.45Sriastraduse the ETHER_F_ALLMULTI bit in struct ethercom::ec_flags, under ETHER_LOCK. 2691.45Sriastradether_ioctl takes care of presenting IFF_ALLMULTI according to the current state 2701.45Sriastradof ETHER_F_ALLMULTI when queried with SIOCGIFFLAGS. 2711.29Sozaki 2721.29SozakiAlso IFF_PROMISC is checked in ether_input and we should get rid of it somehow. 273