TODO.smpnet revision 1.34
1$NetBSD: TODO.smpnet,v 1.34 2020/01/20 18:40:06 thorpej Exp $ 2 3MP-safe components 4================== 5 6They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE 7kernel option. Some components scale up and some don't. 8 9 - Device drivers 10 - aq(4) 11 - vioif(4) 12 - vmx(4) 13 - wm(4) 14 - ixg(4) 15 - ixl(4) 16 - ixv(4) 17 - Layer 2 18 - Ethernet (if_ethersubr.c) 19 - bridge(4) 20 - STP 21 - Fast forward (ipflow) 22 - Layer 3 23 - All except for items in the below section 24 - Interfaces 25 - gif(4) 26 - ipsecif(4) 27 - l2tp(4) 28 - pppoe(4) 29 - if_spppsubr.c 30 - tun(4) 31 - vlan(4) 32 - Packet filters 33 - npf(7) 34 - Others 35 - bpf(4) 36 - ipsec(4) 37 - opencrypto(9) 38 - pfil(9) 39 40Non MP-safe components and kernel options 41========================================= 42 43The components and options aren't MP-safe, i.e., requires the big kernel lock, 44yet. Some of them can be used safely even if NET_MPSAFE is enabled because 45they're still protected by the big kernel lock. The others aren't protected and 46so unsafe, e.g, they may crash the kernel. 47 48Protected ones 49-------------- 50 51 - Device drivers 52 - Most drivers other than ones listed in the above section 53 - Layer 4 54 - DCCP 55 - SCTP 56 - TCP 57 - UDP 58 59Unprotected ones 60---------------- 61 62 - Layer 2 63 - ARCNET (if_arcsubr.c) 64 - BRIDGE_IPF 65 - IEEE 1394 (if_ieee1394subr.c) 66 - IEEE 802.11 (ieee80211(4)) 67 - Layer 3 68 - IPSELSRC 69 - MROUTING 70 - PIM 71 - MPLS (mpls(4)) 72 - IPv6 address selection policy 73 - Interfaces 74 - agr(4) 75 - carp(4) 76 - faith(4) 77 - gre(4) 78 - ppp(4) 79 - sl(4) 80 - stf(4) 81 - strip(4) 82 - if_srt 83 - tap(4) 84 - Packet filters 85 - ipf(4) 86 - pf(4) 87 - Others 88 - AppleTalk (sys/netatalk/) 89 - Bluetooth (sys/netbt/) 90 - altq(4) 91 - CIFS (sys/netsmb/) 92 - kttcp(4) 93 - NFS 94 95Know issues 96=========== 97 98NOMPSAFE 99-------- 100 101We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe 102yet. We use it in comments and also use as part of function names, for example 103m_get_rcvif_NOMPSAFE. Let's use "NOMPSAFE" to make it easy to find non-MP-safe 104codes by grep. 105 106bpf 107--- 108 109MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context 110or softint context, i.e., not in hardware interrupt context. For Tx, all 111bpf_mtap satisfy the requrement. For Rx, most of bpf_mtap are called in softint. 112Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context. 113 114This is the list of the functions that have such bpf_mtap: 115 116 - sca_frame_process() @ sys/dev/ic/hd64570.c 117 118Ideally we should make the functions run in softint somehow, but we don't have 119actual devices, no time (or interest/love) to work on the task, so instead we 120provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint 121context. It's a workaround and once the functions run in softint, we should use 122the original bpf_mtap again. 123 124Lingering obsolete variables 125----------------------------- 126 127Some obsolete global variables and member variables of structures remain to 128avoid breaking old userland programs which directly access such variables via 129kvm(3). 130 131The following programs still use kvm(3) to get some information related to 132the network stack. 133 134 - netstat(1) 135 - vmstat(1) 136 - fstat(1) 137 138netstat(1) accesses ifnet_list, the head of a list of interface objects 139(struct ifnet), and traverses each object through ifnet#if_list member variable. 140ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and 141ifnet#if_pslist_entry respectively. netstat also accesses the IP address list 142of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr 143and struct in6_ifaddr are accessed and the following obsolete member variables 144are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list, 145in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already 146implements alternative methods to fetch the above information via sysctl(3). 147 148vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel. 149The statistic information is retrieved via kvm(3). The global variables 150in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4 151addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist, 152are kept for this purpose. We should provide a means to fetch statistics of 153hash tables via sysctl(3). 154 155fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is 156obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list 157member variables are obsolete but remain. ifnet#if_xname is also accessed 158via struct bpf_if and obsolete ifnet#if_list is required to remain to not change 159the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount, 160bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for 161scalability the statistic counters should be per-CPU and we should stop using 162atomic operations for them however we have to remain the counters and atomic 163operations. 164 165Scalability 166----------- 167 168 - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple 169 flows per CPU 170 - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up 171 is O(n) 172 - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable 173 as they are serialized by one mutex 174 175ALTQ 176---- 177 178If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd) 179for packet transmissions, resulting in serializing all Tx packet processing on 180the queue. We should probably design and implement an alternative queuing 181mechanism that deals with multi-core systems at the first place, not making the 182existing ALTQ MP-safe because it's just annoying. 183 184Using kernel modules 185-------------------- 186 187Please note that if you enable NET_MPSAFE in your kernel, and you use and 188loadable kernel modules (including compat_xx modules or individual network 189interface if_xxx device driver modules), you will need to build custom 190modules. For each module you will need to add the following line to its 191Makefile: 192 193 CPPFLAGS+= NET_MPSAFE 194 195Failure to do this may result in unpredictable behavior. 196 197IPv4 address initialization atomicity 198------------------------------------- 199 200An IPv4 address is referenced by several data structures: an associated 201interface, its local route, a connected route (if necessary), the global list, 202the global hash table, etc. These data structures are not updated atomically, 203i.e., there can be inconsistent states on an IPv4 address in the kernel during 204the initialization of an IPv4 address. 205 206One known failure of the issue is that incoming packets destinating to an 207initializing address can loop in the network stack in a short period of time. 208The address initialization creates an local route first and then registers an 209initializing address to the global hash table that is used to decide if an 210incoming packet destinates to the host by checking the destination of the packet 211is registered to the hash table. So, if the host allows forwaring, an incoming 212packet can match on a local route of an initializing address at ip_output while 213it fails the to-self check described above at ip_input. Because a matched local 214route points a loopback interface as its destination interface, an incoming 215packet sends to the network stack (ip_input) again, which results in looping. 216The loop stops once an initializing address is registered to the hash table. 217 218One solution of the issue is to reorder the address initialization instructions, 219first register an address to the hash table then create its routes. Another 220solution is to use the routing table for the to-self check instead of using the 221global hash table, like IPv6. 222 223if_flags 224-------- 225 226To avoid data race on if_flags it should be protected by a lock (currently it's 227IFNET_LOCK). Thus, if_flags should not be accessed on packet processing to 228avoid performance degradation by lock contentions. Traditionally IFF_RUNNING, 229IFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing. If 230you make a driver MP-safe you must remove such checks. 231 232IFF_ALLMULTI can be set/unset via if_mcast_op. To protect updates of the flag, 233we had added IFNET_LOCK around if_mcast_op. However that was not a good 234approach because if_mcast_op is typically called in the middle of a call path 235and holding IFNET_LOCK such places is problematic. Actually a deadlock is 236observed. Probably we should remove IFNET_LOCK and manage IFF_ALLMULTI 237somewhere other than if_flags, for example ethercom or driver itself (or a 238common driver framework once it appears). Such a change is feasible because 239IFF_ALLMULTI is only set/unset by a driver and not accessed from any common 240components such as network protocols. 241 242Also IFF_PROMISC is checked in ether_input and we should get rid of it somehow. 243