TODO.smpnet revision 1.40
1$NetBSD: TODO.smpnet,v 1.40 2021/01/20 10:26:43 nia Exp $ 2 3MP-safe components 4================== 5 6They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE 7kernel option. Some components scale up and some don't. 8 9 - Device drivers 10 - aq(4) 11 - vioif(4) 12 - vmx(4) 13 - wm(4) 14 - ixg(4) 15 - ixl(4) 16 - ixv(4) 17 - Layer 2 18 - Ethernet (if_ethersubr.c) 19 - bridge(4) 20 - STP 21 - Fast forward (ipflow) 22 - Layer 3 23 - All except for items in the below section 24 - Interfaces 25 - gif(4) 26 - ipsecif(4) 27 - l2tp(4) 28 - pppoe(4) 29 - if_spppsubr.c 30 - tap(4) 31 - tun(4) 32 - vlan(4) 33 - Packet filters 34 - npf(7) 35 - Others 36 - bpf(4) 37 - ipsec(4) 38 - opencrypto(9) 39 - pfil(9) 40 41Non MP-safe components and kernel options 42========================================= 43 44The components and options aren't MP-safe, i.e., requires the big kernel lock, 45yet. Some of them can be used safely even if NET_MPSAFE is enabled because 46they're still protected by the big kernel lock. The others aren't protected and 47so unsafe, e.g, they may crash the kernel. 48 49Protected ones 50-------------- 51 52 - Device drivers 53 - Most drivers other than ones listed in the above section 54 - Layer 4 55 - DCCP 56 - SCTP 57 - TCP 58 - UDP 59 60Unprotected ones 61---------------- 62 63 - Layer 2 64 - ARCNET (if_arcsubr.c) 65 - IEEE 1394 (if_ieee1394subr.c) 66 - IEEE 802.11 (ieee80211(4)) 67 - Layer 3 68 - IPSELSRC 69 - MROUTING 70 - PIM 71 - MPLS (mpls(4)) 72 - IPv6 address selection policy 73 - Interfaces 74 - agr(4) 75 - carp(4) 76 - faith(4) 77 - gre(4) 78 - ppp(4) 79 - sl(4) 80 - stf(4) 81 - if_srt 82 - Packet filters 83 - ipf(4) 84 - pf(4) 85 - Others 86 - AppleTalk (sys/netatalk/) 87 - Bluetooth (sys/netbt/) 88 - altq(4) 89 - kttcp(4) 90 - NFS 91 92Know issues 93=========== 94 95NOMPSAFE 96-------- 97 98We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe 99yet. We use it in comments and also use as part of function names, for example 100m_get_rcvif_NOMPSAFE. Let's use "NOMPSAFE" to make it easy to find non-MP-safe 101codes by grep. 102 103bpf 104--- 105 106MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context 107or softint context, i.e., not in hardware interrupt context. For Tx, all 108bpf_mtap satisfy the requrement. For Rx, most of bpf_mtap are called in softint. 109Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context. 110 111This is the list of the functions that have such bpf_mtap: 112 113 - sca_frame_process() @ sys/dev/ic/hd64570.c 114 115Ideally we should make the functions run in softint somehow, but we don't have 116actual devices, no time (or interest/love) to work on the task, so instead we 117provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint 118context. It's a workaround and once the functions run in softint, we should use 119the original bpf_mtap again. 120 121if_mcast_op() - SIOCADDMULTI/SIOCDELMULTI 122----------------------------------------- 123Helper function is called to add or remove multicast addresses for 124interface. When called via ioctl it takes IFNET_LOCK(), when called 125via sosetopt() it doesn't. 126 127Various network drivers can't assert IFNET_LOCKED() in their if_ioctl 128because of this. Generally drivers still take care to splnet() even 129with NET_MPSAFE before calling ether_ioctl(), but they do not take 130KERNEL_LOCK(), so this is actually unsafe. 131 132Lingering obsolete variables 133----------------------------- 134 135Some obsolete global variables and member variables of structures remain to 136avoid breaking old userland programs which directly access such variables via 137kvm(3). 138 139The following programs still use kvm(3) to get some information related to 140the network stack. 141 142 - netstat(1) 143 - vmstat(1) 144 - fstat(1) 145 146netstat(1) accesses ifnet_list, the head of a list of interface objects 147(struct ifnet), and traverses each object through ifnet#if_list member variable. 148ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and 149ifnet#if_pslist_entry respectively. netstat also accesses the IP address list 150of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr 151and struct in6_ifaddr are accessed and the following obsolete member variables 152are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list, 153in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already 154implements alternative methods to fetch the above information via sysctl(3). 155 156vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel. 157The statistic information is retrieved via kvm(3). The global variables 158in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4 159addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist, 160are kept for this purpose. We should provide a means to fetch statistics of 161hash tables via sysctl(3). 162 163fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is 164obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list 165member variables are obsolete but remain. ifnet#if_xname is also accessed 166via struct bpf_if and obsolete ifnet#if_list is required to remain to not change 167the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount, 168bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for 169scalability the statistic counters should be per-CPU and we should stop using 170atomic operations for them however we have to remain the counters and atomic 171operations. 172 173Scalability 174----------- 175 176 - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple 177 flows per CPU 178 - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up 179 is O(n) 180 - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable 181 as they are serialized by one mutex 182 183ALTQ 184---- 185 186If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd) 187for packet transmissions, resulting in serializing all Tx packet processing on 188the queue. We should probably design and implement an alternative queuing 189mechanism that deals with multi-core systems at the first place, not making the 190existing ALTQ MP-safe because it's just annoying. 191 192Using kernel modules 193-------------------- 194 195Please note that if you enable NET_MPSAFE in your kernel, and you use and 196loadable kernel modules (including compat_xx modules or individual network 197interface if_xxx device driver modules), you will need to build custom 198modules. For each module you will need to add the following line to its 199Makefile: 200 201 CPPFLAGS+= NET_MPSAFE 202 203Failure to do this may result in unpredictable behavior. 204 205IPv4 address initialization atomicity 206------------------------------------- 207 208An IPv4 address is referenced by several data structures: an associated 209interface, its local route, a connected route (if necessary), the global list, 210the global hash table, etc. These data structures are not updated atomically, 211i.e., there can be inconsistent states on an IPv4 address in the kernel during 212the initialization of an IPv4 address. 213 214One known failure of the issue is that incoming packets destinating to an 215initializing address can loop in the network stack in a short period of time. 216The address initialization creates an local route first and then registers an 217initializing address to the global hash table that is used to decide if an 218incoming packet destinates to the host by checking the destination of the packet 219is registered to the hash table. So, if the host allows forwaring, an incoming 220packet can match on a local route of an initializing address at ip_output while 221it fails the to-self check described above at ip_input. Because a matched local 222route points a loopback interface as its destination interface, an incoming 223packet sends to the network stack (ip_input) again, which results in looping. 224The loop stops once an initializing address is registered to the hash table. 225 226One solution of the issue is to reorder the address initialization instructions, 227first register an address to the hash table then create its routes. Another 228solution is to use the routing table for the to-self check instead of using the 229global hash table, like IPv6. 230 231if_flags 232-------- 233 234To avoid data race on if_flags it should be protected by a lock (currently it's 235IFNET_LOCK). Thus, if_flags should not be accessed on packet processing to 236avoid performance degradation by lock contentions. Traditionally IFF_RUNNING, 237IFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing. If 238you make a driver MP-safe you must remove such checks. 239 240IFF_ALLMULTI can be set/unset via if_mcast_op. To protect updates of the flag, 241we had added IFNET_LOCK around if_mcast_op. However that was not a good 242approach because if_mcast_op is typically called in the middle of a call path 243and holding IFNET_LOCK such places is problematic. Actually a deadlock is 244observed. Probably we should remove IFNET_LOCK and manage IFF_ALLMULTI 245somewhere other than if_flags, for example ethercom or driver itself (or a 246common driver framework once it appears). Such a change is feasible because 247IFF_ALLMULTI is only set/unset by a driver and not accessed from any common 248components such as network protocols. 249 250Also IFF_PROMISC is checked in ether_input and we should get rid of it somehow. 251