TODO.smpnet revision 1.28
1$NetBSD: TODO.smpnet,v 1.28 2019/03/26 05:17:17 ozaki-r Exp $ 2 3MP-safe components 4================== 5 6They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE 7kernel option. Some components scale up and some don't. 8 9 - Device drivers 10 - vioif(4) 11 - vmx(4) 12 - wm(4) 13 - ixg(4) 14 - ixv(4) 15 - Layer 2 16 - Ethernet (if_ethersubr.c) 17 - bridge(4) 18 - STP 19 - Fast forward (ipflow) 20 - Layer 3 21 - All except for items in the below section 22 - Interfaces 23 - gif(4) 24 - ipsecif(4) 25 - l2tp(4) 26 - pppoe(4) 27 - if_spppsubr.c 28 - tun(4) 29 - vlan(4) 30 - Packet filters 31 - npf(7) 32 - Others 33 - bpf(4) 34 - ipsec(4) 35 - opencrypto(9) 36 - pfil(9) 37 38Non MP-safe components and kernel options 39========================================= 40 41The components and options aren't MP-safe, i.e., requires the big kernel lock, 42yet. Some of them can be used safely even if NET_MPSAFE is enabled because 43they're still protected by the big kernel lock. The others aren't protected and 44so unsafe, e.g, they may crash the kernel. 45 46Protected ones 47-------------- 48 49 - Device drivers 50 - Most drivers other than ones listed in the above section 51 - Layer 4 52 - DCCP 53 - SCTP 54 - TCP 55 - UDP 56 57Unprotected ones 58---------------- 59 60 - Layer 2 61 - ARCNET (if_arcsubr.c) 62 - BRIDGE_IPF 63 - FDDI (if_fddisubr.c) 64 - HIPPI (if_hippisubr.c) 65 - IEEE 1394 (if_ieee1394subr.c) 66 - IEEE 802.11 (ieee80211(4)) 67 - Token ring (if_tokensubr.c) 68 - Layer 3 69 - IPSELSRC 70 - MROUTING 71 - PIM 72 - MPLS (mpls(4)) 73 - IPv6 address selection policy 74 - Interfaces 75 - agr(4) 76 - carp(4) 77 - faith(4) 78 - gre(4) 79 - ppp(4) 80 - sl(4) 81 - stf(4) 82 - strip(4) 83 - if_srt 84 - tap(4) 85 - Packet filters 86 - ipf(4) 87 - pf(4) 88 - Others 89 - AppleTalk (sys/netatalk/) 90 - Bluetooth (sys/netbt/) 91 - altq(4) 92 - CIFS (sys/netsmb/) 93 - kttcp(4) 94 - NFS 95 96Know issues 97=========== 98 99NOMPSAFE 100-------- 101 102We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe 103yet. We use it in comments and also use as part of function names, for example 104m_get_rcvif_NOMPSAFE. Let's use "NOMPSAFE" to make it easy to find non-MP-safe 105codes by grep. 106 107bpf 108--- 109 110MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context 111or softint context, i.e., not in hardware interrupt context. For Tx, all 112bpf_mtap satisfy the requrement. For Rx, most of bpf_mtap are called in softint. 113Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context. 114 115This is the list of the functions that have such bpf_mtap: 116 117 - sca_frame_process() @ sys/dev/ic/hd64570.c 118 119Ideally we should make the functions run in softint somehow, but we don't have 120actual devices, no time (or interest/love) to work on the task, so instead we 121provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint 122context. It's a workaround and once the functions run in softint, we should use 123the original bpf_mtap again. 124 125Lingering obsolete variables 126----------------------------- 127 128Some obsolete global variables and member variables of structures remain to 129avoid breaking old userland programs which directly access such variables via 130kvm(3). 131 132The following programs still use kvm(3) to get some information related to 133the network stack. 134 135 - netstat(1) 136 - vmstat(1) 137 - fstat(1) 138 139netstat(1) accesses ifnet_list, the head of a list of interface objects 140(struct ifnet), and traverses each object through ifnet#if_list member variable. 141ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and 142ifnet#if_pslist_entry respectively. netstat also accesses the IP address list 143of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr 144and struct in6_ifaddr are accessed and the following obsolete member variables 145are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list, 146in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already 147implements alternative methods to fetch the above information via sysctl(3). 148 149vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel. 150The statistic information is retrieved via kvm(3). The global variables 151in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4 152addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist, 153are kept for this purpose. We should provide a means to fetch statistics of 154hash tables via sysctl(3). 155 156fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is 157obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list 158member variables are obsolete but remain. ifnet#if_xname is also accessed 159via struct bpf_if and obsolete ifnet#if_list is required to remain to not change 160the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount, 161bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for 162scalability the statistic counters should be per-CPU and we should stop using 163atomic operations for them however we have to remain the counters and atomic 164operations. 165 166Scalability 167----------- 168 169 - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple 170 flows per CPU 171 - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up 172 is O(n) 173 - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable 174 as they are serialized by one mutex 175 176ec_multi* of ethercom 177--------------------- 178 179ec_multiaddrs and ec_multicnt of struct ethercom and items listed in 180ec_multiaddrs must be protected by ec_lock. The core of ethernet subsystem is 181already MP-safe, however, device drivers that use the data should also be fixed. 182A typical change should be to protect manipulations of the data via ETHER_* 183macros such as ETHER_FIRST_MULTI by ETHER_LOCK and ETHER_UNLOCK. 184 185ALTQ 186---- 187 188If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd) 189for packet transmissions, resulting in serializing all Tx packet processing on 190the queue. We should probably design and implement an alternative queuing 191mechanism that deals with multi-core systems at the first place, not making the 192existing ALTQ MP-safe because it's just annoying. 193 194Using kernel modules 195-------------------- 196 197Please note that if you enable NET_MPSAFE in your kernel, and you use and 198loadable kernel modules (including compat_xx modules or individual network 199interface if_xxx device driver modules), you will need to build custom 200modules. For each module you will need to add the following line to its 201Makefile: 202 203 CPPFLAGS+= NET_MPSAFE 204 205Failure to do this may result in unpredictable behavior. 206 207IPv4 address initialization atomicity 208------------------------------------- 209 210An IPv4 address is referenced by several data structures: an associated 211interface, its local route, a connected route (if necessary), the global list, 212the global hash table, etc. These data structures are not updated atomically, 213i.e., there can be inconsistent states on an IPv4 address in the kernel during 214the initialization of an IPv4 address. 215 216One known failure of the issue is that incoming packets destinating to an 217initializing address can loop in the network stack in a short period of time. 218The address initialization creates an local route first and then registers an 219initializing address to the global hash table that is used to decide if an 220incoming packet destinates to the host by checking the destination of the packet 221is registered to the hash table. So, if the host allows forwaring, an incoming 222packet can match on a local route of an initializing address at ip_output while 223it fails the to-self check described above at ip_input. Because a matched local 224route points a loopback interface as its destination interface, an incoming 225packet sends to the network stack (ip_input) again, which results in looping. 226The loop stops once an initializing address is registered to the hash table. 227 228One solution of the issue is to reorder the address initialization instructions, 229first register an address to the hash table then create its routes. Another 230solution is to use the routing table for the to-self check instead of using the 231global hash table, like IPv6. 232