TODO.smpnet revision 1.41
1$NetBSD: TODO.smpnet,v 1.41 2021/08/02 23:49:26 mrg Exp $
2
3MP-safe components
4==================
5
6They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE
7kernel option.  Some components scale up and some don't.
8
9 - Device drivers
10   - aq(4)
11   - bcmgenet(4)
12   - iavf(4)
13   - ixg(4)
14   - ixgbe(4)
15   - ixl(4)
16   - ixv(4)
17   - mcx(4)
18   - rge(4)
19   - se(4)
20   - sunxi_emac(4)
21   - vioif(4)
22   - vmx(4)
23   - wm(4)
24   - xennet(4)
25   - usbnet(4) based adapters:
26     - axe(4)
27     - axen(4)
28     - cdce(4)
29     - cue(4)
30     - kue(4)
31     - mos(4)
32     - mue(4)
33     - smsc(4)
34     - udav(4)
35     - upl(4)
36     - ure(4)
37     - url(4)
38     - urndis(4)
39 - Layer 2
40   - Ethernet (if_ethersubr.c)
41   - bridge(4)
42     - STP
43   - Fast forward (ipflow)
44 - Layer 3
45   - All except for items in the below section
46 - Interfaces
47   - gif(4)
48   - ipsecif(4)
49   - l2tp(4)
50   - pppoe(4)
51     - if_spppsubr.c
52   - tap(4)
53   - tun(4)
54   - vlan(4)
55 - Packet filters
56   - npf(7)
57 - Others
58   - bpf(4)
59   - ipsec(4)
60   - opencrypto(9)
61   - pfil(9)
62
63Non MP-safe components and kernel options
64=========================================
65
66The components and options aren't MP-safe, i.e., requires the big kernel lock,
67yet.  Some of them can be used safely even if NET_MPSAFE is enabled because
68they're still protected by the big kernel lock.  The others aren't protected and
69so unsafe, e.g, they may crash the kernel.
70
71Protected ones
72--------------
73
74 - Device drivers
75   - Most drivers other than ones listed in the above section
76 - Layer 4
77   - DCCP
78   - SCTP
79   - TCP
80   - UDP
81
82Unprotected ones
83----------------
84
85 - Layer 2
86   - ARCNET (if_arcsubr.c)
87   - IEEE 1394 (if_ieee1394subr.c)
88   - IEEE 802.11 (ieee80211(4))
89 - Layer 3
90   - IPSELSRC
91   - MROUTING
92   - PIM
93   - MPLS (mpls(4))
94   - IPv6 address selection policy
95 - Interfaces
96   - agr(4)
97   - carp(4)
98   - faith(4)
99   - gre(4)
100   - ppp(4)
101   - sl(4)
102   - stf(4)
103   - if_srt
104 - Packet filters
105   - ipf(4)
106   - pf(4)
107 - Others
108   - AppleTalk (sys/netatalk/)
109   - Bluetooth (sys/netbt/)
110   - altq(4)
111   - kttcp(4)
112   - NFS
113
114Know issues
115===========
116
117NOMPSAFE
118--------
119
120We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe
121yet.  We use it in comments and also use as part of function names, for example
122m_get_rcvif_NOMPSAFE.  Let's use "NOMPSAFE" to make it easy to find non-MP-safe
123codes by grep.
124
125bpf
126---
127
128MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context
129or softint context, i.e., not in hardware interrupt context.  For Tx, all
130bpf_mtap satisfy the requrement.  For Rx, most of bpf_mtap are called in softint.
131Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context.
132
133This is the list of the functions that have such bpf_mtap:
134
135 - sca_frame_process() @ sys/dev/ic/hd64570.c
136
137Ideally we should make the functions run in softint somehow, but we don't have
138actual devices, no time (or interest/love) to work on the task, so instead we
139provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint
140context.  It's a workaround and once the functions run in softint, we should use
141the original bpf_mtap again.
142
143if_mcast_op() - SIOCADDMULTI/SIOCDELMULTI
144-----------------------------------------
145Helper function is called to add or remove multicast addresses for
146interface.  When called via ioctl it takes IFNET_LOCK(), when called
147via sosetopt() it doesn't.
148
149Various network drivers can't assert IFNET_LOCKED() in their if_ioctl
150because of this. Generally drivers still take care to splnet() even
151with NET_MPSAFE before calling ether_ioctl(), but they do not take
152KERNEL_LOCK(), so this is actually unsafe.
153
154Lingering obsolete variables
155-----------------------------
156
157Some obsolete global variables and member variables of structures remain to
158avoid breaking old userland programs which directly access such variables via
159kvm(3).
160
161The following programs still use kvm(3) to get some information related to
162the network stack.
163
164 - netstat(1)
165 - vmstat(1)
166 - fstat(1)
167
168netstat(1) accesses ifnet_list, the head of a list of interface objects
169(struct ifnet), and traverses each object through ifnet#if_list member variable.
170ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and
171ifnet#if_pslist_entry respectively. netstat also accesses the IP address list
172of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr
173and struct in6_ifaddr are accessed and the following obsolete member variables
174are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list,
175in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already
176implements alternative methods to fetch the above information via sysctl(3).
177
178vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel.
179The statistic information is retrieved via kvm(3). The global variables
180in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4
181addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist,
182are kept for this purpose. We should provide a means to fetch statistics of
183hash tables via sysctl(3).
184
185fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is
186obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list
187member variables are obsolete but remain. ifnet#if_xname is also accessed
188via struct bpf_if and obsolete ifnet#if_list is required to remain to not change
189the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount,
190bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for
191scalability the statistic counters should be per-CPU and we should stop using
192atomic operations for them however we have to remain the counters and atomic
193operations.
194
195Scalability
196-----------
197
198 - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple
199   flows per CPU
200 - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up
201   is O(n)
202 - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable
203   as they are serialized by one mutex
204
205ALTQ
206----
207
208If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd)
209for packet transmissions, resulting in serializing all Tx packet processing on
210the queue.  We should probably design and implement an alternative queuing
211mechanism that deals with multi-core systems at the first place, not making the
212existing ALTQ MP-safe because it's just annoying.
213
214Using kernel modules
215--------------------
216
217Please note that if you enable NET_MPSAFE in your kernel, and you use and
218loadable kernel modules (including compat_xx modules or individual network
219interface if_xxx device driver modules), you will need to build custom
220modules.  For each module you will need to add the following line to its
221Makefile:
222
223	CPPFLAGS+=	NET_MPSAFE
224
225Failure to do this may result in unpredictable behavior.
226
227IPv4 address initialization atomicity
228-------------------------------------
229
230An IPv4 address is referenced by several data structures: an associated
231interface, its local route, a connected route (if necessary), the global list,
232the global hash table, etc.  These data structures are not updated atomically,
233i.e., there can be inconsistent states on an IPv4 address in the kernel during
234the initialization of an IPv4 address.
235
236One known failure of the issue is that incoming packets destinating to an
237initializing address can loop in the network stack in a short period of time.
238The address initialization creates an local route first and then registers an
239initializing address to the global hash table that is used to decide if an
240incoming packet destinates to the host by checking the destination of the packet
241is registered to the hash table.  So, if the host allows forwaring, an incoming
242packet can match on a local route of an initializing address at ip_output while
243it fails the to-self check described above at ip_input.  Because a matched local
244route points a loopback interface as its destination interface, an incoming
245packet sends to the network stack (ip_input) again, which results in looping.
246The loop stops once an initializing address is registered to the hash table.
247
248One solution of the issue is to reorder the address initialization instructions,
249first register an address to the hash table then create its routes.  Another
250solution is to use the routing table for the to-self check instead of using the
251global hash table, like IPv6.
252
253if_flags
254--------
255
256To avoid data race on if_flags it should be protected by a lock (currently it's
257IFNET_LOCK).  Thus, if_flags should not be accessed on packet processing to
258avoid performance degradation by lock contentions.  Traditionally IFF_RUNNING,
259IFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing.  If
260you make a driver MP-safe you must remove such checks.
261
262IFF_ALLMULTI can be set/unset via if_mcast_op.  To protect updates of the flag,
263we had added IFNET_LOCK around if_mcast_op.  However that was not a good
264approach because if_mcast_op is typically called in the middle of a call path
265and holding IFNET_LOCK such places is problematic.  Actually a deadlock is
266observed.  Probably we should remove IFNET_LOCK and manage IFF_ALLMULTI
267somewhere other than if_flags, for example ethercom or driver itself (or a
268common driver framework once it appears).  Such a change is feasible because
269IFF_ALLMULTI is only set/unset by a driver and not accessed from any common
270components such as network protocols.
271
272Also IFF_PROMISC is checked in ether_input and we should get rid of it somehow.
273