TODO.smpnet revision 1.42
1$NetBSD: TODO.smpnet,v 1.42 2021/08/03 01:44:10 msaitoh Exp $
2
3MP-safe components
4==================
5
6They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE
7kernel option.  Some components scale up and some don't.
8
9 - Device drivers
10   - aq(4)
11   - bcmgenet(4)
12   - iavf(4)
13   - ixg(4)
14   - ixl(4)
15   - ixv(4)
16   - mcx(4)
17   - rge(4)
18   - se(4)
19   - sunxi_emac(4)
20   - vioif(4)
21   - vmx(4)
22   - wm(4)
23   - xennet(4)
24   - usbnet(4) based adapters:
25     - axe(4)
26     - axen(4)
27     - cdce(4)
28     - cue(4)
29     - kue(4)
30     - mos(4)
31     - mue(4)
32     - smsc(4)
33     - udav(4)
34     - upl(4)
35     - ure(4)
36     - url(4)
37     - urndis(4)
38 - Layer 2
39   - Ethernet (if_ethersubr.c)
40   - bridge(4)
41     - STP
42   - Fast forward (ipflow)
43 - Layer 3
44   - All except for items in the below section
45 - Interfaces
46   - gif(4)
47   - ipsecif(4)
48   - l2tp(4)
49   - pppoe(4)
50     - if_spppsubr.c
51   - tap(4)
52   - tun(4)
53   - vlan(4)
54 - Packet filters
55   - npf(7)
56 - Others
57   - bpf(4)
58   - ipsec(4)
59   - opencrypto(9)
60   - pfil(9)
61
62Non MP-safe components and kernel options
63=========================================
64
65The components and options aren't MP-safe, i.e., requires the big kernel lock,
66yet.  Some of them can be used safely even if NET_MPSAFE is enabled because
67they're still protected by the big kernel lock.  The others aren't protected and
68so unsafe, e.g, they may crash the kernel.
69
70Protected ones
71--------------
72
73 - Device drivers
74   - Most drivers other than ones listed in the above section
75 - Layer 4
76   - DCCP
77   - SCTP
78   - TCP
79   - UDP
80
81Unprotected ones
82----------------
83
84 - Layer 2
85   - ARCNET (if_arcsubr.c)
86   - IEEE 1394 (if_ieee1394subr.c)
87   - IEEE 802.11 (ieee80211(4))
88 - Layer 3
89   - IPSELSRC
90   - MROUTING
91   - PIM
92   - MPLS (mpls(4))
93   - IPv6 address selection policy
94 - Interfaces
95   - agr(4)
96   - carp(4)
97   - faith(4)
98   - gre(4)
99   - ppp(4)
100   - sl(4)
101   - stf(4)
102   - if_srt
103 - Packet filters
104   - ipf(4)
105   - pf(4)
106 - Others
107   - AppleTalk (sys/netatalk/)
108   - Bluetooth (sys/netbt/)
109   - altq(4)
110   - kttcp(4)
111   - NFS
112
113Know issues
114===========
115
116NOMPSAFE
117--------
118
119We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe
120yet.  We use it in comments and also use as part of function names, for example
121m_get_rcvif_NOMPSAFE.  Let's use "NOMPSAFE" to make it easy to find non-MP-safe
122codes by grep.
123
124bpf
125---
126
127MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context
128or softint context, i.e., not in hardware interrupt context.  For Tx, all
129bpf_mtap satisfy the requrement.  For Rx, most of bpf_mtap are called in softint.
130Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context.
131
132This is the list of the functions that have such bpf_mtap:
133
134 - sca_frame_process() @ sys/dev/ic/hd64570.c
135
136Ideally we should make the functions run in softint somehow, but we don't have
137actual devices, no time (or interest/love) to work on the task, so instead we
138provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint
139context.  It's a workaround and once the functions run in softint, we should use
140the original bpf_mtap again.
141
142if_mcast_op() - SIOCADDMULTI/SIOCDELMULTI
143-----------------------------------------
144Helper function is called to add or remove multicast addresses for
145interface.  When called via ioctl it takes IFNET_LOCK(), when called
146via sosetopt() it doesn't.
147
148Various network drivers can't assert IFNET_LOCKED() in their if_ioctl
149because of this. Generally drivers still take care to splnet() even
150with NET_MPSAFE before calling ether_ioctl(), but they do not take
151KERNEL_LOCK(), so this is actually unsafe.
152
153Lingering obsolete variables
154-----------------------------
155
156Some obsolete global variables and member variables of structures remain to
157avoid breaking old userland programs which directly access such variables via
158kvm(3).
159
160The following programs still use kvm(3) to get some information related to
161the network stack.
162
163 - netstat(1)
164 - vmstat(1)
165 - fstat(1)
166
167netstat(1) accesses ifnet_list, the head of a list of interface objects
168(struct ifnet), and traverses each object through ifnet#if_list member variable.
169ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and
170ifnet#if_pslist_entry respectively. netstat also accesses the IP address list
171of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr
172and struct in6_ifaddr are accessed and the following obsolete member variables
173are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list,
174in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already
175implements alternative methods to fetch the above information via sysctl(3).
176
177vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel.
178The statistic information is retrieved via kvm(3). The global variables
179in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4
180addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist,
181are kept for this purpose. We should provide a means to fetch statistics of
182hash tables via sysctl(3).
183
184fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is
185obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list
186member variables are obsolete but remain. ifnet#if_xname is also accessed
187via struct bpf_if and obsolete ifnet#if_list is required to remain to not change
188the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount,
189bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for
190scalability the statistic counters should be per-CPU and we should stop using
191atomic operations for them however we have to remain the counters and atomic
192operations.
193
194Scalability
195-----------
196
197 - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple
198   flows per CPU
199 - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up
200   is O(n)
201 - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable
202   as they are serialized by one mutex
203
204ALTQ
205----
206
207If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd)
208for packet transmissions, resulting in serializing all Tx packet processing on
209the queue.  We should probably design and implement an alternative queuing
210mechanism that deals with multi-core systems at the first place, not making the
211existing ALTQ MP-safe because it's just annoying.
212
213Using kernel modules
214--------------------
215
216Please note that if you enable NET_MPSAFE in your kernel, and you use and
217loadable kernel modules (including compat_xx modules or individual network
218interface if_xxx device driver modules), you will need to build custom
219modules.  For each module you will need to add the following line to its
220Makefile:
221
222	CPPFLAGS+=	NET_MPSAFE
223
224Failure to do this may result in unpredictable behavior.
225
226IPv4 address initialization atomicity
227-------------------------------------
228
229An IPv4 address is referenced by several data structures: an associated
230interface, its local route, a connected route (if necessary), the global list,
231the global hash table, etc.  These data structures are not updated atomically,
232i.e., there can be inconsistent states on an IPv4 address in the kernel during
233the initialization of an IPv4 address.
234
235One known failure of the issue is that incoming packets destinating to an
236initializing address can loop in the network stack in a short period of time.
237The address initialization creates an local route first and then registers an
238initializing address to the global hash table that is used to decide if an
239incoming packet destinates to the host by checking the destination of the packet
240is registered to the hash table.  So, if the host allows forwaring, an incoming
241packet can match on a local route of an initializing address at ip_output while
242it fails the to-self check described above at ip_input.  Because a matched local
243route points a loopback interface as its destination interface, an incoming
244packet sends to the network stack (ip_input) again, which results in looping.
245The loop stops once an initializing address is registered to the hash table.
246
247One solution of the issue is to reorder the address initialization instructions,
248first register an address to the hash table then create its routes.  Another
249solution is to use the routing table for the to-self check instead of using the
250global hash table, like IPv6.
251
252if_flags
253--------
254
255To avoid data race on if_flags it should be protected by a lock (currently it's
256IFNET_LOCK).  Thus, if_flags should not be accessed on packet processing to
257avoid performance degradation by lock contentions.  Traditionally IFF_RUNNING,
258IFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing.  If
259you make a driver MP-safe you must remove such checks.
260
261IFF_ALLMULTI can be set/unset via if_mcast_op.  To protect updates of the flag,
262we had added IFNET_LOCK around if_mcast_op.  However that was not a good
263approach because if_mcast_op is typically called in the middle of a call path
264and holding IFNET_LOCK such places is problematic.  Actually a deadlock is
265observed.  Probably we should remove IFNET_LOCK and manage IFF_ALLMULTI
266somewhere other than if_flags, for example ethercom or driver itself (or a
267common driver framework once it appears).  Such a change is feasible because
268IFF_ALLMULTI is only set/unset by a driver and not accessed from any common
269components such as network protocols.
270
271Also IFF_PROMISC is checked in ether_input and we should get rid of it somehow.
272