TODO.smpnet revision 1.40
1$NetBSD: TODO.smpnet,v 1.40 2021/01/20 10:26:43 nia Exp $
2
3MP-safe components
4==================
5
6They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE
7kernel option.  Some components scale up and some don't.
8
9 - Device drivers
10   - aq(4)
11   - vioif(4)
12   - vmx(4)
13   - wm(4)
14   - ixg(4)
15   - ixl(4)
16   - ixv(4)
17 - Layer 2
18   - Ethernet (if_ethersubr.c)
19   - bridge(4)
20     - STP
21   - Fast forward (ipflow)
22 - Layer 3
23   - All except for items in the below section
24 - Interfaces
25   - gif(4)
26   - ipsecif(4)
27   - l2tp(4)
28   - pppoe(4)
29     - if_spppsubr.c
30   - tap(4)
31   - tun(4)
32   - vlan(4)
33 - Packet filters
34   - npf(7)
35 - Others
36   - bpf(4)
37   - ipsec(4)
38   - opencrypto(9)
39   - pfil(9)
40
41Non MP-safe components and kernel options
42=========================================
43
44The components and options aren't MP-safe, i.e., requires the big kernel lock,
45yet.  Some of them can be used safely even if NET_MPSAFE is enabled because
46they're still protected by the big kernel lock.  The others aren't protected and
47so unsafe, e.g, they may crash the kernel.
48
49Protected ones
50--------------
51
52 - Device drivers
53   - Most drivers other than ones listed in the above section
54 - Layer 4
55   - DCCP
56   - SCTP
57   - TCP
58   - UDP
59
60Unprotected ones
61----------------
62
63 - Layer 2
64   - ARCNET (if_arcsubr.c)
65   - IEEE 1394 (if_ieee1394subr.c)
66   - IEEE 802.11 (ieee80211(4))
67 - Layer 3
68   - IPSELSRC
69   - MROUTING
70   - PIM
71   - MPLS (mpls(4))
72   - IPv6 address selection policy
73 - Interfaces
74   - agr(4)
75   - carp(4)
76   - faith(4)
77   - gre(4)
78   - ppp(4)
79   - sl(4)
80   - stf(4)
81   - if_srt
82 - Packet filters
83   - ipf(4)
84   - pf(4)
85 - Others
86   - AppleTalk (sys/netatalk/)
87   - Bluetooth (sys/netbt/)
88   - altq(4)
89   - kttcp(4)
90   - NFS
91
92Know issues
93===========
94
95NOMPSAFE
96--------
97
98We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe
99yet.  We use it in comments and also use as part of function names, for example
100m_get_rcvif_NOMPSAFE.  Let's use "NOMPSAFE" to make it easy to find non-MP-safe
101codes by grep.
102
103bpf
104---
105
106MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context
107or softint context, i.e., not in hardware interrupt context.  For Tx, all
108bpf_mtap satisfy the requrement.  For Rx, most of bpf_mtap are called in softint.
109Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context.
110
111This is the list of the functions that have such bpf_mtap:
112
113 - sca_frame_process() @ sys/dev/ic/hd64570.c
114
115Ideally we should make the functions run in softint somehow, but we don't have
116actual devices, no time (or interest/love) to work on the task, so instead we
117provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint
118context.  It's a workaround and once the functions run in softint, we should use
119the original bpf_mtap again.
120
121if_mcast_op() - SIOCADDMULTI/SIOCDELMULTI
122-----------------------------------------
123Helper function is called to add or remove multicast addresses for
124interface.  When called via ioctl it takes IFNET_LOCK(), when called
125via sosetopt() it doesn't.
126
127Various network drivers can't assert IFNET_LOCKED() in their if_ioctl
128because of this. Generally drivers still take care to splnet() even
129with NET_MPSAFE before calling ether_ioctl(), but they do not take
130KERNEL_LOCK(), so this is actually unsafe.
131
132Lingering obsolete variables
133-----------------------------
134
135Some obsolete global variables and member variables of structures remain to
136avoid breaking old userland programs which directly access such variables via
137kvm(3).
138
139The following programs still use kvm(3) to get some information related to
140the network stack.
141
142 - netstat(1)
143 - vmstat(1)
144 - fstat(1)
145
146netstat(1) accesses ifnet_list, the head of a list of interface objects
147(struct ifnet), and traverses each object through ifnet#if_list member variable.
148ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and
149ifnet#if_pslist_entry respectively. netstat also accesses the IP address list
150of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr
151and struct in6_ifaddr are accessed and the following obsolete member variables
152are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list,
153in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already
154implements alternative methods to fetch the above information via sysctl(3).
155
156vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel.
157The statistic information is retrieved via kvm(3). The global variables
158in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4
159addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist,
160are kept for this purpose. We should provide a means to fetch statistics of
161hash tables via sysctl(3).
162
163fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is
164obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list
165member variables are obsolete but remain. ifnet#if_xname is also accessed
166via struct bpf_if and obsolete ifnet#if_list is required to remain to not change
167the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount,
168bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for
169scalability the statistic counters should be per-CPU and we should stop using
170atomic operations for them however we have to remain the counters and atomic
171operations.
172
173Scalability
174-----------
175
176 - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple
177   flows per CPU
178 - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up
179   is O(n)
180 - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable
181   as they are serialized by one mutex
182
183ALTQ
184----
185
186If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd)
187for packet transmissions, resulting in serializing all Tx packet processing on
188the queue.  We should probably design and implement an alternative queuing
189mechanism that deals with multi-core systems at the first place, not making the
190existing ALTQ MP-safe because it's just annoying.
191
192Using kernel modules
193--------------------
194
195Please note that if you enable NET_MPSAFE in your kernel, and you use and
196loadable kernel modules (including compat_xx modules or individual network
197interface if_xxx device driver modules), you will need to build custom
198modules.  For each module you will need to add the following line to its
199Makefile:
200
201	CPPFLAGS+=	NET_MPSAFE
202
203Failure to do this may result in unpredictable behavior.
204
205IPv4 address initialization atomicity
206-------------------------------------
207
208An IPv4 address is referenced by several data structures: an associated
209interface, its local route, a connected route (if necessary), the global list,
210the global hash table, etc.  These data structures are not updated atomically,
211i.e., there can be inconsistent states on an IPv4 address in the kernel during
212the initialization of an IPv4 address.
213
214One known failure of the issue is that incoming packets destinating to an
215initializing address can loop in the network stack in a short period of time.
216The address initialization creates an local route first and then registers an
217initializing address to the global hash table that is used to decide if an
218incoming packet destinates to the host by checking the destination of the packet
219is registered to the hash table.  So, if the host allows forwaring, an incoming
220packet can match on a local route of an initializing address at ip_output while
221it fails the to-self check described above at ip_input.  Because a matched local
222route points a loopback interface as its destination interface, an incoming
223packet sends to the network stack (ip_input) again, which results in looping.
224The loop stops once an initializing address is registered to the hash table.
225
226One solution of the issue is to reorder the address initialization instructions,
227first register an address to the hash table then create its routes.  Another
228solution is to use the routing table for the to-self check instead of using the
229global hash table, like IPv6.
230
231if_flags
232--------
233
234To avoid data race on if_flags it should be protected by a lock (currently it's
235IFNET_LOCK).  Thus, if_flags should not be accessed on packet processing to
236avoid performance degradation by lock contentions.  Traditionally IFF_RUNNING,
237IFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing.  If
238you make a driver MP-safe you must remove such checks.
239
240IFF_ALLMULTI can be set/unset via if_mcast_op.  To protect updates of the flag,
241we had added IFNET_LOCK around if_mcast_op.  However that was not a good
242approach because if_mcast_op is typically called in the middle of a call path
243and holding IFNET_LOCK such places is problematic.  Actually a deadlock is
244observed.  Probably we should remove IFNET_LOCK and manage IFF_ALLMULTI
245somewhere other than if_flags, for example ethercom or driver itself (or a
246common driver framework once it appears).  Such a change is feasible because
247IFF_ALLMULTI is only set/unset by a driver and not accessed from any common
248components such as network protocols.
249
250Also IFF_PROMISC is checked in ether_input and we should get rid of it somehow.
251