TODO.smpnet revision 1.29
1$NetBSD: TODO.smpnet,v 1.29 2019/03/27 06:56:37 ozaki-r Exp $
2
3MP-safe components
4==================
5
6They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE
7kernel option.  Some components scale up and some don't.
8
9 - Device drivers
10   - vioif(4)
11   - vmx(4)
12   - wm(4)
13   - ixg(4)
14   - ixv(4)
15 - Layer 2
16   - Ethernet (if_ethersubr.c)
17   - bridge(4)
18     - STP
19   - Fast forward (ipflow)
20 - Layer 3
21   - All except for items in the below section
22 - Interfaces
23   - gif(4)
24   - ipsecif(4)
25   - l2tp(4)
26   - pppoe(4)
27     - if_spppsubr.c
28   - tun(4)
29   - vlan(4)
30 - Packet filters
31   - npf(7)
32 - Others
33   - bpf(4)
34   - ipsec(4)
35   - opencrypto(9)
36   - pfil(9)
37
38Non MP-safe components and kernel options
39=========================================
40
41The components and options aren't MP-safe, i.e., requires the big kernel lock,
42yet.  Some of them can be used safely even if NET_MPSAFE is enabled because
43they're still protected by the big kernel lock.  The others aren't protected and
44so unsafe, e.g, they may crash the kernel.
45
46Protected ones
47--------------
48
49 - Device drivers
50   - Most drivers other than ones listed in the above section
51 - Layer 4
52   - DCCP
53   - SCTP
54   - TCP
55   - UDP
56
57Unprotected ones
58----------------
59
60 - Layer 2
61   - ARCNET (if_arcsubr.c)
62   - BRIDGE_IPF
63   - FDDI (if_fddisubr.c)
64   - HIPPI (if_hippisubr.c)
65   - IEEE 1394 (if_ieee1394subr.c)
66   - IEEE 802.11 (ieee80211(4))
67   - Token ring (if_tokensubr.c)
68 - Layer 3
69   - IPSELSRC
70   - MROUTING
71   - PIM
72   - MPLS (mpls(4))
73   - IPv6 address selection policy
74 - Interfaces
75   - agr(4)
76   - carp(4)
77   - faith(4)
78   - gre(4)
79   - ppp(4)
80   - sl(4)
81   - stf(4)
82   - strip(4)
83   - if_srt
84   - tap(4)
85 - Packet filters
86   - ipf(4)
87   - pf(4)
88 - Others
89   - AppleTalk (sys/netatalk/)
90   - Bluetooth (sys/netbt/)
91   - altq(4)
92   - CIFS (sys/netsmb/)
93   - kttcp(4)
94   - NFS
95
96Know issues
97===========
98
99NOMPSAFE
100--------
101
102We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe
103yet.  We use it in comments and also use as part of function names, for example
104m_get_rcvif_NOMPSAFE.  Let's use "NOMPSAFE" to make it easy to find non-MP-safe
105codes by grep.
106
107bpf
108---
109
110MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context
111or softint context, i.e., not in hardware interrupt context.  For Tx, all
112bpf_mtap satisfy the requrement.  For Rx, most of bpf_mtap are called in softint.
113Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context.
114
115This is the list of the functions that have such bpf_mtap:
116
117 - sca_frame_process() @ sys/dev/ic/hd64570.c
118
119Ideally we should make the functions run in softint somehow, but we don't have
120actual devices, no time (or interest/love) to work on the task, so instead we
121provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint
122context.  It's a workaround and once the functions run in softint, we should use
123the original bpf_mtap again.
124
125Lingering obsolete variables
126-----------------------------
127
128Some obsolete global variables and member variables of structures remain to
129avoid breaking old userland programs which directly access such variables via
130kvm(3).
131
132The following programs still use kvm(3) to get some information related to
133the network stack.
134
135 - netstat(1)
136 - vmstat(1)
137 - fstat(1)
138
139netstat(1) accesses ifnet_list, the head of a list of interface objects
140(struct ifnet), and traverses each object through ifnet#if_list member variable.
141ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and
142ifnet#if_pslist_entry respectively. netstat also accesses the IP address list
143of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr
144and struct in6_ifaddr are accessed and the following obsolete member variables
145are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list,
146in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already
147implements alternative methods to fetch the above information via sysctl(3).
148
149vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel.
150The statistic information is retrieved via kvm(3). The global variables
151in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4
152addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist,
153are kept for this purpose. We should provide a means to fetch statistics of
154hash tables via sysctl(3).
155
156fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is
157obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list
158member variables are obsolete but remain. ifnet#if_xname is also accessed
159via struct bpf_if and obsolete ifnet#if_list is required to remain to not change
160the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount,
161bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for
162scalability the statistic counters should be per-CPU and we should stop using
163atomic operations for them however we have to remain the counters and atomic
164operations.
165
166Scalability
167-----------
168
169 - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple
170   flows per CPU
171 - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up
172   is O(n)
173 - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable
174   as they are serialized by one mutex
175
176ec_multi* of ethercom
177---------------------
178
179ec_multiaddrs and ec_multicnt of struct ethercom and items listed in
180ec_multiaddrs must be protected by ec_lock.  The core of ethernet subsystem is
181already MP-safe, however, device drivers that use the data should also be fixed.
182A typical change should be to protect manipulations of the data via ETHER_*
183macros such as ETHER_FIRST_MULTI by ETHER_LOCK and ETHER_UNLOCK.
184
185ALTQ
186----
187
188If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd)
189for packet transmissions, resulting in serializing all Tx packet processing on
190the queue.  We should probably design and implement an alternative queuing
191mechanism that deals with multi-core systems at the first place, not making the
192existing ALTQ MP-safe because it's just annoying.
193
194Using kernel modules
195--------------------
196
197Please note that if you enable NET_MPSAFE in your kernel, and you use and
198loadable kernel modules (including compat_xx modules or individual network
199interface if_xxx device driver modules), you will need to build custom
200modules.  For each module you will need to add the following line to its
201Makefile:
202
203	CPPFLAGS+=	NET_MPSAFE
204
205Failure to do this may result in unpredictable behavior.
206
207IPv4 address initialization atomicity
208-------------------------------------
209
210An IPv4 address is referenced by several data structures: an associated
211interface, its local route, a connected route (if necessary), the global list,
212the global hash table, etc.  These data structures are not updated atomically,
213i.e., there can be inconsistent states on an IPv4 address in the kernel during
214the initialization of an IPv4 address.
215
216One known failure of the issue is that incoming packets destinating to an
217initializing address can loop in the network stack in a short period of time.
218The address initialization creates an local route first and then registers an
219initializing address to the global hash table that is used to decide if an
220incoming packet destinates to the host by checking the destination of the packet
221is registered to the hash table.  So, if the host allows forwaring, an incoming
222packet can match on a local route of an initializing address at ip_output while
223it fails the to-self check described above at ip_input.  Because a matched local
224route points a loopback interface as its destination interface, an incoming
225packet sends to the network stack (ip_input) again, which results in looping.
226The loop stops once an initializing address is registered to the hash table.
227
228One solution of the issue is to reorder the address initialization instructions,
229first register an address to the hash table then create its routes.  Another
230solution is to use the routing table for the to-self check instead of using the
231global hash table, like IPv6.
232
233if_flags
234--------
235
236To avoid data race on if_flags it should be protected by a lock (currently it's
237IFNET_LOCK).  Thus, if_flags should not be accessed on packet processing to
238avoid performance degradation by lock contentions.  Traditionally IFF_RUNNING,
239IFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing.  If
240you make a driver MP-safe you must remove such checks.
241
242IFF_ALLMULTI can be set/unset via if_mcast_op.  To protect updates of the flag,
243we had added IFNET_LOCK around if_mcast_op.  However that was not a good
244approach because if_mcast_op is typically called in the middle of a call path
245and holding IFNET_LOCK such places is problematic.  Actually a deadlock is
246observed.  Probably we should remove IFNET_LOCK and manage IFF_ALLMULTI
247somewhere other than if_flags, for example ethercom or driver itself (or a
248common driver framework once it appears).  Such a change is feasible because
249IFF_ALLMULTI is only set/unset by a driver and not accessed from any common
250components such as network protocols.
251
252Also IFF_PROMISC is checked in ether_input and we should get rid of it somehow.
253