Home | History | Annotate | Line # | Download | only in quic-design
      1 Datagram BIO API revisions for sendmmsg/recvmmsg
      2 ================================================
      3 
      4 We need to evolve the API surface of BIO which is relevant to BIO_dgram (and the
      5 eventual BIO_dgram_mem) to support APIs which allow multiple datagrams to be
      6 sent or received simultaneously, such as sendmmsg(2)/recvmmsg(2).
      7 
      8 The adopted design
      9 ------------------
     10 
     11 ### Design decisions
     12 
     13 The adopted design makes the following design decisions:
     14 
     15 - We use a sendmmsg/recvmmsg-like API. The alternative API was not considered
     16   for adoption because it is an explicit goal that the adopted API be suitable
     17   for concurrent use on the same BIO.
     18 
     19 - We define our own structures rather than using the OS's `struct mmsghdr`.
     20   The motivations for this are:
     21 
     22   - It ensures portability between OSes and allows the API to be used
     23     on OSes which do not support `sendmmsg` or `sendmsg`.
     24 
     25   - It allows us to use structures in keeping with OpenSSL's existing
     26     abstraction layers (e.g. `BIO_ADDR` rather than `struct sockaddr`).
     27 
     28   - We do not have to expose functionality which we cannot guarantee
     29     we can support on all platforms (for example, arbitrary control messages).
     30 
     31   - It avoids the need to include OS headers in our own public headers,
     32     which would pollute the environment of applications which include
     33     our headers, potentially undesirably.
     34 
     35 - For OSes which do not support `sendmmsg`, we emulate it using repeated
     36   calls to `sendmsg`. For OSes which do not support `sendmsg`, we emulate it
     37   using `sendto` to the extent feasible. This avoids the need for code consuming
     38   these new APIs to define a fallback code path.
     39 
     40 - We do not define any flags at this time, as the flags previously considered
     41   for adoption cannot be supported on all platforms (Win32 does not have
     42   `MSG_DONTWAIT`).
     43 
     44 - We ensure the extensibility of our `BIO_MSG` structure in a way that preserves
     45   ABI compatibility using a `stride` argument which callers must set to
     46   `sizeof(BIO_MSG)`. Implementations can examine the stride field to determine
     47   whether a given field is part of a `BIO_MSG`. This allows us to add optional
     48   fields to `BIO_MSG` at a later time without breaking ABI. All new fields must
     49   be added to the end of the structure.
     50 
     51 - The BIO methods are designed to support stateless operation in which they
     52   are simply calls to the equivalent system calls, where supported, without
     53   changing BIO state. In particular, this means that things like retry flags are
     54   not set or cleared by `BIO_sendmmsg` or `BIO_recvmmsg`.
     55 
     56   The motivation for this is that these functions are intended to support
     57   concurrent use on the same BIO. If they read or modify BIO state, they would
     58   need to be synchronised with a lock, undermining performance on what (for
     59   `BIO_dgram`) would otherwise be a straight system call.
     60 
     61 - We do not support iovecs. The motivations for this are:
     62 
     63   - Not all platforms can support iovecs (e.g. Windows).
     64 
     65   - The only way we could emulate iovecs on platforms which don't support
     66     them is by copying the data to be sent into a staging buffer. This would
     67     defeat all of the advantages of iovecs and prevent us from meeting our
     68     zero/single-copy requirements. Moreover, it would lead to extremely
     69     surprising performance variations for consumers of the API.
     70 
     71   - We do not believe iovecs are needed to meet our performance requirements
     72     for QUIC. The reason for this is that aside from a minimal packet header,
     73     all data in QUIC is encrypted, so all data sent via QUIC must pass through
     74     an encrypt step anyway, meaning that all data sent will already be copied
     75     and there is not going to be any issue depositing the ciphertext in a
     76     staging buffer together with the frame header.
     77 
     78   - Even if we did support iovecs, we would have to impose a limit
     79     on the number of iovecs supported, because we translate from our own
     80     structures (as discussed above) and also intend these functions to be
     81     stateless and not requiire locking. Therefore the OS-native iovec structures
     82     would need to be allocated on the stack.
     83 
     84 - Sometimes, an application may wish to learn the local interface address
     85   associated with a receive operation or specify the local interface address to
     86   be used for a send operation. We support this, but require this functionality
     87   to be explicitly enabled before use.
     88 
     89   The reason for this is that enabling this functionality generally requires
     90   that the socket be reconfigured using `setsockopt` on most platforms. Doing
     91   this on-demand would require state in the BIO to determine whether this
     92   functionality is currently switched on, which would require otherwise
     93   unnecessary locking, undermining performance in concurrent usage of this API
     94   on a given BIO. By requiring this functionality to be enabled explicitly
     95   before use, this allows this initialization to be done up front without
     96   performance cost. It also aids users of the API to understand that this
     97   functionality is not always available and to detect when this functionality is
     98   available in advance.
     99 
    100 ### Design
    101 
    102 The currently proposed design is as follows:
    103 
    104 ```c
    105 typedef struct bio_msg_st {
    106     void *data;
    107     size_t data_len;
    108     BIO_ADDR *peer, *local;
    109     uint64_t flags;
    110 } BIO_MSG;
    111 
    112 #define BIO_UNPACK_ERRNO(e)     /*...*/
    113 #define BIO_IS_ERRNO(e)         /*...*/
    114 
    115 ossl_ssize_t BIO_sendmmsg(BIO *b, BIO_MSG *msg, size_t stride,
    116                           size_t num_msg, uint64_t flags);
    117 ossl_ssize_t BIO_recvmmsg(BIO *b, BIO_MSG *msg, size_t stride,
    118                           size_t num_msg, uint64_t flags);
    119 ```
    120 
    121 The API is used as follows:
    122 
    123 - `msg` points to an array of `num_msg` `BIO_MSG` structures.
    124 
    125 - Both functions have identical prototypes, and return the number of messages
    126   processed in the array. If no messages were sent due to an error, `-1` is
    127   returned. If an OS-level socket error occurs, a negative value `v` is
    128   returned. The caller should determine that `v` is an OS-level socket error by
    129   calling `BIO_IS_ERRNO(v)` and may obtain the OS-level socket error code by
    130   calling `BIO_UNPACK_ERRNO(v)`.
    131 
    132 - `stride` must be set to `sizeof(BIO_MSG)`.
    133 
    134 - `data` points to the buffer of data to be sent or to be filled with received
    135   data. `data_len` is the size of the buffer in bytes on call. If the
    136   given message in the array is processed (i.e., if the return value
    137   exceeds the index of that message in the array), `data_len` is updated
    138   to the actual amount of data sent or received at return time.
    139 
    140 - `flags` in the `BIO_MSG` structure provides per-message flags to
    141   the `BIO_sendmmsg` or `BIO_recvmmsg` call. If the given message in the array
    142   is processed, `flags` is written with zero or more result flags at return
    143   time. The `flags` argument to the call itself provides for global flags
    144   affecting all messages in the array. Currently, no per-message or global flags
    145   are defined and all of these fields are set to zero on call and on return.
    146 
    147 - `peer` and `local` are optional pointers to `BIO_ADDR` structures into
    148   which the remote and local addresses are to be filled. If either of these
    149   are NULL, the given addressing information is not requested. Local address
    150   support may not be available in all circumstances, in which case processing of
    151   the message fails. (This means that the function returns the number of
    152   messages processed, or -1 if the message in question is the first message.)
    153 
    154   Support for `local` must be explicitly enabled before use, otherwise
    155   attempts to use it fail.
    156 
    157 Local address support is enabled as follows:
    158 
    159 ```c
    160 int BIO_dgram_set_local_addr_enable(BIO *b, int enable);
    161 int BIO_dgram_get_local_addr_enable(BIO *b);
    162 int BIO_dgram_get_local_addr_cap(BIO *b);
    163 ```
    164 
    165 `BIO_dgram_get_local_addr_cap()` returns 1 if local address support is
    166 available. It is then enabled using `BIO_dgram_set_local_addr_enable()`, which
    167 fails if support is not available.
    168 
    169 Options which were considered
    170 -----------------------------
    171 
    172 Options for the API surface which were considered included:
    173 
    174 ### sendmmsg/recvmmsg-like API
    175 
    176 This design was chosen to form the basis of the adopted design, which is
    177 described above.
    178 
    179 ```c
    180 int BIO_readm(BIO *b, BIO_mmsghdr *msgvec,
    181               unsigned len, int flags, struct timespec *timeout);
    182 int BIO_writem(BIO *b, BIO_mmsghdr *msgvec,
    183               unsigned len, int flags, struct timespec *timeout);
    184 ```
    185 
    186 We can either define `BIO_mmsghdr` as a typedef of `struct mmsghdr` or redefine
    187 an equivalent structure. The former has the advantage that we can just pass the
    188 structures through to the syscall without copying them.
    189 
    190 Note that in `BIO_mem_dgram` we will have to process and therefore understand
    191 the contents of `struct mmsghdr` ourselves. Therefore, initially we define a
    192 subset of `struct mmsghdr` as being supported, specifically no control messages;
    193 `msg_name` and `msg_iov` only.
    194 
    195 The flags argument is defined by us. Initially we can support something like
    196 `MSG_DONTWAIT` (say, `BIO_DONTWAIT`).
    197 
    198 #### Implementation Questions
    199 
    200 If we go with this, there are some issues that arise:
    201 
    202 - Are `BIO_mmsghdr`, `BIO_msghdr` and `BIO_iovec` simple typedefs
    203   for OS-provided structures, or our own independent structure
    204   definitions?
    205 
    206   - If we use OS-provided structures:
    207 
    208     - We would need to include the OS headers which provide these
    209       structures in our public API headers.
    210 
    211     - If we choose to support these functions when OS support is not available
    212       (see discussion below), We would need to define our own structures in this
    213       case (a polyfill approach).
    214 
    215   - If we use our own structures:
    216 
    217     - We would need to translate these structures during every call.
    218 
    219       But we would need to have storage inside the BIO_dgram for *m* `struct
    220       msghdr`, *m\*v* iovecs, etc. Since we want to support multithreaded use
    221       these allocations probably will need to be on the stack, and therefore
    222       must be limited.
    223 
    224       Limiting *m* isn't a problem, because `sendmmsg` returns the number
    225       of messages sent, so the existing semantics we are trying to match
    226       lets us just send or receive fewer messages than we were asked to.
    227 
    228       However, it does seem like we will need to limit *v*, the number of iovecs
    229       per message. So what limit should we give to *v*, the number of iovecs? We
    230       will need a fixed stack allocation of OS iovec structures and we can
    231       allocate from this stack allocation as we iterate through the `BIO_msghdr`
    232       we have been given. So in practice we could just only send messages
    233       until we reach our iovec limit, and then return.
    234 
    235       For example, suppose we allocate 64 iovecs internally:
    236 
    237       ```c
    238       struct iovec vecs[64];
    239       ```
    240 
    241       If the first message passed to a call to `BIO_writem` has 64 iovecs
    242       attached to it, no further messages can be sent and `BIO_writem`
    243       returns 1.
    244 
    245       If three messages are sent, with 32, 32, and 1 iovecs respectively,
    246       the first two messages are sent and `BIO_writem` returns 2.
    247 
    248       So the only important thing we would need to document in this API
    249       is the limit of iovecs on a single message; in other words, the
    250       number of iovecs which must not be exceeded if a forward progress
    251       guarantee is to be made. e.g. if we allocate 64 iovecs internally,
    252       `BIO_writem` with a single message with 65 iovecs will never work
    253       and this becomes part of the API contract.
    254 
    255       Obviously these quantities of iovecs are unrealistically large.
    256       iovecs are small, so we can afford to set the limit high enough
    257       that it shouldn't cause any problems in practice. We can increase
    258       the limit later without a breaking API change, but we cannot decrease
    259       it later. So we might want to start with something small, like 8.
    260 
    261 - We also need to decide what to do for OSes which don't support at least
    262   `sendmsg`/`recvmsg`.
    263 
    264   - Don't provide these functions and require all users of these functions to
    265     have an alternate code path which doesn't rely on them?
    266 
    267     - Not providing these functions on OSes that don't support
    268       at least sendmsg/recvmsg is a simple solution but adds
    269       complexity to code using BIO_dgram. (Though it does communicate
    270       to code more realistic performance expectations since it
    271       knows when these functions are actually available.)
    272 
    273   - Provide these functions and emulate the functionality:
    274 
    275     - However there is a question here as to how we implement
    276       the iovec arguments on platforms without `sendmsg`/`recvmsg`. (We cannot
    277       use `writev`/`readv` because we need peer address information.) Logically
    278       implementing these would then have to be done by copying buffers around
    279       internally before calling `sendto`/`recvfrom`, defeating the point of
    280       iovecs and providing a performance profile which is surprising to code
    281       using BIO_dgram.
    282 
    283     - Another option could be a variable limit on the number of iovecs,
    284       which can be queried from BIO_dgram. This would be a constant set
    285       when libcrypto is compiled. It would be 1 for platforms not supporting
    286       `sendmsg`/`recvmsg`. This again adds burdens on the code using
    287       BIO_dgram, but it seems the only way to avoid the surprising performance
    288       pitfall of buffer copying to emulate iovec support. There is a fair risk
    289       of code being written which accidentally works on one platform but not
    290       another, because the author didn't realise the iovec limit is 1 on some
    291       platforms. Possibly we could have an iovec limit variable in the
    292       BIO_dgram which is 1 by default, which can be increased by a call to a
    293       function BIO_set_iovec_limit, but not beyond the fixed size discussed
    294       above. It would return failure if not possible and this would give client
    295       code a clear way to determine if its expectations are met.
    296 
    297 ### Alternate API
    298 
    299 Could we use a simplified API? For example, could we have an API that returns
    300 one datagram where BIO_dgram uses `readmmsg` internally and queues the returned
    301 datagrams, thereby still avoiding extra syscalls but offering a simple API.
    302 
    303 The problem here is we want to support single-copy (where the data is only
    304 copied as it is decrypted). Thus BIO_dgram needs to know the final resting place
    305 of encrypted data at the time it makes the `readmmsg` call.
    306 
    307 One option would be to allow the user to set a callback on BIO_dgram it can use
    308 to request a new buffer, then have an API which returns the buffer:
    309 
    310 ```c
    311 int BIO_dgram_set_read_callback(BIO *b,
    312                                 void *(*cb)(size_t len, void *arg),
    313                                 void *arg);
    314 int BIO_dgram_set_read_free_callback(BIO *b,
    315                                      void (*cb)(void *buf,
    316                                                 size_t buf_len,
    317                                                 void *arg),
    318                                      void *arg);
    319 int BIO_read_dequeue(BIO *b, void **buf, size_t *buf_len);
    320 ```
    321 
    322 The BIO_dgram calls the specified callback when it needs to generate internal
    323 iovecs for its `readmmsg` call, and the received datagrams can then be popped by
    324 the application and freed as it likes. (The read free callback above is only
    325 used in rare circumstances, such as when calls to `BIO_read` and
    326 `BIO_read_dequeue` are alternated, or when the BIO_dgram is destroyed prior to
    327 all read buffers being dequeued; see below.) For convenience we could have an
    328 extra call to allow a buffer to be pushed back into the BIO_dgram's internal
    329 queue of unused read buffers, which avoids the need for the application to do
    330 its own management of such recycled buffers:
    331 
    332 ```c
    333 int BIO_dgram_push_read_buffer(BIO *b, void *buf, size_t buf_len);
    334 ```
    335 
    336 On the write side, the application provides buffers and can get a callback when
    337 they are freed. BIO_write_queue just queues for transmission, and the `sendmmsg`
    338 call is made when calling `BIO_flush`. (TBD: whether it is reasonable to
    339 overload the semantics of BIO_flush in this way.)
    340 
    341 ```c
    342 int BIO_dgram_set_write_done_callback(BIO *b,
    343                                       void (*cb)(const void *buf,
    344                                                  size_t buf_len,
    345                                                  int status,
    346                                                  void *arg),
    347                                       void *arg);
    348 int BIO_write_queue(BIO *b, const void *buf, size_t buf_len);
    349 int BIO_flush(BIO *b);
    350 ```
    351 
    352 The status argument to the write done callback will be 1 on success, some
    353 negative value on failure, and some special negative value if the BIO_dgram is
    354 being freed before the write could be completed.
    355 
    356 For send/receive addresses, we import the `BIO_(set|get)_dgram_(origin|dest)`
    357 APIs proposed in the sendmsg/recvmsg PR (#5257). `BIO_get_dgram_(origin|dest)`
    358 should be called immediately after `BIO_read_dequeue` and
    359 `BIO_set_dgram_(origin|dest)` should be called immediately before
    360 `BIO_write_queue`.
    361 
    362 This approach allows `BIO_dgram` to support myriad options via composition of
    363 successive function calls in a builder style rather than via a single function
    364 call with an excessive number of arguments or pointers to unwieldy ever-growing
    365 argument structures, requiring constant revision of the central read/write
    366 functions of the BIO API.
    367 
    368 Note that since `BIO_set_dgram_(origin|dest)` sets data on outgoing packets and
    369 `BIO_get_dgram_(origin|dest)` gets data on incoming packets, it doesn't follow
    370 that these are accessing the same data (they are not setters and getters of a
    371 variables called "dgram origin" and "dgram destination", even though they look
    372 like setters and getters of the same variables from the name.) We probably want
    373 to separate these as there is no need for a getter for outgoing packet
    374 destination, for example, and by separating these we allow the possibility of
    375 multithreaded use (one thread reads, one thread writes) in the future. Possibly
    376 we should choose less confusing names for these functions. Maybe
    377 `BIO_set_outgoing_dgram_(origin|dest)` and
    378 `BIO_get_incoming_dgram_(origin|dest)`.
    379 
    380 Pros of this approach:
    381 
    382   - Application can generate one datagram at a time and still get the advantages
    383     of sendmmsg/recvmmsg (fewer syscalls, etc.)
    384 
    385     We probably want this for our own QUIC implementation built on top of this
    386     anyway. Otherwise we will need another piece to do basically the same thing
    387     and agglomerate multiple datagrams into a single BIO call. Unless we only
    388     want use `sendmmsg` constructively in trivial cases (e.g. where we send two
    389     datagrams from the same function immediately after one another... doesn't
    390     seem like a common use case.)
    391 
    392   - Flexible support for single-copy (zero-copy).
    393 
    394 Cons of this approach:
    395 
    396   - Very different way of doing reads/writes might be strange to existing
    397     applications. *But* the primary consumer of this new API will be our own
    398     QUIC implementation so probably not a big deal. We can always support
    399     `BIO_read`/`BIO_write` as a less efficient fallback for existing third party
    400     users of BIO_dgram.
    401 
    402 #### Compatibility interop
    403 
    404 Suppose the following sequence happens:
    405 
    406 1. BIO_read (legacy call path)
    407 2. BIO_read_dequeue (`recvmmsg` based call path with callback-allocated buffer)
    408 3. BIO_read (legacy call path)
    409 
    410 For (1) we have two options
    411 
    412 a. Use `recvmmsg` and add the received datagrams to an RX queue just as for the
    413    `BIO_read_dequeue` path. We use an OpenSSL-provided default allocator
    414    (`OPENSSL_malloc`) and flag these datagrams as needing to be freed by OpenSSL,
    415    not the application.
    416 
    417    When the application calls `BIO_read`, a copy is performed and the internal
    418    buffer is freed.
    419 
    420 b. Use `recvfrom` directly. This means we have a `recvmmsg` path and a
    421    `recvfrom` path depending on what API is being used.
    422 
    423    The disadvantage of (a) is it yields an extra copy relative to what we have now,
    424    whereas with (b) the buffer passed to `BIO_read` gets passed through to the
    425    syscall and we do not have to copy anything.
    426 
    427    Since we will probably need to support platforms without
    428    `sendmmsg`/`recvmmsg` support anyway, (b) seems like the better option.
    429 
    430 For (2) the new API is used. Since the previous call to BIO_read is essentially
    431 stateless (it's just a simple call to `recvfrom`, and doesn't require mutation
    432 of any internal BIO state other than maybe the last datagram source/destination
    433 address fields), BIO_dgram can go ahead and start using the `recvmmsg` code
    434 path. Since the RX queue will obviously be empty at this point, it is
    435 initialised and filled using `recvmmsg`, then one datagram is popped from it.
    436 
    437 For (3) we have a legacy `BIO_read` but we have several datagrams still in the
    438 RX queue. In this case we do have to copy - we have no choice. However this only
    439 happens in circumstances where a user of BIO_dgram alternates between old and
    440 new APIs, which should be very unusual.
    441 
    442 Subsequently for (3) we have to free the buffer using the free callback. This is
    443 an unusual case where BIO_dgram is responsible for freeing read buffers and not
    444 the application (the only other case being premature destruction, see below).
    445 But since this seems a very strange API usage pattern, we may just want to fail
    446 in this case.
    447 
    448 Probably not worth supporting this. So we can have the following rule:
    449 
    450 - After the first call to `BIO_read_dequeue` is made on a BIO_dgram, all
    451   subsequent calls to ordinary `BIO_read` will fail.
    452 
    453 Of course, all of the above applies analogously to the TX side.
    454 
    455 #### BIO_dgram_pair
    456 
    457 We will also implement from scratch a BIO_dgram_pair. This will be provided as a
    458 BIO pair which provides identical semantics to the BIO_dgram above, both for the
    459 legacy and zero-copy code paths.
    460 
    461 #### Thread safety
    462 
    463 It is a functional assumption of the above design that we would never want to
    464 have more than one thread doing TX on the same BIO and never have more than one
    465 thread doing RX on the same BIO.
    466 
    467 If we did ever want to do this, multiple BIOs on the same FD is one possibility
    468 (for the BIO_dgram case at least). But I don't believe there is any general
    469 intention to support multithreaded use of a single BIO at this time (unless I am
    470 mistaken), so this seems like it isn't an issue.
    471 
    472 If we wanted to support multithreaded use of the same FD using the same BIO, we
    473 would need to revisit the set-call-then-execute-call API approach above
    474 (`BIO_(set|get)_dgram_(origin|dest)`) as this would pose a problem. But I mainly
    475 mention this only for completeness. Our recent learnt lessons on cache
    476 contention suggest that this probably wouldn't be a good idea anyway.
    477 
    478 #### Other questions
    479 
    480 BIO_dgram will call the allocation function to get buffers for `recvmmsg` to
    481 fill. We might want to have a way to specify how many buffers it should offer to
    482 `recvmmsg`, and thus how many buffers it allocates in advance.
    483 
    484 #### Premature destruction
    485 
    486 If BIO_dgram is freed before all datagrams are read, the read buffer free
    487 callback is used to free any unreturned read buffers.
    488