1 <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "https://www.w3.org/TR/html4/loose.dtd"> 3 4 <html> 5 6 <head> 7 8 <title>Postfix Queue Scheduler</title> 9 10 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 11 <link rel='stylesheet' type='text/css' href='postfix-doc.css'> 12 13 </head> 14 15 <body> 16 17 <h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix 18 Queue Scheduler</h1> 19 20 <hr> 21 22 <h2> Disclaimer </h2> 23 24 <p> Many of the <i>transport</i>-specific configuration parameters 25 discussed in this document will not show up in "postconf" command 26 output before Postfix version 2.9. This limitation applies to many 27 parameters whose name is a combination of a master.cf service name 28 such as "relay" and a built-in suffix such as 29 "_destination_concurrency_limit". </p> 30 31 <h2> Overview </h2> 32 33 <p> The queue manager is by far the most complex part of the Postfix 34 mail system. It schedules delivery of new mail, retries failed 35 deliveries at specific times, and removes mail from the queue after 36 the last delivery attempt. There are two major classes of mechanisms 37 that control the operation of the queue manager. </p> 38 39 <p> Topics covered by this document: </p> 40 41 <ul> 42 43 <li> <a href="#concurrency"> Concurrency scheduling</a>, concerned 44 with the number of concurrent deliveries to a specific destination, 45 including decisions on when to suspend deliveries after persistent 46 failures. 47 48 <li> <a href="#jobs"> Preemptive scheduling</a>, concerned with 49 the selection of email messages and recipients for a given destination. 50 51 <li> <a href="#credits"> Credits</a>, something this document would not be 52 complete without. 53 54 </ul> 55 56 <!-- 57 58 <p> Once started, the qmgr(8) process runs until "postfix reload" 59 or "postfix stop". As a persistent process, the queue manager has 60 to meet strict requirements with respect to code correctness and 61 robustness. Unlike non-persistent daemon processes, the queue manager 62 cannot benefit from Postfix's process rejuvenation mechanism that 63 limit the impact from resource leaks and other coding errors 64 (translation: replacing a process after a short time covers up bugs 65 before they can become a problem). </p> 66 67 --> 68 69 <h2> <a name="concurrency"> Concurrency scheduling </a> </h2> 70 71 <p> The following sections document the Postfix 2.5 concurrency 72 scheduler, after a discussion of the limitations of the earlier 73 concurrency scheduler. This is followed by results of medium-concurrency 74 experiments, and a discussion of trade-offs between performance and 75 robustness. </p> 76 77 <p> The material is organized as follows: </p> 78 79 <ul> 80 81 <li> <a href="#concurrency_drawbacks"> Drawbacks of the existing 82 concurrency scheduler </a> 83 84 <li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5 85 concurrency feedback algorithm </a> 86 87 <li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead 88 destination" detection algorithm </a> 89 90 <li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5 91 concurrency scheduler </a> 92 93 <li> <a href="#concurrency_results"> Results for delivery to 94 concurrency limited servers </a> 95 96 <li> <a href="#concurrency_discussion"> Discussion of concurrency 97 limited server results </a> 98 99 <li> <a href="#concurrency_limitations"> Limitations of less-than-1 100 per delivery feedback </a> 101 102 <li> <a href="#concurrency_config"> Concurrency configuration 103 parameters </a> 104 105 </ul> 106 107 <h3> <a name="concurrency_drawbacks"> Drawbacks of the existing 108 concurrency scheduler </a> </h3> 109 110 <p> From the start, Postfix has used a simple but robust algorithm 111 where the per-destination delivery concurrency is decremented by 1 112 after delivery failed due to connection or handshake failure, and 113 incremented by 1 otherwise. Of course the concurrency is never 114 allowed to exceed the maximum per-destination concurrency limit. 115 And when a destination's concurrency level drops to zero, the 116 destination is declared "dead" and delivery is suspended. </p> 117 118 <p> Drawbacks of +/-1 concurrency feedback per delivery are: <p> 119 120 <ul> 121 122 <li> <p> Overshoot due to exponential delivery concurrency growth 123 with each pseudo-cohort(*). This can be an issue with high-concurrency 124 channels. For example, with the default initial concurrency of 5, 125 concurrency would proceed over time as (5-10-20). </p> 126 127 <li> <p> Throttling down to zero concurrency after a single 128 pseudo-cohort(*) failure. This was especially an issue with 129 low-concurrency channels where a single failure could be sufficient 130 to mark a destination as "dead", causing the suspension of further 131 deliveries to the affected destination. </p> 132 133 </ul> 134 135 <p> (*) A pseudo-cohort is a number of delivery requests equal to 136 a destination's delivery concurrency. </p> 137 138 <p> The revised concurrency scheduler has a highly modular structure. 139 It uses separate mechanisms for per-destination concurrency control 140 and for "dead destination" detection. The concurrency control in 141 turn is built from two separate mechanisms: it supports less-than-1 142 feedback per delivery to allow for more gradual concurrency 143 adjustments, and it uses feedback hysteresis to suppress concurrency 144 oscillations. And instead of waiting for delivery concurrency to 145 throttle down to zero, a destination is declared "dead" after a 146 configurable number of pseudo-cohorts reports connection or handshake 147 failure. </p> 148 149 <h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5 150 concurrency feedback algorithm </a> </h3> 151 152 <p> We want to increment a destination's delivery concurrency when 153 some (not necessarily consecutive) number of deliveries complete 154 without connection or handshake failure. This is implemented with 155 positive feedback g(N) where N is the destination's delivery 156 concurrency. With g(N)=1 feedback per delivery, concurrency increases 157 by 1 after each positive feedback event; this gives us the old 158 scheduler's exponential growth in time. With g(N)=1/N feedback per 159 delivery, concurrency increases by 1 after an entire pseudo-cohort 160 N of positive feedback reports; this gives us linear growth in time. 161 Less-than-1 feedback per delivery and integer truncation naturally 162 give us hysteresis, so that transitions to larger concurrency happen 163 every 1/g(N) positive feedback events. </p> 164 165 <p> We want to decrement a destination's delivery concurrency when 166 some (not necessarily consecutive) number of deliveries complete 167 after connection or handshake failure. This is implemented with 168 negative feedback f(N) where N is the destination's delivery 169 concurrency. With f(N)=1 feedback per delivery, concurrency decreases 170 by 1 after each negative feedback event; this gives us the old 171 scheduler's behavior where concurrency is throttled down dramatically 172 after a single pseudo-cohort failure. With f(N)=1/N feedback per 173 delivery, concurrency backs off more gently. Again, less-than-1 174 feedback per delivery and integer truncation naturally give us 175 hysteresis, so that transitions to lower concurrency happen every 176 1/f(N) negative feedback events. </p> 177 178 <p> However, with negative feedback we introduce a subtle twist. 179 We "reverse" the negative hysteresis cycle so that the transition 180 to lower concurrency happens at the <b>beginning</b> of a sequence 181 of 1/f(N) negative feedback events. Otherwise, a correction for 182 overload would be made too late. This makes the choice of f(N) 183 relatively unimportant, as borne out by measurements later in this 184 document. </p> 185 186 <p> In summary, the main ingredients for the Postfix 2.5 concurrency 187 feedback algorithm are a) the option of less-than-1 positive feedback 188 per delivery to avoid overwhelming servers, b) the option of 189 less-than-1 negative feedback per delivery to avoid giving up too 190 fast, c) feedback hysteresis to avoid rapid oscillation, and d) a 191 "reverse" hysteresis cycle for negative feedback, so that it can 192 correct for overload quickly. </p> 193 194 <h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3> 195 196 <p> We want to suspend deliveries to a specific destination after 197 some number of deliveries suffers connection or handshake failure. 198 The old scheduler declares a destination "dead" when negative (-1) 199 feedback throttles the delivery concurrency down to zero. With 200 less-than-1 feedback per delivery, this throttling down would 201 obviously take too long. We therefore have to separate "dead 202 destination" detection from concurrency feedback. This is implemented 203 by introducing the concept of pseudo-cohort failure. The Postfix 204 2.5 concurrency scheduler declares a destination "dead" after a 205 configurable number of pseudo-cohorts suffers from connection or 206 handshake failures. The old scheduler corresponds to the special 207 case where the pseudo-cohort failure limit is equal to 1. </p> 208 209 <h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3> 210 211 <p> The pseudo code shows how the ideas behind new concurrency 212 scheduler are implemented as of November 2007. The actual code can 213 be found in the module qmgr/qmgr_queue.c. </p> 214 215 <pre> 216 Types: 217 Each destination has one set of the following variables 218 int concurrency 219 double success 220 double failure 221 double fail_cohorts 222 223 Feedback functions: 224 N is concurrency; x, y are arbitrary numbers in [0..1] inclusive 225 positive feedback: g(N) = x/N | x/sqrt(N) | x 226 negative feedback: f(N) = y/N | y/sqrt(N) | y 227 228 Initialization: 229 concurrency = initial_concurrency 230 success = 0 231 failure = 0 232 fail_cohorts = 0 233 234 After success: 235 fail_cohorts = 0 236 Be prepared for feedback > hysteresis, or rounding error 237 success += g(concurrency) 238 while (success >= 1) Hysteresis 1 239 concurrency += 1 Hysteresis 1 240 failure = 0 241 success -= 1 Hysteresis 1 242 Be prepared for overshoot 243 if (concurrency > concurrency limit) 244 concurrency = concurrency limit 245 246 Safety: 247 Don't apply positive feedback unless 248 concurrency < busy_refcount + init_dest_concurrency 249 otherwise negative feedback effect could be delayed 250 251 After failure: 252 if (concurrency > 0) 253 fail_cohorts += 1.0 / concurrency 254 if (fail_cohorts > cohort_failure_limit) 255 concurrency = 0 256 if (concurrency > 0) 257 Be prepared for feedback > hysteresis, rounding errors 258 failure -= f(concurrency) 259 while (failure < 0) 260 concurrency -= 1 Hysteresis 1 261 failure += 1 Hysteresis 1 262 success = 0 263 Be prepared for overshoot 264 if (concurrency < 1) 265 concurrency = 1 266 </pre> 267 268 <h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3> 269 270 <p> Discussions about the concurrency scheduler redesign started 271 early 2004, when the primary goal was to find alternatives that did 272 not exhibit exponential growth or rapid concurrency throttling. No 273 code was implemented until late 2007, when the primary concern had 274 shifted towards better handling of server concurrency limits. For 275 this reason we measure how well the new scheduler does this 276 job. The table below compares mail delivery performance of the old 277 +/-1 feedback per delivery with several less-than-1 feedback 278 functions, for different limited-concurrency server scenarios. 279 Measurements were done with a FreeBSD 6.2 client and with FreeBSD 280 6.2 and various Linux servers. </p> 281 282 <p> Server configuration: </p> 283 284 <ul> <li> The mail flow was slowed down with 1 second latency per 285 recipient ("smtpd_client_restrictions = sleep 1"). The purpose was 286 to make results less dependent on hardware details, by avoiding 287 slow-downs by queue file I/O, logging I/O, and network I/O. 288 289 <li> Concurrency was limited by the server process limit 290 ("default_process_limit = 5" and "smtpd_client_event_limit_exceptions 291 = static:all"). Postfix was stopped and started after changing the 292 process limit, because the same number is also used as the backlog 293 argument to the listen(2) system call, and "postfix reload" does 294 not re-issue this call. 295 296 <li> Mail was discarded with "local_recipient_maps = static:all" and 297 "local_transport = discard". The discard action in access maps or 298 header/body checks 299 could not be used as it fails to update the in_flow_delay counters. 300 301 </ul> 302 303 <p> Client configuration: </p> 304 305 <ul> 306 307 <li> Queue file overhead was minimized by sending one message to a 308 virtual alias that expanded into 2000 different remote recipients. 309 All recipients were accounted for according to the maillog file. 310 The virtual_alias_expansion_limit setting was increased to avoid 311 complaints from the cleanup(8) server. 312 313 <li> The number of deliveries was maximized with 314 "smtp_destination_recipient_limit = 2". A smaller limit would cause 315 Postfix to schedule the concurrency per recipient instead of domain, 316 which is not what we want. 317 318 <li> Maximum concurrency was limited with 319 "smtp_destination_concurrency_limit = 20", and 320 initial_destination_concurrency was set to the same value. 321 322 <li> The positive and negative concurrency feedback hysteresis was 323 1. Concurrency was incremented by 1 at the END of 1/feedback steps 324 of positive feedback, and was decremented by 1 at the START of 325 1/feedback steps of negative feedback. 326 327 <li> The SMTP client used the default 30s SMTP connect timeout and 328 300s SMTP greeting timeout. 329 330 </ul> 331 332 <h4> Impact of the 30s SMTP connect timeout </h4> 333 334 <p> The first results are for a FreeBSD 6.2 server, where our 335 artificially low listen(2) backlog results in a very short kernel 336 queue for established connections. The table shows that all deferred 337 deliveries failed due to a 30s connection timeout, and none failed 338 due to a server greeting timeout. This measurement simulates what 339 happens when the server's connection queue is completely full under 340 load, and the TCP engine drops new connections. </p> 341 342 <blockquote> 343 344 <table> 345 346 <tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 347 style</th> <th>connection<br> caching</th> <th>percentage<br> 348 deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 349 <th colspan=2>timed-out in<br> connect/greeting </th> </tr> 350 351 <tr> <td align="center" colspan="9"> <hr> </td> </tr> 352 353 <tr><td align="center">20</td> <td align="center">5</td> <td 354 align="center">1/N</td> <td align="center">no</td> <td 355 align="center">9.9</td> <td align="center">19.4</td> <td 356 align="center">0.49</td> <td align="center">198</td> <td 357 align="center">-</td> </tr> 358 359 <tr><td align="center">20</td> <td align="center">5</td> <td 360 align="center">1/N</td> <td align="center">yes</td> <td 361 align="center">10.3</td> <td align="center">19.4</td> <td 362 align="center">0.49</td> <td align="center">206</td> <td 363 align="center">-</td> </tr> 364 365 <tr><td align="center">20</td> <td align="center">5</td> <td 366 align="center">1/sqrt(N)</td> <td align="center">no</td> 367 <td align="center">10.4</td> <td align="center">19.6</td> <td 368 align="center">0.59</td> <td align="center">208</td> <td 369 align="center">-</td> </tr> 370 371 <tr><td align="center">20</td> <td align="center">5</td> <td 372 align="center">1/sqrt(N)</td> <td align="center">yes</td> 373 <td align="center">10.6</td> <td align="center">19.6</td> <td 374 align="center">0.61</td> <td align="center">212</td> <td 375 align="center">-</td> </tr> 376 377 <tr><td align="center">20</td> <td align="center">5</td> <td 378 align="center">1</td> <td align="center">no</td> <td 379 align="center">10.1</td> <td align="center">19.5</td> <td 380 align="center">1.29</td> <td align="center">202</td> <td 381 align="center">-</td> </tr> 382 383 <tr><td align="center">20</td> <td align="center">5</td> <td 384 align="center">1</td> <td align="center">yes</td> <td 385 align="center">10.8</td> <td align="center">19.3</td> <td 386 align="center">1.57</td> <td align="center">216</td> <td 387 align="center">-</td> </tr> 388 389 <tr> <td align="center" colspan="9"> <hr> </td> </tr> 390 391 </table> 392 393 <p> A busy server with a completely full connection queue. N is 394 the client delivery concurrency. Failed deliveries time out after 395 30s without completing the TCP handshake. See text for a discussion 396 of results. </p> 397 398 </blockquote> 399 400 <h4> Impact of the 300s SMTP greeting timeout </h4> 401 402 <p> The next table shows results for a Fedora Core 8 server (results 403 for RedHat 7.3 are identical). In this case, the artificially small 404 listen(2) backlog argument does not impact our measurement. The 405 table shows that practically all deferred deliveries fail after the 406 300s SMTP greeting timeout. As these timeouts were 10x longer than 407 with the first measurement, we increased the recipient count (and 408 thus the running time) by a factor of 10 to keep the results 409 comparable. The deferred mail percentages are a factor 10 lower 410 than with the first measurement, because the 1s per-recipient delay 411 was 1/300th of the greeting timeout instead of 1/30th of the 412 connection timeout. </p> 413 414 <blockquote> 415 416 <table> 417 418 <tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 419 style</th> <th>connection<br> caching</th> <th>percentage<br> 420 deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 421 <th colspan=2>timed-out in<br> connect/greeting </th> </tr> 422 423 <tr> <td align="center" colspan="9"> <hr> </td> </tr> 424 425 <tr> <td align="center">20</td> <td align="center">5</td> <td 426 align="center">1/N</td> <td align="center">no</td> <td 427 align="center">1.16</td> <td align="center">19.8</td> <td 428 align="center">0.37</td> <td align="center">-</td> <td 429 align="center">230</td> </tr> 430 431 <tr> <td align="center">20</td> <td align="center">5</td> <td 432 align="center">1/N</td> <td align="center">yes</td> <td 433 align="center">1.36</td> <td align="center">19.8</td> <td 434 align="center">0.36</td> <td align="center">-</td> <td 435 align="center">272</td> </tr> 436 437 <tr> <td align="center">20</td> <td align="center">5</td> <td 438 align="center">1/sqrt(N)</td> <td align="center">no</td> 439 <td align="center">1.21</td> <td align="center">19.9</td> <td 440 align="center">0.23</td> <td align="center">4</td> <td 441 align="center">238</td> </tr> 442 443 <tr> <td align="center">20</td> <td align="center">5</td> <td 444 align="center">1/sqrt(N)</td> <td align="center">yes</td> 445 <td align="center">1.36</td> <td align="center">20.0</td> <td 446 align="center">0.23</td> <td align="center">-</td> <td 447 align="center">272</td> </tr> 448 449 <tr> <td align="center">20</td> <td align="center">5</td> <td 450 align="center">1</td> <td align="center">no</td> <td 451 align="center">1.18</td> <td align="center">20.0</td> <td 452 align="center">0.16</td> <td align="center">-</td> <td 453 align="center">236</td> </tr> 454 455 <tr> <td align="center">20</td> <td align="center">5</td> <td 456 align="center">1</td> <td align="center">yes</td> <td 457 align="center">1.39</td> <td align="center">20.0</td> <td 458 align="center">0.16</td> <td align="center">-</td> <td 459 align="center">278</td> </tr> 460 461 <tr> <td align="center" colspan="9"> <hr> </td> </tr> 462 463 </table> 464 465 <p> A busy server with a non-full connection queue. N is the client 466 delivery concurrency. Failed deliveries complete at the TCP level, 467 but time out after 300s while waiting for the SMTP greeting. See 468 text for a discussion of results. </p> 469 470 </blockquote> 471 472 <h4> Impact of active server concurrency limiter </h4> 473 474 <p> The final concurrency-limited result shows what happens when 475 SMTP connections don't time out, but are rejected immediately with 476 the Postfix server's smtpd_client_connection_count_limit feature 477 (the server replies with a 421 status and disconnects immediately). 478 Similar results can be expected with concurrency limiting features 479 built into other MTAs or firewalls. For this measurement we specified 480 a server concurrency limit and a client initial destination concurrency 481 of 5, and a server process limit of 10; all other conditions were 482 the same as with the first measurement. The same result would be 483 obtained with a FreeBSD or Linux server, because the "pushing back" 484 is done entirely by the receiving side. </p> 485 486 <blockquote> 487 488 <table> 489 490 <tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 491 style</th> <th>connection<br> caching</th> <th>percentage<br> 492 deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 493 <th>theoretical<br>defer rate</th> </tr> 494 495 <tr> <td align="center" colspan="9"> <hr> </td> </tr> 496 497 <tr> <td align="center">20</td> <td align="center">5</td> <td 498 align="center">1/N</td> <td align="center">no</td> <td 499 align="center">16.5</td> <td align="center">5.17</td> <td 500 align="center">0.38</td> <td align="center">1/6</td> </tr> 501 502 <tr> <td align="center">20</td> <td align="center">5</td> <td 503 align="center">1/N</td> <td align="center">yes</td> <td 504 align="center">16.5</td> <td align="center">5.17</td> <td 505 align="center">0.38</td> <td align="center">1/6</td> </tr> 506 507 <tr> <td align="center">20</td> <td align="center">5</td> <td 508 align="center">1/sqrt(N)</td> <td align="center">no</td> 509 <td align="center">24.5</td> <td align="center">5.28</td> <td 510 align="center">0.45</td> <td align="center">1/4</td> </tr> 511 512 <tr> <td align="center">20</td> <td align="center">5</td> <td 513 align="center">1/sqrt(N)</td> <td align="center">yes</td> 514 <td align="center">24.3</td> <td align="center">5.28</td> <td 515 align="center">0.46</td> <td align="center">1/4</td> </tr> 516 517 <tr> <td align="center">20</td> <td align="center">5</td> <td 518 align="center">1</td> <td align="center">no</td> <td 519 align="center">49.7</td> <td align="center">5.63</td> <td 520 align="center">0.67</td> <td align="center">1/2</td> </tr> 521 522 <tr> <td align="center">20</td> <td align="center">5</td> <td 523 align="center">1</td> <td align="center">yes</td> <td 524 align="center">49.7</td> <td align="center">5.68</td> <td 525 align="center">0.70</td> <td align="center">1/2</td> </tr> 526 527 <tr> <td align="center" colspan="9"> <hr> </td> </tr> 528 529 </table> 530 531 <p> A server with active per-client concurrency limiter that replies 532 with 421 and disconnects. N is the client delivery concurrency. 533 The theoretical defer rate is 1/(1+roundup(1/feedback)). This is 534 always 1/2 with the fixed +/-1 feedback per delivery; with the 535 concurrency-dependent feedback variants, the defer rate decreases 536 with increasing concurrency. See text for a discussion of results. 537 </p> 538 539 </blockquote> 540 541 <h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3> 542 543 <p> All results in the previous sections are based on the first 544 delivery runs only; they do not include any second etc. delivery 545 attempts. It's also worth noting that the measurements look at 546 steady-state behavior only. They don't show what happens when the 547 client starts sending at a much higher or lower concurrency. 548 </p> 549 550 <p> The first two examples show that the effect of feedback 551 is negligible when concurrency is limited due to congestion. This 552 is because the initial concurrency is already at the client's 553 concurrency maximum, and because there is 10-100 times more positive 554 than negative feedback. Under these conditions, it is no surprise 555 that the contribution from SMTP connection caching is also negligible. 556 </p> 557 558 <p> In the last example, the old +/-1 feedback per delivery will 559 defer 50% of the mail when confronted with an active (anvil-style) 560 server concurrency limit, where the server hangs up immediately 561 with a 421 status (a TCP-level RST would have the same result). 562 Less aggressive feedback mechanisms fare better than more aggressive 563 ones. Concurrency-dependent feedback fares even better at higher 564 concurrencies than shown here, but has limitations as discussed in 565 the next section. </p> 566 567 <h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3> 568 569 <p> Less-than-1 feedback is of interest primarily when sending large 570 amounts of mail to destinations with active concurrency limiters 571 (servers that reply with 421, or firewalls that send RST). When 572 sending small amounts of mail per destination, less-than-1 per-delivery 573 feedback won't have a noticeable effect on the per-destination 574 concurrency, because the number of deliveries to the same destination 575 is too small. You might just as well use zero per-delivery feedback 576 and stay with the initial per-destination concurrency. And when 577 mail deliveries fail due to congestion instead of active concurrency 578 limiters, the measurements above show that per-delivery feedback 579 has no effect. With large amounts of mail you might just as well 580 use zero per-delivery feedback and start with the maximal per-destination 581 concurrency. </p> 582 583 <p> The scheduler with less-than-1 concurrency 584 feedback per delivery solves a problem with servers that have active 585 concurrency limiters. This works only because feedback is handled 586 in a peculiar manner: positive feedback will increment the concurrency 587 by 1 at the <b>end</b> of a sequence of events of length 1/feedback, 588 while negative feedback will decrement concurrency by 1 at the 589 <b>beginning</b> of such a sequence. This is how Postfix adjusts 590 quickly for overshoot without causing lots of mail to be deferred. 591 Without this difference in feedback treatment, less-than-1 feedback 592 per delivery would defer 50% of the mail, and would be no better 593 in this respect than the old +/-1 feedback per delivery. </p> 594 595 <p> Unfortunately, the same feature that corrects quickly for 596 concurrency overshoot also makes the scheduler more sensitive for 597 noisy negative feedback. The reason is that one lonely negative 598 feedback event has the same effect as a complete sequence of length 599 1/feedback: in both cases delivery concurrency is dropped by 1 600 immediately. As a worst-case scenario, consider multiple servers 601 behind a load balancer on a single IP address, and no backup MX 602 address. When 1 out of K servers fails to complete the SMTP handshake 603 or drops the connection, a scheduler with 1/N (N = concurrency) 604 feedback stops increasing its concurrency once it reaches a concurrency 605 level of about K, even though the good servers behind the load 606 balancer are perfectly capable of handling more traffic. </p> 607 608 <p> This noise problem gets worse as the amount of positive feedback 609 per delivery gets smaller. A compromise is to use fixed less-than-1 610 positive feedback values instead of concurrency-dependent positive 611 feedback. For example, to tolerate 1 of 4 bad servers in the above 612 load balancer scenario, use positive feedback of 1/4 per "good" 613 delivery (no connect or handshake error), and use an equal or smaller 614 amount of negative feedback per "bad" delivery. The downside of 615 using concurrency-independent feedback is that some of the old +/-1 616 feedback problems will return at large concurrencies. Sites that 617 must deliver mail at non-trivial per-destination concurrencies will 618 require special configuration. </p> 619 620 <h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3> 621 622 <p> The Postfix 2.5 concurrency scheduler is controlled with the 623 following configuration parameters, where "<i>transport</i>_foo" 624 provides a transport-specific parameter override. All parameter 625 default settings are compatible with earlier Postfix versions. </p> 626 627 <blockquote> 628 629 <table border="0"> 630 631 <tr> <th> Parameter name </th> <th> Postfix version </th> <th> 632 Description </th> </tr> 633 634 <tr> <td colspan="3"> <hr> </td> </tr> 635 636 <tr> <td> initial_destination_concurrency<br> 637 <i>transport</i>_initial_destination_concurrency </td> <td 638 align="center"> all<br> 2.5 </td> <td> Initial per-destination 639 delivery concurrency </td> </tr> 640 641 <tr> <td> default_destination_concurrency_limit<br> 642 <i>transport</i>_destination_concurrency_limit </td> <td align="center"> 643 all<br> all </td> <td> Maximum per-destination delivery concurrency 644 </td> </tr> 645 646 <tr> <td> default_destination_concurrency_positive_feedback<br> 647 <i>transport</i>_destination_concurrency_positive_feedback </td> 648 <td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive 649 feedback amount, per delivery that does not fail with connection 650 or handshake failure </td> </tr> 651 652 <tr> <td> default_destination_concurrency_negative_feedback<br> 653 <i>transport</i>_destination_concurrency_negative_feedback </td> 654 <td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative 655 feedback amount, per delivery that fails with connection or handshake 656 failure </td> </tr> 657 658 <tr> <td> default_destination_concurrency_failed_cohort_limit<br> 659 <i>transport</i>_destination_concurrency_failed_cohort_limit </td> 660 <td align="center"> 2.5<br> 2.5 </td> <td> Number of failed 661 pseudo-cohorts after which a destination is declared "dead" and 662 delivery is suspended </td> </tr> 663 664 <tr> <td> destination_concurrency_feedback_debug</td> <td align="center"> 665 2.5 </td> <td> Enable verbose logging of concurrency scheduler 666 activity </td> </tr> 667 668 <tr> <td colspan="3"> <hr> </td> </tr> 669 670 </table> 671 672 </blockquote> 673 674 <h2> <a name="jobs"> Preemptive scheduling </a> </h2> 675 676 <p> 677 678 The following sections describe the new queue manager and its 679 preemptive scheduler algorithm. Note that the document was originally 680 written to describe the changes between the new queue manager (in 681 this text referred to as <tt>nqmgr</tt>, the name it was known by 682 before it became the default queue manager) and the old queue manager 683 (referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt> 684 every so often. 685 686 </p> 687 688 <p> 689 690 This document is divided into sections as follows: 691 692 </p> 693 694 <ul> 695 696 <li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by 697 nqmgr </a> 698 699 <li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 700 up the message </a> - how it is assigned to transports, jobs, peers, 701 entries 702 703 <li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection 704 works </a> 705 706 <li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption 707 works </a> - what messages may be preempted and how and what messages 708 are chosen to preempt them 709 710 <li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency 711 limits affect the scheduling algorithm </a> 712 713 <li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource 714 limits </a> 715 716 </ul> 717 718 <h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by 719 nqmgr </a> </h3> 720 721 <p> 722 723 Let's start by recapitulating the structures and terms used when 724 referring to the queue manager and how it operates. Many of these are 725 partially described elsewhere, but it is nice to have a coherent 726 overview in one place: 727 728 </p> 729 730 <ul> 731 732 <li> <p> Each message structure represents one mail message which 733 Postfix is to deliver. The message recipients specify to what 734 destinations is the message to be delivered and what transports are 735 going to be used for the delivery. </p> 736 737 <li> <p> Each recipient entry groups a batch of recipients of one 738 message which are all going to be delivered to the same destination 739 (and over the same transport). 740 </p> 741 742 <li> <p> Each transport structure groups everything what is going 743 to be delivered by delivery agents dedicated for that transport. 744 Each transport maintains a set of queues (describing the destinations 745 it shall talk to) and jobs (referencing the messages it shall 746 deliver). </p> 747 748 <li> <p> Each transport queue (not to be confused with the on-disk 749 "active" queue or "incoming" queue) groups everything what is going be 750 delivered to given destination (aka nexthop) by its transport. Each 751 queue belongs to one transport, so each destination may be referred 752 to by several queues, one for each transport. Each queue maintains 753 a list of all recipient entries (batches of message recipients) 754 which shall be delivered to given destination (the todo list), and 755 a list of recipient entries already being delivered by the delivery 756 agents (the busy list). </p> 757 758 <li> <p> Each queue corresponds to multiple peer structures. Each 759 peer structure is like the queue structure, belonging to one transport 760 and referencing one destination. The difference is that it lists 761 only the recipient entries which all originate from the same message, 762 unlike the queue structure, whose entries may originate from various 763 messages. For messages with few recipients, there is usually just 764 one recipient entry for each destination, resulting in one recipient 765 entry per peer. But for large mailing list messages the recipients 766 may need to be split to multiple recipient entries, in which case 767 the peer structure may list many entries for single destination. 768 </p> 769 770 <li> <p> Each transport job groups everything it takes to deliver 771 one message via its transport. Each job represents one message 772 within the context of the transport. The job belongs to one transport 773 and message, so each message may have multiple jobs, one for each 774 transport. The job groups all the peer structures, which describe 775 the destinations the job's message has to be delivered to. </p> 776 777 </ul> 778 779 <p> 780 781 The first four structures are common to both <tt>nqmgr</tt> and 782 <tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>. 783 784 </p> 785 786 <p> 787 788 These terms are used extensively in the text below, feel free to 789 look up the description above anytime you'll feel you have lost a 790 sense what is what. 791 792 </p> 793 794 <h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 795 up the message </a> </h3> 796 797 <p> 798 799 Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue, 800 the following happens: It reads all necessary information from the 801 queue file as <tt>oqmgr</tt> does, and also reads as many recipients 802 as possible - more on that later, for now let's just pretend it 803 always reads all recipients. 804 805 </p> 806 807 <p> 808 809 Then it resolves the recipients as <tt>oqmgr</tt> does, which 810 means obtaining (address, nexthop, transport) triple for each 811 recipient. For each triple, it finds the transport; if it does not 812 exist yet, it instantiates it (unless it's dead). Within the 813 transport, it finds the destination queue for the given nexthop; if it 814 does not exist yet, it instantiates it (unless it's dead). The 815 triple is then bound to given destination queue. This happens in 816 qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>. 817 818 </p> 819 820 <p> 821 822 Then for each triple which was bound to some queue (and thus 823 transport), the program finds the job which represents the message 824 within that transport's context; if it does not exist yet, it 825 instantiates it. Within the job, it finds the peer which represents 826 the bound destination queue within this jobs context; if it does 827 not exist yet, it instantiates it. Finally, it stores the address 828 from the resolved triple to the recipient entry which is appended 829 to both the queue entry list and the peer entry list. The addresses 830 for the same nexthop are batched in the entries up to the 831 <i>transport</i>_destination_recipient_limit for that transport. 832 This happens in qmgr_message_assign(), and apart 833 from that it operates with job and peer structures, it is basically the 834 same as in <tt>oqmgr</tt>. 835 836 </p> 837 838 <p> 839 840 When the job is instantiated, it is enqueued on the transport's job 841 list based on the time its message was picked up by <tt>nqmgr</tt>. 842 For first batch of recipients this means it is appended to the end 843 of the job list, but the ordering of the job list by the enqueue 844 time is important as we will see shortly. 845 846 </p> 847 848 <p> 849 850 [Now you should have a pretty good idea what the state of the 851 <tt>nqmgr</tt> is after a couple of messages were picked up, and what the 852 relation is between all those job, peer, queue and entry structures.] 853 854 </p> 855 856 <h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection 857 works </a> </h3> 858 859 <p> 860 861 Having prepared all those above mentioned structures, the task of 862 the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries 863 one at a time and pass them to the delivery agent for corresponding 864 transport. Now how does this work? 865 866 </p> 867 868 <p> 869 870 The first approximation of the new scheduling algorithm is like this: 871 872 </p> 873 874 <blockquote> 875 <pre> 876 foreach transport (round-robin-by-transport) 877 do 878 if transport busy continue 879 if transport process limit reached continue 880 foreach transport's job (in the order of the transport's job list) 881 do 882 foreach job's peer (round-robin-by-destination) 883 if peer->queue->concurrency < peer->queue->window 884 return next peer entry. 885 done 886 done 887 done 888 </pre> 889 </blockquote> 890 891 <p> 892 893 Now what is the "order of the transport's job list"? As we know 894 already, the job list is by default kept in the order the message 895 was picked up by the <tt>nqmgr</tt>. So by default we get the 896 top-level round-robin transport, and within each transport we get 897 the FIFO message delivery. The round-robin of the peers by the 898 destination is perhaps of little importance in most real-life cases 899 (unless the <i>transport</i>_destination_recipient_limit is reached, 900 in one job there 901 is only one peer structure for each destination), but theoretically 902 it makes sure that even within single jobs, destinations are treated 903 fairly. 904 905 </p> 906 907 <p> 908 909 [By now you should have a feeling you really know how the scheduler 910 works, except for the preemption, under ideal conditions - that is, 911 no recipient resource limits and no destination concurrency problems.] 912 913 </p> 914 915 <h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption 916 works </a> </h3> 917 918 <p> 919 920 As you might perhaps expect by now, the transport's job list does 921 not remain sorted by the job's message enqueue time all the time. 922 The most cool thing about <tt>nqmgr</tt> is not the simple FIFO 923 delivery, but that it is able to slip mail with little recipients 924 past the mailing-list bulk mail. This is what the job preemption 925 is about - shuffling the jobs on the transport's job list to get 926 the best message delivery rates. Now how is it achieved? 927 928 </p> 929 930 <p> 931 932 First I have to tell you that there are in fact two job lists in 933 each transport. One is the scheduler's job list, which the scheduler 934 is free to play with, while the other one keeps the jobs always 935 listed in the order of the enqueue time and is used for recipient 936 pool management we will discuss later. For now, we will deal with 937 the scheduler's job list only. 938 939 </p> 940 941 <p> 942 943 So, we have the job list, which is first ordered by the time the 944 jobs' messages were enqueued, oldest messages first, the most recently 945 picked one at the end. For now, let's assume that there are no 946 destination concurrency problems. Without preemption, we pick some 947 entry of the first (oldest) job on the queue, assign it to delivery 948 agent, pick another one from the same job, assign it again, and so 949 on, until all the entries are used and the job is delivered. We 950 would then move onto the next job and so on and on. Now how do we 951 manage to sneak in some entries from the recently added jobs when 952 the first job on the job list belongs to a message going to the 953 mailing-list and has thousands of recipient entries? 954 955 </p> 956 957 <p> 958 959 The <tt>nqmgr</tt>'s answer is that we can artificially "inflate" 960 the delivery time of that first job by some constant for free - it 961 is basically the same trick you might remember as "accumulation of 962 potential" from the amortized complexity lessons. For example, 963 instead of delivering the entries of the first job on the job list 964 every time a delivery agent becomes available, we can do it only 965 every second time. If you view the moments the delivery agent becomes 966 available on a timeline as "delivery slots", then instead of using 967 every delivery slot for the first job, we can use only every other 968 slot, and still the overall delivery efficiency of the first job 969 remains the same. So the delivery <tt>11112222</tt> becomes 970 <tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, . 971 denotes the free slot). Now what do we do with free slots? 972 973 </p> 974 975 <p> 976 977 As you might have guessed, we will use them for sneaking the mail 978 with little recipients in. For example, if we have one four-recipient 979 mail followed by four one recipients mail, the delivery sequence 980 (that is, the sequence in which the jobs are assigned to the 981 delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine 982 for sneaking in the single recipient mail, but how do we sneak in 983 the mail with more than one recipient? Say if we have one four-recipient 984 mail followed by two two-recipient mails? 985 986 </p> 987 988 <p> 989 990 The simple answer would be to use delivery sequence <tt>12121313</tt>. 991 But the problem is that this does not scale well. Imagine you have 992 mail with a thousand recipients followed by mail with a hundred recipients. 993 It is tempting to suggest the delivery sequence like <tt>121212....</tt>, 994 but alas! Imagine there arrives another mail with say ten recipients. 995 But there are no free slots anymore, so it can't slip by, not even 996 if it had only one recipient. It will be stuck until the 997 hundred-recipient mail is delivered, which really sucks. 998 999 </p> 1000 1001 <p> 1002 1003 So, it becomes obvious that while inflating the message to get 1004 free slots is a great idea, one has to be really careful of how the 1005 free slots are assigned, otherwise one might corner himself. So, 1006 how does <tt>nqmgr</tt> really use the free slots? 1007 1008 </p> 1009 1010 <p> 1011 1012 The key idea is that one does not have to generate the free slots 1013 in a uniform way. The delivery sequence <tt>111...1</tt> is no 1014 worse than <tt>1.1.1.1</tt>, in fact, it is even better as some 1015 entries are in the first case selected earlier than in the second 1016 case, and none is selected later! So it is possible to first 1017 "accumulate" the free delivery slots and then use them all at once. 1018 It is even possible to accumulate some, then use them, then accumulate 1019 some more and use them again, as in <tt>11..1.1</tt> . 1020 1021 </p> 1022 1023 <p> 1024 1025 Let's get back to the one hundred recipient example. We now know 1026 that we could first accumulate one hundred free slots, and only 1027 after then to preempt the first job and sneak the one hundred 1028 recipient mail in. Applying the algorithm recursively, we see the 1029 hundred recipient job can accumulate ten free delivery slots, and 1030 then we could preempt it and sneak in the ten-recipient mail... 1031 Wait wait wait! Could we? Aren't we overinflating the original one 1032 thousand recipient mail? 1033 1034 </p> 1035 1036 <p> 1037 1038 Well, despite the fact that it looks so at the first glance, another trick will 1039 allow us to answer "no, we are not!". If we had said that we will 1040 inflate the delivery time twice at maximum, and then we consider 1041 every other slot as a free slot, then we would overinflate in case 1042 of the recursive preemption. BUT! The trick is that if we use only 1043 every n-th slot as a free slot for n>2, there is always some worst 1044 inflation factor which we can guarantee not to be breached, even 1045 if we apply the algorithm recursively. To be precise, if for every 1046 k>1 normally used slots we accumulate one free delivery slot, than 1047 the inflation factor is not worse than k/(k-1) no matter how many 1048 recursive preemptions happen. And it's not worse than (k+1)/k if 1049 only non-recursive preemption happens. Now, having got through the 1050 theory and the related math, let's see how <tt>nqmgr</tt> implements 1051 this. 1052 1053 </p> 1054 1055 <p> 1056 1057 Each job has so called "available delivery slot" counter. Each 1058 transport has a <i>transport</i>_delivery_slot_cost parameter, which 1059 defaults to default_delivery_slot_cost parameter which is set to 5 1060 by default. This is the k from the paragraph above. Each time k 1061 entries of the job are selected for delivery, this counter is 1062 incremented by one. Once there are some slots accumulated, a job which 1063 requires no more than that number of slots to be fully delivered 1064 can preempt this job. 1065 1066 </p> 1067 1068 <p> 1069 1070 [Well, the truth is, the counter is incremented every time an entry 1071 is selected and it is divided by k when it is used. 1072 But to understand, it's good enough to use 1073 the above approximation of the truth.] 1074 1075 </p> 1076 1077 <p> 1078 1079 OK, so now we know the conditions which must be satisfied so one 1080 job can preempt another one. But what job gets preempted, how do 1081 we choose what job preempts it if there are several valid candidates, 1082 and when does all this exactly happen? 1083 1084 </p> 1085 1086 <p> 1087 1088 The answer for the first part is simple. The job whose entry was 1089 selected the last time is the so called current job. Normally, it is 1090 the first job on the scheduler's job list, but destination concurrency 1091 limits may change this as we will see later. It is always only the 1092 current job which may get preempted. 1093 1094 </p> 1095 1096 <p> 1097 1098 Now for the second part. The current job has a certain amount of 1099 recipient entries, and as such may accumulate at maximum some amount 1100 of available delivery slots. It might have already accumulated some, 1101 and perhaps even already used some when it was preempted before 1102 (remember a job can be preempted several times). In either case, 1103 we know how many are accumulated and how many are left to deliver, 1104 so we know how many it may yet accumulate at maximum. Every other 1105 job which may be delivered by less than that number of slots is a 1106 valid candidate for preemption. How do we choose among them? 1107 1108 </p> 1109 1110 <p> 1111 1112 The answer is - the one with maximum enqueue_time/recipient_entry_count. 1113 That is, the older the job is, the more we should try to deliver 1114 it in order to get best message delivery rates. These rates are of 1115 course subject to how many recipients the message has, therefore 1116 the division by the recipient (entry) count. No one shall be surprised 1117 that a message with n recipients takes n times longer to deliver than 1118 a message with one recipient. 1119 1120 </p> 1121 1122 <p> 1123 1124 Now let's recap the previous two paragraphs. Isn't it too complicated? 1125 Why don't the candidates come only among the jobs which can be 1126 delivered within the number of slots the current job already 1127 accumulated? Why do we need to estimate how much it has yet to 1128 accumulate? If you found out the answer, congratulate yourself. If 1129 we did it this simple way, we would always choose the candidate 1130 with the fewest recipient entries. If there were enough single recipient 1131 mails coming in, they would always slip by the bulk mail as soon 1132 as possible, and the two or more recipients mail would never get 1133 a chance, no matter how long they have been sitting around in the 1134 job list. 1135 1136 </p> 1137 1138 <p> 1139 1140 This candidate selection has an interesting implication - that when 1141 we choose the best candidate for preemption (this is done in 1142 qmgr_choose_candidate()), it may happen that we may not use it for 1143 preemption immediately. This leads to an answer to the last part 1144 of the original question - when does the preemption happen? 1145 1146 </p> 1147 1148 <p> 1149 1150 The preemption attempt happens every time next transport's recipient 1151 entry is to be chosen for delivery. To avoid needless overhead, the 1152 preemption is not attempted if the current job could never accumulate 1153 more than <i>transport</i>_minimum_delivery_slots (defaults to 1154 default_minimum_delivery_slots which defaults to 3). If there are 1155 already enough accumulated slots to preempt the current job by the 1156 chosen best candidate, it is done immediately. This basically means 1157 that the candidate is moved in front of the current job on the 1158 scheduler's job list and decreasing the accumulated slot counter 1159 by the amount used by the candidate. If there are not enough slots... 1160 well, I could say that nothing happens and the another preemption 1161 is attempted the next time. But that's not the complete truth. 1162 1163 </p> 1164 1165 <p> 1166 1167 The truth is that it turns out that it is not really necessary to 1168 wait until the jobs counter accumulates all the delivery slots in 1169 advance. Say we have ten-recipient mail followed by two two-recipient 1170 mails. If the preemption happened when enough delivery slots accumulate 1171 (assuming slot cost 2), the delivery sequence becomes 1172 <tt>11112211113311</tt>. Now what would we get if we would wait 1173 only for 50% of the necessary slots to accumulate and we promise 1174 we would wait for the remaining 50% later, after we get back 1175 to the preempted job? If we use such a slot loan, the delivery sequence 1176 becomes <tt>11221111331111</tt>. As we can see, it makes it not 1177 considerably worse for the delivery of the ten-recipient mail, but 1178 it allows the small messages to be delivered sooner. 1179 1180 </p> 1181 1182 <p> 1183 1184 The concept of these slot loans is where the 1185 <i>transport</i>_delivery_slot_discount and 1186 <i>transport</i>_delivery_slot_loan come from (they default to 1187 default_delivery_slot_discount and default_delivery_slot_loan, whose 1188 values are by default 50 and 3, respectively). The discount (resp. 1189 loan) specifies how many percent (resp. how many slots) one "gets 1190 in advance", when the number of slots required to deliver the best 1191 candidate is compared with the number of slots the current slot had 1192 accumulated so far. 1193 1194 </p> 1195 1196 <p> 1197 1198 And that pretty much concludes this chapter. 1199 1200 </p> 1201 1202 <p> 1203 1204 [Now you should have a feeling that you pretty much understand the 1205 scheduler and the preemption, or at least that you will have 1206 after you read the last chapter a couple more times. You shall clearly 1207 see the job list and the preemption happening at its head, in ideal 1208 delivery conditions. The feeling of understanding shall last until 1209 you start wondering what happens if some of the jobs are blocked, 1210 which you might eventually figure out correctly from what had been 1211 said already. But I would be surprised if your mental image of the 1212 scheduler's functionality is not completely shattered once you 1213 start wondering how it works when not all recipients may be read 1214 in-core. More on that later.] 1215 1216 </p> 1217 1218 <h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency 1219 limits affect the scheduling algorithm </a> </h3> 1220 1221 <p> 1222 1223 The <tt>nqmgr</tt> uses the same algorithm for destination concurrency 1224 control as <tt>oqmgr</tt>. Now what happens when the destination 1225 limits are reached and no more entries for that destination may be 1226 selected by the scheduler? 1227 1228 </p> 1229 1230 <p> 1231 1232 From the user's point of view it is all simple. If some of the peers 1233 of a job can't be selected, those peers are simply skipped by the 1234 entry selection algorithm (the pseudo-code described before) and 1235 only the selectable ones are used. If none of the peers may be 1236 selected, the job is declared a "blocker job". Blocker jobs are 1237 skipped by the entry selection algorithm and they are also excluded 1238 from the candidates for preemption of the current job. Thus the scheduler 1239 effectively behaves as if the blocker jobs didn't exist on the job 1240 list at all. As soon as at least one of the peers of a blocker job 1241 becomes unblocked (that is, the delivery agent handling the delivery 1242 of the recipient entry for the given destination successfully finishes), 1243 the job's blocker status is removed and the job again participates 1244 in all further scheduler actions normally. 1245 1246 </p> 1247 1248 <p> 1249 1250 So the summary is that the users don't really have to be concerned 1251 about the interaction of the destination limits and scheduling 1252 algorithm. It works well on its own and there are no knobs they 1253 would need to control it. 1254 1255 </p> 1256 1257 <p> 1258 1259 From a programmer's point of view, the blocker jobs complicate the 1260 scheduler quite a lot. Without them, the jobs on the job list would 1261 be normally delivered in strict FIFO order. If the current job is 1262 preempted, the job preempting it is completely delivered unless it 1263 is preempted itself. Without blockers, the current job is thus 1264 always either the first job on the job list, or the top of the stack 1265 of jobs preempting the first job on the job list. 1266 1267 </p> 1268 1269 <p> 1270 1271 The visualization of the job list and the preemption stack without 1272 blockers would be like this: 1273 1274 </p> 1275 1276 <blockquote> 1277 <pre> 1278 first job-> 1--2--3--5--6--8--... <- job list 1279 on job list | 1280 4 <- preemption stack 1281 | 1282 current job-> 7 1283 </pre> 1284 </blockquote> 1285 1286 <p> 1287 1288 In the example above we see that job 1 was preempted by job 4 and 1289 then job 4 was preempted by job 7. After job 7 is completed, remaining 1290 entries of job 4 are selected, and once they are all selected, job 1291 1 continues. 1292 1293 </p> 1294 1295 <p> 1296 1297 As we see, it's all very clean and straightforward. Now how does 1298 this change because of blockers? 1299 1300 </p> 1301 1302 <p> 1303 1304 The answer is: a lot. Any job may become a blocker job at any time, 1305 and also become a normal job again at any time. This has several 1306 important implications: 1307 1308 </p> 1309 1310 <ol> 1311 1312 <li> <p> 1313 1314 The jobs may be completed in arbitrary order. For example, in the 1315 example above, if the current job 7 becomes blocked, the next job 1316 4 may complete before the job 7 becomes unblocked again. Or if both 1317 7 and 4 are blocked, then 1 is completed, then 7 becomes unblocked 1318 and is completed, then 2 is completed and only after that 4 becomes 1319 unblocked and is completed... You get the idea. 1320 1321 </p> 1322 1323 <p> 1324 1325 [Interesting side note: even when jobs are delivered out of order, 1326 from a single destination's point of view the jobs are still delivered 1327 in the expected order (that is, FIFO unless there was some preemption 1328 involved). This is because whenever a destination queue becomes 1329 unblocked (the destination limit allows selection of more recipient 1330 entries for that destination), all jobs which have peers for that 1331 destination are unblocked at once.] 1332 1333 </p> 1334 1335 <li> <p> 1336 1337 The idea of the preemption stack at the head of the job list is 1338 gone. That is, it must be possible to preempt any job on the job 1339 list. For example, if the jobs 7, 4, 1 and 2 in the example above 1340 become all blocked, job 3 becomes the current job. And of course 1341 we do not want the preemption to be affected by the fact that there 1342 are some blocked jobs or not. Therefore, if it turns out that job 1343 3 might be preempted by job 6, the implementation shall make it 1344 possible. 1345 1346 </p> 1347 1348 <li> <p> 1349 1350 The idea of the linear preemption stack itself is gone. It's no 1351 longer true that one job is always preempted by only one job at one 1352 time (that is directly preempted, not counting the recursively 1353 nested jobs). For example, in the example above, job 1 is directly 1354 preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes 1355 blocked, and job 4 is being delivered. If it accumulates enough 1356 delivery slots, it is natural that it might be preempted for example 1357 by job 8. Now job 4 is preempted by both job 7 AND job 8 at the 1358 same time. 1359 1360 </p> 1361 1362 </ol> 1363 1364 <p> 1365 1366 Now combine the points 2) and 3) with point 1) again and you realize 1367 that the relations on the once linear job list became pretty 1368 complicated. If we extend the point 3) example: jobs 7 and 8 preempt 1369 job 4, now job 8 becomes blocked too, then job 4 completes. Tricky, 1370 huh? 1371 1372 </p> 1373 1374 <p> 1375 1376 If I illustrate the relations after the above mentioned examples 1377 (but those in point 1), the situation would look like this: 1378 1379 </p> 1380 1381 <blockquote> 1382 <pre> 1383 v- parent 1384 1385 adoptive parent -> 1--2--3--5--... <- "stack" level 0 1386 | | 1387 parent gone -> ? 6 <- "stack" level 1 1388 / \ 1389 children -> 7 8 ^- child <- "stack" level 2 1390 1391 ^- siblings 1392 </pre> 1393 </blockquote> 1394 1395 <p> 1396 1397 Now how does <tt>nqmgr</tt> deal with all these complicated relations? 1398 1399 </p> 1400 1401 <p> 1402 1403 Well, it maintains them all as described, but fortunately, all these 1404 relations are necessary only for the purpose of proper counting of 1405 available delivery slots. For the purpose of ordering the jobs for 1406 entry selection, the original rule still applies: "the job preempting 1407 the current job is moved in front of the current job on the job 1408 list". So for entry selection purposes, the job relations remain 1409 as simple as this: 1410 1411 </p> 1412 1413 <blockquote> 1414 <pre> 1415 7--8--1--2--6--3--5--.. <- scheduler's job list order 1416 </pre> 1417 </blockquote> 1418 1419 <p> 1420 1421 The job list order and the preemption parent/child/siblings relations 1422 are maintained separately. And because the selection works only 1423 with the job list, you can happily forget about those complicated 1424 relations unless you want to study the <tt>nqmgr</tt> sources. In 1425 that case the text above might provide some helpful introduction 1426 to the problem domain. Otherwise I suggest you just forget about 1427 all this and stick with the user's point of view: the blocker jobs 1428 are simply ignored. 1429 1430 </p> 1431 1432 <p> 1433 1434 [By now, you should have a feeling that there are more things going 1435 on under the hood than you ever wanted to know. You decide that 1436 forgetting about this chapter is the best you can do for the sake 1437 of your mind's health and you basically stick with the idea how the 1438 scheduler works in ideal conditions, when there are no blockers, 1439 which is good enough.] 1440 1441 </p> 1442 1443 <h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource 1444 limits </a> </h3> 1445 1446 <p> 1447 1448 When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed 1449 that all recipients of all messages in the "active" queue are completely 1450 read into memory. This is simply not true. There is an upper 1451 bound on the amount of memory the <tt>nqmgr</tt> may use, and 1452 therefore it must impose some limits on the information it may store 1453 in memory at any given time. 1454 1455 </p> 1456 1457 <p> 1458 1459 First of all, not all messages may be read in-core at once. At any 1460 time, only qmgr_message_active_limit messages may be read in-core 1461 at maximum. When read into memory, the messages are picked from the 1462 "incoming" and "deferred" queues and moved to the "active" queue 1463 (incoming having priority), so if there are more than 1464 qmgr_message_active_limit messages queued in the "active" queue, the 1465 rest will have to wait until (some of) the messages in the "active" queue 1466 are completely delivered (or deferred). 1467 1468 </p> 1469 1470 <p> 1471 1472 Even with the limited amount of in-core messages, there is another 1473 limit which must be imposed in order to avoid memory exhaustion. 1474 Each message may contain a huge number of recipients (tens or hundreds 1475 of thousands are not uncommon), so if <tt>nqmgr</tt> read all 1476 recipients of all messages in the "active" queue, it may easily run 1477 out of memory. Therefore there must be some upper bound on the 1478 amount of message recipients which are read into memory at the 1479 same time. 1480 1481 </p> 1482 1483 <p> 1484 1485 Before discussing how exactly <tt>nqmgr</tt> implements the recipient 1486 limits, let's see how the sole existence of the limits themselves 1487 affects the <tt>nqmgr</tt> and its scheduler. 1488 1489 </p> 1490 1491 <p> 1492 1493 The message limit is straightforward - it just limits the size of 1494 the 1495 lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which 1496 message can preempt the current one. Messages not in the "active" queue 1497 are simply not considered at all. 1498 1499 </p> 1500 1501 <p> 1502 1503 The recipient limit complicates more things. First of all, the 1504 message reading code must support reading the recipients in batches, 1505 which among other things means accessing the queue file several 1506 times and continuing where the last recipient batch ended. This is 1507 invoked by the scheduler whenever the current job has space for more 1508 recipients, subject to transport's refill_limit and refill_delay parameters. 1509 It is also done any time when all 1510 in-core recipients of the message are dealt with (which may also 1511 mean they were deferred) but there are still more in the queue file. 1512 1513 </p> 1514 1515 <p> 1516 1517 The second complication is that with some recipients left unread 1518 in the queue file, the scheduler can't operate with exact counts 1519 of recipient entries. With unread recipients, it is not clear how 1520 many recipient entries there will be, as they are subject to 1521 per-destination grouping. It is not even clear to what transports 1522 (and thus jobs) the recipients will be assigned. And with messages 1523 coming from the "deferred" queue, it is not even clear how many unread 1524 recipients are still to be delivered. This all means that the 1525 scheduler must use only estimates of how many recipients entries 1526 there will be. Fortunately, it is possible to estimate the minimum 1527 and maximum correctly, so the scheduler can always err on the safe 1528 side. Obviously, the better the estimates, the better the results, so 1529 it is best when we are able to read all recipients in-core and turn 1530 the estimates into exact counts, or at least try to read as many 1531 as possible to make the estimates as accurate as possible. 1532 1533 </p> 1534 1535 <p> 1536 1537 The third complication is that it is no longer true that the scheduler 1538 is done with a job once all of its in-core recipients are delivered. 1539 It is possible that the job will be revived later, when another 1540 batch of recipients is read in core. It is also possible that some 1541 jobs will be created for the first time long after the first batch 1542 of recipients was read in core. The <tt>nqmgr</tt> code must be 1543 ready to handle all such situations. 1544 1545 </p> 1546 1547 <p> 1548 1549 And finally, the fourth complication is that the <tt>nqmgr</tt> 1550 code must somehow impose the recipient limit itself. Now how does 1551 it achieve it? 1552 1553 </p> 1554 1555 <p> 1556 1557 Perhaps the easiest solution would be to say that each message may 1558 have at maximum X recipients stored in-core, but such a solution would 1559 be poor for several reasons. With reasonable qmgr_message_active_limit 1560 values, the X would have to be quite low to maintain a reasonable 1561 memory footprint. And with low X lots of things would not work well. 1562 The <tt>nqmgr</tt> would have problems to use the 1563 <i>transport</i>_destination_recipient_limit efficiently. The 1564 scheduler's preemption would be suboptimal as the recipient count 1565 estimates would be inaccurate. The message queue file would have 1566 to be accessed many times to read in more recipients again and 1567 again. 1568 1569 </p> 1570 1571 <p> 1572 1573 Therefore it seems reasonable to have a solution which does not use 1574 a limit imposed on a per-message basis, but which maintains a pool 1575 of available recipient slots, which can be shared among all messages 1576 in the most efficient manner. And as we do not want separate 1577 transports to compete for resources whenever possible, it seems 1578 appropriate to maintain such a recipient pool for each transport 1579 separately. This is the general idea, now how does it work in 1580 practice? 1581 1582 </p> 1583 1584 <p> 1585 1586 First we have to solve a little chicken-and-egg problem. If we want 1587 to use the per-transport recipient pools, we first need to know to 1588 what transport(s) the message is assigned. But we will find that 1589 out only after we first read in the recipients. So it is obvious 1590 that we first have to read in some recipients, use them to find out 1591 to what transports the message is to be assigned, and only after 1592 that can we use the per-transport recipient pools. 1593 1594 </p> 1595 1596 <p> 1597 1598 Now how many recipients shall we read for the first time? This is 1599 what qmgr_message_recipient_minimum and qmgr_message_recipient_limit 1600 values control. The qmgr_message_recipient_minimum value specifies 1601 how many recipients of each message we will read the first time, 1602 no matter what. It is necessary to read at least one recipient 1603 before we can assign the message to a transport and create the first 1604 job. However, reading only qmgr_message_recipient_minimum recipients 1605 even if there are only few messages with few recipients in-core would 1606 be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit 1607 recipients in-core so far, the first batch of recipients may be 1608 larger than qmgr_message_recipient_minimum - as large as is required 1609 to reach the qmgr_message_recipient_limit limit. 1610 1611 </p> 1612 1613 <p> 1614 1615 Once the first batch of recipients was read in core and the message 1616 jobs were created, the size of the subsequent recipient batches (if 1617 any - of course it's best when all recipients are read in one batch) 1618 is based solely on the position of the message jobs on their 1619 corresponding transports' job lists. Each transport has a pool of 1620 <i>transport</i>_recipient_limit recipient slots which it can 1621 distribute among its jobs (how this is done is described later). 1622 The subsequent recipient batch may be as large as the sum of all 1623 recipient slots of all jobs of the message permits (plus the 1624 qmgr_message_recipient_minimum amount which always applies). 1625 1626 </p> 1627 1628 <p> 1629 1630 For example, if a message has three jobs, the first with 1 recipient 1631 still in-core and 4 recipient slots, the second with 5 recipients in-core 1632 and 5 recipient slots, and the third with 2 recipients in-core and 0 1633 recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs' 1634 recipients slots in total. This means that we could immediately 1635 read 2+qmgr_message_recipient_minimum more recipients of that message 1636 in core. 1637 1638 </p> 1639 1640 <p> 1641 1642 The above example illustrates several things which might be worth 1643 mentioning explicitly: first, note that although the per-transport 1644 slots are assigned to particular jobs, we can't guarantee that once 1645 the next batch of recipients is read in core, that the corresponding 1646 amounts of recipients will be assigned to those jobs. The jobs lend 1647 its slots to the message as a whole, so it is possible that some 1648 jobs end up sponsoring other jobs of their message. For example, 1649 if in the example above the 2 newly read recipients were assigned 1650 to the second job, the first job sponsored the second job with 2 1651 slots. The second notable thing is the third job, which has more 1652 recipients in-core than it has slots. Apart from sponsoring by other 1653 job we just saw it can be result of the first recipient batch, which 1654 is sponsored from global recipient pool of qmgr_message_recipient_limit 1655 recipients. It can be also sponsored from the message recipient 1656 pool of qmgr_message_recipient_minimum recipients. 1657 1658 </p> 1659 1660 <p> 1661 1662 Now how does each transport distribute the recipient slots among 1663 its jobs? The strategy is quite simple. As most scheduler activity 1664 happens on the head of the job list, it is our intention to make 1665 sure that the scheduler has the best estimates of the recipient 1666 counts for those jobs. As we mentioned above, this means that we 1667 want to try to make sure that the messages of those jobs have all 1668 recipients read in-core. Therefore the transport distributes the 1669 slots "along" the job list from start to end. In this case the job 1670 list sorted by message enqueue time is used, because it doesn't 1671 change over time as the scheduler's job list does. 1672 1673 </p> 1674 1675 <p> 1676 1677 More specifically, each time a job is created and appended to the 1678 job list, it gets all unused recipient slots from its transport's 1679 pool. It keeps them until all recipients of its message are read. 1680 When this happens, all unused recipient slots are transferred to 1681 the next job (which is now in fact the first such job) on the job 1682 list which still has some recipients unread, or eventually back to 1683 the transport pool if there is no such job. Such a transfer then also 1684 happens whenever a recipient entry of that job is delivered. 1685 1686 </p> 1687 1688 <p> 1689 1690 There is also a scenario when a job is not appended to the end of 1691 the job list (for example it was created as a result of a second or 1692 later recipient batch). Then it works exactly as above, except that 1693 if it was put in front of the first unread job (that is, the job 1694 of a message which still has some unread recipients in the queue file), 1695 that job is first forced to return all of its unused recipient slots 1696 to the transport pool. 1697 1698 </p> 1699 1700 <p> 1701 1702 The algorithm just described leads to the following state: The first 1703 unread job on the job list always gets all the remaining recipient 1704 slots of that transport (if there are any). The jobs queued before 1705 this job are completely read (that is, all recipients of their 1706 message were already read in core) and have at maximum as many slots 1707 as they still have recipients in-core (the maximum is there because 1708 of the sponsoring mentioned before) and the jobs after this job get 1709 nothing from the transport recipient pool (unless they got something 1710 before and then the first unread job was created and enqueued in 1711 front of them later - in such a case, they also get at maximum as many 1712 slots as they have recipients in-core). 1713 1714 </p> 1715 1716 <p> 1717 1718 Things work fine in such a state for most of the time, because the 1719 current job is either completely read in-core or has as many recipient 1720 slots as there are, but there is one situation which we still have 1721 to take care of specially. Imagine if the current job is preempted 1722 by some unread job from the job list and there are no more recipient 1723 slots available, so this new current job could read only batches 1724 of qmgr_message_recipient_minimum recipients at a time. This would 1725 really degrade performance. For this reason, each transport has an 1726 extra pool of <i>transport</i>_extra_recipient_limit recipient 1727 slots, dedicated exactly for this situation. Each time an unread 1728 job preempts the current job, it gets half of the remaining recipient 1729 slots from the normal pool and this extra pool. 1730 1731 </p> 1732 1733 <p> 1734 1735 And that's it. It sure does sound pretty complicated, but fortunately 1736 most people don't really have to care exactly how it works as long 1737 as it works. Perhaps the only important things to know for most 1738 people are the following upper bound formulas: 1739 1740 </p> 1741 1742 <p> 1743 1744 Each transport has at maximum 1745 1746 </p> 1747 1748 <blockquote> 1749 <pre> 1750 max( 1751 qmgr_message_recipient_minimum * qmgr_message_active_limit 1752 + *_recipient_limit + *_extra_recipient_limit, 1753 qmgr_message_recipient_limit 1754 ) 1755 </pre> 1756 </blockquote> 1757 1758 <p> 1759 1760 recipients in core. 1761 1762 </p> 1763 1764 <p> 1765 1766 The total amount of recipients in core is 1767 1768 </p> 1769 1770 <blockquote> 1771 <pre> 1772 max( 1773 qmgr_message_recipient_minimum * qmgr_message_active_limit 1774 + sum( *_recipient_limit + *_extra_recipient_limit ), 1775 qmgr_message_recipient_limit 1776 ) 1777 </pre> 1778 </blockquote> 1779 1780 <p> 1781 1782 where the sum is over all used transports. 1783 1784 </p> 1785 1786 <p> 1787 1788 And this terribly complicated chapter concludes the documentation 1789 of the <tt>nqmgr</tt> scheduler. 1790 1791 </p> 1792 1793 <p> 1794 1795 [By now you should theoretically know the <tt>nqmgr</tt> scheduler 1796 inside out. In practice, you still hope that you will never have 1797 to really understand the last or last two chapters completely, and 1798 fortunately most people really won't. Understanding how the scheduler 1799 works in ideal conditions is more than good enough for the vast majority 1800 of users.] 1801 1802 </p> 1803 1804 <h2> <a name="credits"> Credits </a> </h2> 1805 1806 <ul> 1807 1808 <li> Wietse Venema designed and implemented the initial queue manager 1809 with per-domain FIFO scheduling, and per-delivery +/-1 concurrency 1810 feedback. 1811 1812 <li> Patrik Rak designed and implemented preemption where mail with 1813 fewer recipients can slip past mail with more recipients in a 1814 controlled manner, and wrote up its documentation. 1815 1816 <li> Wietse Venema initiated a discussion with Patrik Rak and Victor 1817 Duchovni on alternatives for the +/-1 feedback scheduler's aggressive 1818 behavior. This is when K/N feedback was reviewed (N = concurrency). 1819 The discussion ended without a good solution for both negative 1820 feedback and dead site detection. 1821 1822 <li> Victor Duchovni resumed work on concurrency feedback in the 1823 context of concurrency-limited servers. 1824 1825 <li> Wietse Venema then re-designed the concurrency scheduler in 1826 terms of the simplest possible concepts: less-than-1 concurrency 1827 feedback per delivery, forward and reverse concurrency feedback 1828 hysteresis, and pseudo-cohort failure. At this same time, concurrency 1829 feedback was separated from dead site detection. 1830 1831 <li> These simplifications, and their modular implementation, helped 1832 to develop further insights into the different roles that positive 1833 and negative concurrency feedback play, and helped to identify some 1834 worst-case scenarios. 1835 1836 </ul> 1837 1838 </body> 1839 1840 </html> 1841