dist/proto/SCHEDULER_README.html

    1.1      tron <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
1.1.1.6  christos 	"https://www.w3.org/TR/html4/loose.dtd">
    1.1      tron
    1.1      tron <html>
    1.1      tron
    1.1      tron <head>
    1.1      tron
    1.1      tron <title>Postfix Queue Scheduler</title>
    1.1      tron
1.1.1.4  christos <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
1.1.1.5  christos <link rel='stylesheet' type='text/css' href='postfix-doc.css'>
    1.1      tron
    1.1      tron </head>
    1.1      tron
    1.1      tron <body>
    1.1      tron
    1.1      tron <h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix
    1.1      tron Queue Scheduler</h1>
    1.1      tron
    1.1      tron <hr>
    1.1      tron
1.1.1.2      tron <h2> Disclaimer </h2>
1.1.1.2      tron
1.1.1.2      tron <p> Many of the <i>transport</i>-specific configuration parameters
1.1.1.2      tron discussed in this document will not show up in "postconf" command
1.1.1.2      tron output before Postfix version 2.9. This limitation applies to many
1.1.1.2      tron parameters whose name is a combination of a master.cf service name
1.1.1.2      tron such as "relay" and a built-in suffix such as
1.1.1.2      tron "_destination_concurrency_limit". </p>
1.1.1.2      tron
    1.1      tron <h2> Overview </h2>
    1.1      tron
    1.1      tron <p> The queue manager is by far the most complex part of the Postfix
    1.1      tron mail system. It schedules delivery of new mail, retries failed
    1.1      tron deliveries at specific times, and removes mail from the queue after
    1.1      tron the last delivery attempt.  There are two major classes of mechanisms
    1.1      tron that control the operation of the queue manager. </p>
    1.1      tron
    1.1      tron <p> Topics covered by this document: </p>
    1.1      tron
    1.1      tron <ul>
    1.1      tron
    1.1      tron <li> <a href="#concurrency"> Concurrency scheduling</a>, concerned
    1.1      tron with the number of concurrent deliveries to a specific destination,
    1.1      tron including decisions on when to suspend deliveries after persistent
    1.1      tron failures.
    1.1      tron
    1.1      tron <li> <a href="#jobs"> Preemptive scheduling</a>, concerned with
    1.1      tron the selection of email messages and recipients for a given destination.
    1.1      tron
    1.1      tron <li> <a href="#credits"> Credits</a>, something this document would not be
    1.1      tron complete without.
    1.1      tron
    1.1      tron </ul>
    1.1      tron
    1.1      tron <!--
    1.1      tron
    1.1      tron <p> Once started, the qmgr(8) process runs until "postfix reload"
    1.1      tron or "postfix stop".  As a persistent process, the queue manager has
    1.1      tron to meet strict requirements with respect to code correctness and
    1.1      tron robustness. Unlike non-persistent daemon processes, the queue manager
    1.1      tron cannot benefit from Postfix's process rejuvenation mechanism that
    1.1      tron limit the impact from resource leaks and other coding errors
    1.1      tron (translation: replacing a process after a short time covers up bugs
    1.1      tron before they can become a problem).  </p>
    1.1      tron
    1.1      tron -->
    1.1      tron
    1.1      tron <h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
    1.1      tron
    1.1      tron <p> The following sections document the Postfix 2.5 concurrency
1.1.1.3      tron scheduler, after a discussion of the limitations of the earlier
    1.1      tron concurrency scheduler. This is followed by results of medium-concurrency
    1.1      tron experiments, and a discussion of trade-offs between performance and
    1.1      tron robustness.  </p>
    1.1      tron
    1.1      tron <p> The material is organized as follows: </p>
    1.1      tron
    1.1      tron <ul>
    1.1      tron
    1.1      tron <li> <a href="#concurrency_drawbacks"> Drawbacks of the existing
    1.1      tron concurrency scheduler </a>
    1.1      tron
    1.1      tron <li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5
    1.1      tron concurrency feedback algorithm </a>
    1.1      tron
    1.1      tron <li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead
    1.1      tron destination" detection algorithm </a>
    1.1      tron
    1.1      tron <li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5
    1.1      tron concurrency scheduler </a>
    1.1      tron
    1.1      tron <li> <a href="#concurrency_results"> Results for delivery to
    1.1      tron concurrency limited servers </a>
    1.1      tron
    1.1      tron <li> <a href="#concurrency_discussion"> Discussion of concurrency
    1.1      tron limited server results </a>
    1.1      tron
    1.1      tron <li> <a href="#concurrency_limitations"> Limitations of less-than-1
    1.1      tron per delivery feedback </a>
    1.1      tron
    1.1      tron <li> <a href="#concurrency_config"> Concurrency configuration
    1.1      tron parameters </a>
    1.1      tron
    1.1      tron </ul>
    1.1      tron
    1.1      tron <h3> <a name="concurrency_drawbacks"> Drawbacks of the existing
    1.1      tron concurrency scheduler </a> </h3>
    1.1      tron
    1.1      tron <p> From the start, Postfix has used a simple but robust algorithm
    1.1      tron where the per-destination delivery concurrency is decremented by 1
    1.1      tron after delivery failed due to connection or handshake failure, and
    1.1      tron incremented by 1 otherwise.  Of course the concurrency is never
    1.1      tron allowed to exceed the maximum per-destination concurrency limit.
    1.1      tron And when a destination's concurrency level drops to zero, the
    1.1      tron destination is declared "dead" and delivery is suspended.  </p>
    1.1      tron
    1.1      tron <p> Drawbacks of +/-1 concurrency feedback per delivery are: <p>
    1.1      tron
    1.1      tron <ul>
    1.1      tron
    1.1      tron <li> <p> Overshoot due to exponential delivery concurrency growth
    1.1      tron with each pseudo-cohort(*). This can be an issue with high-concurrency
    1.1      tron channels. For example, with the default initial concurrency of 5,
    1.1      tron concurrency would proceed over time as (5-10-20).  </p>
    1.1      tron
    1.1      tron <li> <p> Throttling down to zero concurrency after a single
    1.1      tron pseudo-cohort(*) failure. This was especially an issue with
    1.1      tron low-concurrency channels where a single failure could be sufficient
    1.1      tron to mark a destination as "dead", causing the suspension of further
    1.1      tron deliveries to the affected destination. </p>
    1.1      tron
    1.1      tron </ul>
    1.1      tron
    1.1      tron <p> (*) A pseudo-cohort is a number of delivery requests equal to
    1.1      tron a destination's delivery concurrency. </p>
    1.1      tron
    1.1      tron <p> The revised concurrency scheduler has a highly modular structure.
    1.1      tron It uses separate mechanisms for per-destination concurrency control
    1.1      tron and for "dead destination" detection.  The concurrency control in
    1.1      tron turn is built from two separate mechanisms: it supports less-than-1
    1.1      tron feedback per delivery to allow for more gradual concurrency
    1.1      tron adjustments, and it uses feedback hysteresis to suppress concurrency
    1.1      tron oscillations.  And instead of waiting for delivery concurrency to
    1.1      tron throttle down to zero, a destination is declared "dead" after a
    1.1      tron configurable number of pseudo-cohorts reports connection or handshake
    1.1      tron failure.  </p>
    1.1      tron
    1.1      tron <h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5
    1.1      tron concurrency feedback algorithm </a> </h3>
    1.1      tron
    1.1      tron <p> We want to increment a destination's delivery concurrency when
    1.1      tron some (not necessarily consecutive) number of deliveries complete
    1.1      tron without connection or handshake failure.  This is implemented with
    1.1      tron positive feedback g(N) where N is the destination's delivery
    1.1      tron concurrency.  With g(N)=1 feedback per delivery, concurrency increases
    1.1      tron by 1 after each positive feedback event; this gives us the old
    1.1      tron scheduler's exponential growth in time. With g(N)=1/N feedback per
    1.1      tron delivery, concurrency increases by 1 after an entire pseudo-cohort
    1.1      tron N of positive feedback reports; this gives us linear growth in time.
    1.1      tron Less-than-1 feedback per delivery and integer truncation naturally
    1.1      tron give us hysteresis, so that transitions to larger concurrency happen
    1.1      tron every 1/g(N) positive feedback events.  </p>
    1.1      tron
    1.1      tron <p> We want to decrement a destination's delivery concurrency when
    1.1      tron some (not necessarily consecutive) number of deliveries complete
    1.1      tron after connection or handshake failure.  This is implemented with
    1.1      tron negative feedback f(N) where N is the destination's delivery
    1.1      tron concurrency.  With f(N)=1 feedback per delivery, concurrency decreases
    1.1      tron by 1 after each negative feedback event; this gives us the old
    1.1      tron scheduler's behavior where concurrency is throttled down dramatically
    1.1      tron after a single pseudo-cohort failure.  With f(N)=1/N feedback per
    1.1      tron delivery, concurrency backs off more gently.  Again, less-than-1
    1.1      tron feedback per delivery and integer truncation naturally give us
    1.1      tron hysteresis, so that transitions to lower concurrency happen every
    1.1      tron 1/f(N) negative feedback events.  </p>
    1.1      tron
    1.1      tron <p> However, with negative feedback we introduce a subtle twist.
    1.1      tron We "reverse" the negative hysteresis cycle so that the transition
    1.1      tron to lower concurrency happens at the <b>beginning</b> of a sequence
    1.1      tron of 1/f(N) negative feedback events.  Otherwise, a correction for
    1.1      tron overload would be made too late.  This makes the choice of f(N)
    1.1      tron relatively unimportant, as borne out by measurements later in this
    1.1      tron document.  </p>
    1.1      tron
    1.1      tron <p> In summary, the main ingredients for the Postfix 2.5 concurrency
    1.1      tron feedback algorithm are a) the option of less-than-1 positive feedback
    1.1      tron per delivery to avoid overwhelming servers, b) the option of
    1.1      tron less-than-1 negative feedback per delivery to avoid giving up too
    1.1      tron fast, c) feedback hysteresis to avoid rapid oscillation, and d) a
    1.1      tron "reverse" hysteresis cycle for negative feedback, so that it can
    1.1      tron correct for overload quickly.  </p>
    1.1      tron
    1.1      tron <h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3>
    1.1      tron
    1.1      tron <p> We want to suspend deliveries to a specific destination after
    1.1      tron some number of deliveries suffers connection or handshake failure.
    1.1      tron The old scheduler declares a destination "dead" when negative (-1)
    1.1      tron feedback throttles the delivery concurrency down to zero. With
    1.1      tron less-than-1 feedback per delivery, this throttling down would
    1.1      tron obviously take too long.  We therefore have to separate "dead
    1.1      tron destination" detection from concurrency feedback.  This is implemented
    1.1      tron by introducing the concept of pseudo-cohort failure. The Postfix
    1.1      tron 2.5 concurrency scheduler declares a destination "dead" after a
    1.1      tron configurable number of pseudo-cohorts suffers from connection or
    1.1      tron handshake failures. The old scheduler corresponds to the special
    1.1      tron case where the pseudo-cohort failure limit is equal to 1.  </p>
    1.1      tron
    1.1      tron <h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3>
    1.1      tron
    1.1      tron <p> The pseudo code shows how the ideas behind new concurrency
    1.1      tron scheduler are implemented as of November 2007.  The actual code can
    1.1      tron be found in the module qmgr/qmgr_queue.c.  </p>
    1.1      tron
    1.1      tron <pre>
    1.1      tron Types:
    1.1      tron         Each destination has one set of the following variables
    1.1      tron         int concurrency
    1.1      tron         double success
    1.1      tron         double failure
    1.1      tron         double fail_cohorts
    1.1      tron
    1.1      tron Feedback functions:
    1.1      tron         N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
    1.1      tron         positive feedback: g(N) = x/N | x/sqrt(N) | x
    1.1      tron         negative feedback: f(N) = y/N | y/sqrt(N) | y
    1.1      tron
    1.1      tron Initialization:
    1.1      tron         concurrency = initial_concurrency
    1.1      tron         success = 0
    1.1      tron         failure = 0
    1.1      tron         fail_cohorts = 0
    1.1      tron
    1.1      tron After success:
    1.1      tron         fail_cohorts = 0
    1.1      tron         Be prepared for feedback &gt; hysteresis, or rounding error
    1.1      tron         success += g(concurrency)
    1.1      tron         while (success >= 1)            Hysteresis 1
    1.1      tron             concurrency += 1            Hysteresis 1
    1.1      tron             failure = 0
    1.1      tron             success -= 1                Hysteresis 1
    1.1      tron         Be prepared for overshoot
    1.1      tron         if (concurrency &gt; concurrency limit)
    1.1      tron             concurrency = concurrency limit
    1.1      tron
    1.1      tron Safety:
    1.1      tron         Don't apply positive feedback unless
    1.1      tron             concurrency &lt; busy_refcount + init_dest_concurrency
    1.1      tron         otherwise negative feedback effect could be delayed
    1.1      tron
    1.1      tron After failure:
    1.1      tron         if (concurrency &gt; 0)
    1.1      tron             fail_cohorts += 1.0 / concurrency
    1.1      tron             if (fail_cohorts &gt; cohort_failure_limit)
    1.1      tron                 concurrency = 0
    1.1      tron         if (concurrency &gt; 0)
    1.1      tron             Be prepared for feedback &gt; hysteresis, rounding errors
    1.1      tron             failure -= f(concurrency)
    1.1      tron             while (failure &lt; 0)
    1.1      tron                 concurrency -= 1        Hysteresis 1
    1.1      tron                 failure += 1            Hysteresis 1
    1.1      tron                 success = 0
    1.1      tron             Be prepared for overshoot
    1.1      tron             if (concurrency &lt; 1)
    1.1      tron                 concurrency = 1
    1.1      tron </pre>
    1.1      tron
    1.1      tron <h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3>
    1.1      tron
    1.1      tron <p> Discussions about the concurrency scheduler redesign started
    1.1      tron early 2004, when the primary goal was to find alternatives that did
    1.1      tron not exhibit exponential growth or rapid concurrency throttling.  No
    1.1      tron code was implemented until late 2007, when the primary concern had
    1.1      tron shifted towards better handling of server concurrency limits. For
    1.1      tron this reason we measure how well the new scheduler does this
    1.1      tron job.  The table below compares mail delivery performance of the old
    1.1      tron +/-1 feedback per delivery with several less-than-1 feedback
    1.1      tron functions, for different limited-concurrency server scenarios.
    1.1      tron Measurements were done with a FreeBSD 6.2 client and with FreeBSD
    1.1      tron 6.2 and various Linux servers.  </p>
    1.1      tron
    1.1      tron <p> Server configuration: </p>
    1.1      tron
    1.1      tron <ul> <li> The mail flow was slowed down with 1 second latency per
    1.1      tron recipient ("smtpd_client_restrictions = sleep 1"). The purpose was
    1.1      tron to make results less dependent on hardware details, by avoiding
    1.1      tron slow-downs by queue file I/O, logging I/O, and network I/O.
    1.1      tron
    1.1      tron <li> Concurrency was limited by the server process limit
    1.1      tron ("default_process_limit = 5" and "smtpd_client_event_limit_exceptions
    1.1      tron = static:all"). Postfix was stopped and started after changing the
    1.1      tron process limit, because the same number is also used as the backlog
    1.1      tron argument to the listen(2) system call, and "postfix reload" does
    1.1      tron not re-issue this call.
    1.1      tron
    1.1      tron <li> Mail was discarded with "local_recipient_maps = static:all" and
    1.1      tron "local_transport = discard". The discard action in access maps or
    1.1      tron header/body checks
    1.1      tron could not be used as it fails to update the in_flow_delay counters.
    1.1      tron
    1.1      tron </ul>
    1.1      tron
    1.1      tron <p> Client configuration: </p>
    1.1      tron
    1.1      tron <ul>
    1.1      tron
    1.1      tron <li> Queue file overhead was minimized by sending one message to a
    1.1      tron virtual alias that expanded into 2000 different remote recipients.
    1.1      tron All recipients were accounted for according to the maillog file.
    1.1      tron The virtual_alias_expansion_limit setting was increased to avoid
    1.1      tron complaints from the cleanup(8) server.
    1.1      tron
    1.1      tron <li> The number of deliveries was maximized with
    1.1      tron "smtp_destination_recipient_limit = 2". A smaller limit would cause
    1.1      tron Postfix to schedule the concurrency per recipient instead of domain,
    1.1      tron which is not what we want.
    1.1      tron
    1.1      tron <li> Maximum concurrency was limited with
    1.1      tron "smtp_destination_concurrency_limit = 20", and
    1.1      tron initial_destination_concurrency was set to the same value.
    1.1      tron
    1.1      tron <li> The positive and negative concurrency feedback hysteresis was
    1.1      tron 1.  Concurrency was incremented by 1 at the END of 1/feedback steps
    1.1      tron of positive feedback, and was decremented by 1 at the START of
    1.1      tron 1/feedback steps of negative feedback.
    1.1      tron
    1.1      tron <li> The SMTP client used the default 30s SMTP connect timeout and
    1.1      tron 300s SMTP greeting timeout.
    1.1      tron
    1.1      tron </ul>
    1.1      tron
    1.1      tron <h4> Impact of the 30s SMTP connect timeout </h4>
    1.1      tron
    1.1      tron <p> The first results are for a FreeBSD 6.2 server, where our
    1.1      tron artificially low listen(2) backlog results in a very short kernel
    1.1      tron queue for established connections. The table shows that all deferred
    1.1      tron deliveries failed due to a 30s connection timeout, and none failed
    1.1      tron due to a server greeting timeout.  This measurement simulates what
    1.1      tron happens when the server's connection queue is completely full under
    1.1      tron load, and the TCP engine drops new connections.  </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron
    1.1      tron <table>
    1.1      tron
    1.1      tron <tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
    1.1      tron style</th> <th>connection<br> caching</th> <th>percentage<br>
    1.1      tron deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
    1.1      tron <th colspan=2>timed-out in<br> connect/greeting </th> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center" colspan="9"> <hr> </td> </tr>
    1.1      tron
    1.1      tron <tr><td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/N</td> <td align="center">no</td> <td
    1.1      tron align="center">9.9</td> <td align="center">19.4</td> <td
    1.1      tron align="center">0.49</td> <td align="center">198</td> <td
    1.1      tron align="center">-</td> </tr>
    1.1      tron
    1.1      tron <tr><td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/N</td> <td align="center">yes</td> <td
    1.1      tron align="center">10.3</td> <td align="center">19.4</td> <td
    1.1      tron align="center">0.49</td> <td align="center">206</td> <td
    1.1      tron align="center">-</td> </tr>
    1.1      tron
    1.1      tron <tr><td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/sqrt(N)</td> <td align="center">no</td>
    1.1      tron <td align="center">10.4</td> <td align="center">19.6</td> <td
    1.1      tron align="center">0.59</td> <td align="center">208</td> <td
    1.1      tron align="center">-</td> </tr>
    1.1      tron
    1.1      tron <tr><td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/sqrt(N)</td> <td align="center">yes</td>
    1.1      tron <td align="center">10.6</td> <td align="center">19.6</td> <td
    1.1      tron align="center">0.61</td> <td align="center">212</td> <td
    1.1      tron align="center">-</td> </tr>
    1.1      tron
    1.1      tron <tr><td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1</td> <td align="center">no</td> <td
    1.1      tron align="center">10.1</td> <td align="center">19.5</td> <td
    1.1      tron align="center">1.29</td> <td align="center">202</td> <td
    1.1      tron align="center">-</td> </tr>
    1.1      tron
    1.1      tron <tr><td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1</td> <td align="center">yes</td> <td
    1.1      tron align="center">10.8</td> <td align="center">19.3</td> <td
    1.1      tron align="center">1.57</td> <td align="center">216</td> <td
    1.1      tron align="center">-</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center" colspan="9"> <hr> </td> </tr>
    1.1      tron
    1.1      tron </table>
    1.1      tron
    1.1      tron <p> A busy server with a completely full connection queue.  N is
    1.1      tron the client delivery concurrency.  Failed deliveries time out after
    1.1      tron 30s without completing the TCP handshake. See text for a discussion
    1.1      tron of results. </p>
    1.1      tron
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <h4> Impact of the 300s SMTP greeting timeout </h4>
    1.1      tron
    1.1      tron <p> The next table shows results for a Fedora Core 8 server (results
    1.1      tron for RedHat 7.3 are identical). In this case, the artificially small
    1.1      tron listen(2) backlog argument does not impact our measurement.  The
    1.1      tron table shows that practically all deferred deliveries fail after the
    1.1      tron 300s SMTP greeting timeout. As these timeouts were 10x longer than
    1.1      tron with the first measurement, we increased the recipient count (and
    1.1      tron thus the running time) by a factor of 10 to keep the results
    1.1      tron comparable. The deferred mail percentages are a factor 10 lower
    1.1      tron than with the first measurement, because the 1s per-recipient delay
    1.1      tron was 1/300th of the greeting timeout instead of 1/30th of the
    1.1      tron connection timeout.  </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron
    1.1      tron <table>
    1.1      tron
    1.1      tron <tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
    1.1      tron style</th> <th>connection<br> caching</th> <th>percentage<br>
    1.1      tron deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
    1.1      tron <th colspan=2>timed-out in<br> connect/greeting </th> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center" colspan="9"> <hr> </td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/N</td> <td align="center">no</td> <td
    1.1      tron align="center">1.16</td> <td align="center">19.8</td> <td
    1.1      tron align="center">0.37</td> <td align="center">-</td> <td
    1.1      tron align="center">230</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/N</td> <td align="center">yes</td> <td
    1.1      tron align="center">1.36</td> <td align="center">19.8</td> <td
    1.1      tron align="center">0.36</td> <td align="center">-</td> <td
    1.1      tron align="center">272</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/sqrt(N)</td> <td align="center">no</td>
    1.1      tron <td align="center">1.21</td> <td align="center">19.9</td> <td
    1.1      tron align="center">0.23</td> <td align="center">4</td> <td
    1.1      tron align="center">238</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/sqrt(N)</td> <td align="center">yes</td>
    1.1      tron <td align="center">1.36</td> <td align="center">20.0</td> <td
    1.1      tron align="center">0.23</td> <td align="center">-</td> <td
    1.1      tron align="center">272</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1</td> <td align="center">no</td> <td
    1.1      tron align="center">1.18</td> <td align="center">20.0</td> <td
    1.1      tron align="center">0.16</td> <td align="center">-</td> <td
    1.1      tron align="center">236</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1</td> <td align="center">yes</td> <td
    1.1      tron align="center">1.39</td> <td align="center">20.0</td> <td
    1.1      tron align="center">0.16</td> <td align="center">-</td> <td
    1.1      tron align="center">278</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center" colspan="9"> <hr> </td> </tr>
    1.1      tron
    1.1      tron </table>
    1.1      tron
    1.1      tron <p> A busy server with a non-full connection queue.  N is the client
    1.1      tron delivery concurrency. Failed deliveries complete at the TCP level,
    1.1      tron but time out after 300s while waiting for the SMTP greeting.  See
    1.1      tron text for a discussion of results.  </p>
    1.1      tron
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <h4> Impact of active server concurrency limiter </h4>
    1.1      tron
    1.1      tron <p> The final concurrency-limited result shows what happens when
    1.1      tron SMTP connections don't time out, but are rejected immediately with
    1.1      tron the Postfix server's smtpd_client_connection_count_limit feature
    1.1      tron (the server replies with a 421 status and disconnects immediately).
    1.1      tron Similar results can be expected with concurrency limiting features
    1.1      tron built into other MTAs or firewalls.  For this measurement we specified
    1.1      tron a server concurrency limit and a client initial destination concurrency
    1.1      tron of 5, and a server process limit of 10; all other conditions were
    1.1      tron the same as with the first measurement. The same result would be
    1.1      tron obtained with a FreeBSD or Linux server, because the "pushing back"
    1.1      tron is done entirely by the receiving side. </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron
    1.1      tron <table>
    1.1      tron
    1.1      tron <tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
    1.1      tron style</th> <th>connection<br> caching</th> <th>percentage<br>
    1.1      tron deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
    1.1      tron <th>theoretical<br>defer rate</th> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center" colspan="9"> <hr> </td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/N</td> <td align="center">no</td> <td
    1.1      tron align="center">16.5</td> <td align="center">5.17</td> <td
    1.1      tron align="center">0.38</td> <td align="center">1/6</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/N</td> <td align="center">yes</td> <td
    1.1      tron align="center">16.5</td> <td align="center">5.17</td> <td
    1.1      tron align="center">0.38</td> <td align="center">1/6</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/sqrt(N)</td> <td align="center">no</td>
    1.1      tron <td align="center">24.5</td> <td align="center">5.28</td> <td
    1.1      tron align="center">0.45</td> <td align="center">1/4</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1/sqrt(N)</td> <td align="center">yes</td>
    1.1      tron <td align="center">24.3</td> <td align="center">5.28</td> <td
    1.1      tron align="center">0.46</td> <td align="center">1/4</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1</td> <td align="center">no</td> <td
    1.1      tron align="center">49.7</td> <td align="center">5.63</td> <td
    1.1      tron align="center">0.67</td> <td align="center">1/2</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center">20</td> <td align="center">5</td> <td
    1.1      tron align="center">1</td> <td align="center">yes</td> <td
    1.1      tron align="center">49.7</td> <td align="center">5.68</td> <td
    1.1      tron align="center">0.70</td> <td align="center">1/2</td> </tr>
    1.1      tron
    1.1      tron <tr> <td align="center" colspan="9"> <hr> </td> </tr>
    1.1      tron
    1.1      tron </table>
    1.1      tron
    1.1      tron <p> A server with active per-client concurrency limiter that replies
    1.1      tron with 421 and disconnects.  N is the client delivery concurrency.
    1.1      tron The theoretical defer rate is 1/(1+roundup(1/feedback)).  This is
    1.1      tron always 1/2 with the fixed +/-1 feedback per delivery; with the
    1.1      tron concurrency-dependent feedback variants, the defer rate decreases
    1.1      tron with increasing concurrency. See text for a discussion of results.
    1.1      tron </p>
    1.1      tron
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3>
    1.1      tron
    1.1      tron <p> All results in the previous sections are based on the first
    1.1      tron delivery runs only; they do not include any second etc. delivery
    1.1      tron attempts. It's also worth noting that the measurements look at
    1.1      tron steady-state behavior only. They don't show what happens when the
    1.1      tron client starts sending at a much higher or lower concurrency.
    1.1      tron </p>
    1.1      tron
    1.1      tron <p> The first two examples show that the effect of feedback
    1.1      tron is negligible when concurrency is limited due to congestion. This
    1.1      tron is because the initial concurrency is already at the client's
    1.1      tron concurrency maximum, and because there is 10-100 times more positive
    1.1      tron than negative feedback.  Under these conditions, it is no surprise
    1.1      tron that the contribution from SMTP connection caching is also negligible.
    1.1      tron </p>
    1.1      tron
    1.1      tron <p> In the last example, the old +/-1 feedback per delivery will
    1.1      tron defer 50% of the mail when confronted with an active (anvil-style)
    1.1      tron server concurrency limit, where the server hangs up immediately
    1.1      tron with a 421 status (a TCP-level RST would have the same result).
    1.1      tron Less aggressive feedback mechanisms fare better than more aggressive
    1.1      tron ones.  Concurrency-dependent feedback fares even better at higher
    1.1      tron concurrencies than shown here, but has limitations as discussed in
    1.1      tron the next section.  </p>
    1.1      tron
    1.1      tron <h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3>
    1.1      tron
    1.1      tron <p> Less-than-1 feedback is of interest primarily when sending large
    1.1      tron amounts of mail to destinations with active concurrency limiters
    1.1      tron (servers that reply with 421, or firewalls that send RST).  When
    1.1      tron sending small amounts of mail per destination, less-than-1 per-delivery
    1.1      tron feedback won't have a noticeable effect on the per-destination
    1.1      tron concurrency, because the number of deliveries to the same destination
    1.1      tron is too small. You might just as well use zero per-delivery feedback
    1.1      tron and stay with the initial per-destination concurrency. And when
    1.1      tron mail deliveries fail due to congestion instead of active concurrency
    1.1      tron limiters, the measurements above show that per-delivery feedback
    1.1      tron has no effect.  With large amounts of mail you might just as well
    1.1      tron use zero per-delivery feedback and start with the maximal per-destination
    1.1      tron concurrency.  </p>
    1.1      tron
    1.1      tron <p> The scheduler with less-than-1 concurrency
    1.1      tron feedback per delivery solves a problem with servers that have active
    1.1      tron concurrency limiters.  This works only because feedback is handled
    1.1      tron in a peculiar manner: positive feedback will increment the concurrency
    1.1      tron by 1 at the <b>end</b> of a sequence of events of length 1/feedback,
    1.1      tron while negative feedback will decrement concurrency by 1 at the
    1.1      tron <b>beginning</b> of such a sequence.  This is how Postfix adjusts
    1.1      tron quickly for overshoot without causing lots of mail to be deferred.
    1.1      tron Without this difference in feedback treatment, less-than-1 feedback
    1.1      tron per delivery would defer 50% of the mail, and would be no better
    1.1      tron in this respect than the old +/-1 feedback per delivery.  </p>
    1.1      tron
    1.1      tron <p> Unfortunately, the same feature that corrects quickly for
    1.1      tron concurrency overshoot also makes the scheduler more sensitive for
    1.1      tron noisy negative feedback.  The reason is that one lonely negative
    1.1      tron feedback event has the same effect as a complete sequence of length
    1.1      tron 1/feedback: in both cases delivery concurrency is dropped by 1
    1.1      tron immediately.  As a worst-case scenario, consider multiple servers
    1.1      tron behind a load balancer on a single IP address, and no backup MX
    1.1      tron address.  When 1 out of K servers fails to complete the SMTP handshake
    1.1      tron or drops the connection, a scheduler with 1/N (N = concurrency)
    1.1      tron feedback stops increasing its concurrency once it reaches a concurrency
    1.1      tron level of about K,  even though the good servers behind the load
    1.1      tron balancer are perfectly capable of handling more traffic. </p>
    1.1      tron
    1.1      tron <p> This noise problem gets worse as the amount of positive feedback
    1.1      tron per delivery gets smaller.  A compromise is to use fixed less-than-1
    1.1      tron positive feedback values instead of concurrency-dependent positive
    1.1      tron feedback.  For example, to tolerate 1 of 4 bad servers in the above
    1.1      tron load balancer scenario, use positive feedback of 1/4 per "good"
    1.1      tron delivery (no connect or handshake error), and use an equal or smaller
    1.1      tron amount of negative feedback per "bad" delivery.  The downside of
    1.1      tron using concurrency-independent feedback is that some of the old +/-1
    1.1      tron feedback problems will return at large concurrencies.  Sites that
    1.1      tron must deliver mail at non-trivial per-destination concurrencies will
    1.1      tron require special configuration.  </p>
    1.1      tron
    1.1      tron <h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3>
    1.1      tron
    1.1      tron <p> The Postfix 2.5 concurrency scheduler is controlled with the
    1.1      tron following configuration parameters, where "<i>transport</i>_foo"
    1.1      tron provides a transport-specific parameter override.  All parameter
    1.1      tron default settings are compatible with earlier Postfix versions. </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron
    1.1      tron <table border="0">
    1.1      tron
    1.1      tron <tr> <th> Parameter name </th> <th> Postfix version </th> <th>
    1.1      tron Description </th> </tr>
    1.1      tron
    1.1      tron <tr> <td colspan="3"> <hr> </td> </tr>
    1.1      tron
    1.1      tron <tr> <td> initial_destination_concurrency<br>
    1.1      tron <i>transport</i>_initial_destination_concurrency </td> <td
    1.1      tron align="center"> all<br> 2.5 </td> <td> Initial per-destination
    1.1      tron delivery concurrency </td> </tr>
    1.1      tron
    1.1      tron <tr> <td> default_destination_concurrency_limit<br>
    1.1      tron <i>transport</i>_destination_concurrency_limit </td> <td align="center">
    1.1      tron all<br> all </td> <td> Maximum per-destination delivery concurrency
    1.1      tron </td> </tr>
    1.1      tron
    1.1      tron <tr> <td> default_destination_concurrency_positive_feedback<br>
    1.1      tron <i>transport</i>_destination_concurrency_positive_feedback </td>
    1.1      tron <td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive
    1.1      tron feedback amount, per delivery that does not fail with connection
    1.1      tron or handshake failure </td> </tr>
    1.1      tron
    1.1      tron <tr> <td> default_destination_concurrency_negative_feedback<br>
    1.1      tron <i>transport</i>_destination_concurrency_negative_feedback </td>
    1.1      tron <td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative
    1.1      tron feedback amount, per delivery that fails with connection or handshake
    1.1      tron failure </td> </tr>
    1.1      tron
    1.1      tron <tr> <td> default_destination_concurrency_failed_cohort_limit<br>
    1.1      tron <i>transport</i>_destination_concurrency_failed_cohort_limit </td>
    1.1      tron <td align="center"> 2.5<br> 2.5 </td> <td> Number of failed
    1.1      tron pseudo-cohorts after which a destination is declared "dead" and
    1.1      tron delivery is suspended </td> </tr>
    1.1      tron
    1.1      tron <tr> <td> destination_concurrency_feedback_debug</td> <td align="center">
    1.1      tron 2.5 </td> <td> Enable verbose logging of concurrency scheduler
    1.1      tron activity </td> </tr>
    1.1      tron
    1.1      tron <tr> <td colspan="3"> <hr> </td> </tr>
    1.1      tron
    1.1      tron </table>
    1.1      tron
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <h2> <a name="jobs"> Preemptive scheduling </a> </h2>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The following sections describe the new queue manager and its
    1.1      tron preemptive scheduler algorithm. Note that the document was originally
    1.1      tron written to describe the changes between the new queue manager (in
    1.1      tron this text referred to as <tt>nqmgr</tt>, the name it was known by
    1.1      tron before it became the default queue manager) and the old queue manager
    1.1      tron (referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt>
    1.1      tron every so often.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron This document is divided into sections as follows:
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <ul>
    1.1      tron
    1.1      tron <li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by
    1.1      tron nqmgr </a>
    1.1      tron
    1.1      tron <li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
    1.1      tron up the message </a> - how it is assigned to transports, jobs, peers,
    1.1      tron entries
    1.1      tron
    1.1      tron <li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection
    1.1      tron works </a>
    1.1      tron
    1.1      tron <li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption
    1.1      tron works </a> - what messages may be preempted and how and what messages
    1.1      tron are chosen to preempt them
    1.1      tron
    1.1      tron <li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency
    1.1      tron limits affect the scheduling algorithm </a>
    1.1      tron
    1.1      tron <li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource
    1.1      tron limits </a>
    1.1      tron
    1.1      tron </ul>
    1.1      tron
    1.1      tron <h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by
    1.1      tron nqmgr </a> </h3>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Let's start by recapitulating the structures and terms used when
1.1.1.4  christos referring to the queue manager and how it operates. Many of these are
    1.1      tron partially described elsewhere, but it is nice to have a coherent
    1.1      tron overview in one place:
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <ul>
    1.1      tron
    1.1      tron <li> <p> Each message structure represents one mail message which
    1.1      tron Postfix is to deliver. The message recipients specify to what
    1.1      tron destinations is the message to be delivered and what transports are
    1.1      tron going to be used for the delivery. </p>
    1.1      tron
    1.1      tron <li> <p> Each recipient entry groups a batch of recipients of one
1.1.1.3      tron message which are all going to be delivered to the same destination
1.1.1.3      tron (and over the same transport).
    1.1      tron </p>
    1.1      tron
    1.1      tron <li> <p> Each transport structure groups everything what is going
    1.1      tron to be delivered by delivery agents dedicated for that transport.
    1.1      tron Each transport maintains a set of queues (describing the destinations
    1.1      tron it shall talk to) and jobs (referencing the messages it shall
    1.1      tron deliver). </p>
    1.1      tron
    1.1      tron <li> <p> Each transport queue (not to be confused with the on-disk
1.1.1.4  christos "active" queue or "incoming" queue) groups everything what is going be
    1.1      tron delivered to given destination (aka nexthop) by its transport.  Each
    1.1      tron queue belongs to one transport, so each destination may be referred
    1.1      tron to by several queues, one for each transport.  Each queue maintains
    1.1      tron a list of all recipient entries (batches of message recipients)
    1.1      tron which shall be delivered to given destination (the todo list), and
    1.1      tron a list of recipient entries already being delivered by the delivery
    1.1      tron agents (the busy list). </p>
    1.1      tron
    1.1      tron <li> <p> Each queue corresponds to multiple peer structures.  Each
    1.1      tron peer structure is like the queue structure, belonging to one transport
    1.1      tron and referencing one destination. The difference is that it lists
    1.1      tron only the recipient entries which all originate from the same message,
    1.1      tron unlike the queue structure, whose entries may originate from various
    1.1      tron messages. For messages with few recipients, there is usually just
    1.1      tron one recipient entry for each destination, resulting in one recipient
    1.1      tron entry per peer. But for large mailing list messages the recipients
    1.1      tron may need to be split to multiple recipient entries, in which case
    1.1      tron the peer structure may list many entries for single destination.
    1.1      tron </p>
    1.1      tron
    1.1      tron <li> <p> Each transport job groups everything it takes to deliver
    1.1      tron one message via its transport. Each job represents one message
    1.1      tron within the context of the transport. The job belongs to one transport
    1.1      tron and message, so each message may have multiple jobs, one for each
    1.1      tron transport. The job groups all the peer structures, which describe
    1.1      tron the destinations the job's message has to be delivered to. </p>
    1.1      tron
    1.1      tron </ul>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The first four structures are common to both <tt>nqmgr</tt> and
    1.1      tron <tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron These terms are used extensively in the text below, feel free to
    1.1      tron look up the description above anytime you'll feel you have lost a
    1.1      tron sense what is what.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
    1.1      tron up the message </a> </h3>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue,
    1.1      tron the following happens: It reads all necessary information from the
    1.1      tron queue file as <tt>oqmgr</tt> does, and also reads as many recipients
    1.1      tron as possible - more on that later, for now let's just pretend it
    1.1      tron always reads all recipients.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Then it resolves the recipients as <tt>oqmgr</tt> does, which
    1.1      tron means obtaining (address, nexthop, transport) triple for each
    1.1      tron recipient. For each triple, it finds the transport; if it does not
    1.1      tron exist yet, it instantiates it (unless it's dead). Within the
1.1.1.4  christos transport, it finds the destination queue for the given nexthop; if it
    1.1      tron does not exist yet, it instantiates it (unless it's dead). The
    1.1      tron triple is then bound to given destination queue. This happens in
    1.1      tron qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Then for each triple which was bound to some queue (and thus
    1.1      tron transport), the program finds the job which represents the message
    1.1      tron within that transport's context; if it does not exist yet, it
    1.1      tron instantiates it. Within the job, it finds the peer which represents
    1.1      tron the bound destination queue within this jobs context; if it does
    1.1      tron not exist yet, it instantiates it.  Finally, it stores the address
    1.1      tron from the resolved triple to the recipient entry which is appended
    1.1      tron to both the queue entry list and the peer entry list. The addresses
1.1.1.4  christos for the same nexthop are batched in the entries up to the
1.1.1.4  christos <i>transport</i>_destination_recipient_limit for that transport.
1.1.1.4  christos This happens in qmgr_message_assign(), and apart
1.1.1.4  christos from that it operates with job and peer structures, it is basically the
    1.1      tron same as in <tt>oqmgr</tt>.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron When the job is instantiated, it is enqueued on the transport's job
    1.1      tron list based on the time its message was picked up by <tt>nqmgr</tt>.
    1.1      tron For first batch of recipients this means it is appended to the end
    1.1      tron of the job list, but the ordering of the job list by the enqueue
    1.1      tron time is important as we will see shortly.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos [Now you should have a pretty good idea what the state of the
1.1.1.4  christos <tt>nqmgr</tt> is after a couple of messages were picked up, and what the
1.1.1.4  christos relation is between all those job, peer, queue and entry structures.]
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection
    1.1      tron works </a> </h3>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Having prepared all those above mentioned structures, the task of
    1.1      tron the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries
    1.1      tron one at a time and pass them to the delivery agent for corresponding
    1.1      tron transport. Now how does this work?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The first approximation of the new scheduling algorithm is like this:
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron <pre>
    1.1      tron foreach transport (round-robin-by-transport)
    1.1      tron do
    1.1      tron     if transport busy continue
    1.1      tron     if transport process limit reached continue
    1.1      tron     foreach transport's job (in the order of the transport's job list)
    1.1      tron     do
    1.1      tron 	foreach job's peer (round-robin-by-destination)
    1.1      tron 	     if peer-&gt;queue-&gt;concurrency &lt; peer-&gt;queue-&gt;window
    1.1      tron 		 return next peer entry.
    1.1      tron 	done
    1.1      tron     done
    1.1      tron done
    1.1      tron </pre>
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Now what is the "order of the transport's job list"? As we know
    1.1      tron already, the job list is by default kept in the order the message
    1.1      tron was picked up by the <tt>nqmgr</tt>. So by default we get the
    1.1      tron top-level round-robin transport, and within each transport we get
    1.1      tron the FIFO message delivery. The round-robin of the peers by the
    1.1      tron destination is perhaps of little importance in most real-life cases
1.1.1.4  christos (unless the <i>transport</i>_destination_recipient_limit is reached,
1.1.1.4  christos in one job there
    1.1      tron is only one peer structure for each destination), but theoretically
    1.1      tron it makes sure that even within single jobs, destinations are treated
    1.1      tron fairly.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron [By now you should have a feeling you really know how the scheduler
    1.1      tron works, except for the preemption, under ideal conditions - that is,
    1.1      tron no recipient resource limits and no destination concurrency problems.]
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption
    1.1      tron works </a> </h3>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron As you might perhaps expect by now, the transport's job list does
    1.1      tron not remain sorted by the job's message enqueue time all the time.
    1.1      tron The most cool thing about <tt>nqmgr</tt> is not the simple FIFO
    1.1      tron delivery, but that it is able to slip mail with little recipients
    1.1      tron past the mailing-list bulk mail.  This is what the job preemption
    1.1      tron is about - shuffling the jobs on the transport's job list to get
    1.1      tron the best message delivery rates. Now how is it achieved?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron First I have to tell you that there are in fact two job lists in
    1.1      tron each transport. One is the scheduler's job list, which the scheduler
    1.1      tron is free to play with, while the other one keeps the jobs always
    1.1      tron listed in the order of the enqueue time and is used for recipient
    1.1      tron pool management we will discuss later. For now, we will deal with
    1.1      tron the scheduler's job list only.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron So, we have the job list, which is first ordered by the time the
    1.1      tron jobs' messages were enqueued, oldest messages first, the most recently
    1.1      tron picked one at the end. For now, let's assume that there are no
    1.1      tron destination concurrency problems. Without preemption, we pick some
    1.1      tron entry of the first (oldest) job on the queue, assign it to delivery
    1.1      tron agent, pick another one from the same job, assign it again, and so
    1.1      tron on, until all the entries are used and the job is delivered. We
    1.1      tron would then move onto the next job and so on and on. Now how do we
    1.1      tron manage to sneak in some entries from the recently added jobs when
    1.1      tron the first job on the job list belongs to a message going to the
    1.1      tron mailing-list and has thousands of recipient entries?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The <tt>nqmgr</tt>'s answer is that we can artificially "inflate"
    1.1      tron the delivery time of that first job by some constant for free - it
    1.1      tron is basically the same trick you might remember as "accumulation of
    1.1      tron potential" from the amortized complexity lessons. For example,
    1.1      tron instead of delivering the entries of the first job on the job list
    1.1      tron every time a delivery agent becomes available, we can do it only
    1.1      tron every second time. If you view the moments the delivery agent becomes
    1.1      tron available on a timeline as "delivery slots", then instead of using
    1.1      tron every delivery slot for the first job, we can use only every other
    1.1      tron slot, and still the overall delivery efficiency of the first job
    1.1      tron remains the same. So the delivery <tt>11112222</tt> becomes
    1.1      tron <tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, .
    1.1      tron denotes the free slot). Now what do we do with free slots?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron As you might have guessed, we will use them for sneaking the mail
    1.1      tron with little recipients in. For example, if we have one four-recipient
    1.1      tron mail followed by four one recipients mail, the delivery sequence
    1.1      tron (that is, the sequence in which the jobs are assigned to the
    1.1      tron delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine
    1.1      tron for sneaking in the single recipient mail, but how do we sneak in
    1.1      tron the mail with more than one recipient? Say if we have one four-recipient
    1.1      tron mail followed by two two-recipient mails?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The simple answer would be to use delivery sequence <tt>12121313</tt>.
    1.1      tron But the problem is that this does not scale well. Imagine you have
1.1.1.4  christos mail with a thousand recipients followed by mail with a hundred recipients.
    1.1      tron It is tempting to suggest the  delivery sequence like <tt>121212....</tt>,
    1.1      tron but alas! Imagine there arrives another mail with say ten recipients.
    1.1      tron But there are no free slots anymore, so it can't slip by, not even
1.1.1.4  christos if it had only one recipient.  It will be stuck until the
    1.1      tron hundred-recipient mail is delivered, which really sucks.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron So, it becomes obvious that while inflating the message to get
1.1.1.4  christos free slots is a great idea, one has to be really careful of how the
    1.1      tron free slots are assigned, otherwise one might corner himself. So,
    1.1      tron how does <tt>nqmgr</tt> really use the free slots?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The key idea is that one does not have to generate the free slots
    1.1      tron in a uniform way. The delivery sequence <tt>111...1</tt> is no
    1.1      tron worse than <tt>1.1.1.1</tt>, in fact, it is even better as some
    1.1      tron entries are in the first case selected earlier than in the second
    1.1      tron case, and none is selected later! So it is possible to first
    1.1      tron "accumulate" the free delivery slots and then use them all at once.
    1.1      tron It is even possible to accumulate some, then use them, then accumulate
    1.1      tron some more and use them again, as in <tt>11..1.1</tt> .
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Let's get back to the one hundred recipient example. We now know
    1.1      tron that we could first accumulate one hundred free slots, and only
    1.1      tron after then to preempt the first job and sneak the one hundred
    1.1      tron recipient mail in. Applying the algorithm recursively, we see the
    1.1      tron hundred recipient job can accumulate ten free delivery slots, and
    1.1      tron then we could preempt it and sneak in the ten-recipient mail...
    1.1      tron Wait wait wait! Could we? Aren't we overinflating the original one
    1.1      tron thousand recipient mail?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos Well, despite the fact that it looks so at the first glance, another trick will
    1.1      tron allow us to answer "no, we are not!". If we had said that we will
    1.1      tron inflate the delivery time twice at maximum, and then we consider
    1.1      tron every other slot as a free slot, then we would overinflate in case
    1.1      tron of the recursive preemption. BUT! The trick is that if we use only
    1.1      tron every n-th slot as a free slot for n&gt;2, there is always some worst
    1.1      tron inflation factor which we can guarantee not to be breached, even
    1.1      tron if we apply the algorithm recursively. To be precise, if for every
    1.1      tron k&gt;1 normally used slots we accumulate one free delivery slot, than
    1.1      tron the inflation factor is not worse than k/(k-1) no matter how many
    1.1      tron recursive preemptions happen. And it's not worse than (k+1)/k if
    1.1      tron only non-recursive preemption happens. Now, having got through the
    1.1      tron theory and the related math, let's see how <tt>nqmgr</tt> implements
    1.1      tron this.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Each job has so called "available delivery slot" counter. Each
    1.1      tron transport has a <i>transport</i>_delivery_slot_cost parameter, which
    1.1      tron defaults to default_delivery_slot_cost parameter which is set to 5
    1.1      tron by default. This is the k from the paragraph above. Each time k
    1.1      tron entries of the job are selected for delivery, this counter is
1.1.1.4  christos incremented by one. Once there are some slots accumulated, a job which
    1.1      tron requires no more than that number of slots to be fully delivered
    1.1      tron can preempt this job.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron [Well, the truth is, the counter is incremented every time an entry
1.1.1.3      tron is selected and it is divided by k when it is used.
1.1.1.4  christos But to understand, it's good enough to use
    1.1      tron the above approximation of the truth.]
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron OK, so now we know the conditions which must be satisfied so one
    1.1      tron job can preempt another one. But what job gets preempted, how do
    1.1      tron we choose what job preempts it if there are several valid candidates,
    1.1      tron and when does all this exactly happen?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The answer for the first part is simple. The job whose entry was
1.1.1.4  christos selected the last time is the so called current job. Normally, it is
    1.1      tron the first job on the scheduler's job list, but destination concurrency
    1.1      tron limits may change this as we will see later. It is always only the
    1.1      tron current job which may get preempted.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos Now for the second part. The current job has a certain amount of
    1.1      tron recipient entries, and as such may accumulate at maximum some amount
    1.1      tron of available delivery slots. It might have already accumulated some,
    1.1      tron and perhaps even already used some when it was preempted before
    1.1      tron (remember a job can be preempted several times). In either case,
    1.1      tron we know how many are accumulated and how many are left to deliver,
    1.1      tron so we know how many it may yet accumulate at maximum. Every other
    1.1      tron job which may be delivered by less than that number of slots is a
    1.1      tron valid candidate for preemption. How do we choose among them?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The answer is - the one with maximum enqueue_time/recipient_entry_count.
    1.1      tron That is, the older the job is, the more we should try to deliver
    1.1      tron it in order to get best message delivery rates. These rates are of
    1.1      tron course subject to how many recipients the message has, therefore
    1.1      tron the division by the recipient (entry) count. No one shall be surprised
1.1.1.4  christos that a message with n recipients takes n times longer to deliver than
1.1.1.4  christos a message with one recipient.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Now let's recap the previous two paragraphs. Isn't it too complicated?
    1.1      tron Why don't the candidates come only among the jobs which can be
    1.1      tron delivered within the number of slots the current job already
    1.1      tron accumulated? Why do we need to estimate how much it has yet to
    1.1      tron accumulate? If you found out the answer, congratulate yourself. If
    1.1      tron we did it this simple way, we would always choose the candidate
1.1.1.4  christos with the fewest recipient entries. If there were enough single recipient
    1.1      tron mails coming in, they would always slip by the bulk mail as soon
1.1.1.4  christos as possible, and the two or more recipients mail would never get
    1.1      tron a chance, no matter how long they have been sitting around in the
    1.1      tron job list.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos This candidate selection has an interesting implication - that when
    1.1      tron we choose the best candidate for preemption (this is done in
    1.1      tron qmgr_choose_candidate()), it may happen that we may not use it for
    1.1      tron preemption immediately. This leads to an answer to the last part
    1.1      tron of the original question - when does the preemption happen?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The preemption attempt happens every time next transport's recipient
    1.1      tron entry is to be chosen for delivery. To avoid needless overhead, the
    1.1      tron preemption is not attempted if the current job could never accumulate
    1.1      tron more than <i>transport</i>_minimum_delivery_slots (defaults to
1.1.1.4  christos default_minimum_delivery_slots which defaults to 3). If there are
    1.1      tron already enough accumulated slots to preempt the current job by the
    1.1      tron chosen best candidate, it is done immediately. This basically means
    1.1      tron that the candidate is moved in front of the current job on the
    1.1      tron scheduler's job list and decreasing the accumulated slot counter
1.1.1.4  christos by the amount used by the candidate. If there are not enough slots...
    1.1      tron well, I could say that nothing happens and the another preemption
    1.1      tron is attempted the next time. But that's not the complete truth.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The truth is that it turns out that it is not really necessary to
    1.1      tron wait until the jobs counter accumulates all the delivery slots in
    1.1      tron advance. Say we have ten-recipient mail followed by two two-recipient
1.1.1.4  christos mails. If the preemption happened when enough delivery slots accumulate
    1.1      tron (assuming slot cost 2), the delivery sequence becomes
    1.1      tron <tt>11112211113311</tt>. Now what would we get if we would wait
    1.1      tron only for 50% of the necessary slots to accumulate and we promise
    1.1      tron we would wait for the remaining 50% later, after we get back
1.1.1.4  christos to the preempted job? If we use such a slot loan, the delivery sequence
1.1.1.4  christos becomes <tt>11221111331111</tt>. As we can see, it makes it not
    1.1      tron considerably worse for the delivery of the ten-recipient mail, but
    1.1      tron it allows the small messages to be delivered sooner.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The concept of these slot loans is where the
    1.1      tron <i>transport</i>_delivery_slot_discount and
    1.1      tron <i>transport</i>_delivery_slot_loan come from (they default to
    1.1      tron default_delivery_slot_discount and default_delivery_slot_loan, whose
    1.1      tron values are by default 50 and 3, respectively). The discount (resp.
    1.1      tron loan) specifies how many percent (resp. how many slots) one "gets
    1.1      tron in advance", when the number of slots required to deliver the best
    1.1      tron candidate is compared with the number of slots the current slot had
    1.1      tron accumulated so far.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos And that pretty much concludes this chapter.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron [Now you should have a feeling that you pretty much understand the
1.1.1.4  christos scheduler and the preemption, or at least that you will have
1.1.1.4  christos after you read the last chapter a couple more times. You shall clearly
    1.1      tron see the job list and the preemption happening at its head, in ideal
    1.1      tron delivery conditions. The feeling of understanding shall last until
    1.1      tron you start wondering what happens if some of the jobs are blocked,
    1.1      tron which you might eventually figure out correctly from what had been
    1.1      tron said already. But I would be surprised if your mental image of the
    1.1      tron scheduler's functionality is not completely shattered once you
    1.1      tron start wondering how it works when not all recipients may be read
    1.1      tron in-core.  More on that later.]
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency
    1.1      tron limits affect the scheduling algorithm </a> </h3>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The <tt>nqmgr</tt> uses the same algorithm for destination concurrency
    1.1      tron control as <tt>oqmgr</tt>. Now what happens when the destination
    1.1      tron limits are reached and no more entries for that destination may be
    1.1      tron selected by the scheduler?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos From the user's point of view it is all simple. If some of the peers
    1.1      tron of a job can't be selected, those peers are simply skipped by the
    1.1      tron entry selection algorithm (the pseudo-code described before) and
    1.1      tron only the selectable ones are used. If none of the peers may be
    1.1      tron selected, the job is declared a "blocker job". Blocker jobs are
    1.1      tron skipped by the entry selection algorithm and they are also excluded
1.1.1.4  christos from the candidates for preemption of the current job. Thus the scheduler
    1.1      tron effectively behaves as if the blocker jobs didn't exist on the job
    1.1      tron list at all. As soon as at least one of the peers of a blocker job
    1.1      tron becomes unblocked (that is, the delivery agent handling the delivery
1.1.1.4  christos of the recipient entry for the given destination successfully finishes),
    1.1      tron the job's blocker status is removed and the job again participates
    1.1      tron in all further scheduler actions normally.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron So the summary is that the users don't really have to be concerned
    1.1      tron about the interaction of the destination limits and scheduling
    1.1      tron algorithm. It works well on its own and there are no knobs they
    1.1      tron would need to control it.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron From a programmer's point of view, the blocker jobs complicate the
    1.1      tron scheduler quite a lot. Without them, the jobs on the job list would
    1.1      tron be normally delivered in strict FIFO order. If the current job is
    1.1      tron preempted, the job preempting it is completely delivered unless it
    1.1      tron is preempted itself. Without blockers, the current job is thus
    1.1      tron always either the first job on the job list, or the top of the stack
    1.1      tron of jobs preempting the first job on the job list.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The visualization of the job list and the preemption stack without
    1.1      tron blockers would be like this:
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron <pre>
    1.1      tron first job-&gt;    1--2--3--5--6--8--...    &lt;- job list
    1.1      tron on job list    |
    1.1      tron                4    &lt;- preemption stack
    1.1      tron                |
    1.1      tron current job-&gt;  7
    1.1      tron </pre>
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron In the example above we see that job 1 was preempted by job 4 and
    1.1      tron then job 4 was preempted by job 7. After job 7 is completed, remaining
    1.1      tron entries of job 4 are selected, and once they are all selected, job
    1.1      tron 1 continues.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron As we see, it's all very clean and straightforward. Now how does
    1.1      tron this change because of blockers?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos The answer is: a lot. Any job may become a blocker job at any time,
1.1.1.4  christos and also become a normal job again at any time. This has several
    1.1      tron important implications:
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <ol>
    1.1      tron
    1.1      tron <li> <p>
    1.1      tron
    1.1      tron The jobs may be completed in arbitrary order. For example, in the
    1.1      tron example above, if the current job 7 becomes blocked, the next job
    1.1      tron 4 may complete before the job 7 becomes unblocked again. Or if both
    1.1      tron 7 and 4 are blocked, then 1 is completed, then 7 becomes unblocked
    1.1      tron and is completed, then 2 is completed and only after that 4 becomes
    1.1      tron unblocked and is completed... You get the idea.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron [Interesting side note: even when jobs are delivered out of order,
1.1.1.4  christos from a single destination's point of view the jobs are still delivered
    1.1      tron in the expected order (that is, FIFO unless there was some preemption
    1.1      tron involved). This is because whenever a destination queue becomes
    1.1      tron unblocked (the destination limit allows selection of more recipient
    1.1      tron entries for that destination), all jobs which have peers for that
    1.1      tron destination are unblocked at once.]
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <li> <p>
    1.1      tron
    1.1      tron The idea of the preemption stack at the head of the job list is
    1.1      tron gone.  That is, it must be possible to preempt any job on the job
    1.1      tron list. For example, if the jobs 7, 4, 1 and 2 in the example above
    1.1      tron become all blocked, job 3 becomes the current job. And of course
    1.1      tron we do not want the preemption to be affected by the fact that there
    1.1      tron are some blocked jobs or not. Therefore, if it turns out that job
    1.1      tron 3 might be preempted by job 6, the implementation shall make it
    1.1      tron possible.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <li> <p>
    1.1      tron
    1.1      tron The idea of the linear preemption stack itself is gone. It's no
    1.1      tron longer true that one job is always preempted by only one job at one
    1.1      tron time (that is directly preempted, not counting the recursively
    1.1      tron nested jobs). For example, in the example above, job 1 is directly
    1.1      tron preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes
    1.1      tron blocked, and job 4 is being delivered. If it accumulates enough
    1.1      tron delivery slots, it is natural that it might be preempted for example
    1.1      tron by job 8. Now job 4 is preempted by both job 7 AND job 8 at the
    1.1      tron same time.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron </ol>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Now combine the points 2) and 3) with point 1) again and you realize
    1.1      tron that the relations on the once linear job list became pretty
    1.1      tron complicated. If we extend the point 3) example: jobs 7 and 8 preempt
    1.1      tron job 4, now job 8 becomes blocked too, then job 4 completes. Tricky,
    1.1      tron huh?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron If I illustrate the relations after the above mentioned examples
1.1.1.4  christos (but those in point 1), the situation would look like this:
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron <pre>
    1.1      tron                             v- parent
    1.1      tron
    1.1      tron adoptive parent -&gt;    1--2--3--5--...      &lt;- "stack" level 0
    1.1      tron                       |     |
    1.1      tron parent gone -&gt;        ?     6              &lt;- "stack" level 1
    1.1      tron                      / \
    1.1      tron children -&gt;         7   8   ^- child       &lt;- "stack" level 2
    1.1      tron
    1.1      tron                       ^- siblings
    1.1      tron </pre>
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Now how does <tt>nqmgr</tt> deal with all these complicated relations?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Well, it maintains them all as described, but fortunately, all these
1.1.1.4  christos relations are necessary only for the purpose of proper counting of
1.1.1.4  christos available delivery slots. For the purpose of ordering the jobs for
    1.1      tron entry selection, the original rule still applies: "the job preempting
    1.1      tron the current job is moved in front of the current job on the job
    1.1      tron list". So for entry selection purposes, the job relations remain
    1.1      tron as simple as this:
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron <pre>
    1.1      tron 7--8--1--2--6--3--5--..   &lt;- scheduler's job list order
    1.1      tron </pre>
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The job list order and the preemption parent/child/siblings relations
    1.1      tron are maintained separately. And because the selection works only
    1.1      tron with the job list, you can happily forget about those complicated
    1.1      tron relations unless you want to study the <tt>nqmgr</tt> sources. In
    1.1      tron that case the text above might provide some helpful introduction
    1.1      tron to the problem domain. Otherwise I suggest you just forget about
    1.1      tron all this and stick with the user's point of view: the blocker jobs
    1.1      tron are simply ignored.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos [By now, you should have a feeling that there are more things going
1.1.1.4  christos on under the hood than you ever wanted to know. You decide that
    1.1      tron forgetting about this chapter is the best you can do for the sake
    1.1      tron of your mind's health and you basically stick with the idea how the
    1.1      tron scheduler works in ideal conditions, when there are no blockers,
    1.1      tron which is good enough.]
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource
    1.1      tron limits </a> </h3>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed
1.1.1.4  christos that all recipients of all messages in the "active" queue are completely
1.1.1.4  christos read into memory. This is simply not true. There is an upper
    1.1      tron bound on the amount of memory the <tt>nqmgr</tt> may use, and
    1.1      tron therefore it must impose some limits on the information it may store
1.1.1.4  christos in memory at any given time.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron First of all, not all messages may be read in-core at once. At any
    1.1      tron time, only qmgr_message_active_limit messages may be read in-core
    1.1      tron at maximum. When read into memory, the messages are picked from the
1.1.1.4  christos "incoming" and "deferred" queues and moved to the "active" queue
1.1.1.4  christos (incoming having priority), so if there are more than
1.1.1.4  christos qmgr_message_active_limit messages queued in the "active" queue, the
1.1.1.4  christos rest will have to wait until (some of) the messages in the "active" queue
1.1.1.4  christos are completely delivered (or deferred).
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Even with the limited amount of in-core messages, there is another
    1.1      tron limit which must be imposed in order to avoid memory exhaustion.
1.1.1.4  christos Each message may contain a huge number of recipients (tens or hundreds
    1.1      tron of thousands are not uncommon), so if <tt>nqmgr</tt> read all
1.1.1.4  christos recipients of all messages in the "active" queue, it may easily run
    1.1      tron out of memory. Therefore there must be some upper bound on the
1.1.1.4  christos amount of message recipients which are read into memory at the
    1.1      tron same time.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Before discussing how exactly <tt>nqmgr</tt> implements the recipient
    1.1      tron limits, let's see how the sole existence of the limits themselves
    1.1      tron affects the <tt>nqmgr</tt> and its scheduler.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The message limit is straightforward - it just limits the size of
    1.1      tron the
    1.1      tron lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which
1.1.1.4  christos message can preempt the current one. Messages not in the "active" queue
1.1.1.4  christos are simply not considered at all.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The recipient limit complicates more things. First of all, the
    1.1      tron message reading code must support reading the recipients in batches,
    1.1      tron which among other things means accessing the queue file several
    1.1      tron times and continuing where the last recipient batch ended. This is
    1.1      tron invoked by the scheduler whenever the current job has space for more
    1.1      tron recipients, subject to transport's refill_limit and refill_delay parameters.
    1.1      tron It is also done any time when all
    1.1      tron in-core recipients of the message are dealt with (which may also
    1.1      tron mean they were deferred) but there are still more in the queue file.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The second complication is that with some recipients left unread
    1.1      tron in the queue file, the scheduler can't operate with exact counts
    1.1      tron of recipient entries. With unread recipients, it is not clear how
    1.1      tron many recipient entries there will be, as they are subject to
    1.1      tron per-destination grouping. It is not even clear to what transports
    1.1      tron (and thus jobs) the recipients will be assigned. And with messages
1.1.1.4  christos coming from the "deferred" queue, it is not even clear how many unread
    1.1      tron recipients are still to be delivered. This all means that the
    1.1      tron scheduler must use only estimates of how many recipients entries
    1.1      tron there will be.  Fortunately, it is possible to estimate the minimum
    1.1      tron and maximum correctly, so the scheduler can always err on the safe
1.1.1.4  christos side.  Obviously, the better the estimates, the better the results, so
    1.1      tron it is best when we are able to read all recipients in-core and turn
    1.1      tron the estimates into exact counts, or at least try to read as many
    1.1      tron as possible to make the estimates as accurate as possible.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The third complication is that it is no longer true that the scheduler
    1.1      tron is done with a job once all of its in-core recipients are delivered.
    1.1      tron It is possible that the job will be revived later, when another
    1.1      tron batch of recipients is read in core. It is also possible that some
    1.1      tron jobs will be created for the first time long after the first batch
    1.1      tron of recipients was read in core. The <tt>nqmgr</tt> code must be
    1.1      tron ready to handle all such situations.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron And finally, the fourth complication is that the <tt>nqmgr</tt>
    1.1      tron code must somehow impose the recipient limit itself. Now how does
    1.1      tron it achieve it?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Perhaps the easiest solution would be to say that each message may
1.1.1.4  christos have at maximum X recipients stored in-core, but such a solution would
    1.1      tron be poor for several reasons. With reasonable qmgr_message_active_limit
1.1.1.4  christos values, the X would have to be quite low to maintain a reasonable
    1.1      tron memory footprint. And with low X lots of things would not work well.
    1.1      tron The <tt>nqmgr</tt> would have problems to use the
    1.1      tron <i>transport</i>_destination_recipient_limit efficiently. The
    1.1      tron scheduler's preemption would be suboptimal as the recipient count
    1.1      tron estimates would be inaccurate. The message queue file would have
    1.1      tron to be accessed many times to read in more recipients again and
    1.1      tron again.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Therefore it seems reasonable to have a solution which does not use
1.1.1.4  christos a limit imposed on a per-message basis, but which maintains a pool
    1.1      tron of available recipient slots, which can be shared among all messages
    1.1      tron in the most efficient manner. And as we do not want separate
    1.1      tron transports to compete for resources whenever possible, it seems
1.1.1.4  christos appropriate to maintain such a recipient pool for each transport
    1.1      tron separately. This is the general idea, now how does it work in
    1.1      tron practice?
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos First we have to solve a little chicken-and-egg problem. If we want
    1.1      tron to use the per-transport recipient pools, we first need to know to
1.1.1.4  christos what transport(s) the message is assigned. But we will find that
1.1.1.4  christos out only after we first read in the recipients. So it is obvious
    1.1      tron that we first have to read in some recipients, use them to find out
1.1.1.4  christos to what transports the message is to be assigned, and only after
1.1.1.4  christos that can we use the per-transport recipient pools.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Now how many recipients shall we read for the first time? This is
    1.1      tron what qmgr_message_recipient_minimum and qmgr_message_recipient_limit
    1.1      tron values control. The qmgr_message_recipient_minimum value specifies
1.1.1.4  christos how many recipients of each message we will read the first time,
    1.1      tron no matter what.  It is necessary to read at least one recipient
    1.1      tron before we can assign the message to a transport and create the first
    1.1      tron job. However, reading only qmgr_message_recipient_minimum recipients
    1.1      tron even if there are only few messages with few recipients in-core would
1.1.1.4  christos be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit
    1.1      tron recipients in-core so far, the first batch of recipients may be
    1.1      tron larger than qmgr_message_recipient_minimum - as large as is required
    1.1      tron to reach the qmgr_message_recipient_limit limit.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Once the first batch of recipients was read in core and the message
    1.1      tron jobs were created, the size of the subsequent recipient batches (if
    1.1      tron any - of course it's best when all recipients are read in one batch)
    1.1      tron is based solely on the position of the message jobs on their
    1.1      tron corresponding transports' job lists. Each transport has a pool of
    1.1      tron <i>transport</i>_recipient_limit recipient slots which it can
    1.1      tron distribute among its jobs (how this is done is described later).
    1.1      tron The subsequent recipient batch may be as large as the sum of all
    1.1      tron recipient slots of all jobs of the message permits (plus the
    1.1      tron qmgr_message_recipient_minimum amount which always applies).
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos For example, if a message has three jobs, the first with 1 recipient
1.1.1.4  christos still in-core and 4 recipient slots, the second with 5 recipients in-core
1.1.1.4  christos and 5 recipient slots, and the third with 2 recipients in-core and 0
1.1.1.4  christos recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs'
    1.1      tron recipients slots in total. This means that we could immediately
    1.1      tron read 2+qmgr_message_recipient_minimum more recipients of that message
    1.1      tron in core.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The above example illustrates several things which might be worth
    1.1      tron mentioning explicitly: first, note that although the per-transport
    1.1      tron slots are assigned to particular jobs, we can't guarantee that once
    1.1      tron the next batch of recipients is read in core, that the corresponding
    1.1      tron amounts of recipients will be assigned to those jobs. The jobs lend
    1.1      tron its slots to the message as a whole, so it is possible that some
    1.1      tron jobs end up sponsoring other jobs of their message. For example,
    1.1      tron if in the example above the 2 newly read recipients were assigned
    1.1      tron to the second job, the first job sponsored the second job with 2
    1.1      tron slots. The second notable thing is the third job, which has more
    1.1      tron recipients in-core than it has slots. Apart from sponsoring by other
    1.1      tron job we just saw it can be result of the first recipient batch, which
    1.1      tron is sponsored from global recipient pool of qmgr_message_recipient_limit
    1.1      tron recipients. It can be also sponsored from the message recipient
    1.1      tron pool of qmgr_message_recipient_minimum recipients.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Now how does each transport distribute the recipient slots among
    1.1      tron its jobs?  The strategy is quite simple. As most scheduler activity
    1.1      tron happens on the head of the job list, it is our intention to make
    1.1      tron sure that the scheduler has the best estimates of the recipient
    1.1      tron counts for those jobs. As we mentioned above, this means that we
    1.1      tron want to try to make sure that the messages of those jobs have all
    1.1      tron recipients read in-core. Therefore the transport distributes the
    1.1      tron slots "along" the job list from start to end. In this case the job
    1.1      tron list sorted by message enqueue time is used, because it doesn't
    1.1      tron change over time as the scheduler's job list does.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron More specifically, each time a job is created and appended to the
    1.1      tron job list, it gets all unused recipient slots from its transport's
    1.1      tron pool. It keeps them until all recipients of its message are read.
    1.1      tron When this happens, all unused recipient slots are transferred to
1.1.1.4  christos the next job (which is now in fact the first such job) on the job
    1.1      tron list which still has some recipients unread, or eventually back to
1.1.1.4  christos the transport pool if there is no such job. Such a transfer then also
    1.1      tron happens whenever a recipient entry of that job is delivered.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron There is also a scenario when a job is not appended to the end of
1.1.1.4  christos the job list (for example it was created as a result of a second or
    1.1      tron later recipient batch). Then it works exactly as above, except that
    1.1      tron if it was put in front of the first unread job (that is, the job
1.1.1.4  christos of a message which still has some unread recipients in the queue file),
    1.1      tron that job is first forced to return all of its unused recipient slots
    1.1      tron to the transport pool.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The algorithm just described leads to the following state: The first
    1.1      tron unread job on the job list always gets all the remaining recipient
    1.1      tron slots of that transport (if there are any). The jobs queued before
    1.1      tron this job are completely read (that is, all recipients of their
    1.1      tron message were already read in core) and have at maximum as many slots
    1.1      tron as they still have recipients in-core (the maximum is there because
    1.1      tron of the sponsoring mentioned before) and the jobs after this job get
    1.1      tron nothing from the transport recipient pool (unless they got something
    1.1      tron before and then the first unread job was created and enqueued in
1.1.1.4  christos front of them later - in such a case, they also get at maximum as many
    1.1      tron slots as they have recipients in-core).
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
1.1.1.4  christos Things work fine in such a state for most of the time, because the
1.1.1.4  christos current job is either completely read in-core or has as many recipient
    1.1      tron slots as there are, but there is one situation which we still have
    1.1      tron to take care of specially.  Imagine if the current job is preempted
    1.1      tron by some unread job from the job list and there are no more recipient
    1.1      tron slots available, so this new current job could read only batches
    1.1      tron of qmgr_message_recipient_minimum recipients at a time. This would
1.1.1.4  christos really degrade performance. For this reason, each transport has an
    1.1      tron extra pool of <i>transport</i>_extra_recipient_limit recipient
    1.1      tron slots, dedicated exactly for this situation. Each time an unread
    1.1      tron job preempts the current job, it gets half of the remaining recipient
    1.1      tron slots from the normal pool and this extra pool.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron And that's it. It sure does sound pretty complicated, but fortunately
1.1.1.4  christos most people don't really have to care exactly how it works as long
    1.1      tron as it works.  Perhaps the only important things to know for most
    1.1      tron people are the following upper bound formulas:
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron Each transport has at maximum
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron <pre>
    1.1      tron max(
    1.1      tron qmgr_message_recipient_minimum * qmgr_message_active_limit
    1.1      tron + *_recipient_limit + *_extra_recipient_limit,
    1.1      tron qmgr_message_recipient_limit
    1.1      tron )
    1.1      tron </pre>
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron recipients in core.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron The total amount of recipients in core is
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <blockquote>
    1.1      tron <pre>
    1.1      tron max(
    1.1      tron qmgr_message_recipient_minimum * qmgr_message_active_limit
    1.1      tron + sum( *_recipient_limit + *_extra_recipient_limit ),
    1.1      tron qmgr_message_recipient_limit
    1.1      tron )
    1.1      tron </pre>
    1.1      tron </blockquote>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron where the sum is over all used transports.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron And this terribly complicated chapter concludes the documentation
1.1.1.4  christos of the <tt>nqmgr</tt> scheduler.
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <p>
    1.1      tron
    1.1      tron [By now you should theoretically know the <tt>nqmgr</tt> scheduler
    1.1      tron inside out. In practice, you still hope that you will never have
    1.1      tron to really understand the last or last two chapters completely, and
    1.1      tron fortunately most people really won't. Understanding how the scheduler
1.1.1.4  christos works in ideal conditions is more than good enough for the vast majority
    1.1      tron of users.]
    1.1      tron
    1.1      tron </p>
    1.1      tron
    1.1      tron <h2> <a name="credits"> Credits </a> </h2>
    1.1      tron
    1.1      tron <ul>
    1.1      tron
    1.1      tron <li> Wietse Venema designed and implemented the initial queue manager
    1.1      tron with per-domain FIFO scheduling, and per-delivery +/-1 concurrency
    1.1      tron feedback.
    1.1      tron
    1.1      tron <li> Patrik Rak designed and implemented preemption where mail with
    1.1      tron fewer recipients can slip past mail with more recipients in a
    1.1      tron controlled manner, and wrote up its documentation.
    1.1      tron
    1.1      tron <li> Wietse Venema initiated a discussion with Patrik Rak and Victor
    1.1      tron Duchovni on alternatives for the +/-1 feedback scheduler's aggressive
    1.1      tron behavior. This is when K/N feedback was reviewed (N = concurrency).
    1.1      tron The discussion ended without a good solution for both negative
    1.1      tron feedback and dead site detection.
    1.1      tron
    1.1      tron <li> Victor Duchovni resumed work on concurrency feedback in the
    1.1      tron context of concurrency-limited servers.
    1.1      tron
    1.1      tron <li> Wietse Venema then re-designed the concurrency scheduler in
    1.1      tron terms of the simplest possible concepts: less-than-1 concurrency
    1.1      tron feedback per delivery, forward and reverse concurrency feedback
    1.1      tron hysteresis, and pseudo-cohort failure. At this same time, concurrency
    1.1      tron feedback was separated from dead site detection.
    1.1      tron
    1.1      tron <li> These simplifications, and their modular implementation, helped
    1.1      tron to develop further insights into the different roles that positive
    1.1      tron and negative concurrency feedback play, and helped to identify some
    1.1      tron worst-case scenarios.
    1.1      tron
    1.1      tron </ul>
    1.1      tron
    1.1      tron </body>
    1.1      tron
    1.1      tron </html>