Home | History | Annotate | Line # | Download | only in doc
      1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
      2 <html>
      3 <head>
      4   <title>GMP Itemized Development Tasks</title>
      5   <link rel="shortcut icon" href="favicon.ico">
      6   <link rel="stylesheet" href="gmp.css">
      7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      8 </head>
      9 
     10 <center>
     11   <h1>
     12     GMP Itemized Development Tasks
     13   </h1>
     14 </center>
     15 
     16 <font size=-1>
     17 <pre>
     18 Copyright 2000-2004, 2006, 2008, 2009 Free Software Foundation, Inc.
     19 
     20 This file is part of the GNU MP Library.
     21 
     22 The GNU MP Library is free software; you can redistribute it and/or modify
     23 it under the terms of either:
     24 
     25   * the GNU Lesser General Public License as published by the Free
     26     Software Foundation; either version 3 of the License, or (at your
     27     option) any later version.
     28 
     29 or
     30 
     31   * the GNU General Public License as published by the Free Software
     32     Foundation; either version 2 of the License, or (at your option) any
     33     later version.
     34 
     35 or both in parallel, as here.
     36 
     37 The GNU MP Library is distributed in the hope that it will be useful, but
     38 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
     39 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
     40 for more details.
     41 
     42 You should have received copies of the GNU General Public License and the
     43 GNU Lesser General Public License along with the GNU MP Library.  If not,
     44 see https://www.gnu.org/licenses/.
     45 </pre>
     46 </font>
     47 
     48 <hr>
     49 <!-- NB. timestamp updated automatically by emacs -->
     50   This file current as of 29 Jan 2014.  An up-to-date version is available at
     51   <a href="https://gmplib.org/tasks.html">https://gmplib.org/tasks.html</a>.
     52   Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
     53 
     54 <p> These are itemized GMP development tasks.  Not all the tasks
     55     listed here are suitable for volunteers, but many of them are.
     56     Please see the <a href="projects.html">projects file</a> for more
     57     sizeable projects.
     58 
     59 <p> CAUTION: This file needs updating.  Many of the tasks here have
     60 either already been taken care of, or have become irrelevant.
     61 
     62 <h4>Correctness and Completeness</h4>
     63 <ul>
     64 <li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean.  Reported
     65      by Patrick Pelissier.
     66      <br>
     67      We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so
     68      need to be careful about changing it.  It used to be a define
     69      applications had to set for long long limb systems, but that in
     70      particular is no longer relevant now that it's established automatically.
     71 <li> The various reuse.c tests need to force reallocation by calling
     72      <code>_mpz_realloc</code> with a small (1 limb) size.
     73 <li> One reuse case is missing from mpX/tests/reuse.c:
     74      <code>mpz_XXX(a,a,a)</code>.
     75 <li> Make the string reading functions allow the `0x' prefix when the base is
     76      explicitly 16.  They currently only allow that prefix when the base is
     77      unspecified (zero).
     78 <li> <code>mpf_eq</code> is not always correct, when one operand is
     79      1000000000... and the other operand is 0111111111..., i.e., extremely
     80      close.  There is a special case in <code>mpf_sub</code> for this
     81      situation; put similar code in <code>mpf_eq</code>.  [In progress.]
     82 <li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
     83      not use just whole limbs, but partial limbs.  [In progress.]
     84 <li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
     85      garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
     86      of a <code>long</code> is not detected.
     87 <li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
     88      the inputs, and in that respect doesn't implement the "infinite precision
     89      followed by truncate" specified in the manual.
     90 <li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
     91      variables with pointers to <code>mpz_add</code> etc, which doesn't work
     92      when those routines are coming from a DLL (because they're effectively
     93      function pointer global variables themselves).  Need to rearrange perhaps
     94      to a set of calls to a test function rather than iterating over an array.
     95 <li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than
     96      a <code>size_t</code> can represent and raise some suitable exception,
     97      probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that
     98      somehow succeeds then an <code>abort</code>.  Various size overflows of
     99      this kind are not handled gracefully, probably resulting in segvs.
    100      <br>
    101      In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits
    102      exceeds an <code>unsigned long</code>.  There's a (small) chance of this
    103      happening but still having enough memory to represent the value.
    104      Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x,
    105      4UL, 1431655766UL)</code>.
    106 <li> <code>mpf</code>: Detect exponent overflow and raise some exception.
    107      It'd be nice to allow the full <code>mp_exp_t</code> range since that's
    108      how it's been in the past, but maybe dropping one bit would make it
    109      easier to test if e1+e2 goes out of bounds.
    110 </ul>
    111 
    112 
    113 
    114 <h4>Machine Independent Optimization</h4>
    115 <ul>
    116 <li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero
    117      limbs until the high limbs fail to give an ordering.  Reduce code size by
    118      turning the three <code>mpn_cmp</code>'s into a single loop stopping when
    119      the end of one operand is reached (and then looking for a non-zero in the
    120      rest of the other).
    121 <li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of
    122      <code>mpn_lshift</code> for any size&lt;=prec means repeated
    123      <code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero
    124      limbs until size==prec+1 is reached.  Those zeros will slow down
    125      subsequent operations, especially if the value is otherwise only small.
    126      If low bits of the low limb are zero, use <code>mpn_rshift</code> so as
    127      to not increase the size.
    128 <li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use
    129      <code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations,
    130      instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>.
    131 <li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and
    132      <code>rp[0]</code> calculations, so the compiler needn't worry about
    133      aliasing between <code>sp</code> and <code>rp</code>.
    134 <li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when
    135      the remainder is not required, as noted in Paul's paper.
    136 <li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb
    137      with high probability and in this case <code>binvert_limb</code> could
    138      be used to calculate the inverse just once for the two exact divisions
    139      "op1.den / gcd" and "op2.den / gcd", rather than letting
    140      <code>mpn_bdiv_q_1</code> do it each time.  This would require calling
    141      <code>mpn_pi1_bdiv_q_1</code>.
    142 <li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for
    143      zero, instead check the high bit of the operand and avoid invoking
    144      <code>count_leading_zeros</code>.  This is an optimization on all
    145      machines, and significant on machines with slow
    146      <code>count_leading_zeros</code>, though it's possible an already
    147      normalized operand might not be encountered very often.
    148 <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
    149      most significant limb (if <code>GMP_LIMB_BITS</code> &lt= 52 bits).
    150      (Peter Montgomery has some ideas on this subject.)
    151 <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
    152      products with fewer operations.
    153 <li> Consider inlining <code>mpz_set_ui</code>.  This would be both small and
    154      fast, especially for compile-time constants, but would make application
    155      binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
    156      preventing the "lazy" allocation scheme below.
    157 <li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
    158      <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
    159      would be needed for the divide by zero test, unless that could be left to
    160      <code>mpn_mod_1</code> (not sure currently whether all the risc chips
    161      provoke the right exception there if using mul-by-inverse).
    162 <li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
    163      <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
    164      might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
    165      since there's no apparent way to get <code>SHRT_MAX</code> with an
    166      expression (since <code>short</code> and <code>unsigned short</code> can
    167      be different sizes).
    168 <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very fast on one
    169      or two limb moduli, due to a lot of function call overheads.  These could
    170      perhaps be handled as special cases.
    171 <li> Make sure <code>mpz_powm_ui</code> is never slower than the corresponding
    172      computation using <code>mpz_powm</code>.
    173 <li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
    174      using the division method when they're small, since the REDC form of a
    175      small multiplier is normally a full size product.  Probably would need a
    176      new tuned parameter to say what size multiplier is "small", as a function
    177      of the size of the modulus.
    178 <li> <code>mpn_gcd</code> might be able to be sped up on small to moderate
    179      sizes by improving <code>find_a</code>, possibly just by providing an
    180      alternate implementation for CPUs with slowish
    181      <code>count_leading_zeros</code>.
    182 <li> <code>mpf_set_str</code> produces low zero limbs when a string has a
    183      fraction but is exactly representable, eg. 0.5 in decimal.  These could be
    184      stripped to save work in later operations.
    185 <li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
    186      use <code>mpn_and_n</code> etc for the benefit of the small number of
    187      targets with native versions of those routines.  Need to be careful not to
    188      pass size==0.  Is some code sharing possible between the <code>mpz</code>
    189      routines?
    190 <li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
    191      unless it's really necessary (currently only sizes are tested, not
    192      whether r really is u or v).
    193 <li> <code>mpf_add</code>: Under the check for v having no effect on the
    194      result, perhaps test for r==u and do nothing in that case, rather than
    195      currently it looks like an <code>MPN_COPY_INCR</code> will be done to
    196      reduce prec+1 limbs to prec.
    197 <li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call
    198      <code>mpn_divrem_1</code> asking for fractional quotient limbs.
    199 <li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>.  When r!=u
    200      there's no overlap and the division can be called on those operands.
    201      When r==u and is prec+1 limbs, then it's an in-place division.  If r==u
    202      and not prec+1 limbs, then move the available limbs up to prec+1 and do
    203      an in-place there.
    204 <li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be
    205      determined by testing the dividend for high&lt;divisor.  When non-zero, the
    206      division can be done on prec dividend limbs instead of prec+1.  The result
    207      size is also known before the division, so that can be a tail call (once
    208      the <code>TMP_ALLOC</code> is eliminated).
    209 <li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
    210      shift the dividend on-the-fly, since this should cost nothing on
    211      superscalar processors and avoid the need for temporary copying in
    212      <code>mpn_tdiv_qr</code>.
    213 <li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with
    214      zeros, then there's no need for the tp temporary.
    215 <li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
    216      <code>num2*den1</code> products limb-by-limb from high to low and look at
    217      each step for values differing by more than the possible carry bit from
    218      the uncalculated portion.
    219 <li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
    220      and compare.  The benefits of karatsuba and higher multiplication
    221      algorithms are lost, but if it's assumed only a few high limbs will be
    222      needed to determine an order then that's fine.
    223 <li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
    224      <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
    225      instead of the functions, so they get inlined on all compilers, not just
    226      gcc and others with <code>inline</code> recognised in gmp.h.
    227      <code>__GMPN_ADD_1</code> etc are meant mostly to support application
    228      inline <code>mpn_add_1</code> etc and if they don't come out good for
    229      internal uses then special forms can be introduced, for instance many
    230      internal uses are in-place.  Sometimes a block of code is executed based
    231      on the carry-out, rather than using it arithmetically, and those places
    232      might want to do their own loops entirely.
    233 <li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
    234      bitfield for the mantissa extraction, not two, when endianness permits.
    235      Might depend on the compiler allowing <code>long long</code> bit fields
    236      when that's the only actual 64-bit type.
    237 <li> tal-notreent.c could keep a block of memory permanently allocated.
    238      Currently the last nested <code>TMP_FREE</code> releases all memory, so
    239      there's an allocate and free every time a top-level function using
    240      <code>TMP</code> is called.  Would need
    241      <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
    242      any cached memory when changing allocation functions though.
    243 <li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
    244      inlined.  If the current chunk has enough room then a couple of pointers
    245      can be updated.  Only if more space is required then a call to some sort
    246      of <code>__gmp_tmp_increase</code> would be needed.  The requirement that
    247      <code>TMP_ALLOC</code> is an expression might make the implementation a
    248      bit ugly and/or a bit sub-optimal.
    249 <pre>
    250 #define TMP_ALLOC(n)
    251   ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
    252      __gmp_tmp_increase (ROUND_UP (n)) : 0),
    253      current-&gt;point += ROUND_UP (n),
    254      current-&gt;point - ROUND_UP (n))
    255 </pre>
    256 <li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
    257      never used.  Perhaps the table should just go up to base 16, and have
    258      code to generate data above that, if and when required.  Naturally this
    259      assumes the code would be smaller than the data saved.
    260 <li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
    261      if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
    262      otherwise, to save space.
    263 <li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
    264      are of course fast, it seems a little silly to make a second pass over
    265      the <code>mpn_get_str</code> output to convert to ASCII.  Perhaps combine
    266      that with the bit extractions.
    267 <li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
    268      A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
    269      deriving S from that.  Perhaps it'd be possible to arrange to get S in
    270      the first place by calling <code>mpn_gcdext</code> with A+B,B.  This
    271      might only be an advantage if A and B are about the same size.
    272 <li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping
    273      powers of 2, but it's perhaps a bit too complicated for what it gains.
    274      The simpler <code>mpn_pow_1</code> is a little faster on small exponents.
    275      (Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to
    276      supporting <code>mpn_mul_2</code>.)
    277      <br>
    278      Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be
    279      confined to single limb operands for simplicity and since that's where
    280      the greatest gain would be.
    281      <br>
    282      Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be
    283      merged.  The reason <code>mpz_n_pow_ui</code> writes to an
    284      <code>mpz_t</code> is that its callers leave it to make a good estimate
    285      of the result size.  Callers of <code>mpn_pow_1</code> already know the
    286      size by separate means (<code>mp_bases</code>).
    287 <li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
    288 </ul>
    289 
    290 
    291 <h4>Machine Dependent Optimization</h4>
    292 <ul>
    293 <li> <code>invert_limb</code> on various processors might benefit from the
    294      little Newton iteration done for alpha and ia64.
    295 <li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with
    296      <code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than
    297      a separate <code>lshift</code> and <code>add_n</code> at
    298      1.75+2.125=3.875.  Or very likely some specific <code>addlsh1_n</code>
    299      code could beat both.
    300 <li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
    301      <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
    302 <li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
    303      and <code>mpn_submul_1</code> for the 21164.  This should use both integer
    304      multiplies and floating-point multiplies.  For the floating-point
    305      operations, the single-limb multiplier should be split into three 21-bit
    306      chunks, or perhaps even better in four 16-bit chunks.  Probably possible
    307      to reach 9 cycles/limb.
    308 <li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>,
    309      <code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using
    310      the corresponding CIX <code>ct</code> instructions, and
    311      <code>__builtin_alpha_cmpbge</code>.  These should give GCC more
    312      information about scheduling etc than the <code>asm</code> blocks
    313      currently used in longlong.h and gmp-impl.h.
    314 <li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system,
    315      making <code>configure</code> choose the slower
    316      <code>malloc-reentrant</code> allocation method.  Is there a better way?
    317      Maybe variable-length arrays per notes below.
    318 <li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads
    319      with garbage.  Does the code get the intended slotting required for the
    320      claimed speeds?  <code>.align</code> at the start of a function would
    321      presumably be safe no matter how it pads.
    322 <li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code>
    323      instruction.  For GCC 3.4 and up, do this via <code>__builtin_clzl</code>
    324      since then gcc knows it's "predicable".
    325 <li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be
    326      used instead of an <code>asm</code> block.  The builtin should give gcc
    327      more opportunities for scheduling, bundling and predication.
    328      <code>__builtin_ctz</code> similarly (it just uses popcount as per
    329      current longlong.h).
    330 <li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
    331      for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
    332      much this can improve the speed, though, since the symmetry that we rely
    333      on is lost.  Perhaps we can just gain cycles when s2 &lt; 2^16, or more
    334      accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
    335 <li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
    336      <code>mpn_addmul_1</code>.
    337 <li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
    338      "<code>mulx</code>"s either with an asm block or via the generic C code is
    339      about 90 cycles.  Try using fp operations, and also try using karatsuba
    340      for just three "<code>mulx</code>"s.
    341 <li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
    342      Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
    343 <li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
    344 <li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
    345      possible (see commented out code in longlong.h).  This is unlikely to
    346      save more than a couple of cycles, so perhaps isn't worth bothering with.
    347 <li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
    348      or anything to indicate V9 support when -mcpu=v9 is selected.  See
    349      gcc/config/sol2-sld-64.h.  Will need to pass something through from
    350      ./configure to select the right code in longlong.h.  (Currently nothing
    351      is lost because <code>mulx</code> for multiplying is commented out.)
    352 <li> UltraSPARC/32: <code>mpn_divexact_1</code> and
    353      <code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take
    354      64-bits at a time from the dividend, as per the 32-bit divisor case in
    355      mpn/sparc64/mode1o.c.  This must be done in assembler, since the full
    356      64-bit registers (<code>%gN</code>) are not available from C.
    357 <li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time
    358      using <code>mulx</code>, in assembler.  This would be the same as for
    359      sparc64.
    360 <li> UltraSPARC: <code>binvert_limb</code> might save a few cycles from
    361      masking down to just the useful bits at each point in the calculation,
    362      since <code>mulx</code> speed depends on the highest bit set.  Either
    363      explicit masks or small types like <code>short</code> and
    364      <code>int</code> ought to work.
    365 <li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements
    366      <code>popc</code> properly (see gcc sparc.md).  Would need to recognise
    367      it as <code>sparchalr1</code> or something in configure / config.sub /
    368      config.guess.  <code>popc_limb</code> in gmp-impl.h could use this (per
    369      commented out code).  <code>count_trailing_zeros</code> could use it too.
    370 <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
    371      <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
    372      should be possible to saturate the cache, which will happen at 8
    373      cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
    374      it should be possible to make them run at about 5 cycles/limb.
    375 <li> PPC601: See which of the power or powerpc32 code runs better.  Currently
    376      the powerpc32 is used, but only because it's the default for
    377      <code>powerpc*</code>.
    378 <li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
    379      <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
    380      possibly two floating-point and one integer limb per loop.  Split operands
    381      into four 16-bit chunks for fast fp operations.  Should easily reach 9
    382      cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
    383      (using one int + two fp).
    384 <li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
    385      as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
    386      share source code, or with a register to control the loop direction
    387      perhaps even share object code.
    388 <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
    389      for important machines.  Helping the generic sqr_basecase.c with an
    390      <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
    391 <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
    392      Will bring time from 1.75 to 1.25 cycles/limb.
    393 <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
    394      Pentium code.)
    395 <li> X86: Good authority has it that in the past an inline <code>rep
    396      movs</code> would upset GCC register allocation for the whole function.
    397      Is this still true in GCC 3?  It uses <code>rep movs</code> itself for
    398      <code>__builtin_memcpy</code>.  Examine the code for some simple and
    399      complex functions to find out.  Inlining <code>rep movs</code> would be
    400      desirable, it'd be both smaller and faster.
    401 <li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
    402      down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
    403      <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
    404 <li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
    405      might benefit from some destination prefetching.
    406 <li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
    407      mul-by-inverse, hoping for maybe 30 c/l.
    408 <li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
    409      do something branch-free for unaligned startups, and shaving one insn
    410      from the loop with alternative indexing might save a cycle.
    411 <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
    412      The pipeline is now extremely deep, perhaps unnecessarily deep.
    413 <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
    414 <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
    415      <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
    416      method", to avoid carry propagation.  splitting one of the operands in
    417      11-bit chunks.
    418 <li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
    419      by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
    420      special lshift by 1.  This would be best as a jump across to the other
    421      routine, could let both live in lshift.asm and omit rshift.asm on finding
    422      <code>mpn_rshift</code> already provided.
    423 <li> Cray T3E: Experiment with optimization options.  In particular,
    424      -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
    425 <li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably
    426      wants a pragma like <code>MPN_COPY_INCR</code>.
    427 <li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
    428      <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
    429      and could be inlined to avoid function calls.
    430 <li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
    431      scheme.  Not sure why, maybe they merely give the compiler more
    432      information about aliasing (or the lack thereof).  Would like to modify
    433      <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
    434      blocks wanted unconditionally are easy enough, those wanted only
    435      sometimes are a problem.  Perhaps a special size calculation to ask for a
    436      dummy length 1 when unwanted, or perhaps an inlined subroutine
    437      duplicating code under each conditional.  Don't really want to turn
    438      everything into a dog's dinner just because Cray don't offer an
    439      <code>alloca</code>.
    440 <li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
    441      Does it?  <code>bits_per_digit</code> and the inner loop over bits in a
    442      limb might prevent it.  Perhaps special cases for binary, octal and hex
    443      would be worthwhile (very possibly for all processors too).
    444 <li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with
    445      <code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h.
    446      This is only for 64-bit mode or something is it, since 32-bit mode has
    447      other code?  Also, is it worth using for <code>BSWAP_LIMB</code> too, or
    448      would that mean a store and re-fetch?  Presumably that's what comes out
    449      in glibc.
    450 <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
    451   <pre>
    452 	   if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
    453 	   if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
    454 	   ... </pre>
    455 <li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
    456      be used in <code>__GMP_EXTERN_INLINE</code>.  What would be the right way
    457      to identify suitable versions of that compiler?
    458 <li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code>
    459      (in <code>&lt;intrinsics.h&gt;</code> like Cray), but it didn't seem to
    460      exist on some IRIX 6.5 systems tried.  If it does actually exist
    461      somewhere it would very likely be an improvement over a function call to
    462      umul.asm.
    463 <li> <code>mpn_get_str</code> final divisions by the base with
    464      <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
    465      on suitable machines.  This ends up happening for decimal by presenting
    466      the compiler with a run-time constant, but the same for other bases would
    467      be good.  Perhaps use could be made of the fact base&lt;256.
    468 <li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
    469      structure like <code>div_t</code> to avoid going through memory, in
    470      particular helping RISCs that don't do store-to-load forwarding.  Clearly
    471      this is only possible if the ABI returns a structure of two
    472      <code>mp_limb_t</code>s in registers.
    473      <br>
    474      On PowerPC, structures are returned in memory on AIX and Darwin.  In SVR4
    475      they're returned in registers, except that draft SVR4 had said memory, so
    476      it'd be prudent to check which is done.  We can jam the compiler into the
    477      right mode if we know how, since all this is purely internal to libgmp.
    478      (gcc has an option, though of course gcc doesn't matter since we use
    479      inline asm there.)
    480 </ul>
    481 
    482 <h4>New Functionality</h4>
    483 <ul>
    484 <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
    485 <li> Let `0b' and `0B' mean binary input everywhere.
    486 <li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
    487      Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
    488      <code>_mpz_realloc</code> do the initial alloc.  Set
    489      <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
    490      similar can unconditionally fetch from.  Niels Mller has had a go at
    491      this.
    492      <br>
    493      The advantages of the lazy scheme would be:
    494      <ul>
    495      <li> Initial allocate would be the size required for the first value
    496           stored, rather than getting 1 limb in <code>mpz_init</code> and then
    497           more or less immediately reallocating.
    498      <li> <code>mpz_init</code> would only store magic values in the
    499           <code>mpz_t</code> fields, and could be inlined.
    500      <li> A fixed initializer could even be used by applications, like
    501           <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
    502           for globals.
    503      </ul>
    504      The advantages of the current scheme are:
    505      <ul>
    506      <li> <code>mpz_set_ui</code> and other similar routines needn't check the
    507           size allocated and can just store unconditionally.
    508      <li> <code>mpz_set_ui</code> and perhaps others like
    509           <code>mpz_tdiv_r_ui</code> and a prospective
    510           <code>mpz_set_ull</code> could be inlined.
    511      </ul>
    512 <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
    513      format is portable between 32-bit and 64-bit machines, and between
    514      little-endian and big-endian machines.  A format which MPFR can use too
    515      would be good.
    516 <li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
    517      logops and copys available in gmp.h, either as library functions or
    518      inlines, with the availability of library functions instantiated in the
    519      generated gmp.h at build time.
    520 <li> <code>mpz_set_str</code> etc variants taking string lengths rather than
    521      null-terminators.
    522 <li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
    523      <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
    524      if they could share code with the current such functions (which should be
    525      possible).
    526 <li> <code>mpz_and_ui</code> etc might be of use sometimes.  Suggested by
    527      Niels Mller.
    528 <li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
    529      accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
    530      decimal in this case, with a further 0x, 0b etc allowed there.
    531      Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
    532      probably something like "0.123" ought not mean octal.
    533 <li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
    534      feature of gmp.h, so applications could know whether to
    535      <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
    536 <li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99
    537      &lt;inttypes.h&gt; might be of use to applications printing limbs.  But
    538      if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this
    539      can easily enough be left to applications.
    540 <li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
    541      It'd be nice if it worked for plain <code>int</code> etc too, not just
    542      <code>mpz_t</code> etc.
    543 <li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
    544      for both integer and float conversions.  A base either in the format
    545      string or as a parameter with <code>*</code> should be allowed.  Maybe
    546      <code>&amp;13b</code> (b for base) or something like that.
    547 <li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
    548      conversions, eg. <code>"%.4Qf"</code>.  This would be merely for
    549      convenience, but still might be useful.  Rounding would be the same as
    550      for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
    551      actually documented).  Alternately, perhaps a separate
    552      <code>mpq_get_str_point</code> or some such might be more use.  Suggested
    553      by Pedro Gimeno.
    554 <li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such
    555      searching towards the low end of an integer might match
    556      <code>mpz_scan0</code> nicely.  Likewise for <code>scan1</code>.
    557      Suggested by Roberto Bagnara.
    558 <li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
    559      bitwise subset of another might be of use.  Some sort of return value
    560      indicating whether it's a proper or non-proper subset would be good and
    561      wouldn't cost anything in the implementation.  Suggested by Roberto
    562      Bagnara.
    563 <li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
    564      <code>mpf_t</code> and <code>long double</code>, suggested by Dan
    565      Christensen.  Other <code>long double</code> routines might be desirable
    566      too, but <code>mpf</code> would be a start.
    567      <br>
    568      <code>long double</code> is an ANSI-ism, so everything involving it would
    569      need to be suppressed on a K&amp;R compiler.
    570      <br>
    571      There'd be some work to be done by <code>configure</code> to recognise
    572      the format in use, MPFR has a start on this.  Often <code>long
    573      double</code> is the same as <code>double</code>, which is easy but
    574      pretty pointless.  A single float format detector macro could look at
    575      <code>double</code> then <code>long double</code>
    576      <br>
    577      Sometimes there's a compiler option for the size of a <code>long
    578      double</code>, eg. xlc on AIX can use either 64-bit or 128-bit.  It's
    579      probably simplest to regard this as a compiler compatibility issue, and
    580      leave it to users or sysadmins to ensure application and library code is
    581      built the same.
    582 <li> <code>mpz_sqrt_if_perfect_square</code>: When
    583      <code>mpz_perfect_square_p</code> does its tests it calculates a square
    584      root and then discards it.  For some applications it might be useful to
    585      return that root.  Suggested by Jason Moxham.
    586 <li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
    587      <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
    588      <code>long long</code>.  These would aid interoperability, though a
    589      mixture of GMP and <code>long long</code> would probably not be too
    590      common.  Since <code>long long</code> is not always available (it's in
    591      C99 and GCC though), disadvantages of using <code>long long</code> in
    592      libgmp.a would be
    593      <ul>
    594      <li> Library contents vary according to the build compiler.
    595      <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
    596           application compiler could take the <code>long long</code>
    597           prototypes.
    598      <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to
    599           indicate whether the functions are available.  (Applications using
    600           autoconf could probe the library too.)
    601      </ul>
    602      It'd be possible to defer the need for <code>long long</code> to
    603      application compile time, by having something like
    604      <code>mpz_set_2ui</code> called with two halves of a <code>long
    605      long</code>.  Disadvantages of this would be,
    606      <ul>
    607      <li> Bigger code in the application, though perhaps not if a <code>long
    608           long</code> is normally passed as two halves anyway.
    609      <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
    610           to be two function calls.
    611      <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
    612           treatment of <code>-0x10..00</code> into applications (see
    613           <code>mpz_get_si</code> correctness above).
    614      <li> Although having libgmp.a independent of the build compiler is nice,
    615           it sort of sacrifices the capabilities of a good compiler to
    616           uniformity with inferior ones.
    617      </ul>
    618      Plain use of <code>long long</code> is probably the lesser evil, if only
    619      because it makes best use of gcc.  In fact perhaps it would suffice to
    620      guarantee <code>long long</code> conversions only when using GCC for both
    621      application and library.  That would cover free software, and we can
    622      worry about selected vendor compilers later.
    623      <br>
    624      In C++ the situation is probably clearer, we demand fairly recent C++ so
    625      <code>long long</code> should be available always.  We'd probably prefer
    626      to have the C and C++ the same in respect of <code>long long</code>
    627      support, but it would be possible to have it unconditionally in gmpxx.h,
    628      by some means or another.
    629 <li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
    630      Suggested by Alexander Kruppa.
    631 </ul>
    632 
    633 
    634 <h4>Configuration</h4>
    635 
    636 <ul>
    637 <li> Alpha ev7, ev79: Add code to config.guess to detect these.  Believe ev7
    638      will be "3-1307" in the current switch, but need to verify that.  (On
    639      OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do
    640      it ourselves for other systems.)
    641 <li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is
    642      "pic always" and ends up running gcc twice with the same options.  This
    643      is wasteful, but harmless.  Perhaps a newer libtool will be better.
    644 <li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
    645      but is that available only for M series chips or some such?  Perhaps it
    646      should be configured in some way.
    647 <li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
    648 <li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter,
    649      which could be driven by an exact hppa cpu type.
    650 <li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
    651      "hinv -c processor" gives lots of information on Irix.  Standard
    652      config.guess appends "el" to indicate endianness, but
    653      <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
    654 <li> PowerPC: The function descriptor nonsense for AIX is currently driven by
    655      <code>*-*-aix*</code>.  It might be more reliable to do some sort of
    656      feature test, examining the compiler output perhaps.  It might also be
    657      nice to merge the aix.m4 files into powerpc-defs.m4.
    658 <li> config.m4 is generated only by the configure script, it won't be
    659      regenerated by config.status.  Creating it as an <code>AC_OUTPUT</code>
    660      would work, but it might upset "make" to have things like <code>L$</code>
    661      get into the Makefiles through <code>AC_SUBST</code>.
    662      <code>AC_CONFIG_COMMANDS</code> would be the alternative.  With some
    663      careful m4 quoting the <code>changequote</code> calls might not be
    664      needed, which might free up the order in which things had to be output.
    665 <li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
    666      scheme.  Though we probably wouldn't be using its assembler support we
    667      could try to use those variables in compatible ways.
    668 <li> <code>GMP_LDFLAGS</code> could probably be done with plain
    669      <code>LDFLAGS</code> already used by automake for all linking.  But with
    670      a bit of luck the next libtool will pass pretty much all
    671      <code>CFLAGS</code> through to the compiler when linking, making
    672      <code>GMP_LDFLAGS</code> unnecessary.
    673 <li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the
    674      .S and .asm rules, but apparently that isn't completely portable (there's
    675      an autoconf <code>AC_PROG_CC_C_O</code> test for it).  So far we've not
    676      had problems, but perhaps the rules could be rewritten to use "foo.s" as
    677      the temporary, or to do a suitable "mv" of the result.  The only danger
    678      from using foo.s would be if a compile failed and the temporary foo.s
    679      then looked like the primary source.  Hopefully if the
    680      <code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that
    681      wouldn't happen.  Might need to check.
    682 </ul>
    683 
    684 
    685 <h4>Random Numbers</h4>
    686 <ul>
    687 <li> <code>_gmp_rand</code> is not particularly fast on the linear
    688      congruential algorithm and could stand various improvements.
    689      <ul>
    690      <li> Make a second seed area within <code>gmp_randstate_t</code> (or
    691           <code>_mp_algdata</code> rather) to save some copying.
    692      <li> Make a special case for a single limb <code>2exp</code> modulus, to
    693           avoid <code>mpn_mul</code> calls.  Perhaps the same for two limbs.
    694      <li> Inline the <code>lc</code> code, to avoid a function call and
    695           <code>TMP_ALLOC</code> for every chunk.
    696      <li> Perhaps the <code>2exp</code> and general LC cases should be split,
    697           for clarity (if the general case is retained).
    698      </ul>
    699 <li> <code>gmp_randstate_t</code> used for parameters perhaps should become
    700      <code>gmp_randstate_ptr</code> the same as other types.
    701 <li> Some of the empirical randomness tests could be included in a "make
    702      check".  They ought to work everywhere, for a given seed at least.
    703 </ul>
    704 
    705 
    706 <h4>C++</h4>
    707 <ul>
    708 <li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to
    709      identify whitespace.
    710      <br>
    711      <code>mpf_class(string)</code>: Use the C++ global locale decimal point,
    712      rather than the C one.
    713      <br>
    714      Consider making these variant <code>mpz_set_str</code> etc forms
    715      available for <code>mpz_t</code> too, not just <code>mpz_class</code>
    716      etc.
    717 <li> <code>mpq_class operator+=</code>: Don't emit an unnecessary
    718      <code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc.
    719 <li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining.
    720      Candidates for this would be,
    721      <ul>
    722      <li> <code>mpz_class(const char *)</code>, etc: since they're normally
    723           not fast anyway, and we can hide the exception <code>throw</code>.
    724      <li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code>
    725           needed to get to the C conversion function.
    726      <li> <code>mpz_class string, char*</code> etc constructors: likewise to
    727           hide the throws and conversions.
    728      <li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code>
    729           to <code>string</code> conversion and free.  Perhaps
    730           <code>mpz_get_str</code> can write directly into a
    731           <code>string</code>, to avoid copying.
    732           <br>
    733           Consider making such <code>string</code> returning variants
    734           available for use with plain <code>mpz_t</code> etc too.
    735      </ul>
    736 </ul>
    737 
    738 <h4>Miscellaneous</h4>
    739 <ul>
    740 <li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
    741      what range of values the generated cofactors can take, and preferably
    742      ensure the definition uniquely specifies the cofactors for given inputs.
    743      A basic extended Euclidean algorithm or multi-step variant leads to
    744      |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
    745      two solutions under just those restrictions.
    746 <li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
    747      <code>mpz_tdiv_qr_ui</code>.  (Of course dividing multiple primes at a
    748      time would be better still.)
    749 <li> The various test programs use quite a bit of the main
    750      <code>libgmp</code>.  This establishes good cross-checks, but it might be
    751      better to use simple reference routines where possible.  Where it's not
    752      possible some attention could be paid to the order of the tests, so a
    753      <code>libgmp</code> routine is only used for tests once it seems to be
    754      good.
    755 <li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
    756      return to a previous k at certain sizes.  This arises basically due to
    757      the step effect caused by size multiples effectively used for each k.
    758      Looking at a graph makes it fairly clear.
    759 <li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
    760      on the string returned by <code>mpf_get_str</code>.  Perhaps some variant
    761      of <code>mpf_get_str</code> could be made which would better suit.
    762 </ul>
    763 
    764 
    765 <h4>Aids to Development</h4>
    766 <ul>
    767 <li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
    768      function to check the validity of each <code>mp?_t</code> parameter, in
    769      particular to check they've been <code>mp?_init</code>ed.  This might
    770      catch elementary mistakes in user programs.  Care would need to be taken
    771      over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
    772      else then consistency checks like size&lt;=alloc, ptr not
    773      <code>NULL</code> and ptr+size not wrapping around the address space,
    774      would be possible.  A more sophisticated scheme could track
    775      <code>_mp_d</code> pointers and ensure only a valid one is used.  Such a
    776      scheme probably wouldn't be reentrant, not without some help from the
    777      system.
    778 <li> tune/time.c could try to determine at runtime whether
    779      <code>getrusage</code> and <code>gettimeofday</code> are reliable.
    780      Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
    781      <code>getrusage</code> doesn't exist.  If a test might take a long time
    782      to run then perhaps cache the result in a file somewhere.
    783 <li> tune/time.c could choose the default precision based on the
    784      <code>speed_unittime</code> determined, independent of the method in use.
    785 <li> Cray vector systems: CPU frequency could be determined from
    786      <code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle
    787      based.  Is this true for all Cray systems?  Would like some documentation
    788      or something to confirm.
    789 </ul>
    790 
    791 
    792 <h4>Documentation</h4>
    793 <ul>
    794 <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
    795 <li> <code>mpn_get_str</code> isn't terribly clear about how many digits it
    796      produces.  It'd probably be possible to say at most one leading zero,
    797      which is what both it and <code>mpz_get_str</code> currently do.  But
    798      want to be careful not to bind ourselves to something that might not suit
    799      another implementation.
    800 <li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code>
    801      etc directly, but instead needs a pointer type like <code>MP_INT*</code>.
    802      It'd be good to show how to do this, but we'd either need to document
    803      <code>mpz_ptr</code> and friends, or perhaps fallback on something
    804      slightly nasty with <code>void*</code>.
    805 </ul>
    806 
    807 
    808 <h4>Bright Ideas</h4>
    809 
    810 <p> The following may or may not be feasible, and aren't likely to get done in the
    811 near future, but are at least worth thinking about.
    812 
    813 <ul>
    814 <li> Reorganize longlong.h so that we can inline the operations even for the
    815      system compiler.  When there is no such compiler feature, make calls to
    816      stub functions.  Write such stub functions for as many machines as
    817      possible.
    818 <li> longlong.h could declare when it's using, or would like to use,
    819      <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
    820      included in libgmp only in that case, the same as is effectively done for
    821      <code>__clz_tab</code>.  Likewise udiv.asm and perhaps cntlz.asm.  This
    822      would only be a very small space saving, so perhaps not worth the
    823      complexity.
    824 <li> longlong.h could be built at configure time by concatenating or
    825      #including fragments from each directory in the mpn path.  This would
    826      select CPU specific macros the same way as CPU specific assembler code.
    827      Code used would no longer depend on cpp predefines, and the current
    828      nested conditionals could be flattened out.
    829 <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
    830      sort of supposed to return the low 31 (or 63) bits.  But this is
    831      undocumented, and perhaps not too important.
    832 <li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
    833      say an extra 16 limbs over what's needed, so as to reduce the chance of
    834      having to do a reallocate if the <code>mpz_t</code> grows a bit more.
    835      This could only be an option, since it'd badly bloat memory usage in
    836      applications using many small values.
    837 <li> <code>mpq</code> functions could perhaps check for numerator or
    838      denominator equal to 1, on the assumption that integers or
    839      denominator-only values might be expected to occur reasonably often.
    840 <li> <code>count_trailing_zeros</code> is used on more or less uniformly
    841      distributed numbers in a couple of places.  For some CPUs
    842      <code>count_trailing_zeros</code> is slow and it's probably worth handling
    843      the frequently occurring 0 to 2 trailing zeros cases specially.
    844 <li> <code>mpf_t</code> might like to let the exponent be undefined when
    845      size==0, instead of requiring it 0 as now.  It should be possible to do
    846      size==0 tests before paying attention to the exponent.  The advantage is
    847      not needing to set exp in the various places a zero result can arise,
    848      which avoids some tedium but is otherwise perhaps not too important.
    849      Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
    850      exp==0, maybe elsewhere too.
    851 <li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
    852      ((malloc))</code> on this, though don't know if it'd do much.  GCC 3.0
    853      allows that attribute on functions, but not function pointers (see info
    854      node "Attribute Syntax"), so would need a new autoconf test.  This can
    855      wait until there's a GCC that supports it.
    856 <li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
    857      <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>.  If those two
    858      routines were opened up a bit maybe that code could be shared.  When a
    859      copy needs to be done there's no carry to append for the add, and if the
    860      copy is non-empty no high zero for the sub.
    861 </ul>
    862 
    863 
    864 <h4>Old and Obsolete Stuff</h4>
    865 
    866 <p> The following tasks apply to chips or systems that are old and/or obsolete.
    867 It's unlikely anything will be done about them unless anyone is actively using
    868 them.
    869 
    870 <ul>
    871 <li> Sparc32: The integer based udiv_nfp.asm used to be selected by
    872      <code>configure --nfp</code> but that option is gone now that autoconf is
    873      used.  The file could go somewhere suitable in the mpn search if any
    874      chips might benefit from it, though it's possible we don't currently
    875      differentiate enough exact cpu types to do this properly.
    876 <li> VAX D and G format <code>double</code> floats are straightforward and
    877      could perhaps be handled directly in <code>__gmp_extract_double</code>
    878      and maybe in <code>mpn_get_d</code>, rather than falling back on the
    879      generic code.  (Both formats are detected by <code>configure</code>.)
    880 </ul>
    881 
    882 
    883 <hr>
    884 
    885 </body>
    886 </html>
    887 
    888 <!--
    889 Local variables:
    890 eval: (add-hook 'write-file-hooks 'time-stamp)
    891 time-stamp-start: "This file current as of "
    892 time-stamp-format: "%:d %3b %:y"
    893 time-stamp-end: "\\."
    894 time-stamp-line-limit: 50
    895 End:
    896 -->
    897