Home | History | Annotate | Line # | Download | only in doc
      1 <h1>TRE Regexp Syntax</h1>
      2 
      3 <p>
      4 This document describes the POSIX 1003.2 extended RE (ERE) syntax and
      5 the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions
      6 to the ERE syntax.  A simple Extended Backus-Naur Form (EBNF) style
      7 notation is used to describe the grammar.
      8 </p>
      9 
     10 <h2>ERE Syntax</h2>
     11 
     12 <h3>Alternation operator</h3>
     13 <a name="alternation"></a>
     14 <a name="extended-regexp"></a>
     15 
     16 <table bgcolor="#e0e0f0" cellpadding="10">
     17 <tr><td>
     18 <pre>
     19 <i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a>
     20                 |   <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a>
     21 </pre>
     22 </td></tr>
     23 </table>
     24 <p>
     25 An extended regexp (ERE) is one or more <i>branches</i>, separated by
     26 <tt>|</tt>.  An ERE matches anything that matches one or more of the
     27 branches.
     28 </p>
     29 
     30 <h3>Catenation of REs</h3>
     31 <a name="catenation"></a>
     32 <a name="branch"></a>
     33 
     34 <table bgcolor="#e0e0f0" cellpadding="10">
     35 <tr><td>
     36 <pre>
     37 <i>branch</i> ::= <i>piece</i>
     38        |   <i>branch</i> <i>piece</i>
     39 </pre>
     40 </td></tr>
     41 </table>
     42 <p>
     43 A branch is one or more <i>pieces</i> concatenated.  It matches a
     44 match for the first piece, followed by a match for the second piece,
     45 and so on.
     46 </p>
     47 
     48 
     49 <table bgcolor="#e0e0f0" cellpadding="10">
     50 <tr><td>
     51 <pre>
     52 <i>piece</i> ::= <i>atom</i>
     53       |   <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a>
     54       |   <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a>
     55 </pre>
     56 </td></tr>
     57 </table>
     58 <p>
     59 A piece is an <i>atom</i> possibly followed by a repeat operator or an
     60 expression controlling approximate matching parameters for the <i>atom</i>.
     61 </p>
     62 
     63 
     64 <table bgcolor="#e0e0f0" cellpadding="10">
     65 <tr><td>
     66 <pre>
     67 <i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b>
     68      |   <a href="#bracket-expression"><i>bracket-expression</i></a>
     69      |   <b>"."</b>
     70      |   <a href="#assertion"><i>assertion</i></a>
     71      |   <a href="#literal"><i>literal</i></a>
     72      |   <a href="#backref"><i>back-reference</i></a>
     73      |   <b>"(?#"</b> <i>comment-text</i> <b>")"</b>
     74      |   <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i>
     75      |   <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b>
     76 </pre>
     77 </td></tr>
     78 </table>
     79 <p>
     80 An atom is either an ERE enclosed in parenthesis, a bracket
     81 expression, a <tt>.</tt> (period), an assertion, or a literal.
     82 </p>
     83 
     84 <p>
     85 The dot (<tt>.</tt>) matches any single character.
     86 If the <code>REG_NEWLINE</code> compilation flag (see <a
     87 href="api.html">API manual</a>) is specified, the newline
     88 character is not matched.
     89 </p>
     90 
     91 <p>
     92 <tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is
     93 completely ignored by the regex parser and it used solely for readability purposes.
     94 </p>
     95 
     96 <h3>Repeat operators</h3>
     97 <a name="repeat-operator"></a>
     98 
     99 <table bgcolor="#e0e0f0" cellpadding="10">
    100 <tr><td>
    101 <pre>
    102 <i>repeat-operator</i> ::= <b>"*"</b>
    103                 |   <b>"+"</b>
    104                 |   <b>"?"</b>
    105                 |   <i>bound</i>
    106                 |   <b>"*?"</b>
    107                 |   <b>"+?"</b>
    108                 |   <b>"??"</b>
    109                 |   <i>bound</i> <b>?</b>
    110 </pre>
    111 </td></tr>
    112 </table>
    113 
    114 <p>
    115 An atom followed by <tt>*</tt> matches a sequence of 0 or more matches
    116 of the atom.  <tt>+</tt> is similar to <tt>*</tt>, matching a sequence
    117 of 1 or more matches of the atom.  An atom followed by <tt>?</tt>
    118 matches a sequence of 0 or 1 matches of the atom.
    119 </p>
    120 
    121 <p>
    122 A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i>
    123 are unsigned decimal integers between <tt>0</tt> and
    124 <tt>RE_DUP_MAX</tt>:
    125 </p>
    126 
    127 <ol>
    128 <li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li>
    129 <li><tt>{</tt><i>m</i><tt>,}</tt></li>
    130 <li><tt>{</tt><i>m</i><tt>}</tt></li>
    131 </ol>
    132 
    133 <p>
    134 An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i>
    135 (inclusive) matches of the atom.  An atom followed by [2]
    136 matches a sequence of <i>m</i> or more matches of the atom.  An atom
    137 followed by [3] matches a sequence of exactly <i>m</i> matches of the
    138 atom.
    139 </p>
    140 
    141 
    142 <p>
    143 Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or
    144 non-greedy.  Normally a repeated expression is greedy, that is, it matches as
    145 many characters as possible.  A non-greedy subexpression matches as few
    146 characters as possible.  Note that this does not (always) mean the same thing
    147 as matching as many or few repetitions as possible.  Also note
    148 that <strong>minimal repetitions are not currently supported for approximate
    149 matching</strong>.
    150 </p>
    151 
    152 <h3>Approximate matching settings</h3>
    153 <a name="approx-settings"></a>
    154 
    155 <table bgcolor="#e0e0f0" cellpadding="10">
    156 <tr><td>
    157 <pre>
    158 <i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b>
    159 
    160 <i>count-limits</i> ::= <b>"+"</b> <i>number</i>?
    161              |   <b>"-"</b> <i>number</i>?
    162              |   <b>"#"</b> <i>number</i>?
    163              |   <b>"~"</b> <i>number</i>?
    164 
    165 <i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"&lt;"</b> <i>number</i>
    166 
    167 <i>cost-term</i> ::= <i>number</i> <b>"i"</b>
    168           |   <i>number</i> <b>"d"</b>
    169           |   <i>number</i> <b>"s"</b>
    170 
    171 </pre>
    172 </td></tr>
    173 </table>
    174 
    175 <p>
    176 The approximate matching settings for a subpattern can be changed
    177 by appending <i>approx-settings</i> to the subpattern.  Limits for
    178 the number of errors can be set and an expression for specifying and
    179 limiting the costs can be given.
    180 </p>
    181 
    182 <p>
    183 The <i>count-limits</i> can be used to set limits for the number of
    184 insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions
    185 (<tt>#</tt>), and total number of errors (<tt>~</tt>).  If the
    186 <i>number</i> part is omitted, the specified error count will be
    187 unlimited.
    188 </p>
    189 
    190 <p>
    191 The <i>cost-equation</i> can be thought of as a mathematical equation,
    192 where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of
    193 insertions, deletions, and substitutions, respectively.  The equation
    194 can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and
    195 <tt>s</tt>.  The multiplier is the cost of the error, and the number
    196 after <tt>&lt;</tt> is the maximum allowed cost of a match.  Spaces
    197 and pluses can be inserted to make the equation readable.  In fact, when
    198 specifying only a cost equation, adding a space after the opening <tt>{</tt>
    199 is <strong>required</strong>.
    200 </p>
    201 
    202 <p>
    203 Examples:
    204 <dl>
    205 <dt><tt>{~}</tt></dt>
    206 <dd>Sets the maximum number of errors to unlimited.</dd>
    207 <dt><tt>{~3}</tt></dt>
    208 <dd>Sets the maximum number of errors to three.</dd>
    209 <dt><tt>{+2~5}</tt></dt>
    210 <dd>Sets the maximum number of errors to five, and the maximum number
    211 of insertions to two.</dd>
    212 <dt><tt>{&lt;3}</tt></dt>
    213 <dd>Sets the maximum cost to three.
    214 <dt><tt>{ 2i + 1d + 2s &lt; 5 }</tt></dt>
    215 <dd>Sets the cost of an insertion to two, a deletion to one, a
    216 substitution to two, and the maximum cost to five.
    217 </dl>
    218 
    219 
    220 <h3>Bracket expressions</h3>
    221 <a name="bracket-expression"></a>
    222 
    223 <table bgcolor="#e0e0f0" cellpadding="10">
    224 <tr><td>
    225 <pre>
    226 <i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b>
    227                    |   <b>"[^"</b> <i>item</i>+ <b>"]"</b>
    228 </pre>
    229 </td></tr>
    230 </table>
    231 
    232 <p>
    233 A bracket expression specifies a set of characters by enclosing a
    234 nonempty list of items in brackets.  Normally anything matching any
    235 item in the list is matched.  If the list begins with <tt>^</tt> the
    236 meaning is negated; any character matching no item in the list is
    237 matched.
    238 </p>
    239 
    240 <p>
    241 An item is any of the following:
    242 </p>
    243 <ul>
    244 <li>A single character, matching that character.</li>
    245 <li>Two characters separated by <tt>-</tt>.  This is shorthand for the
    246 full range of characters  between those two (inclusive) in the
    247 collating sequence.  For example, <tt>[0-9]</tt> in ASCII matches any
    248 decimal digit.</li>
    249 <li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>,
    250 matching the collating element.  This can be used to include a literal
    251 <tt>-</tt> or a multi-character collating element in the list.</li>
    252 <li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an
    253 equivalence class), matching all collating elements with the same
    254 primary collation weight as that element, including the element
    255 itself.</li>
    256 <li>The name of a character class enclosed in <tt>[:</tt> and
    257 <tt>:]</tt>, matching any character belonging to the class.  The set
    258 of valid names depends on the <code>LC_CTYPE</code> category of the
    259 current locale, but the following names are valid in all locales:
    260 <ul>
    261 <li><tt>alnum</tt> - alphanumeric characters</li>
    262 <li><tt>alpha</tt> - alphabetic characters</li>
    263 <li><tt>blank</tt> - blank characters</li>
    264 <li><tt>cntrl</tt> - control characters</li>
    265 <li><tt>digit</tt> - decimal digits (0 through 9)</li>
    266 <li><tt>graph</tt> - all printable characters except space</li>
    267 <li><tt>lower</tt> - lower-case letters</li>
    268 <li><tt>print</tt> - printable characters including space</li>
    269 <li><tt>punct</tt> - printable characters not space or alphanumeric</li>
    270 <li><tt>space</tt> - white-space characters</li>
    271 <li><tt>upper</tt> - upper case letters</li>
    272 <li><tt>xdigit</tt> - hexadecimal digits</li>
    273 </ul>
    274 </ul>
    275 <p>
    276 To include a literal <tt>-</tt> in the list, make it either the first
    277 or last item, the second endpoint of a range, or enclose it in
    278 <tt>[.</tt> and <tt>.]</tt> to make it a collating element.  To
    279 include a literal <tt>]</tt> in the list, make it either the first
    280 item, the second endpoint of a range, or enclose it in <tt>[.</tt> and
    281 <tt>.]</tt>.  To use a literal <tt>-</tt> as the first
    282 endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>.
    283 </p>
    284 
    285 
    286 <h3>Assertions</h3>
    287 <a name="assertion"></a>
    288 
    289 <table bgcolor="#e0e0f0" cellpadding="10">
    290 <tr><td>
    291 <pre>
    292 <i>assertion</i> ::= <b>"^"</b>
    293           |   <b>"$"</b>
    294           |   <b>"\"</b> <i>assertion-character</i>
    295 </pre>
    296 </td></tr>
    297 </table>
    298 
    299 <p>
    300 The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and
    301 "right anchor", respectively.  The left anchor matches the empty
    302 string at the beginning of the string.  The right anchor matches the
    303 empty string at the end of the string.  The behaviour of both anchors
    304 can be varied by specifying certain execution and compilation flags;
    305 see the <a href="api.html">API manual</a>.
    306 </p>
    307 
    308 <p>
    309 An assertion-character can be any of the following:
    310 </p>
    311 
    312 <ul>
    313 <li><tt>&lt;</tt> - Beginning of word
    314 <li><tt>&gt;</tt> - End of word
    315 <li><tt>b</tt> - Word boundary
    316 <li><tt>B</tt> - Non-word boundary
    317 <li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li>
    318 <li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li>
    319 <li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li>
    320 <li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li>
    321 <li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li>
    322 <li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li>
    323 </ul>
    324 
    325 
    326 <h3>Literals</h3>
    327 <a name="literal"></a>
    328 
    329 <table bgcolor="#e0e0f0" cellpadding="10">
    330 <tr><td>
    331 <pre>
    332 <i>literal</i> ::= <i>ordinary-character</i>
    333         |   <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2}
    334         |   <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b>
    335         |   <b>"\"</b> <i>character</i>
    336 </pre>
    337 </td></tr>
    338 </table>
    339 <p>
    340 A literal is either an ordinary character (a character that has no
    341 other significance in the context), an 8 bit hexadecimal encoded
    342 character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character
    343 (e.g. <tt>\x{263a}</tt>), or an escaped character.  An escaped
    344 character is a <tt>\</tt> followed by any character, and matches that
    345 character.  Escaping can be used to match characters which have a
    346 special meaning in regexp syntax.  A <tt>\</tt> cannot be the last
    347 character of an ERE.  Escaping also allows you to include a few
    348 non-printable characters in the regular expression.  These special
    349 escape sequences include:
    350 </p>
    351 
    352 <ul>
    353 <li><tt>\a</tt> - Bell character (ASCII code 7)
    354 <li><tt>\e</tt> - Escape character (ASCII code 27)
    355 <li><tt>\f</tt> - Form-feed character (ASCII code 12)
    356 <li><tt>\n</tt> - New-line/line-feed character (ASCII code 10)
    357 <li><tt>\r</tt> - Carriage return character (ASCII code 13)
    358 <li><tt>\t</tt> - Horizontal tab character (ASCII code 9)
    359 </ul>
    360 
    361 <p>
    362 An ordinary character is just a single character with no other
    363 significance, and matches that character.  A <tt>{</tt> followed by
    364 something else than a digit is considered an ordinary character.
    365 </p>
    366 
    367 
    368 <h3>Back references</h3>
    369 <a name="backref"></a>
    370 
    371 <table bgcolor="#e0e0f0" cellpadding="10">
    372 <tr><td>
    373 <pre>
    374 <i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>]
    375 </pre>
    376 </td></tr>
    377 </table>
    378 <p>
    379 A back reference is a backslash followed by a single non-zero decimal
    380 digit <i>d</i>.  It matches the same sequence of characters
    381 matched by the <i>d</i>th parenthesized subexpression.
    382 </p>
    383 
    384 <p>
    385 Back references are not defined for POSIX EREs (for BREs they are),
    386 but many matchers, including TRE, implement back references for both
    387 EREs and BREs.
    388 </p>
    389 
    390 <h3>Options</h3>
    391 <a name="options"></a>
    392 <table bgcolor="#e0e0f0" cellpadding="10">
    393 <tr><td>
    394 <pre>
    395 <i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)?
    396 </pre>
    397 </td></tr>
    398 </table>
    399 
    400 Options allow compile time options to be turned on/off for particular parts of the
    401 regular expression. The options equate to several compile time options specified to
    402 the regcomp API function. If the option is specified in the first section, it is
    403 turned on. If it is specified in the second section (after the <tt>-</tt>), it is
    404 turned off.
    405 <ul>
    406 <li>i - Case insensitive.
    407 <li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in
    408 the <a href="tre-api.html">API Manual</a>.
    409 <li>r - Causes the regex to be matched in a right associative manner rather than the normal
    410 left associative manner.
    411 <li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended.
    412 </ul>
    413 <h2>BRE Syntax</h2>
    414 
    415 <p>
    416 The obsolete basic regexp (BRE) syntax differs from the ERE syntax as
    417 follows:
    418 </p>
    419 
    420 <ul>
    421 <li><tt>|</tt> is an ordinary character, and there is no equivalent
    422 for its functionality.  <tt>+</tt>, and <tt>?</tt> are ordinary
    423 characters.</li>
    424 <li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with
    425 <tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li>
    426 <li>The parentheses for nested subexpressions are <tt>\(</tt> and
    427 <tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary
    428 characters.</li>
    429 <li><tt>^</tt> is an ordinary character except at the beginning of the
    430 RE or the beginning of a parenthesized subexpression.  Similarly,
    431 <tt>$</tt> is an ordinary character except at the end of the
    432 RE or the end of a parenthesized subexpression.</li>
    433 </ul>
    434