dist/doc/tre-syntax.html

1.1  agc <h1>TRE Regexp Syntax</h1>
1.1  agc
1.1  agc <p>
1.1  agc This document describes the POSIX 1003.2 extended RE (ERE) syntax and
1.1  agc the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions
1.1  agc to the ERE syntax.  A simple Extended Backus-Naur Form (EBNF) style
1.1  agc notation is used to describe the grammar.
1.1  agc </p>
1.1  agc
1.1  agc <h2>ERE Syntax</h2>
1.1  agc
1.1  agc <h3>Alternation operator</h3>
1.1  agc <a name="alternation"></a>
1.1  agc <a name="extended-regexp"></a>
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a>
1.1  agc                 |   <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a>
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc <p>
1.1  agc An extended regexp (ERE) is one or more <i>branches</i>, separated by
1.1  agc <tt>|</tt>.  An ERE matches anything that matches one or more of the
1.1  agc branches.
1.1  agc </p>
1.1  agc
1.1  agc <h3>Catenation of REs</h3>
1.1  agc <a name="catenation"></a>
1.1  agc <a name="branch"></a>
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>branch</i> ::= <i>piece</i>
1.1  agc        |   <i>branch</i> <i>piece</i>
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc <p>
1.1  agc A branch is one or more <i>pieces</i> concatenated.  It matches a
1.1  agc match for the first piece, followed by a match for the second piece,
1.1  agc and so on.
1.1  agc </p>
1.1  agc
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>piece</i> ::= <i>atom</i>
1.1  agc       |   <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a>
1.1  agc       |   <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a>
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc <p>
1.1  agc A piece is an <i>atom</i> possibly followed by a repeat operator or an
1.1  agc expression controlling approximate matching parameters for the <i>atom</i>.
1.1  agc </p>
1.1  agc
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b>
1.1  agc      |   <a href="#bracket-expression"><i>bracket-expression</i></a>
1.1  agc      |   <b>"."</b>
1.1  agc      |   <a href="#assertion"><i>assertion</i></a>
1.1  agc      |   <a href="#literal"><i>literal</i></a>
1.1  agc      |   <a href="#backref"><i>back-reference</i></a>
1.1  agc      |   <b>"(?#"</b> <i>comment-text</i> <b>")"</b>
1.1  agc      |   <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i>
1.1  agc      |   <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b>
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc <p>
1.1  agc An atom is either an ERE enclosed in parenthesis, a bracket
1.1  agc expression, a <tt>.</tt> (period), an assertion, or a literal.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc The dot (<tt>.</tt>) matches any single character.
1.1  agc If the <code>REG_NEWLINE</code> compilation flag (see <a
1.1  agc href="api.html">API manual</a>) is specified, the newline
1.1  agc character is not matched.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc <tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is
1.1  agc completely ignored by the regex parser and it used solely for readability purposes.
1.1  agc </p>
1.1  agc
1.1  agc <h3>Repeat operators</h3>
1.1  agc <a name="repeat-operator"></a>
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>repeat-operator</i> ::= <b>"*"</b>
1.1  agc                 |   <b>"+"</b>
1.1  agc                 |   <b>"?"</b>
1.1  agc                 |   <i>bound</i>
1.1  agc                 |   <b>"*?"</b>
1.1  agc                 |   <b>"+?"</b>
1.1  agc                 |   <b>"??"</b>
1.1  agc                 |   <i>bound</i> <b>?</b>
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc
1.1  agc <p>
1.1  agc An atom followed by <tt>*</tt> matches a sequence of 0 or more matches
1.1  agc of the atom.  <tt>+</tt> is similar to <tt>*</tt>, matching a sequence
1.1  agc of 1 or more matches of the atom.  An atom followed by <tt>?</tt>
1.1  agc matches a sequence of 0 or 1 matches of the atom.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i>
1.1  agc are unsigned decimal integers between <tt>0</tt> and
1.1  agc <tt>RE_DUP_MAX</tt>:
1.1  agc </p>
1.1  agc
1.1  agc <ol>
1.1  agc <li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li>
1.1  agc <li><tt>{</tt><i>m</i><tt>,}</tt></li>
1.1  agc <li><tt>{</tt><i>m</i><tt>}</tt></li>
1.1  agc </ol>
1.1  agc
1.1  agc <p>
1.1  agc An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i>
1.1  agc (inclusive) matches of the atom.  An atom followed by [2]
1.1  agc matches a sequence of <i>m</i> or more matches of the atom.  An atom
1.1  agc followed by [3] matches a sequence of exactly <i>m</i> matches of the
1.1  agc atom.
1.1  agc </p>
1.1  agc
1.1  agc
1.1  agc <p>
1.1  agc Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or
1.1  agc non-greedy.  Normally a repeated expression is greedy, that is, it matches as
1.1  agc many characters as possible.  A non-greedy subexpression matches as few
1.1  agc characters as possible.  Note that this does not (always) mean the same thing
1.1  agc as matching as many or few repetitions as possible.  Also note
1.1  agc that <strong>minimal repetitions are not currently supported for approximate
1.1  agc matching</strong>.
1.1  agc </p>
1.1  agc
1.1  agc <h3>Approximate matching settings</h3>
1.1  agc <a name="approx-settings"></a>
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b>
1.1  agc
1.1  agc <i>count-limits</i> ::= <b>"+"</b> <i>number</i>?
1.1  agc              |   <b>"-"</b> <i>number</i>?
1.1  agc              |   <b>"#"</b> <i>number</i>?
1.1  agc              |   <b>"~"</b> <i>number</i>?
1.1  agc
1.1  agc <i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"&lt;"</b> <i>number</i>
1.1  agc
1.1  agc <i>cost-term</i> ::= <i>number</i> <b>"i"</b>
1.1  agc           |   <i>number</i> <b>"d"</b>
1.1  agc           |   <i>number</i> <b>"s"</b>
1.1  agc
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc
1.1  agc <p>
1.1  agc The approximate matching settings for a subpattern can be changed
1.1  agc by appending <i>approx-settings</i> to the subpattern.  Limits for
1.1  agc the number of errors can be set and an expression for specifying and
1.1  agc limiting the costs can be given.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc The <i>count-limits</i> can be used to set limits for the number of
1.1  agc insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions
1.1  agc (<tt>#</tt>), and total number of errors (<tt>~</tt>).  If the
1.1  agc <i>number</i> part is omitted, the specified error count will be
1.1  agc unlimited.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc The <i>cost-equation</i> can be thought of as a mathematical equation,
1.1  agc where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of
1.1  agc insertions, deletions, and substitutions, respectively.  The equation
1.1  agc can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and
1.1  agc <tt>s</tt>.  The multiplier is the cost of the error, and the number
1.1  agc after <tt>&lt;</tt> is the maximum allowed cost of a match.  Spaces
1.1  agc and pluses can be inserted to make the equation readable.  In fact, when
1.1  agc specifying only a cost equation, adding a space after the opening <tt>{</tt>
1.1  agc is <strong>required</strong>.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc Examples:
1.1  agc <dl>
1.1  agc <dt><tt>{~}</tt></dt>
1.1  agc <dd>Sets the maximum number of errors to unlimited.</dd>
1.1  agc <dt><tt>{~3}</tt></dt>
1.1  agc <dd>Sets the maximum number of errors to three.</dd>
1.1  agc <dt><tt>{+2~5}</tt></dt>
1.1  agc <dd>Sets the maximum number of errors to five, and the maximum number
1.1  agc of insertions to two.</dd>
1.1  agc <dt><tt>{&lt;3}</tt></dt>
1.1  agc <dd>Sets the maximum cost to three.
1.1  agc <dt><tt>{ 2i + 1d + 2s &lt; 5 }</tt></dt>
1.1  agc <dd>Sets the cost of an insertion to two, a deletion to one, a
1.1  agc substitution to two, and the maximum cost to five.
1.1  agc </dl>
1.1  agc
1.1  agc
1.1  agc <h3>Bracket expressions</h3>
1.1  agc <a name="bracket-expression"></a>
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b>
1.1  agc                    |   <b>"[^"</b> <i>item</i>+ <b>"]"</b>
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc
1.1  agc <p>
1.1  agc A bracket expression specifies a set of characters by enclosing a
1.1  agc nonempty list of items in brackets.  Normally anything matching any
1.1  agc item in the list is matched.  If the list begins with <tt>^</tt> the
1.1  agc meaning is negated; any character matching no item in the list is
1.1  agc matched.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc An item is any of the following:
1.1  agc </p>
1.1  agc <ul>
1.1  agc <li>A single character, matching that character.</li>
1.1  agc <li>Two characters separated by <tt>-</tt>.  This is shorthand for the
1.1  agc full range of characters  between those two (inclusive) in the
1.1  agc collating sequence.  For example, <tt>[0-9]</tt> in ASCII matches any
1.1  agc decimal digit.</li>
1.1  agc <li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>,
1.1  agc matching the collating element.  This can be used to include a literal
1.1  agc <tt>-</tt> or a multi-character collating element in the list.</li>
1.1  agc <li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an
1.1  agc equivalence class), matching all collating elements with the same
1.1  agc primary collation weight as that element, including the element
1.1  agc itself.</li>
1.1  agc <li>The name of a character class enclosed in <tt>[:</tt> and
1.1  agc <tt>:]</tt>, matching any character belonging to the class.  The set
1.1  agc of valid names depends on the <code>LC_CTYPE</code> category of the
1.1  agc current locale, but the following names are valid in all locales:
1.1  agc <ul>
1.1  agc <li><tt>alnum</tt> - alphanumeric characters</li>
1.1  agc <li><tt>alpha</tt> - alphabetic characters</li>
1.1  agc <li><tt>blank</tt> - blank characters</li>
1.1  agc <li><tt>cntrl</tt> - control characters</li>
1.1  agc <li><tt>digit</tt> - decimal digits (0 through 9)</li>
1.1  agc <li><tt>graph</tt> - all printable characters except space</li>
1.1  agc <li><tt>lower</tt> - lower-case letters</li>
1.1  agc <li><tt>print</tt> - printable characters including space</li>
1.1  agc <li><tt>punct</tt> - printable characters not space or alphanumeric</li>
1.1  agc <li><tt>space</tt> - white-space characters</li>
1.1  agc <li><tt>upper</tt> - upper case letters</li>
1.1  agc <li><tt>xdigit</tt> - hexadecimal digits</li>
1.1  agc </ul>
1.1  agc </ul>
1.1  agc <p>
1.1  agc To include a literal <tt>-</tt> in the list, make it either the first
1.1  agc or last item, the second endpoint of a range, or enclose it in
1.1  agc <tt>[.</tt> and <tt>.]</tt> to make it a collating element.  To
1.1  agc include a literal <tt>]</tt> in the list, make it either the first
1.1  agc item, the second endpoint of a range, or enclose it in <tt>[.</tt> and
1.1  agc <tt>.]</tt>.  To use a literal <tt>-</tt> as the first
1.1  agc endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>.
1.1  agc </p>
1.1  agc
1.1  agc
1.1  agc <h3>Assertions</h3>
1.1  agc <a name="assertion"></a>
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>assertion</i> ::= <b>"^"</b>
1.1  agc           |   <b>"$"</b>
1.1  agc           |   <b>"\"</b> <i>assertion-character</i>
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc
1.1  agc <p>
1.1  agc The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and
1.1  agc "right anchor", respectively.  The left anchor matches the empty
1.1  agc string at the beginning of the string.  The right anchor matches the
1.1  agc empty string at the end of the string.  The behaviour of both anchors
1.1  agc can be varied by specifying certain execution and compilation flags;
1.1  agc see the <a href="api.html">API manual</a>.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc An assertion-character can be any of the following:
1.1  agc </p>
1.1  agc
1.1  agc <ul>
1.1  agc <li><tt>&lt;</tt> - Beginning of word
1.1  agc <li><tt>&gt;</tt> - End of word
1.1  agc <li><tt>b</tt> - Word boundary
1.1  agc <li><tt>B</tt> - Non-word boundary
1.1  agc <li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li>
1.1  agc <li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li>
1.1  agc <li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li>
1.1  agc <li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li>
1.1  agc <li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li>
1.1  agc <li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li>
1.1  agc </ul>
1.1  agc
1.1  agc
1.1  agc <h3>Literals</h3>
1.1  agc <a name="literal"></a>
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>literal</i> ::= <i>ordinary-character</i>
1.1  agc         |   <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2}
1.1  agc         |   <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b>
1.1  agc         |   <b>"\"</b> <i>character</i>
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc <p>
1.1  agc A literal is either an ordinary character (a character that has no
1.1  agc other significance in the context), an 8 bit hexadecimal encoded
1.1  agc character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character
1.1  agc (e.g. <tt>\x{263a}</tt>), or an escaped character.  An escaped
1.1  agc character is a <tt>\</tt> followed by any character, and matches that
1.1  agc character.  Escaping can be used to match characters which have a
1.1  agc special meaning in regexp syntax.  A <tt>\</tt> cannot be the last
1.1  agc character of an ERE.  Escaping also allows you to include a few
1.1  agc non-printable characters in the regular expression.  These special
1.1  agc escape sequences include:
1.1  agc </p>
1.1  agc
1.1  agc <ul>
1.1  agc <li><tt>\a</tt> - Bell character (ASCII code 7)
1.1  agc <li><tt>\e</tt> - Escape character (ASCII code 27)
1.1  agc <li><tt>\f</tt> - Form-feed character (ASCII code 12)
1.1  agc <li><tt>\n</tt> - New-line/line-feed character (ASCII code 10)
1.1  agc <li><tt>\r</tt> - Carriage return character (ASCII code 13)
1.1  agc <li><tt>\t</tt> - Horizontal tab character (ASCII code 9)
1.1  agc </ul>
1.1  agc
1.1  agc <p>
1.1  agc An ordinary character is just a single character with no other
1.1  agc significance, and matches that character.  A <tt>{</tt> followed by
1.1  agc something else than a digit is considered an ordinary character.
1.1  agc </p>
1.1  agc
1.1  agc
1.1  agc <h3>Back references</h3>
1.1  agc <a name="backref"></a>
1.1  agc
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>]
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc <p>
1.1  agc A back reference is a backslash followed by a single non-zero decimal
1.1  agc digit <i>d</i>.  It matches the same sequence of characters
1.1  agc matched by the <i>d</i>th parenthesized subexpression.
1.1  agc </p>
1.1  agc
1.1  agc <p>
1.1  agc Back references are not defined for POSIX EREs (for BREs they are),
1.1  agc but many matchers, including TRE, implement back references for both
1.1  agc EREs and BREs.
1.1  agc </p>
1.1  agc
1.1  agc <h3>Options</h3>
1.1  agc <a name="options"></a>
1.1  agc <table bgcolor="#e0e0f0" cellpadding="10">
1.1  agc <tr><td>
1.1  agc <pre>
1.1  agc <i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)?
1.1  agc </pre>
1.1  agc </td></tr>
1.1  agc </table>
1.1  agc
1.1  agc Options allow compile time options to be turned on/off for particular parts of the
1.1  agc regular expression. The options equate to several compile time options specified to
1.1  agc the regcomp API function. If the option is specified in the first section, it is
1.1  agc turned on. If it is specified in the second section (after the <tt>-</tt>), it is
1.1  agc turned off.
1.1  agc <ul>
1.1  agc <li>i - Case insensitive.
1.1  agc <li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in
1.1  agc the <a href="tre-api.html">API Manual</a>.
1.1  agc <li>r - Causes the regex to be matched in a right associative manner rather than the normal
1.1  agc left associative manner.
1.1  agc <li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended.
1.1  agc </ul>
1.1  agc <h2>BRE Syntax</h2>
1.1  agc
1.1  agc <p>
1.1  agc The obsolete basic regexp (BRE) syntax differs from the ERE syntax as
1.1  agc follows:
1.1  agc </p>
1.1  agc
1.1  agc <ul>
1.1  agc <li><tt>|</tt> is an ordinary character, and there is no equivalent
1.1  agc for its functionality.  <tt>+</tt>, and <tt>?</tt> are ordinary
1.1  agc characters.</li>
1.1  agc <li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with
1.1  agc <tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li>
1.1  agc <li>The parentheses for nested subexpressions are <tt>\(</tt> and
1.1  agc <tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary
1.1  agc characters.</li>
1.1  agc <li><tt>^</tt> is an ordinary character except at the beginning of the
1.1  agc RE or the beginning of a parenthesized subexpression.  Similarly,
1.1  agc <tt>$</tt> is an ordinary character except at the end of the
1.1  agc RE or the end of a parenthesized subexpression.</li>
1.1  agc </ul>