1 <h1>TRE Regexp Syntax</h1> 2 3 <p> 4 This document describes the POSIX 1003.2 extended RE (ERE) syntax and 5 the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions 6 to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style 7 notation is used to describe the grammar. 8 </p> 9 10 <h2>ERE Syntax</h2> 11 12 <h3>Alternation operator</h3> 13 <a name="alternation"></a> 14 <a name="extended-regexp"></a> 15 16 <table bgcolor="#e0e0f0" cellpadding="10"> 17 <tr><td> 18 <pre> 19 <i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a> 20 | <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a> 21 </pre> 22 </td></tr> 23 </table> 24 <p> 25 An extended regexp (ERE) is one or more <i>branches</i>, separated by 26 <tt>|</tt>. An ERE matches anything that matches one or more of the 27 branches. 28 </p> 29 30 <h3>Catenation of REs</h3> 31 <a name="catenation"></a> 32 <a name="branch"></a> 33 34 <table bgcolor="#e0e0f0" cellpadding="10"> 35 <tr><td> 36 <pre> 37 <i>branch</i> ::= <i>piece</i> 38 | <i>branch</i> <i>piece</i> 39 </pre> 40 </td></tr> 41 </table> 42 <p> 43 A branch is one or more <i>pieces</i> concatenated. It matches a 44 match for the first piece, followed by a match for the second piece, 45 and so on. 46 </p> 47 48 49 <table bgcolor="#e0e0f0" cellpadding="10"> 50 <tr><td> 51 <pre> 52 <i>piece</i> ::= <i>atom</i> 53 | <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a> 54 | <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a> 55 </pre> 56 </td></tr> 57 </table> 58 <p> 59 A piece is an <i>atom</i> possibly followed by a repeat operator or an 60 expression controlling approximate matching parameters for the <i>atom</i>. 61 </p> 62 63 64 <table bgcolor="#e0e0f0" cellpadding="10"> 65 <tr><td> 66 <pre> 67 <i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b> 68 | <a href="#bracket-expression"><i>bracket-expression</i></a> 69 | <b>"."</b> 70 | <a href="#assertion"><i>assertion</i></a> 71 | <a href="#literal"><i>literal</i></a> 72 | <a href="#backref"><i>back-reference</i></a> 73 | <b>"(?#"</b> <i>comment-text</i> <b>")"</b> 74 | <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i> 75 | <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b> 76 </pre> 77 </td></tr> 78 </table> 79 <p> 80 An atom is either an ERE enclosed in parenthesis, a bracket 81 expression, a <tt>.</tt> (period), an assertion, or a literal. 82 </p> 83 84 <p> 85 The dot (<tt>.</tt>) matches any single character. 86 If the <code>REG_NEWLINE</code> compilation flag (see <a 87 href="api.html">API manual</a>) is specified, the newline 88 character is not matched. 89 </p> 90 91 <p> 92 <tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is 93 completely ignored by the regex parser and it used solely for readability purposes. 94 </p> 95 96 <h3>Repeat operators</h3> 97 <a name="repeat-operator"></a> 98 99 <table bgcolor="#e0e0f0" cellpadding="10"> 100 <tr><td> 101 <pre> 102 <i>repeat-operator</i> ::= <b>"*"</b> 103 | <b>"+"</b> 104 | <b>"?"</b> 105 | <i>bound</i> 106 | <b>"*?"</b> 107 | <b>"+?"</b> 108 | <b>"??"</b> 109 | <i>bound</i> <b>?</b> 110 </pre> 111 </td></tr> 112 </table> 113 114 <p> 115 An atom followed by <tt>*</tt> matches a sequence of 0 or more matches 116 of the atom. <tt>+</tt> is similar to <tt>*</tt>, matching a sequence 117 of 1 or more matches of the atom. An atom followed by <tt>?</tt> 118 matches a sequence of 0 or 1 matches of the atom. 119 </p> 120 121 <p> 122 A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i> 123 are unsigned decimal integers between <tt>0</tt> and 124 <tt>RE_DUP_MAX</tt>: 125 </p> 126 127 <ol> 128 <li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li> 129 <li><tt>{</tt><i>m</i><tt>,}</tt></li> 130 <li><tt>{</tt><i>m</i><tt>}</tt></li> 131 </ol> 132 133 <p> 134 An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i> 135 (inclusive) matches of the atom. An atom followed by [2] 136 matches a sequence of <i>m</i> or more matches of the atom. An atom 137 followed by [3] matches a sequence of exactly <i>m</i> matches of the 138 atom. 139 </p> 140 141 142 <p> 143 Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or 144 non-greedy. Normally a repeated expression is greedy, that is, it matches as 145 many characters as possible. A non-greedy subexpression matches as few 146 characters as possible. Note that this does not (always) mean the same thing 147 as matching as many or few repetitions as possible. Also note 148 that <strong>minimal repetitions are not currently supported for approximate 149 matching</strong>. 150 </p> 151 152 <h3>Approximate matching settings</h3> 153 <a name="approx-settings"></a> 154 155 <table bgcolor="#e0e0f0" cellpadding="10"> 156 <tr><td> 157 <pre> 158 <i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b> 159 160 <i>count-limits</i> ::= <b>"+"</b> <i>number</i>? 161 | <b>"-"</b> <i>number</i>? 162 | <b>"#"</b> <i>number</i>? 163 | <b>"~"</b> <i>number</i>? 164 165 <i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"<"</b> <i>number</i> 166 167 <i>cost-term</i> ::= <i>number</i> <b>"i"</b> 168 | <i>number</i> <b>"d"</b> 169 | <i>number</i> <b>"s"</b> 170 171 </pre> 172 </td></tr> 173 </table> 174 175 <p> 176 The approximate matching settings for a subpattern can be changed 177 by appending <i>approx-settings</i> to the subpattern. Limits for 178 the number of errors can be set and an expression for specifying and 179 limiting the costs can be given. 180 </p> 181 182 <p> 183 The <i>count-limits</i> can be used to set limits for the number of 184 insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions 185 (<tt>#</tt>), and total number of errors (<tt>~</tt>). If the 186 <i>number</i> part is omitted, the specified error count will be 187 unlimited. 188 </p> 189 190 <p> 191 The <i>cost-equation</i> can be thought of as a mathematical equation, 192 where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of 193 insertions, deletions, and substitutions, respectively. The equation 194 can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and 195 <tt>s</tt>. The multiplier is the cost of the error, and the number 196 after <tt><</tt> is the maximum allowed cost of a match. Spaces 197 and pluses can be inserted to make the equation readable. In fact, when 198 specifying only a cost equation, adding a space after the opening <tt>{</tt> 199 is <strong>required</strong>. 200 </p> 201 202 <p> 203 Examples: 204 <dl> 205 <dt><tt>{~}</tt></dt> 206 <dd>Sets the maximum number of errors to unlimited.</dd> 207 <dt><tt>{~3}</tt></dt> 208 <dd>Sets the maximum number of errors to three.</dd> 209 <dt><tt>{+2~5}</tt></dt> 210 <dd>Sets the maximum number of errors to five, and the maximum number 211 of insertions to two.</dd> 212 <dt><tt>{<3}</tt></dt> 213 <dd>Sets the maximum cost to three. 214 <dt><tt>{ 2i + 1d + 2s < 5 }</tt></dt> 215 <dd>Sets the cost of an insertion to two, a deletion to one, a 216 substitution to two, and the maximum cost to five. 217 </dl> 218 219 220 <h3>Bracket expressions</h3> 221 <a name="bracket-expression"></a> 222 223 <table bgcolor="#e0e0f0" cellpadding="10"> 224 <tr><td> 225 <pre> 226 <i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b> 227 | <b>"[^"</b> <i>item</i>+ <b>"]"</b> 228 </pre> 229 </td></tr> 230 </table> 231 232 <p> 233 A bracket expression specifies a set of characters by enclosing a 234 nonempty list of items in brackets. Normally anything matching any 235 item in the list is matched. If the list begins with <tt>^</tt> the 236 meaning is negated; any character matching no item in the list is 237 matched. 238 </p> 239 240 <p> 241 An item is any of the following: 242 </p> 243 <ul> 244 <li>A single character, matching that character.</li> 245 <li>Two characters separated by <tt>-</tt>. This is shorthand for the 246 full range of characters between those two (inclusive) in the 247 collating sequence. For example, <tt>[0-9]</tt> in ASCII matches any 248 decimal digit.</li> 249 <li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>, 250 matching the collating element. This can be used to include a literal 251 <tt>-</tt> or a multi-character collating element in the list.</li> 252 <li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an 253 equivalence class), matching all collating elements with the same 254 primary collation weight as that element, including the element 255 itself.</li> 256 <li>The name of a character class enclosed in <tt>[:</tt> and 257 <tt>:]</tt>, matching any character belonging to the class. The set 258 of valid names depends on the <code>LC_CTYPE</code> category of the 259 current locale, but the following names are valid in all locales: 260 <ul> 261 <li><tt>alnum</tt> - alphanumeric characters</li> 262 <li><tt>alpha</tt> - alphabetic characters</li> 263 <li><tt>blank</tt> - blank characters</li> 264 <li><tt>cntrl</tt> - control characters</li> 265 <li><tt>digit</tt> - decimal digits (0 through 9)</li> 266 <li><tt>graph</tt> - all printable characters except space</li> 267 <li><tt>lower</tt> - lower-case letters</li> 268 <li><tt>print</tt> - printable characters including space</li> 269 <li><tt>punct</tt> - printable characters not space or alphanumeric</li> 270 <li><tt>space</tt> - white-space characters</li> 271 <li><tt>upper</tt> - upper case letters</li> 272 <li><tt>xdigit</tt> - hexadecimal digits</li> 273 </ul> 274 </ul> 275 <p> 276 To include a literal <tt>-</tt> in the list, make it either the first 277 or last item, the second endpoint of a range, or enclose it in 278 <tt>[.</tt> and <tt>.]</tt> to make it a collating element. To 279 include a literal <tt>]</tt> in the list, make it either the first 280 item, the second endpoint of a range, or enclose it in <tt>[.</tt> and 281 <tt>.]</tt>. To use a literal <tt>-</tt> as the first 282 endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>. 283 </p> 284 285 286 <h3>Assertions</h3> 287 <a name="assertion"></a> 288 289 <table bgcolor="#e0e0f0" cellpadding="10"> 290 <tr><td> 291 <pre> 292 <i>assertion</i> ::= <b>"^"</b> 293 | <b>"$"</b> 294 | <b>"\"</b> <i>assertion-character</i> 295 </pre> 296 </td></tr> 297 </table> 298 299 <p> 300 The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and 301 "right anchor", respectively. The left anchor matches the empty 302 string at the beginning of the string. The right anchor matches the 303 empty string at the end of the string. The behaviour of both anchors 304 can be varied by specifying certain execution and compilation flags; 305 see the <a href="api.html">API manual</a>. 306 </p> 307 308 <p> 309 An assertion-character can be any of the following: 310 </p> 311 312 <ul> 313 <li><tt><</tt> - Beginning of word 314 <li><tt>></tt> - End of word 315 <li><tt>b</tt> - Word boundary 316 <li><tt>B</tt> - Non-word boundary 317 <li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li> 318 <li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li> 319 <li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li> 320 <li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li> 321 <li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li> 322 <li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li> 323 </ul> 324 325 326 <h3>Literals</h3> 327 <a name="literal"></a> 328 329 <table bgcolor="#e0e0f0" cellpadding="10"> 330 <tr><td> 331 <pre> 332 <i>literal</i> ::= <i>ordinary-character</i> 333 | <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2} 334 | <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b> 335 | <b>"\"</b> <i>character</i> 336 </pre> 337 </td></tr> 338 </table> 339 <p> 340 A literal is either an ordinary character (a character that has no 341 other significance in the context), an 8 bit hexadecimal encoded 342 character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character 343 (e.g. <tt>\x{263a}</tt>), or an escaped character. An escaped 344 character is a <tt>\</tt> followed by any character, and matches that 345 character. Escaping can be used to match characters which have a 346 special meaning in regexp syntax. A <tt>\</tt> cannot be the last 347 character of an ERE. Escaping also allows you to include a few 348 non-printable characters in the regular expression. These special 349 escape sequences include: 350 </p> 351 352 <ul> 353 <li><tt>\a</tt> - Bell character (ASCII code 7) 354 <li><tt>\e</tt> - Escape character (ASCII code 27) 355 <li><tt>\f</tt> - Form-feed character (ASCII code 12) 356 <li><tt>\n</tt> - New-line/line-feed character (ASCII code 10) 357 <li><tt>\r</tt> - Carriage return character (ASCII code 13) 358 <li><tt>\t</tt> - Horizontal tab character (ASCII code 9) 359 </ul> 360 361 <p> 362 An ordinary character is just a single character with no other 363 significance, and matches that character. A <tt>{</tt> followed by 364 something else than a digit is considered an ordinary character. 365 </p> 366 367 368 <h3>Back references</h3> 369 <a name="backref"></a> 370 371 <table bgcolor="#e0e0f0" cellpadding="10"> 372 <tr><td> 373 <pre> 374 <i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>] 375 </pre> 376 </td></tr> 377 </table> 378 <p> 379 A back reference is a backslash followed by a single non-zero decimal 380 digit <i>d</i>. It matches the same sequence of characters 381 matched by the <i>d</i>th parenthesized subexpression. 382 </p> 383 384 <p> 385 Back references are not defined for POSIX EREs (for BREs they are), 386 but many matchers, including TRE, implement back references for both 387 EREs and BREs. 388 </p> 389 390 <h3>Options</h3> 391 <a name="options"></a> 392 <table bgcolor="#e0e0f0" cellpadding="10"> 393 <tr><td> 394 <pre> 395 <i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)? 396 </pre> 397 </td></tr> 398 </table> 399 400 Options allow compile time options to be turned on/off for particular parts of the 401 regular expression. The options equate to several compile time options specified to 402 the regcomp API function. If the option is specified in the first section, it is 403 turned on. If it is specified in the second section (after the <tt>-</tt>), it is 404 turned off. 405 <ul> 406 <li>i - Case insensitive. 407 <li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in 408 the <a href="tre-api.html">API Manual</a>. 409 <li>r - Causes the regex to be matched in a right associative manner rather than the normal 410 left associative manner. 411 <li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended. 412 </ul> 413 <h2>BRE Syntax</h2> 414 415 <p> 416 The obsolete basic regexp (BRE) syntax differs from the ERE syntax as 417 follows: 418 </p> 419 420 <ul> 421 <li><tt>|</tt> is an ordinary character, and there is no equivalent 422 for its functionality. <tt>+</tt>, and <tt>?</tt> are ordinary 423 characters.</li> 424 <li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with 425 <tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li> 426 <li>The parentheses for nested subexpressions are <tt>\(</tt> and 427 <tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary 428 characters.</li> 429 <li><tt>^</tt> is an ordinary character except at the beginning of the 430 RE or the beginning of a parenthesized subexpression. Similarly, 431 <tt>$</tt> is an ordinary character except at the end of the 432 RE or the end of a parenthesized subexpression.</li> 433 </ul> 434