tre-syntax.html revision 1.1 1 <h1>TRE Regexp Syntax</h1>
2
3 <p>
4 This document describes the POSIX 1003.2 extended RE (ERE) syntax and
5 the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions
6 to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style
7 notation is used to describe the grammar.
8 </p>
9
10 <h2>ERE Syntax</h2>
11
12 <h3>Alternation operator</h3>
13 <a name="alternation"></a>
14 <a name="extended-regexp"></a>
15
16 <table bgcolor="#e0e0f0" cellpadding="10">
17 <tr><td>
18 <pre>
19 <i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a>
20 | <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a>
21 </pre>
22 </td></tr>
23 </table>
24 <p>
25 An extended regexp (ERE) is one or more <i>branches</i>, separated by
26 <tt>|</tt>. An ERE matches anything that matches one or more of the
27 branches.
28 </p>
29
30 <h3>Catenation of REs</h3>
31 <a name="catenation"></a>
32 <a name="branch"></a>
33
34 <table bgcolor="#e0e0f0" cellpadding="10">
35 <tr><td>
36 <pre>
37 <i>branch</i> ::= <i>piece</i>
38 | <i>branch</i> <i>piece</i>
39 </pre>
40 </td></tr>
41 </table>
42 <p>
43 A branch is one or more <i>pieces</i> concatenated. It matches a
44 match for the first piece, followed by a match for the second piece,
45 and so on.
46 </p>
47
48
49 <table bgcolor="#e0e0f0" cellpadding="10">
50 <tr><td>
51 <pre>
52 <i>piece</i> ::= <i>atom</i>
53 | <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a>
54 | <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a>
55 </pre>
56 </td></tr>
57 </table>
58 <p>
59 A piece is an <i>atom</i> possibly followed by a repeat operator or an
60 expression controlling approximate matching parameters for the <i>atom</i>.
61 </p>
62
63
64 <table bgcolor="#e0e0f0" cellpadding="10">
65 <tr><td>
66 <pre>
67 <i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b>
68 | <a href="#bracket-expression"><i>bracket-expression</i></a>
69 | <b>"."</b>
70 | <a href="#assertion"><i>assertion</i></a>
71 | <a href="#literal"><i>literal</i></a>
72 | <a href="#backref"><i>back-reference</i></a>
73 | <b>"(?#"</b> <i>comment-text</i> <b>")"</b>
74 | <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i>
75 | <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b>
76 </pre>
77 </td></tr>
78 </table>
79 <p>
80 An atom is either an ERE enclosed in parenthesis, a bracket
81 expression, a <tt>.</tt> (period), an assertion, or a literal.
82 </p>
83
84 <p>
85 The dot (<tt>.</tt>) matches any single character.
86 If the <code>REG_NEWLINE</code> compilation flag (see <a
87 href="api.html">API manual</a>) is specified, the newline
88 character is not matched.
89 </p>
90
91 <p>
92 <tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is
93 completely ignored by the regex parser and it used solely for readability purposes.
94 </p>
95
96 <h3>Repeat operators</h3>
97 <a name="repeat-operator"></a>
98
99 <table bgcolor="#e0e0f0" cellpadding="10">
100 <tr><td>
101 <pre>
102 <i>repeat-operator</i> ::= <b>"*"</b>
103 | <b>"+"</b>
104 | <b>"?"</b>
105 | <i>bound</i>
106 | <b>"*?"</b>
107 | <b>"+?"</b>
108 | <b>"??"</b>
109 | <i>bound</i> <b>?</b>
110 </pre>
111 </td></tr>
112 </table>
113
114 <p>
115 An atom followed by <tt>*</tt> matches a sequence of 0 or more matches
116 of the atom. <tt>+</tt> is similar to <tt>*</tt>, matching a sequence
117 of 1 or more matches of the atom. An atom followed by <tt>?</tt>
118 matches a sequence of 0 or 1 matches of the atom.
119 </p>
120
121 <p>
122 A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i>
123 are unsigned decimal integers between <tt>0</tt> and
124 <tt>RE_DUP_MAX</tt>:
125 </p>
126
127 <ol>
128 <li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li>
129 <li><tt>{</tt><i>m</i><tt>,}</tt></li>
130 <li><tt>{</tt><i>m</i><tt>}</tt></li>
131 </ol>
132
133 <p>
134 An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i>
135 (inclusive) matches of the atom. An atom followed by [2]
136 matches a sequence of <i>m</i> or more matches of the atom. An atom
137 followed by [3] matches a sequence of exactly <i>m</i> matches of the
138 atom.
139 </p>
140
141
142 <p>
143 Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or
144 non-greedy. Normally a repeated expression is greedy, that is, it matches as
145 many characters as possible. A non-greedy subexpression matches as few
146 characters as possible. Note that this does not (always) mean the same thing
147 as matching as many or few repetitions as possible. Also note
148 that <strong>minimal repetitions are not currently supported for approximate
149 matching</strong>.
150 </p>
151
152 <h3>Approximate matching settings</h3>
153 <a name="approx-settings"></a>
154
155 <table bgcolor="#e0e0f0" cellpadding="10">
156 <tr><td>
157 <pre>
158 <i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b>
159
160 <i>count-limits</i> ::= <b>"+"</b> <i>number</i>?
161 | <b>"-"</b> <i>number</i>?
162 | <b>"#"</b> <i>number</i>?
163 | <b>"~"</b> <i>number</i>?
164
165 <i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"<"</b> <i>number</i>
166
167 <i>cost-term</i> ::= <i>number</i> <b>"i"</b>
168 | <i>number</i> <b>"d"</b>
169 | <i>number</i> <b>"s"</b>
170
171 </pre>
172 </td></tr>
173 </table>
174
175 <p>
176 The approximate matching settings for a subpattern can be changed
177 by appending <i>approx-settings</i> to the subpattern. Limits for
178 the number of errors can be set and an expression for specifying and
179 limiting the costs can be given.
180 </p>
181
182 <p>
183 The <i>count-limits</i> can be used to set limits for the number of
184 insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions
185 (<tt>#</tt>), and total number of errors (<tt>~</tt>). If the
186 <i>number</i> part is omitted, the specified error count will be
187 unlimited.
188 </p>
189
190 <p>
191 The <i>cost-equation</i> can be thought of as a mathematical equation,
192 where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of
193 insertions, deletions, and substitutions, respectively. The equation
194 can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and
195 <tt>s</tt>. The multiplier is the cost of the error, and the number
196 after <tt><</tt> is the maximum allowed cost of a match. Spaces
197 and pluses can be inserted to make the equation readable. In fact, when
198 specifying only a cost equation, adding a space after the opening <tt>{</tt>
199 is <strong>required</strong>.
200 </p>
201
202 <p>
203 Examples:
204 <dl>
205 <dt><tt>{~}</tt></dt>
206 <dd>Sets the maximum number of errors to unlimited.</dd>
207 <dt><tt>{~3}</tt></dt>
208 <dd>Sets the maximum number of errors to three.</dd>
209 <dt><tt>{+2~5}</tt></dt>
210 <dd>Sets the maximum number of errors to five, and the maximum number
211 of insertions to two.</dd>
212 <dt><tt>{<3}</tt></dt>
213 <dd>Sets the maximum cost to three.
214 <dt><tt>{ 2i + 1d + 2s < 5 }</tt></dt>
215 <dd>Sets the cost of an insertion to two, a deletion to one, a
216 substitution to two, and the maximum cost to five.
217 </dl>
218
219
220 <h3>Bracket expressions</h3>
221 <a name="bracket-expression"></a>
222
223 <table bgcolor="#e0e0f0" cellpadding="10">
224 <tr><td>
225 <pre>
226 <i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b>
227 | <b>"[^"</b> <i>item</i>+ <b>"]"</b>
228 </pre>
229 </td></tr>
230 </table>
231
232 <p>
233 A bracket expression specifies a set of characters by enclosing a
234 nonempty list of items in brackets. Normally anything matching any
235 item in the list is matched. If the list begins with <tt>^</tt> the
236 meaning is negated; any character matching no item in the list is
237 matched.
238 </p>
239
240 <p>
241 An item is any of the following:
242 </p>
243 <ul>
244 <li>A single character, matching that character.</li>
245 <li>Two characters separated by <tt>-</tt>. This is shorthand for the
246 full range of characters between those two (inclusive) in the
247 collating sequence. For example, <tt>[0-9]</tt> in ASCII matches any
248 decimal digit.</li>
249 <li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>,
250 matching the collating element. This can be used to include a literal
251 <tt>-</tt> or a multi-character collating element in the list.</li>
252 <li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an
253 equivalence class), matching all collating elements with the same
254 primary collation weight as that element, including the element
255 itself.</li>
256 <li>The name of a character class enclosed in <tt>[:</tt> and
257 <tt>:]</tt>, matching any character belonging to the class. The set
258 of valid names depends on the <code>LC_CTYPE</code> category of the
259 current locale, but the following names are valid in all locales:
260 <ul>
261 <li><tt>alnum</tt> - alphanumeric characters</li>
262 <li><tt>alpha</tt> - alphabetic characters</li>
263 <li><tt>blank</tt> - blank characters</li>
264 <li><tt>cntrl</tt> - control characters</li>
265 <li><tt>digit</tt> - decimal digits (0 through 9)</li>
266 <li><tt>graph</tt> - all printable characters except space</li>
267 <li><tt>lower</tt> - lower-case letters</li>
268 <li><tt>print</tt> - printable characters including space</li>
269 <li><tt>punct</tt> - printable characters not space or alphanumeric</li>
270 <li><tt>space</tt> - white-space characters</li>
271 <li><tt>upper</tt> - upper case letters</li>
272 <li><tt>xdigit</tt> - hexadecimal digits</li>
273 </ul>
274 </ul>
275 <p>
276 To include a literal <tt>-</tt> in the list, make it either the first
277 or last item, the second endpoint of a range, or enclose it in
278 <tt>[.</tt> and <tt>.]</tt> to make it a collating element. To
279 include a literal <tt>]</tt> in the list, make it either the first
280 item, the second endpoint of a range, or enclose it in <tt>[.</tt> and
281 <tt>.]</tt>. To use a literal <tt>-</tt> as the first
282 endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>.
283 </p>
284
285
286 <h3>Assertions</h3>
287 <a name="assertion"></a>
288
289 <table bgcolor="#e0e0f0" cellpadding="10">
290 <tr><td>
291 <pre>
292 <i>assertion</i> ::= <b>"^"</b>
293 | <b>"$"</b>
294 | <b>"\"</b> <i>assertion-character</i>
295 </pre>
296 </td></tr>
297 </table>
298
299 <p>
300 The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and
301 "right anchor", respectively. The left anchor matches the empty
302 string at the beginning of the string. The right anchor matches the
303 empty string at the end of the string. The behaviour of both anchors
304 can be varied by specifying certain execution and compilation flags;
305 see the <a href="api.html">API manual</a>.
306 </p>
307
308 <p>
309 An assertion-character can be any of the following:
310 </p>
311
312 <ul>
313 <li><tt><</tt> - Beginning of word
314 <li><tt>></tt> - End of word
315 <li><tt>b</tt> - Word boundary
316 <li><tt>B</tt> - Non-word boundary
317 <li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li>
318 <li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li>
319 <li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li>
320 <li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li>
321 <li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li>
322 <li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li>
323 </ul>
324
325
326 <h3>Literals</h3>
327 <a name="literal"></a>
328
329 <table bgcolor="#e0e0f0" cellpadding="10">
330 <tr><td>
331 <pre>
332 <i>literal</i> ::= <i>ordinary-character</i>
333 | <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2}
334 | <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b>
335 | <b>"\"</b> <i>character</i>
336 </pre>
337 </td></tr>
338 </table>
339 <p>
340 A literal is either an ordinary character (a character that has no
341 other significance in the context), an 8 bit hexadecimal encoded
342 character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character
343 (e.g. <tt>\x{263a}</tt>), or an escaped character. An escaped
344 character is a <tt>\</tt> followed by any character, and matches that
345 character. Escaping can be used to match characters which have a
346 special meaning in regexp syntax. A <tt>\</tt> cannot be the last
347 character of an ERE. Escaping also allows you to include a few
348 non-printable characters in the regular expression. These special
349 escape sequences include:
350 </p>
351
352 <ul>
353 <li><tt>\a</tt> - Bell character (ASCII code 7)
354 <li><tt>\e</tt> - Escape character (ASCII code 27)
355 <li><tt>\f</tt> - Form-feed character (ASCII code 12)
356 <li><tt>\n</tt> - New-line/line-feed character (ASCII code 10)
357 <li><tt>\r</tt> - Carriage return character (ASCII code 13)
358 <li><tt>\t</tt> - Horizontal tab character (ASCII code 9)
359 </ul>
360
361 <p>
362 An ordinary character is just a single character with no other
363 significance, and matches that character. A <tt>{</tt> followed by
364 something else than a digit is considered an ordinary character.
365 </p>
366
367
368 <h3>Back references</h3>
369 <a name="backref"></a>
370
371 <table bgcolor="#e0e0f0" cellpadding="10">
372 <tr><td>
373 <pre>
374 <i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>]
375 </pre>
376 </td></tr>
377 </table>
378 <p>
379 A back reference is a backslash followed by a single non-zero decimal
380 digit <i>d</i>. It matches the same sequence of characters
381 matched by the <i>d</i>th parenthesized subexpression.
382 </p>
383
384 <p>
385 Back references are not defined for POSIX EREs (for BREs they are),
386 but many matchers, including TRE, implement back references for both
387 EREs and BREs.
388 </p>
389
390 <h3>Options</h3>
391 <a name="options"></a>
392 <table bgcolor="#e0e0f0" cellpadding="10">
393 <tr><td>
394 <pre>
395 <i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)?
396 </pre>
397 </td></tr>
398 </table>
399
400 Options allow compile time options to be turned on/off for particular parts of the
401 regular expression. The options equate to several compile time options specified to
402 the regcomp API function. If the option is specified in the first section, it is
403 turned on. If it is specified in the second section (after the <tt>-</tt>), it is
404 turned off.
405 <ul>
406 <li>i - Case insensitive.
407 <li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in
408 the <a href="tre-api.html">API Manual</a>.
409 <li>r - Causes the regex to be matched in a right associative manner rather than the normal
410 left associative manner.
411 <li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended.
412 </ul>
413 <h2>BRE Syntax</h2>
414
415 <p>
416 The obsolete basic regexp (BRE) syntax differs from the ERE syntax as
417 follows:
418 </p>
419
420 <ul>
421 <li><tt>|</tt> is an ordinary character, and there is no equivalent
422 for its functionality. <tt>+</tt>, and <tt>?</tt> are ordinary
423 characters.</li>
424 <li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with
425 <tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li>
426 <li>The parentheses for nested subexpressions are <tt>\(</tt> and
427 <tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary
428 characters.</li>
429 <li><tt>^</tt> is an ordinary character except at the beginning of the
430 RE or the beginning of a parenthesized subexpression. Similarly,
431 <tt>$</tt> is an ordinary character except at the end of the
432 RE or the end of a parenthesized subexpression.</li>
433 </ul>
434