tre-api.html revision 1.2 1 1.1 agc <h1>TRE API reference manual</h1>
2 1.1 agc
3 1.1 agc <h2>The <tt>regcomp()</tt> functions</h2>
4 1.1 agc <a name="regcomp"></a>
5 1.1 agc
6 1.1 agc <div class="code">
7 1.1 agc <code>
8 1.1 agc #include <tre/regex.h>
9 1.1 agc <br>
10 1.1 agc <br>
11 1.1 agc <font class="type">int</font>
12 1.1 agc <font class="func">regcomp</font>(<font
13 1.1 agc class="type">regex_t</font> *<font class="arg">preg</font>,
14 1.1 agc <font class="qual">const</font> <font class="type">char</font>
15 1.1 agc *<font class="arg">regex</font>, <font class="type">int</font>
16 1.1 agc <font class="arg">cflags</font>);
17 1.1 agc <br>
18 1.1 agc <font class="type">int</font> <font
19 1.1 agc class="func">regncomp</font>(<font class="type">regex_t</font>
20 1.1 agc *<font class="arg">preg</font>, <font class="qual">const</font>
21 1.1 agc <font class="type">char</font> *<font class="arg">regex</font>,
22 1.1 agc <font class="type">size_t</font> <font class="arg">len</font>,
23 1.1 agc <font class="type">int</font> <font class="arg">cflags</font>);
24 1.1 agc <br>
25 1.1 agc <font class="type">int</font> <font
26 1.1 agc class="func">regwcomp</font>(<font class="type">regex_t</font>
27 1.1 agc *<font class="arg">preg</font>, <font class="qual">const</font>
28 1.1 agc <font class="type">wchar_t</font> *<font
29 1.1 agc class="arg">regex</font>, <font class="type">int</font> <font
30 1.1 agc class="arg">cflags</font>);
31 1.1 agc <br>
32 1.1 agc <font class="type">int</font> <font
33 1.1 agc class="func">regwncomp</font>(<font class="type">regex_t</font>
34 1.1 agc *<font class="arg">preg</font>, <font class="qual">const</font>
35 1.1 agc <font class="type">wchar_t</font> *<font
36 1.1 agc class="arg">regex</font>, <font class="type">size_t</font>
37 1.1 agc <font class="arg">len</font>, <font class="type">int</font>
38 1.1 agc <font class="arg">cflags</font>);
39 1.1 agc <br>
40 1.1 agc <font class="type">void</font> <font
41 1.1 agc class="func">regfree</font>(<font class="type">regex_t</font>
42 1.1 agc *<font class="arg">preg</font>);
43 1.1 agc <br>
44 1.1 agc </code>
45 1.1 agc </div>
46 1.1 agc
47 1.1 agc <p>
48 1.1 agc The <tt><font class="func">regcomp</font>()</tt> function compiles
49 1.1 agc the regex string pointed to by <tt><font
50 1.1 agc class="arg">regex</font></tt> to an internal representation and
51 1.1 agc stores the result in the pattern buffer structure pointed to by
52 1.1 agc <tt><font class="arg">preg</font></tt>. The <tt><font
53 1.1 agc class="func">regncomp</font>()</tt> function is like <tt><font
54 1.1 agc class="func">regcomp</font>()</tt>, but <tt><font
55 1.1 agc class="arg">regex</font></tt> is not terminated with the null
56 1.1 agc byte. Instead, the <tt><font class="arg">len</font></tt> argument
57 1.1 agc is used to give the length of the string, and the string may contain
58 1.1 agc null bytes. The <tt><font class="func">regwcomp</font>()</tt> and
59 1.1 agc <tt><font class="func">regwncomp</font>()</tt> functions work like
60 1.1 agc <tt><font class="func">regcomp</font>()</tt> and <tt><font
61 1.2 wiz class="func">regncomp</font>()</tt>, respectively, but take a
62 1.2 wiz wide-character (<tt><font class="type">wchar_t</font></tt>) string
63 1.1 agc instead of a byte string.
64 1.1 agc </p>
65 1.1 agc
66 1.1 agc <p>
67 1.1 agc The <tt><font class="arg">cflags</font></tt> argument is a the
68 1.1 agc bitwise inclusive OR of zero or more of the following flags (defined
69 1.1 agc in the header <tt><tre/regex.h></tt>):
70 1.1 agc </p>
71 1.1 agc
72 1.1 agc <blockquote>
73 1.1 agc <dl>
74 1.1 agc <dt><tt>REG_EXTENDED</tt></dt>
75 1.1 agc <dd>Use POSIX Extended Regular Expression (ERE) compatible syntax when
76 1.1 agc compiling <tt><font class="arg">regex</font></tt>. The default
77 1.1 agc syntax is the POSIX Basic Regular Expression (BRE) syntax, but it is
78 1.1 agc considered obsolete.</dd>
79 1.1 agc
80 1.1 agc <dt><tt>REG_ICASE</tt></dt>
81 1.1 agc <dd>Ignore case. Subsequent searches with the <a
82 1.1 agc href="#regexec"><tt>regexec</tt></a> family of functions using this
83 1.1 agc pattern buffer will be case insensitive.</dd>
84 1.1 agc
85 1.1 agc <dt><tt>REG_NOSUB</tt></dt>
86 1.1 agc <dd>Do not report submatches. Subsequent searches with the <a
87 1.1 agc href="#regexec"><tt>regexec</tt></a> family of functions will only
88 1.1 agc report whether a match was found or not and will not fill the submatch
89 1.1 agc array.</dd>
90 1.1 agc
91 1.1 agc <dt><tt>REG_NEWLINE</tt></dt>
92 1.1 agc <dd>Normally the newline character is treated as an ordinary
93 1.1 agc character. When this flag is used, the newline character
94 1.1 agc (<tt>'\n'</tt>, ASCII code 10) is treated specially as follows:
95 1.1 agc <ol>
96 1.1 agc <li>The match-any-character operator (dot <tt>"."</tt> outside a
97 1.1 agc bracket expression) does not match a newline.</li>
98 1.1 agc <li>A non-matching list (<tt>[^...]</tt>) not containing a newline
99 1.1 agc does not match a newline.</li>
100 1.1 agc <li>The match-beginning-of-line operator <tt>^</tt> matches the empty
101 1.1 agc string immediately after a newline as well as the empty string at the
102 1.1 agc beginning of the string (but see the <code>REG_NOTBOL</code>
103 1.1 agc <code>regexec()</code> flag below).
104 1.1 agc <li>The match-end-of-line operator <tt>$</tt> matches the empty
105 1.1 agc string immediately before a newline as well as the empty string at the
106 1.1 agc end of the string (but see the <code>REG_NOTEOL</code>
107 1.1 agc <code>regexec()</code> flag below).
108 1.1 agc </ol>
109 1.1 agc </dd>
110 1.1 agc
111 1.1 agc <dt><tt>REG_LITERAL</tt></dt>
112 1.1 agc <dd>Interpret the entire <tt><font class="arg">regex</font></tt>
113 1.1 agc argument as a literal string, that is, all characters will be
114 1.1 agc considered ordinary. This is a nonstandard extension, compatible with
115 1.1 agc but not specified by POSIX.</dd>
116 1.1 agc
117 1.1 agc <dt><tt>REG_NOSPEC</tt></dt>
118 1.1 agc <dd>Same as <tt>REG_LITERAL</tt>. This flag is provided for
119 1.1 agc compatibility with BSD.</dd>
120 1.1 agc
121 1.1 agc <dt><tt>REG_RIGHT_ASSOC</tt></dt>
122 1.1 agc <dd>By default, concatenation is left associative in TRE, as per
123 1.1 agc the grammar given in the <a
124 1.1 agc href="http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html">base
125 1.1 agc specifications on regular expressions</a> of Std 1003.1-2001 (POSIX).
126 1.1 agc This flag flips associativity of concatenation to right associative.
127 1.1 agc Associativity can have an effect on how a match is divided into
128 1.1 agc submatches, but does not change what is matched by the entire regexp.
129 1.1 agc </dd>
130 1.1 agc
131 1.1 agc <dt><tt>REG_UNGREEDY</tt></dt>
132 1.1 agc <dd>By default, repetition operators are greedy in TRE as per Std 1003.1-2001 (POSIX) and
133 1.1 agc can be forced to be non-greedy by appending a <tt>?</tt> character. This flag reverses this behavior
134 1.1 agc by making the operators non-greedy by default and greedy when a <tt>?</tt> is specified.</dd>
135 1.1 agc </dl>
136 1.1 agc </blockquote>
137 1.1 agc
138 1.1 agc <p>
139 1.1 agc After a successful call to <tt><font class="func">regcomp</font></tt> it is
140 1.1 agc possible to use the <tt><font class="arg">preg</font></tt> pattern buffer for
141 1.1 agc searching for matches in strings (see below). Once the pattern buffer is no
142 1.1 agc longer needed, it should be freed with <tt><font
143 1.1 agc class="func">regfree</font></tt> to free the memory allocated for it.
144 1.1 agc </p>
145 1.1 agc
146 1.1 agc
147 1.1 agc <p>
148 1.1 agc The <tt><font class="type">regex_t</font></tt> structure has the
149 1.1 agc following fields that the application can read:
150 1.1 agc </p>
151 1.1 agc <blockquote>
152 1.1 agc <dl>
153 1.1 agc <dt><tt><font class="type">size_t</font> <font
154 1.1 agc class="arg">re_nsub</font></tt></dt>
155 1.1 agc <dd>Number of parenthesized subexpressions in <tt><font
156 1.1 agc class="arg">regex</font></tt>.
157 1.1 agc </dd>
158 1.1 agc </dl>
159 1.1 agc </blockquote>
160 1.1 agc
161 1.1 agc <p>
162 1.1 agc The <tt><font class="func">regcomp</font></tt> function returns
163 1.1 agc zero if the compilation was successful, or one of the following error
164 1.1 agc codes if there was an error:
165 1.1 agc </p>
166 1.1 agc <blockquote>
167 1.1 agc <dl>
168 1.1 agc <dt><tt>REG_BADPAT</tt></dt>
169 1.1 agc <dd>Invalid regexp. TRE returns this only if a multibyte character
170 1.1 agc set is used in the current locale, and <tt><font
171 1.1 agc class="arg">regex</font></tt> contained an invalid multibyte
172 1.1 agc sequence.</dd>
173 1.1 agc <dt><tt>REG_ECOLLATE</tt></dt>
174 1.1 agc <dd>Invalid collating element referenced. TRE returns this whenever
175 1.1 agc equivalence classes or multicharacter collating elements are used in
176 1.1 agc bracket expressions (they are not supported yet).</dd>
177 1.1 agc <dt><tt>REG_ECTYPE</tt></dt>
178 1.1 agc <dd>Unknown character class name in <tt>[[:<i>name</i>:]]</tt>.</dd>
179 1.1 agc <dt><tt>REG_EESCAPE</tt></dt>
180 1.1 agc <dd>The last character of <tt><font class="arg">regex</font></tt>
181 1.1 agc was a backslash (<tt>\</tt>).</dd>
182 1.1 agc <dt><tt>REG_ESUBREG</tt></dt>
183 1.1 agc <dd>Invalid back reference; number in <tt>\<i>digit</i></tt>
184 1.1 agc invalid.</dd>
185 1.1 agc <dt><tt>REG_EBRACK</tt></dt>
186 1.1 agc <dd><tt>[]</tt> imbalance.</dd>
187 1.1 agc <dt><tt>REG_EPAREN</tt></dt>
188 1.1 agc <dd><tt>\(\)</tt> or <tt>()</tt> imbalance.</dd>
189 1.1 agc <dt><tt>REG_EBRACE</tt></dt>
190 1.1 agc <dd><tt>\{\}</tt> or <tt>{}</tt> imbalance.</dd>
191 1.1 agc <dt><tt>REG_BADBR</tt></dt>
192 1.1 agc <dd><tt>{}</tt> content invalid: not a number, more than two numbers,
193 1.1 agc first larger than second, or number too large.
194 1.1 agc <dt><tt>REG_ERANGE</tt></dt>
195 1.1 agc <dd>Invalid character range, e.g. ending point is earlier in the
196 1.1 agc collating order than the starting point.</dd>
197 1.1 agc <dt><tt>REG_ESPACE</tt></dt>
198 1.1 agc <dd>Out of memory, or an internal limit exceeded.</dd>
199 1.1 agc <dt><tt>REG_BADRPT</tt></dt>
200 1.1 agc <dd>Invalid use of repetition operators: two or more repetition operators have
201 1.1 agc been chained in an undefined way.</dd>
202 1.1 agc </dl>
203 1.1 agc </blockquote>
204 1.1 agc
205 1.1 agc
206 1.1 agc <h2>The <tt>regexec()</tt> functions</h2>
207 1.1 agc <a name="regexec"></a>
208 1.1 agc
209 1.1 agc <div class="code">
210 1.1 agc <code>
211 1.1 agc #include <tre/regex.h>
212 1.1 agc <br>
213 1.1 agc <br>
214 1.1 agc <font class="type">int</font> <font
215 1.1 agc class="func">regexec</font>(<font class="qual">const</font>
216 1.1 agc <font class="type">regex_t</font> *<font
217 1.1 agc class="arg">preg</font>, <font class="qual">const</font> <font
218 1.1 agc class="type">char</font> *<font class="arg">string</font>,
219 1.1 agc <font class="type">size_t</font> <font
220 1.1 agc class="arg">nmatch</font>,
221 1.1 agc <br>
222 1.1 agc <font class="type">regmatch_t</font> <font
223 1.1 agc class="arg">pmatch</font>[], <font class="type">int</font>
224 1.1 agc <font class="arg">eflags</font>);
225 1.1 agc <br>
226 1.1 agc <font class="type">int</font> <font
227 1.1 agc class="func">regnexec</font>(<font class="qual">const</font>
228 1.1 agc <font class="type">regex_t</font> *<font
229 1.1 agc class="arg">preg</font>, <font class="qual">const</font> <font
230 1.1 agc class="type">char</font> *<font class="arg">string</font>,
231 1.1 agc <font class="type">size_t</font> <font class="arg">len</font>,
232 1.1 agc <br>
233 1.1 agc <font class="type">size_t</font> <font
234 1.1 agc class="arg">nmatch</font>, <font class="type">regmatch_t</font>
235 1.1 agc <font class="arg">pmatch</font>[], <font
236 1.1 agc class="type">int</font> <font class="arg">eflags</font>);
237 1.1 agc <br>
238 1.1 agc <font class="type">int</font> <font
239 1.1 agc class="func">regwexec</font>(<font class="qual">const</font>
240 1.1 agc <font class="type">regex_t</font> *<font
241 1.1 agc class="arg">preg</font>, <font class="qual">const</font> <font
242 1.1 agc class="type">wchar_t</font> *<font class="arg">string</font>,
243 1.1 agc <font class="type">size_t</font> <font
244 1.1 agc class="arg">nmatch</font>,
245 1.1 agc <br>
246 1.1 agc <font class="type">regmatch_t</font> <font
247 1.1 agc class="arg">pmatch</font>[], <font class="type">int</font>
248 1.1 agc <font class="arg">eflags</font>);
249 1.1 agc <br>
250 1.1 agc <font class="type">int</font> <font
251 1.1 agc class="func">regwnexec</font>(<font class="qual">const</font>
252 1.1 agc <font class="type">regex_t</font> *<font
253 1.1 agc class="arg">preg</font>, <font class="qual">const</font> <font
254 1.1 agc class="type">wchar_t</font> *<font class="arg">string</font>,
255 1.1 agc <font class="type">size_t</font> <font class="arg">len</font>,
256 1.1 agc <br>
257 1.1 agc
258 1.1 agc <font class="type">size_t</font> <font
259 1.1 agc class="arg">nmatch</font>, <font class="type">regmatch_t</font>
260 1.1 agc <font class="arg">pmatch</font>[], <font
261 1.1 agc class="type">int</font> <font class="arg">eflags</font>);
262 1.1 agc </code>
263 1.1 agc </div>
264 1.1 agc
265 1.1 agc <p>
266 1.1 agc The <tt><font class="func">regexec</font>()</tt> function matches
267 1.1 agc the null-terminated string against the compiled regexp <tt><font
268 1.1 agc class="arg">preg</font></tt>, initialized by a previous call to
269 1.1 agc any one of the <a href="#regcomp"><tt>regcomp</tt></a> functions. The
270 1.1 agc <tt><font class="func">regnexec</font>()</tt> function is like
271 1.1 agc <tt><font class="func">regexec</font>()</tt>, but <tt><font
272 1.1 agc class="arg">string</font></tt> is not terminated with a null byte.
273 1.1 agc Instead, the <tt><font class="arg">len</font></tt> argument is used
274 1.1 agc to give the length of the string, and the string may contain null
275 1.1 agc bytes. The <tt><font class="func">regwexec</font>()</tt> and
276 1.1 agc <tt><font class="func">regwnexec</font>()</tt> functions work like
277 1.1 agc <tt><font class="func">regexec</font>()</tt> and <tt><font
278 1.1 agc class="func">regnexec</font>()</tt>, respectively, but take a wide
279 1.1 agc character (<tt><font class="type">wchar_t</font></tt>) string
280 1.1 agc instead of a byte string. The <tt><font
281 1.1 agc class="arg">eflags</font></tt> argument is a bitwise OR of zero or
282 1.1 agc more of the following flags:
283 1.1 agc </p>
284 1.1 agc <blockquote>
285 1.1 agc <dl>
286 1.1 agc <dt><code>REG_NOTBOL</code></dt>
287 1.1 agc <dd>
288 1.1 agc <p>
289 1.1 agc When this flag is used, the match-beginning-of-line operator
290 1.1 agc <tt>^</tt> does not match the empty string at the beginning of
291 1.1 agc <tt><font class="arg">string</font></tt>. If
292 1.1 agc <code>REG_NEWLINE</code> was used when compiling
293 1.1 agc <tt><font class="arg">preg</font></tt> the empty string
294 1.1 agc immediately after a newline character will still be matched.
295 1.1 agc </p>
296 1.1 agc </dd>
297 1.1 agc
298 1.1 agc <dt><code>REG_NOTEOL</code></dt>
299 1.1 agc <dd>
300 1.1 agc <p>
301 1.1 agc When this flag is used, the match-end-of-line operator
302 1.1 agc <tt>$</tt> does not match the empty string at the end of
303 1.1 agc <tt><font class="arg">string</font></tt>. If
304 1.1 agc <code>REG_NEWLINE</code> was used when compiling
305 1.1 agc <tt><font class="arg">preg</font></tt> the empty string
306 1.1 agc immediately before a newline character will still be matched.
307 1.1 agc </p>
308 1.1 agc
309 1.1 agc </dl>
310 1.1 agc
311 1.1 agc <p>
312 1.1 agc These flags are useful when different portions of a string are passed
313 1.1 agc to <code>regexec</code> and the beginning or end of the partial string
314 1.1 agc should not be interpreted as the beginning or end of a line.
315 1.1 agc </p>
316 1.1 agc
317 1.1 agc </blockquote>
318 1.1 agc
319 1.1 agc <p>
320 1.1 agc If <code>REG_NOSUB</code> was used when compiling <tt><font
321 1.1 agc class="arg">preg</font></tt>, <tt><font
322 1.1 agc class="arg">nmatch</font></tt> is zero, or <tt><font
323 1.1 agc class="arg">pmatch</font></tt> is <code>NULL</code>, then the
324 1.1 agc <tt><font class="arg">pmatch</font></tt> argument is ignored.
325 1.1 agc Otherwise, the submatches corresponding to the parenthesized
326 1.1 agc subexpressions are filled in the elements of <tt><font
327 1.1 agc class="arg">pmatch</font></tt>, which must be dimensioned to have
328 1.1 agc at least <tt><font class="arg">nmatch</font></tt> elements.
329 1.1 agc </p>
330 1.1 agc
331 1.1 agc <p>
332 1.1 agc The <tt><font class="type">regmatch_t</font></tt> structure contains
333 1.1 agc at least the following fields:
334 1.1 agc </p>
335 1.1 agc <blockquote>
336 1.1 agc <dl>
337 1.1 agc <dt><tt><font class="type">regoff_t</font> <font
338 1.1 agc class="arg">rm_so</font></tt></dt>
339 1.1 agc <dd>Offset from start of <tt><font class="arg">string</font></tt> to start of
340 1.1 agc substring. </dd>
341 1.1 agc <dt><tt><font class="type">regoff_t</font> <font
342 1.1 agc class="arg">rm_eo</font></tt></dt>
343 1.1 agc <dd>Offset from start of <tt><font class="arg">string</font></tt> to the first
344 1.1 agc character after the substring. </dd>
345 1.1 agc </dl>
346 1.1 agc </blockquote>
347 1.1 agc
348 1.1 agc <p>
349 1.1 agc The length of a submatch can be computed by subtracting <code>rm_eo</code> and
350 1.1 agc <code>rm_so</code>. If a parenthesized subexpression did not participate in a
351 1.1 agc match, the <code>rm_so</code> and <code>rm_eo</code> fields for the
352 1.1 agc corresponding <code>pmatch</code> element are set to <code>-1</code>. Note
353 1.1 agc that when a multibyte character set is in effect, the submatch offsets are
354 1.1 agc given as byte offsets, not character offsets.
355 1.1 agc </p>
356 1.1 agc
357 1.1 agc <p>
358 1.1 agc The <code>regexec()</code> functions return zero if a match was found,
359 1.1 agc otherwise they return <code>REG_NOMATCH</code> to indicate no match,
360 1.1 agc or <code>REG_ESPACE</code> to indicate that enough temporary memory
361 1.1 agc could not be allocated to complete the matching operation.
362 1.1 agc </p>
363 1.1 agc
364 1.1 agc
365 1.1 agc
366 1.1 agc <h3>reguexec()</h3>
367 1.1 agc
368 1.1 agc <div class="code">
369 1.1 agc <code>
370 1.1 agc #include <tre/regex.h>
371 1.1 agc <br>
372 1.1 agc <br>
373 1.1 agc <font class="qual">typedef struct</font> {
374 1.1 agc <br>
375 1.1 agc <font class="type">int</font> (*get_next_char)(<font
376 1.1 agc class="type">tre_char_t</font> *<font class="arg">c</font>, <font
377 1.1 agc class="type">unsigned int</font> *<font class="arg">pos_add</font>,
378 1.1 agc <font class="type">void</font> *<font class="arg">context</font>);
379 1.1 agc <br>
380 1.1 agc <font class="type">void</font> (*rewind)(<font
381 1.1 agc class="type">size_t</font> <font class="arg">pos</font>, <font
382 1.1 agc class="type">void</font> *<font class="arg">context</font>);
383 1.1 agc <br>
384 1.1 agc <font class="type">int</font> (*compare)(<font
385 1.1 agc class="type">size_t</font> <font class="arg">pos1</font>, <font
386 1.1 agc class="type">size_t</font> <font class="arg">pos2</font>, <font
387 1.1 agc class="type">size_t</font> <font class="arg">len</font>, <font
388 1.1 agc class="type">void</font> *<font class="arg">context</font>);
389 1.1 agc <br>
390 1.1 agc <font class="type">void</font> *<font
391 1.1 agc class="arg">context</font>;
392 1.1 agc <br>
393 1.1 agc } <font class="type">tre_str_source</font>;
394 1.1 agc <br>
395 1.1 agc <br>
396 1.1 agc <font class="type">int</font> <font
397 1.1 agc class="func">reguexec</font>(<font class="qual">const</font>
398 1.1 agc <font class="type">regex_t</font> *<font
399 1.1 agc class="arg">preg</font>, <font class="qual">const</font> <font
400 1.1 agc class="type">tre_str_source</font> *<font class="arg">string</font>,
401 1.1 agc <font class="type">size_t</font> <font class="arg">nmatch</font>,
402 1.1 agc <br>
403 1.1 agc <font class="type">regmatch_t</font> <font
404 1.1 agc class="arg">pmatch</font>[], <font class="type">int</font>
405 1.1 agc <font class="arg">eflags</font>);
406 1.1 agc </code>
407 1.1 agc </div>
408 1.1 agc
409 1.1 agc <p>
410 1.1 agc The <tt><font class="func">reguexec</font>()</tt> function works just
411 1.1 agc like the other <tt>regexec()</tt> functions, except that the input
412 1.1 agc string is read from user specified callback functions instead of a
413 1.1 agc character array. This makes it possible, for example, to match
414 1.1 agc regexps over arbitrary user specified data structures.
415 1.1 agc </p>
416 1.1 agc
417 1.1 agc <p>
418 1.1 agc The <tt><font class="type">tre_str_source</font></tt> structure
419 1.1 agc contains the following fields:
420 1.1 agc </p>
421 1.1 agc <blockquote>
422 1.1 agc <dl>
423 1.1 agc <dt><tt>get_next_char</tt></dt>
424 1.1 agc <dd>This function must retrieve the next available character. If a
425 1.1 agc character is not available, the space pointed to by
426 1.1 agc <tt><font class="arg">c</font></tt> must be set to zero and it must return
427 1.1 agc a nonzero value. If a character is available, it must be stored
428 1.1 agc to the space pointed to by
429 1.1 agc <tt><font class="arg">c</font></tt>, and the integer pointer to by
430 1.1 agc <tt><font class="arg">pos_add</font></tt> must be set to the
431 1.1 agc number of units advanced in the input (the value must be
432 1.1 agc <tt>>=1</tt>), and zero must be returned.</dd>
433 1.1 agc
434 1.1 agc <dt><tt>rewind</tt></dt>
435 1.1 agc <dd>This function must rewind the input stream to the position
436 1.1 agc specified by <tt><font class="arg">pos</font></tt>. Unless the regexp
437 1.1 agc uses back references, <tt>rewind</tt> is not needed and can be set to
438 1.1 agc <tt>NULL</tt>.</dd>
439 1.1 agc
440 1.1 agc <dt><tt>compare</tt></dt>
441 1.1 agc <dd>This function compares two substrings in the input streams
442 1.1 agc starting at the positions specified by <tt><font
443 1.1 agc class="arg">pos1</font></tt> and <tt><font
444 1.1 agc class="arg">pos2</font></tt> of length <tt><font
445 1.1 agc class="arg">len</font></tt>. If the substrings are equal,
446 1.1 agc <tt>compare</tt> must return zero, otherwise a nonzero value must be
447 1.1 agc returned. Unless the regexp uses back references, <tt>compare</tt> is
448 1.1 agc not needed and can be set to <tt>NULL</tt>.</dd>
449 1.1 agc
450 1.1 agc <dt><tt>context</tt></dt>
451 1.1 agc <dd>This is a context variable, passed as the last argument to
452 1.1 agc all of the above functions for keeping track of the internal state of
453 1.1 agc the users code.</dd>
454 1.1 agc
455 1.1 agc </dl>
456 1.1 agc </blockquote>
457 1.1 agc
458 1.1 agc <p>
459 1.1 agc The position in the input stream is measured in <tt><font
460 1.1 agc class="type">size_t</font></tt> units. The current position is the
461 1.1 agc sum of the increments gotten from <tt><font
462 1.1 agc class="arg">pos_add</font></tt> (plus the position of the last
463 1.1 agc <tt>rewind</tt>, if any). The starting position is zero. Submatch
464 1.1 agc positions filled in the <tt><font class="arg">pmatch</font>[]</tt>
465 1.1 agc array are, of course, given using positions computed in this way.
466 1.1 agc </p>
467 1.1 agc
468 1.1 agc <p>
469 1.1 agc For an example of how to use <tt>reguexec()</tt>, see the
470 1.1 agc <tt>tests/test-str-source.c</tt> file in the TRE source code
471 1.1 agc distribution.
472 1.1 agc </p>
473 1.1 agc
474 1.1 agc <h2>The approximate matching functions</h2>
475 1.1 agc <a name="regaexec"></a>
476 1.1 agc
477 1.1 agc <div class="code">
478 1.1 agc <code>
479 1.1 agc #include <tre/regex.h>
480 1.1 agc <br>
481 1.1 agc <br>
482 1.1 agc <font class="qual">typedef struct</font> {<br>
483 1.1 agc <font class="type">int</font>
484 1.1 agc <font class="arg">cost_ins</font>;<br>
485 1.1 agc <font class="type">int</font>
486 1.1 agc <font class="arg">cost_del</font>;<br>
487 1.1 agc <font class="type">int</font>
488 1.1 agc <font class="arg">cost_subst</font>;<br>
489 1.1 agc <font class="type">int</font>
490 1.1 agc <font class="arg">max_cost</font>;<br><br>
491 1.1 agc <font class="type">int</font>
492 1.1 agc <font class="arg">max_ins</font>;<br>
493 1.1 agc <font class="type">int</font>
494 1.1 agc <font class="arg">max_del</font>;<br>
495 1.1 agc <font class="type">int</font>
496 1.1 agc <font class="arg">max_subst</font>;<br>
497 1.1 agc <font class="type">int</font>
498 1.1 agc <font class="arg">max_err</font>;<br>
499 1.1 agc } <font class="type">regaparams_t</font>;<br>
500 1.1 agc <br>
501 1.1 agc <font class="qual">typedef struct</font> {<br>
502 1.1 agc <font class="type">size_t</font>
503 1.1 agc <font class="arg">nmatch</font>;<br>
504 1.1 agc <font class="type">regmatch_t</font>
505 1.1 agc *<font class="arg">pmatch</font>;<br>
506 1.1 agc <font class="type">int</font>
507 1.1 agc <font class="arg">cost</font>;<br>
508 1.1 agc <font class="type">int</font>
509 1.1 agc <font class="arg">num_ins</font>;<br>
510 1.1 agc <font class="type">int</font>
511 1.1 agc <font class="arg">num_del</font>;<br>
512 1.1 agc <font class="type">int</font>
513 1.1 agc <font class="arg">num_subst</font>;<br>
514 1.1 agc } <font class="type">regamatch_t</font>;<br>
515 1.1 agc <br>
516 1.1 agc <font class="type">int</font> <font
517 1.1 agc class="func">regaexec</font>(<font class="qual">const</font>
518 1.1 agc <font class="type">regex_t</font> *<font
519 1.1 agc class="arg">preg</font>, <font class="qual">const</font> <font
520 1.1 agc class="type">char</font> *<font class="arg">string</font>,<br>
521 1.1 agc
522 1.1 agc <font class="type">regamatch_t</font>
523 1.1 agc *<font class="arg">match</font>,
524 1.1 agc <font class="type">regaparams_t</font>
525 1.1 agc <font class="arg">params</font>,
526 1.1 agc <font class="type">int</font>
527 1.1 agc <font class="arg">eflags</font>);
528 1.1 agc <br>
529 1.1 agc <font class="type">int</font> <font
530 1.1 agc class="func">reganexec</font>(<font class="qual">const</font>
531 1.1 agc <font class="type">regex_t</font> *<font
532 1.1 agc class="arg">preg</font>, <font class="qual">const</font> <font
533 1.1 agc class="type">char</font> *<font class="arg">string</font>,
534 1.1 agc <font class="type">size_t</font> <font class="arg">len</font>,<br>
535 1.1 agc
536 1.1 agc <font class="type">regamatch_t</font>
537 1.1 agc *<font class="arg">match</font>,
538 1.1 agc <font class="type">regaparams_t</font>
539 1.1 agc <font class="arg">params</font>,
540 1.1 agc <font class="type">int</font> <font class="arg">eflags</font>);
541 1.1 agc <br>
542 1.1 agc <font class="type">int</font> <font
543 1.1 agc class="func">regawexec</font>(<font class="qual">const</font>
544 1.1 agc <font class="type">regex_t</font> *<font
545 1.1 agc class="arg">preg</font>, <font class="qual">const</font> <font
546 1.1 agc class="type">wchar_t</font> *<font class="arg">string</font>,<br>
547 1.1 agc
548 1.1 agc <font class="type">regamatch_t</font>
549 1.1 agc *<font class="arg">match</font>,
550 1.1 agc <font class="type">regaparams_t</font>
551 1.1 agc <font class="arg">params</font>,
552 1.1 agc <font class="type">int</font>
553 1.1 agc <font class="arg">eflags</font>);
554 1.1 agc <br>
555 1.1 agc <font class="type">int</font>
556 1.1 agc <font class="func">regawnexec</font>(
557 1.1 agc <font class="qual">const</font>
558 1.1 agc <font class="type">regex_t</font>
559 1.1 agc *<font class="arg">preg</font>,
560 1.1 agc <font class="qual">const</font>
561 1.1 agc <font class="type">wchar_t</font>
562 1.1 agc *<font class="arg">string</font>,
563 1.1 agc <font class="type">size_t</font>
564 1.1 agc <font class="arg">len</font>,<br>
565 1.1 agc
566 1.1 agc <font class="type">regamatch_t</font>
567 1.1 agc *<font class="arg">match</font>,
568 1.1 agc <font class="type">regaparams_t</font>
569 1.1 agc <font class="arg">params</font>,
570 1.1 agc <font class="type">int</font>
571 1.1 agc <font class="arg">eflags</font>);
572 1.1 agc <br>
573 1.1 agc </code>
574 1.1 agc </div>
575 1.1 agc
576 1.1 agc <p>
577 1.1 agc The <tt><font class="func">regaexec</font>()</tt> function searches for
578 1.1 agc the best match in <tt><font class="arg">string</font></tt>
579 1.1 agc against the compiled regexp <tt><font
580 1.1 agc class="arg">preg</font></tt>, initialized by a previous call to
581 1.1 agc any one of the <a href="#regcomp"><tt>regcomp</tt></a> functions.
582 1.1 agc </p>
583 1.1 agc
584 1.1 agc <p>
585 1.1 agc The <tt><font class="func">reganexec</font>()</tt> function is like
586 1.1 agc <tt><font class="func">regaexec</font>()</tt>, but <tt><font
587 1.1 agc class="arg">string</font></tt> is not terminated by a null byte.
588 1.1 agc Instead, the <tt><font class="arg">len</font></tt> argument is used to
589 1.1 agc tell the length of the string, and the string may contain null
590 1.1 agc bytes. The <tt><font class="func">regawexec</font>()</tt> and
591 1.1 agc <tt><font class="func">regawnexec</font>()</tt> functions work like
592 1.1 agc <tt><font class="func">regaexec</font>()</tt> and <tt><font
593 1.1 agc class="func">reganexec</font>()</tt>, respectively, but take a wide
594 1.1 agc character (<tt><font class="type">wchar_t</font></tt>) string instead
595 1.1 agc of a byte string.
596 1.1 agc </p>
597 1.1 agc
598 1.1 agc <p>
599 1.1 agc The <tt><font class="arg">eflags</font></tt> argument is like for
600 1.1 agc the regexec() functions.
601 1.1 agc </p>
602 1.1 agc
603 1.1 agc <p>
604 1.1 agc The <tt><font class="arg">params</font></tt> struct controls the
605 1.1 agc approximate matching parameters:
606 1.1 agc <blockquote>
607 1.1 agc <dl>
608 1.1 agc <dt><tt><font class="type">int</font></tt>
609 1.1 agc <tt><font class="arg">cost_ins</font></tt></dt>
610 1.1 agc <dd>The default cost of an inserted character, that is, an extra
611 1.1 agc character in <tt><font class="arg">string</font></tt>.</dd>
612 1.1 agc
613 1.1 agc <dt><tt><font class="type">int</font></tt>
614 1.1 agc <tt><font class="arg">cost_del</font></tt></dt>
615 1.1 agc <dd>The default cost of a deleted character, that is, a character
616 1.1 agc missing from <tt><font class="arg">string</font></tt>.</dd>
617 1.1 agc
618 1.1 agc <dt><tt><font class="type">int</font></tt>
619 1.1 agc <tt><font class="arg">cost_subst</font></tt></dt>
620 1.1 agc <dd>The default cost of a substituted character.</dd>
621 1.1 agc
622 1.1 agc <dt><tt><font class="type">int</font></tt>
623 1.1 agc <tt><font class="arg">max_cost</font></tt></dt>
624 1.1 agc <dd>The maximum allowed cost of a match. If this is set to zero,
625 1.1 agc an exact matching is searched for, and results equivalent to
626 1.1 agc those returned by the <tt>regexec()</tt> functions are
627 1.1 agc returned.</dd>
628 1.1 agc
629 1.1 agc <dt><tt><font class="type">int</font></tt>
630 1.1 agc <tt><font class="arg">max_ins</font></tt></dt>
631 1.1 agc <dd>Maximum allowed number of inserted characters.</dd>
632 1.1 agc
633 1.1 agc <dt><tt><font class="type">int</font></tt>
634 1.1 agc <tt><font class="arg">max_del</font></tt></dt>
635 1.1 agc <dd>Maximum allowed number of deleted characters.</dd>
636 1.1 agc
637 1.1 agc <dt><tt><font class="type">int</font></tt>
638 1.1 agc <tt><font class="arg">max_subst</font></tt></dt>
639 1.1 agc <dd>Maximum allowed number of substituted characters.</dd>
640 1.1 agc
641 1.1 agc <dt><tt><font class="type">int</font></tt>
642 1.1 agc <tt><font class="arg">max_err</font></tt></dt>
643 1.1 agc <dd>Maximum allowed number of errors (inserts + deletes +
644 1.1 agc substitutes).</dd>
645 1.1 agc </dl>
646 1.1 agc </blockquote>
647 1.1 agc
648 1.1 agc <p>
649 1.1 agc The <tt><font class="arg">match</font></tt> argument points to a
650 1.1 agc <tt><font class="type">regamatch_t</font></tt> structure. The
651 1.1 agc <tt><font class="arg">nmatch</font></tt> and <tt><font
652 1.1 agc class="arg">pmatch</font></tt> field must be filled by the caller. If
653 1.1 agc <code>REG_NOSUB</code> was used when compiling the regexp, or
654 1.1 agc <code>match->nmatch</code> is zero, or
655 1.1 agc <code>match->pmatch</code> is <code>NULL</code>, the
656 1.1 agc <code>match->pmatch</code> argument is ignored. Otherwise, the
657 1.1 agc submatches corresponding to the parenthesized subexpressions are
658 1.1 agc filled in the elements of <code>match->pmatch</code>, which must be
659 1.1 agc dimensioned to have at least <code>match->nmatch</code> elements.
660 1.1 agc The <code>match->cost</code> field is set to the cost of the match
661 1.1 agc found, and the <code>match->num_ins</code>,
662 1.1 agc <code>match->num_del</code>, and <code>match->num_subst</code>
663 1.1 agc fields are set to the number of inserts, deletes, and substitutes in
664 1.1 agc the match, respectively.
665 1.1 agc </p>
666 1.1 agc
667 1.1 agc <p>
668 1.1 agc The <tt>regaexec()</tt> functions return zero if a match with cost
669 1.1 agc smaller than <code>params->max_cost</code> was found, otherwise
670 1.1 agc they return <code>REG_NOMATCH</code> to indicate no match, or
671 1.1 agc <code>REG_ESPACE</code> to indicate that enough temporary memory could
672 1.1 agc not be allocated to complete the matching operation.
673 1.1 agc </p>
674 1.1 agc
675 1.1 agc <h2>Miscellaneous</h2>
676 1.1 agc
677 1.1 agc <div class="code">
678 1.1 agc <code>
679 1.1 agc #include <tre/regex.h>
680 1.1 agc <br>
681 1.1 agc <br>
682 1.1 agc <font class="type">int</font> <font
683 1.1 agc class="func">tre_have_backrefs</font>(<font class="qual">const</font>
684 1.1 agc <font class="type">regex_t</font> *<font class="arg">preg</font>);
685 1.1 agc <br>
686 1.1 agc <font class="type">int</font> <font
687 1.1 agc class="func">tre_have_approx</font>(<font class="qual">const</font>
688 1.1 agc <font class="type">regex_t</font> *<font class="arg">preg</font>);
689 1.1 agc <br>
690 1.1 agc </code>
691 1.1 agc </div>
692 1.1 agc
693 1.1 agc <p>
694 1.1 agc The <tt><font class="func">tre_have_backrefs</font>()</tt> and
695 1.1 agc <tt><font class="func">tre_have_approx</font>()</tt> functions return
696 1.1 agc 1 if the compiled pattern has back references or uses approximate
697 1.1 agc matching, respectively, and 0 if not.
698 1.1 agc </p>
699 1.1 agc
700 1.1 agc
701 1.1 agc <h2>Checking build time options</h2>
702 1.1 agc
703 1.1 agc <a name="tre_config"></a>
704 1.1 agc <div class="code">
705 1.1 agc <code>
706 1.1 agc #include <tre/regex.h>
707 1.1 agc <br>
708 1.1 agc <br>
709 1.1 agc <font class="type">char</font> *<font
710 1.1 agc class="func">tre_version</font>(<font class="type">void</font>);
711 1.1 agc <br>
712 1.1 agc <font class="type">int</font> <font
713 1.1 agc class="func">tre_config</font>(<font class="type">int</font> <font
714 1.1 agc class="arg">query</font>, <font class="type">void</font> *<font
715 1.1 agc class="arg">result</font>);
716 1.1 agc <br>
717 1.1 agc </code>
718 1.1 agc </div>
719 1.1 agc
720 1.1 agc <p>
721 1.1 agc The <tt><font class="func">tre_config</font>()</tt> function can be
722 1.1 agc used to retrieve information of which optional features have been
723 1.1 agc compiled into the TRE library and information of other parameters that
724 1.1 agc may change between releases.
725 1.1 agc </p>
726 1.1 agc
727 1.1 agc <p>
728 1.1 agc The <tt><font class="arg">query</font></tt> argument is an integer
729 1.1 agc telling what information is requested for. The <tt><font
730 1.1 agc class="arg">result</font></tt> argument is a pointer to a variable
731 1.1 agc where the information is returned. The return value of a call to
732 1.1 agc <tt><font class="func">tre_config</font>()</tt> is zero if <tt><font
733 1.1 agc class="arg">query</font></tt> was recognized, REG_NOMATCH otherwise.
734 1.1 agc </p>
735 1.1 agc
736 1.1 agc <p>
737 1.1 agc The following values are recognized for <tt><font
738 1.1 agc class="arg">query</font></tt>:
739 1.1 agc
740 1.1 agc <blockquote>
741 1.1 agc <dl>
742 1.1 agc <dt><tt>TRE_CONFIG_APPROX</tt></dt>
743 1.1 agc <dd>The result is an integer that is set to one if approximate
744 1.1 agc matching support is available, zero if not.</dd>
745 1.1 agc <dt><tt>TRE_CONFIG_WCHAR</tt></dt>
746 1.1 agc <dd>The result is an integer that is set to one if wide character
747 1.1 agc support is available, zero if not.</dd>
748 1.1 agc <dt><tt>TRE_CONFIG_MULTIBYTE</tt></dt>
749 1.1 agc <dd>The result is an integer that is set to one if multibyte character
750 1.1 agc set support is available, zero if not.</dd>
751 1.1 agc <dt><tt>TRE_CONFIG_SYSTEM_ABI</tt></dt>
752 1.1 agc <dd>The result is an integer that is set to one if TRE has been
753 1.1 agc compiled to be compatible with the system regex ABI, zero if not.</dd>
754 1.1 agc <dt><tt>TRE_CONFIG_VERSION</tt></dt>
755 1.1 agc <dd>The result is a pointer to a static character string that gives
756 1.1 agc the version of the TRE library.</dd>
757 1.1 agc </dl>
758 1.1 agc </blockquote>
759 1.1 agc
760 1.1 agc
761 1.1 agc <p>
762 1.1 agc The <tt><font class="func">tre_version</font>()</tt> function returns
763 1.1 agc a short human readable character string which shows the software name,
764 1.1 agc version, and license.
765 1.1 agc
766 1.1 agc <h2>Preprocessor definitions</h2>
767 1.1 agc
768 1.1 agc <p>The header <tt><tre/regex.h></tt> defines certain
769 1.1 agc C preprocessor symbols.
770 1.1 agc
771 1.1 agc <h3>Version information</h3>
772 1.1 agc
773 1.1 agc <p>The following definitions may be useful for checking whether a new
774 1.1 agc enough version is being used. Note that it is recommended to use the
775 1.1 agc <tt>pkg-config</tt> tool for version and other checks in Autoconf
776 1.1 agc scripts.</p>
777 1.1 agc
778 1.1 agc <blockquote>
779 1.1 agc <dl>
780 1.1 agc <dt><tt>TRE_VERSION</tt></dt>
781 1.1 agc <dd>The version string. </dd>
782 1.1 agc
783 1.1 agc <dt><tt>TRE_VERSION_1</tt></dt>
784 1.1 agc <dd>The major version number (first part of version string).</dd>
785 1.1 agc
786 1.1 agc <dt><tt>TRE_VERSION_2</tt></dt>
787 1.1 agc <dd>The minor version number (second part of version string).</dd>
788 1.1 agc
789 1.1 agc <dt><tt>TRE_VERSION_3</tt></dt>
790 1.1 agc <dd>The micro version number (third part of version string).</dd>
791 1.1 agc
792 1.1 agc </dl>
793 1.1 agc </blockquote>
794 1.1 agc
795 1.1 agc <h3>Features</h3>
796 1.1 agc
797 1.1 agc <p>The following definitions may be useful for checking whether all
798 1.1 agc necessary features are enabled. Use these only if compile time
799 1.1 agc checking suffices (linking statically with TRE). When linking
800 1.1 agc dynamically <a href="#tre_config"><tt>tre_config()</tt></a> should be used
801 1.1 agc instead.</p>
802 1.1 agc
803 1.1 agc <blockquote>
804 1.1 agc <dl>
805 1.1 agc <dt><tt>TRE_APPROX</tt></dt>
806 1.1 agc <dd>This is defined if approximate matching support is enabled. The
807 1.1 agc prototypes for approximate matching functions are defined only if
808 1.1 agc <tt>TRE_APPROX</tt> is defined.</dd>
809 1.1 agc
810 1.1 agc <dt><tt>TRE_WCHAR</tt></dt>
811 1.1 agc <dd>This is defined if wide character support is enabled. The
812 1.1 agc prototypes for wide character matching functions are defined only if
813 1.1 agc <tt>TRE_WCHAR</tt> is defined.</dd>
814 1.1 agc
815 1.1 agc <dt><tt>TRE_MULTIBYTE</tt></dt>
816 1.1 agc <dd>This is defined if multibyte character set support is enabled.
817 1.1 agc If this is not set any locale settings are ignored, and the default
818 1.1 agc locale is used when parsing regexps and matching strings.</dd>
819 1.1 agc
820 1.1 agc </dl>
821 1.1 agc </blockquote>
822