6 regcomp, regexec, regerror, regfree - regular-expression library
12 int regcomp(regex_t *preg, const char *pattern, int cflags);
14 int regexec(const regex_t *preg, const char *string, size_t nmatch,
15 regmatch_t pmatch[], int eflags);
17 size_t regerror(int errcode, const regex_t *preg, char *errbuf,
20 void regfree(regex_t *preg);
23 These routines implement POSIX 1003.2 regular expressions (``RE''s);
24 see regex(7). Regcomp compiles an RE written as a string into an
25 internal form, regexec matches that internal form against a string and
26 reports results, regerror transforms error codes from either into
27 human-readable messages, and regfree frees any dynamically-allocated
28 storage used by the internal form of an RE.
30 The header <regex.h> declares two structure types, regex_t and reg-
31 match_t, the former for compiled internal forms and the latter for
32 match reporting. It also declares the four functions, a type regoff_t,
33 and a number of constants with names starting with ``REG_''.
35 Regcomp compiles the regular expression contained in the pattern
36 string, subject to the flags in cflags, and places the results in the
37 regex_t structure pointed to by preg. Cflags is the bitwise OR of zero
38 or more of the following flags:
40 REG_EXTENDED Compile modern (``extended'') REs, rather than the obso-
41 lete (``basic'') REs that are the default.
43 REG_BASIC This is a synonym for 0, provided as a counterpart to
44 REG_EXTENDED to improve readability. This is an exten-
45 sion, compatible with but not specified by POSIX 1003.2,
46 and should be used with caution in software intended to
47 be portable to other systems.
49 REG_NOSPEC Compile with recognition of all special characters turned
50 off. All characters are thus considered ordinary, so the
51 ``RE'' is a literal string. This is an extension, com-
52 patible with but not specified by POSIX 1003.2, and
53 should be used with caution in software intended to be
54 portable to other systems. REG_EXTENDED and REG_NOSPEC
55 may not be used in the same call to regcomp.
57 REG_ICASE Compile for matching that ignores upper/lower case dis-
58 tinctions. See regex(7).
60 REG_NOSUB Compile for matching that need only report success or
61 failure, not what was matched.
63 REG_NEWLINE Compile for newline-sensitive matching. By default, new-
64 line is a completely ordinary character with no special
65 meaning in either REs or strings. With this flag, `[^'
66 bracket expressions and `.' never match newline, a `^'
67 anchor matches the null string after any newline in the
68 string in addition to its normal function, and the `$'
69 anchor matches the null string before any newline in the
70 string in addition to its normal function.
72 REG_PEND The regular expression ends, not at the first NUL, but
73 just before the character pointed to by the re_endp mem-
74 ber of the structure pointed to by preg. The re_endp
75 member is of type const char *. This flag permits inclu-
76 sion of NULs in the RE; they are considered ordinary
77 characters. This is an extension, compatible with but
78 not specified by POSIX 1003.2, and should be used with
79 caution in software intended to be portable to other sys-
82 When successful, regcomp returns 0 and fills in the structure pointed
83 to by preg. One member of that structure (other than re_endp) is pub-
84 licized: re_nsub, of type size_t, contains the number of parenthesized
85 subexpressions within the RE (except that the value of this member is
86 undefined if the REG_NOSUB flag was used). If regcomp fails, it
87 returns a non-zero error code; see DIAGNOSTICS.
89 Regexec matches the compiled RE pointed to by preg against the string,
90 subject to the flags in eflags, and reports results using nmatch,
91 pmatch, and the returned value. The RE must have been compiled by a
92 previous invocation of regcomp. The compiled form is not altered dur-
93 ing execution of regexec, so a single compiled RE can be used simulta-
94 neously by multiple threads.
96 By default, the NUL-terminated string pointed to by string is consid-
97 ered to be the text of an entire line, with the NUL indicating the end
98 of the line. (That is, any other end-of-line marker is considered to
99 have been removed and replaced by the NUL.) The eflags argument is the
100 bitwise OR of zero or more of the following flags:
102 REG_NOTBOL The first character of the string is not the beginning of
103 a line, so the `^' anchor should not match before it.
104 This does not affect the behavior of newlines under
107 REG_NOTEOL The NUL terminating the string does not end a line, so
108 the `$' anchor should not match before it. This does not
109 affect the behavior of newlines under REG_NEWLINE.
111 REG_STARTEND The string is considered to start at string +
112 pmatch[0].rm_so and to have a terminating NUL located at
113 string + pmatch[0].rm_eo (there need not actually be a
114 NUL at that location), regardless of the value of nmatch.
115 See below for the definition of pmatch and nmatch. This
116 is an extension, compatible with but not specified by
117 POSIX 1003.2, and should be used with caution in software
118 intended to be portable to other systems. Note that a
119 non-zero rm_so does not imply REG_NOTBOL; REG_STARTEND
120 affects only the location of the string, not how it is
123 See regex(7) for a discussion of what is matched in situations where an
124 RE or a portion thereof could match any of several substrings of
127 Normally, regexec returns 0 for success and the non-zero code
128 REG_NOMATCH for failure. Other non-zero error codes may be returned in
129 exceptional situations; see DIAGNOSTICS.
131 If REG_NOSUB was specified in the compilation of the RE, or if nmatch
132 is 0, regexec ignores the pmatch argument (but see below for the case
133 where REG_STARTEND is specified). Otherwise, pmatch points to an array
134 of nmatch structures of type regmatch_t. Such a structure has at least
135 the members rm_so and rm_eo, both of type regoff_t (a signed arithmetic
136 type at least as large as an off_t and a ssize_t), containing respec-
137 tively the offset of the first character of a substring and the offset
138 of the first character after the end of the substring. Offsets are
139 measured from the beginning of the string argument given to regexec.
140 An empty substring is denoted by equal offsets, both indicating the
141 character following the empty substring.
143 The 0th member of the pmatch array is filled in to indicate what sub-
144 string of string was matched by the entire RE. Remaining members
145 report what substring was matched by parenthesized subexpressions
146 within the RE; member i reports subexpression i, with subexpressions
147 counted (starting at 1) by the order of their opening parentheses in
148 the RE, left to right. Unused entries in the array--corresponding
149 either to subexpressions that did not participate in the match at all,
150 or to subexpressions that do not exist in the RE (that is, i >
151 preg->re_nsub)--have both rm_so and rm_eo set to -1. If a subexpres-
152 sion participated in the match several times, the reported substring is
153 the last one it matched. (Note, as an example in particular, that when
154 the RE `(b*)+' matches `bbb', the parenthesized subexpression matches
155 the three `b's and then an infinite number of empty strings following
156 the last `b', so the reported substring is one of the empties.)
158 If REG_STARTEND is specified, pmatch must point to at least one reg-
159 match_t (even if nmatch is 0 or REG_NOSUB was specified), to hold the
160 input offsets for REG_STARTEND. Use for output is still entirely con-
161 trolled by nmatch; if nmatch is 0 or REG_NOSUB was specified, the value
162 of pmatch[0] will not be changed by a successful regexec.
164 Regerror maps a non-zero errcode from either regcomp or regexec to a
165 human-readable, printable message. If preg is non-NULL, the error code
166 should have arisen from use of the regex_t pointed to by preg, and if
167 the error code came from regcomp, it should have been the result from
168 the most recent regcomp using that regex_t. (Regerror may be able to
169 supply a more detailed message using information from the regex_t.)
170 Regerror places the NUL-terminated message into the buffer pointed to
171 by errbuf, limiting the length (including the NUL) to at most
172 errbuf_size bytes. If the whole message won't fit, as much of it as
173 will fit before the terminating NUL is supplied. In any case, the
174 returned value is the size of buffer needed to hold the whole message
175 (including terminating NUL). If errbuf_size is 0, errbuf is ignored
176 but the return value is still correct.
178 If the errcode given to regerror is first ORed with REG_ITOA, the
179 ``message'' that results is the printable name of the error code, e.g.
180 ``REG_NOMATCH'', rather than an explanation thereof. If errcode is
181 REG_ATOI, then preg shall be non-NULL and the re_endp member of the
182 structure it points to must point to the printable name of an error
183 code; in this case, the result in errbuf is the decimal digits of the
184 numeric value of the error code (0 if the name is not recognized).
185 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
186 they are extensions, compatible with but not specified by POSIX 1003.2,
187 and should be used with caution in software intended to be portable to
188 other systems. Be warned also that they are considered experimental
189 and changes are possible.
191 Regfree frees any dynamically-allocated storage associated with the
192 compiled RE pointed to by preg. The remaining regex_t is no longer a
193 valid compiled RE and the effect of supplying it to regexec or regerror
196 None of these functions references global variables except for tables
197 of constants; all are safe for use from multiple threads if the argu-
200 IMPLEMENTATION CHOICES
201 There are a number of decisions that 1003.2 leaves up to the implemen-
202 tor, either by explicitly saying ``undefined'' or by virtue of them
203 being forbidden by the RE grammar. This implementation treats them as
206 See regex(7) for a discussion of the definition of case-independent
209 There is no particular limit on the length of REs, except insofar as
210 memory is limited. Memory usage is approximately linear in RE size,
211 and largely insensitive to RE complexity, except for bounded repeti-
212 tions. See BUGS for one short RE using them that will run almost any
213 system out of memory.
215 A backslashed character other than one specifically given a magic mean-
216 ing by 1003.2 (such magic meanings occur only in obsolete [``basic'']
217 REs) is taken as an ordinary character.
219 Any unmatched [ is a REG_EBRACK error.
221 Equivalence classes cannot begin or end bracket-expression ranges. The
222 endpoint of one range cannot begin another.
224 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is
227 A repetition operator (?, *, +, or bounds) cannot follow another repe-
228 tition operator. A repetition operator cannot begin an expression or
229 subexpression or follow `^' or `|'.
231 `|' cannot appear first or last in a (sub)expression or after another
232 `|', i.e. an operand of `|' cannot be an empty subexpression. An empty
233 parenthesized subexpression, `()', is legal and matches an empty
234 (sub)string. An empty string is not a legal RE.
236 A `{' followed by a digit is considered the beginning of bounds for a
237 bounded repetition, which must then follow the syntax for bounds. A
238 `{' not followed by a digit is considered an ordinary character.
240 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
241 REs are anchors, not ordinary characters.
246 POSIX 1003.2, sections 2.8 (Regular Expression Notation) and B.5 (C
247 Binding for Regular Expression Matching).
250 Non-zero error codes from regcomp and regexec include the following:
252 REG_NOMATCH regexec() failed to match
253 REG_BADPAT invalid regular expression
254 REG_ECOLLATE invalid collating element
255 REG_ECTYPE invalid character class
256 REG_EESCAPE \ applied to unescapable character
257 REG_ESUBREG invalid backreference number
258 REG_EBRACK brackets [ ] not balanced
259 REG_EPAREN parentheses ( ) not balanced
260 REG_EBRACE braces { } not balanced
261 REG_BADBR invalid repetition count(s) in { }
262 REG_ERANGE invalid character range in [ ]
263 REG_ESPACE ran out of memory
264 REG_BADRPT ?, *, or + operand invalid
265 REG_EMPTY empty (sub)expression
266 REG_ASSERT ``can't happen''--you found a bug
267 REG_INVARG invalid argument, e.g. negative-length string
270 Written by Henry Spencer, henry@zoo.toronto.edu.
273 This is an alpha release with known defects. Please report problems.
275 There is one known functionality bug. The implementation of interna-
276 tionalization is incomplete: the locale is always assumed to be the
277 default one of 1003.2, and only the collating elements etc. of that
278 locale are available.
280 The back-reference code is subtle and doubts linger about its correct-
281 ness in complex cases.
283 Regexec performance is poor. This will improve with later releases.
284 Nmatch exceeding 0 is expensive; nmatch exceeding 1 is worse. Regexec
285 is largely insensitive to RE complexity except that back references are
286 massively expensive. RE length does matter; in particular, there is a
287 strong speed bonus for keeping RE length under about 30 characters,
288 with most special characters counting roughly double.
290 Regcomp implements bounded repetitions by macro expansion, which is
291 costly in time and space if counts are large or bounded repetitions are
292 nested. An RE like, say,
293 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}' will (eventually) run
294 almost any existing machine out of swap space.
296 There are suspected problems with response to obscure error conditions.
297 Notably, certain kinds of internal overflow, produced only by truly
298 enormous REs or by multiply nested bounded repetitions, are probably
301 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)'
302 is a special character only in the presence of a previous unmatched
303 `('. This can't be fixed until the spec is fixed.
305 The standard's definition of back references is vague. For example,
306 does `a\(\(b\)*\2\)*d' match `abbbd'? Until the standard is clarified,
307 behavior in such cases should not be relied on.
309 The implementation of word-boundary matching is a bit of a kludge, and
310 bugs may lurk in combinations of word-boundary matching and anchoring.
314 25 Sept 1997 REGEX(3)