|
back to the help index
Parenthetical Constructs
Capturing Parentheses
Capturing Parentheses are of the form `(<regex>)' and can be used to group
arbitrarily complex regular expressions. Parentheses can be nested, but the
total number of parentheses, nested or otherwise, is limited to 50 pairs.
The text that is matched by the regular expression between a matched set of
parentheses is captured and available for text substitutions and
backreferences (see below.) Capturing parentheses carry a fairly high
overhead both in terms of memory used and execution speed, especially if
quantified by `*' or `+'.
Non-Capturing Parentheses
Non-Capturing Parentheses are of the form `(?:<regex>)' and facilitate
grouping only and do not incur the overhead of normal capturing parentheses.
They should not be counted when determining numbers for capturing parentheses
which are used with backreferences and substitutions. Because of the limit
on the number of capturing parentheses allowed in a regex, it is advisable to
use non-capturing parentheses when possible.
Positive Look-Ahead
Positive look-ahead constructs are of the form `(?=<regex>)' and implement a
zero width assertion of the enclosed regular expression. In other words, a
match of the regular expression contained in the positive look-ahead
construct is attempted. If it succeeds, control is passed to the next
regular expression atom, but the text that was consumed by the positive
look-ahead is first unmatched (backtracked) to the place in the text where
the positive look-ahead was first encountered.
One application of positive look-ahead is the manual implementation of a
first character discrimination optimization. You can include a positive
look-ahead that contains a character class which lists every character that
the following (potentially complex) regular expression could possibly start
with. This will quickly filter out match attempts that can not possibly
succeed.
Negative Look-Ahead
Negative look-ahead takes the form `(?!<regex>)' and is exactly the same as
positive look-ahead except that the enclosed regular expression must NOT
match. This can be particularly useful when you have an expression that is
general, and you want to exclude some special cases. Simply precede the
general expression with a negative look-ahead that covers the special cases
that need to be filtered out.
Positive Look-Behind
Positive look-behind constructs are of the form `(?<=<regex>)' and implement
a zero width assertion of the enclosed regular expression in front of the
current matching position. It is similar to a positive look-ahead assertion,
except for the fact the the match is attempted on the text preceeding the
current position, possibly even in front of the start of the matching range
of the entire regular expression.
A restriction on look-behind expressions is the fact that the expression
must match a string of a bounded size. In other words, `*', `+', and `{n,}'
quantifiers are not allowed inside the look-behind expression. Moreover,
matching performance is sensitive to the difference between the upper and
lower bound on the matching size. The smaller the difference, the better the
performance. This is especially important for regular expressions used in
highlight patterns.
Another (minor) restriction is the fact that look-ahead patterns, nor
any construct that requires look-ahead information (such as word boundaries)
are supported at the end of a look-behind pattern (no error is raised, but
matching behaviour is unspecified). It is always possible to place these
look-ahead patterns immediately after the look-behind pattern, where they
will work as expected.
Positive look-behind has similar applications as positive look-ahead.
Negative Look-Behind
Negative look-behind takes the form `(?<!<regex>)' and is exactly the same as
positive look-behind except that the enclosed regular expression must
NOT match. The same restrictions apply.
Note however, that performance is even more sensitive to the distance
between the size boundaries: a negative look-behind must not match for
any possible size, so the matching engine must check every size.
Case Sensitivity
There are two parenthetical constructs that control case sensitivity:
(?i<regex>) Case insensitive; `AbcD' and `aBCd' are
equivalent.
(?I<regex>) Case sensitive; `AbcD' and `aBCd' are
different.
Regular expressions are case sensitive by default, that is, `(?I<regex>)' is
assumed. All regular expression token types respond appropriately to case
insensitivity including character classes and backreferences. There is some
extra overhead involved when case insensitivity is in effect, but only to the
extent of converting each character compared to lower case.
Matching Newlines
NEdit regular expressions by default handle the matching of newlines in a way
that should seem natural for most editing tasks. There are situations,
however, that require finer control over how newlines are matched by some
regular expression tokens.
By default, NEdit regular expressions will NOT match a newline character for
the following regex tokens: dot (`.'); a negated character class (`[^...]');
and the following shortcuts for character classes:
`\d', `\D', `\l', `\L', `\s', `\S', `\w', `\W', `\Y'
The matching of newlines can be controlled for the `.' token, negated
character classes, and the `\s' and `\S' shortcuts by using one of the
following parenthetical constructs:
(?n<regex>) `.', `[^...]', `\s', `\S' match newlines
(?N<regex>) `.', `[^...]', `\s', `\S' don't match
newlines
`(?N<regex>)' is the default behavior.
Notes on New Parenthetical Constructs
Except for plain parentheses, none of the parenthetical constructs capture
text. If that is desired, the construct must be wrapped with capturing
parentheses, e.g. `((?i<regex))'.
All parenthetical constructs can be nested as deeply as desired, except for
capturing parentheses which have a limit of 50 sets of parentheses,
regardless of nesting level.
Back References
Backreferences allow you to match text captured by a set of capturing
parenthesis at some later position in your regular expression. A
backreference is specified using a single backslash followed by a single
digit from 1 to 9 (example: \3). Backreferences have similar syntax to
substitutions (see below), but are different from substitutions in that they
appear within the regular expression, not the substitution string. The number
specified with a backreference identifies which set of text capturing
parentheses the backreference is associated with. The text that was most
recently captured by these parentheses is used by the backreference to
attempt a match. As with substitutions, open parentheses are counted from
left to right beginning with 1. So the backreference `\3' will try to match
another occurrence of the text most recently matched by the third set of
capturing parentheses. As an example, the regular expression `(\d)\1' could
match `22', `33', or `00', but wouldn't match `19' or `01'.
A backreference must be associated with a parenthetical expression that is
complete. The expression `(\w(\1))' contains an invalid backreference since
the first set of parentheses are not complete at the point where the
backreference appears.
Substitution
Substitution strings are used to replace text matched by a set of capturing
parentheses. The substitution string is mostly interpreted as ordinary text
except as follows.
The escape sequences described above for special characters, and octal and
hexadecimal escapes are treated the same way by a substitution string. When
the substitution string contains the `&' character, NEdit will substitute the
entire string that was matched by the `Find...' operation. Any of the first
nine sub-expressions of the match string can also be inserted into the
replacement string. This is done by inserting a `\' followed by a digit from
1 to 9 that represents the string matched by a parenthesized expression
within the regular expression. These expressions are numbered left-to-right
in order of their opening parentheses.
The capitalization of text inserted by `&' or `\1', `\2', ... `\9' can be
altered by preceding them with `\U', `\u', `\L', or `\l'. `\u' and `\l'
change only the first character of the inserted entity, while `\U' and `\L'
change the entire entity to upper or lower case, respectively.
back to the help index
|