[ nedit-Bugs-1760116 ] Negated escape sequences misinterpreted
in character class
Tony Balinski
ajbj at free.fr
Tue Jul 31 16:26:35 CEST 2007
Quoting Joerg Fischer <jf505 at gmx.de>:
> > Interestingly, I notice that (?n\W) does not match newlines (my
> > patch allows (?n[\W]) to do so, which is rather inconsistent). This
> > is true also for \L, \D. Also \y without (?n ) around it will match
> > newline. I believe these to be faults. What about you?
>
> Well, what's right and what's wrong? Here is another quote from
> NEdit's help:
>
> By default, NEdit regular expressions will NOT match a
> newline character for the following regex tokens: dot
> (`.'); a negated character class (`[^...]'); and the
> following shortcuts for character classes:
>
> `\d', `\D', `\l', `\L', `\s', `\S', `\w', `\W', `\Y'
>
> The matching of newlines can be controlled for the `.'
> token, negated character classes, and the `\s' and `\S'
> shortcuts by using one of the following parenthetical
> constructs:
>
> (?n<regex>) `.', `[^...]', `\s', `\S' match newlines
>
> (?N<regex>) `.', `[^...]', `\s', `\S' don't match
> newlines
>
> `(?N<regex>)' is the default behavior.
>
>
> This is unusual, since in Perl only the anchors ^ and $ and the any
> character . are affected by a m(ultiple) or s(ingle) line modifier.
...
An interesting analysis, but it doesn't really say what we should do.
My reaction is that either the \D, \L, \s, \W and \y classes should
behave as [^\d] etc, and thus should never match newlines outside of a
newline-matching context. This means that, without the newline-matching
context, newlines can only ever be matched explicitly, either as
individual matches or as a character included in a character class (I
would allow it to appear in a range, though, eg "[^\x01-\x1F]" - that's
explicit enough for me).
I think this matches with the principle of least surprise: a single
simple rule saying, "outside of a (?n ) context, there is no implicit
matching of newlines by ., [...], [^...] or the shorthand builtin
character class matches \D, \L, \s, \S, \W, \y, \Y." (Note that there
since newline is not a digit, letter or word character, the classes
\d, \l, \w will never match it.)
As for Perl's newline handling, it is also consistent: an inverted
character class will include \n if the original range did not, and that
\s will always match it.
...
> However, these conventions are incompatible with standards (ie, Perl),
> and I vaguely recall that from time to time some folks complain
> because NEdit regexes aren't more Perl compatible...
>
> OK, so what to do now? I don't know, really. I think character
> classes, or these shorthand notations, should generally match what
> they say. This holds also for the NEdit specific \y and \Y which IMO
> shouldn't be affected by the (?n<regex>) or (?N<regex>) constructs.
>
> Moreover, \S should never match newlines (I believe), and \D, \L, \W
> should match newlines -- at least with the help of (?n<regex>).
>
> Cheers,
> Jörg
Thanks for your comments: a useful summary and a decent opinion. I
disagree slightly with you recommendation; I don't want to see something
like "<\w+\y\w+>" matching two words separated by a newline outside of a
(?n ) grouping. Likewise for a \W instead of the \y. I think this makes
it clear that \n has a special status (usually).
In Perl, you're often just treating text already broken into lines
(eg using constructs like "while (<ARGV>)"); in NEdit, that is usually
not the case - the text is just one long string, so the newline
exception seems (at least to me) to make a lot more sense. It allows you
to write concise regexes which won't span lines, which, I think, makes
more sense in the editor context. It forces you to make a bit more of an
effort when writing multiline-matching regexes too, which I also think
is correct.
I don't think it would be too difficult to adjust the code to provide
consistency (either Perl-like or as I suggest). But I think that work
should be done. More comments, anybody?
Tony
More information about the Develop
mailing list