[ nedit-Bugs-1760116 ] Negated escape sequences misinterpreted in character class

Joerg Fischer jf505 at gmx.de
Fri Jul 27 20:05:12 CEST 2007


> Interestingly, I notice that (?n\W) does not match newlines (my patch
> allows (?n[\W]) to do so, which is rather inconsistent). This is true also
> for \L, \D. Also \y without (?n ) around it will match newline. I believe
> these to be faults. What about you?

Well, what's right and what's wrong?  Here is another quote from
NEdit's help:

  By default, NEdit regular expressions will NOT match a
  newline character for the following regex tokens: dot
  (`.'); a negated character class (`[^...]'); and the
  following shortcuts for character classes:

   `\d', `\D', `\l', `\L', `\s', `\S', `\w', `\W', `\Y'

  The matching of newlines can be controlled for the `.'
  token, negated character classes, and the `\s' and `\S'
  shortcuts by using one of the following parenthetical
  constructs:

     (?n<regex>)  `.', `[^...]', `\s', `\S' match newlines

     (?N<regex>)  `.', `[^...]', `\s', `\S' don't match
                                            newlines

  `(?N<regex>)' is the default behavior. 

  
This is unusual, since in Perl only the anchors ^ and $ and the any
character . are affected by a m(ultiple) or s(ingle) line modifier. 

That is, in Perl \s = [ \t\n\r\f] and \S = [^ \t\n\r\f], so \s matches
newlines (always) and \S does not (never), which is what one would
normally expect.  Similarly, for the other escape sequences.
(In NEdit by default \s = [ \t\v\r\f], which includes vetical tabs,
and \S = [^ \t\n\v\r\f], where we needn't list the \n, because by 
NEdit's default a negated character class will not match \n anyway.)

Now, we are not Perl but inside a text editor, and the preceding NEdit
convention was done, because

  MATCHING NEWLINES

  NEdit regular expressions by default handle the matching
  of newlines in a way that should seem natural for most
  editing tasks.  There are situations, however, that
  require finer control over how newlines are matched by
  some regular expression tokens.

However, the fact that (?n\S) matches newlines just like (?n\s) seems
somehow upside down.  In effect, the (?N<regex>) construct excludes
newlines from all character classes, and (?n<regex>) does just the
opposite, ie, even includes newlines to classes where they normally
wouldn't belong to.

I can't tell whether or not these conventions are natural, or at least
handy - in any case they can't be no more than that, because you can
always define a regex doing what you want and if it is by mixing
(?n<regex>) and (?N<regex>) constructs.

However, these conventions are incompatible with standards (ie, Perl),
and I vaguely recall that from time to time some folks complain
because NEdit regexes aren't more Perl compatible...


OK, so what to do now?  I don't know, really.  I think character
classes, or these shorthand notations, should generally match what
they say.  This holds also for the NEdit specific \y and \Y which IMO
shouldn't be affected by the (?n<regex>) or (?N<regex>) constructs.

Moreover, \S should never match newlines (I believe), and \D, \L, \W
should match newlines -- at least with the help of (?n<regex>).

Cheers,
Jörg


More information about the Develop mailing list