Negated regular expressions in Oniguruma and Ruby

Suraj N. Kurapati


  1. Overview
    1. Wholly negated regular expressions
      1. Partly negated regular expressions
        1. Conclusion

          About a week ago, I submitted this feature request asking for the ability to negate regular expressions in Ruby. I have since implemented the feature myself by patching Ruby trunk and patching Oniguruma 5.9.2 beside it. And now, I shall explain to you how it all works.

          Overview

          Regular expressions can be negated either as a whole (/.../v) or in part /(?v:...)/) through the negation flag (v). The letter v was chosen for the negation flag because it comes from the ed(1) program of UNIX heritage, where it serves a global command that negates the sense of the regular expression passed in. This v heritage can also be found in grep(1), whose -v command-line option inverts its matching behavior.

          Wholly negated regular expressions

          This is the simplest part of the implementation. We match the negated regular expression as usual, treating it as if it were not negated, and then negate the result of that match if there were no matching errors.

          Partly negated regular expressions

          At parse time, a partly negated regular expression, say r in /(?v:r)/, is expanded into /(?:rN)?.*?/ where N is a special instruction (OP_NEGATE) that causes the regular expression matching engine to treat the input matched thus far as a mismatch.

          For example, the /a(?v:b)c/ parse tree looks like this after expansion:

          PATTERN: /a(?v:b)c/ (US-ASCII)
          <list:275ca40>
             <string:275c9f0>a
             <enclose:275c950> option:4
                <quantifier:2755640>{0,1}
                   <string:275c9a0>b
             <quantifier:2755690>{0,-1}?
                <anychar:275af20>
             <string:2755550>c
          

          Since r is wrapped in an optional non-capturing group (/(?:rN)?/), the matching engine is able to continue onward when r does not match. However, if r did not match because the matching engine reached non-matching input characters, then those characters will obstruct the matching engine from continuing onward. That is why /.*?/ is provided.

          For example, when /a(?v:b)c/ is matched against the input "axyzc", the non-matching input characters ("xyz") will obstruct the matching engine from matching the rest of the regular expression (/c/) against the rest of the input ("xyzc"). However, since /.*?/ consumes "xyz", the matching engine can proceed to match /c/ against "c".

          Conclusion

          In all, this was a challenging and fun learning experience for me. I brushed up on my rusty (thanks Ruby!) knowledge of the C language and the GDB debugger while also adding some shiny new tools to my repertoire:

          I have submitted my patches simultaneously to the Ruby developers and to the author of Oniguruma, and now await their response. If all goes well, we might see this feature in Ruby and Oniguruma someday. Enjoy! :-]


          Updates