Regular Expressions Will Stab You in the Back

kibwen · on Oct 26, 2012

If you're really worried about the worst-case behavior of regular expressions be sure to select an engine designed specifically to not choke on pathological inputs, such as Russ Cox's RE2:

http://code.google.com/p/re2/

The rather interesting analysis is here:

http://swtch.com/~rsc/regexp/regexp1.html

Not that it's a free lunch, since you're gaining a constant amount of overhead in exchange for a sane growth pattern.

EDIT: Of course the article would mention RE2. That's what I get for reading the comments before TFA...

pjscott · on Oct 26, 2012

This gets a section in the article, but it's worth repeating, since RE2 is just that good. It's as close to Perl-compatible as a non-backtracking regular expression engine is likely to get, pleasantly fast, and stable enough for Google to use in production.

Regarding speed, I've actually found that re2 is faster on some non-pathological regexps, and a little slower on most. As usual, if you care about this, benchmark it. I've managed to get some easy 2x speedups by switching from Python's re to re2, even though re is usually faster.

kibwen · on Oct 26, 2012

Good to hear that RE2's constant factor doesn't seem too large, I know the Rust guys were looking to potentially use it as the regex engine in their stdlib, but were torn on whether that constant factor would prove troublesome.

Mr_T_ · on Oct 26, 2012

Does anyone know if the Go standard library implements RE2? I'm asking because I know that Russ Cox is on the Go team and the documentation mentions RE2: http://golang.org/pkg/regexp/

pjscott · on Oct 26, 2012

From looking at the source code, Go's regexp package uses O(n) NFA-based matching, as described here:

http://swtch.com/~rsc/regexp/regexp2.html

It isn't nearly as highly optimized as RE2, but it works in a similar way.

nmcfarl · on Oct 26, 2012

Or they won’t as everyone who worked on perl based website from 2000 will tell you. This is a massively overstated title and lede.

However the solutions given, if you do end up with this problem, are good and detailed.

They did miss perl's solution, Special Backtracking Control Verbs : http://perldoc.perl.org/perlre.html#Special-Backtracking-Con...

pjscott · on Oct 26, 2012

Perl also has some clever memoization which handles a lot of cases that choke less mature regular expression engines. But neither of these is a satisfying solution; backtracking control verbs can only help you if you use them, and the memoization is not the solution to all woes, as the benchmark in the "Performance" section of this article shows:

http://swtch.com/~rsc/regexp/regexp1.html

The sky isn't falling, but it's still possible to get struck by lightning. (And of course, you're vulnerable to DoS attacks if you run user-supplied regexps with backreferences -- this is provably NP-hard.)

vampirechicken · on Oct 26, 2012

Not to gainsay everything you've written here, because I think that I think there's plenty of room in my universe for new regex engines, but the article is five years old and benchmarks Perl 5.8.7.

We are of course up to 5.16.1 these days, and I'd be more inclined to pay attention to benchmarks if they were performed with remotely current versions of the software.

Otherwise, thanks for the enlightening post.

pjscott · on Oct 26, 2012

Just ran the same benchmark on Perl 5.12.3 (because it came installed by default on my machine), and got the same exponential-time matching behavior. If you have a more recent version of Perl than I do, I'd be interested in hearing the results. This takes about 10 seconds on my machine, but should be nigh-instantaneous if they've fixed the problem in a later version of Perl:

    "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaa/

vampirechicken · on Oct 26, 2012

I have 5.8.8, and I have 5.14.2. I've also taken the liberty of converting the string of a?'s into a{0,27}.

I recognize that you're trying to demonstrate pathologically bad behavior, but as written, you've got a pathologically bad regex.

Note the speed up from 24 & 30 seconds to seven one-thousandths of a second in both cases. Again, I like the idea of a regex engine with different design considerations than Perl's.

Your article would have been much better if it has used a class of pathological data with real world examples, and not used contrived examples that I rewrote in less time than it took to run.

So here's my results:

5.14.2:

     $ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
     1
     real    0m24.227s
     user    0m24.210s
     sys     0m0.008s

     $ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a{0,27}aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
     1
     real    0m0.007s
     user    0m0.004s
     sys     0m0.004s

----------------------------------

5..8.8:

     $ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaa                               aaaaaaaaaaaaaaaa/ '
     1
     real    0m30.065s
     user    0m30.046s
     sys     0m0.016s

     $ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a{0,27}aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
     1
     real    0m0.007s
     user    0m0.004s
     sys     0m0.004s

--------------------------------

pjscott · on Oct 26, 2012

Real-world examples tend to be subtler, and not quite as easy to explain in an introductory article. Of course you wouldn't really use the for-demonstration-only regex I mentioned. Instead, you would write something reasonable-looking that works just fine unless it encounters a highly unlikely bad input. That's the way I've always seen it happen.

vampirechicken · on Oct 31, 2012

In the "real world" I usually try to optimize for the average case, and deal with the "highly unlikely bad input" as a special case.

I'm still impressed with the new regex engine, but I have a pathological hatred of contrived examples.

It's not you, it's me.