If you're really worried about the worst-case behavior of regular expressions be sure to select an engine designed specifically to not choke on pathological inputs, such as Russ Cox's RE2:
This gets a section in the article, but it's worth repeating, since RE2 is just that good. It's as close to Perl-compatible as a non-backtracking regular expression engine is likely to get, pleasantly fast, and stable enough for Google to use in production.
Regarding speed, I've actually found that re2 is faster on some non-pathological regexps, and a little slower on most. As usual, if you care about this, benchmark it. I've managed to get some easy 2x speedups by switching from Python's re to re2, even though re is usually faster.
Good to hear that RE2's constant factor doesn't seem too large, I know the Rust guys were looking to potentially use it as the regex engine in their stdlib, but were torn on whether that constant factor would prove troublesome.
Does anyone know if the Go standard library implements RE2? I'm asking because I know that Russ Cox is on the Go team and the documentation mentions RE2: http://golang.org/pkg/regexp/
Perl also has some clever memoization which handles a lot of cases that choke less mature regular expression engines. But neither of these is a satisfying solution; backtracking control verbs can only help you if you use them, and the memoization is not the solution to all woes, as the benchmark in the "Performance" section of this article shows:
The sky isn't falling, but it's still possible to get struck by lightning. (And of course, you're vulnerable to DoS attacks if you run user-supplied regexps with backreferences -- this is provably NP-hard.)
Not to gainsay everything you've written here, because I think that I think there's plenty of room in my universe for new regex engines, but the article is five years old and benchmarks Perl 5.8.7.
We are of course up to 5.16.1 these days, and I'd be more inclined to pay attention to benchmarks if they were performed with remotely current versions of the software.
Just ran the same benchmark on Perl 5.12.3 (because it came installed by default on my machine), and got the same exponential-time matching behavior. If you have a more recent version of Perl than I do, I'd be interested in hearing the results. This takes about 10 seconds on my machine, but should be nigh-instantaneous if they've fixed the problem in a later version of Perl:
I have 5.8.8, and I have 5.14.2. I've also taken the liberty of converting the string of a?'s into a{0,27}.
I recognize that you're trying to demonstrate pathologically bad behavior, but as written, you've got a pathologically bad regex.
Note the speed up from 24 & 30 seconds to seven one-thousandths of a second in both cases. Again, I like the idea of a regex engine with different design considerations than Perl's.
Your article would have been much better if it has used a class of pathological data with real world examples, and not used contrived examples that I rewrote in less time than it took to run.
So here's my results:
5.14.2:
$ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
1
real 0m24.227s
user 0m24.210s
sys 0m0.008s
$ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a{0,27}aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
1
real 0m0.007s
user 0m0.004s
sys 0m0.004s
----------------------------------
5..8.8:
$ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaa aaaaaaaaaaaaaaaa/ '
1
real 0m30.065s
user 0m30.046s
sys 0m0.016s
$ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a{0,27}aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
1
real 0m0.007s
user 0m0.004s
sys 0m0.004s
Real-world examples tend to be subtler, and not quite as easy to explain in an introductory article. Of course you wouldn't really use the for-demonstration-only regex I mentioned. Instead, you would write something reasonable-looking that works just fine unless it encounters a highly unlikely bad input. That's the way I've always seen it happen.
http://code.google.com/p/re2/
The rather interesting analysis is here:
http://swtch.com/~rsc/regexp/regexp1.html
Not that it's a free lunch, since you're gaining a constant amount of overhead in exchange for a sane growth pattern.
EDIT: Of course the article would mention RE2. That's what I get for reading the comments before TFA...