Hacker News new | comments | show | ask | jobs | submitlogin
ReDoS: Regular Expression Denial of Service (levelup.gitconnected.com)
83 points by davisjam314 11 days ago | hide | past | web | 37 comments | favorite





"Your regex engine uses a slow regex matching algorithm. This is the case for the regex engines built into most programming language runtimes — this includes Java, JavaScript (various flavors) and Node.js, C#/.NET, Ruby, Python, Perl, and PHP. On the safe side, Go and Rust both use fast regex engines, and Google’s RE2 library is fast as well."

I don't feel like "slow" and "fast" are the right adjectives here. Perhaps some form of "backtracking capable" instead of "slow".

Edit: For example, the jit version of pcre2 is often faster than Rust's re engine. But it still supports backtracking and can be "dos-ed".


I don't know if it has a particular name, but it is a common problem: fast average case, slow worst case.

A common example is the quicksort algorithm. It is often considered the fastest sorting algorithm in the general case but it is still O(n²) in the worst case. According to statistics, it should never happen in practice, however, with cleverly ordered data, it can. That's why some libraries prefer to use a slightly slower algorithm (ex: heap sort) but with a O(n.log(n)) guarantee, so that it can't be DDoSed.

Unbalanced binary trees can have the same problem, on random data, they work fine, but improperly used, they degenerate into much slower linked lists. Hash tables can also be exploited if they don't used a cryptographically secure hash function, and these are rarely used because they can be so slow as to negate the advantage of using a hash table in the first place.


>cryptographically secure hash function

I doubt that would help you. Hash tables take the hash and mod it by the size of the array. That will lead to collisions after the mod even if it's essentially impossible to generate collisions before the mod.

What you actually need is a secret hash function, so attackers don't know what the values hash to. This is done using a keyed hash function with a secret key.


As to sorting, a lot of libraries use timsort these days since it's more optimal than quicksort in the real world and has nlogn worst case. Or b trees for big storage and sorting, but that's generally a different use case (databases et al).

If your threat model is "users who create hash collisions to slowly encumber data operations", you've either royally pissed someone off or you have a very determined competitor. No one who cares about performance would deploy a crypto hash function for a dictionary unless they absolutely had to. You can always add more buckets to reduce collision probability. There's an entire class of hash functions dedicated to non crypto purposes (xxhash, murmur, etc).


The name for this in the security literature is "algorithmic complexity attacks". They were first proposed by Crosby & Wallach in 2003, with a running example of degraded hash tables.

Yes, those terms are definitely imprecise.

My target audience for that blog post was "anyone vaguely technical, from programmers to managers". I therefore attempted to minimize the amount of technical detail and to focus on the core algorithmic concepts.


You could have just used one sentence to explain that regular expressions in (list of languages) have a feature called backtracking which can be exploited to slow down and DoS a site. I mean, even that sentence, right there. Instead, you boil it down so the reader knows Go and Rust are good and other languages are bad.

Maybe instead of "a slow regex matching algorithm" you could just say "a backtracking algorithm".

And instead of "a fast regex engine" you could say "a linear algorithm".

That way it's both more concise and more technically correct.


I suggest using the term "regular language" in contrast to "regex" or "regular expression".

https://news.ycombinator.com/item?id=23598078

"Regular language" is an accurate mathematical term that's not "corrupted" by programming. I think you could also use "automata-based" but it's probably less clear and refers more to the implementation [1]

If you scroll down about 3 pages here, there is table explaining regexes vs. regular languages:

https://www.oilshell.org/share/05-31-pres.html

It's the same stuff that's in RSC's articles but those are too dense for a general programming audience (they're more for regex engine implementers). And the existence of the OP's article shows that people haven't gotten it :)

You could also use the terms:

- regex engine (Perl/Python/PCRE/Java/etc.)

- regular language engine (libc, RE2, rust/regex)

Or "hybrid engine" may be more accurate for some of those. That is, they may use either automata-based techniques or backtracking depending on the situation.

-----

BTW, the Oil shell has a new syntax Eggex for both regular languages and regexes that makes the difference clear. In other words, you should be able to see from the syntax if a construct may cause exponential backtracking.

https://www.oilshell.org/release/0.8.pre6/doc/eggex.html

In other words, it's a static property that's easily detected (if you have a parser for the regex language itself, which is not hard at all in theory, but maybe it is a little in practice :) )

-----

[1] Aside: The terms NFA and DFA were also corrupted by Friedl in a popular O'Reilly book. I have this book and remember being super confused by it, since I read it long after I learned what a DFA and NFA were.

https://swtch.com/~rsc/regexp/regexp1.html

What little text it devotes to implementation issues perpetuates the widespread belief that recursive backtracking is the only way to simulate an NFA. Friedl makes it clear that he neither understands nor respects the underlying theory.

http://regex.info/blog/2006-09-15/248


The "corrupted" term is better here because it describes what the article is about: how to avoid security issues with real-world software tools, some of which match true regular expressions and some which match extended pseudo-"regular expressions".

See where the article says, "Faster engines may not support advanced features, most notably backreferences and lookaround assertions."

In a discussion about sucrose, fructose, aspartame, and stevia, you wouldn't tell someone to stop using the term "sweetener" and instead use the narrower and more precise term "sugar". If you did, you'd be advising them to use a word with the wrong meaning.


I'm replying to the parent, so the article could be called:

Regular Expression Denial of Service Can be Avoided by Using Regular Languages

In other words, accepting regular expressions / regexes exposes you to a DoS. Accepting regular languages does not (as long as you're using an appropriate engine.)


This seems like a quibble, but putting the emphasis on "using an appropriate engine" in parentheses at the end is burying the lede. Sticking to regular languages will not, in itself, help - you can have catastrophic backtracking in a backtracking engine even with a regular expression that uses only "regular language" concepts.

It's also at least theoretically possible that an engine could implement a subset of the non-regular-language "regex" constructs and still avoid backtracking. We considered this in Hyperscan - there are a number of backreference cases that are tractable in practice:

First, sometimes the backreferences just don't contain that many different possibilities - sometimes they are used to match quotes (or absence of quotes) so the three alternatives are ("|'|epsilon). This can be done by expansion.

Second, a number of backreference uses are structured so that you can only have a small - fixed size - number of back-references "alive" at once. For example, /\s+(\d+)\s+\1/ in libpcre notation matches whitespace, then a number, then more whitespace, then that number again. The thing being referred to by the backref can't ever have multiple live simultaneous values - an automata could note the boundary-crossings in and out of this particular backref and make a note of the value, then do a simple compare later.

There are quite a number of backreferences out there that work this way.

No-one, to my knowledge, does this, but it could be done.

The point is - it's the engine. Not the language. Fully general support for nasty regular expression constructs does lead in practice to ugly blowouts but it doesn't have to.

Notably, received wisdom that matching backreferences is exponential is based on a false assumption. The proof that it's in NP requires you to arbitrarily increase the number of backreferences, which is counter to the way most people would imagine regex matching or string matching is analyzed - you either assume that the regex or string is a fixed workload or you give it another parameter 'm' and talk about O(nm) things.


Fair point about the parenthetical, though I would say both the engine and the language are crucial, depending on what role you're playing in the system.

The setup I think of: you have "domain experts" writing policies with regexes, software engineers who embed the engine in an application, and operators of the entire service. My usages of regexes for "big data" followed that pattern, and I imagine the DPI applications of Hyperscan do too.

A lot of those rules are accumulated over time, are not easy to rewrite, and require testing. If you have a thousand regexes in one dialect painfully crafted over years, you can't just change the engine. It would break the whole application.

So that is why I emphasize the language. Backtracking engines are more common, and the language and the "code" written in the language has to be changed before you can change the engine. That happened at least partially at Google 10+ years ago, migrating from PCRE to RE2.

But yes, if you're an operator, you will be looking at it from the engine perspective. Obviously it is not enough to change the patterns themselves.

(The backreference point is interesting, although I'll probably leave it out of my mental model until someone implements it :) )

----

Also the point of Eggex is to treat regexes more abstractly, like a real language. Just like I don't write C++ code for GCC or Clang, I don't write regexes for a specific engine. I have ported regexes between engines many times.

The typical application I've worked on uses multiple engines. Hell a one line shell script using grep and awk uses multiple engines!!!

Tangent: reminds me of this paper: https://www3.cs.stonybrook.edu/~dongyoon/papers/FSE-19-Lingu...


> This seems like a quibble, but putting the emphasis on "using an appropriate engine" in parentheses at the end is burying the lede. Sticking to regular languages will not, in itself, help.

Yes. I attempted to make this point at the end of my article, while avoiding phrases like "regular languages".

> There are quite a number of backreferences out there that work this way

Yes, this appears to be a common usage of backreferences. In the regex corpus from the FSE'19 Lingua Franca paper, many backreference-using regexes are of this form.

I believe a loose bound on the worst-case time complexity for regexes with backreferences is n^{mX} where X is the number of backreferenced capture groups. The goal is to describe the number of distinct values the capture groups can take on. If a capture group can take on arbitrary sub-strings (e.g. .(.)\1) then m=2; if it is fixed but ambiguous (e.g. .*(')\1) then m=1; if it is unambiguous then m=0. Often still a polynomial though.

The exception proves the rule; I've seen regexes with 20 backreferences, many of them ambiguous, but the vast majority cover simple use cases and are quite sane.


[ disclosure: I designed Hyperscan and worked on it 2006-2017 ]

This is a rather old concept, but as long as people keep using backtracking regular expression matchers, I suppose it's relevant.

What would be intriguing would be ReDoS against robust matchers like RE2 and Hyperscan. Both have their own potential performance corners that could be exploited, especially when matching large numbers of patterns (more Hyperscan's bailiwick than RE2).

For a sufficiently large pattern (or, rather, one that builds a sufficiently large DFA), RE2 could be forced to stay in its slow path by an input that prioritizes trying to head for an unbuilt DFA state. An attacker could attempt to tune this to stay on the edge of RE2s heuristics to fall back to NFA matching.

Alternately, NFA matching in RE2 is quite slow, and there are presumably ways to ensure that the slowest paths (probably involving large numbers of states being on) are traversed.

Hyperscan has its own performance corners - the dumps produced by Hyperscan's compiler could be examined to discover which "literal factors" it's using, then inputs crafted to blast those literal factors. There's a good deal of trickery to try to avoid getting caught by these kind of "overly frequent literal factors" so it's probably harder than one might think - but I don't doubt an attacker could make life difficult for Hyperscan.

Once into the NFA/DFA engines in Hyperscan, an attacker could ensure that the engines can't ever use their fast acceleration paths.

Note these aren't exponential blowups in performance. However, hitting one of these systems (RE2 or Hyperscan) with a 'constant factor' of extra work could be quite significant.


This is a "technical two-pager" of my doctoral research about regex denial of service.

I looked through this. I'm curious as to how you managed to slot in a small reference to Hyperscan (my baby for many years) while writing a thesis about regular expression performance problems.

You cite us twice; once as a source on what "dominated" means in a graph theoretic sense (?) and once to falsely suggest that Hyperscan is somehow "domain specific". To the extent that Hyperscan is focused on scaling up to larger-scale pattern matching (unlike RE2) it is suitable for a domain that other regular expression matchers are not suitable for, but there's no reason that you can't use Hyperscan in a general purpose domain.

The fact of the matter is that large scale regular expression matching, where it occurs in domains that are highly performance-sensitive, is a extremely difficult problem that is largely solved - in a limited way. No-one sensible builds an IPS or a NGFW and expects to be able to do backreferences and the usual menagerie of backtracking-only features (recursive subpatterns!).

The various outages that have happened since - people tripping over performance problems that were widely understood in the industry in at least 2006, if not earlier - are examples of people wilfully using the wrong tools for the job and being surprised when they don't work.

There's a niche for faster backtracking matchers, but most of the usage of regular expressions where anyone cares about performance is already squarely in the "automata" world and is done by either Hyperscan, hand-rolled software at network shops or acceleration hardware like the cards from TitanIC (now Mellanox). The backtracking formulation is not only subject to all the superlinear guff, it's also incredibly poorly suited to matching large numbers of regular expressions at once.


(This is a repetition of my replies to your thread on Twitter).

My goal was not to prove that super-linear behavior is possible, but rather to understand questions like:

1. How common are such regexes in practice? (~10%)

2. Do developers know it? (<50%)

3. Do they use specialized regex engines like RE2 or HyperScan? (Not often)

These are not theoretical advances. But I believe that my work improved our understanding of practice, and that it will let us ground future decisions in large-scale data instead of opinion.

Are there other regex questions you think are worth a look? I'm all ears!

I apologize if my treatment of HyperScan was offensive to you. My understanding from surveys is that developers typically use the regex engine built into their programming languages. I therefore focused on those engines, not on standalones like RE2 and HyperScan.

Once RE2 and HyperScan reach widespread use (I hope my research moves folks in that direction!), understanding their weaknesses will be an interesting next step.


In the automata world there is also re2c and ragel

Edit: Never mind, clicked on the article, lots of sources!

Is your paper available for reading? I'm very interested in it.


A quick Google search turned up http://hdl.handle.net/10919/98593.

In one place, you say that programmers should pre-validate that inputs are not long. I would suggest quantifying that somehow. I suspect some developers will not realize that a 30 character input is "long".

At the same time some sites/systems impose too strict size limits which makes impossible to enter e. g. a long name or a long address. And such problems are almost inevitable if a programmer makes assumptions about input data based on a limited personal experience.

https://github.com/kdeldycke/awesome-falsehood tries to address this.


Thank you for the suggestion. I added notes about typical sizes depending on the complexity of the regex.

TLDR: ~30 chars is too long for exponential regexes. 1K-5K chars are too long for quadratic regexes. Higher polynomials blow up somewhere between 100 and 1000 characters.


The paper Regular Expression Matching Can Be Simple And Fast [1] has a time complexity plot that clearly shows how backtracking engines are more exploitable than finite-automata engines.

[1]: https://swtch.com/~rsc/regexp/regexp1.html


Interestingly, the backtracking engines can also be viewed as using automata. The trouble is actually in how they simulate the automaton. They use backtracking -- like a depth-first search. Since they don't remember where they've been, this leads them to a lot of redundancy. Engines like RE2 use a lockstep approach instead -- like a breadth-first search. They work their way down the search tree level-by-level and so don't have to repeat themselves.

Practically speaking, the slow regex engines don't always build NFAs. Instead they may follow a grammar-directed approach. But the NFA is a valid model of that simulation.


I learned about ReDoS through a PostgreSQL bug. On an old pg version(9.4.5), the following regex expression made pg hang in an infinite loop.

  select 'anystring' ~ '($^)+';
Of course that was patched a long time ago, but since then I've avoided exposing regex functionality to end users.

This isn't a cheat-sheet. I expected a PDF with one or two pages of very condensed key information, not a long web article.

Ok, we've de-cheat-sheeted the title above.

Great. Much better now.

Thanks, folks!

It's great to see more introductory ReDoS material! I took a deep-dive on ReDoS myself recently and found the material available to be somewhat lacking, especially for beginners. Over the course of my investigation I found some interesting bugs in big name projects and created a few blog posts as an introduction to the bug class:

* https://blog.r2c.dev/2020/finding-python-redos-bugs-at-scale...

* https://blog.r2c.dev/2020/improving-redos-detection-with-dli...

The culmination of this work was a Python regex linter that could automatically detect ReDoS expressions with fairly high accuracy - Dlint's DUO138 rule: https://github.com/dlint-py/dlint/blob/master/docs/linters/D....

In my opinion, the best solution, as this article mentions, is to avoid the bug class altogether by using something like RE2 when possible. Nevertheless, I found ReDoS to be a really cool bug class at the intersection of computer science and software engineering.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: