BBCode won't protect you from XSS

Cross site scripting (XSS or HTML injection) is a vulnerability class that allows an attacker to enter malicious HTML code into your document. Usually that's a Javascript code for e.g. stealing the cookies or self spreading worm used for social networking sites. All the big sites (Facebook, Twitter, Myspace, Wikipedia to name the few) have had an XSS hole in the past. What that can teach us is that XSS protection is usually done wrong. Developers invent yet-another-perfect XSS filters, WAFs update their signatures, even the browsers feature XSS protection nowadays. And still, most of them fail.

One of the wrong ways to protect from XSS is to use BBCode. Today I encountered ~~yet another blog post~~ (update: post deleted by author, see comment #1) claiming that author used a BBCode implementation so that "The code is encoded and perfectly safe and resilient to possible XSS attacks". I just just had to put that to a test. Read on to find out how "perfect" was the code and what is wrong with using BBCode for XSS protection.

Look into details

The described project uses a TinyMCE WYSYWIG editor with bbcode plugin to convert a HTML text entered by a user into bbcode, which is then persisted (in a database I suppose). To convert it back to HTML they use a snippet of 3rd party Javascript code (yes, the conversion happens on the client, not on a server!). A quick look at the code is enough to convince me that it's vulnerable.

Regexp for protection - just don't do it!

The ~50 lines code uses a few regular expressions to scan for bbcode patterns and convert them to HTML. It's just doomed to fail - and so it did:

Yet another XSS

It took just 5 minutes to exploit this "perfect" protection. Not that I'm smart, it's the protection that is so lame. I tried just a simple vector attack, one of many that would possibly fit for this particular script.

This particular script failed for the following reasons:

Regular expressions

Regular expressions, while being a brilliant tool to process text data, are pretty clumsy when they're given nested markup language to parse (like XML or HTML, or BBCode). It's just not the right tool to deal with it. It's not easy to account for:

proper tag nesting,
proper escaping,
different allowed attributes,
multi-line text,
different encodings.

It gets even worse because browsers are usually forgiving for the invalid HTML syntax. There are many XSS attack vectors that target an invalid regex that did not account for some special case. Recent Twitter @onmouseover flaw was most likely possible because of an invalid regex. When you're trying regexps to do your filtering, it will quickly evolve into complicated, almost cryptic code that noone discovered holes in yet.

Another conversion layer

When you're trying to clean valid HTML from malicious content, you should take an approach to choose only allowed tags, only allowed attributes etc. Once your HTML can be invalid (improperly nested tags, not closed tags etc, mixed quoting for attributes), your task gets much much harder, but still doable. And when you're introducing a different, intermediary syntax to handle (BBCode in this case) - well, you're toasted. You need to make sure that:

Your HTML to BBCode corrects invalid HTML code
Your HTML to BBCode filters out malicious HTML code
Your HTML to BBCode always produces valid BBCode
The conversion from BBode to HTML is flawless and produces valid HTML
Both conversions do not introduce their own attack vectors

That gets really complicated and it is in practice, that's why we've had so many bbcode XSS flaws in the past. Developer, as it was in this case, just includes a 3rd party bbcode library and hopes for the best, thinking he's secure because of using some buzzword. Library is not magic - it's an additional code and more code = more errors. Seriously - it's just harder to be secure down that road, so let's just not go there.

The correct approach is...

If you need to allow some HTML from users on your pages - use a full blown parser! Start from the perfectly clean HTML root node and keep adding the valid nodes (if they match your whitelist) until you're done. Or throw all your HTML into a proper HTML parser object and then traverse the tree leaving only safe nodes/attributes. Whitelist is the key word for both algorithms.

I can only recommend PHP projects that do that, namely HTMLPurifier and Wibble but I'm sure that there are equivalents in your environment. For example, Wibble:

Converts all HTML input to HTML safe UTF-8
Loads the HTML into a DOMDocument object
Applies one or more filters (DOM manipulators) to the HTML DOM
Extracts the filtered HTML from DOM and applies HTML Tidy (ext/tidy)
Converts the final HTML to the user's selected character encoding (if not UTF-8)

There is no place for a malicious code coming from almost-valid HTML syntax to pass through this. If there's anything wrong with the HTML, it would most likely be your incomplete whitelist (or some weird html/tidy attack vector, but I doubt it).

And be skeptical about 3rd party code you're including.

Bonus

Server side BBCode parsing may also be wrong (and likely from the same reasons that the JS version) - try http://www.bbcode-to-html.com/ with my example code.

As usual - the code displayed here is on my github.

3 comments:

Łukasz said...: Hello.. I'm Łukasz from B21. I closed our post yesterday, as I read your article. I think it will be better not to show it on the web, because it might be confusing for others. Thank you very much for pointing us the mistake we made, this is really great piece of article and we appriciate it! Regars; September 27, 2010 at 7:16 AM
Sam Bronson said...: Hmm, I don't think you put enough; March 15, 2011 at 8:26 PM
steve prismn said...: The Internet has created a new economic ecosystem, the
E-commerce marketplace,
and it has become the virtual main street of the world.

E-commerce; January 31, 2012 at 6:52 AM

the world. according to koto

Sunday, September 26, 2010