One of the wrong ways to protect from XSS is to use BBCode. Today I encountered
|Regexp for protection - just don't do it!|
|Yet another XSS|
This particular script failed for the following reasons:
Regular expressionsRegular expressions, while being a brilliant tool to process text data, are pretty clumsy when they're given nested markup language to parse (like XML or HTML, or BBCode). It's just not the right tool to deal with it. It's not easy to account for:
- proper tag nesting,
- proper escaping,
- different allowed attributes,
- multi-line text,
- different encodings.
It gets even worse because browsers are usually forgiving for the invalid HTML syntax. There are many XSS attack vectors that target an invalid regex that did not account for some special case. Recent Twitter @onmouseover flaw was most likely possible because of an invalid regex. When you're trying regexps to do your filtering, it will quickly evolve into complicated, almost cryptic code that noone discovered holes in yet.
Another conversion layerWhen you're trying to clean valid HTML from malicious content, you should take an approach to choose only allowed tags, only allowed attributes etc. Once your HTML can be invalid (improperly nested tags, not closed tags etc, mixed quoting for attributes), your task gets much much harder, but still doable. And when you're introducing a different, intermediary syntax to handle (BBCode in this case) - well, you're toasted. You need to make sure that:
- Your HTML to BBCode corrects invalid HTML code
- Your HTML to BBCode filters out malicious HTML code
- Your HTML to BBCode always produces valid BBCode
- The conversion from BBode to HTML is flawless and produces valid HTML
- Both conversions do not introduce their own attack vectors
The correct approach is...If you need to allow some HTML from users on your pages - use a full blown parser! Start from the perfectly clean HTML root node and keep adding the valid nodes (if they match your whitelist) until you're done. Or throw all your HTML into a proper HTML parser object and then traverse the tree leaving only safe nodes/attributes. Whitelist is the key word for both algorithms.
I can only recommend PHP projects that do that, namely HTMLPurifier and Wibble but I'm sure that there are equivalents in your environment. For example, Wibble:
- Converts all HTML input to HTML safe UTF-8
- Loads the HTML into a DOMDocument object
- Applies one or more filters (DOM manipulators) to the HTML DOM
- Extracts the filtered HTML from DOM and applies HTML Tidy (ext/tidy)
- Converts the final HTML to the user's selected character encoding (if not UTF-8)
And be skeptical about 3rd party code you're including.
BonusServer side BBCode parsing may also be wrong (and likely from the same reasons that the JS version) - try http://www.bbcode-to-html.com/ with my example code.
As usual - the code displayed here is on my github.