One of the wrong ways to protect from XSS is to use BBCode. Today I encountered
Look into details
The described project uses a TinyMCE WYSYWIG editor with bbcode plugin to convert a HTML text entered by a user into bbcode, which is then persisted (in a database I suppose). To convert it back to HTML they use a snippet of 3rd party Javascript code (yes, the conversion happens on the client, not on a server!). A quick look at the code is enough to convince me that it's vulnerable.Regexp for protection - just don't do it! |
Yet another XSS |
This particular script failed for the following reasons:
Regular expressions
Regular expressions, while being a brilliant tool to process text data, are pretty clumsy when they're given nested markup language to parse (like XML or HTML, or BBCode). It's just not the right tool to deal with it. It's not easy to account for:- proper tag nesting,
- proper escaping,
- different allowed attributes,
- multi-line text,
- different encodings.
It gets even worse because browsers are usually forgiving for the invalid HTML syntax. There are many XSS attack vectors that target an invalid regex that did not account for some special case. Recent Twitter @onmouseover flaw was most likely possible because of an invalid regex. When you're trying regexps to do your filtering, it will quickly evolve into complicated, almost cryptic code that noone discovered holes in yet.
Another conversion layer
When you're trying to clean valid HTML from malicious content, you should take an approach to choose only allowed tags, only allowed attributes etc. Once your HTML can be invalid (improperly nested tags, not closed tags etc, mixed quoting for attributes), your task gets much much harder, but still doable. And when you're introducing a different, intermediary syntax to handle (BBCode in this case) - well, you're toasted. You need to make sure that:- Your HTML to BBCode corrects invalid HTML code
- Your HTML to BBCode filters out malicious HTML code
- Your HTML to BBCode always produces valid BBCode
- The conversion from BBode to HTML is flawless and produces valid HTML
- Both conversions do not introduce their own attack vectors
The correct approach is...
If you need to allow some HTML from users on your pages - use a full blown parser! Start from the perfectly clean HTML root node and keep adding the valid nodes (if they match your whitelist) until you're done. Or throw all your HTML into a proper HTML parser object and then traverse the tree leaving only safe nodes/attributes. Whitelist is the key word for both algorithms.I can only recommend PHP projects that do that, namely HTMLPurifier and Wibble but I'm sure that there are equivalents in your environment. For example, Wibble:
- Converts all HTML input to HTML safe UTF-8
- Loads the HTML into a DOMDocument object
- Applies one or more filters (DOM manipulators) to the HTML DOM
- Extracts the filtered HTML from DOM and applies HTML Tidy (ext/tidy)
- Converts the final HTML to the user's selected character encoding (if not UTF-8)
And be skeptical about 3rd party code you're including.
Bonus
Server side BBCode parsing may also be wrong (and likely from the same reasons that the JS version) - try http://www.bbcode-to-html.com/ with my example code.As usual - the code displayed here is on my github.
Hello.. I'm Łukasz from B21. I closed our post yesterday, as I read your article. I think it will be better not to show it on the web, because it might be confusing for others. Thank you very much for pointing us the mistake we made, this is really great piece of article and we appriciate it! Regars
ReplyDeleteHmm, I don't think you put enough
ReplyDeleteThe Internet has created a new economic ecosystem, the
ReplyDeleteE-commerce marketplace,
and it has become the virtual main street of the world.
E-commerce