One thing that makes web development both fascinating and exhausting is how the same subjects keep popping up, over and over, without resulting in any clear answers. One the one hand it’s remarkably easy to put up your own website. But building a site capable of handling a lot of traffic, and is easy to change and modify is not so easy.

What’s got me started is this recent blog by Sam Ruby, owner of the job to die for, at IBM, to whom PHP can thank for the Java extension, who’s been a member of the Apache group and has had a part countless other web innovations and groups.

The issue? How to publish content submitted to your site by it’s visitors. Solving this is one is as old as that most dated of web apps – the Guestbook and if you trawl through the comments on Sam’s site, you’ll quickly get the idea that still, no ones too sure of the answer.

The basic problem, as you no doubt know, is to allow visitors to your blog or forum to submit more that just plain, unformatted text, you need to allow them some kind of mechanism to add structure. But if you give them access the entire HTML vocab (plus Javascript and CSS), not only will your site be an ever changing mess but you’ll also potentially be exposing visitors to things like XSS exploits (side note – Chris Shifflet: Foiling Cross-Site Attacks).

Been developing a fetish recently for knocking up lists for describing common development problems, as a means to really nail it down. Here’s a guess at what a good solution to this problem needs to do (subject to my opinion / limited vision);

Good Smells

1. Prevents the structure of your site from being broken

2. Poses no threat to your site’s or it’s visitors security

3. Provides visitors with enough power, in terms of how they are able to format their submissions, to be happy.

4. Is easy to parse (extracting submitted formatting and handling it should not require a PhD)

5. Is easy to use. Believe it or not, there are people “out there” who have no idea of HTML.

6. Preserves the intent of the formatting. Not quite sure how to explain what I mean here but the thought test might be: “Is it possible to transform submitted content to other output types?” – i.e. is generating a PDF document, as opposed to HTML, at least feasible.

Any more / less?

Meeting all those requirements is probably an impossibility – it’s going to be a compromise at some level.

Some of the common solutions, off the top of my head, to solving this are;

Common Styles

a. Allowing a limited subset of “safe” HTML. This addresses points 3. and 6. pretty well and, assuming a basic knowledge of HTML, places no additional requirements on users to learn new markup syntaxes. Also on the plus side (depending on your point of view) is there’s plenty of WYSIWYG “plugins” these days, such as Editize or JavaScript based solutions. The downside is it’s very easy to get wrong, particularly in terms of security (see PHP’s strip_tags() function and the comments that follow on “evil attributes”). The other problem is how to parse it? Unless you require users submit well formed XML, your standard XML parser will choke on HTML. Using regular expressions to parse HTML is often a recipe for nightmares. Most languages used on the web have evolved an HTML capable parser or two by now although. That said, it’s almost shocking that PHP, in particular, has come so far with, essentially, no built in HTML parser (thankfully PHP5 brings HTML Tidy to the fray, plus the DOM extension can now handle HTML.

b. Wiki style. Using “markup” like *this for bold* and _this for italic_. Wiki style often starts out well, being easy to use and secure but is perhaps weak on point 3. But things go down hill the more formatting options you provide to users, the parsing getting progressively harder to manage and the syntax weirder like !!!this for some large text. Users are required to learn this alien markup and may find it difficult to express their precise formatting intent (intent thereby being lost of becoming arbitrary as text like McDougals gets automatically assigned as a link to a new wiki page). In the end, I don’t think wikis do much to address those beyond their original target audience – software developers (flame on…)

c. Implied formatting. This is less frequently used as standalone mechanism but turns up often as part of other styles. Essentially whitespace takes on a meaning it doesn’t normally have with HTML. PHP offers nl2br() for example. It’s definitely easy to use and fairly safe (depending on what you do with URLs, for example). It’s also easy to parse. Where it fails is it typically offers little power to the user and it’s very easy to lose the intent of the formatting, hence it’s often augmented with one or more of the other styles.

d. BBCode style. Essentially use your own custom markup; one which will be ignored by web browsers completely, should any un parsed fragments turn up in the finished page. Although this can be a little tricky for users have never run into it before, it’s a tried, tested and successful, as forums apps like vBulletin and phpBB have proved, to the point where BBCode is almost (an unwritten) standard. Surprisingly, on Sams blog, no one mentioned it but perhaps that reflects the common divide between PHP developers and the rest of the web; doing it vs. talking about it. For end users, it generally means that based HTML tags have been translated more or less one to one to BBCode – simply replace

Any more?

One notable hybrid of all is textile markup, which throws in a little everything. Those times I’ve been subjected to it, the result was “Yuck!”. Another hybrid seems to be Markdown.

Practical Notes

Couple of quick points, outside of security issues;

– When storing visitor submitted content in a database, for later display, apply the parsing operations after the content has been stored, not before. In other words, don’t parse, INSERT then SELECT but INSERT, SELECT then parse (if performance is an issue, cache the HTML resulting from the parse). The basic reason for this is it makes editing the content later (either by you as a site admin or by the visitor themselves) easy – you display their content (more or less) as is in a textarea rather than having to reverse the parsing operation to give them back what they started with (a recipe for headaches). You also stand a better chance of preserving the intent of the formatting, which is easy to lose if you’re required to reverse the parsing. You might consider filtering the content before storing it – certainly for SQL injections and possibly for stuff like “bad word filters” but don’t transform or add to the content.

– Document your markup. The number of blogs I see that expect visitors to guess (nudge nudge Sitepoint ;))…

Any more?

While I’m here, some PEAR projects that can help in this area;

PEAR::HTML_BBCodeParser – you don’t even need to write your own (this has even become a WACT Tag). Note stuff like converting HTML entities and handling linefeeds is still your job.

PEAR::Text_Wiki – in effect, and abstraction layer for WIKI markup. Text_Wiki “captures” all the common document structuring requirements, end users may have, as “rules” and can translate whatever markup you like to those rules, the rules rendering (X)HTML. Very clever project. Would also work as a BBCode parser (and pretty much anything else in fact).

XML_HTMLSax – a SAX parser which won’t choke on HTML (badly formed XML). In fact the name HTMLSax is a little misleading, as it has no specific knowledge of HTML vocab. In fact it’s much like Pythons HTMLParser although tags which are closed implicitly, like
result in a four argument to the open tag handler with XML_HTMLSax, as well as a call to the close handler, while Python;s HTMLParser has a “startendtag” callback for this situation. A couple of projects I’ve seen but never tried is HTML Parser for PHP-4, which provides a state based API and PHP HTML Parser, which does have some knowledge of HTML and seems to designed to transform HTML is a single pass (from the user point of view). Note also Simple Test has a (you guessed it) simple SAX based parser for HTML – it uses regular expressions, based on the Lexer in lamplib – still need to benchmark it against HTMLSax which uses a string position based approach to parsing, just for interest.

Anyway – long rant. Enough already.