stuff to click

Switching to XHTML

difficulty 2 comments 0 added Feb 28, 2011 category web

A couple of years ago I decided to catch up with time and make my web projects compatible with web standards. I preferred to switch to more strict and structured XHTML language and had to fight my way through all intricacies which included XML language syntax, document structure, javascript pitfalls and sever HTTP response header tuning. I know that each issue deserves probably a separate article but you easily can google any details by yourselves. My goal was to make a short all in one overview of problems you can face and give some hints.

XML syntax

Probably the simplest part was the XML syntax. And I won't stop here for long just be sure that

  • tags and attributes go lowercase
  • attributes values are captured with ""
  • closing slashes are present
  • for comments <![CDATA[...]]> are used instead of <!-- ...-->
  • ampersand is used only within entities

As I remember, ampersands cost me a couple of hours to get through. I was trying to insert GoogleMap code snippet which contained ampersands in URL as GET request delimeter. You must guarantee that everywhere (scripts, URLs, where else?) in the document (except CDATA sections, of course) ampersand is represented with entity.

Document structure

XML document should be started with XML declaration:

<?xml version="1.0" encoding="utf-8" standalone=no" ?>

This is an example of prolog with attributes set to their default values. It is always recommended to use the prolog, but there is one annoying exception — IE6. Using the prolog with IE6 automatically turns on IE6 quirk mode.

The main practical value of using a prolog is its encoding "attribute". If your encoding is utf-8, you may freely skip the prolog losing nothing in practical sense, but your propensity to follow standards would be questionable. Specifying the encoding will be considered further.

Next stop is DOCTYPE declaration connecting your document with specific DTD. I'd say you must use it at least because it is crucial for IE being able to render your document correctly. And I always remember about standards!

If you are well determined to use pure XHTML you may use strict DTD:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Otherwise you may use more loose "transitional" DTD intended to more painless switch to the standard:

<!DOCTYPE html PUBLIC   "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

The main difference between strict and transitional DTDs is that strict contain less old HTML tags and attributes working on presentation level, keeping features important for logical markup.

The last thing about a XHTML document structure which is important for successful document validation is the root element:

<html xmlns="http://www.w3.org/1999/xhtml">

You just need add xmlns attribute which is setting XHTML namespace as the default in your XML document.

Javasript pitfalls

Let's start with simple, as I already mentioned you should use lowercase working with XHTML. One of silly mistakes is using uppercase letters in events:

<img src="some.jpg" alt="some picture" onClick="javascript:someFunc()">

It comes as easy slip of the hand when you're used to camel notation. Instead use "onclick".

Also comment out the script body with XHTML comments <![CDATA[ ... ]]> if it is placed within XML document, because javascript may contain special characters which may be parsed as part of XML document (like ampersand and quotes). Also don't forget comment out XHTML comment markers with javascript comment markers:

//<![CDATA[
function SomeFunc()
{
...
}
//]]>
//or even better this:
/* <![CDATA[ */
function SomeFunc()
{
...
}
/*]]>*/

And at last the most painful problem is that some of scripts may stop working because of using document.write() javascript function which is disabled in browsers when the document is served as XML application.

If you use document.write() by yourself, you better stop doing that anyway, the idea of altering a document in such frivolous way doesn't get along with the spirit of web standards. Instead you may use innerHTML attribute of corresponding javascript object.

If it is allowed you can change a third party code (like counters) by wrapping the counter code with named div and instead of

<script type="text/javascript">
/*<![CDATA[*/ //don't forget to comment out
document.write(
"<a href="http://www.counterprovider.com/click" target="_blank"><img src="...
 /*]]>*/</script>

write:

<script type="text/javascript">
/*<![CDATA[*/ //don't forget to comment out
document.getElementById('counter').innerHTML +=  
"<a href="http://wwww.counterprovider.com/click" target="_blank"><img src="...
 /*]]>*/</script>

The most troublesome javascript code is that which you can't change for some reason like Google AdSense. There is no universal recipe for that case and we have to explore each problem separately.

Speaking about AdSense, the solution was found. It is embedding object with HTML document which contains the script. This works well for all browsers except IE, which would open a target page in the embedded iframe. But taking into account that IE doesn't serve the document as application/xhtml+xml, we may not bother and use the script as it is. For other browsers you may use:

 
<object id="adSense" data="/google/adSense.html" 
type="text/html" width="124" height="640" border="0" />
 

where adSense.html is document containing the script in its html body which must be served as text/html (not xhtml).

You can implement browser detection and branching differently, for example as Stu Nicholls did it, or in your template language, or purely in php or what ever you use. I use PHP and Open Power Template, in my php code I detect user client by analyzing $SERVER["HTTP_USER_AGENT"] variable and then pass the browser variable to my template for branching.

Briefly about encoding

Encoding may be (and should be) specified in the following places:

  • In server response header: Content-Type:application/xhtml+xml;charset=utf-8. This may be done by your server side scripting language or use mod_rewrite in .htaccess file. See the next section.
  • In XML declaration <?xml version="1.0" encoding="utf-8"?>
  • In meta tag <meta http-equiv="Content-Type" content="text/html; charset=windows-1251" />

The last two will guarantee that the page will be presented properly when user saved it locally. But if you remember, XML declaration may be omitted to avoid IE6 quirk rendering mode, so the meta tag is the last resort in that case.

Serving document as application/xhtml+xml

All mentioned above wouldn't make much sense if your document (even properly composed) isn't served as application/xhtml+xml. Only in that case browser applies strict rules and checks to your document and shows if you comply with the standard. As I said IE can't normally serve application/xhtml+xml, but there is a workaround to turn on XML parser even in that case.

How a browser processes your document depends on browser's capabilities and content type header which the server sends along with your document. If browser may accept application/xhtml+xml content type it sends corresponding Accept header, for example my Google Chrome sends:

Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

You may check this header and set Content-Type response header:

Content-Type:application/xhtml+xml;charset=utf-8

I prefer to do it in PHP or other server side language. The following code gives a hint how to implement it:

 
if (stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml"))
    header("Content-type: application/xhtml+xml;charset=utf-8"); 
else { header("Content-type: text/html;charset=utf-8"); } 
 

It also may be done with Apache mod_rewrite enabling in .htaccess file (I snatched this snippet from codingforums.com)

 
# serve .xhtml as xml - this could equally be .html
AddType  application/xhtml+xml xhtml
# serve as tag soup if necessary
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_ACCEPT} !application/xhtml+xml
RewriteCond %{HTTP_ACCEPT} (text/html|*/*)
RewriteCond %{REQUEST_FILENAME} .*.xhtml
RewriteRule ^.*$ - "[T=text/html,L]"

it is probably the only way to serve static content on a third party hosting.

The most of my projects use PHP based Open Power Template engine which has convenient interface to set this header.

P.S. Don't forget to validate your document!

Valid XHTML 1.0 Transitional

Comments

Place a comment

I also highly recommend this book, it is quite helpful and nice reading:

developing with web standards