XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Raising the Bar on RSS Feed Quality

November 19, 2002

RSS is an XML-based syntax for facilitating the exchange of information in a lightweight fashion through the distribution (or feeding) of resources. Publishers can use this versatile and increasingly essential format to assist end users in tracking and consuming content. Netscape originally developed the format but lost interest and eventually abandoned work on it. This created an identity crisis that devolved into varying interruptions, with dispute over even the meaning of the RSS acronym, RDF Site Summary or Rich Site Summary or Really Simple Syndication. But as divergent efforts work to develop RSS, one result has been a diminished overall quality in RSS feeds.

In this article, I provide an overview of RSS's core syntax, then I examine the poor state of RSS feed quality and provide some recommendations for authoring more useful and effective feeds. This examination is not a review of the RSS specification, nor is it an emphatic plea for strict compliance. Instead, this article provides an approach to authoring RSS feeds that is neutral, practical, and conservative. RSS feeds are simply too useful a mechanism for information exchange services. It is imperative that we improve their effectiveness.

RSS Syntax Basics

Despite differences between RSS 1.0 and RSS 0.9x/2.0, these formats do share a common core of elements. Let's take a quick look at the primary elements of RSS syntax. (While common in all versions, this review does not apply to any specific one element per se. Some of these elements are required, and some are not. It depends on which specification you're referencing.)

It all begins with the <channel> tag, which contains metadata about the feed. The primary metadata elements are the <title>, <description>, and <link> tags, though others have been specified. Inside of the <channel> tagset is one or more <item>s. Like <channel>, <item>s contain a <title>, a <description>, and a <link> tag. Here is a sample RSS feed.


<rss>
    <channel>
       <title>tima thinking outloud.</title>
       <description>The thoughts of Timothy Appnel 
on emerging technology and trends.</description>
       <link>http://tima.mplode.com/</link>
       <item>
 
<link>http://www.mplode.com/tima/archives/000137.html</link>
          <title>Released: mt-rssfeed v1.0 and mt-list v0.2</title>
          <description>I released two new versions of my MovableType 
plugins today.</description>
       </item>
    </channel>
</rss>

All the RSS specifications wrap <channel> with a root tag. I've used <rss>, but the 0.90 and 1.0 specifications use <rdf:RDF> as the root tag. At this point, however, they're not necessary from a functional point of view.

That should give you a taste of RSS syntax. Now let's dive into my recommendations for improving the quality of RSS feeds.

Improving RSS Feed Quality

Over the centuries, we have developed time-tested, best practices for the written word. More recently, through mass media, we have developed information layering techniques. Different RSS formats aside, most aggregators and toolkits make a good effort to abstract information from any feed remotely like RSS. And with the significant adoption of RSS as well as resources like the Syndic8 directory, we can examine usage patterns and make informed assumptions. With some care and consideration we can publish effective, useful, and reliable content feeds with RSS today. Here are my recommendations:

All RSS feeds must be well formed XML.

I wasn't telling the whole truth when I said these were recommendations. This is the only real requirement because RSS is an XML format after all, and well-formedness is XML's baseline. If it's not well formed, it's not XML. Improperly encoded HTML and the use of HTML entities in the RSS feeds cause the most common offenses. (HTML is typically not well formed XML, and XML only supports five named entities that HTML supports.) If you're not sure if a feed is well formed, try an online XML well-formedness checker such as RUWF. While end users may not care about standards compliance, they want their content. It's not that hard to consistently comply with the XML standard. The remaining tips will help.

Use the RSS Validator Service.

It is now not that hard to test your RSS feeds' syntax for errors thanks to the RSS Validator Service. Developed by Mark Pilgrim, along with Sam Ruby and Bill Kearney, it checks RSS feeds for problems and generates friendly and instructive messages for fixing them. The service is optimized for RSS 2.0, but also supports other versions of the format. This recent development is significant because it provides a much-needed tool for alerting publishers to issues in their syndication feeds. (The RSS Validator Service can also tell you if your feed is well-formed XML.)

Use CDATA for embedding HTML in <description> tags.

This is perhaps the most important recommendation I can make because it goes a long way toward avoiding malformed XML/RSS files, with almost no fuss. Avoid the method of entity-encoded HTML, also known as double entity-encoding, which, while quite common and not going away anytime soon, will save you and others some headaches. Besides being a nonstandard practice within the XML specification, this method requires more processing cycles and unnecessarily adds to the file size. It's also prone to occasional error.

Consider this example:

Original HTML fragment: <b>&quot;foo!&quot;</b>
With the entity-encoded HTML method: &lt;b&gt;&amp;quot;foo!&amp;quot;&lt;/b&gt;
With CDATA to encode the HTML: <![CDATA[<b>&quot;foo!&quot;</b>]]>

The original HTML is untouched when you use CDATA. And the file-size advantages become increasingly clear as the amount of HTML increases. When you consider that the entity-encoded HTML example above could also be an HTML example that was not encoded, you can begin to see the kinds of errors that entity-encoding can introduce.

Pages: 1, 2

Next Pagearrow