July 14, 2003

Why We Need Echo's Content Model

Recent off-blog discussions with Greg Reinacker and Matt Berther about my recent problems posting to NewsGator from Outlook emphasize the great need for a true content model for the data in RSS/Echo feeds rather than just a list of syntactic tags.

The problem comes about because of the way I am creating the xhtml:body part of my postings - I am taking the *raw* contents of the blog body, and wrapping it in an xhtml:body parent tag. What I am wanting therefore is for the data entered by Outlook and the NewsGator posting plugins to be pure and valid XHTML.

So, for example, a simple posting (from here) might look like:

<item>
    <pubDate>Thu, 03 Jul 2003 21:31:16 GMT</pubDate>
    <title>Test for Greg</title>
    <description><![CDATA[This &amp; that.&nbsp; 3 &lt; 5....]]></description>
    <body xmlns="http://www.w3.org/1999/xhtml">
        <p>This &amp; that.&nbsp; 3 &lt; 5.</p>
    </body>
</item>

That's all fine and good until we get to consider the entities in the data (which actually make the above invalid XML). The specific case I hit was with the plugins passing &nbsp; entities through in the data. Now &nbsp; is a perfectly valid HTML / SGML entities, but it's not an entity available in plain old XML data because it is not in the predefined entities list. There are some mentions in the xhtml schemas referring to the entity defintion list files, but of course they are DTD-based and not XML Schema-based. We could try including the definitions in-line in a DOCTYPE declaration in the instance document, but of course there isn't one in RSS/Echo data files.

So, the conclusion I am left with is that although we can use xhtml content in RSS/Echo we can't use any (X)HTML entities without escaping them.

Of course, escaping the data is exactly what I was trying to avoid in the first place! - both in the "tunnelled in CDATA" form (like the description element in the example above), and in the escaped entity (eg. "&amp;nbsp;") form. It just doesn't feel right to me to try to include an XML fragment format in the data then mask significant parts of it.

This problem with XHTML entities has already been spotted by other people during discussions in the Echo wiki around Content (see "issue: dangers of html"). It is also part of the reason why the discussions in the Echo wiki about Escaped HTML is so vital and important too. Even after all the discussion among Greg, Matt and myself, it is still very unclear who (if anyone) is doing the right or wrong thing here.

So, without an unambiguous content model for RSS/Echo, there can never be real and guaranteed interoperability of this data.

Note: All references to "Echo" should be taken as meaning whatever Sam's little project is called this week ;-)

Entry categories: RSS Standards Weblog XML
Posted by Jorgen Thelin at July 14, 2003 03:58 PM - [PermaLink]
 
Traceback List
Comments
&nbsp; is a synonym for &#160; If you are interested in how I handle this currently, look at html2xml in http://www.intertwingly.net/code/mombo/template/__init__.py Posted by: Sam Ruby on July 14, 2003 03:38 PM
Hi Sam, Yes, there are a whole set of translations that can and should be applied for all the (X)HTML entities, but the big problem it is not clear _who_ should be doing this due to the ambiguity of the content model. This "data scrub" could potentially be done by: NewsGator The individual NewsGator plugin(s) MoveableType The formatting plugin(s) in MoveableType The ultimate problem is the ambiguity of the content model(s) at the various interface boundaries the data passes across, but thanks for the pointer on how Mombo handles this. Posted by: Jorgen Thelin on July 14, 2003 09:33 PM