Recent off-blog discussions with Greg Reinacker and Matt Berther about my recent problems posting to NewsGator from Outlook emphasize the great need for a true content model for the data in RSS/Echo feeds rather than just a list of syntactic tags.
The problem comes about because of the way I am creating the xhtml:body part of my postings - I am taking the *raw* contents of the blog body, and wrapping it in an xhtml:body parent tag. What I am wanting therefore is for the data entered by Outlook and the NewsGator posting plugins to be pure and valid XHTML.
So, for example, a simple posting (from here) might look like:
<item>
<pubDate>Thu, 03 Jul 2003 21:31:16 GMT</pubDate>
<title>Test for Greg</title>
<description><![CDATA[This & that. 3 < 5....]]></description>
<body xmlns="http://www.w3.org/1999/xhtml">
<p>This & that. 3 < 5.</p>
</body>
</item>
That's all fine and good until we get to consider the entities in the data (which actually make the above invalid XML). The specific case I hit was with the plugins passing entities through in the data. Now is a perfectly valid HTML / SGML entities, but it's not an entity available in plain old XML data because it is not in the predefined entities list. There are some mentions in the xhtml schemas referring to the entity defintion list files, but of course they are DTD-based and not XML Schema-based. We could try including the definitions in-line in a DOCTYPE declaration in the instance document, but of course there isn't one in RSS/Echo data files.
So, the conclusion I am left with is that although we can use xhtml content in RSS/Echo we can't use any (X)HTML entities without escaping them.
Of course, escaping the data is exactly what I was trying to avoid in the first place! - both in the "tunnelled in CDATA" form (like the description element in the example above), and in the escaped entity (eg. "&nbsp;") form. It just doesn't feel right to me to try to include an XML fragment format in the data then mask significant parts of it.
This problem with XHTML entities has already been spotted by other people during discussions in the Echo wiki around Content (see "issue: dangers of html"). It is also part of the reason why the discussions in the Echo wiki about Escaped HTML is so vital and important too. Even after all the discussion among Greg, Matt and myself, it is still very unclear who (if anyone) is doing the right or wrong thing here.
So, without an unambiguous content model for RSS/Echo, there can never be real and guaranteed interoperability of this data.
Note: All references to "Echo" should be taken as meaning whatever Sam's little project is called this week ;-)
All content is
Copyright (c) 2008 Jorgen Thelin. All rights reserved.
The opinions expressed here represent my own views
and not necessarily those of my current, prior or future employer(s).
Content is provided "as-is", without any representations or warrenties of any kind.
Contents of the Weblog Feed are
licensed under a
Creative Commons License.