June 03, 2003

Is an InfoSet typed?

This interesting question came up yesterday as an underlying theme in some discussions about XML object models. My answer yesterday was slightly confused at best (I said unconditionally "yes" at the time), but looking at it some more I now realise the correct answer is really "sometimes" !

To recap: An infoset is an abstract representation of some XML contents, canonicalized and independent of the syntactical markup symbols. As an XML parser validates the XML data against its associated schema to ensure that all of the data is valid and the structure is sound, it can also convert the data internally to the datatypes specified in the schema to produce the PSVI (post schema validation infoset).

So, as an XML document is parsed, the parser will initially produce a representaiton of that data as an "raw" InfoSet object model (rather than using say DOM).
It this point, the waters get increasingly murky though!
The InfoSet spec refers to Attribute Information Items having an "[attribute type]" property:

An indication of the type declared for this attribute in the DTD. Legitimate values are ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN, NMTOKENS, NOTATION, CDATA, and ENUMERATION. ...

However, there is no similar property mentioned for Element Information Items as far as I can see. And given that at this stage, we have not applied any validation using any DTD or Schema, I believe the real answer (despite that the spec implies) is that before the PSVI stage then InfoSets can only ever be purely structural and not in fact typed at all.

The next step in the processing is to apply validation against a particular Schema, and this produces a specific PSVI (based on the particular Schema used), and this PSVI clearly is typed:

PSVI provides not only information on validation of an XML document but also type information on elements and attributes as well as default values of them.

The picture potentially becomes even more confusing when discussions about the typing of the InfoSet model itself surface too (rather than an instance InfoSet representing a particular XML document). Totally confused yet??

So based on this, I believe the true answer to the question "Is an InfoSet typed?" is actually "maybe"!
A "pre schema validation infoset" has some of the placeholders for type information (the [attribute type] property of an AII primarily), but these properties cannot in fact be populated until validation occurs, at which point the PSVI (post schema validation infoset) is produced anyway which also adds oodles of type information for elements (EIIs) as well as attributes (AIIs).

As usual, the individual specs are sufficiently ambiguous when combined and applied together to land me and others in muddy water up to my armpits!

This subject area of XML typing becomes increasingly important with the introduction of the XQuery 1.0 and XPath 2.0 Data Model which brings the PSVI and data typing increasingly to the fore, and which Dare Obasanjo has described as "the best candidate for the Data Model for XML".

There is also an interesting article by Erik Wilde from last years XML 2002 conference which describes the extensibility aspect and layering approaches to InfoSets which is worth a read too.

I think I grok all the facets and interelationships of the situation now, but please correct me if I am wrong!

Entry categories: XML
Posted by Jorgen Thelin at June 3, 2003 03:28 PM - [PermaLink]