RedGrittyBrick

XML

XML: The Good the Bad and the Ugly

I’ve used XML on a variety of projects and have come to some firmly held views about what it is good at and what it is bad at.

What is good about XML?

XML is designed for data interchange between software written by different groups of people. Prior to XML you had either poorly documented proprietary binary formats or variations of CSV. XML promised to be a more self-documenting format that would ease interchange of data.

What is good about XML?

XML is a variation of plain text and can be written to be somewhat readable. If the element and attribute names are well-chosen it can be easy to understand the internal structure of the data.

What is bad about XML?

Attributes vs Elements

This is a minor one.

There don’t seem to be any widely adopted standard approaches to choosing between attributes and sub-elements.

It is therefore possible for two people to independently create an XML document type for the same document and arrive at completely different structures.

Let me invent a couple of examples:

<Book>
  <Language>English</Language>
  <Author>Raymond Chandler</Author>
  <Title>The High Window</Title>
</Book>

compared to

<book language="EN-US" author="Chandler, 
Raymond">The High Window</book>

It might be better if attributes didn’t exist. It seems they are often overused.

Definitions

You often want to validate an XML document against a definition that specifies what elements can or must exist etc. This is the best way to ensure a document can be used by multiple programs.

But there are too many different ways to specify an XML document. There were Document Type Definitions (DTDs) and there are XML Schema Declarations and there are others like Relax NG. But at least some of these don’t have a good way to specify business rules so you get supplementary or alternative ways to define those - e.g. Schematron, …

There are many ways to define XML and it may be that not all toolsets support all of them. So you can have interoperability problems and end up tied to specific toolsets.

Namespaces

You sometimes need to wrap an XML document in another XML document. For example, to send an order for a book inside a communications message. This is a problem if the two XML documents were developed by different groups working independently and there is a clash of element names - for example an order <header> and a message <header>.

To solve this problem, namespaces were introduced:

 <m:message xmlns:m="acme.com/message">
   <m:header>
     <m:sender>Joe Smith</m:sender>
   <m:header>
   <m:body>
     <bda:order xmlns:bda="example.org/order">
       <bda:header>
          <bda:order_no>12345678</bda:order_no>

This way, automated tools can work out whether a <header> is a message header or an order header.

In practice, it seems to me that you could probably infer the type of <header> from its position in the document.

This adds to the verbosity of XML and a crucial problem it introduces is the mutability of namespace prefixes. In my example we have prefixes m: and bda: but it is possible that an application might have to change one of these to avoid a clash of prefixes - supposing some one supplied an order.xml to be sent and it used the m: prefix?

Digital signatures

In order to provide for message authentication, it is desirable to be able to digitally sign an XML document with a private key to prove that the document has not been forged or altered.

It happens that often you want a signature of a part of a document so that you can include the signature in the document itself.

This means you need to consider what happens to the namespace declarations?

In the example above, if we signed the <bda:order> and placed the signature in the <m:header>, we would be OK. But if the message somehow got transformed into the fully equivalent form below, we could no longer check the signature

 <m:message xmlns:m="acme.com/message" 
            xmlns:bda="example.org/order">
   <m:header>
     <m:sender>Joe Smith</m:sender>
   <m:header>
   <m:body>
     <bda:order>
       <bda:header>
          <bda:order_no>12345678</bda:order_no>

Notice that the xmlns attributes were moved earlier in the XML, this is legal in terms of XML rules and does not affect the meaning of the content. However it changes the digital signature.

To fix this problem it is necessary to define a standard form of XML

Canonicalisation

Caonicalisation aims to transform any equivalent forms of an XML document into exactly one “canonical form”. This is so that no matter how much the XML has been changed in ways that don’t affect its meaning - you can still transform it to canonical form and use that to calculate the same hash value for all forms of the same XML.

So canonical XML is XML transformed so that it meets certain extra rules about

However the rules for XML canonicalisation mostly ignore white-space between elements. This means it is still easy to accidentally create an XML document that looks the same as the original but which has a different canonical form and so fails digital signature verification.

The whole canonicalisation process is pretty involved.

The whole nasty mess could be avoided if we just accepted that programs that process XML shouldn’t rearrange the XML in any way even if the rearrangement is fully equivalent in meaning.

Whem we sign an email message, we don’t canonicalise the content because we don’t expect any email software to replace CR with LF or CRLF or other permutations. we don’t expect extra spaces to be inserted or tabs to be replaced with spaces or vice versa. Even though none of these things would normally affect the meaning of the message we would expect such changes to cause a failure when checking a digital signature of the message.

Why so different for XML? The result is that handling XML is made needlessly complex.


Other people’s views: