XML Canonicalization, Part 2
by Bilal Siddiqui
|
Pages: 1, 2
Excluding the Ancestor Context
We have seen that the ancestor context is included while canonicalizing XML document subsets. However, doing this may introduce problems under certain circumstances. In order to elaborate the scenarios in which including the ancestor context creates problems, we first need to discuss Enveloping, a concept that is of paramount importance in web services interoperability.
Enveloping
SOAP is fast becoming the de-facto standard for XML messaging over the Internet. SOAP defines the format to wrap XML data inside envelopes.
Look at Listing 7,
which wraps the PackageBooking element inside the
SOAP:Body element. Listing 7 demonstrates a
simple enveloping mechanism, in which the message payload (i.e. the
message that needs to be sent across the Internet) is wrapped inside a
SOAP:Body element and the entire SOAP:Body is
wrapped by the SOAP:Envelope element.
The advantage of this simple enveloping lies in its ability to enable vertical stacking of XML-based protocols. Vertical stacking means that protocols and message formats can be defined for specific low-level tasks (such as signing, encrypting, routing etc.) and higher protocol layers will use the services provided by lower layers. For example, WS-Security, a high-level XML security protocol being developed by an OASIS Technical Committee, uses the SOAP format to utilize the signing and encrypting mechanisms provided by W3C's XML Digital Signature and XML Encryption specifications respectively.
Listing 7 also
contains a SOAP:Header element in addition to the
SOAP:Body element. The SOAP:Header element is
optional and is meant to contain protocol-specific information. This
effectively means that the message payloads are contained inside
SOAP:Body elements and protocol headers are contained in the
SOAP:Header element. For instance, WS-Security uses the SOAP
Header to wrap signature related information.
Envelope Handling
The application which receives a SOAP message is likely to tear the envelope (wrapper) and extract the XML payload (the XML message) in order to be able to process the message received. This tearing of a SOAP envelope and extracting the XML payload is referred to as de-enveloping. Further, the receiving application might need to re-envelope the XML message received in a new envelope.
The need for re-enveloping emerges in federated web services, which rely on partner applications to do part of a job, thus integrating, for example, an entire supply chain into interoperable and loosely coupled systems.
As an example of federated applications, let's consider a tourism industry B2B scenario. A tourist wants to know the details of a vacation tour being offered by a tour operator's web service. The tourist sends an XML message containing information about the places he would like to visit and the dates on which he is planning to travel.
Naturally the tourist's XML message will be authored by some client-side XML- and SOAP-aware application, which will author and wrap all information inside a SOAP envelope without requiring the tourist to know anything about XML and SOAP.
Upon receipt of this SOAP message, the tour operator's web service will extract the information related to time and place of travel from the SOAP envelope. The tour operator's service will need to send pieces of the travel information to different partner hotels and car rental companies. Therefore, the tour operator's service will author fresh SOAP envelopes containing the relevant pieces of information and forward them to partner hotels and car rental companies.
In a similar fashion, upon receipt of the response from partner hotels and car rental companies, the tour operator's service will re-envelope the information received before sending the fresh envelope back to the tourist.
Exclusive XML Canonicalization
With the above discussion in mind, have a look at Listing 10, which is a SOAP response message that a fictitious partner hotel has just sent back to the tour operator's web service.
The tour operator would have also received SOAP response messages from other partner hotels and car rental companies. These messages need to be combined to form a complete packaged vacation tour.
You may now have a look again at Listing 7, which is actually
a packaged vacation tour that the tour operator's web service will
ultimately send back to the tourist. The first booking
element of Listing 7
(whose unitCharge attribute shows a value of "50" and which
we canonicalized in Listing
9) is the same as the booking element of Listing 10.
In order to demonstrate the role of canonicalization in federated web service applications, let's assume that the partner hotel wanted to sign the booking element of Listing 10 while sending the SOAP message to the tour operator, thus allowing the tourist to verify that the booking is not fake.
Listing 11 shows the canonical form of the booking element of Listing 10. This is the canonical form that the partner hotel will use to create a message digest. On the other hand, recall that Listing 9 was the canonical form of the same booking element, when it was part of Listing 7. Therefore, the tourist will use Listing 9 to verify the message digest of the partner hotel.
Compare Listing 11 with Listing 9 and you will find that they are different from each other. The difference comes from the fact that we conserved the ancestor contexts from two different XML documents while canonicalizing the same booking element.
Therefore, message digest and signature verification will fail at the tourist's client application end. This clearly establishes the need to exclude ancestor context while employing canonicalization concepts in federated web service applications. W3C has released the Exclusive XML Canonicalization recommendation for this purpose.
Exclusive canonicalization applies only while canonicalizing fragments of XML files and differs from (inclusive) canonical XML in the following two ways:
- Attributes from the xml namespace are not imported from ancestors into orphan nodes.
- Omitted namespace declarations are included in
exclusive canonical form to an element only if:
- The namespace declaration is used by the element or any or its child attributes.
- The namespace declaration is not already in effect in the exclusive canonical form.
Note that the second point above also applies to empty default
namespace declarations (xmlns=""). This means that the
exclusive canonical form of an element will include the
xmlns="" declaration if the elements belongs to the default
empty namespace, and the nearest ancestor occurrence of the default
namespace declaration in the exclusive canonical form has some non-empty
default namespace (xmlns="http://someURI...").
Applying these rules to the booking element of Listings 10, the exclusive
canonical form comes to be as shown in Listing 12. You may notice
that the exclusive canonical form of the first booking element of Listing 7 (whose
unitCharge attribute has the value "50") is also exactly the
same as Listing 12.
Problematic Scenarios
I should point at two important problems that may result by applying the Canonical XML specification:
-
If the XML file being canonicalized contains external parsed entity references, the external entity references will be replaced with external content during canonicalization as already discussed above under the heading "External Entity References." However, if the external content contains some relative URIs, the URIs may become non-operational after the replacement (since the DTD declaration will be removed during canonicalization and there will be no way to reach the external content after XML canonicalization).
Non-operational URIs may create problems in signature applications, as there is no way to detect whether the original operational XML document or its non-operational canonical form was intended to be signed. If an application thinks that the purpose of signature applications may be defeated by this ambiguity, such scenarios should be resolved prior to canonicalization (e.g. relative URIs be converted to absolute URIs before starting to canonicalize).
-
There may be some application specific equivalence criteria that cannot be covered in a generalized specification. For example, an XML file carrying an invoice with all prices in French francs will not produce the same canonical form as that of the same invoice with equivalent prices in Euros (although the two invoices will be logically equivalent). Therefore, such application specific issues need to be resolved in an application specific manner.
Resources
|
Got a question or comment on this article? Share it in our forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- great piece of information about xml canonicalization
2004-10-18 05:15:23 kasamajay [Reply]
This article covered
1. need for xml canonicalization
2. real world application where xml canonicalization in needed
3. steps for canonicalization
4. problems in canonicalization
5. solutions to the problems in canonicalization
detailed examples for understanding each of the above listed points.
Thanks for taking your valuable time to share this great piece of information
Regards
Ajay Kumar Kasam
+919845557312
- Learn XPath
2002-11-24 20:15:37 Michael Brundage [Reply]
The XPath (//. | //@* | //namespace::*)[ancestor-or-self::booking[@unitCharge="50"]] is both hideously inefficient and incorrect.
First of all, you explain that the parenthesized expression is to select all nodes in the document. But there's a much more efficient way to do this: //node()
Secondly, your XPath matches every node whose ancestor or self is the booking element in question -- not a single element. If you pick only the first such element (through a call to selectSingleNode), then you happen to get back only the booking element itself. However, there is a much better way to do this: Use the standard pattern //booking[@unitCharge=50].
This kind of search is what the // abbreviation was invented for.
And finally, none of these XPaths actually work, because of namespaces. You really have to do
//x:booking[@unitCharge=50] where the prefix x is bound to the namespace "http://www.FicticiousTourismInterface/BookingService", or equivalently,
//*[namespace-uri()='http://www.FicticiousTourismInterface/BookingService" and local-name() = 'booking' and @unitCharge=50]
- Learn XPath: Author's Response
2002-11-27 06:58:24 Bilal Siddiqui [Reply]
Hi Michael,
This is the author's response to your "learn XPath" advice. There are three points in your comment and I will respond to each separately:
1. The comment on the use of ((//. | //@* |
//namespace::*)[ancestor-or-self::booking[@unitCharge="50"]] ) XPath expression:
You say that the expression is *inefficient* and *incorrect*. Your statement is not entirely true i.e. the expression is correct and in 100 percent compliance with the XPath specification. It finds all the nodes of the document and then applies the predicate.
In fact, this same expression is used by the W3C people who designed the process of XML canonicalization. Please refer to sections 2.1 and 3.7 of Canonical XML spec. Following link will take you direct to the spec.
http://www.w3.org/TR/xml-c14n
Focus of the article under consideration is XML canonicalization and we are discussing the W3C specifications on Canonical XML and Exclusive XML Canonicalization in this article. That's why I used the same expression that W3C guys used in their document. While writing articles, I normally try to keep focus and refrain from anything off-topic. I simply don't understand why you expect me to find XPath alternates and discuss efficiency of XPath expressions in an article on XML canonicalization, especially when W3C has already used an expression in their official spec.
Let me conclude here by saying:
A. The expression ((//. | //@* | //namespace::*)[predicate] ) that I used is correct and in full compliance with the XPath 1.0 W3C recommendation.
B. If you don't like this expression, I think your difference of opinion here is not with me, but with the W3C guys who designed the Canonical XML spec. You should better post your comment at W3.org
*********************
2. Your suggestion of finding a single element instead of every node whose ancestor or self is the booking element by using the XPath expression: //booking[@unitCharge=50]
Immediately after explaining what the XPath expression (//. | //@* | //namespace::*)[predicate] does, I have included Listing 8, which is the result of applying this expression to Listing 7. Now if we were to follow your suggestion and use //booking[@unitCharge=50], we will not get Listing 8. Instead the result will be just the booking element without any of its child nodes (yes, even without any of its attribute child nodes)!! The result will be something like <booking/> instead of what we see in Listing 8.
Why is this so? XPath expressions evaluate to objects of one of the four basic types defined by XPath. Out of the four types, one is node-set, which is a set of nodes. A node in an XPath node-set can be of several types e.g. element nodes. An XPath element node contained in a node-set does not include in itself any of its children.
I think you are assuming that every node in a node-set resulting from an XPath expression, when directly serialized to a string value will yield all of its child nodes (element nodes, attribute nodes etc.). Your assumption is entirely wrong.
For example, it is possible to write an XPath expression that will include a parent node in the serialized result and not include one or more of its children. Section 3.7 of Canonical XML spec. that I mentioned above gives a simple demonstration of this fact.
If we were to follow your assumption that every node in an XPath node-set comes complete with all of its children, then it would not be possible to write an XPath expression that includes a parent node and excludes any of its children in the resulting XML fragment.
Therefore, your suggested expression does not perform the required job that we are trying to accomplish (extract the document fragment of Listing 8 from Listing 7, so that it can be canonicalized). Use the expression that I have suggested in my article!
*********************
3. Namespaces missing in the XPaths:
Yes you are right. I somehow forgot to compensate for the namespaces used. Thanks for pointing out. The correct expression should be:
(//. | //@* | //namespace::*)[ancestor-or-self::bs:booking[@unitCharge="50"]]
with declaration xmlns:bs="http://www.FictitiousTourismInterface/BookingService"
I have requested the XML.com editor to edit the published version of the article accordingly.
I will welcome any further thoughts from you, but your comments will look more professional if they are less arrogant :)
Cheers,
Bilal
- Learn XPath: Author's Response
- Reference code?
2002-10-10 10:41:46 Richard Hough [Reply]
Great article. Is there any source code for a reference implementation of XML Canonicalization? Some of this stuff, especially handling namespaces, is tricky enough that i'd prefer not to write my own.
