Combining Schematron with other XML Schema languages

By Eddie Robertsson
June 10, 2002

Abstract

This article shows how Schematron can be combined with other XML Schema languages to create powerful validation possibilities for business applications.

Table of contents

Introduction
Introduction to Schematron
    Schematron hierarchy
        Assertions
        Rules
        Patterns
    Schematron processing
Embedded Schematron Rules in W3C XML Schema
    Dependant attributes
    Interleaving of elements
    Co-occurrence constraints
    Dependancy between XML documents
Embedded Schematron Rules in RELAX-NG
    Co-occurrence constraints
    Dependancy between XML documents
Processing
Summary
Acknowledgements

Introduction

After the W3C ratified W3C XML Schema as a full recommendation on May 2nd 2001 it has become clear that this is the most used XML Schema language in the development community. Many believed that W3C XML Schema would solve most problems encountered with validation of XML documents but this is not the case and in fact was never the goal of W3C XML Schema. In the purpose section of the specification it is clearly stated that:

"However, the language defined by this specification does not attempt to provide all the facilities that might be needed by any application. Some applications may require constraint capabilities not expressible in this language, and so may need to perform their own additional validations."

When W3C XML Schema is not powerful enough there are other options for developers. One of the options is to find a different XML Schema language that can express all the needed constraints and RELAX-NG has become increasingly popular due to its simplicity and expressive power. In many areas RELAX-NG is more powerful than W3C XML Schema but there are still areas where both of these languages fall short. One such area is the ability to express constraints between components in an XML document which are known as co-occurence constraints.

The best XML Schema language for expressing co-occurence constraints is Schematron. Schematron is a rule-based schema language and although you can define structure using Schematron it can often be a bit cumbersome. However, since defining structure in both W3C XML Schema and RELAX-NG is easy, the perfect solution would be to combine the schema languages. This way we can use each language for what it is best at, define structure with W3C XML Schema or RELAX-NG and define co-occurence constraints with Schematron.

This article will provide an explanation and several examples of how Schematron rules can easily be embedded within W3C XML Schemas and RELAX-NG to perform validation tasks not possible in W3C XML Schema or RELAX-NG alone.

The following four areas, which W3C XML Schema does not fully address, will be covered in the section Embedded Schematron Rules in W3C XML Schema:

The first two of the above examples (Dependant attributes and Interleaving of elements) are handled by RELAX-NG without having to rely on embedded Schematron rules. However, when it comes to defining advanced co-occurence constraints and dependancies between XML documents RELAX-NG also falls short and some examples of how this can be achieved will be shown in the section Embedded Schematron Rules in RELAX-NG

Introduction to Schematron

The Schematron schema language differs from most other XML schema languages in that it is a rule-based language that uses path-expressions instead of grammars. This means that instead of creating a grammar for an XML document a Schematron schema will make assertions applied to a specific context within the document. If the assertion fails, a diagnostic message that is supplied by the author of the schema can be displayed.

One advantages of taking this rule-based approach is that in many cases the Schematron rules can easily be created by modifying the wanted constraint written in plain English. For example, a simple content model can in plain English be written like this: "The Person element should in the XML instance document have an attribute Title and contain the elements Name and Sex in that order. If the value of the Title attribute is 'Mr' then the value of the Sex element must be 'Male'".

In this sentence the context in which the assertions should be applied are clearly stated as the Person element while we have four different assertions:

  1. The context element (Person) should have an attribute Title
  2. The context element should contain two child elements, Name and Sex
  3. The child element Name should appear before the child element Sex
  4. If attribute Title has the value 'Mr' then the element Sex must have the value 'Male'

In order to implement the path-expressions used in the rules in Schematron, the W3C XPath language (XPath) is used with various extensions provided by XSLT (Extensible Stylesheet Language Transformations). Since the path-expressions are built on top of XPath and XSLT it is also trivial to implement Schematron using XSLT, which is shown in the section Schematron processing below.

It has already been mentioned that Schematron makes various assertions based on a specific context in a document. Both the assertions and the context make up two of the four layers in Schematron's fixed four-layer hierarchy that consists of phases (top-level), patterns, rules (defines the context) and assertions.

Schematron hierarchy

In this introduction only three of these layers (patterns, rules and assertions) will be covered since these are most important for using embedded Schematron rules in W3C XML Schemas and RELAX-NG. For a full description of the Schematron schema language see the Schematron specification.

In short the three layers covered in this section are constructed so that each assertion is grouped into rules and each rule defines a context. Each rule is then grouped into patterns, which are given a name that is displayed together with the error message (there is really more to patterns than just a grouping mechanism but for this introduction this is sufficient).

The example in the introduction specified a very simple content model (see below) that will be used to explain the three layers in the hierarchy.

<Person Title="Mr">
   <Name>Eddie</Name>
   <Sex>Male</Sex>
</Person>

Assertions

The bottom layer in the hierarchy is the assertions, which are used to specify the constraints that should be checked within a specific context of the XML instance document. In a Schematron schema the typical element used to define assertions is, assert. The assert element has a test attribute, which is a modified XPath expression 1 . In the above example there was four assertions made on the document in order to specify the content model, namely:

  1. The context element (Person) should have an attribute Title
  2. The context element should contain two child elements, Name and Sex
  3. The child element Name should appear before the child element Sex
  4. If attribute Title has the value 'Mr' then the element Sex must have the value 'Male'

Written using Schematron assertions this would be:

<assert test="@Title">The element Person must have a Title attribute.</assert>
<assert test="count(*) = 2 and count(Name) = 1 and count(Sex)= 1">The element Person should have the child elements Name and Sex.</assert>
<assert test="*[1] = Name">The element Name must appear before element Sex.</assert>
<assert test="(@Title = 'Mr' and Sex = 'Male') or @Title != 'Mr'">If the Title is "Mr" then the sex of the person must be "Male".</assert>

For people familiar with XPath these assertions are easy to understand but even for people with limited experience using XPath this is rather straightforward. The first assertion simply tests for the occurrence of an attribute Title. The second assertion tests that the total number of children is equal to two and that there is one Name element and one Sex element. The third assertion tests that the first child element is Name and the last assertion test that if the Title is 'Mr' then the sex of the person must be 'Male'.

If the condition in the test attribute is not fulfilled the content of the assertion element will be displayed to the user. So, for example, if the third condition was broken (*[1] = Name) then the following message would be displayed:

The element Name must appear before element Sex.

Each of the above assertions has a condition that is evaluated but the assertion does not define where in the XML instance document this condition should be checked. For example, the first assertion test for the occurrence of the attribute Title but it is not specified on which element in the XML instance document this assertion should be applied. The next layer in the hierarchy, the rules, specifies this location (the context of the assertion).

Rules

The rules in Schematron are declared by using the rule element and the rule element has a context attribute. The value of the context attribute is the same modified XPath expression as for the test attribute on the assertions. Like the name suggest, the context attribute is used to specify the context in the XML instance document where the assertions should be applied. In the above example the context was specified to be the Person element and a Schematron rule with the Person element as context would simply be:

<rule context="Person"></rule>

Since the rules are used to group together all the assertions that share the same context the rules are designed so that the assertions are declared as children of the rule element. For the above example this means that the complete Schematron rule would be:

<rule context="Person">
   <assert test="@Title">The element Person must have a Title attribute.</assert>
   <assert test="count(*) = 2 and count(Name) = 1 and count(Sex) = 1">The element Person should have the child elements Name and Sex.</assert>
   <assert test="*[1] = Name">The element Name must appear before element Age.</assert>
   <assert test="(@Title = 'Mr' and Sex = 'Male') or @Title != 'Mr'">If the Title is "Mr" then the sex of the person must be "Male".</assert>
</rule>

This means that all the assertions in the rule will be tested on every Person element in the XML instance document. If the context should not be all the Person elements it is easy to change the XPath to define a more restricted context. The value Database/Person would for example set the context to be all the Person elements that have the element Database as its parent.

Patterns

The third layer in the hierarchy is the pattern, declared using the pattern element, which is used to group together different rules. The pattern element also has a name attribute that will be displayed in the output when the pattern is checked. For the above assertions you could for example have two patterns, one for checking the structure and one for checking the co-occurrence constraint. Since patterns group together different rules Schematron is designed so that groups are declared as children of the pattern element. This means that the above example, using the two patterns, would look like this:

<pattern name="Check structure">
   <rule context="Person">
      <assert test="@Title">The element Person must have a Title attribute.</assert>
      <assert test="count(*) = 2 and count(Name) = 1 and count(Sex) = 1">The element Person should have the child elements Name and Sex.</assert>
      <assert test="*[1] = Name">The element Name must appear before element Age.</assert>
   </rule>
</pattern>
<pattern name="Check co-occurrence constraints">
   <rule context="Person">
      <assert test="(@Title = 'Mr' and Sex = 'Male') or @Title != 'Mr'">If the Title is "Mr" then the sex of the person must be "Male".</assert>
   </rule>
</pattern>

The name of the pattern will always be displayed in the output regardless of whether the assertions fail or succeed and if the assertion fails the output will also contain the content of the assertion element. However, there is also additional information displayed together with the assertion text to help the user locate the source of the failed assertion. For example, if the co-occurrence constraint above was violated by having Title='Mr' and Sex='Female' then the following diagnostic would be generated by Schematron:

From pattern "Check structure":

From pattern "Check co-occurence constraints":
   Assertion fails: "If the Title is "Mr" then the sex of the person must be "Male"." at
      /Person[1]
         <Person Title="Mr">...</>

So, the pattern names are always displayed while the assertion text is only displayed when the assertion fails. The additional information starts with an XPath that shows the location of the context element in the instance document (in this case the first Person element) and then on a new line the start tag of the context element is displayed.

The assertion to test the co-occurrence constraint is not trivial and in fact this rule could be written in a simpler way by using an XPath predicate when selecting the context. Instead of having the context set to all Person elements the co-occurrence constraint can be simplified by only specifying the context to be all the Person elements that have the attribute Title='Mr'. If the rule was specified using this technique the co-occurrence constraint could be described like this:

<rule context="Person[@Title='Mr']">
   <assert test="Sex = 'Male'">If the Title is "Mr" then the sex of the person must be "Male".</assert>
</rule>

So, by moving some of the logic from the assertion to the specification of the context the complexity of the rule has been decreased. This is a technique that often is very useful when writing Schematron schemas.

This concludes this introduction about patterns and now all that is left to do is to wrap the patterns in the Schematron schema in a schema element and specify that all the Schematron elements used should be defined in the Schematron namespace, http://www.ascc.net/xml/schematron. This means that the complete Schematron schema for the example would be:

<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">
   <sch:pattern name="Check structure">
      <sch:rule context="Person">
         <sch:assert test="@Title">The element Person must have a Title attribute</sch:assert>
         <sch:assert test="count(*) = 2 and count(Name) = 1 and count(Sex) = 1">The element Person should have the child elements Name and Sex.</sch:assert>
         <sch:assert test="*[1] = Name">The element Name must appear before element Sex.</sch:assert>
      </sch:rule>
   </sch:pattern>
   <sch:pattern name="Check co-occurrence constraints">
      <sch:rule context="Person">
         <sch:assert test="(@Title = 'Mr' and Sex = 'Male') or @Title != 'Mr'">If the Title is "Mr" then the sex of the person must be "Male".</sch:assert>
      </sch:rule>
   </sch:pattern>
</sch:schema>

Schematron can also be used to validate XML instance documents that use namespaces. Each namespace used in the XML instance document should be declared in the Schematron schema. The element used to declare namespaces are the ns element which should appear as a child of the schema element. The ns element has two attributes, uri and prefix, which are used to define the namespace uri and the namespace prefix. So, if the XML instance document in the example had been defined in the namespace www.topologi.com/example then the Schematron schema would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">
   <sch:ns uri="www.topologi.com/example" prefix="ex"/>
   <sch:pattern name="Check structure">
      <sch:rule context="ex:Person">
         <sch:assert test="@Title">The element Person must have a Title attribute</sch:assert>
         <sch:assert test="count(ex:*) = 2 and count(ex:Name) = 1 and count(ex:Sex) = 1">The element Person should have the child elements Name and Sex.</sch:assert>
         <sch:assert test="ex:*[1] = ex:Name">The element Name must appear before element Sex.</sch:assert>
      </sch:rule>
   </sch:pattern>
   <sch:pattern name="Check co-occurrence constraints">
      <sch:rule context="ex:Person">
         <sch:assert test="(@Title = 'Mr' and ex:Sex = 'Male') or @Title != 'Mr'">If the Title is "Mr" then the sex of the person must be "Male".</sch:assert>
      </sch:rule>
   </sch:pattern>
</sch:schema>

Note that all XPath expressions that test element values now include the namespace prefix ex.

Schematron processing

One of the major advantages with Schematron is that you do not need a specially written Schematron processor in order to validate the XML instance documents. Since Schematron is built using XPath and XSLT functions all you need is an XSLT processor. The Schematron processing then works in two steps (see Figure 1):

  1. The Schematron schema is first turned into a validating XSLT stylesheet by transforming it with an XSLT stylesheet provided by Academica Sinica Computing Centre. These stylesheets (schematron-basic.xsl, schematron-message.xsl and schematron-report.xsl) can be found at the Schematron website and the different stylesheets generate different output. For example, the schematron-basic.xsl is used to generate simple text output like in the example above.
  2. This validating stylesheet is then used on the XML instance document and the result will be a report that is based on the rules and assertions in the original Schematron schema.

This means that it is very easy to set up a Schematron processor because the only thing needed is an XSLT processor together with one of the Schematron stylesheets. Here is an example of how to validate the example used above where the XML instance document is called Person.xml and the Schematron schema is called Person.sch. The example use Saxon as an XSLT processor:

C:\>saxon -o validate_person.xsl Person.sch schematron-basic.xsl

C:\>saxon Person.xml validate_person.xsl

From pattern "Check structure":

From pattern "Check co-occurrence constraints":
   Assertion fails: "If the Title is "Mr" then the sex of the person must be "Male"." at
      /Person[1]
         <Person Title="Mr">...</>

Embedded Schematron Rules in W3C XML Schema

One really good thing about W3C XML Schema is that it is very easy to extend and one way to do so is to use the annotation functions. The annotation element can have two child elements, namely documentation and appinfo. The documentation element is mainly intended to provide humans with information about the schema while the appinfo element is intended for applications. The appinfo element is defined so that it can have any well-formed XML content from any namespace. Since a Schematron rule use XML syntax this is the perfect place to embed rules from Schematron.

Almost all elements defined by the W3C XML Schema specification can have the annotation child element and the most logic place to put the Schematron rules are on the element declaration where the Schematron rule applies. This means that the W3C XML Schema element declaration and the Schematron rule that apply to the element are declared in the same place. However, since the Schematron rule add more code to the already verbose W3C XML Schema, you can just as easy include all the Schematron rules in, for example, the annotation element for the schema element itself. This may improve readability of the schema by concentrating the Schematron rules at the beginning of the W3C XML Schema.

Here is a very simple W3C XML Schema that only define one element:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Root" type="xs:string">
   </xs:element>
</xs:schema>

Now, if a Schematron rule should have the Root element as its context this rule could be added as an embedded Schematron rule within the appinfo element of the declaration like this:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Root" type="xs:string">
      <xs:annotation>
         <xs:appinfo>
            <sch:pattern name="Test constraints on the Root element" xmlns:sch="http://www.ascc.net/xml/schematron">
               <sch:rule context="Root">
                  <sch:assert test="test-condition">Error message when the assertion condition is broken...</sch:assert>
               </sch:rule>
            </sch:pattern>
         </xs:appinfo>
      </xs:annotation>
   </xs:element>
</xs:schema>

As can be seen from the example all embedded Schematron rules must be added on the pattern level and all Schematron elements must be declared in the Schematron namespace, http://www.ascc.net/xml/schematron. The rules are embedded on a pattern level because this way the pattern name will be included in the output which helps identify which rule was broken if there is a validation problem in the XML instance document.

Now that we know how to write Schematron schemas and we have seen an example of an embedded Schematron rule in a W3C XML Schema we can have a look at how to solve the different problems stated in the introduction.

Dependant attributes

To illustrate we will use an example where we have a socket element with two attributes hostName and hostAddress. The requirement is that these two attributes are mutually exclusive so that if one is present the other cannot be present and vice versa. It is also required that at least of the attributes must appear.

W3C XML Schema will be used to declare the socket element and also that the socket element can have two attributes, hostName and hostAddress. The closest we can get to the above constraint in W3C XML Schema is to declare both attributes as optional since neither hostName nor hostAddress is required. This schema could look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="socket">
      <xs:complexType>
         <xs:attribute name="hostName" type="xs:string" use="optional"/>
         <xs:attribute name="hostAddress" type="xs:string" use="optional"/>
      </xs:complexType>
   </xs:element>
</xs:schema>

A Schematron rule can now be embedded on the socket element to add the extra constraints. In this case the constraint is divided into two assertions, which allows for a separate error message for each assertion:

  1. Both hostName and hostAddress cannot be present at the same time
  2. At least one of hostName and hostAddress must be present

The W3C XML Schema with an embedded Schematron rule for this example would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="socket">
      <xs:annotation>
         <xs:appinfo>
            <sch:pattern name="Mutually exclusive attributes on the socket element" xmlns:sch="http://www.ascc.net/xml/schematron">
               <sch:rule context="socket">
                  <sch:report test="@hostName and @hostAddress">On a socket element only one of the attributes hostName and hostAddress are allowed, not both.</sch:report>
                  <sch:assert test="@hostName | @hostAddress">One of the attributes hostName or hostAddress must be present on the socket element</sch:assert>
               </sch:rule>
            </sch:pattern>
         </xs:appinfo>
      </xs:annotation>
      <xs:complexType>
         <xs:attribute name="hostName" type="xs:string" use="optional"/>
         <xs:attribute name="hostAddress" type="xs:string" use="optional"/>
      </xs:complexType>
   </xs:element>
</xs:schema>

This schema would validate that the following two instance documents are valid

<?xml version="1.0" encoding="UTF-8"?>
<socket hostAddress="192.168.200.76"/>

<?xml version="1.0" encoding="UTF-8"?>
<socket hostName="pc100"/>

while the following two instance documents are invalid:

<?xml version="1.0" encoding="UTF-8"?>
<socket hostAddress="192.168.200.76" hostName="pc100"/>

<?xml version="1.0" encoding="UTF-8"?>
<socket/>

Interleaving of elements

The constraints put on the all group that each element declared in its content must have its maxOccurs attribute fixed to 1 simplifies the processing but limits the usefulness. For example, the following content model is not allowed:

<xs:element name="Root">
   <xs:complexType>
      <xs:all>
         <xs:element name="child1" type="xs:string" minOccurs="5" maxOccurs="5"/>
         <xs:element name="child2" type="xs:string" minOccurs="2" maxOccurs="2"/>
         <xs:element name="child3" type="xs:string" minOccurs="0"/>
         <xs:element name="child4" type="xs:string" minOccurs="3" maxOccurs="7"/>
      </xs:all>
   </xs:complexType>
</xs:element>

By changing the all group to a choice group and by making the choice group itself optional and repeatable a content model where the different child elements can appear in any order are created. If, for example all the child1 elements should be grouped together in the instance document, the minOccurs constraint can be kept as it is. If the child elements do not have to be grouped together the minOccurs constraint can be set to 1 to allow for a full mixture of the elements:

<xs:element name="Root">
   <xs:complexType>
      <xs:choice minOccurs="0" maxOccurs="unbounded">
         <xs:element name="child1" type="xs:string" minOccurs="1" maxOccurs="5"/>
         <xs:element name="child2" type="xs:string" minOccurs="1" maxOccurs="2"/>
         <xs:element name="child3" type="xs:string" minOccurs="0"/>
         <xs:element name="child4" type="xs:string" minOccurs="1" maxOccurs="7"/>
      </xs:choice>
   </xs:complexType>
</xs:element>

Unfortunately this also removes the occurrence constraints on the children. Sometimes this is not a very important requirement and if that is the case the above will probably be sufficient. If, however, the occurrence constraints on the child elements are important it is trivial to add a Schematron rule to check this. The following schema will illustrate:

<xs:element name="Root">
   <xs:annotation>
      <xs:appinfo>
         <sch:pattern name="Extended_all" xmlns:sch="http://www.ascc.net/xml/schematron">
            <sch:rule context="Root">
               <sch:assert test="count(child1) = 5">You must have exactly 5 child1 elements.</sch:assert>
               <sch:assert test="count(child2) = 2">You must have exactly 2 child2 elements.</sch:assert>
               <sch:assert test="count(child3) &lt;= 1">You can only have one child3 element.</sch:assert>
               <sch:assert test="count(child4) &gt;= 3 and count(child4) &lt;= 7">You must have at least 3 child4 elements but you canít have more than 7.</sch:assert>
            </sch:rule>
         </sch:pattern>
      </xs:appinfo>
   </xs:annotation>
   <xs:complexType>
      <xs:choice minOccurs="0" maxOccurs="unbounded">
         <xs:element name="child1" type="xs:string" minOccurs="1" maxOccurs="5"/>
         <xs:element name="child2" type="xs:string" minOccurs="1" maxOccurs="2"/>
         <xs:element name="child3" type="xs:string" minOccurs="0"/>
         <xs:element name="child4" type="xs:string" minOccurs="1" maxOccurs="7"/>
      </xs:choice>
   </xs:complexType>
</xs:element>

This schema would validate a true mixture of all the child elements. If the child elements should be grouped together the only change would be to preserve the minOccurs constraint on each child element (5 for child1, 2 for child2 and 3 for child4). However, since child4's occurrence is a range a new Schematron rule is needed to assert that all child4 elements are grouped together. The new schema would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Root">
      <xs:annotation>
         <xs:appinfo>
            <sch:pattern name="Extended_all" xmlns:sch="http://www.ascc.net/xml/schematron">
               <sch:rule context="Root">
                  <sch:assert test="count(child1) = 5">You must have exactly 5 child1 elements.</sch:assert>
                  <sch:assert test="count(child2) = 2">You must have exactly 2 child2 elements.</sch:assert>
                  <sch:assert test="count(child3) &lt;= 1">You can only have one child3 element.</sch:assert>
                  <sch:assert test="count(child4) &gt;= 3 and count(child4) &lt;= 7">You must have at least 3 child3 elements but you canít have more than 7.</sch:assert>
               </sch:rule>
               <sch:rule context="Root/*">
                  <sch:assert test="not(preceding-sibling::*[1][name() != name(current())][preceding-sibling::*[name() = name(current())]])">All <sch:name/> elements must be grouped with the other <sch:name/> elements.</sch:assert>
               </sch:rule>
            </sch:pattern>
         </xs:appinfo>
      </xs:annotation>
      <xs:complexType>
         <xs:choice minOccurs="0" maxOccurs="unbounded">
            <xs:element name="child1" type="xs:string" minOccurs="5" maxOccurs="5"/>
            <xs:element name="child2" type="xs:string" minOccurs="2" maxOccurs="2"/>
            <xs:element name="child3" type="xs:string" minOccurs="0"/>
            <xs:element name="child4" type="xs:string" minOccurs="3" maxOccurs="7"/>
         </xs:choice>
      </xs:complexType>
   </xs:element>
</xs:schema>

The new rule has each of the child elements of Root as its context so this rule will apply to all the children of Root. The assertion in the rule uses the preceding sibling axis to assert that all the child elements must be grouped together. In this case it would have been enough to apply this rule to child4 (since it's the only element with an occurrence range) but it is just as easy to apply the same rule for all the children.

Co-occurrence constraints

The number of examples for co-occurrence constraints is more or less unlimited and one example was used in the Introduction to Schematron section above. In that example the co-occurrence constraint was that if the Title attribute on element Person had the value 'Mr' then the value of the Sex sub-element must be 'Male'. Instead of defining everything using a Schematron schema, this example will show how to do the structure in W3C XML Schema and the co-occurrence constraint with a Schematron rule. The W3C XML Schema for this simple example is straightforward:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Person">
      <xs:complexType>
         <xs:sequence>
            <xs:element name="Name" type="xs:string"/>
            <xs:element name="Sex">
               <xs:simpleType>
                  <xs:restriction base="xs:string">
                     <xs:enumeration value="Male"/>
                     <xs:enumeration value="Female"/>
                  </xs:restriction>
               </xs:simpleType>
            </xs:element>
         </xs:sequence>
         <xs:attribute name="Title" type="xs:string" use="required"/>
      </xs:complexType>
   </xs:element>
</xs:schema>

This schema defines the structure of the XML instance document and the only thing the Schematron rule needs to define is the co-occurrence constraint. The complete schema with an embedded Schematron rule would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Person">
      <xs:annotation>
         <xs:appinfo>
            <sch:pattern name="Co-occurrence constraint on attribute Title" xmlns:sch="http://www.ascc.net/xml/schematron">
               <sch:rule context="Person[@Title='Mr']">
                  <sch:assert test="Sex = 'Male'">If the Title is "Mr" then the sex of the person must be "Male".</sch:assert>
               </sch:rule>
            </sch:pattern>
         </xs:appinfo>
      </xs:annotation>
      <xs:complexType>
         <xs:sequence>
            <xs:element name="Name" type="xs:string"/>
            <xs:element name="Sex">
               <xs:simpleType>
                  <xs:restriction base="xs:string">
                     <xs:enumeration value="Male"/>
                     <xs:enumeration value="Female"/>
                  </xs:restriction>
               </xs:simpleType>
            </xs:element>
         </xs:sequence>
         <xs:attribute name="Title" type="xs:string" use="required"/>
      </xs:complexType>
   </xs:element>
</xs:schema>

Dependancy between XML documents

By using the document() function in XSLT it is also possible to apply constraints between XML instance documents and not just within a single document. To illustrate this we use two simple XML instance documents where one document contain a single Person element with a name sub-element and one document that contain a single Car element with an Owner attribute. The W3C XML Schemas for these documents would be:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Person">
      <xs:complexType>
         <xs:sequence>
            <xs:element name="Name" type="xs:string"/>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

and

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Car">
      <xs:complexType>
         <xs:attribute name="Owner" type="xs:string" use="required"/>
      </xs:complexType>
   </xs:element>
</xs:schema>

The instance documents would be:

<?xml version="1.0" encoding="UTF-8"?>
<Person>
   <Name>Eddie</Name>
</Person>

and

<?xml version="1.0" encoding="UTF-8"?>
<Car Owner="Eddie"/>

Now we want to make sure that the value of the Owner attribute in Car.xml must match the value of Person/Name in Person.xml. This can be done by inserting a Schematron rule in the W3C XML Schema that defines the Car document:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Car">
      <xs:annotation>
         <xs:appinfo>
            <sch:pattern name="Car owner must link to a person" xmlns:sch="http://www.ascc.net/xml/schematron">
               <sch:rule context="Car">
                  <sch:assert test="document('Person.xml')/Person/Name = @Owner">The owner of the car must match the name of the person in Person.xml.</sch:assert>
               </sch:rule>
            </sch:pattern>
         </xs:appinfo>
      </xs:annotation>
      <xs:complexType>
         <xs:attribute name="Owner" type="xs:string" use="required"/>
      </xs:complexType>
   </xs:element>
</xs:schema>

The document() function will bring in the elements from the Person.xml file and the assertion will make sure that the value of the Owner attribute match the value of the Person/Name element.

Embedded Schematron Rules in RELAX-NG

Unlike for W3C XML Schemas the embedded Schematron rules in a RELAX-NG schema does not have to be declared within a specific element. Since a RELAX-NG validator will ignore all elements not in the RELAX-NG namespace (http://relaxng.org/ns/structure/1.0 ), the Schematron rules can be declared between any RELAX-NG element.

Here is a very simple RELAX-NG schema:

<?xml version="1.0" encoding="UTF-8"?>
<element name="Root" xmlns="http://relaxng.org/ns/structure/1.0">
   <text/>
</element>

Now, if a Schematron rule should have the Root element as its context this rule could be added as an embedded Schematron rule like this:

<?xml version="1.0" encoding="UTF-8"?>
<element name="Root" xmlns="http://relaxng.org/ns/structure/1.0">
   <sch:pattern name="Test constraints on the Root element" xmlns:sch="http://www.ascc.net/xml/schematron">
      <sch:rule context="Root">
         <sch:assert test="test-condition">Error message when the assertion condition is broken...</sch:assert>
      </sch:rule>
   </sch:pattern>
   <text/>
</element>

The Schematron rules embedded in a RELAX-NG schema are inserted on the pattern level and need to be declared in the Schematron namespace (http://www.ascc.net/xml/schematron ) just like for W3C XML Schemas.

Co-occurrence constraints

Although RELAX-NG have better support for co-occurence constraints than W3C XML Schema there are still many types of co-occurence constraints that cannot be expressed by RELAX-NG. One such example is identity constraints that has been left out of the current version of RELAX-NG.

As an example we are going to use a schema that defines a sports tournament. The tournament have a name, a number of teams which have a unique id and a number of matches that define which teams will meet in each match. Typically such a schema would validate that every team in a match must also be one of the teams registered in the tournament. Although some basic identity constraints can be done usings DTD's ID and IDREF, more complex identity constraints will have to be checked with embedded Schematron rules.

A RELAX-NG for the above described tournament could look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
      <ref name="Tournament"/>
   </start>
   <define name="Tournament">
      <element name="Tournament">
         <element name="Name"><text/></element>
         <element name="Teams">
            <!-- We must have at least two teams -->
            <ref name="Team"/>
            <oneOrMore>
               <ref name="Team"/>
            </oneOrMore>
         </element>
         <element name="Matches">
            <oneOrMore>
               <element name="Match">
                  <element name="Team"><text/></element>
                  <element name="Team"><text/></element>
                  <attribute name="id"/>
               </element>
            </oneOrMore>
         </element>
      </element>
   </define>
   <define name="Team">
      <element name="Team">
         <attribute name="id"/>
         <optional>
            <attribute name="Name"/>
         </optional>
      </element>
   </define>
</grammar>

An XML instance document that would be valid against this schema is:

<?xml version="1.0" encoding="UTF-8"?>
<Tournament>
   <Name>FIFA World Cup</Name>
   <Teams>
      <Team Name="Sweden" id="t1"/>
      <Team Name="Argentina" id="t2"/>
      <Team Name="Nigeria" id="t3"/>
      <Team Name="England" id="t4"/>
   </Teams>
   <Matches>
      <Match id="m1">
         <Team>t1</Team>
         <Team>t4</Team>
      </Match>
      <Match id="m2">
         <Team>t2</Team>
         <Team>t3</Team>
      </Match>
   </Matches>
</Tournament>

Unfortunately the RELAX-NG schema will also validate the XML instance document even if the id for one of the teams playing in a match doesn't match the id of a team that has been registered in the tournament (appears as a child of the Teams element). It is very easy to add a Schematron rule to check this extra constraint and it could for example be done by adding an embedded rule to the definition of the pattern that match the Match element:

<element name="Matches">
   <oneOrMore>
      <element name="Match">
         <sch:pattern name="Check that each team is registered in the tournament" xmlns:sch="http://www.ascc.net/xml/schematron">
            <sch:rule context="Matches/Match/Team">
               <sch:assert test="text() = ../../../Teams/Team/@id"
>Each Team in a Match must be a registered Team in the tournament.</sch:assert>
            </sch:rule>
         </sch:pattern>
         <element name="Team"><text/></element>
         <element name="Team"><text/></element>
         <attribute name="id"/>
      </element>
   </oneOrMore>
</element>

With this new definition for the pattern following XML instance document would be invalid

<?xml version="1.0" encoding="UTF-8"?>
<Tournament>
   <Name>FIFA World Cup</Name>
   <Teams>
      <Team Name="Sweden" id="t1"/>
      <Team Name="Argentina" id="t2"/>
   </Teams>
   <Matches>
      <Match id="m1">
         <Team>t1</Team>
         <Team>t4</Team>
      </Match>
   </Matches>
</Tournament>

since a team with an id="t4" is not registered in the tournament.

Dependancy between XML documents

Neither RELAX-NG nor W3C XML Schema was designed to handle dependancies between XML instance documents but sometimes this is a necessary requirement. For example, if the teams in the previous example were put in a separate XML instance document we would still need to validate that each team in a match is registered as a child of the Teams element.

With this new design the RELAX-NG schema for the tournament would be:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
      <ref name="Tournament"/>
   </start>
   <define name="Tournament">
      <element name="Tournament">
         <element name="Name"><text/></element>
         <element name="Matches">
            <oneOrMore>
               <element name="Match">
                  <element name="Team"><text/></element>
                  <element name="Team"><text/></element>
                  <attribute name="id"/>
               </element>
            </oneOrMore>
         </element>
      </element>
   </define>
</grammar>

with the corresponding instance:

<?xml version="1.0" encoding="UTF-8"?>
<Tournament>
   <Name>FIFA World Cup</Name>
   <Matches>
      <Match id="m1">
         <Team>t1</Team>
         <Team>t4</Team>
      </Match>
      <Match id="m2">
         <Team>t2</Team>
         <Team>t3</Team>
      </Match>
   </Matches>
</Tournament>

The schema that defines the teams would be:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
      <ref name="Teams"/>
   </start>
   <define name="Teams">
      <element name="Teams">
         <!-- We must have at least two teams -->
         <ref name="Team"/>
         <oneOrMore>
            <ref name="Team"/>
         </oneOrMore>
      </element>
   </define>
   <define name="Team">
      <element name="Team">
         <attribute name="id"/>
         <optional>
            <attribute name="Name"/>
         </optional>
      </element>
   </define>
</grammar>

with the instance:

<?xml version="1.0" encoding="UTF-8"?>
<Teams>
   <Team Name="Sweden" id="t1"/>
   <Team Name="Argentina" id="t2"/>
   <Team Name="Nigeria" id="t3"/>
   <Team Name="England" id="t4"/>
</Teams>

Now, when validation is performed of the XML instance document with the tournament information we still want to make sure that each team in a match is declared in the XML instance document that contains the teams. Like in the previous example the embedded Schematron rule can be defined on the pattern for the Match element. The only difference will be that this time the document() function will be used to access the instance where the teams are defined:

<element name="Matches">
   <oneOrMore>
      <element name="Match">
         <sch:pattern name="Check that each team is registered in the tournament" xmlns:sch="http://www.ascc.net/xml/schematron">
            <sch:rule context="Matches/Match/Team">
               <sch:assert test="text() = document('Teams.xml')/Teams/Team/@id"
>Each Team in a Match must be a registered Team in the tournament.</sch:assert>
            </sch:rule>
         </sch:pattern>
         <element name="Team"><text/></element>
         <element name="Team"><text/></element>
         <attribute name="id"/>
      </element>
   </oneOrMore>
</element>

Processing

Neither a W3C XML Schema nor a RELAX-NG processor will recognize and perform the validation constraints expressed by the embedded Schematron rules. In fact, the embedded Schematron rules will be completely ignored by both processors since for W3C XML Schema they are declared within the appinfo element and for RELAX-NG they are declared in the Schematron namespace2. This means that in order to use the Schematron rules for validation they need to be extracted from the host schema and concatenated into a Schematron schema. Since all three schema languages use XML syntax a perfect tool for this is XSLT.

The XSD2Schtrn.xsl stylesheet will extract embedded Schematron rules from a W3C XML Schema document and merge them into a complete Schematron schema. It will also extract Schematron rules that have been declared in W3C XML Schema modules that are imported, included or redefined in the base schema. Similarily, the RNG2Schtrn.xsl stylesheet will extract embedded Schematron rules from a RELAX-NG schema document. It will also extract Schematron rules that has been declared in RELAX-NG modules that are included in or referenced from the base schema.

The result from the scripts is a complete Schematron schema that can be validated using the two-step XSLT process described in the Introduction to Schematron section above. This means that validation results are available from both Schematron validation and W3C XML Schema or RELAX-NG validation and if needed the results can be merged into one report. The whole process is described in the following picture:

As can be seen in the picture, there are two distinctive paths in the processing which means that if timing is important the two paths could be implemented as separate processes and be executed in parallel.

A batch file that would (using XSV and Saxon) validate an XML instance document against both W3C XML Schema and its embedded Schematron rules can look like this:

echo Running XSV validation on Person_bad.xml...

   xsv Person_bad.xml

echo Creating Schematron schema from appinfo in Person.xsd...

   saxon -o Person.sch Person.xsd XSD2Schtron.xsl

echo Running Basic Schematron validation on file Person_bad.xml...

   saxon -o validate.xsl Person.sch schematron-basic.xsl
   saxon Person_bad.xml validate.xsl

So, first is the XML instance document is validated against the W3C XML Schema using XSV and then it is validated with the embedded Schematron rules using Saxon. An output example could look like this:

Running XSV validation on Person.xml...

<?xml version='1.0'?>
<xsv docElt='{None}Person' instanceAssessed='true' instanceErrors='0' rootType='[Anonymous]' schemaErrors='0' schemaLocs='None -> Person.xsd' target='file:/E:/Work/XMLSchema/XML-DEV/Schtrn+W3C/Person.xml' validation='strict' version='XSV 1.203.2.16/1.106.2.8 of 2001/
10/28 17:39:15' xmlns='http://www.w3.org/2000/05/xsv'>
<schemaDocAttempt URI='file://C:/Person.xsd' outcome='success' source='schemaLoc'/>
</xsv>

Done.

Creating Schematron schema from appinfo in Person.xsd...
Running Basic Schematron validation on file Person.xml...

   From pattern "Check structure":

   From pattern "Check co-occurrence constraints":
Assertion fails: "If the Title is "Mr" then the sex of the person must be "Male"." at
   /Person[1]
   <Person Title="Mr">...</>

Similarily a batch file that would (using the Win32 executable of Jing and Saxon) validate an XML instance document against a RELAX-NG schema and its embedded Schematron rules can look like this:

echo Running Jing validation on Tournament_bad.xml...

   jing Tournament.rng Tournament_bad.xml

echo Creating Schematron schema from Tournament.rng...

   saxon -o Tournament.sch Tournament.rng RNG2Schtron.xsl

echo Running Basic Schematron validation on file Tournament_bad.xml...

   saxon -o validate.xsl Tournament.sch schematron-basic.xsl
   saxon Tournament_bad.xml validate.xsl

An output example could look like this:

Running Jing validation on Tournament_bad.xml...

Error at URL "file:/D:/Work/XMLSchema/XML-DEV/Schtrn+W3C/Article/Emb_Schtrn/Tournament_bad.xml", line number 7: unknown element "BugusTeam"

Creating Schematron schema from Tournament.rng...

Running Basic Schematron validation on file Tournament_bad.xml...

From pattern "Check that each team is registered in the tournament":
   Assertion fails: "Each Team in a Match must be a registered Team in the tournament." at    /Tournament[1]/Matches[1]/Match[1]/Team[2]
   <Team>...</>

Done.

The Topologi Schematron Validator is a graphical validator that can validate and XML instance document using both W3C XML Schemas and RELAX-NG schemas with embedded Schematron rules.

Summary

Schematron is a very good complement to both W3C XML Schema and RELAX-NG and there seems little that cannot be validated by the combination. This article has shown how to extract the embedded Schematron rules and validate the resulting Schematron schema using a three-step XSLT process. The examples shown can be downloaded in a zip-file that also contains Saxon, XSV and Jing so you can try them out yourself (only Windows is supported and Jing needs Microsoft Java VM).

It is up to each project and use-case to evaluate if this is suitable technique to achieve more powerful validation and some of the advantages and disadvantages to take into account are:

+ By combining the power of W3C XML Schema and Schematron the limit for what can be done in terms of validation is raised to a new level.
+ Many of the constraints that previously had to be checked in the application can now be moved out of the application and into the schema.
+ Since Schematron lets you provide your own error messages (the content of the assertion elements) you can assure that each message is as explanatory as it needs to be.

- In time critical applications the overhead of processing the embedded Schematron rules may be too long.
- Since the extraction of Schematron rules from a RELAX-NG schema is performed with XSLT, embedded Schematron rules are only supported in RELAX-NG schema that use the full XML syntax.

For W3C XML Schema it should also be noted that, at this stage, Schematron rules can only applied on specific elements in the XML instance document. It is not yet possible to apply a Schematron rule to a type definition in W3C XML Schema which would make this technique even more powerful. Depending on how much of the PSVI3 that will be available in the next version of XPath this is something that may be possible in the future.

If you do not mind adding two more XSLT processes to the processing chain this is in fact possible to do with the help of Francis Norton's typeTagger. The basic idea is that it annotates the XML instance document with extra attributes containing, among other things, the element type information from the W3C XML Schema.

Instead of using the RNG2Schtrn.xsl stylesheet there exists an alternative way to validate embedded Schematron rules in a RELAX-NG schema. One version of Sun's MSV have an add-on that will validate XML documents against RELAX-NG schemas annotated with Schematron rules.

The ability to combine embedded Schematron rules is not unique to W3C XML Schema and RELAX-NG and in fact it should be possible in all XML Schema languages the uses XML syntax and have an extensibility mechanism. The only thing needed is to modify the XSLT extractor stylesheet to accomodate the extension mechanism in the XML Schema language.

Acknowledgements

I would like to thank Rick Jelliffe for taking the time to review this paper.


1. The test attribute allows XPath expressions to be combined in or groups (using the | operator and parentheses for grouping. Back to text

2. A RELAX-NG processor will ignore all element that are defined in a different namespace than the RELAX-NG namespace (http://relaxng.org/ns/structure/1.0). Back to text

3. PSVI is short hand for Post Schema Validation Infoset which is a modified version of the original document's infoset, providing additional information such as default values, datatypes, etc. Back to text


Copyright © 2002, Eddie Robertsson
This is a draft paper that can be used privately but do not repost publicly.