|
Rick Jelliffe
ricko@topologi.com
24/02/2002
XML 1.0's success and widespread adoption is in large part
because it solves a problem that no other technology has:
it makes the character set problem go away.
XML 1.0 solves the character set problem by adopting three
measures:
- It adopts Unicode as its document character set.
- It allows any encoding, and requires the encoding be labelled
(or defaulted).
- It provides error-detection mechanisms for mislabelled
encodings, as part of well-formedness
The last of these measures is not well-recognized. Yet it
provides a check which gives XML 1.0 a fundamental robustness.
The measures are integrated: use a different character
set than Unicode, and coverage is compromise; allow guessing
of encodings, and fragility is increased; provide no error-checking,
and fragility is also increased. The XML 1.1 discussions have
brought this issue more to light.
The error-detection mechanisms in XML are three:
- The behaviour when an infeasible codepoint or transition
in the incoming codes is found is specified.
- The Unicode characters allowed in a document are restricted..
- The Unicode characters allowed in XML names are restricted.
The particular policies in place for XML 1.0 for using these
mechanisms are:
- A trancoding error is fatal. However, in many important
cases (for example, between the ISO 8859-n family) this
will catch no errors. Furthermore, many and perhaps
most existing transcoder libraries silently strip out infeasible
code sequences. Furthermore, the requirement to fail
at encoding errors was only clarified as part of XML 1.0
second edition, so even where transcoding libraries could
flag an error, deployed implementations could have chosen
to continue processing.
- Only a small selection of document characters are restricted:
in particular the C0 range.
- A large selection of characters are restricted from use
as names.
In this note, I will not deal with other important considerations
relevant to characters in XML: that there are accessibility
considerations which legitimize banning symbol characters
that have no pronunciation (i.e. in any particular locale);
that the character U+0000 will cause problems in zero-terminated
strings; that MIME requirements for "textual" content means
that control characters are inappropriate for use in text/*
documents; etc.
Instead, I want to provide additional information to help
gauge the effectiveness of XML's current error-detection mechanisms,
and to see if this information allows us to come up simpler
rules which give nearly as useful coverage.
Probability
Before starting, it is useful to consider that statistical
methods are the basis of much quality assurance and quality
control. Even in data communications, probability plays
a role: for example, the Cyclic Redundancy Check on Internet
protocols and other checksums are not 100% reliable.
However, they do not need to be; foremost, because if the
possibility of an error in one sample is e/t, the possibility
that n samples will not detect the error are (t-e)^n/t^n.
There are two rational approaches to error-detection policy
in XML:
- Restrict characters to the smallest possible number, based
on Unicode properties. This gives the maximum possible number
of redundant characters. For names, this is the current
policy in XML 1.0.
- Enumerate and analyze the most common possible classes
of transcoding errors, and determine whether the natures
of the codes themselves allow effective rules to be formulated
to detect errors. That is the approach in this note.
The approach of simply removing checks for encoding errors
is simply bad engineering, in the absence of any other layers
or methods to perform error detection.
Encoding Errors
The causes of encoding errors include:
- human error or ignorance: a user may not know that the
encoding they are using is not correct;
- webserver error: web servers will send data as 8859-1
or ASCII by default; if the server is set up with a different
default, particular files may be still be sent out with
incorrect encoding;
- programmer error: most programming language IO methods
output data using the locale's encoding by default;
- proxy error: a transcoding proxy recodes the document
without changing the XML header; if the document is saved
as-is, it will be in error even if the proxy sent the correct
MIME header information.
Let us take these as a working set of classes of errors we
should consider.
- UTF-16 mislabelled as UTF-8, and vice versa.
- Windows code pages mislabelled as ISO 8859-n, and vice
versa.
- Mislabelling as 8859-1, from webserver defaults.
UTF-8 labelled as UTF-16
There is no problem here. No delimiters will be detected,
and the document cannot be WF.
UTF-16 labelled as UTF-8
There is no problem here. U+0000 is not allowed in
the document character set, and an error
will be detected in every case of an incoming XML entity that
contains any markup.
UTF-8 mislabelled as ISO 8859-n
UTF-8 uses the range 0x80 to 0x9F. The probability
that a random two-byte character has this is 1:2. The probability
that a random three-byte characters has that code is 1:3.
The probability that a new character > U+FFFF has that
code is 1:4.
So restricting the document character set to disallow C1 will
be effective to catch UTF-8 mislabelling, except for documents
with very few (in repertoire, not in frequency) non-ASCII
characters. It would be interesting to check whether
the Euro is caught.
ISO 8859-n pages mislabelled as ISO 8859-n
The ISO 8859-n encodings are mutually feasible: no errors
will be detected by either a transcoder or by checking the
document character set for unallocated or deprecated characters.
The only method available to detect encoding errors is by
restricting the name rules.
An important case here is when a document is deemed incorrectly
to be ISO 8859-1. ISO 8859-1 has the useful property that
it does not have XML 1.0 name characters in the A0 to BF range.
Character encodings which do have name characters (ref http://www.kostis.net/charsets/)
in that range include:
- ISO 8859-2
- ISO 8859-3
- ISO 8859-4
- ISO 8859-5
- ISO 8859-7
In the cases of ISO 8859- 2,3,4 the characters that
would be detected will be, to a great extent, language- dependent.
In the case of Greek, it comes down to whether diacritical
or tone marks are used: if native-language markup is used
with tonos marks, then restricting the range U+00A0 to U+BF
will reliably detect encoding errors, given the high incidence
of the use of tonos in Greek words.
Each of the ISO 8859-n character sets has holes with non-name
characters. These provided additional potential error-detection
points.
For ISO 8859-1, the characters 0xD7 and 0xF7 are examples.
For the non-Latin ISO 8859 character sets, these one
or both of these two codes are used for common name characters.
Restricting these characters should be effective in catching
errors in those non-Latin scripts. (Greek, Cyrillic, Arabic,
Hebrew). (Russian KOI8 may be in this class too.)
Most of the Latin character sets share these same characters,
so again, Latin
Windows code pages labelled as ISO 8859-n
The Windows code pages allow characters in the 0x80 to 0x9F
range. When labelled as ISO8859-1, these occupy the
currently ambiguous C1 range in Unicode: this range is for
privately defined control characters, unless the higher-level
protocol specifies a particular control character set. One
C1 control character that has a special significance is NEL.
The introduction of the Euro as 0x80 in CP1252 means that
the previously harmless practice where documents created by
"ANSI" tools could (if they only used the Latin 1 characters)
be labelled ISO 8859-1 is no longer appropriate.
All mislabelling of ANSI as ISO 8859-1 would be caught by
disallowing the C1 controls from the document character set.
Big 5 mislabelled as ISO 8859-1
Big 5 is the character set used in Taiwan and Honk Kong,
and increasingly in Mainland China due to trade.
Assume character frequency is random in the Big 5 file. A
Big5 character has a 1/5 possibility of containing a code
point in its first byte in the A0 to BF range (this is slightly
less actually, because A0 to A4 are not used for Han characters;
however, the second byte may also contain these characters,
so we will let them cancel each other out). Assume a document
using native language encoding has 20 elements and 20 attributes,
each with two characters: without duplicates, that gives 80
characters.
For all possible DTDs with these qualities, the chances that
restricting the range of name characters by disallowing U+00A0
to U+00B7 will not detect an encoding problem are therefore
4^80/5^80 =
approx 1.8e-7: very low.
In the case of the Big5 superset Big 5 Plus, detecting C1
characters will also detect the encoding errors. However,
because it is rarer characters in Big5 Plus, I doubt that
this has much effect in practice.
Shift JIS mislabelled as ISO 8859-1
Shift JIs is the character set used for external text on
Japanese PCs.
It uses the code points 0x81 to 0x9F. If the C1
range in Unicode is disallowed as document characters, encoding
errors should be detected for even small documents. The
probabilities involved are for not detecting a problem are
around 5^n/24^n, where n is the number of Japanese characters
used in names in the some typical document type. This is very
reliable.
Other Encodings
Looking through the code tables in Lunde's CJKV information
processing and the website http://czyborra.com/charsets/ it
seems that many other character sets also can have encoding
errors
of mislabelling as ISO 8859-1 detected reliably:
Discussion
Restricted document and naming rules provide an effective
method of catching encoding errors in significant cases.
Restrictions to the document character set are better than
restrictions to the naming rules, because a document may not
be using native language markup.
Examining various encodings reveals some critical ranges.
Dealing with these ranges appropriately would maintain XML's
current robustness while allowing well-formedness to be decoupled
from specific versions of Unicode (See below for recommendations
to achieve this). Indeed, the robustness of XML would
probably be increased, while the implementation complexity
significantly decreased.
Errors detected by these methods should come under the category
"encoding errors" if detected immediately after transcoding,
and "bad document character" and "bad name character" otherwise.
Finally, I note that errors relating to encoding belong to
well-formedness of a document. Errors relating to which
characters are allowed in a well-formed document relate to
validity. Therefore detailed prescription or proscription
of name characters related to policy should be moved to be
a validity issue, not a WF issue though sometimes encoding
errors may be the cause of an invalid name. An important consideration
for validation is that only the element and attribute names
in a schema need to be validated for name-rule consistency,
and not every element in an instance (in the absence of ANY
content types): so it is possible for instances to be parsed
fast while still getting the benefit of strict name rules--errors
are detected when no schema rule can be found for an element
or attribute.
Recommendations
- The character U+0000 NUL should not be an allowed character
in XML documents.
- The C1 characters U+0080 to U+009F should not be allowed
characters in XML documents, with the exception of NEL if
needs be.
- The Latin1 non-name characters U+00A0 to U+00BF and U+00D7
and U+00F7 should not be allowed in XML Names.
- Other restrictions to characters may be useful for particular
circumstances, but these will tend to be specific to the
encoding confusion involved.
- Users of the ISO 8859 character sets for Latin, other
than ISO 8859-1, should be warned to pay particular attention
to encoding issues, as the chances that an encoding error
will be detected will depend on the language they use and
may even depend on whether they use rarer characters.
|