xtext — encoding declarations for text

Rick Jelliffe
Topologi, Pty. Ltd.
2003-05-20

This paper proposes xtext, an encoding header system similar to XML's but for use with almost any computer language using text.

Background

XML's encoding headers have proven successful over the last five years in taming the previously intractable charset problem: it provides a syntax and algorithm for clearly stating the character encoding used by a file. By providing an in-band signal, as markup, it overcomes the deficiencies in text APIs and the incompatible defaulting regimes of operating systems and Internet standards.

XML's mechanism can be seen as a logical step from various systems such as UNIX magic numbers, Hayes auto-baud detection, Adobe's MIF locale-detection, and ISO 2022 character set announcement.

Problem

However, not all text is XML. Indeed, as XML's attribute-value-tree-with-links (AVTWL) approach becomes more ubiquitous as the database level, we can expect more compact syntaxes to be developed, in applications where the need for typability is strong. And existing non-XML formats will continue.

Is there a version of XML's encoding header syntax and algorithm which can be suitable for non-XML uses?

Solution

An xtext file is a text document in some encoding. The encoding is detected as per XML's appendix F with the following changes:

The third change is the key to the thing. You choose the common delimiters for the artificial language being used in the file: a CSS file could use @, while a java file could use //.

The nice thing about the XML encoding PI is that it did not involve any new syntax, so existing tools (SGML tools) would continue to work with it. For xtext to be successful, it must have the same non-distruptive property, allowing it to be retrofitted to existing files and used by different systems.

Examples


HTML

<!--xtext encoding="utf-8"-->
<html>
<p>Some kind of HTML</p>
</html>

C, C++, Java, C# etc.

//xtext encoding="ISO-8859-1" refo="%x" refc="%"
package com.topologi.tme1.editor;
// some text with a character reference here:  %x4444% 
import blah...


CSS

@xtext encoding="utf-8" 

html, body {
   background: #fff;
   color: #000;
}
or
/*xtext encoding="utf-8"*/

html, body {
   background: #fff;
   color: #000;
}