What is the Lexical Reporter ?
The lexical reporter is a Topologi TreeWorld service that uses an indexing tool in order to tokenize XML data into terms. The Lexical uses a SAX to parse the XML and therefore understands the syntax which enables it to report the terms coming from a particular context in the document.
As part of this service, the lexical reporter can produce the list terms that belong to any context of a document or a collection of documents as well as identifying files that contain some particular terms in a given context. For each term, the lexical reporter returns various statistics regarding the usage of the term such as its occurrence, its frequency per document, etc...
Like any other tree of nodes of TreeWorld, the results can be sorted based on those statistics, so terms can be presented in alphabetical order, by occurrence, by broad usage (in all the documents) or any other metadata available.
What does it do ?
At present, there are two sets of actions provided with this service:
- Listing index terms
- On XML files, it is possible to list the words used in all the text nodes, only within or outside some elements, and within attributes.
- Finding terms in documents
- Once a list of terms is produced, it can search for these terms in any specified context.
What can it be used for ?
The lexical reporter can be used to:
- identify at a glance the most important terms in your document(s) and quickly produce metadata and keywords
- produce indexes for a given context: this is particularly useful when the terms are highly specialised such as medical or pharmaceutical terms
- assist people to produce a thesaurus
- for quality assurance, to check that certain words are all marked up as desired in a particular context or never in some situations
- perform spell-checking when using a specialised terminology: browsing through the content of the index allows for a quick detection of typos and anomalies.
The tasks to list terms only applies to XML
files and always returns terms.
Terms can be listed in 4 different contexts of the selected XML document(s):
- in all the text nodes
- inside a particular element and its children
- outside a particular element and its children
- inside a particular attribute.
The indexer is namespace aware, so if a namespace URI is specified then only elements or attributes from that namespace will be taken into account.
Context specific indexing also allows for some options to be specified, such as whether numbers should be included or not, whether the index should be case sensitive and the minimum length of terms that you want to take into account. The lexical reporter comes with a built-in set of common words that it should ignore unless specified otherwise.
An Example: step by step procedure
The following example will index the content of all the 'TITLE' elements in a collection of XML files.
- Select the files that need to be indexed.
- Select the corresponding action 'List Terms in an Element'.
- Choose the appropriate values for the parameters.
In this example, no namespace is defined so we can ignore this parameter and leave it as 'any'. We do not want to include numbers, nor do we care about the case of the terms. We are also going to ignore any word shorter than 5 characters.
- The Lexical Analyser returns the results.
In this example, we have displayed the results in 'show pretty' mode, the terms are returned in alphabetic order first.
- Get the information that you need.
From the list of terms it is very easy to reorder them according to different criteria or to display statistics on a specific term.
Finding Terms in Files
The task to find terms only applies to terms and always returns XML files. It uses the files that were submitted for indexing in the previous action, and only returns the ones that contain the selected term(s) in the specified context.
This feature is particularly useful when a list of terms from a particular element has been returned and we need to know if that term also appears in a different context.
Say for example that an abbreviation or an acronym should always be marked as such so that it is possible for the back-end to provide its definition. Then by indexing the terms which are within the 'acronym' element and searching for files where the term appears outside the acronym element, we can know where this term has been forgotten.
Of course, it can also simply be used to search for the term in a collection of documents.
An Example: step by step procedure
The following example will return the files containing the term 'anaphylaxis' in the collection of XML files that was searched previously.
- Select the 'anaphylaxis' (we could have selected more than 1 term at once).
- Select the corresponding action 'show Files with this Term'.
- The lexical analyser will return the list of files.