XML
Contents
- 1 Introduction
- 2 Before You Start
- 3 Sample XML Source
- 4 Only XML Required
- 5 Sample Environment (or Configuration File)
- 6 Invoking ConTeXt
- 7 XML Typesetting
- 8 Mapping Selections
- 9 XML Paths (or LPaths)
- 10 Defining ConTeXt Setups
- 11 Map Registration
- 12 Format (Setups Configuration)
- 13 See Also
- 14 Notes
Introduction
It is possible to typeset XML sources with ConTeXt. Only an environment (or configuration file) is needed.
ConTeXt has been dealing with XML since MkII. This also means that the documentation for that version is obsolete for MkIV (or LMTX).[1]
The wiki contains only an explanation for the current way of dealing with XML in ConTeXt (MkIV and later).
Before You Start
It might be obvious, but there are two basic requirements to typeset XML sources with ConTeXt:
- Familiarity with XML. You don’t have to type XML directly, but ConTeXt isn’t able to compile well–formed XML.[2]
- At least, some knowledge of ConTeXt commands, since otherwise formatting what you select from the XML source would be impossible.
XML is way more powerful than being source format to typeset with ConTeXt. They are also completely independent from each other. It is important to deal with XML first without seeing it through ConTeXt lenses.
As for typing directly XML sources, there are some lightweight tagging (or markup) languages, such as AsciiDoc or Markdown.[3] There are tools (Pandoc being just one of them) that generate XML from these lightweight markup formats. It is not entirely impossible that in some cases these tools might generate wrong XML format (due to bugs in them). In that case, you will have to find out what is wrong with your XML source.[4]
Knowing ConTeXt is required too, because typesetting XML may be explained as having two parts:
- Selecting what you want from the XML file(s).
- Defining how you want your selections in the final PDF document.
It is better to start learning standard ConTeXt first (if required) and then acquire some experience with XML files.
Sample XML Source
An XML sample borrowed and adapted from the net reads:
<TEI xml:lang="en"> <teiHeader> <!-- stuff omitted here --> </teiHeader> <text> <body> <div type="essay"> <head>An Essay on Summer</head> <p>Summer school in <date when="1990">MCMXC</date> was never easy; it went by too quickly and left us wanting more.</p> <p>But, as my friend <name type="person">Peter</name> said with his inimitable <foreign xml:lang="fr">je ne sais quoi</foreign>, <said>It never pays to think too hard</said>. Or, as I would rather put it, <quote xml:lang="it">Que sera, sera</quote>.</p> </div> <div type="essay"> <head>An Essay on Winter</head> <p xml:lang="es">¡Hasta la vista…!</p> </div> </body> </text> </TEI>
Only XML Required
This previous sample is written using the TEI markup. It is correct XML and valid (TEI) XML.
You might think XML correctness[5] as the set orthographical rules common to all European languages. Some of these rules may be:[6]
- All words are separated using at least a blank space.
- Single dots mark different sentences.
- Blank vertical space separates paragraph (when available.
XML rules describe how the tags inside the characters <…>
are to be used. To these rules belong:
- Markup is defined by the string inside the characters
< >
. - Any blank space separates attributes (
<element attribute="value" attribute1="value1">
). - The name is the only required part for the
<…>
tag. - Elements have opening tag and a matching closing tag (
<…>
and</…>
), otherwise the opening tag must autoclose (<…/>
)[7]. - The name must come first in the tag (before the first space, if any attribute is given).
- Attributes have their values assigned with the equal sign (and no blank space before or after the sign).
- Attributes have their values enclosed in quotes.
Validity is related to a document type. XML validity is properly the document validity.
A document type (such as XHTML or TEI) defines a limited set of elements (of element names). Each element may contain one or more attributes with different values.
This specification of XML is called the document type definition. You may consider it as the set of grammar rules of each European language.
For example, <whatever>
is a correct pure XML name, but it is invalid XHTML or TEI element.
An even more extreme sample of correct XML would read:
<τεχτ> <βοδυ> <διβ type="essay"> <ἡαδ>An Essay on Summer</ἡαδ> <π>Summer school in <δατη when="1990">MCMXC</δατη> was never easy; it went by too quickly and left us wanting more.</π> <π>But, as my friend <ναμη type="person">Peter</ναμη> said with his inimitable <ξένον xml:lang="fr">je ne sais quoi</ξένον>, <ἔφα>It never pays to think too hard</ἔφα>. Or, as I would rather put it, <λεγόμενον xml:lang="it">Que sera, sera</λεγόμενον>.</π> </διβ> <διβ type="essay"> <ἡαδ>An Essay on Winter</ἡαδ> <π xml:lang="es">¡Hasta la vista…!</π> </διβ> </βοδυ> </τεχτ>
This is invalid TEI. But ConTeXt only requires correct (or valid, as it describes them) XML sources to compile them.
Sample Environment (or Configuration File)
A minimal configuration file or environment to typeset the previous sample may read:
\startxmlsetups xml:presets:all \xmlsetsetup {#1} {*} {xml:*} \stopxmlsetups \xmlregistersetup{xml:presets:all} \startxmlsetups xml:TEI \mainlanguage[\xmlatt{#1}{xml:lang}] \xmlflush{#1} \stopxmlsetups \startxmlsetups xml:body \xmlflush{#1} \stopxmlsetups \startxmlsetups xml:date \xmlflush{#1} \stopxmlsetups \startxmlsetups xml:div \startchapter[title=\xmltext{#1}{head}] \xmlflush{#1} \stopchapter \stopxmlsetups \startxmlsetups xml:foreign \bgroup\language[\xmlatt{#1}{xml:lang}]\em\xmlflush{#1}\egroup \stopxmlsetups \startxmlsetups xml:name \bgroup\sc\xmlflush{#1}\egroup \stopxmlsetups \startxmlsetups xml:p \startparagraph \xmlflush{#1} \stopparagraph \stopxmlsetups \startxmlsetups xml:p:date \xmlflush{#1} \stopxmlsetups \startxmlsetups xml:quote \bgroup\language[\xmlatt{#1}{xml:lang}]\quotation{\xmlflush{#1}}\egroup \stopxmlsetups \startxmlsetups xml:said \xmlflush{#1} \stopxmlsetups \startxmlsetups xml:teiHeader \xmlflush{#1} \stopxmlsetups \startxmlsetups xml:text \xmlflush{#1} \stopxmlsetups
A proper explanation in XML Typesetting.
Invoking ConTeXt
The XML source may be saved as source.xml
and the environment or configuration file could be saved as environment.tex
.[8]
context --environment=environment.tex source.xml
This invocation will generate an output file named source.pdf
.
XML Typesetting
Formatting XML sources with ConTeXt (or properly typesetting them) requires:
- Selecting which parts you want to be typeset. At least, these selections will cover elements by their name.
- Assigning these parts to single configuration commands (otherwise all will be displayed the same).
In practice, the ConTeXt configuration for XML (or environment file) contains:
- A set of XML (node) selections mapped or assigned to ConTeXt setups (or configurations).
- The registration of this mapping (or assignation set).
- The configuration of each setup.
A basic skeleton showing the three tasks would read:
\startxmlsetups xml:whatever \xmlsetsetup {#1} {*} {xml:*} \stopxmlsetups \xmlregistersetup{xml:whatever} \startxmlsetups xml:body \xmlflush{#1} \stopxmlsetups % and so many definitions as XML selections
The two blank lines separate the three parts listed above.
Mapping Selections
The first thing to define is a list of selections from the XML source linked to invidual ConTeXt configurations.
This minimal sample contains it:
\startxmlsetups xml:whatever \xmlsetsetup {#1} {*} {xml:*} \stopxmlsetups
- The first line \startxmlsetups creates the list (named
xml:whatever
).- The same identifier will be required to register the list.
- It is customary to use
xml:
as namespace, but any character string (such asοὑδέν:
) would do. - Both parts of the name are free, but the identifier should match completely in the registration.
- The third line \stopxmlsetups closes the \startxmlsetups (as customary in ConTeXt.
- The second line \xmlsetsetup assigns individual selections in XML with ConTeXt format.
- In \xmlsetsetup, the second pair of braces defines the individual XML selection, the third pair of braces defines the ConTeXt setup.
- The content of the first pair of braces (\xmlsetsetup{#1}) is required in all cases.
XML Paths (or LPaths)
You define what you want form the XML sources using XML Paths, known as XPaths. Since ConTeXt access these paths using Lua, they are LPaths.
We are handling the contents of the second pair of braces from the command:
\xmlsetsetup{#1}{*}{xml:*}
The most basic path is the one used in the sample {*}
, which stands for any XML element.
Other path types may be:
{element[@attribute]
}, selects<element attribute="…">
(<element>
withattribute
set, regardless of its value).{element[@attribute='value']
} selects<element attribute="value">
, but not<element attribute="value1">
(or even<element attribute="value another-value">
).{container/element
} selects all<element>
children (or direct descendants) of<container>
.
There are a bunch of other possibilities. A separate page on LPaths would make more sense.
Defining ConTeXt Setups
The third and last pair of braces from \xmlsetsetup{#1}{*}{xml:*} defines the matching setup for the given element.
If you use wildcard (*
) this will take the element name from the path (when a path is selected).
It is up to you which namespace you use to name ConTeXt setups,[9] but they must match the individual formatting command.
A way of getting rid of some content (which otherwise would be selected) is to match a path with an non–existing selection.[10]
Map Registration
After defining the list of XML setups (XML paths matched with ConTeXt setups), it must be registered. The registration command reads:
\xmlregistersetup{xml:whatever}
The only requirement is that the identifier (xml:whatever
in the sample) is exactly the same that the one defined in \startxmlsetups.
Format (Setups Configuration)
Last (but not least, as they say) comes the format of XML selections. Without this step, the selections will be lost in the transition to the output document.
As already explained in Defining ConTeXt Setups, these names (contained in the last pair of braces of \xmlsetsetup) should match each indivual setup configuration.
For a setup named in the selection mapping {xml:body
}, its configuration may read:
\startxmlsetups xml:body \xmlflush{#1} \stopxmlsetups
Flushing the contents of the element (the node), it is the most basic operation.
This is required to be able to have its children elements.
Flushing only adds the text of the element, but for formatting one needs standard ConTeXt command.
Compare the previous setup to these other ones:
\startxmlsetups xml:p \startparagraph \xmlflush{#1} \stopparagraph \stopxmlsetups \startxmlsetups xml:name \bgroup\sc\xmlflush{#1}\egroup \stopxmlsetups
The xml:p
setup adds the required commands so that <p>
are handled as commands.
For xml:name
, small caps are added. \bgroup … \egroup
is similar to enclose its contents in braces (but more explicit and readable).
As mentioned, \xmlflush{#1} flushes the current selection (or node).
This is the most basic operation, but there are other commands as well.
\xmltext adds the text from a path, such as in:
\xmltext{#1}{head} \xmltext{#1}{.}
The first command from the sample gets the text from a child <head>
element.
The second command gets the text from the current element ({.
} is the path for it).
Attributes may be accessed with:
- \xmlatt{#1}{name gets the value for the attribute
{name
} from the current element. - \xmlattribute{#1}{path}{name} gets the value for the attribute
{name
} from the selected{path
}.
A more detailed list (with sample explanation) deserves a XML Setup Commands.
See Also
Notes
- ↑ In general, old MkII code includes the uppercase XML string in its commands (as in \getXMLcode[name]), while new MkIV code uses lowercase xml (as in \xmlflush{#1}).
- ↑ If this is all Greek to you, consider it as incorrect XML.
- ↑ For a detailed list, see a feature comparison list in Wikipedia.
- ↑ ConTeXt will complain with a message in the PDF document starting with “invalid xml file”.
- ↑ I’m aware that the technical term is well–formedness, not being able to avoid considering a more expressive replacement. Correctness seems to be a suitable candidate.
- ↑ This is not more than a fancy example, in no way an exhaustive description (or list).
- ↑ With or without space before the slash.
- ↑ Of course, file names should differ in documents. Although not being mandatory (as far as I can recall), it is a good idea to keep different file extensions for each file format. I mean,
.xml
for XML files and.tex
for ConTeXt files. - ↑ The part of the identifier with the form
xml:
, which may contain any string of letters (no digits). - ↑ This is exactly what happens with the
<head>
element in the sample. There is no defined\startxmlsetups xml:head \xmlflush{#1} \stopxmlsetups
It would be redundant (appearing twice in the output document), since it is already included withxml:div
with \xmltext{#1}{head}.