Using XML and XSLT to build LinuxFocus.org(/Nederlands)

ArticleCategory: [Artikel Kategorie]

Applications

AuthorImage:[Bild des Autors]

TranslationInfo:[Author and translation history]

AboutTheAuthor:[Über den Autor]

Will receive his masters degree this year and will start his PhD reasearch on chemometrics. Still enjoys basketball a lot, as he does LinuxFocus and Linux in general.

Abstract:[Zusammenfassung]

This article contains the presentation given at the Libre Software Meeting in Bordeaux in July. It explains the XML database used for automatic generation of the LinuxFocus.org(/Nederlands) web site.

ArticleIllustration:[Titelbild des Artikels]

ArticleBody:[Der eigentliche Artikel]

Introduction

The system used for document and translation management in the LinuxFocus project consists of several ASCII files, including resdb.txt, issuedb.txt and maindb.txt. These files have a fixed format, and they're used to generate web pages. However, they are difficult to extend, and the separated nature of the data makes it hard to manage all the information available for an article.

LinuxFocus did not automatically generate much web content when I started the new database. As an editor on the Dutch team, I was eager to have the index.html files on the web site dynamically generated. Editing several HTML files each time a new article was translated took a lot of effort and caused many broken links. Therefore, I wanted a new system to which I could add information easily, and from which I could easily generate index pages for the web site. I started working on it sometime in the summer of 2000.

The choice for XML was a bit arbitrary. Suggestions had been made to use a relational database, but I was experienced in XML and preferred a system of text based files. It soon turned out that a new numbering scheme would be useful, because the database could then use one type of ID instead of the two or three schemes then in use. Guido Socher did all the renumbering, which was quite an effort (my thanks!).

The Document Type Definition (DTD) was already in development, and a little bit of content was in the database for testing purposes. With the new uniform numbering scheme, the time was right to load the database with content. After having added about 20 articles, it became clear that this was an enormous project. Writing scripts to use the old files was possible, but not all information that the new database could contain was available, and, as explained, the information that was available was distributed over several files. Fortunately, Floris Lambrechts got involved, and I have to thank him deeply for adding most of the content to the database. Without his help, the system would not be what it is today.

Along with the new format also came the ability to add new information. And over the past year several new kinds of data have been added to the database. Early extensions were a table of authors, translators, editors and other people involved in LinuxFocus, and file locations. The reason for addition of the latter was that there were several filenaming schemes used since the beginning of LinuxFocus. During the renumbering it was reduced to two schemes. Some files used server side includes and used the .shtml extension, where older articles used .html extensions. The <file> tag can be used to overwrite a default. (The current default uses the format "article" + article number + ".shtml". This might include an optional ".meta" in case the file is in LinuxFocus' meta format.)

Now that the database had reached critical mass, I finally got around to benchmarking the software I was writing. The current XSLT stylesheets are not the first implementation. It was preceded by Perl based code. But with the growing size of the database, performance became important. The first try was simply not good enough. But before I start explaining the tools, I'll explain the database format.

The Document Type Definition

XML, first of all, is a syntax specification for markup languages. XML defines how markup should look. The syntax describes the sequence of characters allowed in "well formed" XML document. It declares that a document has one root element and that an element consists of a start tag, content (text, child elements, or both), and an end tag. These tags consist of a "<" character followed by a name and at the end a ">" character. An end tag must have a "/" just in front of the name. Empty tags, like HTML's <br>, take a "/" after the name. A start tag may contain attributes, and these also have a specific syntax. XML tags look like these:

<greeting>Hello, world!</greeting>

or for an empty tag

<br/>

Besides syntax, languages also contain semantics. This describes how certain elements relate to each other. The semantics of HTML declares that the <body> tag should be contained by the <html> element, and not the other way around. The semantics also describe that the <img> element is empty, as is the <br> element. If these semantics are given in a formal notation, they can be parsed with a program and used to validate the document using those semantics. One of these formal notations is called Document Type Definition, or DTD for short. If a document passes the validation process, it is called a valid document. You have to be careful with XML because its validation is very strict.

Now that we know what a DTD is, let's have a look at the LinuxFocus XML Database DTD. For several of the specifications we will provide an example. By examining these examples you will get an idea on how the information is contained in LinuxFocus' XML database.

<database>

The root element in the LinuxFocus XML database, or one of its extensions/localizations, is the <database> element.

<!ELEMENT database    (themes?, persons?, issues?, articles?)>

First, note that the "?" means the child element may occur zero or one times. Thus, the database may contain information about LinuxFocus' themes, persons, issues and articles. Since this is very straightforward, I'll move on to a more interesting example.

<themes>

The themes are contained within the <themes> element which is a child element of <database>. Each theme has a unique ID, a title, and optionally an abstract and an image.

<!ELEMENT themes      (theme+)>
  <!ELEMENT theme       (title*, desc?, img?)>
    <!ELEMENT title       (#PCDATA)>
    <!ELEMENT desc       (#PCDATA)>
    <!ELEMENT img         (EMPTY)>

Some of these elements must have attributes. These are also given in the DTD. Any textual content is contained in an element with the xml:lang attribute. The value of that attribute may be any token conform the ISO 3166 standard for country codes. Examples are "en", "fr" and "nl". Both the id and xml:lang attributes are specified in the original XML specification and are part of the XML syntax.

<!ATTLIST theme       id            ID            #REQUIRED>
<!ATTLIST title       xml:lang      NMTOKEN       #REQUIRED>
<!ATTLIST desc        xml:lang      NMTOKEN       #REQUIRED>
<!ATTLIST img         src           CDATA         #REQUIRED>

An example database might look like this:

<database>
  <themes>
    <theme id="hw">
      <title xml:lang="en">Hardware</title>
      <img src="Hardware.jpg"/>
    <theme>
  <themes>
</database>

<issues>

Issues are contained in the <issues> element. Like themes each issue has a unique ID.

<!ELEMENT issues      (issue+)>
  <!ELEMENT issue       (title+, published?, file*)>
    <!ELEMENT title       (#PCDATA)>
    <!ELEMENT published   (EMPTY)>
    <!ELEMENT file        (#PCDATA)>

The element <published> flags published issues. The next issue and the SomeLanguage2Eng pseudo issues do not have this element. The <title> element has again the @xml:lang attribute. The <file> element denotes the directory in which this issue is located. It must not point to the index.html, because it is used to determine file locations.

An example (note that we use the @code attribute for sorting):

    <issue id="ToBeWritten" code="999996">
      <title xml:lang="en">Not yet written articles</title>
    </issue>
    <issue id="September2001" code="200109">
      <title xml:lang="en">September2001</title>
    </issue>

<persons>

Information about authors and translators are stored in <person> elements. Each person must have a unique ID.

 <!ELEMENT persons (person+)>
 <!ELEMENT person 
              ((name|email)*,(homepage|nickname|desc|team)*)>
 <!ELEMENT email (#PCDATA)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT homepage (#PCDATA)>
 <!ELEMENT nickname (#PCDATA)>
 <!ELEMENT desc (#PCDATA|%html-els;)*>
 <!ELEMENT team EMPTY>

Each person can have the following information: a name, an email address (or more), homepage(s) and nicknames. If the person is also part of a translation team, we add a <team> element. For example, the following line in the <person> element means that Floris belongs to the Dutch team <team xml:lang="nl"/>. Finally, each person can have a description, which may contain additional web links.

An example:

    <person id="nl-ew">
      <name>Egon Willighagen</name>
      <email>egonw@linuxfocus.org</email>
      <team xml:lang="nl"/>
    </person>

<articles>

The articles are of course the most interesting part of the database.

  <!ELEMENT articles    (article+)>
    <!ELEMENT article     (title+, 
        (file|personref|abstract|issueref|themeref|
         nometa|nohtml|translation|proofread)*)>
      <!ELEMENT abstract    (#PCDATA)>
      <!ELEMENT nohtml      EMPTY>
      <!ELEMENT nometa      EMPTY>
      <!ELEMENT translation 
                   (personref*, (reserved|finished|proofread)*)>
      <!ELEMENT reserved    (#PCDATA)>
      <!ELEMENT finished    (#PCDATA)>
      <!ELEMENT proofread   (personref*, (reserved|finished)*)>
<!ATTLIST article     id            ID            #REQUIRED
                      xml:lang      NMTOKEN       #IMPLIED
                      type          (article|coverpage)
                                                  "article"
                      next          IDREF         #IMPLIED
                      prev          IDREF         #IMPLIED>
<!ATTLIST file        xml:lang      NMTOKEN       #REQUIRED
                      type          (target|meta) "target">
<!ATTLIST translation from          NMTOKEN       #REQUIRED
                      to            NMTOKEN       #REQUIRED>

Each article has at least one title; one for each language. The <file> element can be used to give the article's file location, for both the META format and the HTML version (see example below). In cases where no META or HTML version is available, the optional <nohtml/> and <nometa/> elements may be used. Each article can have an abstract. Having the abstract in the database means it can be used to create index web pages.

The <article> element has five attributes: the required @ID, an optional attribute xml:lang to denote the language in which it was originally written, a @type attribute used for cover pages, which are for translation purposes also treated as articles. Finally, two other optional attributes, @next and @prev, which are used to tie articles from a series together.

An article is associated to an issue and to a theme with the <issueref> and <themeref> elements, both having a @href attribute. The value for this attribute must be a unique ID, the ID of the associated issue or theme.

An example:

<article id="article206" xml:lang="en">
  <title xml:lang="en">Using XML and XSLT to build 
    LinuxFocus.org(/Nederlands)</title>
  <personref href="nl-ew"/>
  <issueref href="ToBeWritten"/>
  <themeref href="appl"/>
  <abstract xml:lang="en">
This article shows you how parts of the Dutch web site of LinuxFocus is 
generated with XSLT tools from the XML database. It compares this with 
the (very) much slower DOM tools in Perl.
  </abstract>
</article>

A localized <article> element looks like:

<article id="52">
  <title xml:lang="nl">Enlightenment</title>
  <file xml:lang="nl">Nederlands/July1998/article52.html</file>
  <translation from="en" to="nl">
    <personref href="nl-tu"/>
    <reserved>2000-09-06</reserved>
    <finished>2000-10-04</finished>
    <proofread>
      <personref href="nl-fl"/>
      <reserved>2000-10-04</reserved>
      <finished>2000-10-04</finished>
    </proofread>
  </translation>
  <abstract xml:lang="nl">
Enlightenment is een Linux window-manager met
uitgebreide mogelijkheden.  Dit artikel bespreekt
ze, samen met de installatie en de instelling
van E.  Dit alles is niet voor beginners daar
E op het moment nog in beta-stadium
verkeert.
  </abstract>
</article>

Note that this translation is reserved for translation at a certain date, it is done, but also proof-read. In all cases the person who did the work is linked to with <personref> elements.

For all elements, the best tutorial is the current databases itself:

Automagically make web pages

One of the reasons for creating this new format was to automatically create web indices from it. Now that we understand (?) the database format let's see how we can use it to generate web pages.

First, a bit of history. The first implementation used Perl modules to interface with the database. Though the interface was very clean, the implementation was very slow. The information was contained in an XML container called Document Object Model (DOM). Most implementations for DOM, however, are very slow, at least much slower than the alternative Simple Application interface for XML (SAX).

But if the task is just web page generation a third alternative seems best: XSLT. This is a XML based transformation language. Many XSLT processor currently exist and most programming languages are supported. Some time ago there was a LinuxFocus article on XML::XSLT, one of Perl XSLT implementations. Since the publication of that article, more implementations have emerged, and there are a few that I recommend:

The examples in the remainder of this article will use Sablotron.

An XSLT processor takes two files for input. One is the XML source to transform. The other is the XSLT stylesheet that defines the transformation. For generation of LinuxFocus web pages the following XSLT stylesheets are available:

issues.xslt
This stylesheet generates a list of issues, with their respective articles.
issuetoc.xsl
This one generates the table of contents for a certain issue.
issuetoc_full.xslt
Like the previous, but with more information.
mainindex.xslt
Generates a list of articles with information on the translation status.
previssues.xslt
A list of all issues that have been published.
recently_translated.xslt
The ten most recently translated articles.
rss.xslt
Generates a RSS file with the ten most recently translated articles.
theme.xslt
This stylesheet generates the index page for a certain theme.
themes_index.xslt
Generates an index of all themes.
vertaald.xslt
Shows all translated articles for a certain language.

Note that these stylesheets are not the latest versions. Contact me or one of the editors of the Dutch translation teams to get up to date versions.

To generate the mainindex.html, for example, the Dutch teams runs:

sabcmd stylesheets/mainindex.xslt db/lfdb.nl.xml > ../mainindex.html

The stylesheets know where the English root database is, and just needs the localized database as XML input. Some sheets need an additional parameter:

sabcmd stylesheets/theme.xslt db/lfdb.nl.xml '$theme=appl' > ../Themes/appl.html

The Dutch index.html is also generated from the database, but uses a bit more complex setup. The index.html is made with Guido Socher's lfpagecomposer from a set of preprocessed input files. And these preprocessed input files are generated from a set of .pre files such :

<H2>Vorige nummers</H2>
 
<p>Dit zijn de uitgaven van LinuxFocus in het Nederlands:
<ul>
<!-- macro xslt previssues -->
</ul>

<H2>Recent vertaalde artikelen</H2>
< macro xslt recently_translated -->

These files are simply HTML fragments with a macro that applies a stylesheet to you localized database. The processing is done with a program called apply_stylesheets.pl which looks for  commands and parses the database with that command. Note that the .xslt extension is omitted. Our Makefile contains:

%.shtml: %.pre
        @echo "Making $*..."
        @../../xml/bin/apply_stylesheets.pl $*.pre

The resulting *.shtml files are used by the lfpagecomposer script. The stylesheets that are used to generate the index.html are: issuetoc.xslt, previssues.xslt and recently_translated.xslt.

Localizing

To use this system for other languages, you need to do the following:

localize the XML database (like lfdb.nl.xml)
localize the stylesheets

The second step is a bit unfortunate. In principle only the text in the output needs to be localized, but the stylesheets do not have localization properties yet. This is possible, however, and I would like to see it implemented.

I recommend using a DTD aware XML editor. In Emacs you can, for example, use the psgml major mode. This will give you the ability to validate the document (with nsgmls). This helps a lot in avoiding mistakes. In Emacs you can then also right-mouse-click to see the elements and attributes you can insert on that specific place in the XML file. (Thanks to Jaime Villate for his excellent talk at the LSM conference in Bordeaux this year.)

Another great help is the Dutch localization of the XML database. If you run into trouble you can consult that file. Though the content is mostly Dutch, you can see how the database elements are organized. If that does not help, you can always email me.

Localizing the stylesheets is probably a bit tricky. Text is intermingled with XSLT commands. The latter you must not touch (unless you know what you're doing), in order to preserve its functionality. I plan to have the stylesheets localized in the future which would mean that you only need to edit a file that contains your translations and no XSLT commands, but this is not yet done.

Future plans

OK, this should help you to get started. Most things you can copy/paste from the Dutch files. All files are FDL and GPL. In the next year these are my plans with this system:

localize stylesheets
add new stylesheets (for top_authors.html, top_translaters.html and others things we like to see as web pages)
possibly an interface to a daemon based relational database, like MySQL.
integrate the system with other LinuxFocus tools (like getticket etc)