Displaying XML files using XSLT, Perl, and Java

This document gives an instantaneous introduction to XML and related buzz: XSLT, DTDs, and XML schemas, but focuses on displaying XML files using XLST templates, Perl and Java. This is NOT a definitive document on any of these topics. It's really more like a "Memo to self"....but maybe someone else will also find it useful somewhere down the road.

The main idea is to give a quick example of how an XML file relates to its XSLT files, its DTD and its XML schema. Following the introduction, short programs to access an XML file from a Perl CGI script and a Java stand-alone program are presented.

An XML example

Here is an XML file showing an extract from a partial list of some programs I've written over the last 10 years or so. <?xml version="1.0" ?> <?xml-stylesheet type="text/xsl" href="http://people.cc.ku.edu/~grobe/history/prog-list.xsl" ?> <!DOCTYPE list-of-programs SYSTEM "http://people.cc.ku.edu/~grobe/history/prog-list.dtd"> <list-of-programs xmlns:HTML="http://www.w3.org/Profiles/XHTML-transitional"> <application> <app_name>Course Catalog</app_name> <description> searches an online version of the 1996 Undergraduate Course Catalog. Cloned for 1998 and 2000 editions of the catalog. </description> <url>http://lark.cc.ukans.edu/cgiwrap/catalog/lookup-course96.pl</url> <language>Perl</language> <date_of_origin>1997</date_of_origin> </application> <application> <app_name>Parallel CPU Use</app_name> <description> shows current activity on each CPU within the KU supercomputer in an easy-to-read form. </description> <language>Perl</language> <url>http://heron.cc.ukans.edu/cgi-bin/cpu-use.pl</url> <date_of_origin>1997</date_of_origin> </application> <application> <app_name>Netometer</app_name> <description> evaluates the current status of KU connectivity to the Internet using multiple pings to multiple sites located around the country. </description> <language>Perl</language> <url>http://lark.cc.ukans.edu/cgiwrap/grobe/netometer.pl</url> <date_of_origin>1996</date_of_origin> </application> <application> <app_name>Document Archive Management System (DAMS)</app_name> <description>implements a document database allowing web document construction and editing entirely via the web itself, the Document Archive Management System (note the clever reuse of acronym). </description> <language>Perl</language> <url>http://raven.cc.ku.edu/~dams</url> <date_of_origin>1997</date_of_origin> </application> <application> <app_name>Calculet</app_name> <description>implements a simple calculator in Java2. A state machine is used to organize responses to GUI button input. </description> <language>Java</language> <url>http://condor.cc.ku.edu/~grobe/calculet</url> <date_of_origin>2001</date_of_origin> </application> <application> <app_name>MyKU</app_name> <description> allowed KU users to build customized views of online resources related to KU and/or useful in their work at KU. (design only) </description> <language>PHP</language> <url>http://www.ku.edu/~myku</url> <date_of_origin>1999</date_of_origin> </application> <application> <app_name>StereoBounce</app_name> <description>shows two stereo windows containing bouncing balls. Designed to simplify alignment of stereo projectors. <language>Java</language> <url>http://condor.cc.ku.edu/~grobe/StereoBouncexyz/StereoBouncexyz.java</url> <date_of_origin>2000</date_of_origin> </application> </list-of-programs> This file starts with some pointers to other information that will be discussed below, and defines content for the root element "list-of-programs", and each "application" element within the root element. The application element, in turn, contains the following elements:
name name of the program
description description of it's purpose
language language in which it is written
url URL for more information or to run the application
date_of_origindate it was written

You can see a larger version of this file at

XSLT

XSLT is the XML Stylesheet Language Translator. XSL is a language used to control the translation of XML files from XML to some other file format (or to some other XML format). It's kind of a combination of ASP/PHP/ColdFusion with CSS.

For example, when used to translate XML to HTML for display on the Web, an XSL "template" will include HTML statements along with embedded XSLT commands for selecting and displying data from the XML file that invokes it.

Here is an example that is referenced by the XML file above.

<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="html" indent="yes"/> <xsl:template match="/"> <center> <h2>Selected programming projects</h2> <h3> 1992 - 2002</h3> <table cellpadding="10" style="border:1px solid blue"> <tr> <th style="border:1px solid black">Program</th> <th style="border:1px solid black">Language</th> <th style="border:1px solid black">Description</th> </tr> <xsl:for-each select="list-of-programs/application"> <tr> <td style="border:1px solid black"> <xsl:value-of select="app_name"/> </td> <td style="border:1px solid black" align="center" > <xsl:value-of select="language"/> </td> <td style="border:1px solid black"> <xsl:value-of select="description"/> </td> </tr> </xsl:for-each> </table> </center> </xsl:template> </xsl:stylesheet>

Eliminate the cruft and this file reduces to a for loop embedded in HTML. Here are the major loop statements:

<xsl:for-each select="list-of-programs/application"> <xsl:value-of select="app_name"/> <xsl:value-of select="language"/> <xsl:value-of select="description"/> </xsl:for-each> The commands within the for loop are executed once for each application element within the list-of-programs element. Within the loop, xsl:value-of statements will be translated to a single value....the value of the element specified by their "select" attributes, during each iteration.

For example, the XSLT statement:

<td style="border:1px solid black"> <xsl:value-of select="app_name"/> </td> will become: <td style="border:1px solid black">Course Catalog</td> during the first pass though the loop (if "Course Catalog" is the first entry in the XML file).

Note that there is a W3C standard for identifying XML elements within XSLT templates called XPATH. In the example above the XML structure is so simple that the power of XPATH is not well demonstrated.

The language may seem somewhat awkward, unless you are used to writing ASP/PHP/ColdFusion scripts, but it also seems reasonably powerful. At least one author (Erik Ray) describes XSLT as powerful enough to do 90% of what users are likely to want to do, but either unable or tortuously difficult to do the rest.

Overall, the XSLT approach seems to offer several general advantages. First, it further separates a document's content from it's format, allowing multiple agents to work with the same data in the same format, obviating the need to keep or export data in multiple formats.

Second, it moves some processing to the client from the server. This may be an advantage in some data delivery situations, where, for example, large hit rates severely overload a server. On the other hand large datafiles will probably not be efficiently processed using this approach.

Third, it "democratizes" database access. Site developers can build database driven pages even though they have no server to work with.

Note that there are additional XSL commands. For example, there is a "xsl:if" Boolean conditional, a switch construct, xsl:choose, that relies on xsl:when clauses, some ability to construct variables (though they have limited functionality), xsl:variable, and many more features.

Note also that there is also additional functionality within the commands described earlier. For example, select takes Boolean conditionals that can incorporate a broad set of functions and give detailed control over element selection. These conditionals are defined within XPATH.

You can see this file at

The DTD

Note that the original XML document also references a Document Type Definition called prog-list.dtd. DTDs define the elements that make up an XML document. Actually, they were originally used to define SGML documents (such as the structure of HTML documents), but have been adapted to serve XML, as well.

Here is the DTD for the XML document above:

<!ELEMENT list-of-programs (application)+> <!ATTLIST list-of-programs xmlns:HTML CDATA #FIXED "http://www.w3.org/Profiles/XHTML-transitional"> <!ELEMENT application (app_name,description,language,url,date_of_origin)> <!ELEMENT app_name (#PCDATA) > <!ELEMENT description (#PCDATA)> <!ELEMENT language (#PCDATA)> <!ELEMENT url (#PCDATA)> <!ELEMENT date_of_origin (#PCDATA)>

It is simple and straight to the point, at least once you get used to reading DTDs. This one defines the "list-of-programs" element in two ways. First, it defines list-of-programs as a collection of one or more "applications" (as directed by the "+" sign). Then it defines the attributes that can appear within the list-of-programs element declaration. In this case the list-of-programs element may contain one attribute "xmlns:HTML" (XML name space) to which a value may be assigned. In this version a "fixed" value is assigned within the DTD. (As of this writing #IMPLIED did not work on my usual browser, so I was forced to use #FIXED.)

Next, the "applications" element is defined as a collection of up to 5 elements, as listed earlier in this document.

You can see this file at

The XML Schema

Unfortunately DTDs do not have enough power to define explicit data content within elements. This is a significant handicap for people who want to use XML to translate data files among different formats.

Enter "schemas", or DTDs on steroids. Here is one for the XML document above:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="list-of-programs" type="List-of-Programs"/> <xsd:complexType name="List-of-Programs"> <xsd:sequence> <xsd:element name="application" type="Application"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="Application"> <xsd:sequence> <xsd:element name="app-name" type="xsd:string"/> <xsd:element name="description" type="xsd:string"/> <xsd:element name="language" type="xsd:string"/> <xsd:element name="url" type="xsd:URI"/> <xsd:element name="date_of_origin" type="xsd:gYear"/> </xsd:sequence> </xsd:entry> </xsd:schema> Note that each leaf element is defined using a datatype. "xsd:string" is roughly equivalent to "CDATA" within a DTD, but "xsd:URI" and "xsd:gYear" (Gregorian year) are much more specific. It is also possible to declare integer and real values within ranges.

As of this writing (2002) browsers are not using schemas to validate XML schemas. In fact, it is difficult to find a free tool or web site that will validate schemas. (They will come.)

Note also that the schema is itself defined in XML. (Hmmm...it would be good to include the schema schema here.)

Using Perl to read XML

One of the promises of XML is that an XML file can be accessed by multiple agents. For example, one might want to deliver the same data using either XML/XSLT or some run-of-the-mill programming language.

In particular, one might want to use some CGI script to deliver a document that would be tortuously difficult to deliver using XSLT.

There seem to be roughly 3 approaches to accessing XML from languages:

The Document Object Model (DOM) model copies a complete XML file into a Perl data structure. XML::Simple appears to belong to this class, and XML::LibXML is another example.

The Simple API for XML (SAX) is an event-oriented approach where programmers define routines for handling each element, etc. as it arrives within an XML stream.

The third category collects eccentric approaches. The Perl RAX, and PYX packages appear to fit in such a category. RAX will deal with XML files meant to be used like record-oriented relational databases. With RAX, you simply set up a while loop to read each "record" and RAX parses each record and returns values of requested elements as they come through the input stream. PYX converts XML files to a simple text stream that can be handled by Unix (and possibly Windows) filters.

The granddaddy of approaches to reading XML in Perl is XML::Parser::Expat, a C package underlying many other Perl packages, such as XML::Parser, which can deliver element streams and/or document objects. XML::Parser is used under the covers by XML::Simple to implement its DOM approach.

Here is an example using the XML::Simple module to access the XML example above. It uses XMLin to build read the XML file and build an internal hash called "$programs", containing all the data in the file. The program then prints an HTML table containing only program names and descriptions (ignoring other information in the file):

#!/usr/bin/perl # program to read an XML file, prog-list.xml, and display it as HTML. # XML::Simple is used to read the file into a complex hash of hashes of # arrays (or some such structure). # taken from "Perl and XML" by Erik T. Ray, and modified extensively. use lib '/home/grobe/public_html/history/XML-Parser-2.29/blib/lib'; use lib '/home/grobe/public_html/history/XML-Parser-2.29/blib/arch'; use lib '/home/grobe/public_html/history/XML-Simple-1.06/blib/lib'; use lib '/home/grobe/public_html/history/XML-Simple-1.06/blib/arch'; use strict; use warnings; use XML::Simple::PREFERRED_PARSER = 'XML::Parser'; my $item; # turn the file into a hash; use forcearray so that all elements are arrays. # we might want to eval this in real life to check for "well-formedness", # depending on efficiency considerations. my $programs = XMLin('../history/prog-list.xml',forcearray=>1); use Data::Dumper; #print Dumper($programs); # use this for debugging. # print a header for the web page. print <<HERE; Content-type:text/html <html> <head> <title>Selected programming projects</title> </head> <body bgcolor=lightblue> <font size=10 face="arial"> <center> <h2>Selected programing projects</h2> <h3>1992-2002</h3> </center> <table border=1> <tr><th> Program</th><th>Description</th></tr> HERE # loop over each program sub-hash. they are all stored # as an anonymous list under the 'programs' key. if ( !(defined @{$programs->{application}} ) ) { exit; } for my $program ( @{$programs->{application}} ) { print "\n<tr>\n"; print "<th rowspan=2>$program->{'app_name'}->[0] </th>\n"; print "<td>$program->{'description'}->[0] </td>\n"; print "</tr>\n<tr>\n"; # if you are taking multiple pieces of data wihtin a single tag, # you need to read the array structure. For example, if the ## URL tag includes multiple URLs, you could use: #if ( defined @{$program->{'url'}} ) #{ # my $max = @{$program->{'url'}}; # count the number of URLs. # # print "<td>\n"; # if ( $max > 0 ) # { # for ($item=0; $item < $max; $item=$item+1) # { # print "<a href=\"$program->{'url'}->[$item]\"> # $program->{'url'}->[$item]</a>\n"; # if ( $item < ($max - 1) ) # { # print "<br>"; # } # } # } # print "\n&nbsp;</td>\n"; # print "\n</tr>"; #} } print "\n</table></body></html>\n"; # now, if we had made any changes here we could do: #open (FILEHANDLE, "> /tmp/test.xml") || die "couldn't open"; #print FILEHANDLE XMLout($programs) ; exit; Note that accessing the data structure produced by XMLin is less than straight-forward for all but the hairiest Perl programmers. Nonetheless, the program is workable for getting data in and out using XML formats.

You can see the document produced by this script at:

Using Java to read an XML file

Here is a Java program that reads a program listing on a URL-accessible file using the standard Java 2 SAX implementation. This program takes a single argument: the URL of any XML file defined using the program list DTD, fetches the file from its remote location, and displays apposite information contained therein.

This program defines 3 classes:

import javax.xml.parsers.*; import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; import java.util.*; import java.net.*; // This class takes a URL from the command line and processes // its contents as an XML document containing a list of programs // according to the DTD at // http://condor.cc.ku.edu/~grobe/history/prog-list.dtd // This class will build an ArrayList of Application objects, // which can then be processed ad lib. // Each start tag is pushed on a stack which is not popped until // the matching ending element is encountered. The "characters" // method then peeks at the stack to decide where to put contents. public class GetApplicationList2 extends DefaultHandler { public static ArrayList applicationObjectList = new ArrayList( 701); public static void main( String[] arguments ) { if ( arguments.length > 0 ) { try { // Set up an input stream for reading the URL. URL programsURL = new URL( arguments[ 0 ] ); InputStream programsStream = programsURL.openStream(); BufferedInputStream programsIn = new BufferedInputStream( programsStream ); // Get a parser factory, and instruct that factory to \ // generate parsers that validate the incoming XML file with // respect to its DTD as the file is read. SAXParserFactory factory = SAXParserFactory.newInstance(); factory.setValidating( true ); // Get a parser, and set it up to read from the input stream // and call methods within GetApplicationsListHandler2 to // handle generated events. SAXParser sax = factory.newSAXParser(); sax.parse( programsIn, new GetApplicationListHandler2() ); } catch (MalformedURLException e) { System.err.println(arguments[ 0 ] + " is not a viable URL"); } catch ( ParserConfigurationException pce ) { System.out.println( "Could not create that parser." ); System.out.println( pce.getMessage() ); } catch ( SAXException se ) { System.out.println( "Problem with the SAX parser.") ; System.out.println( se.getMessage() ); } catch ( IOException ioe ) { System.out.println( "Error reading file." ); System.out.println( ioe.getMessage() ); } // While the parser is working, it calls methods that will // populate the the object list, so when it completes, // the list should be ready to be processed. In this case // it will simply be printed to the console. for( int k = 0; k < applicationObjectList.size(); k++ ) { Application application = ( Application ) applicationObjectList.get( k ); System.out.println( "\nApplication: " + application.app_name ); System.out.println( "Description: " + application.description ); System.out.println( "Language: " + application.language ); System.out.println( "URL: " + application.url ); System.out.println( "Date of origin: "+application.date_of_origin ); } } else { System.out.println( "Usage: java GetApplicationList filename" ); } } } // end GetApplicationList class class GetApplicationListHandler2 extends DefaultHandler { // Here are the methods that will be called by the SAX parser... Application application = new Application(); Stack tagStack = new Stack(); public void startElement( String uri, String localName, String qName, Attributes attributes ) { // Process a start tag.. tagStack.push( qName ); // record the start tag on the stack. // If a tag has attributes they must be picked out of the // attribute map while processing the start tag. For example, // the following if statement could be used to pick out // hypothetical "isbn" and "edition" attributes: /* if (currentTag == "book_info" ) { libraryBook.isbn = attributes.getValue( "isbn" ); libraryBook.edition = attributes.getValue( "publisher" ); } */ } // end startElement tag. public void characters( char[] characterArray, int start, int length ) { // get the stuff between start and end tags. Note that it // may involve multiple lines. if( length <= 0 ) // it's a null section. { return; } String value = new String( characterArray, start, length ); String blanklessString = value.trim(); if( blanklessString.length() <= 0 ) // nothing but whitespace { return; } // ...but if the char string actually holds something besides // whitespace, find out which tag is being processed... String currentTag = ( String )tagStack.peek(); // ...and store the char string in the proper variable. // tags of no interest to this application can be ignored, // so they will not be placed into the application object. if ( currentTag.equals( "app_name" ) ) application.app_name = application.app_name.concat( value.trim() ); if (currentTag.equals( "description" ) ) application.description = application.description.concat( value.trim() ); if (currentTag.equals( "language" ) ) application.language = application.language.concat( value.trim() ); if (currentTag.equals( "url" ) ) application.url = application.url.concat( value.trim() ); if (currentTag.equals( "date_of_origin" ) ) application.date_of_origin = application.date_of_origin.concat( value.trim() ); } public void endElement( String uri, String localName, String qName ) { // Process an ending tag... // First check to see if it matches its start tag. String poppedTag = ( String) tagStack.pop(); if( ! qName.equals( poppedTag ) ) { System.out.println( "Popped tag (" + poppedTag + ") does not match end tag(" + qName + ")\n" ); System.exit( 0 ); } // Then check to see if it ends a set of application sub-tags, // and, if so, store the object in the ArrayList of objects... if( qName.equals( "application" ) ) { GetApplicationList2.applicationObjectList.add( application ); // ...and create a new object to start a new collection. application = new Application(); } } } // end GetApplicationListHandler2 class Application // holds the data collected for a single application. { String app_name = ""; String description = ""; String language = ""; String url = ""; String date_of_origin = ""; } // end Application

This program can be easily modified to read XML files defined using other DTDs. DTDs including tags with attributes can be handled after some simple modifications suggested by clues within the program comments.

Remember, however, that getValue() will return a null value if it requests an attribute that is not supplied within the XML stream being processed.

Much of the code presented above was, in fact, used to parse incoming XML within a program, XMLRelaySQL, which relays SQL commands embedded within XML from connecting clients to an SQL server via JDBC. It includes several examples of attribute processing.

Additional info

For more information about using XML with Perl see "Perl and XML" by Erik T. Ray. For more info about XSLT see the "XSLT Developer's Guide," by Chris vonSee, Osborne, 2002.

For more information about using XML with Java see "Processing XML with Java: A Guide to SAX, DOM, JDOM, JAXP, and TrAX" by Elliot Rusty Harold. This is an amazingly thorough book.

You could also look at Mapping XML to Java by Robert Hustead, which is a short online tutorial on the using XML via Java.

Michael Grobe
Academic Computing Services
The University of Kansas
September 2003
October 2002
January 2006