[an error occurred while processing this directive]
> developer > web app development
Learn XML in 30 minutes
by Alex Rosen, Principle Engineer, Novell
Date Created: 2000-12-04 09:58:00.000
  Introduction
  XML
  DTDs and Validation
  exteNd XML MetaData
  Namespaces, XSL, XSchema, etc.
  appendix
introduction

How do you learn XML? Don't try to learn it by reading the XML specification. Like any specification, its main goal was to be very precise, so that every XML parser performs exactly the same as every other XML parser. In this particular case, the result was a specification that is particularly hard to read. 

This is a problem for anyone who wants to use the SilverCmd feature. SilverCmd takes input files in the XML format. This article attempts to teach you the basics of XML everything you'll need to use XML with SilverCmd - in 30 minutes or less. I'll skip over lots of esoteric, less-frequently-used parts, such as notations, IDREFs, NMTOKENs, etc. Hopefully, if I don't cover it here, you won't really need to know it (at least when using XML with SilverCmd).

XML

Elements and Attributes

The good news is that XML documents are easy to read, and not too hard to write either. Here's a simple XML document (we'll call it the "EmployeesList" sample). Anyone familiar with HTML should be able to understand the basics of what it's saying, without too much trouble.  

<EmployeesList>

   <Employee>

      <FirstName>Alex</FirstName>

      <LastName>Rosen</LastName>

      <Salary rate="hourly" currency="dollars">5.15</Salary>

   </Employee>

</EmployeesList> 

Like HTML, XML consists of tags and text content. Tags are everything inside the angle brackets, and content is everything else. The content is the "data" of the document, and the tags are the "meta-data"  - they describe the data.  

Most tags in XML are "elements." Elements and their content are the main components of an XML document. The "EmployeesList" sample document, of course, represents a list of employees. There's only one employee in this particular list. The employee's first name is "Alex", his last name is "Rosen," and he doesn't get paid enough. Simple.[1]

In XML, there must be exactly one top-most tag. If you want an XML document that is a list of employees, you have to define a top-level container element like "EmployeesList," because you can't have more than one "Employee" element at the top level. (Again, this is just like in HTML, which always has a single top-level element: <HTML>.) 

Elements may have attributes. In the "EmployeesList" example, the "Salary" element has two attributes: "rate" and "currency". An attribute is exactly what it sounds like: an attribute (or property) of the element it modifies. This should also be familiar to HTML users, who are used to seeing attributes like: <IMG SRC="image.gif" WIDTH="32" HEIGHT="32" ALT="Welcome icon">. 

Generally, the main data of a document goes inside the elements (like "Alex", "Rosen", and "5.15"), while information about the main data goes into the attributes (like "hourly" and "dollars").  

At this point, we've already gotten most of the way through XML. I should point out some ways that XML is different from HTML, relating to elements and attributes: 

-       XML is case-sensitive. For example, the tags <item> and <ITEM> are different.

-       The value of an attribute must always be placed in quotes. In HTML, you can often get away without the quotes. However, in XML, quotes around attribute values are mandatory.

-       Some HTML tags don't require separate start and end tags (such as <P>, <BR>, and <IMG>). In XML, you must indicate these kinds of tags by adding a trailing slash, like this: <BR/>. However, these types of tags are much less common in XML than in HTML.

-      Although it's not technically legal HTML, most browsers will accept badly-nested tags, such as this example: <B>Hi<I>there</B>mom!</I>. This is definitely not allowed in XML. Tags must nest properly (that is, you may only specify an end-tag for the most recent start-tag).

XML Prolog

The "EmployeesList" example is a legal XML document. But you'll often see XML documents with a "prolog" at the top, like this: 

<?xml version="1.0" encoding="ISO-8858-1"?>

or

<?xml version="1.0" standalone="yes"?> 

This prolog is encouraged, but it is only required if the document's character encoding is not UTF-8 or UTF-16. (Since ASCII is a strict subset of UTF-8, this also applies to ASCII encoding.) If you use any other encoding, you must specify that encoding in the prolog, as seen above. (The "version" attribute is the XML version that you're using - make sure it's 1.0, since that's the current standard, and it should be forward-compatible. The "version" attribute is required, if you use a prolog. The "standalone" attribute is just an optional hint to the XML parser. When in doubt, it's safest if you just leave it out.)

With SilverStream, you'll often see a second line like this: 

<?AgMetaXML 1.0?>

This is just specifying which version of SilverStream "metadata" format is being used. Don't worry about this, it's not required.

Comments

As in HTML, XML comments start with <!-- and  end with -->. 

<!-- For example, this is a comment. -->

Character data

The content of an XML document (everything that. s not in a tag) is the "character data" of a document. Any characters are OK in the character data, except the left angle bracket ("<") and the ampersand ("&"). These must be replaced by "&lt;" and "&amp;" respectively, as in HTML.

Whitespace

Generally, if you want to make an XML document more readable, it's fine to add extra whitespace between elements, but not inside elements (where the document's actual data content is). In the "EmployeesList" example, note that we have nicely formatted the XML by adding a newline after every tag, except inside of element content. If we tried to add whitespace there, it would be associated with the content text. For example, if we did this: 

      <FirstName>

       Alex

      </FirstName>

 ....then the name would be read as "[newline][tab][tab]Alex[newline]", which is not what we want.

DTDs and validation

What is XML?

I've described how to write XML. But in order to move on to the next level, we need take a step back and figure out what XML actually is. 

XML is a language for writing documents that can hold any arbitrary data. But you can also think of XML as a language for writing languages. You can write a document that's just in XML, as in the "EmployeesList" example above. But you can also define a tag-set and rules, to create a language with a specific purpose, such as a language for describing a chemical compound, or a magazine article, or the user interface components of a program. Then you can write a document of that particular kind - such as a Chemical Markup Language (CML) document, or an XML User-Interface Language (XUL) document.  

Let. s compare XML with HTML. HTML is designed for one particular purpose: to format data for presentation to a human. HTML defines which tags are allowed, what they mean, and which tags can go inside which other tags. Similarly, CML was designed for one particular purpose: to describe chemical compounds. It also has a defined set of tags, rules, and meanings. 

XML, by contrast, is the parent of CML, XUL, and lots of other special-purpose languages. XML itself does not define any tags or rules or meanings. You can use any tags you want -  or, you can define a particular language based on XML, like CML or XUL, which has tags, rules, and meanings that you specify. The file that stores these tags and rules is called a DTD (Document Type Definition). For example, the CML DTD defines the tags and rules that a CML document must use. (The DTD doesn't explicitly define the meaning of the tags - you have to document these (in English or another human language). You can put this documentation in comments in your DTD, though.) 

You can think of an analogy with files on your hard drive. You can store any arbitrary bytes you want in a file, just as you can write an XML document with any tags and any structure you like. However, if you want to make a file on your hard drive a particular kind of file, such as a GIF, JPEG, or a Microsoft Word document, you have to make sure that its bytes follow the rules of the particular kind of document you'e trying to write. Similarly in XML, if you want to write a CML document, you must make sure that the document follows the rules for CML.

DTD overview

A correctly-written XML document is called "well-formed". It's very easy to write a well-formed XML document: start tags must match end tags, tags must be properly nested, and so on. All XML documents must be well-formed in order for them to work correctly. 

But there's another level that an XML document can aspire to: it can be "valid." This means that not only is the document a well-formed XML document, but it also follow the rules for its particular kind of XML document. For example, you may want to write a document that follows the rules of the "XML User-Interface Language" (XUL), created by the Mozilla/Netscape people. (XUL is used to specify a user interface; it knows about windows, menus, and the like.) 

In order to be valid, an XML document must specify which DTD it follows. In this case, it's the XUL DTD. This DTD specifies that a XUL document talks about "windows", and that windows can contain "menubars", and menubars contain "menus", etc. The software that reads the XML document can then validate it against the specified DTD. If it finds, for example, that a menubar contains a window, or that a "hippopotamus" tag shows up anywhere, then the document is not a valid XUL document (even though it might be well-formed).

DTD Elements

SilverStream has provided DTDs which specify the XML input for each SilverCmd. Here's the "itemlist.dtd," which is the DTD for several SilverCmds, including Delete, Build, and Publish. 

<!-- This is the itemlist DTD -->

<!ELEMENT obj_ItemList (Items)>

<!ELEMENT Items (el*)>

<!ATTLIST Items type (StringArray) "StringArray">

<!ELEMENT el (#PCDATA)> 

To help you follow along, here's a sample XML file that conforms to this DTD: 

<!-- This is a sample itemlist document -->

<obj_ItemList>

   <Items type="StringArray">

      <el>Forms/test1</el>

      <el>Objects/com/myco/pkg1/MyObject</el>

   </Items>

</obj_ItemList> 

First, notice that comments are the same in DTDs as in XML documents. (That's about all that's the same.) 

Element types are declared in the DTD by the <!ELEMENT> tags. The first line declares that there is an element type named "obj_ItemList," which must contain exactly one "Items" tag, and nothing else. In turn, "Items" may contain zero or more "el" tags, and nothing else 

By combining symbols (* + , | ?) a DTD can express many different types of structures: 

A*       The item 'A' may occur zero or more times - it's optional, and can appear more than once.

A+      The item 'A' may occur one or more times - it's required, and can appear more than once.

A?       The item 'A' may occur zero or one times - it's optional, and can't appear more than once.

A        The item 'A' must occur exactly once - it's required, and can't appear more than once. 

A, B    The items 'A' and 'B' must occur in the order specified.

A | B    Either item 'A' or 'B' may occur at this position. 

Here are some other sample element declarations, to show you all the possibilities: 

<!ELEMENT A (B, C, D?)>     

Element A must contain a B, followed by a C, optionally followed by a D. That is, either BCD or BC. 

<!ELEMENT E (F*, G+, H)>      

Element E must contain zero or more Fs, followed by one or more Gs, followed by exactly one H. For example, FFFFGGGGH, GH, or FGH. However, FH is not valid. 

<!ELEMENT I (J | K | (L, M?)> 

Element I can contain either a J, or a K, or an L optionally followed by an M. That is, either J, K, L, or LM. However, JK is not valid. 

Jumping down to the last line of our example, it specifies that the "el" element contains text data ("parsed character data"). Instead of containing other elements, the "el" element contains text (such as "Forms/test1").

DTD Attributes

The other type of declaration in our example is the <!ATTLIST> declaration, which describes the attributes that an element can have. Here' s the declaration again: 

<!ATTLIST Items type (StringArray) "StringArray"> 

This declares an attribute named "type," which applies to the element named "Items." It says that the value of this attribute can only be "StringArray," and furthermore that the default value is also "StringArray." (In other words, the "type" of an "Items" element is always "StringArray".) 

Here's how ATTLIST works: 

<!ATTLIST [element-name]

          [attribute-name]

          [CDATA or enumeration]

          [default value or #REQUIRED or #IMPLIED]> 

The name of the element that the attribute belongs to is listed first, followed by the name of the attribute we. re describing. The next parameter describes the allowed values of the attribute. this can be either the keyword CDATA, which means that any text string is allowed, or an enumeration of allowed values, like (apple | orange | pear). The final parameter can be one of three things: a default value (for example "pear"); the keyword #REQUIRED, which means that there's no default value, and that the XML file must provide a value for this attribute; or the keyword #IMPLIED, which means that there's no default value, and it's OK if the XML file doesn. t provide a value for this attribute. (The word "implied" is kind of confusing here, perhaps "optional" would have been clearer.)

Here are some examples, with their meanings: 

<!ATTLIST Items type CDATA "Ohio">  

The element named "Items" contains an attribute called "type". The attribute's value can be any text ("character data"), and it has a default value of "Ohio." (If the attribute is omitted in the XML document, the value of "Ohio" is assumed.) 

<!ATTLIST Items type (red | green | blue) #REQUIRED>

The element named "Items" contains an attribute called "type". The attribute. s value can be either "red", "green", or "blue", and it is required. (Because this attribute is required, there is no default value. The attribute must always be specified and given a value in the XML document.) 

<!ATTLIST Items type (table | chair | lamp) #IMPLIED>

The element named "Items" contains an attribute called "type". The attribute. s value can be either "table", "chair", or "lamp". (#IMPLIED specifies that the attribute not required and there's no default value. If the attribute is omitted in the XML, then the element "Items" simply has no value for attribute "type".) 

Unlike element declarations, attributes can only use the | symbol ("or"). They cannot use an of (+ ? , *). To put it another way, an attribute value may either be CDATA (meaning any text), or an enumeration of possible values. 

You might wonder, why is it called "ATTLIST"? Because a single ATTLIST declaration can describe more than one attribute of an element. For example, the following two snippets are equivalent: 

<!ATTLIST Item name CDATA #REQUIRED size (small | medium | large) "medium"> 

and 

<!ATTLIST Item name CDATA #REQUIRED>

<!ATTLIST Item size (small | medium | large) "medium">

 Also, note that like XML, DTDs are case-sensitive. "ELEMENT", "ATTLIST", "CDATA", etc. must be all upper-case.

More on DTDs and Validation

As I mentioned earlier, XML is designed so it can be used with or without a DTD. But this can get a bit confusing. An XML document must have a DTD if you want to validate it. But a DTD can also change the meaning (value) of the XML document, with or without validation.  

For example, suppose you define an attribute "size" with the default value "medium." If an element in the XML document doesn't contain this attribute explicitly, it still contains the implied value "medium" - as long as the DTD is present. If the DTD is not present, then the element has no value at all for this attribute.  

You. ll notice this in SilverStream XML documents. For example, as we saw earlier, the itemlist DTD defines an attribute named "type" for the element "Items", and it specifies that the default value (and in fact, the only allowed value) of that attribute is "StringArray". But if you look at delete_sample.xml (which follows the itemlist DTD), you. ll notice that it explicitly gives a value of "StringArray" for the "type" attribute. It does this in case the XML file is used without the DTD. 

How does an XML document specify that it uses a DTD? By adding a line near the top of the file (after the <?xml?> section, if there is one), like this: 

<!DOCTYPE obj_ItemList SYSTEM c:\SilverStream\DTDs\itemlist.dtd> 

The second part ("obj_ItemList") must match the topmost (root) element in the XML file. Basically, the DOCTYPE declaration is saying, "This XML document contains an element called 'obj_ItemList'. You can find the definition of the 'obj_ItemList' element (and all of the elements it contains) in the following DTD file."  

Why do the sample XML files in that SilverStream provides in version 3.0 lack a DOCTYPE definition? There are a few reasons, but the main reason is to cut down on confusion over the location of the DTD. The DTD location can be either an absolute or a relative path to the DTD file. We can't put an absolute path in the sample XML files, because we don't know where you're going to install SilverStream. We could put a relative path (just "itemlist.dtd"), which would work fine for the examples, since they're in the same directory as the DTD. But as soon as someone modified a sample for their own use, and put it in a new location, it would fail. (We also could have used a URL to a location on our Web site, instead of a filename, but then you'd have to be connected to the Internet to use the XML file.) 

So we decided to leave the DOCTYPE out, and simply add a comment to each sample XML file that says which DTD it uses. This means that the XML document will not be validated when it's edited or used, unless you add a DOCTYPE statement to it. This is usually fine when an XML file is used (parsed), but if you use an XML-aware editor to edit your files (instead of a plain text editor), you might want to add the DOCTYPE statement. That way, the XML editor can check to make sure that the XML you're writing is valid, according to that DTD. 

Note that, even if the XML document has a DTD, the software that reads it doesn't necessarily have to parse it. It's up to the particular program that's reading the XML. If it tries to validate the XML, and the XML doesn't specify a DTD, it will cause an error. (SilverCmd doesn't perform XML validation on its inputs, although of course it will give you an error if some required piece of information is missing.)

exteNd XML MetaData

The input to a SilverCmd is an XML file. This is an XML representation of SilverStream's internal "MetaData" format. Because it must be converted from XML to SilverStream's MetaData representation, there are a few quirks that I'll go over here. 

Some XML elements describe MetaData "objects" and some describe MetaData "properties". Objects can contain properties and other objects. Properties can only contain data values (or arrays of data values). 

This is a little confusing. MetaData objects are similar to XML elements, and MetaData properties are similar to XML attributes, yet both are represented in XML as elements. To distinguish the two, an XML element uses the prefix "obj_" to indicate a MetaData object. All other elements are MetaData attributes. 

MetaData properties have data types: String, Integer, Boolean, Double, Color, StringArray, IntegerArray, or ByteArray. The type of a property is indicated by the XML attribute "type". If no "type" attribute is specified, the type is assumed to be String. For example: 

   <Filename>c:\temp\xxx.txt</Filename>

   <Version type="Integer">3</Version>

   <Cancelable type="Boolean">true</Cancelable>

   <BackgroundColor type="Color">FF0000</BackgroundColor> 

Boolean values are represented by using the strings "true" or "false". Color values are represented by 6-character hex strings, as in HTML (e.g. "FF0000" or "808080"). Array values are represented by child elements named "el", each of which represents a single array element. For example: 

   <Items type="StringArray">

      <el>Forms/frmMain</el>

      <el>Views/</el>

      <el>Objects/com/myco/mypkg/MyObj</el>

      <el>Pages/myPage</el>

      <el>Media/Images/test.gif</el>

   </Items> 

This last example also illustrates how to identify items in a database on the SilverStream server. The syntax uses the same hierarchy as seen in SilverStream's Main Designer Window, with each piece separated by a forward-slash. This example refers to the form named "frmMain", the Views directory (i.e. all views in the database), the object "com.myco.mypkg.MyObj", the page named "myPage", and the image named "test.gif".

Namespaces, XSL, XSchema, etc.

These are specifications that you won. t need for basic XML usage, but they're good to know a little bit about.

Namespaces

XML Namespaces are intended to prevent element and attribute names from clashing. For example, if an XML document contains a tag named <title>, it might be a book title, a Web page title, a diplomatic title, etc. Namespaces allow you to prefix the tag with a unique identifier, specifying which tagset it came from. This way, a program can be confident that when it sees a tag in a particular namespace, it knows exactly what that tag means.  

For example, suppose that someone defined an XML syntax for book reviews. This would be the book review namespace. You could then specify in your XML document that the prefix "br" will indicate that the element or attribute name belongs to that namespace. For example,  

<br:title br:isbn="0-553-57240-7">SilverStream for Dummies</br:title> 

There's also a way of specifying that every element in the XML document belongs to the specified namespace, to avoid having to type all of those "br:" prefixes. 

The XML Namespaces specification was completed in January 1999.

XSL

Plain XML is great for machine-to-machine communication. For example, in B-to-B commerce, where you have one computer asking another to place an order for 5000 pencils. But what if you want to display XML data to a human?  

HTML specifies how to display data to people - it combines the data with the presentation specification. (Or, if you use style sheets, you can put the data in the HTML and the presentation in a CSS style sheet.) XML, by contrast, contains only data. The information about how to display XML data can be contained in a separate XSL document. (XSL stands for eXtensible Style Language, just as XML stands for eXtensible Markup Language.)  

If you try to look at an XML file in an older browser, you. ll just see the XML text (like <FirstName>Alex</FirstName>.) But Internet Explorer 5.0 understands XML and XSL. So you can write an XSL document that describes how to display the XML data to the user. Both the XML file and the XSL file get sent down to the browser, which uses them to determine how to display the data to the user. Or, for browsers that don. t understand XML/XSL, you can use the same XSL document on the server to transform the XML into HTML, and send the HTML to the browser. 

The XSL specification is mostly finished (as of Feb 2000).

XML Schemas

XML Schemas (or XSchema) is a replacement for DTDs. DTDs are great, but they have at least three major short-comings, which are addressed in XML Schemas: 

DTDs have no concept of data types. All data is just text strings. Suppose you had an XML document like: 

<Reservation>

   <Meal>Dinner</Meal>

   <Time>12/9/2000 7:30PM</Time>

   <NumberOfPeople>2</NumberOfPeople>

</Reservation> 

You can create a DTD that specifies that a Reservation contains a Meal, Time, and NumberOfPeople. But you'd also like to specify that Meal must either be "Lunch" or "Dinner", that Time must be a date/time value in a particular format, and that NumberOfPeople must be a positive integer. But you can't. The data types in XSchema lets you do all of these things. 

DTDs have a strange syntax. To learn XML, you have to learn both the XML syntax and the DTD syntax, and XML parsers must be able to read both syntaxes. XSchema solves this problem by you guessed it - using XML syntax. This can be quite confusing at first, but once you get your head around it, it's makes things a lot easier. Here's a simple schema - notice how easy it is to understand, once you know XML: 

<element name="ShoppingCart" type="ShoppingCartType"/>

<type name="ShoppingCartType">

   <element name="Item" type="ItemType" minOccurs="0" maxOccurs="*"/>

</type>

<type name="ItemType">

   <element name="ProductName" type="string"/>

   <element name="Quantity" type="integer"/>

</type>  

This describes an element named "ShoppingCart", which (via the ShoppingCartType) contains zero or more "Items", which in turn contain (via the ItemType) a "ProductName" string and a "Quantity" integer. 

DTDs don't allow much flexibility in defining structures. XML Schemas allow you to define your own datatypes; for example, you could define a "MailingAddress" type that contains name, address, city, state, and zip code, and then use it as the type for both the "HomeAddress" and "WorkAddress" tags. There's even a form of inheritance, so that once type can "refine" another. 

The XML Schemas specification is still under construction (as of Feb 2000).

appendix

Other XML features

This article attempted to describe the 20% of XML that is all you'll need 80% of the time. Some other areas of XML may be useful, if you work with XML a lot. These are the features that are useful the other 19% of the time. (The stuff in the last 1%, like NOTATIONS, IDREFS, etc, cannot be easily understood by mere mortals.) 

-       Internal entities and parameter entities, which provide a macro facility. (There are also other types of entities, which seem to be completely unrelated.)

-       Character references, which allow you to write characters by specifying their Unicode value, like this: &#x571F; (which is Japanese for "Saturday").

-       CDATA sections, which allow you to paste in lots of literal text, without having to worry about escaping & or < (such as in Java code or mathematical equations).

Technicalities

Here are some technicalities that I glossed over in the main article. They're very minor, but I should probably mention them here, just so nobody can accuse me of giving you wrong information: 

-       You can't use a double-hyphen in a comment. For example:

   <!-- This is an illegal comment -- it contains a double-hyphen. -->

-       In addition to &lt; and &amp;, you must also replace > with &gt; if it appears in the sequence ]]>. You must say ]]&gt; instead. (The sequence ]]> is used in CDATA sections, which I didn't get in to here.)

-       There's one other keyword that can go in ATTLIST declarations: the keyword #FIXED before a default value specifies that the attribute's value MUST be equal to the default value.

Resources

Here are some good URLs to help you learn more about XML: 

-       The XML section of the W3C, the standards body that created XML. It has all the official specs for XML, XSL, Namespaces, Schemas, and other related standards. 

-       The Annotated XML Spec. This is the XML spec, plus comments from the editor of the spec, which gives you some prayer of understanding what it's saying. 

-       Here's a great introduction to XML, which is clear and easy to read, but it does go into some of the more esoteric features of XML that I. ve tried to avoid. 

-       Other sites: xml.org, xml.com, xml.about.com, Robin Cover's XML page, an XML FAQ.



[1] For those who aren't familiar with HTML, let me go through this in a little more detail. Tags are the things in angle brackets, like this start-tag: <mytag> and this end-tag: </mytag>. In XML, most tags define "elements". The start-tag indicates the beginning of an element, and the end-tag signals its end.

For example, at the top-most level of this example is an element called "EmployeesList". Inside this "EmployeesList" element is a child element called "Employee". (Normally there would be be more than one of these inside the "EmployeesList" element, since most companies have more than one employee, but I wanted to keep this example short.) Inside this "Employee" element are 3 child elements: "FirstName", "LastName", and "Salary". Inside these elements are the actual content (data) of the document. For example, the contents of the "Salary" element is the text "5.15".