Chapter: XML and Web Services : Essentials of XML : Validating XML with the Document Type Definition (DTD)

Structure of a Document Type Definition and Declaration

The Document Type Declaration

Structure of a Document Type Definition

The structure of a DTD consists of a Document Type Declaration, elements, attributes, entities, and several other minor keywords. We will take a look at each of these topics, in that order. As we progress from topic to topic, we will follow a mini case study about the use of XML to store employee records by the Human Resources department of a ficti-tious company.

Our coverage of the DTD structure shall begin with the Document Type Declaration.

The Document Type Declaration

In order to reference a DTD from an XML document, a Document Type Declaration must be included in the XML document. Listings 3.1, 3.2, and 3.3 gave some examples and brief explanations of using a Document Type Declaration to reference a DTD. There may be one Document Type Declaration per XML document. The syntax is as follows:

<!DOCTYPE rootelement SYSTEM | PUBLIC DTDlocation [ internalDTDelements ] >

The exclamation mark (!) is used to signify the beginning of the declaration.

DOCTYPE is the keyword used to denote this as a Document Type Definition.

rootelement is the name of the root element or document element of the XML document.

SYSTEM and PUBLIC are keywords used to designate that the DTD is contained in an external document. Although the use of these keywords is optional, to reference an external DTD you would have to use one or the other. The SYSTEM keyword is used in tandem with a URL to locate the DTD. The PUBLIC keyword specifies some public location that will usually be some application-specific resource reference.

internalDTDelements are internal DTD declarations. These declarations will always be placed within opening ([) and closing (]) brackets.

It is possible for a Document Type Declaration to contain both an external DTD subset and an internal DTD subset. In this situation, the internal declarations take precedence over the external ones. In other words, if both the external and internal DTDs define a rule for the same element, the rule of the internal element will be the one used. Consider the Document Type Declaration fragment shown in Listing 3.4.

LISTING 3.4 Internal and External DTDs

<!DOCTYPE rootelement SYSTEM “http://www.myserver.com/mydtd.dtd”

[

<!ELEMENT element1 (element2,element3)> <!ELEMENT element2 (#PCDATA)>

<!ELEMENT element3 (#PCDATA)> ]>

Here in Listing 3.4, we see that the Document Type Declaration references an external DTD. There is also an internal subset of the DTD contained in the Document Type Declaration. Any rules in the external DTD that apply to elements defined in the internal DTD will be overridden by the rules of the internal DTD.

Now that you have seen how to reference a DTD from an XML document, we will begin our coverage of the items that make up the declarations in DTDs.

DTD Elements

All elements in a valid XML document are defined with an element declaration in the DTD. An element declaration defines the name and all allowed contents of an element. Element names must start with a letter or an underscore and may contain any combina-tion of letters, numbers, underscores, dashes, and periods. Element names must never start with the string “xml”. Colons should not be used in element names because they are normally used to reference namespaces.

Each element in the DTD should be defined with the following syntax:

<!ELEMENT elementname rule >

ELEMENT is the tag name that specifies that this is an element definition.

elementname is the name of the element.

rule is the definition to which the element’s data content must conform.

In a DTD, the elements are processed from the top down. A validating XML parser will expect the order of the appearance of elements in the XML document to match the order of elements defined in the DTD. Therefore, elements in a DTD should appear in the order you want them to appear in an XML document. If the elements in an XML docu-ment do not match the order of the DTD, the XML document will not be considered valid by a validating parser.

Listing 3.5 demonstrates a DTD, contactlist.dtd, that defines the ordering of elements for referencing XML documents.

LISTING 3.5 contactlist.dtd

<!ELEMENT contactlist (fullname, address, phone, email) >

<!ELEMENT fullname (#PCDATA)>

<!ELEMENT address (addressline1, addressline2)>

<!ELEMENT addressline1 (#PCDATA)>

<!ELEMENT addressline2 (#PCDATA)>

<!ELEMENT phone (#PCDATA)>

<!ELEMENT email (#PCDATA)>

The first element in the DTD, contactlist, is the document element. The rule for this element is that it contains (is the parent element of) the fullname, address, phone, and email elements. The rule for the fullname element, the phone element, and the email element is that each contains parsed character data (#PCDATA). This means that the ele-ments will contain marked-up character data that the XML parser will interpret. The address element has two child elements: addressline1 and addressline2. These two children elements contain #PCDATA. This DTD defines an XML structure that is nested two levels deep. The root element, contactlist, has four child elements. The address element is, in turn, parent to two more elements. In order for an XML document that ref-erences this DTD to be valid, it must be laid out in the same order, and it must have the same depth of nesting.

The XML document in Listing 3.6 is a valid document because it follows the rules laid out in Listing 3.5 for contactlist.dtd.

<?xml version=”1.0”?>

<!DOCTYPE contactlist SYSTEM “contactlist.dtd”>

<fullname>Bobby Soninlaw</fullname>

<addressline1>101 South Street</addressline1>

<addressline2>Apartment #2</addressline2> </address>

</contactlist>

The second line of this XML document is the Document Type Declaration that refer-ences contactlist.dtd. This is a valid XML document because it is well formed and complies with the structural definition laid out in the DTD.

The element rules govern the types of data that may appear in an element.

DTD Element Rules

All data contained in an element must follow a set rule. As stated previously, the rule is the definition to which the element’s data content must conform. There are two basic types of rules that elements must fall into. The first type of rule deals with content. The second type of rule deals with structure. First, we will look at element rules that deal with content.

Content Rules

The content rules for .elements deal with the actual data that defined elements may con-tain. These rules include the ANY rule, the EMPTY rule, and the #PCDATA rule.

The ANY Rule

An element may be defined. using the ANY rule. This rule is just what it sounds like: The element may contain other elements and/or normal character data (just about anything as long as it is well formed). An element using the ANY rule would appear as follows:

<!ELEMENT elementname ANY>

The drawback to this rule is that it is so wide open that it defeats the purpose of valida-tion. A DTD that defines all its elements using the ANY rule will always be valid as long as the XML is well formed. This really precludes any effective validation. The XML fragments as shown in Listing 3.7 are all valid given the definition of elementname.

LISTING 3.7 XML Fragments Using the ANY Rule

This is valid content </elementname>

This is more valid content </anotherelement>

This is still valid content </elementname>

This is still valid content! </yetanotherelement>

Here is more valid content </elementname>

You should see from this listing why it is not always a great idea to use the ANY rule. All three fragments containing the element elementname are valid. There is, in effect, no val-idation for this element. Use of the ANY rule should probably be limited to instances where the XML data will be freeform text or other types of data that will be highly variable and have difficulty conforming to a set structure.

The EMPTY Rule

This rule is the exact opposite of the ANY rule. An element that is defined with this rule will contain no data. However, an element with the EMPTY rule could still contain attrib-utes (more on attributes in a bit). The following element is an example of the EMPTY rule:

<!ELEMENT elementname EMPTY>

This concept is seen a lot in HTML. There are many tags such as the break tag (<br />) and the paragraph tag (<p />) that follow this rule. Neither one of these tags contains any data, but both are very important in HTML documents. The best example of an empty tag used in HTML is the image tag (<img>). Even though the image tag does not contain any data, it does have attributes that describe the location and display of an image for a Web browser.

In XML, the EMPTY rule might be used to define empty elements that contain diagnostic information for the processing of data. Empty elements could also be created to hold metadata describing the contents of the XML document for indexing purposes. Empty elements could even be used to provide clues for applications that will render the data for viewing (such as an empty “gender” tag, which designates an XML record as “male” or “female”—male records could be rendered in blue, and female records could be rendered in pink) .

The #PCDATA Rule

The #PCDATA rule indicates that parsed character data will be contained in the element. Parsed character data is data that may contain normal markup and will be interpreted and parsed by any XML parser accessing the document. The following element demonstrates the #PCDATA rule:

<!ELEMENT elementname (#PCDATA)>

An element in an XML document that adheres to the #PCDATA rule might appear as follows:

<data>

This is some parsed character data </data>

It is possible in an element using the #PCDATA rule to use the CDATA keyword to prevent the character data from being parsed. You can see an example of this in Listing 3.8.

LISTING 3.8 CDATA

<data>

<![CDATA[<tag>This will not be parsed</tag>]]> </data>

</sample>

All the data between <![CDATA[ and ]]> will be ignored by the parser and treated as nor-mal characters (markup ignored).

Structure Rules

Whereas the content rules. deal with the actual content of the data contained in defined elements, structure rules deal with how that data may be organized. There are two types of structure rules we will look at here. The first is the “element only” rule. The second rule is the “mixed” rule.

The “Element Only” Rule

The “element only” rule .specifies that only elements may appear as children of the cur-rent element. The child element sequences should be separated by commas and listed in the order they should appear. If there are to be options for which elements will appear, the listed elements should be separated by the pipe symbol (|). The following element definition demonstrates the “element only” rule:

<!ELEMENT elementname (element1, element2, element3)>

You can see here that a list of elements are expected to appear as child elements of ele-mentname when the referencing XML document is parsed. All these child elements must be present and in the specified order. Here is how an element that is listing a series of options will appear:

<!ELEMENT elementname (element1 | element2)>

The element defined here will have a single child element: either element1 or element2.

The “Mixed” Rule

The “mixed” rule is used to help define elements that may have both character data (#PCDATA) and child elements in the data they contain. A list of options or a sequential list will be enclosed by parentheses. Options will be separated by the pipe symbol (|), whereas sequential lists will be separated by commas. The following element is an example of the “mixed” rule:

<!ELEMENT elementname (#PCDATA | childelement1 | childelement2)*>

In this example, the element may contain a mixture of character data and child elements. The pipe symbol is used here to indicate that there is a choice between #PCDATA and each of the child elements. However, the asterisk symbol (*) is added here to indicate that each of the items within the parentheses may appear zero or more times (we will cover the use of element symbols in the next section). This can be useful for describing data sets that have optional values. Consider the following element definition:

<!ELEMENT Son (#PCDATA | Name | Age)*>

This definition defines an element, Son, for which there may be character data, elements, or both. A man might have a son, but he might not. If there is no son, then normal char-acter data (such as “N/A”) could be used to describe this condition. Alternatively, the man might have an adopted son and would like to indicate this. Consider the XML frag-ments shown in Listing 3.9 in relation to the definition for the element Son.

LISTING 3.9 The “Mixed” Rule

<Son>

N/A

</Son>

<Son> Adopted Son

<Name>Bobby</Name>

</Son>

The first fragment contains only character data. The second fragment contains a mixture of character data and the two defined child elements. Both fragments conform to the def-inition and are valid.

Element Symbols

In addition to the normal rules that apply to element definitions, element symbols can be used to control the occurrence of data. Table 3.1 shows the symbols that are available for use in DTDs.

TABLE 3.1 Element Symbols

Element symbols can be added to element definitions for another level of control over the XML documents that are being validated against it. Consider the DTD in Listing 3.10, which makes very limited use of XML symbols.

LISTING 3.10 Limited Use of Symbols

<!ELEMENT contactlist (contact) >

<!ELEMENT contact (name, age, sex, address, city, state, zip, children) >

<!ELEMENT name (#PCDATA) >

<!ELEMENT age (#PCDATA) >

<!ELEMENT sex (#PCDATA) >

<!ELEMENT address (#PCDATA) >

<!ELEMENT city (#PCDATA) >

<!ELEMENT state (#PCDATA) >

<!ELEMENT zip (#PCDATA) >

<!ELEMENT children (child) >

<!ELEMENT child (childname, childage, childsex) >

<!ELEMENT childname (#PCDATA) >

<!ELEMENT childage (#PCDATA) >

<!ELEMENT childsex (#PCDATA) >

You can see in Listing 3.10 that a contact record for a contactlist file is being laid out. It is very straight forward and includes the basic address information you would expect to see in this type of file. Information on the contact’s children is also included. This looks like a well-laid-out, easy-to-use file format. However, there are several problems. What if you are not sure about a contact’s address? What if the contact does not have children? What if the user is a lady and you are afraid to ask her age? The way that this DTD is laid out, it will be very difficult for a referencing XML document to be deemed valid if any of this information is unknown.

Using element symbols, you can create a more flexible DTD that will take into account the possibility that you might not always know all of a contact’s personal information. Take a look at a similar DTD laid out in Listing 3.11.

LISTING 3.11 Broader Use of Symbols

<!ELEMENT contactlist (contact+) >

<!ELEMENT contact (name, age?, sex, address?, city?, state?, zip?, children?) >

<!ELEMENT name (#PCDATA) >

<!ELEMENT age (#PCDATA) >

<!ELEMENT sex (#PCDATA) >

<!ELEMENT address (#PCDATA) >

<!ELEMENT city (#PCDATA) >

<!ELEMENT state (#PCDATA) >

<!ELEMENT zip (#PCDATA) >

<!ELEMENT children (child*) >

<!ELEMENT child (childname, childage?, childsex) >

<!ELEMENT childname (#PCDATA) >

<!ELEMENT childage (#PCDATA) >

<!ELEMENT childsex (#PCDATA) >

Listing 3.11 is much more flexible than Listing 3.10. There is still a single root element, contactlist, which will contain one or more instances (+) of the element contact. Under each contact element is a list of child elements that make up the description of the contact record. It is assumed here that the name and sex of the contact will be known. However, the definition indicates that there will be zero or one occurrence (?) of the age, address, city, state, zip, and children elements. These elements are set for zero or one occurrence because the definition is taking into account that this information might not be known. Looking further down the listing, you see that the children element is marked to have zero or more instances (*) of the child element. This is because a person might have no children or many children (or we might not know how many children the person has). Under the child element, it is assumed that childname and childsex infor-mation will be known (if there is at least one child element). However, the childage element is marked as zero or one (?), just in case it is unknown how old the child is.

You can easily see how Listing 3.11 is more flexible than Listing 3.10. Listing 3.11 takes into account that much of the contact data could be missing or unknown. An XML docu-ment being validated against the DTD in Listing 3.10 could still be validated and accepted by a validating parser even though it might not have all the contact’s personal data. However, an XML document being validated against the DTD in Listing 3.10 would be rejected as invalid if it did not include the children element.

Now that you have seen how DTDs define element declarations, let’s take a look at how attributes are used in a mini case study.

Zippy Human Resources: XML for Employee Records, Part I

Now that you have seen how elements are defined in a DTD, you have enough tools to follow along with a mini case study that shows how a company could use XML in its Human Resources department.

The Human Resources department for a small but growing company, Zippy Delivery Service, has decided that in order to make their employee data flexible across all the applications used by the company, the employee data should be stored in XML. The Zippy Human Resources department’s first task is to decide on the fields to be included in the XML structure:

Employee Name

Position

Age

Sex

Race

Marital Status

Address Line 1

Address Line 2

City

State

Zip Code

Phone Number

E-Mail Address

After determining which elements are needed, they decide to put together a DTD in order to ensure that the structure of the employee records in the XML data file never changes. Additionally, the decision is made that multiple employee records should be stored in a single file. Because this is the case, they need to declare a document (root) element to hold employee records and a par-ent element for the elements making up each individual employee record. The Human Resources department also realizes that some of the data might not be applicable to all employees. Therefore, they need to use element symbols to account for varying occurrences of data. They’ve come up with the following DTD structure as the first draft:

Employees1.dtd

<!ELEMENT employees (employee+) >

<!ELEMENT employee (name, position, age, sex, race, m_status, address1, address2?, city, state, zip, phone?, email?) >

<!ELEMENT name (#PCDATA) >

<!ELEMENT position (#PCDATA) >

<!ELEMENT age (#PCDATA) >

<!ELEMENT sex (#PCDATA) >

<!ELEMENT race (#PCDATA) >

<!ELEMENT m_status (#PCDATA) >

<!ELEMENT address1 (#PCDATA) >

<!ELEMENT address2 (#PCDATA) >

<!ELEMENT city (#PCDATA) >

<!ELEMENT state (#PCDATA) >

<!ELEMENT zip (#PCDATA) >

<!ELEMENT phone (#PCDATA) >

<!ELEMENT email (#PCDATA) >

The Human Resources department has decided that the document element employees is required to have one or more (+) child elements (employee). The employee element would be the container element for each individual employee’s data. Out of the elements comprising the employee data, the Human Resources department knows that not all employees have a second line to their street address. Also, some employees do not have home telephone numbers or e-mail addresses. Therefore, the elements address2, phone, and email are marked to appear zero or one time in each record (?). The new DTD structure is saved in a file named employees1.dtd (which, by the way, you can download from the Sams Web site).

The first several employee records are then entered into an XML document, called Employees1.xml:

<?xml version=”1.0”?>

<!DOCTYPE employees SYSTEM “employees1.dtd”> <employees>

<name>Bob Jones</name> <position>Dispatcher</position> <age>37</age>

<race>African American</race>

<m_status>Married</m_status>

<address1>202 Carolina St.</address1>

<city>Oklahoma City</city>

<email>bobjones@mail.com</email>

</employee>

<name>Mary Parks</name>

<position>Delivery Person</position>

<sex>Female</sex>

<race>Caucasian</race> <m_status>Single</m_status> <address1>1015 Empire Blvd.</address1>

<address2>Apt. D3</address2> <city>Oklahoma City</city> <state>OK</state>

<email>maryparks@mail.com</email>

</employee>

<name>Jimmy Griffin</name>

<position>Delivery Person</position> <age>23</age>

<race>African American</race>

<m_status>Single</m_status> <address1>1720 Maple St.</address1>

<city>Oklahoma City</city> <state>OK</state>

</employee>

</employees>

The XML document Employees1.xml (also available for download from the Sams Web site) initially has three employee records entered into it. The Document Type Declaration is entered after the XML declaration and before the document element, employees, and it uses the SYSTEM keyword to denote that it is refer-encing the DTD, employees1.dtd, externally.

The Human Resources department at Zippy Delivery Service feels that they are off to a good start. They have defined a DTD, employees1.dtd, for their XML data structure and have created an XML document, Employees1.xml (containing three employee records), that is valid according to the DTD. However, you’ll find out during the course of this chapter that the Human Resources department’s DTD can be improved.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

XML and Web Services : Essentials of XML : Validating XML with the Document Type Definition (DTD) : Structure of a Document Type Definition and Declaration |