Structure of a Document Type Definition
The
structure of a DTD consists of a Document Type Declaration, elements,
attributes, entities, and several other minor keywords. We will take a look at
each of these topics, in that order. As we progress from topic to topic, we
will follow a mini case study about the use of XML to store employee records by
the Human Resources department of a ficti-tious company.
Our coverage
of the DTD structure shall begin with the Document Type Declaration.
The Document Type Declaration
In order
to reference a DTD from an XML document, a Document Type Declaration must be
included in the XML document. Listings 3.1, 3.2, and 3.3 gave some examples and
brief explanations of using a Document Type Declaration to reference a DTD.
There may be one Document Type Declaration per XML document. The syntax is as
follows:
<!DOCTYPE rootelement
SYSTEM | PUBLIC
DTDlocation [ internalDTDelements ] >
The exclamation mark (!) is used to signify the beginning of the
declaration.
DOCTYPE is the keyword used to denote this as a Document Type Definition.
rootelement is the name of the root element or document element of the XML document.
SYSTEM and PUBLIC are keywords used to designate that the DTD is contained in an external document. Although
the use of these keywords is optional, to reference an external DTD you would
have to use one or the other. The SYSTEM keyword is used in tandem with a URL to locate
the DTD. The PUBLIC keyword specifies some public location that will usually be some
application-specific resource reference.
internalDTDelements are internal DTD declarations. These
declarations will always be placed within opening ([) and closing (]) brackets.
It is
possible for a Document Type Declaration to contain both an external DTD subset
and an internal DTD subset. In this situation, the internal declarations take
precedence over the external ones. In other words, if both the external and
internal DTDs define a rule for the same element, the rule of the internal
element will be the one used. Consider the Document Type Declaration fragment
shown in Listing 3.4.
LISTING 3.4 Internal
and External DTDs
<!DOCTYPE rootelement
SYSTEM “http://www.myserver.com/mydtd.dtd”
[
<!ELEMENT
element1 (element2,element3)> <!ELEMENT element2 (#PCDATA)>
<!ELEMENT
element3 (#PCDATA)> ]>
Here in
Listing 3.4, we see that the Document Type Declaration references an external
DTD. There is also an internal subset of the DTD contained in the Document Type
Declaration. Any rules in the external DTD that apply to elements defined in
the internal DTD will be overridden by the rules of the internal DTD.
Now that
you have seen how to reference a DTD from an XML document, we will begin our
coverage of the items that make up the declarations in DTDs.
DTD Elements
All
elements in a valid XML document are defined with an element declaration in the
DTD. An element declaration defines the name and all allowed contents of an
element. Element names must start with a letter or an underscore and may
contain any combina-tion of letters, numbers, underscores, dashes, and periods.
Element names must never start with the string “xml”. Colons should not be used
in element names because they are normally used to reference namespaces.
Each
element in the DTD should be defined with the following syntax:
<!ELEMENT elementname
rule >
ELEMENT is the tag name that specifies that this is an element definition.
elementname is the name of the element.
rule is the definition to which the element’s data content must conform.
In a
DTD, the elements are processed from the top down. A validating XML parser will
expect the order of the appearance of elements in the XML document to match the
order of elements defined in the DTD. Therefore, elements in a DTD should
appear in the order you want them to appear in an XML document. If the elements
in an XML docu-ment do not match the order of the DTD, the XML document will
not be considered valid by a validating parser.
Listing
3.5 demonstrates a DTD, contactlist.dtd, that defines the ordering of elements for referencing XML
documents.
LISTING 3.5 contactlist.dtd
<!ELEMENT
contactlist (fullname, address, phone, email) >
<!ELEMENT
fullname (#PCDATA)>
<!ELEMENT
address (addressline1, addressline2)>
<!ELEMENT
addressline1 (#PCDATA)>
<!ELEMENT
addressline2 (#PCDATA)>
<!ELEMENT
phone (#PCDATA)>
<!ELEMENT
email (#PCDATA)>
The
first element in the DTD, contactlist, is the document element. The rule for this element is that it
contains (is the parent element of) the fullname, address, phone, and email elements. The rule for the fullname element, the phone element, and the email element is that each contains
parsed character data (#PCDATA). This means that the ele-ments will contain marked-up character
data that the XML parser will interpret. The address element has two child elements: addressline1 and addressline2. These two children elements contain #PCDATA. This DTD defines an XML
structure that is nested two levels deep. The root element, contactlist, has four child elements.
The address element is, in turn, parent
to two more elements. In order for an XML document that ref-erences this DTD to
be valid, it must be laid out in the same order, and it must have the same
depth of nesting.
The XML
document in Listing 3.6 is a valid document because it follows the rules laid
out in Listing 3.5 for contactlist.dtd.
<?xml version=”1.0”?>
<!DOCTYPE
contactlist SYSTEM “contactlist.dtd”>
<contactlist>
<fullname>Bobby
Soninlaw</fullname>
<address>
<addressline1>101
South Street</addressline1>
<addressline2>Apartment
#2</addressline2> </address>
<phone>(405)
555-1234</phone>
<email>bs@mail.com</email>
</contactlist>
The
element rules govern the types of data that may appear in an element.
DTD Element Rules
All data
contained in an element must follow a set rule. As stated previously, the rule
is the definition to which the element’s data content must conform. There are
two basic types of rules that elements must fall into. The first type of rule
deals with content. The second type of rule deals with structure. First, we
will look at element rules that deal with content.
Content Rules
The
content rules for .elements deal with the actual data that defined elements may
con-tain. These rules include the ANY rule, the EMPTY rule, and the #PCDATA rule.
The ANY
Rule
An
element may be defined. using the ANY rule. This rule is just what it sounds like:
The element may contain other elements and/or normal character data (just about
anything as long as it is well formed). An element using the ANY rule would appear as
follows:
<!ELEMENT elementname
ANY>
The
drawback to this rule is that it is so wide open that it defeats the purpose of
valida-tion. A DTD that defines all its elements using the ANY rule will always be valid as
long as the XML is well formed. This really precludes any effective validation.
The XML fragments as shown in Listing 3.7 are all valid given the definition of
elementname.
LISTING 3.7 XML
Fragments Using the ANY
Rule
<elementname>
This is
valid content </elementname>
<elementname>
<anotherelement>
This is
more valid content </anotherelement>
This is
still valid content </elementname>
<elementname>
<emptyelement /> <yetanotherelement>
This is
still valid content! </yetanotherelement>
Here is
more valid content </elementname>
You
should see from this listing why it is not always a great idea to use the ANY rule. All three fragments
containing the element elementname are valid. There is, in effect, no val-idation for this element.
Use of the ANY rule should probably be limited to instances where the XML data
will be freeform text or other types of data that will be highly variable and
have difficulty conforming to a set structure.
The EMPTY
Rule
This
rule is the exact opposite of the ANY rule. An element that is defined with this
rule will contain no data. However, an element with the EMPTY rule could still contain
attrib-utes (more on attributes in a bit). The following element is an example
of the EMPTY rule:
<!ELEMENT elementname
EMPTY>
This
concept is seen a lot in HTML. There are many tags such as the break tag (<br />) and the paragraph tag (<p />) that follow this rule.
Neither one of these tags contains any data, but both are very important in
HTML documents. The best example of an empty tag used in HTML is the image tag
(<img>). Even though the image tag
does not contain any data, it does have attributes that describe the location
and display of an image for a Web browser.
In XML,
the EMPTY rule might be used to define
empty elements that contain diagnostic information for the processing of data.
Empty elements could also be created to hold metadata describing the contents
of the XML document for indexing purposes. Empty elements could even be used to
provide clues for applications that will render the data for viewing (such as
an empty “gender” tag, which designates an XML record as “male” or
“female”—male records could be rendered in blue, and female records could be
rendered in pink) .
The #PCDATA
Rule
The #PCDATA rule indicates that parsed
character data will be contained in the element. Parsed character data is data
that may contain normal markup and will be interpreted and parsed by any XML
parser accessing the document. The following element demonstrates the #PCDATA rule:
<!ELEMENT elementname
(#PCDATA)>
An
element in an XML document that adheres to the #PCDATA rule might appear as follows:
<data>
This is
some parsed character data </data>
It is possible in an element using the #PCDATA rule to
use the CDATA keyword to prevent the character data from
being parsed. You can see an example of this in Listing 3.8.
LISTING 3.8 CDATA
<sample>
<data>
<![CDATA[<tag>This
will not be parsed</tag>]]> </data>
</sample>
All the
data between <![CDATA[ and ]]> will be ignored by the parser and treated as nor-mal characters
(markup ignored).
Structure Rules
Whereas
the content rules. deal with the actual content of the data contained in
defined elements, structure rules deal with how that data may be organized.
There are two types of structure rules we will look at here. The first is the
“element only” rule. The second rule is the “mixed” rule.
The “Element Only” Rule
The
“element only” rule .specifies that only elements may appear as children of the
cur-rent element. The child element sequences should be separated by commas and
listed in the order they should appear. If there are to be options for which
elements will appear, the listed elements should be separated by the pipe
symbol (|). The following element definition demonstrates the “element only”
rule:
<!ELEMENT elementname
(element1, element2, element3)>
You can
see here that a list of elements are expected to appear as child elements of ele-mentname when the referencing XML
document is parsed. All these child elements must be present and in the specified order. Here is
how an element that is listing a series of options will appear:
<!ELEMENT elementname
(element1 | element2)>
The
element defined here will have a single child element: either element1 or element2.
The “Mixed” Rule
The
“mixed” rule is used to help define elements that may have both character data
(#PCDATA) and child elements in the
data they contain. A list of options or a sequential list will be enclosed by
parentheses. Options will be separated by the pipe symbol (|), whereas sequential lists
will be separated by commas. The following element is an example of the “mixed”
rule:
<!ELEMENT elementname
(#PCDATA | childelement1
| childelement2)*>
In this
example, the element may contain a mixture of character data and child
elements. The pipe symbol is used here to indicate that there is a choice
between #PCDATA and each of the child elements. However, the asterisk symbol (*) is added here to indicate
that each of the items within the parentheses may appear zero or more times (we
will cover the use of element symbols in the next section). This can be useful
for describing data sets that have optional values. Consider the following
element definition:
<!ELEMENT Son
(#PCDATA | Name | Age)*>
This
definition defines an element, Son, for which there may be character data, elements, or both. A man
might have a son, but he might not. If there is no son, then normal char-acter
data (such as “N/A”) could be used to describe this condition. Alternatively,
the man might have an adopted son and would like to indicate this. Consider the
XML frag-ments shown in Listing 3.9 in relation to the definition for the
element Son.
LISTING 3.9 The
“Mixed” Rule
<Son>
N/A
</Son>
<Son>
Adopted Son
<Name>Bobby</Name>
<Age>12</Age>
</Son>
The
first fragment contains only character data. The second fragment contains a
mixture of character data and the two defined child elements. Both fragments
conform to the def-inition and are valid.
Element Symbols
In
addition to the normal rules that apply to element definitions, element symbols
can be used to control the occurrence of data. Table 3.1 shows the symbols that
are available for use in DTDs.
TABLE 3.1 Element
Symbols
Element
symbols can be added to element definitions for another level of control over
the XML documents that are being validated against it. Consider the DTD in
Listing 3.10, which makes very limited use of XML symbols.
LISTING 3.10 Limited
Use of Symbols
<!ELEMENT contactlist
(contact) >
<!ELEMENT
contact (name, age, sex, address, city, state, zip, children) >
<!ELEMENT
name (#PCDATA) >
<!ELEMENT
age (#PCDATA) >
<!ELEMENT
sex (#PCDATA) >
<!ELEMENT
address (#PCDATA) >
<!ELEMENT
city (#PCDATA) >
<!ELEMENT
state (#PCDATA) >
<!ELEMENT
zip (#PCDATA) >
<!ELEMENT
children (child) >
<!ELEMENT
child (childname, childage, childsex) >
<!ELEMENT
childname (#PCDATA) >
<!ELEMENT
childage (#PCDATA) >
<!ELEMENT
childsex (#PCDATA) >
You can
see in Listing 3.10 that a contact record for a contactlist file is being laid
out. It is very straight forward and includes the basic address information you
would expect to see in this type of file. Information on the contact’s children
is also included. This looks like a well-laid-out, easy-to-use file format.
However, there are several problems. What if you are not sure about a contact’s
address? What if the contact does not have children? What if the user is a lady
and you are afraid to ask her age? The way that this DTD is laid out, it will
be very difficult for a referencing XML document to be deemed valid if any of
this information is unknown.
Using
element symbols, you can create a more flexible DTD that will take into account
the possibility that you might not always know all of a contact’s personal
information. Take a look at a similar DTD laid out in Listing 3.11.
LISTING 3.11 Broader
Use of Symbols
<!ELEMENT contactlist
(contact+) >
<!ELEMENT
contact (name, age?, sex, address?, city?, state?, zip?, children?) >
<!ELEMENT
name (#PCDATA) >
<!ELEMENT
age (#PCDATA) >
<!ELEMENT
sex (#PCDATA) >
<!ELEMENT
address (#PCDATA) >
<!ELEMENT
city (#PCDATA) >
<!ELEMENT
state (#PCDATA) >
<!ELEMENT
zip (#PCDATA) >
<!ELEMENT
children (child*) >
<!ELEMENT
child (childname, childage?, childsex) >
<!ELEMENT
childname (#PCDATA) >
<!ELEMENT
childage (#PCDATA) >
<!ELEMENT
childsex (#PCDATA) >
Listing
3.11 is much more flexible than Listing 3.10. There is still a single root
element, contactlist, which will contain one or more instances (+) of the element contact. Under each contact element is a list of child
elements that make up the description of the contact record. It is assumed here
that the name and sex of the contact will be known. However, the definition
indicates that there will be zero or one occurrence (?) of the age, address, city, state, zip, and children elements. These elements are
set for zero or one occurrence because the definition is taking into account that
this information might not be known. Looking further down the listing, you see
that the children element is marked to have zero or more instances (*) of the child element. This is because a
person might have no children or many children (or we might not know how many
children the person has). Under the child element, it is assumed that childname and childsex infor-mation will be known
(if there is at least one child element). However, the childage element is marked as zero or one (?), just in case it is unknown
how old the child is.
You can
easily see how Listing 3.11 is more flexible than Listing 3.10. Listing 3.11
takes into account that much of the contact data could be missing or unknown.
An XML docu-ment being validated against the DTD in Listing 3.10 could still be
validated and accepted by a validating parser even though it might not have all
the contact’s personal data. However, an XML document being validated against
the DTD in Listing 3.10 would be rejected as invalid if it did not include the children element.
Now that
you have seen how DTDs define element declarations, let’s take a look at how
attributes are used in a mini case study.
Zippy Human Resources: XML for Employee
Records, Part I
Now
that you have seen how elements are defined in a DTD, you have enough tools to
follow along with a mini case study that shows how a company could use XML in
its Human Resources department.
The
Human Resources department for a small but growing company, Zippy Delivery
Service, has decided that in order to make their employee data flexible across
all the applications used by the company, the employee data should be stored in
XML. The Zippy Human Resources department’s first task is to decide on the
fields to be included in the XML structure:
Employee
Name
Position
Age
Sex
Race
Marital
Status
Address
Line 1
Address
Line 2
City
State
Zip
Code
Phone
Number
E-Mail
Address
After
determining which elements are needed, they decide to put together a DTD in
order to ensure that the structure of the employee records in the XML data file
never changes. Additionally, the decision is made that multiple employee
records should be stored in a single file. Because this is the case, they need
to declare a document (root) element to hold employee records and a par-ent
element for the elements making up each individual employee record. The Human
Resources department also realizes that some of the data might not be
applicable to all employees. Therefore, they need to use element symbols to
account for varying occurrences of data. They’ve come up with the following DTD
structure as the first draft:
Employees1.dtd
<!ELEMENT employees
(employee+) >
<!ELEMENT
employee (name, position, age, sex, race, m_status, address1, address2?, city,
state, zip, phone?, email?) >
<!ELEMENT
name (#PCDATA) >
<!ELEMENT
position (#PCDATA) >
<!ELEMENT
age (#PCDATA) >
<!ELEMENT
sex (#PCDATA) >
<!ELEMENT
race (#PCDATA) >
<!ELEMENT
m_status (#PCDATA) >
<!ELEMENT
address1 (#PCDATA) >
<!ELEMENT
address2 (#PCDATA) >
<!ELEMENT
city (#PCDATA) >
<!ELEMENT
state (#PCDATA) >
<!ELEMENT
zip (#PCDATA) >
<!ELEMENT
phone (#PCDATA) >
<!ELEMENT
email (#PCDATA) >
The
Human Resources department has decided that the document element employees is
required to have one or more (+)
child elements (employee). The employee element would be the container element for each
individual employee’s
data. Out of the elements comprising the employee data, the Human Resources
department knows that not all employees have a second line to their street address.
Also, some employees do not have home telephone numbers or e-mail addresses.
Therefore, the elements address2, phone, and email are marked to appear zero or one time in each
record (?).
The new DTD structure
is saved in a file named employees1.dtd (which, by the way, you can download from the
Sams Web site).
The
first several employee records are then entered into an XML document, called Employees1.xml:
<?xml version=”1.0”?>
<!DOCTYPE
employees SYSTEM “employees1.dtd”> <employees>
<employee>
<name>Bob
Jones</name> <position>Dispatcher</position>
<age>37</age>
<sex>Male</sex>
<race>African
American</race>
<m_status>Married</m_status>
<address1>202
Carolina St.</address1>
<city>Oklahoma
City</city>
<state>OK</state>
<zip>73114</zip>
<phone>4055554321</phone>
<email>bobjones@mail.com</email>
</employee>
<employee>
<name>Mary
Parks</name>
<position>Delivery
Person</position>
<age>19</age>
<sex>Female</sex>
<race>Caucasian</race>
<m_status>Single</m_status> <address1>1015 Empire
Blvd.</address1>
<address2>Apt.
D3</address2> <city>Oklahoma City</city>
<state>OK</state>
<zip>73107</zip>
<phone>4055559876</phone>
<email>maryparks@mail.com</email>
</employee>
<employee>
<name>Jimmy
Griffin</name>
<position>Delivery
Person</position> <age>23</age>
<sex>Male</sex>
<race>African
American</race>
<m_status>Single</m_status>
<address1>1720 Maple St.</address1>
<city>Oklahoma
City</city> <state>OK</state>
<zip>73107</zip>
<phone>4055556633</phone>
</employee>
</employees>
The
XML document Employees1.xml (also available for download from the Sams Web
site) initially has three employee records entered into it. The Document Type
Declaration is entered after the XML declaration and before the document
element, employees, and it uses the SYSTEM keyword
to denote that it is refer-encing the DTD, employees1.dtd, externally.
The
Human Resources department at Zippy Delivery Service feels that they are off to
a good start. They have defined a DTD, employees1.dtd, for their XML data structure and have created
an XML document, Employees1.xml (containing three employee records), that is
valid according to the DTD. However, you’ll find out during the course of this
chapter that the Human Resources department’s DTD can be improved.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.