XML Documents, DTD, and XML Schema
1. Well-Formed and Valid XML Documents and XML DTD
In Figure 12.3, we saw what a simple XML document may look like. An XML
document is well formed if it
follows a few conditions. In particular, it must start with an XML declaration to indicate the version
of XML being used as well as any other relevant attributes, as shown in the
first line in Figure 12.3. It must also follow the syn-tactic guidelines of the
tree data model. This means that there should be a single root element, and
every element must include a matching pair of start and end tags within the start and end tags of the parent element. This ensures
that the nested elements specify a well-formed tree structure.
A well-formed XML document is syntactically correct. This allows it to
be processed by generic processors that traverse the document and create an
internal tree representation. A standard model with an associated set of API
(application programming interface) functions called DOM (Document Object Model) allows programs to manipulate the
resulting tree representation corresponding to a well-formed XML document.
However, the whole document must be parsed beforehand when using DOM in order
to convert the document to that standard DOM internal data structure
representation. Another API called SAX
(Simple API for XML) allows processing of XML documents on the fly by notifying
the processing program through callbacks whenever a start or end tag is
encountered. This makes it easier to process large documents and allows for
processing of so-called streaming XML
documents, where the processing
program can process the tags as they are encountered. This is also known as event-based processing.
A well-formed XML document can be schemaless; that is, it can have any
tag names for the elements within the document. In this case, there is no
predefined set of elements (tag names) that a program processing the document
knows to expect. This gives the document creator the freedom to specify new
elements, but limits the possibilities for automatically interpreting the
meaning or semantics of the elements within the document.
A stronger criterion is for an XML document to be valid. In this case, the document must be well formed, and it must
follow a particular schema. That is, the element names used in the start and
end tag pairs must follow the structure specified in a separate XML DTD (Document Type Definition) file or XML schema file. We first discuss
XML DTD here, and then we give an overview of XML schema in Section 12.3.2.
Figure 12.4 shows a simple XML DTD file, which specifies the elements (tag
names) and their nested structures. Any valid documents conforming to this DTD
should follow the specified structure. A special syntax exists for specifying
DTD files, as illustrated in Figure 12.4. First, a name is given to the root tag of the document, which is
called Projects in the first line in Figure 12.4. Then the elements and their nested
structure are specified.
<!DOCTYPE Projects [
<!ELEMENT Projects
(Project+)>
<!ELEMENT Project (Name,
Number, Location, Dept_no?, Workers) <!ATTLIST Project
ProjId ID #REQUIRED>
>
<!ELEMENT Name
(#PCDATA)> <!ELEMENT Number (#PCDATA) <!ELEMENT Location (#PCDATA)>
<!ELEMENT Dept_no (#PCDATA)> <!ELEMENT Workers (Worker*)>
<!ELEMENT Worker (Ssn,
Last_name?, First_name?, Hours)> <!ELEMENT Ssn (#PCDATA)>
<!ELEMENT Last_name
(#PCDATA)> <!ELEMENT First_name (#PCDATA)> <!ELEMENT Hours
(#PCDATA)>
] >
Figure 12.4 An XML DTD file called Projects
When specifying elements, the following notation is used:
A * following the element name means
that the element can be repeated zero or more times in the document. This kind
of element is known as an optional multivalued (repeating) element.
A +
following the element name means that the element can be repeated one or more
times in the document. This kind of element is a required multival-ued (repeating) element.
A ?
following the element name means that the element can be repeated zero or one
times. This kind is an optional
single-valued (nonrepeating) element.
An element appearing without any
of the preceding three symbols must appear exactly once in the document. This
kind is a required single-valued (nonrepeating) element.
The type of the element is specified via parentheses following the
element. If the parentheses include names of other elements, these latter
elements are the children of the
element in the tree structure. If the parentheses include the keyword #PCDATA or one of the other data types available in XML DTD, the element is a
leaf node. PCDATA stands for parsed character data,
which is roughly similar to a string data type.
The list of attributes that can
appear within an element can also be specified
via the keyword !ATTLIST. In Figure 12.3, the Project element has an attribute ProjId. If the
type of an attribute is ID, then it can be referenced from
another
attribute whose type is IDREF within another element. Notice that attributes can also be used to hold
the values of simple data elements of type #PCDATA.
Parentheses can be nested when specifying elements.
We can see that the tree structure in Figure 12.1 and the XML document
in Figure 12.3 conform to the XML DTD in Figure 12.4. To require that an XML
document be checked for conformance to a DTD, we must specify this in the
declaration of the document. For example, we could change the first line in
Figure 12.3 to the following:
<?xml version=“1.0”
standalone=“no”?> <!DOCTYPE Projects SYSTEM “proj.dtd”>
When the value of the standalone attribute in an XML document is “no”, the document needs to be checked against a separate DTD document or
XML schema document (see below). The DTD file shown in Figure 12.4 should be
stored in the same file system as the XML document, and should be given the
file name proj.dtd. Alternatively, we could include the DTD document text at the beginning
of the XML document itself to allow the checking.
Although XML DTD is quite adequate for specifying tree structures with
required, optional, and repeating elements, and with various types of
attributes, it has several limitations. First, the data types in DTD are not
very general. Second, DTD has its own special syntax and thus requires specialized
processors. It would be advantageous to specify XML schema documents using the
syntax rules of XML itself so that the same processors used for XML documents
could process XML schema descriptions. Third, all DTD elements are always
forced to follow the specified ordering of the document, so unordered elements
are not permitted. These draw-backs led to the development of XML schema, a
more general but also more complex language for specifying the structure and
elements of XML documents.
2. XML Schema
The XML schema language is a
standard for specifying the structure of XML documents. It uses the same
syntax rules as regular XML documents, so that the same processors can be used
on both. To distinguish the two types of documents, we will use the term XML instance document or XML document for a regular XML
document, and XML schema document
for a document that specifies an XML schema. Figure 12.5 shows an XML schema
document corresponding to the COMPANY database shown in Figures 3.5
and 7.2. Although it is unlikely that we would want to display the whole
database as a single document, there have been proposals to store data in native XML format as an alternative to
storing the data in relational data-bases. The schema in Figure 12.5 would serve
the purpose of specifying the struc-ture of the COMPANY database
if it were stored in a native XML system. We discuss this topic further in
Section 12.4.
As with XML DTD, XML schema is based on the tree data model, with
elements and attributes as the main structuring concepts. However, it borrows
additional concepts from database and object models, such as keys, references,
and identifiers. Here we describe the features of XML schema in a step-by-step
manner, referring to the sample XML schema document in Figure 12.5 for
illustration. We introduce and describe some of the schema concepts in the
order in which they are used in Figure 12.5.
Figure 12.5
An XML schema file called company.
<?xml version=“1.0”
encoding=“UTF-8” ?>
<xsd:schema
xmlns:xsd=“http://www.w3.org/2001/XMLSchema”> <xsd:annotation>
<xsd:documentation
xml:lang=“en”>Company Schema (Element Approach) - Prepared by Babak
Hojabri</xsd:documentation>
</xsd:annotation>
<xsd:element name=“company”>
<xsd:complexType>
<xsd:sequence>
<xsd:element
name=“department” type=“Department” minOccurs=“0” maxOccurs= “unbounded” />
<xsd:element name=“employee” type=“Employee” minOccurs=“0” maxOccurs=
“unbounded”>
<xsd:unique
name=“dependentNameUnique”> <xsd:selector xpath=“employeeDependent” />
<xsd:field xpath=“dependentName” />
</xsd:unique>
</xsd:element>
<xsd:element
name=“project” type=“Project” minOccurs=“0” maxOccurs=“unbounded” />
</xsd:sequence>
</xsd:complexType>
<xsd:unique
name=“departmentNameUnique”> <xsd:selector xpath=“department” /> <xsd:field
xpath=“departmentName” />
</xsd:unique>
<xsd:unique
name=“projectNameUnique”> <xsd:selector xpath=“project” />
<xsd:field xpath=“projectName” />
</xsd:unique>
<xsd:key
name=“projectNumberKey”> <xsd:selector xpath=“project” />
<xsd:field xpath=“projectNumber” />
</xsd:key>
<xsd:key
name=“departmentNumberKey”> <xsd:selector xpath=“department” />
<xsd:field xpath=“departmentNumber” />
</xsd:key>
<xsd:key
name=“employeeSSNKey”> <xsd:selector xpath=“employee” /> <xsd:field
xpath=“employeeSSN” />
</xsd:key>
<xsd:keyref
name=“departmentManagerSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector
xpath=“department” />
<xsd:field
xpath=“departmentManagerSSN” /> </xsd:keyref>
<xsd:keyref
name=“employeeDepartmentNumberKeyRef” refer=“departmentNumberKey”>
<xsd:selector
xpath=“employee” />
<xsd:field
xpath=“employeeDepartmentNumber” /> </xsd:keyref>
<xsd:keyref
name=“employeeSupervisorSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector
xpath=“employee” />
<xsd:field
xpath=“employeeSupervisorSSN” /> </xsd:keyref>
<xsd:keyref
name=“projectDepartmentNumberKeyRef” refer=“departmentNumberKey”>
<xsd:selector xpath=“project” />
<xsd:field
xpath=“projectDepartmentNumber” /> </xsd:keyref>
<xsd:keyref
name=“projectWorkerSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector xpath=“project/projectWorker”
/>
<xsd:field xpath=“SSN”
/> </xsd:keyref>
<xsd:keyref
name=“employeeWorksOnProjectNumberKeyRef” refer=“projectNumberKey”>
<xsd:selector
xpath=“employee/employeeWorksOn” /> <xsd:field xpath=“projectNumber”
/>
</xsd:keyref>
</xsd:element>
<xsd:complexType
name=“Department”> <xsd:sequence>
<xsd:element
name=“departmentName” type=“xsd:string” /> <xsd:element
name=“departmentNumber” type=“xsd:string” /> <xsd:element
name=“departmentManagerSSN” type=“xsd:string” /> <xsd:element name=“departmentManagerStartDate”
type=“xsd:date” />
<xsd:element
name=“departmentLocation” type=“xsd:string” minOccurs=“0” maxOccurs=“unbounded”
/> </xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Employee”>
<xsd:sequence>
<xsd:element name=“employeeName”
type=“Name” /> <xsd:element name=“employeeSSN” type=“xsd:string” />
<xsd:element name=“employeeSex” type=“xsd:string” /> <xsd:element
name=“employeeSalary” type=“xsd:unsignedInt” /> <xsd:element
name=“employeeBirthDate” type=“xsd:date” />
<xsd:element
name=“employeeDepartmentNumber” type=“xsd:string” /> <xsd:element
name=“employeeSupervisorSSN” type=“xsd:string” /> <xsd:element
name=“employeeAddress” type=“Address” />
<xsd:element
name=“employeeWorksOn” type=“WorksOn” minOccurs=“1” maxOccurs=“unbounded” />
<xsd:element name=“employeeDependent” type=“Dependent” minOccurs=“0”
maxOccurs=“unbounded” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType
name=“Project”>
<xsd:sequence>
<xsd:element
name=“projectName” type=“xsd:string” />
<xsd:element
name=“projectNumber” type=“xsd:string” />
<xsd:element
name=“projectLocation” type=“xsd:string” />
<xsd:element
name=“projectDepartmentNumber” type=“xsd:string” />
<xsd:element
name=“projectWorker” type=“Worker” minOccurs=“1” maxOccurs=“unbounded” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType
name=“Dependent”> <xsd:sequence>
<xsd:element
name=“dependentName” type=“xsd:string” /> <xsd:element
name=“dependentSex” type=“xsd:string” /> <xsd:element
name=“dependentBirthDate” type=“xsd:date” /> <xsd:element
name=“dependentRelationship” type=“xsd:string” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Address”>
<xsd:sequence>
<xsd:element
name=“number” type=“xsd:string” /> <xsd:element name=“street”
type=“xsd:string” /> <xsd:element name=“city” type=“xsd:string” />
<xsd:element name=“state” type=“xsd:string” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Name”>
<xsd:sequence>
<xsd:element
name=“firstName” type=“xsd:string” /> <xsd:element name=“middleName” type=“xsd:string”
/> <xsd:element name=“lastName” type=“xsd:string” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Worker”>
<xsd:sequence>
<xsd:element name=“SSN”
type=“xsd:string” /> <xsd:element name=“hours” type=“xsd:float” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“WorksOn”>
<xsd:sequence>
<xsd:element
name=“projectNumber” type=“xsd:string” /> <xsd:element name=“hours”
type=“xsd:float” />
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
Schema descriptions and XML
namespaces. It is necessary to identify the specific set of XML schema language
elements (tags) being used by specify-ing a file stored at a Web site location.
The second line in Figure 12.5 specifies the file used in this example, which
is http://www.w3.org/2001/XMLSchema. This is a commonly used standard for XML schema commands. Each such
definition is called an XML namespace,
because it defines the set of commands (names) that can be used. The file name
is assigned to the variable xsd (XML schema description) using
the attribute xmlns (XML namespace), and this variable is used as a prefix to all XML
schema commands (tag names). For example, in Figure 12.5, when we write xsd:element or xsd:sequence, we are referring to the definitions of the element and sequence tags as defined in the file http://www.w3.org/2001/XMLSchema.
2. Annotations, documentation, and language used. The next couple of
lines in Figure 12.5 illustrate the XML schema elements (tags) xsd:annotation and xsd:documentation, which are used for providing comments and other descriptions in the XML document. The attribute xml:lang of the xsd:documentation element specifies the language being used, where en stands for the English
language.
Elements and types. Next, we specify the root
element of our XML schema. In XML schema, the name attribute of the xsd:element tag specifies the ele-ment name,
which is called company for the root element in our example (see Figure 12.5). The structure of
the company root element can then be speci-fied, which in our example is xsd:complexType. This is further specified to be a sequence of departments, employees,
and projects using the xsd:sequence structure of XML schema. It is
important to note here that this is not the only way to specify an XML schema
for the COMPANY database. We will dis-cuss other options in Section 12.6.
First-level elements in the COMPANY database. Next, we
specify the three first-level
elements under the company root element in Figure 12.5.
These elements are named employee, department, and project, and each is specified in an xsd:element tag.
Notice that if a tag has only attributes and no further subelements or data
within it, it can be ended with the backslash symbol (/>) directly instead of having a separate matching end tag. These are
called empty elements; examples are
the xsd:element elements
named department and project in Figure 12.5.
Specifying element type and minimum and maximum occurrences. In
XML schema, the attributes type, minOccurs, and maxOccurs in the xsd:element
tag specify the type and multiplicity of each
element in any doc-ument that conforms to the schema specifications. If we
specify a type attrib-ute in an xsd:element, the structure of the element
must be described separately, typically using the xsd:complexType element of XML schema. This is illustrated by the employee, department, and project elements in Figure 12.5. On the other hand, if no type attribute is specified, the element structure can be defined directly
following the tag, as illustrated by the company root
ele-ment in Figure 12.5. The minOccurs and maxOccurs tags are used for specify-ing lower and upper bounds on the number of
occurrences of an element in any XML document that conforms to the schema specifications.
If they are not specified, the default is exactly one occurrence. These serve a
similar role to the *, +, and ? symbols of XML DTD.
Specifying keys. In XML schema, it is possible to
specify constraints that correspond
to unique and primary key constraints in a relational database (see Section
3.2.2), as well as foreign keys (or referential integrity) con-straints (see
Section 3.2.4). The xsd:unique tag specifies elements that
correspond to unique attributes in a relational database. We can give each
such uniqueness constraint a name, and we must specify xsd:selector and xsd:field tags for it to identify the element type that contains the unique
element and the element name within it that is unique via the xpath attribute. This is illustrated by the departmentNameUnique and projectNameUnique elements in
Figure 12.5. For specifying primary
keys, the tag xsd:key is used instead of xsd:unique, as illustrated by the projectNumberKey,
departmentNumberKey, and employeeSSNKey elements in Figure 12.5. For specifying foreign keys, the tag xsd:keyref is used, as illustrated by the six xsd:keyref elements
in Figure 12.5.
When specifying a foreign key, the attribute refer of the xsd:keyref tag specifies the referenced primary key, whereas the tags xsd:selector and xsd:field specify the referencing element type and foreign key (see Figure 12.5).
7. Specifying the structures of complex elements via complex types.
The next part of our example specifies the structures of the complex elements Department, Employee, Project, and Dependent, using the tag xsd:complexType (see Figure 12.5 on page 428). We specify each of these as a sequence of subelements corresponding to the database attributes of each entity type (see Figure 3.7) by using the xsd:sequence and xsd:element tags of XML schema. Each element is given a name and type via the attributes name and type of xsd:element. We can also specify minOccurs and maxOccurs attributes if we need to change the default of exactly one occurrence. For (optional) data-base attributes where null is allowed, we need to specify minOccurs = 0, whereas for multivalued database attributes we need to specify maxOccurs = “unbounded” on the corresponding element. Notice that if we were not going to specify any key constraints, we could have embedded the subelements within the parent element definitions directly without having to specify complex types. However, when unique, primary key and foreign key constraints need to be specified; we must define complex types to specify the ele-ment structures.
Composite (compound) attributes. Composite attributes from Figure 7.2 are also specified as complex types in Figure 12.7, as illustrated by the Address, Name, Worker, and WorksOn complex types. These could have been directly embedded within their parent elements.
This example illustrates some of the main features of XML schema. There
are other features, but they are beyond the scope of our presentation. In the
next section, we discuss the different approaches to creating XML documents from
relational data-bases and storing XML documents.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.