Chapter: Fundamentals of Database Systems : Object, Object-Relational, and XML: Concepts, Models, Languages, and Standards : XML: Extensible Markup Language

XML Documents, DTD, and XML Schema

1. Well-Formed and Valid XML Documents and XML DTD 2. XML Schema

XML Documents, DTD, and XML Schema

1. Well-Formed and Valid XML Documents and XML DTD

In Figure 12.3, we saw what a simple XML document may look like. An XML document is well formed if it follows a few conditions. In particular, it must start with an XML declaration to indicate the version of XML being used as well as any other relevant attributes, as shown in the first line in Figure 12.3. It must also follow the syn-tactic guidelines of the tree data model. This means that there should be a single root element, and every element must include a matching pair of start and end tags within the start and end tags of the parent element. This ensures that the nested elements specify a well-formed tree structure.

A well-formed XML document is syntactically correct. This allows it to be processed by generic processors that traverse the document and create an internal tree representation. A standard model with an associated set of API (application programming interface) functions called DOM (Document Object Model) allows programs to manipulate the resulting tree representation corresponding to a well-formed XML document. However, the whole document must be parsed beforehand when using DOM in order to convert the document to that standard DOM internal data structure representation. Another API called SAX (Simple API for XML) allows processing of XML documents on the fly by notifying the processing program through callbacks whenever a start or end tag is encountered. This makes it easier to process large documents and allows for processing of so-called streaming XML documents, where the processing program can process the tags as they are encountered. This is also known as event-based processing.

A well-formed XML document can be schemaless; that is, it can have any tag names for the elements within the document. In this case, there is no predefined set of elements (tag names) that a program processing the document knows to expect. This gives the document creator the freedom to specify new elements, but limits the possibilities for automatically interpreting the meaning or semantics of the elements within the document.

A stronger criterion is for an XML document to be valid. In this case, the document must be well formed, and it must follow a particular schema. That is, the element names used in the start and end tag pairs must follow the structure specified in a separate XML DTD (Document Type Definition) file or XML schema file. We first discuss XML DTD here, and then we give an overview of XML schema in Section 12.3.2. Figure 12.4 shows a simple XML DTD file, which specifies the elements (tag names) and their nested structures. Any valid documents conforming to this DTD should follow the specified structure. A special syntax exists for specifying DTD files, as illustrated in Figure 12.4. First, a name is given to the root tag of the document, which is called Projects in the first line in Figure 12.4. Then the elements and their nested structure are specified.

<!DOCTYPE Projects [

<!ELEMENT Projects (Project+)>

<!ELEMENT Project (Name, Number, Location, Dept_no?, Workers) <!ATTLIST Project

ProjId ID #REQUIRED>

<!ELEMENT Name (#PCDATA)> <!ELEMENT Number (#PCDATA) <!ELEMENT Location (#PCDATA)> <!ELEMENT Dept_no (#PCDATA)> <!ELEMENT Workers (Worker*)>

<!ELEMENT Worker (Ssn, Last_name?, First_name?, Hours)> <!ELEMENT Ssn (#PCDATA)>

<!ELEMENT Last_name (#PCDATA)> <!ELEMENT First_name (#PCDATA)> <!ELEMENT Hours (#PCDATA)>

] >

Figure 12.4 An XML DTD file called Projects

When specifying elements, the following notation is used:

A _* following the element name means that the element can be repeated zero or more times in the document. This kind of element is known as an optional multivalued (repeating) element.

A + following the element name means that the element can be repeated one or more times in the document. This kind of element is a required multival-ued (repeating) element.

A ? following the element name means that the element can be repeated zero or one times. This kind is an optional single-valued (nonrepeating) element.

An element appearing without any of the preceding three symbols must appear exactly once in the document. This kind is a required single-valued (nonrepeating) element.

The type of the element is specified via parentheses following the element. If the parentheses include names of other elements, these latter elements are the children of the element in the tree structure. If the parentheses include the keyword #PCDATA or one of the other data types available in XML DTD, the element is a leaf node. PCDATA stands for parsed character data, which is roughly similar to a string data type.

The list of attributes that can appear within an element can also be specified

via the keyword !ATTLIST. In Figure 12.3, the Project element has an attribute ProjId. If the type of an attribute is ID, then it can be referenced from another attribute whose type is IDREF within another element. Notice that attributes can also be used to hold the values of simple data elements of type #PCDATA.

Parentheses can be nested when specifying elements.

A bar symbol ( e₁ | e₂ ) specifies that either e₁ or e₂ can appear in the document.

We can see that the tree structure in Figure 12.1 and the XML document in Figure 12.3 conform to the XML DTD in Figure 12.4. To require that an XML document be checked for conformance to a DTD, we must specify this in the declaration of the document. For example, we could change the first line in Figure 12.3 to the following:

<?xml version=“1.0” standalone=“no”?> <!DOCTYPE Projects SYSTEM “proj.dtd”>

When the value of the standalone attribute in an XML document is “no”, the document needs to be checked against a separate DTD document or XML schema document (see below). The DTD file shown in Figure 12.4 should be stored in the same file system as the XML document, and should be given the file name proj.dtd. Alternatively, we could include the DTD document text at the beginning of the XML document itself to allow the checking.

Although XML DTD is quite adequate for specifying tree structures with required, optional, and repeating elements, and with various types of attributes, it has several limitations. First, the data types in DTD are not very general. Second, DTD has its own special syntax and thus requires specialized processors. It would be advantageous to specify XML schema documents using the syntax rules of XML itself so that the same processors used for XML documents could process XML schema descriptions. Third, all DTD elements are always forced to follow the specified ordering of the document, so unordered elements are not permitted. These draw-backs led to the development of XML schema, a more general but also more complex language for specifying the structure and elements of XML documents.

2. XML Schema

The XML schema language is a standard for specifying the structure of XML documents. It uses the same syntax rules as regular XML documents, so that the same processors can be used on both. To distinguish the two types of documents, we will use the term XML instance document or XML document for a regular XML document, and XML schema document for a document that specifies an XML schema. Figure 12.5 shows an XML schema document corresponding to the COMPANY database shown in Figures 3.5 and 7.2. Although it is unlikely that we would want to display the whole database as a single document, there have been proposals to store data in native XML format as an alternative to storing the data in relational data-bases. The schema in Figure 12.5 would serve the purpose of specifying the struc-ture of the COMPANY database if it were stored in a native XML system. We discuss this topic further in Section 12.4.

As with XML DTD, XML schema is based on the tree data model, with elements and attributes as the main structuring concepts. However, it borrows additional concepts from database and object models, such as keys, references, and identifiers. Here we describe the features of XML schema in a step-by-step manner, referring to the sample XML schema document in Figure 12.5 for illustration. We introduce and describe some of the schema concepts in the order in which they are used in Figure 12.5.

Figure 12.5

An XML schema file called company.

<?xml version=“1.0” encoding=“UTF-8” ?>

<xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema”> <xsd:annotation>

<xsd:documentation xml:lang=“en”>Company Schema (Element Approach) - Prepared by Babak Hojabri</xsd:documentation>

</xsd:annotation> <xsd:element name=“company”>

<xsd:complexType>

<xsd:sequence>

<xsd:element name=“department” type=“Department” minOccurs=“0” maxOccurs= “unbounded” /> <xsd:element name=“employee” type=“Employee” minOccurs=“0” maxOccurs= “unbounded”>

<xsd:unique name=“dependentNameUnique”> <xsd:selector xpath=“employeeDependent” /> <xsd:field xpath=“dependentName” />

</xsd:unique>

</xsd:element>

<xsd:element name=“project” type=“Project” minOccurs=“0” maxOccurs=“unbounded” /> </xsd:sequence>

</xsd:complexType>

<xsd:unique name=“departmentNameUnique”> <xsd:selector xpath=“department” /> <xsd:field xpath=“departmentName” />

</xsd:unique>

<xsd:unique name=“projectNameUnique”> <xsd:selector xpath=“project” /> <xsd:field xpath=“projectName” />

</xsd:unique>

<xsd:key name=“projectNumberKey”> <xsd:selector xpath=“project” /> <xsd:field xpath=“projectNumber” />

</xsd:key>

<xsd:key name=“departmentNumberKey”> <xsd:selector xpath=“department” /> <xsd:field xpath=“departmentNumber” />

</xsd:key>

<xsd:key name=“employeeSSNKey”> <xsd:selector xpath=“employee” /> <xsd:field xpath=“employeeSSN” />

</xsd:key>

<xsd:keyref name=“departmentManagerSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector xpath=“department” />

<xsd:field xpath=“departmentManagerSSN” /> </xsd:keyref>

<xsd:keyref name=“employeeDepartmentNumberKeyRef” refer=“departmentNumberKey”>

<xsd:selector xpath=“employee” />

<xsd:field xpath=“employeeDepartmentNumber” /> </xsd:keyref>

<xsd:keyref name=“employeeSupervisorSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector xpath=“employee” />

<xsd:field xpath=“employeeSupervisorSSN” /> </xsd:keyref>

<xsd:keyref name=“projectDepartmentNumberKeyRef” refer=“departmentNumberKey”> <xsd:selector xpath=“project” />

<xsd:field xpath=“projectDepartmentNumber” /> </xsd:keyref>

<xsd:keyref name=“projectWorkerSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector xpath=“project/projectWorker” />

<xsd:field xpath=“SSN” /> </xsd:keyref>

<xsd:keyref name=“employeeWorksOnProjectNumberKeyRef” refer=“projectNumberKey”>

<xsd:selector xpath=“employee/employeeWorksOn” /> <xsd:field xpath=“projectNumber” />

</xsd:keyref>

</xsd:element>

<xsd:complexType name=“Department”> <xsd:sequence>

<xsd:element name=“departmentName” type=“xsd:string” /> <xsd:element name=“departmentNumber” type=“xsd:string” /> <xsd:element name=“departmentManagerSSN” type=“xsd:string” /> <xsd:element name=“departmentManagerStartDate” type=“xsd:date” />

<xsd:element name=“departmentLocation” type=“xsd:string” minOccurs=“0” maxOccurs=“unbounded” /> </xsd:sequence>

</xsd:complexType> <xsd:complexType name=“Employee”>

<xsd:sequence>

<xsd:element name=“employeeName” type=“Name” /> <xsd:element name=“employeeSSN” type=“xsd:string” /> <xsd:element name=“employeeSex” type=“xsd:string” /> <xsd:element name=“employeeSalary” type=“xsd:unsignedInt” /> <xsd:element name=“employeeBirthDate” type=“xsd:date” />

<xsd:element name=“employeeDepartmentNumber” type=“xsd:string” /> <xsd:element name=“employeeSupervisorSSN” type=“xsd:string” /> <xsd:element name=“employeeAddress” type=“Address” />

<xsd:element name=“employeeWorksOn” type=“WorksOn” minOccurs=“1” maxOccurs=“unbounded” /> <xsd:element name=“employeeDependent” type=“Dependent” minOccurs=“0” maxOccurs=“unbounded” />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name=“Project”>

<xsd:sequence>

<xsd:element name=“projectName” type=“xsd:string” />

<xsd:element name=“projectNumber” type=“xsd:string” />

<xsd:element name=“projectLocation” type=“xsd:string” />

<xsd:element name=“projectDepartmentNumber” type=“xsd:string” />

<xsd:element name=“projectWorker” type=“Worker” minOccurs=“1” maxOccurs=“unbounded” /> </xsd:sequence>

</xsd:complexType>

<xsd:complexType name=“Dependent”> <xsd:sequence>

<xsd:element name=“dependentName” type=“xsd:string” /> <xsd:element name=“dependentSex” type=“xsd:string” /> <xsd:element name=“dependentBirthDate” type=“xsd:date” /> <xsd:element name=“dependentRelationship” type=“xsd:string” />

</xsd:sequence>

</xsd:complexType> <xsd:complexType name=“Address”>

<xsd:sequence>

<xsd:element name=“number” type=“xsd:string” /> <xsd:element name=“street” type=“xsd:string” /> <xsd:element name=“city” type=“xsd:string” /> <xsd:element name=“state” type=“xsd:string” />

</xsd:sequence>

</xsd:complexType> <xsd:complexType name=“Name”>

<xsd:sequence>

<xsd:element name=“firstName” type=“xsd:string” /> <xsd:element name=“middleName” type=“xsd:string” /> <xsd:element name=“lastName” type=“xsd:string” />

</xsd:sequence>

</xsd:complexType> <xsd:complexType name=“Worker”>

<xsd:sequence>

<xsd:element name=“SSN” type=“xsd:string” /> <xsd:element name=“hours” type=“xsd:float” />

</xsd:sequence>

</xsd:complexType> <xsd:complexType name=“WorksOn”>

<xsd:sequence>

<xsd:element name=“projectNumber” type=“xsd:string” /> <xsd:element name=“hours” type=“xsd:float” />

</xsd:sequence>

</xsd:complexType>

</xsd:schema>

Schema descriptions and XML namespaces. It is necessary to identify the specific set of XML schema language elements (tags) being used by specify-ing a file stored at a Web site location. The second line in Figure 12.5 specifies the file used in this example, which is http://www.w3.org/2001/XMLSchema. This is a commonly used standard for XML schema commands. Each such definition is called an XML namespace, because it defines the set of commands (names) that can be used. The file name is assigned to the variable xsd (XML schema description) using the attribute xmlns (XML namespace), and this variable is used as a prefix to all XML schema commands (tag names). For example, in Figure 12.5, when we write xsd:element or xsd:sequence, we are referring to the definitions of the element and sequence tags as defined in the file http://www.w3.org/2001/XMLSchema.

2. Annotations, documentation, and language used. The next couple of lines in Figure 12.5 illustrate the XML schema elements (tags) xsd:annotation and xsd:documentation, which are used for providing comments and other descriptions in the XML document. The attribute xml:lang of the xsd:documentation element specifies the language being used, where en stands for the English language.

Elements and types. Next, we specify the root element of our XML schema. In XML schema, the name attribute of the xsd:element tag specifies the ele-ment name, which is called company for the root element in our example (see Figure 12.5). The structure of the company root element can then be speci-fied, which in our example is xsd:complexType. This is further specified to be a sequence of departments, employees, and projects using the xsd:sequence structure of XML schema. It is important to note here that this is not the only way to specify an XML schema for the COMPANY database. We will dis-cuss other options in Section 12.6.

First-level elements in the COMPANY database. Next, we specify the three first-level elements under the company root element in Figure 12.5. These elements are named employee, department, and project, and each is specified in an xsd:element tag. Notice that if a tag has only attributes and no further subelements or data within it, it can be ended with the backslash symbol (/>) directly instead of having a separate matching end tag. These are called empty elements; examples are the xsd:element elements named department and project in Figure 12.5.

Specifying element type and minimum and maximum occurrences. In

XML schema, the attributes type, minOccurs, and maxOccurs in the xsd:element tag specify the type and multiplicity of each element in any doc-ument that conforms to the schema specifications. If we specify a type attrib-ute in an xsd:element, the structure of the element must be described separately, typically using the xsd:complexType element of XML schema. This is illustrated by the employee, department, and project elements in Figure 12.5. On the other hand, if no type attribute is specified, the element structure can be defined directly following the tag, as illustrated by the company root ele-ment in Figure 12.5. The minOccurs and maxOccurs tags are used for specify-ing lower and upper bounds on the number of occurrences of an element in any XML document that conforms to the schema specifications. If they are not specified, the default is exactly one occurrence. These serve a similar role to the *, +, and ? symbols of XML DTD.

Specifying keys. In XML schema, it is possible to specify constraints that correspond to unique and primary key constraints in a relational database (see Section 3.2.2), as well as foreign keys (or referential integrity) con-straints (see Section 3.2.4). The xsd:unique tag specifies elements that correspond to unique attributes in a relational database. We can give each such uniqueness constraint a name, and we must specify xsd:selector and xsd:field tags for it to identify the element type that contains the unique element and the element name within it that is unique via the xpath attribute. This is illustrated by the departmentNameUnique and projectNameUnique elements in

Figure 12.5. For specifying primary keys, the tag xsd:key is used instead of xsd:unique, as illustrated by the projectNumberKey, departmentNumberKey, and employeeSSNKey elements in Figure 12.5. For specifying foreign keys, the tag xsd:keyref is used, as illustrated by the six xsd:keyref elements in Figure 12.5. When specifying a foreign key, the attribute refer of the xsd:keyref tag specifies the referenced primary key, whereas the tags xsd:selector and xsd:field specify the referencing element type and foreign key (see Figure 12.5).

7. Specifying the structures of complex elements via complex types.

The next part of our example specifies the structures of the complex elements Department, Employee, Project, and Dependent, using the tag xsd:complexType (see Figure 12.5 on page 428). We specify each of these as a sequence of subelements corresponding to the database attributes of each entity type (see Figure 3.7) by using the xsd:sequence and xsd:element tags of XML schema. Each element is given a name and type via the attributes name and type of xsd:element. We can also specify minOccurs and maxOccurs attributes if we need to change the default of exactly one occurrence. For (optional) data-base attributes where null is allowed, we need to specify minOccurs = 0, whereas for multivalued database attributes we need to specify maxOccurs = “unbounded” on the corresponding element. Notice that if we were not going to specify any key constraints, we could have embedded the subelements within the parent element definitions directly without having to specify complex types. However, when unique, primary key and foreign key constraints need to be specified; we must define complex types to specify the ele-ment structures.

Composite (compound) attributes. Composite attributes from Figure 7.2 are also specified as complex types in Figure 12.7, as illustrated by the Address, Name, Worker, and WorksOn complex types. These could have been directly embedded within their parent elements.

This example illustrates some of the main features of XML schema. There are other features, but they are beyond the scope of our presentation. In the next section, we discuss the different approaches to creating XML documents from relational data-bases and storing XML documents.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Fundamentals of Database Systems : Object, Object-Relational, and XML: Concepts, Models, Languages, and Standards : XML: Extensible Markup Language : XML Documents, DTD, and XML Schema |