[Definition: A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints.]
Each XML document has both a logical and a physical structure. Physically, the docu-ment is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a “root” or document entity.
Logically, the document is composed of declarations, elements, comments, character ref-erences, and processing instructions, all of which are indicated in the document by explicit markup. The logical and physical structures must nest properly, as described in 4.3.2 Well-Formed Parsed Entities.
1 Well-Formed XML Documents
[Definition: A textual object is a well-formed XML document if:]
Taken as a whole, it matches the production labeled document.
It meets all the well-formedness constraints given in this specification.
Each of the parsed entities which is referenced directly or indirectly within the document is well-formed.
Matching the document production implies that:
It contains one or more elements.
[Definition: There is exactly one element, called the root, or document element, no part of which appears in the content of any other element.] For all other elements, if the start-tag is in the content of another element, the end-tag is in the content of the same element. More simply stated, the elements, delimited by start- and end-tags, nest properly within each other.
[Definition: As a consequence of this, for each non-root element C in the docu-ment, there is one other element P in the document such that C is in the content of P, but is not in the content of any other element that is in the content of P. P is referred to as the parent of C, and C as a child of P.]
[Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]). Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any charac-ter in the range specified for Char. The use of “compatibility characters”, as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]), is discouraged.]
 Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
➥ [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.
3 Common Syntactic Constructs
This section defines some symbols used widely in the grammar.
S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.
 S ::= (#x20 | #x9 | #xD | #xA)+
Characters are classified for convenience as letters, digits, or other characters. A letter consists of an alphabetic or syllabic base character or an ideographic character. Full defi-nitions of the specific characters in each class are given in B Character Classes.
[Definition: A Name is a token beginning with a letter or one of a few punctuation char-acters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.] Names beginning with the string “xml”, or any string which would match ((‘X’|’x’) (‘M’|’m’) (‘L’|’l’)), are reserved for standardization in this or future versions of this specification.
An Nmtoken (name token) is any mixture of name characters.
Literal data is any quoted string not containing the quotation mark used as a delimiter for that string. Literals are used for specifying the content of internal entities (EntityValue), the values of attributes (AttValue), and external identifiers (SystemLiteral). Note that a SystemLiteral can be parsed without scanning for markup.
4 Character Data and Markup
Text consists of intermingled character data and markup. [Definition: Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, processing instruc-tions, XML declarations, text declarations, and any white space that is at the top level of the document entity (that is, outside the document element and not inside any other markup).]
[Definition: All text that is not markup constitutes the character data of the document.]
The ampersand character (&) and the left angle bracket (<) may appear in their literal form only when used as markup delimiters, or within a comment, a processing instruc-tion, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings “&” and “<” respectively. The right angle bracket (>) may be represented using the string “>”, and must, for com-patibility, be escaped using “>” or a character reference when it appears in the string “]]>” in content, when that string is not marking the end of a CDATA section.
In the content of elements, character data is any string of characters which does not con-tain the start-delimiter of any markup. In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, “]]>”.
To allow attribute values to contain both single and double quotes, the apostrophe or sin-gle-quote character (‘) may be represented as “'”, and the double-quote character (“) as “"”.
 CharData ::= [^<&]* - ([^<&]* ‘]]>’ [^<&]*)
[Definition: Comments may appear anywhere in a document outside other markup; in addition, they may appear within the document type declaration at places allowed by the grammar. They are not part of the document’s character data; an XML processor may, but need not, make it possible for an application to retrieve the text of comments. For compatibility, the string “—” (double-hyphen) must not occur within comments.] Parameter entity references are not recognized within comments.
 Comment ::= ‘<!--’ ((Char - ‘-’) | (‘-’ (Char - ‘-’)))* ‘-->’
An example of a comment:
<!-- declarations for <head> & <body> -->
6 Processing Instructions
[Definition: Processing instructions (PIs) allow documents to contain instructions for applications.]
PIs are not part of the document’s character data, but must be passed through to the application. The PI begins with a target (PITarget) used to identify the application to which the instruction is directed. The target names “XML”, “xml”, and so on are reserved for standardization in this or future versions of this specification. The XML
Notation mechanism may be used for formal declaration of PI targets. Parameter entity references are not recognized within processing instructions.
7 CDATA Sections
[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string “<![CDATA[“ and end with the string “]]>”:]
Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and can-not) be escaped using “<” and “&”. CDATA sections cannot nest.
An example of a CDATA section, in which “<greeting>” and “</greeting>” are recog-nized as character data, not markup:
8 Prolog and Document Type Declaration
[Definition: XML documents should begin with an XML declaration which specifies the version of XML being used.] For example, the following is a complete XML document, well-formed but not valid:
<?xml version=”1.0”?> <greeting>Hello, world!</greeting>
and so is this:
The version number “1.0” should be used to indicate conformance to this version of this specification; it is an error for a document to use the value “1.0” if it does not conform to this version of this specification. It is the intent of the XML working group to give later versions of this specification numbers other than “1.0”, but this intent does not indicate a commitment to produce any future versions of XML, nor if any are produced, to use any particular numbering scheme. Since future versions are not ruled out, this construct is provided as a means to allow the possibility of automatic version recognition, should it become necessary. Processors may signal an error if they receive documents labeled with versions they do not support.
The function of the markup in an XML document is to describe its storage and logical structure and to associate attribute-value pairs with its logical structures. XML provides a mechanism, the document type declaration, to define constraints on the logical structure and to support the use of predefined storage units. [Definition: An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it.]
The document type declaration must appear before the first element in the document.
[Definition: The XML document type declaration contains or points to markup declara-tions that provide a grammar for a class of documents. This grammar is known as a doc-ument type definition, or DTD. The document type declaration can point to an external subset (a special kind of external entity) containing markup declarations, or can contain the markup declarations directly in an internal subset, or can do both. The DTD for a document consists of both subsets taken together.]
[Definition: A markup declaration is an element type declaration, an attribute-list decla-ration, an entity declaration, or a notation declaration.] These declarations may be con-tained in whole or in part within parameter entities, as described in the well-formedness and validity constraints below. For further information, see 4 Physical Structures.
Document Type Definition
Note that it is possible to construct a well-formed document containing a doctypedecl that neither points to an external subset nor contains an internal subset.
The markup declarations may be made up in whole or in part of the replacement text of parameter entities. The productions later in this specification for individual nonterminals (elementdecl, AttlistDecl, and so on) describe the declarations after all the parameter entities have been included.
Parameter entity references are recognized anywhere in the DTD (internal and external subsets and external parameter entities), except in literals, processing instructions, com-ments, and the contents of ignored conditional sections (see 3.4 Conditional Sections).
They are also recognized in entity value literals. The use of parameter entities in the internal subset is restricted as described below.
Validity constraint: Root Element Type
The Name in the document type declaration must match the element type of the root element.
Validity constraint: Proper Declaration/PE Nesting
Parameter-entity replacement text must be properly nested with markup declarations. That is to say, if either the first character or the last character of a markup declaration (markupdecl above) is contained in the replacement text for a parameter-entity reference, both must be contained in the same replacement text.
Well-formedness constraint: PEs in Internal Subset
In the internal DTD subset, parameter-entity references can occur only where markup declarations can occur, not within markup declarations. (This does not apply to refer-ences that occur in external parameter entities or to the external subset.)
Well-formedness constraint: External Subset
The external subset, if any, must match the production for extSubset.
Well-formedness constraint: PE Between Declarations
The replacement text of a parameter entity reference in a DeclSep must match the production extSubsetDecl.
Like the internal subset, the external subset and any external parameter entities refer-enced in a DeclSep must consist of a series of complete markup declarations of the types allowed by the non-terminal symbol markupdecl, interspersed with white space or para-meter-entity references. However, portions of the contents of the external subset or of these external parameter entities may conditionally be ignored by using the conditional section construct; this is not allowed in the internal subset.
The external subset and external parameter entities also differ from the internal subset in that in them, parameter-entity references are permitted within markup declarations, not only between markup declarations.
An example of an XML document with a document type declaration:
<?xml version=”1.0”?> <!DOCTYPE greeting SYSTEM “hello.dtd”> <greeting>Hello,
The system identifier “hello.dtd” gives the address (a URI reference) of a DTD for the document.
The declarations can also be given locally, as in this example:
<?xml version=”1.0” encoding=”UTF-8” ?> <!DOCTYPE greeting [
<!ELEMENT greeting (#PCDATA)>
If both the external and internal subsets are used, the internal subset is considered to occur before the external subset. This has the effect that entity and attribute-list declara-tions in the internal subset take precedence over those in the external subset.
9 Standalone Document Declaration
Markup declarations can affect the content of the document, as passed from an XML processor to an application; examples are attribute defaults and entity declarations. The standalone document declaration, which may appear as a component of the XML decla-ration, signals whether or not there are such declarations which appear external to the document entity or in parameter entities. [Definition: An external markup declaration is defined as a markup declaration occurring in the external subset or in a parameter entity (external or internal, the latter being included because non-validating processors are not required to read them).]
Standalone Document Declaration
In a standalone document declaration, the value “yes” indicates that there are no external markup declarations which affect the information passed from the XML processor to the application. The value “no” indicates that there are or may be such external markup dec-larations. Note that the standalone document declaration only denotes the presence of external declarations; the presence, in a document, of references to external entities, when those entities are internally declared, does not change its standalone status.
If there are no external markup declarations, the standalone document declaration has no meaning. If there are external markup declarations but there is no standalone document declaration, the value “no” is assumed.
Any XML document for which standalone=”no” holds can be converted algorithmically to a standalone document, which may be desirable for some network delivery applica-tions.
Validity constraint: Standalone Document Declaration
The standalone document declaration must have the value “no” if any external markup declarations contain declarations of:
attributes with default values, if elements to which these attributes apply appear in the document without specifications of values for these attributes, or
entities (other than amp, lt, gt, apos, quot), if references to those entities appear in the document, or
attributes with values subject to normalization, where the attribute appears in the document with a value which will change as a result of normalization, or
element types with element content, if white space occurs directly within any instance of those types.
An example XML declaration with a standalone document declaration:
<?xml version=”1.0” standalone=’yes’?>
10 White Space Handling
In editing XML documents, it is often convenient to use “white space” (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, “significant” white space that should be preserved in the delivered version is common, for example in poetry and source code.
An XML processor must always pass all characters in a document that are not markup through to the application. A validating XML processor must also inform the application which of these characters constitute white space appearing in element content.
A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. In valid docu-ments, this attribute, like any other, must be declared if it is used. When declared, it must be given as an enumerated type whose values are one or both of “default” and “pre-serve”. For example:
<!ATTLIST poem xml:space (default|preserve) ‘preserve’>
<!ATTLIST pre xml:space (preserve) #FIXED ‘preserve’>
The value “default” signals that applications’ default white-space processing modes are acceptable for this element; the value “preserve” indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overriden with another instance of the xml:space attribute.
The root element of any document is considered to have signaled no intentions as regards application space handling, unless it provides a value for this attribute or the attribute is declared with a default value.
11 End-of-Line Handling
XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA).
To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
12 Language Identification
In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in docu-ments to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by [IETF RFC 1766], Tags for the Identification of Languages, or its successor on the IETF Standards Track.
<p xml:lang=”en”>The quick brown fox jumps over the lazy dog.</p> <p xml:lang=”en-GB”>What colour is it?</p>
<p xml:lang=”en-US”>What color is it?</p> <sp who=”Faust” desc=’leise’ xml:lang=”de”>
<l>Habe nun, ach! Philosophie,</l> <l>Juristerei, und Medizin</l> <l>und leider auch Theologie</l>
<l>durchaus studiert mit heißem Bemüh’n.</l> </sp>
The intent declared with xml:lang is considered to apply to all attributes and content of the element where it is specified, unless overridden with an instance of xml:lang on another element within that content.
A simple declaration for xml:lang might take the form
xml:lang NMTOKEN #IMPLIED
but specific default values may also be given, if appropriate. In a collection of French poems for English students, with glosses and notes in English, the xml:lang attribute might be declared this way:
<!ATTLIST poem xml:lang NMTOKEN ‘fr’>
<!ATTLIST gloss xml:lang NMTOKEN ‘en’>
<!ATTLIST note xml:lang NMTOKEN ‘en’>