Chapter: XML and Web Services : Essentials of XML : The Fundamentals of XML

Basics of Reading and Processing XML

Basics of Reading and Processing XML

Now that you have learned the basics of how to write well-formed XML documents (learning how to write valid XML documents is covered in the chapters on DTDs and the XML Schema), it is now important that you learn how to process and handle these XML documents. After all, the value of XML is not in its creation but in its use.

Along these lines, processing XML follows a few major lines: parsing the XML docu-ment, processing and making use of the parsed elements, and integrating with other sys-tems and programming languages. Because XML is just a text document format and not a programming language, it provides no mechanism to instruct machines how to process the content contained within it. That’s actually a good thing. Because there are no spe-cific processing requirements, XML documents can be processed by all types of devices, operating systems, clients, servers, and other information consumers, all which only need to understand how to read XML. XML not only has separated the presentation from data, it has separated the strict processing requirements from data. In essence, XML is as pure a data format as possible.

The following sections explore the various steps of processing XML and the tools available to accomplish these tasks.

Parsers

The first step for any system that plans to make use of XML documents is to actually read the documents into memory. Although this may seem like a simple task, the struc-tured nature of XML imposes several requirements on parsers. In addition, the behavior of parsing applications needs to be consistent so that XML documents can be reliably exchanged between disparate systems. As a result, XML parsers must adhere to a certain accepted level of compliance.

Because an XML document is just a text file, any user can write his or her own program to read in the XML text file and take it apart for use in a programming application. However, the amount of time and complexity it would take to write such an XML docu-ment reader (which, by the way, would have to be written over and over again for the dif-ferent programs that need access to the information in XML documents) would make the adoption of XML an onerous task. The WC3 (the XML standardization body) came to the realization that a standard mechanism was needed to parse these XML documents and promoted the use of compliant XML parsers. As a result, a number of widely avail-able XML parsers exist that allow the application developer to focus on application-spe-cific code rather than on XML document reading or processing.

In actuality, there are really two types of XML parsers: validating parsers and nonvalidat-ing parsers. Nonvalidating parsers merely read XML documents and verify that the docu-ments are well formed. Validating parsers read well-formed documents in addition to checking their compliance against a DTD, XML Schema, or other validation set. Obviously, nonvalidating parsers are much easier to program and can be made extremely efficient and space conserving. The first iteration of XML parsers were nonvalidating because the DTD and XML Schema proposals were far from stable. As the specifications became more stable, the number of validating parsers likewise increased. As a result, many of the parsers currently on the market (commercial or open source) are validating parsers that have progressively become more robust and efficient.

Because of the added complexity of ensuring validity and compliance with a DTD or schema, validating parsers tend to be much larger in memory and processing footprint than nonvalidating parsers. If most of the XML in a particular system is well formed and doesn’t need to be checked for validity, the use of a nonvalidating parser may be a better idea.

Examples of nonvalidating parsers include James Clark’s expat, XP, and Lark. Examples of validating parsers include IBM’s XML for Java, the DataChannel XML Parser (DXP), Daniel Veillard’s libXML, and Apache’s Xerces. Microsoft’s MSXML includes both vali-dating and nonvalidating parsers that support a variety of platforms. These parsers run the gamut from open source efforts to commercial products, from extremely tiny imple-mentations to large, robust efforts. Information about these tools and links to find out more information are included in the chapters that cover them in more detail.

Event-based parsers such as SAX provide a view of XML documents that is data centric and event driven. When a user reads an XML document using SAX, elements that are encountered by the parser are read, processed, and then forgotten. The event-based parser reads the elements from the document and returns them to the application with a list of attributes and content. By taking this approach, a user obtains a more efficient means of processing XML documents because the search time is greatly optimized, requiring less code and memory. The primary reason for this is that an in-memory tree representation of the XML document is not required. Event-based APIs merely report parsing events such as the start and end of XML markup, which are processed by application event han-dlers through callbacks. This mechanism is widely used in many “process-and-forget” systems and is especially appropriate for XML-based messaging and transaction systems, where keeping the XML tree in memory is simply not appropriate.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

XML and Web Services : Essentials of XML : The Fundamentals of XML : Basics of Reading and Processing XML |