The Promise of XML
· Advantages of XML over SGML
· Advantages of XML over HTML
· Advantages of XML over EDI
· Advantages of XML over Databases and Flat Files
· Drawbacks to XML
· XML-Based Standards
What can XML offer that these other various formats have been unable to deliver at this point? How will XML make our lives better, make our systems more efficient, lower our costs, and increase our revenues? How will XML make the task of representing, storing, and exchanging data an easier process than using SGML, HTML, or EDI?
Benefits of XML
The very nature of XML is that it is a structured document format that represents not only the information to be exchanged but also the metadata encapsulating its meaning. Most information has structure of some type. For example, information about a book contains information about the title, author, chapters, body text, and index. In turn, body text contains paragraphs, line text, and footnotes. This information is structured because a document that describes a book would need to describe that information in a way that a person or machine can understand it. Author information should not be contained within the index section, and vice versa. Although SGML has provided this functionality to XML by virtue of being a “parent language,” XML has simplified the process of defining and using this metadata.
Although XML is fairly simple in nature, in that it only needs to follow basic syntax rules to be considered “well-formed,” one of the biggest features of the language is its ability to provide a means for guaranteeing the validity of a document. This means that not only can you send a document to a receiving party but you can also send criteria, in the form of Document Type Definitions (DTDs) or other schema formats, with which the document must comply. For example, criteria may specify that an XML document should contain only the listed set of elements and attributes in a specific order and in given quantities. XML documents, on the other hand, come built in with error and validity checking. The DTD or schema that is referred to by an XML document can guarantee,
at the time of document creation, that all the elements are correctly specified and in the correct order. Furthermore, the usage of a more advanced validity-guaranteeing mecha-nism such as XML Schema can help guarantee that the values of the element content itself are valid and fall within acceptable ranges. Documents can be validated at their time of creation or at their time of receipt, and they can be rejected or accepted on an automated basis without human intervention. At design time, these errors can be fixed before transmission, and upon receipt, they can be sent back to the sender for further human processing with an exact pinpointing as to where these errors have occurred.
Validity-checking software is also very low cost, if not free. Most parsers on the market are available in open-source form and come with validation capabilities built in. Although many of these are currently only DTD compliant, the move to XML Schema–based validity checking is well under way. Batches of documents can be checked for compliance against a single DTD or schema, or they can be checked against different schema based on their destination or origination. Although the use of a DTD and schema does not guarantee 100-percent validity, it goes a long way toward ensuring that the vast majority of documents exchanged and received fit an acceptable policy.
One benefit of using XML with DTDs or schemas is that XML editors provide structured editing “for free.” As a developer, how many times have you run a processor on some formatted file only to get a complaint about a syntax error at line 37? Editing software that only allows you to enter valid XML will catch many of these errors as you type them. From another perspective, editors can automatically create a form-style interface from a DTD or schema. Therefore, XML can provide a simpler user interface and elimi-nate some of the complexity of creating XML documents.
XML takes advantage of existing Internet protocols, and as such, designers choosing to use XML in their solutions don’t have to create new protocols as a means for transport-ing their documents. Designing a new protocol today may not make sense when existing and well-understood protocols such as HTTP exist. Using these protocols makes the doc-ument more portable across multiple platforms, more easily debugged, and easier to understand how to qualify and route. In addition, HTTP as a protocol is well understood, and IT engineers know how to manage the HTTP traffic. Using a new protocol would require inventing a protocol to go over the wires, which would necessitate identifying new data streams for firewalls, management of the traffic, and a whole ball of wax that is simply not necessary for a structured data format.
Because XML is a structured document that shares many of the same processing and parsing requirements as SGML and HTML, plenty of generally available parsers have been built. Many of these parsers are now built in to general browsers and server-side agents. Chapter 2 talks about these various client-side and server-side parsers and processors and explains which tools are available for use today.
In addition, the Document Object Model (DOM) has been created by the W3C as a gen-eral model for how parsers and processors should interact and process XML documents for representation as a data-bound tree. As a result, the DOM has produced a generic, universal method for processing XML documents. Applications that require XML pro-cessing can access this wealth of tools and specifications and thus add parsing in a rela-tively pain-free way. Developers do not have to write new parsers, unless they really want to. Many parsers exist in a wide variety of languages, and many of these are free.
Another oft-cited benefit of XML is its ability to be read and written by humans, rather than created by applications in a machine-only readable format. Although many say that XML will be primarily used for machine-to-machine communication and can be created using visual tools that don’t necessitate the actually editing of the code, experience with HTML has shown that there are numerous occasions when a developer has to “dip in” to the actual document and make adjustments. It is for this reason that XML is plain text and uses elements that represent actual words or phrases that contain some semantic meaning.
XML represents information and the metadata about that information; therefore, it does not specify any particular manner for how the data should be processed or provide any constraints for mechanisms with which to handle the information. This is in contrast to other formats, such as EDI, certain types of text files, and databases, that explicitly require accessing the documents in a specific manner. Furthermore, the files themselves define how the information is to be processed and what requirements systems must have in order to make sense of the documents. In contrast, XML documents simply encode information and their metadata without specifying how the information is to be processed or displayed.
Often, the capability of XML to separate its process and data content is known as being future-proof or loosely coupled, depending on which end of the marketing spectrum you stand. Future-proof in this instance means that no future changes in the data-exchange layer should affect the programming layer, and vice versa. Loosely coupled systems allow for “arms-length” exchange of information, where one party does not need to know details of how the other party plans to process the information. These systems are then “loosely coupled” from the existing systems they need to integrate with or whatever sys-tem is to be in place in future. This allows for changes in the presentation, process, and data layers without affecting the other layers.
Due to XML’s popularity, ease of use, and increasing proliferation of tools, the number of individuals and organizations skilled in XML use is increasing exponentially. It is becoming considerably easier to find skilled employees and contractors who are familiar with XML, the standards, and best practices for implementing XML in multiple environments. Perhaps one of the best arguments for the use of XML is that the more people there are who make use of the language, the more it will be supported and capa-ble of meeting your needs. Sometimes the best technologies are the ones that are the most in use, regardless of their technological advantages.
Advantages of XML over SGML
Although XML borrows much of its functionality from SGML, it provides a number of distinct advantages. Although SGML may still be suitable for content and data represen-tation, the tide of public opinion is definitely shifting in XML’s favor. As such, it makes sense to at least consider XML in place of existing or proposed SGML implementations.
XML permits well-formed documents to be parsed without the need for a DTD, whereas many SGML implementations require some DTD for processing. XML is much simpler and more permissive in its syntax than SGML. The XML specification is very small, includes a bare-bones set of features (rather than a bunch of optional features that can make implementation costs difficult to judge), and avoids some of the stigma associated with the SGML name.
XML was created because a direct implementation of SGML on the Internet was diffi-cult. SGML simply did too much. One of SGML’s benefits is that it provides significant flexibility for a diverse community of users by providing a wide array of choices, which resulted in a wide range of syntactical variations for documents. This produced a specifi-cation that was very difficult for developers to implement. XML 1.0 simplified the speci-fication by eliminating unnecessary flexibility. This resulted in a specification that was both powerful and easy to implement. The goal was to aim at meeting the majority of users’ needs, without aiming to meet all the users’ needs.
Advantages of XML over HTML
HTML was created to meet a very different need than XML. It is clear that XML will not now, or perhaps ever, completely replace HTML. Except of course with regard to the XML-enabled version of HTML, known as XHTML. HTML was designed as a language to present hyperlinked, formatted information in a Web browser. It has no capability to represent metadata, provide validation, support extensibility by users, or support even the basic needs of e-business. Fundamentally, the difference is that HTML is intended for consumption by humans, whereas XML is meant for both machine and human consumption.
Advantages of XML over EDI
EDI adoption has been fairly widespread, even though mainly among larger-sized busi-nesses. The cost of EDI implementation and ongoing maintenance can be measured in the billions in aggregate. Millions of dollars in transactions occur on a daily basis using EDI-mediated messages. It would be very difficult, if not impossible, to uproot all this activity and replace it with exclusively XML-based transactions. These businesses have so much money and time invested in ANSI X12/EDI that they will be fairly slow to adopt a new standard, which would necessitate new processing technology, mapping software, and back-end integration. For them, it would seem that they would need to discard their existing, working technology in favor of an unproven and still immature technology.
However, XML offers a number of clear advantages over EDI, which has long had its time in the sun. XML is a good replacement for EDI because it uses the Internet for the data exchange. There have been efforts to provide mechanisms for EDI to also be trans-ported over the Internet, but many of these have not met with much success. Recent efforts have attempted to make use of Internet protocols such as SMTP, FTP, and HTTP to transport EDI, but it is clear that the format was not originally designed or intended for such use.
Compared to EDI and other electronic commerce and data-interchange standards, XML offers serious cost savings and efficiency enhancements that make implementation of XML good for the bottom line. There are many components to document exchange and electronic commerce systems: document creation tools, processing components, validity checking, data mapping, back-end integration, access to a communications backbone, security, and other pieces of the commerce puzzle. XML greatly simplifies, if not elimi-nates, many of these steps.
XML’s built-in validity checking, low-cost parsers and processing tools, Extensible Stylesheet Language (XSL) based mapping, and use of the Internet keep down much of the e-commerce chain cost. In many cases, general XML tools can be found that are not only applicable to the problem to be solved, but are flexible and very inexpensive. Whereas EDI is a specific domain of knowledge and expertise that comes with a compa-rable price tag, XML makes use of technology that has been in use for years, if not decades. Systems that take advantage of this wealth of available processing power and know-how will greatly reduce not only their costs but also their time to implementation.
The use of the Internet itself greatly lowers the barrier for small and medium-sized com-panies that have found EDI too costly to implement. Simple functionality and low-cost tools will go a long way in helping these companies afford to exchange high-quality, structured documents that are capable of supporting commercial exchange and back-end integration.
As one XML user states, “XML is hip, happening, now.” EDI is perceived as crusty and old. Text files are blasé, and databases have increasingly become a staple of data storage locked in a proprietary format. The idea that XML represents a new, fresh approach to solving many lingering problems in a flexible manner appeals to many in senior manage-ment. In many instances, buying into a new technology requires the approval of the senior levels of IT, if not the corporate and management levels. With XML’s continuing positive exposure, getting management approval on an XML project is become an increasingly simpler endeavor.
Another of the drawbacks to EDI and some text file and database formats is that they don’t easily support the needs for internationalization and localization. Specifically, in those languages it is difficult to represent information contained in a non-Latin alphabet. XML, as part of its initial specification, supports these needs inherently.
XML syntax allows for international characters that follow the Unicode standard to be included as content in any XML element. These can then be marked up and included in any XML-based exchange. The use of internationalization features helps to surpass one of the early problems of other formats that cause unnecessary schism and conflict between different geographies. For example, it is not fair that an English technical man-ual can be marked up in a file format if a Japanese manual can’t be likewise formatted. XML sought to solve this problem from the get-go.
Advantages of XML over Databases and Flat Files
XML is a structured document format that includes not only the data but also metadata that describes that data’s content and context. Most text files simply cannot offer this clear advantage. They either represent simply the information to be exchanged without metadata or include metadata in a flat, one-level manner. Common file exchange formats such as comma-delimited and tab-delimited text files merely contain data in predefined locations or delimitations in the files. Complex file formats such as Microsoft Excel con-tain more structured information but are machine-readable only and still do not contain the level of structuring present in XML.
Relational and object-oriented databases and formats can represent data as well as meta-data, but for the most part, their formats are not text based. Most databases use a propri-etary binary format to represent their information. There are other text-based formats that include metadata regarding information and are structured in a hierarchical representa-tion, but they have not caught on in popularity nearly to the extent that XML or even SGML has.
Although text files can also be transmitted via e-mail and over the Web, structured for-mats such as relational and object-oriented databases are not easily accessible over the Internet. Their binary-based formats and proprietary connection mechanisms preclude their ability to be easily accessible via the Internet. Many times, gateway software and other mechanisms are needed to access these formats, and when they are made accessible it usually is through one particular transport protocol, such as HTTP. Other means for accessing the data, such as through e-mail and FTP, are simply not available.
One of the primary issues faced by alternate file format and database languages is that processing tools are custom, proprietary, or expensive. When tools are widespread, they are usually specific to the particular file format in question. One of XML’s greatest strengths is that processing tools have become relatively widespread and inexpensive, if not free.
Drawbacks to XML
One of the most notable and significant “knocks” against XML is that it’s huge. XML takes up lots of space to represent data that could be similarly modeled using a binary format or a simpler text file format. The reason for this is simple: It’s the price we pay for human-readable, platform-neutral, process-separated, metadata-enhanced, structured, validated code.
And this space difference is not insignificant. XML documents can be 3 to 20 times as large as a comparable binary or alternate text file representation. The effects of this space should not be underestimated. It’s possible that 1GB of database information can result in over 20GB of XML-encoded information. This information then needs to get stored and transmitted over the network—facts that should make computer, storage, and net-work hardware manufacturers very happy indeed!
Let’s not also forget that computers need to process this information. Large XML docu-ments may need to be loaded into memory before processing, and some XML documents can be gigabytes in size! This can result in sluggish processing, unnecessary reparsing of documents, and otherwise heavy system loads. In addition, much of the “stack” of proto-cols requires fairly heavy processing to make it work as intended. For example, the Simple Object Access Protocol (SOAP), which is a cross-platform messaging and com-munication platform for use in remote procedure calls (RPCs) between and within server systems, is a very heavy protocol to manipulate on-the-fly. The marshalling that occurs in the process of working with the protocol can cause system performance to be quite poor because XML is, after all, a text-based protocol that is being used to make RPCs between systems. Using XML in this transactional, real-time manner may impose more requirements on the system as far as parsing and processing than the system can handle.
In addition, a problem of many current XML parsers is that they read an entire XML document into memory before processing. This practice can be disastrous for XML doc-uments of very large sizes. XML is not only a data language but a complicated one at that (from a parsing perspective). It oftentimes increases code complexity, because XML can be more difficult to parse than a simpler data format such as comma- or tab-delim-ited fields.
Despite all the added value in representing data and metadata in a structured manner, some projects simply don’t require the complexity that XML introduces. In these cases, simple text files do the job more efficiently. For example, a configuration file that includes a short list of a few commands and their values doesn’t require a multilevel, metadata-enhanced file format for its communication. Therefore, one shouldn’t take the stance that simply because XML contains structure and metadata it should be used for all file formatting and document-exchange needs.
Although XML does offer validation technology, it is not currently as sophisticated as many of the EDI syntax checkers. XML editors often lack the detail and helpfulness found in common EDI editors. Many EDI syntax editors can report error details through-out a document and can complete the parsing of the entire document. Many XML editors are unable to proceed beyond the first syntax.
In addition, XML inherits the notorious security issues associated with the Internet, but it also inherits the possible solutions to those problems as well. As long as a system is designed with security in mind, exchanging XML over the Internet should be fairly prob-lem free.
We have already discussed the advantages of the “ML” in XML, but the “X” presents advantages of its own. Extensibility, as applied to XML, is the ability for the language to be used to define specific vocabularies and metadata. Rather than being fixed in describ-ing a particular set of data, XML, in conjunction with its DTDs and schema, is able to define any number of documents that together form a language of their own.
Indeed, hundreds, if not thousands, of specific document vocabularies have been created based on XML to meet the different needs of healthcare, manufacturing, user interface design, petroleum refining, and even chess games. Text files and relational database schemas are rigid in that they are meant to represent the information contained within and nothing more. It would be a difficult proposition at best to add a new set of informa-tion to a text file or relational database management system (RDBMS). XML files, espe-cially those created using an “open content model,” can easily be extended by adding additional elements and attributes. Whole classes of documents can be defined simply by sending a document with a new DTD or schema. Sharing a DTD and schema within a user community results in a joint specification—if not a de facto or explicit standard.