The development of XML was not an epiphany that came to a lone inventor working in isolation, nor was it conceived of as part of a corporation’s product-development efforts. Rather, XML is an evolution of data formats that existed previously but solved problems of different sorts. In understanding XML, one needs to understand these formats and how their limitations prevented their widespread adoption.


Standard Generalized Markup Language (SGML)


With its roots originating all the way back in 1969 and its standardization by the ISO in 1986, SGML is really the forefather of all markup languages. It introduced the notion that data processing and document processing could be one and the same thing—but we’re getting ahead of ourselves here.

Note: SGML is formally standardized as ISO specification 8879:1986. You can obtain more information from the World Wide Web Consortium (W3C) Web site at http://www.w3.org/MarkUp/SGML/.

Computers have long been used for document and text processing. In the early days, computers were used to assist in document preparation and typesetting. They allowed copy creators and editors to quickly prototype how a specific document would look prior to its printing on a traditional printing press. As computing progressed, so did its applica-tion in the document-preparation industry. The advent of word processors necessitated the invention of a means to indicate how the content was to be modified for printing. Because software applications at the time were text based with no graphical capabilities to speak of, the text contained in the documents were “marked up” using textual com-mands that were later processed by the final printing destination. These so-called markups surrounded the text and explained how it was to be handled for printing. This included notations for boldface, underline, font sizing, placement, and other such com-mands. Word processors didn’t invent markup as a concept—markup is common in docu-ment creation and editing. Editors have used markup for decades, if not centuries, to indicate their revisions and changes to text. Word processors merely implemented a way by which markup could be encoded in a computer-based system.


The number and type of such markups proliferated with the number of word processing formats. Markup languages such as troff, rich-text format (RTF), and LaTeX were cre-ated to meet these needs. An example of LaTeX can be found in Listing 1.1 (This code is from a Web site on LaTeX at http://www.oxy.edu/~jquinn/home/Math400/LaTeX/ thesis-example-latexcode.html). Finally, the development of graphical WYSIWYG (What You See Is What You Get) systems eliminated the need for textual markup of documents to indicate their final presentation format. However, the legacy of markup lives on.



LISTING 1.1 LaTeX Example










\setlength{\evensidemargin}{0.2in} \setlength{\topmargin}{-.6in} \begin{document} \newtheorem{lemma}{Lemma}[section]

LISTING 1.1  continued













\title{Title  For  a  Sample  Comprehensive  Paper}


\author{  Your  Name  Here  \\Department  of  Mathematics  \\Occidental  College  \\  \\


{\it Submitted in partial fulfillment of the requirements for the degree}\\ {\sc Bachelor of Arts}}






Every paper needs to begin with an abstract. This is a brief overview of the entire paper. It should be independent of the body of the paper (i.e. no referencing things to come). If you feel a definition is needed to make the ideas here clear, then by all means include it. A lazy reader should be able to get the entire gist of your work by reading the abstract to be able to determine if it is worth reading more of the paper.






The introduction serves to acquaint the reader with your topic and place it in a greater perspective. Notation and definitions which are used throughout the work should be presented here. You may find yourself repeating the ideas in


the abstract —- that’s okay. They should be more fleshed out in the introduction.


\section{Main  Body}

SGML built upon this markup history by providing a common format for defining and exchanging markups between systems that may not share the same markup language inherently. In 1969, IBM sought to simplify the tasks of creating, archiving, searching, and managing legal documents. Charles Goldfarb headed up this task of creating the system and defining a format to meet these needs. In the process of doing so, Goldfarb, along with his coworkers Ed Mosher and Ray Lorie, realized that IBM’s multiple sys-tems stored their information in different formats. Producing an application and data format that would cross these systems and produce a unified result would mean that a standard format would have to be created. The solution to this set of problems took the form of the Generalized Markup Language (GML), the initials of which are also the cre-ators’ initials. GML was designed to provide a standard means for marking up content that could then be archived, managed, and searched. See Listing 1.2 for an example of an SGML document.

LISTING 1.2         SGML Example


<!DOCTYPE  book  [


<!ELEMENT book O O ((title & subject & author & ISBN?), body)> <!ELEMENT body – O (bodylines+)>


<!ELEMENT  bodylines  O  O  (#PCDATA)>


<!ELEMENT (title, subject, author, ISBN) – O (#PCDATA)> ]>


<title>Little Miss Muffet</title>

<subject>Children’s fairy tale</subject>

<author>Mother Goose</author>




<bodylines>Little Miss Muffet</bodylines>

<bodylines>Sat on her tuffet</bodylines>



SGML also introduced the notion of a generalized document format. Rather than having proprietary, custom markup languages that could not be exchanged between systems, a common means for markup definition was defined. Systems that complied with the SGML specification could communicate with each other, even if competing vendors cre-ated them. SGML also brought forth the idea that documents can have custom types that indicate the nature and purpose of the information contained within. Rather than specify-ing a single, monolithic specification that was to be used across all industries, SGML conceived that individual industries would be concerned specifically with the way they represent information. Each of these industries would be able to maintain a Document Type Definition (DTD) for itself and thus be able to exchange documents in an even more specific, standardized manner.


All these features in SGML have transformed the simple document into a representation of text content and its associated data. SGML proved, at a very early age, that document processing and data processing could be one and the same. This idea would be carried forward in the development of its subsequent successor formats: XML and HTML.


However, as SGML development progressed, it became increasingly more overweight and complicated. Both the creation and parsing of SGML documents were difficult and complex, and the various “optional” features of SGML started to bog down its ability to become widely adopted. By necessity, the SGML specification was pulled and influenced by many conflicting industry groups, each of which wanted to make sure the language was able to meet their needs. As a result, the creation of a simple, generic parser for the language was a difficult proposition, at best.


However, the legacy of SGML continued to live on, not only in the number of documents created in the language, but in subsequent formats that borrowed heavily from its creative direction while attempting to side-step some of its complexities.


Hypertext Markup Language (HTML)


SGML could have continued its steady growth as the only generalized markup language in use if it weren’t for the sudden emergence of the Web and its own format for data exchange—the Hypertext Markup Language (HTML).


Although the Internet has been around since the late 1960s, it was the development of the Web that truly brought the Internet into its current prominence and widespread usage. The Web finally put a visual, interactive, and easy-to-use front end on a network system that had formerly been dominated by applications such as Telnet, FTP, and Gopher. The Web provided users a means to easily create repositories of knowledge that could be linked with one another as well as contain graphical images and well-formatted layouts. What’s more, the Web was based, in part, on SGML.


In 1989, a physics researcher at the CERN European Nuclear Research Facility named Tim Berners-Lee proposed that information collected and produced by the facility could be shared in a more interactive and visual manner. Berners-Lee took a peek at what SGML had to offer on this subject, and upon further exploration, he realized that he could create a simple DTD based on SGML that would allow users to create simple hypertext-linked documents. He named this DTD and subsequent development the Hypertext Markup Language (HTML), a sample of which can be seen at Listing 1.3.

LISTING 1.3 HTML Example






<TITLE>This an HTML Hello World!</TITLE>





<H1>Hello  World!</H1>


<FONT SIZE=”2”>Using a Font Tag, with <B>Boldface</B> and <I>Italics</I></FONT>






However, HTML is nothing like SGML when it comes to the strictness and complexity of the language. HTML was developed relatively quickly and was meant to solve a fairly simple job. It was created with simple developers in mind; therefore, “sloppiness” was allowed to thrive. In fact, this sloppiness may be the very reason why the Web exists in the first place. Because it was so easy to create HTML documents and browsers, the format flourished in the vacuum of the Internet. Users simply were craving a document format that could express their ideas in a visual, linked manner. HTML met this need.

Because it borrows much of its functionality from SGML, HTML provides many similar features: the use of angle-bracketed elements and attributes as well as a structure defined by a DTD that was independent of display mechanisms. Of course, this latter part became increasingly fuzzy as the various Internet browser vendors started to battle over control of the market. In particular, Microsoft and Netscape sought to add their own pro-prietary elements to the HTML language that would be understandable only by their respective browser platforms. Of course, this violated the basic tenets of SGML in that the markup language should be standardized and generalized.


In addition, HTML solved only one part of the SGML realm of problems—namely the presentational and layout aspects. HTML was aimed squarely at representing information for display on a browser or other display devices such as cell phones and handheld devices. The language was never intended as a means for storing data and metadata (information that describes data) or for providing a framework for users to exchange data in a structured manner. HTML had separated the notions of data processing from docu-ment processing.


It soon became clear that once again a need for a language such as SGML was needed on the Internet. HTML was not adequate for the extensible, data-oriented nature of informa-tion exchange, and SGML was too complex and not native to the Internet environment.


Electronic Data Interchange


Of course, HTML and SGML were not the only data formats in existence prior to the emergence of XML. In the electronic commerce and business communities, another acronym held even more sway than SGML.


The Transportation Data Coordinating Committee (TDCC) developed the Electronic Data Interchange (EDI) format in the early 1970s as a means for transportation industry ven-dors to specify transaction sets that enabled electronic processing of purchase orders and bills. At the time, computing power was concentrated in isolated mainframes that had low storage capacity and even lower bandwidth capabilities for exchanging information.


Because freight transactions were dominated by high-volume, low-dollar transactions, transportation suppliers were early adopters of EDI standards. Many large carriers and shippers achieved significant productivity gains by switching their internal, paper-ori-ented systems to electronic transactions enabled by EDI.


Because the presence of an established message-transport infrastructure, standardized business process rules, and file formats did not exist in the early years of EDI’s forma-tion, the EDI format carried with it specifications for how the messages were to be exchanged and processed. Before the Internet came into widespread use, EDI messages were sent across private value-added networks (VANs) that ensured that transactional messages reached their destination with security, integrity, and messaging validity, along with receipts that guaranteed the messages were received. The EDI transaction sets also contained strict business rules on how the messages were to be handled.


The EDI file format used a fairly arcane syntax that was unintelligible to most humans. Just looking at Listing 1.4 is enough to give many of us headaches. The structure was aimed at efficiency and compactness over flexibility and human readability. As such, EDI parsers and processors were used to create, read, and manage these files. In general, two parties that wished to conduct an EDI transaction would need to enter into a trading agreement, choose a VAN for message delivery, build or buy software to conduct map-ping between data formats and EDI messages, and build translators to interpret the sender’s message into the company’s native data format. Each of these operations would have to be accomplished for every new trading partner added to the network. In addition, VANs charge monthly and per-transaction fees for the handling of these messages. It is no wonder that implementation cost and complexity is so high with EDI systems. It is also no wonder that only the large manufacturers were able to afford to participate!

LISTING 1.4 EDI Example


ISA*00* *00* *01*003897733 *12*PARTNER ID*980923*1804*U*00200*000000002*0*T*@ GS*PO*MFUS*PARTNER ID*19980924*0937*3*X*004010






CUR*BY*USD TAX*1-00-123456-6 FOB*DF***02*DDP


ITD*01*ZZ*****45*****NET 45 - Payment due 45 days from Document Date TD5*Z****Ship via Airborne













N1*ST*Acme Hardware Corporation*92*0000002924

N3*123 Random Hill Rd














Each of the EDI transaction sets defines which fields of data are contained in a specific transactional message. The format defines the fields themselves, their order of appear-ance, and the length of the information contained within. A number of “implementation guidelines” are also applied to the transaction sets to assist in the development of valid EDI messages.


The EDI transaction sets were developed by two separate bodies: the American National Standards Institute’s (ANSI) Accredited Standards Committee (ASC) X12 and the United Nations Standards Messages Directory for Electronic Data Interchange for Administration, Commerce, and Transport (EDIFACT). Whereas ANSI X12 met the needs of North American commerce users, EDIFACT was focused on meeting more international needs. Later, the ANSI ASC X12 effort was moved to the Data Interchange Standards Association (DISA) for ongoing management. As such, the specifications devi-ated somewhat and the “standard” nature of EDI was rapidly degraded.


EDI has been used as the basis for a number of industry-specific standards efforts. In particular, the healthcare industry has used EDI to define its Health Level Seven (HL7) standard, which is in use by most of the world’s hospitals and insurance companies for exchanging healthcare and health insurance information. In addition, other groups including automotive, insurance, government, retail, and grocery industries have looked to EDI as a format on which to base their business-to-business interactions.


However, many of the supposed gains that EDI was to deliver were never realized due to the inability of the electronic applications to eliminate the paper processes necessary to support the business processes. EDI exhibits the “80/20 rule,” which states that the last 20 percent of a company’s trading partners to be implemented in EDI will represent 80 percent of its savings. The reason for this is simple: The trading partners that still con-duct business in paper formats and processes still need to be supported. That means dual and somewhat-redundant processes—one electronic and one paper—need to be sup-ported. This is very inefficient in the long run. In addition, EDI was never really able to help the small and medium-sized trading partners to participate in the electronic com-merce game. This is primarily due to EDI’s cost and the complexity of implementation. It was simply too expensive to get all the small-business suppliers to switch from their paper processes to EDI. This meant that the returns for everyone were greatly diminished.


Another of EDI’s problems is its reliance on fixed transaction sets. The rigidity of these transaction sets makes EDI somewhat impervious to the natural changes that occur in business processes and methodologies. This rigidity is reflected in the somewhat-strict manner in which EDI messages must be processed and the standardization process by which these transaction sets are defined. Transaction sets have a well-defined field format and structure. Companies are not free to add their own data elements or redefine data structures. This has required many users to implement EDI in a nonstandard manner in order for it to serve their business needs.


However, the EDI industry sought to fix many of these shortcomings by embracing the Internet as a means for transportation, and by relaxing many of the strict processing requirements. EDI has actually made some significant strides in the past five or so years in trying to adapt to the rapidly changing business frontier. In this regard, it is unlikely that EDI is going to disappear entirely. Rather, we may find that within EDI’s already large community base, its use will solidify. However, as a means for transporting data in general or as a solution for e-business for the community at large, EDI has had its day in the sun, and now XML is due to bask in some of the sunlight.


The investment that many companies have made in EDI is not going to simply be thrown away, however. Many companies are looking to leverage their EDI expertise into crafting XML solutions that take advantage of the EDI infrastructure, business processes, and architecture. In fact, a number of XML proposals seek to “XML-enable” EDI by simply replacing the arcane EDI format with XML tags. Others seek to mirror the transaction sets using a similar XML-based element structure. In any case, many companies are seeking to soften the transition from EDI to XML-based systems by utilizing the decades of experience in EDI systems and using this experience to create robust XML-based systems.

