Scientific and Engineering
In the early 1990s, the Internet was being used, not by mainly commercial entities but by scientific and educational establishments. The foundation technologies for the Web—the Hypertext Transfer Protocol (HTTP) and the Hypertext Markup Language
(HTML)—were created not for the sake of online e-commerce but to exchange research papers in the field of physics. Therefore, it makes complete sense that XML would be a hotbed of activity by those in various scientific, mathematic, and engineering fields.
This section touches on two standards that have leveraged XML as their means of document exchange.
The rapid increase in the use and exchange of data in the biological fields has demanded a better way for representing, storing, and exchanging this information. The use of infor-mation in biology has spawned its own field of study, bioinformatics, and the recent explosion in genetics research has likewise required an increasing amount of attention in standardizing information storage and exchange. To this end, XML has provided the technology to meet many of these needs.
Bioinformatic Sequence Markup Language (BSML)
Just as in every industry that has large data requirements, the bioinformatics industry has the challenge of integrating large quantities of heterogeneous information gathered from different sources and distributed locally and over the Internet. A bioinformatic sequence is the visual encoding of strings of nucleotides, the chemical makeup of our DNA. Individual nucleotides, such as adenosine, cytosine, guanine, taurine, and uracil, are encoded as “acgtu,” respectively. A sequence is an arbitrarily long string of these charac-ters that corresponds to a particular encoding of genetic material. As researchers expand their knowledge of a particular organism’s genetic structure, the exchange of these strings of genetic encoding becomes increasingly more important. As is the case almost everywhere that data is present, XML can facilitate the discovery process by enabling the researchers to integrate and annotate these sequences. XML also enables the integration of this “genomic” information with related, or “extragenomic,” information such as liter-ature, images, and documents that support the particular genetic information being researched.
Developed by the National Human Genome Research Institute (NHGRI) and promoted by LabBook, Inc., the Bioinformatic Sequence Markup Language (BSML) is a proposed XML standard for the communication of bioinformatics data. The BSML standard is divided into two logical parts: Definitions and Display. The Definitions section encodes the bioinformatic data, including sequences, sets, sequence features, analytical outputs, relationships, and annotations. The optional Display section encodes information for graphic representation of the bioinformatic data. Multiple users can simultaneously access the same data and examine different links, files, and sequence views without hav-ing to make alterations to source documents. In addition, BSML allows users to include multiple annotations such as documents, tables, charts, and sequence features and graphs aligned to sequence maps. Although the specification of BSML doesn’t require any spe-cific browser or graphical interpretation technology, LabBook provides for a viewer that is tailored around the BSML application. In addition, LabBook develops and provides freely available tools that help create and manipulate BSML files.
The BSML specification’s main goal is to represent genetic sequences and their graphic display properties. In particular, the specification describes the features of genetic sequences, represents relationships among sequences and their features, defines graphic objects that represent sequence features and relationships, provides representation of the relationships between sequences and source documents (such as sequence and genetic marker databases), and defines methods for storing and transmitting encoded sequence and graphic information. Listing 22.9 shows a sample BSML XML instance.
LISTING 22.9 Sample BSML Instance
<!DOCTYPE Bsml SYSTEM “bsml.dtd”> <Bsml>
<Sequence id=”SEQ1” title=”ECRPOBC” seq-type=”dna” units=”bp” length=”12337” shape=”linear” strands=”2”>
<View id=”VEW1” seqref=”SEQ1”> </View>
Even though LabBook has wrapped commercial products around the standard, BSML remains in the public domain and is supported by the LabBook efforts.
In the same vein as biological information, chemistry and materials information also needs to be exchanged. This is especially vital in the various pharmaceutical, materials processing, plastics, petroleum, and other industries that rely on accurate chemical infor-mation to perform their tasks adequately. However, like any other industry, the processes have been formerly dominated by paper rather than electronic interchange. Various chemistry industry specifications, such as the Chemical Markup Language covered next, hope to change this by providing a deep level of specification for chemical properties as well as the required vocabularies for defining chemical industry interchange.
Chemical Markup Language
The foundations of the Chemical Markup Language (CML, or more officially known as XML-CML) can be traced all the way back to the original days of HTML, when the Internet was frequented mainly by academics rather than individuals and corporations. The original concept was to provide a platform-neutral means of exchanging information regarding chemical compositions. Originally formatted as an SGML DTD, CML began pursuing the XML direction soon after the language’s development in 1996. Subsequently, CML became one of the first acknowledged domain-specific DTDs pub-lished for XML.
CML itself doesn’t cover the entire spectrum of possibilities in the chemical industry. Rather, it focuses on representing molecules, which the CML Web site defines as “dis-crete entities representable by a formula and usually a connection table.” CML further specifies a hierarchy for compound molecules, such as clathrates and macromolecules, reactions, and macromolecular structures/sequences. In addition, CML “has no specific support for physicochemical concepts but can support labeled numeric data types of sev-eral sorts, which can cover a wide range of requirements. It allows quantities and proper-ties to be specifically attached to molecules, atoms, or bonds.”
In many respects, CML forms a common basis for most chemical-domain XML vocabu-laries in much the same way that MathML forms the basis for many mathematical and scientific-domain XML vocabularies. CML also makes use of and leverages a number of other XML specifications, including Resource Description Framework (RDF), XHTML, SVG, PlotML, MathML, Dublin Core, and XML Schema, as its schema base.
CML supports spectra and other instrumental output, crystallography, organic and inor-ganic molecules, physicochemical quantities (including units), MO calculations, macro-molecules (such as sequence protein and ligand), molecular hyperglossaries (including text and molecules), and hyperlinks. CML accomplishes this by specifying a core set of elements, such as molecule (to describe a connected set of atoms), bond, which describes a link between atoms within a molecule, atomArray and bondArray, which provide con-tainers for atoms and bonds, and electron, which provides details of electrons in atoms, bonds, and molecules. Also specified are macromolecular, reaction, crystallography, and formula elements to describe the interaction of these various core elements. Macromolecular elements include sequence, to describe a macromolecular sequence, and feature, which describes features in a sequence. Reaction elements are specified by means of reaction, which describes a reaction that contains molecules and links between them. Crystallography and formulas are described by crystal and formula, which describe crystallographic unit cell and symmetry in fractional coordinates for atoms and provide a container for the representation of arbitrary chemical formulas using a text string with a convention attribute.
LISTING 22.10 Sample CML Document
<molecule convention=”MDLMol” id=”adrenalin” title=”EPINEPHRINE”> <date day=”22” month=”11” year=”1995”>
<atomArray> <atom id=”a1”>
<string builtin=”elementType”>C</string> <float builtin=”x2”>-0.2969</float> <float builtin=”y2”>0.8979</float>
<string builtin=”elementType”>C</string> <float builtin=”x2”>-0.2969</float> <float builtin=”y2”>-0.6121</float>
<string builtin=”elementType”>H</string> <float builtin=”x2”>2.144</float>
<float builtin=”y2”>2.8844</float> </atom>
<bondArray> <bond id=”b1”>
<string builtin=”atomRef”>a1</string> <string builtin=”atomRef”>a2</string> <string builtin=”order”>1</string>
<string builtin=”atomRef”>a1</string> <string builtin=”atomRef”>a3</string> <string builtin=”order”>2</string>
<string builtin=”atomRef”>a4</string <string builtin=”atomRef”>a14</string> <string builtin=”order”>1</string> <string builtin=”stereo”>H</string>
<reaction title=”Diels-Alder cycloaddition”
id=”simple_rxn_1” convention=”stepwise”> <string title=”description”> Simple example of a A + B -> C reaction. See source for further information.
<float title=”yield” units=”%”>88</float>
<string title=”notes”>taken from Vollhardt and Schore</string> <list title=”reactionStep” id=”simple_s_1”>
<string title=”description”>cycloaddition</string> <float title=”yield” convention=”%”>88</float> <string title=”notes”>one step</string>
<link title=”reactant” href=”simple_mol_reactant1” id=”simple_lk_1”/> <link title=”reactant” href=”simple_mol_reactant2” id=”simple_lk_2”/> <link title=”reagent” id=”simple_lk_3”>
<string title=”temperature” convention=”degC”>100</string> <string title=”duration” convention=”hours”>3</string> <string title=”notes”>reflux</string>
<link title=”reagent” id=”simple_lk_4”> <integer title=”index”>2</integer> <string title=”notes”>workup</string>
<link title=”product” href=”simple_mol_product” id=”simple_lk_5”/> <!-- also catalyst, intermediate, transition state as needed -->