Chapter: XML and Web Services : The Semantic Web : The Semantic Web for Information Owners

Architecture of the Semantic Web

W3C has described a seven-layer architecture for the Semantic Web, from the bottom to the top: · Unicode and URIs · XML, XML Schema, and XML Namespaces · RDF, RDF Schema, and Topic Maps · Ontologies · Logic · Proof · Trust

Architecture of the Semantic Web

W3C has described a seven-layer architecture for the Semantic Web, from the bottom to the top:

· Unicode and URIs

· XML, XML Schema, and XML Namespaces

· RDF, RDF Schema, and Topic Maps

· Ontologies

· Logic

· Proof

· Trust

As you can see, these layers move from the very concrete to the abstract, even ethereal. Most people have seen a URI (layer 1), but few (maybe none) have thought through what trust might mean, let alone how it might be specified so that it can be processed by machines. The bottom three layers are well specified enough that some issues can be raised with them—particularly at the interfaces of the layers. The top four layers are (like our elephant) very much under construction. This section describes the seven layers. Needless to say, there is a lot more detail in the concrete layers than in the blue-sky thinking at the top.

Unicode and URIs

At the lowest layer of the Semantic Web are specifications for the characters used by documents on the Semantic Web (Unicode) and for identifying documents (URIs). Because URIs are made from characters, we’ll look at characters first.

Unicode

In the beginning, there was ASCII, and that was good. When the World Wide Web was starting out, that is. The original HTML specifications used the SGML Reference Concrete Syntax, which boiled down to using ASCII (the American Standard Code for Information Interchange), and the browser manufacturers followed along. They used the SGML entity facility to map characters that were not included in the ASCII character set, such as curly quotes, bullets, and so forth, using (whether or not they knew it) the famous ISO entity sets with which so many consultants were able to pad their deliverables.

ASCII was good because it was a very successful standard (or vice versa). It was totally transparent and portable—as long as you used the Western European alphabet nd didn’t need any accented characters. If you needed accents (in French, for example) or Cyrillic, Greek, or Hebrew characters, then the SGML entity system became more than a little verbose. (Imagine using β whenever you wanted to write throughout an entire document!) However, if you used a nonalphabetic writing system, ASCII was totally unhelpful.

Enter Asian character sets, which are often nonalphabetic and run to many thousands of characters. After all, WWW doesn’t stand for Western-European Wide Web—it stands for World Wide Web. What to do?

The answer was to build a bigger code space. ASCII, at maximum, provides for 256 characters—barely enough for English, and far too few for even one Asian character set, and absurdly small when Chinese, Japanese, Hangul, Thai, their many historical varia-tions, as well as all the other Asian characters sets (including Khmer and Vietnamese) are taken into account.

Unicode built the bigger code space. (One reason for XML’s worldwide success is that it uses Unicode in the form of UTF-8.) Unicode is, essentially, a giant code table. It maps characters, which are abstract entities (such as “LATIN CHARACTER CAPITAL A,” a Japanese Hiragana syllable, or a Chinese ideogram), to code points, which are unique numeric values (“U+0041” for Latin “A”), to byte serializations or encodings. Rendering engines, in turn, map byte serializations to visual representations of characters (such as capital “A” in the font Garamond), called glyphs. Unicode does not standardize glyphs.

There are three Unicode encodings: UTF-32, with one 32-bit code unit per one code point; UTF-16, with one or two 16-bit code units per code point; and UTF-8, with one to four 8-bit code units per code point. UTF-8 is the subset of Unicode in which XML documents are encoded. It aims to preserve the characteristics of ASCII so that

file systems, parsers, and other software that rely on ASCII remain backward compatible with Unicode.

As already mentioned, Unicode is required by XML. Other languages and specifications that require Unicode are Java, ECMAScript (JavaScript), LDAP, and CORBA 3.0. Companies that have adopted Unicode include Apple, HP, IBM, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, and other industry leaders. It is supported in all modern browsers, numerous products, and most operating systems. Therefore, if you’re a developer writing on a platform that doesn’t handle Unicode or an information owner making the assump-tion that all characters are ASCII, take a hard look at your assumptions. Importantly, Unicode “guarantees” that no characters will be removed or reinterpreted in ways that are incompatible with the existing standard.

Up until now, we’ve been using the word Unicode for the standard that solves our code space problem. In fact, there are two standards:

Unicode

ISO 10646

Unicode is a semicommercial effort; ISO 10646 is an international standard. Fortunately, in 1991 both efforts decided that the world didn’t need two competing solutions for the code space problem and agreed to keep their code tables in sync with each other and coordinate future extensions. They have kept their agreement. All characters are at the same positions in the code tables and have the same names in Unicode and ISO 10646.

Are there differences between Unicode and ISO 10646? Yes. ISO 10646 is focused mainly on its code tables. Unicode also gives information relevant to implementers, par-ticularly for implementers of high-end composition systems. It provides rendering algo-rithms for scripts (such as Arabic), mixing bidirectional text (such as Latin left-to-right text and Hebrew right-to-left text), sorting and string comparisons, and so forth. However, ISO 10646 has more complete coverage of Chinese, Japanese, and Korean sample glyphs.

Are there differences between Unicode and ISO 10646 that affect the Semantic Web directly? Again, yes, and again the answer turns on code space.

First, there are some differences between the XML view of the world at W3C and the Unicode view of the world. For example, W3C does not consider some Unicode characters suitable for XML. W3C feels it is better to use the HTML <BR> tag or some other markup equivalent than to use Unicode line and paragraph separators. Other issues arise when Unicode characters are specified to handle functions that are handled, or bet-ter handled, by markup. Examples here include list item marker characters (better han-dled with a style sheet), bidirectional text (specified in HTML 4.0), object replacement (better handled with an HTML src attribute or equivalent), and others.

Potential users of Unicode also have issues with it, especially Chinese/Japanese/Korean/ Vietnamese (CKJV) users. Character set issues in general, and East Asian character

sets in particular, are extremely intricate and culturally bound. (There is one, possibly apocryphal, story of country representatives feuding over whom would be “first” in the code table.)

Some CKJV issues are process issues. Due to the large number of Asian character sets, and the huge numbers of ideographs within each set, some sets and Unicode does not cover many characters. Unicode 3.0 has almost 28,000 ideographic charac-ters. However, by some estimates, there are 160,000 ideographs yet to be standard-ized. Even if this estimate is high, given that the Unicode process, like any standards process, is by nature slow, some CKJV users are bound to remain skeptical of Unicode for some time.

Some of these skeptics will, of course, be classical scholars who study works that use characters or character sets that are no longer widely used. However, here we should consider that these works may have great cultural significance. How would English writers and speakers feel if the Unicode process had not yet standardized on the characters used to represent the works of Shakespeare or the New Testa-ment in Greek?

URIs

Unicode is the first part of level 1 of the Semantic Web; the URI is the second part. Now, URIs are made out of characters, so the issues of internationalizing character sets, as we just discussed, potentially impact URI interoperability. Furthermore, URIs pose interest-ing technical and philosophical issues in their own right (known as the identification problem). So, let’s look at URIs and their impact on the Semantic Web in detail.

On the second day, Tim Berners-Lee created the Uniform Resource Locator (URL). It, too, was good. Having in mind the overarching goal of creating a system whereby scien-tists could communicate their results, he sought to create an addressing system that could be scribbled on a cocktail napkin in a bar; in this he succeeded. URLs are definitely human readable, as opposed to the multiline incantations and arcane sequences of digits that preceded the URL.

Let’s process a URL, just to get the terminology clear. Here is a sample (fake) URL:

http://jane.books.com/hardbacks/chinese

If you typed this into your browser (and the URL weren’t fake) here is what would hap-pen: The browser would divide the URL into scheme, domain name, and path name, and then resolve the result.

First, the browser would split the URL on the scheme (before the :// part). That scheme tells the browser how and to whom to delegate locating a resource. (This “how and to whom” is called a transfer protocol; HTTP is an example.) Next, the browser looks for a “top-level domain” in the URL (here, .com) and, reading backwards, passes the domain name jane.books.com to a Domain Name System (DNS) router, which interprets the string and connects the browser to the jane.books.com server. Finally, Jane’s server resolves the remaining path name into a Web page, and returns that document to the browser, which displays it.

So far, so good. The URL certainly works. But today, we have transitioned from the notion of a URL that locates to a URI that identifies, and in that transition some prob-lems have arisen:

The internationalization problem (a scalability problem)

The privatization problem (a scalability problem)

The terminology problem (a semantic problem)

The identity problem (a semantic problem)

Let’s turn to the internationalization problem first. Again, character set issues raise their ugly heads. On the one hand, URIs are to be 7-bit ASCII (as specified in RFC 2396). On the other hand, we have URIs embedded in XML documents that not only are UTF-8 Unicode but may contain XML general entities such as &. What to do? The answer: Perform a mini data conversion effort on-the-fly in the browser, in which the general entities in the URL sequence are converted to Unicode characters, and the resulting sequence is converted to UTF-8. In that sequence, each UTF-8 character is used if it cor-responds to a 7-bit ASCII character. Otherwise, it is escaped into hex, and written like %hh, where h is one of the two 7-bit ASCII characters making the hexadecimal number in the Unicode code table for the desired character in UTF-8.

In this process, the original use-case behind URLs—human readability—has been left far behind. A netizen in the world’s largest emerging market and oldest continuous commer-cial civilization (that is, a Chinese person) who wished to scribble a URI on a cocktail napkin would see nothing but a morass of percent signs (even assuming that all his char-acters were in UTF-8)! Furthermore, if he wanted to use his URI for marketing and print it on a billboard (like Western firms do), which version of his URL would he use? The one in the characters his customers could read, or the one that they saw in the browser’s address box after conversion to UTF-8? Our Chinese netizen can’t use his own character set for his own URI! This is the internationalization problem for URLs in a nutshell.

Enter the second problem: the privatization problem. Because there is a market for inter-nationalized domain names, at least one registry service (a service that ensures the uniqueness and hence the value of domain names as assets) has opened an international-ized domain service intended to “catalyze” the work of the Internet Engineering Task Force (IETF) in writing the specifications for internationalized domain names. Well and good, but what will happen if the catalysis fails? Will the registries decide to deny their customers services, or will they fragment the World Wide Web by introducing domain names that are not universally resolvable?

These problems would seem simple enough to resolve, at least in theory (just like Unicode—make a bigger name space). Enter problem three: Not all the people who are charged with resolving these problems use the same terminology. One key nonstandard-ized piece of terminology, at least in the world of URIs, is the term URI reference.

Recall from earlier the sample URI

http://jane.books.com/hardbacks/chinese

which addresses an entire document. What if we only want to address a fragment of that document? We would use a URI reference, which would look like this:

http://jane.books.com/hardbacks/chinese#id42

Here, id42, in an XML document, would be the ID of an XML element, or fragment identifier. However, although the W3C world uses the term URI reference in RDF for the combination of a URI and a fragment identifier, this term is not recognized in the IETF world where URIs are defined. The impact? Think back to internationalization for a moment—URIs, being in the scope of IETF efforts, are 7-bit ASCII, but fragment identifiers, being outside IETF, are not! Our Chinese entrepreneur, still trying to write a legal URI reference on his dampening cocktail napkin, would end up with percent signs everywhere up to the hash mark, and then nice, readable (to him) characters after the hash mark.

Finally, we have the identity problem. Between the original definition of URLs in RFC 1630 and their standardization in RDF 2396, URLs became a subset of URIs; what was location became a subset of identification. But what is it, exactly, that URIs identify?

Now, the change to URIs was motivated by the perfectly reasonable desire to solve the huge infrastructure problem that locations on the Web, like it or not, change. If the path name portion of the URL represents a real path to a real document on a real machine, the URL will break if the document is renamed or moved, if the machine gets a new domain name, or if the machine is down. The concept of a URI introduced a level of indirection, in that the identifier may remain stable, even if the path name of the document (its loca-tion) changes. Of course, it turns out that most of this indirection can be managed at the file-system level by the server, and so URI, which was supposed to turn into an umbrella concept for other UR*s, such as URC (Uniform Resource Citation) and URN (Uniform Resource Name), really turned into a fancier name for a URL.

However, there is a crucial semantic difference between a URL and a URI. Because the conceptual association between the URI and a physical file has been severed, the URI became free to identify anything, including resources that are not available on the Net, such as physical books, the person Jane, and abstract ideas. RDF uses URI references in just this way (see Chapter 23).

However, let’s put ourselves in the place of a browser and server once more. When we-the-browser are given a URI reference, how do we know, from the URI alone

(which is all we do know about) whether the URI is a document to display or an identifi-cation of a nonretrievable resource such as the person Jane or the concept of love? Using the HTTP protocol, how do we-the-server distinguish between a URI that identifies the person Jane and an Error 404? In Zen terms, how do we distinguish the pointing finger from the moon?

The answer is that now we can’t. As of this writing, there is no standard way to distin-guish between URIs that have failed to retrieve resources and URIs that are not meant to retrieve resources at all. Today, the Semantic Web suffers from one of the most basic confusions possible: between the identity of a thing and the thing itself. There are at least two alternative solutions: URIs could change to include such a semantic, and the seman-tic could be handled at a higher layer above URIs.

XML Specifications

XML, XML namespaces, and XML Schema are covered elsewhere in this book. In this section, we will focus on one XML specification, XML Topic Maps (XTM), that addresses the issues that bubble up to the middle layers of the Semantic Web architecture from the Unicode and URI layers. As you’ll recall, two of those issues are international-ization and identity.

First, we’ll briefly describe XTM topic maps. Then we’ll compare and contrast the topic approach to the problem with the RDF approach.

Topic maps are a bibliographic solution. They were originally designed to provide an interchange syntax for finding aids such as indexes, thesauruses, glossaries, and tax-onomies. Like most solutions that involve bibliographic issues (see the discussions of DDC and AECMA 1000D), topic maps evolved into a solution for modeling relations between information resources. Topic maps were evolved rapidly by the members of a small team working closely together.

Topic maps create associations among topics, which are electronic proxies for subjects (subjects being subject matter—stuff people talk about). Subjects can be addressable (a Web page) or nonaddressable (the person Jane or the concept of love). Users are encour-aged to develop and use Published Subject Indicators (PSIs) for their nonaddressable subjects. Unlike RDF, the topic map data model has a number of “pre-reified” constructs optimized for modeling; these constructs include the topic basename associations, where a topic can be given a label within a scope.

Topic maps address the internationalization problem in a way that RDF labels and com-ments do not. In topic maps, basenames that can be scoped by human languages are built in to the data model. Therefore, a topic map information overlay can be adjusted to give topics names that are appropriate to the user, whether the user’s preference is for English, French, or Chinese. Because RDF labels and comments use the XML lang attribute, which is not part of the RDF data model, RDF solutions must use ad-hoc solutions for a function that topic maps build in.

Topic maps address the identity problem that bubbles up from URIs. RDF has no built-in way to distinguish between resources that fail to be retrieved (such as a document on a downed server) and resources that can never be retrieved (such as the concept love). Topic maps, by explicitly distinguishing in their XML markup between addressable (that is, retrievable) and nonaddressable subjects, solves the identity problem.

Ontology

Ontology (like semantics) is another one of those words that, having been appropriated by the software community, has gained new meanings. In philosophy, ontology is the study of being; a formal account of what exists. For the Web, ontology combines taxonomy with inference rules. A taxonomy is a system that classifies things that exist, often in the form of a tree. For example, the taxonomy of the animal kingdom descends from order through kingdom, phylum, class, order, family, genus, and species. Because animals are classified on the basis of their physical characteristics, we make inferences: If all members of the genus fish have gills, and a trout is a fish, then trout have gills.

Ontologists also have a notion of an ontological commitment—that being the ontology we choose to make the basis of our inferences. For example, assume we must choose between a universe that is Euclidean, where the geometry of space is flat, and a universe that is Riemannian, where it is curved. This ontological commitment will affect our inferencing rules: If the universe is flat, parallel lines will never meet. If it is curved, par-allel lines will meet.

Pragmatically, ontologies exist just so communities of interest can use them to share commitments. A large part of the library community, for example, has made an ontologi-cal commitment to the Dewey Decimal Classification (DDC).

However, the boundaries of the twin notions “taxonomy” and “inferencing rules” are just a bit fuzzy. After all, a DTD could be seen as a classification system for element types in markup languages, and a DTD enables inferencing rules for instances of those types in its content models. Even a URI can be viewed as a taxonomy of URLs, URCs, URNs, and so forth.

Fortunately, the W3C has formed an ontology working group, so we may hope for a more precise approach. This will be necessary for the Semantic Web, because without ontologies to allow communities of interest to agree formally on the meaning of XML namespaces, there is no possibility of creating meaning that machines can understand, because the predicates of RDF statements are implemented with namespaces.

There are many existing ontologies the Semantic Web could leverage. Two of the most important are DDC and MeSH (Medical Subject Headings). There are also many meta-ontologies: ontology languages to create ontologies. Three of the most important are Cyc, Conceptual Graphs, and OIL. Let’s look at these ontologies in detail. Next, we’ll look at the meta-ontologies.

DDC

DDC is the numerical classification scheme that one sees on the spines of paper books in libraries. Of course, because DDC classifies the content of books, it could just as well be used online—for example, as an ontology in the Semantic Web.

Devised by Melvil Dewey in 1873, DDC uses a taxonomy of Arabic numerals separated by dots to create classifications based on subject matter. For example, 500 represents nat-ural sciences and mathematics in general, 530 represents physics, and 531 classical mechanics. (The addressing system of the HyTime-compliant specification AECMA 1000D uses the same numbering principles, but for a taxonomy of part-whole relation-ships for aircraft assemblies and subassemblies.)

An interesting artifact of the numbering systems are that short numbers are higher in the taxonomic hierarchy than long ones, because the more specific the subject matter, the more digits are required to describe it. DDC therefore implies inferencing rules, as well as a taxonomy, and is therefore a first-class ontology.

MeSH

MeSH is an ontology, too, but organized quite differently from DDC. It is a controlled vocabulary thesaurus, where (as in the writer’s tool Roget’s) terms are associated with links such as synonym, antonym, homonym, and so forth. (Chapter 23 shows how such associations could be represented in RDF with the terms as nodes connected with labeled arcs: True and false would be two words connected by an arc labeled “antonym,” for example.)

MeSH terms can be organized both alphabetically and in a conceptual hierarchy. In the former, ankle would follow anatomy; in the latter, ankle would be a narrower subject under the broader subject anatomy. Therefore, there are multiple points of entry into the MeSH ontology; this is a general characteristic of information overlays such as RDF and topic maps. Like DDC, MeSH’s hierarchy implies inferencing rules, and MeSH is there-fore a first-class ontology.

Conceptual Graphs

Conceptual Graphs (CG) is a language in which ontologies can be created. It uses graph structures to express meaning in a way that humans can read and machines can process. CGs are similar to RDF when considered as syntax, in that they can be represented both graphically and textually. Here is an example of CG syntax in text form:

[Go]-

(Agnt)->[Person: Joe] (Dest)->[City: Manhattan] (Inst)->[Train].

The words in square brackets ([Go])are concepts; the words in parentheses ((Dest)) are relations. Arcs that connect relations to concepts are shown as arrows (->). The preced-ing CG statement translates to “Joe is going to Manhattan by train.” The essential con-cept—the verb—is [Go]; the individual [person: Joe] is an (Agent) that can perform actions such as [Go]ing to the individual [City:Manhattan] in an (Inst)rumental rela-tion with a [Train].

CG enables the definition of taxonomies in the form of relations between concepts. CG also supports inferencing rules in the form of First Order Logic (FOL), which we will briefly cover when we examine the next layer of the Semantic Web.

Cyc

Cyc (pronounced psych), like CG, is an ontology and an ontology definition language, but its approach is completely different. Cyc intends to realize the artificial intelligence (AI) dream of enabling machines to use common-sense reasoning. However, instead of working from the top down (the classical AI approach), with the hierarchy’s root and the axioms for inferencing rules, Cyc works from the bottom up. The Cyc project collects and formalizes common-sense rules (over one million so far) and allows common-sense conclusions to emerge from the interactions between the rules.

For example, a common sense rule is, “If it rains, take an umbrella.” Or so it would seem. In fact, this rule can be heavily qualified by contextual rules, which also need to be entered. (For example, the rule presumes that the agent who may or may not take the umbrella is sane, not dead, not quadriplegic, not on the planet Venus, and so on.)

Fortunately, after 17 years of development, enough rules have been collected to allow Cyc to be productized and a significant portion of the rules base made public. It may be that Semantic Web developers will consider that this particular wheel need not be rein-vented; if so, they will find Murray Altheim’s XML version of Cyc a useful tool.

OIL

Ontology Inferencing Layer (OIL) is our last ontology definition language. Rather than create an FOL-enabled syntax from the top down (like CG) or a vast inferencing system from the bottom up (like Cyc), OIL is a layer on top of the W3C RDF specification (see Chapter 23).

OIL (not yet a W3C effort, although cited there inherits concepts from three communi-ties. First, the AI community supplies OIL with the notion of “frames with slots,” or, in object-oriented (OO) terms, with classes that have attributes. Second, the knowledge rep-resentation community brings the notions of concepts and roles in description logic, which OIL maps to the notions of classes and attributes, respectively. Description logic (unlike the OO notation) has well-understood mathematical properties that enable infer-encing rules. Finally, the markup technology community brings XML syntax and the RDF modeling primitives (instanceOf and subClassOf). OIL extends RDF to create a full-fledged modeling language.

Here is an example of OIL syntax:

class-def Book slot-def Price domain Product class-def Janes Book subclass-of Book

slot-constraint PublishedBy has-value “Janes Publisher”

This is a typical OO class hierarchy. Because Jane’s book is a subclass of book, it inher-its the properties of book (such as price). Whether the OO hierarchy is necessary or suffi-cient to create an ontology seems an open question: neither CG nor Cyc are considered OO technologies.

Logic

If RDF provides for a statement using a simple subject/object/predicate model, we can think of First Order Logic (FOL) in the Logic layer as enhancing RDF statements with richer syntax and more semantic power, within a world or universe of discourse specified by the ontological commitments made in the Ontological layer. (If our ontological com-mitment is to the DDC, we will reason from different premises than we would if our commitment was to MeSH, Cyc, or DAML.)

FOL statements are richer syntactically than RDF statements because they can use con-structs that are like natural language conjunctions (Boolean operators, such as “and”) and demonstratives (variables, such as “this”). FOL statements are more powerful because they can explicitly assert what is true—part or all of a world (the quantifiers, such as “there exists”). Other forms of logic may coexist with FOL on the Logic layer of the Semantic Web, but FOL provides a baseline functionality.

FOL typically uses a formal notation of its own (called Peano notation), but in this sec-tion, for accessibility, we’ll just use English-like phrases in italic. For the existential quantifier ∃ , for example, we will write the words there exists.

Briefly, here are the informal synthesis of the key concepts in FOL. We will elaborate the statement “Roses are red” in our examples.

The following concepts are defined by the user:

Constants. Individuals in the world (resources, such as “rose”).

Functions. Mapping resource to resource (properties, such as “color-of(rose) = red”).

Relations. Mapping resources to truth values (true and false).

These concepts are supplied by FOL itself:

Variable symbols. x and y.

Boolean values. True and false.

Conjunction. Rose is red, and violet is blue.

Disjunction. Rose is red, or rose is yellow.

Negation. Rose is not blue.

Inference. If rose is red, then violet is blue.

Equivalence. Rose is red if and only if sugar is sweet. There are two kinds of quantifiers:

Universal. For all x.

Existential. There is an x.

FOL also has assembly rules building up sentences from terms and atoms:

A term (denoting a resource) is a constant symbol, a variable symbol, or a function of n terms.

An atom (which has a truth value) is either a relation of n terms or two atoms con-nected with and or or.

A sentence is an atom, or, if S is a sentence and x is a variable, then a sentence is preceded by an existential quantifier: There exists an x such that x is S. (Note that the existential quantifier implicitly apples only within our world.)

A well-formed formula (WFF) is a sentence with all variables “bound” by universal or existential quantifiers. For example, “There is a rose such that roses are red” has “rose” as a universally quantified variable, but not “red.”

Notice that rules 1 through 3 are recursive: Terms are defined in terms of, well, terms, atoms in terms of atoms, and sentences in terms of sentences. We’ll use this characteris-tic to break the following example into its components.

Now, why on earth would these rules matter to the Semantic Web? Let’s take the sen-tence with which we began this chapter:

Buy me a hard-cover copy of Jane’s book, in Chinese, if available for less than $39.95.

This is a pretty complicated sentence, and a machine might well wish to break it down into its component parts. Those parts turn out to be easily expressible in FOL.

First, let’s replace “Jane’s book” with the variable book and make that variable explicit where the rules of English (unlike the rules of logic) allow it to be implicit:

Buy me a hard-cover copy of book, in Chinese, if book is available for less than $39.95.

Now, because the definition of “sentence” is recursive, let’s break this sentence down into sentences. The English is deceptive—it looks like we are saying “There is a hard-cover copy of book”, but, in fact, because we can’t buy the book if it doesn’t exist, we’ll translate the phrase to a relation, because a relation has a truth value. We’ll also assign each sentence to a constant:

A := book (hard-cover)

B := book (Chinese)

X := book (available)

Y := book (price, < $39.95)

Now, because sentences are atoms, and atoms can be connected with Booleans, we can construct the following inference:

If ((A and B) and (X and Y)) then…

This means that we will purchase Jane’s book only if it is available, in Chinese, in a hard-cover version, and at the right price. Notice that all the sentences can be easily rep-resented by RDF statements (as shown in Chapter 23), but the FOL variables abstract away from the RDF syntax, which shows the layered design of the Semantic Web in action to a good effect.

But how to express the imperative “Buy me the book”? First, we need to make a little more explicit what is implicit. How would we represent the purchase? With the following sentence:

P := book (purchased)

Our sentence now reads like so:

If ((A and B) and (X and Y)) then P.

As it turns out, we have now represented a transaction in FOL. But as you know, an entire transaction has a truth value, because it takes place in the time between the “if” and “then”: The last copy of Jane’s book might be sold, in which case we would want to roll back the credit card purchase statement in P. Can we represent this in FOL? For con-venience, we would put our transaction sentence into a variable:

T := if ((A and B) and (X and Y)) then P.

However, all we really need to do is assert:

Why? A sentence is an atom, and atoms have truth values. Therefore, if T is true, our Semantic Web–aware system should commit the transaction. Otherwise, it should not.

Proof

The Logic layer provides a language for describing the truth or falsity of statements we might make in a universe of discourse. But suppose we want to question the conclusions of the Logic layer? We would get the Proof layer to expose the steps in the reasoning that led the Logic layer to make the inference it did. For example, the truth of the FOL relation

Book (Chinese)

might be proved by exhibiting the value “Mandarin” in the book title’s XML lang attribute.

The vision is that once an XML-based interchange syntax for proofs is developed, Semantic Web users (whether machine or human) will begin to exchange proofs as well as to mix and match them—and in a process akin to evolutionary programming, good proofs will drive out bad ones.

If the Logic layer of the Semantic Web enters uncharted territory, the Proof layer enters the whitespace on the map. There are no W3C specifications in process for it. The closest implementation experiences to the Proof layer seem to fall into two disciplines:

Formal methods for proving programs correct

Automated theorem proving

Neither technology has gained broad acceptance, although both have been used with suc-cess on extremely large-scale projects. Automated theorem proving is used in hardware verification by chip manufacturers, for example.

Trust

If the Proof layer enters the whitespace of the Semantic Web’s conceptual map, the Trust layer is deep within it. Remember, again, that the Logic layer provides a language for describing the truth or falsity about statements we might make in a universe of discourse. Suppose, again, that we don’t trust the conclusions of the Logic layer, but we don’t have the time or the inclination to run a proof. What to do? In the physical world, we might ask a friend whose judgment we trust whether she trusts the Logic layer to come to the right conclusion on the facts given to it. On the Semantic Web, we ask a network of friends the same question. This is the notion of a “web of trust.”

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

XML and Web Services : The Semantic Web : The Semantic Web for Information Owners : Architecture of the Semantic Web |