Natural Language Processing :
Developing programs to understand natural language
is important in AI because a natural form of communication with systems is
essential for user acceptance. One of the most critical tests for intelligent
behavior is the ability to communicate effectively. This was the test proposed
by Alan Turing. AI programs must be able to communicate with their human
counterparts in a natural way, and natural language is one of the most
important mediums for that purpose. A program understands a natural language if
it behaves by taking a correct or acceptable action in response to the input.
For example, we say a child demonstrates understanding if it responds with the
correct answer to a question. The action taken need not be the external
response. It may be the creation of some internal data structures. The
structures created should be meaningful and correctly interact with the world
model representation held by the program. In this chapter we explore many of
the important issues related to natural language understanding and language
This chapter explores several techniques that are
used to enable humans to interact with computers via natural human languages.
Natural languages are the languages used by humans for communication (among
other functions). They are distinctly different from formal languages, such as
C++, Java, and PROLOG. One of the main differences, which we will examine in
some detail in this chapter, is that natural languages are ambiguous, meaning
that a given sentence can have more than one possible meaning, and in some
cases the correct meaning can be very hard to determine. Formal languages are
almost always designed to ensure that ambiguity cannot occur. Hence, a given
program written in C++ can have only one interpretation. This is clearly
desirable because otherwise the computer would have to make an arbitrary
decision as to which interpretation to work with. It is becoming increasingly
important for computers to be able to understand natural languages. Telephone
systems are now widespread that are able to understand a narrow range of
commands and questions to assist callers to large call centers, without needing
to use human resources. Additionally, the quantity of unstructured textual data
that exists in the world (and in particular, on the Internet) has reached
unmanageable proportions. For humans to search through these data using
traditional techniques such as Boolean queries or the database query language
SQL is impractical. The idea that people should be able to pose questions in
their own language, or something similar to it, is an increasingly popular one.
Of course, English is not the only natural language. A great deal of research
in natural language processing and information retrieval is carried out in
English, but many human languages differ enormously from English. Languages
such as Chinese, Finnish, and Navajo have almost nothing in common with English
(although of course Finnish uses the same alphabet). Hence, a system that can
work with one human language cannot necessarily deal with any other human
language. In this section we will explore two main topics. First, we will
examine natural language processing, which is a collection of techniques used
to enable computers to “understand” human language. In general, they are
concerned with extracting grammatical information as well as meaning from human
utterances but they are also concerned with understanding those utterances, and
performing useful tasks as a result. Two of the earliest goals of natural
language processing were automated translation (which is explored in this
chapter) and database access. The idea here was that if a user wanted to find
some information from a database, it would
make much more sense if he or she could query the
database in her language, rather than needing to learn a new formal language
such as SQL. Information retrieval is a collection of techniques used to try to
match a query (or a command) to a set of documents from an existing corpus of
documents. Systems such as the search engines that we use to find data on the
Internet use information retrieval (albeit of a fairly simple nature).
of linguistics
In dealing with natural language, a computer system
needs to be able to process and manipulate language at a number of levels.
Phonology. This is needed only if the computer is
required to understand spoken language. Phonology is the study of the sounds
that make up words and is used to identify words from sounds. We will explore
this in a little more detail later, when we look at the ways in which computers
can understand speech.
Morphology. This is the first stage of analysis
that is applied to words, once they have been identified from speech, or input
into the system. Morphology looks at the ways in which words break down into
components and how that affects their grammatical status. For example, the
letter “s” on the end of a word can often either indicate that it is a plural
noun or a third-person present-tense verb.
Syntax. This stage involves applying the rules of
the grammar from the language being used. Syntax determines the role of each word
in a sentence and, thus, enables a computer system to convert sentences into a
structure that can be more easily manipulated.
Semantics. This involves the examination of the
meaning of words and sentences. As we will see, it is possible for a sentence
to be syntactically correct but to be semantically meaningless. Conversely, it
is desirable that a computer system be able to understand sentences with
incorrect syntax but that still convey useful information semantically.
Pragmatics. This is the application of human-like
understanding to sentences and discourse to determine meanings that are not
immediately clear from the semantics. For example, if someone says, “Can you
tell me the time?”, most people know that “yes” is not a suitable answer.
Pragmatics enables a computer system to give a sensible answer to questions
like this.
In addition to these levels of analysis, natural
language processing systems must apply some kind of world knowledge. In most
real-world systems, this world knowledge is limited to a specific domain (e.g.,
a system might have detailed knowledge about the Blocks World and be able to
answer questions about this world). The ultimate goal of natural language
processing would be to have a system with enough world knowledge to be able to
engage a human in discussion on any subject. This goal is still a long way off.
In studying the English language, morphology is
relatively simple. We have endings such as -ing, -s, and -ed, which are applied
to verbs; endings such as -s and -es, which are applied to nouns; we also have
the ending -ly, which usually indicates that a word is an adverb.
We also have prefixes such as anti-, non-, un-, and
in-, which tend to indicate negation, or opposition.
We also have a number of other prefixes and
suffixes that provide a variety of semantic and syntactic information.
In practice, however, morphologic analysis for the
English language is not terribly complex, particularly when compared with
agglutinative languages such as German, which tend to combine words together
into single words to indicate combinations of meaning.
Morphologic analysis is mainly useful in natural
language processing for identifying parts of speech (nouns, verbs, etc.) and
for identifying which words belong together.
In English, word order tends to provide more of
this information than morphology, however. In languages such as Latin, word
order was almost entirely superficial, and the morphology was extremely
important. Languages such as French, Italian, and Spanish lie somewhere between
these two extremes.
As we will see in the following sections, being
able to identify the part of speech for each word is essential to understanding
a sentence. This can partly be achieved by simply looking up each word in a
dictionary, which might contain for example the following entries:
(swims, verb, present, singular, third person)
(swimmer, noun, singular)
(swim, verb, present, singular, first and second
(swim, verb, present plural, first, second, and
third persons)
(swimming, participle)
(swimmingly, adverb)
(swam, verb, past)
Clearly, a complete dictionary of this kind would
be unfeasibly large. A more practical approach is to include information about
standard endings, such as:
(-ly, adverb)
(-ed, verb, past)
(-s, noun, plural)
This works fine for regular verbs, such as walk,
but for all natural languages there are large numbers of irregular verbs, which
do not follow these rules. Verbs such as to be and to do are particularly
difficult in English as they do not seem to follow any morphologic rules.
The most sensible approach to morphologic analysis
is thus to include a set of rules that work for most regular words and then a
list of irregular words.
For a system that was designed to converse on any
subject, this second list would be extremely long. Most natural language
systems currently are designed to discuss fairly limited domains and so do not
need to include over-large look-up tables.
In most natural languages, as well as the problem
posed by the fact that word order tends to have more importance than
morphology, there is also the difficulty of ambiguity at a word level.
This kind of ambiguity can be seen in particular in
words such as trains, which could be a plural noun or a singular verb, and set,
which can be a noun, verb, or adjective.
Parsing involves mapping a linear piece of text
onto a hierarchy that represents the way the various words interact with each
other syntactically.
First, we will look at grammars, which are used to
represent the rules that define how a specific language is built up.
Most natural languages are made up of a number of
parts of speech, mainly the following:
Verb o Noun
Adjective o Adverb
Conjunction o Pronoun
In fact it is useful when parsing to combine words
together to form syntactic groups. Hence, the words, a dog, which consist of an
article and a noun, can also be described as a noun phrase.
A noun phrase is one or more words that combine
together to represent an object or thing that can be described by a noun.
Hence, the following are valid noun phrases: christmas, the dog, that packet of
chips, the boy who had measles last year and nearly died, my favorite color
A noun phrase is not a sentence—it is part of a
A verb phrase is one or more words that represent
an action. The following are valid verb phrases: swim, eat that packet of
chips, walking
A simple way to describe a sentence is to say that
it consists of a noun phrase and a verb phrase. Hence, for example: That dog is
eating my packet of chips.
In this sentence, that dog is a noun phrase, and is
eating my packet of chips is a verb phrase. Note that the verb phrase is in
fact made up of a verb phrase, is eating, and a noun phrase, my packet of
A language is defined partly by its grammar. The
rules of grammar for a language such as English can be written out in full,
although it would be a complex process to do so.
To allow a natural language processing system to
parse sentences, it needs to have knowledge of the rules that describe how a
valid sentence can be constructed.
These rules are often written in what is known as
Backus–Naur form (also known as Backus normal form—both names are abbreviated
as BNF).
BNF is widely used by computer scientists to define
formal languages such as C++ and Java. We can also use it to define the grammar
of a natural language.
A grammar specified in BNF consists of the
following components:
Terminal symbols. Each terminal symbol is a symbol
or word that
in the language itself. In English, for example, the terminal symbols are our
dictionary words such as the, cat, dog, and so on. In formal languages, the
terminal symbols include variable names such as x, y, and so on, but for our
purposes we will consider the terminal symbols to be the words in the language.
Nonterminal symbols. These are the symbols such as
noun, verb
and conjunction that are used to define words and phrases of the language. A
nonterminal symbol is so-named because it is used to represent one or more
terminal symbols.
The start symbol. The start symbol is used to
represent a complete sentence in the language. In our case, the start symbol is
simply sentence, but in first-order predicate logic, for example, the start
symbol would be expression.
Rewrite rules. The rewrite rules define the
structure of the grammar. Each rewrite rule details what symbols (terminal or
nonterminal) can be used to make up each nonterminal symbol.
Let us now look at rewrite rules in more detail. We
saw above that a sentence could take the following form: noun phrase verb
We thus write the following rewrite rule:
Sentence→NounPhrase VerbPhrase This does not mean that every sentence must be
of this form, but simply that a string of symbols that takes on the form of the
right-hand side can be rewritten in the form of the left-hand side. Hence, if
we see the words
The cat sat on the mat
we might identify that the cat is a noun phrase and
that sat on the mat is a verb phrase. We can thus conclude that this string
forms a sentence.
We can also use BNF to define a number of possible
noun phrases.
Note how we use the “|” symbol to separate the
possible right-hand sides in BNF:
NounPhrase→ Noun
| Article Noun
| Adjective Noun
| Article Adjective Noun
Ø Similarly,
we can define a verb phrase: VerbPhrase→
| Verb NounPhrase
| Adverb Verb NounPhrase
The structure of human languages varies
considerably. Hence, a set of rules like this will be valid for one language,
but not necessarily for any other language.
For example, in English it is usual to place the
adjective before the noun (black cat, stale bread), whereas in French, it is
often the case that the adjective comes after the noun (moulin rouge). Thus
far, the rewrite rules we have written consist solely of nonterminal symbols.
Rewrite rules are also used to describe the parts
of speech of individual words (or terminal symbols):
Noun→ cat
| dog
| Mount Rushmore
| chickens
Verb→ swims
| eats
| climbs
Article→ the
| a
Adjective→ black
| brown
| green
| stale
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023; All Rights Reserved. Developed by Therithal info, Chennai.