Web Search and Analysis
The emergence of the Web has brought millions of users to search for
information, which is stored in a very large number of active sites. To make
this information accessible, search engines such as Google and Yahoo! have to
crawl and index these sites and document collections in their index databases.
Moreover, search engines have to regularly update their indexes given the
dynamic nature of the Web as new Web sites are created and current ones are
updated or deleted. Since there are many millions of pages available on the Web
on different topics, search engines have to apply many sophisticated techniques
such as link analysis to identify the importance of pages.
There are other types of search engines besides the ones that regularly
crawl the Web and create automatic indexes: these are human-powered, vertical
search engines or metasearch engines. These search engines are developed with
the help of computer-assisted systems to aid the curators with the process of
assigning indexes. They consist of manually created specialized Web
directories that are hierarchically organized indexes to guide user navigation
to different resources on the Web. Vertical
search engines are customized
topic-specific search engines that crawl and index a specific collection of documents on the Web and
provide search results from that specific collection. Metasearch engines are built on top of search engines: they query
different search engines simultaneously and aggregate and provide search
results from these sources.
Another source of searchable Web documents is digital libraries. Digital libraries can be broadly
defined as collections of electronic resources and services for the delivery of
materials in a variety of formats. These collections may include a
univer-sity’s library catalog, catalogs from a group of participating
universities as in the State of Florida University System, or a compilation of
multiple external resources on the World Wide Web such as Google Scholar or the
IEEE/ACM index. These interfaces provide universal access to different types of
content—such as books, articles, audio, and video—situated in different
database systems and remote repos-itories. Similar to real libraries, these
digital collections are maintained via a catalog and organized in categories
for online reference. Digital libraries “include personal, distributed, and
centralized collections such as online public access catalogs (OPACs) and
bibliographic databases, distributed document databases, scholarly and
professional discussion lists and electronic journals, other online databases,
forums, and bulletin boards.”
1. Web Analysis and Its Relationship to Information Retrieval
In addition to browsing and searching the Web, another important
activity closely related to information retrieval is to analyze or mine
information on the Web for new information of interest. (We discuss mining of
data from files and databases in Chapter 28.) Application of data analysis
techniques for discovery and analysis of useful information from the Web is
known as Web analysis. Over the past
few years the World Wide Web has emerged as an important repository of
information for many day-to-day applications for individual consumers, as well
as a significant plat-form for e-commerce and for social networking. These
properties make it an interesting target for data analysis applications. The
Web mining and analysis field is an integration of a wide range of fields
spanning information retrieval, text analysis, natural language processing,
data mining, machine learning, and statistical analysis.
The goals of Web analysis are to improve and personalize search results
relevance and to identify trends that may be of value to various businesses and
organizations. We elaborate on these goals next.
Finding relevant information. People
usually search for specific information on the Web by entering keywords in a
search engine or browsing information portals and using services. Search
services are constrained by search relevance problems since they have to map
and approximate the information need of millions of users as an a priori task. Low precision (see Section 27.6) ensues due to results that are
nonrelevant to the user. In the case of the Web, high recall (see section 27.6) is impossible to determine due to the
inability to index all the pages on the Web. Also, measuring recall does not
make sense since the user is concerned with only the top few documents. The
most rele-vant feedback for the user is typically from only the top few
results.
Personalization of the information. Different
people have different content and
presentation preferences. By collecting personal information and then
generating user-specific dynamic Web pages, the pages are personalized for the
user. The customization tools used in various Web-based applications and
services, such as click-through monitoring, eyeball tracking, explicit or
implicit user profile learning, and dynamic service composition using Web APIs,
are used for service adaptation and personalization. A personalization engine
typically has algorithms that make use of the user’s personalization
information—collected by various tools—to generate user-specific search
results.
Finding information of commercial value. This problem deals with finding
interesting patterns in users’ interests, behaviors, and their use of
products and services, which may be of commercial value. For example,
businesses such as the automobile industry, clothing, shoes, and cosmetics may
improve their services by identifying patterns such as usage trends and user
preferences using various Web analysis techniques.
Based on the above goals, we can classify Web analysis into three
categories: Web content analysis, which deals with extracting useful
information/knowledge from Web page
contents; Web structure analysis,
which discovers knowledge from hyperlinks representing the structure of the
Web; and Web usage analysis, which
mines user access patterns from usage logs that record the activity of every
user.
2. Searching the Web
The World Wide Web is a huge corpus of information, but locating
resources that are both high quality and relevant to the needs of the user is
very difficult. The set of Web pages taken as a whole has almost no unifying
structure, with variability in authoring style and content, thereby making it
more difficult to precisely locate needed information. Index-based search
engines have been one of the prime tools by which users search for information
on the Web. Web search engines crawl
the Web and create an index to the Web for searching purposes. When a user
specifies his need for information by supplying keywords, these Web search
engines query their repository of indexes and produce links or URLs with
abbreviated content as search results. There may be thousands of pages relevant
to a particular query. A problem arises when only a few most relevant results
are to be returned to the user. The discussion we had about querying and
relevance-based ranking in IR systems in Sections 27.2 and 27.3 is applicable
to Web search engines. These ranking algo-rithms explore the link structure of
the Web.
Web pages, unlike standard text collections, contain connections to
other Web pages or documents (via the use of hyperlinks), allowing users to
browse from page to page. A hyperlink
has two components: a destination page
and an anchor text describing the
link. For example, a person can link to the Yahoo! Website on his Web page with
anchor text such as “My favorite Website.” Anchor texts can be thought of as
being implicit endorsements. They provide very important latent human
annotation. A person linking to other Web pages from his Web page is assumed
to have some relation to those Web pages. Web search engines aim to distill
results per their relevance and authority. There are many redundant hyperlinks,
like the links to the homepage on every Web page of the Web site. Such
hyperlinks must be eliminated from the search results by the search engines.
A hub is a Web page or a
Website that links to a collection of prominent sites (authorities) on a common
topic. A good authority is a page
that is pointed to by many good hubs, while a good hub is a page that points to
many good authorities. These ideas are used by the HITS ranking algorithm,
which is described in Section 27.7.3. It is often found that authoritative
pages are not very self-descriptive, and authorities on broad topics seldom
link directly to one another. These properties of hyperlinks are being actively
used to improve Web search engine result ranking and organize the results as
hubs and authorities. We briefly discuss a couple of ranking algorithms below.
3. Analyzing the Link
Structure of Web Pages
The goal of Web structure
analysis is to generate structural summary about the Website and Web pages.
It focuses on the inner structure of documents and deals with the link
structure using hyperlinks at the interdocument level. The structure and
content of Web pages are often combined for information retrieval by Web search
engines. Given a collection of interconnected Web documents, interesting and
informative facts describing their connectivity in the Web subset can be
discovered. Web structure analysis is also used to reveal the structure of Web
pages, which helps with navigation and makes it possible to compare/integrate
Web page schemes. This aspect of Web structure analysis facilitates Web
document classification and clustering on the basis of structure.
The PageRank Ranking Algorithm. As
discussed earlier, ranking algorithms are used to
order search results based on relevance and authority. Google uses the
well-known PageRank algorithm, which is based on the
“importance” of each page. Every Web page has a number of forward links
(out-edges) and backlinks (in-edges). It is very difficult to determine all the
backlinks of a Web page, while it is relatively straightforward to determine
its forward links. According to the PageRank algorithm, highly linked pages are
more important (have greater authority) than pages with fewer links. However,
not all backlinks are important. A backlink to a page from a credible source is
more important than a link from some arbitrary page. Thus a page has a high
rank if the sum of the ranks of its backlinks is high. PageRank was an attempt
to see how good an approximation to the “importance” of a page can be obtained
from the link structure.
The computation of page ranking follows an iterative approach. PageRank
of a Web page is calculated as a sum of the PageRanks of all its backlinks.
PageRank treats the Web like a Markov
model. An imaginary Web surfer visits an infinite string of pages by
clicking randomly. The PageRank of a page is an estimate of how often the
surfer winds up at a particular page. PageRank is a measure of
query-independent impor-tance of a page/node. For example, let P(X)
be the PageRank of any page X and C(X)
be the number of outgoing links from page X,
and let d be the damping factor in
the range 0 < d < 1. Usually d is set to 0.85. Then PageRank for a
page A can be calcu-lated as:
P(A) = (1 – d) + d (P(T1)/C(T1)
+ ... + P(Tn)/C(Tn))
Here T1, T2, ..., Tn are the pages
that point to Page A (that is, are
citations to page A). PageRank forms
a probability distribution over Web pages, so the sum of all Web pages’ PageRanks is one.
The H ITS Ranking Algorithm. The HITS algorithm proposed by Jon Kleinberg is another type of ranking algorithm exploiting the link
structure of the Web. The algorithm presumes that a good hub is a document that
points to many hubs, and a good authority is a document that is pointed at by
many other authorities. The algorithm contains two main steps: a sampling
component and a weight-propagation component. The sampling component constructs
a focused collection S of pages with
the following properties:
S is relatively small.
S is rich in relevant pages.
S contains most (or a majority) of
the strongest authorities.
The weight component recursively calculates the hub and authority values
for each document as follows:
Initialize hub and authority
values for all pages in S by setting
them to 1.
While (hub and authority values
do not converge):
For each page in S, calculate authority value = Sum of
hub values of all pages pointing to
the current page.
For each page in S, calculate hub value = Sum of
authority values of all pages pointed at
by the current page.
Normalize hub and authority
values such that sum of all hub values in S
equals 1 and the sum of all authority values in S equals 1.
4. Web Content Analysis
As mentioned earlier, Web content
analysis refers to the process of discovering useful information from Web
content/data/documents. The Web content
data consists of unstructured data such as free text from electronically
stored documents, semi-structured data typically found as HTML documents with
embedded image data, and more structured data such as tabular data, and pages
in HTML, XML, or other markup languages generated as output from databases.
More generally, the term Web content refers
to any real data in the Web page that is intended for the user accessing that page. This usually
consists of but is not limited to text and graphics.
We will first discuss some preliminary Web content analysis tasks and
then look at the traditional analysis tasks of Web page classification and
clustering later.
Structured Data Extraction.
Structured data on the Web is often very important as it represents essential information, such as a structured table
showing the airline flight schedule between two cities. There are several
approaches to structured data extraction. One includes writing a wrapper, or a program that looks for
different structural characteristics of the information on the page and
extracts the right con-tent. Another approach is to manually write an extraction
program for each Website based on observed format patterns of the site, which
is very labor intensive and time consuming. It does not scale to a large number
of sites. A third approach is wrapper
induction or wrapper learning, where the user first manually labels a set of
train-ing set pages, and the learning system generates rules—based on the
learning pages—that are applied to extract target items from other Web pages. A
fourth approach is the automatic approach, which aims to find patterns/grammars
from the Web pages and then uses wrapper
generation to produce a wrapper to extract data automatically.
Web Information
Integration. The Web is immense and has
millions of documents, authored by many different persons and organizations.
Because of this, Web pages that contain similar information may have different
syntax and different words that describe the same concepts. This creates the
need for integrating information from diverse Web pages. Two popular approaches
for Web information integration are:
Web query interface integration, to
enable querying multiple Web data-
bases that are not visible in external interfaces and are hidden in the
“deep Web.” The deep Web consists of those pages that do
not exist until they are created dynamically as the result of a specific
database search, which produces some of the information in the page (see
Chapter 14). Since traditional search engine crawlers cannot probe and collect
information from such pages, the deep Web has heretofore been hidden from
crawlers.
Schema matching, such as integrating directories
and catalogs to come up with a
global schema for applications. An example of such an application would be to
combine a personal health record of an individual by matching and collecting
data from various sources dynamically by cross-linking health records from
multiple systems.
These approaches remain an area of active research and a detailed
discussion of them is beyond the scope of this book. Consult the Selected
Bibliography at the end of this chapter for further details.
Ontology-Based Information
Integration. This task involves using
ontologies
to effectively combine information from multiple
heterogeneous sources. Ontologies—formal models of representation with
explicitly defined concepts and named relationships linking them—are used to
address the issues of semantic heterogeneity in data sources. Different
classes of approaches are used for information integration using ontologies.
Single ontology approaches use one
global ontology that provides a shared vocabulary
for the specification of the semantics. They work if all informa-tion sources
to be integrated provide nearly the same view on a domain of knowledge. For
example, UMLS (described in Section 27.4.3) can serve as a common ontology for
biomedical applications.
In a multiple ontology approach, each information source is described by
its own ontology. In principle, the “source ontology” can be a combination of
several other ontologies but it cannot be assumed that the different “source
ontologies” share the same vocabulary. Dealing with multiple, partially
over-lapping, and potentially conflicting ontologies is a very difficult
problem faced by many applications, including those in bioinformatics and other
complex area of knowledge.
Hybrid ontology approaches are similar to multiple ontology approaches: the semantics of each source is described by its own ontology. But
in order to make the source ontologies comparable to each other, they are built
upon one global shared vocabulary. The shared vocabulary contains basic terms
(the primitives) of a domain of knowledge. Because each term of source ontology
is based on the primitives, the terms become more easily compara-ble than in
multiple ontology approaches. The advantage of a hybrid approach is that new
sources can be easily added without the need to modify the mappings or the
shared vocabulary. In multiple and hybrid approaches, several research issues,
such as ontology mapping, alignment, and merging, need to be addressed.
Building Concept Hierarchies.
One common way of organizing search results is via a linear ranked list of documents. But for some users and
applications, a better way to display results would be to create groupings of
related documents in the search result. One way of organizing documents in a
search result, and for organiz-ing information in general, is by creating a concept hierarchy. The documents in a
search result are organized into groups in a hierarchical fashion. Other
related techniques to organize docments are through classification and clustering
(see Chapter 28). Clustering creates groups of documents, where the documents
in each group share many common concepts.
Segmenting Web Pages and
Detecting Noise. There are many superfluous parts in a Web document, such as advertisements and navigation panels.
The information and text in these superfluous parts should be eliminated as
noise before classifying the documents based on their content. Hence, before
applying classification or clustering algorithms to a set of documents, the
areas or blocks of the documents that contain noise should be removed.
5. Approaches to Web
Content Analysis
The two main approaches to Web content analysis are
(1) agent based (IR
view) and
(2) database based (DB view).
The agent-based approach involves the development of sophisticated artificial intelligence systems that can act autonomously or
semi-autonomously on behalf of a particular user, to discover and process
Web-based information. Generally, the agent-based Web analysis systems can be
placed into the following three categories:
Intelligent Web agents are
software agents that search for relevant information using characteristics of
a particular application domain (and possibly a user profile) to organize and
interpret the discovered information. For example, an intelligent agent that
retrieves product information from a variety of vendor sites using only
general information about the product domain.
Information Filtering/Categorization is
another technique that utilizes Web agents
for categorizing Web documents. These Web agents use methods from information
retrieval, and semantic information based on the links among various documents
to organize documents into a concept hierarchy.
Personalized Web agents are another type of Web agents that utilize the personal preferences of
users to organize search results, or to discover information and documents
that could be of value for a particular user. User preferences could be learned
from previous user choices, or from other individuals who are considered to
have similar preferences to the user.
The database-based approach aims to infer the structure of the Website or to trans-form a Web site
to organize it as a database so that better information management and querying
on the Web become possible. This approach of Web content analysis primarily
tries to model the data on the Web and integrate it so that more sophisticated
queries than keyword-based search can be performed. These could be achieved by
finding the schema of Web documents, building a Web document ware-house, a Web
knowledge base, or a virtual database. The database-based approach may use a
model such as the Object Exchange Model (OEM) that represents semi-structured
data by a labeled graph. The data in the OEM is viewed as a graph, with objects
as the vertices and labels on the edges. Each object is identified by an object
identifier and a value that is either atomic—such as integer, string, GIF
image, or HTML document—or complex in the form of a set of object references.
The main focus of the database-based approach has been with the use of
multilevel databases and Web query systems. A multilevel database at its lowest level is a data-base containing
primitive semistructured information stored in various Web repositories, such
as hypertext documents. At the higher levels, metadata or generalizations are
extracted from lower levels and organized in structured collections such as
relational or object-oriented databases. In a Web query system, information about the content and structure of
Web documents is extracted and organized using database-like techniques. Query
languages similar to SQL can then be used to search and query Web documents.
They combine structural queries, based on the organization of hypertext documents,
and content-based queries.
6. Web Usage Analysis
Web usage analysis is the application of data analysis techniques to discover usage patterns from Web data, in order to
understand and better serve the needs of Web-based applications. This activity
does not directly contribute to information retrieval; but it is important to
improve or enhance the users’ search experience. Web usage data describes the pattern of usage of Web pages, such as
IP addresses, page references, and
the date and time of accesses for a user, user group, or an application. Web
usage analysis typically consists of three main phases: preprocessing, pattern
discovery, and pattern analysis.
Preprocessing. Preprocessing converts the
information collected about usage
statistics and patterns into a form that can be utilized by the pattern
discovery methods. We use the term “page view” to refer to pages viewed or
visited by a user. There are several different types of preprocessing
techniques available:
Usage preprocessing analyzes the available collected data about usage pat-terns of users,
applications, and groups of users. Because this data is often incomplete, the
process is difficult. Data cleaning techniques are necessary to eliminate the
impact of irrelevant items in the analysis result. Frequently, usage data is
identified by an IP address, and consists of clicking streams that are
collected at the server. Better data is available if a usage tracking process
is installed at the client site.
Content preprocessing is the process
of converting text, image, scripts and other
content into a form that can be used by the usage analysis. Often, this
consists of performing content analysis such as classification or clustering.
The clustering or classification techniques can group usage information for
similar types of Web pages, so that usage patterns can be discovered for
specific classes of Web pages that describe particular topics. Page views can
also be classified according to their intended use, such as for sales or for discovery
or for other uses.
Structure preprocessing: The
structure preprocessing can be done by parsing and reformatting the
information about hyperlinks and structure between viewed pages. One difficulty
is that the site structure may be dynamic and may have to be constructed for
each server session.
Pattern Discovery
The techniques that are used in pattern discovery are based on methods
from the fields of statistics, machine learning, pattern recognition, data
analysis, data mining, and other similar areas. These techniques are adapted so
they take into consideration the specific knowledge and characteristics for Web
Analysis. For example, in association rule discovery (See Section 28.2), the
notion of a transaction for market-basket analysis considers the items to be
unordered. But the order of accessing of Web pages is important, and so it
should be considered in Web usage analysis. Hence, pattern discovery involves
mining sequences of page views. In general, using Web usage data, the following
types of data mining activities may be performed for pattern discovery.
Statistical analysis. Statistical
techniques are the most common method to
extract knowledge about visitors to a Website. By analyzing the session
log, it is possible to apply statistical measures such as mean, median, and
frequency count to parameters such as pages viewed, viewing time per page,
length of navigation paths between pages, and other parameters that are
relevant to Web usage analysis.
Association rules. In the context of Web usage
analysis, association rules refer to
sets of pages that are accessed together with a support value exceed-ing some
specified threshold. (See Section 28.2 on association rules.) These pages may
not be directly connected to one another via hyperlinks. For example,
association rule discovery may reveal a correlation between users who visited a
page containing electronic products to those who visit a page about sporting
equipment.
Clustering. In the Web usage domain, there
are two kinds of interesting clusters
to be discovered: usage clusters and page clusters. Clustering of users tends
to establish groups of users exhibiting similar browsing patterns.
Such knowledge is especially useful for inferring user demographics in
order to perform market segmentation in E-commerce applications or provide
personalized Web content to the users. Clustering
of pages is based on the content of the pages, and pages with similar
contents are grouped together. This type of clustering can be utilized in
Internet search engines, and in tools that provide assistance to Web browsing.
Classification. In the Web domain, one goal is to
develop a profile of users belonging
to a particular class or category. This requires extraction and selection of
features that best describe the properties of a given class or cate-gory of
users. As an example, an interesting pattern that may be discovered would be:
60% of users who placed an online order in /Product/Books are in the 18-25 age
group and live in rented apartments.
Sequential patterns. These
kinds of patterns identify sequences of Web
accesses, which may be used to predict the next set of Web pages to be
accessed by a certain class of users. These patterns can be used by marketers
to produce targeted advertisements on Web pages. Another type of sequen-tial
pattern pertains to which items are typically purchased following the purchase
of a particular item. For example, after purchasing a computer, a printer is
often purchased
Dependency modeling. Dependency
modeling aims to determine and model
significant dependencies among the various variables in the Web domain. As an
example, one may be interested to build a model representing the different
stages a visitor undergoes while shopping in an online store based on the
actions chosen (e.g., from a casual visitor to a serious potential buyer).
Pattern Analysis
The final step is to filter out those rules or patterns that are
considered to be not of interest from the discovered patterns. The particular
analysis methodology based on the application. One common technique for
pattern analysis is to use a query language such as SQL to detect various
patterns and relationships. Another technique involves loading of usage data
into a data ware-house with ETL tools and performing OLAP operations to view it
along multiple dimensions (see Section 29.3). It is common to use visualization
techniques, such as graphing patterns or assigning colors to different values,
to highlight patterns or trends in the data.
7. Practical
Applications of Web Analysis
Web Analytics. The goal of web analytics is to understand and optimize the performance of Web usage. This
requires collecting, analyzing, and performance monitoring of Internet usage
data. On-site Web analytics measures the performance of a Website in a
commercial context. This data is typically compared against key performance
indicators to measure effectiveness or performance of the Website as a whole,
and can be used to improve a Website or improve the marketing strategies.
Web Spamming. It has become increasingly important for companies and individuals to
have their Websites/Web pages appear in the top search results. To achieve
this, it is essential to understand search engine ranking algorithms and to
present the information in one’s page in such a way that the page is ranked
high when the respective keywords are queried. There is a thin line separating
legitimate page optimization for business purposes and spamming. Web Spamming is thus defined as a
deliberate activity to promote one’s page by manipulating the results returned
by the search engines. Web analysis may be used to detect such pages and
discard them from search results.
Web Security. Web analysis can be used to find interesting usage patterns of Websites. If any flaw in a Website has been exploited, it can be
inferred using Web analysis thereby allowing the design of more robust
Websites. For example, the backdoor or information leak of Web servers can be
detected by using Web analysis techniques on some abnormal Web application log
data. Security analysis techniques such as intrusion detection and denial of
service attacks are based on Web access pattern analysis.
Web Crawlers. Web crawlers are programs that visit Web pages and create copies of all the visited pages so they can be processed by a search engine for
indexing the downloaded pages to provide fast searches. Another use of crawlers
is to automatically check and maintain the Websites. For example, the HTML
code and the links in a Website can be checked and validated by the crawler.
Another unfortunate use of crawlers is to collect e-mail addresses from Web
pages, so they can be used for spam e-mails later.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.