Indexing the Right Stuff
So, let's get back to whether you need a
search engine. Let's assume that you do intend to slap a search engine on top
of your web site. Shouldn't be a problem right? Just point the indexer at the
directory where all the pages live, and, voilà! Searchable site!
Of course, you knew it wasn't that simple.
Searching only works well when the stuff that's being searched is the same as
the stuff that users want. This means you may not want to index the entire
site. We'll explain.
1. Indexing the Entire Site
Search engines are frequently used to index an
entire site without regard for the content and how it might vary - every word
of every page, whether it contains real content or help information,
advertising, navigation menus, and so on.
However, searching works much better when the
information space is defined narrowly and contains homogeneous content. In
other words, the more you search through indices that combine apples and oranges, the worse your retrieval results will be. After all, when
you search a site, you're probably looking for apples only, not oranges.
As already discussed, a site's content is usually a mix of apples, oranges, kumquats, bell peppers, chainsaws, and
Barbie dolls to begin with. So, when you tell your search engine to index your
entire site, the site's users will be performing searches against all kinds of
stuff - navigation, destination, and other kinds of pages - all at once. What
they retrieve can often be ugly.
Let's try an example to see what happens.
Searching Netscape's site for plug-ins,
what do we find? Exactly 100 documents. Of these:
•
58
documents are Welcome to Netscape Navigator version X.X pages for just about
every version of Netscape Navigator and include information about plug-ins.
•
16
documents are in German (a language I don't read).
•
6
documents contain the potentially relevant term application in their titles, but 5 of these 6 have exactly the same
title (Netscape Handbook: Application
Features).
•
2
documents actually contain plug-in in
their titles.
•
18 other
assorted documents may be relevant, but are not labeled in a way that indicates
whether this is the case.
Analyzing these search results, we find two
common problems. First, we are presented with documents that clearly don't
belong. If the site had been selectively indexed with audience differences in
mind, 16% of the results would not have been displayed at all. Second,
regarding relevant documents, it's not clear why we need 58 versions of the
same type of document. It would have been useful to index pages more
selectively, such as files relevant to Windows or Macintosh users, or recent
versions versus older versions of the software. Are very many people still
interested in old Netscape Beta versions? So, our search is less successful
than it could have been; it gave us a lot of irrelevant documents, and too many
that could be relevant.
Our search performed poorly because all the
content in the site was indexed together. By doing so, the site's architects
chose to ignore two very important things: that the information in their site
isn't all the same, and that it makes good sense to respect the lines already
drawn between different types of content. For example, it's clear that German
and English content are vastly different and that their audiences overlap very
little (if at all), so why not create separately searchable indices along those
divisions?
The site designers at Netscape are already
doing this, in a limited way. They have put a lot of effort into helping you
download the right version of the software from the nearest location. To
download the software, you get asked several questions (not unlike those in a
reference interview). Shown in Figure 6.15, the
site asks the user:
•
What
operating system does your computer use?
•
What
language do you speak?
•
Which of
our products do you need?
The result is a list of links to download
sites that provide the user the right information (i.e., software appropriate
to the user's platform), taking into account his or her geographic location and
language. Why not apply this same careful approach to matching users with the
right information to the entire site, instead of just to this specific
situation?
Figure 6.15. Three pull-down menus perform a brief reference
interview sufficient to help users download the appropriate software product.
2. Search Zones: Selectively Indexing the Right Content
Search zones are subsets of a web site that
have been indexed separately from the rest of the site's content. When you
search a search zone, you have, through interaction with the site, already
identified yourself as a member of a particular audience or as someone
searching for a particular type of information. The search zones in a site
match those specific needs, and the result is improved retrieval performance.
The user is simply less likely to retrieve irrelevant information.
The Microsoft site has a good example of
search zone use. Although this site suffers from other searching problems, it
compares favorably to the Netscape site when searching for our old stand-by, plug-ins. On the search page you're
asked where you want to search in the Microsoft site, and are provided with the
options on a pull-down menu (Figure 6.16).
Figure 6.16. Microsoft's site employs search zones to help focus
the user's search before submitting a query to the search engine.
You've got many options to review, but you can
quickly find the Internet Explorer
area of the site where you'd want to look for plug-ins. Consider how well the
effort the user expends in reviewing and selecting from this menu compares to
the much greater effort of searching the entire site and then sifting through a
tremendously larger retrieval set. Also note the Full Site Search option; sometimes it does make sense to maintain
an index of the entire site, especially for users who are unsure where to look,
who are doing a comprehensive leave-no-stones-unturned search, or who just
haven't had any luck searching the more narrowly defined indices.
How is search zone indexing set up? It depends
on the search engine software used. Most support the creation of search zones,
but some provide interfaces that make this process easier, while others require
you to manually provide a list of pages to index. In either case, search zone
indexing requires more work on your part than simply pointing the search engine
at the entire site: you'll need to review and mark each page that should be
indexed. To make this easier, you might design your site so that pages that
should be indexed together are located in the same directory; that way, you
would mark for indexing a directory (and, implicitly, its contents) instead of
its individual pages. You may also be working with pages that are generated
from a database. In this case, you could design the database to include a field
for each record denoting which index the generated page should belong to.
You can create search zones in many ways.
Examples of four common approaches are:
• by content type
• by audience
• by subject
• by date
Note that these approaches are similar to the
organization schemes discussed in Chapter 3.
The decisions you made in selecting your site's organization scheme will often
work for determining search zones as well. You could also try other ways; the
most important consideration is to choose an approach appropriate to your
site's audiences and their information needs.
2.1 Apples and apples: indexing similar content types
Most web sites contain, at minimum, two major
and dissimilar types of pages: navigation
and destination. Destination pages
contain the actual information you want from a web site: sport scores, book
reviews, software documentation, and so on. The primary purpose of a site's
navigation pages is to get you to the
destination pages. Navigation pages
may include main pages, search pages, and pages that help you browse a site.
When a user searches a site, he or she is
generally looking for destination pages. If navigation pages are part of the
retrieval, they will just clutter up the retrieval results. In fact, the reason
that the user is searching rather than browsing some other way could be because
the navigation system is performing poorly in the first place. So why keep
showing the user navigation pages that don't work and aren't relevant to the
search?
Let's take a simple example: your company
sells computer products via its web site. The destination pages consist of
descriptions, pricing, and ordering information, one page for each product.
Also, a number of navigation pages help users find products, such as listings
of products for different platforms (e.g., Macintosh versus Windows), listings
of products for different applications (e.g., word processing, bookkeeping),
listings of business versus home products, and listings of hardware versus
software products. If the user is searching for Intuit's Quicken, what's likely
to happen? Instead of simply retrieving Quicken's product page, they might get
all these pages:
Financial Products Index Page
Home Products Index Page
Macintosh Products Index Page
Quicken Product Page
Software Products Index Page
Windows Products Index Page
The user retrieves the right destination page
(i.e., the Quicken Product Page), but also five more that are purely navigation
pages. In other words, 83% of the retrieval is in the way. And keep in mind
that this example is simple; what if the user had to ignore 83% of a much
larger retrieval set, say, 200 documents?
Of course, indexing similar content isn't
always easy, because "similar" is a highly relative term. It's not
always clear where to draw the line between navigation and destination pages.
In some cases, a page can be considered both. For example, we tried the
approach described here for the SIGGRAPH 96 Conference web site.13
We found that some pages didn't really fit the navigation/destination
breakdown. For example, the Exhibition Hall Map page appears to be navigation.
It links to pages for each of the five sections of the hall. These five pages
appear to be destination, presenting detailed maps of their respective
sections, including booth numbers and the names of exhibitors. But their parent
page also provides important information, such as where the hall entrances are,
and where the five sections are in relation to one another. So isn't the main
Exhibition Hall Map page destination as well as navigation? The best solution,
in this particular case, was to index these hybrid pages, but it wasn't ideal.
The more important lesson from this experience
was to test out the navigation/destination distinctions before actually
applying them. The weakness of the navigation/destination approach is that it
is essentially an exact organization scheme (discussed in Chapter 3) which requires the pages to be either one
thing (in this case destination) or another (navigation). In the following
three approaches, the organization approaches are ambiguous, and therefore more
forgiving of pages that fit into multiple categories.
2.2 Who's going to care? Indexing for specific audiences
If you've already decided to create an
architecture for your site that uses an audience-oriented organization scheme,
it may make sense to create search zones by audience breakdown as well. We
found this a useful approach for the original Library of Michigan web site.
The Library of Michigan has three primary
audiences: members of the Michigan state legislature and their staffs, Michigan
libraries and their librarians, and the citizens of Michigan. The information
needed from this site is different for each of these audiences; for example,
each has a very different circulation policy. Why would a state legislator care
how long a citizen can check a book out for?
So we created four indices: one for the
content relevant to each audience, and one unified index of the entire site in
case the audience-specific indices didn't do the trick for a particular search.
Here are the results from running a query on the word circulation against each of the four indices:
As with any search zone, less overlap between
indices improves performance. If the sizes of retrieval results were reduced by
a very small figure, let's say, 10% or 20%, it may not be worth the overhead of
creating separate audience-oriented indices. But in this case, much of the
site's content is specific to one of the audiences.
2.3 Drilling down: Indexing by subject
If your site uses a strong subject-oriented or
topical organization scheme, you've already distinguished many of the site's
search zones. Yahoo! is perhaps the most popular site to employ
subject-oriented search zones. Every subject category and subcategory in Yahoo!
can be searched individually. For example, let's say you're looking for sites
that deal with science fiction movies. If you search for science fiction against the whole Yahoo! search index, you'll
retrieve a lot of stuff: 35 category and subcategory matches and 816 site
matches. But you're not looking for science fiction in general; you're looking
for science fiction movies. So, instead you can run the same science fiction search against the index
for the Yahoo! subcategory Movies and
Films. This time you'll be happier
with your retrieval: 2 category and subcategory matches and 19 site matches. This is another excellent
example of how hierarchical search zones allow for increased specificity, and
therefore improved retrieval results.
2.4 Yesterday's news: Indexing recent content
Chronologically organized content allows for
perhaps the easiest implementation of search zones. (Not surprisingly, it's
probably the most common example of search zones.) Because dated materials are
generally not ambiguous, indexing them by date is staightforward.
News.Com is a
great example (Figure 6.17); it supports highly
flexible chronological searching by:
Date Range (e.g., from 5/20/97 to 6/26/97) 3
Days Back
7 Days Back
14 Days Back
21 Days Back
30 Days Back
60 Days Back
90 Days Back
Figure 6.17. News.com's search interface uses two components
(Date range and Number of days back) to allow for powerful chronological
searching.
Regular users can return to the site and check
up on the news depending on how regularly they use the site (e.g., every week,
two weeks, three weeks). Users who are looking for news during a particular
date range can essentially generate a custom search zone on the fly. The only
negative in News.Com's implementation
is that they don't seem to support a search against all news articles,
regardless of age.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.