Home | | **Design and Analysis of Algorithms** | Input Enhancement in String Matching: Horspool’s and Boyer-Moore Algorithm

Horspool’s algorithm and Boyer-Moore Algorithm. This is exactly the idea behind the two best-known algorithms of this type: the Knuth-Morris-Pratt algorithm [Knu77] and the Boyer-Moore algorithm [Boy77].

**Input Enhancement in String Matching**

In this section, we see how the technique of input enhancement can be applied to the problem of string matching. Recall that the problem of string matching requires finding an occurrence of a given string of ** m** characters called the

Several faster algorithms have been discovered. Most of them exploit the input-enhancement idea: preprocess the pattern to get some information about it, store this information in a table, and then use this information during an actual search for the pattern in a given text. This is exactly the idea behind the two best-known algorithms of this type: the Knuth-Morris-Pratt algorithm [Knu77] and the Boyer-Moore algorithm [Boy77].

The principal difference between these two algorithms lies in the way they compare characters of a pattern with their counterparts in a text: the Knuth-Morris-Pratt algorithm does it left to right, whereas the Boyer-Moore algorithm does it right to left. Since the latter idea leads to simpler algorithms, it is the only one that we will pursue here. (Note that the Boyer-Moore algorithm starts by aligning the pattern against the beginning characters of the text; if the first trial fails, it shifts the pattern to the right. It is comparisons within a trial that the algorithm does right to left, starting with the last character in the pattern.)

Although the underlying idea of the Boyer-Moore algorithm is simple, its actual implementation in a working method is less so. Therefore, we start our discussion with a simplified version of the Boyer-Moore algorithm suggested by R. Horspool [Hor80]. In addition to being simpler, Horspool’s algorithm is not necessarily less efficient than the Boyer-Moore algorithm on random strings.

**Horspool’s Algorithm**

** **Consider, as an example, searching for the pattern BARBER in some text:

Starting with the last R of the pattern and moving right to left, we compare the corresponding pairs of characters in the pattern and the text. If all the pattern’s characters match successfully, a matching substring is found. Then the search can be either stopped altogether or continued if another occurrence of the same pattern is desired.

If a mismatch occurs, we need to shift the pattern to the right. Clearly, we would like to make as large a shift as possible without risking the possibility of missing a matching substring in the text. Horspool’s algorithm determines the size of such a shift by looking at the character ** c** of the text that is aligned against the last character of the pattern. This is the case even if character

In general, the following four possibilities can occur.

** ****Case 1 **If there are no** **** c**’s in the pattern—e.g.,

** ****Case 2 **If there are occurrences of character** ***c*** **in the pattern but it is not the last** **one there—e.g., ** c** is letter B in our example—the shift should align the rightmost occurrence of

** ****Case 3 **If** ***c*** **happens to be the last character in the pattern but there are no** **** c**’s

** ****Case 4 **Finally, if** ***c*** **happens to be the last character in the pattern and there** **are other ** c**’s among its first

** **These examples clearly demonstrate that right-to-left character comparisons can lead to farther shifts of the pattern than the shifts by only one position** **always made by the brute-force algorithm. However, if such an algorithm had to check all the characters of the pattern on every trial, it would lose much of this superiority. Fortunately, the idea of input enhancement makes repetitive comparisons unnecessary. We can precompute shift sizes and store them in a table. The table will be indexed by all possible characters that can be encountered in a text, including, for natural language texts, the space, punctuation symbols, and other special characters. (Note that no other information about the text in which eventual searching will be done is required.) The table’s entries will indicate the shift sizes computed by the formula

For example, for the pattern BARBER, all the table’s entries will be equal to 6, except for the entries for E, B, R, and A, which will be 1, 2, 3, and 4, respectively.

Here is a simple algorithm for computing the shift table entries. Initialize all the entries to the pattern’s length ** m** and scan the pattern left to right repeating the following step

**ALGORITHM** *ShiftTable**(P** *[0*..m** *−* *1]*)*

//Fills the shift table used by Horspool’s and Boyer-Moore algorithms //Input: Pattern ** P** [0

**for ***i*** **←** **0** to ***size*** **−** **1** do ***Table*[** i**]

**for ***j*** **←** **0** to ***m*** **−** **2** do ***Table*[*P*** **[*j*** **]]** **←** ***m*** **−** **1** **−** ***j*** return ***Table*

Now, we can summarize the algorithm as follows:

**Horspool’s algorithm**

**Step 1 **For a given pattern of length** ***m*** **and the alphabet used in both the** **pattern and text, construct the shift table as described above.

**Step 2 **Align the pattern against the beginning of the text.

**Step 3 **Repeat the following until either a matching substring is found or the** **pattern reaches beyond the last character of the text. Starting with the last character in the pattern, compare the corresponding characters in the pattern and text until either all ** m** characters are matched (then stop) or a mismatching pair is encountered. In the latter case, retrieve the entry

Here is pseudocode of Horspool’s algorithm.

**ALGORITHM** *HorspoolMatching**(P** *[0*..m** *−* *1]*, T** *[0*..n** *−* *1]*)*

//Implements Horspool’s algorithm for string matching //Input: Pattern ** P** [0

//Output: The index of the left end of the first matching substring

or −1 if there are no matches

*ShiftTable**(P** *[0*..m** *−* *1]** )** //generate

** i **←

**while ***i*** **≤** ***n*** **−** **1** do**

** k **←

**while ***k*** **≤** ***m*** **−** **1** and ***P*** **[*m*** **−** **1** **−** **** k**]

**if ***k*** **=** ***m*

**return ***i*** **−** ***m*** **+** **1** else ***i*** **←** ***i*** **+** ***Table*[*T*** **[** i**]]

**return **−1

**EXAMPLE **As an example of a complete application of Horspool’s algorithm,** **consider searching for the pattern BARBER in a text that comprises English letters and spaces (denoted by underscores). The shift table, as we mentioned, is filled as follows:

A simple example can demonstrate that the worst-case efficiency of Hor-spool’s algorithm is in ** O(nm)** (Problem 4 in this section’s exercises). But for random texts, it is in

**Boyer-Moore Algorithm**

Now we outline the Boyer-Moore algorithm itself. If the first comparison of the rightmost character in the pattern with the corresponding character ** c** in the text fails, the algorithm does exactly the same thing as Horspool’s algorithm. Namely, it shifts the pattern to the right by the number of characters retrieved from the table precomputed as explained earlier.

The two algorithms act differently, however, after some positive number *k*** (**0

In this situation, the Boyer-Moore algorithm determines the shift size by consid-ering two quantities. The first one is guided by the text’s character ** c** that caused a mismatch with its counterpart in the pattern. Accordingly, it is called the

For example, if we search for the pattern BARBER in some text and match the last two characters before failing on letter S in the text, we can shift the pattern by ** t**1

The same formula can also be used when the mismatching character ** c** of the text occurs in the pattern, provided

If ** t**1

To summarize, the bad-symbol shift ** d**1 is computed by the Boyer-Moore algorithm either as

The second type of shift is guided by a successful match of the last ** k >** 0 characters of the pattern. We refer to the ending portion of the pattern as its suffix of size

Let us first consider the case when there is another occurrence of *suff* ** (k)** in the pattern or, to be more accurate, there is another occurrence of

What is to be done if there is no other occurrence of *suff* ** (k)** not preceded by the same character as in its rightmost occurrence? In most cases, we can shift the pattern by its entire length

Unfortunately, shifting the pattern by its entire length when there is no other occurrence of *suff* ** (k)** not preceded by the same character as in its rightmost occurrence is not always correct. For example, for the pattern ABCBAB and

Note that the shift by 6 is correct for the pattern DBCBAB but not for ABCBAB, because the latter pattern has the same substring AB as its prefix (beginning part of the pattern) and as its suffix (ending part of the pattern). To avoid such an erroneous shift based on a suffix of size ** k,** for which there is no other occurrence in the pattern not preceded by the same character as in its rightmost occurrence, we need to find the longest prefix of size

Now we are prepared to summarize the Boyer-Moore algorithm in its entirety.

**The Boyer-Moore algorithm**

**Step 1 **For a given pattern and the alphabet used in both the pattern and the** **text, construct the bad-symbol shift table as described earlier.

**Step 2 **Using the pattern, construct the good-suffix shift table as described** **earlier.

**Step 3 **Align the pattern against the beginning of the text.

**Step 4 **Repeat the following step until either a matching substring is found or** **the pattern reaches beyond the last character of the text. Starting with the last character in the pattern, compare the corresponding characters in the pattern and the text until either all *m* character pairs are matched (then stop) or a mismatching pair is encountered after ** k** ≥ 0 character pairs are matched successfully. In the latter case, retrieve the entry

Shifting by the maximum of the two available shifts when ** k >** 0 is quite log-ical. The two shifts are based on the observations—the first one about a text’s mismatched character, and the second one about a matched group of the pattern’s rightmost characters—that imply that shifting by less than

**EXAMPLE **As a complete example, let us consider searching for the pattern** **BAOBAB in a text made of English letters and spaces. The bad-symbol table looks as follows:

The actual search for this pattern in the text given in Figure 7.3 proceeds as follows. After the last B of the pattern fails to match its counterpart K in the text, the algorithm retrieves ** t**1

The next try successfully matches just one pair of B’s. After the failure of the next comparison on the space character in the text, the algorithm retrieves ** t**1

the pattern by max{** d**1

When searching for the first occurrence of the pattern, the worst-case effi-ciency of the Boyer-Moore algorithm is known to be linear. Though this algorithm runs very fast, especially on large alphabets (relative to the length of the pattern), many people prefer its simplified versions, such as Horspool’s algorithm, when dealing with natural-language–like strings.

**Exercises 7.2**

**1. **Apply Horspool’s algorithm to search for the pattern** **BAOBAB** **in the text

BESS KNEW ABOUT BAOBABS

** **Consider the problem of searching for genes in DNA sequences using Hor-spool’s algorithm. A DNA sequence is represented by a text on the alphabet {A, C, G, T}, and the gene or gene segment is the pattern.

** **Construct the shift table for the following gene segment of your chromo-some 10:

TCCTATTCTT

Apply Horspool’s algorithm to locate the above pattern in the following DNA sequence:

TTATAGATCTCGTATTCTTTTATAGATCTCCTATTCTT

** **How many character comparisons will be made by Horspool’s algorithm in searching for each of the following patterns in the binary text of 1000 zeros?

**a. **00001 **b. **10000 **c. **01010

** **For searching in a text of length ** n** for a pattern of length

**a. **worst-case input. **b. **best-case input.

** **Is it possible for Horspool’s algorithm to make more character comparisons than the brute-force algorithm would make in searching for the same pattern in the same text?

** **If Horspool’s algorithm discovers a matching substring, how large a shift should it make to search for a next possible match?

** **How many character comparisons will the Boyer-Moore algorithm make in searching for each of the following patterns in the binary text of 1000 zeros?

**a. **00001 **b. **10000 **c. **01010

** ****a. **Would the Boyer-Moore algorithm work correctly with just the bad-symbol** **table to guide pattern shifts?

** **Would the Boyer-Moore algorithm work correctly with just the good-suffix table to guide pattern shifts?

** ****a. **If the last characters of a pattern and its counterpart in the text do match,** **does Horspool’s algorithm have to check other characters right to left, or can it check them left to right too?

** **Answer the same question for the Boyer-Moore algorithm.

** **Implement Horspool’s algorithm, the Boyer-Moore algorithm, and the brute-force algorithm of Section 3.2 in the language of your choice and run an experiment to compare their efficiencies for matching

** **random binary patterns in random binary texts.

** **random natural-language patterns in natural-language texts.

** **You are given two strings ** S** and

** **Design a space-efficient algorithm for the task. Indicate the space and time efficiencies of your algorithm.

Design a time-efficient algorithm for the task. Indicate the time and space efficiencies of your algorithm.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Introduction to the Design and Analysis of Algorithms : Space and Time Trade-Offs : Input Enhancement in String Matching: Horspool’s and Boyer-Moore Algorithm |

**Related Topics **

Privacy Policy, Terms and Conditions, DMCA Policy and Compliant

Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.