Home | | Internet & World Wide Web HOW TO PROGRAM | | Internet Programming | | Web Programming | String Processing and Regular Expressions - Perl

Chapter: Internet & World Wide Web HOW TO PROGRAM - Perl and CGI (Common Gateway Interface)

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

String Processing and Regular Expressions - Perl

One of Perl’s most powerful capabilities is the processing of textual data easily and effi-ciently, which allows for straightforward searching, substitution, extraction and concatenation of strings.

String Processing and Regular Expressions

 

One of Perl’s most powerful capabilities is the processing of textual data easily and effi-ciently, which allows for straightforward searching, substitution, extraction and concatenation of strings. Text manipulation in Perl is usually done with a regular expression—a series of characters that serves as a pattern-matching template (or search criterion) in strings, text files and databases. This feature allows complicated searching and string pro-cessing to be performed using relatively simple expressions.

 

Many string-processing tasks can be accomplished by using Perl’s equality and compar-ison operators (Fig. 27.6, fig27_06.pl). Line 5 declares and initializes array @fruits. Operator qw (“quote word”) takes the contents inside the parentheses and creates a comma-separated list, with each element wrapped in double quotes. In this example, qw( apple orange banana ) is equivalent to ( "apple", "orange", "banana" ).

 

Lines 7–24 demonstrate our first example of Perl control structures. The foreach structure (line 7) iterates sequentially through the elements in @fruits. Each element’s value is assigned to variable $item, and the body of the foreach is executed once for each element in the array. Notice that a semicolon does not terminate the foreach.

 

 

    #!/usr/bin/perl

    # Fig. 27.6: fig27_06.pl

    # Program to demonstrate the eq, ne, lt, gt operators.

 

    @fruits = qw( apple orange banana );

 

    foreach $item ( @fruits ) {

 

    if ( $item eq "banana" ) {

      print( "String '$item' matches string 'banana'\n" );

      }

      if ( $item ne "banana" ) {

      print( "String '$item' does not match string 'banana'\n" );

      }

 

      if ( $item lt "banana" ) {

      print( "String '$item' is less than string 'banana'\n" );

      }

 

      if ( $item gt "banana" ) {

      print( "String '$item' is greater than string 'banana'\n" );

      }

}

 

 

String 'apple' does not match string 'banana'

String 'apple' is less than string 'banana'

String 'orange' does not match string 'banana'

String 'orange' is greater than string 'banana'

String 'banana' matches string 'banana'

 

Fig. 27.6 Using the eq, ne, lt and gt operators.

 

Line 9 introduces another control structure: the if structure. Parentheses surround the condition being tested, and mandatory curly braces surround the block of code that is exe-cuted when the condition is true. In Perl, any scalar except the number 0, the string "0" and the empty string (i.e., undef values) is defined as true. In our example, when the $item’s content is tested against "banana" (line 9) for equality, the condition evaluates to true, and the print command (line 10) is executed.

 

The remaining if statements (lines 13, 17 and 21) demonstrate the other string com-parison operators. Operators eq, lt and gt test strings for equality, less-than and greater-than, respectively. These operators are used only with strings. When comparing numeric values, operators ==, !=, <, <=, > and >= are used.

 

For more powerful string comparisons, Perl provides the match operator (m//), which uses regular expressions to search a string for a specified pattern. Figure 27.7 uses the match operator to perform a variety of regular expression tests.

 

We begin by assigning the string "Now is is the time" to variable $search (line 5). The expression on line 8 uses the m// match operator to search for the literal characters Now inside variable $search. Note that the m character preceding the slashes of the m// operator is optional in most cases and thus is omitted here.

 

The match operator takes two operands. The first operand is the regular-expression pattern to search for (Now), which is placed between the slashes of the m// operator. The second operand is the string within which to search, which is assigned to the match operator using the =~ operator. The =~ operator is sometimes called a binding operator, because it binds whatever is on its left side to a regular-expression operator on its right.

 

In our example, the pattern Now is found in the string "Now is is the time". The match operator returns true, and the body of the if statement is executed. In addition to literal characters like Now, which match only themselves, regular expressions can include special characters called metacharacters, which specify patterns or contexts that cannot be defined using literal characters. For example, the caret metacharacter (^) matches the beginning of a string. The next regular expression (line 12) searches the beginning of $search for the pattern Now.

 

The $ metacharacter searches the end of a string for a pattern (line 17). Because the pattern Now is not found at the end of $search, the body of the if statement (line 18) is not executed. Note that Now$ is not a variable; it is a search pattern that uses $ to search for Now at the end of a string.

 

 

    #!/usr/bin/perl

    # Fig 27.7: fig27_07.pl

    # Searches using the matching operator and regular expressions.

 

    $search = "Now is is the time";

    print( "Test string is: '$search'\n\n" );

 

    if ( $search =~ /Now/ ) {

    print( "String 'Now' was found.\n" );

      }

      if ( $search =~ /^Now/ ) {

      print( "String 'Now' was found at the beginning of the line." );

      print( "\n" );

      }

      if ( $search =~ /Now$/ ) {

      print( "String 'Now' was found at the end of the line.\n" );

      }

      if ( $search =~ /\b ( \w+ ow ) \b/x ) {

      print( "Word found ending in 'ow': $1 \n" );

}

      if ( $search =~ /\b ( \w+ ) \s ( \1 ) \b/x ) {

      print( "Repeated words found: $1 $2\n" );

      }

      @matches = ( $search =~ / \b ( t \w+ ) \b /gx );

print( "Words beginning with 't' found: @matches\n" );

 

Test string is: 'Now is is the time'

 

String 'Now' was found.

 

String 'Now' was found at the beginning of the line. Word found ending in 'ow': Now

Repeated words found: is is

Words beginning with 't' found: the time

 

Fig. 27.7       Using the matching operator

 

The condition on line 21, searches (from left to right) for the first word ending with the letters ow. As in strings, backslashes in regular expressions escape characters with special significance. For example, the \b expression does not match the literal characters “\b.” Instead, the expression matches any word boundary. A word boundary is a boundary between an alphanumeric character09, az, AZ and the underscore character—and something that is not an alphanumeric character. Between the \b characters is a set of parentheses, which will be explained momentarily.

The expression inside the parentheses, \w+ ow, indicates that we are searching for pat-terns ending in ow. The first part, \w+, is a combination of \w (an escape sequence that matches a single alphanumeric character) and the + modifier, which is a quantifier that instructs Perl to match the preceding character one or more times. Thus, \w+ matches one or more alphanumeric characters. The characters ow are taken literally. Collectively, the expres-sion /\b ( \w+ ow ) \b/ matches one or more alphanumeric characters ending with ow, with word boundaries at the beginning and end. See Fig. 27.8 for a description of several Perl reg-ular-expression quantifiers and Fig. 27.9 for a list of regular-expression metacharacters.

 

Parentheses indicate that the text matching the pattern is to be saved in a special Perl variable (e.g., $1, etc.). The parentheses (line 21 of Fig. 27.7) cause Now to be stored in variable $1. Multiple sets of parentheses may be used in regular expressions, where each match results in a new Perl variable ($1, $2, $3, etc.). The value matched in the first set of parentheses is stored in variable $1, the value matched in the second set of parentheses is stored in variable $2, and so on.



Adding modifying characters after a regular expression refines the pattern-matching process. Modifying characters (Fig. 27.10) placed to the right of the forward slash that delimits the regular expression instruct the interpreter how to treat the preceding expres-sion. For example, the i after the regular expression

 

/computer/i

 

tells the interpreter to ignore case when searching, thus matching computer, COMPUTER,

Computer, CoMputER, etc.


When added to the end of a regular expression, the x modifying character indicates that whitespace characters are to be ignored. This allows programmers to add space characters to their regular expressions for readability without affecting the search. If the expression were written as

 

$search =~ /\b ( \w+ ow ) \b/

 

—that is, without the x modifying character—then the script would be searching for a word boundary, two spaces, one or more alphanumeric characters, one space, the characters ow, two spaces and a word boundary. The expression would not match $search’s value.

 

The condition on line 25 uses the memory function (i.e., parentheses) in a regular expression. The first parenthetical expression matches any string containing one or more alphanumeric characters. The expression \1 then evaluates to the word that was matched in the first parenthetical expression. The regular expression searches for two identical, con-secutive words, separated by a whitespace character (\s), in this case “is is.”

 

The condition in line 29 searches for words beginning with the letter t in the string $search. Modifying character g indicates a global search—a search that does not stop after the first match is found. The array @matches is assigned the value of a list of all matching words.

 

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail


Copyright © 2018-2020 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.