Regular
Expression Processing
The java.util.regex package supports regular expression processing. As
the term is used here, a regular
expression is a string of characters that describes a character sequence.
This general description, called a pattern,
can then be used to find matches in other character sequences. Regular
expressions can specify wildcard characters, sets of characters, and various
quantifiers. Thus, you can specify a regular expression that represents a
general form that can match several different specific character sequences.
There are two classes that
support regular expression processing: Pattern
and Matcher. These classes work
together. Use Pattern to define a
regular expression. Match the pattern against another sequence using Matcher.
Pattern
The Pattern class defines no constructors. Instead, a pattern is created
by calling the compile( ) factory
method. One of its forms is shown here:
static Pattern compile(String
pattern)
Here, pattern is the regular expression that you want to use. The compile( ) method transforms the string
in pattern into a pattern that can be
used for pattern matching by the Matcher
class. It returns a Pattern object
that contains the pattern.
Once you have created a Pattern object, you will use it to
create a Matcher. This is done by
calling the matcher( ) factory
method defined by Pattern. It is
shown here:
Matcher matcher(CharSequence str)
Here str is the character sequence that the pattern will be matched
against. This is called the input
sequence. CharSequence is an
interface that defines a read-only set of characters. It is implemented by the String class, among others. Thus, you
can pass a string to matcher( ).
Matcher
The Matcher class has no constructors. Instead, you create a Matcher by calling the matcher( ) factory method defined by Pattern, as just explained. Once you
have created a Matcher, you will use
its methods to perform various pattern matching operations.
The simplest pattern matching
method is matches( ), which simply
determines whether the character sequence matches the pattern. It is shown
here:
boolean matches( )
It returns true if the sequence and the pattern
match, and false otherwise.
Understand that the entire sequence must match the pattern, not just a
subsequence of it.
To determine if a subsequence
of the input sequence matches the pattern, use find( ). One version is shown here:
boolean find( )
It returns true if there is a matching subsequence
and false otherwise. This method can
be called repeatedly, allowing it to find all matching subsequences. Each call
to find( ) begins where the previous
one left off.
You can obtain a string
containing the last matching sequence by calling group( ). One of its forms is shown here:
String group( )
The matching string is
returned. If no match exists, then an IllegalStateException
is thrown. You can obtain the index within the input sequence of the current
match by calling
start( ). The index one past the end of the current match is obtained by
calling end( ). The forms used in this chapter are shown
here:
int start( ) int end( )
Both throw IllegalStateException if no match
exists.
You can replace all
occurrences of a matching sequence with another sequence by calling replaceAll( ), shown here:
String replaceAll(String newStr)
Here, newStr specifies the new character sequence that will replace the
ones that match the pattern. The updated input sequence is returned as a
string.
Regular
Expression Syntax
Before demonstrating Pattern and Matcher, it is necessary to explain how to construct a regular
expression. Although no rule is complicated by itself, there are a large number
of them, and a complete discussion is beyond the scope of this chapter.
However, a few of the more commonly used constructs are described here.
In general, a regular
expression is comprised of normal characters, character classes (sets of characters),
wildcard characters, and quantifiers. A normal character is matched as-is.
Thus, if a pattern consists of "xy", then the only input sequence
that will match it is "xy". Characters such as newline and tab are
specified using the standard escape sequences, which begin with a \ . For example, a newline is specified
by \n. In the language of regular
expressions, a normal character is also called a literal.
A character class is a set of
characters. A character class is specified by putting the characters in the
class between brackets. For example, the class [wxyz] matches w, x, y, or z. To
specify an inverted set, precede the characters with a ^. For example, [^wxyz] matches any character except w, x, y, or z.
You can specify a range of characters using a hyphen. For example, to specify a
character class that will match the digits 1 through 9, use [1-9].
The wildcard character is the
. (dot) and it matches any
character. Thus, a pattern that consists of "." will match these (and
other) input sequences: "A", "a", "x", and so on.
A quantifier determines how
many times an expression is matched. The quantifiers are shown here:
+Match one or more.
* Match zero or more.
? Match zero or one.
For example, the pattern
"x+" will match "x", "xx", and "xxx",
among others. One other point: In general, if you specify an invalid
expression, a
PatternSyntaxException will be thrown.
Demonstrating
Pattern Matching
The best way to understand
how regular expression pattern matching operates is to work through some
examples. The first, shown here, looks for a match with a literal pattern:
// A simple pattern matching demo.
import java.util.regex.*;
class RegExpr {
public static void main(String args[]) {
Pattern pat; Matcher mat; boolean found;
pat = Pattern.compile("Java"); mat =
pat.matcher("Java");
found = mat.matches(); // check for a match
System.out.println("Testing Java against
Java."); if(found) System.out.println("Matches");
else System.out.println("No Match");
System.out.println();
System.out.println("Testing Java against
Java 8.");
mat = pat.matcher("Java 8"); //
create a new matcher
found = mat.matches(); // check for a match
if(found)
System.out.println("Matches");
else System.out.println("No Match");
}
}
The output from the program
is shown here:
Testing Java against Java.
Matches
Testing Java against Java 8.
No Match
Let’s look closely at this
program. The program begins by creating the pattern that contains the sequence
"Java". Next, a Matcher is
created for that pattern that has the input sequence "Java". Then,
the matches( ) method is called to
determine if the input sequence matches the pattern. Because the sequence and
the pattern are the same, matches( )
returns true. Next, a new Matcher is created with the input
sequence "Java 8" and matches(
) is called again. In this case, the pattern and the input sequence differ,
and no match is found. Remember, the matches(
) function returns true only
when the input sequence precisely matches the pattern. It will not return true just because a subsequence
matches.
You can use find( ) to determine if the input
sequence contains a subsequence that matches the pattern. Consider the
following program:
// Use find() to find a subsequence.
import java.util.regex.*;
class RegExpr2 {
public static void main(String args[]) {
Pattern pat = Pattern.compile("Java"); Matcher mat =
pat.matcher("Java 8");
System.out.println("Looking for Java in
Java 8.");
if(mat.find())
System.out.println("subsequence found"); else
System.out.println("No Match");
}
}
The output is shown here:
Looking for Java in Java 8. subsequence found
In this case, find( ) finds the subsequence
"Java".
The find( ) method can be used to search the input sequence for
repeated occurrences of the pattern because each call to find( ) picks up where the previous one left off. For example, the
following program finds two occurrences of the pattern "test":
// Use find() to find multiple subsequences.
import java.util.regex.*;
class RegExpr3 {
public static void main(String args[]) {
Pattern pat = Pattern.compile("test"); Matcher mat =
pat.matcher("test 1 2 3 test");
while(mat.find()) {
System.out.println("test found at index
" + mat.start());
}
}
}
The output is shown here:
test found at index 0 test found at index 11
As the output shows, two
matches were found. The program uses the start(
) method to obtain the index of each match.
Using
Wildcards and Quantifiers
Although the preceding
programs show the general technique for using Pattern and Matcher,
they don’t show their power. The real benefit of regular expression processing
is not seen until wildcards and
quantifiers are used. To begin, consider the following example that uses the +
quantifier to match any arbitrarily long sequence of Ws:
// Use a quantifier.
import java.util.regex.*;
class RegExpr4 {
public static void main(String args[]) {
Pattern pat = Pattern.compile("W+"); Matcher mat =
pat.matcher("W WW WWW");
while(mat.find())
System.out.println("Match: " +
mat.group());
}
}
The output from the program is
shown here:
Match: W
Match: WW
Match: WWW
As the output shows, the
regular expression pattern "W+" matches any arbitrarily long sequence
of Ws.
The next program uses a
wildcard to create a pattern that will match any sequence that begins with e and ends with d. To do this, it uses the dot wildcard character along with the + quantifier.
// Use wildcard and quantifier.
import java.util.regex.*;
class RegExpr5 {
public static void main(String args[]) {
Pattern pat = Pattern.compile("e.+d");
Matcher mat = pat.matcher("extend cup end
table");
while(mat.find())
System.out.println("Match: " +
mat.group());
}
}
You might be surprised by the
output produced by the program, which is shown here:
Match: extend cup end
Only one match is found, and
it is the longest sequence that begins with e
and ends with d. You might have
expected two matches: "extend" and "end". The reason that
the longer sequence is found is that,
by default, find( ) matches the
longest sequence that fits the pattern. This is called greedy behavior. You can specify reluctant behavior by adding the ? quantifier to the pattern, as shown in this version of the
program. It causes the shortest matching pattern to be obtained.
// Use the ? quantifier.
import java.util.regex.*;
class RegExpr6 {
public static void main(String args[]) { // Use
reluctant matching behavior.
Pattern pat =
Pattern.compile("e.+?d");
Matcher mat = pat.matcher("extend cup end
table");
while(mat.find())
System.out.println("Match: " +
mat.group());
}
}
The output from the program
is shown here:
Match: extend
Match: end
As the output shows, the
pattern "e.+?d" will match the shortest sequence that begins with e and ends with d. Thus, two matches are found.
Working
with Classes of Characters
Sometimes you will want to
match any sequence that contains one or more characters, in any order, that are
part of a set of characters. For example, to match whole words, you want to
match any sequence of the letters of the alphabet. One of the easiest ways to
do this is to use a character class, which defines a set of characters. Recall
that a character class
is created by putting the
characters you want to match between brackets. For example, to match the
lowercase characters a through z, use [a-z].
The following program demonstrates this technique:
// Use a character class.
import java.util.regex.*;
class RegExpr7 {
public static void main(String args[]) { //
Match lowercase words.
Pattern pat =
Pattern.compile("[a-z]+"); Matcher mat = pat.matcher("this is a
test.");
while(mat.find())
System.out.println("Match: " +
mat.group());
}
}
The output is shown here:
Match: this
Match: is
Match: a
Match: test
Using
replaceAll( )
The replaceAll( ) method supplied by Matcher lets you perform powerful search and replace operations
that use regular expressions. For example, the following program replaces all
occurrences of sequences that begin with "Jon" with "Eric":
// Use replaceAll().
import java.util.regex.*;
class RegExpr8 {
public static void main(String args[]) { String
str = "Jon Jonathan Frank Ken Todd";
Pattern pat = Pattern.compile("Jon.*?
");
Matcher mat = pat.matcher(str);
System.out.println("Original sequence:
" + str);
str = mat.replaceAll("Eric ");
System.out.println("Modified sequence:
" + str);
}
}
The output is shown here:
Original sequence: Jon Jonathan Frank Ken Todd
Modified sequence: Eric Eric Frank Ken Todd
Because the regular
expression "Jon.*? " matches any string that begins with Jon followed
by zero or more characters, ending in a space, it can be used to match and
replace both Jon and Jonathan with the name Eric. Such a substitution is not
easily accomplished without pattern matching capabilities.
Using
split( )
You can reduce an input
sequence into its individual tokens by using the split( ) method defined by Pattern.
One form of the split( ) method is
shown here:
String[ ] split(CharSequence str)
It processes the input
sequence passed in str, reducing it
into tokens based on the delimiters specified by the pattern.
For example, the following
program finds tokens that are separated by spaces, commas, periods, and
exclamation points:
// Use split().
import java.util.regex.*;
class RegExpr9 {
public static void main(String args[]) {
// Match lowercase words.
Pattern pat = Pattern.compile("[
,.!]");
String strs[] = pat.split("one two,alpha9
12!done.");
for(int i=0; i < strs.length; i++)
System.out.println("Next token: " +
strs[i]);
}
}
The output is shown here:
Next token: one Next token: two Next token:
alpha9
Next token: 12
Next token: done
As the output shows, the
input sequence is reduced to its individual tokens. Notice that the delimiters
are not included.
Two
Pattern-Matching Options
Although the pattern-matching
techniques described in the foregoing offer the greatest flexibility and power,
there are two alternatives which you might find useful in some circumstances.
If you only need to perform a one-time pattern match, you can use the matches( ) method defined by Pattern. It is shown here:
static boolean matches(String
pattern, CharSequence str)
It returns true if pattern matches str and false otherwise. This method
automatically compiles pattern and
then looks for a match. If you will be using the same pattern repeatedly, then using matches( ) is less efficient than compiling the pattern and using
the pattern-matching methods defined by Matcher,
as described previously.
You can also perform a
pattern match by using the matches( )
method implemented by String. It is
shown here:
boolean matches(String pattern)
If the invoking string
matches the regular expression in pattern,
then matches( ) returns true. Otherwise, it returns false.
Exploring
Regular Expressions
The overview of regular
expressions presented in this section only hints at their power. Since text
parsing, manipulation, and tokenization are a large part of programming, you
will likely find Java’s regular expression subsystem a powerful tool that you
can use to your advantage. It is, therefore, wise to explore the capabilities
of regular expressions. Experiment with several different types of patterns and
input sequences. Once you understand how regular expression pattern matching
works, you will find it useful in many of your programming endeavors.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.