Java Matching Newlines in Text

Java Matching Newlines in Text


Problem

You need to match newlines in text.

Solution

Use \n or \r .

See also the flags constant Pattern.MULTILINE , which makes newlines match as begin-
ning-of-line and end-of-line ( ^ and $ ).

Explained

While line-oriented tools from Unix such as sed and grep match regular expressions
one line at a time, not all tools do. The sam text editor from Bell Laboratories was
the first interactive tool I know of to allow multiline regular expressions; the Perl
scripting language followed shortly. In the Java API, the newline character by default
has no special significance. The BufferedReader method readLine( ) normally strips
out whichever newline characters it finds. If you read in gobs of characters using
some method other than readLine( ) , you may have some number of \n , \r , or \r\n
sequences in your text string. * Normally all of these are treated as equivalent to \n . If
you want only \n to match, use the UNIX_LINES flag to the Pattern.compile( )
method.

In Unix, ^ and $ are commonly used to match the beginning or end of a line, respec-
tively. In this API, the regex metacharacters ^ and $ ignore line terminators and only
match at the beginning and the end, respectively, of the entire string. However, if
you pass the MULTILINE flag into Pattern.compile( ) , these expressions match just
after or just before, respectively, a line terminator; $ also matches the very end of the
string. Since the line ending is just an ordinary character, you can match it with . or
similar expressions, and, if you want to know exactly where it is, \n or \r in the pat-
tern match it as well. In other words, to this API, a newline character is just another
character with no special significance. See the sidebar “Pattern.compile( ) Flags”. An
example of newline matching is shown in Example 4-6.

Example 4-6. NLMatch.java
import java.util.regex.*;
/**
* Show line ending matching using regex class.
* @author Ian F. Darwin, ian@darwinsys.com
* @version $Id: ch04,v 1.4 2004/05/04 20:11:27 ian Exp $
*/
public class NLMatch {
public static void main(String[] argv) {
String input = "I dream of engines\nmore engines, all day long";
System.out.println("INPUT: " + input);
System.out.println( );

Example 4-6. NLMatch.java (continued)
String[] patt = {
"engines.more engines",
"engines$"
};
for (int i = 0; i < patt.length; i++) {
System.out.println("PATTERN " + patt[i]);
boolean found;
Pattern p1l = Pattern.compile(patt[i]);
found = p1l.matcher(input).find( );
System.out.println("DEFAULT match " + found);
Pattern pml = Pattern.compile(patt[i],
Pattern.DOTALL|Pattern.MULTILINE);
found = pml.matcher(input).find( );
System.out.println("MultiLine match " + found);
System.out.println( );
}
}
}

If you run this code, the first pattern (with the wildcard character . ) always matches, while the second pattern (with $ ) matches only when MATCH_MULTILINE is set.

> java NLMatch
INPUT: I dream of engines
more engines, all day long
PATTERN engines
more engines
DEFAULT match true
MULTILINE match: true
PATTERN engines$
DEFAULT match false
MULTILINE match: true

0 comments:

Post a Comment