Java Matching “Accented” or Composite Characters

Java Matching “Accented” or Composite
Characters


Problem

You want characters to match regardless of the form in which they are entered.

Solution

Compile the Pattern with the flags argument Pattern.CANON_EQ for “canonical
equality.”

Explained

Composite characters can be entered in various forms. Consider, as a single exam-
ple, the letter e with an acute accent. This character may be found in various forms in
Unicode text, such as the single character é (Unicode character \u00e9) or as the two-
character sequence e ́ (e followed by the Unicode combining acute accent, \u0301).

To allow you to match such characters regardless of which of possibly multiple “fully
decomposed” forms are used to enter them, the regex package has an option for
“canonical matching,” which treats any of the forms as equivalent. This option is
enabled by passing CANON_EQ as (one of) the flags in the second argument to Pattern.
compile( ) . This program shows CANON_EQ being used to match several forms:

import java.util.regex.*;
/**
* CanonEqDemo - show use of Pattern.CANON_EQ, by comparing varous ways of
* entering the Spanish word for "equal" and see if they are considered equal
* by the regex-matching engine.
*/
public class CanonEqDemo {
public static void main(String[] args) {
String pattStr = "\u00e9gal"; // égal
String[] input = {
"\u00e9gal", // égal - this one had better match :-)
"e\u0301gal", // e + "Combining acute accent"
"e\u02cagal", // e + "modifier letter acute accent"
"e'gal", // e + single quote
"e\u00b4gal", // e + Latin-1 "acute"
};
Pattern pattern = Pattern.compile(pattStr, Pattern.CANON_EQ);
for (int i = 0; i < input.length; i++) {
if (pattern.matcher(input[i]).matches( )) {
System.out.println(pattStr + " matches input " + input[i]);
} else {
System.out.println(pattStr + " does not match input " + input[i]);
}
}
}
}

When you run this program on JDK 1.4 or later, it correctly matches the “combining accent” and rejects the other characters, some of which, unfortunately, look like the accent on a printer, but are not considered “combining accent” characters.

égal
égal
égal
égal
égal
matches input égal
matches input e?gal
does not match input e?gal
does not match input e'gal
does not match input e ́gal

0 comments:

Post a Comment