Java Taking Strings Apart with StringTokenizer

Java Taking Strings Apart with StringTokenizer


Problem 

You need to take a string apart into words or tokens. 

Solution 

Construct a StringTokenizer around your string and call its methods hasMoreTokens( ) and nextToken( ). The StringTokenizer methods implement the Iterator design pattern (see Recipe 7.4):

// StrTokDemo.java
StringTokenizer st = new StringTokenizer("Hello World of Java");
while (st.hasMoreTokens( ))
 System.out.println("Token: " + st.nextToken( ));

StringTokenizer also implements the Enumeration interface directly (also in Recipe 7.4), but if you use the methods thereof you need to cast the results to String. A StringTokenizer normally breaks the String into tokens at what we would think of as “word boundaries” in European languages. Sometimes you want to break at some other character. No problem. When you construct your StringTokenizer, in addition to passing in the string to be tokenized, pass in a second string that lists the “break characters.” For example:


// StrTokDemo2.java
StringTokenizer st = new StringTokenizer("Hello, World|of|Java", ", |");
while (st.hasMoreElements( ))
 System.out.println("Token: " + st.nextElement( ));


But wait, there’s more! What if you are reading lines like:


FirstName|LastName|Company|PhoneNumber


and your dear old Aunt Begonia hasn’t been employed for the last 38 years? Her “Company” field will in all probability be blank.* If you look very closely at the previous code example, you’ll see that it has two delimiters together (the comma and the space), but if you run it, there are no “extra” tokens. That is, the StringTokenizer normally discards adjacent consecutive delimiters. For cases like the phone list, where you need to preserve null fields, there is good news and bad news. The good news is you can do it: you simply add a second argument of true when constructing the StringTokenizer, meaning that you wish to see the delimiters as tokens. The bad news is that you now get to see the delimiters as tokens, so you have to do the arithmetic yourself. Want to see it? Run this program:


// StrTokDemo3.java
StringTokenizer st =
 new StringTokenizer("Hello, World|of|Java", ", |", true);
while (st.hasMoreElements( ))
 System.out.println("Token: " + st.nextElement( ));


and you get this output:


C:\javasrc>java StrTokDemo3
Token: Hello
Token: ,
Token:
Token: World
Token: |
Token: of
Token: |
Token: Java


This isn’t how you’d like StringTokenizer to behave, ideally, but it is serviceable enough most of the time. Example 3-1 processes and ignores consecutive tokens, returning the results as an array of Strings.


 StrTokDemo4.java (StringTokenizer)
import java.util.*;
/** Show using a StringTokenizer including getting the delimiters back */
public class StrTokDemo4 {
 public final static int MAXFIELDS = 5;
 public final static String DELIM = "|";
 /** Processes one String; returns it as an array of Strings */
 public static String[] process(String line) {
 String[] results = new String[MAXFIELDS];
 // Unless you ask StringTokenizer to give you the tokens,
 // it silently discards multiple null tokens.
 StringTokenizer st = new StringTokenizer(line, DELIM, true);
 int i = 0;
 // stuff each token into the current slot in the array
 while (st.hasMoreTokens( )) {
 String s = st.nextToken( );
 if (s.equals(DELIM)) {
 if (i++>=MAXFIELDS)
 // This is messy: See StrTokDemo4b which uses
 // a Vector to allow any number of fields.
 throw new IllegalArgumentException("Input line " +
 line + " has too many fields");
 continue;
 }
 results[i] = s;
 }
 return results;
 }
 public static void printResults(String input, String[] outputs) {
 System.out.println("Input: " + input);
 for (int i=0; i

When you run this, you will see that A is always in Field 1, B (if present) is in Field 2, and so on. In other words, the null fields are being handled properly:

Input: A|B|C|D
Output 0 was: A
Output 1 was: B
Output 2 was: C
Output 3 was: D
Output 4 was: null
Input: A||C|D
Output 0 was: A
Output 1 was: null
Output 2 was: C
Output 3 was: D
Output 4 was: null
Input: A|||D|E
Output 0 was: A
Output 1 was: null
Output 2 was: null
Output 3 was: D
Output 4 was: E

See Also 

Now that Java includes Regular Expressions (as of JDK 1.4), many occurrences of StringTokenizer can be replaced with Regular Expressions with considerably more flexibility. For example, to extract all the numbers from a String, you can use this code:

Matcher toke = Pattern.compile("\\d+").matcher(inputString);
 while (toke.find( )) {
 String courseString = toke.group(0);
 int courseNumber = Integer.parseInt(courseString);
 ...

This allows user input to be more flexible than you could easily handle with a StringTokenizer. Assuming that the numbers represent course numbers at some educational institution, the inputs “471,472,570” or “Courses 471 and 472, 570” or just “471 472 570” should all give the same results.

0 comments:

Post a Comment