Java Parsing Comma-Separated Data

 Java Parsing Comma-Separated Data


You have a string or a file of lines containing comma-separated values (CSV) that you need to read. Many Windows-based spreadsheets and some databases use CSV to export data. 


Use my CSV class or a regular expression. 


CSV is deceptive. It looks simple at first glance, but the values may be quoted or unquoted. If quoted, they may further contain escaped quotes. This far exceeds the capabilities of the StringTokenizer class (Recipe 3.2). Either considerable Java coding or the use of regular expressions is required. I’ll show both ways. First, a Java program. Assume for now that we have a class called CSV that has a noargument constructor and a method called parse( ) that takes a string representing one line of the input file. The parse( ) method returns a list of fields. For flexibility, the fields are returned as a List, from which you can obtain an Iterator. I simply use the Iterator’s hasNext() method to control the loop and its next( ) method to get the next object:

import java.util.*;
/* Simple demo of CSV parser class.
public class CSVSimple {
public static void main(String[] args) {
 CSV parser = new CSV( );
 List list = parser.parse(
 Iterator it = list.iterator( );
 while (it.hasNext( )) {
 System.out.println( ));

After the quotes are escaped, the string being parsed is actually the following:


Running CSVSimple yields the following output:

> java CSVSimple

But what about the CSV class itself? The code in Example 3-10 started as a translation of a CSV program written in C++ by Brian W. Kernighan and Rob Pike that appeared in their book The Practice of Programming (Addison Wesley). Their version commingled the input processing with the parsing; my CSV class does only the parsing since the input could be coming from any of a variety of sources. And it has been substantially rewritten over time. The main work is done in parse( ), which delegates handling of individual fields to advquoted( ) in cases where the field begins with a quote; otherwise, to advplain( ).
import java.util.*;
import com.darwinsys.util.Debug;
/** Parse comma-separated values (CSV), a common Windows file format.
 * Sample input: "LU",86.25,"11/4/1998","2:19PM",+4.0625
 * * Inner logic adapted from a C++ original that was
 * Copyright (C) 1999 Lucent Technologies
 * Excerpted from 'The Practice of Programming'
 * by Brian W. Kernighan and Rob Pike.
* Included by permission of the web site, * which says: * "You may use this code for any purpose, as long as you leave * the copyright notice and book citation attached." I have done so. * @author Brian W. Kernighan and Rob Pike (C++ original) * @author Ian F. Darwin (translation into Java and removal of I/O) * @author Ben Ballard (rewrote advQuoted to handle '""' and for readability) */ public class CSV { public static final char DEFAULT_SEP = ','; /** Construct a CSV parser, with the default separator (','). */ public CSV( ) { this(DEFAULT_SEP); } /** Construct a CSV parser with a given separator. * @param sep The single char for the separator (not a list of * separator characters) */ public CSV(char sep) { fieldSep = sep; } /** The fields in the current String */ protected List list = new ArrayList( ); /** the separator char for this parser */ protected char fieldSep; /** parse: break the input String into fields * @return java.util.Iterator containing each field * from the original as a String, in order. */ public List parse(String line) { StringBuffer sb = new StringBuffer( ); list.clear( ); // recycle to initial state int i = 0; if (line.length( ) == 0) { list.add(line); return list; } do { sb.setLength(0); if (i < line.length( ) && line.charAt(i) == '"') i = advQuoted(line, sb, ++i); // skip quote else i = advPlain(line, sb, i); list.add(sb.toString( )); Debug.println("csv", sb.toString( )); i++; } while (i < line.length( )); return list; } /** advQuoted: quoted field; return index of next separator */ protected int advQuoted(String s, StringBuffer sb, int i) { int j; int len= s.length( ); for (j=i; j import; import; import; import java.util.ArrayList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; /* Simple demo of CSV matching using Regular Expressions. * Does NOT use the "CSV" class defined in the Java CookBook, but uses * a regex pattern simplified from Chapter 7 of Mastering Regular * Expressions (p. 205, first edn.) * @version $Id: ch03,v 1.3 2004/05/04 18:03:14 ian Exp $ */ public class CSVRE { /** The rather involved pattern used to match CSV's consists of three * alternations: the first matches aquoted field, the second unquoted, * the third a null field. */ public static final String CSV_PATTERN = "\"([^\"]+?)\",?|([^,]+),?|,"; private static Pattern csvRE; public static void main(String[] argv) throws IOException { System.out.println(CSV_PATTERN); new CSVRE().process(new BufferedReader(new InputStreamReader(; } /** Construct a regex-based CSV parser. */ public CSVRE() { csvRE = Pattern.compile(CSV_PATTERN); } /** Process one file. Delegates to parse() a line at a time */ public void process(BufferedReader in) throws IOException { String line; // For each line... while ((line = in.readLine()) != null) { System.out.println("line = `" + line + "'"); List l = parse(line); System.out.println("Found " + l.size() + " items."); for (int i = 0; i < l.size(); i++) { System.out.print(l.get(i) + ","); } System.out.println(); } } /** Parse one line. * @return List of Strings, minus their double quotes */ public List parse(String line) { List list = new ArrayList(); Matcher m = csvRE.matcher(line); // For each field while (m.find()) { System.out.println(m.groupCount()); String match =; if (match == null) break; if (match.endsWith(",")) {// trim trailing , match = match.substring(0, match.length() - 1); } if (match.startsWith("\"")) { // assume also ends with match = match.substring(1, match.length() - 1); } if (match.length() == 0) match = null; list.add(match); } return list; } }

It is sometimes “downright scary” how much mundane code you can eliminate with a single, well-formulated regular expression.


Post a Comment