Web Development and Design | Tutorial for Java, PHP, HTML, Javascript: Java

Web Development and Design | Tutorial for Java, PHP, HTML, Javascript: Java
Showing posts with label Java. Show all posts
Showing posts with label Java. Show all posts

Java Rounding Floating-Point Numbers

Java Rounding Floating-Point Numbers


You need to round floating-point numbers to integers or to a particular precision.


If you simply cast a floating value to an integer value, Java truncates the value. A
value like 3.999999 cast to an int or long becomes 3, not 4. To round floating-point
numbers properly, use Math.round( ) . It has two forms: if you give it a double , you get
a long result; if you give it a float , you get an int .

What if you don’t like the rounding rules used by round ? If for some bizarre reason
you wanted to round numbers greater than 0.54 instead of the normal 0.5, you could
write your own version of round( ) :

* Round floating values to integers.
* @Return the closest int to the argument.
* @param d A non-negative values to be rounded.
static int round(double d) {
if (d < 0) {
throw new IllegalArgumentException("Value must be non-negative");
int di = (int)Math.floor(d);
// integral value below (or ==) d
if ((d - di) > THRESHOLD) {
return di + 1;
} else {
return di;

If you need to display a number with less precision than it normally gets, you proba- bly want to use a DecimalFormat object or a Formatter object.

Java Comparing Floating-Point Numbers

Java Comparing Floating-Point Numbers


You want to compare two floating-point numbers for equality.


Based on what we’ve just discussed, you probably won’t just go comparing two
floats or doubles for equality. You might expect the floating-point wrapper classes,

Float and Double , to override the equals( ) method, which they do. The equals( )
method returns true if the two values are the same bit for bit, that is, if and only if
the numbers are the same or are both NaN . It returns false otherwise, including if the
argument passed in is null, or if one object is +0.0 and the other is –0.0.

If this sounds weird, remember that the complexity comes partly from the nature of
doing real number computations in the less-precise floating-point hardware, and
partly from the details of the IEEE Standard 754, which specifies the floating-point
functionality that Java tries to adhere to, so that underlying floating-point processor
hardware can be used even when Java programs are being interpreted.

To actually compare floating-point numbers for equality, it is generally desirable to
compare them within some tiny range of allowable differences; this range is often
regarded as a tolerance or as epsilon. Example shows an equals( ) method you
can use to do this comparison, as well as comparisons on values of NaN . When run, it
prints that the first two numbers are equal within epsilon:

$ java FloatCmp
True within epsilon 1.0E-7

Example. FloatCmp.java
* Floating-point comparisons.
public class FloatCmp {
final static double EPSILON = 0.0000001;
public static void main(String[] argv) {
double da = 3 * .3333333333;
double db = 0.99999992857;
// Compare two numbers that are expected to be close.
if (da == db) {
System.out.println("Java considers " + da + "==" + db);
// else compare with our own equals method
} else if (equals(da, db, 0.0000001)) {
System.out.println("True within epsilon " + EPSILON);
} else {
System.out.println(da + " != " + db);
// Show that comparing two NaNs is not a good idea:
double d1 = Double.NaN;
double d2 = Double.NaN;
if (d1 == d2)
System.err.println("Comparing two NaNs incorrectly returns true.");
if (!new Double(d1).equals(new Double(d2)))
System.err.println("Double(NaN).equal(NaN) incorrectly returns false.");

/** Compare two doubles within a given epsilon */
public static boolean equals(double a, double b, double eps) {
if (a==b) return true;
// If the difference is less than epsilon, treat as equal.
return Math.abs(a - b) < eps;
/** Compare two doubles, using default epsilon */
public static boolean equals(double a, double b) {
if (a==b) return true;
// If the difference is less than epsilon, treat as equal.
return Math.abs(a - b) < EPSILON * Math.max(Math.abs(a), Math.abs(b));

Note that neither of the System.err messages about “incorrect returns” prints. The point of this example with NaN s is that you should always make sure values are not NaN before entrusting them to Double.equals( ).

Java Ensuring the Accuracy of Floating-Point Numbers

Java Ensuring the Accuracy of Floating-Point


You want to know if a floating-point computation generated a sensible result.


Compare with the INFINITY constants, and use isNaN( ) to check for “not a number.”
Fixed-point operations that can do things like divide by zero result in Java notifying
you abruptly by throwing an exception. This is because integer division by zero is
considered a logic error.

Floating-point operations, however, do not throw an exception because they are
defined over an (almost) infinite range of values. Instead, they signal errors by pro-
ducing the constant POSITIVE_INFINITY if you divide a positive floating-point num-
ber by zero, the constant NEGATIVE_INFINITY if you divide a negative floating-point
value by zero, and NaN (Not a Number), if you otherwise generate an invalid result.
Values for these three public constants are defined in both the Float and the Double wrapper classes.

The value NaN has the unusual property that it is not equal to itself,
that is, NaN != NaN . Thus, it would hardly make sense to compare a (possibly sus-
pect) number against NaN , because the expression:

x == NaN

can never be true. Instead, the methods Float.isNaN(float) and Double. isNaN(double) must be used:

// InfNaN.java
public static void main(String argv[]) {
double d = 123;
double e = 0;
if (d/e == Double.POSITIVE_INFINITY)
System.out.println("Check for POSITIVE_INFINITY works");
double s = Math.sqrt(-1);
if (s == Double.NaN)
System.out.println("Comparison with NaN incorrectly returns true");
if (Double.isNaN(s))
System.out.println("Double.isNaN( ) correctly returns true");

Note that this, by itself, is not sufficient to ensure that floating-point calculations have been done with adequate accuracy. For example, the following program dem- onstrates a contrived calculation—Heron’s formula for the area of a triangle—both in float and in double . The double values are correct, but the floating-point value comes out as zero due to rounding errors. This happens because, in Java, operations involving only float values are performed as 32-bit calculations. Related languages such as C automatically promote these to double during the computation, which can eliminate some loss of accuracy.

/** Compute the area of a triangle using Heron's Formula.
* Code and values from Prof W. Kahan and Joseph D. Darcy.
* See http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf.
* Derived from listing in Rick Grehan's Java Pro article (October 1999).
* Simplified and reformatted by Ian Darwin.
public class Heron {
public static void main(String[] args) {
// Sides for triangle in float
float af, bf, cf;
float sf, areaf;
// Ditto in double
double ad, bd, cd;
double sd, aread;
//Area of triangle in float
= 12345679.0f;
= 12345678.0f;
= 1.01233995f;
sf = (af+bf+cf)/2.0f;
areaf = (float)Math.sqrt(sf * (sf - af) * (sf - bf) * (sf - cf));
System.out.println("Single precision: " + areaf);
Area of triangle in double
= 12345679.0;
= 12345678.0;
= 1.01233995;
sd = (ad+bd+cd)/2.0d;
aread =
Math.sqrt(sd * (sd - ad) * (sd - bd) * (sd - cd));
System.out.println("Double precision: " + aread);

Let’s run it. To ensure that the rounding is not an implementation artifact, I’ll try it both with Sun’s JDK and with Kaffe:

$ java Heron
Single precision:
Double precision:
$ kaffe Heron
Single precision:
Double precision:

If in doubt, use double ! To ensure consistency of very large magnitude double computations on different Java implementations, Java provides the keyword strictfp , which can apply to classes, interfaces, or methods within a class. * If a computation is Strict-FP, then it must always, for example, return the value INFINITY if a calculation would overflow the value of Double.MAX_VALUE (or underflow the value Double.MIN_VALUE ). Non-Strict- FP calculations—the default—are allowed to perform calculations on a greater range and can return a valid final result that is in range even if the interim product was out of range. This is pretty esoteric and affects only computations that approach the bounds of what fits into a double.

Java Taking a Fraction of an Integer Without Using Floating Point

Java Taking a Fraction of an Integer Without
Using Floating Point


You want to multiply an integer by a fraction without converting the fraction to a
floating-point number.


Multiply the integer by the numerator and divide by the denominator.
This technique should be used only when efficiency is more important than clarity,
as it tends to detract from the readability—and therefore the maintainability—of
your code.


Since integers and floating-point numbers are stored differently, it may sometimes be
desirable and feasible, for efficiency purposes, to multiply an integer by a fractional
value without converting the values to floating point and back, and without requiring a “cast”:

/** Compute the value of 2/3 of 5 */
public class FractMult {
public static void main(String u[]) {
double d1 = 0.666 * 5;
// fast but obscure and inaccurate: convert
System.out.println(d1); // 2/3 to 0.666 in programmer's head
double d2 = 2/3 * 5;
// wrong answer - 2/3 == 0, 0*5 = 0
double d3 = 2d/3d * 5;
double d4 = (2*5)/3d;
int i5 = 2*5/3;
// "normal"
// one step done as integers, almost same answer
// fast, approximate integer answer

Running it looks like this:
$ java FractMult

Java Converting Numbers to Objects and Vice Versa

Java Converting Numbers to Objects
and Vice Versa


You need to convert numbers to objects and objects to numbers.


Use the Object Wrapper classes listed in Table at the beginning.


Often you have a primitive number and you need to pass it into a method where an
Object is required. This frequently happens when using the Collection classes and earlier.

To convert between an int and an Integer object, or vice versa, you can use the

// IntObject.java
// int to Integer
Integer i1 = new Integer(42);
System.out.println(i1.toString( )); // or just i1
// Integer to int
int i2 = i1.intValue( );

Java Storing a Larger Number in a Smaller Number

Java Storing a Larger Number
in a Smaller Number


You have a number of a larger type and you want to store it in a variable of a smaller


Cast the number to the smaller type. (A cast is a type listed in parentheses before a
value that causes the value to be treated as though it were of the listed type.)
For example, to cast a long to an int , you need a cast. To cast a double to a float ,
you also need a cast.


This causes newcomers some grief, as the default type for a number with a decimal
point is double , not float . So code like:

float f = 3.0;

won’t even compile! It’s as if you had written:

double tmp = 3.0;
float f = tmp;

You can fix it by making f a double , by making the 3.0 a float , by putting in a cast, or by assigning an integer value of 3:

double f = 3.0;
float f = 3.0f;
float f = 3f;
float f = (float)3.0;
float f = 3;

The same applies when storing an int into a short , char , or byte :

// CastNeeded.java
public static void main(String argv[]) {
int i;
double j = 2.75;
i = j;
i = (int)j;
// with cast; i gets 2
System.out.println("i =" + i);
byte b;
b = i;
b = (byte)i;
// with cast, i gets 2
System.out.println("b =" + b);

The lines marked EXPECT COMPILE ERROR do not compile unless either com- mented out or changed to be correct. The lines marked “with cast” show the correct forms.

Java Checking Whether a String Is a Valid Number

Java Checking Whether a String
Is a Valid Number


You need to check whether a given string contains a valid number, and, if so, con-
vert it to binary (internal) form.


Use the appropriate wrapper class’s conversion routine and catch the
NumberFormatException . This code converts a string to a double :

// StringToDouble.java
public static void main(String argv[]) {
String aNumber = argv[0];
// not argv[1]
double result;
try {
result = Double.parseDouble(aNumber);
} catch(NumberFormatException exc) {
System.out.println("Invalid number " + aNumber);
System.out.println("Number is " + result);

Explained Of course, that lets you validate only numbers in the format that the designers of the wrapper classes expected. If you need to accept a different definition of numbers, you could use regular expressions to make the determination. There may also be times when you want to tell if a given number is an integer num- ber or a floating-point number. One way is to check for the characters . , d , e , or f in the input; if one of these characters is present, convert the number as a double . Otherwise, convert it as an int :

// Part of GetNumber.java
private static Number NAN = new Double(Double.NaN);
/* Process one String, returning it as a Number subclass
public Number process(String s) {
if (s.matches(".*[.dDeEfF]")) {
try {
double dValue = Double.parseDouble(s);
System.out.println("It's a double: " + dValue);
return new Double(dValue);
} catch (NumberFormatException e) {
System.out.println("Invalid a double: " + s);
return NAN;
} else // did not contain . d e or f, so try as int.
try {
int iValue = Integer.parseInt(s);
System.out.println("It's an int: " + iValue);
return new Integer(iValue);
} catch (NumberFormatException e2) {
System.out.println("Not a number:" + s);
return NAN;

See Also 

A more involved form of parsing is offered by the DecimalFormat class. JDK 1.5 also features the Scanner class.

Java Numbers

Java Numbers

Numbers are basic to just about any computation. They’re used for array indexes,
temperatures, salaries, ratings, and an infinite variety of things. Yet they’re not as
simple as they seem. With floating-point numbers, how accurate is accurate? With
random numbers, how random is random? With strings that should contain a num-
ber, what actually constitutes a number?

Java has several built-in types that can be used to represent numbers, summarized in
Table. Note that unlike languages such as C or Perl, which don’t specify the size
or precision of numeric types, Java—with its goal of portability—specifies these
exactly and states that they are the same on all platforms.

As you can see, Java provides a numeric type for just about any purpose. There are
four sizes of signed integers for representing various sizes of whole numbers. There
are two sizes of floating-point numbers to approximate real numbers. There is also a
type specifically designed to represent and allow operations on Unicode characters.

When you read a string from user input or a text file, you need to convert it to the
appropriate type. The object wrapper classes in the second column have several

functions, but one of the most important is to provide this basic conversion functionality—replacing the C programmer’s atoi/atof family of functions and the
numeric arguments to scanf.

Going the other way, you can convert any number (indeed, anything at all in Java) to
a string just by using string concatenation. If you want a little bit of control over
numeric formatting, Recipe 5.8 shows you how to use some of the object wrappers’
conversion routines. And if you want full control, it also shows the use of
NumberFormat and its related classes to provide full control of formatting.

As the name object wrapper implies, these classes are also used to “wrap” a number
in a Java object, as many parts of the standard API are defined in terms of objects.
Later on, Recipe 10.16 shows using an Integer object to save an int ’s value to a file
using object serialization, and retrieving the value later.

But I haven’t yet mentioned the issues of floating point. Real numbers, you may
recall, are numbers with a fractional part. There is an infinity of possible real num-
bers. A floating-point number—what a computer uses to approximate a real num-
ber—is not the same as a real number. The number of floating-point numbers is
finite, with only 2^32 different bit patterns for float s, and 2^64 for double s. Thus,
most real values have only an approximate correspondence to floating point. The
result of printing the real number 0.3 works correctly, as in:

// RealValues.java
System.out.println("The real value 0.3 is " + 0.3);

results in this printout:

The real value 0.3 is 0.3

But the difference between a real value and its floating-point approximation can accumulate if the value is used in a computation; this is often called a rounding error. Continuing the previous example, the real 0.3 multiplied by 3 yields:

The real 0.3 times 3 is 0.89999999999999991

Surprised? More surprising is this: you’ll get the same output on any conforming Java implementation. I ran it on machines as disparate as a Pentium with OpenBSD, a Pentium with Windows and Sun’s JDK, and on Mac OS X with JDK 1.4.1. Always the same answer.

And what about random numbers? How random are they? You have probably heard the expression “pseudo-random numbers.” All conventional random number genera- tors, whether written in Fortran, C, or Java, generate pseudo-random numbers. That is, they’re not truly random! True randomness comes only from specially built hard- ware: an analog source of Brownian noise connected to an analog-to-digital con- verter, for example. * This is not your average PC! However, pseudo-random number generators (PRNG for short) are good enough for most purposes, so we use them. Java provides one random generator in the base library java.lang.Math , and several others; we’ll examine these in Recipe 5.13. 

The class java.lang.Math contains an entire “math library” in one class, including trigonometry, conversions (including degrees to radians and back), rounding, trun- cating, square root, minimum, and maximum. It’s all there. Check the Javadoc for java.lang.Math . 

The package java.Math contains support for “big numbers”—those larger than the normal built-in long integers, for example. See Recipe 5.19. Java works hard to ensure that your programs are reliable. 

The usual ways you’d notice this are in the common requirement to catch potential exceptions—all through the Java API—and in the need to “cast” or convert when storing a value that might or might not fit into the variable you’re trying to store it in. I’ll show examples of these. Overall, Java’s handling of numeric data fits well with the ideals of portability, reli- ability, and ease of programming. 

See Also 

The Java Language Specification. The Javadoc page for java.lang.Math .

Java Program: Full Grep

Java Program: Full Grep

Now that we’ve seen how the regular expressions package works, it’s time to write
Grep2 , a full-blown version of the line-matching program with option parsing.
Table lists some typical command-line options that a Unix implementation of
grep might include.

Table  Grep command-line options

Option                                                             Meaning

-c                                                     Count only: don’t print lines, just count them.

-C                                                Context; print some lines above and below each line that matche(not                                                                           implemented in this version; left
                                                                        as an exercise for the reader).

-f                                                    pattern Take pattern from file named after -f instead of from                                                                                       command line.

-h                                                     Suppress printing filename ahead of lines.

-i                                                                       Ignore case.

-l                                                List filenames only: don’t print lines, just the names they’re found in.

-n                                                              Print line numbers before matching lines.

-s                                                  Suppress printing certain error messages.

-v                                           Invert: print only lines that do NOT match the pattern.

We discussed the GetOpt class. Here we use it to control the operation
of an application program. As usual, since main( ) runs in a static context but our
application main line does not, we could wind up passing a lot of information into
the constructor. Because we have so many options, and it would be inconvenient to
keep expanding the options list as we add new functionality to the program, we use a
kind of Collection called a BitSet to pass all the true / false arguments: true to print
line numbers, false to print filenames, etc. A BitSet is much like a Vector  but is specialized to store only Boolean values and is ideal for handling command-line arguments.

The program basically just reads lines, matches the pattern in them, and, if a match
is found (or not found, with -v ), prints the line (and optionally some other stuff,
too). Having said all that, the code is shown in Example

Example. Grep2.java
import com.darwinsys.util.*;
import java.io.*;
import java.util.*;
/** A command-line grep-like program. Accepts some options and takes a pattern
* and a list of text files.
public class Grep2 {
/** The pattern we're looking for */
protected Matcher pattern;
/** The Reader for the current file */
protected BufferedReader d;
/** Are we to only count lines, instead of printing? */
protected boolean countOnly = false;
/** Are we to ignore case? */
protected boolean ignoreCase = false;
/** Are we to suppress printing of filenames? */
protected boolean dontPrintFileName = false;
/** Are we to only list names of files that match? */
protected boolean listOnly = false;
/** are we to print line numbers? */
protected boolean numbered = false;
/** Are we to be silent about errors? */
protected boolean silent = false;
/** are we to print only lines that DONT match? */
protected boolean inVert = false;
/** Construct a Grep object for each pattern, and run it
* on all input files listed in argv.
public static void main(String[] argv) throws RESyntaxException {
if (argv.length < 1) {
System.err.println("Usage: Grep2 pattern [filename...]");
String pattern = null;
GetOpt go = new GetOpt("cf:hilnsv");
BitSet args = new BitSet( );
char c;
while ((c = go.getopt(argv)) != 0) {
switch(c) {
case 'c':
case 'f':
try {
BufferedReader b = new BufferedReader
(new FileReader(go.optarg( )));
pattern = b.readLine( );
b.close( );
} catch (IOException e) {
System.err.println("Can't read pattern file " +
go.optarg( ));
case 'h':
case 'i':
case 'l':
case 'n':
case 's':
case 'v':
int ix = go.getOptInd( );
if (pattern == null)
pattern = argv[ix-1];
Grep2 pg = new Grep2(pattern, args);
if (argv.length == ix)
pg.process(new InputStreamReader(System.in), "(standard input)");
for (int i=ix; i<argv.length; i++) {
try {
pg.process(new FileReader(argv[i]), argv[i]);
} catch(Exception e) {
/** Construct a Grep2 object.
public Grep2(String patt, BitSet args) {
// compile the regular expression
if (args.get('C'))
countOnly = true;
if (args.get('H'))
dontPrintFileName = true;
if (args.get('I'))
ignoreCase = true;
if (args.get('L'))
listOnly = true;
if (args.get('N'))
numbered = true;
if (args.get('S'))
silent = true;
if (args.get('V'))
inVert = true;
int caseMode = ignoreCase ? Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE : 0;
pattern = Pattern.compile(patt, caseMode);
matcher = pattern.matcher("");
/** Do the work of scanning one file
* @param
Reader object already open
* @param
fileName String
Name of the input file
public void process(Reader ifile, String fileName) {
String line;
int matches = 0;
try {
d = new BufferedReader(ifile);
while ((line = d.readLine( )) != null) {
if (pattern.match(line)) {
if (countOnly)
else {
if (!dontPrintFileName)
System.out.print(fileName + ": ");
} else if (inVert) {
if (countOnly)
System.out.println(matches + " matches in " + fileName);
d.close( );
} catch (IOException e) { System.err.println(e); }

Java Program: Data Mining

Java Program: Data Mining

Suppose that I, as a published author, want to track how my book is selling in com-
parison to others. This information can be obtained for free just by clicking on the
page for my book on any of the major bookseller sites, reading the sales rank num-
ber off the screen, and typing the number into a file—but that’s too tedious. As I
wrote in the book that this example looks for, “computers get paid to extract rele-
vant information from files; people should not have to do such mundane tasks.” This
program uses the Regular Expressions API and, in particular, newline matching to
extract a value from an HTML page on the hypothetical QuickBookShops.web web
site. It also reads from a URL object . The pattern to look for is
something like this (bear in mind that the HTML may change at any time, so I want
to keep the pattern fairly general):

<b>QuickBookShop.web Sales Rank: </b>

As the pattern may extend over more than one line, I read the entire web page from the URL into a single long string using my FileIO.readerToString( ) method (see Recipe 10.8) instead of the more traditional line-at-a-time paradigm. I then plot a graph using an external program; this could (and should) be changed to use a Java graphics program. The com- plete program is shown in Example

Example. BookRank.java
// Standard imports not shown
import com.darwinsys.io.FileIO;
import com.darwinsys.util.FileProperties;
/** Graph of a book's sales rank on a given bookshop site.
* @author Ian F. Darwin, http://www.darwinsys.com/, Java Cookbook author,
originally translated fairly literally from Perl into Java.
* @author Patrick Killelea <p@patrick.net>: original Perl version,
from the 2nd edition of his book "Web Performance Tuning".
* @version $Id: ch04,v 1.4 2004/05/04 20:11:27 ian Exp $
public class BookRank {
public final static String DATA_FILE = "book.sales";
public final static String GRAPH_FILE = "book.png";
/** Grab the sales rank off the web page and log it. */
public static void main(String[] args) throws Exception {
Properties p = new FileProperties(
args.length == 0 ? "bookrank.properties" : args[1]);
String title = p.getProperty("title", "NO TITLE IN PROPERTIES");
// The url must have the "isbn=" at the very end, or otherwise
// be amenable to being string-catted to, like the default.
String url = p.getProperty("url", "http://test.ing/test.cgi?isbn=");
// The 10-digit ISBN for the book.
String isbn = p.getProperty("isbn", "0000000000");
// The regex pattern (MUST have ONE capture group for the number)
String pattern = p.getProperty("pattern", "Rank: (\\d+)");
// Looking for something like this in the input:
<b>QuickBookShop.web Sales Rank: </b>
Pattern r = Pattern.compile(pattern);
// Open the URL and get a Reader from it.
BufferedReader is = new BufferedReader(new InputStreamReader(
new URL(url + isbn).openStream( )));
// Read the URL looking for the rank information, as
// a single long string, so can match regex across multi-lines.
String input = FileIO.readerToString(is);
// System.out.println(input);

// If found, append to sales data file.
Matcher m = r.matcher(input);
if (m.find( )) {
PrintWriter pw = new PrintWriter(
new FileWriter(DATA_FILE, true));
String date = // 'date +'%m %d %H %M %S %Y'`;
new SimpleDateFormat("MM dd hh mm ss yyyy ").
format(new Date( ));
// Paren 1 is the digits (and maybe ','s) that matched; remove comma
Matcher noComma = Pattern.compile(",").matcher(m.group(1));
pw.println(date + noComma.replaceAll(""));
pw.close( );
} else {
System.err.println("WARNING: pattern `" + pattern +
"' did not match in `" + url + isbn + "'!");
//Whether current data found or not, draw the graph, using external plotting program against all historical data. Could use gnuplot, R, any other math/graph program. Better yet: use one of the Java plotting APIs.

String gnuplot_cmd =
"set term png\n" +
"set output \"" + GRAPH_FILE + "\"\n" +
"set xdata time\n" +
"set ylabel \"Book sales rank\"\n" +
"set bmargin 3\n" +
"set logscale y\n" +
"set yrange [1:60000] reverse\n" +
"set timefmt \"%m %d %H %M %S %Y\"\n" +
"plot \"" + DATA_FILE +
"\" using 1:7 title \"" + title + "\" with lines\n"
Process proc = Runtime.getRuntime( ).exec("/usr/local/bin/gnuplot");
PrintWriter gp = new PrintWriter(proc.getOutputStream( ));
gp.close( );

Java Program: Apache Logfile Parsing

Java Program: Apache Logfile Parsing

The Apache web server is the world’s leading web server and has been for most of
the web’s history. It is one of the world’s best-known open source projects, and one
of many fostered by the Apache Foundation. But the name Apache is a pun on the
origins of the server; its developers began with the free NCSA server and kept hack-
ing at it or “patching” it until it did what they wanted. When it was sufficiently dif-
ferent from the original, a new name was needed. Since it was now “a patchy server,”
the name Apache was chosen. One place this patchiness shows through is in the log
file format. Consider this entry: - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0"
200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)"

The file format was obviously designed for human inspection but not for easy pars- ing. The problem is that different delimiters are used: square brackets for the date, quotes for the request line, and spaces sprinkled all through. Consider trying to use a StringTokenizer ; you might be able to get it working, but you’d spend a lot of time fiddling with it. However, this somewhat contorted regular expression * makes it easy to parse:

^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([^"]+)"

You may find it informative to refer back to Table 4-2 and review the full syntax used here. Note in particular the use of the non-greedy quantifier +? in \"(.+?)\" to match a quoted string; you can’t just use .+ since that would match too much (up to the quote at the end of the line). Code to extract the various fields such as IP address, request, referer URL, and browser version is shown in Example:

import java.util.regex.*;
* Parse an Apache log file with Regular Expressions
public class LogRegExp implements LogExample {
public static void main(String argv[]) {
String logEntryPattern =
"^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3})
(\\d+) \"([^\"]+)\" \"([^\"]+)\"";
System.out.println("Using regex Pattern:");
System.out.println("Input line is:");
Pattern p = Pattern.compile(logEntryPattern);
Matcher matcher = p.matcher(logEntryLine);
if (!matcher.matches( ) ||
NUM_FIELDS != matcher.groupCount( )) {
System.err.println("Bad log entry (or problem with regex?):");
System.out.println("IP Address: " + matcher.group(1));
System.out.println("Date&Time: " + matcher.group(4));
System.out.println("Request: " + matcher.group(5));
System.out.println("Response: " + matcher.group(6));
System.out.println("Bytes Sent: " + matcher.group(7));
if (!matcher.group(8).equals("-"))
System.out.println("Referer: " + matcher.group(8));
System.out.println("Browser: " + matcher.group(9));

The implements clause is for an interface that just defines the input string; it was used in a demonstration to compare the regular expression mode with the use of a StringTokenizer . The source for both versions is in the online source for this chap- ter. Running the program against the sample input shown above gives this output:

Using regex Pattern:
^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([^"]+)"
Input line is: - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0"
200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)"
IP Address:
Date&Time: 27/Oct/2000:09:27:09 -0400
Request: GET /java/javaResources.html HTTP/1.0
Response: 200
Bytes Sent: 10450
Browser: Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)

The program successfully parsed the entire log file format with one call to matcher. matches( ) .

Java Matching “Accented” or Composite Characters

Java Matching “Accented” or Composite


You want characters to match regardless of the form in which they are entered.


Compile the Pattern with the flags argument Pattern.CANON_EQ for “canonical


Composite characters can be entered in various forms. Consider, as a single exam-
ple, the letter e with an acute accent. This character may be found in various forms in
Unicode text, such as the single character é (Unicode character \u00e9) or as the two-
character sequence e ́ (e followed by the Unicode combining acute accent, \u0301).

To allow you to match such characters regardless of which of possibly multiple “fully
decomposed” forms are used to enter them, the regex package has an option for
“canonical matching,” which treats any of the forms as equivalent. This option is
enabled by passing CANON_EQ as (one of) the flags in the second argument to Pattern.
compile( ) . This program shows CANON_EQ being used to match several forms:

import java.util.regex.*;
* CanonEqDemo - show use of Pattern.CANON_EQ, by comparing varous ways of
* entering the Spanish word for "equal" and see if they are considered equal
* by the regex-matching engine.
public class CanonEqDemo {
public static void main(String[] args) {
String pattStr = "\u00e9gal"; // égal
String[] input = {
"\u00e9gal", // égal - this one had better match :-)
"e\u0301gal", // e + "Combining acute accent"
"e\u02cagal", // e + "modifier letter acute accent"
"e'gal", // e + single quote
"e\u00b4gal", // e + Latin-1 "acute"
Pattern pattern = Pattern.compile(pattStr, Pattern.CANON_EQ);
for (int i = 0; i < input.length; i++) {
if (pattern.matcher(input[i]).matches( )) {
System.out.println(pattStr + " matches input " + input[i]);
} else {
System.out.println(pattStr + " does not match input " + input[i]);

When you run this program on JDK 1.4 or later, it correctly matches the “combining accent” and rejects the other characters, some of which, unfortunately, look like the accent on a printer, but are not considered “combining accent” characters.

matches input égal
matches input e?gal
does not match input e?gal
does not match input e'gal
does not match input e ́gal

Java Controlling Case in Regular Expressions

Java Controlling Case in Regular Expressions


You want to find text regardless of case.


Compile the Pattern passing in the flags argument Pattern.CASE_INSENSITIVE to
indicate that matching should be case-independent (“fold” or ignore differences in
case). If your code might run in different locales (see Chapter 15), add Pattern.
UNICODE_CASE . Without these flags, the default is normal, case-sensitive matching
behavior. This flag (and others) are passed to the Pattern.compile() method, as in:

// CaseMatch.java
Pattern reCaseInsens = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE |
// will match case-insensitively

This flag must be passed when you create the Pattern ; as Pattern objects are immu- table, they cannot be changed once constructed. The full source code for this example is online as CaseMatch.java.

Pattern.compile( ) Flags 

Half a dozen flags can be passed as the second argument to Pattern.compile( ) . If more than one value is needed, they can be or’d together using the | bitwise or operator. In alphabetical order, the flags are:


Enables so-called “canonical equivalence,” that is, characters are matched by their base character, so that the character e followed by the “combining character mark” for the acute accent ( ́ ) can be matched either by the composite character é or the letter e followed by the character mark for the accent (see Recipe 4.8). 


Turns on case-insensitive matching (see Recipe 4.7). 


Causes whitespace and comments (from # to end-of-line) to be ignored in the pattern. 


Allows dot ( . ) to match any regular character or the newline, not just newline (see Recipe 4.9). 


Specifies multiline mode (see Recipe 4.9). 


Enables Unicode-aware case folding (see Recipe 4.7). 


Makes \n the only valid “newline” sequence for MULTILINE mode (see Recipe 4.9).

Java Printing Lines Containing a Pattern

Java Printing Lines Containing a Pattern


You need to look for lines matching a given regex in one or more files.


Write a simple grep-like program.


As I’ve mentioned, once you have a regex package, you can write a grep-like pro-
gram. I gave an example of the Unix grep program earlier. grep is called with some
optional arguments, followed by one required regular expression pattern, followed
by an arbitrary number of filenames. It prints any line that contains the pattern, dif-
fering from Recipe 4.5, which prints only the matching text itself. For example:

grep "[dD]arwin" *.txt

searches for lines containing either darwin or Darwin in every line of every file whose name ends in .txt. * Example 4-5 is the source for the first version of a program to do this, called Grep0 . It reads lines from the standard input and doesn’t take any optional arguments, but it handles the full set of regular expressions that the Pattern class implements (it is, therefore, not identical with the Unix programs of the same name). 

We haven’t covered the java.io package for input and output yet (see Chapter 10), but our use of it here is simple enough that you can probably intuit it. The online source includes Grep1 , which does the same thing but is better structured (and therefore longer). Later in this chapter, Recipe 4.12 presents a Grep2 program that uses my GetOpt (see Recipe 2.6) to parse command-line options.

* On Unix, the shell or command-line interpreter expands *.txt to match all the filenames, but the normal Java
interpreter does this for you on systems where the shell isn’t energetic or bright enough to do it.

Example 4-5. Grep0.java
import java.io.*;
import java.util.regex.*;
/** Grep0 - Match lines from stdin against the pattern on the command line.
public class Grep0 {
public static void main(String[] args) throws IOException {
BufferedReader is =
new BufferedReader(new InputStreamReader(System.in));
if (args.length != 1) {
System.err.println("Usage: Grep0 pattern");
Pattern patt = Pattern.compile(args[0]);
Matcher matcher = patt.matcher("");
String line = null;
while ((line = is.readLine( )) != null) {
if (matcher.find( )) {
System.out.println("MATCH: " + line);

Java Printing All Occurrences of a Pattern

Java Printing All Occurrences of a Pattern


You need to find all the strings that match a given regex in one or more files or other


This example reads through a file one line at a time. Whenever a match is found, I
extract it from the line and print it.

This code takes the group( ) methods from Recipe 4.3, the substring method from
the CharacterIterator interface, and the match( ) method from the regex and simply
puts them all together. I coded it to extract all the “names” from a given file; in run-
ning the program through itself, it prints the words “import”, “java”, “until”,
“regex”, and so on:

> jikes +E -d . ReaderIter.java
> java ReaderIter ReaderIter.java

I interrupted it here to save paper. This can be written two ways, a traditional “line at a time” pattern shown in Example 4-3 and a more compact form using “new I/O” shown in Example 4-4

Example 4-3. ReaderIter.java
import java.util.regex.*;
import java.io.*;
* Print all the strings that match a given pattern from a file.
public class ReaderIter {
public static void main(String[] args) throws IOException {
// The regex pattern
Pattern patt = Pattern.compile("[A-Za-z][a-z]+");
// A FileReader (see the I/O chapter)
BufferedReader r = new BufferedReader(new FileReader(args[0]));
// For each line of input, try matching in it.
String line;
while ((line = r.readLine( )) != null) {
// For each match in the line, extract and print it.
Matcher m = patt.matcher(line);
while (m.find( )) {
// Simplest method:
// System.out.println(m.group(0));
// Get the starting position of the text
int start = m.start(0);
// Get ending position
int end = m.end(0);
// Print whatever matched.
System.out.println("start=" + start + "; end=" + end);
// Use CharSequence.substring(offset, end);
System.out.println(line.substring(start, end));

Example 4-4. GrepNIO.java
/* Grep-like program using NIO, but NOT LINE BASED.
* Pattern and file name(s) must be on command line.
public class GrepNIO {
public static void main(String[] args) throws IOException {
if (args.length < 2) {
System.err.println("Usage: GrepNIO patt file [...]");
Pattern p = Pattern.compile(args[0]);
for (int i=1; i<args.length; i++)
process(p, args[i]);
static void process(Pattern pattern, String fileName) throws IOException {
// Get a FileChannel from the given file.
FileChannel fc = new FileInputStream(fileName).getChannel( );
// Map the file's content
ByteBuffer buf = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size( ));
// Decode ByteBuffer into CharBuffer
CharBuffer cbuf =
Charset.forName("ISO-8859-1").newDecoder( ).decode(buf);
Matcher m = pattern.matcher(cbuf);
while (m.find( )) {

The NIO version shown in Example 4-4 relies on the fact that an NIO Buffer can be used as a CharSequence . This program is more general in that the pattern argument is taken from the command-line argument. It prints the same output as the previous example if invoked with the pattern argument from the previous program on the command line:

java GrepNIO " [A-Za-z][a-z]+"

You might think of using \w+ as the pattern; the only difference is that my pattern looks for well-formed capitalized words while \w+ would include Java-centric oddi- ties like theVariableName , which have capitals in nonstandard positions. Also note that the NIO version will probably be more efficient since it doesn’t reset the Matcher to a new input source on each line of input as ReaderIter does.

Java Replacing the Matched Text

Java Replacing the Matched Text

As we saw in the previous recipe, regex patterns involving multipliers can match a lot
of characters with very few metacharacters. We need a way to replace the text that
the regex matched without changing other text before or after it. We could do this
manually using the String method substring( ) . However, because it’s such a com-
mon requirement, the JDK 1.4 Regular Expression API provides some substitution
methods. In all these methods, you pass in the replacement text or “right-hand side”
of the substitution (this term is historical: in a command-line text editor’s substitute
command, the left-hand side is the pattern and the right-hand side is the replace-
ment text). The replacement methods are:

Replaces all occurrences that matched with the new string.

appendReplacement(StringBuffer, newString)
Copies up to before the first match, plus the given newString .

Appends text after the last match (normally used after appendReplacement ).

Example 4-2 shows use of these three methods.

// class ReplaceDemo
// Quick demo of substitution: correct "demon" and other
// spelling variants to the correct, non-satanic "daemon".
// Make a regex pattern to match almost any form (deamon, demon, etc.).
String patt = "d[ae]{1,2}mon"; // i.e., 1 or 2 'a' or 'e' any combo
// A test string.
String input = "Unix hath demons and deamons in it!";
System.out.println("Input: " + input);
// Run it from a regex instance and see that it works
Pattern r = Pattern.compile(patt);
Matcher m = r.matcher(input);
System.out.println("ReplaceAll: " + m.replaceAll("daemon"));
// Show the appendReplacement method
m.reset( );
StringBuffer sb = new StringBuffer( );
System.out.print("Append methods: ");
while (m.find( )) {
m.appendReplacement(sb, "daemon");
System.out.println(sb.toString( ));
// Copy to before first match,
// plus the word "daemon"
// copy remainder

Sure enough, when you run it, it does what we expect:

Input: Unix hath demons and deamons in it!
ReplaceAll: Unix hath daemons and daemons in it!
Append methods: Unix hath daemons and daemons in it!

Java Finding the Matching Text

Java Finding the Matching Text


You need to find the text that the regex matched.


Sometimes you need to know more than just whether a regex matched a string. In
editors and many other tools, you want to know exactly what characters were
matched. Remember that with multipliers such as * , the length of the text that was
matched may have no relationship to the length of the pattern that matched it. Do
not underestimate the mighty .* , which happily matches thousands or millions of
characters if allowed to. As you saw in the previous recipe, you can find out whether
a given match succeeds just by using find( ) or matches( ) . But in other applications,
you will want to get the characters that the pattern matched.

After a successful call to one of the above methods, you can use these “information”
methods to get information on the match:

start(), end( )
Returns the character position in the string of the starting and ending characters
that matched.

groupCount( )
Returns the number of parenthesized capture groups if any; returns 0 if no
groups were used.

group(int i)
Returns the characters matched by group i of the current match, if i is less than
or equal to the return value of groupCount( ) . Group 0 is the entire match, so
group(0) (or just group( ) ) returns the entire portion of the string that matched.

The notion of parentheses or “capture groups” is central to regex processing. Regexes
may be nested to any level of complexity. The group(int) method lets you retrieve
the characters that matched a given parenthesis group. If you haven’t used any
explicit parens, you can just treat whatever matched as “level zero.” For example:

// Part of REmatch.java
String patt = "Q[^u]\\d+\\.";
Pattern r = Pattern.compile(patt);
String line = "Order QT300. Now!";
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println(patt + " matches \"" +
m.group(0) +
"\" in \"" + line + "\"");
} else {
System.out.println("NO MATCH");

When run, this prints:

Q[^u]\d+\. matches "QT300." in "Order QT300. Now!"

It is also possible to get the starting and ending indexes and the length of the text that the pattern matched (remember that terms with multipliers, such as the \d+ in this example, can match an arbitrary number of characters in the string). You can use these in conjunction with the String.substring( ) methods as follows:

// Part of regexsubstr.java -- Prints exactly the same as REmatch.java
Pattern r = Pattern.compile(patt);
String line = "Order QT300. Now!";
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println(patt + " matches \"" +
line.substring(m.start(0), m.end(0)) +
"\" in \"" + line + "\"");
} else {
System.out.println("NO MATCH");

Suppose you need to extract several items from a string. If the input is:

Smith, John
Adams, John Quincy

and you want to get out:

John Smith
John Quincy Adams

just use:

// from REmatchTwoFields.java
// Construct a regex with parens to "grab" both field1 and field2
Pattern r = Pattern.compile("(.*), (.*)");
Matcher m = r.matcher(inputLine);
if (!m.matches( ))
throw new IllegalArgumentException("Bad input: " + inputLine);
System.out.println(m.group(2) + ' ' + m.group(1));

Java Regular Expression Syntax

Java Regular Expression Syntax


You need to learn the syntax of JDK 1.4 regular expressions.


Consult Table 4-2 for a list of the regular expression characters.


These pattern characters let you specify regexes of considerable power. In building
patterns, you can use any combination of ordinary text and the metacharacters, or
special characters, in Table 4-2. These can all be used in any combination that makes
sense. For example, a+ means any number of occurrences of the letter a , from one up
to a million or a gazillion. The pattern Mrs?\. matches Mr. or Mrs. . And .* means
“any character, any number of times,” and is similar in meaning to most command-
line interpreters’ meaning of the * alone. The pattern \d+ means any number of
numeric digits. \d{2,3} means a two- or three-digit number.

Regexes match anyplace possible in the string. Patterns followed by a greedy multi-
plier (the only type that existed in traditional Unix regexes) consume (match) as
much as possible without compromising any subexpressions which follow; patterns
followed by a possessive multiplier match as much as possible without regard to fol-
lowing subexpressions; patterns followed by a reluctant multiplier consume as few
characters as possible to still get a match.

Also, unlike regex packages in some other languages, the JDK 1.4 package was
designed to handle Unicode characters from the beginning. And the standard Java
escape sequence \unnnn is used to specify a Unicode character in the pattern. We use
methods of java.lang.Character to determine Unicode character properties, such as
whether a given character is a space.

To help you learn how regexes work, I provide a little program called REDemo . * In the
online directory javasrc/RE, you should be able to type either ant REDemo , or javac
REDemo followed by java REDemo , to get the program running.

In the uppermost text box, type the regex pattern you want to test.
Note that as you type each character, the regex is checked for syntax; if the syntax is
OK, you see a checkmark beside it. You can then select Match, Find, or Find All.
Match means that the entire string must match the regex, while Find means the regex
must be found somewhere in the string (Find All counts the number of occurrences that are found). Below that, you type a string that the regex is to match against.

Experiment to your heart’s content. When you have the regex the way you want it,
you can paste it into your Java program. You’ll need to escape (backslash) any char-
acters that are treated specially by both the Java compiler and the JDK 1.4 regex
package, such as the backslash itself, double quotes, and others (see the sidebar
“Remember This!”).

Remember This!

Remember that because a regex compiles strings that are also compiled by javac, you
usually need two levels of escaping for any special characters, including backslash,
double quotes, and so on. For example, the regex:

"You said it\."

has to be typed like this to be a Java language String :

"\"You said it\\.\""

I can’t tell you how many times I’ve made the mistake of forgetting the extra backslash
in \d+ , \w+ , and their kin!.

I typed qu into the REDemo program’s Pattern box, which is a syntacti-
cally valid regex pattern: any ordinary characters stand as regexes for themselves, so
this looks for the letter q followed by u . In the top version, I typed only a q into the
string, which is not matched. In the second, I have typed quack and the q of a second
quack . Since I have selected Find All, the count shows one match. As soon as I type
the second u , the count is updated to two, as shown in the third version.

Regexes can do far more than just character matching. For example, the two-charac-
ter regex ^T would match beginning of line ( ^ ) immediately followed by a capital T—
i.e., any line beginning with a capital T. It doesn’t matter whether the line begins
with Tiny trumpets, Titanic tubas, or Triumphant slide trombones, as long as the capi-
tal T is present in the first position.

But here we’re not very far ahead. Have we really invested all this effort in regex tech-
nology just to be able to do what we could already do with the java.lang.String
method startsWith( ) ? Hmmm, I can hear some of you getting a bit restless. Stay in
your seats! What if you wanted to match not only a letter T in the first position, but
also a vowel (a, e, i, o, or u) immediately after it, followed by any number of letters in
a word, followed by an exclamation point? Surely you could do this in Java by check-
ing startsWith("T") and charAt(1) == 'a' || charAt(1) == 'e' , and so on? Yes, but by
the time you did that, you’d have written a lot of very highly specialized code that
you couldn’t use in any other application. With regular expressions, you can just
give the pattern ^T[aeiou]\w*! . That is, ^ and T as before, followed by a character
class listing the vowels, followed by any number of word characters ( \w* ), followed
by the exclamation point.

“But wait, there’s more!” as my late, great boss Yuri Rubinsky used to say. What if
you want to be able to change the pattern you’re looking for at runtime? Remember
all that Java code you just wrote to match T in column 1, plus a vowel, some word
characters, and an exclamation point? Well, it’s time to throw it out. Because this
morning we need to match Q , followed by a letter other than u , followed by a num-
ber of digits, followed by a period. While some of you start writing a new function to
do that, the rest of us will just saunter over to the RegEx Bar & Grille, order a
^Q[^u]\d+\.. from the bartender, and be on our way.

OK, the [^u] means “match any one character that is not the character u .” The \d+
means one or more numeric digits. The + is a multiplier or quantifier meaning one or
more occurrences of what it follows, and \d is any one numeric digit. So \d+ means a
number with one, two, or more digits. Finally, the \. ? Well, . by itself is a metachar-
acter. Most single metacharacters are switched off by preceding them with an escape
character. Not the ESC key on your keyboard, of course. The regex “escape” charac-
ter is the backslash. Preceding a metacharacter like . with escape turns off its special
meaning. Preceding a few selected alphabetic characters (e.g., n , r , t , s , w ) with
escape turns them into metacharacters. The ^Q[^u]\d+\.. regex in
action. In the first frame, I have typed part of the regex as ^Q[^u and, since there is an unclosed square bracket, the Syntax OK flag is turned off; when I complete the
regex, it will be turned back on. In the second frame, I have finished the regex and
typed the string as QA577 (which you should expect to match the ^Q[^u]\d+ , but not
the period since I haven’t typed it). In the third frame, I’ve typed the period so the
Matches flag is set to Yes.

One good way to think of regular expressions is as a “little language” for matching
patterns of characters in text contained in strings. Give yourself extra points if you’ve
already recognized this as the design pattern known as Interpreter. A regular expres-
sion API is an interpreter for matching regular expressions.

So now you should have at least a basic grasp of how regexes work in practice. The
rest of this chapter gives more examples and explains some of the more powerful
topics, such as capture groups. As for how regexes work in theory—and there is a lot
of theoretical details and differences among regex flavors—the interested reader is
referred to the book Mastering Regular Expressions. Meanwhile, let’s start learning
how to write Java programs that use regular expressions.