Java Program: Apache Logfile Parsing

Java Program: Apache Logfile Parsing


The Apache web server is the world’s leading web server and has been for most of
the web’s history. It is one of the world’s best-known open source projects, and one
of many fostered by the Apache Foundation. But the name Apache is a pun on the
origins of the server; its developers began with the free NCSA server and kept hack-
ing at it or “patching” it until it did what they wanted. When it was sufficiently dif-
ferent from the original, a new name was needed. Since it was now “a patchy server,”
the name Apache was chosen. One place this patchiness shows through is in the log
file format. Consider this entry:

123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0"
200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)"

The file format was obviously designed for human inspection but not for easy pars- ing. The problem is that different delimiters are used: square brackets for the date, quotes for the request line, and spaces sprinkled all through. Consider trying to use a StringTokenizer ; you might be able to get it working, but you’d spend a lot of time fiddling with it. However, this somewhat contorted regular expression * makes it easy to parse:

^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([^"]+)"
"([^"]+)"

You may find it informative to refer back to Table 4-2 and review the full syntax used here. Note in particular the use of the non-greedy quantifier +? in \"(.+?)\" to match a quoted string; you can’t just use .+ since that would match too much (up to the quote at the end of the line). Code to extract the various fields such as IP address, request, referer URL, and browser version is shown in Example:

LogRegExp.java
import java.util.regex.*;
/**
* Parse an Apache log file with Regular Expressions
*/
public class LogRegExp implements LogExample {
public static void main(String argv[]) {
String logEntryPattern =
"^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3})
(\\d+) \"([^\"]+)\" \"([^\"]+)\"";
System.out.println("Using regex Pattern:");
System.out.println(logEntryPattern);
System.out.println("Input line is:");
System.out.println(logEntryLine);
Pattern p = Pattern.compile(logEntryPattern);
Matcher matcher = p.matcher(logEntryLine);
if (!matcher.matches( ) ||
NUM_FIELDS != matcher.groupCount( )) {
System.err.println("Bad log entry (or problem with regex?):");
System.err.println(logEntryLine);
return;
}
System.out.println("IP Address: " + matcher.group(1));
System.out.println("Date&Time: " + matcher.group(4));
System.out.println("Request: " + matcher.group(5));
System.out.println("Response: " + matcher.group(6));
System.out.println("Bytes Sent: " + matcher.group(7));
if (!matcher.group(8).equals("-"))
System.out.println("Referer: " + matcher.group(8));
System.out.println("Browser: " + matcher.group(9));
}
}

The implements clause is for an interface that just defines the input string; it was used in a demonstration to compare the regular expression mode with the use of a StringTokenizer . The source for both versions is in the online source for this chap- ter. Running the program against the sample input shown above gives this output:

Using regex Pattern:
^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([^"]+)"
"([^"]+)"
Input line is:
123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0"
200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)"
IP Address: 123.45.67.89
Date&Time: 27/Oct/2000:09:27:09 -0400
Request: GET /java/javaResources.html HTTP/1.0
Response: 200
Bytes Sent: 10450
Browser: Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)

The program successfully parsed the entire log file format with one call to matcher. matches( ) .

0 comments:

Post a Comment