Does this site look plain?

This site uses advanced css techniques

The Webalizer is a web logfile analysis program written by Bradley Barrett, and it's widely used to produce statistics about website traffic. It is highly customizable and very fast.

Table of Contents

It includes facilities for reporting IP-to-name DNS translations, but it's a terribly slow and inefficient process. Even the inclusion of webazolver, a preprocessor which does nothing but populate a name cache via multiple child processes, is impacted badly in the face of an unreachable nameserver. These are very common.

Our own website sees relatively little traffic, but log processing still took much too long, and it was clear that DNS inverse lookups were the hangup. So we studied how The Webalizer used the cache and wrote our own program using asynchronous DNS. It's dramatically faster and more efficient.

On a dual-processor 500MHz Linux machine, fastzolver was able to achieve more than 100 resolution per second when run in unlimited queries mode (though this ought not be typical, as it's very hard on a nameserver). This runs in a single thread.

Asynchronous DNS

Most DNS lookups are synchronous, which means that the requesting process blocks until a response is received or a timeout occurs. These timeouts are typically fairly long - 30 seconds or one minute - and absolutely nothing is done by the requesting process during this time.

When inverse DNS servers respond immediately, a series of lookups can proceed rapidly, but it doesn't take many unreachable servers before runtime shoots up dramatically. This can be ameliorated somewhat with multiple child processes, but this requires a lot of coordination but still suffers from the waiting-for-reply hangs.

Our approach uses asynchronous DNS lookups, which send a request and immediately continue with do other work (such as performing other lookups). A note about the outstanding request is kept on a list, and when a reply arrives, it's mated with its associated request and the name is resolved. Timeouts form a pseudo reply that likewise completes the process.

By taking this approach, nothing is blocked while waiting for a reply that comes late (or never at all), and we can run as many outstanding requests as we have memory, bandwidth, and an available nameserver.

Rolling our own asynchronous DNS with the standard resolver libraries is a daunting task, so we instead use the wonderful adns library written by Ian Jackson. It provides exactly the support for a program of this kind, and its integration has been very smooth in this and other projects.

Logfiles and the DNS Cache

Like webalizer, this program reads webserver logfiles, but fastzolver has virtually no real knowledge of their format. It knows that the first word on each line is an IP address, and the rest of the line is not considered in any way. This makes fastzolver for general "resolve this list of addresses" purposes.

The whole process revolves around a database cache file maintained in Berkeley DB format: fastzolver populates it with IP-to-name translations, and webalizer reads it while processing the logs in detail. Each cache entry contains four items:

The key
This is the IP address being looked up, in dotted-quad string form. This is a unique key.
Hostname
This is either a looked-up hostname or a string IP address. The former is used when a DNS lookup is successful, while the latter is for both an unsuccessful or an in-progress lookup.
Type
This is a flag that indicates whether the hostname field contains an actual hostname, or just an IP address. This affects the cache expiry time.
Timestamp
This is a 32-bit binary UNIX time that remembers when the entry was created. We don't consider a looked-up entry to be valid forever, so the timestamp allows for expiry of older data.

While processing the log, an IP address is extracted from the start of each line and as an optimization always skips any line with the same IP as the previous line: Runs of the same IP address are very common.

The cache is consulted using the IP address, and one of three results is obtained:

  1. An entry is found and within the cache-valid times
  2. A cached but stale entry is found
  3. No entry is found

In case #1, this IP address is considered "translated" and no more work need be done: fastzolver moves onto the next entry in the file.

Otherwise, any existing stale entries are deleted outright, and a new entry saved that uses the IP address string itself as the hostname. This is really just a placeholder, and it's replaced later with the actual hostname if it's located. Otherwise the IP address string remains as a kind of negative cache entry so it's not repeatedly looked up without success.

The cache algorithm treats a valid hostname differently from a not-found IP address string on the assumption that the former is likely to be valid longer, and that the latter could be due to a connectivity issue that may be resolved soon. Default cache expiry times are 5 days for a hostname and one day for an IP address, though both can be modified on the command line.

Curiously, there is no explicit process that does large-scale expiry of cache entries. Though it would certainly be possible to create one that made a standalone expiration pass, it wouldn't produce any meaningful difference from the existing behavior that expires upon new lookup.

Though deleting records in Berkeley DB files makes room available for other records in the future, it won't ever return the space to the filesystem: this means that files grow, but they don't shrink. We recommend removing the DNS cache file periodically (perhaps once a month) to clean this out.

Command-line Parameters

This program can be run standalone from the command line or included in the same daily log-rotation scripts that drive The Webalizer. These are the options supported on the command line:

-V
Show the program's version information and exit.
-d
Increment the debug level by one
-L maxpending
Limit the number of pending DNS queries to maxpending, which is an attempt to avoid a denial of service attack on the nameserver. The default is a small number (40-ish), and zero disables the limit entirely. Note that this is not a thread or subprocess count - fastzolver runs one thread in one process. Running with -L1 simulates a single-threaded, synchronous lookup.
-D cachefile
Keep the cache of looked-up names in cachefile, which defaults to dns_cache.db in the current directory. This file's name will also be passed to webalizer, which uses the cache while reporting.
-H ndays
Expire valid hostnames after ndays days
-U ndays
Expire unknown lookups (saved as literal IP addresses) after ndays days
-s
Show quite a few statistics while running: every second, it displays the current progress, and at the end of the run it displays the ultimate resolution rate.
...
STATS: 359949 lines, 85 pending, 32425 results (21524 success):
STATS: 362172 lines, 99 pending, 32621 results (21659 success):
STATS: 364409 lines, 105 pending, 32835 results (21794 success):
STATS: 365875 lines, 83 pending, 32981 results (21878 success):
STATS: 367334 lines, 80 pending, 33120 results (21954 success):
STATS: 368603 lines, 87 pending, 33245 results (22026 success):
STATS: 370203 lines, 83 pending, 33377 results (22097 success):
...
Processed 35668 queries in 307 seconds (116.18 q/sec)

Integration with The Webalizer

Generally speaking, fastzolver is nearly a drop-in replacement for webazolver and is easy to integrate into an existing log-processing scheme. This is not a tutorial

Most users perform their log processing at midnight from cron with a common set of steps:

A prototype of one manifestation of this scheme - without quite a bit of error checking - could be written as a shell script and saved in /usr/local/bin/cron-run-webalizer. Then cron would launch the script nightly just after midnight:

/usr/local/bin/cron-run-webalizer
#
# Just a prototype!
#
OLDLOG=`date +access_log.%Y-%m-%d`

cd /home/apache/logs

mv access_log $OLDLOG

apachectl graceful

sleep 60

fastzolver -L100 -D/tmp/dns_cache.db $OLDLOG
webalyzer -p -N0 -D/tmp/dns_cache.db $OLDLOG

gzip $OLDLOG

There are, of course, other scenarios, such as those built on top of the logrotate utility or which support more than one website (and therefore more than one logfile). In the latter case, there is no reason that all log processing can't share a common DNS cache file so that all benefit from prior lookups.

Download & Build

Source code to fastzolver is available from this website:

It requires a Linux/Unix system, a C++ compiler, GNU make, the ADNS library, and the Berkeley DB library. It may optionally use ZLIB to process compressed logfiles.

ZLIB support is enabled by default, but may be disabled by commenting out the CFLAGS += -DSUPPORT_ZLIB line in the makefile.

The same version of the Berkeley DB library as used with webalizer is required here - the shared DNS cache database is in this format. We're using the version 1.85 compatibility mode: most default installations contain this, but those building db4 from source must include the --enable-compat185 when configuring the tree.

Resources

First published: 2005/08/06