The Apache HTTP server allows a system administrator to configure how it should log requests. This is good in terms of flexibility, but it’s horrid in terms of parsing: every installation can be different.

I was tasked with getting Apache logs into Graylog and discovered that $CUST has different Apache log formats even between Apache instances which run on a single machine. I certainly didn’t want to have to write extractors for all of those, and I can’t imagine people here wanting to maintain those …

People have tried submitting JSON directly from Apache, but I find that a bit cumbersome to write, and I have the feeling it’s brittle: an unexpected brace in the request (which ought to be possible) could render the JSON invalid.

apache-logger

I settled on what I think is a much simpler and rather flexible format: a TAB-separated (\t) list of key=value pairs configured like this in httpd.conf:

LogFormat "clientaddr=%h\trequest=%r\tstatus=%s\toctets=%b\ttime=%t\truntime=%D\treferer=%{Referer}i\tuseragent=%{User-Agent}i\tinstance=nsd9" graylog
CustomLog "|/usr/local/apache-logger.py" graylog

The apache-logger program splits those up, adds fields required for GELF, and fires that off to a Graylog server configured with an appropriate GELF input.

#!/usr/bin/env python
# JPMens, March 2015 filter for special Apache log format to GELF

import sys
import json
import gelf    # https://github.com/jspaulding/gelf-python/blob/master/gelf.py
import socket
import fileinput
from geoip import open_database    # http://pythonhosted.org/python-geoip/

my_hostname = socket.gethostname()  # GELF "host" (i.e. source)

try:
    geodb = open_database('GeoLite2-City.mmdb')
except:
    sys.exit("Cannot open GeoLite2-City database")

c = gelf.Client(server='192.168.1.133', port=10002)

def isnumber(s):
    try:
        float(s)
        return True
    except ValueError:
        pass

    return False

for line in fileinput.input():
    parts = line.rstrip().split('\t')
    data = {}
    for p in parts:
        key, value = p.split('=', 1)

        if isnumber(value):
            try:
                value = int(value)
            except:
                value = float(value)

        if value != '' and value != None:
            data[key] = value

    data['host']        = my_hostname    # overwrite with GELF source
    data['type']        = 'special'

    request = data.get('request', 'GET I dunno')
    method = request.split(' ', 1)[0]

    data['short_message']  = request
    data['method']         = method
    if 'request' in data:
        del data['request']

    try:
        g = geodb.lookup(data['clientaddr'])
        if g is not None:
            data['country_code'] = g.country
    except:
        pass

    try:
        c.log(json.dumps(data))
    except:
        pass

Graylog effectively receives something like this (the Geo-location having been added by apache-logger):

{
    "clientaddr": "62.x.x.x",
    "host": "tiggr",
    "instance": "nsd9",
    "method": "GET",
    "country_code": "GB",
    "octets": 282,
    "referer": "-",
    "runtime": 501,
    "short_message": "GET /barbo HTTP/1.1",
    "status": 404,
    "time": "[20/Mar/2015:06:41:36 +0000]",
    "type": "special",
    "useragent": "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2"
}

You’ll have noted that the LogFormat allows me to specify any number of fields (e.g. instance) and values.