Hit Filtering When processing a log file not all the hits should be included in the generated statistics. Log files can sometimes span multiple months and/or contain hits that you will have no interest in (for example requests to retrieve css files, javascript files, image files, the favicon.ico, the robots.txt file). As such it is desirable to be able to filter the hits to only include in the statistics those that you have an interest in. Hit filtering allows you to do that.
Hit filtering will only occur if a hit filter is provided by specifying a class in
property:
hitFilterClass
The hit filter implementation must implement interface:
org.polliwog.filters.HitFilter. For each hit method:
accept(org.polliwog.data.Hit) is called, if the method returns
true
then the hit is added to the statistics, otherwise it is
filtered, i.e. rejected.
Basic Hit Filter
A basic hit filtering implementation is provided with polliwog. It is implemented by class:
org.polliwog.filters.BasicHitFilter. To make polliwog use this filter use value:
org.polliwog.filters.BasicHitFilter
for the:
hitFilterClass property.
The basic hit filter uses a xml file to configure a number of rules that it applies to the hit to determine whether it should be filtered.
Each rule is applied sequentially. Each rule can define whether the hit should be
accepted or
rejected (which correspond to whether the
accept returns
true
or
false
).
The xml file has the following elements/attributes:
XML Definition HelpClose
The Children column shows the child elements that can be used within the specified element. Child elements can appear in any order (there is no enforcement via a DTD).
- A + after the element name indicates that at least 1 child element with that name must be present.
- A * after the element name indicates that 0 or more elements can be present.
- A ? after the element name indicates that either 1 or no elements with that name can be present.
If no symbol is provided after the name then one element must be provided.
The Attributes column shows the attribute that can be used on the specified element. Attribute definitions are defined as: name(value_type,required|optional), where value_type is one of:
- string - A string value, this can be anything.
- integer - An integer value.
Required and optional are represented as: R and O respectively.
|
Name | Root | Children | Parent(s) | Attributes | Description |
---|
hit-filter | Y | rule+ | NONE | NONE | The root element, each child rule element defines a rule that should be applied to a hit. |
rule | N | ANY | hit-filter | type(string,R) action(string,R) | Defines a rule to be applied to a hit. The type attribute defines the type of rule that is created, the following values are supported: Depending upon the type of rule created there may be extra attributes needed for the rule element, see the relevant rule for details.
The action attribute can be either: accept or reject depending upon whether you want to accept or reject the hit if the rule matches. |
Date Rules
A date rule is created when the the
type attribute on a
rule element is:
date. An instance of:
org.polliwog.filters.DateRule is created. A date rule will filter hits based on the date that the hit occurred.
The date rule uses the following extra optional attributes, which must be specified on the
rule element, to initialize itself (the value in brackets indicates the type of value that should be specified):
- currentMonth(boolean) - when specified with a
true
value only hits with a date in the current month will be accepted/rejected. The time between 00:00:000 on the 1st of the calendar month to: 23:59:999 on the last day of the calendar month is considered to be the current month. - currentWeek(boolean) - when specified with a
true
value only hits with a date in the current week will be accepted/rejected. The time between 00:00:000 on the first day of the current week to 23:59:999 6 days later. In general (for most locales) this will mean either between: Monday 00:00:000 - Sunday 23:59:999 or: Sunday 00:00:000 - Saturday 23:59:999 either way, the time period used will span 7 days (minus 1 millisecond).
Note: the first day of the week is determined by calling: getFirstDayOfWeek(), the week is then taken as being seven days after that day. - today(boolean) - when specified with a
true
value only hits for the current date will be accepted/rejected. - after(string) - indicates that only hits after the date specified should be accepted/rejected. The default date format is: dd/MMM/yyyy but this can be overridden by using the format attribute.
- before(string) - indicates that only hits before the date specified should be accepted/rejected. The default date format is: dd/MMM/yyyy but this can be overridden by using the format attribute.
- format(string) - indicates that the format specified should be used instead of the default. The format should be suitable for use with a SimpleDateFormat object.
- month(string) - when specified only hits for the specified month (for the current year) will be accepted/rejected. The month value should be one of the standard three letter acronyms for your locale, i.e. Jan, Feb, Mar, Apr etc.
Example
Only accept hits for the current month.
<rule type="date"
action="accept"
currentMonth="true" />
Example
Reject hits not between 3rd April 2007 and 23rd April 2007. Note this is the same as reversing the dates and using an
action value of:
accept.
<rule type="date"
action="reject"
before="03/Apr/2007"
after="23/Apr/2007" />
Example
Accept hits in June.
<rule type="date"
action="accept"
month="Jun" />
URL Rule
A url rule is created when the the
type attribute on a
rule element is:
url. An instance of:
org.polliwog.filters.URLRule is created. A url rule filters based on the requested url for the hit.
The url rule uses the following extra optional attributes, which must be specified on the
rule element, to initialize itself (the value in brackets indicates the type of value that should be specified):
- startsWith(string) - when specified indicates a value that the url should start with for the hit to be filtered.
- endsWith(string) - when specified indicates a value that the url should end with for the hit to be filtered.
- contains(string) - when specified indicates a value that the url should contain for the hit to be filtered.
- ignoreCase(boolean) - when specified with a
true
value it indicates that comparisons should be case-insensitive
Note: request urls start with /.
Example
Reject hits for the
admin part of the site.
<rule type="url"
action="reject"
startsWith="/admin/" />
Example
Only accept .php hits.
<rule type="url"
action="accept"
endsWith=".php" />
© Gary Bentley 2004-2007. All Rights Reserved.