Home
Hit FilteringHit Filtering
When processing a log file not all the hits should be included in the generated statistics. Log files can sometimes span multiple months and/or contain hits that you will have no interest in (for example requests to retrieve css files, javascript files, image files, the favicon.ico, the robots.txt file). As such it is desirable to be able to filter the hits to only include in the statistics those that you have an interest in. Hit filtering allows you to do that.

Hit filtering will only occur if a hit filter is provided by specifying a class in property:
hitFilterClass

The hit filter implementation must implement interface: org.polliwog.filters.HitFilter. For each hit method: accept(org.polliwog.data.Hit) is called, if the method returns true then the hit is added to the statistics, otherwise it is filtered, i.e. rejected.
Basic Hit Filter
A basic hit filtering implementation is provided with polliwog. It is implemented by class: org.polliwog.filters.BasicHitFilter. To make polliwog use this filter use value:
org.polliwog.filters.BasicHitFilter
for the: hitFilterClass property.

The basic hit filter uses a xml file to configure a number of rules that it applies to the hit to determine whether it should be filtered.

Each rule is applied sequentially. Each rule can define whether the hit should be accepted or rejected (which correspond to whether the accept returns true or false).

The xml file has the following elements/attributes:
Show the help for this table
NameRootChildrenParent(s)AttributesDescription
hit-filterYrule+NONENONEThe root element, each child rule element defines a rule that should be applied to a hit.
ruleNANYhit-filtertype(string,R)
action(string,R)
Defines a rule to be applied to a hit. The type attribute defines the type of rule that is created, the following values are supported:
Depending upon the type of rule created there may be extra attributes needed for the rule element, see the relevant rule for details.

The action attribute can be either: accept or reject depending upon whether you want to accept or reject the hit if the rule matches.
JoSQL Rules
A JoSQL rule is created when the the type attribute on a rule element is: josql. An instance of: org.polliwog.filters.JoSQLRule is created.

A josql rule uses a External site, opens in new window JoSQL WHERE clause to perform the filtering (you do not need to provide the WHERE keyword). The WHERE clause should be placed as the content of the rule element.

Instances of Hit will be passed to the rule and the WHERE clause applied. If the WHERE clause evaluates to true then the hit is accepted or rejected according to the action attribute.

The functions from JoSQLFunctionHandler are available for use in the WHERE clause.
Information available in the Hit object
It should be noted that whilst the Hit object can contain a large amount of information at the point where the filter rules are applied only non-derived information (i.e. information that is available in the log file) has a value. For instance the site area, hit page and visit summary won't have a value so they should not be used in the WHERE clause.

In general, you should use the following accessors in the WHERE clause:
Example
Only accept hits for the current month.
<rule type="josql"
      action="accept">
  currentMonth (date)
</rule>
Example
Reject all hits from the 192.168.0.X ip address range.
<rule type="josql"
      action="reject">
  hostname LIKE '192.168.0.%'
</rule>
Example
Reject hits for image files.
<rule type="josql"
      action="reject">
  pageType $IN ('gif', 'jpg', 'png')
</rule>
Example
Reject 404 hits that come from a post method where the size is greater than 10000 bytes.
<rule type="josql"
      action="reject">
  status = '404'
  AND
  requestMethod $= 'POST'
  AND
  size > 10000
</rule>
Date Rules
*Deprecated
This rule is deprecated as of version 0.7 and should not be used, instead use a JoSQL rule.
A date rule is created when the the type attribute on a rule element is: date. An instance of: org.polliwog.filters.DateRule is created. A date rule will filter hits based on the date that the hit occurred.

The date rule uses the following extra optional attributes, which must be specified on the rule element, to initialize itself (the value in brackets indicates the type of value that should be specified):
  • currentMonth(boolean) - when specified with a true value only hits with a date in the current month will be accepted/rejected. The time between 00:00:000 on the 1st of the calendar month to: 23:59:999 on the last day of the calendar month is considered to be the current month.
  • currentWeek(boolean) - when specified with a true value only hits with a date in the current week will be accepted/rejected. The time between 00:00:000 on the first day of the current week to 23:59:999 6 days later. In general (for most locales) this will mean either between: Monday 00:00:000 - Sunday 23:59:999 or: Sunday 00:00:000 - Saturday 23:59:999 either way, the time period used will span 7 days (minus 1 millisecond).

    Note: the first day of the week is determined by calling: External site, opens in new window getFirstDayOfWeek(), the week is then taken as being seven days after that day.
  • today(boolean) - when specified with a true value only hits for the current date will be accepted/rejected.
  • after(string) - indicates that only hits after the date specified should be accepted/rejected. The default date format is: dd/MMM/yyyy but this can be overridden by using the format attribute.
  • before(string) - indicates that only hits before the date specified should be accepted/rejected. The default date format is: dd/MMM/yyyy but this can be overridden by using the format attribute.
  • format(string) - indicates that the format specified should be used instead of the default. The format should be suitable for use with a External site, opens in new window SimpleDateFormat object.
  • month(string) - when specified only hits for the specified month (for the current year) will be accepted/rejected. The month value should be one of the standard three letter acronyms for your locale, i.e. Jan, Feb, Mar, Apr etc.
Example
Only accept hits for the current month.
<rule type="date"
      action="accept"
      currentMonth="true" />

Example
Reject hits not between 3rd April 2007 and 23rd April 2007. Note this is the same as reversing the dates and using an action value of: accept.
<rule type="date"
      action="reject"
      before="03/Apr/2007"
      after="23/Apr/2007" />

Example
Accept hits in June.
<rule type="date"
      action="accept"
      month="Jun" />
URL Rule
*Deprecated
This rule is deprecated as of version 0.7 and should not be used, instead use a JoSQL rule.
A url rule is created when the the type attribute on a rule element is: url. An instance of: org.polliwog.filters.URLRule is created. A url rule filters based on the requested url for the hit.

The url rule uses the following extra optional attributes, which must be specified on the rule element, to initialize itself (the value in brackets indicates the type of value that should be specified):
  • startsWith(string) - when specified indicates a value that the url should start with for the hit to be filtered.
  • endsWith(string) - when specified indicates a value that the url should end with for the hit to be filtered.
  • contains(string) - when specified indicates a value that the url should contain for the hit to be filtered.
  • ignoreCase(boolean) - when specified with a true value it indicates that comparisons should be case-insensitive
Note: request urls start with /.

Example
Reject hits for the admin part of the site.
<rule type="url"
      action="reject"
      startsWith="/admin/" />

Example
Only accept .php hits.
<rule type="url"
      action="accept"
      endsWith=".php" />