AGREPY: Python port of agrep string matching with errors
Version 1.2 August 2002
INTRODUCTION
Agrep, written by Sun Wu and Udi Manber (described in "Fast
Text Searching Allowing Errors", CACM, 35(10), 1992), is a suite
of C functions which together perform various string matching
operations under UNIX (i.e. specified at the commandline). The
most recent version is 2.04, and is available from
http://glimpse.cs.arizona.edu/software.html.
The suite contains functions for exact string matching, matching allowing
a small maximum number of insertions, deletions or substitutions,
and functions for matching patterns containing metacharacters
(i.e. regular expressions). Inexact matching of patterns involving
regular expressions is also allowed, and another function performs
matching on a file of input string, rather than just a single
string specificied on the command-line. Finally, while the default
text unit is the \n (or \r) terminated line, multiline matching is
available through the specification (again on the command-line) of a
"record" separator. Given an input pattern and a file, agrep prints
on standard output all records that match the pattern (so a single
match suffices to have a record printed).
Agrepy
This port takes agrep from its current, user level setting and
makes it available as a Python module. However, in recent Python
implementations, much of the functionality described above is already
covered: exact matching is already found in the string module and
regular expression matching is in the regsub and re modules. Furthermore,
in the context of embedded functions (rather than command-line), IO and
the defintion of a text string are handled by the surrounding application
and are therefore no longer relevant. Therefore, what this port implements
are solely those functions relating to inexact matching of text strings
which contain no metacharacters. (Inexact matching of regular expressions
has been ignored because the semantics are not clear - at least to me;
concurrent matching of multiple input patterns is deferred to another day).
On the other hand, apgrepy extends agrep in the following sense:
given a pattern and a text string, agrepy and returns a list of all,
non-overlapping pairs of text indexes such that the start index is the
first character of the text that matches the earliest pattern character
exactly, and the end is the last text character that matches exactly. The end
index of each match is 1 place greater than the actual index so it
can be immediately used to construct a slice. (agrep itself is
content with recognizing that the input line contains a match,
but does not say where or differentiate multiple matches.)
Specifically, AGREPY Pythonizes two functions from file sgrep.c
of agrep version 2.04. "agrep", now called, sagrep, deals with
"short" pattern strings (setable by a header constant, currently 24),
and "a_monkey", now called lagrep, which deals with longer strings.
Each of these is supported by functions which set up the data structures
used during matching. SWIG is used to create
the interface module. As far as possible, the original code of agrep
has been preserved, except where required by the new circumstances (e.g.
to support determination of match end-points) or forced by SWIG
(e.g. move from K&R headers to ANSI prototypes, elimination of
function-like C preprocessor macros). I also indulged in a little
tidying up of the code to make it easier to maintain.
Why the New Version?
The orginal version had an end-of-text bug (sagrepy.c), and both lagrepy.c
and sagrepy.c had problems at times finding the correct ends of a match.
There also can be a genuine ambiguity. The methodology now is that sagrepy/lagrepy
find the ends of matches fairly accurately, and a separate, recursive function
firms up the end position and finds the start position.
Tested Systems
Agrepy has been developed on a PC running Linux, and successfully tested on
a SGI Challenge running IRIX.
Source Code
The code for port is available from:
http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/src/agrepy_1.2.tar.gz.