AGREPY: Python port of agrep string matching with errors

Version 1.2 August 2002


INTRODUCTION

Agrep, written by Sun Wu and Udi Manber (described in "Fast Text Searching Allowing Errors", CACM, 35(10), 1992), is a suite of C functions which together perform various string matching operations under UNIX (i.e. specified at the commandline). The most recent version is 2.04, and is available from http://glimpse.cs.arizona.edu/software.html. The suite contains functions for exact string matching, matching allowing a small maximum number of insertions, deletions or substitutions, and functions for matching patterns containing metacharacters (i.e. regular expressions). Inexact matching of patterns involving regular expressions is also allowed, and another function performs matching on a file of input string, rather than just a single string specificied on the command-line. Finally, while the default text unit is the \n (or \r) terminated line, multiline matching is available through the specification (again on the command-line) of a "record" separator. Given an input pattern and a file, agrep prints on standard output all records that match the pattern (so a single match suffices to have a record printed).

Agrepy

This port takes agrep from its current, user level setting and makes it available as a Python module. However, in recent Python implementations, much of the functionality described above is already covered: exact matching is already found in the string module and regular expression matching is in the regsub and re modules. Furthermore, in the context of embedded functions (rather than command-line), IO and the defintion of a text string are handled by the surrounding application and are therefore no longer relevant. Therefore, what this port implements are solely those functions relating to inexact matching of text strings which contain no metacharacters. (Inexact matching of regular expressions has been ignored because the semantics are not clear - at least to me; concurrent matching of multiple input patterns is deferred to another day). On the other hand, apgrepy extends agrep in the following sense: given a pattern and a text string, agrepy and returns a list of all, non-overlapping pairs of text indexes such that the start index is the first character of the text that matches the earliest pattern character exactly, and the end is the last text character that matches exactly. The end index of each match is 1 place greater than the actual index so it can be immediately used to construct a slice. (agrep itself is content with recognizing that the input line contains a match, but does not say where or differentiate multiple matches.)

Specifically, AGREPY Pythonizes two functions from file sgrep.c of agrep version 2.04. "agrep", now called, sagrep, deals with "short" pattern strings (setable by a header constant, currently 24), and "a_monkey", now called lagrep, which deals with longer strings. Each of these is supported by functions which set up the data structures used during matching. SWIG is used to create the interface module. As far as possible, the original code of agrep has been preserved, except where required by the new circumstances (e.g. to support determination of match end-points) or forced by SWIG (e.g. move from K&R headers to ANSI prototypes, elimination of function-like C preprocessor macros). I also indulged in a little tidying up of the code to make it easier to maintain.

Why the New Version?

The orginal version had an end-of-text bug (sagrepy.c), and both lagrepy.c and sagrepy.c had problems at times finding the correct ends of a match. There also can be a genuine ambiguity. The methodology now is that sagrepy/lagrepy find the ends of matches fairly accurately, and a separate, recursive function firms up the end position and finds the start position.

Tested Systems

Agrepy has been developed on a PC running Linux, and successfully tested on a SGI Challenge running IRIX.

Source Code

The code for port is available from: http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/src/agrepy_1.2.tar.gz.