This program searches text corpora for arbitrary regular expressions and
produces a report in HTML format. It can read local files, or those
available by HTTP or FTP, and it knows how to unpack ZIP files. It requires
Perl 5, and the following network modules: Net::FTP,
LWP::Simple, and LWP::UserAgent.
The program requires the presence of a parameter file,
search-params, that specifies information about the search. That is read in and
evaluated as Perl code, and so can be used to alter any of the code or
variables; but it is intended specifically for the setting of the following
variables:
- $search_pattern
-
This must be set to a string containing the regular expression to be
searched. The search expression uses Perl 5 extended regular expression
syntax, so spacing is ignored, and comments can be included in the search
pattern. Searches are normally case sensitive, but one can prefix them with
(?i) to make them insensitive to case.
- $pre_not_patterns
-
This can be set to a reference to an array of strings that contain regular
expressions for material that one does not want to precede the search
target.
- $post_not_patterns
-
This can be set to a reference to an array of strings that contain regular
expressions for material that one does not want to follow the search
target.
- $pre_patterns
-
This can be set to a reference to an array of strings that contain regular
expressions for material that one wants to precede the search target.
- $post_patterns
-
This can be set to a reference to an array of strings that contain regular
expressions for material that one wants to follow the search target.
- $fileset_name
-
This can be set to a string that specifies a prefix that will begin the
names of the HTML files the program generates. That allows one to keep the
results of several different searches in the same directory. Otherwise, all
pages will begin with the prefix ``report''.
- $files
-
This is the other mandatory parameter besides
$search_pattern. It should be set to a reference to an array of strings, each naming a
file to be searched. If the name begins http:// or ftp://, the appropriate protocols will be used to fetch the file; otherwise it
will be assumed to be a local file. The file name can end in a / in which case the entire directory will be searched. If the file name ends
in .zip, it will be unzipped before searching.
The program reads each file into memory before processing. It also scans
for a line containing the string *END*. If found, that is considered to be the end of a Project Gutenberg header,
and text above that line is ignored. It makes a simple effort to break the
text into sentences and search for words within the context of individual
sentences.
The output format uses a separate HTML page for each distinct string that
is matched by the $search_pattern, ignoring case distinctions. (This is why there are separate variables for
$pre_patterns and $post_patterns: these are searched, but not included in the reported string.) In those
files, there is a separate report for each sentence in which the string was
found. A master index file is also generated (by the name of $fileset_name.html), which lists the various strings in alphabetical order, providing links
to the relevant pages.