search

This program searches text corpora for arbitrary regular expressions and produces a report in HTML format. It can read local files, or those available by HTTP or FTP, and it knows how to unpack ZIP files. It requires Perl 5, and the following network modules: Net::FTP, LWP::Simple, and LWP::UserAgent.

The program requires the presence of a parameter file, search-params, that specifies information about the search. That is read in and evaluated as Perl code, and so can be used to alter any of the code or variables; but it is intended specifically for the setting of the following variables:

$search_pattern
This must be set to a string containing the regular expression to be searched. The search expression uses Perl 5 extended regular expression syntax, so spacing is ignored, and comments can be included in the search pattern. Searches are normally case sensitive, but one can prefix them with (?i) to make them insensitive to case.

$pre_not_patterns
This can be set to a reference to an array of strings that contain regular expressions for material that one does not want to precede the search target.

$post_not_patterns
This can be set to a reference to an array of strings that contain regular expressions for material that one does not want to follow the search target.

$pre_patterns
This can be set to a reference to an array of strings that contain regular expressions for material that one wants to precede the search target.

$post_patterns
This can be set to a reference to an array of strings that contain regular expressions for material that one wants to follow the search target.

$fileset_name
This can be set to a string that specifies a prefix that will begin the names of the HTML files the program generates. That allows one to keep the results of several different searches in the same directory. Otherwise, all pages will begin with the prefix ``report''.

$files
This is the other mandatory parameter besides $search_pattern. It should be set to a reference to an array of strings, each naming a file to be searched. If the name begins http:// or ftp://, the appropriate protocols will be used to fetch the file; otherwise it will be assumed to be a local file. The file name can end in a / in which case the entire directory will be searched. If the file name ends in .zip, it will be unzipped before searching.

The program reads each file into memory before processing. It also scans for a line containing the string *END*. If found, that is considered to be the end of a Project Gutenberg header, and text above that line is ignored. It makes a simple effort to break the text into sentences and search for words within the context of individual sentences.

The output format uses a separate HTML page for each distinct string that is matched by the $search_pattern, ignoring case distinctions. (This is why there are separate variables for $pre_patterns and $post_patterns: these are searched, but not included in the reported string.) In those files, there is a separate report for each sentence in which the string was found. A master index file is also generated (by the name of $fileset_name.html), which lists the various strings in alphabetical order, providing links to the relevant pages.