=head1 search This program searches text corpora for arbitrary regular expressions and produces a report in HTML format. It can read local files, or those available by HTTP or FTP, and it knows how to unpack ZIP files. It requires Perl 5, and the following network modules: F, F, and F. The program requires the presence of a parameter file, F, that specifies information about the search. That is read in and evaluated as Perl code, and so can be used to alter any of the code or variables; but it is intended specifically for the setting of the following variables: =over 2 =item $search_pattern This must be set to a string containing the regular expression to be searched. The search expression uses Perl 5 extended regular expression syntax, so spacing is ignored, and comments can be included in the search pattern. Searches are normally case sensitive, but one can prefix them with (?i) to make them insensitive to case. =item $pre_not_patterns This can be set to a reference to an array of strings that contain regular expressions for material that one does not want to precede the search target. =item $post_not_patterns This can be set to a reference to an array of strings that contain regular expressions for material that one does not want to follow the search target. =item $pre_patterns This can be set to a reference to an array of strings that contain regular expressions for material that one wants to precede the search target. =item $post_patterns This can be set to a reference to an array of strings that contain regular expressions for material that one wants to follow the search target. =item $fileset_name This can be set to a string that specifies a prefix that will begin the names of the HTML files the program generates. That allows one to keep the results of several different searches in the same directory. Otherwise, all pages will begin with the prefix "report". =item $files This is the other mandatory parameter besides C<$search_pattern>. It should be set to a reference to an array of strings, each naming a file to be searched. If the name begins C or C, the appropriate protocols will be used to fetch the file; otherwise it will be assumed to be a local file. The file name can end in a C in which case the entire directory will be searched. If the file name ends in C<.zip>, it will be unzipped before searching. =back The program reads each file into memory before processing. It also scans for a line containing the string C<*END*>. If found, that is considered to be the end of a Project Gutenberg header, and text above that line is ignored. It makes a simple effort to break the text into sentences and search for words within the context of individual sentences. The output format uses a separate HTML page for each distinct string that is matched by the C<$search_pattern>, ignoring case distinctions. (This is why there are separate variables for C<$pre_patterns> and C<$post_patterns>: these are searched, but not included in the reported string.) In those files, there is a separate report for each sentence in which the string was found. A master index file is also generated (by the name of F<$fileset_name.html>), which lists the various strings in alphabetical order, providing links to the relevant pages. =cut use strict; use Net::FTP; use LWP::Simple; use LWP::UserAgent; $main::fileset_name = 'report'; $main::pre_not_patterns = []; $main::post_not_patterns = []; $main::pre_patterns = []; $main::post_patterns = []; $main::text = []; $main::temp_file_no = 0; $main::current_host = ''; $main::left_over = ''; $main::all_caps = 0; &Go; sub Go { require "search-params"; if (!$main::search_pattern) {die "Must specify \$search_pattern string";} if (!$main::files) {die "Must specify \$files";} print STDERR "Recording hits in files beginning '$main::fileset_name'\n"; my $f; foreach $f (@$main::files) {&Process_Arg($f);} } sub Process_Arg { #The arg could be local or ftp or http; a file or a directory. # http URLs are accepted, but without enumeration. my($rl) = @_; my $host; my $path; my $dirs; my $leaf; if ($rl =~ m{^http://}i) { &Process_File(undef, undef, $rl, undef); return; } if ($rl =~ m{^\s*ftp://(.+?)/(.*)}) { $host = $1; $path = $2; } else { $path = $rl; } if ($path =~ m{(.*)/(.*)$}) { $dirs = $1; $leaf = $2; } else { $leaf = $path; } my $ftp; if (defined $host) { $ftp = Open_FTP($host); if (!$ftp) { warn "Couldn't open FTP host $host: skipping $rl"; return; } } my $files; if (defined $dirs) { if (!&CD($dirs, $ftp)) { warn "Couldn't change to directory $dirs: skipping $rl"; return; } } $files = &Enumerate_Files($dirs, $leaf, $ftp); if (!defined($files)) { warn "Couldn't read any files in directory: skipping $rl"; return; } my $file; for $file (@$files) {&Process_File($host, $dirs, $file, $ftp);} } sub Really_Open_FTP { $main::current_ftp = Net::FTP->new($main::current_host); if (!defined $main::current_ftp) {return 0;} if (!$main::current_ftp->login('anonymous', 'poulin@csli.stanford.edu')) { warn "Anonymous login to $main::current_host failed."; return 0; } print STDERR "Anonymous FTP to $main::current_host\n"; $main::current_ftp; } sub Open_FTP { my($host) = @_; if ($main::current_host) { if ($main::current_host eq $host) { if (!&ftp->cwd('/')) {return 0;} return $main::current_ftp; } # We're assuming connection stays open: might be stupid. Probably # should try some simple ping operation. $main::current_ftp->quit; #We won't be greedy: just keep one connection open at a time. } $main::current_host = $host; &Really_Open_FTP; } sub CD { my($dir, $ftp) = @_; if (defined $ftp) { if (!$ftp->cwd($dir)) {return 0;} } 1; } sub Enumerate_Files { my($dirs, $pattern, $ftp) = @_; if (!$pattern) {$pattern = '*';} my $files = []; #print STDERR "Enumerating contents of directory $dirs for |$pattern|\n"; if (defined $ftp) { $files = $ftp->ls($pattern); } else { if (defined($dirs) and $dirs) {$pattern = "$dirs/$pattern";} @$files = glob("$pattern"); } if (scalar @$files > 1 or (scalar @$files == 1 and $files->[0] ne $pattern)) { print STDERR "Directory $dirs contains the following files:\n"; my $f; foreach $f (@$files) {print STDERR " $f\n";} } $files; } sub Check_All_Caps() { #Sample the file to see whether it appears to be in all caps. my $upper = 0; my $lower = 0; my $tries = 0; while ($upper + $lower < 1000 and $tries++ < 1000) { my $line = $main::text->[int(rand(scalar(@$main::text)))]; $lower += $line =~ tr{a-z}{a-z}; $upper += $line =~ tr{A-Z}{A-Z}; } if ($tries >= 1000) { print STDERR "File seems to have little text.\n"; } my $sum = $upper + $lower; if (!$sum) {return;} if (($upper / $sum) > 0.5) {$main::all_caps = 1;} } sub Process_File { my($host, $dirs, $file_name, $ftp) = @_; my $long_name = ''; if (defined $host) { $long_name = "ftp://$host/"; if (defined $dirs) {$long_name = "$dirs/";} } else { if ($file_name !~ m{^http:}i) {$long_name = "file:"} } $long_name .= $file_name; $long_name = "$file_name"; #print STDERR "\n$long_name, $file_name:\n"; if (!&Read_File($file_name, $ftp)) { warn "Can't read $file_name: skipping $long_name"; return; } if (!scalar(@$main::text)) { print STDERR "Empty file.\n"; return; } &Remove_Header; #print "text starts line $main::first_line_num\n"; Check_All_Caps(); my $index = &Search; &Print($index, $long_name); } sub Read_File { my($file_name, $ftp) = @_; $main::text = []; print STDERR "\n$file_name:\n"; if (defined $ftp) { print STDERR "Retrieving by FTP...\n"; if (!&get_by_ftp($ftp, $file_name)) {return 0;} } elsif ($file_name =~ m{^http://}i) { if (!&Read_HTTP($file_name)) {return 0;} } elsif ($file_name =~ m{\.zip$}) { &Read_Local_Zip_File($file_name); } else { print STDERR "Reading...\n"; if (!open(IN, "<$file_name")) {return 0;} &Consume_IN; } 1; } sub Remove_Header { #Looks for a Project Gutenberg header my $i; for ($i = 0; $i < @$main::text; $i++) { if ($main::text->[$i] =~ m{\*END\*}) { $main::line_num = $i + 1; splice(@$main::text, 0, $main::line_num); 1; } last if $i > 1000; } $main::line_num = 1; #TODO: there may be fallback tests for other types of headers, e.g., ABU 0; } sub Search { $main::sent = ''; $main::sent_start = $main::line_num; $main::positions = {}; my $last_word = ''; my $last_token = ''; my $space; my $new_word; while (1) { ($space, $new_word) = &Consume_Word; last unless defined $space; # terminal punct mark, e.g., -- for quotes. if ($last_word =~ m{[.!?:]\W*$} and $new_word =~ m{^\W*[A-Z]} and ($main::all_caps or !($last_token =~ m{[A-Z]\S*\.$} and $space =~ m{^(?:\ |\ *\n)$} and $new_word =~ m{^[A-Z]}))) { #Exclude Mr. Ed and I. Magnin #TODO: or we could consult a list of known abbreviations. That would # normally give better & shorter sentences, but be very bad when an # unknown abbrev is encountered. &Search_sentence; $main::sent = $new_word; $main::sent_start = $main::line_num; } else { $main::sent .= "$space$new_word"; } $last_word = $new_word unless $new_word !~ m{[A-Za-z.?!]}; $last_token = $new_word; } if ($main::sent) {&Search_sentence;} $main::positions; } sub Consume_Word { #print STDERR "Consume_Word($main::text->[0])\n"; my $space = ''; while (1) { if (!@$main::text) {return (undef, undef);} #print STDERR "main::text->[0] = |$main::text->[0]|\n"; if ($main::text->[0] =~ s{^(\S+)}{}) { my $word = $1; return($space, $word); } if ($main::text->[0] =~ s{^(\s+)}{}) { #print STDERR "Leading space = |$1|\n"; $space .= $1; } if (!$main::text->[0]) { #print STDERR "Empty line\n"; #print STDERR "shift\n"; shift @$main::text; $main::line_num++; } } } sub Search_sentence { #TODO: consider splitting up ridiculously long sentences: perhaps look for # internal ; or : or perhaps for !?:. not followed by cap. Or # internal blank lines or indented lines. #print STDERR "Search_sentence($main::sent)\n"; while ($main::sent =~ m{$main::search_pattern}gox) { my $pre = $`; my $word = $&; my $post = $'; #print STDERR "Prelim: |$pre|, $word, |$post|\n"; next unless &Secondary($pre, $main::pre_patterns, 1) and &Secondary($pre, $main::pre_not_patterns, 0) and &Secondary($post, $main::post_patterns, 1) and &Secondary($post, $main::post_not_patterns, 0); print STDERR "$word\n"; my $cap; ($cap = $word) =~ tr{[a-zàâéèêëîïôûç]}{[A-ZÀÂÉÈÊËÎÏÔÛÇ]}; if (!exists $main::positions->{$cap}) {$main::positions->{$cap} = [];} push(@{$main::positions->{$cap}}, [$main::sent_start, "$pre$word$post"]); } } sub Secondary { my($str, $pats, $want_match) = @_; #print STDERR " Secondary($str,$want_match)\n"; my $pat; foreach $pat (@$pats) { #print STDERR " {$pat}\n"; if ($str =~ m{$pat}x) { return 0 unless $want_match; } else { return 0 if $want_match; } } 1; } sub Print { my($positions, $file_name) = @_; #my @words = sort alphabetically keys %$positions; my @words = keys %$positions; my $word; foreach $word (@words) { &OpenFile($word, $file_name); my $poses = $positions->{$word}; my $pos; foreach $pos (@$poses) {&Print_Sentence($pos);} } &Update_Report; } sub alphabetically { my $ax; ($ax = $a) =~ tr{ÀÂÉÈÊËÎÏÔÛÇ}{AAEEEEIIOUC}; my $bx; ($bx = $b) =~ tr{ÀÂÉÈÊËÎÏÔÛÇ}{AAEEEEIIOUC}; $ax cmp $bx; } sub OpenFile { my($word, $file_name) = @_; my $name; $main::word = $word; if (!exists $main::file_names->{$word}) { $name = "$main::fileset_name$main::temp_file_no.html"; $main::file_names->{$word} = $name; $main::temp_file_no++; open(TEMP, ">$name") or die "Can't append to file $name"; print TEMP <<"END"; $word

$word

END } else { $name = $main::file_names->{$word}; open(TEMP, ">>$name") or die "Can't append to file $name"; } $main::print_me = "$file_name"; } sub Print_Sentence { my($pos) = @_; my($line_num, $sent) = @$pos; #print "Print_Sentence($line_num, $char_num, $len)\n"; #TODO: consider bumping line_num to the line where the word itself # appears if ($main::print_me) { print TEMP "

$main::print_me

\n\n"; $main::print_me = ''; } if (!exists($main::word_counts->{$main::word})) { $main::word_counts->{$main::word} = 0; } $main::word_counts->{$main::word}++; $sent =~ s{\s+}{ }g; my $pr = "\n

$line_num: $sent"; #TODO: consider using built-in \n for a breaking point? while (length($pr) > 79) { my $line = substr($pr, 0, 79); my $rem = substr($pr, 79); if ($line =~ m{(.*)\s+(.*)}) { print TEMP "$1\n"; $pr = "$2$rem"; } else { print TEMP "$line\n"; $pr = $rem; } } if ($pr) {print TEMP "$pr\n";} return; } sub Update_Report { my @words = sort alphabetically keys %$main::file_names; open(REP, ">$main::fileset_name.html") or die; print REP <<'END'; search report

END my $word; foreach $word (@words) { my $count = $main::word_counts->{$word}; next unless $count; my $file = $main::file_names->{$word}; print REP "
  • $word [$count]\n"; } print REP <<'END';
  • END close(REP); } sub get_by_ftp { my($ftp,$remote) = @_; if ($remote =~ m{\.zip$}) { return &Get_Zip_By_FTP($ftp,$remote); } my($loc,$len,$buf,$data); delete ${*$ftp}{'net_ftp_port'}; delete ${*$ftp}{'net_ftp_pasv'}; $data = $ftp->retr($remote) or return undef; $buf = ''; $main::left_over = ''; do { $len = $data->read($buf,1024); } while($len > 0 && &Store($buf)); if ($main::left_over) {push(@$main::text, $main::left_over);} $data->close(); # implied $ftp->response return 1; } sub Store { my($data) = @_; $data = $main::left_over . $data; if ($data =~ s{((?:.|\n)*\n)(.*)}{$1}) { $main::left_over = $2; } else { $main::left_over = $data; return; } my(@lines) = $data =~ m{(.*\n)}g; push(@$main::text, @lines); } sub Get_Zip_By_FTP { my($ftp,$remote) = @_; my $result = $ftp->get($remote, "temp.zip"); if (!$result) {return $result;} $result = &Read_Local_Zip_File("temp.zip"); unlink("temp.zip"); return $result; } sub Read_Local_Zip_File { my($name) = @_; open(IN, "unzip -p $name|") or return 0; print STDERR "Unzipping and reading...\n"; &Consume_IN; close(IN); } sub Consume_IN { my $line; while (defined($line = )) { push(@$main::text, $line); } } sub Read_HTTP { my($name) = @_; if ($name =~ m{\.zip$}) { if (!&Read_HTTP_to_file($name, 'temp.zip')) {return 0;} if (!&Read_Local_Zip_File('temp.zip')) {return 0;} unlink('temp.zip'); return 1; } else { return &Read_HTTP_to_memory($name); } } sub Read_HTTP_to_file { my($url, $local) = @_; my $ua = new LWP::UserAgent; my $req = new HTTP::Request('GET', $url); my $res = $ua->request($req, $local); if ($res->is_success) { return 1; } 0; } sub Read_HTTP_to_memory { my($url) = @_; my $ua = new LWP::UserAgent; $main::left_over = ''; my $res = $ua->request(HTTP::Request->new('GET', $url), \&Store); if (!$res->is_success) { return 0; } if ($main::left_over) {push(@$main::text, $main::left_over);} 1; } __END__ To do: Gutenberg texts are in: ftp://sunsite.unc.edu/pub/docs/books/gutenberg/etext90/ .. etext97/ Can we find a nearer and more robust server? (Also ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/... ) Some French texts are available via: http://un2sg1.unige.ch/www/athena/html/francaut.html http://www.cnam.fr/ABU/ Instead of reading in whole text, do a floating window of text --- enough for a reasonable context buffer. Problem here is that some things become tricky, such as scanning for a header. (Perhaps it's safer not to try looking for a header anyway, unless we explicitly ask for it.) This is very Gutenberg-centric anyway. Will double-spaced text pose a problem? Maybe not. Sentence boundary test is ignoring paragraph boundaries. Perhaps that's not important anyway; a fix might break in poetry or double-spaced text. Sentence boundaries: Often we'll get -- between sentences to mark change of speaker. Perhaps we could ignore all punct between . and cap'ed word.