Index of /~bkessler/sanskrit-thesis/prog/derive
Name Last modified Size Description
Parent Directory 25-Jul-2004 18:33 -
Makefile 25-Jul-2004 18:16 10k
alternatives.c 25-Jul-2004 11:06 1k
alternatives.h 25-Jul-2004 11:06 1k
attribute.c 25-Jul-2004 11:06 1k
attribute.h 25-Jul-2004 11:06 2k
attributes.c 25-Jul-2004 11:06 1k
attributes.h 25-Jul-2004 11:06 1k
avm.c 25-Jul-2004 11:06 2k
avm.h 25-Jul-2004 11:06 2k
big-data 25-Jul-2004 11:06 10k
derive 25-Jul-2004 11:06 61k
derive.c 25-Jul-2004 11:06 2k
dlist.c 25-Jul-2004 11:06 2k
dlist.h 25-Jul-2004 11:06 3k
emeneau-data 25-Jul-2004 11:06 2k
examples.tex 25-Jul-2004 11:06 76k
factor.c 25-Jul-2004 11:06 9k
factor.h 25-Jul-2004 11:06 5k
feature.c 25-Jul-2004 11:06 2k
feature.h 25-Jul-2004 11:06 2k
features.c 25-Jul-2004 11:06 2k
features.h 25-Jul-2004 11:06 3k
goldman-data 25-Jul-2004 11:06 3k
indexedAvm.c 25-Jul-2004 11:06 2k
indexedAvm.h 25-Jul-2004 11:06 2k
item.c 25-Jul-2004 11:06 6k
item.h 25-Jul-2004 11:06 2k
match.c 25-Jul-2004 11:06 1k
match.h 25-Jul-2004 11:06 4k
output.txt.iso8859-1 25-Jul-2004 18:16 57k
output.txt.utf8 25-Jul-2004 18:16 58k
plainSignedFeature.c 25-Jul-2004 11:06 8k
plainSignedFeature.h 25-Jul-2004 11:06 2k
readWordsLex-epilog 25-Jul-2004 11:06 1k
readWordsLex-prolog 25-Jul-2004 11:06 1k
readWordsYacc.h 25-Jul-2004 11:06 1k
readWordsYacc.y 25-Jul-2004 11:06 5k
rule.c 25-Jul-2004 11:06 5k
rule.h 25-Jul-2004 11:06 2k
ruleSystem.c 25-Jul-2004 11:06 2k
ruleSystem.h 25-Jul-2004 11:06 2k
rules.c 25-Jul-2004 11:06 1k
rules.h 25-Jul-2004 11:06 1k
segment.h 25-Jul-2004 11:06 1k
segments.c 25-Jul-2004 11:06 1k
segments.h 25-Jul-2004 11:06 2k
sign.c 25-Jul-2004 11:06 1k
sign.h 25-Jul-2004 11:06 3k
terminal.c 25-Jul-2004 11:06 8k
terminal.h 25-Jul-2004 11:06 4k
test-data 25-Jul-2004 11:06 8k
test-data-old 25-Jul-2004 11:06 8k
test-data-orig 25-Jul-2004 11:06 7k
utfize.pl 25-Jul-2004 18:16 1k
# -*- mode: Fundamental -*- -------------------------------------------- #
# File: derive/README
# Description: Describes the derive/ directory
# Author: Brett Kessler
# Created: 3-May-92
# Modified: Sun May 3 21:22:25 1992 (Brett Kessler)
# Language: English
##############################################################################
The derive/ directory is the final link in the chain of programs. It
produces a program, derive, which can read in a sequence of words,
run the sandhi rules over them, display the exact derivation rule by rule,
and score against an expected output also read in from standard input.
The input is written in the ASCII description described in the SEGMENTS
file. A dash "-" must be inserted between each syllable, a plus "+"
between each word, and a equals sign "=" separates introduces the expected
sandhi; a semicolon ";" ends each example. In addition, where individual
words may have lexical properties that might influence sandhi, they are
written immediately after the word, in square brackets, using the attribute
names defined in SEGMENTS.
The program applies each rule at each position in the input string, in
temporal order. After attemtping to apply the rule beginning at the last
position, it then sees whether the rule has changed the string (not whether
the rule applied: vacuous applications are ignored). If there is a change,
the program prints out the name of the rule, and its output. When all
the rules have been applied, it checks to see whether the final form matches
the expected output. If not, it flags the error and tells what the
expectation was. At the end of the entire test suite, the program tells
how many errors were encountered.
The output can be written in a format that is easier to view on an
ASCII terminal (default form, or choose "-c ascii"), or in a form that
generates LaTeX output with the proper IPA symbols (provided that the
CSLI phonetic founts and the WSU IPA founts are loaded). This latter
option is "-c latex". Alternatively, the program utfize.pl can be
used to convert the ascii output to IPA in UTF-8. The default make
rule generates "output.txt.iso8859-1", the ASCII version, and
output.txt.utf8.
The directory includes several test suites. test-data generates about a
hundred examples, one for each final-initial pair demonstrated in Emeneau
or Goldman. emeneau-data and goldman-data are all the examples in those
two texts (for the latter, in the chapter on sandhi); big-data is a
concatenation of all those, the ultimate stress test of the rule suite.
The code is organized as follows. derive.c is the main implementation
module, which calls the parser via the interface readWordsYacc.h after
initializing the rule system which is compiled in from frag/rules.c.
The actual parser is the Yacc program readWordsYacc.y. This
incorporates the tokenizer readWordsLex.c, which is compiled by Lex
from readWordsLex.l. This in turn is automatically generated by
concatenated the stubs readWordsLex-prolog and *-epilog with
frag/segmentsLex and frag/attributesLex, recognizers for segment
transcriptions and word attributes names that were transduced by other
programs from the SEGMENTS file. When the parser reads in an
example, it hands the input and expected output over to the
ruleSystem.c file. ruleSystem.c manages the actual running of the
test suite, telling whether the end result is matches the expectation.
It calls rules.c to run all the rules against the input string, and
rules.c calls rule.c with all the rules in order. rule.c actually does
the pattern matching, trying the structural description at each terminal
of the input string, in temporal order. The bulk of the C files are for
each of the parts of the rule, and are largely analogous to the files
in rules/. Among these are item.* (sequences of pattern parts; the structural
description and the change are both items), factor.* (each element of a
pattern, including alternatives.*, avm.*, indexedAVM.*), feature.* (top level
elements in an avm.*, each alternative-lists of plainSignedFeature*, which
consists of sign.* and feature.* ID), and attributes.* (lexical attributes
attached to a word boundary; being a list of attribute.*). The data string
is represented as a string type defined in terminal.*, each terminal linked
together as a doubly linked list (dlist). features.* gives an inventory of
information about each feature type (derived from FEATURES), and segments.*
does the same for segments (from SEGMENTS).
Finally, match.* contains types and a function central to performing
the pattern matching operations. But most of the actual logic for the
pattern matching is found in the .c files for each of the data types.
Each of those files is implicitly divided into three sections: one for
creating an item of that structure at initialization time; one for
doing pattern matching of that type of component; and one for doing a
structural change for that kind of object.
##############################################################################
## End of README
##############################################################################