Index of /~bkessler/sanskrit-thesis/prog
Name Last modified Size Description
Parent Directory 26-Jul-2004 18:40 -
FEATURES 25-Jul-2004 11:05 1k
Makefile 25-Jul-2004 18:33 1k
RULES 25-Jul-2004 11:06 9k
SEGMENTS 25-Jul-2004 11:05 3k
derive/ 25-Jul-2004 18:16 -
features/ 25-Jul-2004 11:06 -
frag/ 25-Jul-2004 11:06 -
include/ 25-Jul-2004 11:06 -
lib/ 25-Jul-2004 11:06 -
rules/ 25-Jul-2004 11:06 -
segments/ 25-Jul-2004 11:06 -
# -*- mode: Fundamental -*- -------------------------------------------- #
# File: README
# Description: Describes the sandhi directory.
# Author: Brett Kessler
# Created: 2-May-92
# Modified: Sat Sep 26 09:45:47 1992 (Brett Kessler)
# Language: English.
##############################################################################
This directory contains code for testing a generative description of
the sandhi system of classical Sanskrit. Snapshot overview: the
program derive/derive reads the rules in the RULES file, applies them
in sequential order to a test suite, and generates output as in
/derive/output.txt(.utf8), which shows which rules apply to transform the
underlying forms to a surface form.
The tester is the program derive
in the subdirectory derive; for example,
cd derive
./derive < test-data
The data files in this directory are text files that are meant to be
comfortably manipulable by human beings. The intention is that the
researcher can test alternative theories of the sandhi system by editing
these data files and not altering any computer code; these files in turn
could be shown to colleagues without requiring them to read computer code
either. It is however necessary to recompile the system after making
changes (i.e., by typing `make` in this directory).
All the data files recognize the double-solidus // to mean that all the
remaining text on a line is a comment to be ignored by the processing
programs.
FEATURES describes the phonemic features that will be used in the system.
It assumes a simple sort of feature geometry, where a feature can either
be binary, or takes as value a set of features. This is coded as an
attribute-value matrix (AVM), with the attributes specifying a feature
name, and the value being either "binary" or a sub-AVM. Attributes are
separated from values by a colon; attribute-value pairs are separated from
each other by a semicolon; and square brackets enclose each AVM, including
the top-level one. As a "magic number", the first non-comment token in
the file is the word "FEATURES". White space (blanks, tabs, new lines)
are ignored. For readability, I am capitalizing the names of composite
features, but this is irrelevant to the processors. All feature names
do have to begin with some Roman letter, and consist entirely of
letters, digits, and underscores. Case is distinctive. Feature names
must be distinctive even across composite features.
SEGMENTS describes the feature bundles that appear in the underlying
lexical representations or at some point in the derivation of
utterances; it also declares the lexical attributes that are referred
to by the rules. The description of the segments is introduced by the
token "SEGMENTS". For each segment is declared a sequence of ASCII
characters that will be used to represent the segment in plain text.
This is for example the form that is used in the RULES file, and which
will be used when printing out derivations on ASCII terminals. There
are few restrictions on what characters can be used here, other than
that white space is excluded, but one should probably avoid character
strings that have special meaning in the RULES file. Segmentation
ambiguities on reading will be resolved in favour of the longest string;
so segments "t", "h", and "th" can all be defined, and if the input
contains the sequence "th", only the last will be matched. After the
ASCII representation appears, separated from the former by white space,
the string that should be used in printing the segment in environments
where LaTeX is available. Syntactically, this can consist of anything
except white space, equals sign, semicolon or solidus; semantically, the
system is currently set up to recognize any character in LaTeX, WSU IPA,
or the CSLI phonetic LaTeX macro sets. After the LaTeX specification
appears an equals sign introducing a list of base (leaf, non-composite)
features that are positive for this segment. These features must be
listed in the FEATURES file, and are separated by white space. The
segment declaration is terminated by a semicolon.
The next section of this file is introduced by the header "ATTRIBUTES"
and consists of a list of lexical features, separated by white space.
The RULES file is more complex. It has two parts, introduced by the
headers "ONSETS" and "RULES". Under ONSETS is given a formulation of
what onsets a syllable may have. This is in the form of what is here
called an "item" pattern. An item is a sequence of factors, normally
separated by white space; the sequemce corresponds to the temporal
sequencing of phone segments. A factor is an optional item (enclosed
in parentheses), a repeated item (followed by a star), a set of
alternative items (encosed in curly brackets and separated by commas),
or some sort of terminal designation, which must match the current
item in the data string. These last include segments (any of the
phones listed in SEGMENTS, designated by their ASCII code) or feature
bundles. Feature bundles are enclosed in square brackets, and consist
of a series of features, separated by white space and/or optional
commas; [] matches any segment. The features may be any of the names
declared in FEATURES. Each is preceded by + or - to show that the
current feature is or is not present; for a composite feature, the
symbols can be taken to mean that some or no subordinate feature is
present. Alternatively, it can be preceded by one of the words
"alpha", "beta", "gamma" or "delta", which are variables, which must
all have the same values within the rule in order to get a successful
match. A single variable can be used on different features within a
rule, except for composite features (since they don't share type with
any other feature). An AVM may also contain an alternation of feature
sets, using the braces syntax as for item alternatives. Finally, an
AVM may be followed by an index (a single lower-case letter preceded
by an underscore); two AVMs having the same index within a rule must
match the same segment.
The RULES section is a list of rules, which apply in the order listed.
Each rules consists of a name, which is anything delimited by double
quotes; an equals sign; then the rule proper, which is a source
description, arrow ("->"), change description, and then an
environment, introduced by a solidus. The rule is terminated by a
semicolon. The source description and source change are "items" as
described above, except that the special factor "0" may be used to
designate a null source or change. Unlike ONSETS, RULES may also
mention boundaries. The hierarchical organization of segments into
syllables, words, and utterances is for the sake of convenience here
represented by overt boundary markers that appear in line with the
segment syllables. SB marks the edge of syllables, WB of words, and
UB of utterances. They cannot be matched by segment terminals; that
is, a rule applies only within a syllable except insofar as it
explicitly mentions SB. The same applies mut. mut. for WB and UB,
except that in this particular rule system I explicitly "erase word
boundaries" early in the derivation. This is an artifice meaning that
subsequent rules ignore word boundaries. While word boundaries
survive, they may be annotated with any of the attributes mentioned in
the ATTRIBUTES section of the SEGMENTS declaration, preceded by + or -
to show that the preceding rule must have or not have the indicated
attribute. These annotations are surrounded by angled brackets.
The rule environment is an item like the source or change desciption,
except that it furthermore must contain (for each normal disjunction)
an environment bar. The overall functioning of the rule is as one
would expect in the field of linguistics: if the utterance string
contains the structural description (the environment with the source
description substituted for the environment bar), the source
description is replaced by the change description. The source may be
null (or "0") allowing absolute insertion, or the change may be null
(or "0"), allowing for deletion. If the change includes an AVM, the
matched segment is changed to add or delete the features mentioned in
the AVM; if the source matches a sequence of more than one segment, an
AVM may be mentioned in the change only if it is indexed to some other
segment in the rule. The change description may mention a composite
feature only negatively, meaning that it becomes - for all
subfeatures; specifying a + for such a feature would be inherently
ambiguous. In this system, rules apply at every point in the string
where the structural description matches, in the temporal order of the
segment string ("left to right"), so that the output of a rule may
feed a later operation of the rule in the same string.
The rest of this directory, other than the Makefile, are
subdirectories. Four of them (features/, segments/, rules/, and
derive/) are active code directories. The first three preprocess the
data files that have been discussed above, placing into frags/ code
fragments that help later programs manipulate features, segments,
and rules without the overhead of reading these text files. The code
in the directories must be built in this order (as shown in the
Makefile), since each contributes code to the next. In particular,
within the features/ directory is built the xduceFeatures program,
which reads FEATURES and produces frag/ code (featuresLex, featuresH,
featuresC) that assigns unique IDs and masks to each feature and helps
the segments/ and rules/ code to read references to features, and
helps the derive/ code to manipulate features in rules and data
strings. The segments/ directory builds the xduceSegments program,
which reads the SEGMENTS file and produces frag/ code that helps the
rules/ code read references to segments within RULES, and lets the
derive/ code read word transcriptions and manipulate segments as
integer bit-mask bundles. The rules/ directory builds the xduceRules
program, which reads the RULES files and writes the rules as one huge
C expression which the derive/ code can quickly load into memory at
start-up time. The derive/ code compiles the program derive, which
reads sequences of words, runs the sandhi rule suite over them, displaying
the output of every rule that produced a change, and checking for errors
by comparing the final output against the predicted output it reads from
the input stream.
To build and run "derive", type `make` in the top-level directory.
The C code is written in conformance to the ANSI C standard, using
prototypes. It has been implemented and tested on HP 9000/300 series
workstations running HP-UX 8.0, and also in Linux (SuSE) on Intel,
using Gnu gcc. The general organizing principle, which will be
especially noticeable and helpful in the larger subdirectories, has
been to organize by major data types. Thus the data type "AVM" will
be declared in a .h file of the same name (but not capitalized), which
provides the typedef and the prototypes for all external functions on
that data type, which functions are implemented in the corresponding
.c file. All external functions are given names with a prefix
matching the data type name (plus underscore), so one can immediately
infer which .h and .c files the function is declared and implemented
in.
##############################################################################
## End of README
##############################################################################