Index of /~bkessler/RelSoundLetMono/src

      Name                    Last modified       Size  Description

[DIR] Parent Directory 27-Aug-2004 14:00 - [TXT] words-aligned 19-Jul-2004 15:50 118k [TXT] pgive.c 19-Jul-2004 15:50 15k [TXT] cond_cons.pl 19-Jul-2004 15:50 38k [TXT] Makefile 19-Jul-2004 15:50 1k [   ] CondCons.tgz 19-Jul-2004 15:50 64k [TXT] COPYING 19-Jul-2004 15:50 18k

 -*-  coding: utf-8-unix; -*-

CondCons - computes conditional consistencies on sound-spelling links
Copyright © 2003 by Brett Kessler   

This file is part of CondCons.

CondCons is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

CondCons is distributed in the hope that it will be useful,
but without any warranty; without even the implied warranty of
merchantability or fitness for a particular purpose.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with CondCons; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


cond_cons.pl processes a file of lexical data and generates a set of
HTML pages that compute consistencies of letter-to-sound and
sound-to-letter correspondences. Its main task is to tell by how much
the consistency is increased when some other part of the syllable is
taken into consideration, and compute whether that improvement is
significantly greater than chance. Therefore the operation is only
defined for monosyllabic words.

The program as distributed follows the algorithms and uses the data
sets described in the following paper. More information about that
paper can be found at http://brettkessler.com/RelSoundLetMono/ --

Kessler, B., & Treiman, R. (2001.) Relationships between sounds and
letters in English monosyllables. Journal of Memory and Language, 44,
592-617.


Requirements:

C compiler.  The program was developed with Gnu Compiler
Collection (gcc) version 3.2; I have not tested what other compilers
it works with, but suspect it will compile under any ANSI C compiler.

Perl 5. The program was developed under Perl 5.8.0.

make - useful but not manadatory

The programs will probably run out of the box on a Linux or Unix
machine, including Mac OS X (although you may have to first install
developer tools); but Windows does not normally ship with the required
tools.  For all popular operating systems, Perl can be installed for
free (see http://www.perl.com). Windows users may be able to avail
themselves of gcc by using Cygwin software
(http://www.cygwin.com). The existing software are simply shell
scripts lacking graphical user interfaces; they will run under a shell
(Windows command prompt, Mac Terminal).  Nota bene that I have made no
effort to test or develop for portability.


Installation:

Download the CondCons.tgz file; unpack it (e.g., tar xvzf
CondCons.tgz). Alternatively, download each of the other files in this
directory to a local directory individually.


Running:

The main requirement for running CondCons is having a data file. The
supplied file "words-aligned" contains the data used for the
above-mentioned paper. If you wish to run with data of your own,
you will need to either modify the cond_cons.pl program, or follow the
existing format fairly closely.

1. Phonetic transcription.

The program currently assumes a fairly idiosyncratic phonetic
transcription. If you have a choice, the path of least resistance may
be to follow that transcription in your own work. The key is:

Vowels:
a  At
A  Odd, Arm
c  OUGHt
C  OIntment
e  Ape
E  Ebb
i  EAt
o  OAt
R  IRk
u  OOze
U  pUt
V  Up
W  OUt
y  Id
Y  I

Consonants:
b  Bee
d  Dog
D  Jump
f  Foot
g  Go
G  siNG
h  Hop
j  Yes
k  Kiss
l  Love
m  Mop
n  Nap
p  Pot
q  THick
Q  THis
r  Read
s  Sip
S  SHip
t  Top
T  CHop
v  Vie
w  Wide
z  Zoo
Z  beiGe

No other symbols, such as stress marks, are used.  More details of
application can be seen by inspecting the words-aligned file.

CondCons assumes that the data file (such as words-aligned) gives not
only the spelling and pronunciation for each word, but also an
alignment of the letters to the sounds. The program was designed to
work on a fairly small-grained alignment, ideally one-letter -
one-sound, except where one-many analyses are unavoidable (as in ship,
where "sh" = /S/, or FOX, where "x" = /ks/). This is spelled out as a
series of letter=sound pairs, each pair separated by a space, e.g.:
sh=S i=y p=p  or  f=f o=A x=ks   The transcription also
assumes that all letters represent something, with the possible
exception of final (or nearly final) "e", which can be a marker; this
is represented as e=0 (zero), e.g., f=f r=r a=e m=m e=0
The first thing the program does is repackage these sort of
transcriptions so that (1) correspondences are at the level of entire
syllable chunks, i.e., onset, vowel, coda; (2) silent "e" is assigned
to some part of the syllable, either the vowel, coda, or both.

To do this, the program needs to know a good deal about English
spelling conventions, and so your data needs to follow the phonetic
transcriptions it knows about. On the other hand, if your data file is
already set up so that silent "e" is assigned to the right parts --
e.g., you have an alignment like f=f r=r aE=e m=m -- then you can tell
the program to dispense with silent-e assignment. In that case, you
have a great deal more freedom in coming up with your own
transcription system. The main requirements then is that each phoneme
must be represented by a single, unique, one-byte
character. (Technically, the requirement is that vowels must be
one-byte characters, and that consonants must not contain a vowel
symbol as part of their representation.)  This means that
representations like "oi" for the diphthong here coded as /C/ are out
unless you modify the program.

2. Data file.

The lexical file you are processing needs to be a plain text file
(not, e.g., RTF; if you use a word processor like Word, remember to
save in plain text .txt format).  It must have one line per
entry. Each line consists of eight pieces of data with a tab between
each one.  The data are exemplified by the following:

aid	ed	7	1	l	0	VC	ai=e d=d

The first line of the file is a header intended for human consumption,
and can in fact be anything; it is ignored by the program.

In order, these fields are:

spelling. This can be in virtually any format.

pronunciation - as discussed above.

frequency - this should be a fairly small integer; it is intended to
be the logarithm of actual frequency. If this number becomes too large
(e,g, actual frequencies from a large corpus are used) it is possible
that the program will run very very slowly. If you are not interested
in frequency weighting, this field can be set to any value, in which
case you will want to run cond_cons with the flag --notokens (see below).

child-frequency - an integer telling how frequent the word is in
children's vocabulary. Confusingly, this one is meant to be actual
frequency, rather than log frequency. Normally the program will do
separate additional analyses over the words that have a value of 20 or
more in this field, although that cutoff can be changed
(--children-cutoff=) or the entire analysis can be skipped
(--nochildren); for the options, see below.

lexical stratum - this field is ignored. It is intended to have "l"
for lexical words and "g" for grammatical words, but you can set this
to anything you want.

inflected - this field is ignored. 

pattern - this field is intended to give the C-V pattern of the
word. Words that are marked CVC will be included in special
consonant-vowel-consonant tests. If you are not interested in this
feature, you can place anything in this field. You can then run with
the --nocvc option to skip the reports entirely (see options below).

alignment - space-delimited alignments of spell=pron, as discussed
above. It is intended that this alignment be at the granularity of
individual sounds and letters, insofar as appropriate. But actually,
it can be at any granularity no grosser than Onset-Vowel-Coda, which
is what cond_cons converts the alignment to anyway. The rules are
that each alignment pair is spelling and pronunciation, separated by
a = sign; the alignment pairs are separated by a single space.  E.g.:

f=f r=r au=c d=d
   or
fr=fr au=c d=d

The alignment pair e=0 has a special meaning (unless cond_cons is run
with the flag --noassign-silent-e), silent E; it will be reassigned to
the vowel and/or to the coda depending on the other sound-spelling
correspondences in the word.


Running the program.

The provided Makefile will compile pgive if necessary, then invoke
cond_cons to analyse the data file "words-aligned" and build an HTML
report in the out/ subdirectory, with main HTML file called
"results.html".  To do this, just cd to the CondCons directory and
type `make`. (You may first need to install the make program if you
are not running Linux or Unix; it comes with Mac OS X developer tools,
or Cygwin under Windows.)  Beware, though, that running the program
can take a long time on some computers, because it runs many analyses,
and for each analysis, it runs a fairly time-consuming Monte Carlo
test. 

Other options will require you to either modify the Makefile or type
commands by hand into a shell (Windows command prompt, Mac
Terminal). The above-mentioned standard run is:

    perl cond_cons.pl < words-aligned > out/results.html

To run with a data file other than words-aligned, or to make the main
HTML page be other than results.html, change those file names in the
command.

It is not possible to tailor the names of the various
files the the program generates. However, you can make the output go
to a different directory by adding the option --output=   E.g.:

    perl cond_cons.pl --output=out2 < words-aligned > out2/results.html

Beware, however, that pgive is assumed to exist in the output
directory. If such is not the case, you should either put it there, or
a symlink to it, or add a specific option to the program, e.g.:
--pgive=out/pgive

If you use your own phonetic transcription system, you will need to
tell the program which characters are vowels, e.g.:

perl cond_cons.pl --vowel="aAcCeEioORuUVWyY"  < words-aligned > out/results.html 
In the unlikely event that your alignments include "e=0" but you do
not want the program to reassign silent "e", you can add the flag
--noassign-silent-e 

The Monte Carlo program uses 10,000 rearrangements of the data to
compute the significance of tables computed by types, and 1,000
rearrangements if counting by tokens. To change the former number (the
tokens count will always be 1/10 of that number), you can add the flag
--iter=5000
specifying any integer number you like.

Normally the program computes statistics from both the standpoint of
reading (letters to sounds) and spelling (sounds to letters). Either
can be suppressed by adding the appropriate option:  --noreading or
--nospelling

Normally the program reports on onsets, vowels, and codas. You can
suppress any of these by the arguments  --noonset  --novowel  or --nocoda
E.g., if you specify --noonset --nocoda --noreading  the program will
only report on the spelling of vowels. It still takes the onset and
coda into account for conditional consistencies, however.

Normally the program includes a report on whether the length of the
onset or coda in itself significantly helps predict the coding of the
vowel. This can be foregone by adding --nolength

Normally the program includes a report on which particular strings
(letters for reading, sounds for spelling) are significantly helped by
context.  To suppress this analysis, add --nostrings

Normally the program computes statistics two ways: counting by tokens
and by types (weighted by log frequency). To suppress either, add the
command --notypes or --notokens

Normally the program runs an entirely separate set of analyses over
words that have a child-frequency of 20 or more.  To change that
cutoff, specify e.g., --children-cutoff=10. Or to entirely suppress the
analyses, add --nochildren

Normally the program runs an entirely separate set of analyses over
words that are of structure CVC (as specified in the data file). To
suppress this analysis, add --nocvc.

Normally the program runs an entirely separate set of analyses over
words that have no /r/ after the vowel. To suppress these analyses,
add --norless (read "no r-less").


Output.

All output is written to the subdirectory "out" (or the one specified
by --output=). This includes the HTML file explicitly listed on the
command line, plus dozens of other HTML files pointed to by that main
file. Those files have fairly readily interpretable names ending in
.html. In addition, the program generates dozens of data files that
are fed as data to the pgive program. These files all have numeric
names like "0001120-2" and are pretty useless unless you wish to debug
the programs, or perhaps feed the data to other statistical
programs. They can be safely deleted after any run.

The program also generates in out/ a file called OVC-alignments. This
lists for each word the letters that were taken to spell the onset,
vowel, and coda, respectively.