*******************************************************************************
*******************************************************************************
**                    Non-word Tool Suite
**               (c) 2001-2004 Shane T. Mueller 
**                stmuelle@indiana.edu
**                http://mypages.indiana.edu/~stmuelle
**                Version:  2004-10-13
*******************************************************************************
*******************************************************************************
This suite is a set of tools that allows word-like non-words to be generated.
Additionally, pre-generated non-words have been provided.  


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
                                 Provided Data
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
data/dict.txt:
The words from the CMU machine-readable dictionary.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
data/cmu-freq.dat:
Likelihood data file based on the CMU machine-readable
dictionary.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
data/kf-freq.dat:
Likelihood data file based on most of the Kucera-Francis
corpus.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
data/both-freq.dat:
Likelihood data file based on the Kucera-Francis corpus and
the CMU dictionary together.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
output/nonwords-04.txt:
output/nonwords-05.txt:
output/nonwords-06.txt:
output/nonwords-07.txt:
output/nonwords-08.txt:
output/nonwords-09.txt:
output/nonwords-10.txt:

Files containing 1000 non-words each generated using the
both-freq.dat frequencies.  The nonwords may not be unique;
they may appear more than once in a file.  Additionally,
they may be actual words, and especially names, that did not
appear in the CMU dictionary.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
nonwords.stats.txt:
All of the words in nonwords-04.txt through nonwords-10.txt,
along with their likelihood according to both-freq.dat, and
the distribution of words with different levenshtein
distances in the CMU dictionary.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
words.stats.txt:
All of the words in CMU dictionary,
along with their likelihood according to both-freq.dat, and
the distribution of words with different levenshtein
distances in the CMU dictionary.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
                                 The Tools
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
freq
frequencoder.c
Compile: make freq
Invoke as: bin/freq

Purpose:  To extract 2nd order conditional probabilities from a body of text.  
It converts all text to upper-case, and treats anything besides a letter as a
space.  It automatically generates a file 'freq.dat' that contains the 
probability of each letter triplet in(27x27x27 entries)



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
wordgen
Source: wordgenerator.c
Compile:  make bin/wordgen
Invoke as: bin/wordgen

   This program generates wordlike nonsense words according to the procedure
   described by Shannon & Weaver (1963), and prints them to stdout..  
   If called with no arguments,
   20 such non-words will be produced, between the lengths
   of 5 and 9. The number of generated non-words can be
   controlled by using the -n command-line option, and the
   lengths of the shortest and longest words can be controlled
   using the -min and -max options
   Examples:
   bin/wordgen -n 1000 -min 5 -max 5] produces 1000 5-letter non-words
   bin/wordgen -min 2 -max 15 -n 100] produces 100 non-words between
                                      2 and 15 letters long
This program requires two data files to be in the /data directory: a 2nd order
conditional probability 
matrix, stored in freq.dat, and a dictionary, stored in dict.txt.
The frequency file is generated by the accompanying 'freq' program
Both of these files are available in the data directory. which contains 2nd order
probabilities based on either the Kucera-Francis corpus, the words in the
CMU machine-readable dictionary, or both.  Copy or symlink the a .dat file
to be named freq.dat to change dictionaries.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
like
likelihood.c
Compile: make bin/like
Invoke as: bin/like

stored in data/like.dat, created by bin/freq based on a text
corpus.  The program calculates second-order conditional
probabilities of letters, so it uses a table of the
probabilities of a letter given its previous two letters in
the.  word.  The program will evaluates the first word on
each line of each file specified on the command-line, and
prints the results to stdout.
		  
Usage:
like wordfile.txt > wordlike.txt



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
levdictfile
Source: lev-dict-file.c
Compile:  make bin/levdictfile
Invoke as: bin/levdictfile

   This program finds the distribution of Levenshtein (edit)
   distances between words in specified file and 
   the words in "data/dict.txt".  It only prints out a line for each word.

  
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
levdict
Source: lev-dict.c
Compile:  make bin/levdict

   This program finds the distribution of Levenshtein (edit)
   distances between the words on the command line and  
   the words in "data/dict.txt".  When searching the dictionary, it outputs anything
   that is a 'neighbor', i.e., 2 or 1 letters away.  It then prints out a line for each word.
   describing the distribution of distances between that word and every word in the dictionary.



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
levpartial
Source: leven-partial.c
Compile:  make bin/levpartial

   This computes an experimental 'partial' levenshtein distance matrix, using the partial letter
   distances specified in data/letters.dat.  It works like the levenshtein distance, but
   the cost of some swapping operations are cheaper than others. It  takes a file with a list
   of words as an argument, optionally followed by a number specifying which column of the
   file to use.  It outputs a matrix of the partial levenshtein distances between the words 
   in the file to stdout.

