-------------------------------------------------------------
The SEPARATE utility.  Copyright 2001 by Shane T. Mueller
Oct. 20, 2001
Contributors:
Shane Mueller (Primary Author)
Michael Kahana (Provided words sets and semantic similarities).
Pablo Gomez (Assistance with graphemic similarity).

This archive is available by contacting me at smueller@umich.edu, 
or via www at "http://www-personal.umich.edu/~smueller", at least
through 2002. 
-------------------------------------------------------------
This utility uses the words from the Toronto Noun Pool 
and attempts to cluster them into groups using a genetic-
algorithm-like process.                                  

You should have the following files/directories:
backup/           Possibly contains backups
bin/              Contains compiled file separate
extra/            Other useful scripts
matrices/         Data matrices
out/              Output Files
sbin/             Backup script
src/              Contains separate.c
readme.txt        This file
Makefile          make command file
run               executable script that runs ./bin/separate
separate.ini      contains parameter settings that program reads

ABOUT:
This program is a utility that helps develop stimuli based on 
their similarity.  Similarity is a viewed as a multi-dimensional
aspect of a pair of words, and thus of a set of words.  Currently,
the matrices/ directory contains information about 480 words from
the Toronto Noun Pool.  This includes 7 dimensions of similarity:

Onset              Mean similarity of syllable onsets
Nucleus            Mean similarity of syllable nuclei
Coda               Mean similarity of syllable codas
Stress             Mean similarity of syllable stress
Initial            Mean similarity of first phonemes in word
Semantic           Mean semantic similarity of words, based on LSA
Graphemic          Mean graphemic similarity of words (using edit 
                   or Levenshtein distance)

All dimensions except for Semantic similarity are distances, so 
low numbers indicate high similarity. (e.g. Mueller, Seymour, Kieras, & Meyer,
under review). Semantic similarity is measured by LSA cos(theta) (Landauer et
al., 1998, Kahana & Howard, 2001). Identical items have cos(theta)=1; dissimilar
items have cos(theta) closer to 0. Graphemic similarity is the 'Edit' or
'Levenshtein' distance--the number of changes one string must undergo to be
transformed into the other, where legal transformations are adding, subtracting,
or changing a single letter.  The spellings in the Noun Pool are Canadian, and
there are some words which are spelled differently in the US (like colour), so 
be cautioned.

The program requires that you specify:
	Set Size
	Trials between report generation
	The total number of Cost Functions/Evaluation Statistics
	2 Simulated Annealing Parameters (baseline, decay proportion)
	Details of Cost Fns/Eval Stats:
		* A vector describing the weighted sum of dimensions
		  that are used to create a cost function
		* The Evaluation Statistic to use for each cost function
		* Which sets of words to apply Eval Stat to.
	
	
General Operation:
The heuristic solution works by breaking the set of 480  words
into smaller sets, and iteratively combining pairs of smaller lists, 
dividing them randomly, and determining if the new organization 
satisfies all criteria. Each criteria is determined by an evaluation 
statistic, which currently include min/max of mean/stdev, along with the
indices of the first and last set of words to which each evaluation stat
should be applied to. This allows you to, for instance, minimize one 
cost function that is applied to sets 1-10, and maximize another cost
function that is applied to sets 11-20.


Simulated Annealing:
When the 1st simulated annealing parameter is greater 
than 0, there is a small probability that for a given Evaluation Stat, the
new configuration will be acceptable even if the overall optimality decreases.
This probability gets smaller as the SA parameter gets closer to 0, and also
decreases as the new goodness-of-fit becomes worse (its more likely to accept
small decreases in fit than large decreases in fit). The first SA parameter is
the baseline parameter, and probably should start out fairly small (below .01). 
The second parameter is the number by which the first parameter gets multiplied
by every time a report gets displayed (the second parameter in the .ini file).
This should be close to 1 (.99 or greater), and should be set to 1 if you
don't want the SA parameter to change during operation. Simulated annealing 
tends to produce better results in less time, and is essential for producing
reasonable results when many Evaluation Stats are used.



Adjustable parameters include:
------------------------------
Set size --must be a divisor of 480 (2,3,4,5,6,8,10,12,15,16,20,24,etc.) 
Number of Cycles between saves/displays: reduces output
Number of Cost Fns/Eval Stats
Weighting contrast: 7 numbers
Evaluation Statistic--one of:
  -es minmean     Splice decreases mean contrast.
  -es maxmean     Splice increases mean contrast.
  -es minstd      Splice decreases std. dev. of contrast.
  -es maxstd      Splice increases std. dev. of contrasts.
Min index of set to apply eval stat to
Max index of set to apply eval stat to

The program's parameters are controlled by a file        
separate.ini. This file should contain the following lines:
<SET SIZE>
<NUMBER OF CYCLES BETWEEN DISPLAY/SAVE>
<WEIGHTING CONTRAST> -es <evalution statistic> MinIndex MaxIndex
Multiple weighting contrasts/evalutation statistics can be
used by including multiple lines.


For example, to produce 5 groups of size 10, whose graphemic
similarity is minimized, displaying output every 1000 cycles, 
without simulated annealing:

10
1000
1
0 1
0 0 0 0 0 0 1 -es minmean 1 5



To produce 40 groups of size 12 whose overall phonological dissimilarity
is low, and whose intra-set semantic similarity is low, with a typical
set of simulated annealing parameters:

12
10000
2
.005 .9992
.5 .3 .3 .2 .2 0 0 -es minmean 1 40
0 0 0 0 0 1 0      -es minmean 1 40

To produce two subsets of 10 lists of size 6; one which is
low in phonological dissimilarity and on which is high; but overall
the sets have about the same levels of semantic and orthographic similarity:

6
10000
5
.005  .9992
.5 1 1 1 .3 0 0 -es maxstd 1 20
.5 1 1 1 .3 0 0 -es minmean 1 10
.5 1 1 1 .3 0 0 -es maxmean 10 20
0 0 0 0 0 1 0 -es minstd 1 20
0 0 0 0 0 0 1 -es minstd 1 20



Included in the out/##-## subdirectoris are some pre-generated sets.
They are named informatively, and their important generating parameters
are found inside the file. The subdirectory indicates the number of sets
and the size of each set.

================================================================================
TECHNICAL DETAILS:
After a user-specified number of trials, the current output file 
 out/datafile.dat is copied to out/datafile.dat.old, and a new 
 out/datafile.dat is produced.  It includes the numbers of the words that
 compose each list, and a lot of output detailing aspects of these lists. 
 When the program is executed, it will look for out/datafile.dat and read in
 the current configuration.  If the file is missing or corrupt, it will alert
 the user, so that the program can be exited and datafile.dat.old can be
 used to replace datafile.dat.  Otherwise, it will use this saved configuration
 in out/datafile.dat and continue, although it may now be using different 
 parameters.  This allows one to stop the execution and start later, or 
 adjust some parameters and restart easily. The program is halted by ctrl-c,
 and so this can corrupt datafile.dat.  I didn't want to introduce additional
 dependencies like the curses library, so this is the consequence.

Very little error checking is done, so incorrect input files can lead to disastrous
results. If things are going bad, try deleting out/datafile.dat and starting
fresh.

================================================================================
To compile separate.c, go to the directory with Makefile and type:
 make
Alternatively, go to src/ and type:
  gcc separate.c -lm -o separate
