This repository contains the code used in Mayer, C. (2020) An algorithm for learning phonological classes from distributional information. Phonology 37(1), 91-131.
I hope that by making the code publicly available, researchers will be able to both extend the algorithm and apply it to their own data sets. A brief description of the components and their usage is given below. See the paper for more details.
This folder contains the code used in the paper.
Minimum Python requirements (earlier versions may work, but have not been tested):
- Python 3 (3.6.5)
numpypackage (1.13.3)nltkpackage (3.2.5)sklearnpackage (0.19.1)
Most Python files can be called from the command line. You can add --help to these commands to get a description of the arguments.
-
HMM.py: A group of classes that implement a simple Hidden Markov Model that can be used to generate toy language corpora with specific transition and emission probabilities. Has no command line interface. See generate_parupa_corpora.py for an example of its use.
-
generate_parupa_corpora.py: Generates one or more corpora for the toy language Parupa. This script can be called from the command line with the following arguments.
- Required positional argument(s): A space-separated list of noise values between 0 and 1. The noise value reflects the percentage of generated tokens that do not follow the phonotactic constraints of Parupa. This option combined with
corpora_per_leveldetermines how many corpora will be generated in total. --corpora per level: The number of corpora that will be generated at each level. Optional, default 10.--corpus_size: The number of tokens to generate in each corpus. Optional, default 50,000.--output_dir: The directry to save the corpora in. Optional, default../corpora/noisy_parupa/
An example of usage is:
python3 generate_parupa_corpora.py 0 0.25 0.5 0.75 1 --corpora_per_level 5 --corpus_size 10000 --outdir /my/great/dir/This command will generate 25 corpora: 5 at noise level 0, 5 at noise level 0.25, etc.
- Required positional argument(s): A space-separated list of noise values between 0 and 1. The noise value reflects the percentage of generated tokens that do not follow the phonotactic constraints of Parupa. This option combined with
-
VectorModelBuilder.py: Generates a vector embedding of a corpus file. The input file consists of one word per line, with the segments in the word separated by a space. Because segments are space-separated, multi-character representations of a segment can be used. See the files in the
corporadirectory for formatting examples.The class generates three output files:
.datafile: contains the vector representations of each segment in the input corpus..soundsfile: contains the labels of the sounds in the same order as their vectors in the.datafile..contextsfile: contains the labels of the contexts (columns) of the vectors in the.datafile.
This class can be called from the command line or instantiated in a Python script.
Command line arguments:
- Required positional argument: The path to the corpus file to vectorize.
Optional arguments:
--count_method: The counting method to use when creating the vectors. The program currently supports only thengrammethod. Default:ngram.--n: The value ofnto use when thecount_method == ngram. Default:3.--weighting: The weighting method to use on the raw counts when creating the vectors. Options includeprobability,conditional_probability,pmi,ppmi, andnone. Note that if you use unigrams (n == 1),ppmiandpmiwill weight all counts to 0 (because there is only a single context with a probability of1.0), and conditional probability and probability weightings will be equivalent. Default:ppmi.--outfile: The base filename to save the output files as. Optional, if not specified the base filename will be the same as the input corpus file.--outdir: The directory to save the output files in. Optional, default../vector_data/.
An example of usage is:
python3 VectorModelBuilder.py ../corpora/parupa.txt --n 3 --weighting ppmi --outfile my_vectors --outdir ../vector_data/ -
clusterer.py: Takes a vector embedding as input and generates classes of sounds using the combination of PCA and k-means clustering. Will print the discovered classes to the console and save them to a text file.
Command line arguments:
-
Required positional argument: The stem of the set of input files generated by
VectorModelBuilder.py. For example, if your input files areparupa_trigram_ppmi.data,parupa_trigram_ppmi.sounds, andparupa_trigram_ppmi.contexts, this argument should beparupa_trigram_ppmi. -
Required positional argument: Path to the file where the discovered classes will be saved.
-
--v_scalar: A parameter that controls what proportion of variance a principal component must account for to be used in clustering. The threshold is (this value * the average amount of variance). -
--no_constrain_initial_partition: A parameter that removes restrictions on how initial partition of the data set: namely, it removes the restriction that any partition of the full set of sounds must be into two classes (e.g., consonants vs. vowels, voiced vs. voiceless, etc.). -
--no_constrain_initial_pcs: A parameter that removes restrictions on the initial partition of the data set. Namely, it remove the restriction that only the first principal component is considered. Setting this to FALSE will result in the same classes being detected as when it is TRUE, but with additional partitions of the data set potentially discovered as well. Similar results can be gained by increasing the variability scalar, but this will apply to all recursive calls to the clusterer rather than just the top level call.
-
-
vectorize_dir.py: A convenience script that produces vector representations for all corpora in a directory.
The command line arguments for this script are essentially identical to those for
VectorModelBuilder.py. The only differences are that the--outfileargument has been removed, and the required positional argument specifying the corpus file has been replaced with an optional argument specifying the directory of corpora:--indir: The directory of corpus files that will be vectorized. Default: `../corpora/noisy_parupa'.
R files can be run from an IDE like RStudio. Configurable variables are given in upper case at the tops of the files, and have accompanying comments specifying their use.
plot_embeddings.R: Plots and saves 2D PCAs of the full vector embedding, as well as 2D embeddings of the first partition into two by k-means clustering (in general, consonants vs. vowels). This was used to generate many of the figures in the paper.
This directory contains the corpora used in the paper.
This directory contains the vector embeddings of the corpora used in the paper.
This directory will hold .txt files containing the classes discovered by clusterer.R.
This directory will hold plots of the vector embeddings generated by plot_embedding.R.