learn-lang-diary/connector-sets-revised.lyx

#LyX 2.2 created this file. For more info see http://www.lyx.org/
\lyxformat 508
\begin_document
\begin_header
\save_transient_properties true
\origin unavailable
\textclass article
\begin_preamble
\usepackage{url} 
\usepackage{newunicodechar}
\newunicodechar{」}{~}
\newunicodechar{‐}{-}
\end_preamble
\use_default_options false
\begin_modules
theorems-ams
eqs-within-sections
figs-within-sections
\end_modules
\maintain_unincluded_children false
\language english
\language_package default
\inputencoding utf8
\fontencoding global
\font_roman "times" "default"
\font_sans "helvet" "default"
\font_typewriter "courier" "default"
\font_math "auto" "auto"
\font_default_family default
\use_non_tex_fonts false
\font_sc false
\font_osf false
\font_sf_scale 100 100
\font_tt_scale 100 100
\graphics default
\default_output_format default
\output_sync 0
\bibtex_command default
\index_command default
\paperfontsize default
\spacing single
\use_hyperref true
\pdf_bookmarks true
\pdf_bookmarksnumbered false
\pdf_bookmarksopen false
\pdf_bookmarksopenlevel 1
\pdf_breaklinks true
\pdf_pdfborder true
\pdf_colorlinks true
\pdf_backref false
\pdf_pdfusetitle true
\papersize default
\use_geometry true
\use_package amsmath 2
\use_package amssymb 2
\use_package cancel 1
\use_package esint 0
\use_package mathdots 1
\use_package mathtools 1
\use_package mhchem 0
\use_package stackrel 1
\use_package stmaryrd 1
\use_package undertilde 1
\cite_engine basic
\cite_engine_type default
\biblio_style plain
\use_bibtopic false
\use_indices false
\paperorientation portrait
\suppress_date false
\justification true
\use_refstyle 0
\index Index
\shortcut idx
\color #008000
\end_index
\secnumdepth 2
\tocdepth 2
\paragraph_separation indent
\paragraph_indentation default
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\html_math_output 0
\html_css_as_file 0
\html_be_strict false
\end_header

\begin_body

\begin_layout Title
Connector Set Distributions 
\end_layout

\begin_layout Author
Linas Vepstas
\end_layout

\begin_layout Date
\begin_inset Box Frameless
position "t"
hor_pos "c"
has_inner_box 1
inner_pos "t"
use_parbox 0
use_makebox 0
width "50col%"
special "none"
height "1in"
height_special "totalheight"
thickness "0.4pt"
separation "3pt"
shadowsize "4pt"
framecolor "black"
backgroundcolor "none"
status open

\begin_layout Plain Layout
Version 1: 
\begin_inset space \hfill{}
\end_inset

 11 May 2017
\begin_inset Newline newline
\end_inset

Version 2: 
\begin_inset space \hfill{}
\end_inset

 6 August 2017
\begin_inset Newline newline
\end_inset

Version 3: 
\begin_inset space \hfill{}
\end_inset

 12 September 2017
\begin_inset Newline newline
\end_inset

Version 4: 
\begin_inset space \hfill{}
\end_inset

 23 July 2018
\end_layout

\end_inset


\end_layout

\begin_layout Abstract
The goal of unsupervised learning of the grammar of natural language is
 to obtain, by automated means, a valid lexis of dependencies between words.
 In the Link Grammar formalism, these dependencies are represented as 
\begin_inset Quotes eld
\end_inset

connector sets
\begin_inset Quotes erd
\end_inset

 (also termed 
\begin_inset Quotes eld
\end_inset

disjuncts
\begin_inset Quotes erd
\end_inset

) that capture how a word connects to it's neighbors.
\end_layout

\begin_layout Abstract
One well-known model of statistical dependency parsing is the so-called
 MST-parse model of Deniz Yuret.
 It provides reasonable but imperfect results, and has several unsatisfying
 properties: it fails to labelled links and it fails to assign words to
 grammatical classes.
 Both of these limitations can be overcome by extracting connector sets
 from MST linkages, and then performing a statistical averaging over many
 observations.
 Arguments for why this might be a good approach to statistical language
 learning is given in a companion report; this report reports on experimental
 results.
\end_layout

\begin_layout Abstract
A collection of connector sets (disjuncts) and disjunct vectors are obtained
 from a large statistical sampling of MST-parsed sentences.
 This report surveys the general statistical landscape of these structures.
 The ultimate aim is to use such collections as input data to graph algorithms
 that can extract grammatical classes, perform word-sense disambiguation,
 obtain synonym-sets and provide accurate dependency parsing.
\end_layout

\begin_layout Abstract
This is a revised version of earlier reports.
 It provides a better overview, and provides additional results.
 This is not yet the final version; I'm still waiting for additional computation
s to complete.
 
\end_layout

\begin_layout Section
Introduction
\end_layout

\begin_layout Standard
This report characterizes the statistical distribution of word-disjunct
 pairs extracted from a large block of text, using unsupervised natural
 language learning techniques.
\end_layout

\begin_layout Standard
A disjunct is a sequence of words, resembling an N-gram, but containing
 grammatical information.
 The grammatical information in a disjunct is 
\begin_inset Quotes eld
\end_inset

complete
\begin_inset Quotes erd
\end_inset

, in the sense that it is sufficient to construct and accurate parser of
 language.
 This is unlike the situation encountered with Word2Vec
\begin_inset CommandInset citation
LatexCommand cite
key "Mikolov2013a"

\end_inset

 or AdaGram
\begin_inset CommandInset citation
LatexCommand cite
key "Bartunov2015"

\end_inset

, where one can discover 
\begin_inset Quotes eld
\end_inset

important
\begin_inset Quotes erd
\end_inset

 sequences of words, but where the route to the creation of a grammar is
 less clear.
 The algorithm used to extract disjuncts is also completely different from
 the neural-net techniques; it is a graph technique, as opposed to a gradient-de
scent technique.
 Despite this, there are exploitable conceptual similarities in building
 models of natural language.
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2018skippy"

\end_inset

 The goal of this report is to provide a fairly detailed statistical analysis
 of the distribution of disjuncts, thereby characterizing the structure
 of natural language at a grammatical, syntactic level.
\end_layout

\begin_layout Standard
The need for this report is to characterize the structure of the disjunct
 dataset to a sufficient degree that it can be used as input to other graph-base
d methods to extract grammatical classes and to perform word-sense disambiguatio
n.
 These techniques and algorithms are presented in 
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2018skippy"

\end_inset

.
\end_layout

\begin_layout Standard
The origin of the concept of a 
\begin_inset Quotes eld
\end_inset

disjunct
\begin_inset Quotes erd
\end_inset

 comes from the Link Grammar theory of semantic-syntactic parsing.
\begin_inset CommandInset citation
LatexCommand cite
key "Sleator1991,Sleator1993"

\end_inset

.
 Link Grammar is a form of dependency grammar; it is essentially equivalent
 to other forms of dependency grammar, in that Link Grammar dependencies
 can be algorithmically converted into other kinds of dependencies, as well
 as into phrase-structure grammars.
 In Link Grammar, a disjunct is defined as an ordered sequence of connectors;
 those connectors indicate how a word can attach or connect to other words,
 forming links.
 A reasonable conceptual model for a disjunct is to think of it as a jigsaw-puzz
le piece; the connectors correspond to the tabs and slots on the jigsaw-puzzle
 piece.
 Parsing a sentence then consists of assembling jigsaw pieces together,
 in such a way that there are no unconnected tabs or slots at the end of
 the process.
 The links between words then indicate the syntactic and semantic relationships
 between them.
\end_layout

\begin_layout Standard
A similar relationship between words can be obtained by performing a maximum-spa
nning-tree (MST) parse, where links between words are the ones that maximize
 the mutual information between the word-pairs.
\begin_inset CommandInset citation
LatexCommand cite
key "Yuret1998"

\end_inset

 The MST approach to natural language grammar has been well-explored, and
 provides a reasonable accurate model.
 The primary issue is that, when interpreted naively, it does not provide
 any labels that describe the relationships between words, such as 
\begin_inset Quotes eld
\end_inset

subject
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

object
\begin_inset Quotes erd
\end_inset

 relationships, which are considered to be key to the symbolic, linguistic
 grammatical structure of a language.
 
\end_layout

\begin_layout Standard
A bridge can be built between the the unlabeled MST representation of a
 parsed sentence, and the labeled Link Grammar parse of a sentence.
\begin_inset CommandInset citation
LatexCommand cite
key "Goertzel2014"

\end_inset

 This is done by starting with an unlabeled MST parse of a sentence, and
 then applying a label to each link, that label consisting of the word-pair,
 itself.
 In effect, one obtains a jigsaw-puzzle piece, where each tab and slot is
 labeled by the word that the tab/slot is allowed to connect to.
 After accumulating a large database of statistics on such jigsaw-puzzle
 pieces, one can compare them, looking for similarity.
 By clustering together similar pieces, one is effectively creating a dictionary
 or lexis of grammatical categories (parts of speech) together with the
 grammatical information as to how these can be assembled into grammatically
 correct, semantically meaningful sentences.
\end_layout

\begin_layout Standard
This approach appears to be sufficient to distinguish between different
 meanings associated with a word.
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2018stiching"

\end_inset

 The final outcome is not dissimilar to popular neural-net techniques, although
 the path taken is entirely different.
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2018skippy"

\end_inset

 The overall approach of discerning graph structure from large bodies of
 statistical data is hypothesized to be generalizable to domains far outside
 of linguistics, including genomics and proteomics.
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2017sheaves"

\end_inset


\end_layout

\begin_layout Standard
This report examines one specific dataset of word-disjunct pairs.
 It explores the overall distribution of words and disjuncts, which are
 unsurprisingly Zipfian in nature.
 It explores various different measures of similarity between words, such
 as cosine distance and entropic distance, either of which are employable
 for discerning the grammatical classes into which the word-disjunct pairs
 can be assigned.
 In effect, this provides the background needed for understanding how words
 are assigned to grammatical categories, in practice.
 It characterizes the nature and structure of natural language, when viewed
 from the path of disjuncts derived from MST parses derived from the mutual
 information (MI) distribution of word pairs.
 The analysis is similar in spirit to an earlier analysis of the statistics
 of word-pairs.
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2009"

\end_inset


\end_layout

\begin_layout Subsection
Recap
\end_layout

\begin_layout Standard
A detailed description of the process of obtaining (pseudo-)disjuncts from
 large unannotated text corpora is given elsewhere.
 This section provides a short, informal summary.
 
\end_layout

\begin_layout Standard
The story so far: Starting from a large text corpus, the mutual information
 (MI) of word-pairs are counted.
 This MI is used to perform a maximum spanning-tree (MST) parse (of a different
 subset of) the corpus.
 From each parse, a pseudo-disjunct is extracted for each word.
 The pseudo-disjunct is like a real LG disjunct, except that each connector
 in the disjunct is the word at the far end of the link.
\end_layout

\begin_layout Standard
So, for example, in in idealized world, the MST parse of the sentence "Ben
 ate pizza" would produce the parse Ben <–> ate <–> pizza and from this,
 we can extract the pseudo-disjunct (Ben- pizza+) on the word "ate".
 Similarly, the sentence "Ben puked pizza" should produce the disjunct (Ben-
 pizza+) on the word "puked".
 Since these two disjuncts are the same, we can conclude that the two words
 "ate" and "puked" are very similar to each other.
 Considering all of the other disjuncts that arise in this example, we can
 conclude that these are the only two words that are similar.
\end_layout

\begin_layout Standard
Any given word will have many pseudo-disjuncts attached to it.
 Each disjunct has a count of the number of times it has been observed.
 Thus, this set of disjuncts can be imagined to be a vector in a high-dimensiona
l vector space, which each disjunct being a single basis element.
 The similarity of two words can be taken to be the cosine-similarity between
 the disjunct-vectors.
 Other similarities are possible, and it seems that the entropic similarity,
 described below, is superior.
\end_layout

\begin_layout Standard
Equivalently, the set of disjuncts can be thought of as a weighted set:
 each disjunct has a weight, corresponding to the number of times it has
 been observed.
 A weighted set is more or less the same thing as a vector, and these two
 are treated as the same, in what follows.
 Note that the disjunct vectors are sparse: for any given word, almost all
 coefficients will have a count of zero.
 For example, the dataset that will be examined next has over a quarter
 of a million different pseudo-disjuncts in it; most words have fewer than
 a hundred disjuncts on them.
\end_layout

\begin_layout Subsection
Summary of results
\end_layout

\begin_layout Standard
The primary results reported below are these:
\end_layout

\begin_layout Standard
* Most scores and metrics that can be assigned to connector sets give a
 (scale-free) Zipfian ranking distribution, and are thus fairly boring.
 There are some oddities here and there.
\end_layout

\begin_layout Standard
* The greater the average number of observations per disjunct, the more
 grammatically acceptable (accurate) the disjunct seems to be.
 This is good news: it means that the general technique is not generating
 ungrammatical garbage.
\end_layout

\begin_layout Standard
* Connector sets can be given a mutual information score.
 The distribution for the MI scores appears to be Gaussian (
\emph on
i.e
\emph default
.
 a Bell curve).
 This comes as a bit of a surprise.
 I am not aware of what kind of network theory gives a natural rise to Gaussians.
\end_layout

\begin_layout Standard
* The MI score seems to be quite good at identifying words that participate
 in idioms, set phrases and institutional phrases.
\end_layout

\begin_layout Standard
* The average number of connectors per disjunct, which should have indicated
 the part-of-speech that the word belongs to, fails to do this.
 This seems to be due to the fact that the dataset is polluted with lists
 and tables (including tables-of-contents, and indexes), all of which are
 mis-interpreted as sentences by the processing software.
 This causes some very unusual disjuncts to be constructed.
\end_layout

\begin_layout Standard
* In the earlier sample, derived from Wikipedia, it became clear that there
 were very few verbs that aren't relationship verbs.
 Wikipedia articles describe concepts and events.
 The relationship between these require the copula and other relationship
 verbs: 
\begin_inset Quotes eld
\end_inset

is
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

has
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

was
\begin_inset Quotes erd
\end_inset

.
 Wikipedia is almost completely devoid of narrative verbs: 
\begin_inset Quotes eld
\end_inset

ran
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

jumped
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

hit
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

ate
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

thought
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

took
\begin_inset Quotes erd
\end_inset

.
 Thus, we discern two very different styles of human communication: the
 exchange of facts, and the exchange of stories.
 Narratives contain a far richer selection of verbs, and thus, for language
 learning, a text corpus of narratives is required.
 Ideally, this would be from young-adult literature, which is a bit more
 direct in its kinesthetic content than adult literature might be.
\end_layout

\begin_layout Standard
* Cosine similarity applied to connector sets seems to be an effective way
 of determining the grammatical similarity of words.
 Yet, it is not so unambiguously great, that other kinds of measures shouldn't
 be contemplated.
\end_layout

\begin_layout Standard
* Entropic similarity, essentially, a form of symmetric mutual information
 between words, appears to be an even better similarity measure.
 It is particularly appealing because it is naturally additive, and thus
 fits naturally into network theories derived from thermodynamic partition
 functions.
\end_layout

\begin_layout Subsection
Revision History
\end_layout

\begin_layout Standard
The original version of this report, dated 7 May 2017, was prepared on a
 painfully small dataset, which also (may have?) incorporated a fatal bug
 in the disjunct code: disjuncts were being assembled incorrectly, due to
 a reversed sign in the MI calculations.
 This bug was eventually uncovered, and so it seemed best to entirely discard
 the initial analysis, and instead repeat it with a newer and larger dataset
 that was correctly assembled.
 The revised analysis was done mostly in July 2017.
 Sadly, the discovery of this bug required that multiple large datasets
 be discarded and reconstructed.
 This caused a month of effort to be lost.
 Version 1 can be accessed by digging in git, and pulling up commit 27a66643a52c
0985adc5b38caf94fc25f5e2e684 (or maybe a bit earlier, circa late June 2017,
 as that is when the bug was spotted.).
\end_layout

\begin_layout Standard
Simultaneously, there was a lot of confusion about the efficacy of the cosine
 similarity measure.
 Initial work on cosine similarity used a filtered dataset, with the goal
 of filtering to reduce 
\begin_inset Quotes eld
\end_inset

noise
\begin_inset Quotes erd
\end_inset

 in the dataset, as well as to manage dataset size.
 It turns out that this filtering also had the undesired side-effect of
 destroying much of the 
\begin_inset Quotes eld
\end_inset

signal
\begin_inset Quotes erd
\end_inset

 as well – it rendered many grammatically unrelated words to be judged to
 be very similar.
 Between the accidental sign reversal, and the excessively strong data cuts,
 it was all very confusing, and has taken another month to recover from
 this – I'm back to where I was in May, just older and wiser, now.
 Versions 2 and 3 of this report provide the status quo.
 They can be found at commit bc48712d3c2ab71383961f4a50dd6b3b4c6fd75f in
 git.
\end_layout

\begin_layout Standard
Version 4 of this report expands it by considering an entropic similarity
 between words, intended to supplant the cosine similarity as an effective
 graph metric.
 The entropic similarity is a form of symmetric mutual information between
 words; it is given this oddball name so as to avoid confusion with a variety
 of other contexts in which mutual information appears.
 And so, Version 4 provides additional statistical analysis comparing the
 entropic similarity to the cosine similarity.
\end_layout

\begin_layout Section
Dataset Characterization
\end_layout

\begin_layout Standard
Some terminology and notation are introduced next, followed by a characterizatio
n of the dataset.
 This is followed by a statistical analysis of the word-disjunct pairs.
 The analysis of word-similarity is left for a later section.
\end_layout

\begin_layout Subsection
Terminology
\end_layout

\begin_layout Standard
It is useful to introduce some notation for counting words, disjuncts, and
 connectors.
 Let 
\begin_inset Formula $N(w)$
\end_inset

 be the number of times that the word 
\begin_inset Formula $w$
\end_inset

 has been observed, in the dataset.
 Let 
\begin_inset Formula $N(w,d)$
\end_inset

 be the number of times that the disjunct 
\begin_inset Formula $d$
\end_inset

 has been observed on word 
\begin_inset Formula $w$
\end_inset

.
 The pair 
\begin_inset Formula $(w,d)$
\end_inset

 is referred to as a 
\begin_inset Quotes eld
\end_inset

connector set
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

cset
\begin_inset Quotes erd
\end_inset

 in the text below.
 Thus, for a word 
\begin_inset Formula $w$
\end_inset

, there is a set 
\begin_inset Formula $(w,*)=\left\{ (w,d)|N(w,d)>0\right\} $
\end_inset

 of associated csets, called the 
\begin_inset Quotes eld
\end_inset

support
\begin_inset Quotes erd
\end_inset

 of the word.
 The size of this set can be written using the standard notation for set-sizes
 as 
\begin_inset Formula $\left|(w,*)\right|$
\end_inset

.
 Similarly, a disjunct 
\begin_inset Formula $d$
\end_inset

, is supported by the set 
\begin_inset Formula $(*,d)=\left\{ (w,d)|N(w,d)>0\right\} $
\end_inset

 of associated csets.
 
\end_layout

\begin_layout Standard
The primary contents of the database are the counts 
\begin_inset Formula $N(w,d)$
\end_inset

 and everything else of interest in this section can be obtained from this.
 Note that 
\begin_inset Formula $N(w,d)$
\end_inset

 can be understood as a matrix, where the disjuncts identify columns, and
 the words identify rows.
 In general, this is a very sparse matrix: the number of non-zero entries
 
\begin_inset Formula $\left|(*,*)\right|$
\end_inset

 is far less than the number of rows times the number of columns.
\end_layout

\begin_layout Standard
Every time a word is observed in an MST parse, a disjunct is extracted for
 it; thus, word observations and disjunct observations are on one-to-one
 correspondence.
 In notation:
\begin_inset Formula 
\[
\sum_{d}N(w,d)=N(w,*)=N(w)
\]

\end_inset

Similarly, the total number of times that a disjunct was observed is just
\begin_inset Formula 
\[
N(*,d)=\sum_{w}N(w,d)
\]

\end_inset


\end_layout

\begin_layout Standard
Frequencies can be obtained by dividing by the total number of observations,
 so that 
\begin_inset Formula $p(w,d)=N(w,d)/N(*)$
\end_inset

 and 
\begin_inset Formula $p(w)=N(w)/N(*)$
\end_inset

 with 
\begin_inset Formula $N(*)=\sum_{w}N(w)$
\end_inset

 the total number of observations of words.
\end_layout

\begin_layout Standard
A single disjunct is always composed of a fixed number of connectors, independen
tly of any observations; let 
\begin_inset Formula $C(d,c)$
\end_inset

 be the number of times that connector 
\begin_inset Formula $c$
\end_inset

 appears in disjunct 
\begin_inset Formula $d$
\end_inset

.
 Note that 
\begin_inset Formula $C(d,c)$
\end_inset

 is almost always either zero or one; however, a connector can appear more
 than once in a disjunct, so this count can rise to 2 or 3 or higher.
 The wild-card sum 
\begin_inset Formula $C(d,*)=\sum_{c}C(d,c)$
\end_inset

 is the total number of connectors in the disjunct; it is the vertex degree
 of all edges connecting to that disjunct.
 It is also useful to define 
\begin_inset Formula $C(d,+)$
\end_inset

 and 
\begin_inset Formula $C(d,-)$
\end_inset

 as the total number of right-linking and left-linking connectors.
\end_layout

\begin_layout Subsection
Dataset characterization
\end_layout

\begin_layout Standard
The rest of this report is based on a single dataset, called 'en_pairs_rfive_mtw
o'.
 It was built by performing MST parsing of of text from tranche-1 and 2,
 using word-pair statistics gathered from random parses of text from tranche-1,2
,3,4,5.
\end_layout

\begin_layout Standard
These 
\begin_inset Quotes eld
\end_inset

tranches
\begin_inset Quotes erd
\end_inset

 consist of unannotated text corpora downloaded from the net; the specific
 download scripts that were used are the 'download.sh' scripts located in
 github, at https://github.com/opencog/learn/download.
 Word-pair statistics are obtained by 
\begin_inset Quotes eld
\end_inset

random-tree parsing
\begin_inset Quotes erd
\end_inset

.
 A random-tree parse is generated by creating a random planar parse tree
 for a sentence; it is planar in the sense that no links cross.
 It is random in the sense that a uniform distribution is assumed on the
 space of all possible planar parses.
 For each link in such a parse, one simply records the two words at each
 end of the link, and increments the count of that particular word pair
 by one.
 The random-tree method for obtaining word-pair counts results in subtly
 different statistics than what would be obtained by a sliding window of
 some fixed width.
 In particular, the random-tree method will occasionally count long-distance
 links, longer than what would be seen in a sliding window.
 The difference in statistics between random-tree and sliding-window methods
 has not been characterized.
 It is presumed to be minor, and probably immaterial to subsequent results.
\end_layout

\begin_layout Standard
The word-pair dataset, and other similar datasets are summarized in the
 language learning diary
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2013"

\end_inset

, in the section titled 
\begin_inset Quotes erd
\end_inset

Dataset report 3 June 2017
\begin_inset Quotes erd
\end_inset

.
 It is repeated here, as it gives a hint of the foundation for the MST parses.
 The column labels in the table below are the same as those explained there.
 They are:
\end_layout

\begin_layout Description
Size The dimensions of the array.
 This is the number of unique, distinct words observed occurring on the
 left-side of a word pair, times the number of words occurring on the right.
 We expect the dimensions to be approximately equal, as most words will
 typically occur on both the left and right side of a pair.
 
\end_layout

\begin_layout Description
Pairs The total number of distinct word-pairs observed.
\end_layout

\begin_layout Description
Obs'ns The total number of observations of these pairs.
 Most pairs will be observed more than once.
 Distributions are typically Zipfian.
\end_layout

\begin_layout Description
Obs/pr The average number of times each word-pair was observed.
\end_layout

\begin_layout Description
Entropy The total entropy of these pairs in this dataset.
 Denote a word-pair as 
\begin_inset Formula $(w_{L},w_{R})$
\end_inset

 and 
\begin_inset Formula $p(w_{L},w_{R})$
\end_inset

 as the probability (normalized count) for observing the word-pair.
 The total entropy is then 
\begin_inset Formula $H=-\sum_{w_{L},w_{R}}p(w_{L},w_{R})\log_{2}p(w_{L},w_{R})$
\end_inset

.
\end_layout

\begin_layout Description
MI The total mutual information for the pairs in this dataset, defined as:
 
\begin_inset Formula $MI=\sum_{w_{L},w_{R}}p(w_{L},w_{R})\log_{2}\left[p(w_{L},w_{R})/p(w_{L},*)p(*,w_{R})\right]$
\end_inset


\end_layout

\begin_layout Standard
The 'en_pairs_rfive' word-pair dataset can then be summarized as:
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="2" columns="7">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Size
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Pairs
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Obs'ns
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Obs/pr
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Entropy
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Dataset
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
839K
\begin_inset Formula $\times$
\end_inset

851K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30.1M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.35G
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
44.9
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18.54
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.84
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_pairs_rfive
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\noindent
The diary provides reports for other word-pair datasets, of varying sizes,
 and some for other languages.
 Perhaps the most interesting and surprising result is that the MI value
 is nearly independent of the size of the dataset, and the language, varying
 between about 1.8 and 3.0.
\end_layout

\begin_layout Standard
A collection of pairs can be thought of as a sparse matrix; the rows of
 the matrix can be taken to be vectors, as can the columns.
 As vectors, they can be thought of as inhabiting some Banach space, having
 an 
\begin_inset Formula $l_{p}$
\end_inset

-norm.
 Given a vector 
\begin_inset Formula $\vec{v}=\sum_{n}a_{n}\hat{e}_{n}$
\end_inset

 decomposed into basis elements 
\begin_inset Formula $\hat{e}_{n}$
\end_inset

 and real-number-valued coefficients 
\begin_inset Formula $a_{n}$
\end_inset

, the 
\begin_inset Formula $l_{p}$
\end_inset

-norm is defined as
\begin_inset Formula 
\[
l_{p}\left(\vec{v}\right)=\sqrt[p]{\sum_{n}\left|a_{n}\right|^{p}}
\]

\end_inset

For 
\begin_inset Formula $p=2$
\end_inset

, this is simply the (Euclidean) 
\begin_inset Quotes eld
\end_inset

length
\begin_inset Quotes erd
\end_inset

 of the vector.
 The 
\begin_inset Formula $p=1$
\end_inset

 norm is sometimes called the 
\begin_inset Quotes eld
\end_inset

Manhattan distance
\begin_inset Quotes erd
\end_inset

; here called simply the 
\begin_inset Quotes eld
\end_inset

count
\begin_inset Quotes erd
\end_inset

.
 For 
\begin_inset Formula $p=0$
\end_inset

, one is simply measuring the support of the vector; that is, the number
 of coefficients 
\begin_inset Formula $a_{n}$
\end_inset

 that are non-zero.
 In what follows, the 
\begin_inset Formula $\hat{e}_{n}$
\end_inset

 correspond to words, either to words on the left of a pair, or on the right.
 Thus, the table caption below is given by:
\end_layout

\begin_layout Description
Size The left and right dimensions, as before.
 
\emph on
Videlicet
\emph default
, the number of unique, distinctly different words observed on the left
 and the right side of a pair.
 Viewed as a matrix, this is the number of columns and rows in the matrix.
\end_layout

\begin_layout Description
Support The support is the average number of word-pairs that a word participates
 in (on the left, or on the right).
 Viewed as a matrix, this is the average number of non-zero entries in each
 row or column.
 Viewed as (row or column) vectors, this is the 
\begin_inset Quotes eld
\end_inset

support
\begin_inset Quotes erd
\end_inset

 of a (row or column) vector.
 Formally, this is the 
\begin_inset Formula $l_{0}$
\end_inset

 norm of each vector: 
\begin_inset Formula $\left|(w_{L},*)\right|=\sum_{w_{R}}\left(0<N(w_{L},w_{R})\right)$
\end_inset

 and likewise 
\begin_inset Formula $\left|(*,w_{R})\right|=\sum_{w_{L}}\left(0<N(w_{L},w_{R})\right)$
\end_inset

.
\end_layout

\begin_layout Description
Count The count is the average number of observations that a word-pair was
 observed, for a given word.
 Viewed as a matrix, this is the average value of each non-zero entry (averaged
 over rows, or columns).
 Viewed as vectors, this is the 
\begin_inset Formula $l_{1}$
\end_inset

 norm divided by the 
\begin_inset Formula $l_{0}$
\end_inset

 norm.
 The 
\begin_inset Formula $l_{1}$
\end_inset

 norm is just the wild-card counts 
\begin_inset Formula $N(w_{L},*)$
\end_inset

 and 
\begin_inset Formula $N(*,w_{R})$
\end_inset

, where as always, the wild-card counts are defined as 
\begin_inset Formula $N(w_{L},*)=\sum_{w_{R}}N(w_{L},w_{R})$
\end_inset

.
 The count shown in the table is then the average count: 
\begin_inset Formula $N(w_{L},*)/\left|(w_{L},*)\right|$
\end_inset

 for the rows, and likewise for the columns.
\end_layout

\begin_layout Description
Length The length is the average length of the row and column vectors.
 This is the 
\begin_inset Formula $l_{2}$
\end_inset

 norm divided by the 
\begin_inset Formula $l_{0}$
\end_inset

 norm.
 The 
\begin_inset Formula $l_{2}$
\end_inset

 norm is just the standard concept of the length of a vector in Euclidean
 space.
 Here, 
\begin_inset Formula $L(w_{L},*)=\sqrt{\sum_{w_{R}}N^{2}(w_{L},w_{R})}$
\end_inset

, and likewise 
\begin_inset Formula $L(*,w_{R})=\sqrt{\sum_{w_{L}}N^{2}(w_{L},w_{R})}$
\end_inset

.
 The length is interesting, because it 
\begin_inset Quotes eld
\end_inset

penalizes
\begin_inset Quotes erd
\end_inset

 word-pairs with only a small number of counts.
 The act of squaring the count has the effect of giving much higher 
\begin_inset Quotes eld
\end_inset

confidence
\begin_inset Quotes erd
\end_inset

 to large observation counts: a word-pair observed twice as often is given
 four times the credit.
 The length shown in this table is the 
\begin_inset Quotes eld
\end_inset

average
\begin_inset Quotes erd
\end_inset

 length: it is 
\begin_inset Formula $L(w_{L},*)/\left|(w_{L},*)\right|$
\end_inset

 for the rows, and likewise for the columns.
\end_layout

\begin_layout Standard
The support and count for the pairs are given below.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="3" columns="9">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Size
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Support
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Count
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Length
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Dataset
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
L
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
R
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
L
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
R
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
L
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
R
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
L
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
R
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Name
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
839K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
851K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
80.6K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
80.6K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
249
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
230
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28.2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_pairs_rfive
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The disjunct dataset is obtained by collecting statistics on MST parses
 obtained from the tables above.
 MST parsing is performed by considering the set of all-possible planar
 trees connecting the words in a sentence.
 One then selects the one single tree which maximizes the sum of the MI
 values between all word-pairs.
 This is the MST parse.
 Given an MST parts, one then 
\begin_inset Quotes eld
\end_inset

cuts
\begin_inset Quotes erd
\end_inset

 each link, and labels the cut end of the link with the word that it was
 previously connected to.
 The disjunct is then the collection of these cut, labeled half-edges.
 The collection is ordered, left-to-right, and a plus/minus indicator is
 used to half-edge as having been originally left going (minus) or right-going
 (plus).
 A word-disjunct pair is then the word, and the dangling links that go with
 it; the word can be thought of as sitting in the middle of the left-going
 connectors, and the right-going connectors.
 (In this sense, a word-disjunct pair resembles an N-gram, except that N
 is not fixed, and there might be skipped intermediate words).
 The dataset is created by accumulating counts for a large number of these
 word-disjunct pairs.
\end_layout

\begin_layout Standard
Just as before, the dataset of pair-counts can be viewed as a sparse matrix.
 The rows correspond to words, the columns to disjuncts.
 The table below summarizes this dataset.
 The table caption is the same as before, with the following additional
 values being reported:
\end_layout

\begin_layout Description
\begin_inset Formula $H_{left}$
\end_inset

,
\begin_inset Formula $H_{right}$
\end_inset

 The left and right entropies.
 These are defined as 
\begin_inset Formula $H_{right}=-\sum_{w}p(w,*)\log_{2}p(w,*)$
\end_inset

 and 
\begin_inset Formula $H_{left}=-\sum_{d}p(*,d)\log_{2}p(*,d)$
\end_inset

.
 Note that 
\begin_inset Formula $MI=H-H_{left}-H_{right}$
\end_inset

 holds, by definition.
 The left and right entropies were not reported for the word-pairs table,
 because these two are nearly equal, and are equal to half the difference
 between the entropy and the MI.
\end_layout

\begin_layout Standard
The table is then characterized as:
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="2" columns="9">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Size
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Csets
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Obs'ns
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Ob/cs
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Entropy
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $H_{left}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $H_{right}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Dataset
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
137K
\begin_inset Formula $\times$
\end_inset

6.24M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8.63M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18.5M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20.96
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
19.14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9.71
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7.90
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_pairs_rfive_mtwo
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Subsection
Dataset Analysis
\end_layout

\begin_layout Standard
The word-pair dataset contains 851964 words.
 Not all of these were observe during the subsequent MST parsing, as the
 size of the MST-parsed corpus was smaller.
 However, since the MST corpus was a subset of the word-pair corpus, it
 does not contain any words that were not previously observed.
\end_layout

\begin_layout Standard
The MST dataset has 137078 words that have disjuncts attached to them.
 These words have been observed a total of 18489594  times, for an average
 of 18489594/137078 = 21.70 observations per word.
 This dataset contains 6239997 different, unique disjuncts, for an average
 of 18489594 / 6239997 = 2.963 observations per disjunct.
 
\end_layout

\begin_layout Standard
The period (the punctuation mark denoting end-of-sentence) appears 849354.0
 times, suggesting that this many sentences were observed.
 Each sentence thus has an average of 18489594 / 849354 = 21.77 words per
 sentence.
 The precise statistics, however, are governed by the end-of-sentence detector,
 which ignores common abbreviations such as 
\begin_inset Quotes eld
\end_inset

Mr.
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

Mrs.
\begin_inset Quotes erd
\end_inset

, 
\emph on
etc
\emph default
.
 but may make mistakes for uncommon abbreviations, and for non-prose text,
 such as tables and charts.
\end_layout

\begin_layout Standard
The dataset contains 6239997 unique disjuncts, for an average of 18489594
 / 6239997 = 2.96 observations per disjunct.
 This last number that makes this dataset feel thin and sparse.
 Its not clear how accurate that perception is: an earlier dataset was about
 one-tenth to one-twentieth the size, in the number of words, disjuncts
 and observations, yet it had a ratio of 1.5 observations per cset.
 That is, making more than ten times the number of observations only doubled
 the observations per cset.
 This suggests that the distribution is Zipfian, so that large increases
 in observations result in only small changes to averages.
\end_layout

\begin_layout Standard
The dataset is sparse, in a completely different sense from the above.
 Viewing 
\begin_inset Formula $N(w,d)$
\end_inset

 as a matrix whose size is 
\begin_inset Formula $851964\times6239997$
\end_inset

, there are only a very small number of counts that are non-zero.
 Specifically, the fraction is 
\begin_inset Formula $8629163/(851964\times6239997)=1.623\times10^{-6}$
\end_inset

.
 The sparsity of this matrix can be defined as 
\begin_inset Formula $-\log_{2}$
\end_inset

 of this number; here, it is 
\begin_inset Formula $-\log_{2}1.623\times10^{-6}\approx16.60$
\end_inset

.
 The sparsity appears to increase with the number of observations: the previous,
 ten-times-smaller dataset had a sparsity of 15.
\end_layout

\begin_layout Standard
The total word-entropy for the dataset is defined as
\begin_inset Formula 
\[
H_{word}=-\sum_{w}p(w)\log_{2}p(w)
\]

\end_inset

and was measured to be 
\begin_inset Formula $H_{word}=9.71$
\end_inset

 bits.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
This and the following entropies were measured with the word-entropy-bits,
 disjunct-entropy-bits, etc.
 functions in 'disjunct-stats.scm' Alternately, the 'print-matrix-summary-report'
 function now reports this.
\end_layout

\end_inset

 The total entropy is much larger.
 It is defined as
\begin_inset Formula 
\[
H_{total}=-\sum_{w,d}p(w,d)\log_{2}p(w,d)
\]

\end_inset

and is measured to be 
\begin_inset Formula $H_{total}=20.96$
\end_inset

 bits.
 The disjunct entropy is dual to the word entropy:
\begin_inset Formula 
\[
H_{disjunct}=-\sum_{d}p(*,d)\log_{2}p(*,d)
\]

\end_inset

and is measured to be 
\begin_inset Formula $H_{disjunct}=19.14$
\end_inset

 bits.
 The total mutual information between the words and disjuncts is then 
\begin_inset Formula 
\[
MI_{cset}=\sum_{w,d}p(w,d)\log_{2}\frac{p(w,d)}{p(*,d)p(w,*)}=H_{word}+H_{disjunct}-H_{total}
\]

\end_inset

and is measured to be 
\begin_inset Formula $MI_{cset}=7.897$
\end_inset

 bits.
\end_layout

\begin_layout Standard
It is interesting to compare 
\begin_inset Formula $MI_{cset}$
\end_inset

 to 
\begin_inset Formula $MI_{word-pair}$
\end_inset

.
 The former is much larger, by just over 6 bits.
 There are several ways to think about this.
 Clearly, the connector-set database contains 
\begin_inset Quotes eld
\end_inset

more information
\begin_inset Quotes erd
\end_inset

; one might say that it is a distillation, an extraction.
 It is an amplification, but done so that the signal-to-noise ratio is much
 improved.
 MST parsing is clearly picking out some sort of structure from natural
 language.
 Consider it this way: 
\begin_inset Formula $2^{1.84}=3.58$
\end_inset

 and so the word-pair dataset contains on average 1.84 more bits of information
 than what one would expect to observe in a completely random, uniformly-distrib
uted word-salad, for which we expect 
\begin_inset Formula $MI=0$
\end_inset

.
 The word-disjunct dataset takes a large step in extracting signal from
 the noise: one gets 
\begin_inset Formula $7.897-1.84=6.057$
\end_inset

 more bits of information, for a factor of 
\begin_inset Formula $2^{6.057}=66.5$
\end_inset

.
 That is a respectable amount of noise reduction.
\end_layout

\begin_layout Section
Distributions
\end_layout

\begin_layout Standard
The previous section characterized the dataset in terms of its total size,
 entropy, and related single-number values that can be extracted from it.
 More interesting is the distribution of columns and rows, that is, the
 distribution of the words and the disjuncts.
 These result in various graphs, almost all of which are Zipfian in nature;
 and some that very much are not.
 The distributions obtained here can be contrasted with the distributions
 obtained from word-pair counts.
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2009"

\end_inset

 They are similar in many ways, and also strikingly different.
\end_layout

\begin_layout Standard
In the following, the term 
\begin_inset Quotes eld
\end_inset

connector set
\begin_inset Quotes erd
\end_inset

 is taken as a synonym for 
\begin_inset Quotes eld
\end_inset

disjunct
\begin_inset Quotes erd
\end_inset

.
\end_layout

\begin_layout Subsection
Connector-set distribution
\end_layout

\begin_layout Standard
Some connector-sets will be observed far more often than others.
 Likewise for the two sides of the connector-set: some words will have far
 more observations, and some disjuncts will be seen more often.
\end_layout

\begin_layout Standard
Two graphs, dual to one-another.
 The one on the left shows 
\begin_inset Formula $N(w,*)$
\end_inset

, ranked by count.
 The one on the right shows 
\begin_inset Formula $N(*,d)$
\end_inset

, also ranked.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Obtained by running (print-ts-rank sorted-word-obs outport) from the disjunct-st
ats.scm file, on the en_pairs_rfive_mtwo database.
 The second one prints sorted-dj-obs.
 The graphs generated with ranked.gplot
\end_layout

\end_inset

.
 The first follows the canonical Zipf distribution.
 The green line is an eyeballed, approximate fit, of exponent -1.1.
 The second has an exponent of about -0.85.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-word-obs.eps
	width 50text%

\end_inset


\begin_inset Graphics
	filename en-dj/ranked-dj-obs.eps
	width 50text%

\end_inset


\end_layout

\begin_layout Standard
The first ten words in the word ranking are: "LEFT-WALL" "," "." "the" "and"
 "to" "of" "a" """ "in".
 This is the ranking of how often these words appear, overall, in the MST-parsed
 corpus.
 The number of connections to LEFT-WALL should be equal to the number of
 sentences in the corpus, as the parser is set up to make one LEFT-WALL
 connection to a sentence.
 Most sentences will end in a period; some with question marks of other
 punctuation.
 Commas and the word 
\begin_inset Quotes eld
\end_inset

the
\begin_inset Quotes erd
\end_inset

 can appear more than once in a sentence.
 The frequent occurrence of the straight double-quote mark is due to the
 fact that the corpus is heavily weighted with dialog: i.e.
 with fictional novels, where the characters are speaking a lot.
\end_layout

\begin_layout Standard
This list is repeated in the table below.
 The support is 
\begin_inset Formula $\left|(w,*)\right|$
\end_inset

, that is, the number of different kinds of disjuncts observed for that
 word.
 The count is 
\begin_inset Formula $N(w,*)$
\end_inset

, that is, the total number of times those disjuncts have been observed
 for that word.
 The frequency is just the count divided by 18489594.
 The length is 
\begin_inset Formula $\mbox{len}(w,*)=\sqrt{\sum_{d}N^{2}(w,d)}$
\end_inset

, that is, the root of the sum of the squares of the observations.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
A printing utility for these three is `show-counts` in the `disjunct-stats.scm`
 file.
\end_layout

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="11" columns="6">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
support
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
frequency
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $-\log_{2}$
\end_inset

frequency
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
length
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
LEFT-WALL
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
64215
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
972963
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.05262
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.248
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
122353.9
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
,
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
243987
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
957593
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.05179
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.271
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25475.4
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
106195
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
849354
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.04594
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.444
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
55168.0
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
215324
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
727027
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.03932
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.669
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9264.2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
and
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
126861
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
420942
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.02277
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.457
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28694.7
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
117110
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
401967
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.02174
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.523
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
11480.0
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
of
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
108951
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
371211
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.02008
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.638
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
11047.5
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
102720
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
289631
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.01566
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.996
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6855.1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
"
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
51289
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
256785
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.01389
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6.170
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21388.8
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
in
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
64011
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
208745
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.01129
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6.469
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14758.1
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The main point of this table is to demonstrate the log-likelihood column.
 At this point, these numbers won't seem to have much meaning; however,
 they provide an overall scale that will be seen, repeatedly, in the analysis
 below.
 The range of magnitudes – 4 to 7 – is no accident, and similar ranges will
 be seen later.
 
\end_layout

\begin_layout Standard
The first ten pseudo-disjuncts in the disjunct-ranking are ""+" ",-" "the+"
 "He+" "The+" "the-" "LEFT-WALL-" "I+" "“+" "to-".
 The meaning of the plus and minus signs was explained above; but to recap:
 the disjunct 
\begin_inset Quotes eld
\end_inset

xxx+
\begin_inset Quotes erd
\end_inset

 means that there are many words that expect to be followed by the word
 
\begin_inset Quotes eld
\end_inset

xxx
\begin_inset Quotes erd
\end_inset

 (on the right).
 The disjunct 
\begin_inset Quotes eld
\end_inset

the-
\begin_inset Quotes erd
\end_inset

 means that there are many words that want to link to the word 
\begin_inset Quotes eld
\end_inset

the
\begin_inset Quotes erd
\end_inset

 on the left.
 This is grammatically correct: 
\begin_inset Quotes eld
\end_inset

the
\begin_inset Quotes erd
\end_inset

 is a determiner, and it is always the dependent of some noun.
 The disjunct 
\begin_inset Quotes eld
\end_inset

The+
\begin_inset Quotes erd
\end_inset

 is at first appears to be grammatical garbage/nonsense: it states that
 there are many words that want to link to the word 
\begin_inset Quotes eld
\end_inset

The
\begin_inset Quotes erd
\end_inset

 on the right.
 Naively, this is never correct for English; determiners always precede
 the noun that they modify.
 The capitalization gives away what is really happening: the word 
\begin_inset Quotes eld
\end_inset

The
\begin_inset Quotes erd
\end_inset

 is a sentence opener, and it is being linked by the LEFT-WALL, indicating
 the start of the sentence; 
\emph on
ergo
\emph default
, it is lining backwards.
 Similar remarks apply to ""+" "the+" "He+" "I+" "“+": Clearly, the capitalized
 
\begin_inset Quotes eld
\end_inset

He
\begin_inset Quotes erd
\end_inset

 is a sentence opener, and 
\begin_inset Quotes eld
\end_inset

I
\begin_inset Quotes erd
\end_inset

 is plausibly so.
 The two different styles of quotation marks (symmetrically vertical and
 right-leaning) open up dialog in fictional novels, which make up a large
 portion of the corpus.
 
\end_layout

\begin_layout Standard
The above avoids the question of whether its is syntactically correct to
 link the LEFT-WALL to 
\begin_inset Quotes eld
\end_inset

The
\begin_inset Quotes erd
\end_inset

.
 This is determined not by raw frequency counts, but by mutual information.
 This is explored later.
 At this point we can only say that such a linking is frequent, and cannot
 judge whether it is correct.
\end_layout

\begin_layout Standard
The ranking of connector sets is shown below.
 It's a graph of the ranked counts 
\begin_inset Formula $N(w,d)$
\end_inset

.
 Recall that we define a 
\begin_inset Quotes eld
\end_inset

connector set
\begin_inset Quotes erd
\end_inset

 as the pairing of a word, and one particular disjunct that is associated
 with that word.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Graph of sorted-cset-obs, 
\emph on
op cit
\emph default
.
 
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-cset-obs.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
The top-ten connector-sets are "LEFT-WALL: He+;" "LEFT-WALL: The+;" "LEFT-WALL:
 “+;" "LEFT-WALL: "+;" ".: "+;" "LEFT-WALL: I+;" "and: ,-;" "LEFT-WALL: It+;"
 "LEFT-WALL: She+;" ".: ”+;"
\end_layout

\begin_layout Standard
These are hard to read, so, decoded: the four are connectors from the left
 side of the sentence to the words 
\begin_inset Quotes eld
\end_inset

He
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

The
\begin_inset Quotes erd
\end_inset

, and two different styles of double-quote marks: a right-leaning double-quote,
 and a vertical double-quote.
 This was commented on before: the corpus has many novels, and so many sentences
 will begin with quotes.
 Next comes a period which links to a double-quote on it's right.
 Last comes period followed by a leaning double-quote.
 Clearly, this is expected in the corpus.
 Also visible are the sentence openers 
\begin_inset Quotes eld
\end_inset

I
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

It
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

She
\begin_inset Quotes erd
\end_inset

.
 In that list is the word 
\begin_inset Quotes eld
\end_inset

and
\begin_inset Quotes erd
\end_inset

, which connects to the 
\emph on
left
\emph default
 to a comma.
 Not a surprise.
\end_layout

\begin_layout Standard
Not visible in the top-ten, but can be seen in the top-fifty are mirror-images,
 for example: ",: and+;" in 14th place, which states that the comma expects
 to be followed by the 
\begin_inset Quotes eld
\end_inset

and
\begin_inset Quotes erd
\end_inset

.
 The counts differ: the sixth-place "and: ,-;" had 32755 counts, while 14th
 place had 17745.
 Presumably this is due the comma having a different or more complex linkage
 about half the time.
 Other items in the top-50 that are not sentence openers include "”: ?-;"
 ",: but+;" ".: him-; "in: the+;" ".: ’+;" "”: .-;" "of: the+;".
 The first disjunct with more than one connector in is is "It: LEFT-WALL-
 was+;" which states that 
\begin_inset Quotes eld
\end_inset

It
\begin_inset Quotes erd
\end_inset

 wants to be a sentence opener, but wants to be followed by the word 
\begin_inset Quotes eld
\end_inset

was
\begin_inset Quotes erd
\end_inset

 on the right.
 Not surprising.
 Proceeding down the list, the disjuncts continue in this manner.
 This does not necessarily mean that they are 
\begin_inset Quotes eld
\end_inset

high quality
\begin_inset Quotes erd
\end_inset

, only that they were frequently observed in MST parses.
 It might be the case that the quality is given by the MI between the word
 and its disjunct; this is explored later.
\end_layout

\begin_layout Subsection
Word distribution
\end_layout

\begin_layout Standard
It is also interesting to turn the word distribution graph 
\begin_inset Quotes eld
\end_inset

on it's side
\begin_inset Quotes erd
\end_inset

.
 This is meant to be a simple exercise, so as to place some later graphs
 into context.
 Despite the simplicity, the analysis turns out to be somewhat surprising,
 and somewhat subtle.
 In particular, the last graph elicits some features in the dataset that
 are not otherwise easily visible.
 
\end_layout

\begin_layout Standard
In this dataset, there were 53076 words observed exactly once (out of a
 total of 18.5M observations of 137K words).
 This is quite something: of all the words observed, almost half were seen
 only once.
 More than half were seen twice, or less.
 These are presumably rare typos, foreign words, IPA pronunciation guides:
 any word that appears only once must be unusual; and yet, there are a lot
 of them! There are 17120 words that appear twice, 9081 that appear exactly
 3 times, 
\emph on
etc.

\emph default
 These counts are graphed below.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Graph of binned-word-counts.dat, generated in disjunct-stats.scm 
\end_layout

\end_inset

 
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/binned-word-counts.eps
	width 70text%

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "graph word frequency"

\end_inset


\end_layout

\begin_layout Standard
This graph indicates that most (almost all) words were observed less than
 100 times.
 In this dataset, there were only 9425 words that were observed 100 or more
 times, 5683 words that were observed 200 or more times, and 3263 words
 that were observed 400 or more times.
 Percentage-wise: about 6.8% of the words were observed 100 times or more.
 This should be enough to give confidence in the syntactic usage of the
 commonly-used English words; but most of the rest of this dataset includes
 oddities of various sorts, including place names and given names.
 The challenge will be to see if these can be grouped into grammatical categorie
s.
\end_layout

\begin_layout Standard
Writing 
\begin_inset Formula $N$
\end_inset

 for the number of times that some word was observed, it appears that there
 are approximately 
\begin_inset Formula $53076\times N^{-3/2}$
\end_inset

 words observed that many times.
 In formulas, the size of the set of words 
\begin_inset Formula $\left\{ w|N(w)=N\right\} $
\end_inset

is given by 
\begin_inset Formula 
\begin{equation}
\left|\left\{ w|N(w)=N\right\} \right|\sim N^{-3/2}\label{eq:word-distrib}
\end{equation}

\end_inset

where 
\begin_inset Formula $\left\{ w|\mbox{cond}\right\} $
\end_inset

 is a set of words (subject to the condition cond) and 
\begin_inset Formula $\left|\left\{ w|\mbox{cond}\right\} \right|$
\end_inset

 denotes the size of that set of words.
\end_layout

\begin_layout Standard
There is something interesting about this chart: it is more stable under
 varying dataset sizes than the Zipfian distribution.
 The slope of the Zipfian distribution changes, as datasets grow larger,
 typically trying to approach a slope of 1.0 very slowly.
 By contrast, the above 
\begin_inset Formula $N^{-3/2}$
\end_inset

 behavior seems to provide a much better description, even as the size of
 the dataset varies.
 I cannot demonstrate that assertion here, but have noticed it to be true
 when looking at other datasets.
\end_layout

\begin_layout Standard
The next graph belabors the point, and yet it's important.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Generated from binned-word-logli.dat
\end_layout

\end_inset

 It shows nothing new, but it does show it in a format that will be recur
 frequently, later.
 Thus, its worth understanding now.
 This graph shows exactly the same data as the previous graph: it 
\emph on
is
\emph default
 the same graph, except that the x-axis is now labeled differently, and
 some of the counts have been binned together.
 So first: note that 
\begin_inset Formula $-\log_{2}(1/18489594)=24.140$
\end_inset

 and so this is the location of the first spike on the far-right.
 Next, 
\begin_inset Formula $-\log_{2}(2/18489594)=23.140$
\end_inset

 and 
\begin_inset Formula $-\log_{2}(3/18489594)=22.555$
\end_inset

 are the locations of the second and third spikes: these correspond to words
 that have been observed 1,2 and 3 times.
 Words that have been observed exactly 
\begin_inset Formula $N$
\end_inset

 times will have a log-likelihood of 
\begin_inset Formula $-\log_{2}(N/18489594)=24.140-\log_{2}N$
\end_inset

.
 The formula 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:word-distrib"

\end_inset

, which was based on the graph above, can be rewritten as 
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
\left|\left\{ w|N(w)=N\right\} \right|\sim2^{-3/2\times\log_{2}N}
\]

\end_inset

which effectively predicts the height and location of the spikes.
 This is clearly demonstrated by the straight green line.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/binned-word-logli.eps
	width 70text%

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "logli distribution graph"

\end_inset


\end_layout

\begin_layout Standard
But then something else happens.
 As long as the bins are narrow, so that they are either full, or empty,
 then the nice power law holds.
 Once the bins become too wide to just hold single, discrete counts, but
 instead lump together different logli's, the apparent distribution changes.
 It is worthwhile to understand this phenomenon.
\end_layout

\begin_layout Standard
This graph was generated by bin-counting.
 The x-axis was divided into 1200 equal-sized bins, and whenever the log-likelih
ood of a word landed within a particular bin, the count was accumulated
 into that bin.
 On the right side of the graph, many (most) bins are empty, because they
 do not correspond to logarithms of integers; this results in the spikes.
 The width of each bin is (24-9)/1200, and so when 
\begin_inset Formula 
\[
\log_{2}N-\log_{2}(N-1)\approx\log_{2}\left(1+\frac{1}{N}\right)\approx\frac{1}{N\log2}<\frac{24-9}{1200}=\frac{15}{1200}
\]

\end_inset

then multiple counts will be shoved into one bin.
 For this chart, this happens when 
\begin_inset Formula $N\approx115$
\end_inset

 so there are about 115 distinct spikes, and then they merge when the logli
 is 
\begin_inset Formula $24.140-\log_{2}115\approx17.3$
\end_inset

 which is the spot where the above graph bends from the green line to the
 blue line.
 From this point on, when multiple counts are being jammed into one bin,
 we expect the distribution measure to be given by the Jacobian determinant
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant
\end_layout

\end_inset

 of the point measure.
 We can compute this explicitly.
 Changing notation slightly, write 
\begin_inset Formula 
\[
C(N)=KN^{-3/2}
\]

\end_inset

for the number of words that were observed 
\begin_inset Formula $N$
\end_inset

 times.
 This is the same formula as before.
 For this dataset, 
\begin_inset Formula $K=53076$
\end_inset

.
 The bincount at logli
\begin_inset Formula $=x$
\end_inset

 is 
\begin_inset Formula 
\[
\mbox{bincnt}(x)=\sum_{x\le\log_{2}\left(T/N\right)\le x+\epsilon}C(N)\Delta N
\]

\end_inset

For this dataset, 
\begin_inset Formula $T=18489594$
\end_inset

.
 The binsize used in the above graph was 
\begin_inset Formula $\epsilon=15/1200=1/80$
\end_inset

.
 In the above formula, 
\begin_inset Formula $\Delta N=1$
\end_inset

 is a notational trick to allow us to convert the sum into an integral when
 
\begin_inset Formula $N$
\end_inset

 gets large.
 We replace 
\begin_inset Formula $\sum$
\end_inset

 by 
\begin_inset Formula $\int$
\end_inset

 and replace 
\begin_inset Formula $\Delta N$
\end_inset

 by 
\begin_inset Formula $dN$
\end_inset

 and write
\begin_inset Formula 
\[
\mbox{bincnt}(x)\approx\int_{x}^{x+\epsilon}C(N)\frac{dN}{dy}dy
\]

\end_inset

Here, there is a change-of-variable to 
\begin_inset Formula $y=\log_{2}\left(T/N\right)$
\end_inset

 or equivalently 
\begin_inset Formula $N=T2^{-y}$
\end_inset

.
 The Jacobian determinant is then 
\begin_inset Formula $\left|dN/dy\right|=N\log2$
\end_inset

 and so
\begin_inset Formula 
\[
\mbox{bincnt}(x)\approx\frac{K\log2}{\epsilon\sqrt{T}}2^{x/2}
\]

\end_inset

Comparing this to the graph above, with logli
\begin_inset Formula $=x$
\end_inset

, we expect the bincounted region to have a slope of 0.5, and yet, the eyeballed
 fit above clearly shows 0.8.
 What's going on? WTF? Blame the data.
 Try again.
 The graph below shows exactly the same data, but this time there are only
 60 bins grand-total, and so only the first three spikes show.
 After that, the spikes merge together into bins.
 The green line is drawn exactly with the same slope and offset as before:
 that's because the first three spikes are in exactly the same locations
 as before, and have the same height.
 The blue line shows an eyeballed fit to the merged counts, and initially,
 it really does have a slope of 0.5, which is exactly what the Jacobian determina
nt was telling is it should be.
 Yayy! Declare victory and go home!
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/binned-word-logli-60.eps
	width 60text%

\end_inset


\end_layout

\begin_layout Standard
The purple line is an eyeballed fit to what the data is doing, when the
 number of observations really does become large.
 The knee in the graph is at about logli
\begin_inset Formula $=15.5=24.140-\log_{2}N$
\end_inset

 or, equivalently at 
\begin_inset Formula $N=400$
\end_inset

.
 Thus, we have to revise the apparent distribution.
 It is, for this dataset:
\begin_inset Formula 
\[
\left|\left\{ w|N(w)=N\right\} \right|\sim\begin{cases}
N^{-3/2} & \mbox{for }N<400\\
N^{-1.8} & \mbox{for }N>400
\end{cases}
\]

\end_inset

It was noted before that there are 3263 words in the dataset that were observed
 400 or more times.
 
\end_layout

\begin_layout Standard
What does this mean? What is this saying? Its not entirely clear.
 It seems to suggest that there are about 3.3K words that are used preferentially
 more often than the rest.
 That is, they are used more often not only in absolute terms, but also
 in relative terms: the form a core of the vocabulary, enjoying a popularity
 exceeding the trend line for less-frequently used words.
\end_layout

\begin_layout Standard
Its not clear if this is a generic feature of the English language, or if
 this is peculiar to the particular corpus.
 Let's review the corpus again.
 The corpus comprises assorted late-19th and early 20th century texts from
 Project Gutenberg, a dozen sci-fi/fantasy novels, and a sampling of fan-fiction.
 These texts will contain stray markup, including tables of contents, chapter
 headings, indexes, figure captions and itemized lists.
 Quite often, ASCII artwork is used to delimit chapters or sections.
 Chapter headings are often written in all-upper-case.
 There are stray quotations in Latin, snippets of Latin prose and poetry.
 Travelogues will include miscellaneous foreign sayings and unusual place-names.
 All of this stuff adds up: it will be observed only once, twice, maybe
 a a few dozen times.
 It may as well be random text, from the point of view of the word-pair
 MI statistics, and from the point of view of the MST parser.
 There is no easy way to remove this 
\begin_inset Quotes eld
\end_inset

garbage
\begin_inset Quotes erd
\end_inset

 in any 
\emph on
a priori
\emph default
 fashion.
 It is there, and it is unavoidable.
 It is indistinguishable from random sentences.
 However, it seems that the fact that some significant portion is 
\begin_inset Quotes eld
\end_inset

ungrammatical
\begin_inset Quotes erd
\end_inset

 should not affect word-count statistics, unless it just so happens that
 there is a core of 3.3K vocabulary words, followed by 130K words that are
 given names, arcane terms, and other 
\begin_inset Quotes eld
\end_inset

junk
\begin_inset Quotes erd
\end_inset

.
 This does not seem plausible.
\end_layout

\begin_layout Standard
Thus, it is unclear on what the meaning of this knee in the graph really
 is, and how it should be explained.
 Note that this knee is NOT visible in the Zipfian distribution – nothing
 happens at 
\begin_inset Formula $N=400$
\end_inset

 - it is smooth as silk.
 It maybe could have been visible in the 
\begin_inset CommandInset ref
LatexCommand vref
reference "graph word frequency"

\end_inset

 graph, except that the far edge of that graph ends at exactly 
\begin_inset Formula $N=400$
\end_inset

, and does not continue past there! This seems to be a fairly subtle effect.
\end_layout

\begin_layout Subsection
Ranked average observations per disjunct
\end_layout

\begin_layout Standard
A more interesting distribution arises by looking at the average number
 of observations per disjunct (per word).
 That is, a single word may have hundreds of disjuncts, observed thousands
 of times; what is the average number of times that a disjunct is observed?
 By 
\begin_inset Quotes eld
\end_inset

average
\begin_inset Quotes erd
\end_inset

, it is explicitly meant 
\begin_inset Formula $N(w,*)/\left|(w,*)\right|$
\end_inset

, the number of observations divided by the support for those observations.
\end_layout

\begin_layout Standard
This number gives a hint of how 
\begin_inset Quotes eld
\end_inset

narrow
\begin_inset Quotes erd
\end_inset

 the grammatical usage of a word is.
 If the average is high, it suggests that the word just does not have very
 many disjuncts on it; the few that it does have are observed a lot.
 Recall that these disjuncts (pseudo-disjuncts) connect to individual words,
 and not to word-classes.
 Thus, if a disjunct is seen a lot, it probably connects to another word,
 forming a high-MI pair.
 This can be explicitly seen in the example further below.
\end_layout

\begin_layout Standard
A graph of the ranked average number of observations, per disjunct, per
 word, is shown on the left, below.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Computed with the sorted-avg list in disjunct-stats.scm
\end_layout

\end_inset

 The ranking is distinctly not Zipfian; this is confirmed by slicing the
 data three ways: excluding words with less than 400 observations (leaving
 3263 words), excluding words with less than 100 observations (leaving 9425
 words), and excluding words with less than 20 observations (leaving 25505
 words out of 137K words).
\end_layout

\begin_layout Standard
The graph on the right expresses an alternate view of the same idea: it
 shows a bin-count of all of the words.
 Reaffirming the graph on the left, it indicates that almost all words have
 an average disjunct observation count of less than four.
 It also conveys the sense that when the average disjunct count is greater
 than about four, that this is unusual, and perhaps meaningful in some way.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-avg.eps
	width 50text%

\end_inset


\begin_inset Graphics
	filename en-dj/binned-avg.eps
	width 50text%

\end_inset


\end_layout

\begin_layout Standard
The first ten on the ranked list are "*" "Literary" "Archive" "Gutenberg"
 "Notes" "...." "|" "Foundation" "Project" "Summary".
 This suggests that these all come from exactly the same parse of a small
 group of sentences having a very regular, formulaic structure, occurring
 repeatedly in multiple texts.
 One of those sentences is easily found; it begins as: 
\begin_inset Quotes eld
\end_inset

The Project Gutenberg Literary Archive Foundation has been created...
\begin_inset Quotes erd
\end_inset

 
\end_layout

\begin_layout Standard
Closer examination indicates that more or less all words having an average
 of more than three observations per disjunct are associated with the Project
 Gutenberg legal boilerplate.
 The first 80 entries in the 400+ list have an average observation count
 of above three, and they are all boilerplate words: "fee" "copies" "trademark"
 "agreement" "electronic" "copyright" "donations", and so on.
 This suggests that pretty much all of the 
\begin_inset Quotes eld
\end_inset

bump
\begin_inset Quotes erd
\end_inset

 on the above-left graph is entirely due to license boilerplate!
\end_layout

\begin_layout Standard
Its entertaining to look at some of these close-up.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
View disjuncts by saying (filter (lambda (cset) (< 10 (get-count cset)))
 (cog-incoming-by-type (Word "foo") 'Section)) where 10 is the minimum number
 of counts.
\end_layout

\end_inset

 The word 
\begin_inset Quotes eld
\end_inset

Prince
\begin_inset Quotes erd
\end_inset

 shows up 99th on the list, with an average of 2.885 observations per disjunct.
 It has a total of 773 different disjuncts on it.
 The top six disjuncts are just single links, shown in the table below.
 Clearly all are princes.
 This leaves 767 other disjuncts with far fewer counts.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
number of observations
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Andrew+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
626
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Vasili+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
149
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Andrew's+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
46
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Edouard+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
46
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Hans+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Bagration+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
32
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
The word 
\begin_inset Quotes eld
\end_inset

think
\begin_inset Quotes erd
\end_inset

 appears as number 140 on the list, with an average of 2.6104 observations
 per disjunct.
 It has 5462 disjuncts in total; only six are observed more than 200 times.
 The last in the list below is the first clear appearance of a transitive
 verb.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
number of observations
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1321
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
you-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
308
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
don't-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
265
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
244
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
do- you-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
211
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- it+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
202
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
The word 
\begin_inset Quotes eld
\end_inset

long
\begin_inset Quotes erd
\end_inset

 appears as 195 on the list, with an average of 2.3825 observations per disjunct.
 It has 5412 different disjuncts on it; only five are seen more than 200
 times.
 These are:
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="6" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
number of observations
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
385
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
as+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
335
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
how-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
236
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a- time+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
212
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
enough+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
201
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
The skewness appears to be very sharp.
 This suggests that we should not waste time looking at mean-square variations
 in the average, although we'll do this anyway.
 But first, its worth graphing the skewness directly.
 Again, this is done on a log-log graph, in a Zipfian way.
\end_layout

\begin_layout Subsection
Disjunct count distribution
\end_layout

\begin_layout Standard
The graph below shows the distribution of the disjunct observations on the
 five words 
\begin_inset Quotes eld
\end_inset

Prince
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

think
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

long
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

fact
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

from
\begin_inset Quotes erd
\end_inset

.
 Indeed, it looks Zipfian; since we know that all but the first three or
 four disjuncts are noise, this graph illustrates 
\begin_inset Quotes eld
\end_inset

pink noise
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

1/f noise
\begin_inset Quotes erd
\end_inset

.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Computed with the dj-prince and dj-think etc.
 arrays in disjunct-stats.scm 
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-dj-prince.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
Can one get a smoother distribution by summing together these two graphs?
 Sure...
 and one can sum together not just these two words, but all words (that
 have been observed at least 100 times).
 
\end_layout

\begin_layout Standard
That graph is shown below.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Computed with the accum-dj-all function in disjunct-stats.scm 
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-dj-counts.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
Its kind of a strange graph.
 Yes, the x-axis of this graph does imply that there are thousands of words
 with more than a thousand disjuncts on them, and hundreds that have more
 than ten-thousand (unique, different) disjuncts on them! Exactly what does
 this mean? This is covered in the next section.
\end_layout

\begin_layout Subsection
Disjunct Support Distribution
\end_layout

\begin_layout Standard
Is it possible that some words have a large number of disjuncts on them?
 
\end_layout

\begin_layout Standard
Yes, it is.
 For example, the comma was observed to have 243987 unique, different disjuncts
 associated with it.
 The word 
\begin_inset Quotes eld
\end_inset

the
\begin_inset Quotes erd
\end_inset

 has 215324 unique, different disjuncts, the word 
\begin_inset Quotes eld
\end_inset

and
\begin_inset Quotes erd
\end_inset

 has 126861.
 Rounding out this list are "to" "of" "." "a" "was" "LEFT-WALL".
 Its not clear what fraction of these disjuncts are grammatically valid,
 and what fraction are junk.
\end_layout

\begin_layout Standard
The graph below shows the distribution of the size of the support: the ranking
 of 
\begin_inset Formula $\left|(w,*)\right|$
\end_inset

.
 Again, the graph appears to be approximately Zipfian.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Generated by sorted-support in disjunct-stats.scm 
\end_layout

\end_inset

 The eye-balled fit has a slope of 0.9, but, from the eyeball-perspective,
 this is not all that different from a slope of 1.0.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-support.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
Terminology: the 
\begin_inset Quotes eld
\end_inset

support
\begin_inset Quotes erd
\end_inset

 of a vector is the number of basis elements that have a non-zero coefficient.
 This is the set 
\begin_inset Formula $(w,*)$
\end_inset

 defined earlier.
 Equivalently, this is the size of the set of disjuncts associated with
 a word, when counted 
\emph on
without
\emph default
 multiplicity.
\end_layout

\begin_layout Subsection
Ranked Euclidean length (RMS Size)
\end_layout

\begin_layout Standard
A different distribution arises by looking at the ranked RMS sizes of the
 disjunct sets
\begin_inset Foot
status open

\begin_layout Plain Layout
Obtained by running (print-ts-rank sorted-lengths outport) from the disjunct-sta
ts.scm file 
\end_layout

\end_inset

.
 Here, the RMS size
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
The word 
\begin_inset Quotes eld
\end_inset

length
\begin_inset Quotes erd
\end_inset

 can be used to describe the root-mean-square size of the set of disjuncts
 associated with a word.
 That is, each element of the set is a disjunct, and that disjunct has a
 count, the number of times it has been observed.
 The root-mean-square of these counts can be taken as the set-size.
 But this set can also be interpreted as a vector, and so the RMS size is
 the same thing as the Euclidean length of the vector.
 Thus, the word 
\begin_inset Quotes eld
\end_inset

length
\begin_inset Quotes erd
\end_inset

 is sometimes used for the RMS size; they're the same thing.
\end_layout

\end_inset

 is computed by taking the root-mean-square of the counts on each disjunct
 in the set, that is, by computing 
\begin_inset Formula $\sqrt{\sum_{d}N(w,d)^{2}}$
\end_inset

 for each word 
\begin_inset Formula $w$
\end_inset

 and then ranking.
 Interpreting 
\begin_inset Formula $d$
\end_inset

 as a basis element of a vector space, this can be recognized as the Euclidean
 length of the count-vector.
\end_layout

\begin_layout Standard
The RMS size of the set is thus larger not only when more disjuncts have
 been observed, but also when most of the observations are made of only
 a small handful of disjuncts.
 That is, the RMS size should be relatively larger, if the word is less
 grammatically flexible.
 So for example, prepositions tend to be very flexible; adjectives, not
 so much.
 Thus, we expect adjectives to appear higher-up on this ranking list, than
 the observation-based list.
 And this might be true, relatively, but certainly not true absolutely.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-lengths.eps
	width 70text%

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "graph: ranked length"

\end_inset


\end_layout

\begin_layout Standard
The first dozen words of the RMS-size-list are: "LEFT-WALL" "." "and" ","
 """ "”" "in" "to" "of" "“" "as" "the".
 Not that interesting: these are all words that were observed a lot in the
 text.
 The RMS size is dominated by the total number of observations of a word
 in text.
 In and of itself, its insufficient to indicate how 
\begin_inset Quotes eld
\end_inset

concentrated
\begin_inset Quotes erd
\end_inset

 the disjuncts are, how grammatically narrow a word is.
 For this, some other quantity is needed.
\end_layout

\begin_layout Standard
The slope appears to be exactly -1.0, continuing the scale-free trend.
\end_layout

\begin_layout Subsection
Mean-square to size ratio
\end_layout

\begin_layout Standard
More interesting is the ratio of mean-square size to the total size.
 In formulas, by ranking according to 
\begin_inset Formula 
\[
\frac{\sqrt{\sum_{d}N^{2}(w,d)}}{N(w,*)}=\frac{\sqrt{\sum_{d}p^{2}(w,d)}}{p(w,*)}
\]

\end_inset

This seems like the interesting ratio, because the Zipf exponent of -0.65
 would be doubled, when working with mean-square sizes, thus making the
 two rankings comparable.
\end_layout

\begin_layout Standard
Words high in this score will be words that have relatively few disjuncts
 on them, or at least, few that matter much, that rise above the level of
 noise.
 The first ten words on this list, excluding punctuation, and excluding
 all words with fewer than 100 observations: "It" "and" "but" "in" "Two"
 "as" "Notes" "not" "There" "Summary".
 A review of the input corpus shows that the word "Summary" appears in only
 one input text: Charles Darwin's 
\begin_inset Quotes eld
\end_inset

On the Origin of Species
\begin_inset Quotes erd
\end_inset

, and then only to indicate an actual summary! The word 
\begin_inset Quotes eld
\end_inset

Notes
\begin_inset Quotes erd
\end_inset

 appears in less than a dozen input texts, with almost every usage being
 formulaic and rigid.
 The grammatical usage of these two words in the input corpus is fairly
 constrained, and thus it is no surprise that these have relatively narrower
 disjunct distributions on them.
 
\end_layout

\begin_layout Standard
Excluding the capitalized words, and punctuation, in this list what remains
 is quite surprising.
 It is shown in the table below.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Extracted from ranked-sqlen-norm.dat 
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="11" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
rank
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
score
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1956
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
and
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1147
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
but
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
12
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1043
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
in
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
814
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
as
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
776
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
not
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
590
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
been
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
22
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
568
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
be
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
26
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
453
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
one
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
440
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
at
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
386
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
own
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset

 
\begin_inset space \qquad{}
\end_inset

  
\begin_inset Tabular
<lyxtabular version="3" rows="11" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
rank
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
score
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
34
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
329
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
of
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
35
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
328
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
36
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
324
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
with
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
37
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
319
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
other
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
38
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
315
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
it
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
42
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
284
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
when
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
43
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
275
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
for
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
45
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
271
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
he
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
46
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
265
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
have
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
49
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
248
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
that
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset

 
\end_layout

\begin_layout Standard
It is surprising because, grammatically, we expect most of these words to
 have a large number of varied disjuncts attached to them.
 We expect them to be diffuse, not sharp: we expect that these would have
 a large number of observations smeared over a large variety of disjuncts,
 instead of having their weight concentrated in only a handful of disjuncts,
 the way that 
\begin_inset Quotes eld
\end_inset

Prince
\begin_inset Quotes erd
\end_inset

 was, above.
 So what is going on, here?
\end_layout

\begin_layout Standard
The distribution, shown below, is a power-law.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/binned-sqlen-norm.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
Are there other interesting measures? One could contemplate the ratio of
 the mean-square size to the support 
\begin_inset Formula $\sum_{d}N(w,d)^{2}/\left|(w,*)\right|$
\end_inset

.
 Another possibility would be this, minus the average-squared, which would
 give the second moment, aka, the mean-square deviation from the average,
 specifically
\begin_inset Formula 
\[
\frac{\sum_{d}N(w,d)^{2}}{\left|(w,*)\right|}-\left[\frac{N(w,*)}{\left|(w,*)\right|}\right]^{2}=\frac{1}{\left|(w,*)\right|}\sum_{d}\left[N(w,d)-\frac{N(w,*)}{\left|(w,*)\right|}\right]^{2}
\]

\end_inset

Neither of these variations seem promising; they seem to offer up more of
 the same, at least on this dataset.
 A larger and more refined dataset might reveal otherwise.
\end_layout

\begin_layout Subsection
Mutual information
\end_layout

\begin_layout Standard
The concept of the 
\begin_inset Quotes eld
\end_inset

fractional mutual information
\begin_inset Quotes erd
\end_inset

 for a pair is interesting to explore.
 Define this as
\begin_inset Formula 
\begin{equation}
MI_{pair}(w,d)=-\log_{2}\frac{p(w,d)}{p(w,*)p(*,d)}\label{eq:mi-word-disjunct}
\end{equation}

\end_inset

This is 
\begin_inset Quotes eld
\end_inset

fractional
\begin_inset Quotes erd
\end_inset

 in the sense that the total MI for the set of all pairs can be written
 as
\begin_inset Formula 
\[
MI_{cset}=\sum_{w,d}p(w,d)MI_{pair}(w,d)
\]

\end_inset

Fractional MI is interesting because it usually has a reasonably nice distributi
on.
 For this particular dataset, it ranges from about -11 to +24.
 The distribution is shown in the graphs below.
 
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
These are graphed by binned-cset-mi and weighted-cset-mi in the disjunct-stats.sc
m file.
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/binned-cset-mi.eps
	width 50col%

\end_inset


\begin_inset Graphics
	filename en-dj/weighted-cset-mi.eps
	width 50col%

\end_inset


\end_layout

\begin_layout Standard
These graphs are generated by computing the value for 
\begin_inset Formula $MI_{pair}(w,d)$
\end_inset

 for each of the 8629163 
\begin_inset Formula $(w,d)$
\end_inset

 pairs (aka 'connector sets'), and approximating it's distribution by bin
 counting.
 In each graph above, there are 200 bins, each of width of about 35/200,
 and each pair is assigned to one of the bins, according to it's MI value.
 The graph on the left then shows how many pairs there are in each bin.
 The graph on the left is similar, but not the same: it sums the frequencies
 for all the pairs in each bin.
 In formulas: the graph on the left shows the value of
\begin_inset Formula 
\[
\mbox{sizeof }\left\{ (w,d)|MI_{pair}(w,d)\mbox{ is in bin}\right\} 
\]

\end_inset

while the graph on the right shows
\begin_inset Formula 
\[
\sum_{MI_{pair}(w,d)\mbox{ is in bin}}p(w,d)
\]

\end_inset

where '
\begin_inset Formula $x\mbox{ is in bin}$
\end_inset

' simply means 
\begin_inset Formula $lo\le x<hi$
\end_inset

 with the bin being the interval 
\begin_inset Formula $[lo,hi)$
\end_inset

.
\end_layout

\begin_layout Standard
Both of these graphs show 
\begin_inset Quotes eld
\end_inset

combs
\begin_inset Quotes erd
\end_inset

 in the right side.
 These combs are exactly the same combs as noted in the last figure in section
 
\begin_inset CommandInset ref
LatexCommand nameref
reference "logli distribution graph"

\end_inset

 
\begin_inset CommandInset ref
LatexCommand vref
reference "logli distribution graph"

\end_inset

.
 The combs are due to the large number of words that have been observed
 only a small handful of times.
 In essence, the combs attest that the bulk of the high-MI pairs have been
 observed more than just a few times; 
\emph on
i.e.

\emph default
 the high MI values are meaningful.
\end_layout

\begin_layout Standard
Both graphs show an eyeballed fit in green.
 The left graph shows that the distribution can be approximated by a Gaussian
 (visually a parabola, due to the logarithmic scale), given by 
\begin_inset Formula $\exp-(0.23(MI_{pair}-11))^{2}/2$
\end_inset

.
 The shape of the graph on the right is harder to pin down.
 It has hints of parabolic behavior, yet the left edge appears more straight
 than curved, that is, to be exponential, given by 
\begin_inset Formula $2^{1.3MI_{pair}}=\exp0.9MI_{pair}$
\end_inset

.
 The eyeballed blue parabola is given by 
\begin_inset Formula $\exp-(0.15(MI_{pair}-12))^{2}/2$
\end_inset

, it's clearly missing a large excess in the middle of the graph.
\end_layout

\begin_layout Subsection
Marginal Mutual Information of Words
\end_layout

\begin_layout Standard
The marginal mutual information of a single word can be defined by summing
 the (fractional) mutual information between a word, and all of it's disjuncts:
\begin_inset Formula 
\[
MI_{word}(w)=\frac{1}{p(w)}\sum_{d}p(w,d)MI_{pair}(w,d)
\]

\end_inset

This is also written in the 
\begin_inset Quotes eld
\end_inset

fractional
\begin_inset Quotes erd
\end_inset

 style, so that, again, the total MI of the entire dataset can be written
 as
\begin_inset Formula 
\[
MI_{cset}=\sum_{w}p(w)MI_{word}(w)
\]

\end_inset

That is, 
\begin_inset Formula $MI_{word}(w)$
\end_inset

 is the fractional contribution of the word to the total MI.
 The fractional marginal MI is very convenient for comparing different words,
 since it factors out the frequency of how often a word is observed: the
 MI of two words with two very different frequencies can be directly compared.
 
\end_layout

\begin_layout Standard
As can be seen from the graph below, the fractional marginal MI ranges between
 +3 and +18 for this dataset.
 The total MI for the dataset is measured to be 7.8969 bits.
 The distribution can be visualized in two different ways.
 The graph on the left, below
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
This is graphed by sorted-word-mi-hi-p in disjunct-stats.scm 
\end_layout

\end_inset

 shows the ranked MI of the 9425 words that have been observed more than
 100 times in this dataset.
 Note that it is a semi-log plot; it is NOT Zipfian.
 Note that the MI seems to decay logarithmically, for a good long ways,
 and then drops off a cliff.
 
\end_layout

\begin_layout Standard
The graph on the right shows the distribution, for all 137078 words, bin-counted
 into 100 bins.
 The combs on the far right are again the same combs as noted in the last
 figure in section 
\begin_inset CommandInset ref
LatexCommand nameref
reference "logli distribution graph"

\end_inset

 
\begin_inset CommandInset ref
LatexCommand vref
reference "logli distribution graph"

\end_inset

.
 The distribution appears to be Gaussian, and of approximately the same
 width as before, although located at a different center: the green line
 in the graph is the Gaussian given by 
\begin_inset Formula $\exp-(0.24(MI_{word}-20))^{2}/2$
\end_inset

 as compared to 0.23 for the non-marginal MI distribution above.
 The difference between 0.24 and 0.23 seems to be significant: changing one
 to the other seems to give a noticeably poorer fit.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-word-mi-hi-p.eps
	width 50text%

\end_inset


\begin_inset Graphics
	filename en-dj/binned-word-mi.eps
	width 50col%

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
What are the disjuncts like at either end of this distribution? The first
 few words in the ranking are: "LICENSE" "FULL" "formats" "BREACH" "AGREE"
 "WARRANTIES" "WARRANTY" and are clearly parts of the set phrases that make
 up the Project Gutenberg license agreement.
 The word 
\begin_inset Quotes eld
\end_inset

Prince
\begin_inset Quotes erd
\end_inset

, examined previously, is 7378th in this rank.
 
\end_layout

\begin_layout Standard
Its entertaining to look at some of these.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
View disjuncts by saying (filter (lambda (cset) (< 10 (get-count cset)))
 (cog-incoming-by-type (Word "foo") 'Section)) where 10 is the minimum number
 of counts.
\end_layout

\end_inset

 The word "LICENSE" is surely under-sampled, as, in this all-capitals form,
 it only appears in the Gutenberg boilerplate, and nowhere else.
 This we cannot expect accurate MST parses, and cannot expect accurate disjuncts
 for this word.
 Yet other set phrases quickly top the list.
 The word 
\begin_inset Quotes eld
\end_inset

San
\begin_inset Quotes erd
\end_inset

, number 33 in the list, is seen only with the disjuncts 
\begin_inset Quotes eld
\end_inset

Francisco+
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

Antonio+
\begin_inset Quotes erd
\end_inset

.
 The word 
\begin_inset Quotes eld
\end_inset

Tomb
\begin_inset Quotes erd
\end_inset

 is 40th in the list and has three disjuncts observed more than twice:
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="4" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct for 
\begin_inset Quotes eld
\end_inset

Tomb
\begin_inset Quotes erd
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
19
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Great- of+ Nazarick+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
15
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Underground- of+ Nazarick+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
The- Great- of+ Nazarick+
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
It is already clear from this one example that the high marginal-MI words
 will be those that take part in idioms, 
\begin_inset Quotes eld
\end_inset

set phrases
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

institutional phrases
\begin_inset Quotes erd
\end_inset

, and that the disjunct identifies the words taking part in the setting.
 The word 
\begin_inset Quotes eld
\end_inset

prominently
\begin_inset Quotes erd
\end_inset

 appears 50th in the list, and suggests that it is only used only in a rather
 rigid and formulaic way:
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="3" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct for 
\begin_inset Quotes eld
\end_inset

prominently
\begin_inset Quotes erd
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
51
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
appear- whenever+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
51
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
without- displaying+
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The word 
\begin_inset Quotes eld
\end_inset

Corps
\begin_inset Quotes erd
\end_inset

 appears 95th in the list.
 Note that both the English and French word-ordering appears: 
\begin_inset Quotes eld
\end_inset

Diplomatic Corps
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

Corps Diplomatique
\begin_inset Quotes erd
\end_inset

, as witness on the different sign on the disjuncts:
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="6" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct for 
\begin_inset Quotes eld
\end_inset

Corps
\begin_inset Quotes erd
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Supply-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Marine-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Medical-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Diplomatic-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Diplomatique+
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
At the other extreme, the ten words with the lowest MI are: "it" "in" "of"
 "that" "to" "and" "the" "LEFT-WALL" "," "." ranging from 5.07 for "it" down
 to 3.33 for the period.
 These are already familiar from previous rankings: they occur with very
 high frequency; the disjunct lists on them will be lengthly variable, diffuse.
 
\end_layout

\begin_layout Standard
These samplings of disjuncts are sharply reminiscent of the technique of
 collocation used in corpus linguistics, but with a big, important difference.
 There, the linguist examines a window that is some 6-8 words wide, and
 examines the frequency of all phrases appearing within that window, containing
 the word-of-interest in the center.
 Here, we again have a word-of-interest, but this time, instead of seeing
 phrases, we see the grammatical structure revealed directly, by means of
 the disjuncts extracted from MST parses.
 The philosophical basis used to justify corpus linguistics, i.e.
 that of frequentism, is accepted and applied here as well.
 In this case, it is used to obtain grammatical structure.
\end_layout

\begin_layout Subsection
Mutual Information of Disjuncts
\end_layout

\begin_layout Standard
Symmetrically, one also has the mutual information of a disjunct, in comparison
 to all of the words it connects to:
\begin_inset Formula 
\[
MI_{disjunct}(d)=\frac{1}{p(*,d)}\sum_{w}p(w,d)MI_{pair}(w,d)
\]

\end_inset

Again, this is presented in the 
\begin_inset Quotes eld
\end_inset

fractional
\begin_inset Quotes erd
\end_inset

 style, so that the total MI of the entire dataset can be written as before:
\begin_inset Formula $MI=\sum_{d}p(*,d)MI_{disjunct}(d)$
\end_inset

The dataset contains 6239997 (6.24 million) unique disjuncts, observed for
 a total 18489594 (18.5 million) times.
 The distribution is shown below, with the disjuncts sorted into 200 equal-sized
 bins.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Use binned-dj-mi to get this.
 
\end_layout

\end_inset

 The combs on the far right are again the same combs as noted in the last
 figure in section 
\begin_inset CommandInset ref
LatexCommand nameref
reference "logli distribution graph"

\end_inset

 
\begin_inset CommandInset ref
LatexCommand vref
reference "logli distribution graph"

\end_inset

.
 The green line shows an eyeballed fit to a Gaussian.
 As before, the width appears to be almost 1/4th, but not quite.
 It is given by 
\begin_inset Formula $\exp-(0.27(MI_{disjunct}-13.7))^{2}/2$
\end_inset

.
 That is, the mean MI is 13.7, and with a standard deviation of about 
\begin_inset Formula $\sqrt{1/0.27}=1.92$
\end_inset

.
 
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/binned-dj-mi.eps
	width 60col%

\end_inset


\end_layout

\begin_layout Standard
The origin of the noise that seems to build up for MI<10 is unclear.
 
\end_layout

\begin_layout Subsection
Fractional Entropy
\end_layout

\begin_layout Standard
There is a simpler variant than the mutual information, that is also worth
 understanding: the fractional contribution to the total entropy.
 This is given by the sum 
\begin_inset Formula 
\[
H_{word}(w)=-\frac{1}{p(w)}\sum_{d}p(w,d)\log_{2}p(w,d)
\]

\end_inset

This is written in the 
\begin_inset Quotes eld
\end_inset

fractional
\begin_inset Quotes erd
\end_inset

 style, so that the total entropy of the entire dataset can be written as
\begin_inset Formula 
\[
H_{cset}=\sum_{w}p(w)H_{word}(w)
\]

\end_inset

 Analogously, one also has the fractional contribution of the disjuncts:
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
H_{disjunct}(d)=-\frac{1}{p(*,d)}\sum_{w}p(w,d)\log_{2}p(w,d)
\]

\end_inset

where, again, one has that 
\begin_inset Formula 
\[
H_{cset}=\sum_{d}p(d)H_{disjunct}(d)
\]

\end_inset


\end_layout

\begin_layout Standard
The ranked fractional entropy is shown in the left graph below.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Generated from sorted-word-ent in disjunct-stats.scm.
\end_layout

\end_inset

 It only shows those words that have been observed 100 times, or more.
 
\end_layout

\begin_layout Standard
It resembles the graph for the ranked fractional marginal MI, above.
 The graph on the right shows the distribution of the entropy, for all of
 the words.
 This affirms (or explains?) the sharp knee in the graph on the left: the
 knee occurs because almost all words have a large disjunct-entropy.
 Remarkably, the distribution is anti-Gaussian, in that it appears to diverge,
 the larger the entropy.
 In that respect, it cannot even be a proper distribution, as it cannot
 be normalized to a probability of 1.0 - the distribution increases without
 bound! Yet, as the graph illustrates, that is what it seems to be.
 The green curve is the anti-Gaussian, given by 
\begin_inset Formula $\exp(0.27(FME-13.7))^{2}/2$
\end_inset

.
 As before, it has a width of approximately 1/4th.
 
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-word-ent.eps
	width 50text%

\end_inset


\begin_inset Graphics
	filename en-dj/binned-word-ent.eps
	width 50text%

\end_inset


\end_layout

\begin_layout Standard
The top-ranked words (which have been observed 100 times or more) are "clawing"
 "manages" "noting" "anyways" "circling" "choke" "neared" "urging" "pursuing"
 "exited".
 This is not a list that has been exposed with other statistics.
 Almost all of these are verbs, a grammatical class that never appeared
 in any of the previous lists.
 Why these? 
\end_layout

\begin_layout Standard
All of these words have an entropy of exactly 
\begin_inset Formula $24.14021=\log_{2}18489594$
\end_inset

.
 Since there are a total of 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
18489594 observations of disjuncts in this dataset, the only way in which
 this entropy is possible is if every disjunct on these words was observed
 once and only once.
\end_layout

\begin_layout Standard
Looking deeper into the disjunct set, the sensation that the words with
 the highest entropies are almost all verbs continues quite strikingly.
 A sampling is given in the table below.
 All of these words have a large support (i.e.
 have at least 100 observations of disjuncts on each), but each disjunct
 is observed only once, rarely twice; almost never more than that.
 For example, 
\begin_inset Quotes eld
\end_inset

gripped
\begin_inset Quotes erd
\end_inset

 has only one disjunct that was observed three times, seven that were observed
 twice, and 320 that were observed only once.
 A quick examination shows that many, maybe most, are grammatically reasonable,
 for example, for 
\begin_inset Quotes eld
\end_inset

gripped
\begin_inset Quotes erd
\end_inset

: (He- staff+) is the disjunct observed three times.
 Four of the seven observed twice are (had- him+) (she- his+ hand+) (He-
 staff+ tightly+) (his+ hand+) and look to be a part of rather formulaic
 sentences (possibly from the fan-fiction part of the corpus).
 The other three disjuncts observed twice are strange nonsense: (hands-
 Ross- at+ edge+ vanity+) (hand- jaw- other- top+) (LEFT-WALL- jockeyings-
 by+ them+).
 These seem to be the result of failed MST parses, depending, presumably,
 on word-pairs that were witnessed only a few times.
\end_layout

\begin_layout Standard
This observation, and the table below, reinforces the need to the urgency
 of clustering words: by clustering words together into grammatical categories,
 this should increase the number of observations of category-pairs, improving
 MST parses, as well as increasing the number of observations of disjuncts,
 hopefully drowning out the weak and bizarre disjuncts.
 Given that the high-entropy words seem to be predominantly verbs, this
 suggests that clustering will be absolutely required to pick out the grammatica
l form of verbs.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="6" columns="6">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Entropy
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
rank
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Word
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $\left|(w,*)\right|_{3}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $\left|(w,*)\right|_{2}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $\left|(w,*)\right|_{1}$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.1402
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
clawing
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
102
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.1293
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
41
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
licking
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
183
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.1241
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
79
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
toys
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
124
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.1202
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
164
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
grabbing
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
300
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.0846
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
567
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gripped
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
328
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The table employs a bit of notation worth reviewing.
 Recall the definition of the set that supports a word: 
\begin_inset Formula 
\[
(w,*)=\left\{ (w,d)|N(w,d)>0\right\} 
\]

\end_inset

The notation 
\begin_inset Formula $\left|(w,*)\right|$
\end_inset

 was used to indicate the size of this set.
 Extend this notation as
\begin_inset Formula 
\[
\left|(w,*)\right|_{k}=\mbox{sizeof }\left\{ (w,d)|N(w,d)\ge k\right\} 
\]

\end_inset

That is, 
\begin_inset Formula $\left|(w,*)\right|_{k}$
\end_inset

 is the size of the support for when the disjuncts on word 
\begin_inset Formula $w$
\end_inset

 have been seen at least 
\begin_inset Formula $k$
\end_inset

 times.
\end_layout

\begin_layout Standard
From this table, it is now clear why 
\begin_inset Quotes eld
\end_inset

large entropy
\begin_inset Quotes erd
\end_inset

 can be intuitively understood to mean 
\begin_inset Quotes eld
\end_inset

many possibilities
\begin_inset Quotes erd
\end_inset

.
 Each of these words was seen in a very broad setting of possibilities:
 in a sense, the broadest possible.
 Each of these words was observed with a vary large set of different disjuncts,
 and this set was as spread-out as possible: the vast majority of disjuncts
 were observed exactly once.
 
\end_layout

\begin_layout Subsubsection*
More terms
\end_layout

\begin_layout Standard
Some curious terms show up in relating the fractional mutual information
 to the fractional entropy.
 Expanding out the above summations, one obtains
\begin_inset Formula 
\[
MI_{word}(w)=H_{word}(w)+\log_{2}p(w,*)+\frac{1}{p(w)}\sum_{d}p(w,d)\log_{2}p(*,d)
\]

\end_inset

 The last term is bizarre...
\end_layout

\begin_layout Subsection
Vertex degrees and hubiness
\end_layout

\begin_layout Standard
Vertex degrees can be defined as the average number of connectors per disjunct.
 In principle, the vertex degree is an excellent indicator of the part of
 speech.
 For example, determiners, adjectives and adverbs typically have a degree
 of one: they have one connector, which is modifying the noun (or verb)
 that they act on.
 By contrast, nouns typically have a degree of two: one connector to attach
 to a verb, another to a modifier, and that's it.
 Verbs have a degree of three: one connector to a subject, one to a direct
 object, a third to an indirect object or a modifier.
 Of course, nouns might have two or more modifiers, or maybe zero modifiers;
 verbs are also quite variable, but the general concept of vertex degree
 is appealing.
 Closely related to this is the idea of 
\begin_inset Quotes eld
\end_inset

hubiness
\begin_inset Quotes erd
\end_inset

, which can be defined as the second moment of the degree.
 
\end_layout

\begin_layout Standard
Thus, its worth looking at this.
 Define the average degree as
\begin_inset Formula 
\[
K(w)=\frac{\sum_{d,c}N(w,d)C(d,c)}{N(w,*)}
\]

\end_inset

This is graphed, below, for all words that have at least 100 observations.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Computed with sorted-avg-connectors in disjunct-stats.scm 
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-avg-connectors.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
This graph is unexpected.
 Having an average number of connectors that exceed 5 or 6 is intuitively
 surprising.
 In proper grammar, it would be hard to reach even this degree without having
 a transitive or ditranstive verb with several modifiers, and a particle
 or preposition.
 The first ten items are out of control.
 What's happening here? The first ten items in the ranking are: "|" "#"
 "+" "||" "_" "ASCII" "electronically" "Northup" "disclaimer" "u".
 The first five are presumably formatting markup or possibly decorative markup
 in the texts.
 Three of the words appear to be license boilerplate.
 The name "Northup" appears in only one text in the corpus: an autobiography,
 
\begin_inset Quotes eld
\end_inset

Twelve Years a Slave
\begin_inset Quotes erd
\end_inset

, by Solomon Northup.
 Besides the title-page, the word Northup appears repeatedly in the table
 of contents, which is bound to make for awkward parsing.
 This presumably 
\begin_inset Quotes eld
\end_inset

explains
\begin_inset Quotes erd
\end_inset

 why the average connector count would be 11.2 for this word.
\end_layout

\begin_layout Standard
Moving further down this list, many of the words and symbols suggest that
 they appear in tables or lists embedded in the corpus.
 The 
\begin_inset Quotes eld
\end_inset

grammar
\begin_inset Quotes erd
\end_inset

 of tables and lists is necessarily awkward, and seems unlikely to get much
 of a meaningful parse from the MST parser.
 This is further strengthened by an earlier analysis of a Wikipedia-based
 dataset, re-reported below.
\end_layout

\begin_layout Subsubsection*
Vertex degree in Wikipedia
\begin_inset CommandInset label
LatexCommand label
name "subsec:Vertex-degree-in"

\end_inset


\end_layout

\begin_layout Standard
A similar analysis, performed on a much smaller dataset consisting entirely
 of Wikipedia articles found a similar behavior.
 The first ten items in the ranking were: "-" "de" - "y" ":" "(" ")" "General"
 "Department" "x" "Act".
\end_layout

\begin_layout Standard
Consider 
\begin_inset Quotes eld
\end_inset

de
\begin_inset Quotes erd
\end_inset

.
 There are 12 observations of the disjunct 
\begin_inset Quotes eld
\end_inset

Janeiro+
\begin_inset Quotes erd
\end_inset

.
 There are 9 observations of the disjunct 
\begin_inset Quotes eld
\end_inset

la+
\begin_inset Quotes erd
\end_inset

.
 There are 51 observations of a disjunct that has 117 connectors on it!!
 This starts out as 
\begin_inset Quotes eld
\end_inset

Diego- Francisco- Francisco- Alonso- Carlos- Fernández- Carlos-
\begin_inset Quotes erd
\end_inset

 and ends with 
\begin_inset Quotes eld
\end_inset

Figueroa+ (+ (+ y+
\begin_inset Quotes erd
\end_inset

 suggesting that there were possibly 51 really bad parses of a very long
 table of Spanish kings, which was mistaken for being a single sentence.
 Clearly, its junk; its frequently-occurring junk, which suggests that the
 table was repeatedly transcluded in maybe 51 different Wikipedia pages.
 
\end_layout

\begin_layout Standard
Similarly, 
\begin_inset Quotes eld
\end_inset

Department
\begin_inset Quotes erd
\end_inset

 has 18 observations of a disjunct with 41 connectors on it.
 It starts with 
\begin_inset Quotes eld
\end_inset

Education- Education- Health- Services- Services- Immigration-
\begin_inset Quotes erd
\end_inset

 and ends with 
\begin_inset Quotes eld
\end_inset

Veterans+ of+ Treasury+ Treasury+
\begin_inset Quotes erd
\end_inset

, again suggesting a bad parse of a table mistaken for a sentence, and included
 in 18 different Wikipedia pages.
 
\end_layout

\begin_layout Standard
The list continues in a similar way, for quite a while.
 The green line suggests that if some 30 or so pathological cases are ignored,
 the system settles down to a more respectable behavior.
 Entries 30 through 50 in the rankings are "Bay" "Street" "Island" "of"
 "century" "right" "Game" "Georgian" "or" "a" ";" "near" "Party" "team"
 "law" "Australia" "her" "research" "Church" "east" "Government".
 Notable is a preponderance of capitalized words, suggesting more tables
 of various sorts, and a complete lack of verbs.
 A spot-check of words like 
\begin_inset Quotes eld
\end_inset

team
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

law
\begin_inset Quotes erd
\end_inset

 shows that the pathological behavior continues.
 Several conclusions are possible.
\end_layout

\begin_layout Standard
One conclusion is that there is a severe shortage of verbs in Wikipedia
 articles, and this makes sense: its primarily descriptive, rather than
 active: running, jumping, hitting, putting, mixing, giving, setting are
 not the kinds of verbs that are required to describe a typical encyclopedia
 topic.
 
\end_layout

\begin_layout Standard
Another conclusion is that perhaps the number of observations of pairs are
 insufficient to get deep, reliable MST parsing.
 Junk links get used because there were not enough appropriate word-pairs
 seen to give a good-quality MST parse.
 A related conclusion is that the connector-set dataset is also too thin:
 The grammatically reasonable connectors are observed not even a few dozen
 times, barely pushing them out of the noise-floor of onesie-twosie observations
 of junk.
 
\end_layout

\begin_layout Standard
So: bigger datasets, and an urgent need for non-Wikipedia content.
 Fiction, and presumably teen fiction should be filled with the kinds of
 active verbs describing human motions and actions, and should be absent
 of tables and lists masquerading as sentences.
\end_layout

\begin_layout Subsection
Hubiness
\end_layout

\begin_layout Standard
Similar to the above, hubiness can be defined as the second moment of the
 connector count:
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
hub(w)=\frac{\sum_{d,c}N(w,d)C^{2}(d,c)}{N(w,*)}-K^{2}(w)
\]

\end_inset

Given the earlier Zipfian results on the average degree 
\begin_inset Formula $K(w)$
\end_inset

, it should be no surprise that a ranked listing of words by hubiness is
 very nearly identical to the listing for average degree.
 This is, after all, what the scale-free nature of the Zipfian distribution
 really means.
 Not only is the ranking nearly the same, but one also has the approximate
 equality 
\begin_inset Formula $hub(w)\approx2K(w)$
\end_inset

 to some ten or twenty percent.
 
\end_layout

\begin_layout Section
Grammatical Classes
\end_layout

\begin_layout Standard
The primary intended use of the connector set dataset is to provide input
 to the measurement of the syntactic (and semantic) similarity of words,
 so that they can be organized into grammatical classes (colloquially, into
 parts of speech).
 Two words are grammatically similar if they share many disjuncts in common,
 that is, if they are used in sentences in much the same walk.
 Thus, for example, one might imagine that the words 
\begin_inset Quotes eld
\end_inset

run
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

trot
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

jog
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

walk
\begin_inset Quotes erd
\end_inset

 might get used in similar ways, sometimes being almost interchangeable.
 Thus, it is interesting to develop metrics by which similarity might be
 gauged.
\end_layout

\begin_layout Standard
A number of different metrics are possible, and many are proposed in the
 section that follows.
 Two are of particular interest.
 The first is the cosine similarity; this is interesting, if for no other
 reason than that it is widely defined and used.
 The second is the entropic similarity.
 Thus lurks in many contexts under the name of 
\begin_inset Quotes eld
\end_inset

mutual information
\begin_inset Quotes erd
\end_inset

; however, the term 
\begin_inset Quotes eld
\end_inset

mutual information
\begin_inset Quotes erd
\end_inset

 has already been heavily used in the previous sections, and the entropic
 similarity is something different; it resembles the cosine similarity,
 but has much nicer properties, when one is concerned with maximizing the
 information content over large datasets.
 
\end_layout

\begin_layout Standard
A related, but very important task, when judging the similarity of words,
 and attempting to classify them into distinct grammatical classes is that
 sometimes the same word, 
\emph on
viz
\emph default
, the same text-string can belong to two different parts of speech.
 A casual perusal of any dictionary will show that many words can be used
 both as nouns, and as verbs.
 Other words can be nouns or adjectives; yet others can be subclasses: 
\emph on
e.g.

\emph default
 transitive and intransitive verbs.
 Disjuncts can be thought of as hyper-fine-grained parts-of-speech identifiers,
 and so one task of categorization is to place them into larger, less crude
 categories.
 Another important realization is to observe that different disjuncts correlate
 strongly with different semantic meanings; for a particularly brutal example,
 the noun 
\begin_inset Quotes eld
\end_inset

saw
\begin_inset Quotes erd
\end_inset

 has a semantic meaning that is unrelated to the past tense of the verb
 
\begin_inset Quotes eld
\end_inset

to see
\begin_inset Quotes erd
\end_inset

.
 Which is intended is readily discerned from context; disjuncts encode content.
\end_layout

\begin_layout Standard
A more extensive discussion of the relationship between meaning and disjuncts
 is given elsewhere;
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2018stiching"

\end_inset

 it is touched on only briefly below, as it informs what properties of a
 similarity metric are desirable, and which are not.
\end_layout

\begin_layout Subsection
Matrix Factorization
\end_layout

\begin_layout Standard
One can pose the problem of organizing words into grammatical categories
 as a matrix factorization problem.
 In the current framework, this means factorizing the matrix
\begin_inset Formula 
\[
p\left(w,d\right)=\sum_{g\in G}p^{\prime}\left(w,g\right)p^{\prime\prime}\left(g,d\right)
\]

\end_inset

so that 
\begin_inset Formula $p^{\prime}\left(w,g\right)$
\end_inset

 gives the probability that the word 
\begin_inset Formula $w$
\end_inset

 belongs to the grammatical category 
\begin_inset Formula $g$
\end_inset

, while 
\begin_inset Formula $p^{\prime\prime}\left(g,d\right)$
\end_inset

 gives the probability that the disjunct 
\begin_inset Formula $d$
\end_inset

 is appropriate for use with the grammatical category 
\begin_inset Formula $g$
\end_inset

.
 This factorization feeds back into itself, as the disjuncts are necessarily
 composed of words, which must now be reassigned into classes.
 
\end_layout

\begin_layout Standard
There are many widely discussed algorithms for performing the above factorizatio
n, falling under the general name of 
\begin_inset Quotes eld
\end_inset

low-rank matrix approximation
\begin_inset Quotes erd
\end_inset

 (LRMA) techniques.
 These are discussed in greater detail in 
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2018skippy"

\end_inset

.
\end_layout

\begin_layout Subsection
Clustering
\end_layout

\begin_layout Standard
An alternative to applying LRMA techniques, one can try to guess at the
 grammatical categories by looking for words that are similar to each other,
 and so invoke clustering algorithms to obtain grammatical categories.
 As clustering is inherently point-wise and not global, it is a generally
 weaker technique.
 That is, clustering generally looks at just pairs of words, and 
\begin_inset Quotes eld
\end_inset

ignores
\begin_inset Quotes erd
\end_inset

 other evidence in making the decision as to whether two words might go
 into the same grammatical category.
 Worse, if the clustering problem is posed naively, then it has a naive
 and incorrect answer: each word can belong to only one class.
 This ignores the fact that a word might belong to multiple classes 
\emph on
viz.

\emph default
 that class membership is a probability 
\begin_inset Formula $p^{\prime}\left(w,g\right)$
\end_inset

.
 It also ignores the possibility that an observed word-vector might be the
 sum of observations of two very different words that just happen to have
 the same spelling: a canonical example being 
\begin_inset Quotes eld
\end_inset

saw
\begin_inset Quotes erd
\end_inset

, which can be the past tense of 
\begin_inset Quotes eld
\end_inset

to see
\begin_inset Quotes erd
\end_inset

, the noun that is a cutting tool, and the verb that is the cutting action.
 Although in call cases, one has 
\begin_inset Formula $w=\mbox{saw}$
\end_inset

, the collection of disjuncts associated to each are likely to be very different.
 That is, the distribution 
\begin_inset Formula $p^{\prime\prime}\left(g,d\right)$
\end_inset

 for 
\begin_inset Formula $g=\mbox{`cutting tool`}$
\end_inset

 will be very different from 
\begin_inset Formula $p^{\prime\prime}\left(g,d\right)$
\end_inset

 for 
\begin_inset Formula $g=\mbox{`past-tense-observation`}$
\end_inset

.
 
\end_layout

\begin_layout Section
Disjunct Cosine Similarity
\begin_inset CommandInset label
LatexCommand label
name "subsec:Disjunct-Cosine-Similarity"

\end_inset


\end_layout

\begin_layout Standard
The cosine similarity between two vectors is simply their inner product.
 In this case, given two words 
\begin_inset Formula $w_{1}$
\end_inset

 and 
\begin_inset Formula $w_{2}$
\end_inset

, it is given by 
\begin_inset Formula 
\begin{equation}
\mbox{sim}(w_{1},w_{2})=\frac{\sum_{d}N(w_{1},d)N(w_{2},d)}{\mbox{len}(w_{1})\mbox{len}(w_{2})}\label{eq:cosine-sim}
\end{equation}

\end_inset

where 
\begin_inset Formula $\mbox{len}(w)$
\end_inset

 is the root-mean-square length (Euclidean length) of the connector-set
 vector:
\begin_inset Formula 
\[
\mbox{len}(w)=\sqrt{\sum_{d}N^{2}(w,d)}
\]

\end_inset

This section explores the distribution of the cosine similarity for the
 
\noun on
en_pairs_rfive_mtwo
\noun default
 dataset (same as above).
\end_layout

\begin_layout Subsubsection
Rank distribution
\end_layout

\begin_layout Standard
The 
\noun on
en_pairs_rfive_mtwo
\noun default
 dataset (same as above) contains 797 words whose length is greater than
 or equal to 128.
 The ranking-by-length was already shown up above, in the graph 
\begin_inset CommandInset ref
LatexCommand nameref
reference "graph: ranked length"

\end_inset


\begin_inset CommandInset ref
LatexCommand vpageref
reference "graph: ranked length"

\end_inset

.
 The similarity between all pairs of these was computed; this resulted in
 
\begin_inset Formula $797\times796/2=317206$
\end_inset

 pairs.
 These can be sorted and ranked.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
See the good-sims and ranked-sims in disjunct-stats.scm file.
\end_layout

\end_inset

 They are shown below.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-sims.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
Well, that's new! The similarity ranking is very well fit by a parabola
 at the high end, and a cubic at the low end! In the above, the green line
 is given by 
\begin_inset Formula $1-\log^{2}(rank)/160$
\end_inset

 and while the blue line is given by 
\begin_inset Formula $-\log^{3}(rank/317206).$
\end_inset


\end_layout

\begin_layout Subsubsection
Hyperbolic space
\end_layout

\begin_layout Standard
Recall that
\begin_inset Formula 
\[
\cos\theta=1-\frac{\theta^{2}}{2}+\frac{\theta^{4}}{4!}-\cdots
\]

\end_inset

and so the above graph is stating that 
\begin_inset Formula $\theta\approx\log\left(rank\right)/8.9$
\end_inset

 is a reasonable approximation for the similarity angle, for the first thousand
 pairs, or so.
 Lest there be any hope that the graph as a whole is sine-wave-like, the
 high-rank tail goes as the cube of the log, not the square.
 
\end_layout

\begin_layout Standard
The similarity angle can be taken as a metric or as a distance.
 If the relationships between words is thought of as a network graph, one
 can ask how many nearest neighbors there are, as a function of the distance.
 Since the rank is effectively a count of nearest neighbors, and since 
\begin_inset Formula $\theta$
\end_inset

 is effectively a distance, one has that 
\begin_inset Formula 
\[
V\sim e^{r}
\]

\end_inset

where 
\begin_inset Formula $V$
\end_inset

 is the number of nearest neighbors (the 
\begin_inset Quotes eld
\end_inset

volume
\begin_inset Quotes erd
\end_inset

) and 
\begin_inset Formula $r$
\end_inset

 is the distance (so 
\begin_inset Formula $r=\theta$
\end_inset

 is just a change of notation).
 This volume-to-radius relationship should be compared to the volume-to-radius
 relationships for flat spaces for fixed dimension.
 For flat, Euclidean two-dimensional space, one has 
\begin_inset Formula $A=\pi r^{2}$
\end_inset

 the area of a circle.
 In three dimensions, 
\begin_inset Formula $V=4\pi r^{3}/3$
\end_inset

.
 For flat 
\begin_inset Formula $N$
\end_inset

-dimensional space, 
\begin_inset Formula $V\sim r^{N}$
\end_inset

.
 Clearly, the 
\begin_inset Quotes eld
\end_inset

space
\begin_inset Quotes erd
\end_inset

 of the network graph cannot be flat.
 This is in contrast to spaces with constant negative curvature, where the
 volume does grow exponentially as a function of the radius.
 Thus, one can hypothesize that the language graph can be mapped (almost)
 isometrically to a hyperbolic surface (for example, the Poincaré disk).
\end_layout

\begin_layout Standard
This conjecture promptly poses various questions: does the network graph
 have constant negative curvature, or is it bumpy? Numerically, what is
 the curvature? Does it embed better into hyperbolic 3-space or hyperbolic
 4-space? There is no such thing as a hyperbolic 
\begin_inset Formula $N$
\end_inset

-space for classical Riemann spaces, for 
\begin_inset Formula $N<2$
\end_inset

; yet, for the network graph, presumably one has 
\begin_inset Formula $N=0$
\end_inset

.
 How should this be described? 
\end_layout

\begin_layout Subsubsection
Size Cutoffs
\end_layout

\begin_layout Standard
The network graph has some nodes and links that are observed very frequently,
 while others are observed only very rarely, sometimes only once or twice.
 As mentioned in the section 
\begin_inset CommandInset ref
LatexCommand ref
reference "subsec:Vertex-degree-in"

\end_inset

 on the vertex degree, the rarely-observed disjuncts appear to be associated
 with bad parses of lists, tables (such as tables of contents, product tables,
 pricing tables) and indexes.
 Therefore, it is appealing to imagine that by removing this 
\begin_inset Quotes eld
\end_inset

crud
\begin_inset Quotes erd
\end_inset

, one could improve the quality of the dataset.
 One obvious way of performing a cleanup is to banish infrequently-observed
 words.
\end_layout

\begin_layout Standard
In the 
\noun on
en_pairs_rfive_mtwo
\noun default
 dataset, there are 427 words observed 256 or more times, these form 
\begin_inset Formula $427\times426/2=90951$
\end_inset

 pairs.
 There are 245 words observed 512 or more times; these form 
\begin_inset Formula $245\times244/2=29890$
\end_inset

 pairs.
 There are 130 words observed 1024 or more times, these form 
\begin_inset Formula $130\times129/2=8385$
\end_inset

 pairs.
 The ranking for these are shown below.
 There are 69 words observed 2048 or more times; these form 
\begin_inset Formula $69\times68/2=2346$
\end_inset

 pairs.
 The ranking of these are shown below.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/ranked-graded-sims.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
The graph below shows the distribution of cosine similarity.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Produced by binned-good-sims.
 
\end_layout

\end_inset

 There are 797 words for which 
\begin_inset Formula $128<\mbox{len}(w)$
\end_inset

.
 This shows the distribution of the 317206 word-pairs formed from these
 words, for which 
\begin_inset Formula $0.1\le\mbox{sim}(w_{1},w_{2})$
\end_inset

.
 The eyeballed fit is for 
\begin_inset Formula $\exp(-3.5\times\mbox{sim})$
\end_inset

.
 
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/binned-sims.eps
	width 70text%

\end_inset


\end_layout

\begin_layout Standard
One can also be less restrictive.
 The graph below shows the distribution of 1058120 pairs, with the length
 restriction loosened to 
\begin_inset Formula $4<\mbox{len}(w)$
\end_inset

.
 It is, however, an incomplete dataset; not all possible pairs were considered.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-dj/binned-sims-all.eps
	width 60col%

\end_inset


\end_layout

\begin_layout Standard
It has a rather different structure to it.
 Its not clear what this means; this may be an artifact of it being an incomplet
e dataset...
 
\end_layout

\begin_layout Subsubsection
Top Ten Lists
\end_layout

\begin_layout Standard
Its worth getting an idea of what the most similar words are, and why they
 are considered similar.
\end_layout

\begin_layout Standard
The top-ten similarity pairs (for which 
\begin_inset Formula $128<\mbox{len}(w)$
\end_inset

) are: 'Stats ..
 Category' 'Notes ..
 Summary' 'Stats ..
 Rating' 'Category ..
 Rating' 'She ..
 He' 'Category ..
 Fandom' 'Stats ..
 Fandom' 'Category ..
 Character' 'Stats ..
 Character' 'she ..
 he'.
 The first one has a cosine of exactly 1.0, and the rest are above 0.975.
 
\end_layout

\begin_layout Standard
The two words 
\begin_inset Quotes eld
\end_inset

Stats
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

Category
\begin_inset Quotes erd
\end_inset

 have exactly one disjunct: it is 'LEFT-WALL- :+' That is, these appear
 as the first word in a sentence, and are immediately followed by a colon.
 The word 'Notes' has 27 distinct disjuncts, but only three are observed
 more than 6 times: 'Chapter- or LEFT-WALL- or End-' That is, 'Notes' appears
 either all by itself, or as the phrase 'Chapter Notes' or as 'End Notes'.
 The word 'Summary' has 47 distinct disjuncts, but only two appear more
 than 6 times: 'Chapter- or LEFT-WALL-' i.e.
 either as a solitary word, or as the phrase 'Chapter Summary'.
 This is why 'Notes' and 'Summary' are considered to be so similar.
\end_layout

\begin_layout Standard
Continuing in this fashion: 'Rating' has 7 disjuncts, four of which are
 observed more than 6 times: '(LEFT-WALL- :+) or (LEFT-WALL- :+ Audiences+)
 or (LEFT-WALL- :+ Explicit+) or (LEFT-WALL- :+ Mature+)' This suggests
 it appears in a table of some sort.
 The first disjunct '(LEFT-WALL- :+)' accounts for its high similarity to
 both 'Stats' and 'Category'.
 So, yes, all these words have been discovered to behave in a grammatically
 similar fashion; however, this behavior is somewhat boring: they arise
 from the regularity of some table listing.
\end_layout

\begin_layout Standard
The first non-capitalized word-pair appears at position ten.
 Of the top thirty, the table below shows the non-capitalized word-pairs.
 
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="13" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
rank
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
cosine
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word-pair
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.977
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'she ..
 he'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.955
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'guess ..
 suppose'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.954
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'city ..
 house'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
19
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.950
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'would ..
 might'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.945
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
're ..
 non'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.944
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'village ..
 city'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.931
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'father ..
 mother'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.928
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'village ..
 house'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
26
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.928
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'don ..
 sama'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.923
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'world ..
 city'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.921
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'son ..
 daughter'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
29
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.918
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'suppose ..
 hope'
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The table below shows several more pairs worth closer examination.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="8" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
rank
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
cosine
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word-pair
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
32
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.915
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'though ..
 but'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.914
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'should ..
 might'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
54
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.893
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'should ..
 must'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
61
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.885
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'when ..
 until'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
67
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.881
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'leave ..
 take'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
68
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.8794
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'believe ..
 think'
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
69
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.879
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
'in ..
 by'
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Subsubsection
Vector Structure
\end_layout

\begin_layout Standard
It is worth looking at a few of the vectors, to see what form they actually
 have.
 
\end_layout

\begin_layout Standard
The word 'she' has 27578 distinct disjuncts on it; the word 'he' has 57330
 disjuncts.
 The table below shows the top-ranked disjuncts for each.
 It makes quite clear why the two have a high similarity.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Print the disjuncts with the `print-disjuncts` function.
\end_layout

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="9" columns="4">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
she
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
he
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2501
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
said+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4673
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
said+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1060
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
was+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2643
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
was+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
983
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
had+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1959
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
had+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
656
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
asked+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1293
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
could+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
593
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
”-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1281
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
asked+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
474
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
could+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
909
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
that- was+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
372
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
that- was+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
813
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
”-
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The list above contains the pairs 'guess ..
 suppose' and also 'suppose ..
 hope'.
 The word 'guess' has 859 disjuncts on it.
 The word 'suppose' has 805 disjuncts on it.
 The word 'hope' has 1735 disjuncts on it.
 The table below indicates why their cosine distance is close.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="10" columns="6">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
guess
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
suppose
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
hope
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
859 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
805 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1735 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
272
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
377
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
393
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
57
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- .+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
66
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- you+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
160
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- you+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- it's+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
45
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- .+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
77
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
that+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- you're+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
39
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- that+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
59
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
of+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- so+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- it+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
55
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- that+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
15
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
could-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- ?+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
51
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
we- that+ you+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
15
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- you+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- ,+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- so+
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The pair 're ..
 non' seems strange, but closer examination shows that both are usually
 followed by a hyphen, and that this accounts for all of the observed similarity
 between these two.
 The pair 'don ..
 sama' is also strange.
 The similarity is due entirely to their linking to dashes and other punctuation.
 It seems to be due to a Finnish text that has snuck into the corpus.
\end_layout

\begin_layout Standard
Some more verbs are worth looking at.
 Here is 'leave ..
 take', below.
 Almost all of the similarity is due the infinitive form; the only other
 shared disjunct appearing in the table is 'to- it+', although there are
 more common disjuncts at lower counts.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="11" columns="4">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
leave
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
take
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3017 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5693 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
298
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
619
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
40
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- the+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
105
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
would-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- .+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
95
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- care+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
32
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
took-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
88
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
will-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
31
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
you-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
87
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- a+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- him+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
74
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- it+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
26
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- room+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
70
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- her+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- it+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
67
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- it+
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The pair 'believe ..
 think' is as follows; several kinds of constructions are clearly shared.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="10" columns="4">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
believe
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
think
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2190 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5462 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
339
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1321
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
I-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
116
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- that+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
308
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
you-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
94
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
265
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
don't-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
79
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to- that+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
244
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
54
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
that+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
211
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
do- you-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
51
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- you+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
202
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- it+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
51
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
 I- it+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
168
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- you+
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The first prepositions in the list are 'in ..
 by' and so are worth a look.
 Most of the similarity is accounted for by having them precede a determiner.
 Why there is a strong link to a determiner is unclear.
 Whether there is also a strong link between the determiner and a subsequent
 noun is also unclear.
 
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="10" columns="4">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
in
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
by
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
64011 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20907 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
12484
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
the+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2064
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4776
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
497
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4720
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
his+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
271
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
followed-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1529
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
front+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
266
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
his+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1429
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
her+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
252
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
surrounded-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1113
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
this+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
214
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
which+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1092
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
my+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
207
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
means+
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The pair 'would ..
 might' is accompanied by 'should ..
 might' and 'should ..
 must'.
 The disjuncts are shown below.
 The reason for the similarity is readily apparent, and corresponds with
 what might be expected.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="9" columns="8">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
would
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
should
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
might
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
must
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20039 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6388 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6861 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5860 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1648
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
have+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1064
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
have+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
654
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
have+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
666
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
have+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1510
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
840
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
468
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
558
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
be+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
621
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
it- be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
415
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
154
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
he- have+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
215
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
it- be+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
445
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
he- have+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
246
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I- be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
150
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
it- be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
195
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
you-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
315
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
he- be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
155
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
we-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
91
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
he- be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
183
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
I-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
290
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
not+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
134
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
he- be+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
89
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
I-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
137
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
we-
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The words 'city', 'house', 'village' and 'world' are all similar.
 These are shown below.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="12" columns="8">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
city
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
house
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
village
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
world
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1332 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3330 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1019 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell multicolumn="1" alignment="center" valignment="top" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3024 total obs
\end_layout

\end_inset
</cell>
<cell multicolumn="2" alignment="center" valignment="top" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
disjunct
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
258
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
598
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
157
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
646
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
96
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- ,+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
275
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- ,+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
57
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- .+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
265
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- .+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
94
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- .+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
226
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- .+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
57
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- ,+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
202
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- ,+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- and+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
80
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
street+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
158
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
in- the-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
this-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
79
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
into- the-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- Mbonga+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
117
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
in- the- .+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- of+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
58
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- and+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
112
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
in- the- ,+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
15
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
The- the- .+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
54
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
his-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
12
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
their-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
82
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
this-
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- the- .+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
48
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
my-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
11
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- and+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
43
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
.+
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
13
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a-
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
45
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- was+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- of+
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
39
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the- is+
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The above table clearly indicates why the cosine similarity between these
 four words is high.
 Yet, it is also disappointing: the primary reason is that they all take
 the determiner 'the', and occur at the end of phrases (the- ,+) or at the
 end of sentences (the- .+).
 This seems superficial, at best.
 After that, there's not so much similarity, and what there is still falls
 back onto the presence of determiners.
\end_layout

\begin_layout Standard
The similarity above is predicated on the frequent pairing of a word with
 a determiner.
 The pairing itself is correct, but perhaps not entirely significant: the
 word-pairs that needed to appear in the MST parse to extract these disjuncts
 had almost abysmally low MI values.
 They were clearly high enough to allow the MST parse to move forward, but
 are not otherwise terribly promising.
 They are shown below.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="9" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word-pair
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
fractional mi
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the city
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.7827
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the house
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.4228
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the village
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.6635
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the world
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.0578
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
city .
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.0167
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
house .
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.0095
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
village .
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.8765
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
world .
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.2255
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Table:prototype table"

\end_inset


\end_layout

\begin_layout Standard
Recall that, for word-pairs, that MI scores below 4 are considered to be
 quite poor.
 That the scores above are poor is not surprising: determiners link with
 almost any noun, but not to verbs: this is enough to get them into the
 2-3 range for MI.
 By contrast, the period can appear after almost any word, except maybe
 for prepositions, adjectives and adverbs.
 This is apparently enough to drive the MI positive, but that's all.
\end_layout

\begin_layout Standard
The above suggests that using the raw observation count to compute the cosine
 similarity is perhaps not the best way to compute similarity.
 
\end_layout

\begin_layout Subsection
Score Cosine
\end_layout

\begin_layout Standard
The above results suggest that perhaps the correct way to assign a score
 is to total up the word-pair MI scores for each disjunct, and use that
 to form a cosine.
 That is, define
\begin_inset Formula 
\[
sc(w,d)=\sum_{c\in d_{+}}mi(w,c)+\sum_{c\in d_{-}}mi(c,w)
\]

\end_inset

where 
\begin_inset Formula $d_{+}\subseteq d$
\end_inset

 is the subset of connectors that connect to the right, and 
\begin_inset Formula $d_{-}\subseteq d$
\end_inset

 is the subset of connectors that connect to the left.
 Here, 
\begin_inset Formula $c\in d$
\end_inset

 is simply one of the words that the connector is connecting to.
 Then, 
\begin_inset Formula $mi(w,c)$
\end_inset

 and 
\begin_inset Formula $mi(c,w)$
\end_inset

 are the word-pair MI scores.
 The cosine similarity might then be defined as 
\begin_inset Formula 
\[
\mbox{sos}(w_{1},w_{2})=\frac{\sum_{d}sc(w_{1},d)sc(w_{2},d)}{\sqrt{\sum_{d}sc{}^{2}(w_{1},d)}\sqrt{\sum_{d}sc^{2}(w_{2},d)}}
\]

\end_inset


\end_layout

\begin_layout Standard
The idea here is that connectors to punctuation, and connectors to determiners,
 which might be observed quite often, but have a naturally low MI score,
 will contribute relatively little to the overall cosine similarity.
 
\end_layout

\begin_layout Standard
Why is this a reasonable thing to do? Well, the previous two tables already
 suggested that the four words 'city' 'house' 'village' and 'world' should
 be placed into the same grammatical class, simply because of the high frequency
 with which they connect to a determiner.
 More subtle differences carrying a semantic signal might be getting washed
 out in this process.
 Perhaps the sos cosine score would be more sensitive to those processes.
\end_layout

\begin_layout Section
Word Senses
\end_layout

\begin_layout Standard
Words, considered as text strings, often correspond to multiple senses.
 Conversely, many words have similar senses.
 One would like to induce what these senses and word categories are, based
 on the observed counts.
\end_layout

\begin_layout Subsection
Prototype Theory
\end_layout

\begin_layout Standard
Another way to interpret the table 
\begin_inset CommandInset ref
LatexCommand vpageref
reference "Table:prototype table"

\end_inset

 is that the highest MI entry defines the 
\begin_inset Quotes eld
\end_inset

prototype
\begin_inset Quotes erd
\end_inset

 for the class of similar words.
 That is, the cosine score suggests that (world, city, village, house) should
 be taken as a grammatical category.
 The highest-MI entry suggests that 
\begin_inset Quotes eld
\end_inset

world
\begin_inset Quotes erd
\end_inset

 should be taken as the prototype for this class.
 
\end_layout

\begin_layout Standard
Here, the intended sense for 
\begin_inset Quotes eld
\end_inset

prototype
\begin_inset Quotes erd
\end_inset

 is that of 
\begin_inset Quotes eld
\end_inset


\begin_inset CommandInset href
LatexCommand href
name "prototype theory"
target "https://en.wikipedia.org/wiki/Prototype_theory"

\end_inset


\begin_inset Quotes erd
\end_inset

 of cognitive semantics, which suggests that a good representative of a
 class provides the underlying 
\begin_inset Quotes eld
\end_inset

meaning
\begin_inset Quotes erd
\end_inset

 of an otherwise flat description of the extensional and intensional qualities
 of a set.
 To create a bridge to the statistical tools available here, one might say
 that an intensional or an extensional description of a class is a flat,
 unweighted collection of features or members.
 In fact, some features or members are more important than others, so that
 a 
\begin_inset Quotes eld
\end_inset

robin
\begin_inset Quotes erd
\end_inset

 is a better representative of 
\begin_inset Quotes eld
\end_inset

birds
\begin_inset Quotes erd
\end_inset

 than a 
\begin_inset Quotes eld
\end_inset

penguin
\begin_inset Quotes erd
\end_inset

 is.
 There are at least two ways of judging a 
\begin_inset Quotes eld
\end_inset

better representative
\begin_inset Quotes erd
\end_inset

: frequency, and MI.
 A frequency approach means that 
\begin_inset Quotes eld
\end_inset

robins
\begin_inset Quotes erd
\end_inset

 are more prototypical of birds simply because cultural English speakers
 (Americans, Europeans) see robins more frequently than penguins, and so
 are acculturated to this prototype.
 The MI approach means that robins are more prototypical of birds because
 they have a high MI, when correlating intensional and extensional qualities.
 That is, a large value for 
\begin_inset Formula 
\[
mi(robin)=\sum_{feat\in bird-features}mi(robin,feat)
\]

\end_inset

states that robins have many bird-like qualities.
 Likewise, feathers are more representative of birds than weight or color,
 suggesting a large value for 
\begin_inset Formula 
\[
mi(feather)=\sum_{bird\in birds}mi(bird,feather)
\]

\end_inset

where the individual MI score was computed over all pairs (thing, feature)
 over the class of all things and all features.
 
\end_layout

\begin_layout Standard
What determines a category itself? Wikipedia summarizes this aptly:
\end_layout

\begin_layout Itemize
maximize the number of attributes shared by members of the category, and
 
\end_layout

\begin_layout Itemize
minimize the number of attributes shared with other categories.
\end_layout

\begin_layout Standard
Taking 
\begin_inset Quotes eld
\end_inset

attribute
\begin_inset Quotes erd
\end_inset

 to be a disjunct, and 
\begin_inset Quotes eld
\end_inset

member
\begin_inset Quotes erd
\end_inset

 to be a word, this suggests that a grammatical category is a collection
 of words that share many disjuncts (large overlap, large cosine similarity),
 while minimizing the disjuncts shared with other words (large MI).
\end_layout

\begin_layout Standard
The later point is perhaps the most important: MI scores provide a sense
 of 
\begin_inset Quotes eld
\end_inset

mutual exclusion
\begin_inset Quotes erd
\end_inset

: a word-disjunct pair has a high MI exactly when that disjunct is not shared
 with other words.
 This motivates a measure such as the 
\begin_inset Quotes eld
\end_inset

entropic similarity
\begin_inset Quotes erd
\end_inset

, given below.
\end_layout

\begin_layout Subsection
Entropic Similarity
\end_layout

\begin_layout Standard
The entropic similarity is similar to the cosine similarity, in that both
 start with the dot-product (inner product) between two vectors.
 As pointed in the 
\begin_inset Quotes eld
\end_inset

Gradient Descent vs.
 Graphical Models 
\begin_inset Quotes eld
\end_inset

 paper
\begin_inset CommandInset citation
LatexCommand cite
key "Vepstas2018skippy"

\end_inset

, the cosine similarity 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:cosine-sim"

\end_inset

 is inappropriate for probability spaces (because probability spaces are
 not orthogonal vector spaces).
 The entropic similarity seems to come closer to the desired form of similarity.
 Both definitions start with the dot-product of two vectors:
\begin_inset Formula 
\begin{equation}
\mbox{dot}\left(w_{1},w_{2}\right)=\sum_{d}p\left(w_{1},d\right)p\left(w_{2},d\right)\label{eq:dot-prod}
\end{equation}

\end_inset

with the disjunct 
\begin_inset Formula $d$
\end_inset

 playing the role of the basis elements.
 The cosine similarity of eqn 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:cosine-sim"

\end_inset

 can be written as 
\begin_inset Formula 
\[
\mbox{dot}\left(w_{1},w_{2}\right)=\frac{\mbox{dot}\left(w_{1},w_{2}\right)}{\sqrt{\mbox{dot}\left(w_{1},w_{1}\right)\mbox{dot}\left(w_{2},w_{2}\right)}}
\]

\end_inset

By comparison, the entropic similarity is defined as
\begin_inset Formula 
\[
\mbox{dmi}(w_{1},w_{2})=-\log_{2}\frac{\mbox{dot}\left(w_{1},w_{2}\right)\mbox{dot}\left(*,*\right)}{\mbox{dot}\left(w_{1},*\right)\mbox{dot}\left(*,w_{2}\right)}
\]

\end_inset


\end_layout

\begin_layout Standard
This looks perhaps a bit awkward.
 Some insight is gained by rewriting this in terms of a joint probability.
 Let 
\begin_inset Formula $M=\left[p\left(w,d\right)\right]$
\end_inset

 be the matrix whose rows correspond to words, and columns correspond to
 disjuncts.
 The value of any given matrix entry is then the frequency 
\begin_inset Formula $p\left(w,d\right)$
\end_inset

.
 The matrix 
\begin_inset Formula $MM^{T}$
\end_inset

 then has matrix entries
\begin_inset Formula 
\[
\left[MM^{T}\right]\left(w_{1},w_{2}\right)=\sum_{d}p\left(w_{1},d\right)p\left(w_{2},d\right)=\mbox{dot}\left(w_{1},w_{2}\right)
\]

\end_inset

Here, the notation 
\begin_inset Formula $M^{T}$
\end_inset

 denotes the matrix-transpose.
 Note that 
\begin_inset Formula $MM^{T}$
\end_inset

 is a symmetric matrix: 
\begin_inset Formula 
\[
\left[MM^{T}\right]\left(w_{1},w_{2}\right)=\left[MM^{T}\right]\left(w_{2},w_{1}\right)
\]

\end_inset

Normalizing this matrix, one obtains a joint probability matrix 
\begin_inset Formula $Q$
\end_inset

:
\begin_inset Formula 
\begin{equation}
Q\left(w_{1},w_{2}\right)=\frac{\left[MM^{T}\right]\left(w_{1},w_{2}\right)}{\sum_{u,v}\left[MM^{T}\right]\left(u,v\right)}=\frac{\left[MM^{T}\right]\left(w_{1},w_{2}\right)}{\left[MM^{T}\right]\left(*,*\right)}=\frac{\mbox{dot}\left(w_{1},w_{2}\right)}{\mbox{dot}\left(*,*\right)}\label{eq:symmetric transpose}
\end{equation}

\end_inset

This really is a joint probability, in that 
\begin_inset Formula $Q\left(*,*\right)=1$
\end_inset

 (by definition) and 
\begin_inset Formula $Q\left(w\right)=Q\left(w,*\right)=Q\left(*,w\right)$
\end_inset

 is a unique marginal probability (by symmetry).
 Here, it makes sense that the left and right marginal probabilities are
 the same: both are words, and there is nothing making the relationship
 asymmetric.
\end_layout

\begin_layout Standard
The point-wise symmetric mutual information between two words, aka the entropic
 similarity, can then be defined as
\begin_inset Formula 
\[
\mbox{dmi}(w_{1},w_{2})=\log_{2}\frac{Q\left(w_{1},w_{2}\right)}{Q\left(w_{1},*\right)Q\left(*,w_{2}\right)}=\log_{2}\frac{Q\left(w_{1},w_{2}\right)}{Q\left(w_{1}\right)Q\left(w_{2}\right)}
\]

\end_inset

Written in this fashion, this is clearly a form of mutual information between
 two words; we choose to call it the 
\begin_inset Quotes eld
\end_inset

entropic similarity
\begin_inset Quotes erd
\end_inset

 here, to avoid confusion with other definitions of 
\begin_inset Quotes eld
\end_inset

mutual information
\begin_inset Quotes erd
\end_inset

 throughout this text.
 The total entropic MI is then
\begin_inset Formula 
\[
\mbox{DMI}=\sum_{w_{1},w_{2}}Q\left(w_{1},w_{2}\right)\log_{2}\frac{Q\left(w_{1},w_{2}\right)}{Q\left(w_{1}\right)Q\left(w_{2}\right)}
\]

\end_inset


\end_layout

\begin_layout Subsection
Experimental Results
\end_layout

\begin_layout Standard
Both the cosine similarity 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:cosine-sim"

\end_inset

 and the entropic similarity are built upon the same dot-product 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:dot-prod"

\end_inset

.
 It is reasonable to think that the two should be correlated.
 This section explores that correlation.
\end_layout

\begin_layout Standard
Begin by a redo of the graphs from section 
\begin_inset CommandInset ref
LatexCommand ref
reference "subsec:Disjunct-Cosine-Similarity"

\end_inset

.
 This uses the same dataset, the 
\noun on
en_pairs_rfive_mtwo,
\noun default
 but more conveniently trimmed down to a more manageable size 
\noun on
en_sh_two_sim.

\noun default
 
\end_layout

\begin_layout Standard
The first graphs shows the distribution of the 
\begin_inset Formula $N=640$
\end_inset

 top most frequently-observed words, all pairs, for a total of 
\begin_inset Formula $N(N-1)/2=204480$
\end_inset

 pairs; all pairs shown.
 This is effectively the same as the figures in section 4.1 but with two
 differences: there is no minimum-length cutoff, and the similarity scores
 have been extended down to zero (rather than a 0.1 cutoff, as before).
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-sims/binned-sims-dj-cosine.eps
	width 70col%

\end_inset


\end_layout

\begin_layout Standard
\align block
The fit of the linear section is eye-balled.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
See the `(mksims top-items)` and `sim-dj-cosine` in the `disjunct-stats.scm`
 file.
\end_layout

\end_inset

 It is not clear why the slope differs from the figures in section 4.1.
\end_layout

\begin_layout Standard
The corresponding figure for the entropic similarity is quite different:
 
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-sims/binned-sims-dj-mi.eps
	width 70col%

\end_inset


\end_layout

\begin_layout Standard
This shows the 
\begin_inset Formula $N=920$
\end_inset

 top words, for a total of 
\begin_inset Formula $N\left(N-1\right)/2=422740$
\end_inset

 pairs.
 The lone spike at the far left corresponds to an 
\begin_inset Formula $MI=-\infty$
\end_inset

, as appropriate for disjuncts with zero overlap (the cosine would be zero,
 too) For this dataset, this 
\begin_inset Formula $MI=-\infty$
\end_inset

 bin holds 14334 items.
 The smooth parabolic line is an eyeballed fit - a Gaussian, centered at
 an MI value of -3.5, and having a standard deviation of 3.
\end_layout

\begin_layout Standard
As both the entropic similarity and the cosine similarity are based on the
 same inner product, one expects them to be (highly?) correlated.
 The figures below correlation is show the correlation, for the top 
\begin_inset Formula $N=640$
\end_inset

 words:
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-sims/scatter-dj-mi-cos-dots.eps
	width 49col%

\end_inset


\begin_inset Graphics
	filename en-sims/scatter-dj-mi-cos.eps
	width 49col%

\end_inset


\end_layout

\begin_layout Standard
Both graphs show the same data.
 The left graph displays it as a point cloud; the right graph sorts the
 data by increasing entropic similarity, and then draws a line connecting
 the points, in sequential order.
 There is a very clear and very linear relationship between the two, but
 it is quite broad and noisy.
 There does seem to be a bounding-box effect: for any given log cosine,
 the entropic similarity is guaranteeed to lie in a fixed range, and vice-versa.
\end_layout

\begin_layout Standard
The distribution for the entropic similarity follows a square-root Zipfian
 distribution: 
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-sims/ranked-dj-mi.eps
	width 49col%

\end_inset


\end_layout

\begin_layout Standard
As always, the fit is eyeballed, but it appears to be going as 
\begin_inset Formula $ES=const.-\log_{2}\sqrt{rank}$
\end_inset

 (with the const presumably dependent on the size of the dataset).
 The sharp falloff is presumably an artifact of the limited dataset size
 (it has a total of 
\begin_inset Formula $422740$
\end_inset

 pairs for 
\begin_inset Formula $N=920$
\end_inset

 words, same as before).
\end_layout

\begin_layout Standard
The distribution of the cosine similarity is not as well-behaved.
 The figure on the left shows the log of the cosine, so as to make it more
 directly comparable to the entopic similarity, which is effectively logarithmic.
 The figure on the right shows the same data, but with a different vertical
 scale.
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename en-sims/ranked-dj-log-cos.eps
	width 49col%

\end_inset


\begin_inset Graphics
	filename en-sims/ranked-dj-cos.eps
	width 49col%

\end_inset


\end_layout

\begin_layout Standard
A linear region, if any, doesn't even extend to the first 1000 ranked similariti
es.
 The figure on the right suggests a possibly parabolic fit, but that does
 not work out very well, either.
 (The cusp in the left figue is where the parabola crosses zero.) The right
 hand-side figure is the same sort of graph as given in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "subsec:Disjunct-Cosine-Similarity"

\end_inset


\end_layout

\begin_layout Standard
No data cutoffs were used to generate these figures.
 It is already clear that the vectors contain disjuncts with very low observatio
n counts, and that these low counts are often the result of low-quality
 MST parses.
 A leading cause for low-quality MST parses are infrequently-observed words.
 A given word might only occur in the corpus only a handful of times, which
 is not enough to establish it's syntactic behavior.
 Thus, disjuncts containing this word are likely to be incorrect, behaving
 no better than chance.
 it is not clear how much this affects judgements of word-similarity.
\end_layout

\begin_layout Standard
All very interesting, I think.
\end_layout

\begin_layout Section
\start_of_appendix
Other Similarity Scores
\end_layout

\begin_layout Standard
There are a number of different ways of judging similarity between a pair
 of words.
 It seems that these are limited only by the imagination.
 A variety of different similarity measures are given below.
 Each of these can be justified, to some degree, by hand-waving arguments
 and appeals to taste and conceptions of mathematical beauty.
 How to properly anchor these measures more firmly in a consistent framework
 is unclear.
\end_layout

\begin_layout Subsection
Quality Cosine
\end_layout

\begin_layout Standard
Some disjuncts are more important than others.
 This is indicated by the word-disjunct MI, and gives a different perspective
 than the word-disjunct frequency.
 This suggests that either a cosine score, or an MI-weighted overlap similarity
 may capture the grammatical similarity of words better.
\end_layout

\begin_layout Standard
The motivation for using cosine similarity is to find pairs of words that
 act in a grammatically similar fashion: they are used in the same way,
 with the same kinds of disjuncts.
 However, observational counts are subject to the vicissitudes of the input
 text: perhaps, the 
\begin_inset Quotes eld
\end_inset

quality
\begin_inset Quotes erd
\end_inset

 of the disjunct itself is better judged by it's MI.
 A disjunct could be judged as being 
\begin_inset Quotes eld
\end_inset

high quality
\begin_inset Quotes erd
\end_inset

 if 
\begin_inset Formula $mi(w,d)=MI_{pair}(w,d)$
\end_inset

 as defined in equation 
\begin_inset CommandInset ref
LatexCommand vref
reference "eq:mi-word-disjunct"

\end_inset

 is high.
 This motivates a 
\begin_inset Quotes eld
\end_inset

quality cosine
\begin_inset Quotes erd
\end_inset

: 
\begin_inset Formula 
\[
\mbox{qim}(w_{1},w_{2})=\frac{\sum_{d}mi(w_{1},d)mi(w_{2},d)}{\sqrt{\sum_{d}mi^{2}(w_{1},d)}\sqrt{\sum_{d}mi^{2}(w_{2},d)}}
\]

\end_inset

But what if the high quality is based on a pathetically small number of
 observations? Whenever one has a small number of observations, one has,
 almost automatically, a high MI value, simply because these two things
 are seen together, and nothing else is.
 Perhaps there should be some observational weighting.
 One can contemplate defining 
\begin_inset Formula $pi(w,d)=p(w,d)mi(w,d)$
\end_inset

 and so the cosine 
\begin_inset Formula 
\[
\mbox{pim}(w_{1},w_{2})=\frac{\sum_{d}pi(w_{1},d)pi(w_{2},d)}{\sqrt{\sum_{d}pi^{2}(w_{1},d)}\sqrt{\sum_{d}pi^{2}(w_{2},d)}}
\]

\end_inset

 The suitability of these different means of judging similarity is not clear.
\end_layout

\begin_layout Subsection
Overlap Similarity
\end_layout

\begin_layout Standard
Cosine similarity is perhaps too strict: it judges that two words are similar
 when they share not only the same collection of disjuncts, but have these
 occur with the same frequencies.
 Perhaps its enough to say that two words are grammatically similar, if
 they simply share the same set of disjuncts.
 This suggests ignoring the relative counts, or at least, sharply filtering
 them.
 Thus, the overlap similarity can be defined as 
\begin_inset Formula 
\[
\mbox{ovl}(w_{1,}w_{2})=\frac{\sum_{d}\sigma(N(w_{1},d))\sigma(N(w_{2},d))}{\sum_{d}\sigma(N(w_{1},d)+N(w_{2},d))}
\]

\end_inset

where 
\begin_inset Formula $\sigma(x)$
\end_inset

 is some sigmoid function.
 In the most basic case, its a step function: 
\begin_inset Formula $\sigma(x)=1$
\end_inset

 if 
\begin_inset Formula $x>C$
\end_inset

 for some constant 
\begin_inset Formula $C$
\end_inset

 and zero otherwise.
 Each term in the numerator sum is one only when both disjuncts are present.
 Each term in the denominator is one when either disjunct is present.
 
\end_layout

\begin_layout Standard
This similarity measure essentially boils down to a normalized cylinder-set
 measure.
 That is, instead of interpreting the disjuncts as a basis for a vector
 space, they can be taken as independent observations in a Cartesian product
 space.
 This makes more sense than pretending that these form a vector space.
 Why is that? What makes the cosine angle, and vector dot products so appealing,
 is that they are preserved by rotations; yet there really is no reason
 to expect or desire rotational symmetry.
 In short, the idea that we are dealing with vector spaces give the wrong
 idea of what's going on: its more appropriate to view the set of disjuncts
 as elements of a product space.
 Product spaces have a natural measure: the cylinder-set measure.
 The overlap similarity is essentially 
\begin_inset Formula $\mu(A\cap B)/\mu(A\cup B)$
\end_inset

 with 
\begin_inset Formula $\mu$
\end_inset

 the Borel measure, and the intersection/union being taken over the associated
 disjunct sets.
\end_layout

\begin_layout Standard
Continuing in this vein, what should be taken as the measure 
\begin_inset Formula $\mu(A)$
\end_inset

 of 
\begin_inset Formula $A$
\end_inset

? The counting measure is a natural choice; that is, 
\begin_inset Formula $\mu(A)=\left|A\right|$
\end_inset

 being the number of elements in set 
\begin_inset Formula $A$
\end_inset

.
 Because we also have a count associated with the members of the set, we
 can consider using the 
\begin_inset Formula $l_{p}$
\end_inset

 norms for the measure.
 That is, 
\begin_inset Formula $\left|A\right|$
\end_inset

 is just the 
\begin_inset Formula $l_{0}$
\end_inset

 norm; perhaps 
\begin_inset Formula $\mu(A)=\left\Vert A\right\Vert _{p}$
\end_inset

 could also work.
 For the purposes here, this may be enough; note, however, that only 
\begin_inset Formula $l_{0}$
\end_inset

 and 
\begin_inset Formula $l_{1}$
\end_inset

 satisfy one of the axioms of measure theory: namely, that when 
\begin_inset Formula $A\cap B=\varnothing$
\end_inset

 then 
\begin_inset Formula $\mu(A\cup B)=\mu(A)+\mu(B)$
\end_inset

.
 The other 
\begin_inset Formula $l_{p}$
\end_inset

 norms do not satisfy this.
\end_layout

\begin_layout Standard
The (opencog matrix) module currently implements an overlap similarity which
 computes the 
\begin_inset Formula $l_{0}$
\end_inset

-norm based similarity: 
\begin_inset Formula 
\[
\mbox{ovl}_{0}(w_{1},w_{2})=\frac{\left|\left\{ d\mbox{ s.t. }0<N\left(w_{1},d\right)\right\} \cap\left\{ d\mbox{ s.t. }0<N\left(w_{2},d\right)\right\} \right|}{\left|\left\{ d\mbox{ s.t. }0<N\left(w_{1},d\right)\right\} \cup\left\{ d\mbox{ s.t. }0<N\left(w_{2},d\right)\right\} \right|}
\]

\end_inset

where 
\begin_inset Formula $\left\{ d\mbox{ s.t. }0<N\left(w,d\right)\right\} $
\end_inset

 is the set of disjuncts 
\begin_inset Formula $d$
\end_inset

 such that the cset 
\begin_inset Formula $(w,d)$
\end_inset

 was observed at least once.
 As always, 
\begin_inset Formula $\left|\left\{ x\right\} \right|$
\end_inset

 is the number of elements in the set 
\begin_inset Formula $\left\{ x\right\} $
\end_inset

.
\end_layout

\begin_layout Standard
Some experimentation was done with this similarity measure, but the results
 were not particularly impressive.
 What does become quickly apparent is that the most frequently observed
 words will necessarily have a low similarity to the low-frequency words.
 This is because the high-frequency words will have a large number of disjuncts
 observed with them, thus causing the denominator to grow large.
 The numerator, by contrast, stays small, essentially limited by the number
 of disjuncts observed on the low-frequency word.
 This behavior is not what we want.
 Intuition suggests that we really do want to be able to compare words,
 independent of how frequently they occur.
\end_layout

\begin_layout Standard
This suggests a modified form of overlap similarity, by normalizing to a
 common count.
 That is, one should work with 
\begin_inset Formula $N(w,d)/N(w,*)$
\end_inset

, which can be understood as the normalized probability of observing a disjunct
 
\begin_inset Formula $d$
\end_inset

 on some word 
\begin_inset Formula $w$
\end_inset

.
 So, arrange the words such that 
\begin_inset Formula $N\left(w_{1},*\right)>N\left(w_{2},*\right)$
\end_inset

 and define 
\begin_inset Formula $K\left(w_{1},d\right)=N\left(w_{1},d\right)N\left(w_{2},*\right)/N\left(w_{1},*\right)$
\end_inset

 – this makes the counts on 
\begin_inset Formula $w_{1}$
\end_inset

 directly comparable to those on 
\begin_inset Formula $w_{2}$
\end_inset

, giving
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
\mbox{rovl}(w_{1},w_{2})=\frac{\left|\left\{ d\mbox{ s.t. }1\le K\left(w_{1},d\right)\right\} \cap\left\{ d\mbox{ s.t. }1\le N\left(w_{2},d\right)\right\} \right|}{\left|\left\{ d\mbox{ s.t. }1\le K\left(w_{1},d\right)\right\} \cup\left\{ d\mbox{ s.t. }1\le N\left(w_{2},d\right)\right\} \right|}
\]

\end_inset

That is, the condition 
\begin_inset Formula $1\le N\left(w_{2},d\right)$
\end_inset

 simply tests for the presence or absence of disjunct 
\begin_inset Formula $d$
\end_inset

 on 
\begin_inset Formula $w_{2}$
\end_inset

, while 
\begin_inset Formula $K\left(w_{1},d\right)$
\end_inset

 gives the probability of observing disjunct 
\begin_inset Formula $d$
\end_inset

 on 
\begin_inset Formula $w_{1}$
\end_inset

, if 
\begin_inset Formula $w_{1}$
\end_inset

 had been observed as often as 
\begin_inset Formula $w_{2}$
\end_inset

.
\end_layout

\begin_layout Standard
The above considerations suggest a more appropriate definition for overlap
 similarity that allows for frequency-independent observations: namely
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
\mbox{ovl}(w_{1,}w_{2})=\frac{\sum_{d}\sigma(f(w_{1},d))\sigma(f(w_{2},d))}{\sum_{d}\sigma(f(w_{1},d)+f(w_{2},d))}
\]

\end_inset

with 
\begin_inset Formula $f(w,d)=N(w,d)/N(w,*)$
\end_inset

.
 In this form, the numerator now starts to resemble the numerator of the
 cosine similarity measure: both the cosine numerator, and this numerator
 are computing a kind-of overlap or intersection of sets of disjuncts.
 The denominators differ.
 Based on gut intuition, the above measure may be be more suitable for judging
 similarity than the cosine measure, precisely because it emphasizes the
 set-like qualities of connector sets, as opposed to vector-like qualities.
 
\end_layout

\begin_layout Subsection
Disjunct Subsets
\end_layout

\begin_layout Standard
The above line of thinking suggests that another interesting comparison
 can be made by looking for a subset relationship between the disjuncts
 on different words.
 For example, transitive verbs should have all the connectors that intransitive
 verbs do, plus some more.
\end_layout

\begin_layout Standard
The overlay similarity can be adapted for this purpose: two sets obey a
 subset relation 
\begin_inset Formula $A\subset B$
\end_inset

 if 
\begin_inset Formula $A\cap(1-B)=\varnothing$
\end_inset

.
\end_layout

\begin_layout Standard
This re-affirms the previous observation: we expect connector-sets to behave
 in a set-like fashion, not a vector-like fashion.
\end_layout

\begin_layout Subsection
Quality Overlaps
\end_layout

\begin_layout Standard
Overlap alone may be too weak: again, some of the observed disjuncts may
 just be junk, or may correlate poorly with the word.
 This suggests several different approaches.
 One would be to trim the dataset to discard low-MI word-disjunct pairs,
 before computing overlap.
 Another would be to weight the overlap with the MI (discarding negative
 MI values).
\end_layout

\begin_layout Subsection
Connector Similarity
\end_layout

\begin_layout Standard
There is another, dual kind of similarity that is very different from the
 above proposals.
 A single disjunct is an ordered list of (pseudo-)connectors: 
\begin_inset Formula 
\[
d=\left(c_{1},c_{2},\cdots,c_{k}\right)
\]

\end_inset

where each pseudo-connector is a word, and a direction indicator.
 We can judge two connectors to be similar, and thus, two words to be similar,
 if they appear in a large number of disjuncts that would be identical,
 but for that one connector.
\end_layout

\begin_layout Standard
There are some practical difficulties of writing code to discern this.
\end_layout

\begin_layout Subsection
Word-Pair Cosine Similarity
\end_layout

\begin_layout Standard
To correctly gauge the advantage of disjunct-based techniques, one should
 compare them to the same kinds of measures, but applied to word-pairs,
 instead of word-disjunct pairs.
 For example, the disjunct-based cosine similarity, can be contrasted against
 the simpler word-pair-based cosine similarity.
 This is given by 
\begin_inset Formula 
\[
\mbox{sim}_{pair}(w_{1},w_{2})=\frac{\sum_{w}N_{pair}(w_{1},w)N_{pair}(w,w_{2})}{\mbox{len}(w_{1})\mbox{len}(w_{2})}
\]

\end_inset

where 
\begin_inset Formula $\mbox{len}(w)$
\end_inset

 is the root-mean-square length (Euclidean length) of the pair vector:
\begin_inset Formula 
\[
\mbox{len}(w)=\sqrt{\sum_{v}N_{pair}(w,v)N_{pair}(v,w)}
\]

\end_inset

and 
\begin_inset Formula $N_{pair}(w,v)$
\end_inset

 is the count of having observed the ordered word-pair 
\begin_inset Formula $(w,v)$
\end_inset

.
 Equivalently, writing the normalized frequency of observing a word pair
 as 
\begin_inset Formula $p(w,v)=N(w,v)/N(*,*)$
\end_inset

, this similarity can be written in the form 
\begin_inset Formula 
\[
\mbox{sim}_{pair}(w_{1},w_{2})=\frac{\sum_{w}p(w_{1},w)p(w,w_{2})}{\sqrt{\sum_{v}p(w_{1},v)p(v,w_{1})}\sqrt{\sum_{v}p(w_{2},v)p(v,w_{2})}}
\]

\end_inset

Note that this similarity measure is NOT symmetric: 
\begin_inset Formula $\mbox{sim}(w_{1},w_{2})\ne\mbox{sim}(w_{2},w_{1})$
\end_inset

.
 This is because it's built out of a manifestly non-symmetric count: 
\begin_inset Formula $p(w,v)\ne p(v,w)$
\end_inset

 and should really be written as 
\begin_inset Formula $p(w,v)=p(R;w,v)$
\end_inset

 with the relation 
\begin_inset Formula $R$
\end_inset

 encompassing all of the constraints of pair-wise word relationships (including,
 for example, that the pair might have been extracted from a random planar
 tree parse).
 Of course, one could construct a symmetrized similarity measure.
\end_layout

\begin_layout Standard
The point here is that this measure treats words as similar with they link,
 pair-wise, to the same kinds of words, with the same kinds of frequencies.
 This is not unlike the similarity that the disjunct-cosine is measuring,
 except that the disjunct carries additional grammatical information with
 it: it captures more complex relationships between the words in a sentence.
\end_layout

\begin_layout Subsection
Symmetric Cosine Information
\end_layout

\begin_layout Standard
The cosine similarity was defined as 
\begin_inset Formula 
\[
\mbox{sim}(w_{1},w_{2})=\frac{\sum_{d}N(w_{1},d)N(w_{2},d)}{\sqrt{\sum_{d}N^{2}(w_{1},d)}\sqrt{\sum_{d}N^{2}(w_{2},d)}}
\]

\end_inset

which, after dividing by 
\begin_inset Formula $N(*,*)$
\end_inset

 so that 
\begin_inset Formula $p(w,d)=N(w,d)/N(*,*)$
\end_inset

, gives the equivalent expression 
\begin_inset Formula 
\[
\mbox{sim}(w_{1},w_{2})=\frac{\sum_{d}p(w_{1},d)p(w_{2},d)}{\sqrt{\sum_{d}p^{2}(w_{1},d)}\sqrt{\sum_{d}p^{2}(w_{2},d)}}
\]

\end_inset

Comparing this to the expression for mutual information suggests that using
 the vector support, instead of the vector length, could be interesting.
 In particular, these might be interesting:
\begin_inset Formula 
\[
\mbox{com}(w_{1},w_{2})=-\log_{2}\frac{\sum_{d}p(w_{1},d)p(w_{2},d)}{p(w_{1},*)p(w_{2},*)}
\]

\end_inset

Written this way, it resembles a kind of mutual information.
\end_layout

\begin_layout Subsection
Symmetric Mutual Information (Entropic Similarity)
\end_layout

\begin_layout Standard
The symmetric mutual information, and related quantities, can be considered
 by working with the symmetric matrix 
\begin_inset Formula $Q$
\end_inset

 defined in eqn 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:symmetric transpose"

\end_inset

.
 This can be more closely related to the cosine by noting that it shares
 the vector dot product as a basic building block.
 That it, let 
\begin_inset Formula 
\[
\mbox{dot}(w_{1},w_{2})=\sum_{d}p(w_{1},d)p(w_{2},d)
\]

\end_inset

define a process between two 
\begin_inset Quotes eld
\end_inset

independent
\begin_inset Quotes erd
\end_inset

 processes 
\begin_inset Formula $w_{1}$
\end_inset

 and 
\begin_inset Formula $w_{2}$
\end_inset

.
 Looked at this way, the canonical mutual information would then be 
\begin_inset Formula 
\[
\mbox{dmi}(w_{1},w_{2})=-\log_{2}\frac{\mbox{dot}(w_{1},w_{2})\mbox{dot}(*,*)}{\mbox{dot}(w_{1},*)\mbox{dot}(*,w_{2})}
\]

\end_inset

One can normalize the process using the cosine similarity; the resulting
 process is different, and thus has a different value for the mutual information
:
\begin_inset Formula 
\[
\mbox{cmi}(w_{1},w_{2})=-\log_{2}\frac{\mbox{sim}(w_{1},w_{2})\mbox{sim}(*,*)}{\mbox{sim}(w_{1},*)\mbox{sim}(*,w_{2})}
\]

\end_inset

The 
\begin_inset Formula $\mbox{sim}(*,*)$
\end_inset

 serves to normalize the entire calculation, so that one is effectively
 computing with 
\begin_inset Formula $\mbox{norm-sim}(w_{1},w_{2})=\mbox{sim}(w_{1},w_{2})/\mbox{sim}(*,*)$
\end_inset

.
 Some experiments using cmi were performed, and it quickly became apparent
 that cmi does 
\begin_inset Quotes eld
\end_inset

the wrong thing
\begin_inset Quotes erd
\end_inset

: it singles out pairs that have a lot to do with each other, but little
 to do with anything else.
 Which discriminates 
\emph on
against
\emph default
 the clusters of similar words, which is not what we want.
 We don't want strange, unusual pairings; we want common, likely pairings.
 The level of discrimination can be severe: some really, really bad pairings
 show up, but only because they are unlike anything else.
 This score is strong when sim is weak, and is essentially picking up the
 tail-end of the sim distribution.
\end_layout

\begin_layout Subsection
Mutual Length
\end_layout

\begin_layout Standard
One can be inspired to write some crazy concoctions: 
\begin_inset Formula 
\[
\mbox{mim}(w,d)=-\log_{2}\frac{p(w,d)}{\sum_{w,d}p^{2}(w,d)}
\]

\end_inset

I don't know what to call these; the first seems to be some kind of 
\begin_inset Quotes eld
\end_inset

cosine information
\begin_inset Quotes erd
\end_inset

, the second, some sort of 
\begin_inset Quotes eld
\end_inset

mutual length
\begin_inset Quotes erd
\end_inset

 device.
\end_layout

\begin_layout Section
Cosine Similarity scatterplots
\end_layout

\begin_layout Standard
The scatterplot in 
\begin_inset CommandInset ref
LatexCommand ref
reference "fig:Cosine-similarity-scatterplot"

\end_inset

visualizes the cosine similarity between 797 word-pairs.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Generated with 'scatter.scm'
\end_layout

\end_inset

 The rows and columns are ranked by frequency of word occurrence, so that
 the single-most-frequently occurring word is at the upper-left, with less
 and less frequently occurring words proceeding to the right, and downwards.
 The color scale is such that red represents 1.0, yellow represents 0.75,
 green is exactly 0.5, and blue is 0.25 or less, fading to black.
 This is for the same 
\noun on
en_pairs_rfive_mtwo
\noun default
 dataset as above, and examined in detail in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "subsec:Disjunct-Cosine-Similarity"

\end_inset


\begin_inset CommandInset ref
LatexCommand vpageref
reference "subsec:Disjunct-Cosine-Similarity"

\end_inset

.
 That is, it computes the similarity between the 797 words whose length
 is greater than 128.
 Note that cosine similarity is symmetric; this figure is necessarily symmetric
 about the diagonal.
 The diagonal can be seen as a red line, representing a similarity of 1.0.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Float figure
placement h
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Caption Standard

\begin_layout Plain Layout
Cosine similarity scatterplot
\begin_inset CommandInset label
LatexCommand label
name "fig:Cosine-similarity-scatterplot"

\end_inset


\end_layout

\end_inset


\begin_inset Graphics
	filename scat/scat-cosine-big.png
	width 100col%

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
The tartan pattern indicates that some words are very unlike others, and
 that most words are very unlike one-another.
 
\end_layout

\begin_layout Standard
It would be nice to re-arrange (permute) the rows and columns to bring the
 matrix into quasi-diagonal form.
 This is easier said than done.
 The graph 
\begin_inset CommandInset ref
LatexCommand ref
reference "fig:Quasi-diagonal-cosine-similarity"

\end_inset

 shows the same data, with a different ranking.
 This time, the first word is the the LEFT-WALL, and the next word is the
 one word, out of 796, that has the highest cosine similarity to the period.
 Next comes the word, out of the remaining 795, that has the highest cosine
 similarity to the last.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Computed with the `dranked` function, and the `dranked-long` list graphed,
 from the `disjunct-stats.scm` file.
\end_layout

\end_inset

 This continues on down the list, so that word-pairs with the highest similarity
 are next to each-other, in the list.
 Effectively, we get the two-highest similarities for each word: the highest
 being to the word right before, and second-highest to the word right after.
 The list is organized so that the highest similarity pair occurs in the
 upper-left.
 The pairs least-similar to anything else end up on the lower-right.
 The color scheme is as before.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Caption Standard

\begin_layout Plain Layout
Quasi-diagonal cosine similarity
\begin_inset CommandInset label
LatexCommand label
name "fig:Quasi-diagonal-cosine-similarity"

\end_inset


\end_layout

\end_inset


\begin_inset Graphics
	filename scat/scat-fcos.png
	width 100col%

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
This ranking exposes block-diagonal structure in the dataset.
 The block in the upper-right consists entirely of punctuation and various
 capitalized words.
 The middle portions are quite interesting.
 The full list of 797 words is shown in table 
\begin_inset CommandInset ref
LatexCommand vpageref
reference "tab:Block-ranked-words"

\end_inset

.
 Its sort of interesting to read.
 So, the word that is most similar to the LEFT-WALL is the double-dash –
 perhaps not to surprising, as the double dash is often used to start new
 sentence phrases.
 That various forms of punctuation follow is not surprising.
 
\end_layout

\begin_layout Standard
After this are various capitalized words.
 As noted previously, capitalized words are observed far less frequently
 than their u-capitalized counterparts – usually only one capitalized word
 per sentence! Thus, capitalized words have far fewer disjuncts on them,
 thus making it easier for them to appear superficially similar, when they
 really should not be.
 Despite this, it is reassuring to see 
\begin_inset Quotes eld
\end_inset

Where Who What Why How
\begin_inset Quotes erd
\end_inset

 occur in succession.
 Other confidence-instilling sequences include 
\begin_inset Quotes eld
\end_inset

2 1 4 3 A
\begin_inset Quotes erd
\end_inset

 and some names: 
\begin_inset Quotes eld
\end_inset

Mai Demelza Richard John George
\begin_inset Quotes erd
\end_inset

.
 Then another reassuring sequence: 
\begin_inset Quotes eld
\end_inset

nothing something anything everything
\begin_inset Quotes erd
\end_inset

.
 Why 
\begin_inset Quotes eld
\end_inset

George
\begin_inset Quotes erd
\end_inset

 should be similar to 
\begin_inset Quotes eld
\end_inset

nothing
\begin_inset Quotes erd
\end_inset

 is best left unasked.
 But we'll ask anyway.
 
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Float table
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset Caption Standard

\begin_layout Plain Layout
Block-ranked words
\begin_inset CommandInset label
LatexCommand label
name "tab:Block-ranked-words"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
LEFT-WALL – ...
 ..
 ....
 *** Two Bit - The “ And But So Perhaps Though When As Then While If Have
 Can Did Do Will Thank Are Where Who What Why How Or Now Here This That
 It There He She I You We They Some One Most Instead However Besides Well
 Oh Ah No Yes Yeah Good Ham 2 1 4 3 A Mai Demelza Richard John George nothing
 something anything everything it there who which that what where when until
 as because since if before while little small large good great second woman
 man child boy girl king house city village world town country ship sun
 latter land others children people men things them us me him her his my
 your their our the this every another leave take get find see hear speak
 move stop fight be make give show keep bring meet follow call ask tell
 help go live stay talk turn run change use read say understand remember
 know think suppose guess hope promise believe mean am myself fear thought
 knew saw felt found called happy fine certain new single word moment minute
 year day night evening morning way book letter party distance light fire
 forest room air sea earth ground floor table subject story line case question
 place time thing person family soul heart body own wife brother sister
 father mother daughter son hands face feet mouth eyes hand mind friends
 friend thoughts arm arms chest staff words office past water door window
 truth same whole best present first last least once home back down up at
 over on in by from upon under through into for with such having like quite
 only also still born gone already been seen done taken known given brought
 made making taking after above among between of all near against to will
 can cannot may must might would should could did does was is seems seemed
 appeared used began tried wanted needed meant came went turned continued
 said says spoke had has always never ever usual possible well soon far
 much late bad strong big young white black dark blue closed opened open
 hold put stand sleep answer try return come enough ready trying going supposed
 beginning rest sound corner direction presence name side power nature one
 most part kind sort number lot bit couple matter group other two three
 four five ten several many some out length force death life voice hair
 husband lips head shoulder eye bed business point state view sense feeling
 thinking saying now indeed therefore however sir yes God dear lady poor
 old dead wrong right here doing glad afraid sure sorry coming close hard
 long short real different clear probably certainly simply just almost not
 hardly easily feel have shall wish need want seem next following Queen
 captain wall future general London work himself herself smiled sighed nodded
 asked replied cried got stood sat looked look smile chance reason means
 longer doubt idea end edge bottom top account instead perhaps and or but
 though although then thus yet even really heard left lost passed met were
 are do love told gave took held followed behind within half an raised free
 better less more rather often true alone again ) too so very pretty a its
 these those anyone you we they he she North Jack Pitch Ross Jim Dallas
 Telzey Goth Sir san together along off away around round about how why
 – ; , : … ! .
 ? — All From In On By To With After For It’s It's Not Just Even At See
 paragraph spirit cause middle sight front spite charge form attention throat
 breath money law human few hundred thousand years hours minutes days times
 later ago ' " ” 」 ’ _ etc Mr Mrs Dr sama don ul looking especially nor
 either each any no finally suddenly slowly quietly Mary King Lord Lady
 Miss Captain I'm I’m My Her His Your ( ‘ Illustration Stats Category Rating
 Fandom Character Warning Is Really Please Father Mother May course being
 without let set sitting drawing living high heavy cold public fact particular
 full relief order hour individual General outside both talking able happened
 exactly else hell non re self S t six natural than fell deep wide deal
 agree don’t don't didn’t didn't received paid care drop provide York agreement
 ape medium author laws terms works forward forth shook enjoyed originally
 access permission instance chapter comment St Princess Old New Summary
 Notes Our THE # * Of { including Prince notes states copies copy ‐ YOU
 THIS OF OR Section CHAPTER Chapter + | Project eBook eBooks Gutenberg }
 P tax States fee electronic associated Archive Foundation United & copyright
 Literary donations License trademark distribution refund Van distributing
 Ooal Additional Posted archive Tags 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The block-ranked words are a kind of 
\begin_inset Quotes eld
\end_inset

stream-of-consciousness
\begin_inset Quotes erd
\end_inset

 with regard to similarity.
 This is shown in the table below.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Created with the `prt-sim` function.
\end_layout

\end_inset

 So, the numbers are all quite similar to each other, but the letter 'A'
 is not very similar to '3'.
 It just so happens that 'A' is more similar to '3', than any other subsequent
 word.
 'A' and 'Mai' are not similar, but then all of the given names are quite
 similar.
 Next, it turns out that 'George' and 'nothing' really don't have very much
 to do with one-another.
 And so the block-diagonal structure is exposed.
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="23" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Pair
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Similarity
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2 ..
 1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.881
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1 ..
 4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.791
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4 ..
 3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.835
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3 ..
 A
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.375
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
A ..
 Mai
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.351
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Mai ..
 Demelza
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.615
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Demelza ..
 Richard
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.515
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Richard ..
 John
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.602
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
John ..
 George
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.648
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
George ..
 nothing
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.342
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
nothing ..
 something
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.483
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
something ..
 anything
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.549
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
anything ..
 everything
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.369
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
everything ..
 it
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.496
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
it ..
 there
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.743
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
there ..
 who
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.416
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
who ..
 which
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.618
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
which ..
 that 
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.580
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
that ..
 what
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.730
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
what ..
 where
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.722
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
where ..
 when
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.830
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
when ..
 until
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.886
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
Reading through the list of 797 words in table 
\begin_inset CommandInset ref
LatexCommand vpageref
reference "tab:Block-ranked-words"

\end_inset

 reveals a lot of interesting runs.
 All this suggests that cosine similarity really is doing the right thing.
 Equally reassuring are the failures of similarity: the end of the list
 of 797 words is populated with Gutenberg license boilerplate words.
 This is good: they are at the end of the list because they don't fit into
 any other grammatical usage patterns.
 They don't fit because they have been observed hundreds of times, which
 propels them into the high-frequency category; yet, since they always appear
 in set phrases, they can't be similar to anything else.
\end_layout

\begin_layout Subsection*
Filtering
\end_layout

\begin_layout Standard
An earlier draft of this report showed a very different figure, which turned
 out to have been created based on what seemed to be a valid assumption,
 but turns out not to be so wise.
 It deserves a some discussion here, which is resumed in a later section.
\end_layout

\begin_layout Standard
The idea (recaptured below) is that it can be a good thing to filter out
 noisy data.
 For example, one may choose to ignore words that have not been seen many
 times.
 This is the cut done up above: there are only 797 words observed with a
 length of 128 or greater.
 This is a convenient number only because it allows a 797-pixel-wide color
 plot to be made.
 A different cut might exclude words that were observed less than N times.
 
\end_layout

\begin_layout Standard
Another possibility, and this seems to be the fatal one, is to exclude disjuncts
 that are seen only a handful of times.
 Superficially, this seems like a very plausible thing to do.
 In practice, this turns out to be an almost completely disastrous cut.
 Setting it too high renders almost all capitalized words identical to one-anoth
er, having a similarity of exactly one.
 Capitalized words are observed only infrequently; discarding low-frequency
 disjuncts leaves the capitalized words almost naked, and thus essentially
 identical.
 Just about all of them connect to the LEFT-WALL, since they are the first
 word in a sentence, and so all of them share a very high observation count
 of links to LEFT-WALL, and low observation counts of links to anything
 else.
 Ergo: they are all similar, since they all start sentences.
\end_layout

\begin_layout Standard
The damage doesn't stop there.
 It turns out that such trimming also raises similarity across the board:
 it ends up so that everything seems to be pretty darn similar to everything
 else, thus erasing too much of the 
\begin_inset Quotes eld
\end_inset

signal
\begin_inset Quotes erd
\end_inset

 that we are looking for.
 The original hope was to raise the signal-to-noise ratio by applying judicious
 cuts; but hope is not enough.
 Injudicious cuts can destroy the signal all too easily.
 Careful data analysis is needed; blind trust is not enough.
\end_layout

\begin_layout Standard
\begin_inset CommandInset bibtex
LatexCommand bibtex
bibfiles "lang"
options "tufte"

\end_inset


\end_layout

\end_body
\end_document