learn-lang-diary/learn-lang-diary-part-one.lyx

#LyX 2.3 created this file. For more info see http://www.lyx.org/
\lyxformat 544
\begin_document
\begin_header
\save_transient_properties true
\origin unavailable
\textclass article
\begin_preamble
\usepackage{url} 
\end_preamble
\use_default_options false
\maintain_unincluded_children false
\language english
\language_package default
\inputencoding utf8
\fontencoding global
\font_roman "times" "default"
\font_sans "helvet" "default"
\font_typewriter "cmtt" "default"
\font_math "auto" "auto"
\font_default_family default
\use_non_tex_fonts false
\font_sc false
\font_osf false
\font_sf_scale 100 100
\font_tt_scale 100 100
\use_microtype false
\use_dash_ligatures false
\graphics default
\default_output_format default
\output_sync 0
\bibtex_command default
\index_command default
\paperfontsize default
\spacing single
\use_hyperref true
\pdf_bookmarks true
\pdf_bookmarksnumbered false
\pdf_bookmarksopen false
\pdf_bookmarksopenlevel 1
\pdf_breaklinks true
\pdf_pdfborder true
\pdf_colorlinks true
\pdf_backref false
\pdf_pdfusetitle true
\papersize default
\use_geometry false
\use_package amsmath 2
\use_package amssymb 2
\use_package cancel 1
\use_package esint 0
\use_package mathdots 1
\use_package mathtools 1
\use_package mhchem 0
\use_package stackrel 1
\use_package stmaryrd 1
\use_package undertilde 1
\cite_engine basic
\cite_engine_type default
\biblio_style plain
\use_bibtopic false
\use_indices false
\paperorientation portrait
\suppress_date false
\justification true
\use_refstyle 0
\use_minted 0
\index Index
\shortcut idx
\color #008000
\end_index
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\paragraph_indentation default
\is_math_indent 0
\math_numbering_side default
\quotes_style english
\dynamic_quotes 0
\papercolumns 1
\papersides 1
\paperpagestyle default
\listings_params "basicstyle={\ttfamily},basewidth={0.5em}"
\tracking_changes false
\output_changes false
\html_math_output 0
\html_css_as_file 0
\html_be_strict false
\end_header

\begin_body

\begin_layout Title
Language Learning Diary - Part One
\end_layout

\begin_layout Date
2014-2020
\end_layout

\begin_layout Author
Linas Vepstas
\end_layout

\begin_layout Abstract
The language-learning effort involves research and software development
 to implement the ideas described in ArXiv abs/1401.3372
\begin_inset CommandInset citation
LatexCommand cite
key "Goertzel2014"
literal "true"

\end_inset

.
 This document contains supplementary notes and a loosely-organized semi-chronol
ogical diary of results.
\end_layout

\begin_layout Abstract
Because this document has repeatedly become overly large, it has been split
 into multiple sub-documents.
 Notable ones include the report on connector sets, the preliminary report
 on grammatical classes, and several reports on word-pairs.
 What remains here are assorted ad hoc commentary.
\end_layout

\begin_layout Abstract
Because what remains here is still too long to manage, the diary resumes
 in Part Two, as of 2021.
\end_layout

\begin_layout Section*
Introduction
\end_layout

\begin_layout Standard
The language-learning effort involves research and software development
 to implement the ideas described in in ArXiv abs/1401.3372
\begin_inset CommandInset citation
LatexCommand cite
key "Goertzel2014"
literal "true"

\end_inset

.
 This document contains supplementary notes and a loosely-organized semi-chronol
ogical diary of results.
 Its not actually chronological: in general, it is organized so that theory
 precedes data analysis.
 Usually.
\end_layout

\begin_layout Standard
The initial stages of this work require the extraction of word-pair probabilitie
s from raw text, and the use of these to induce a Link Grammar
\begin_inset CommandInset citation
LatexCommand cite
key "Sleator1991,Sleator1993"
literal "true"

\end_inset

.
 This extends prior work on MST parsers
\begin_inset CommandInset citation
LatexCommand cite
key "Yuret1998"
literal "true"

\end_inset

, by inducing link types for word-pair relations.
\end_layout

\begin_layout Standard
Later stages further extend beyond what is possible with Link Grammar by
 inducing synonymous words and phrases.
 The goal here is to unify into a consistent framework various techniques
 for unsupervised semantic discovery that have already been proven in narrower
 contexts
\begin_inset CommandInset citation
LatexCommand cite
key "Poon2009,Lin1998,Lin2001"
literal "true"

\end_inset

.
\end_layout

\begin_layout Standard
The first section of this document is a review of various definitions of
 probabilities that can be obtained from natural language text.
 This is followed by a roughly chronological diary of further observaions
 and results.
 Many revisions are made out of chronological order.
\end_layout

\begin_layout Subsection*
Lexical Attraction, Mutual Information, Interaction Information
\end_layout

\begin_layout Standard
The goal of this section is to clarify some of the formulas used by Deniz
 Yuret in his PhD thesis 
\begin_inset Quotes eld
\end_inset


\emph on
Discovery of Linguistic Relations Using Lexical Attraction
\emph default

\begin_inset Quotes erd
\end_inset

, MIT 1998 (
\begin_inset CommandInset href
LatexCommand href
name "http://www2.denizyuret.com/pub/yuretphd.pdf"
target "http://www2.denizyuret.com/pub/yuretphd.pdf"
literal "false"

\end_inset

).
 These formulas are vitally important, because they provide a strong tool
 when working with text; this has been shown by Yuret in his thesis, as
 well as by many others, as well as by my own practical experience with using
 them.
\end_layout

\begin_layout Standard
Possibly the most useful formula is the one in the middle of page 40.
 By the time that we get to it, the terms 
\begin_inset Quotes eld
\end_inset

mutual information
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

lexical attraction
\begin_inset Quotes erd
\end_inset

 are being used interchangeably.
 This formula states the 
\begin_inset Formula $MI(x,y)$
\end_inset

 for two words 
\begin_inset Formula $x$
\end_inset

 and 
\begin_inset Formula $y$
\end_inset

; yet it is manifestly not symmetric in 
\begin_inset Formula $x$
\end_inset

 and 
\begin_inset Formula $y$
\end_inset

, since 
\begin_inset Formula $x$
\end_inset

 is the word on the left, and 
\begin_inset Formula $y$
\end_inset

 is the word on the right.
 By contrast, textbook (wikipedia) definitions of MI are symmetric in their
 variables.
 Below I try to dis-entangle the resulting confusion a bit, and give a more
 correct derivation of the formula.
 The key is to observe that the formula contains an implicit pair-wise relations
hip between two words, and that there are actually three variables: two
 words, and their relationship.
 If this implicit relationship is made explicit, then the confusion evaporates.
 It also opens the door to talking about the MI (or the interaction information
 InI) of more complex relationships, not just pair-wise ones.
 
\end_layout

\begin_layout Standard
Being able to correctly write down the MI and the InI for complex relationships
 is important for NLP: relationtionships can be labelled by types (subject,
 object) and by word classes (noun, verb), and have various dependency constrain
ts between them.
 Thus, we need to be able to talk both about a labelled directed graph,
 and the entropy or mutual information contained in it's various sub-graphs.
\end_layout

\begin_layout Standard
In defense of Yuret, he does say, on page 22, that 
\begin_inset Quotes eld
\end_inset

lexical attraction is the likelihood of a syntactic relation.
\begin_inset Quotes erd
\end_inset

 However, the relation starts becoming implicit by eqn 12 on page 29.
 An unexplained leap is then made from eqn 12 to the formula on page 40.
 The below gets fairly pedantic; this seems unavoidable to avoid confusion.
\end_layout

\begin_layout Subsubsection*
Definitions 
\end_layout

\begin_layout Standard
Let 
\begin_inset Formula $P(R(w_{l},w_{r}))$
\end_inset

 represent the probability (frequency) of observing two words, 
\begin_inset Formula $w_{l}$
\end_inset

 and 
\begin_inset Formula $w_{r}$
\end_inset

 in some relationship or pattern 
\begin_inset Formula $R$
\end_inset

.
 Typically, 
\begin_inset Formula $R$
\end_inset

 can be a (link-grammar) linkage of type 
\begin_inset Formula $t$
\end_inset

 connecting word 
\begin_inset Formula $w_{l}$
\end_inset

 on the left to word 
\begin_inset Formula $w_{r}$
\end_inset

 on the right; implicitly, both 
\begin_inset Formula $w_{l}$
\end_inset

 and 
\begin_inset Formula $w_{r}$
\end_inset

 occur in the same sentence.
 The goal of this discussion is to enable relations 
\begin_inset Formula $R$
\end_inset

 that are more general than this; for now, though, 
\begin_inset Formula $R$
\end_inset

 is a word-pair occurring in a single sentence.
\end_layout

\begin_layout Standard
The simplest dependency grammar language model has only one type 
\begin_inset Formula $t$
\end_inset

, the ANY type.
 This is the type that Yuret uses: it makes no distinction at all between
 subject, object relations (that is, all dependencies are unlabelled), and
 it does not make a head-dependent distinction (all dependencies are bi-directio
nal).
 Thus, in what follows, we do the same: initially, the relation 
\begin_inset Formula $R(w_{l},w_{r})$
\end_inset

 is simply the statement that the words 
\begin_inset Formula $w_{l}$
\end_inset

 and 
\begin_inset Formula $w_{r}$
\end_inset

 are connected by an unlabelled, un-directed edge.
 For this simplest case, what 
\begin_inset Formula $R(w_{l},w_{r})$
\end_inset

 does is to capture that 
\begin_inset Formula $w_{l}$
\end_inset

 is to the left of 
\begin_inset Formula $w_{r}$
\end_inset

.
 
\end_layout

\begin_layout Standard
In what follows, the relation 
\begin_inset Formula $R=R(w_{l},w_{r})$
\end_inset

 refers to a generic two-word relation, and not necessarily this simplest
 one.
 To regain Yuret's formula, use the simplest relation, the ordered word-pair
 relation, given just above.
\end_layout

\begin_layout Standard
The quantity of interest is the (unconditional) probability 
\begin_inset Formula $P(R(w_{l},w_{r}),w_{l},w_{r})$
\end_inset

 of observing the two words 
\begin_inset Formula $w_{l}$
\end_inset

 and 
\begin_inset Formula $w_{r}$
\end_inset

 in a relation 
\begin_inset Formula $R=R(w_{l},w_{r})$
\end_inset

.
 To correctly understand and work with this quantity, some care must be
 taken with the notation for several related probabilities.
 First, one has 
\begin_inset Formula $P(w)$
\end_inset

, the probability of observing the word 
\begin_inset Formula $w$
\end_inset

 in the data sample.
 Next, one has 
\begin_inset Formula $P(S(w_{1},w_{2}),w_{1},w_{2})$
\end_inset

, the probability that the two words occur in the same sentence.
 Again, 
\begin_inset Formula $S(w_{1},w_{2})$
\end_inset

 denotes a relation between the two words; it differs from 
\begin_inset Formula $R(w_{1},w_{2})$
\end_inset

 in that the word-order does not matter.
 A third kind of pair relation is the unconditional probability of observing
 two words, which can be 
\emph on
defined
\emph default
 as 
\begin_inset Formula $P(w_{1},w_{2})=P(w_{1})P(w_{2})$
\end_inset

.
 In this case, instead of assuming independence of two random variables,
 we define them to be so.
 This is possible, because we have a notation for specifying when there
 is a correlation.
 That is, if there was some correlation (relation) 
\begin_inset Formula $C(w_{1},w_{2})$
\end_inset

 between them, then one should write this explicitly, as 
\begin_inset Formula $P(C,w_{1},w_{2})=P(C(w_{1},w_{2}),w_{1},w_{2})$
\end_inset

.
 The notation here allows the various needed probabilities to be defined
 without ambiguity.
 
\end_layout

\begin_layout Standard
Thus, assumptions of independent variables are now replaced by a notational
 infrastrcture.
 Note, in particular, that if one uses a frequentist definition for the
 probabilities (as will be done in what follows), then the probabilities
 are not independent of the data sample from which they are drawn.
 Thus, all probabilities here have an implicit dependence on the data sample.
 This dependency is not explicitly shown.
 Some care must be taken to use the same data sample throughout.
 
\end_layout

\begin_layout Standard
The above notation allows the definition of conditional probabilities, in
 the conventional sense.
 For example, one has that 
\begin_inset Formula 
\[
P(R,w_{l},w_{r})=P(R|w_{l},w_{r})P(w_{l},w_{r})
\]

\end_inset

or that 
\begin_inset Formula 
\[
P(R|w_{l},w_{r})=\frac{P(R,w_{l},w_{r})}{P(w_{l},w_{r})}
\]

\end_inset

as the conditional probability of observing the relation 
\begin_inset Formula $R$
\end_inset

, given that it's component parts are observed.
 From the earlier definitions, the denominator factors, and so we conclude
 that the correct expression for the conditional probability is:
\begin_inset Formula 
\begin{equation}
P(R|w_{l},w_{r})=\frac{P(R,w_{l},w_{r})}{P(w_{l})P(w_{r})}\label{eq:cond-pair}
\end{equation}

\end_inset

This is the probability of observing the relationship 
\begin_inset Formula $R$
\end_inset

 given that the individual parts of the relationship have been observed.
 The relation 
\begin_inset Formula $R$
\end_inset

 includes all correlations between the two words: their ordering as well
 as their co-occurance in a sentence.
 
\end_layout

\begin_layout Standard
Take care, however: 
\begin_inset Formula $P(R|w_{l},w_{r})$
\end_inset

 is NOT the probability of seeing 
\begin_inset Formula $R$
\end_inset

, given that 
\begin_inset Formula $w_{l}$
\end_inset

 and 
\begin_inset Formula $w_{r}$
\end_inset

 occur in the same sentence.
 This would instead by given by 
\begin_inset Formula $P(R,w_{l},w_{r})/P(S,w_{l},w_{r})$
\end_inset

.
 This is an entirely different.
\end_layout

\begin_layout Subsubsection*
Frequentism - Counting words and pairs
\end_layout

\begin_layout Standard
In order to be usable, a computable definition for the probabilities must
 be given.
 For this, the definition can only be frequentist.
 That is, the probabilities are to be obtained from empircal data; from
 counting frequencies as they occur in data samples taken from nature.
 The frequency 
\begin_inset Formula $P(w)$
\end_inset

 of observing a word 
\begin_inset Formula $w$
\end_inset

 is obvious: 
\begin_inset Formula 
\[
P(w)=\frac{N(w)}{N(*)}
\]

\end_inset

where 
\begin_inset Formula $N(w)$
\end_inset

 is the count of observing word 
\begin_inset Formula $w$
\end_inset

 and 
\begin_inset Formula $N(*)$
\end_inset

 is the total number of words observed.
 That is, by definition, it is the wild-card summation 
\begin_inset Formula 
\[
N(*)=\sum_{w}N(w)
\]

\end_inset

How to count words is not entirely obvious, so even these definitions need
 care.
 There are several ways in which one can count words.
 One way is to simply count how many times a word occurs in the block of
 sample text.
 Another way is to count how many times a word occurs in parses of the sample
 text.
 These are not the same! For example, if a parse connects words by edges
 (by dependency-grammar relations), then one can count each word once, for
 each time that it occurs at the end of an edge.
 In this counting, the word-count is exactly double the word-pair count.
 A word is then counted multiple times, if it participates in multiple edges.
 If the sample text is parsed multiple times, then additional counts can
 result that way.
 To maintain consistency with the definitions given in the previous section,
 
\begin_inset Formula $N(w)$
\end_inset

 is defined to be the number of times that the word 
\begin_inset Formula $w$
\end_inset

 occurs in the data sample, and independent of any other relations that
 
\begin_inset Formula $w$
\end_inset

 might be engaged in.
 For now, it is assumed that the segmentation of the text sample into words
 is unambiguous.
 
\end_layout

\begin_layout Standard
Let 
\begin_inset Formula $F(S(w),w)$
\end_inset

 be the number of times (frequency) of observing word 
\begin_inset Formula $w$
\end_inset

 in any sentence 
\begin_inset Formula $S$
\end_inset

.
 This can be computed as
\begin_inset Formula 
\[
F(S,w)=\frac{N(w)}{NS}
\]

\end_inset

where 
\begin_inset Formula $N(w)$
\end_inset

 is the number of times a word 
\begin_inset Formula $w$
\end_inset

 was observed in a data sample, and 
\begin_inset Formula $NS$
\end_inset

 is the number of sentences in that same sample.
 This counts with 
\begin_inset Quotes eld
\end_inset

multiplicity
\begin_inset Quotes erd
\end_inset

, in that 
\begin_inset Formula $w$
\end_inset

 can appear in a sentence more than once.
 That is, 
\begin_inset Formula $F$
\end_inset

 is not a probability, rather, it is an expectation value of the number
 of times that a word is observed.
 This can be made explicit, by writing
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
F(S,w)=\frac{N(w)}{N(*)}\,\frac{N(*)}{NS}=P(w)L(S)
\]

\end_inset

with 
\begin_inset Formula $L(S)=F(S,*)$
\end_inset

 being the average sentence length (the expectation value of the number
 of words in a sentence).
 
\end_layout

\begin_layout Standard
Three different word-pair relationships are interesting.
 First, define the relation 
\begin_inset Formula $S(w_{1},w_{2})$
\end_inset

 as being the relation that both words 
\begin_inset Formula $w_{1}$
\end_inset

 and 
\begin_inset Formula $w_{2}$
\end_inset

 occur in the same sentence, but in arbitrary order.
 It is symmetric: 
\begin_inset Formula $S(w_{1},w_{2})=S(w_{2},w_{1})$
\end_inset

.
 Define 
\begin_inset Formula $A(w_{l},w_{r})$
\end_inset

 as being the relation that both words 
\begin_inset Formula $w_{l}$
\end_inset

 and 
\begin_inset Formula $w_{r}$
\end_inset

 occur in the same sentence, and that 
\begin_inset Formula $w_{l}$
\end_inset

 is to the left of 
\begin_inset Formula $w_{r}.$
\end_inset

 By this definition, the counts for the two are related: one has that 
\begin_inset Formula 
\[
N(S,w_{1},w_{2})=N(A,w_{1},w_{2})+N(A,w_{2},w_{1})
\]

\end_inset

This is the symmetrized count.
\end_layout

\begin_layout Standard
Neither of 
\begin_inset Formula $S$
\end_inset

 or 
\begin_inset Formula $A$
\end_inset

 is yet the relation 
\begin_inset Formula $R(w_{l},w_{r})$
\end_inset

 mentioned above, which is defined as being the relation that both words
 
\begin_inset Formula $w_{l}$
\end_inset

 and 
\begin_inset Formula $w_{r}$
\end_inset

 occur in the same sentence, that 
\begin_inset Formula $w_{l}$
\end_inset

 is to the left of 
\begin_inset Formula $w_{r}$
\end_inset

, and, most importantly, that there is a link-grammar link (of type 
\begin_inset Quotes eld
\end_inset

R
\begin_inset Quotes erd
\end_inset

) connecting the two.
 Observe that although 
\begin_inset Formula $A$
\end_inset

 can be deduced from 
\begin_inset Formula $S$
\end_inset

, there is no simple or obvious relation between 
\begin_inset Formula $S$
\end_inset

 and 
\begin_inset Formula $R$
\end_inset

; these are essentially independent relations.
\end_layout

\begin_layout Standard
The way that the statistics are collected for 
\begin_inset Formula $A$
\end_inset

 and for 
\begin_inset Formula $R$
\end_inset

 are different.
 To count the 
\begin_inset Formula $A$
\end_inset

-type relations, one tokenizes a sentence into words, and then, counts every
 possible word-pair in the sentence.
 Effectively, one draws a clique of edges between the words, and then counts
 each edge.
 The statistics for 
\begin_inset Formula $R$
\end_inset

 are collected by parsing the sentence into a random planar tree, and then
 counting the edges in the tree.
 The result for this counting is NOT the same as that for type-
\begin_inset Formula $A$
\end_inset

 edges.
 The reason for this is demonstrated in depth, in the section 
\begin_inset CommandInset ref
LatexCommand nameref
reference "sec:Edge-counting"

\end_inset

 
\begin_inset CommandInset ref
LatexCommand vpageref
reference "sec:Edge-counting"

\end_inset

, below.
\end_layout

\begin_layout Standard
Initially, there is only one link relation 
\begin_inset Quotes eld
\end_inset

R
\begin_inset Quotes erd
\end_inset

 between two words: this is the 
\begin_inset Quotes eld
\end_inset

ANY
\begin_inset Quotes erd
\end_inset

 link-type.
 However, in general, 
\begin_inset Quotes eld
\end_inset

R
\begin_inset Quotes erd
\end_inset

 can be other kinds of link-types.
 Note that 
\begin_inset Quotes eld
\end_inset

R
\begin_inset Quotes erd
\end_inset

 can also have a head-tail dependency order: either 
\begin_inset Formula $w_{l}$
\end_inset

 or 
\begin_inset Formula $w_{r}$
\end_inset

 can be the head-word of a directional link.
 Thus, there are three different symmetrizations that can be obtained from
 
\begin_inset Quotes eld
\end_inset

R
\begin_inset Quotes erd
\end_inset

: by failing to make a left-right distinction, by failing to make a head-tail
 distinction, and failing to do either.
\end_layout

\begin_layout Standard
The definition for the probability of observing a relation can be taken
 to be 
\begin_inset Formula 
\begin{equation}
P(R,w_{l},w_{r})=\frac{N(R,w_{l},w_{r})}{N(R,*,*)}\label{eq:prob-relation}
\end{equation}

\end_inset

where
\begin_inset Formula 
\[
N(R,*,*)=\sum_{w_{l},w_{r}}N(R,w_{l},w_{r})
\]

\end_inset

This can be roughly understood as being the conditional probability of observing
 the relation 
\begin_inset Formula $R(w_{l},w_{r})$
\end_inset

 between two specific words, given that the relation 
\begin_inset Formula $R$
\end_inset

 between any two words was seen.
\end_layout

\begin_layout Standard
Is it possible to define the unconditional probability 
\begin_inset Formula $P(R,*,*)$
\end_inset

 of seeing the relationship? The path to the answer is not entirely straight-for
ward.
 First consdier the probability 
\begin_inset Formula $P(S,w_{1},w_{2})$
\end_inset

 of seeing two words in the same sentence.
 This probability is defined just as in eqn 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:prob-relation"

\end_inset

; that is, 
\begin_inset Formula $P(S,w_{1},w_{2})=N(S,w_{1},w_{2})/N(S,*,*)$
\end_inset

.
 From this, one can define the frequency of seeing a relation in a sentence,
 as
\begin_inset Formula 
\[
F(R|S,w_{1},w_{2})=\frac{P(R,w_{1},w_{2})}{P(S,w_{1},w_{2})}
\]

\end_inset

This gives the expectation value of seeing the relation 
\begin_inset Formula $R$
\end_inset

 in a sentence, given that the two words are already known to be in the
 sentence.
 That this is an expectation value should be clear, as the relation might
 appear multiple times in one sentence (e.g.
 if one of the words is repeated).
  The sum
\begin_inset Formula 
\[
F(R|S,*,*)=\sum_{w_{l},w_{r}}F(R|S,w_{l},w_{r})
\]

\end_inset

then counts the average number of relations per sentence.
 For the any-type ordered-pair relation, clearly one must have that there
 are at least as many relations as there are words in the sentence, minus
 one, since each word must appear in at least one (distinct) relation.
 That is, 
\begin_inset Formula $F(S,*)-1\le F(R|S,*,*)$
\end_inset

 with 
\begin_inset Formula $F(S,*)$
\end_inset

 the expected length of a sentence.
 
\end_layout

\begin_layout Standard
Similarly, one can consider the ratio
\begin_inset Formula 
\[
F(S,w_{1},w_{2})=\frac{P(S,w_{1},w_{2})}{P(w_{1})P(w_{2})}
\]

\end_inset

which captures the frequency at which two words are seen in the same sentence.
 The summation 
\begin_inset Formula $F(S,*,*)$
\end_inset

 then counts how many pairs are seen per sentence.
 Assuming that the counting was performed with a uniform distribution, this
 should then equal the number of edges in a clique.
 That is, for a sentence of length 
\begin_inset Formula $m$
\end_inset

, there should be 
\begin_inset Formula $m(m-1)/2$
\end_inset

 word-pairs (edges) counted for that sentence.
 This should hold approximately, on average, so that 
\begin_inset Formula $F(S,*,*)\approx F(S,*)(F(S,*)-1)/2$
\end_inset

.
\end_layout

\begin_layout Standard
From the development above, it should be clear that it is not really possible
 to define a quantity 
\begin_inset Formula $P(R,*,*)$
\end_inset

 that is the 
\begin_inset Quotes eld
\end_inset

probability of seeing a relation
\begin_inset Quotes erd
\end_inset

.
 We can count the number 
\begin_inset Formula $N(R,*,*)$
\end_inset

 of times the relation occurs in a data sample.
 We can count the average number of times the relation is seen in a sentence.
 However, as long as the relation occurs at least once in the data sample,
 one would have to say that the 
\begin_inset Quotes eld
\end_inset

probability of seeing the relation in the data sample
\begin_inset Quotes erd
\end_inset

 is one.
 The problem is one of normalization: there is no universe, of which 
\begin_inset Formula $N(R,*,*)$
\end_inset

 is a fractional measure.
\end_layout

\begin_layout Standard
That said, once can still consider an interesting ratio:
\begin_inset Formula 
\[
F(R,*,*)=\sum_{w_{l},w_{r}}F(R,w_{l},w_{r})=\sum_{w_{l},w_{r}}\frac{P(R,w_{l},w_{r})}{P(w_{l})P(w_{r})}
\]

\end_inset

This can be interpreted as a kind-of centrality.
 So, for example, for the any-pair relation, every word in the data sample
 must participate in at least one such pair-relation, and thus, we expect
 that 
\begin_inset Formula $F(R,*,*)\approx1$
\end_inset

.
 The precise value is related to the tree-parse that is being used to generate
 the any-relation.
 If the (random) parse-tree is acyclic, then the number of edges is comparable
 to the number of words.
 If the parse-tree contains cycles, then there may be more relations than
 there are words.
\end_layout

\begin_layout Subsubsection*
Yuret's Mutual Information
\end_layout

\begin_layout Standard
Deniz Yuret introduces the concept of 
\begin_inset Quotes eld
\end_inset

lexical attraction
\begin_inset Quotes erd
\end_inset

.
 It is reviewed briefly, here.
 He defines a probability 
\begin_inset Formula $\mathcal{P}(w_{l},w_{r})$
\end_inset

 of seeing an ordered pair; as compared to the above, the relation is implicit.
 To make it explicit, one should write: 
\begin_inset Formula 
\begin{equation}
\mathcal{P}(w_{l},w_{r})=P(A(w_{l},w_{r}),w_{l},w_{r})\label{eq:Yuret-prob}
\end{equation}

\end_inset

which indicates the relation explicitly, as well as noting that the order
 of the positions in the relation matter.
 To avoid confusion, the cursive 
\begin_inset Formula $\mathcal{P}$
\end_inset

 is used for the Yuret notation, instead of the roman 
\begin_inset Formula $P$
\end_inset

 which is reserved for the definitions above.
\end_layout

\begin_layout Standard
The letter 
\begin_inset Formula $A$
\end_inset

 used here reminds us that in Yuret's work, the pair-counting method used
 is the clique-edge-counting mechanism, described above, rather than the
 random-planar-tree relation.
 One expects the two to be similar, but not the same.
\end_layout

\begin_layout Standard
Yuret also uses the notation 
\begin_inset Formula $\mathcal{P}(w_{l},*)$
\end_inset

 and 
\begin_inset Formula $\mathcal{P}(*,w_{r})$
\end_inset

 for wild-card summations, defined as 
\begin_inset Formula 
\[
\mathcal{P}(w_{l},*)=\sum_{w_{r}}\mathcal{P}(w_{l},w_{r})\qquad\mbox{and}\qquad\mathcal{P}(*,w_{r})=\sum_{w_{l}}\mathcal{P}(w_{l},w_{r})
\]

\end_inset

It is tempting to conflate 
\begin_inset Formula $\mathcal{P}(w_{l},*)$
\end_inset

 with 
\begin_inset Formula $P(w_{l})$
\end_inset

 but that would be wrong; not every possible word can occur on the 
\begin_inset Formula $w_{r}$
\end_inset

 position.
 This suggests a different, but tempting, error, that 
\begin_inset Formula $\mathcal{P}(w_{l},*)\le P(w_{l})$
\end_inset

.
 This is also not the case! A word might occur more frequently as the left
 side of a pair, than it does all by itself in the sample text.
 This follows from the frequentist definitions; the denominators for the
 two probabilities are not compatible; they do not range over the same universe.
\end_layout

\begin_layout Standard
Yuret defines the 
\begin_inset Quotes eld
\end_inset

lexical attraction
\begin_inset Quotes erd
\end_inset

 as 
\begin_inset Formula 
\begin{equation}
\mbox{MI}(w_{l},w_{r})=\log_{2}\,\frac{\mathcal{P}(w_{l},w_{r})}{\mathcal{P}(w_{l},*)\mathcal{P}(*,w_{r})}\label{eq:Yuret's MI}
\end{equation}

\end_inset

so that large positive MI is associated with words that rarely seen one
 without the other (e.g.
 '
\emph on
Northern Ireland
\emph default
' from his examples.) Note the absence of a minus sign in the above! See
 below for an explanation.
 Large-MI word pairs occur when 
\begin_inset Formula $\mathcal{P}(w_{l},w_{r})$
\end_inset

 is roughly comparable to 
\begin_inset Formula $\mathcal{P}(w,*)\approx\mathcal{P}(*,w)$
\end_inset

.
 
\end_layout

\begin_layout Standard
It is worth reviewing Yuret's example, at this point.
 He looks at the word pair 'Northern Ireland' and states (based on a particular
 corpus that was analyzed) that 
\begin_inset Formula $-\log_{2}P(\mbox{'Northern'})=12.60$
\end_inset

 and that 
\begin_inset Formula $-\log_{2}P(\mbox{'Ireland'})=14.65$
\end_inset

 and finally that 
\begin_inset Formula $-\log_{2}\mathcal{P}(\mbox{'Northern'},\mbox{'Ireland'})=16.13$
\end_inset

.
 What these numbers mean is that although either word alone occurs at a
 rate of roughly once in ten-thousand words, the word-pair together occurs
 at the rate of one in thirty-thousand or so: the word pair occurs almost
 as often as either word alone.
 Thus, the resulting MI is very large: 
\begin_inset Formula $\mbox{MI}=-16.13+12.60+14.65=11.12$
\end_inset

.
 The choice of sign in eqn 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:Yuret's MI"

\end_inset

 is such that words that co-occur have a large positive value.
 In practice, the distribution of the MI for word-pairs runs from about
 -15 to about +35, and, when ranked according to MI, the probabilities form
 a rounded mountain-peak, two-sided, each side being linear (Zipfian) with
 the peak at about MI=4 or 6.
 (See my other notes for a graph.)
\end_layout

\begin_layout Standard
References:
\end_layout

\begin_layout Itemize
Apparently, MI is introduced here: Kenneth Ward Church and Patrick Hanks.
 
\begin_inset Quotes eld
\end_inset

Word association norms, mutual information, and lexicography
\begin_inset Quotes erd
\end_inset

.
 Computational linguistics, 16(1):22–29, 1990.
 
\end_layout

\begin_layout Section*
1 January 2014
\end_layout

\begin_layout Standard
OK, after that side distraction, which helped clear up notation, back to
 the main show ...
\end_layout

\begin_layout Standard
The main show is this: We want to model language, and specifically, find
 a 'minimal' set of relations R that are accurately generative.
 The meaning of 'minimal' seems obvious, intuitively, but a lot harder to
 pin down mathematically.
 We need to pin it down to get an algorithm that works in a trust-worthy,
 understandable fashion.
\end_layout

\begin_layout Standard
So: what is the total space of relations, and how do we find it? The simplest
 model is then a Zipfian distribution of words, but placed in random order.
 This model has a total entropy of 
\begin_inset Formula 
\[
H=-\sum_{w}P(w)\log_{2}P(w)
\]

\end_inset

For a recent swipe at parsing a few hundred articles from the French wikipedia,
 I get H=7.2.
 This is on 17K words, observed a total of 35M times (actually, observerd
 each sentence 100 times, so really just 350K 'true' observations of words).
\end_layout

\begin_layout Standard
How does one count the entropy of the rule-set? Elucidating this is the
 goal-set.
\end_layout

\begin_layout Standard
But first, step back: describe the rules.
 
\end_layout

\begin_layout Standard
OK ...
 so, once again ...
 sentence structure is to be described via link-grammar, using disjoined
 conjunctions of connectors.
 This is theoretically sound, as it seems to be isomorphic to categorical
 grammars (via type-theory of the connectors; need a formal proof of this
 someday, but for now it seems 'obvious').
 Also link-grammar is fully compatible with dependency grammar.
 So lets move forward.
 But this is an old debate, off to the side, immaterial for now.
\end_layout

\begin_layout Subsection*
How to count relations
\end_layout

\begin_layout Standard
Consider a sentence with 
\begin_inset Formula $n$
\end_inset

 words in it, numbered 
\begin_inset Formula $w_{1},w_{2},\cdots,w_{n}$
\end_inset

 left to right.
 We want to constrain grammar by discovering a set of relations 
\begin_inset Formula $R(w_{1},w_{2},\cdots,w_{n})$
\end_inset

 such that 
\begin_inset Formula $P(R(w_{1},w_{2},\cdots,w_{n}))>0$
\end_inset

 when the sentence is grammatically valid (
\emph on
i.e.

\emph default
 such an 
\begin_inset Formula $R$
\end_inset

 exists), and 
\begin_inset Formula $P$
\end_inset

 is zero when no such 
\begin_inset Formula $R$
\end_inset

 exists (
\emph on
i.e.

\emph default
 the sentence is not grammatically valid.) The first and most obvious simplificati
on rule is to observe that 
\begin_inset Formula $R$
\end_inset

 can be replaced by 
\begin_inset Formula $R(W_{1},W_{2},\cdots,W_{n})$
\end_inset

 where each 
\begin_inset Formula $W_{k}$
\end_inset

 is a set of words.
 That is, instead of listing each sentences individually, we list certain
 classes of sentences.
 In other words, the relations 
\begin_inset Formula $R(w_{1},w_{2},\cdots,w_{n})$
\end_inset

 are in one-to-one correspondence with a list of grammatical sentences 
\begin_inset Formula $(w_{1},w_{2},\cdots,w_{n})$
\end_inset

, so simply listing all possible sentences is a very verbose way of specifying
 a grammar.
 It is linguistically 'obvious' that sentences fall into classes, and so
 the two relations 
\begin_inset Formula $R('this','is','a','dog')$
\end_inset

 and 
\begin_inset Formula $R('this','is','a','cat')$
\end_inset

 can be replaced by 
\begin_inset Formula $R('this','is','a',W_{n})$
\end_inset

 where 
\begin_inset Formula $W_{n}=\{'dog','cat'\}$
\end_inset

.
 In fact, 
\begin_inset Formula $W_{n}$
\end_inset

 can be a rather large set of nouns.
 
\end_layout

\begin_layout Standard
So ...
 the question is: what is the reduction of complexity, by performing this
 classification? What is the correct way of counting? I assume that 'complexity'
 is a synonym for 'entropy', so we are looking to do two things: enumerate
 the states of the system, and proivde a measure for complexity.
 So, lets consider a language with 
\begin_inset Formula $N$
\end_inset

 nouns, so that the cardinality of 
\begin_inset Formula $W_{n}$
\end_inset

 is 
\begin_inset Formula $|W_{n}|=N$
\end_inset

 and the only valid sentences are 
\begin_inset Formula $('this','is','a',w)$
\end_inset

 with 
\begin_inset Formula $w\in W_{n}$
\end_inset

 .
 Before simplification, we had 
\begin_inset Formula $N$
\end_inset

 relations 
\begin_inset Formula $R$
\end_inset

, one per sentence.
 We also had 
\begin_inset Formula $N+3$
\end_inset

 sets, each set containing a single word; the 
\begin_inset Formula $N$
\end_inset

 nouns, and the three words 
\begin_inset Formula $'this','is','a'$
\end_inset

.
 After simplification, we have one relation 
\begin_inset Formula $R$
\end_inset

, and four sets; three of the sets have cardinality 1, the fourth set has
 cardinality 
\begin_inset Formula $N$
\end_inset

.
 
\end_layout

\begin_layout Subsubsection*
Revision: July 2014
\end_layout

\begin_layout Standard
There seem to be several ways of counting.
 Some of these seem to give wrong answers.
 Some just seem wrong.
 This is all very confusing, so I've altered the entries to explicitly show
 the different ways of counting.
\end_layout

\begin_layout Paragraph*
Method 1 (naive counting):
\end_layout

\begin_layout Standard
One counting rule is to count set-membership relations on equal footing
 with structural relations.
 Thus, before simplification, we had 
\begin_inset Formula $N+3$
\end_inset

 sets, each a singleton, and thus 
\begin_inset Formula $N+3$
\end_inset

 set membership relations.
 After simplification, we have four sets, but still have 
\begin_inset Formula $N+3$
\end_inset

 set membership relations.
 Thus, this particular simplification step does not reduce the number of
 membership relations at all.
 This seems disconcerting...
 Let's provisionally go with this and see what happens.
 Thus, before simplification, we had 
\begin_inset Formula $2N+3$
\end_inset

 relations grand-total, and afterwords, we have 
\begin_inset Formula $N+4$
\end_inset

 relations grand-total.
\end_layout

\begin_layout Standard
What is the correct 'thermodynamic' picture of what's going on? In this
 toy problem, we have a grand-total state space of size 
\begin_inset Formula $(N+3)^{4}$
\end_inset

 since any of the 
\begin_inset Formula $N+3$
\end_inset

 words can appear in any of the four slots in a four-word sentence (micro-canonn
ical ensemble).
 The entropy, at 'infinite temperature' where all possible four-word sequences
 occur with equal probability is then 
\begin_inset Formula $4\log_{2}(N+3)$
\end_inset

.
 The entropy of the set of grammatical sentences is 
\begin_inset Formula $\log_{2}N$
\end_inset

 since there are only 
\begin_inset Formula $N$
\end_inset

 possible grammatical sentences.
 In this toy grammar, there are also invalid setences of length 1,2,3,5,6,7,...
 and so the total size of the space of word-sequences is clearly infinite.
 
\end_layout

\begin_layout Standard
OK, so the space of word-sequences is very concrete, and easy to describe
 and measure, at least for toy grammars.
 What about the space of relations? Well, the claim is that the entropies
 of the before-and-after models are 
\begin_inset Formula $\log_{2}(2N+3)$
\end_inset

 and 
\begin_inset Formula $\log_{2}(N+4)$
\end_inset

, respectively.
 Neither of these matches the entropy of the set of allowed sentences (which
 is 
\begin_inset Formula $\log_{2}N$
\end_inset

), so this seems paradoxical, and begs the questions 'did we count correctly?'
 and 'did we actually simplify anything by making the above change of description
n?' Hmm.
 The correct answer seems to be 'no' and 'no'.
 
\end_layout

\begin_layout Paragraph*
Method 2 (subtract one):
\end_layout

\begin_layout Standard
To 'fix' the oddball results above, an alternative counting methodology
 is to subtract 1 from the cardinality of every set.
 This would then give both 
\begin_inset Formula $\log_{2}N$
\end_inset

 as the entropy for both the before and after relation-sets.
 Thus, before, we had 
\begin_inset Formula $N$
\end_inset

 relations and 
\begin_inset Formula $N+3$
\end_inset

 sets, each of weight zero, for a total weighted-relation count of 
\begin_inset Formula $N$
\end_inset

.
 After, we have one relation and four sets; three of the sets have weight
 zero, one set has a weight of 
\begin_inset Formula $N-1$
\end_inset

 so the total weighted relations is again 
\begin_inset Formula $N$
\end_inset

.
 This seems to resolve the paradox.
 But why subtract one? That's a bizarre rule, almost unheard-of in information
 theory.
\end_layout

\begin_layout Paragraph*
Method 3 (naive log addition):
\end_layout

\begin_layout Standard
Total complexity is given by:
\begin_inset Formula 
\[
K=\log_{2}|Rel|+\sum_{W\in Wrds}\log_{2}|W|
\]

\end_inset

where 
\begin_inset Formula $Rel$
\end_inset

 is the set of relations, and 
\begin_inset Formula $Wrds$
\end_inset

 is the set of word-lists, and 
\begin_inset Formula $|W|$
\end_inset

 is the cardinality of each word-list.
 We then get, before simplification, 
\begin_inset Formula $|Rel|=N$
\end_inset

 and 
\begin_inset Formula $|W|=1$
\end_inset

 for each of the word-sets.
 The total complexity is thus 
\begin_inset Formula $K=\log_{2}N$
\end_inset

 as expected (i.e.
 equal to the log of the total number of possible sentences).
 After simplification, there is 
\begin_inset Formula $|Rel|=1$
\end_inset

 and 3 sets with 
\begin_inset Formula $|W|=1$
\end_inset

 and one set with 
\begin_inset Formula $|W|=N$
\end_inset

, thus yielding a total of 
\begin_inset Formula $K=\log_{2}N$
\end_inset

 again.
 This seems to give a plausible answer, and provides a plausible argument.
 
\end_layout

\begin_layout Paragraph*
Method 4 (relational complexity):
\end_layout

\begin_layout Standard
Treating each relation as being equally complex seems odd.
 It would seem to make more sense to have each relation contribute according
 to its complexity, so that the contribution of the relations to the total
 complexity is: 
\begin_inset Formula 
\[
\sum_{R\in Rel}C_{R}
\]

\end_inset

with 
\begin_inset Formula $C_{R}$
\end_inset

 the complexity of each relation, itself the log of some measure.
 But how do we measure complexity? Is it Kolmogorov complexity? There's
 no obvious 
\emph on
a priori
\emph default
 definition for this.
 The definition of this complexity would seem to depend on the particular
 algorithm machinery of the grammar; that is, on the 'programming language'
 used to represent the relation.
 This is the traditional ambiguity attached to the Kolmogorov complexity.
\end_layout

\begin_layout Paragraph*
Method 5 (corpus distribution):
\end_layout

\begin_layout Standard
Instead of measuring the complexity of a grammatical expression (in an as-yet
 unknown grammar), instead, use the corpus frequency as a proxy.
 For the above example, if the 
\begin_inset Formula $N$
\end_inset

 sentences are equi-distributed (i.e.
 occur equally likely in the corpus), then, before simplification, each
 of the relations has a complexity 
\begin_inset Formula 
\[
C_{R}=-\frac{1}{N}\log_{2}\frac{1}{N}
\]

\end_inset

so that, before simplification, 
\begin_inset Formula 
\[
K=\sum_{R\in Rel}C_{R}=\log_{2}N
\]

\end_inset

which again seems to be the desired answer.
 After simplification, there is one relation that applies to the entire
 corpus, so that 
\begin_inset Formula $C_{R}=0$
\end_inset

 after simplification.
 
\end_layout

\begin_layout Paragraph*
Method 6 (corpus word-counts):
\end_layout

\begin_layout Standard
If we are taking word-relation frequencies from the corpus, then we should
 be taking word-set frequencies from the corpus as well.
 That is, the word-set contribution 
\begin_inset Formula $\log_{2}|W|$
\end_inset

 is assuming an equi-distribution.
 This should be replaced by the corpus contribution 
\begin_inset Formula 
\[
-\sum_{w\in W}p(w)\log_{2}p(w)
\]

\end_inset


\end_layout

\begin_layout Paragraph*
Summary.
\end_layout

\begin_layout Standard
Provisionally, the last two methods seem to be the best way to move forward.
 To summarize, the complexity is given by 
\begin_inset Formula 
\begin{equation}
K=-\sum_{R\in Rel}P_{R}\log_{2}P_{R}-\sum_{W\in Wrds}\ \sum_{w\in W}P_{w}\log_{2}P_{w}\label{eq:counting complexity}
\end{equation}

\end_inset

where 
\begin_inset Formula $P_{R}=P(R)=P(R(W_{1},W_{2},\cdots,W_{n}))$
\end_inset

 is the probability of observing the relation 
\begin_inset Formula $R$
\end_inset

 in a sample corpus, and 
\begin_inset Formula $P_{w}=P(w|W)$
\end_inset

 is the probability of observing word 
\begin_inset Formula $w$
\end_inset

 in the corpus, conditioned on its appearance in the corpus having to do
 with it belonging to the word-class 
\begin_inset Formula $W$
\end_inset

.
\end_layout

\begin_layout Subsection*
Counting Link-Grammar Relations
\end_layout

\begin_layout Standard
Per link-grammar, each relation is decomposable into pair-wise relations;
 this is the so-called 'parse' of a sentence.
 If the relation is a single word-per-slot sentence relation, then the 'parse'
 is literal.
 We write 
\begin_inset Formula 
\begin{equation}
R(w_{1},w_{2},\cdots,w_{n})=\prod_{j,k,m}R_{\alpha}(w_{j},w_{k},t_{m})\ Q(R_{\alpha},R_{\beta},\cdots,R_{\omega})\label{eq:pair-decompose}
\end{equation}

\end_inset

where 
\begin_inset Formula $R_{\alpha}(w_{j},w_{k},t_{m})$
\end_inset

 is a single connected pair of words, connected by the connector 
\begin_inset Formula $t_{m}$
\end_inset

.
 The product symbol 
\begin_inset Formula $\prod$
\end_inset

 implies that all such binary relations must hold.
 The awkward 
\begin_inset Formula $Q(R_{\alpha},R_{\beta},\cdots,R_{\omega})$
\end_inset

 at the end is the additional no-links-cross constraint in the current link-gram
mar parser.
 Its a non-local constraint involving all of the binary relations.
 It also subsumes any 'post-processing' rules, although, for the language
 learnign exercise, there won't be any post-processing rules.
 At any rate, 
\begin_inset Formula $Q$
\end_inset

 is a place where higher ordrer constraints can be applied.
 In particular, the most genneral form for 
\begin_inset Formula $Q$
\end_inset

 should be 
\begin_inset Formula $Q(R_{\alpha},R_{\beta},\cdots,R_{\omega},w_{1},w_{2},\cdots,w_{n})$
\end_inset

 since, in principle, it could depend on the word-choice, although the no-links-
cross constraint does not.
 
\end_layout

\begin_layout Standard
Yuret proposes a way of discovering the pair-wise relations
\begin_inset CommandInset citation
LatexCommand cite
key "Yuret1998"
literal "true"

\end_inset

.
 He makes the implicit, unvoiced assumption that there is a single, unique
 connector type 
\begin_inset Formula $t_{m}$
\end_inset

 for every ordered pair of words 
\begin_inset Formula $w_{j},w_{k}$
\end_inset

; that is, that 
\begin_inset Formula $t_{m}=t_{m}(w_{j},w_{k})$
\end_inset

.
 Viz, specifically, that such connectors are in 1-1 correspondence with
 word-pairs.
 (I don't think he's aware of this assumption; I don't think anyone has
 ever before realized that he's making such an assumption; certainly, I
 haven't).
 Yuret then makes two claims: first, that the only possible grammatically
 correct parses are those of the above form (eqn 
\begin_inset CommandInset ref
LatexCommand eqref
reference "eq:pair-decompose"

\end_inset

) for which the relations 
\begin_inset Formula $R_{\alpha}(w_{j},w_{k},t_{m}(w_{j},w_{k}))$
\end_inset

 have been previously observed; secondly, that there is a natural ranking
 of such allowed parses by summing the total mutual information associated
 with each word-pair.
\end_layout

\begin_layout Standard
These two concepts give rise to the idea of minimum-spanning-tree parsers.
 Such parsers work in a two-step process: a training phase, and a parse
 phase.
 In the training phase, one gathers a lot of statistics about mutual information.
 The important point here is that this is unsupervised training.
 To parse, one first creates a graph clique, with every word connected to
 every other.
 One uses the gathered MI to define graph edge lengths.
 Finally, the correct parse is then the maximum spanning tree of the graph
 (maximizing the MI, summed over the tree edges in the graph).
\end_layout

\begin_layout Standard
Here, we use the same idea, but then take the next step.
 The spanning tree can be decomposed into a set of link-grammar disjuncts,
 one disjunct per word.
 The disjunct is merely a list of the connections that one word makes.
 It consists of the type, and the direction.
 The direction is left or right.
 The type is the 
\begin_inset Formula $t_{m}=t_{m}(w_{j},w_{k})$
\end_inset

 defined above.
 By parsing a large number of sentences, we can now automatically discover
 a large number of disjuncts, in an unsupervised manner.
\end_layout

\begin_layout Standard
The goal, the next step, is then to reduce the total number of disjuncts,
 and the total number of types, by clustering and discovering similarities.
\end_layout

\begin_layout Section*
3 January 2014
\end_layout

\begin_layout Subsection*
No-crossing Minimum Spanning Trees
\end_layout

\begin_layout Standard
It turns out that writing an algorithm for a no-crossing minimum spanning
 tree is surprisingly painful; enforcing the no-crossing constraint requires
 treatment of a number of special cases.
 But perhaps this is not actually required! R.
 Ferrer i Cancho in 
\begin_inset Quotes eld
\end_inset

Why do syntactic links not cross?
\begin_inset Quotes erd
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Ferrer2006"
literal "true"

\end_inset

 shows that, when attempting to arrange a random set of points on a line,
 in such a way as to minimize euclidean distances between connected points,
 one ends up with trees that almost never cross!
\end_layout

\begin_layout Standard
Other related references:
\end_layout

\begin_layout Itemize
Crossings are rare: Havelka, J.
 (2007).
 Beyond projectivity: multilingual evaluation of constraints and measures
 on non-projective structures.
 In: Proceedings of the 45th Annual Meeting of the Association of Computational
 Linguistics (ACL-07): 608-615.
 Prague, Czech Republic: Association for Computational Linguistics.
 
\end_layout

\begin_layout Itemize
Hubbiness is a better model of sentence complexity than mean dependency
 distance: Ramon Ferrer-i-Cancho (2013) 
\begin_inset Quotes eld
\end_inset

Hubiness, length, crossings and their relationships in dependency trees
\begin_inset Quotes erd
\end_inset

, ArXiv 1304.4086 — also states: maximum number of crossings is bounded above
 by mean dependency length.
 Also, mean dependency length is bounded below by variance of degrees of
 vertices (i.e.
 variance in number of connectors a word can have).
\end_layout

\begin_layout Itemize
Language tends to be close to the theoretical minimum possible dependency
 distance, if it was legal to re-arrange words arbitrarily.
 See Temperley, D.
 (2008).
 Dependency length minimization in natural and artificial languages.
 Journal of Quantitative Linguistics, 15(3):256-282.
 
\end_layout

\begin_layout Itemize
Park, Y.
 A.
 and Levy, R.
 (2009).
 Minimal-length linearizations for mildly context-sensitive dependency trees.
 In Proceedings of the North American Chapter of the Association for Computation
al Linguistics - Human Language Technologies (NAACL-HLT) conference.
 
\end_layout

\begin_layout Itemize
Sentences with long dependencies are hard to understand: The original claim
 is from Yngve, 1960, having to do with phrase-structure depth.
 See – Gibson, E.
 (2000).
 The dependency locality theory: A distance-based theory of linguistic complexit
y.
 In Marantz, A., Miyashita, Y., and O'Neil, W., editors, Image, Language, Brain.
 Papers from the first Mind Articulation Project Symposium.
 MIT Press, Cambridge, MA.
 
\end_layout

\begin_layout Itemize
(Cite this, its good) Mean dependency distance is a good measure of sentence
 complexity – for 20 languages – Haitao Liu gives overview starting from
 Yngve.
 
\begin_inset CommandInset citation
LatexCommand cite
key "Liu2008"
literal "true"

\end_inset

.
 Haitao Liu 
\begin_inset Quotes eld
\end_inset

Dependency distance as a metric of language comprehension difficulty
\begin_inset Quotes erd
\end_inset

, 2008, Journal of Cognitive Science, v9.2 pp 159-191 http://www.lingviko.net/JCS.pd
f 
\end_layout

\begin_layout Itemize
Sentences with long dependencies are rarely spoken: Hawkins, J.
 A.
 (1994).
 A Performance Theory of Order and Constituency.
 Cambridge University Press, Cambridge, UK.
 —-Hawkins, J.
 A.
 (2004).
 Efficiency and Complexity in Grammars.
 Oxford University Press, Oxford, UK.
 —-Wasow, T.
 (2002).
 Postverbal Behavior.
 CSLI Publications, Stanford, CA.
 Distributed by University of Chicago Press.
 
\end_layout

\begin_layout Itemize
Dependency-length minimization is universal: Richard Futrell, Kyle Mahowald,
 and Edward Gibson, 
\begin_inset Quotes eld
\end_inset

Large-scale evidence of dependency length minimization in 37 languages
\begin_inset Quotes erd
\end_inset

 (2015), doi: 10.1073/pnas.1502134112
\end_layout

\begin_layout Itemize
The longest links, observed statistically, are of length 6 or less.
 This is based on computing the mutual information of words at different
 distances for the Brown corpus.
 Xuedong Huang, Fileno Alleva, Hsiao-wuen Hon, Mei-Y uh Hwang, Kai-Fu Lee
 and Ronald Rosenfeld.
 The SPHINX-II Speech Recognition System: An Overview .
 Computer , Speech and Language , volume 2, pages 137–148, 1993.
\end_layout

\begin_layout Standard
So, rather than imposing no-crossing as a constraint on the parser, instead,
 let it find its own way into the grammar.
 Just implement a plain-old MST parser, punt on crossing.
\end_layout

\begin_layout Subsection*
Crossing and Copulas
\end_layout

\begin_layout Standard
Here's a great example where multiple lingusitic desires cause trouble.
 First, consider the planar diagram, no-links-cross, that indicates both
 the head-noun and the head-verb:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

        +-------->WV------->+
\end_layout

\begin_layout Plain Layout

        +---->Wd---->+      |
\end_layout

\begin_layout Plain Layout

        |      +<-Ds-+--Ss--+-->Pa->+
\end_layout

\begin_layout Plain Layout

        |      |     |      |       |
\end_layout

\begin_layout Plain Layout

    LEFT-WALL the  dog.n   was    black 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The above indicates the head-noun (
\family typewriter
Wd
\family default
), the head-verb (
\family typewriter
WV
\family default
) and a subject-verb relation (
\family typewriter
Ss
\family default
) and that these three for a cycle.
 Its still a DAG, because the arrows point down from the root.
 The 
\family typewriter
Pa
\family default
 link is a predicative-adjective link.
\end_layout

\begin_layout Standard
A plausible alternative diagram is
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

       +---->Wd---->+      
\end_layout

\begin_layout Plain Layout

       |            +-->adjcomp--->+
\end_layout

\begin_layout Plain Layout

       |            |              |
\end_layout

\begin_layout Plain Layout

       |      +<-Ds-+--Ss--+<-cop<-+
\end_layout

\begin_layout Plain Layout

       |      |     |      |       |
\end_layout

\begin_layout Plain Layout

   LEFT-WALL the  dog.n   was    black
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Here, the 
\family typewriter
adjcomp
\family default
 arrow indicates the adjectival complement, and 
\family typewriter
cop
\family default
 indicates the copula.
 Note that it is 
\emph on
impossible
\emph default
 to draw an arrow from the root to the head-verb, without forcing a link-crossin
g! Interesting, eh? 
\end_layout

\begin_layout Standard
The real question, the major question is: what happens to pair-wise MI?
 What would that show? It seems unlikely that MI(dog, black) is going to
 be all that high.
 But what about MI(physical object, color)? What is the correct way to compute
 MI(physical object, color)?
\end_layout

\begin_layout Standard
The above is perhaps the simplest-possible example of the confusion surrounding
 the transition from SSyntR to DSyntR (or to logical representations, a
 la opencog).
 That is, we want to convert the 
\family typewriter
adjcomp
\family default
 relation to 
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

  EvaluationLink
\end_layout

\begin_layout Plain Layout

     PredicateNode "has color"
\end_layout

\begin_layout Plain Layout

        ListLink
\end_layout

\begin_layout Plain Layout

           Concept "dog"
\end_layout

\begin_layout Plain Layout

           Concept "black"
\end_layout

\end_inset

 
\end_layout

\begin_layout Standard
Following Poon & Domingos, we really want to convert it to
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

  LambdaLink
\end_layout

\begin_layout Plain Layout

    VariableList
\end_layout

\begin_layout Plain Layout

       Variable $PHY
\end_layout

\begin_layout Plain Layout

       Variable $COL
\end_layout

\begin_layout Plain Layout

    AndLink
\end_layout

\begin_layout Plain Layout

       EvaluationLink
\end_layout

\begin_layout Plain Layout

          PredicateNode "has color"
\end_layout

\begin_layout Plain Layout

          ListLink
\end_layout

\begin_layout Plain Layout

             Variable $PHY
\end_layout

\begin_layout Plain Layout

             Variable $COL
\end_layout

\begin_layout Plain Layout

       InheritanceLink
\end_layout

\begin_layout Plain Layout

          Variable $PHY
\end_layout

\begin_layout Plain Layout

          Concept "physical object"
\end_layout

\begin_layout Plain Layout

       InheritanceLink
\end_layout

\begin_layout Plain Layout

          Variable $COL
\end_layout

\begin_layout Plain Layout

          Concept "color" 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Of course, this has to be done based on the evidence of many sentences,
 discovered to be qusi-synonymous (a la Dekang Lin DIRT).
\end_layout

\begin_layout Section*
11 January 2014
\end_layout

\begin_layout Subsection*
Clustering Redux
\end_layout

\begin_layout Standard
OK, so what is the very next algorithmic step? Up to here, we've generated
 a large number of unique disjuncts.
 Now what?
\end_layout

\begin_layout Standard
Back to counting.
 Lets do the French dictionary.
 The database fr_pairs contains table atoms_mi_snapshot.
 So:
\end_layout

\begin_layout Itemize

\family typewriter
select count(*) from atoms_mi_snapshot;
\family default
 returns 415532
\end_layout

\begin_layout Section*
15 January 2014
\end_layout

\begin_layout Subsection*
Embodied Learning
\end_layout

\begin_layout Standard
OK, so maybe learning syntax before emantics puts the cart before the horse.
 Can we learn a world-model first, and then gradually annotate and correct
 it as our linguistic comprehension improves? So, for example, can we start
 with a world-model obtained via document summarization? How do we annotate
 this model with newly discovered data?
\end_layout

\begin_layout Standard
Related question: how to automatically discover ontologies? Automated, unsupervi
sed concept, entity extraction? Semantic context change over time?
\end_layout

\begin_layout Standard
Steps: 
\end_layout

\begin_layout Enumerate
How do I extract entities out of a text? The extraction doesn't have to
 be perfect; having candidate entities is enough.
 How do I put a confidence rating on the entiy, and how do I discard the
 low-cnfidence ones?
\end_layout

\begin_layout Enumerate
Once entities are extracted, I want to start decorating them with attributes
 (adjectives, modifiers), to build a network.
\end_layout

\begin_layout Enumerate
Once a network is built, it needs to be factually reconciled, using logical
 reasoning and an ontology (is-a and has-a relations).
 Need to do this so that upon reading 
\begin_inset Quotes eld
\end_inset

colorless green ideas
\begin_inset Quotes erd
\end_inset

, we can deduce that ideas are either colorless or green, but not both.
\end_layout

\begin_layout Enumerate
How to automatically extract an ontology from free text?
\end_layout

\begin_layout Standard
The above seem to be the central steps/core issues for creating a world-model,
 unsupervised, from text.
\end_layout

\begin_layout Subsection*
Entropy
\end_layout

\begin_layout Standard
Some refresher notes: 
\end_layout

\begin_layout Itemize
\begin_inset Quotes eld
\end_inset

The Boltzmann distribution is the so-called canonical distribution, meaning
 it maximizes entropy subject to a constraint on the expected value of energy.
\begin_inset Quotes erd
\end_inset

 (viz, this is the MaxEnt principle.
 Except for MaxEnt, the constraint is not on energy, but rather a set of
 constraints obtained on some other theoretical grounds.)
\end_layout

\begin_layout Itemize
Define 
\begin_inset Quotes eld
\end_inset

Shanon Entropy
\begin_inset Quotes erd
\end_inset

 as 
\begin_inset Formula $S_{s}=-k_{B}\sum p\log p$
\end_inset


\end_layout

\begin_layout Itemize
The 
\begin_inset Quotes eld
\end_inset

Boltzmann Entropy
\begin_inset Quotes erd
\end_inset

 
\begin_inset Formula $S_{B}$
\end_inset

 is the Shanon entropy of the microcanonical ensemble: it maximizes the
 entropy (MaxEnt) for a fixed value of the energy.
 (MaxEnt: not the energy, but for a fixed set of constraints).
 (viz, 
\begin_inset Formula $S_{B}=k_{B}\log\left(\epsilon\frac{d\Omega}{dE}\right)$
\end_inset

 with 
\begin_inset Formula $\Omega$
\end_inset

 being number of states, E the energy, 
\begin_inset Formula $\epsilon$
\end_inset

 a constant of dimension energy to make arg of the log dimensionless.) (MaxEnt:
 replace E by the individual constraints.
 This suggests that there are many Boltzmann entropies: one for each constraint
 that is applied!)
\end_layout

\begin_layout Itemize
The 
\begin_inset Quotes eld
\end_inset

Gibbs Entropy
\begin_inset Quotes erd
\end_inset

 is the Shanon entropy, maximized for a system held to the constraint that
 energy is less-than-or-equal to E.
 (!) This gives 
\begin_inset Formula $S_{G}=k_{B}\log\Omega$
\end_inset

 (duhh, take 
\begin_inset Formula $p=1/\Omega$
\end_inset

 for 
\begin_inset Formula $\Omega$
\end_inset

 states.
 For a non-sharp cutoff, the Shannon entropy is primal.).
 (MaxEnt: one gets a different Gibbs entropy for each applied constraint.)
\end_layout

\begin_layout Itemize
Gibbs and Boltzmann entropies give different results for N-particle systems
 when N is very small.
 Viz, an off-by-one error for N.
 In some ways, 
\begin_inset Formula $S_{G}$
\end_inset

 is more correct (at low temp, quantum systems).
 See Jörn Dunkel and Stefan Hilbert (2014) 
\begin_inset Quotes eld
\end_inset

Consistent thermostatistics forbids negative absolute temperatures
\begin_inset Quotes erd
\end_inset

 Nature Physics DOI: 10.1038/NPHYS2815 
\end_layout

\begin_layout Subsection*
Why does Yuret's MST work?
\end_layout

\begin_layout Standard
There is an interesting simplification that happens with minimum-spanning
 tree parsers driven by entropy.
 If we use Yuret's definition of the MI of word-pairs, then, Yuret says
 (I should re-read his stuff) that we should maximize the entropy 
\begin_inset Formula 
\begin{equation}
\sum_{w_{l,}w_{r}}MI(w_{l},w_{r})\label{eq:MST entropy}
\end{equation}

\end_inset

Why? Why this, instead of the maybe 
\begin_inset Quotes eld
\end_inset

more obviously correct
\begin_inset Quotes erd
\end_inset

 sum: 
\begin_inset Formula 
\begin{equation}
\sum_{w_{l,}w_{r}}P(w_{l},w_{r})MI(w_{l},w_{r})\label{eq:wrong MST entropy}
\end{equation}

\end_inset

I think I can hand-wave the answer.
 The answer is that we don't really know the probability of 
\begin_inset Formula $P(w_{l},w_{r})$
\end_inset

 
\emph on
for the given sentence! 
\emph default
We know 
\begin_inset Formula $P(w_{l},w_{r})$
\end_inset

 for a large corpus, but its somewhat of a mistake to assume that this identical
 to what it would be for expressing a particular idea in a certain specific
 way.
 Its possible that, to express the idea, the only sentences that one could
 ever possibly use would have 
\begin_inset Formula $P(w_{l},w_{r})$
\end_inset

 that strongly deviate from a large-corpus average.
 Unfortunately, there is no easy way of knowing what this sentence-specific
 
\begin_inset Formula $P(w_{l},w_{r})$
\end_inset

 is.
 So, instead we make the uniform distribution assumption, that they're all
 the same, and thus get eqn 
\begin_inset CommandInset ref
LatexCommand eqref
reference "eq:MST entropy"

\end_inset

 instead of 
\begin_inset CommandInset ref
LatexCommand eqref
reference "eq:wrong MST entropy"

\end_inset

.
 Does Yuret ever make this argument himself? Dunno.
\end_layout

\begin_layout Standard
A supporting argument is that we also ignored 3,4,5-point interactions as
 well.
 Which brings us to the next point: why should we expect a link-parse to
 work better than an MST parse? Because Yuret-MST ignores the valence of
 words, whereas the disjuncts don't! The disjuncts provide a better, more
 accurate way of capturing valency!
\end_layout

\begin_layout Subsection*
Entity Extraction
\end_layout

\begin_layout Standard
See Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu Tal
 Shaked, Stephen Soderland, Daniel S.
 Weld, and Alexander Yates (2005) 
\begin_inset Quotes eld
\end_inset

Unsupervised Named-Entity Extraction from the Web: An Experimental Study
\begin_inset Quotes erd
\end_inset

.
 So: KNOWITALL utilizes a set of eight domain-independent extraction patterns
 to generate candidate facts.
 For example, the generic pattern “NP1 such as NPList2” ...
 
\begin_inset Quotes eld
\end_inset

cities such as Paris,...
\begin_inset Quotes erd
\end_inset


\begin_inset Quotes erd
\end_inset

 Of course, this is not really unsupervised, since it uses human-generated
 search patterns (
\begin_inset Quotes eld
\end_inset

such as
\begin_inset Quotes erd
\end_inset

) and also applies constraints (the targets must be proper nouns, which
 is not a-priori known).
\end_layout

\begin_layout Subsection*
Partition Function
\end_layout

\begin_layout Standard
Some notes about the Boltzmann distribution.
 Let 
\begin_inset Formula $F(w_{l},w_{r})$
\end_inset

 be a numeric score associated with the edge 
\begin_inset Formula $(w_{l},w_{r})$
\end_inset

 – for example, this might be (minus) the MI.
 By convention, one introduces a Lagrange multiplier 
\begin_inset Formula $\beta$
\end_inset

 and writes 
\begin_inset Formula $F=\beta f$
\end_inset

.
 The probability of a parse tree 
\begin_inset Formula $T$
\end_inset

 constructed solely out of edge-pairs is then defined as
\begin_inset Formula 
\[
P(T|S)=\frac{\exp\left(-\beta\sum_{(w_{l},w_{r})\in T}f(w_{l},w_{r})\right)}{Z(S)}
\]

\end_inset

Here, 
\begin_inset Formula $S$
\end_inset

 is the sentence: a sequence of words, and 
\begin_inset Formula $Z(S)$
\end_inset

 is the partition function: the sum of the probabilities of all different
 possible parses for that setence:
\begin_inset Formula 
\[
Z(S)=\sum_{T}\exp\left(-\beta\sum_{(w_{l},w_{r})\in T}f(w_{l},w_{r})\right)
\]

\end_inset

The MST parse is then the single parse 
\begin_inset Formula $T$
\end_inset

 that maximizes the probability 
\begin_inset Formula $P(T|S)$
\end_inset

 and this can be easily seen to be the parse that maximizes the sum 
\begin_inset Formula $\sum_{(w_{l},w_{r})\in T}f(w_{l},w_{r})$
\end_inset

 on the spanning tree 
\begin_inset Formula $T$
\end_inset

.
\end_layout

\begin_layout Standard
Written in this form, it suggests how parsing can be generalized to include
 other scores.
 If, for example, one has some other score 
\begin_inset Formula $g(w_{1},w_{2},w_{3})$
\end_inset

 over triples of words, then one sums as above, using 
\begin_inset Formula $g$
\end_inset

 in place of 
\begin_inset Formula $f$
\end_inset

.
 In general, one can consider scoring functions 
\begin_inset Formula $f=f(R;S)$
\end_inset

 for some relation 
\begin_inset Formula $R$
\end_inset

 over the sentence 
\begin_inset Formula $S$
\end_inset

.
 The prototypical example would be a scoring function that uses Link Grammar
 disjuncts for the relation 
\begin_inset Formula $R$
\end_inset

, together with some weighting for link lengths (so as to avoid long links),
 together with some weighting for word pairs (to give greater weight to
 idioms and set phrases, as opposed to grammatically valid but non-idiomatic
 parses).
\end_layout

\begin_layout Standard
In many papers on supervised training, it is common to call the relation
 
\begin_inset Formula $R$
\end_inset

 a 
\begin_inset Quotes eld
\end_inset

feature
\begin_inset Quotes erd
\end_inset

, and to consider many different possible features 
\begin_inset Formula $R$
\end_inset

.
 Thus, 
\begin_inset Formula $f=f_{R}(S)$
\end_inset

 becomes a vector 
\begin_inset Formula $\vec{f}$
\end_inset

 over the features 
\begin_inset Formula $R$
\end_inset

.
 The multiplier 
\begin_inset Formula $-\beta$
\end_inset

 is replaced by a vector of weights 
\begin_inset Formula $\vec{w}$
\end_inset

, and so one considers the probability 
\begin_inset Formula 
\[
P=\frac{\exp\left(\sum\vec{w}\cdot\vec{f}\right)}{Z(S)}
\]

\end_inset

Supervised training then uses a training corpus marked up with both a feature
 vector 
\begin_inset Formula $\vec{f}$
\end_inset

 and the correct parse; training consists of using various supervised learning
 algos to find the best-possible weight vector 
\begin_inset Formula $\vec{w}$
\end_inset

 that maximizes the fraction of correct parses (or optimizes the ROC curve,
 or other measure of accuracy).
 In unsupervised training, we don't have a training corpus, and thus, do
 not focus on optimizing the weight vector.
\end_layout

\begin_layout Standard
Now we do the funky chicken dance.
 Write
\begin_inset Formula 
\[
Z(S)=\det\,L(S)
\]

\end_inset

This is commonly done when working with fermions; this is the Berezin determinan
t or Berezin integral, so named because one may write 
\begin_inset Formula 
\[
\det A=\int\exp\left[-\theta A\eta\right]d\theta d\eta
\]

\end_inset

for Grassman variables 
\begin_inset Formula $\theta$
\end_inset

 and 
\begin_inset Formula $\eta$
\end_inset

 and 
\begin_inset Formula $A$
\end_inset

 a matrix.
 Here's the part that surprises me: Koo et al state that 
\begin_inset Formula $L(S)$
\end_inset

 can be taken to be a Laplacian matrix of a graph.
 Wow! The mind boggles.
\end_layout

\begin_layout Standard
References:
\end_layout

\begin_layout Itemize
Koo, T., Globerson, A., Carreras, X., and Collins, M.
 (2007).
 Structured prediction models via the matrix-tree theorem.
 In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
 Language Processing and Computational Natural Language Learning (EMNLP-
 CoNLL) , pages 141–150, Prague, Czech Republic, June.
 Association for Computational Linguistics.
\end_layout

\begin_layout Section*
3 March 2014
\end_layout

\begin_layout Standard
Start again, after long distraction.
\end_layout

\begin_layout Subsection*
Finding patterns
\end_layout

\begin_layout Standard
To problem.
 Consider an alphabet of 
\begin_inset Formula $N=5$
\end_inset

 letters, 
\begin_inset Formula $\alpha=\{A,B,C,D,E\}$
\end_inset

 and a corpus built from those letters.
 The five letters occur with probability 
\begin_inset Formula $p(w)$
\end_inset

 with 
\begin_inset Formula $w\in\alpha$
\end_inset

.
 Assume the corpus consists entirely of pairs AB, CB and DE, each occurring
 equally often: so: 
\begin_inset Formula $p(A,B)=p(C,B)=p(D,E)=1/3$
\end_inset

 .
 From this, we can reconstruct that 
\begin_inset Formula $p(A)=p(C)=p(D)=p(E)=1/6$
\end_inset

 and 
\begin_inset Formula $p(B)=1/3$
\end_inset

.
 This follows because the corpus can be reduced to 
\begin_inset Formula $\{AB,CB,DE\}$
\end_inset

, so A occurs 1 out of 6 times, B two out of 6 times, etc.
 The total single-letter entropy is thus
\begin_inset Formula 
\begin{align*}
h_{SING} & =-\sum_{w\in\alpha}p(w)\log_{2}p(w)\\
 & =-\frac{4}{6}\log_{2}\frac{1}{6}-\frac{1}{3}\log_{2}\frac{1}{3}\\
 & =\frac{2}{3}-\log_{2}\frac{1}{3}=2.25163
\end{align*}

\end_inset

By contrast, in a random 2-letter corpus, we expect all possible letter
 pairs to occur equally often, i.e.
 
\begin_inset Formula $p(w)=1/5$
\end_inset

, which would result in 
\begin_inset Formula $h_{RAND}=-\log_{2}1/5=2.321928$
\end_inset

 and so we see that the total entropy for this corpus is less than the random
 corpus.
\end_layout

\begin_layout Standard
The total double-word entropy is
\begin_inset Formula 
\begin{align*}
h_{PAIR} & =-\sum_{w_{1},w_{2}\in\alpha}p(w_{1},w_{2})\log_{2}p(w_{1},w_{2})\\
 & =-\log_{2}\frac{1}{3}=1.5849625
\end{align*}

\end_inset

Compare this to 
\begin_inset Formula $h_{PR-RAND}=-\log_{2}1/25=4.643856$
\end_inset

 for the random 2-letter corpus.
 The pair-entropy is sharply lower.
\end_layout

\begin_layout Standard
What do we know about mutual information? We can also deduce that 
\begin_inset Formula $p(*,B)=2/3$
\end_inset

, 
\begin_inset Formula $p(A,*)=1/3$
\end_inset

 and so 
\begin_inset Formula 
\begin{align*}
MI(A,B) & =\log_{2}p(A,B)-\log_{2}p(A,*)-\log_{2}p(*,B)\\
 & =\log_{2}3/2=0.585
\end{align*}

\end_inset

and likewise 
\begin_inset Formula $MI(C,B)=MI(A,B)$
\end_inset

 while 
\begin_inset Formula $MI(D,E)=\log_{2}3=1.585$
\end_inset

.
 
\end_layout

\begin_layout Standard
By contrast, in a random 2-letter corpus, we expect all possible letter
 pairs to occur equally often, i.e.
 
\begin_inset Formula $p(w_{1},w_{2})=1/25$
\end_inset

, which would result in an 
\begin_inset Formula $MI(w_{1},w_{2})=\log_{2}1=0$
\end_inset

 for all word pairs.
 
\end_layout

\begin_layout Standard
Given this corpus, we wish to deduce the following answer: there is a cluster
 
\begin_inset Formula $\gamma=\{A,C\}$
\end_inset

 and two link relations 
\begin_inset Formula $R(\gamma,B)$
\end_inset

 and 
\begin_inset Formula $R(D,E)$
\end_inset

 occurring with probabilities 
\begin_inset Formula $p(\gamma,B)=p(A,B)+p(C,B)=2/3$
\end_inset

 and 
\begin_inset Formula $p(D,E)=1/3$
\end_inset

.
 Note that 
\begin_inset Formula $p(\gamma,*)=p(A,*)+p(C,*)=2/3$
\end_inset

 so that
\end_layout

\begin_layout Standard
\begin_inset Formula 
\begin{align*}
MI(\gamma,B) & =\log_{2}p(\gamma,B)-\log_{2}p(\gamma,*)-\log_{2}p(*,B)\\
 & =\log_{2}3/2=0.585
\end{align*}

\end_inset

So how do we deduce this?
\end_layout

\begin_layout Standard
Well, consider the reduced space, with 
\begin_inset Formula $N=4$
\end_inset

 letters: 
\begin_inset Formula $\beta=\{\gamma,B,D,E\}$
\end_inset

.
 In this space, only two pairs are observed in the corpus, 
\begin_inset Formula $\gamma B$
\end_inset

 and 
\begin_inset Formula $DE$
\end_inset

 with probabilities as above.
 The single-letter probabilities are 
\begin_inset Formula $p(D)=p(E)=1/6$
\end_inset

 and 
\begin_inset Formula $p(\gamma)=p(B)=1/3$
\end_inset

.
 The single-letter entropy is
\begin_inset Formula 
\begin{align*}
h_{SING}^{red} & =-\sum_{w\in\beta}p(w)\log_{2}p(w)\\
 & =-\frac{2}{6}\log_{2}\frac{1}{6}-\frac{2}{3}\log_{2}\frac{1}{3}\\
 & =\frac{1}{3}-\log_{2}\frac{1}{3}=1.9182958
\end{align*}

\end_inset

This can be compared to the entropy of the random 4-word corpus: 
\begin_inset Formula $h_{RAND}^{red}=-\log_{2}1/4=2$
\end_inset

.
 Note that 
\begin_inset Formula 
\[
h_{RAND}^{red}-h_{SING}^{red}=0.081704>0.070298=h_{RAND}-h_{SING}
\]

\end_inset

In other words, the reduced corpus shows more order than the comparable
 unreduced corpus! Interesting! The above can be written as: 
\begin_inset Formula 
\[
h_{SING}-h_{SING}^{red}=0.333334>0.321928=h_{RAND}-h_{RAND}^{red}
\]

\end_inset


\end_layout

\begin_layout Standard
What about the reduced pair entropy? For this case, we have
\begin_inset Formula 
\begin{align*}
h_{PAIR}^{red} & =-\sum_{w_{1},w_{2}\in\beta}p(w_{1},w_{2})\log_{2}p(w_{1},w_{2})\\
 & =-\frac{2}{3}\log_{2}\frac{2}{3}-\frac{1}{3}\log_{2}\frac{1}{3}\\
 & =-\frac{2}{3}-\log_{2}\frac{1}{3}=0.9182958
\end{align*}

\end_inset

which can be compared to the random-pair entropy of 
\begin_inset Formula $h_{PR-RAND}^{red}=-\log_{2}1/16=4$
\end_inset

.
 The comparable reduction is 
\begin_inset Formula 
\[
h_{PR-RAND}^{red}-h_{PAIR}^{red}=3.081704>3.0588935=h_{PR-RAND}-h_{PAIR}
\]

\end_inset

So again, this wins, but not by a lot.
 Re-ordering, this can be written as:
\begin_inset Formula 
\[
h_{PAIR}-h_{PAIR}^{red}=0.6666667>0.643856=h_{PR-RAND}-h_{PR-RAND}^{red}
\]

\end_inset


\end_layout

\begin_layout Standard
So we seem to have two ways of winning: reducing the overall entropy, for
 for single letters, and for pairs, and also finding reductions that are
 strong, even compared to the reduced vocab.
\end_layout

\begin_layout Subsection*
Reductio ad absurdum? No.
\end_layout

\begin_layout Standard
What if we continue on this path, and (incorrectly) reduce to 
\begin_inset Formula $N=3$
\end_inset

 letters, with 
\begin_inset Formula $\delta=\{\gamma,\eta,D\}$
\end_inset

 where 
\begin_inset Formula $\eta=\{B,E\}$
\end_inset

? Then 
\begin_inset Formula $p(\eta)=p(B)+p(E)=1/2$
\end_inset

 
\begin_inset Formula 
\begin{align*}
h_{SING}^{rr} & =-\sum_{w\in\delta}p(w)\log_{2}p(w)\\
 & =-\frac{1}{6}\log_{2}\frac{1}{6}-\frac{1}{3}\log_{2}\frac{1}{3}-\frac{1}{2}\log_{2}\frac{1}{2}\\
 & =\frac{2}{3}-\frac{1}{2}\log_{2}\frac{1}{3}=1.4591479
\end{align*}

\end_inset

and the reduction inequality is
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
h_{SING}^{red}-h_{SING}^{rr}=0.4591479>0.4150375=h_{RAND}^{red}-h_{RAND}^{rr}
\]

\end_inset

So this inequality allows an inappropriate reduction to take place.
 That implies that we must not use the SING inequality to obtain reductions!
\end_layout

\begin_layout Standard
For the pairs, 
\begin_inset Formula $p(\gamma,\eta)=p(\gamma,B)=2/3$
\end_inset

 and 
\begin_inset Formula $p(D,\eta)=p(D,E)=1/3$
\end_inset

 and everything else being zero.
 Thus one gets:
\begin_inset Formula 
\begin{align*}
h_{PAIR}^{rr} & =-\sum_{w_{1},w_{2}\in\delta}p(w_{1},w_{2})\log_{2}p(w_{1},w_{2})\\
 & =-\frac{2}{3}\log_{2}\frac{2}{3}-\frac{1}{3}\log_{2}\frac{1}{3}\\
 & =-\frac{2}{3}-\log_{2}\frac{1}{3}=0.9182958
\end{align*}

\end_inset

so that 
\begin_inset Formula 
\[
h_{PAIR}^{red}-h_{PAIR}^{rr}=0\ngtr0.830075=h_{PR-RAND}^{red}-h_{PR-RAND}^{rr}
\]

\end_inset

Here, nothing is gained, so the pair inequality blocks the inappropriate
 reduction.
 Consider a different inappropraite reduction to 
\begin_inset Formula $N=3$
\end_inset

: let 
\begin_inset Formula $\epsilon=\{\zeta,B,E\}$
\end_inset

 with 
\begin_inset Formula $\zeta=\{D,\gamma\}$
\end_inset

 Then the pair probabilities are 
\begin_inset Formula $p(\zeta,B)=p(\gamma,B)=2/3$
\end_inset

 and 
\begin_inset Formula $p(\zeta,E)=p(D,E)=1/3$
\end_inset

 and again, there is no entropy reduction.
 The other groupings look to be equally ineffective.
\end_layout

\begin_layout Subsection*
Finding Patterns, General Formula
\end_layout

\begin_layout Standard
OK, recast the above section for the (semi-)general case of word-pairs (not
 structures in general).
 So, given a vocabulary of 
\begin_inset Formula $N$
\end_inset

 words, we have 
\begin_inset Formula $h_{RAND}=-\log_{2}\frac{1}{N}=\log_{2}N$
\end_inset

 and 
\begin_inset Formula $h_{RAND}^{red}=\log_{2}(N-1)$
\end_inset

 so that for large 
\begin_inset Formula $N,$
\end_inset

 
\begin_inset Formula $h_{RAND}-h_{RAND}^{red}=\log_{2}N/(N-1)=\log_{2}(1+1/(N-1))\approx1/N$
\end_inset

 and so we have a word-combine winner if we can combine words A and C into
 a cluster 
\begin_inset Formula $\gamma=\{A,C\}$
\end_inset

 such that 
\begin_inset Formula 
\begin{align*}
\frac{1}{N} & \lesssim h_{SING}-h_{SING}^{red}\\
 & =-\sum_{w\in\alpha}p(w)\log_{2}p(w)+\sum_{w\in\beta}p(w)\log_{2}p(w)\\
 & =-p(A)\log_{2}p(A)-p(C)\log_{2}p(C)+p(\gamma)\log_{2}p(\gamma)\\
 & =p(A)\log_{2}\left(1+\frac{p(C)}{p(A)}\right)+p(C)\log_{2}\left(1+\frac{p(A)}{p(C)}\right)
\end{align*}

\end_inset

where 
\begin_inset Formula $p(\gamma)=p(A)+p(C)$
\end_inset

.
 What's not clear: is this inequality 
\emph on
ever
\emph default
 broken? Or does it always hold? At any rate, from the previous example,
 it seems clear that we should not use the SING inequalities to obtain clusters.
\end_layout

\begin_layout Standard
For pairs, its clear that 
\begin_inset Formula $h_{PR-RAND}-h_{PR-RAND}^{red}\approx2/N$
\end_inset

 which follows as above, given that 
\begin_inset Formula $h_{PR-RAND}=2\log_{2}N$
\end_inset

, etc.
 The corresponding inequality is now
\begin_inset Formula 
\begin{align*}
\frac{2}{N}\lesssim & h_{PAIR}-h_{PAIR}^{red}\\
=-\sum_{w_{1},w_{2}\in\alpha} & p(w_{1},w_{2})\log_{2}p(w_{1},w_{2})+\sum_{w_{1},w_{2}\in\beta}p(w_{1},w_{2})\log_{2}p(w_{1},w_{2})\\
=-\sum_{w\in\alpha\backslash A,C} & \left[p(A,w)\log_{2}p(A,w)+p(C,w)\log_{2}p(C,w)-p(\gamma,w)\log_{2}p(\gamma,w)\right]\\
-\sum_{w\in\alpha\backslash A,C} & \left[p(w,A)\log_{2}p(w,A)+p(w,C)\log_{2}p(w,C)-p(w,\gamma)\log_{2}p(w,\gamma)\right]\\
 & -p(A,A)\log_{2}p(A,A)-p(C,A)\log_{2}p(C,A)+p(\gamma)\log_{2}p(\gamma)\\
 & -p(A,C)\log_{2}p(A,C)-p(C,C)\log_{2}p(C,C)
\end{align*}

\end_inset

So...
\end_layout

\begin_layout Section*
8 March 2014
\end_layout

\begin_layout Subsection*
Morphology
\end_layout

\begin_layout Standard
Notes: https://en.wikipedia.org/wiki/Nonconcatenative morphology
\end_layout

\begin_layout Section*
25 March 2014
\end_layout

\begin_layout Subsection*
Information-Theoretic Clustering
\end_layout

\begin_layout Standard
New references:
\end_layout

\begin_layout Itemize
http://www.cs.utexas.edu/users/inderjit/public_papers/kdd_cocluster.pdf Information-
Theoretic Co-clustering Inderjit S.
 Dhillon, Subramanyam Mallela, Dharmendra S.
 Modha 
\end_layout

\begin_layout Itemize
http://pdf.aminer.org/000/472/364/a_generalized_maximum_entropy_approach_to_bregma
n_co_clustering_and.pdf A Generalized Maximum Entropy Approach to Bregman
 Co-clustering and Matrix Approximation Arindam Banerjee, Inderjit Dhillon,
 Joydeep Ghosh, Srujana Merugu, Dharmendra S.
 Modha Journal of Machine Learning Research 8 (2007) 1919-1986 
\end_layout

\begin_layout Section*
30 March 2014
\end_layout

\begin_layout Standard
The below was going to be a brief note, but I'm turning it into a rough
 draft blog post.
 But after sleeping on it, it seems silly.
 
\end_layout

\begin_layout Subsection*
Freedom and Constraint
\end_layout

\begin_layout Standard
The concepts of freedom and constraint are central to the definition of
 algebra in mathematics.
 So for example, in group theory, the algebraic symbols denoting the elements
 of the group may be arranged freely, in any order desired.
 A given group is then defined as a 'presentation', a set of equivalences
 between different orderings.
 Thus, there is the notion of a 'free group', which is merely a set of symbols
 that can be written in arbitrary order, and no further constraints other
 than those of it being a group.
 Groups that aren't free are presented by a collection of equations, which
 state that one certain order of symbols is equivalent to another.
 One says that groups are 'equationally presented'.
 
\end_layout

\begin_layout Standard
A more complex example is the term algebra, where the terms may be arranged
 in free order; but the combination of the written symbols on the page are
 constrained to those of the 'signature' of the algebra.
 One then has the notion of an 'equational theory', which is a term algebra
 with additional equations between expressions, indicating which expressions
 should be taken as equivalent.
\end_layout

\begin_layout Standard
These have strong, and even precise analogues in linguistics.
 But first, continuing with the mathematical observations: the signature
 of a term algebra can be viewed as defining the 'syntax' of the symbolic
 notation: a Turing machine, tasked with the need to recognize the 'language'
 of the term algebra, would process input symbols one by one.
 It would appear that term algebras have a context-free syntax, and are
 thus recognizable by a push-down automata.
 That is, one must recognize the function symbol, the open and close parens,
 the commas separating arguments, and the constant symbols.
 The arbitrary-depth recursiveness is the only reason why the push-down
 is needed; otherwise the language seems 'almost regular'.
 (Hmm ...
 is there any formal definition/distinction of this case? i.e.
 for very simple constext-free languages, vs.
 'more complex' ones? Not that I know of ...)
\end_layout

\begin_layout Standard
In linguistics, similar notions of freedom and constraint arise, but seem
 to be more of a surprise and mystery to linguists.
 Thus, for example, in 
\begin_inset CommandInset citation
LatexCommand cite
key "Anderson2012"
literal "true"

\end_inset

, Anderson describes the syntax and morphotactics of Kwakw’ala, a Wakashan
 language of coastal British Columbia.
 The syntax of the language (that is, the order in which the words can appear
 in a sentence) is very strict: the verb must be followed by a subject,
 optionally followed by the object, and then a prepositional phrase.
 Similarly adjectives must always precede the noun.
 The language also has a rich morphology: words are assembled from stems
 and suffixes.
 The rules for assembling a word out of stems and affixes is referred to
 as the 'morphotactics'.
 In Kwakw’ala, it would appear that the morphotactics is utterly distinct
 from the syntax: here, object-denoting prefixes can precede verbs, adjective-de
noting suffixes follow a noun.
 Anderson finds this quite remarkable: the language has two distinct kinds
 of structure-imposing systems: the syntax and the morphtactics, and they
 are quite different.
 He notes that this dual structure in turn allows the same thing to be said
 in multiple ways.
 One may take meaning-parts, as morphemes, and glue them together morphtacticall
y into words, and aranging these in a sentence.
 Allternately, one may take the meaning-parts separately, as individual
 words, and glue them together into a sentence, having a different sequence
 of the meaning-parts.
\end_layout

\begin_layout Standard
The part that struck me with Anderson's analysis is the similarity of the
 phenomena to the analogous behaviour formalized in mathematics.
 Lets first look at a second example: Lithuanian has a rich morphotactical
 structure: verbs and adverbs are conjugated, nouns and adjectives are declined;
 the rules for doing so are rather fixed and uniform, making adjustments
 mostly for phonological reasons (i.e.
 with exceptions based on constraints that come from the natural flow of
 the sound sequences constrained by the use of vocal cords, mouth, tongue
 and lips).
 Curiously, Lithuanian is almost devoid of syntactic constraint: word-order
 can be chosen freely (in the mathemaitcal sense!), and the meaning of the
 resulting sentences are essentially the same (if I am allowed to gloss over
 the notion that different word orders can serve to highlight or emphasize
 different themes and rhemes).
 So again: a language with very distinct syntax and morphtactics; in this
 case, the syntax being almost absent.
 
\end_layout

\begin_layout Standard
I used the theory of Link Grammar for performing structural linguistic analysis.
 The theory was originally developed to model syntactic structure, but it
 also appears to be entirely adequate for morphotactic analysis as well
 (certainly, for 'agglutinative' or 'concatenative' languages, with ongoing
 research into more complex morphologies).
 From the point of view of a linguist, Link Grammar appears to be 'just
 another theory of syntax', being a kind of dependency grammar.
 From the point of view of a mathematician, the situation is entirely more
 remarkable.
 It appears that the mathematical definition of what constitutes a 'link
 grammar' is isomorphic to that of a 'categorical grammar', and that the
 correspondence is immediate and direct.
 Categorical grammars are interesting because they have a direct, formal
 mathematical definition that is studied and classified by mathematicians:
 roughly speaking, categorical grammars are 'non-symmetric compact closed
 monoidal categories'.
 The precise definition here has been championed by Bob Coecke ref [xxx]
 
\end_layout

\begin_layout Standard
It takes some study of category theory to understand what this means, but,
 roughly speaking, it means that sequences of sounds, morphemes, words are
 analyzed in sequential order: by means of short-distance groupings of left-righ
t arrangements.
 This may sound silly, as, of course, sequential things occur in a sequence,
 but it helps highlight the difference between dependency grammars and phrase-st
ructure grammars, or computer-science grammars in general.
 An example of a 'computer-science grammar' is the so-called 'context-free
 grammar'.
 A hallmark of such grammars is that they allow recursion to arbitrary depths.
 An English-language example would be the sequence of sentences: 
\begin_inset Quotes eld
\end_inset

This is a house
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

This thing is a house.
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

This thing is a thing that is a house.
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

This thing is a thing that is a thing that is ...
 a house.
\begin_inset Quotes erd
\end_inset

 The example is silly because no one ever talks that way.
 The phrase-structure analysis of this would be 
\begin_inset Quotes eld
\end_inset

(S (VP (NP (VP (NP ...
 )))))
\begin_inset Quotes erd
\end_inset

, with the hierarchical arrangement emphsized.
 Dependency grammars can also parse such sentences, but here, the arrangement
 of dependencies are in the form of arrows that point from head word to
 dependent word; the arrows are only rarely long-range, and usually point
 to the immediately-surrounding words.
 There is strong psycho-linguistic evidence for such local structure in
 language, see for example [xxxx].
 That is, the workings of the human mind is not recursive in nature, pushing
 and popping an arbitrarily deep stack as each new noun-phrase or verb-phrase
 is encountered.
 Indeed, psychological studies with constructed sentences similar to the
 above, but varying the 'thing' and 'house' at each depth, show that humans
 quickly loose track after just two or three nestings [need ref].
 In essence, the human mind is adapted for linear sequential analysis, and
 long-range order between words is challenging: this is the psycho-linguistic
 argument for dependency grammars.
 From the mathematical point of view, the statement is that human languages
 are not so much context-free, as they are non-symmetric compact closed
 monoidal categories.
 That Link Grammar is an example of the latter is why it seems so appropriate
 to use for syntactic and morpho-tactic structural analysis.
\end_layout

\begin_layout Standard
Which theories of language are mathematically isomorphic? That is, Link
 Grammar and categorical grammars seem to be isomorphic because there is
 a simple way of translating the one into the other, and vice-versa (although
 no formal mathematical proof of this has been written down).
 A mathematical proof of equivalence is a mechanical device: given one represent
ation, one turns a crank to obtain the other.
 More generally, its been argued that phrase-structure grammars and dependency
 grammars are equivalent in the same sense: there is an algorithm that converts
 the one into the other, and v.v.[where's the ref for this?].
 Does this mean that non-symmetric compact closed monoidal categories have
 context-free grammars as their internal language, and that every context-free
 language has a corresponding monoidal category? I think not, but the answer
 to that, the 'why not', and the 'what, then, is the difference?' is entirely
 unclear.
 Clarifying these relationships seems important for putting language study
 on a firmer basis.
\end_layout

\begin_layout Standard
Anyway, the point here was to clarify the boundaries between freedom and
 constraint.
 Traditional phrase-structure grammars were inspired by notions from 1960's-era
 computer science, but now seem slavishly wedded to the same ideas, to the
 detriment of closer linguistic understanding.
 Dependency grammars seem to be more psycho-linguistically valid, but have
 suffered from a lack of mathematical formalism that ellucidates freedom
 and constraint.
 This lack of formalism makes it hard to explain why some constructions
 are grammatically correct, and others are not.
 It also seems to draw an artificial and confusing line between syntactic
 and morphotactic structure, when, in fact, these really should be taken
 as a part of a continuum of structure.
 I see no reason why a single grammar could not also describe the allomorphic
 variations in pronunciation.
 After all, these are just a set of rules that govern how a morpheme is
 pronounced, and this is essentially a linear, sequential phenomenon, with
 only (mostly?) nearest-neighbor morphemes affecting one-another.
 The nearest-neighbor aspect of this fairly well screams out 'dependency'.
\end_layout

\begin_layout Standard
Another curious and interesting language-constraint structure emerges with
 the study of idioms and institutionalized, set phrases.
 Because these are 'phrases', built of 'words', it would naively seem that
 these lie in the domain of syntax.
 But this is misleading.
 Institutionalized utterances are those where neither the word-choice nor
 the word-order are directly governed by syntax alone, but instead seem
 to be frozen into a fixed form.
 So, one talks of the 'time of day', but never of 'pressure of air' or 'height
 of mountain' – 
\begin_inset Quotes eld
\end_inset

What pressure of air should I put in this tire?
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

What height of mountain do you plan to climb?
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

What time of day do you expect to come over?
\begin_inset Quotes erd
\end_inset

.
 There is nothing syntactic that prevents such a choice of wording, and
 the semantic meaning is more or less clear: its just that such word arrangement
s simply don't happen.
 Its as if the lexis for English has a phrase in it: 
\begin_inset Quotes eld
\end_inset

time-of-day
\begin_inset Quotes erd
\end_inset

, which should be treated as a single word, rather than the three words
 it is written as.
 This provides the first hint of the role of probability in this discussion:
 the probability of seeing the phrase 'height of mountain' in English approaches
 zero: in fact, this text that you are reading right now just might be the
 only place ever in the history of the world in which this phrase has appeared
 ...
 despite it being 'grammatically valid'.
 Freedom and constraint aren't just governed by true-false distinctions,
 but by probabilities.
 The question then is, 'what is the most natural way in which to express
 such probabilities?' 
\end_layout

\begin_layout Standard
The last is not just some idle intellectual question, but in fact, an engineerin
g question: the proper structure should have an immediate and direct effect
 on how well, and how quickly a language could be learned, via unsupervised
 machine-learning algorithms.
 A universal but naive attitude in the artificial-intelligence community
 is that 'oh.
 everything is a neural net, and we should use neural nets to build AI.'
 Less frequently, one may seem a similar attitude regarding Hidden Markov
 Models (HMMs).
 The fact that such naive approaches lead to algorithms that fail to converge
 quickly leads to ideas such as 'deep learning': a modification that explicitly
 splits a problem into layers, with explicit feedback between layers.
 Another variation used to escape the trap is to explicitly model what is
 un-known: this is the notion of maximum entropy (MaxEnt).
 Traditional AI was also founded on logic and reasoning, and, for many decades,
 AI was dominated by the exploration of boolean-valued logic.
 By this I mean anything with crisp, sharp truth values: whether first-order
 logic, boolean satisfiability, satisfiability-modulo-theories, stable-model
 semantics, and so on.
 Another corner was fuzzy logic, but that didn't seem to have legs.
 Notions of maximum entropy and probability can be unified: thus, one has
 Markov Logic Networks (MLN).
 What I'm wodering about here is that maybe none of these approaches are
 correct, because they are ignoring the actual structure that is in front
 of us.
\end_layout

\begin_layout Standard
So, perhaps, the correct approach is not to marry maximum entropy with first-ord
er logic, but to marry maximum entropy to dependency grammars (or, equivalently,
 to appropriate monoidal categories).
 The question then becomes: what is the appropriate monoidal category? Picking
 the wrong one will lead to disastrous machine learning performance (this,
 I think, is the lesson from neural networks).
 Picking something too easy doesn't get you far enough (the lesson of HMM's
 – excellent for certain classes of problems, but lacking in scale).
 There are more choices than that: but the chices, and their inter-connectedness
, and trade-offs, seem to be unarticulated.
 For any given monoidal category, there would seem to be some probabilistic
 model corresponding to that category's internal language.
 That is, there is a way of describing the transition probabilities from
 state to state.
 Indeeded, (finite) monoidal categories, in the form of acts, can be partly
 understood to be finite state machines acting on a set.
 The probabilistic generalization of this leads both to probabilistic and
 quantum finite automata, with the former having a strong resemblance, if
 not identity, to Markov chains, with the corresponding acts being HMM's.
 My hypothesis is that probabilistic dependency grammars will lead to machine
 learning algos that converge more rapidly than the similar-but-different
 HMM that can also be mapped onto the same problem.
 Unfortunately, my hypothesis is impedded by my lack of understanding of
 precisely, exactly how the different approaches named above may be equivalent,
 isomorphic, or merely similar.
\end_layout

\begin_layout Section*
2 April 2014
\end_layout

\begin_layout Subsection*
Link Grammar and Finite State Transducers
\end_layout

\begin_layout Standard
Claim: Finite state transducers, such as those used for morphological analysis,
 can be mapped to a Link Grammar.
 This implies that Link Grammar parsing can be used for morphological analysis,
 thus unifying syntactic parsing and morphological analysis into a unified
 framework.A finite state transducer (FST) is defined as:
\end_layout

\begin_layout Itemize
A set of states 
\begin_inset Formula $Q$
\end_inset


\end_layout

\begin_layout Itemize
A set 
\begin_inset Formula $\Sigma$
\end_inset

 of input symbols (surface form)
\end_layout

\begin_layout Itemize
A set 
\begin_inset Formula $\Gamma$
\end_inset

 of output symbols (lexicalized form)
\end_layout

\begin_layout Itemize
A transition function 
\begin_inset Formula $\delta\subset Q\times(\Sigma\cup\{\epsilon\})\times(\Gamma\cup\{\epsilon\})\times Q$
\end_inset


\end_layout

\begin_layout Standard
A member 
\begin_inset Formula $(r,a,b,s)\in\delta$
\end_inset

 should be thought of as the arrow from state 
\begin_inset Formula $r$
\end_inset

 to state 
\begin_inset Formula $s$
\end_inset

, the arrow being taken when the input symbol is 
\begin_inset Formula $a$
\end_inset

 and as a result producing the output symbol 
\begin_inset Formula $b$
\end_inset

.
 The corresponding link-grammar dictionary entry for this would be 
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     a.b: r- & s+;
\end_layout

\end_inset


\end_layout

\begin_layout Standard
This states that no linkage is possible, unless the previous link resulted
 in the emission of the r+ connector.
 No transition to the next state is possible, unless that state has an s-
 connector on it.
\end_layout

\begin_layout Standard
The current link-grammar notation a.b is awkward for printing, and perhaps
 some new style is needed to distinguish the output to be printed from the
 input that is recognized.
 Thus, perhaps, it would be better to invent a new notation, perhaps a$b
 to denote that a is recognized, and that b is printed.
 
\end_layout

\begin_layout Standard
Note that the above definition of link-grammar rules results in a very simple,
 linear linkage: state transitions follow one-another in linear order.
 Link grammar allows richer, more complex linkage diagrams, and so the question
 arises: can a given FST be compactified into a smaller system by making
 use of the richer possibilities that link-grammar offers? How can this
 compactification be achieved?
\end_layout

\begin_layout Standard
Suppose that the FST 
\begin_inset Formula $\delta$
\end_inset

 includes as a subset the state transitions 
\begin_inset Formula $\{(r,a,?,s),(s,\epsilon,\epsilon,t),(s,b,?,t),(t,c,?,u)\}$
\end_inset

.
 The symbol ? is used here as a don't-care state, as it is irrelevant to
 the discussion that follows.
 The above state transitions indicates that when the system is in state
 
\begin_inset Formula $s$
\end_inset

, it may spontaneously transition to state 
\begin_inset Formula $t$
\end_inset

, or may do so upon reading 
\begin_inset Formula $b$
\end_inset

.
 That is, the presence of 
\begin_inset Formula $b$
\end_inset

 is optional in the state transition.
 The 
\begin_inset Quotes eld
\end_inset

natural
\begin_inset Quotes erd
\end_inset

 way of indicating this with link-grammar notation is using the link-grammar
 dictionary entries: 
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

   a: r- & s+;
\end_layout

\begin_layout Plain Layout

   b: t+;
\end_layout

\begin_layout Plain Layout

   c: {t-} & s- & u+;
\end_layout

\end_inset

Because the transition 
\begin_inset Formula $(s,\epsilon,\epsilon,t)$
\end_inset

 reads no input, and produces no output, the state transitions would more
 likely be written as 
\begin_inset Formula $\{(r,a,?,t),(r,a,?,s),(s,b,?,t),(t,c,?,u)\}$
\end_inset

, that is, by collapsing the transition 
\begin_inset Formula $(s,\epsilon,\epsilon,t)$
\end_inset

 into the prior state.
 This would have the entries
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

   a: r- & (s+ or t+);
\end_layout

\begin_layout Plain Layout

   b: s- & t+;
\end_layout

\begin_layout Plain Layout

   c: t- & u+;
\end_layout

\end_inset


\end_layout

\begin_layout Standard
How should it be understood? These are, in fact, two distinct, inequivalent
 LG grammars, as can be seen by considering the parse of the strings 
\begin_inset Quotes eld
\end_inset

ac
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

abc
\begin_inset Quotes erd
\end_inset

 for the two cases.
\end_layout

\begin_layout Standard
When would weighting schemes interfere? when would output interfere?
\end_layout

\begin_layout Section*
15 April 2014
\end_layout

\begin_layout Subsection*
Elegant Normal Form
\end_layout

\begin_layout Standard
Or, more precisely, 
\begin_inset Quotes eld
\end_inset

Minimal Normal Form
\begin_inset Quotes erd
\end_inset

.
 Instead of writing out LG disjuncts in long strings of DNF or CNF, where
 they blow up into the thousands or tens of thousands, we really need to
 write then in Craig Holman's "Elegant Normal Form", ( http://www.patterncraft.com
/Blog/Blog-080609.html#ElegantNormalForm ) format.
 This is to be done by entropy minimization, in two different ways: first,
 ENF reduces the total count of terms, for just one single expression.
 Second, and maybe more important: different words will share significant
 subsets of the ENF expression.
 So, for example, the LG English dicts define:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     <verb-rq>: Rw- or ({Ic-} & Q- & <verb-wall>) or [()];
\end_layout

\end_inset

which is (1) in ENF, not DNF or CNF, and (2) shared by several dozen words.
 There should be a strong push to discover such common sub-expressions across
 many words.
\end_layout

\begin_layout Section*
28 April 2014
\end_layout

\begin_layout Subsection*
Isotopy
\end_layout

\begin_layout Standard
The concept of 
\begin_inset Quotes eld
\end_inset

isotopy
\begin_inset Quotes erd
\end_inset

 (https://en.wikipedia.org/wiki/Isotopy_(semiotics)) was introduced by Algirdas
 Greimas in 1966.
 Example: 
\begin_inset Quotes eld
\end_inset

I drink some water
\begin_inset Quotes erd
\end_inset

, with the meanings of 
\begin_inset Quotes eld
\end_inset

drink
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

water
\begin_inset Quotes erd
\end_inset

 re-inforcing one-another.
 But this is exactly what the Mihalcea WSD algo does, eh? 
\end_layout

\begin_layout Section*
14 May 2014
\end_layout

\begin_layout Subsection*
Tree similarity
\end_layout

\begin_layout Standard
\begin_inset Quotes eld
\end_inset

Similarity Evaluation on Tree-structured Data
\begin_inset Quotes erd
\end_inset

 Rui Yang Panos Kalnis Anthony K.
 H.
 Tung SIGMOD 2005 June 14-16, 2005.
 Quote from abstract: 
\begin_inset Quotes eld
\end_inset

propose to trans- form tree-structured data into an approximate numerical
 multidimensional vector
\begin_inset Quotes erd
\end_inset

.
 Funny – that's what Bob Coecke proposes for any kind of monoidal category:
 vector spaces being a special monoidal cat.
 Hmmm.
 
\end_layout

\begin_layout Standard
Approaches:
\end_layout

\begin_layout Itemize
Tree-edit distance: many variants proposed, all high cpu/memory intensive.
\end_layout

\begin_layout Itemize
convert tree to pre or post-order, and use string edit distance.
\end_layout

\begin_layout Itemize
Convert to binary tree.
 For combo trees, this makes sense, due to the associative property of most
 of the operators.
 In particular, in combo any oper that can have multi-sibilings is also
 associative and thus convertible to binary tree.
 What's more, trees with binary branch distance of zero really are equivalent
 for us: See Figure 4 in above reference.
 Yay! this fits very very well with combo!.
\end_layout

\begin_layout Section*
29 June 2014
\end_layout

\begin_layout Subsection*
Morphology Basic Claims
\end_layout

\begin_layout Standard
We have two tasks to address: the automated discovery of morpheme boundaries,
 and the automated discovery of 
\begin_inset Quotes eld
\end_inset

morphtactics
\begin_inset Quotes erd
\end_inset

, the syntax of connected morphemes.
 We make two claims: first.
 the automated discovery of morpheme boundaries can be accomplished by searching
 for breaks between word-parts that have the lowest mutual information.
 Second, the discovery of morphotactics is identical to the discovery of
 syntax, as outlined above.
\end_layout

\begin_layout Standard
The simplest approach to finding the breaks between morphemes is to randomly
 break up words into two parts.
 A worked example of this is given below.
 Several questions present themselves:
\end_layout

\begin_layout Itemize
To discover morphemes of words that split into three or more parts, is it
 better to always split pairwise, and then perform recursion, or is it easier
 to split into multiple parts immediately? Perhaps the answer is language-depend
ent?
\end_layout

\begin_layout Itemize
Does one obtain better mophological splits by immediately including morphtactic
 analysis, or can this be deferred?
\end_layout

\begin_layout Subsection*
Morphology Worked Example
\end_layout

\begin_layout Standard
OK, this will be tedious, but I see no alternative.
 Suppose we have the corpus 
\begin_inset Quotes eld
\end_inset

test gift tester testy gifty tester gifter
\begin_inset Quotes erd
\end_inset

 so that 
\begin_inset Quotes eld
\end_inset

tester
\begin_inset Quotes erd
\end_inset

 appears twice in the corpus.
 Explore all possible splits into two parts.
 The 4-letter splits split 3 ways, the 5-letter splits split 4 ways, etc.
 so there is a total of N(*,*)=3+3+5+4+4+5+5=29 pairs.
 All pairs appear once, except for tester, which appears twice.
 Viz.
\begin_inset Newline newline
\end_inset

 
\end_layout

\begin_layout Standard
P(x,y)=1/29 for (x,y) in {(t,est), (te,st), (tes,t), (g,ift), (gi,ft), (gif,t),
 (t,esty), (te,sty), (tes,ty), (test,y), (g,ifty), (gi,fty), (gif,ty), (gift,y),
 (g,ifter), (gi,fter), (gif,ter), (gift,er), (gifte,r)}
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

and
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(x,y)=2/29 for (x,y) in {(t,ester), (te,ster), (tes,ter), (test,er), (teste,r)}
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

There is a bit of a procedural error in the above; we would like to discover
 the 
\begin_inset Quotes eld
\end_inset

null suffix
\begin_inset Quotes erd
\end_inset

, that is, that 
\begin_inset Quotes eld
\end_inset

test
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

gift
\begin_inset Quotes erd
\end_inset

, with nothing following it, are morphemes, so that the possible suffixes
 are 
\begin_inset Quotes eld
\end_inset

-y
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

-er
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

-nothing
\begin_inset Quotes erd
\end_inset

.
 However, the above failed to count this possibility separately.
 Thus, given the above data, what we expect to find are two roots: 
\begin_inset Quotes eld
\end_inset

gif-
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

tes-
\begin_inset Quotes erd
\end_inset

 and three suffixes: 
\begin_inset Quotes eld
\end_inset

-t
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

-ty
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

-ter
\begin_inset Quotes erd
\end_inset

.
 This is not so bad.
 If we did split and count in such a way as to allow a null suffix, it would
 be ambiguus as to whether the stems end with a 
\begin_inset Quotes eld
\end_inset

t
\begin_inset Quotes erd
\end_inset

 or not.
 That is, the with-t and without-t stems would have been equally likely...
 Anyway, moving on...
 the possible splits are shown in the table below 
\begin_inset CommandInset ref
LatexCommand ref
reference "tab:Word-Split-Table"

\end_inset

:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
\begin_inset Float table
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset Caption Standard

\begin_layout Plain Layout
Word Split Table
\begin_inset CommandInset label
LatexCommand label
name "tab:Word-Split-Table"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Tabular
<lyxtabular version="3" rows="20" columns="12">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
g
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gi
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gif
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gift
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gifte
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
t
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
te
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
test
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
teste
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
row total
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ifter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
fter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
er
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
r
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ifty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
fty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ift
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ft
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
t
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ester
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ster
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
esty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
sty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
est
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
st
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
column total
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
29
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Plain Layout
The above is a sparse matrix showing the possible word splits.
 empty cells contain a count of zero.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Next, lets do the partial sums.
 Recall the notation for the partial summation of pairs.
 writing P(x,y) for the probability of observing the 
\emph on
ordered
\emph default
 pair of items (x,y), the partial sums are: 
\begin_inset Formula 
\[
P(x,*)=\sum_{y\in Y}P(x,y)
\]

\end_inset

 and 
\begin_inset Formula 
\[
P(*,y)=\sum_{x\in X}P(x,y)
\]

\end_inset

 The left-hand sums are the column totals in the table above, table 
\begin_inset CommandInset ref
LatexCommand ref
reference "tab:Word-Split-Table"

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(t,*) = (1+1+2)/29 = 4/29 = P(te,*) = P(tes,*)
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(g,*) = (1+1+1)/29 = 3/29 = P(gi,*) = P(gif,*)
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(test,*) = (1+2)/29 = 3/29
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(teste,*) = 2/29
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(gift,*) = 2/29
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(gifte,*) = 1/29
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

Next, the right-hand partial sums.
 These are the row totals for the table above, table 
\begin_inset CommandInset ref
LatexCommand ref
reference "tab:Word-Split-Table"

\end_inset

:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(*,est) = 1/29 = P(*,st) = P(*,esty) = P(*,sty) =P(,ift) = P(*,ft) = P(*,ifty)
 = P(*,fty) = P(*,ifter) = P(*,fter)
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(*,t) = (1+1)/29 = 2/29 = P(*,ty) = P(*,y)
\end_layout

\begin_layout Standard
P(*,ester) = 2/29 = P(*,ster)
\end_layout

\begin_layout Standard
P(*,ter) = (1+2)/29 = 3/29 = P(*,er) = P(*,r)
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

Now, the compute the MI (we use log=log_2 in all cases below, for measuring
 the entropy in units of bits).
 Recall the definition of mutual information for 
\emph on
ordered
\emph default
 pairs, previously discussed and given above: 
\begin_inset Formula 
\[
MI(x,y)=\log_{2}\frac{P(x,y)}{P(x,*)P(*,y)}
\]

\end_inset

So, working these by hand:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(t,est) = log P(t,est)/P(t,*)P(*,est) = log (1/29)(29/4)(29/1) = log(29/4)
 = 2.857981
\end_layout

\begin_layout Standard
= MI(te,st) = MI(t,esty) = MI(te,sty) 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(g,ift) = log P(g,ift)/P(g,*)P(*,ift) = log (1/29)(29/3)(29/1) = log(29/3)
 = 3.273018
\end_layout

\begin_layout Standard
=MI(gi,ft) = MI(g,ifty) = MI(gi,fty)=MI(g,ifter)=MI(gi,fter)
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(tes,t) = log P(tes,t)/P(tes,*)P(*,t) = log (1/29)(29/4)(29/2) = log(29/8)
 = 1.857981
\end_layout

\begin_layout Standard
=MI(tes,ty)
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(gif,t) = log P(gif,t)/P(gif,*)P(*,t) = log (1/29)(29/3)(29/2) = log(29/6)
 = 2.273018
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(test,y) = log P(test,y)/P(test,*)P(*,y) = log (1/29)(29/3)(29/2) = log(29/6)
 = 2.273018
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(gif,ty) = log P(gif,ty)/P(gif,*)P(*,ty) = log (1/29)(29/3)(29/2) = log(29/6)
 = 2.273018
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(gift,y) = log P(gift,y)/P(gift,*)P(*,y) = log (1/29)(29/2)(29/2) = log(29/4)
 = 2.857981
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(gif,ter) = log P(gif,ter)/P(gif,*)P(*,ter) = log (1/29)(29/3)(29/3) =
 log(29/9) = 1.688056
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(gift,er) = log P(gift,er)/P(gift,*)P(*,er) = log (1/29)(29/2)(29/3) =
 log(29/6) = 2.273018
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(gifte,r) = log P(gifte,r)/P(gifte,*)P(*,r) = log (1/29)(29/1)(29/3) =
 log(29/3) = 3.273018
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(t,ester) = log P(t,ester)/P(t,*)P(*,ester) = log (2/29)(29/4)(29/2) =
 log(29/4) = 2.857981
\end_layout

\begin_layout Standard
= MI(te,ster) 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(tes,ter) = log P(tes,ter)/P(tes,*)P(*,ter) = log (2/29)(29/4)(29/3) =
 log(29/6) = 2.273018
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(test,er) = log P(test,er)/P(test,*)P(*,er) = log (2/29)(29/3)(29/3) =
 log(58/9) = 2.688056
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(teste,r) = log P(teste,r)/P(teste,*)P(*,r) = log (2/29)(29/2)(29/3) =
 log(29/3) = 3.273018
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

Phew.
 I think that's all of them.
 So, what can we conclude? The basic claim is that the morpheme boundaries
 occur at the places where the letters are the least sticky, the most likely
 to be de-correlated, i.e.
 those with the lowest MI.
 In the above, these are: MI(gif,ter)=1.69 followed by MI(tes,t)=MI(tes,ty)=1.86.
 These are the most likely splits for these three words.
 Lets look up each possible split, for each word.
 We get:
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="12">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Word
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Split
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Split
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Split
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Split
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Split
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Best
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gift
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(g,ift)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gi,ft)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,t)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,t)
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gifty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(g,ifty)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gi,fty)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,ty)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gift,y)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,ty)
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gifter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(g,ifter)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gi,fter)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,ter)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.69
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gift,er)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gifte,r)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,ter)
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
test
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(t,est)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(te,st)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,t)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,t)
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
testy
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(t,esty)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(te,sty)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,ty)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(test,y)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,ty)
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tester
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(t,ester)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(te,ster)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,ter)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(test,er)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.69
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(teste,r)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,ter)
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The best results from the above table are summarized below
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Word
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Lowest MI split(s)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gift
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,t)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gifty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,ty)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gifter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(gif,ter)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.69
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
test
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,t)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.86
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
testy
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,ty)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.86
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tester
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(tes,ter)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

What looks like the best split has been found; it certainly matches what
 was expected.
 Yay! After this, link-type clustering proceeds just as before, as if these
 were distinct words.
 That is, the above has 6 distinct link types; clustering will then proceed
 discover one link type, between the cluster {gif, tes} and {t,ty,ter}.
\end_layout

\begin_layout Subsubsection*
Morfesssor
\end_layout

\begin_layout Standard
An alternative algorithm is presented in:
\end_layout

\begin_layout Itemize
Mathias Creutz Krista Lagus, 
\begin_inset Quotes eld
\end_inset

Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora
 Using Morfessor 1.0
\begin_inset Quotes erd
\end_inset

 http://users.ics.aalto.fi/mcreutz/papers/Creutz05tr.pdf 
\end_layout

\begin_layout Standard
That algorithm works only for concatenative languages, and does not provide
 a morphotactic structure; that is, it cannot learn the grammar governing
 the morphemes.
 It also requires several (plausible) assumptions about Bayesian priors.
 One assumption is that morpheme frequency follows a modified Zipfian distributi
on, this is used to make estimates for morphemes that are observed only
 once in the corpus.
 Another assumption is is that the morpheme length distribution can be approxima
ted by either a Poisson or a (two-parameter) gamma distribution.
\end_layout

\begin_layout Section*
12 July 2014
\end_layout

\begin_layout Subsection*
Link-type discovery, worked example
\end_layout

\begin_layout Standard
In keeping with the previous, lets look at a super-simplified version of
 link-type discovery, continuing immediately from the previous morpheme-discovery
 example.
 We begin with the initial observations, given in the table below:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Pair
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Initial Link Type
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
# observations
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gif–t
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
GA
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gif–ty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
GB
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gif–ter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
GC
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tes–t
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
TA
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tes–ty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
TB
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tes–ter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
TC
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The 
\begin_inset Quotes eld
\end_inset

initial link type
\begin_inset Quotes erd
\end_inset

 is handed out randomly; the actual letter string has no bearing on the
 outcome.
 Notice the above has 6 different, unique link types.
 These correspond to the following link-grammar dictionary, written in the
 classic link-grammar notation:
\begin_inset Newline newline
\end_inset


\begin_inset Float algorithm
placement H
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset Caption Standard

\begin_layout Plain Layout
Morpheme grammar
\begin_inset CommandInset label
LatexCommand label
name "alg:Morpheme-grammar"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.=: GA+ or GB+ or GC+;
\end_layout

\begin_layout Plain Layout

     tes.=: TA+ or TB+ or TC+;
\end_layout

\begin_layout Plain Layout

     =t: GA- or TA-;
\end_layout

\begin_layout Plain Layout

     =ty: GB- or TB-;
\end_layout

\begin_layout Plain Layout

     =ter: GC- or TC-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\end_layout

\begin_layout Standard
From the above initial dictionary, we want to deduce that a single link
 type is sufficient to full describe what is happening.
 That is, we wish to discover the following dictionary:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: LL+;
\end_layout

\begin_layout Plain Layout

     =t =ty =ter: LL-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

This is inuitively obvious, because the morphemes obviously form a clique:
 each stem has been observed with each suffix.
 Technically, this is a bipartite clique or complete bipartite graph of
 order (2,3).
 Here, we see it immediately; however, in general, it is very hard to search
 for bipartite cliques in a grammar; general algorithms are provably NP-complete
 and run in exponential time.
\end_layout

\begin_layout Standard
So how should we find grammar reductions? How is this to be done?
\end_layout

\begin_layout Standard
Out vocabulary consists of N=5 morphemes 
\begin_inset Formula $\alpha=$
\end_inset

{gif.=, tes.=.
 =t, =ty, =ter}.
 We begin by recomuting the MI for observed pairs, once-again starting with
 the initial corpus 
\begin_inset Quotes eld
\end_inset

test gift tester testy gifty tester gifter
\begin_inset Quotes erd
\end_inset

, same as before, with 
\begin_inset Quotes eld
\end_inset

tester
\begin_inset Quotes erd
\end_inset

 appearing twice in the corpus.
 This time, se split strictly according to the learned morphology.
 The word split table is:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="5" columns="4">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gif
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
row total
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
ty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
t
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
column total
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

Note that this table is a stric subset of the previous table; the column
 and row totals are completely unchanged.
 However, the total number of observations has diminished from 29 to 7,
 and so all P amd MI values need to be recomputed.
 Proceeding long-hand, as before:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(x,y)=1/7 for (x,y) in {(tes,t), (gif,t), (tes,ty), (gif,ty), (gif,ter)}
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

and
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(x,y)=2/7 for (x,y) in {(tes,ter)}
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The partial sums are:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(gif,*) = (1+1+1)/7 = 3/7
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(tes,*) = (1+1+2)/7 = 4/7
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(*,t) = 2/7 = P(*,ty)
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
P(*,ter) = 3/7
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The MI values are all different, as well:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(gif,t) = log P(gif,t)/P(gif,*)P(*,t) = log (1/7)(7/3)(7/2) = log (7/6)
 = 0.222392
\end_layout

\begin_layout Standard
= MI(gif,ty) 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(gif,ter) = log P(gif,ter)/P(gif,*)P(*,ter) = log (1/7)(7/3)(7/3) = log
 (7/9) = -0.362570 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(tes,t) = log P(tes,t)/P(tes,*)P(*,t) = log (1/7)(7/4)(7/2) = log (7/8)
 = -0.192645
\end_layout

\begin_layout Standard
= MI(tes,ty)
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
MI(tes,ter) = log P(tes,ter)/P(tes,*)P(*,ter) = log (2/7)(7/4)(7/3) = log
 (7/6) = 0.222392
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

Note that three of the MI values are negative, and three are positive.
\end_layout

\begin_layout Standard
Following the previous formulas, we compute the total pair entropy:
\end_layout

\begin_layout Standard
\begin_inset Formula 
\begin{align*}
h_{PAIR}^{observed} & =-\sum_{w_{1},w_{2}\in\alpha}p(w_{1},w_{2})\log_{2}p(w_{1},w_{2})\\
 & =-\frac{5}{7}\log_{2}\frac{1}{7}-\frac{2}{7}\log_{2}\frac{2}{7}=2.521641
\end{align*}

\end_inset

This is a bit of a misnomer, or misleading; we are actually computing the
 link-entropy: so the set is actually 
\begin_inset Formula $\beta=\{GA,GB,GC,TA,TB,TC\}$
\end_inset

 the first five of which were observed once, and the last was observed twice.
 So really we should write:
\begin_inset Formula 
\[
h_{PAIR}^{observed}=-\sum_{t\in\beta}p(t)\log_{2}p(t)
\]

\end_inset

with 
\begin_inset Formula $p(t)$
\end_inset

 being the probability of observing link-type 
\begin_inset Formula $t$
\end_inset

.
\end_layout

\begin_layout Standard
The above is the observed entropy, given the corpus, and the grammar shown
 in listing 
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:Morpheme-grammar"

\end_inset

.
 However, this grammar does not have any probability indicators attached
 to it, so that if it was used to generate a corpus, the entropy would be
 different.
 Basically, the probability of observing any of the link-types would be
 identical, and so the entropy would be:
\begin_inset Formula 
\begin{align*}
h_{PAIR}^{generated} & =-\sum_{t\in\beta}p(t)\log_{2}p(t)\\
 & =-\frac{6}{6}\log_{2}\frac{1}{6}=2.584963
\end{align*}

\end_inset

This is obtained by observing that there are 6 link types in the set 
\begin_inset Formula $\beta$
\end_inset

 and so, if chosen equi-probably, the resulting entropy is just 
\begin_inset Formula $\log_{2}6$
\end_inset

.
 For a given number 
\begin_inset Formula $N$
\end_inset

 of link types, the entropy of the generated grammar will be 
\begin_inset Formula $\log_{2}N$
\end_inset

, for this extremely simply type of grammar, where all disjuncts have only
 one connector in them.
 The generated entropy will always be maximal for the grammar, as the observed
 distribution will surely never be equi-distributed.
 Thus, we have as a general principle:
\begin_inset Formula 
\[
h^{observed}\le h^{generated}
\]

\end_inset

Note that the equi-distributed link-types is the same as having each of
 the words in the corpus appear with equal frequency.
 The morphemes, however, do NOT appear with equally frequency (although
 individually, all stems do, and all suffixes do).
 
\end_layout

\begin_layout Standard
Link type reductions can be many ways.
 In each case, we look to see if adding a new word to a category improves
 the score.
 The possibilities are:
\end_layout

\begin_layout Enumerate
Group =ter and =ty together.
\end_layout

\begin_layout Enumerate
Group =ter and =t together.
\end_layout

\begin_layout Enumerate
Group =ty and =t together.
\end_layout

\begin_layout Enumerate
Group gif.= and tes.= together.
\end_layout

\begin_layout Standard
After this, we have more reductions:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
1.a.
 Add =t to {=ter, =ty}, and finally group together gif.= and tes.=
\end_layout

\begin_layout Standard
1.b.
 Group together gif.= and tes.=, and finally, add =t to {=ter, =ty}
\end_layout

\begin_layout Standard
2.a, 2.b.
 3.a 3.b variations of above
\end_layout

\begin_layout Standard
4.a.
 Group =ter and =ty together, then add =t.
\end_layout

\begin_layout Standard
4.b.
 Group =ter and =t together, then add =ty.
\end_layout

\begin_layout Standard
4.c.
 Group =ty and =t together, then add =ter.
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

This gives 9 different orders in which the reductions can take place.
 Actually, only 6: case 1b and 4a are the same, as are 2b=4b and 3b=4c.
 Lets do at least some of them.
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Paragraph*
Case 1.
 
\end_layout

\begin_layout Standard
Let 
\begin_inset Formula $\gamma=\{\mbox{=ter, =ty}\}$
\end_inset

.
 Then the link types GB and GC need to be consolidated: GG={GB, GC} and
 likewise TT={TB, TC}.
 The dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.=: GA+ or GG+;
\end_layout

\begin_layout Plain Layout

     tes.=: TA+ or TT+;
\end_layout

\begin_layout Plain Layout

     =t: GA- or TA-;
\end_layout

\begin_layout Plain Layout

     =ty =ter: GG- or TT-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(GA) = p(gif,t) = 1/7 = p(TA) = p(tes,t)
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(GG) = p(gif,ter) + p(gif,ty) = 2/7
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(TT) = p(tes,ter) + p(tes,ty) = 3/7
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So the observed entropy is now
\begin_inset Formula 
\[
h_{PAIR}^{red1.}=-\frac{2}{7}\log_{2}\frac{1}{7}-\frac{2}{7}\log_{2}\frac{2}{7}-\frac{3}{7}\log_{2}\frac{3}{7}=1.842371
\]

\end_inset

The generated entropy is 
\begin_inset Formula $h_{gen}^{red1.}=\log_{2}4=2$
\end_inset

 since there are four total link types in the reduced grammar.
 Pursuant to equation 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:counting complexity"

\end_inset

, we should add the log of the cardinality of the word-sets.
 Here, only one word-set has a cardinality greater than one: {=ty, =ter}.
 So, one gets:
\begin_inset Formula 
\[
h_{gen}^{wrds1.}=\log_{2}2=1
\]

\end_inset

The conditional entropy, based on the textual observations, is 
\begin_inset Formula 
\[
h_{obs}^{wrds1.}=-\frac{3}{5}\log_{2}\frac{3}{5}-\frac{2}{5}\log_{2}\frac{2}{5}=0.970951
\]

\end_inset

 
\end_layout

\begin_layout Paragraph*
Case 1.a.
 
\end_layout

\begin_layout Standard
Let 
\begin_inset Formula $\delta=\{\mbox{=ter, =ty, =t}\}$
\end_inset

.
 Then the link types GA and GG need to be consolidated: G={GA, GG} and likewise
 T={TA, TT}.
 The dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.=: G+;
\end_layout

\begin_layout Plain Layout

     tes.=: T+;
\end_layout

\begin_layout Plain Layout

     =t =ty =ter: G- or T-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(G) = p(gif,t) + p(gif,ty) + p(gif,ter) = 3/7
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(T) = p(tes,t) + p(tes,ty) + p(tes,ter) = 4/7
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So the observed entropy is now
\begin_inset Formula 
\[
h_{PAIR}^{red1.a.}=-\frac{3}{7}\log_{2}\frac{3}{7}-\frac{4}{7}\log_{2}\frac{4}{7}=0.985228
\]

\end_inset

The generated entropy is 
\begin_inset Formula $\log_{2}2=1$
\end_inset

 since there are only two link types in the grammar.
 The word-counting entropy for the set 
\begin_inset Formula $\delta$
\end_inset

 contributes an additional 
\begin_inset Formula 
\[
h_{gen}^{wrds1.a.}=\log_{2}3=1.584963
\]

\end_inset

while the observed entropy is 
\begin_inset Formula 
\[
h_{obs}^{wrds1.a.}=-\frac{4}{7}\log_{2}\frac{2}{7}-\frac{3}{7}\log_{2}\frac{3}{7}=1.556657
\]

\end_inset


\end_layout

\begin_layout Paragraph*
Case 1.b.
 
\end_layout

\begin_layout Standard
Let 
\begin_inset Formula $\gamma=\{\mbox{=ter, =ty}\}$
\end_inset

 as before, and 
\begin_inset Formula $\epsilon=\{\mbox{gif.=, tes.=}\}$
\end_inset

.
 The link types consolidate: EA={GA, TA} and EM={GG, TT}.
 The dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: EA+ or EM+;
\end_layout

\begin_layout Plain Layout

     =t: EA-;
\end_layout

\begin_layout Plain Layout

     =ty =ter: EM-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(EA) = p(gif,t) + p(tes,t) = 2/7
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(EM) = 5/7
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So that the entropy is
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
h_{PAIR}^{red1.b.}=-\frac{2}{7}\log_{2}\frac{2}{7}-\frac{5}{7}\log_{2}\frac{5}{7}=0.863121
\]

\end_inset

The generated entropy is 
\begin_inset Formula $\log_{2}2=1$
\end_inset

 since there are only two link types in the grammar.
 The word-set counting probability adds 
\begin_inset Formula 
\[
h_{gen}^{wrds.1.b.}=2\log_{2}2=2
\]

\end_inset

 while the observed probabilities are
\begin_inset Formula 
\[
h_{obs}^{wrds.1.b.}=-\frac{3}{7}\log_{2}\frac{3}{7}-\frac{4}{7}\log_{2}\frac{4}{7}-\frac{3}{5}\log_{2}\frac{3}{5}-\frac{2}{5}\log_{2}\frac{2}{5}=1.956179
\]

\end_inset

 
\end_layout

\begin_layout Paragraph*
Other cases.
\end_layout

\begin_layout Standard
Case 2.a.
 and case 2.b.
 are identical to cases 1.a.
 and 1.b.
 because =t and =ty are interchangeable, from the probability point of view.
\end_layout

\begin_layout Standard
Case 3.a.
 and case 3.b.
 are similar, but with different probabilities.
\end_layout

\begin_layout Standard
Case 4.
 and the subcases are different, but not illuminating.
\end_layout

\begin_layout Paragraph*
Final Case.
\end_layout

\begin_layout Standard
The final consolidation gives 
\begin_inset Formula $\gamma=\{\mbox{=ter, =ty,=t}\}$
\end_inset

, and 
\begin_inset Formula $\epsilon=\{\mbox{gif.=, tes.=}\}$
\end_inset

.
 The dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: LL+;
\end_layout

\begin_layout Plain Layout

     =t =ty =ter: LL-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(LL) = 7/7
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So that the entropy is
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
h_{PAIR}^{final}=-\frac{7}{7}\log_{2}\frac{7}{7}=0
\]

\end_inset

The generated entropy is 
\begin_inset Formula $\log_{2}1=0$
\end_inset

 since there is only one link type in the grammar.
 The word-set counting probability adds 
\begin_inset Formula 
\[
h_{gen}^{wrds.fin}=\log_{2}2+\log_{2}3=2.584963
\]

\end_inset

while the observed word count is 
\begin_inset Formula 
\[
h_{obs}^{wrds.fin}=-\frac{4}{7}\log_{2}\frac{2}{7}-\frac{3}{7}\log_{2}\frac{3}{7}-\frac{3}{7}\log_{2}\frac{3}{7}-\frac{4}{7}\log_{2}\frac{4}{7}=2.541885
\]

\end_inset


\end_layout

\begin_layout Paragraph*
Summary.
\end_layout

\begin_layout Standard
The table below summaries these results.
 The sum columns show the entropy according to the equation 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:counting complexity"

\end_inset

 for the observed frequencies, and the generated frequencies.
\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="10" columns="7">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h_{obs}^{red}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h_{gen}^{red}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h_{obs}^{wrds}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h_{gen}^{wrds}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h_{obs}^{red}+h_{obs}^{wrds}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h_{gen}^{red}+h_{gen}^{wrds}$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Initial
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.521641
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.584963
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.521641
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.584963
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case 1.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
1.842371
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.970951
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.813322
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case 1.a.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.985228
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
1.556657
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
1.584963
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.541885
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.584963
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case 1.b.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.863121
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.956179
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.819299
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Final
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.541885
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.584963
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.541885
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.584963
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case 3.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.950212
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.950212
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case 3.a.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.985228
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.556656
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.541884
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case 3.b.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.985228
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.985228
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.970456
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case 4.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.556657
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.985228
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.541885
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

Arghhh.
 Such a simple case, so much complexity...
 anyway, the case 3 and 4 are computed from the script 
\begin_inset Quotes eld
\end_inset

link-type/gifty.scm
\begin_inset Quotes erd
\end_inset

 in this same directory.
\end_layout

\begin_layout Standard
Conclusions: based purely on entropy maximization, all cases advance, but
 none go to the final case.
 But we are not imposing any 'complexity penalty' on this.
\end_layout

\begin_layout Standard
Results on some alternate distributions, for this ranking: 
\begin_inset Quotes eld
\end_inset

tester testy test gifter gifty gift
\begin_inset Quotes erd
\end_inset


\end_layout

\begin_layout Itemize
Pure Zipf: 
\begin_inset Formula $(rank)^{-1.0}$
\end_inset

: none advance (
\begin_inset Formula $h_{initial}=2.281979$
\end_inset

 and 
\begin_inset Formula $h_{final}=2.293598$
\end_inset

)
\end_layout

\begin_layout Itemize
Zipf 
\begin_inset Formula $(rank)^{-1.05}$
\end_inset

 :none advance (
\begin_inset Formula $h_{initial}=2.251204$
\end_inset

 and 
\begin_inset Formula $h_{final}=2.263603$
\end_inset

)
\end_layout

\begin_layout Itemize
Zipf 
\begin_inset Formula $(rank)^{-1.5}$
\end_inset

 :none advance (
\begin_inset Formula $h_{initial}=1.930661$
\end_inset

 and 
\begin_inset Formula $h_{final}=1.948128$
\end_inset

)
\end_layout

\begin_layout Standard
None of these advance because the initial and final entropies are so very
 close.
 But, as before, there are advnces, with the biggest ones to case 4.c and
 3.b.
 The alternative rankings 
\begin_inset Quotes eld
\end_inset

tester testy test gift gifty gifter
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

tester gifter testy gifty test gift
\begin_inset Quotes erd
\end_inset

 give only slightly different results.
\end_layout

\begin_layout Subsection*
Link-type discovery, better example
\end_layout

\begin_layout Standard
In the previous, the unified link-type discovery is inevitable, so a more
 complex version is needed, with a less-obvious outcome.
 So lets take the original example of link-type discovery, and add some
 confounding link types.
 We begin with the initial observations, plus some extras, given in the table
 below:
\begin_inset Newline newline
\end_inset


\begin_inset Float table
placement H
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset Caption Standard

\begin_layout Plain Layout
Example Link Frequency Table
\begin_inset CommandInset label
LatexCommand label
name "tab:Example-Link-Frequency"

\end_inset


\end_layout

\end_inset


\begin_inset Tabular
<lyxtabular version="3" rows="9" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Pair
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Initial Link Type
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
# observations
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gif–t
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
GA
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gif–ty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
GB
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
gif–ter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
GC
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tes–t
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
TA
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tes–ty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
TB
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tes–ter
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
TC
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
blo–ty
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
BB
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
blo-fu
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
BF
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Plain Layout
Example distribution of link frequencies obtained from an example corpus.
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\end_layout

\begin_layout Standard
The addition of 
\begin_inset Quotes eld
\end_inset

bloty
\begin_inset Quotes erd
\end_inset

 to the link table, and with a strong weight, will tend to derail the consolidat
ion of the =ty suffix with the others.
 The addition of 
\begin_inset Quotes eld
\end_inset

blofu
\begin_inset Quotes erd
\end_inset

 helps make sure that there's some confusion about the 
\begin_inset Quotes eld
\end_inset

blo=
\begin_inset Quotes erd
\end_inset

 stem.
\end_layout

\begin_layout Standard
The corresponding link-grammar dictionary is: 
\begin_inset CommandInset label
LatexCommand label
name "gifty-grammar"

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Float algorithm
wide false
sideways false
status open

\begin_layout Plain Layout

\end_layout

\begin_layout Plain Layout
\begin_inset Caption Standard

\begin_layout Plain Layout
Example morpheme grammar
\begin_inset CommandInset label
LatexCommand label
name "alg:Example-morpheme-grammar"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.=: GA+ or GB+ or GC+;
\end_layout

\begin_layout Plain Layout

     tes.=: TA+ or TB+ or TC+;
\end_layout

\begin_layout Plain Layout

     blo.=: BB+ or BF+;
\end_layout

\begin_layout Plain Layout

     =t: GA- or TA-;
\end_layout

\begin_layout Plain Layout

     =ty: GB- or TB- or BB-;
\end_layout

\begin_layout Plain Layout

     =ter: GC- or TC-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
A more complex grammar showing morpheme linkages.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
From the above initial dictionary, we hope to deduce one word class that
 contains gif.= and tes.= and another that contains =t and =ter; exactly how
 the rest plays out is unclear.
 Lets begin by starting with the un-clustered entropy, and then see what
 happens if we try various different clusters.
 So, as before, let 
\begin_inset Formula $\beta=\{GA,GB,GC,TA,TB,TC,BB,BF\}$
\end_inset

 and write:
\begin_inset Formula 
\begin{align*}
h_{PAIR}^{observed} & =-\sum_{t\in\beta}p(t)\log_{2}p(t)\\
 & =-\frac{6}{11}\log_{2}\frac{1}{11}-\frac{2}{11}\log_{2}\frac{2}{11}-\frac{3}{11}\log_{2}\frac{3}{11}\\
 & =2.845351
\end{align*}

\end_inset

with 
\begin_inset Formula $p(t)$
\end_inset

 being the probability of observing link-type 
\begin_inset Formula $t$
\end_inset

.
 Since there are 8 different link types, the generated entropy is 
\begin_inset Formula $h_{PAIR}^{generated}=\log_{2}8=3$
\end_inset

.
 The different between these two is 
\begin_inset Formula $h^{gen}-h^{obs}=0.154649$
\end_inset

.
 The observed corpus also has 8 words in it (not counting multiplicity):
 this is by design; before reduction, there is always exactly one link type
 for each morpheme pair.
\end_layout

\begin_layout Standard
Lets look at several cases:
\end_layout

\begin_layout Enumerate
Group gif.= and tes.= together.
\end_layout

\begin_layout Enumerate
Group gif.= and blo.= together.
\end_layout

\begin_layout Enumerate
Group =t and =ty together.
\end_layout

\begin_layout Enumerate
Group =ty and =fu together.
\end_layout

\begin_layout Enumerate
Group =ter and =fu together.
\end_layout

\begin_layout Standard
Here, we expect case 1 to go easily, cases 2 and 3 to to be ambiguous or
 blocked, case 4 to be weakly blocked, and case 5 to be strongly blocked.
 So, proceeding:
\end_layout

\begin_layout Paragraph*
Case 1.
\end_layout

\begin_layout Standard
Group gif.= and tes.= together.
 Let 
\begin_inset Formula $\gamma=\{\mbox{gif.=, tes.=}\}$
\end_inset

.
 Then the link types G* and T* need to be consolidated: A={GA,TA} and likewise
 B={GB,TB} and C={GC,TC}.
 The dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: A+ or B+ or C+;
\end_layout

\begin_layout Plain Layout

     blo.=: BB+ or BF+;
\end_layout

\begin_layout Plain Layout

     =t: A-;
\end_layout

\begin_layout Plain Layout

     =ty: B- or BB-;
\end_layout

\begin_layout Plain Layout

     =ter: C-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(A) = p(gif,t) + p(tes,t) = 2/11 = p(B) = p(gif,ty) + p(tes,ty) 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(C) = p(gif,ter) + p(tes,ter) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BB) = p(blo,ty) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BF) = p(blo,fu) = 1/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So the observed entropy is now
\begin_inset Formula 
\[
h_{PAIR}^{red1.}=-\frac{4}{11}\log_{2}\frac{2}{11}-\frac{6}{11}\log_{2}\frac{3}{11}-\frac{1}{11}\log_{2}\frac{1}{11}=2.231270
\]

\end_inset

The generated entropy is 
\begin_inset Formula $h^{gen}=\log_{2}5=2.321928$
\end_inset

.
 The difference is 
\begin_inset Formula $h^{gen}-h^{obs}=0.090658.$
\end_inset

 This clearly brings the entropy closer to the theoretical (equidistributional)
 maximum; the grouping goes.
 However, 
\begin_inset Formula $h^{lang}=\log_{2}8=3$
\end_inset

 as before, since the generated language still has 8 words in it.
\end_layout

\begin_layout Paragraph*
Case 2.
 
\end_layout

\begin_layout Standard
Group gif.= and blo.= together.
 Let 
\begin_inset Formula $\delta=\{\mbox{gif.=, blo.=}\}$
\end_inset

.
 Then the link types GB and BB can be consolidated, because they share the
 common suffix =ty: B={GB,BB}.
 No other link consolidation is possible, without permitting impermissible
 (previously unseen) linkages.
 The dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= blo.=: GA+ or B+ or GC+ or BF+;
\end_layout

\begin_layout Plain Layout

     tes.=: TA+ or TB+ or TC+;
\end_layout

\begin_layout Plain Layout

     =t: GA- or TA-;
\end_layout

\begin_layout Plain Layout

     =ty: B- or TB-;
\end_layout

\begin_layout Plain Layout

     =ter: GC- or TC-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

Note that this dictionary does allow several previously unobserved words:
 giffu, blot, bloter.
 This is what happens when one hypothesizes unions between classes that
 merely overlap, instead of being subsets.
 What happens next depends on whether the overlap was large, or small.
\end_layout

\begin_layout Standard
The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(GA) = p(gif,t) = 1/11 = p(GC) = p(TA) = p(TB) 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(TC) = p(tes,ter) = 2/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(B) = p(gif,ty) + p(blo,ty) = 4/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BF) = p(blo,fu) = 1/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So the observed entropy is now
\begin_inset Formula 
\[
h_{PAIR}^{red2.}=-\frac{5}{11}\log_{2}\frac{1}{11}-\frac{2}{11}\log_{2}\frac{2}{11}-\frac{4}{11}\log_{2}\frac{4}{11}=2.550341
\]

\end_inset

The generated entropy is 
\begin_inset Formula $h^{gen}=\log_{2}7=2.807355$
\end_inset

.
 The difference is 
\begin_inset Formula $h^{gen}-h^{obs}=0.257014$
\end_inset

.
 The entropy is not getting closer to the equidistributional maximum; this
 grammar is rejected.
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Paragraph*
Case 3.
 
\end_layout

\begin_layout Standard
Group =t and =ty together.
 Let 
\begin_inset Formula $\epsilon=\{\mbox{=t, =ty}\}$
\end_inset

 Then we may group G={GA, GB} and T={TA, TB}.
 The corresponding link-grammar dictionary is:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.=: G+ or GC+;
\end_layout

\begin_layout Plain Layout

     tes.=: T+ or TC+;
\end_layout

\begin_layout Plain Layout

     blo.=: BB+ or BF+;
\end_layout

\begin_layout Plain Layout

     =t =ty: G- or T- or BB-;
\end_layout

\begin_layout Plain Layout

     =ter: GC- or TC-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The above again allows a new, unobserved word: 
\begin_inset Quotes eld
\end_inset

blot
\begin_inset Quotes erd
\end_inset

.
 The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(G) = p(gif,t) + p(gif,ty) = 2/11 = p(T) = p(tes,t) + p(tes,ty) 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(GC) = p(gif,ter) = 1/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(TC) = p(tes,ter) = 2/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BB) = p(blo,ty) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BF) = p(blo,fu) = 1/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So the observed entropy is now
\begin_inset Formula 
\[
h_{PAIR}^{red3.}=-\frac{6}{11}\log_{2}\frac{2}{11}-\frac{2}{11}\log_{2}\frac{1}{11}-\frac{3}{11}\log_{2}\frac{3}{11}=2.481715
\]

\end_inset


\begin_inset Newline newline
\end_inset

The equidistributional entropy is 
\begin_inset Formula $h^{gen}=\log_{2}6=2.584963$
\end_inset

.
 The difference is 
\begin_inset Formula $h^{gen}-h^{obs}=0.103248$
\end_inset

.
 This difference means we are getting closer to the maximum; the grouping
 is acceptable! Its really not much worse than case 1, which was unambiguous.
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Paragraph*
Case 4.
 
\end_layout

\begin_layout Standard
Group =ty and =fu together.
 Let 
\begin_inset Formula $\zeta=\{\mbox{=ty, =fu}\}$
\end_inset

.
 Then we must group B={BB,BF} together.
 The dictionary is:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.=: GA+ or GB+ or GC+;
\end_layout

\begin_layout Plain Layout

     tes.=: TA+ or TB+ or TC+;
\end_layout

\begin_layout Plain Layout

     blo.=: B+;
\end_layout

\begin_layout Plain Layout

     =t: GA- or TA-;
\end_layout

\begin_layout Plain Layout

     =ty =fu: GB- or TB- or B-;
\end_layout

\begin_layout Plain Layout

     =ter: GC- or TC-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

No new unobserved words are allowed by this grouping! The observed pair
 probabilities are:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(GA) = p(gif,t) = 1/11 = p(GB) = p(GC) = p(TA) = p(TB) 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(TC) = p(tes,ter) = 2/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(B) = p(blo,ty) + p(blo,fu) = 4/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The observed entropy is then:
\begin_inset Formula 
\[
h_{PAIR}^{red4.}=-\frac{5}{11}\log_{2}\frac{1}{11}-\frac{2}{11}\log_{2}\frac{2}{11}-\frac{4}{11}\log_{2}\frac{4}{11}=2.550341
\]

\end_inset

Curiously, this entropy is identical to the completely different case 2.
 The equidistributional entropy is 
\begin_inset Formula $h^{gen}=\log_{2}7=2.807355$
\end_inset

 and the difference is thus 
\begin_inset Formula $h^{gen}-h^{obs}=0.257014$
\end_inset

 which is sharply further away from the equidistributional maximum.
 Thus, this grouping is rejected.
 This is perhaps surprising ...
 First, this grammar did not generate any new unobserved words; thus, it
 is a faithful grammar.
 Also, it succeeds in reducing the total number of link-types, and thus is
 naively acceptable for that reason.
 However, the frequency distribution of th generated grammar move away from
 the observed frequency distribution, leading to the rejection.
 This begs a question: when and how might we annotate the grammar with frequency
 information?
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Paragraph*
Case 5.
 
\end_layout

\begin_layout Standard
Group =ter and =fu together, so that 
\begin_inset Formula $\eta=\{\mbox{=ter, =fu}\}$
\end_inset

.
 It does not appear that any link types get consolidated! This is not much
 of a grouping, then ...
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.=: GA+ or GB+ or GC+;
\end_layout

\begin_layout Plain Layout

     tes.=: TA+ or TB+ or TC+;
\end_layout

\begin_layout Plain Layout

     blo.=: BB+ or BF+;
\end_layout

\begin_layout Plain Layout

     =t: GA- or TA-;
\end_layout

\begin_layout Plain Layout

     =ty: GB- or TB- or BB-;
\end_layout

\begin_layout Plain Layout

     =ter =fu: GC- or TC- or BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

Many new, unobserved words are allowed: bloter, giffu, tesfu.
 The observed pair probabilities are:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(GA) = p(gif,t) = 1/11 = p(GB) = p(GC) = p(TA) = p(TB) 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(TC) = p(tes,ter) = 2/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BB) = p(blo,ty) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BF) = p(blu,fu) = 1/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The observed entropy is then:
\begin_inset Formula 
\[
h_{PAIR}^{red5.}=-\frac{6}{11}\log_{2}\frac{1}{11}-\frac{2}{11}\log_{2}\frac{2}{11}-\frac{3}{11}\log_{2}\frac{3}{11}=2.845351
\]

\end_inset

This is identical to the unreduced entropy: no surprise, because no link
 consolidation was performed.
 The equidistributional entropy is the same as well: 
\begin_inset Formula $h^{gen}=\log_{2}8=3$
\end_inset

 since there are still 8 link types.
 The language entropy increased: there are now 11 possible words in the
 language, so 
\begin_inset Formula $h^{lang}=\log_{2}11=3.459432$
\end_inset

.
 This is a very unsatisfying situation: the difference in entropies is no
 better or worse than the starting point, and so this seems like a reasonable
 sideways slide, and yet, this grammar allows a bunch of nonsense words
 to be generated.
 That seems wrong.
\end_layout

\begin_layout Paragraph*
Summary
\end_layout

\begin_layout Standard
Of the 5 cases, three are blocked (cases 2,4,5), and two are acceptable
 (1,3).
 Case 1 looks to be the best.
 Lets see what might happen next:
\end_layout

\begin_layout Itemize
Case 1a.
 Group =t and =ty
\end_layout

\begin_layout Itemize
Case 1b.
 Group =t and =ter
\end_layout

\begin_layout Itemize
Case 1c.
 Group =ty and =ter
\end_layout

\begin_layout Standard
Case 1a resembles Case 3 so we expect it to advance.
 Likewise for case 1c.
 Its reasonable to guess that case 1b will be the strongest.
 Lets try some of these.
\end_layout

\begin_layout Paragraph*
Case 1a.
\end_layout

\begin_layout Standard
Group =t and =ty.
 The link merges are T={A,B}; the resulting grammar is:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: T+ or C+;
\end_layout

\begin_layout Plain Layout

     blo.=: BB+ or BF+;
\end_layout

\begin_layout Plain Layout

     =t =ty: T- or BB-;
\end_layout

\begin_layout Plain Layout

     =ter: C-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

This grammar allows a new unobserved word: 
\begin_inset Quotes eld
\end_inset

blot
\begin_inset Quotes erd
\end_inset

.
 The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(T) = p(gif,t) + p(tes,t) + p(gif,ty) + p(tes,ty) = 4/11 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(C) = p(gif,ter) + p(tes,ter) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BB) = p(blo,ty) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BF) = p(blo,fu) = 1/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So the observed entropy is now
\begin_inset Formula 
\[
h_{PAIR}^{red1.}=-\frac{4}{11}\log_{2}\frac{4}{11}-\frac{6}{11}\log_{2}\frac{3}{11}-\frac{1}{11}\log_{2}\frac{1}{11}=1.867634
\]

\end_inset

The generated entropy is 
\begin_inset Formula $h^{gen}=\log_{2}4=2$
\end_inset

.
 The difference is 
\begin_inset Formula $h^{gen}-h^{obs}=0.132366$
\end_inset

 which is not closer than the previous delta of 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.090658, so this is rejected.
\end_layout

\begin_layout Paragraph*
Casse 1b.
\end_layout

\begin_layout Standard
Group =t and =ter.
 This consolidates links T={A,C} and so the dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: T+ or B+;
\end_layout

\begin_layout Plain Layout

     blo.=: BB+ or BF+;
\end_layout

\begin_layout Plain Layout

     =t =ter: T-;
\end_layout

\begin_layout Plain Layout

     =ty: B- or BB-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

This does not generate any new words.
 The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(A) = p(gif,t) + p(tes,t) + p(gif,ter) + p(tes,ter) = 5/11 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(B) = p(gif,ty) + p(tes,ty) = 2/11 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BB) = p(blo,ty) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BF) = p(blo,fu) = 1/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

So the observed entropy is now
\begin_inset Formula 
\[
h_{PAIR}^{red1.}=-\frac{5}{11}\log_{2}\frac{5}{11}-\frac{2}{11}\log_{2}\frac{2}{11}-\frac{3}{11}\log_{2}\frac{3}{11}-\frac{1}{11}\log_{2}\frac{1}{11}=1.789929
\]

\end_inset

The generated entropy is 
\begin_inset Formula $h^{gen}=\log_{2}4=2$
\end_inset

.
 The difference is 
\begin_inset Formula $h^{gen}-h^{obs}=0.210071$
\end_inset


\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
 which does not get closer; the best still stands at 0.090658.
 This is surprising: it sems to be blocking the discovery of the clique.
\end_layout

\begin_layout Paragraph*
Case 1c.
\end_layout

\begin_layout Standard
Group =ty and =ter.
 This consolidates T={B,C}, so the dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: A+ or T+;
\end_layout

\begin_layout Plain Layout

     blo.=: BB+ or BF+;
\end_layout

\begin_layout Plain Layout

     =t: A-;
\end_layout

\begin_layout Plain Layout

     =ty =ter: T- or BB-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

This does not generate any new words.
 The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(A) = p(gif,t) + p(tes,t) = 2/11 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(C) = p(gif,ty) + p(tes,ty)+ p(gif,ter) + p(tes,ter) = 5/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BB) = p(blo,ty) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BF) = p(blo,fu) = 1/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The changes are the same as for case 1b.
 Again, this is blocked.
\end_layout

\begin_layout Paragraph*
Case 1f.
\end_layout

\begin_layout Standard
This is the 
\begin_inset Quotes eld
\end_inset

final
\begin_inset Quotes erd
\end_inset

 case: group together =t =ty =ter into one.
 This consolidates T={A,B,C} so that 
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: T+;
\end_layout

\begin_layout Plain Layout

     blo.=: BB+ or BF+;
\end_layout

\begin_layout Plain Layout

     =t =ty =ter: T- or BB-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

This allows new words 
\begin_inset Quotes eld
\end_inset

blot
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

bloter
\begin_inset Quotes erd
\end_inset

.
 The observed pair probabilities become: 
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(T) = p(gif,t) + p(tes,t) + p(gif,ty) + p(tes,ty)+ p(gif,ter) + p(tes,ter)
 = 7/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BB) = p(blo,ty) = 3/11
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
p(BF) = p(blo,fu) = 1/11
\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The observed entropy is 
\begin_inset Formula 
\[
h_{PAIR}^{red1.}=-\frac{7}{11}\log_{2}\frac{7}{11}-\frac{3}{11}\log_{2}\frac{3}{11}-\frac{1}{11}\log_{2}\frac{1}{11}=1.240671
\]

\end_inset

Hmm.
 3 link types
\end_layout

\begin_layout Paragraph*
Summary
\end_layout

\begin_layout Standard
Movement to Cases 1a, 1b and 1c are all blocked.
 This seems surprising.
 The relatively high-frequency observation of =ter makes the distribution
 of the consolidated grammar to deviate strongly from the distribution of
 the observed corpus.
 This seems like an undesired effect, as the point of learning how to simplify
 the grammar is to obtain a smaller grammar, rather than to preserve the
 the distribution of the corpus.
 Mostly.
\end_layout

\begin_layout Standard
Intuition suggests that the grammar for 
\begin_inset Quotes eld
\end_inset

common
\begin_inset Quotes erd
\end_inset

 cases should be consolidated.
 The grammar for quite rare cases should indeed be handled distinctly.
 To avoid this seemingly perverse outcome, perhaps the grammar should contain
 frequency information, which is to be consolidated appropriately.
 This is truly tedious, but seems to be necessary.
 So we have to start from scratch.
\end_layout

\begin_layout Standard
And we do, below, and its a total failure, as now, the corpus frequencies
 are recorded faithfully, so the consolidation process doesn't tell us anything
 we didn't know.
 Its the same calculation done differently.
\end_layout

\begin_layout Subsection*
Worked Link Consolidation Example, with Frequencies (XXX Fail)
\end_layout

\begin_layout Standard
(XXX The below fails, don't bother reading it).
 So we start all over again, using the same corpus frequencies as before,
 namely, those of table 
\begin_inset CommandInset ref
LatexCommand ref
reference "tab:Example-Link-Frequency"

\end_inset

.
 The grammar is essentially identical to that of 
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:Example-morpheme-grammar"

\end_inset

, except that it is now annotated with probabilities.
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.=: (GA+)(1/11) or (GB+)(1/11) or (GC+)(1/11);
\end_layout

\begin_layout Plain Layout

     tes.=: (TA+)(1/11) or (TB+)(1/11) or (TC+)(2/11);
\end_layout

\begin_layout Plain Layout

     blo.=: (BB+)(3/11) or (BF+)(1/11);
\end_layout

\begin_layout Plain Layout

     =t: GA- or TA-;
\end_layout

\begin_layout Plain Layout

     =ty: GB- or TB- or BB-;
\end_layout

\begin_layout Plain Layout

     =ter: GC- or TC-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The above only annotates the +-going links; it seems like annotating the
 –going links would cause double-counting.
 This is somewhat confusing, since the probability has nothing to do with
 directionality.
 A better notation is not obvious.
 Lets go through the cases as before.
\end_layout

\begin_layout Paragraph*
Case 1.
\end_layout

\begin_layout Standard
Group gif.= and tes.= together.
 Let 
\begin_inset Formula $\gamma=\{\mbox{gif.=, tes.=}\}$
\end_inset

.
 Then the link types G* and T* need to be consolidated: A={GA,TA} and likewise
 B={GB,TB} and C={GC,TC}.
 The dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= tes.=: (A+)(2/11) or (B+)(2/11) or (C+)(3/11);
\end_layout

\begin_layout Plain Layout

     blo.=: (BB+)(3/11) or (BF+)(1/11);
\end_layout

\begin_layout Plain Layout

     =t: A-;
\end_layout

\begin_layout Plain Layout

     =ty: B- or BB-;
\end_layout

\begin_layout Plain Layout

     =ter: C-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The observational probabilities are unchanged, as the dictionary probabilities
 have no bearing on the parsing.
 However, the entropy of the generated language is different, as it is no
 longer 
\begin_inset Formula $\log_{2}5$
\end_inset

 but instead 
\begin_inset Formula 
\[
h^{gen}=-\frac{6}{11}\log_{2}\frac{1}{11}-\frac{2}{11}\log_{2}\frac{2}{11}-\frac{3}{11}\log_{2}\frac{3}{11}
\]

\end_inset

That is, it is now identical to 
\begin_inset Formula $h^{observed}$
\end_inset

.
 No surprise, as we made it like that, by encoding the frequency information
 in the dictionary.
\end_layout

\begin_layout Paragraph*
Case 2.
 
\end_layout

\begin_layout Standard
Group gif.= and blo.= together.
 Let 
\begin_inset Formula $\delta=\{\mbox{gif.=, blo.=}\}$
\end_inset

.
 Then the link types GB and BB can be consolidated, because they share the
 common suffix =ty: B={GB,BB}.
 No other link consolidation is possible, without permitting impermissible
 (previously unseen) linkages.
 The dictionary becomes
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

     gif.= blo.=: (GA+)(1/11) or (B+)(4/11) or (GC+)(1/11) or (BF+)(1/11);
\end_layout

\begin_layout Plain Layout

     tes.=: (TA+)(1/11) or (TB+)(1/11) or (TC+)(2/11);
\end_layout

\begin_layout Plain Layout

     =t: GA- or TA-;
\end_layout

\begin_layout Plain Layout

     =ty: B- or TB-;
\end_layout

\begin_layout Plain Layout

     =ter: GC- or TC-;
\end_layout

\begin_layout Plain Layout

     =fu: BF-;
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset

The generated entropy is
\begin_inset Formula 
\[
h^{gen}=-\frac{5}{11}\log_{2}\frac{1}{11}-\frac{4}{11}\log_{2}\frac{4}{11}-\frac{2}{11}\log_{2}\frac{2}{11}
\]

\end_inset

which is identical to the corpus entropy, again.
 Not surprising, I guess ...
 we seem to be doing the same calculation, but in a different way.
 Dohh.
 Never mind ...
 
\end_layout

\begin_layout Subsection*
Alternate Distributions
\end_layout

\begin_layout Standard
Instead of looking for an equi-distribution, how about a Zipf distribution,
 which seems far more plausible? The distribution is 
\begin_inset Formula 
\[
p(k,n)=\frac{1}{kH_{n}}
\]

\end_inset

 where the normalization is 
\begin_inset Formula $H_{n}=\sum_{k=1}^{n}1/n$
\end_inset

.
 The entropy is then 
\begin_inset Formula 
\begin{align*}
h_{n}^{Zipf} & =-\sum_{k=1}^{n}p(k,n)\log_{2}p(k,n)\\
 & =\frac{1}{H_{n}}\sum_{k=1}^{n}\frac{\log_{2}kH_{n}}{k}\\
 & =\log_{2}H_{n}+\frac{1}{H_{n}}\sum_{k=1}^{n}\frac{\log_{2}k}{k}
\end{align*}

\end_inset

and the first few values are shown below.
 For comparison, 
\begin_inset Formula $h_{n}^{equi}=\log_{2}n$
\end_inset

 is also shown.
\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="8" columns="4">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
n
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $H_{n}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h_{n}^{Zipf}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h_{n}^{equi}$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.918296
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.83333
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.435371
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.584963
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.033333
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.792488
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.283333
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.063860
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.321928
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.45
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.281979
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.584963
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.592857
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.463914
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.807355
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.717857
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.619715
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

The question is then: how would the above cases go if this was used as the
 deciding factor? This is shown below:
\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="11" columns="10">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
# lnk
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{observed}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{equi}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{eq}-h^{obs}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
OK
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{Zipf}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{Zipf}-h^{obs}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
C1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
C2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Base
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.845351
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.154649
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.619715
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.225636
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.231270
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.321928
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.090658
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.063860
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.16741
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.550341
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.807355
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.257014
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.463914
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.086427
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.481715
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.584963
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.103248
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.281979
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.199736
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.550341
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.807355
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.257014
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.463914
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.086427
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.845351
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.154649
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.619715
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.225636
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1a.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
1.867634
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.132366
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.792488
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.075146
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1b.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
1.789929
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.210071
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.792488
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.002559
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1c.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
1.789929
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.210071
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.792488
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.002559
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1f.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
1.240671
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.584963
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.344292
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.435371
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.1947
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

There seem to be two different decision criteria to apply: 
\end_layout

\begin_layout Enumerate
Does the reduced entropy come closer to the Zipfian entropy?
\end_layout

\begin_layout Enumerate
Does the reduced entropy increase, relative to the Zipfian entropy?
\end_layout

\begin_layout Standard
The first is shown in column C1, the second in C2.
 Naively, C2 seems like a better chooser.
 Does it also work for the simple case (with the original 7-word corpus)?
 Lets see:
\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="5" columns="10">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Case
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
#lnk
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{observed}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{equi}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{eq}-h^{obs}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
OK
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{Zipf}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $h^{Zipf}-h^{obs}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
C1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
C2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Base
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.521641
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
2.584963
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.063322
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.281979
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.239662
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
1.842371
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.157629
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.792488
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.049883
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1a
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.985228
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.014772
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.918296
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.066932
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Y
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1b.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
0.863121
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.136879
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.918296
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.055175
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
N
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\begin_inset Newline newline
\end_inset

Basically, this is really irritating.
\end_layout

\begin_layout Subsection*
Thoughts
\end_layout

\begin_layout Standard
What have we learned from the above?
\end_layout

\begin_layout Itemize
The problem of condensing together morphemes into classes which share common
 link types is the bipartite clique problem.
 It is a known-hard problem.
\end_layout

\begin_layout Itemize
Bad grammars increase the size of the language.
 This could be acceptable, if the increase is small.
 What's the criteria? Unclear.
\end_layout

\begin_layout Section*
Consciousness - 27 July 2014
\end_layout

\begin_layout Standard
Two works:
\end_layout

\begin_layout Itemize
Masafumi Oizumi, Larissa Albantakis, Giulio Tononi, 
\begin_inset Quotes eld
\end_inset

From the Phenomenology to the Mechanisms of Consciousness: Integrated Informatio
n Theory 3.0
\begin_inset Quotes erd
\end_inset

 (2014) PLOS Computational Biology, http://www.ploscompbiol.org/article/info%3Adoi
%2F10.1371%2Fjournal.pcbi.1003588
\end_layout

\begin_layout Itemize
Max Tegmark, Consciousness as a State of Matter (27 Feb 2014) arXiv:1401.1219v2
 [quant-ph]
\end_layout

\begin_layout Standard
Curious points and thoughts:
\end_layout

\begin_layout Itemize
CEI – 
\begin_inset Quotes eld
\end_inset

Cause-effect information
\begin_inset Quotes erd
\end_inset

 – Tononi – sound like a time-ordered variant of mutual information.
 How should this be defined? Answer: my guess is that its just like the
 mutual information defined in eqn
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:Relation MI"

\end_inset

, right? Because the relational complexity can deal with arbitrary structures,
 so that seems appropriate.
 
\end_layout

\begin_layout Section*
13 Sept 2014
\end_layout

\begin_layout Standard
The Zipfian distribution is typical of a scale-free network.
 So why is language scale-free? Crudely, because we attempt to recycle existing
 concepts/words.
\end_layout

\begin_layout Standard
Next, from this:
\end_layout

\begin_layout Itemize
Christoph Adami 
\begin_inset Quotes eld
\end_inset

Information-Theoretic Considerations Concerning the Origin of Life
\begin_inset Quotes erd
\end_inset

 http://arxiv.org/pdf/1409.0590v1.pdf
\end_layout

\begin_layout Standard
Come the following thoughts:
\end_layout

\begin_layout Itemize
Never assume a uniform distribution of parts.
 This makes it very unlikely that an imprtant assemblage of parts can arise
 at random.
 For Adami, this is used to argue that biotic and abiotic strings should
 have very siilar distributions (or rather, the converse: a non-uniform
 abotic distribution makes it much more likely to find a replictor with
 a similar distribution.) 
\end_layout

\begin_layout Itemize
The information content of (grammatical sentences of length L is 
\begin_inset Formula 
\[
I_{grammatical}=-\log_{2}(N_{grammatical}/N_{total})
\]

\end_inset

where 
\begin_inset Formula $N_{total}$
\end_inset

 is the total number of sentences of length L, assuming a 
\emph on
uniform
\emph default
 distribution of words picked from a vocabulary of D words.
 That is, 
\begin_inset Formula $N_{total}=D^{L}$
\end_inset

.
 But this is weird ...
 because the vocabulary isn't really a constant, and the natural distribution
 is not uniform, so its not clear what kind of 
\begin_inset Quotes eld
\end_inset

information
\begin_inset Quotes erd
\end_inset

 the above actually is ...
 
\end_layout

\begin_layout Section*
Thermodynamics - 24 March 2015
\end_layout

\begin_layout Standard
Some quick short notes: blog post: 
\begin_inset Quotes eld
\end_inset

Thermodynamics with Continuous Information Flow
\begin_inset Quotes erd
\end_inset

 https://johncarlosbaez.wordpress.com/2015/03/21/19395 with arxiv paper: http://ar
xiv.org/pdf/1402.3276v3.pdf Jordan M.
 Horowitz and Massimiliano Esposito study the master equation for a probability
 p(x,y) over two distributions X,Y, which are connected via a bipartite
 graph.
 The total system is also connected to a thermal bath.
 The eqn is
\begin_inset Formula 
\[
\frac{dp(x,y)}{dt}=\sum_{x',y'}H_{x,x'}^{y,y'}p(x',y')-H_{x',x}^{y',y}p(x,y)
\]

\end_inset

i.e.
 its Markovian; we've written two indexes, which makes it clearer when H
 is bipartite i.e.
\begin_inset Formula 
\[
H_{x,x'}^{y,y'}=\begin{cases}
H_{x,x'}^{y} & \mbox{\ if\ }y=y'\\
H_{x}^{y,y'} & \mbox{\ if\ }x=x'\\
0 & \mbox{\ otherwise}
\end{cases}
\]

\end_inset

The interesting part is the entropy, and the thermal bath, which is not
 in the master eqn(!) The total entropy is 
\begin_inset Formula $S_{tot}=S_{XY}+S_{env}$
\end_inset

.
 Per usual, the information entropy is 
\begin_inset Formula $S_{XY}=-\sum_{x,y}p(x,y)\log p(x,y)$
\end_inset

.
 Two tricks now happen: (1) taking the timer derivative of 
\begin_inset Formula $S_{XY}$
\end_inset

 results in something that naturally splits into an X piece and a Y piece.
 Trick (2) is that 
\begin_inset Formula $S_{env}$
\end_inset

 cannot be written down directly, but its time derivative can be; it is
 proportional to the heat current: 
\begin_inset Formula $\dot{S}_{env}=-\dot{Q}/T$
\end_inset

 Observer the tiny dot over S,Q these are the usuaul rate-of-change dots,
 (i.e.
 just rates, not functions we are taking time derivative of).
 Q is heat, Q-dot is heat flow, T is temp.
 Local detailed balance requires that 
\begin_inset Formula 
\[
\log\frac{H_{x,x'}^{y,y'}}{H_{x',x}^{y',y}}=\frac{-(E_{x,y}-E_{x',y'})}{kT}
\]

\end_inset

is the change in energy due to a state transition: the change in energy
 is supplied by the heat reservior.
 Where does this mystery equation come from? Answer:
\end_layout

\begin_layout Standard
Detailed balance requires that, when the system reaches equilibirum, that
 the transition rate into and out of the equlibrium state 
\begin_inset Formula $p_{i}=\pi_{i}$
\end_inset

 are equal:
\begin_inset Formula 
\[
H_{ji}\pi_{i}=H_{ij}\pi{}_{j}
\]

\end_inset

(there is NO repeated-index summation).
 Then, just write 
\begin_inset Formula $\pi_{i}=\exp-E_{i}/kT$
\end_inset

, and turn the crank.
 The general principle: 
\emph on
the log of the ratio of the forward and backward transition rates between
 two states must be proportional to the energy difference between those
 states! 
\end_layout

\begin_layout Standard
BTW the detailed-balance equation resembles Bayes Theorem, in that, if we
 wrote 
\begin_inset Formula $H_{ji}\rightarrow P(j|i)$
\end_inset

 and 
\begin_inset Formula $\pi_{i}\rightarrow P(i)$
\end_inset

, then detailed balance is written as 
\begin_inset Formula $P(j|i)P(i)=P(i|j)P(j)$
\end_inset

.
 So the master equation describes 
\begin_inset Quotes eld
\end_inset

non-equilibrium Bayes statistics
\begin_inset Quotes erd
\end_inset

, in a strange sense.
 Hmm.
 But, of course, this is just a Markov chanin/process.
\end_layout

\begin_layout Section*
26 March 2015
\end_layout

\begin_layout Standard
The Inverse Relationship Principle of Channel theory: 
\begin_inset Quotes eld
\end_inset


\emph on
Whenever there is an increase in available information there is a corresponding
 decrease in possibilities, and vice versa.
\emph default

\begin_inset Quotes eld
\end_inset

 Barwise, 
\begin_inset Quotes eld
\end_inset

Information and Impossibilities.
\begin_inset Quotes erd
\end_inset

 Notre Dame J.
 Formal Logic Volume 38, Number 4, 488-515.
 Barwise, Jon and Jerry Seligman 1997.
 
\begin_inset Quotes eld
\end_inset

Information Flow: The Logic of Distributed Systems
\begin_inset Quotes erd
\end_inset

.
 Cambridge: Cambridge University Press
\end_layout

\begin_layout Section*
Linear networks - 3 May 2015
\end_layout

\begin_layout Standard
Another Baez post: 
\begin_inset Quotes eld
\end_inset


\emph on
A Compositional Framework for Passive Linear Networks
\begin_inset Quotes erd
\end_inset

 
\emph default
blog: https://johncarlosbaez.wordpress.com/2015/04/28/a-compositional-framework-fo
r-passive-linear-networks/
\end_layout

\begin_layout Standard
So first, we have the table:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="6" columns="5">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top" width="15col%">
<column alignment="center" valignment="top" width="15col%">
<column alignment="center" valignment="top" width="25col%">
<column alignment="center" valignment="top" width="25col%">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
mechanics
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
electronics
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
information geometry
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
geometric mechanics
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
q
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
position
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
charge
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
entropy
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
point on manifold
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $\dot{q}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
velocity
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
current
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
entropy change
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tangent vector
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
p
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
momentum
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
flux linkage
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
temperature momentum
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
covector (vector in cotangent bundle)
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $\dot{p}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
force
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
voltage
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
temperature
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
map from tangent bundle to cotangent bundle
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
principle of least action
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
principle of least power dissipation
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
?
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
principle of least action
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
This table is slightly oversimplified; the first four columns show only
 the linear case.
 The fifth column makes clear that force isn't really p-dot; that only holds
 when the manifold is flat.
 Anyway..
\end_layout

\begin_layout Standard
Key concepts: (*) monoidal categories are needed, and (*) symplectic geometry
 is needed.
\end_layout

\begin_layout Standard
Baez does the linear passive-component electronics example, viz a network
 of passive resistors, capacitors, inductors.
 For the resistor network, voltages at each node are taken from the field
 
\begin_inset Formula $\mathbb{F}=\mathbb{R}$
\end_inset

 while for the inductive network, the field is the field of rational functions
 of one variable 
\begin_inset Formula $\mathbb{F}=\mathbb{R}(t)$
\end_inset

 with 
\begin_inset Formula $t$
\end_inset

 time: i.e.
 voltage varying over time.
 A Dirichlet form is a quadratic form 
\begin_inset Formula 
\[
P(\phi)=\frac{1}{2}\sum_{i,j}\frac{(\phi_{i}-\phi_{j})^{2}}{r_{ij}}
\]

\end_inset

where 
\begin_inset Formula $r_{ij}$
\end_inset

 is the reistance (impedance) between nodes i and j, and 
\begin_inset Formula $\phi_{i}=\phi(i)$
\end_inset

 is the voltage at node i.
 (Actually, we should be summing over edges, so as to handle parallel resistors).
 Note that the space of Dirichlet forms is smaller than the space of quadratic
 forms: Dirichlet forms do not have diagonal entries.
 Note that P is (half) the power dissipation.
\end_layout

\begin_layout Standard
The principle of least power dissipation is this: Given fixed voltages 
\begin_inset Formula $\psi$
\end_inset

 on the boundary of the network, i.e.
 on the input/output terminals, the actual power dissipated is
\begin_inset Formula 
\[
Q(\psi)=\min_{\phi\in\mathbb{R}^{N},\phi|_{\partial N}=\psi}\;P(\phi)
\]

\end_inset

Notation: there are N nodes, so voltages live in 
\begin_inset Formula $\mathbb{R}^{N}$
\end_inset

.
 The boundary of the network (input/output terminals) is written as 
\begin_inset Formula $\partial N$
\end_inset

 and the voltages are held fixed at the boundary.
 Note that Q is also a Dirichlet form.
 Viz its a map 
\begin_inset Formula $Q:\mathbb{R}^{\partial N}\to\mathbb{R}$
\end_inset

.
 The black-box principle of equivalent resistor networks is that any two
 resistor networks are black-box equivalent when they have the same Q.
 
\end_layout

\begin_layout Standard
For the correct generalization to impedance, it is not enough to just replace
 
\begin_inset Formula $\mathbb{F}=\mathbb{R}$
\end_inset

 by 
\begin_inset Formula $\mathbb{F}=\mathbb{R}(t)$
\end_inset

 because this fails to deal with the time variation correctly.
 Put it another way: for the pure resistor network, we are free to fix voltages
 at both the input and output terminals arbitrarily; the internal currents
 are determined entirely by these.
 For the general case with impedance, we are not free to fix both voltages
 and currents at both the input and output terminals.
 Out of the total set of 
\begin_inset Formula $2\dim(\partial N)$
\end_inset

 voltages and currents, we can fix only half the set, i.e.
 a mixture of voltages, currents of 
\begin_inset Formula $\dim(\partial N)$
\end_inset

.
\end_layout

\begin_layout Standard
To handle this, we need to construct a symplectic vector space, with a symplecti
c form on it, and work in the Lagrangian subspace of it.
 Thus, we have 
\begin_inset Formula $\psi\in\mathbb{F}^{\partial N}$
\end_inset

 as the potentials at the network terminals, and 
\begin_inset Formula $dQ_{\psi}\in\left(\mathbb{F}^{\partial N}\right)^{*}$
\end_inset

 as the conjugate currents.
 Out of the total space 
\begin_inset Formula $\mathbb{F}^{\partial N}\oplus\left(\mathbb{F}^{\partial N}\right)^{*}$
\end_inset

 of states, the subspace of actually attainable states is 
\begin_inset Formula 
\[
{\rm Graph}\left(dQ\right)=\left\{ \left(\psi,dQ_{\psi}\right)\vert\psi\in\mathbb{F}^{\partial N}\right\} \subseteq\mathbb{F}^{\partial N}\oplus\left(\mathbb{F}^{\partial N}\right)^{*}
\]

\end_inset

The set of Lagrangian subspaces is an algebraic variety, the Lagrangian
 Grassmanian.
\end_layout

\begin_layout Standard
Baez primary result on impedance networks is that the black box is describable
 by the symplecitification of ..
 OK I don't get it.
\end_layout

\begin_layout Section*
Mining Grammatical Categories – 20 June 2015
\end_layout

\begin_layout Standard
Now that we have a database filled with disjunct statistics, how do we datamine
 that for grammatical categories, which is, after all, the main point of
 this exercise? Let me explain in several steps; at first illustrative,
 and then, more precisely.
 So first, consider a corpus containing these sentences:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard

\family sans
the big tree 
\end_layout

\begin_layout Standard

\family sans
a green tree 
\end_layout

\begin_layout Standard

\family sans
the big bush 
\end_layout

\begin_layout Standard

\family sans
a green bush
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
\noindent
I want to conclude that "tree" is a lot like "bush", and the two should
 be considered as being "similar enough to be merged into a common grammatical
 category".
 That is, the words "tree" and "bush" always occur in similar contexts,
 or even the same contexts.
 The word 
\begin_inset Quotes eld
\end_inset

context
\begin_inset Quotes erd
\end_inset

 here means 
\begin_inset Quotes eld
\end_inset

the dependency parse context
\begin_inset Quotes erd
\end_inset

, and not 
\begin_inset Quotes eld
\end_inset

the n-gram context
\begin_inset Quotes erd
\end_inset

.
 More precisely, it means 
\begin_inset Quotes eld
\end_inset

the accumulated statistics for the disjuncts obtained from MST dependency
 parses
\begin_inset Quotes erd
\end_inset

.
\end_layout

\begin_layout Standard
Suppose the following parses were observed:
\end_layout

\begin_layout Standard

\family typewriter
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

	 +---MA---+
\end_layout

\begin_layout Plain Layout

	 |   +-MB-+
\end_layout

\begin_layout Plain Layout

	 |   |    |
\end_layout

\begin_layout Plain Layout

	the big  tree 
\end_layout

\begin_layout Plain Layout

\end_layout

\begin_layout Plain Layout

	 +---MC---+
\end_layout

\begin_layout Plain Layout

	 |   +-MD-+
\end_layout

\begin_layout Plain Layout

	 |   |    |
\end_layout

\begin_layout Plain Layout

	 a green tree 
\end_layout

\begin_layout Plain Layout

\end_layout

\begin_layout Plain Layout

	 +---ME---+
\end_layout

\begin_layout Plain Layout
\align left

	 |   +-MF-+
\end_layout

\begin_layout Plain Layout

	 |   |    |
\end_layout

\begin_layout Plain Layout

	the big bush
\end_layout

\begin_layout Plain Layout

\end_layout

\begin_layout Plain Layout

	 +---MG---+
\end_layout

\begin_layout Plain Layout

	 |   +-MH-+
\end_layout

\begin_layout Plain Layout

	 |   |    |
\end_layout

\begin_layout Plain Layout

	 a green bush
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Recall that the above parses were obtained by performing a Maximum-Spanning-Tree
 (MST) parse based on word-pair mutual information (MI).
 The MST is obtained by considering the graph clique joining all words in
 the sentence, and then keeping only those edges that have the greatest
 MI between pairs of words.
 This is the 
\begin_inset Quotes eld
\end_inset

Yuret parse
\begin_inset Quotes erd
\end_inset

.
 The Yuret parse does not have labelled edges, and so we assign arbitrary
 (but unique!) link labels to the edges that were kept.
 Every unique word pair gets a unique link type.
 Then, using the standard Link Grammar theory, each link is broken into
 a + and a - connector, and the ordered set of connectors on a word are
 called a disjunct.
\end_layout

\begin_layout Standard
The disjuncts extracted from the above parses would then be:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

	tree: (MA- & MB-) or (MC- & MD-)
\end_layout

\begin_layout Plain Layout

	bush: (ME- & MF-) or (MG- & MH-)
\end_layout

\end_inset


\end_layout

\begin_layout Standard
No two disjuncts are alike, so naively, these seem completely uncomparable.
 Of course, this is wrong; we need to compare the 
\begin_inset Quotes eld
\end_inset

decoded disjunct
\begin_inset Quotes erd
\end_inset

.
 The 
\begin_inset Quotes eld
\end_inset

decoded disjunct
\begin_inset Quotes erd
\end_inset

 is NOT a part of the standard Link Grammar theory, so let me explain it
 here: it is simply the disjunct where the connector is replaced by the
 word or word-class that it connects to.
 For example, 
\family typewriter
MA-
\family default
 connects to the word 
\begin_inset Quotes eld
\end_inset

the
\begin_inset Quotes erd
\end_inset

, so the 
\begin_inset Quotes eld
\end_inset

decoded connector
\begin_inset Quotes erd
\end_inset

 for 
\family typewriter
MA-
\family default
 is 
\family typewriter
$the$-
\family default
.
 So, the decoded disjuncts are then:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

	tree: ($the$- & $big$-) or ($a$- & $green$-)
\end_layout

\begin_layout Plain Layout

	bush: ($the$- & $big$-) or ($a$- & $green$-)
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Now we can see that the decoded disjuncts are identical, for this example.
 Based on this, we conclude that perhaps 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

bush
\begin_inset Quotes erd
\end_inset

 indeed belong to the same grammatical category.
 The remainder of the clustering algorithm is now 
\begin_inset Quotes eld
\end_inset

obvious
\begin_inset Quotes erd
\end_inset

: rewrite the dictionary so that it has a single entry for both words:
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

	tree bush: (MA- & MB-) or (MC- & MD-)
\end_layout

\end_inset

This leaves the ME+, MF+, 
\emph on
etc
\emph default
.
 connector dangling: thus, we need to search for all occurrences of ME+ and
 replace it by MA+, and likewise all occurrences of MF+ need to be replaced
 by MB+, and so on.
\end_layout

\begin_layout Subsubsection*
Similarity metrics
\end_layout

\begin_layout Standard
The above conveys the general idea, but is over-simplifies a few aspects.
 First of all, it is very unlikely that two words will appear in sentence
 contexts that are exactly identical.
 Secondly, some constructions may be very common, and others, very rare;
 that is, some disjuncts may be very common, and some very rare.
 So, for example: suppose we read a text which used the phrase 
\begin_inset Quotes eld
\end_inset


\emph on
the big idea
\emph default

\begin_inset Quotes erd
\end_inset

 a lot, but we also read an obscure linguistics text that said that 
\begin_inset Quotes eld
\end_inset


\emph on
a green idea sleeps furiously
\emph default

\begin_inset Quotes erd
\end_inset

.
 It would probably be a mistake to lump 
\begin_inset Quotes eld
\end_inset

idea
\begin_inset Quotes erd
\end_inset

 in with 
\begin_inset Quotes eld
\end_inset

tree, bush
\begin_inset Quotes erd
\end_inset

, given that 
\begin_inset Quotes eld
\end_inset

green idea
\begin_inset Quotes erd
\end_inset

 is a very rare construction.
 Thus, we need a better way of comparing collections of disjuncts.
 
\end_layout

\begin_layout Standard
One obvious way is to treat a collection of decoded disjuncts as a vector
 in a high-dimensional vector space.
 The similarity between two vectors could be given by the cosine between
 two vectors.
 Alternately, perhaps the vectors could be treated as points, and similarity
 be given by the distance between points.
 There are other possibilities; the best choice is not obvious; several
 need to be explored.
\end_layout

\begin_layout Standard
Thus, for example, let 
\begin_inset Formula $\{e_{1},e_{2},e_{3},\cdots\}$
\end_inset

 be the basis of a high-dimensional vector space.
 For the previous example, we let 
\begin_inset Formula $e_{1}$
\end_inset

 correspond to the decoded disjunct 
\family typewriter
($the$- & $big$-)
\family default
 while 
\begin_inset Formula $e_{2}$
\end_inset

 corresponds to 
\family typewriter
($a$- & $green$-)
\family default
.
 The word 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 is then some vector ...
 what vector should it be? There are several choices.
 Suppose that 
\family typewriter
($the$- & $big$-)
\family default
 was observed with a frequency 
\begin_inset Formula $p_{1}$
\end_inset

 and that 
\family typewriter
($a$- & $green$-)
\family default
 was observed with frequency 
\begin_inset Formula $p_{2}$
\end_inset

.
 The corresponding vector is then obviously 
\begin_inset Formula $p_{1}e_{1}+p_{2}e_{2}$
\end_inset

 and we can construct another vector that corresponds to the the word 
\begin_inset Quotes eld
\end_inset

bush
\begin_inset Quotes erd
\end_inset

, say, for example: 
\begin_inset Formula $q_{1}e_{1}+q_{2}e_{2}$
\end_inset

.
\end_layout

\begin_layout Standard
The dot-product between 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

bush
\begin_inset Quotes erd
\end_inset

 is then given by 
\begin_inset Formula $p_{1}q_{1}+p_{2}q_{2}$
\end_inset

, so that the larger the product, the closer the two words are.
 The cosine angle is 
\begin_inset Formula $(p_{1}q_{1}+p_{2}q_{2})/\left|p\right|\left|q\right|$
\end_inset

 where 
\begin_inset Formula $\left|p\right|=\sqrt{p_{1}^{2}+p_{2}^{2}}$
\end_inset

 and so on.
 The closer that the cosine is to 1.0, the closer the two words are.
 There are other possibilities: we have the Cartesian distance 
\begin_inset Formula 
\[
dist(tree,bush)=\sqrt{(p_{1}-q_{1})^{2}+(p_{2}-q_{2})^{2}}
\]

\end_inset

 and we can contemplate 
\emph on
lp
\emph default
-metrics as well.
\end_layout

\begin_layout Standard
None of the above metrics take into account the mutual information (MI)
 of the disjunct.
 This is almost surely a mistake.
 Due to the vagaries of MST parsing, there will be many disjuncts with a
 low MI value.
 This is not uncommon in sentences with prepositions, where MST gives some
 poor choices for the links to the prepositions, and thus results in disjuncts
 with low MI values.
 Recall, the higher the MI, the stronger the structure is.
 Thus, perhaps a better vector for 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 might be
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
tree=e_{1}m_{1}p_{1}+e_{2}m_{2}p_{2}
\]

\end_inset

The above seems to be the most entropic-like in its expression.
 However, the probabilities might weight the terms too strongly, and so a
 weaker weighting would be the below.
 It is not yet clear to me which of these expressions are the most 
\begin_inset Quotes eld
\end_inset

elegant
\begin_inset Quotes erd
\end_inset

, or which work the best...
 
\begin_inset Formula 
\[
tree=e_{1}(m_{1}-\log_{2}p_{1})+e_{2}(m_{2}-\log_{2}p_{2})
\]

\end_inset

Here 
\begin_inset Formula $m_{1}$
\end_inset

 and 
\begin_inset Formula $m_{2}$
\end_inset

 are the mutual information of the disjuncts 
\family typewriter
(MA- & MB-)
\family default
 and 
\family typewriter
(MC- & MD-)
\family default
, respectively.
 The last two seem to be closer to the intended spirit of the maximum entropy
 principle.
 There are even more possibilities, though.
\end_layout

\begin_layout Subsubsection*
Frequency and Mutual Information
\end_layout

\begin_layout Standard
The above section makes explicit use of the frequency and the mutual information
 of a disjunct.
 It is useful to define these.
 Given a disjunct 
\family typewriter
(MA- & MB-)
\family default
 let N
\family typewriter
(MA- & MB-)
\family default
 be the number of times that this disjunct has been observed.
 It will usually be an integer (except when obtained in certain unusual
 situations not discussed here).
 Let N
\family typewriter
(*- & *-)
\family default
 be the number of times that any two-connector disjunct has been observed,
 as long as both connectors point in the - direction.
 That is,
\begin_inset Formula 
\[
N(*-\&*-)=\sum_{c_{1}\in-,c_{2}\in-}N(c_{1}\&c_{2})
\]

\end_inset

the summation taking place over all connectors in the - direction.
 The frequency of observing 
\family typewriter
(MA- & MB-)
\family default
 is then
\begin_inset Formula 
\[
p(MA-\&MB-)=\frac{N(MA-\&MB-)}{N(*-\&*-)}
\]

\end_inset

The mutual information associated with the disjunct is then 
\begin_inset Formula 
\[
m(MA-\&MB-)=\log_{2}\frac{p(MA-\&MB-)}{p(MA-\&*-)p(*-\&MB-)}
\]

\end_inset

The reason for this possibly unexpected form was developed earlier in this
 diary.
 
\end_layout

\begin_layout Subsubsection*
Semantics
\end_layout

\begin_layout Standard
There is another interesting issue that arises in the above discussion:
 the problem of syntax-semantics correspondence.
 Consider, for example, the sentence 
\emph on

\begin_inset Quotes eld
\end_inset

the dog treed the squirrel
\emph default

\begin_inset Quotes erd
\end_inset

.
 Here the word 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 is used as a verb, meaning 
\begin_inset Quotes eld
\end_inset

the dog chased the squirrel up into the tree
\begin_inset Quotes erd
\end_inset

.
 Such sentences will cause the the word 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 to accumulate disjuncts that the word 
\begin_inset Quotes eld
\end_inset

bush
\begin_inset Quotes erd
\end_inset

 will not have.
 Likewise, 
\begin_inset Quotes eld
\end_inset


\emph on
I'm bushed
\emph default

\begin_inset Quotes erd
\end_inset

 is a verb usage that has no analogous 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 version.
 Thus, not only do the words 
\begin_inset Quotes eld
\end_inset

bush
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 have different sets of disjuncts, but the differences are hiding semantic
 differences ...
 
\end_layout

\begin_layout Standard
There are several strategies that can be used to deal with this.
 More on this later.
 
\end_layout

\begin_layout Subsubsection*
Finding word pairs
\end_layout

\begin_layout Standard
We need a good way of finding word-pairs that are likely to be related.
 I think that perhaps the pattern matcher may be ideal for this.
 Details are TBD...
 but the basic idea is that the hypergraph for 
\begin_inset Quotes eld
\end_inset

tree: (MA- & MB-)
\begin_inset Quotes erd
\end_inset

 is connected to 
\begin_inset Quotes eld
\end_inset

big
\begin_inset Quotes erd
\end_inset

 because MB- is connected to 
\begin_inset Quotes eld
\end_inset

big
\begin_inset Quotes erd
\end_inset

, and 
\begin_inset Quotes eld
\end_inset

big
\begin_inset Quotes erd
\end_inset

 is connected to other lg-connectors, which in turn are connected to other
 disjuncts, which are then connected to other words.
 Thus, we search the local neighborhood of 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

, which causes us to dsicover the word 
\begin_inset Quotes eld
\end_inset

big
\begin_inset Quotes erd
\end_inset

, and then we search the neighborhood of 
\begin_inset Quotes eld
\end_inset

big
\begin_inset Quotes erd
\end_inset

 to find candidates such as 
\begin_inset Quotes eld
\end_inset

bush
\begin_inset Quotes erd
\end_inset

 which might be comparable to 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

.
 This search graph is not small, but it is not large: There may be thousands
 of words that are two hops away from 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

, but not millions.
\end_layout

\begin_layout Subsubsection*
Putting it all together
\end_layout

\begin_layout Standard
These are the things that need to be done:
\end_layout

\begin_layout Enumerate
compute the MI for the disjuncts
\end_layout

\begin_layout Enumerate
pick a common noun, compute the similarity scores for that word and every
 word that is linked to it.
 created ranked graph of similarity.
\end_layout

\begin_layout Enumerate
repeat step 2 for several different similarity formulas
\end_layout

\begin_layout Enumerate
repeat steps 2,3 for several verbs, several adjectives, several adverbs,
 several determiners, several prepositions.
\end_layout

\begin_layout Enumerate
Write code for creating grouping words into grammatical clusters.
\end_layout

\begin_layout Enumerate
Pick the most promising metric, and start clustering in bulk.
\end_layout

\begin_layout Standard
Step 5 requires writing a lot of code; it can all be written before the
 final metric has been determined.
\end_layout

\begin_layout Subsubsection*
The end.
\end_layout

\begin_layout Standard
That's all for now.
 More later.
\end_layout

\begin_layout Section*
Not LSA – 1 July 2015
\end_layout

\begin_layout Standard
NotLSA – a way to do LSA-like things without actually using LSA (Latent
 Semantic Analysis).
 Two very low-brow approaches, maybe well-known in the industry; I have
 no idea.
 Both of these approaches attempt to automatically extract keywords from
 documents.
 What cool about this is that its ...
 unsupervised; requires no training, and is based on very simple, proven
 ideas.
 Obvious, even: compute the mutual information between pairs of things ...
 between words and documents, between words and word-pairs, etc.
 Heh.
 
\end_layout

\begin_layout Standard
But how do we do this? How do we compute the MI between a page of text,
 and a word? No way to answer this without diving into the details.
\end_layout

\begin_layout Subsubsection*
Text-keyword correlation
\end_layout

\begin_layout Standard
Lets take a text, say – 1000 pages of ..
 something.
 Some corpus.
 We want to compute the mutual information between the page itself, and
 the words on the page.
 We do this by analogy to MI of word pairs.
\end_layout

\begin_layout Standard
Call the 
\begin_inset Formula $k$
\end_inset

'th page 
\begin_inset Formula $g_{k}$
\end_inset

.
 Count the number of times that word 
\begin_inset Formula $w_{m}$
\end_inset

 appears on this page; let this count be 
\begin_inset Formula $N_{mk}$
\end_inset

.
 Define 
\begin_inset Formula $N_{m}=\sum_{k}N_{mk}$
\end_inset

 be the total number of times that the work 
\begin_inset Formula $w_{m}$
\end_inset

 appear in the document, and let 
\begin_inset Formula $N=\sum_{m}N_{m}$
\end_inset

 be the total number of words in the document.
 Then, as usual, define probabilities, so that 
\begin_inset Formula 
\[
p_{m}=P(w_{m})=N_{m}/N
\]

\end_inset

is the frequency of observing word 
\begin_inset Formula $w_{m}$
\end_inset

 in the entire corpus, and 
\begin_inset Formula 
\[
p_{mk}=P(w_{m}|g_{k})=N_{mk}/\sum_{m}N_{mk}
\]

\end_inset

be the (relative) frequency of the same word on page 
\begin_inset Formula $g_{k}$
\end_inset

.
 Notice that the definition of 
\begin_inset Formula $p_{mk}$
\end_inset

 is independent of the page size.
 Pages do not all have to be of the same size.
 Define the mutual information as 
\begin_inset Formula 
\[
\mbox{MI}(g_{k},w_{m})=-\log_{2}\frac{p_{mk}}{p_{m}}=-\log_{2}\frac{N_{mk}N}{\sum_{m}N_{mk}\sum_{k}N_{mk}}=-\log_{2}\frac{p(m,k)}{p(m,*)p(*,k)}
\]

\end_inset

This is essentially a measure of how much more often the word 
\begin_inset Formula $w_{m}$
\end_inset

 appears on page 
\begin_inset Formula $g_{k}$
\end_inset

 as compared to its usual frequency.
 The highest-MI words are essentially the topic words for the page.
 The right-most form introduces a new notation, to make it clear that it
 resembles the traditional pair-MI expression.
 The notation is 
\begin_inset Formula 
\[
p(m,k)=\frac{N_{mk}}{N}
\]

\end_inset

so that 
\begin_inset Formula 
\[
p(m,*)=\sum_{k}p(m,k)\qquad\mbox{and}\qquad p(*,k)=\sum_{m}p(m,k)
\]

\end_inset

are the traditional-looking pair-MI values.
 
\end_layout

\begin_layout Standard
TODO: – this does not have the feature-reduction/word-combing aspects of
 LSA...
\end_layout

\begin_layout Subsubsection*
Variants
\end_layout

\begin_layout Standard
Instead of working with words, we could work with word-pairs, which is a
 stand-in for working with (named) entities.
 Thus, we can identify if a named entity occurs in a document more often
 than average.
\end_layout

\begin_layout Section*
Unsupervised Morphology Learning References
\end_layout

\begin_layout Standard
Here's some:
\end_layout

\begin_layout Itemize
John Goldsmith, 
\begin_inset Quotes eld
\end_inset

The unsupervised learning of natural language morphology
\begin_inset Quotes erd
\end_inset

, Journal Computational Linguistics archive Volume 27 Issue 2, June 2001
 Pages 153-198 MIT Press http://delivery.acm.org/10.1145/980000/972668/p153-goldsmi
th.pdf
\end_layout

\begin_layout Itemize
John Goldsmith, 
\begin_inset Quotes eld
\end_inset

An algorithm for unsupervised learning of morphology
\begin_inset Quotes erd
\end_inset

 Natural Language Engineering Volume 12 / Issue 04 / December 2006, pp 353-371
 Cambridge University Press DOI: http://dx.doi.org/10.1017/S1351324905004055
 http://people.cs.uchicago.edu/~jagoldsm/Papers/algorithm.pdf
\end_layout

\begin_layout Itemize
Survey Article Unsupervised Learning of Morphology Harald Hammarström Lars
 Borin http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00050
\end_layout

\begin_layout Section*
Predicate-Argument structure
\end_layout

\begin_layout Standard
Here's one:
\end_layout

\begin_layout Itemize
The Darwinian evolution of natural language comes from a combination of
 Expressive FSM's and Lexical predicate-argument FSM's within the human
 brain.
 Shigeru Miyagawa, Robert C.
 Berwick and Kazuo Okanoya 
\begin_inset Quotes eld
\end_inset

The emergence of hierarchical structure in human language
\begin_inset Quotes erd
\end_inset

 Front.
 Psychol., 20 February 2013 | http://dx.doi.org/10.3389/fpsyg.2013.00071 http://alpha-
leonis.lids.mit.edu/wordpress/wp-content/uploads/2014/01/shigeru-berwick-kaz-fronti
ers13.pdf
\end_layout

\begin_layout Section*
Edge-counting 27 March 2017
\begin_inset CommandInset label
LatexCommand label
name "sec:Edge-counting"

\end_inset


\end_layout

\begin_layout Standard
Counting edges in a clique is not the same as counting edges in planar trees.
 The diagram below shows the clique of a four-word sentence.
 The 
\begin_inset Quotes eld
\end_inset

words
\begin_inset Quotes erd
\end_inset

 are 'a', 'b', 'c' and 'd'.
 There are a total of six edges, with one edge between every possible word-pair.
 Each edge occurs only once.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/four-clique.eps
	width 30col%

\end_inset


\end_layout

\begin_layout Standard
Pair counting in planar diagrams gives different results.
 The diagram below shows the twelve planar trees, containing no cycles,
 that can be formed by parsing a sentence of four words.
 
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/edge-counting.eps
	width 40col%

\end_inset


\end_layout

\begin_layout Standard
The general formula for the number of different planar dependency parses
 is
\begin_inset Formula 
\[
\frac{1}{2n-1}{3n-1 \choose n-1}
\]

\end_inset

This formula is given by Deniz Yuret in 
\begin_inset Quotes eld
\end_inset


\begin_inset CommandInset href
LatexCommand href
name "Lexical Attraction Models of Language"
target "http://www2.denizyuret.com/pub/lex-attr/lam-iscis06.pdf"
literal "false"

\end_inset


\begin_inset Quotes erd
\end_inset

 ISCIS 2006 (http://www2.denizyuret.com/pub/lex-attr/lam-iscis06.pdf) 
\end_layout

\begin_layout Standard
It is important not to confuse this with the 
\begin_inset Quotes eld
\end_inset

matrix-tree theorem
\begin_inset Quotes erd
\end_inset

 aka 
\begin_inset CommandInset href
LatexCommand href
name "Kirchoff's Theorem"
target "https://en.wikipedia.org/wiki/Kirchhoff%27s_theorem"
literal "false"

\end_inset

, which counts the number of spanning trees of a graph.
 In short, it states that the number of spanning trees is equal to any cofactor
 of its Laplacian matrix.
 In our case, we are dealing with a complete graph (a clique) and so on
 might hope that 
\begin_inset CommandInset href
LatexCommand href
name "Cayley's formula"
target "https://en.wikipedia.org/wiki/Cayley's_formula"
literal "false"

\end_inset

 applies.
 In fact, neither theorem works, because we are interested in non-self-intersect
ing planar trees, constrained by linear word-order.
\end_layout

\begin_layout Standard
There are 36 edges grand total, and these are unequally distributed.
 The counts are: 
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word-pair
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
count
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(ab)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(bc)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(cd)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(ac)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(bd)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(ad)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
Note that the most frequent edges occur almost twice as often as the least-frequ
ent edges.
 The distribution, by length, is:
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="4" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Length
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Count
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
Note that the progressively-longer edges get less frequent.
\end_layout

\begin_layout Standard
If graphs with cycles are also allowed, (but no edge crossings) then, in
 addition to the above, there are eleven more diagrams.
 These are shown below.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/loop-counting.eps
	width 40col%

\end_inset


\end_layout

\begin_layout Standard
Again, we count the number of edges, as before.
 The 'tree' column shows the counts from before; the loop count count the
 edges from the additional eleven diagrams; the total is just that.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="4">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word-pair
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
tree
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
loop
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
total
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(ab)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(bc)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(cd)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(ac)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(bd)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(ad)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
Likewise, the number of arcs of the given length is now given below:
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="4" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Length
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Count
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
48
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
What are the actual distributions, for these two cases? Begin by counting
 the number of planar trees.
 I currently do not know of any published work on this, so below, I make
 a half-baked attempt to count these myself.
 Its ...
 incomplete.
 Maybe there's some simpler approach.
\end_layout

\begin_layout Subsection*
Counting planar tree graphs
\end_layout

\begin_layout Standard
Let's count the number of planar tree graphs; i.e.
 those without any loops.
 First, we need a generic formula for sentences of length N.
 This is not so very easy.
 The diagram below shows one way to count.
 (I think what follows is correct, but I might be making a mistake.
 I am unaware of any literature that presents this information).
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/all-count.eps
	width 50col%

\end_inset


\end_layout

\begin_layout Standard
Here, the star represents some planar tree connecting all of the words of
 a smaller sentence.
 Assume that there are 
\begin_inset Formula $T(n)$
\end_inset

 such trees, connecting 
\begin_inset Formula $n$
\end_inset

 words.
 Tree diagrams of of Type A are assembled by placing two adjacent smaller
 trees next to each other.
 Naively, one can then count how many such pairs there are; the issue is
 that the Type B diagram will occur multiple times in this pairing; we would
 rather NOT count it with this mutiplicity.
 To avoid this problem, we should only allow pairs, such as Type A, to be
 assembled of sub-parts of the shapes C and D.
 Because of the over-arching arc, these can never result in double-counting.
 However, counting only pairs results in an under-counting: graphs of Type
 B never occur.
 Thus, one should count pairs, triples, and so on – graphs of Type E.
 Now we have a way of getting the formula.
 Define 
\begin_inset Formula $D(n)$
\end_inset

 as the count of the number of planar trees, connecting 
\begin_inset Formula $n$
\end_inset

 words, having an arc connecting the first and last word: i.e.
 trees of type C or D.
 (Think 
\begin_inset Quotes eld
\end_inset

D = dome
\begin_inset Quotes erd
\end_inset

) One then has that 
\begin_inset Formula 
\[
D(n)=\sum_{j=1}^{n-1}T(j)T(n-j)
\]

\end_inset

It is convenient, here, to define 
\begin_inset Formula $T(1)=1$
\end_inset

.
 The first and last terms of this sum then correspond to trees of Type C,
 while the middle terms are trees of type D.
\end_layout

\begin_layout Standard
To count trees of Type E, is is convenient to break this up into the problem
 of counting chains of length 
\begin_inset Formula $k$
\end_inset

, so that there are 
\begin_inset Formula $C_{k}(n)$
\end_inset

 trees, consisting of a sequence of 
\begin_inset Formula $k$
\end_inset

 domes, making up a total of 
\begin_inset Formula $n$
\end_inset

 words.
 One then has that 
\begin_inset Formula 
\[
T(n)=D(n)+C_{2}(n)+\cdots+C_{n-1}(n)
\]

\end_inset

It's convenient, here, to define 
\begin_inset Formula $C_{1}(n)=D(n)$
\end_inset

.
 Writing down the 
\begin_inset Formula $C_{k}(n)$
\end_inset

's requires some combinatorial magic.
 The first one is 
\begin_inset Formula 
\[
C_{2}(n)=\sum_{j=2}^{n-1}D(j)D(n-j+1)
\]

\end_inset

Next comes
\begin_inset Formula 
\[
C_{3}(n)=\sum_{j=2}^{n-1}\;\sum_{m=2}^{n-j+1}D(j)D(m)D(n-j-m+2)
\]

\end_inset

which is awkward to write down.
 It's easier to count partitions of sets.
 Thus, what really is happening here is that the sums range over all 
\begin_inset Formula $k$
\end_inset

-way partitions of sets containing 
\begin_inset Formula $n+k-1$
\end_inset

 elements.
 Not the partition is NOT over 
\begin_inset Formula $n$
\end_inset

 elements: to get connected graphs, we have to identify end-points of each
 link in the chain.
 Thus, 
\begin_inset Formula 
\[
C_{k}(n)=\Pi_{\sigma}\cdots
\]

\end_inset

The table below summarizes the first few sums:
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="7">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $n$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $T(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $D(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $C_{2}(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $C_{3}(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $C_{4}(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $C_{5}(n)$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
12
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
45
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
123
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
Either I am computing this wrong, or the sequences are not in OEIS.
 Surprising!
\end_layout

\begin_layout Subsection*
Counting planar loop graphs
\end_layout

\begin_layout Standard
The above process can be repeated, except that this time, we consider the
 planar graphs containing loops.
 To get started, consider the diagram below.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/all-loop.eps
	width 60col%

\end_inset


\end_layout

\begin_layout Standard
Here, the stars represent either 
\begin_inset Quotes eld
\end_inset

domed
\begin_inset Quotes erd
\end_inset

 diagrams, or the empty set (a set containing no words and no edges.
 The type F concatenates two domes, and puts a dome over those, in turn.
 Since both of the stars are domed (or empty), it is impossible to add any
 additional edges to this graph.
 So, for graphs constructed out of a pair domes (one or both possibly empty),
 Type F is all that there is.
 For three domes in a row, there are only three ways of adding edges: these
 are shown in type G and H in this diagram.
 Again, this exhauses all possibilities.
 This process constructs both looped and tree diagrams.
 The general idea is to repeat this, for sequences of four or more stars.
\end_layout

\begin_layout Standard
The counting is similar to that before.
 Let 
\begin_inset Formula $F(n)$
\end_inset

 count the number of domed graphs, connecting 
\begin_inset Formula $n$
\end_inset

 words.
 Let 
\begin_inset Formula $G_{2}(n)$
\end_inset

 count the number of type F graphs, made of two parts, and containing 
\begin_inset Formula $n$
\end_inset

 words.
 Consulting the diagram, we have
\begin_inset Formula 
\[
G_{2}(n)=F(n-1)+\sum_{k=2}^{n-1}F(k)F(n-k+1)+F(n-1)
\]

\end_inset

Likewise, let 
\begin_inset Formula $G_{3}(n)$
\end_inset

 count the number of graphs of type G and H, combined.
 Consulting the diagram, this has a more complex expression:
\begin_inset Formula 
\[
G_{3}(n)=\sum_{k=2}^{n-2}F(k)F(n-k)+\sum\sum F()F()F()...+\sum_{k=2}^{n-2}F(k)F(n-k)+...
\]

\end_inset

The total number of domed graphs having 
\begin_inset Formula $n$
\end_inset

 words is then 
\begin_inset Formula 
\[
F(n)=\sum_{k=2}^{n-1}G_{k}(n)
\]

\end_inset

Let 
\begin_inset Formula $S(n)$
\end_inset

 be the count of a string of domed graphs, but NOT having connecting arcs:
 that is, graphs of type A or E.
\end_layout

\begin_layout Standard
A table of these is given below.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="8">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $n$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $L(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $S(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $F(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $G_{2}(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $G_{3}(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $G_{4}(n)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $G_{5}(n)$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
x
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
156
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
x
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1162
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
x
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Subsection*
Effecect on Grammar Induction
\end_layout

\begin_layout Standard
There is a surprising effect on the quality of the learned grammars, as
 shown in the graph below, obtained from Andres Suarez
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Email on the lang-learn@gmail.com mailing list, 22 July 2019, titled 
\begin_inset Quotes eld
\end_inset

MI-threshold for MST-parses
\begin_inset Quotes erd
\end_inset

.
\end_layout

\end_inset

 via the Kolonin learning project.
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/andres-mi-cut.png
	width 100col%

\end_inset


\end_layout

\begin_layout Standard
The graph shows the F1-score (geometric mean of recall and precision) for
 parse trees obtained four different ways.
 The recall and precision are relative to a 
\begin_inset Quotes eld
\end_inset

golden
\begin_inset Quotes erd
\end_inset

 corpus.
 The horizontal axis shows score dependence on an MI threshold: word pairs
 with an MI of less than the threshold are discarded, before the parse tree
 is generated.
\end_layout

\begin_layout Standard
The lines are labeled as:
\end_layout

\begin_layout Itemize

\series bold
Sequential 
\series default
– The parse 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

 is just a sequence of links connecting neighboring words.
 This is independent of the MI score of individual word-pairs, and so is
 a flat line.
\end_layout

\begin_layout Itemize

\series bold
WIN6
\series default
 – The parse tree is an MST tree obtained using previously-obtained word-pair
 MI scores from a 
\begin_inset Quotes eld
\end_inset

clique-pair counting
\begin_inset Quotes erd
\end_inset

 word-pair generation technique.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
The code base calls this 
\begin_inset Quotes eld
\end_inset

clique-pair counting
\begin_inset Quotes erd
\end_inset

.
\end_layout

\end_inset

 The counting is done by considering all word-pairs formed between the left-most
 word in a 7-wide window, and all the other words in that window.
 Thus, six word-pairs are counted for each window position, then the window
 is slid over by one.
\end_layout

\begin_layout Itemize
\noindent
\align block

\series bold
WIN6-odist
\series default
 – The parse tree is an MST tree obtained using previously-obtained word-pair
 MI scores from a weighted-window counting technique.
 This counts pairs with a distance penalty of 
\begin_inset Formula $\mbox{floor}\left(6/dist\right)$
\end_inset

, as shown below:
\begin_inset Newline newline
\end_inset


\begin_inset Box Frameless
position "t"
hor_pos "c"
has_inner_box 1
inner_pos "t"
use_parbox 0
use_makebox 0
width "100col%"
special "none"
height "1in"
height_special "totalheight"
thickness "0.4pt"
separation "3pt"
shadowsize "4pt"
framecolor "black"
backgroundcolor "none"
status open

\begin_layout Plain Layout
\noindent
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="8" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
distance
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
weight
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\end_inset


\begin_inset Newline newline
\end_inset

That is, a word pair formed from nearest neighbors is given a weight of
 six, and so on.
\end_layout

\begin_layout Itemize

\series bold
LG-all
\series default
 – The parse tree is an MST tree obtained using previously-obtained word-pair
 MI scores from a 100 random planar tree parses created using the LG 
\begin_inset Quotes eld
\end_inset

any
\begin_inset Quotes erd
\end_inset

 language parser.
\end_layout

\begin_layout Standard
Word-pair MI values were obtained from the ULL Project 
\begin_inset Quotes eld
\end_inset

Gutenberg Children's Corpus
\begin_inset Quotes erd
\end_inset

.
 This consists of 200K sentences of average length 13.5, for a total of 2.7M
 words.
 The F1-scores were obtained by evaluatng 228 sentences (average length
 of 8) from the 
\begin_inset Quotes eld
\end_inset

golden corpus
\begin_inset Quotes erd
\end_inset

.
\end_layout

\begin_layout Standard
There are two rather surprising results here:
\end_layout

\begin_layout Enumerate
The MST tree always has a lower score than the sequential 
\begin_inset Quotes eld
\end_inset

tree
\begin_inset Quotes erd
\end_inset

, when compared against the golden parse tree.
 The reason for this is not currently understood.
\end_layout

\begin_layout Enumerate
The way in which the MI values were collected has a strong effect on the
 quality of the MST parse.
 This is unexpected.
\end_layout

\begin_layout Standard
Both of these results are unexpected, and lack an adequate explanation.
 Understanding the first is more important.
\end_layout

\begin_layout Section*
English Dataset Sample 28 April 2017
\end_layout

\begin_layout Standard
This section was moved to `word-pairs-redux.lyx` in July 2019, so that all
 the word-pair stuff would be in one place.
\end_layout

\begin_layout Section*
Connector Sets 7 May 2017 (revised July 2017)
\end_layout

\begin_layout Standard
Tho better manage the size of this diary, this has been moved to its own
 file.
 See the 
\begin_inset Quotes eld
\end_inset

connector-sets-revised.lyx
\begin_inset Quotes erd
\end_inset

 file.
\end_layout

\begin_layout Subsubsection*
Abstract
\end_layout

\begin_layout Standard
This is a report on a dataset of disjuncts and connector sets, extracted
 from MST parses of a batch of sentences.
 First, a recap of what these are, then a characterization of the database
 contents, and finally, a report on the grammatical similarity of words
 in the dataset.
 
\end_layout

\begin_layout Section*
MST parsing algos
\end_layout

\begin_layout Standard
There are multiple MST algos, some better than others.
 A short list with some references.
\end_layout

\begin_layout Itemize
The current implementation in the (opencog sheaf) directory is an MST algo
 for generating projective MST trees from undirected edges.
 Its a simple-minded projective adaptation of Borůvka's algo (see wikipedia
 for a description).
 I just measured it to run at 
\begin_inset Formula $O\left(n^{3}\right)$
\end_inset

 for 
\begin_inset Formula $n$
\end_inset

 words.
 See atomspace/opencog/bench-ms/README.md.
\end_layout

\begin_layout Itemize
The 
\begin_inset Formula $O\left(n^{3}\right)$
\end_inset

 is for the case of arbitrary-length links.
 If the scoring function is altered to give bad scores to link lengths >6
 long, then the algo kicks over to 
\begin_inset Formula $O\left(n^{2.3}\right)$
\end_inset

 after about 
\begin_inset Formula $8<n$
\end_inset

 or so.
 Awesome! See graphs in atomspace/opencog/bench-ms/ for a better look.
\end_layout

\begin_layout Itemize
This isn't bad, per se, since Yuret published his best projective MST algo
 which ran at 
\begin_inset Formula $O\left(n^{3}\right)$
\end_inset

 for 
\begin_inset Formula $n$
\end_inset

 words.
 So we are in the right ballpark...
\end_layout

\begin_layout Itemize
The state-of-the-art projective algo (for directed edges) is supposed to
 be Eisner's algo, which runs at 
\begin_inset Formula $O\left(n^{3}\right)$
\end_inset

 for 
\begin_inset Formula $n$
\end_inset

 words.
 So we are still in the right ballpark.
 Eisner, Jason, 1996.
 “Three New Probabilistic Models for Dependency Parsing.” In Proceedings
 of the 16th Conference on Computational Linguistics (CoLING 96) .
 Saarbruecken: ACL, 340–345.
 
\end_layout

\begin_layout Itemize
The Chu-Liu-Edmonds algorithm finds (non-projective) spanning trees in directed
 graphs.
 It is described by Ryan McDonald, Fernando Pereira, Kiril Ribarov Jan Hajič,
 
\begin_inset Quotes eld
\end_inset

Non-projective Dependency Parsing using Spanning Tree Algorithms
\begin_inset Quotes erd
\end_inset

 http://www2.denizyuret.com/ref/mcdonald/nonprojectiveHLT-EMNLP2005.pdf Its
 supposed to run in 
\begin_inset Formula $O\left(n^{2}\right)$
\end_inset

 time.
\end_layout

\begin_layout Itemize
An non-projective algorithm that is 
\begin_inset Quotes eld
\end_inset

super-linear
\begin_inset Quotes erd
\end_inset

 in the number of edges is described by Effi Levi, Roi Reichart, Ari Rappoport,
 
\begin_inset Quotes eld
\end_inset

Edge-Linear First-Order Dependency Parsing with Undirected Minimum Spanning
 Tree Inference
\begin_inset Quotes erd
\end_inset

 (2016) https://www.aclweb.org/anthology/P/P16/P16-1198.pdf Since this is edge-line
ar, I think that, for us, it claims 
\begin_inset Formula $O\left(n^{2}\right)$
\end_inset

 for 
\begin_inset Formula $n$
\end_inset

 words.
 (Since, for us, we don't know, a priori, if we have an edge, or not).
 Its also not projective.
 https://arxiv.org/pdf/1510.07482.pdf
\end_layout

\begin_layout Section*
Dataset report 3 June 2017
\begin_inset CommandInset label
LatexCommand label
name "sec:Dataset-report"

\end_inset


\end_layout

\begin_layout Standard
Some summary reports from various different datasets.
 The summary for the word-pair datasets was moved to the `word-pairs-redux.lyx`
 directory, so that all word-pair stuff lives together.
\end_layout

\begin_layout Subsection*
Disjunct datasets
\end_layout

\begin_layout Standard
Next, datasets that hold disjuncts.
 This section used to report more data, but it was all flawed: the MI had
 a minus sign in it, causing all computed disjuncts to be maximally bad.
 Despite this, the results were similar to the below: observations and entropy
 fit in line, as expected.
 The 
\begin_inset Formula $H_{left}$
\end_inset

 entropy values were lower, hovered around 15, and the MI was in the 3-5
 range, while 
\begin_inset Formula $H_{right}$
\end_inset

 was unchanged and fit in line.
 You can find the original data in the git commit 9244905afdff191a39af8c5a6deab5
92d5a1558c.
 
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="9" columns="9">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Size
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Sects
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Obs'ns
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Ob/sec
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Entropy
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $H_{left}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $H_{right}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Notes
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
37K x 291K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
446K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
661K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.48
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18.30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16.00
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10.28
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7.98
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_pairs_sim
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
137K x 6.24M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8.63M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18.5M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20.96
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
19.14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9.71
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7.90
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_pairs_rfive_mtwo
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
522K x 25.2M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
34.2M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
77.8M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
22.82
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20.92
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10.09
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8.18
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_pairs_rfive_mst
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
445K x 23.4M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
31.9M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
69.4M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.18
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23.09
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21.14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10.11
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8.16
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_pairs_cfive_mst
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
60K x 602K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
801K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.19M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.48
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18.86
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17.99
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10.13
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9.26
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
zen_pairs_mst
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
85K x 4.88M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7.02M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17.8M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.54
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20.48
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17.06
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9.52
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6.10
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
zen_pairs_three_mst
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
563K x 29.9M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
39.7M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_mpg_18
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
564K x 21.9M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
34.5M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
78.2M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.26
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
22.41
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20.00
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10.12
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7.72
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_mpg_53
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
An updated legend for the columns:
\end_layout

\begin_layout Description
Size The dimensions of the array.
 The left dimension counts words, the right dimension counts the number
 of unique, distinct pseudo-disjuncts.
\end_layout

\begin_layout Description
Sects The total number of distinct Sections observed.
\end_layout

\begin_layout Description
Obsn's The total number of observations of these Sections.
 Most Sections will be observed more than once.
 Distributions are typically Zipfian, as pointed out earlier.
\end_layout

\begin_layout Description
Obs/sec The average number of times each Section was observed.
\end_layout

\begin_layout Description
Entropy The total entropy of the Sections in this dataset, as defined previously
: for Sections 
\begin_inset Formula $(w,d)$
\end_inset

 it is 
\begin_inset Formula $H=-\sum_{w,d}p(w,d)\log_{2}p(w,d)$
\end_inset

.
\end_layout

\begin_layout Description
MI The total mutual information for the Sections in this dataset, as defined
 previously: 
\begin_inset Formula $MI=\sum_{w,d}p(w,d)\log_{2}\left[p(w,d)/p(w,*)p(*,d)\right]$
\end_inset


\end_layout

\begin_layout Description
\begin_inset Formula $H_{left}$
\end_inset

,
\begin_inset Formula $H_{right}$
\end_inset

 The left and right entropies, as defined previously.
 Note that 
\begin_inset Formula $MI=H-H_{left}-H_{right}$
\end_inset

 holds, by definition.
 Not given for the word-pairs table, because these two are nearly equal,
 and are half the difference between the entropy and the MI.
\end_layout

\begin_layout Standard
Note how the MI is considerably larger than that for the word-pairs.
 Higher MI implies a stronger correlation, and this is good: this suggests
 that the disjuncts are capturing meaningful structures in the language.
\end_layout

\begin_layout Standard
The behavior of the 
\begin_inset Quotes eld
\end_inset

zen
\begin_inset Quotes erd
\end_inset

 dataset might be explained by two issues that this dataset has.
 The smaller 
\begin_inset Quotes eld
\end_inset

zen_pairs_mst
\begin_inset Quotes erd
\end_inset

 dataset is tiny, with a large fraction (most?) words observed only once,
 most disjuncts observed only once, and so the high MI being a false signal,
 an artifact of the tiny size of the set.
 By contrast, the unexpectedly low MI on the 
\begin_inset Quotes eld
\end_inset

zen_pairs_three_mst
\begin_inset Quotes erd
\end_inset

 dataset might be blamed on the 3rd-party word segmentation tool.
 It is known not to be very accurate, and the low MI might be a by-product
 of that.
\end_layout

\begin_layout Standard
The descriptinos of the datasets can be found in the 
\begin_inset Quotes eld
\end_inset

connector-sets-revised.lyx
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

word-pairs-redux.lyx
\begin_inset Quotes erd
\end_inset

 files.
 There are several new datasets since then.
\end_layout

\begin_layout Description

\noun on
en_mpg_53
\noun default
 The disjuncts were built using the MPG parser, after discarding all word
 pairs with MI<5.3 (and then recomputing word-pair MI marginals).
 The stats, as given in the table above, are not all that different than
 
\series bold
\noun on
en_pairs_cfive_mst
\series default
\noun default
 and its not clear if this should be surprising or not...
 Both
\series bold
\noun on
 en_pairs_cfive_mst
\series default
\noun default
 and this are built from the same original word-pair stats.
 If a different set of word-pairs were obtained, how much would things differ?
\end_layout

\begin_layout Description

\noun on
en_mpg_18
\noun default
 As above, but after discarding all word pairs with MI<1.8.
\end_layout

\begin_layout Section*
Thresholding PCA Classifier
\end_layout

\begin_layout Standard
The next step is what I've called 
\begin_inset Quotes eld
\end_inset

clustering
\begin_inset Quotes erd
\end_inset

 in the past, but it really needs to be something more like factor analysis,
 or better yet, sparse PCA.
 Except that's not right, either.
\end_layout

\begin_layout Standard
What is needed is a recognizer,as follows.
 Consider 
\begin_inset Formula $\overrightarrow{b}=\sum_{n}b_{n}w_{n}$
\end_inset

 be a vector, with the 
\begin_inset Formula $w_{n}$
\end_inset

T being individual words, and the 
\begin_inset Formula $b_{n}$
\end_inset

 being weights.
 Plain-old Principal Component Analysis (PCA) computes real-valued weights
 
\begin_inset Formula $b_{n}$
\end_inset

.
 It's problematic, because potentially all of the weights are non-zero for
 all of the words.
 Sparse PCA computes real-valued weights 
\begin_inset Formula $b_{n}$
\end_inset

 such that only some small number of them are non-zero.
 This is much better.
 But what is really needed is a classifier: a set of 
\begin_inset Formula $b_{n}$
\end_inset

 that are either zero or one, indicating the membership of a word 
\begin_inset Formula $w_{n}$
\end_inset

 in some class of words.
 (Note, by the way, that a word might belong to multiple classes, for example,
 according to its part-of-speech, or it's meaning.) This suggests a neural-netish
 variant on iterative PCA, described below.
 But, before giving this, some general remarks.
\end_layout

\begin_layout Subsection*
Preliminary comments
\end_layout

\begin_layout Standard
The definition of PCA requires a matrix 
\begin_inset Formula $X$
\end_inset

 that connects columns and rows in some way.
 In the conventional definition, it is a matrix connecting variables and
 measurements.
 The variables (the features being measured) are organized in the columns;
 the measurements in rows.
 The PCA algorithm effectively computes the eigenvectors of the matrix 
\begin_inset Formula $X^{T}X$
\end_inset

, with 
\begin_inset Formula $X^{T}$
\end_inset

 denoting the transpose of 
\begin_inset Formula $X$
\end_inset

.
\end_layout

\begin_layout Standard
The matrix 
\begin_inset Formula $X^{T}X$
\end_inset

 is proportional to the covariance matrix between the different features
 being observed.
 The principal component is the the direction of the greatest variation
 in this matrix.
 
\end_layout

\begin_layout Standard
What plays the role of 
\begin_inset Formula $X$
\end_inset

 in the current situation, and how should the principal component be understood
 and interpreted? 
\end_layout

\begin_layout Standard
What we have on hand, foundationally, is the frequency matrix 
\begin_inset Formula $P$
\end_inset

 with components 
\begin_inset Formula $p(w,d)$
\end_inset

 connecting words with disjuncts.
 It was defined previously as 
\begin_inset Formula $p(w,d)=N(w,d)/N(*,*)$
\end_inset

, and where 
\begin_inset Formula $N(w,d)$
\end_inset

 is the number of times word 
\begin_inset Formula $w$
\end_inset

 has been observed with disjunct 
\begin_inset Formula $d$
\end_inset

.
 As noted earlier, 
\begin_inset Formula $N(w,d)$
\end_inset

 is very large and very sparse: typically 
\begin_inset Formula $200K\times4M$
\end_inset

 in recent datasets, with only 1 entry out of 
\begin_inset Formula $2^{15}$
\end_inset

 being non-zero.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
I plan to send out the revised, expanded statistical analysis 
\begin_inset Quotes eld
\end_inset

real soon now
\begin_inset Quotes erd
\end_inset

.
\end_layout

\end_inset

 The extreme sparsity indicates that a power-iteration algorithm will be
 the most efficient for implementing a PCA algorithm.
\end_layout

\begin_layout Standard
What we will examine will be the results on several different kinds of matrices
 
\begin_inset Formula $M$
\end_inset

 derived from (constructed from) the base data matrix 
\begin_inset Formula $P$
\end_inset

.
 In all cases, the features are words, and so in all cases, it is appropriate
 to write 
\begin_inset Formula $X=M^{T}$
\end_inset

; that is, we work mostly with the transpose of the matrix 
\begin_inset Formula $X$
\end_inset

 as usually given in standard texts.
 This follows from the standard Link Grammar dictionary: the word is followed
 by the disjuncts it can employ.
\end_layout

\begin_layout Subsection*
PCA of the frequency matrix
\end_layout

\begin_layout Standard
Should we identify 
\begin_inset Formula $P$
\end_inset

 and 
\begin_inset Formula $X^{T}$
\end_inset

, so that 
\begin_inset Formula $X^{T}X=PP^{T}$
\end_inset

? We can, but then we don't get what we want.
 Computing the principal component of this matrix for a recent dataset (see
 later section below), we get the following vector shown below.
 The 
\begin_inset Quotes eld
\end_inset

weight
\begin_inset Quotes erd
\end_inset

 gives the magnitude of the vector component.
 The other two columns are the support for the word, and the number of observati
owns, and are shown for comparison.
 
\end_layout

\begin_layout Standard
XXXX the table below is still from the broken corpus!!!! discard it!!!
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="16" columns="4">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
weight
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $\left|(w,*)\right|$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $N(w,*)$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.9358
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2031
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
341112
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.1212
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1403
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
106378
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
and
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.1098
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1225
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
96276
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.1035
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1276
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
96308
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
”
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0920
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
506
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
45809
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
,
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0904
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1703
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
111982
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0842
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
958
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
73760
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
in
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0808
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
750
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
56751
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
of
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0783
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
890
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
64753
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
his
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0666
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
691
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
48728
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
it
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0567
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
606
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
44211
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
with
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0531
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
480
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33681
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
him
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0482
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
425
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30345
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
that
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0464
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
729
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
49714
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
for
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0450
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
479
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33092
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
What does this mean? What can we do with this? Why is the weight of the
 period so high? In essence, this vector is stating that the greatest variety
 of disjuncts are associated with the period.
 Since periods are sentence enders, and every sentence has one, a link to
 the end of the sentence will attach to just about any word.
 That is, the period almost single-handedly accounts for almost all of the
 variance of the disjuncts in the dataset.
 The rest of the list is filled out with words that also attach freely and
 easily to just about anything: 
\begin_inset Quotes eld
\end_inset

the
\begin_inset Quotes erd
\end_inset

 should attach only to nouns, but common nouns wildly outnumber all of the
 other parts of speech put together.
 Similar remarks for 
\begin_inset Quotes eld
\end_inset

and
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

to
\begin_inset Quotes erd
\end_inset

.
 The comma can connect in a large variety of situations, and the closing
 quotation mark ” behaves much like a sentence-ender (This particular dataset
 contains a lot of dialog).
 The two columns labeled as 
\begin_inset Formula $\left|(w,*)\right|$
\end_inset

 and 
\begin_inset Formula $N(w,*)$
\end_inset

 confirms this interpretation: so, 
\begin_inset Formula $\left|(w,*)\right|$
\end_inset

 is the total number of unique, different disjuncts that were observed with
 
\begin_inset Formula $w$
\end_inset

, and 
\begin_inset Formula $N(w,*)$
\end_inset

 is the summation over all of the counts with which these disjuncts were
 seen.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
If these numbers seem small, it is because they were taken from a sharply
 filtered dataset, the en_pairs_ttwo_mst dataset with the (50,30,10) cut
 applied.
 This cut is discussed later, below.
\end_layout

\end_inset

 The top words, in terms of the variety and number of disjuncts, are more
 or less the makeup of the principal component of 
\begin_inset Formula $PP^{T}$
\end_inset

.
 This should not be a surprise.
 Anyway, this is not what we wanted: we want to classify sets of similar
 words; discovering which words account for the greatest variation in disjuncts
 is of secondary interest.
 
\end_layout

\begin_layout Subsection*
PCA of the cosine similarity
\end_layout

\begin_layout Standard
We had previously defined the cosine similarity of two words as
\begin_inset Formula 
\[
\mbox{sim}(w_{1},w_{2})=\frac{\sum_{d}p(w_{1},d)p(w_{2},d)}{\sqrt{\sum_{d}p^{2}(w_{1},d)}\sqrt{\sum_{d}p^{2}(w_{2},d)}}
\]

\end_inset

and so, perhaps we should use this as the basis for judging the similarity
 of words.
 This suggests defining a matrix 
\begin_inset Formula $S$
\end_inset

 with matrix components 
\begin_inset Formula 
\[
S(w,d)=\frac{p(w,d)}{\sqrt{\sum_{d}p^{2}(w,d)}}
\]

\end_inset

and then setting setting 
\begin_inset Formula $X=S^{T}$
\end_inset

 so that 
\begin_inset Formula $X^{T}X=SS^{T}$
\end_inset

.
 The idea here is that PCA allows a whole-set analysis of similarity, rather
 than point-wise similarity.
 That is, for normal clustering algorithms, one computes a large number
 of values for 
\begin_inset Formula $\mbox{sim}(w_{1},w_{2})$
\end_inset

 and then employs a clustering algorithm to categorize these, word by word.
 Here, instead, perhaps PCA can reveal entire clusters in one gulp, by simultane
ously evaluating the similarity between all words in a cluster.
\end_layout

\begin_layout Standard
Power iteration converges at about half of the rate as for the frequency
 matrix, which is not a surprise, as the off-diagonal entries are closer
 to one-another.
 The PCA vector, however, is not all that different: 0.978 
\begin_inset Quotes erd
\end_inset

.
\begin_inset Quotes erd
\end_inset

 + 0.137 
\begin_inset Quotes eld
\end_inset

,
\begin_inset Quotes erd
\end_inset

 + 0.080 
\begin_inset Quotes eld
\end_inset

the
\begin_inset Quotes erd
\end_inset

 + 0.068 
\begin_inset Quotes eld
\end_inset

to
\begin_inset Quotes erd
\end_inset

 + 0.067 
\begin_inset Quotes eld
\end_inset

and
\begin_inset Quotes erd
\end_inset

 + 0.039 
\begin_inset Quotes eld
\end_inset

a
\begin_inset Quotes erd
\end_inset

 + ...
 and so on, the remaining entries filled out in roughly the same order,
 by the same words, as in the frequency PCA.
\begin_inset Foot
status open

\begin_layout Plain Layout
Created as follows:
\end_layout

\begin_layout Plain Layout
(define fsi (add-subtotal-filter psa 50 30 10)) 
\end_layout

\begin_layout Plain Layout
(define pci (make-cosine-matrix fsi))
\end_layout

\begin_layout Plain Layout
(define pti (make-power-iter-pca pci 'left-unit))
\end_layout

\begin_layout Plain Layout
(define lit (pti 'left-iterate feig 8)) 
\end_layout

\begin_layout Plain Layout
(pti 'left-print lit 20) 
\end_layout

\end_inset

 Why is this? It's worth taking a look at the matrix:
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Created with 
\end_layout

\begin_layout Plain Layout
(define poi (add-pair-cosine-compute fsi)) 
\end_layout

\begin_layout Plain Layout
(poi 'right-cosine WORD-A WORD-B) 
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="7" columns="7">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
,
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
and
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.549
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.731
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.6435
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.668
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.627
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
,
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.711
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.824
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.888
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.765
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.790
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.744
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.896
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.906
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.857
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
and
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.755
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
a
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
So what is this saying? There are plenty of pairs that have greater similarity;
 here's an arbitrary sampling: 
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="9" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
pair
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
sim
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(he, she)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.982
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(this, that)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.894
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(run, walk)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.878
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(big, small)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.908
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(high, low)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.910
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(soft,hard)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.809
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(easy,hard)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.846
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
(easy,soft)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.749
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
So why aren't some of these in the PCA vector?
\end_layout

\begin_layout Subsubsection*
IMPORTANT: 
\end_layout

\begin_layout Standard
Some readers have misunderstood this section.
 We are NOT doing PCA to obtain similarity! We are examining it as an algo
 for CLUSTERING! That is, instead of doing k-means clustering, or agglomerative
 clustering, or something like that, the idea is/was to use a thresholded
 PCA for CLUSTERING! and NOT for similarity, because we've already got reasonabl
e adequate similarity.
 Higher-quality similarity might be nice, but that is of secondary importance,
 right now.
\end_layout

\begin_layout Subsection*
A bit of sheaf theory
\end_layout

\begin_layout Standard
I recently realized that much of what is being discussed here can be anchored
 in the vocabulary of a generic mathematical theory, namely, sheaf theory.
 Sheaves allow topological structure to be discussed in a local way: sheaves
 describe how the local neighborhoods of a point glue together, to form
 a manifold as a whole.
 Link Grammar disjuncts and connector sets are really just the stalks and
 germs of sheaf theory, in mild disguise.
 This can be seen as follows.
\end_layout

\begin_layout Standard
A standard way of expressing a graph is to list all of the vertices in the
 graph, and to list all of the edges.
 Knowing these, one knows the graph.
 However, this is a global description: One does not know the local structure
 until one looks at specific vertices, and what they attach to.
\end_layout

\begin_layout Standard
A different way of describing a graph is to make a list of pairs: a vertex
 
\begin_inset Formula $v$
\end_inset

 and all the edges that attach to it.
 More generally, one can consider pairs where a vertex 
\begin_inset Formula $v$
\end_inset

 is attached to a vertex 
\begin_inset Formula $w$
\end_inset

 by means of a path of length 
\begin_inset Formula $N$
\end_inset

 or less.
\begin_inset Formula 
\[
\left(\mbox{vertex }v,\left\{ w\mbox{ s.t. vertex }w\mbox{ is attached to }v\right\} \right)
\]

\end_inset

This describes the graph, as a whole, just as well as the simpler vertex+edge
 list does.
 However, the language is different: these pairs are presheaves, obeying
 all the axioms of a presheaf, e.g.
 the composition of restriction morphisms.
 They become a sheaf because they also obey the gluing or collation axiom
 as well: they can be glued together to form the original graph from which
 they were taken.
\end_layout

\begin_layout Standard
Thus, we can see that the set
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
\left(\mbox{vertex }v,\left\{ \mbox{ edges attached to }v\right\} \right)
\]

\end_inset

is the same thing as a Link Grammar dictionary entry.
\end_layout

\begin_layout Standard
To be more precise, we need to distinguish the graph-sheaf that arises for
 a single sentence (from the dependency parse of the sentence) from the
 sheaf that arises from the entire language.
 If we take the language to consist of the set of all possible sentences,
 then the sheafification is to parse each of the sentences in the language,
 to get a dependency graph for each sentence, create the individual (word,
 connector-set) lists, and then take the quotient, identifying together
 all words that have the same spelling.
 This gives the sheaf of the entire language.
\end_layout

\begin_layout Standard
From what I can tell, this realization that language can be sheafified is
 not new; when the language is not a natural language, but is instead first-orde
r logic, then it's sheafification gives the Kripke–Joyal semantics.
 According to Wikipedia, this was noted in 1965 for existential quantification.
 I don't know if this was ever noted for natural language before, but, as
 I've blathered on the mailing list before, this provides the 
\begin_inset Quotes eld
\end_inset

answer
\begin_inset Quotes erd
\end_inset

 to why the logic of Link Grammar appears to be modal logic: Link Grammar
 dictionary entries are sheaves, and the disjuncts are the different 
\begin_inset Quotes eld
\end_inset

possible worlds
\begin_inset Quotes erd
\end_inset

 that a given word can inhabit.
 For a natural-language sentence, 
\begin_inset Quotes eld
\end_inset

there exists
\begin_inset Quotes erd
\end_inset

 (existential quantification) a collection of disjuncts that can parse the
 sentence.
 Bingo.
\end_layout

\begin_layout Standard
I used to say that LG disjuncts had something to do with linear logic, because
 linear logic also has the general whiff of 
\begin_inset Quotes eld
\end_inset

possible worlds
\begin_inset Quotes erd
\end_inset

 around it.
 I now see that in fact its actually modal logic, and it is the language
 of sheaves that provides the direct route from Link Grammar, to modal logic.
 It would be very interesting to see all the details worked out.
\end_layout

\begin_layout Standard
Most interesting is perhaps this: the sentences of a language are observed
 with some a priori frequency or probability.
 What's the correct way of converting this to a probability distribution
 on the sheaf? Next, given a probability distribution on the sheaf, what
 is the corresponding probability distribution on the corresponding modal
 logic? 
\end_layout

\begin_layout Standard
It seems to me that one could make this very generic: every language, and
 not just first-order logic, but any language, as considered in model theory,
 has a set of sentences.
 This sentences are composed of the terms in their term algebra, and these
 terms and how they connect, define a graph.
 That graph can be viewed in terms of sheaves, germs, stalks, etale spaces.
 This implies that every model, of model theory, has a corresponding cohomology.
 Writing this out could be interesting.
 Perhaps this has already been done; perhaps this is what topos theory is.
 But I suspect that it's not been sufficiently popularized: certainly, the
 standard computer-science textbooks that tell you what a language is do
 not tell you that it has a cohomology associated with it.
 And yet, this seems blatantly obvious, in retrospect, and naggingly it
 might actually be important for some reason or another.
\end_layout

\begin_layout Standard
Anyway: this is not just all-talk, no-action.
 I've written some code that implements some sheaf-based parsing on the
 atomspace.
 It is in the 
\begin_inset CommandInset href
LatexCommand href
name "github atomspace repo"
target "https://github.com/opencog/atomspace/tree/master/opencog/sheaf"
literal "false"

\end_inset

.
 The README file there explains more.
\end_layout

\begin_layout Subsection*
Disjuncts are compositional
\end_layout

\begin_layout Standard
One reason that a disjunct representation of a graph is important is that
 disjuncts can be composed, so that the product is again a disjunct.
 This is in contrast to the vertex-edge model.
 where neither vertices nor edges can be composed to obtain a new vertex
 or edge.
 That is, disjuncts form a monoidal category, and, specifically, a compact-close
d category.
 This section tries to spell out very clearly what this means, if it is
 not already apparent.
 
\end_layout

\begin_layout Standard
So, for example: the Link Grammar parse for 
\begin_inset Quotes eld
\end_inset

this is a test
\begin_inset Quotes erd
\end_inset

 involves four disjuncts:
\end_layout

\begin_layout Itemize
this: S+
\end_layout

\begin_layout Itemize
is: S- & O+
\end_layout

\begin_layout Itemize
a: D+
\end_layout

\begin_layout Itemize
test: D- & O-
\end_layout

\begin_layout Standard
The determiner connectors D+ and D- can be composed to form a determiner
 D link, leaving a phrase that is still describable by a disjunct, a single
 object O- connector that can attach to verbs: 
\end_layout

\begin_layout Itemize
\begin_inset Quotes eld
\end_inset

a test
\begin_inset Quotes erd
\end_inset

: O-
\end_layout

\begin_layout Standard
This has only one connector, but it is a perfectly ordinary connector, not
 differing from that which might be found on a single word.
 That is, Link Grammar makes no particular distinction between words and
 word-phrases.
 Using the same argument, it is why Link Grammar can work for morpho-syntax.
 One can continue composing:
\end_layout

\begin_layout Itemize
\begin_inset Quotes eld
\end_inset

is a test
\begin_inset Quotes erd
\end_inset

: S-
\end_layout

\begin_layout Standard
which has a subject connector S- that can connect to any subject.
 One can also, perhaps foolishly, perform some net-very-sensical disjuncts:
\end_layout

\begin_layout Itemize
\begin_inset Quotes eld
\end_inset

is a
\begin_inset Quotes erd
\end_inset

: S- & O+ & D+
\end_layout

\begin_layout Standard
or
\end_layout

\begin_layout Itemize
\begin_inset Quotes eld
\end_inset

this ...
 a
\begin_inset Quotes erd
\end_inset

: S+ & D+
\end_layout

\begin_layout Standard
This last has to use elipses as an awkward notation to indicate the projectivity
 constraint.
 Projectivity can be discard, provided some other means ensures a tight
 parse.
\end_layout

\begin_layout Standard
The point here is that the category of disjuncts can be taken to be a monoidal
 category, i.e.
 a category with a tensor product 
\begin_inset Formula $\otimes$
\end_inset

, with tensoring simply being the writing of two disjuncts next to each
 other.
 As the first three examples illustrate, the typical usage is not only to
 tensor together two disjuncts, but also to contract some of the connectors,
 as well.
 
\end_layout

\begin_layout Standard
The contractability of connectors into links means that the Link Grammar
 forms a compact-closed category.
 I've been through this one too many times, so won't try to sketch this
 here.
 It is a good homework exercise for novices.
 
\end_layout

\begin_layout Standard
Bob Coecke has written repeatedly on this topic, any one of his papers on
 pregroup grammars or closed monoidal categories applied to linguistics
 is adequate to grasp the concept.
 His notation is easily and readily translated into Link Grammar notation.
 The primary insigh is to understand that the Link Grammar connector letters
 should be understood as type labels: they provide a simple, easy notational
 device, overcoming the notational complexity that is otherwise required
 when presenting categorial grammars.
\end_layout

\begin_layout Standard
What's the point of all this? Well categorical grammars are all the rage,
 and the fact that LG is a categorical grammar seems to be frequently overlooked
 or misunderstood.
\end_layout

\begin_layout Subsection*
Conclusions
\end_layout

\begin_layout Standard
Conclusions from the above:
\end_layout

\begin_layout Itemize
Pair-wise similarity is very promising.
\end_layout

\begin_layout Itemize
The cosine similarity measure penalizes better, more accurate measurements,
 because better measurements are more likely to find dissimilarity.
 We need a better measure.
\end_layout

\begin_layout Itemize
PCA and sparse PCA, in the naive sense, applied to frequencies, or to cosine
 similarities, are inappropriate for classification.
 Its still possible that perhaps PCA applied to some sigmoid of the cosine
 similarity (e.g.
 cosine to the fourth power) might work better, but the selection of this
 sigmoid seems ad-hoc, and not anchored in any principles.
\end_layout

\begin_layout Itemize
First principles suggest something Bayesian, based on the Gibbs measure,
 maybe some sort of hidden multi-variate logistic regression.
 Hidden, because we don't know the grammatical categories in any a priori
 sense; we must deduce them.
 
\end_layout

\begin_layout Itemize
It would be great if someone worked out the precise details in going from
 sheaves to modal logic.
 This was already done, in 1965, for topos theory; no one has done this
 for natural language, though.
\end_layout

\begin_layout Subsection*
Other similarity measures
\end_layout

\begin_layout Standard
Its worth noting that one of the other similarity measures, such as qim
 and pim, discussed previously, can also be treated in this way.
 Note also that a matrix can be constructed so that 
\begin_inset Formula $X^{T}X$
\end_inset

 becomes explicitly Markovian.
 This is given by 
\begin_inset Formula 
\[
A(w,d)=\frac{p(w,d)}{p(*,d)}\quad\mbox{ and }\quad B(d,w)=\frac{p(w,d)}{p(w,*)}
\]

\end_inset

and then setting 
\begin_inset Formula $X^{T}X=AB$
\end_inset

.
 This has the property that 
\begin_inset Formula 
\[
\sum_{w_{1}}\left[AB\right](w_{1},w_{2})=1
\]

\end_inset

That is, the matrix 
\begin_inset Formula $AB$
\end_inset

 is a Markov matrix.
 It is straight-forward to compute using the standard power-iteration algorithm
 employed throughout this section.
 
\end_layout

\begin_layout Subsection*
A modified PCA algorithm
\end_layout

\begin_layout Standard
This suggests a feed-foreward-neural-netish variant on iterative PCA, described
 below.
 I is entirely of my own design, cribbed from nowhere at all, just popped
 into my head as I sit still immobilized.
\end_layout

\begin_layout Standard
\paragraph_spacing double
\noindent
\begin_inset space \hspace{}
\length 1.3em
\end_inset

0.
 
\begin_inset space \space{}
\end_inset

Pre-condition, filter the data.
 See step 10, below, for what to filter, and why.
\end_layout

\begin_layout Enumerate
Start with 
\begin_inset Formula $b_{n}=1/\sqrt{\left|w\right|}$
\end_inset

 where 
\begin_inset Formula $\left|w\right|$
\end_inset

 is the number of unique words.
 This starting point is a unit-length vector, i.e.
 
\begin_inset Formula $\left|\overrightarrow{b}\right|=1$
\end_inset

.
 Its convenient to change notation, here, and write 
\begin_inset Formula $b(w)$
\end_inset

 for the value of 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

 at word 
\begin_inset Formula $w$
\end_inset

.
 That is, 
\begin_inset Formula $b(w_{n})=b_{n}$
\end_inset

 is the same thing.
 
\end_layout

\begin_layout Enumerate
Let 
\begin_inset Formula $M$
\end_inset

 be the matrix for which the PCA is to be computed, with matrix components
 
\begin_inset Formula $M(w,d)$
\end_inset

 for word 
\begin_inset Formula $w$
\end_inset

 and disjunct 
\begin_inset Formula $d$
\end_inset

.
 This matrix is derived from (defined in terms of) the frequency matrix
 
\begin_inset Formula $p(w,d)$
\end_inset

 describing the base dataset.
 Compute the double-sum 
\begin_inset Formula 
\[
s(v)=\left[MM^{T}b\right](v)=\sum_{d}M(v,d)\sum_{w}M(w,d)b(w)
\]

\end_inset

which is basically a pair of dot products.
 Its still a large, time-consuming computation, even for sparse vectors.
\end_layout

\begin_layout Enumerate
Normalize: set 
\begin_inset Formula $\overrightarrow{b}\leftarrow\overrightarrow{s}/\left|\overrightarrow{s}\right|$
\end_inset

 so that 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

 is of unit length.
 In theory, this is not needed; in practice, each iteration can sharply
 shrink the value of 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

, making it very small, eventually leading to exponent underflow.
 
\end_layout

\begin_layout Enumerate
Repeat these steps 
\begin_inset Formula $k$
\end_inset

 times: go to step 2 and run the summation again.
 The repetition here is the 'power iteration' or the 'von Mises iteration'
 method for computing the largest eigenvalue of 
\begin_inset Formula $\left[MM^{T}\right]$
\end_inset

.
 It is not guaranteed to converge, and if it does, it might not do so quickly.
 But we deal with this in the next step, so its sufficient to keep 
\begin_inset Formula $k$
\end_inset

 small, just enough to get a trend going.
 Another way to think of this is as a Markov process (specifically, a Markov
 chain).
 That is, the matrix 
\begin_inset Formula $\left[MM^{T}\right]$
\end_inset

 will behave essentially as a Markov chain, and iteration on it just identifies
 the primary Peron-Frobenius stable state (step 3 makes it Markovian, by
 preserving to total probability measure).
 That is, 
\begin_inset Formula $\left[MM^{T}\right]$
\end_inset

 defines a weighted adjacency matrix for a graph, and iteration creates
 a measure-preserving process (walk) on this graph.
\end_layout

\begin_layout Enumerate
After the above repetitions, apply some standard neural-net sigmoid function
 to 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

.
 That is, set 
\begin_inset Formula $b(w)\leftarrow\sigma(b(w))$
\end_inset

 for some sigmoid.
 This has the effect of driving some of the elements to zero, and others
 to one.
\end_layout

\begin_layout Enumerate
Repeat this 
\begin_inset Formula $m$
\end_inset

 times: go to step 2, and repeat steps 2-5.
 Viewing this as a dynamical system, the effect of the sigmoid function
 is to force the system into a block-diagonal form, with the vector 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

 identifying a highly-connected block.
 Another way to look at this is as a graph factorization algorithm: the
 vector 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

 is identifying a well-connected subgraph, which is only weakly connected
 to the rest of the graph.
 The vector (viewed as a measure-preserving dynamical system) is spending
 most of its time in one particular block.
 Again, 
\begin_inset Formula $\left[MM^{T}\right]^{k}$
\end_inset

, the 
\begin_inset Formula $k$
\end_inset

-th power iterated matrix from step 4, can be thought of as a surrogate
 for a weighted graph adjacency matrix.
 A third way of thinking of this is as an 
\begin_inset Formula $m$
\end_inset

-layer neural net, with the link weights between one layer and the next
 being given by 
\begin_inset Formula $\left[MM^{T}\right]^{k}$
\end_inset

.
 All three ways of looking at this are essentially equivalent: a measure-preserv
ing dynamical system, a chaotic and mixing process on a graph, or as an
 
\begin_inset Formula $m$
\end_inset

-layer neural net.
 Pick your favorite.
\end_layout

\begin_layout Enumerate
Classify.
 Pass the vector 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

 through the step function, i.e.
 
\begin_inset Formula $b(w)\leftarrow\Theta(b(w))$
\end_inset

 where 
\begin_inset Formula $\Theta(x)=0$
\end_inset

 if 
\begin_inset Formula $x<1/2$
\end_inset

 and 
\begin_inset Formula $\Theta(x)=1$
\end_inset

 if 
\begin_inset Formula $x>1/2$
\end_inset

.
 The step function is a super-sharp sigmoid.
 This step identifies and isolates an active, well-connected subgraph of
 
\begin_inset Formula $\left[MM^{T}\right]$
\end_inset

.
 It identifies a square block, of dimension 
\begin_inset Formula $\left|b\right|\times\left|b\right|$
\end_inset

 where 
\begin_inset Formula $\left|b\right|$
\end_inset

 is the total number of non-zero entries in this final 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

.
 To belabor the point: the block-matrix is explicitly 
\begin_inset Formula 
\[
B(v,w)=b(v)b(w)\sum_{d}M(v,d)M(w,d)
\]

\end_inset

The non-zero elements of this final 
\begin_inset Formula $\overrightarrow{b}$
\end_inset

 identify a class of words that can be considered to be grammatically similar
 or identical.
 This is the 
\begin_inset Quotes eld
\end_inset

clustering
\begin_inset Quotes erd
\end_inset

 step.
\end_layout

\begin_layout Enumerate
Associated with this class of words is a disjunct set, the 
\begin_inset Quotes eld
\end_inset

average disjunct
\begin_inset Quotes erd
\end_inset

 for the class.
 It can be taken to be the set 
\begin_inset Formula $\left\{ d|0<\sum_{w}b(w)N(w,d)\right\} $
\end_inset

.
 The observed counts associated with this set can be taken to be 
\begin_inset Formula $N(b,d)=\sum_{w}b(w)N(w,d)$
\end_inset

 and the frequencies similarly: 
\begin_inset Formula $p(b,d)=\sum_{w}b(w)p(w,d)$
\end_inset

.
 From here-on, the set of words 
\begin_inset Formula $b\equiv\{w|0\ne b(w)\}$
\end_inset

 can be treated as if it was an ordinary word, behaving like any other,
 with the indicated disjuncts, counts and frequencies.
\end_layout

\begin_layout Enumerate
Since words can have have multiple meanings, or rather, multiple different
 kinds of grammatical behaviors based on their part of speech, the identified
 words need to be subtracted, en block, from the matrix 
\begin_inset Formula $p(w,d)$
\end_inset

, and then the process repeated, to identify another class of words.
 Put another way, if 
\begin_inset Formula $b$
\end_inset

 is to be added to the set of words, as 
\begin_inset Quotes eld
\end_inset

just another word
\begin_inset Quotes erd
\end_inset

, then the frequencies 
\begin_inset Formula $p(b,d)$
\end_inset

 have to be subtracted from the matrix 
\begin_inset Formula $P$
\end_inset

, and shunted to this new 
\begin_inset Quotes eld
\end_inset

word
\begin_inset Quotes erd
\end_inset

, so as not to loose the overall normalization.
 That is, one must preserve the identity 
\begin_inset Formula $\sum_{w,d}p(w,d)=1$
\end_inset

.
 So define, in the next iteration 
\begin_inset Formula 
\[
p(w,d)\leftarrow\begin{cases}
p(b,d) & \mbox{ if }w=b\\
p(w,d)-b(w)p(b,d) & \mbox{ otherwise}
\end{cases}
\]

\end_inset

(Hmmm.
 This may not be right, its late and I'm tired).
 This still sums to the identity except that now some of the values might
 go negative, and we don't want that.
 
\end_layout

\begin_layout Enumerate
And so we get to what should be called step zero: We want to truncate, and
 discard the negative entries.
 This should have been carried out as an actual step 0: a pre-conditioning
 of the matrix: some noise filtering, e.g.
 discarding all words that were observed less than a handful of times, discardin
g rare or preposterous disjuncts.
 Pre-conditioning in this way will have the effect of removing some (possibly
 many) of the words from the matrix: the size of the matrix shrinks.
 This is the step where the actual dimensional reduction takes place: the
 size of the set of words is shrinking, as they get classified into sets.
 
\end_layout

\begin_layout Enumerate
Go to step 0 and repeat, until the preconditioning and noise-removal has
 left behind an empty matrix (or alternately, a matrix where all words have
 been classified into some group).
 So, for example, words which have only one part-of-speech or meaning would
 (hopefully should) get classified after just one step; words that are more
 complex, and have two parts of speech, would require at least two iterations.
 This is perhaps optimistic; I expect dozens of iterations to get anything
 vaguely accurate.
\end_layout

\begin_layout Enumerate
There's one more step.
 After the formation of the class 
\begin_inset Formula $b$
\end_inset

, we arrive at a situation where no (pseudo-)connectors connect to 
\begin_inset Formula $b$
\end_inset

 directly.
 Instead, all disjuncts connect to words inside of 
\begin_inset Formula $b$
\end_inset

.
 But this is a problem: we don't know if any given connector actually connects
 to some 
\begin_inset Formula $w\in b$
\end_inset

 or if it connects to the same 
\begin_inset Formula $w$
\end_inset

, but outside of 
\begin_inset Formula $b$
\end_inset

.
 (e.g.
 if 
\begin_inset Formula $b$
\end_inset

 are nouns, then does 
\begin_inset Quotes eld
\end_inset

saw+
\begin_inset Quotes erd
\end_inset

 connect to 
\begin_inset Quotes eld
\end_inset

saw
\begin_inset Quotes erd
\end_inset

 the noun, or 
\begin_inset Quotes eld
\end_inset

saw
\begin_inset Quotes erd
\end_inset

 the verb?) Thus, after some small number of iterations of step 11, there
 needs to be a re-parse of the entire text, using these newly discovered
 classes of words.
 
\end_layout

\begin_layout Standard
That's it.
 I think this should work fairly well.
 Clearly, there are many nested loops, and so this is potentially a very
 time-consuming computation.
 The number of iterations 
\begin_inset Formula $k$
\end_inset

 and 
\begin_inset Formula $m$
\end_inset

 need to be kept small, and the classification in step 11 needs to be kept
 greedy, because step 12 is expensive.
 An alternate strategy is to brutally precondition 
\begin_inset Formula $p(w,d)$
\end_inset

 to make it as small as possible; but this risks throwing out the baby with
 the bathwater: early on, we want to cluster together the rare, obscure,
 unused words as best as possible into large bins, and then devote large
 CPU resources to correctly classifying the remaining much smaller set of
 verbs and prepositions, which we know, 
\emph on
a priori,
\emph default
 to be complex and difficult, due to their grammatical variability.
\end_layout

\begin_layout Subsection*
Dataset
\end_layout

\begin_layout Standard
The previous dataset 
\noun on
en_pairs_sim
\noun default
, analyzed above, proved to be inadequate in many respects.
 Thus, data analysis here resumes with a different, considerably larger
 dataset, collected on a higher-quality corpus.
 This will be the 
\noun on
en_pairs_ttwo_mst
\noun default
 dataset, listed above.
 To recap, it is this one:
\begin_inset Newline newline
\end_inset


\end_layout

\begin_layout Standard
\begin_inset Tabular
<lyxtabular version="3" rows="2" columns="9">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Size
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Csets
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Obs'ns
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Ob/cs
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Entropy
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $H_{left}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $H_{right}$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MI
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Notes
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
176K x 3.4M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6.43M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14.3M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.23
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21.01
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14.91
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10.01
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-3.91
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_pairs_ttwo_mst
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\begin_inset Newline newline
\end_inset

To avoid accidental corruption of this dataset, a copy was made, in which
 assorted sporadic results are maintained.
 The copy is the 
\noun on
en_pairs_ttwo_sim
\noun default
 dataset.
\end_layout

\begin_layout Subsection*
Filtering, Step 0
\end_layout

\begin_layout Standard
The filtering performed in step 0 (described in step 10, above) removes
 some of the noise in the dataset.
 Basic filtering is implemented in the 
\noun on
(opencog analysis)
\noun default
 scheme module, and specifically in the 
\noun on
filter.scm
\noun default
 file.
 One can remove rows and columns that have subtotal counts less than a cutoff,
 and also remove individual entries that have fewer than some number of
 counts.
 By removing very infrequently observed connector sets, some of drivers
 of accidental similarity or dis-similarity between words should be ameliorated.
\end_layout

\begin_layout Standard
How much data does filtering actually discard? This dataset has 175559 rows.
 Each row corresponds to one unique, distinct word (columns correspond to
 disjuncts).
 Of these words, only 84984 were observed twice, or more: slightly less
 than half! Only 64882 words were seen three times or more; only 10% of
 the words were seen 32 times or more.
 The distribution is shown in the graph below.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
From the cnt-obs-rows function.
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/count-rows.eps
	width 70col%

\end_inset


\end_layout

\begin_layout Standard
The fraction of rows with more than 
\begin_inset Formula $N$
\end_inset

 observations drops a little faster than 
\begin_inset Formula $\sqrt{N}$
\end_inset

.
 Note that this graph is not scale-free; for larger datasets, the graph
 should progressively flatten.
 Since cumulative distributions are integrals of distributions, this is
 essentially the integral of some of the graphs shown before.
 A table of plausible cutoffs to use with this dataset is given below.
 There are three cuts one can make: discard words that are observed 
\begin_inset Formula $N$
\end_inset

 or fewer times; discard disjuncts that were observed 
\begin_inset Formula $N$
\end_inset

 or fewer times, and discard connector-sets (word-disjunct pairs) that were
 observed 
\begin_inset Formula $N$
\end_inset

 or fewer times.
 These three cuts are given in the first three columns; the resulting dataset
 is given in the remaining columns.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Stats can be gotten by creating the add-support-compute object on the filter
 object, and then invoking 'left-basis-size, 'right-basis-size, 'total-support
 and 'total-count.
 
\end_layout

\end_inset


\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="11" columns="7">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Word cut
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Dj cut
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Cset cut
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Size
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Csets
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Obs'ns
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Ob/cs
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
176K x 3.4M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6.43M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14.3M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.23
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
85K x 1.15M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.10M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
12.0M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.94
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
49K x 145K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.81M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9.51M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.39
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
32.2K x 40.3K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.42M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8.73M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.60
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17.9K x 14.4K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.98M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8.05M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.07
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17.9K x 40.3K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.26M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8.48M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.75
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17.9K x 40.3K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
269K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.52M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20.5
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
50
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
13.1K x 14.4K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.88M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7.86M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.19
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
50
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
13.1K x 14.4K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
256K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.41M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21.1
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
50
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
13.1K x 14.4K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
101K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.36M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
43.4
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The last cut seems plausible for further work: it suggests that each disjunct
 is observed a fairly strong number of times; and that given the word/disjunct
 ratio, a lot of words are using disjuncts in similar fashion; thus, there
 should be a lot of similarity.
\end_layout

\begin_layout Standard
Note, by the way, that the previous sections carefully described entropy
 and mutual information distributions that no longer hold for the cut dataset.
 Filtering changes these!
\end_layout

\begin_layout Subsection*
Power iteration, Steps 1-4
\end_layout

\begin_layout Standard
Step 1-4 are implemented in the 
\noun on
(opencog analysis)
\noun default
 scheme module, and specifically in the 
\noun on
thresh-pca.scm
\noun default
 file.
 The implementation uses lazy evaluation to avoid unneeded computation,
 and caching of evaluation results to avoid repeated evaluations.
 This seems like the best way of working with the extremely sparse matrices
 involved.
\end_layout

\begin_layout Standard
Iteration appears to converge very rapidly.
 After three iterations, the ranking, by weight in the vector, appears to
 be established.
 This is shown in the figures below.
 Six iterations are performed, and the words 
\begin_inset Formula $w$
\end_inset

 are then ranked according to the strength 
\begin_inset Formula $b(w)$
\end_inset

 in the sixth iteration.
 Then the values of 
\begin_inset Formula $b(w)$
\end_inset

 are plotted for the first five iterations, using this rank.
 After the third iteration, there is no discernible change in the weights.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/power-iter-all.eps
	width 45col%

\end_inset


\begin_inset Graphics
	filename images/power-iter-close.eps
	width 45col%

\end_inset


\end_layout

\begin_layout Standard
In this case, the principle component is revealed to be
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Computed as follows (simplified, various checks were done to verify correctness)
: 
\end_layout

\begin_layout Plain Layout
(define psa (make-pseudo-cset-api))
\end_layout

\begin_layout Plain Layout
(psa 'fetch-pairs)
\end_layout

\begin_layout Plain Layout
(define fsi (add-subtotal-filter psa 50 30 10))
\end_layout

\begin_layout Plain Layout
(define pti (make-power-iter-pca fsi))
\end_layout

\begin_layout Plain Layout
(define feig (pti 'make-left-unit (fsi 'left-basis))) 
\end_layout

\begin_layout Plain Layout
(define lit (pti 'left-iterate feig 4)) 
\end_layout

\begin_layout Plain Layout
(pti 'left-print lit 20) 
\end_layout

\end_inset


\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="6" columns="2">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
word
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
weight
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.9358
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
the
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.1212
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
and
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.1098
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.1035
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
”
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0920
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
The significance and interpretation of this vector was already discussed
 in a previous section.
\end_layout

\begin_layout Section*
Reboot: 23 June 2017
\end_layout

\begin_layout Standard
The cosine similarity gives OK pair-wise similarity results; the overlap
 similarity gives noticeably worse results.
 There's no obvious matrix-based algorithm that will group together multiple
 words into a cluster, efficiently, in one shot.
 Its time for a rethink.
 
\end_layout

\begin_layout Standard
What are we trying to do here, really?
\end_layout

\begin_layout Itemize
Grammatical categories: By grouping multiple words into categories, we hope
 to discover grammatical categories of words that behave similarly, with
 respect to grammar.
\end_layout

\begin_layout Itemize
Compression: By grouping multiple words into categories, we hope to compress
 the size of the overall dataset of connector-sets, without losing much
 fidelity.
\end_layout

\begin_layout Itemize
Idioms: By observing word-disjunct pairs with high MI scores, we hope to
 discover idioms and set phrases.
 
\end_layout

\begin_layout Itemize
Meaning: By developing a coherent framework for working with graph sections
 (of which the connector sets are a special case), we hope to discover synonymou
s phrases.
\end_layout

\begin_layout Itemize
Reference resolution: By developing a coherent framework for working with
 graph sections, we hope to discover reference resolution (of pronouns and
 of given names) across multiple sentences.
\end_layout

\begin_layout Standard
These seem like they should be achievable.
 Preliminary results looking promising, but not yet great.
 What's the grand scheme of things?
\end_layout

\begin_layout Itemize
Discover that certain nouns refer to objects in the physical world.
\end_layout

\begin_layout Itemize
Discover that certain verbs refer to actions in the physical world.
\end_layout

\begin_layout Itemize
Discover that certain nouns refer to abstract, non-physical concepts.
\end_layout

\begin_layout Itemize
Discover the meaning of the verb-phrases 
\begin_inset Quotes eld
\end_inset

is-a
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

has-a
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

is-a-part-of
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

belongs-to
\begin_inset Quotes erd
\end_inset

 ...
 or, generally, discover the meanings of prepositional phrases.
 
\end_layout

\begin_layout Itemize
Perform reasoning on relationships; specifically, on 
\begin_inset Quotes eld
\end_inset

is-a
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

has-a
\begin_inset Quotes erd
\end_inset

, ...
 relationships.
\end_layout

\begin_layout Itemize
Develop a database of common-sense knowledge.
\end_layout

\begin_layout Itemize
Translate between multiple languages, by employing common-sense knowledge
 and reasoning.
\end_layout

\begin_layout Standard
The first two seem impossible without embodiment.
 The third bullet holds out hope that progress might be possible for textual-onl
y analysis.
 The fourth bullet asks for an algebraic structure to be discerned: 
\begin_inset Quotes eld
\end_inset

is-a
\begin_inset Quotes erd
\end_inset

 relations are symmetric: 
\begin_inset Quotes eld
\end_inset

A is-a B
\begin_inset Quotes erd
\end_inset

 if 
\begin_inset Quotes eld
\end_inset

B is-a A
\begin_inset Quotes erd
\end_inset

, and it should be possible to data-mine such symmetric relations.
 Likewise for 
\begin_inset Quotes eld
\end_inset

next-to
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

near
\begin_inset Quotes erd
\end_inset

 relationships.
 It is at least plausible that such relations could be data-mined, with
 no 
\emph on
a priori
\emph default
 knowledge of the words.
 Anyway, this provides the setting for the initial grammatical tasks.
 So, back to the initial grammatical tasks.
\end_layout

\begin_layout Standard
Idioms: based on preliminary evidence, we could make lists of idiomatic
 phrases, now, based only on high-MI word-disjunct relations.
 But these are useless, until we can build a list of synonymous words.
\end_layout

\begin_layout Standard
Discerning synonymous words based on grammatical usage is tricky.
 First, it is often antonyms that get observed: e.g.
 the (black,white) pairs reported above.
 To discern antonymy, we would also need to discern is-a relationships,
 apply common-sense reasoning, and notice that antonyms never describe the
 same objects, never describe the properties of the same objects.
 So, antonym detection appears to be an advanced topic.
\end_layout

\begin_layout Subsection*
Clustering
\end_layout

\begin_layout Standard
We could go full-speed ahead on trying to discern grammatical categories,
 but for several issues: 
\end_layout

\begin_layout Itemize
Merging two items into one necessarily entails a loss of information.
 That is, one necessarily has that 
\begin_inset Formula $-(p_{a}+p_{b})\log_{2}(p_{a}+p_{b})\le-p_{a}\log_{2}p_{a}-p_{b}\log_{2}p_{b}$
\end_inset

.
 That is, information is necessarily lost.
 How can we minimize information loss? 
\end_layout

\begin_layout Itemize
If a classification error is made early on, can it be spotted, and later
 corrected? What is the mechanism, and how might it work?
\end_layout

\begin_layout Standard
We can use the 
\begin_inset Quotes eld
\end_inset

information loss
\begin_inset Quotes erd
\end_inset

 to our advantage if the 
\begin_inset Quotes eld
\end_inset

lost information
\begin_inset Quotes erd
\end_inset

 is in fact just noise in the data.
 That is, the data is necessarily noisy, and the naive calculation of the
 entropy and mutual information encodes both that noise and the signal we
 are searching for.
 Clustering together items, and the associated information loss, is desirable
 if the loss results in the filtering out of noise.
 How can we characterize the noise in the observations?
\end_layout

\begin_layout Subsection*
Word Meanings
\end_layout

\begin_layout Standard
Take as an assumption that word-meaning is strongly correlated with grammatical
 usage.
 That is, 
\begin_inset Quotes eld
\end_inset

saw
\begin_inset Quotes erd
\end_inset

, the noun, has a different meaning than 
\begin_inset Quotes eld
\end_inset

saw
\begin_inset Quotes erd
\end_inset

, the verb.
 Thus, as a hypothesis, write
\begin_inset Formula 
\[
[w,m_{1}]=\{(w,d_{a}),\,(w,d_{b}),\,(w,d_{c}),\,\cdots\}
\]

\end_inset

that is, the word 
\begin_inset Formula $w$
\end_inset

 might have meaning 
\begin_inset Formula $m_{1}$
\end_inset

 whenever is used with any of the disjunts 
\begin_inset Formula $d_{k}$
\end_inset

 from the indicated set.
 The meaning 
\begin_inset Formula $m_{2}$
\end_inset

 of a word will be associated with a different set of disjuncts.
 In general, the sets 
\begin_inset Formula $[w,m_{1}]$
\end_inset

 and 
\begin_inset Formula $[w,m_{2}]$
\end_inset

 will overlap.
 
\end_layout

\begin_layout Standard
Individually, 
\begin_inset Formula $[w,m]$
\end_inset

 is just a set, and has no weights or probabilities associated with it.
 However, if the disjunct 
\begin_inset Formula $(w,d)$
\end_inset

 is observed, one can say that there is a frequency or probability 
\begin_inset Formula $p([w,m]|(w,d))$
\end_inset

 that the meaning 
\begin_inset Formula $m$
\end_inset

 of word 
\begin_inset Formula $w$
\end_inset

 is intended when 
\begin_inset Formula $(w,d)$
\end_inset

 is obeserved.
 This is written as a conditional probability, so that one has
\begin_inset Formula 
\[
\sum_{m}p([w,m]|(w,d))=1
\]

\end_inset

That is, given that 
\begin_inset Formula $(w,d)$
\end_inset

 was observed, there must be some meaning 
\begin_inset Formula $m$
\end_inset

 that was intended; the list of possible meanings is complete and exhaustive.
 I'm assuming that one possible meaning is 
\begin_inset Quotes eld
\end_inset

nonsense
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

junk
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

unknown
\begin_inset Quotes erd
\end_inset

; just add it to the list of possible meanings.
\end_layout

\begin_layout Standard
One of the tasks is to discover the complete set of meanings 
\begin_inset Formula $\{m_{i}\}$
\end_inset

 for a word.
 Another task is to discern the probabilities 
\begin_inset Formula $p([w,m]|(w,d))$
\end_inset

.
\end_layout

\begin_layout Subsection*
Word Classes
\end_layout

\begin_layout Standard
Any given word might belong to one of many different word classes (noun,
 verb, ...) and the collected disjunct usage observations on that word will
 in general be a linear combination of such different usages.
 Distinguishing word-classes require untangling these relationships.
\end_layout

\begin_layout Standard
The setup for this problem is mostly identical to the problem above, except
 that the 
\begin_inset Quotes eld
\end_inset

meaning
\begin_inset Quotes erd
\end_inset

 
\begin_inset Formula $m$
\end_inset

 is re-interpreted as the word-class 
\begin_inset Formula $g$
\end_inset

, short for 
\begin_inset Quotes eld
\end_inset

grammatical class
\begin_inset Quotes erd
\end_inset

.
 That is, the above did not specify the definition of 
\begin_inset Formula $m$
\end_inset

, rather, it was presented in general terms.
 Here, likewise, but a stricter definition is proposed for a 
\begin_inset Quotes eld
\end_inset

word class
\begin_inset Quotes erd
\end_inset

.
\end_layout

\begin_layout Standard
A word class 
\begin_inset Formula $g$
\end_inset

 is a set of words 
\begin_inset Formula $\{w_{i}\}$
\end_inset

, together with a set of disjuncts 
\begin_inset Formula $\{d_{j}\}$
\end_inset

, such that all words in the word-class are commonly used with any of the
 disjuncts in the disjunct-set.
 That is, 
\begin_inset Formula 
\[
g=\left(\{w_{i}\},\{d_{j}\}\right)
\]

\end_inset

 subject to the constraint that the vector
\begin_inset Formula 
\[
[w_{1},g]=\left\{ (w_{1},d_{a}),\,(w_{1},d_{b}),\,(w_{1},d_{c}),\,\cdots\right\} 
\]

\end_inset

 is judged to be similar to the vector
\begin_inset Formula 
\[
[w_{2},g]=\left\{ (w_{2},d_{a}),\,(w_{2},d_{b}),\,(w_{2},d_{c}),\,\cdots\right\} 
\]

\end_inset

according to some similarity measure (e.g.
 cosine similarity).
 The idea is that any of the words in 
\begin_inset Formula $g$
\end_inset

 use any of the disjuncts in 
\begin_inset Formula $g$
\end_inset

 in similar ways.
\end_layout

\begin_layout Standard
Any given word might belone to multiple different classes 
\begin_inset Formula $g$
\end_inset

.
 For example, the verb 
\begin_inset Quotes eld
\end_inset

saw
\begin_inset Quotes erd
\end_inset

 will belong to a different class than the noun 
\begin_inset Quotes eld
\end_inset

saw
\begin_inset Quotes erd
\end_inset

.
 As a general rule, whenever 
\begin_inset Formula $g_{1}\ne g_{2}$
\end_inset

 then the set of disjuncts in 
\begin_inset Formula $g_{1}$
\end_inset

 and 
\begin_inset Formula $g_{2}$
\end_inset

 will not overlap very much, if at all.
\end_layout

\begin_layout Subsection*
Assigning Word to Word Classes
\end_layout

\begin_layout Standard
Assigning a word to a word-class has a knock-on network effect.
 That is, words appear not only in isolation, but also as connectors in
 disjuncts.
 If two words are considered to be similar, then perhaps two connectors
 should be judged to be similar.
 If two connectors are similar, then perhaps the disjuncts they appear in
 are similar.
 If two disjuncts are similar, then perhaps some other pair of words can
 now be considered to be similar.
 The question arises of how far to follow this network effect, and how to
 assign cutoffs.
 
\end_layout

\begin_layout Standard
Note that the network can be traced in either one of two directions.
 Given a pair of similar words, one can ask if any of the disjuncts attached
 to those words are similar to one-another, or not.
 In the other direction, one can ask how similar connectors imply similar
 disjuncts.
 This is made more explicit below.
\end_layout

\begin_layout Standard
Starting with a single word, one can examine all of the disjuncts on that
 word, to see if any of them are similar.
 For example, the word 
\begin_inset Quotes eld
\end_inset

the
\begin_inset Quotes erd
\end_inset

 should have disjuncts 
\begin_inset Quotes eld
\end_inset

book+
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

novel+
\begin_inset Quotes erd
\end_inset

 on it, and one can ask if 
\begin_inset Quotes eld
\end_inset

book
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

novel
\begin_inset Quotes erd
\end_inset

 are similar.
 If so, they can be merged into a grammatical class 
\begin_inset Formula $g=\left\{ \mbox{book},\mbox{novel}\right\} $
\end_inset

 and the two disjuncts 
\begin_inset Quotes eld
\end_inset

book+
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

novel+
\begin_inset Quotes erd
\end_inset

 replaced by 
\begin_inset Formula $g+$
\end_inset

.
 The observation count (and likewise the probability) on (the, g+) should
 be the sum of N(the, book+) and N(the, novel+).
 The process is then repeated recursively, examining each of the words appearing
 in the disjuncts.
\end_layout

\begin_layout Standard
Alternately, one may walk the network in the 
\begin_inset Quotes eld
\end_inset

other
\begin_inset Quotes erd
\end_inset

 direction, and merge disjuncts as they appear in similar words.
 Let disjunct 
\begin_inset Formula $d$
\end_inset

 be a sequence of connectors: 
\begin_inset Formula $d=(c_{1},c_{2},c_{3,}\cdots)$
\end_inset

 and each connector is a word and a direction indicator: 
\begin_inset Formula $c=(w,\pm)$
\end_inset

.
 Given two similar words 
\begin_inset Formula $w_{a}$
\end_inset

 and 
\begin_inset Formula $w_{b}$
\end_inset

, one can trace through the connectors 
\begin_inset Formula $c_{a+}=(w_{a},+)$
\end_inset

 and 
\begin_inset Formula $c_{b+}=(w_{b},+)$
\end_inset

 and likewise for the - direction.
 One then forms the set of all disjuncts in which 
\begin_inset Formula $c_{a+}$
\end_inset

appears: 
\begin_inset Formula 
\[
\left\{ d_{k}=(c_{1},c_{2},c_{3},\cdots)|c_{j}=c_{a+}\,\mbox{for some }j\right\} 
\]

\end_inset

Then, given one 
\begin_inset Formula $d_{k}$
\end_inset

, one constructs 
\begin_inset Formula $\widetilde{d_{k}}$
\end_inset

 so that 
\begin_inset Formula $c_{b+}$
\end_inset

 replaces 
\begin_inset Formula $c_{a+}$
\end_inset

.
 One then constructs the set of all words that appear with 
\begin_inset Formula $\widetilde{d_{k}}$
\end_inset

 :
\begin_inset Formula 
\[
\upsilon=\left\{ w|N(w,\widetilde{d_{k}})>0\right\} 
\]

\end_inset

and ask whether any of these words already belong to the same grammatical
 class.
 If not, then they should be compared to one-another, to see if they might.
\end_layout

\begin_layout Standard
If a pair of words in 
\begin_inset Formula $\upsilon$
\end_inset

 already belong to the same grammatical class, then the two disjuncts 
\begin_inset Formula $d_{k}$
\end_inset

 and 
\begin_inset Formula $\widetilde{d_{k}}$
\end_inset

 can be merged into one.
 Do this by forming the grammatical class 
\begin_inset Formula $g=\{w_{a},w_{b}\}$
\end_inset

 and construct the connector 
\begin_inset Formula $c_{g+}=(g,+)$
\end_inset

.
 Then construct 
\begin_inset Formula $\overbrace{d_{k}}$
\end_inset

 so that that 
\begin_inset Formula $c_{g+}$
\end_inset

 replaces 
\begin_inset Formula $c_{a+}$
\end_inset

, and replace both 
\begin_inset Formula $d_{k}$
\end_inset

 and 
\begin_inset Formula $\widetilde{d_{k}}$
\end_inset

 by 
\begin_inset Formula $\overbrace{d_{k}}$
\end_inset

 in the relevant sections.
 The observation counts are copied over.
 The process is recursive, repeating for each pair of words judged sufficiently
 similar.
 Alternately, one might defer the creation of 
\begin_inset Formula $g$
\end_inset

 until one has walked enough of the network to determine general similarity.
\end_layout

\begin_layout Subsection*
Merging Words to form Word Classes
\end_layout

\begin_layout Standard
After the grammatical behavior of two words is considered to be similar,
 how should a merged word-class be created? How should the merger be performed?
 There are several different ways in which words can be merged together
 to form word classes.
 These are reviewed below.
\end_layout

\begin_layout Standard
Merging is not straight-forward, because the process needs to result in
 an orthogonalization of the space for grammatical behavior.
 That is, the disjunct counts on any given word might be partly representative
 of its behavior as one of many different kinds of fine-grained parts of
 speech.
 That is, the goal is to take the disjuncts on any given word, separate
 them into two classes, and merge one class into an existing (or new) grammatica
l class, while leaving the rest as-is, which might be subsequently re-organized
 into some other class, ad infinitum.
\end_layout

\begin_layout Subsubsection*
Linear merging
\end_layout

\begin_layout Standard
Linear merging treats words as vectors, computing their sum to define the
 new merged class, and then computing the perpendicular components as the
 left-over, un-accounted-for remainders.
 More precisely, the disjuncts are considered to be the basis elements of
 the vector space, and the count (or frequency) of each disjunct defines
 the vector.
\end_layout

\begin_layout Standard
Consider merging two words or word-classes 
\begin_inset Formula $w_{a}$
\end_inset

 and 
\begin_inset Formula $w_{b}$
\end_inset

 (that is, each of 
\begin_inset Formula $w_{a}$
\end_inset

 and 
\begin_inset Formula $w_{b}$
\end_inset

 can be either a word, or a word-class).
 Let 
\begin_inset Formula $M(w,d)$
\end_inset

 be a number associated with the word-disjunct pair 
\begin_inset Formula $(w,d)$
\end_inset

.
 Typically it will be the count 
\begin_inset Formula $N(w,d)$
\end_inset

 that the pair was observed (or equivalently, the normalized frequency 
\begin_inset Formula $p(w,d)=N(w,d)/N(*,*)).$
\end_inset

 The corresponding vector is then
\begin_inset Formula 
\[
\vec{w}=\sum_{d}M(w,d)\;\widehat{d}
\]

\end_inset

where 
\begin_inset Formula $\widehat{d}$
\end_inset

 is the basis element.
 The merged word-class can then be defined as 
\begin_inset Formula $\vec{w}_{c}=\vec{w}_{a}+\vec{w}_{b}$
\end_inset

.
 
\end_layout

\begin_layout Subsubsection*
Erasure vs.
 orthogonal replacement
\end_layout

\begin_layout Standard
After the merger, there are two alternatives for what to do with 
\begin_inset Formula $\vec{w}_{a}$
\end_inset

 and 
\begin_inset Formula $\vec{w}_{b}$
\end_inset

 in the dataset.
 One alternative is to remove both 
\begin_inset Formula $\vec{w}_{a}$
\end_inset

 and 
\begin_inset Formula $\vec{w}_{b}$
\end_inset

 entirely.
 The other alternative is to compute the components of 
\begin_inset Formula $\vec{w}_{a}$
\end_inset

 and 
\begin_inset Formula $\vec{w}_{b}$
\end_inset

 that are orthogonal to 
\begin_inset Formula $\vec{w}_{c}$
\end_inset

, and replace 
\begin_inset Formula $\vec{w}_{a}$
\end_inset

 and 
\begin_inset Formula $\vec{w}_{b}$
\end_inset

 by these orthogonal components.
 That is, given a vector 
\begin_inset Formula $\vec{v}$
\end_inset

 (which might be 
\begin_inset Formula $\vec{w}_{a}$
\end_inset

 or 
\begin_inset Formula $\vec{w}_{b}$
\end_inset

), compute 
\begin_inset Formula 
\[
\vec{v}_{\bot}=\vec{v}-\widehat{w}_{c}\left(\widehat{w}_{c}\cdot\vec{v}\right)
\]

\end_inset

 where 
\begin_inset Formula $\widehat{w}_{c}=\vec{w}_{c}/\left|\vec{w}_{c}\right|$
\end_inset

 is the normalized unit vector pointing in the 
\begin_inset Formula $\vec{w}_{c}$
\end_inset

 direction.
 
\end_layout

\begin_layout Standard
The goal of maintaining the orthogonal components is that perhaps 
\begin_inset Formula $\vec{w}_{a}$
\end_inset

 and 
\begin_inset Formula $\vec{w}_{b}$
\end_inset

 have admixtures of other grammatical categories in them; what these are
 cannot be known a-priori.
 The discard option effectively discards these admixtures, hiding them from
 later iterations.
 This kind of hiding/data-destruction seems undesirable.
\end_layout

\begin_layout Standard
The orthogonal component potentially has negative coefficients appearing
 in it; it seems that these must be zeroed out to preserve 
\begin_inset Quotes eld
\end_inset

physicality
\begin_inset Quotes erd
\end_inset

.
\end_layout

\begin_layout Subsubsection*
Linear overlap
\end_layout

\begin_layout Standard
This would work like linear merging, described above, except that the intersecti
on of the sets of disjuncts on these two words is computed first, and the
 vector basis is taken only over this intersected set.
 The intersection is presumably substantial, if the two words are grammatically
 similar.
 For the replacement step, one has three alternatives: total discard, which
 seems inappropriate (as the non-intersected disjuncts get discarded); partial
 discard, which discards only the intersected components, and orthogoanl
 replacement.
\end_layout

\begin_layout Subsection*
Noise
\end_layout

\begin_layout Standard
If an event is normally distributed, then we can characterize the uncertainty
 as being 
\begin_inset Formula $1/\sqrt{N}$
\end_inset

 after 
\begin_inset Formula $N$
\end_inset

 observations.
 We don't actually know if our observations are normally distributed.
 Its not even clear quite how to even obtain the distribution.
 But lets assume they are.
 Then, given that 
\begin_inset Formula $p(x,y)=N(x,y)/N(*,*)$
\end_inset

 and estimating the noise to go as 
\begin_inset Formula $\sqrt{N(x,y)}$
\end_inset

 we get that the error in the frequentist probability estimate is given
 by
\begin_inset Formula 
\[
\frac{N(x,y)\pm\sqrt{N(x,y)}}{N(*,*)}=p(x,y)\pm\sqrt{\frac{p(x,y)}{N(*,*)}}
\]

\end_inset

where only the pair observations 
\begin_inset Formula $N(x,y)$
\end_inset

 are considered to be noisy and the value of 
\begin_inset Formula $N(*,*)$
\end_inset

 is held fixed (the natural variation in it is ignored).
\end_layout

\begin_layout Standard
The error in the frequentist estimate of the entropy to be
\begin_inset Formula 
\[
-\log_{2}\left[p(x,y)\pm\sqrt{\frac{p(x,y)}{N(*,*)}}\right]=-\left[\log_{2}p(x,y)\right]\pm\frac{1}{\log2}\sqrt{\frac{1}{p(x,y)N(*,*)}}
\]

\end_inset

where the estimate 
\begin_inset Formula $\log(1+\epsilon)\approx\epsilon$
\end_inset

 is used for small 
\begin_inset Formula $\epsilon$
\end_inset

.
 Summing to estimate the total entropy, one gets
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
H\pm\Delta H=-\sum_{x,y}\left[p(x,y)\pm\sqrt{\frac{p(x,y)}{N(*,*)}}\right]\log_{2}\left[p(x,y)\pm\sqrt{\frac{p(x,y)}{N(*,*)}}\right]
\]

\end_inset

which expands out to
\begin_inset Formula 
\[
\Delta H=\frac{-1}{\sqrt{N(*,*)}}\sum_{x,y}\sqrt{p(x,y)}\left(\frac{1}{\log2}+\log_{2}p(x,y)\right)
\]

\end_inset


\end_layout

\begin_layout Standard
What might these values be, in practice? As a worked example, consider the
 word-pair (big, deal) in the 
\noun on
en_pairs_rthree
\noun default
 dataset.
 It is observed 1039 times, out of 638845863 pair observations (639M) total.
 Plugging and chugging, one gets 
\begin_inset Formula $H=-\log_{2}p(\mbox{big},\mbox{deal})=19.23$
\end_inset

 and 
\begin_inset Formula $\Delta H=0.552$
\end_inset

 which seems to be eminently reasonable.
 The value of 
\begin_inset Formula $\Delta H$
\end_inset

 depends only on 
\begin_inset Formula $N(x,y)$
\end_inset

 and 
\begin_inset Formula $N(*,*)$
\end_inset

 and is graphed below, for fixed 
\begin_inset Formula $N(*,*)=639M$
\end_inset

.
 It does not take very many observations to drive the uncertainty to a fairly
 small value.
\begin_inset Separator latexpar
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/error.eps
	width 60col%

\end_inset


\end_layout

\begin_layout Standard
foo
\end_layout

\begin_layout Subsection*
Error correction
\end_layout

\begin_layout Standard
If a classification error is made early on, can it be spotted, and later
 corrected? What is the mechanism, and how might it work?
\end_layout

\begin_layout Subsection*
Worked example
\end_layout

\begin_layout Standard
Both the cosine similarity and the overlap similarity suggested that the
 words 
\begin_inset Quotes eld
\end_inset

black
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

white
\begin_inset Quotes erd
\end_inset

 are similar.
 What happens when these are grouped together? We not only consider throwing
 both words into the same bag, but we then may want to consider what happens
 to other disjuncts, that connected to other words.
\end_layout

\begin_layout Standard
Also, when clustering, should we create a single common category 
\begin_inset Quotes eld
\end_inset

bw
\begin_inset Quotes erd
\end_inset

 holding both words, or should we create three categories: bw, white-prime
 and black-prime, where bw just has the common disjuncts, and white-prime
 and black-prime is what's left after taking differences?
\end_layout

\begin_layout Section*
English wordpair small dataset July 2017
\end_layout

\begin_layout Standard
This section moved to `word-pairs-redux.lyx` in July 2019 so that all word-pair
 stuff is in the same place.
\end_layout

\begin_layout Section*
Chinese character pair dataset July 2017
\end_layout

\begin_layout Standard
This section moved to `word-pairs-redux.lyx` in July 2019 so that all word-pair
 stuff is in the same place.
\end_layout

\begin_layout Section*
Idioms and word boundary detection
\end_layout

\begin_layout Standard
Higher-level structures in language are important.
 By 
\begin_inset Quotes eld
\end_inset

higher level
\begin_inset Quotes erd
\end_inset

 I mean both the problem of detecting idioms in English, and segmenting
 words in Chinese.
 I claim that applying traditional algorithms to sheaves is sufficient to
 get good results.
 
\end_layout

\begin_layout Standard
In English, one is interested in discovering idioms, entity names, set phrases
 and institutional phrases set in English: one is looking for a sequence
 of neighboring 
\begin_inset Quotes eld
\end_inset

words
\begin_inset Quotes erd
\end_inset

 that commonly occur together.
 Examples include: 
\begin_inset Quotes eld
\end_inset

Sun Trust Bank
\begin_inset Quotes erd
\end_inset

 (an entity name), 
\begin_inset Quotes eld
\end_inset

gone fishin
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

out to lunch
\begin_inset Quotes erd
\end_inset

 (set phrases), 
\begin_inset Quotes eld
\end_inset

blessing in disguise
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

dime a dozen
\begin_inset Quotes erd
\end_inset

 (idioms).
 The words are not necessarily sequential: there are set circumpositions:
 
\begin_inset Quotes eld
\end_inset

if...
 then...
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

first...
 second...
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

not only ...
 but also ...
\begin_inset Quotes erd
\end_inset


\end_layout

\begin_layout Standard
The Chinese word-segmentation problem is discerning when two hanzi characters
 belong to the same word, or not.
 It is similar to the problem of discerning idioms in English.
\end_layout

\begin_layout Subsubsection*
What is a word?
\end_layout

\begin_layout Standard
As background knowledge: there are multiple definitions of a word: Jerome
 Packard, in 
\begin_inset Quotes eld
\end_inset

The Morphology of Chinese A Linguistic and Cognitive Approach
\begin_inset Quotes erd
\end_inset

 (2000) Cambridge University Press lists the following:
\end_layout

\begin_layout Itemize
Orthographic word
\end_layout

\begin_layout Itemize
Sociological word 
\end_layout

\begin_layout Itemize
Lexical word 
\end_layout

\begin_layout Itemize
Semantic word
\end_layout

\begin_layout Itemize
Phonological word 
\end_layout

\begin_layout Itemize
Morphological word
\end_layout

\begin_layout Itemize
Syntactic word
\end_layout

\begin_layout Itemize
Psycholinguistic word
\end_layout

\begin_layout Subsubsection*
Sheaf structures are important
\end_layout

\begin_layout Standard
The proposal being advanced here is that the general sheaf-theoretic techniques
 can be used to discover all of these structures.
\end_layout

\begin_layout Standard
The simplest case would seem to be word-boundary detection in Chinese.
 Here, a word boundary might be one, two or three hanzi characters in a
 row.
 It seems that basic MST techniques should be enough to discover these.
 So, for example, given a hanzi sequence A B C D E, if the MST parse provides
 a link B-C and C-D but no link A-B and no link D-E, then the sequence BCD
 is a candidate for being identified as a word.
 However, observing this once is not statistics: only if the sequence BCD
 is observed many times, can one consider it to be a word.
 
\end_layout

\begin_layout Standard
More complex structures can be found using sheaf-theoretic techniques.
 The example below is taken from the existing Link-Grammar lexis to illustrate
 a search for circumpositions.
 Consider the sentence 
\begin_inset Quotes eld
\end_inset

I will do it if you say so
\begin_inset Quotes erd
\end_inset

.
 It has the parse:
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

    +------->WV------->+--MVs-+---CV->+
\end_layout

\begin_layout Plain Layout

    +--Wd--+-Sp*i+--I--+Osm+  +Cs+-Sp-+--O-+
\end_layout

\begin_layout Plain Layout

    |      |     |     |   |  |  |    |    |
\end_layout

\begin_layout Plain Layout

LEFT-WALL I.p will.v do.v it if you say.v so 
\end_layout

\end_inset

Inside of this, there is a single 
\begin_inset Quotes eld
\end_inset

germ
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

gerbe
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

disjunct
\begin_inset Quotes erd
\end_inset

 located at the word 
\begin_inset Quotes eld
\end_inset

if
\begin_inset Quotes erd
\end_inset

: 
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

         +Cs+
\end_layout

\begin_layout Plain Layout

         |  |
\end_layout

\begin_layout Plain Layout

        if  ? 
\end_layout

\end_inset

Extending out from this are numerous 
\begin_inset Quotes eld
\end_inset

sections
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

partial linkages
\begin_inset Quotes erd
\end_inset

.
 One of these is
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

   +--MVs-+---CV->+
\end_layout

\begin_layout Plain Layout

   |      +Cs+-Sp-+--O--+
\end_layout

\begin_layout Plain Layout

   |      |  |    |     |
\end_layout

\begin_layout Plain Layout

   ?     if  ?  say     ? 
\end_layout

\end_inset

The above structure might occur in many other sentences, and not just in
 this sentence.
 One can keep an eye out for this structure.
 If it occurs more often than usual, one can deduce that it is some set
 phrase or idiom.
 This particular example is not a set phrase in English, but it does illustrate
 how one can describe structure, and search for it, in a way that is more
 sophisticated than using n-grams.
\end_layout

\begin_layout Section*
Word boundaries - Chinese
\end_layout

\begin_layout Standard
So ...
 how does one find word-boundaries in Chinese? The basic idea is to count
 the frequency of patterns such as the below, where BCD are sequentially
 linked, and there are no links AB or DE.
 There may be additional links from the triple BCD going elsewhere, but
 not to neighboring words.
 Ideally, those links attach to just one morpheme.
 
\end_layout

\begin_layout Standard
\begin_inset listings
inline false
status open

\begin_layout Plain Layout

 ..--+         +-...
  +--...
\end_layout

\begin_layout Plain Layout

     |   +--+--+      |
\end_layout

\begin_layout Plain Layout

     |   |  |  |      |
\end_layout

\begin_layout Plain Layout

     A   B  C  D      E 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
If this was a European language, we would expect any extra links to attach
 to the last morpheme; this is due to the morphology of Indo-European, where
 the semantic (meaning-carrying) stem is always to the left of the grammatically
 active suffixes.
 Note Japanese, although it has a minimal morphology, as also similarly
 structured; i.e.
 the suffix carries the syntactic structure.
 With Chinese, this is less obviously the case, and ideally, the correct
 attachment will be discovered.
\end_layout

\begin_layout Section*
Meaning
\end_layout

\begin_layout Standard
So here's one approach to meaning.
 It is already clear that disjuncts are correlated with meaning, so one
 provisional approach might be to assign each disjunct a unique meaning.
 Alternately, this can be used as a doorway to the intentional meaning of
 a word.
 
\end_layout

\begin_layout Standard
Consider the phrases 
\begin_inset Quotes eld
\end_inset

the big balloon
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

the red balloon
\begin_inset Quotes erd
\end_inset

, 
\begin_inset Quotes eld
\end_inset

the small ballon
\begin_inset Quotes erd
\end_inset

...
 The pseudo-disjuncts on balloon in these three cases would be 
\begin_inset Quotes eld
\end_inset

the- big-
\begin_inset Quotes erd
\end_inset

 
\begin_inset Quotes eld
\end_inset

the - red-
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

the- small-
\begin_inset Quotes erd
\end_inset

 (plus an additional connector to the verb).
 Examining this connector-by-connector, we expect that the MI for the word
 pair (the, balloon) to be small, while the MI for the word-pairs (big,
 balloon), (red, balloon) and (small, balloon) to be large(r).
 Its thus tempting to identify the set {big, red, small} as the set of intention
al attributes associated with 
\begin_inset Quotes eld
\end_inset

balloon
\begin_inset Quotes erd
\end_inset

.
 The strength of the MI values to each of the connectors might be taken
 as a judgement of how much that attribute is prototypical of the object
 (see other section on 
\begin_inset Quotes eld
\end_inset

prototype theory
\begin_inset Quotes erd
\end_inset

).
\end_layout

\begin_layout Standard
The disjuncts associated with 
\begin_inset Quotes eld
\end_inset

balloon
\begin_inset Quotes erd
\end_inset

 will also connect to a verb.
 These verb connectors may be taken as another set of intentional attributes,
 for example {floats, drifts, rose, popped}.
 It should be possible to distinguish these as an orthogonal set of attributes,
 in that one might observe 
\begin_inset Quotes eld
\end_inset

the- red- floats+
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

the- red- drifts+
\begin_inset Quotes erd
\end_inset

 but never observe 
\begin_inset Quotes eld
\end_inset

floats- drifts+
\begin_inset Quotes erd
\end_inset

.
\end_layout

\begin_layout Standard
Meaning bibliography: 
\end_layout

\begin_layout Itemize
\begin_inset Quotes eld
\end_inset

The Molecular Level of Lexical Semantics
\begin_inset Quotes erd
\end_inset

, EA Nida, (1997) International Journal of Lexicography, 10(4): 265–274.
 https://www.academia.edu/36534355/The_Molecular_Level_of_Lexical_Semantics_by_EA_
Nida 
\begin_inset CommandInset citation
LatexCommand cite
key "Nida97"
literal "true"

\end_inset


\end_layout

\begin_layout Itemize
Zellig Harris.
 
\begin_inset Quotes eld
\end_inset

Distributional structure.
\begin_inset Quotes erd
\end_inset

 Word, 10(23):146–162, 1954.
 J.
 R.
 Firth.
 A synopsis of linguistic theory, 1930–1955.
 In J.
 R.
 Firth (Ed.), Studies in linguistic analysis, pages 1–32.
 (1957) https://www.tandfonline.com/doi/pdf/10.1080/00437956.1954.11659520 –
 Cited by 2660 
\end_layout

\begin_layout Itemize
John R.
 Firth (1935) 
\begin_inset Quotes eld
\end_inset

The technique of semantics.
\begin_inset Quotes erd
\end_inset

 Transactions of the Philological Society 34(1).
 36–73.
 Quote: 
\begin_inset Quotes eld
\end_inset

the complete meaning of a word is always contextual, and no study of meaning
 apart from context can be taken seriously.
\end_layout

\begin_layout Itemize
John R.
 Firth (1957) A synopsis of linguistic theory 1930–1955.
 In Studies in linguistic analysis, 1–32.
 Oxford: Blackwell.
 Quote: 
\begin_inset Quotes eld
\end_inset

You shall know a word by the company it keeps
\begin_inset Quotes erd
\end_inset


\end_layout

\begin_layout Section*
Meaning Redux
\end_layout

\begin_layout Standard
(9 June 2018) I keep explaing, over and over, why K-means and SVD cannot
 be used.
 Here's a snapshot from a recent email, explaining it again:
\end_layout

\begin_layout Subsection*
Why SVM and K-means don't work
\end_layout

\begin_layout Standard
Here's WHY both SVM and K-means are fundamentally wrong, and are total failures
 for this particular task.
 Lets start with K-means.
 One minor issue with K-means is that you have to pick K in advance, but
 you don't know what K is.
 But whatever, that's not important.
 Its OK to guess that K=100 or so.
 The big problem is that K-means then takes the 
\emph on
MEAN
\emph default
 (the 
\emph on
AVERAGE
\emph default
) of the vectors.
 That's what the word "mean" in "K-means" means.
 But we already know, a priori, that taking the average is wrong -- it wipes
 out, erases the different word senses.
\end_layout

\begin_layout Standard
For example: the word-token "saw" is going to have a vector that contains
 disjuncts for both "cutting tool", "the verb cut" "the past tense of to
 see".
 With K-means, this word token can only be assigned to just one cluster:
 it will be the cluster for nouns, or the cluster for past tenses, or the
 cluster for cutting-manipulation-actions.
 No matter which cluster its assigned to, when the average/sum of the vector
 is merged into the cluster, the wrong disjuncts will be averaged in as
 well.
 
\end_layout

\begin_layout Standard
So, for example: lets assume k-means places "saw" into the "nouns cluster".
 After averaging, the noun cluster will now contain disjuncts for both past-tens
e verbs, and also disjuncts for present-tense manipulation-verbs.
 Clearly, noun-clusters should not contain these.
 Two bad things happen: (a) the noun cluster is polluted with verb-vector
 components, and (b) the vector has not been factorized, and so "saw" cannot
 also be placed into other clusters as well.
 
\end_layout

\begin_layout Standard
Ergo -- K-means is fundamentally incorrect -- it cannot correctly cluster
 linguistic data! 
\end_layout

\begin_layout Standard
Lets write some formulas: let 
\begin_inset Formula $v$
\end_inset

 be a vector.
 The MST observation counts give us 
\begin_inset Formula $v_{saw}$
\end_inset

.
 We know that, a priori,
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
v_{saw}=v_{tool}+v_{past-tense}+v_{cutting}
\]

\end_inset

However, we do NOT know what these parts: 
\begin_inset Formula $v_{tool}$
\end_inset

,
\begin_inset Formula $v_{past-tense}$
\end_inset

, 
\begin_inset Formula $v_{cutting}$
\end_inset

 what they are.
 We need to factorize them out.
 K-means erases them, lumps them all into one.
 It does not factorize.
\end_layout

\begin_layout Standard
SVM is a little bit better, if you use it correctly.
 I am not convinced that you are using it correctly.
 So, for example, lets say we had only four words: 
\begin_inset Formula $v_{saw}$
\end_inset

, 
\begin_inset Formula $v_{look}$
\end_inset

, 
\begin_inset Formula $v_{heard}$
\end_inset

 and 
\begin_inset Formula $v_{tool}$
\end_inset

.
 Suppose that SVM was told to decompose into three dimensions, and that
 the three that were picked were mostly pointing along the direction of
 
\begin_inset Formula $v_{look}$
\end_inset

 and 
\begin_inset Formula $v_{heard}$
\end_inset

 and 
\begin_inset Formula $v_{tool}$
\end_inset

 -- these were the three principle components.
\end_layout

\begin_layout Standard
Where should 
\begin_inset Formula $v_{saw}$
\end_inset

 go? In traditional SVDM, the three principle components are used to define
 three hyperplanes or classifiers, and so the single vector 
\begin_inset Formula $v_{saw}$
\end_inset

 would then be classified as to being either "on the same side of the hyperplane
 as 
\begin_inset Formula $v_{look}$
\end_inset

, and thus a part of the 
\begin_inset Formula $v_{look}$
\end_inset

 singular value", or "on the same side of the hyperplane as 
\begin_inset Formula $v_{tool}$
\end_inset

, and thus a part of the 
\begin_inset Formula $v_{tool}$
\end_inset

 singular value", etc.
\end_layout

\begin_layout Standard
But this is again wrong.
 We want to factor or decompose 
\begin_inset Formula $v_{saw}$
\end_inset

 along these three different principle components, thereby automatically
 discovering that some of the disjuncts on 
\begin_inset Formula $v_{saw}$
\end_inset

 are tool-like, that others are verb-like, etc.
\end_layout

\begin_layout Standard

\emph on
THIS
\emph default
 is where word-sense disambiguation comes from.
 It is 
\emph on
NOT
\emph default
 done in some pre-cleaner, pre-disambiguator stage.
 It is done at the clustering stage.
\end_layout

\begin_layout Standard
But, as I hope is now clear, both SVM and K-means are fundamentally wrong
 approaches, because both ERASE word-sense information from the dataset!
\end_layout

\begin_layout Standard
Now, IF you are very careful, you might be able to modify SVM, and after
 finding principle components, go back for a second pass, and perform the
 factorization needed to extract the different word-senses.
 Maybe.
 I can see/guess at a way of doing this, but its hard.
\end_layout

\begin_layout Subsection*
Dimensional reduction
\end_layout

\begin_layout Standard
There's a completely different issue - "dimensional reduction" which must
 not be ignored; its important, a big part of the task.
\end_layout

\begin_layout Standard
So: My large dataset has 24 million disjuncts in it -- that's the dimension
 of the vector space -- all vectors are 24M-dimensional.
 How do you perform dimensional reduction? Well, its "easy" -- if two disjuncts
 have connectors that belong to the same word-class, replace them by one
 disjunct in that word-class.
 (The vector space is now (24M minus one)-dimensional) Lets suppose that
 one of the disjuncts is
\end_layout

\begin_layout Standard

\family typewriter
bird: the- & saw-
\end_layout

\begin_layout Standard
\noindent
When doing the dimensional reduction, the saw- needs to be replaced by:
 ??? either 
\family typewriter
TOOL-
\family default
 or 
\family typewriter
PASTTENSE-
\family default
 or by 
\family typewriter
CUTTING-.
\family default
Obviously, that disjunct was obtained from an MST parses of childrens-lit
 sentences like "John saw the bird.
 Susan saw the bird too.
 Mary saw it also".
 When you dimensionally reduce the saw- in this disjunct, which cluster
 do you assign it to?
\end_layout

\begin_layout Standard
Well, if the text has the sentences: "John knew the bird was there.
 John heard the bird", and if clustering determined that "knew", "heard"
 belongs to 
\family typewriter
PASTTENSE
\family default
 then the dim reduction is clear:
\end_layout

\begin_layout Standard

\family typewriter
bird: the- & PASTTENSE-
\end_layout

\begin_layout Standard
\noindent
\align left
and we know that the following is wrong:
\end_layout

\begin_layout Standard

\family typewriter
bird-: the- & TOOL-
\end_layout

\begin_layout Standard
\noindent
The problem here is that both K-means and SVD are completely ignorant of
 the structure of the basis elements of the vector space.
 Both assume that the basis elements of the vector space are irreducible,
 atomic, indivisible.
 Its a natural assumption for some machine-learning tasks, but completely
 wrong for language-learning where we know, a priori, that the basis elements
 have structure.
 This is kind of a key idea from sheaves!
\end_layout

\begin_layout Section*
Merge Results 5 June 2018
\end_layout

\begin_layout Standard
After a very long hiatus, restart.
 All earlier merge data lost!? 
\end_layout

\begin_layout Standard
Here's a sample of automatically-discovered grammatical classes, using the
 '
\family typewriter
ortho-merge
\family default
' strategy from '
\family typewriter
gram-class.scm
\family default
'.
 I seem to have lost/corrupted a previous, larger dataset, so this was remade
 from scratch the last few days.
 Source dataset is '
\family typewriter
en_pairs_cfive_class
\family default
'.
 The merge parameters are: cosine-similarity-accept cutoff = 0.65; union-merge-fra
ction = 0.3.
\end_layout

\begin_layout Standard
This run took 48 hours (its very far from done, this is a snapshot), it
 found 230 words that it could classify into 38 classes.
 (Seven more words got classified as I prepared this, so the counts may
 be off.) The table below lists them exhaustively.
 Note that some words appear in multiple classes: for example, 
\begin_inset Quotes eld
\end_inset

mother father
\begin_inset Quotes erd
\end_inset

.
 Some words are clearly mis-classified, but there are not many of those.
 Some classes are a bit confusing as to their content, but most seem very
 clear.
 The classes are clearly semantic in nature; for example, there are two
 distinct classes of prepositions.
 The semantics is entertainingly insightful: 
\begin_inset Quotes eld
\end_inset

voice mother hands heart head father mind face feet
\begin_inset Quotes erd
\end_inset

 are parts of oneself, with some unexpected members: 
\begin_inset Quotes eld
\end_inset

mother, father
\begin_inset Quotes erd
\end_inset

 are not normally considered to be body parts, but are, in some sense, deeply,
 
\begin_inset Quotes eld
\end_inset

parts of oneself
\begin_inset Quotes erd
\end_inset

.
 Similarly, 
\begin_inset Quotes eld
\end_inset

wife arm daughter friend mouth friends brother
\begin_inset Quotes erd
\end_inset

 are mostly relatives and relationships, yet 
\begin_inset Quotes eld
\end_inset

arm mouth
\begin_inset Quotes erd
\end_inset

 are not.
 Perhaps the arm and mouth have a mind of their own, functioning a bit independe
ntly from the true self?
\end_layout

\begin_layout Standard
I'll try to run this a few more days, and present a newer report.
 While reviving this old code, I realized that the classification algorithm
 being used here has multiple faults and is a bit crude.
 I'm writing a nicer algo right now.
 I don't really know how to compare the quality of the algos, at this point.
\end_layout

\begin_layout Standard
Meanwhile, you should be able to get similar results, by applying the code
 in '
\family typewriter
gram-class.scm
\family default
' to a dataset that contains disjuncts derived from MST parses.
\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="38" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top" width="0pt">
<column alignment="left" valignment="top" width="80col%">
<column alignment="left" valignment="top" width="18col%">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Size
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Members
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Comments
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
43
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
village city question subject sea town girl public land French English fire
 King war boy air morning words others poor best second world door book
 heart body case night room whole light country people house children last
 present ground water family first other
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Nouns, mostly
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\strikeout off
\uuline off
\uwave off
\noun off
\color none
for in from at on by of with all towards within near against under through
 over upon into 
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Prepositions
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
12
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
help hear keep leave take find get make see give say go 
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Personal verbs
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
12
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
fine word large moment certain small woman new good man great little
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Adjectives, mostly
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
10
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
fall action history character state position sense force knowledge pleasure
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
full nature part death power most some out one 
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
voice mother hands heart head father mind face feet
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Body-parts
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
till whether since because until where if when
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Time
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
will would might should may can must could
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Imperatives
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
or but perhaps nor though And while
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Conjunctions
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
wife arm daughter friend mouth friends brother
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Relatives
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
feel believe myself am know think
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Beingness
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
rest end body name side power 
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
heard taken given already done seen
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Past perfect - action
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
really always still also now
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
year place same day way
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
kept held called made found
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Possessive verbs
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
son arms own eyes 
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
her me him us
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Anaphora
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
heard felt knew saw
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Simple past - action
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
making such like
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
our its their
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Possesives - plural
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
five three four
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
during between among
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Prepositions
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
once least
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
thus sometimes
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Deduction
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
therefore indeed
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Deduction
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
is was
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to be - Singular
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
are were
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
to be - Plural
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
! ? 
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Sentence end
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
, ; 
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Punct
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
And The
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Sentence start 
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
they we
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Anaphora
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
cannot shall
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Imperatives
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
mother father
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
sort number
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
France England
\end_layout

\end_inset
</cell>
<cell alignment="left" valignment="top" topline="true" bottomline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset

Here's a graph of the above distribution.
 Its on a log-log scale.
 It looks to be approximately Zipfian.
 That's no surprise.
 Total of 38 classes shown.
 
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
This and the remaining graphs built with 'word-classes.scm' prt-disjunct-distribu
tion, etc.
 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename word-classes/class-words.eps
	width 80text%

\end_inset

 
\end_layout

\begin_layout Standard
Here's a graph of the number disjuncts in each grammatical class.
 (There were 259 words classified into 42 classes when this was prepared).
 The 
\begin_inset Quotes eld
\end_inset

number of disjuncts
\begin_inset Quotes erd
\end_inset

 is the same as the 
\begin_inset Quotes eld
\end_inset

support
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Formula $l_{0}$
\end_inset

 norm of the vector.
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename word-classes/class-dj-support.eps
	width 80text%

\end_inset


\end_layout

\begin_layout Standard
Continuing, below is the total count of the number of observations of the
 disjuncts (There were 269 words classified into 44 classes when this was
 prepared).
 The 
\begin_inset Quotes eld
\end_inset

count of disjuncts
\begin_inset Quotes erd
\end_inset

 is the same as the 
\begin_inset Quotes eld
\end_inset

count
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

Manhatten distance
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Formula $l_{1}$
\end_inset

 norm of the vector.
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename word-classes/class-dj-count.eps
	width 80text%

\end_inset


\end_layout

\begin_layout Standard
And again, below is the length of each vector, viz, the root-mean-square
 count of the number of observations of the disjuncts (There were 269 words
 classified into 44 classes when this was prepared).
 The 
\begin_inset Quotes eld
\end_inset

RMS count of disjuncts
\begin_inset Quotes erd
\end_inset

 is the same as the 
\begin_inset Quotes eld
\end_inset

length
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Formula $l_{2}$
\end_inset

 norm of the vector.
 The initial part of this graph is the most Zipfian so far, with a slope
 of exactly 1.0, as eyeballed.
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename word-classes/class-dj-length.eps
	width 80text%

\end_inset


\end_layout

\begin_layout Standard
Conclusion: Looks good, more processing needed; comparison of experiments
 needed.
\end_layout

\begin_layout Section*
Merge Experiments
\end_layout

\begin_layout Standard
Three different merge experiments are being run.
 These are reported below.
 The summary is here:
\end_layout

\begin_layout Description
Block-5x5 Same as above, pushed out farther.
 This is run on a copy of the 
\family typewriter
en_pairs_cfive_mst
\family default
 dataset (all five MST tranches) on the LXC container.
 Using a cutoff of 20 observations minimum per word, this contains 62607
 words to be classified.
 The classifier is the 
\family typewriter
merge-ortho
\family default
 classifier, using a minimum cosine of 0.65 to propose a merge, and a fixed
 fraction of 0.3 for the union-merge.
 The agglomeration algorithm is the block-diagonal algorithm.
 (This dataset has 40M sections; viz about 80M atoms, viz about 120GB to
 load in full)
\end_layout

\begin_layout Description
Fuzz-5x2 This is run on a copy of the 
\family typewriter
en_pairs_cfive_mtwo
\family default
 dataset (only two MST tranches) and thus has fewer words: a total of 25505
 words with more than 20 observations.
 As above, this uses the 
\family typewriter
merge-ortho
\family default
 classifier, using a minimum cosine of 0.65 to propose a merge, and a fixed
 fraction of 0.3 for the union-merge.
 The agglomeration algorithm is the greedy algorithm.
 (this dataset has about 13M sections) Run this as: `(gram-classify-greedy-fuzz
 0.65 0.3 20)`.
\end_layout

\begin_layout Description
Discrim-5x2 This is run on a copy of the 
\family typewriter
en_pairs_cfive_mtwo
\family default
 dataset, as above: a total of 25505 words with more than 20 observations.
 This uses the 
\family typewriter
merge-discrim
\family default
 classifier, which is like the 
\family typewriter
merge-ortho
\family default
 classifier, but uses a variable fraction for union-merge.
 Because the variable fraction should behave nicely, the minimum cosine
 is set to 0.50.
 The agglomeration algorithm is the greedy algorithm.
 Run this as: `(gram-classify-greedy-discrim 0.5 20)`.
\end_layout

\begin_layout Standard
Basically, the last two are directly comparable: they differ only in the
 merge strategy.
 The first two are harder to compare: they use different datasets and different
 agglomeration algos.
 All three are using the screwy 
\family typewriter
merge-ortho
\family default
 classifier, which is almost right, but altered counts in a somewhat screwy
 way.
 Thus, these experiments need to be repeated ...
 again.
\end_layout

\begin_layout Standard
The table below is a 
\begin_inset Quotes eld
\end_inset

progress report
\begin_inset Quotes erd
\end_inset

 on 
\series bold
Fuzz-5x2
\series default
, as its being computed.
 Each row represents a snapshot in a different point in time for the computation.
 
\series bold
Fuzz-5x2
\series default
 crashed; the last row are the stats at the time of the crash.
\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="10" columns="8">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
num-classes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
num-words
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
doubles
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
singletons
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
dupes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
dup-cls
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
uniq-dup
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
cpu
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
46
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
429
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
259
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
12
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
15
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3400
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
56
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
482
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
487
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6230
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
65
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
533
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
26
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
682
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
15
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
19
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8800
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
70
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
581
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
915
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
18
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
36
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
11760
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
75
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
635
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1138
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
48
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
26
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14760
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
81
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
670
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
31
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1301
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
50
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17200
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
85
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
707
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1544
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
50
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20720
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
90
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
822
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1846
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
29
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
58
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
31
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25256
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
93
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
835
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1927
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
66
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
34
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
26624
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Foot
status open

\begin_layout Plain Layout
Table entries obtained from `(prt-all-classes)` and `(prt-multi-members)`
\end_layout

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
The columns are as follows:
\end_layout

\begin_layout Description
num-classes Total number of grammatical categories (word classes).
\end_layout

\begin_layout Description
num-words Total number of words assigned to grammatical classes, with two
 or more words per class.
\end_layout

\begin_layout Description
doubles Total number of classes having exactly two words in them.
\end_layout

\begin_layout Description
singletons Total number of words examined, but could not be assigned to
 any existing grammatical class.
 These can be thought of as classes that have only one member; they may
 eventually grow to more than one member.
\end_layout

\begin_layout Description
dupes Total number of words belonging to more than one class.
 These roughly correspond to words that have been found to have more than
 one syntactic form (i.e.
 more than one 
\begin_inset Quotes eld
\end_inset

meaning
\begin_inset Quotes erd
\end_inset

 - aka 
\begin_inset Quotes eld
\end_inset

word-sense disambiguation
\begin_inset Quotes erd
\end_inset

)
\end_layout

\begin_layout Description
dup-cls Total number of classes that the dupes belong to.
 Thus, dupe-cls / dupes = average number of 
\begin_inset Quotes eld
\end_inset

meanings
\begin_inset Quotes erd
\end_inset

 that a multi-meaning word has.
 Currently, this appears to be always 2 in the above dataset.
\end_layout

\begin_layout Description
uniq-dup Total number of unique classes that have multi-meaning words in
 them.
\end_layout

\begin_layout Description
cpu The CPU-minutes accumulated so far.
 This is ad-hoc, it doesn't count for time spent in postgres, or inefficient
 parallelism.
 It just provides a scale for forward progress.
\end_layout

\begin_layout Standard
Some examples of multi-category words: 
\end_layout

\begin_layout Description
what belongs to <that as when if what before where because until> and also
 <what how why whether> – The first class appears to correspond to propostional
 words, the second to question words.
\end_layout

\begin_layout Description
her belongs to <his her> and <her him me us> – possesives and determiners.
\end_layout

\begin_layout Description
with belongs to <of in with for on by from into upon over through under
 among> and also <with like such having> – prepositions and membership-property
 words.
\end_layout

\begin_layout Standard
Here's a progress report for 
\series bold
Discrim-5x2
\series default
.
 A quick look-see shows that this is lower-quality; the cosine=0.50 seems
 to accept too much, mixing nouns and verbs, although it is better at placing
 given names into one category (for example)...
\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="9" columns="8">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
num-classes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
num-words
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
doubles
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
singletons
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
dupes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
dup-cls
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
uniq-dup
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
cpu
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
31
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
427
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
8
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
67
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
34
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2370
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
47
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
715
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
89
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
62
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
127
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4130
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
63
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
946
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
169
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
87
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
182
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
6375
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
82
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1145
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
19
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
343
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
106
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
222
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
41
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
9700
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
96
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1268
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
573
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
125
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
260
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
44
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
13420
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
106
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1415
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
27
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
757
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
136
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
282
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
49
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
16400
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
121
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1518
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
32
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
994
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
142
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
294
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
54
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20600
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
125
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1545
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1020
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
145
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
300
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
56
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21149
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace defskip
\end_inset


\end_layout

\begin_layout Standard
Its clear that 
\series bold
Discrim-5x2
\series default
 is assigning more words into fewer classes than 
\series bold
Fuzz-5x2
\series default
 is.
 Particularly notable is the much much smaller size of the 'doubles' classes
 and the 'singletons' classes.
 This crashed mysteriosuly after running for 21K seconds...
 `(compute-right-cosine (WordNode "”" #)
\end_layout

\begin_layout Standard
Below are distribution graphs for the word-clases obtained using the two
 different datasets, and three different classification schemes.
 The general similarity of the graphs is immediately apparent.
 One can conclude: 
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Graphs created with `word-classes/distribution.gplot` from stats generated
 with `word-classes.scm` 
\end_layout

\end_inset


\end_layout

\begin_layout Itemize
The different classifcation schemes all generate the same distribution of
 class sizes, and that distribution is very nearly purely Zipfian.
 (Upper-left graph)
\end_layout

\begin_layout Itemize
The distribution of disjuncts in the classes is bimodal, and the modality
 and inflection is the same for the 
\begin_inset Formula $l_{0}$
\end_inset

, 
\begin_inset Formula $l_{1}$
\end_inset

 and 
\begin_inset Formula $l_{2}$
\end_inset

-norms.
\end_layout

\begin_layout Itemize
The distribution of disjuncts is determined primarily by the dataset, and
 not by the classification algo.
 That is, Fuzz-5x2 and Discrim-5x2 are two different classifiers running
 on the same dataset; they are in many ways similar, and differing a bit
 from the larger dataset Block-5x5.
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename word-classes/words-in-class.eps
	width 45text%

\end_inset


\begin_inset Graphics
	filename word-classes/support-of-class.eps
	width 45text%

\end_inset


\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename word-classes/count-of-class.eps
	width 45text%

\end_inset


\begin_inset Graphics
	filename word-classes/length-of-class.eps
	width 45text%

\end_inset


\end_layout

\begin_layout Standard
The overall lack of dramatic differences in the distributions is remarkable.
 Visual inspection of the classes indicates that they are all arriving at
 the same general and mostly-correct classification of words.
 It could be interesting to see how much these classifications differ; measuring
 this, however, is difficult, as they are in distinct datasets, and there's
 no infrastructure for that.
\end_layout

\begin_layout Subsection*
Quality evaluation
\end_layout

\begin_layout Standard
Evaluating the quality is hard.
 Quick looks suggest its all going as planned...
 unclear how to be quantititive, except by tedious hand-scoring.
\end_layout

\begin_layout Standard
One possibility for an objective external score is to compare these word-classes
 to WordNet sysnsets.
 
\end_layout

\begin_layout Section*
Connector distribution from MST parses
\end_layout

\begin_layout Standard
(9 June 2018) This is kind-of a repeat of earlier work reported in `connector-se
ts-revised.lyx` but is (a) graphed differently and is (b) for a different
 dataset.
 Its actually a commentary on the quality of data comeing out of MST.
 First graph: number of sections having N connectors.
 
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename word-classes/dj-size.eps
	width 60text%

\end_inset


\end_layout

\begin_layout Standard
It shows how many sections there are that have the indicated number of connector
s on them.
 That is, each section has one and only one disjunct in it.
 Each disjunct can have N connectors in it.
 So, fixing N, how many sections are there that have N connectors? For 
\begin_inset Quotes eld
\end_inset

real
\begin_inset Quotes erd
\end_inset

 linnguistic data, we expect a much much sharper falloff.
 Determiners (the, a this, that ...) should have one connector.
 There are few of these, and that is what the chart above suggests.
 (This chart shows 30K such connectors: since this is pre-clustering, those
 30K correspond to some determiner, connecting to some specific word.)
\end_layout

\begin_layout Standard
Nouns should have 2 or 3 or 4: zero or one to a determiner, zero, one or
 two (maybe rarely three) to adjectives, one to a verb.
 Transitive verbs should have 3 or 4 connectors: one to the subject, one
 to the object, one to LEFT-WALL, zero or one to adverbs, particles, preposition
s, etc.
 Thus, six or more connectors should be very very rare.
 its not.
 This suggests that the MST parser is not producing enitrely believable
 data (but we knew that, already).
 
\end_layout

\begin_layout Standard
Perhaps the high-connector count disjuncts are observed only infrequently?
 The next graph shows the counts, weighted by the number of observations:
 i.e.
 how often that particular disjunct was observed.
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename word-classes/dj-wsize.eps
	width 60text%

\end_inset


\end_layout

\begin_layout Standard
Hmm.
 The good news: the observation counts for 1- and 2-connector disjuncts
 are much higher, with 3-connector disjuncts seen a lot less often.
 The number of 4-connector disjuncts remains dishearteningly high, and the
 fall-off is slower than before.
\end_layout

\begin_layout Standard
Both of the above graphs were generated by considering only the sections
 on word-clusters.
 This comes from the 
\series bold
Block-5x5
\series default
, when it had 700 words assigned to clusters.
 Its not clear if this pattern is true, in general, for all words.
 Note that, to obtain these 700 words, the top-most-frequent 1300 words
 were examined: so the 700 words are among the most frequent.
 Note that, due to the orthogonalization algorithm, not all of the counts
 are transferred from the words to the clusters; only some are.
\end_layout

\begin_layout Section*
Cross-connector stats
\end_layout

\begin_layout Standard
Ran the full cross-connector stats
\end_layout

\begin_layout Section*
Entropic-similarity
\end_layout

\begin_layout Standard
Some stats for entropic-similarity.
 There is good reason to think that the entropic similarity will provide
 a superior judgement of similarity, as compared to cosine similarity.
 The entropic similarity is defined as the MI between word-disjunct vectors.
 Thus, it is similar to the cosine similarity, in being founded on a vector
 dot-product, but has a different normalization in the denominator.
 It's defined below.
\end_layout

\begin_layout Standard
Consider `
\series bold
en_rfive_mtwo
\series default
`.
 There are: 
\end_layout

\begin_layout Itemize
Rows: 137078 – 
\emph on
viz
\emph default
 that many words.
 Viz 
\begin_inset Formula $\sum_{w}1=137078$
\end_inset


\end_layout

\begin_layout Itemize
Columns: 6239997 - 
\emph on
viz
\emph default
 that many unique disjuncts – i.e.
 atoms of the form 
\family typewriter
(Section (*) (dj))
\family default
 with * the wild-card, and 
\family typewriter
dj
\family default
 held fixed.
 Viz 
\begin_inset Formula $\sum_{d}1=6239997$
\end_inset


\end_layout

\begin_layout Itemize
Size: 8629163 - 
\emph on
viz
\emph default
 this many unique sections, viz explicit 
\family typewriter
(Section w dj)
\family default
 for fixed 
\family typewriter
(w,dj)
\family default
.
 
\emph on
Viz
\emph default
 
\begin_inset Formula $\sum_{w,d}\left[0<N(w,d)\right]=8629163$
\end_inset


\end_layout

\begin_layout Standard
Let 
\begin_inset Formula $N(w,d)$
\end_inset

 be the observation count of disjunct 
\begin_inset Formula $d$
\end_inset

 on word 
\begin_inset Formula $w$
\end_inset

.
 Then, for 
\series bold
en_rfive_mtwo
\series default
 we have:
\begin_inset Formula 
\[
N\left(*,*\right)=\sum_{w,d}N(w,d)=18489594.0
\]

\end_inset

viz 18.5M observations total, while
\begin_inset Formula 
\[
\sum_{u,w,d}N(u,d)N(w,d)=\sum_{d}N(*,d)N(*,d)=63598403588.0
\]

\end_inset

viz 63.6G.
 Note that
\begin_inset Formula 
\[
\sum_{d}\frac{N(*,d)}{N\left(*,*\right)}\frac{N(*,d)}{N\left(*,*\right)}=1.8603\times10^{-4}
\]

\end_inset

and 
\begin_inset Formula 
\[
-\log_{2}\sum_{d}\frac{N(*,d)}{N\left(*,*\right)}\frac{N(*,d)}{N\left(*,*\right)}=-\log_{2}1.8603\times10^{-4}=12.4\mbox{ bits}
\]

\end_inset

Does this quantity have a name? What is the name?
\end_layout

\begin_layout Standard
Define the dot-product (inner product) between words as
\begin_inset Formula 
\[
i\left(u,w\right)=\sum_{d}N(u,d)N(w,d)
\]

\end_inset

Normalize this into a bona-fide joint probability as
\begin_inset Formula 
\[
p(u,w)=i\left(u,w\right)/i\left(*,*\right)
\]

\end_inset

and define the entropic similarity as the MI between these two vectors:
\begin_inset Formula 
\[
MI\left(u,w\right)=\log_{2}\frac{p\left(u,w\right)}{p\left(u\right)p\left(w\right)}
\]

\end_inset

where the marginal probability is
\begin_inset Formula 
\[
p\left(u\right)=p\left(u,*\right)=\frac{i\left(u,*\right)}{i\left(*,*\right)}=\frac{1}{i\left(*,*\right)}\;\sum_{w,d}N(u,d)N(w,d)=\frac{1}{i\left(*,*\right)}\;\sum_{d}N(u,d)N(*,d)
\]

\end_inset

The total entropy from the marginal probability is 
\begin_inset Formula 
\[
H=-\sum_{w}p\left(w\right)\log p\left(w\right)
\]

\end_inset

and the total MI would be
\begin_inset Formula 
\[
MI=\sum_{u,w}p\left(u,w\right)\log_{2}\frac{p\left(u,w\right)}{p\left(u\right)p\left(w\right)}
\]

\end_inset


\end_layout

\begin_layout Standard
What's this like? Some pairs below.
 Note that 
\begin_inset Formula $-\log_{2}1.8603\times10^{-4}=12.3922\mbox{ bits}$
\end_inset

.
\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="15" columns="7">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $u$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $w$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $-\log_{2}p\left(u\right)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $-\log_{2}p\left(w\right)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $-\log_{2}p(u,w)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $MI(u,w)$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $\cos\left(u,w\right)$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
other
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
same
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21.52
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
22.13
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
27.13
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.1232
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.5866
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
nice
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
fine
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
26.94
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25.21
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
35.24
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.5210
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.5256
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
him
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
me
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
20.81
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21.34
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25.50
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.2583
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.7842
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
men
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
women
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23.69
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
25.79
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
33.24
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.8489
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.6077
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
up
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
down
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
22.46
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23.31
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
27.87
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.5085
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.5630
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
found
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
called
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23.62
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.57
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
32.09
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.7066
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.5576
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
came
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
went
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.19
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
24.72
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
31.32
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.1930
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.5900
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
eyes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
hand
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23.47
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23.19
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
28.91
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
5.3589
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.7284
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
men
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
nice
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
38.71
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.477
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0233
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
men
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
went
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
36.08
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-0.061
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0249
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
nice
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
went
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
38.22
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
+1.0490
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0353 
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
called
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
eyes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
38.14
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-2.499
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0049
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
nice
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
eyes
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
40.21
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
-2.193
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.0038
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
nice
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
called
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
35.63
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
+3.4810
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.3794
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
Casual observation suggests that the MI and the cosine similarity are correlated
; when one is high, so is the other.
 A detailed examination of this is in section 6 approx page 40 of the March
 2019 version of `connector-sets-revised.pdf`.
\end_layout

\begin_layout Subsection*
Zipf's Law 
\end_layout

\begin_layout Standard
These are interesting:
\end_layout

\begin_layout Itemize
\begin_inset Quotes eld
\end_inset

Zipf’s law holds for phrases, not words
\begin_inset Quotes erd
\end_inset

, Jake Ryland Williams, Paul R.
 Lessard, Suma Desu, Eric M.
 Clark, James P.
 Bagrow, Christopher M.
 Danforth & Peter Sheridan Dodds, Scientific Reports volume 5, Article number:
 12209 (2015)
\end_layout

\begin_layout Standard
Based on my experience, it holds for both ...
 it holds for everything.
\end_layout

\begin_layout Section*
Grammatical Class Redux
\end_layout

\begin_layout Standard
The word-disjunct dataset, containing counts 
\begin_inset Formula $N\left(w,d\right)$
\end_inset

, is far too large to be used as a practical dictionary for parsing.
 It also fails to provide sufficient coverage to parse most sentences.
 Thus, it needs to be simultaneously be compressed and also generalized.
 This is partly achieved by clustering words 
\begin_inset Formula $w$
\end_inset

 into word-classes 
\begin_inset Formula $g$
\end_inset

.
 More precisely, this is a partitioning of the observation counts 
\begin_inset Formula $N\left(w,d\right)$
\end_inset

 into count-classes 
\begin_inset Formula $N\left(g,d\right)$
\end_inset

 where 
\begin_inset Formula $g$
\end_inset

 is a set of words that behave in a grammatically similar fashion.
\end_layout

\begin_layout Standard
One naive (but approximate and incorrect) way of doing this is to come up
 with a membership function – a membership matrix 
\begin_inset Formula $p\left(g|w\right)$
\end_inset

, behaving like a normalized conditional probability
\begin_inset Formula 
\[
\sum_{g}p\left(g|w\right)=1
\]

\end_inset

which says that every word 
\begin_inset Formula $w$
\end_inset

 belongs to one or more clases 
\begin_inset Formula $g$
\end_inset

.
 A partitioning of counts is then
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
N\left(g,d\right)=\sum_{w}p\left(g|w\right)N\left(w,d\right)
\]

\end_inset

The intended interpretation is that 
\begin_inset Formula $g$
\end_inset

 is a set of words that behave in a similar fashion, syntactically, so that
 
\begin_inset Formula $p\left(g|w\right)$
\end_inset

 is the conditional probability that a word 
\begin_inset Formula $w$
\end_inset

 belongs to 
\begin_inset Formula $g$
\end_inset

.
 Typically, 
\begin_inset Formula $w$
\end_inset

 will belong to no more than half-a-dozen grammatical classes, with each
 grammatical class corresponding loosely to each word-sense that the word
 has.
 
\end_layout

\begin_layout Standard
The problem with the above is that it is incorrect – a too-strong approximation.
 The problem is that different word-senses behave in grammatically different
 fashion – 
\emph on
e.g.

\emph default
 a word can be both a verb and a noun.
 The correct partitioning is then necessarily given by 
\begin_inset Formula $p\left(g|w,d\right)$
\end_inset

 so that, for example, if 
\begin_inset Formula $g$
\end_inset

 is a verb-class, then 
\begin_inset Formula $p\left(g|w,d\right)\ne0$
\end_inset

 only when 
\begin_inset Formula $d$
\end_inset

 is a verb-style disjunct, otherwise, 
\begin_inset Formula $p\left(g|w,d\right)=0$
\end_inset

 whenever 
\begin_inset Formula $d$
\end_inset

 is a noun-style disjunct.
 That is, the correct partitioning is given by 
\begin_inset Formula 
\[
N\left(g,d\right)=\sum_{w}p\left(g|w,d\right)N\left(w,d\right)
\]

\end_inset

and word-sense disambiguation fundamentally disallow the assumption that
 
\begin_inset Formula $p\left(g|w,d\right)\sim p\left(g|w\right)$
\end_inset

 since this assumption erases the grammatical differences between different
 word-senses.
\end_layout

\begin_layout Section*
Word-Sense Disambiguation
\end_layout

\begin_layout Standard
How does this work with entropic similarity? Let
\begin_inset Formula 
\[
\vec{w}=\sum_{d}N\left(w,d\right)\widehat{e}_{d}
\]

\end_inset

be the vector defined by the observed counts of disjuncts.
 Suppose it is a linear combination of two distinct word-senses 
\begin_inset Formula $\vec{s}$
\end_inset

 and 
\begin_inset Formula $\vec{t}$
\end_inset

, so that 
\begin_inset Formula 
\[
\vec{w}=\vec{s}+\vec{t}
\]

\end_inset

but we don't know what the counts for these are, and are so faced with the
 challenge of determining the linear decomposition.
 
\end_layout

\begin_layout Standard
Suppose that there is some grammatical class 
\begin_inset Formula $\vec{g}$
\end_inset

 and we've noticed that 
\begin_inset Formula $\vec{w}$
\end_inset

 is similar to 
\begin_inset Formula $\vec{g}$
\end_inset

 and this similarity is probably due to 
\begin_inset Formula $\vec{s}$
\end_inset

 being similar to 
\begin_inset Formula $\vec{g}$
\end_inset

.
 How can we use this to determine 
\begin_inset Formula $\vec{s}$
\end_inset

? A classical strategy is orthogonalization: assume that 
\begin_inset Formula $\vec{s}$
\end_inset

 is parallel to 
\begin_inset Formula $\vec{g}$
\end_inset

, and take the orthogonal complement.
 The primary problem here is that the space is not Euclidean: it does not
 have sphere-symmetry that preserves the dot product.
 The vectors 
\begin_inset Formula $\vec{w}$
\end_inset

 inhabit a simplex; the vector coefficients must always remain zero or positive,
 and all decompositions must preserve the total count 
\begin_inset Formula $N\left(*,*\right)$
\end_inset

 as an invariant.
\end_layout

\begin_layout Standard
Recall the definition of 
\begin_inset Formula $i\left(u,v\right)$
\end_inset

 above, and note that 
\begin_inset Formula $i\left(u,v\right)=\vec{u}\cdot\vec{v}$
\end_inset

.
 Note that 
\begin_inset Formula $*$
\end_inset

 behaves like a constant vector; 
\emph on
i.e.

\emph default
 
\begin_inset Formula $\vec{*}=\sum_{d}N\left(*,d\right)\widehat{e}_{d}=\sum_{d}\sum_{w}N\left(w,d\right)\widehat{e}_{d}$
\end_inset

 is a perfectly well-defined vector.
\end_layout

\begin_layout Standard
Thus, given the WSD problem before us, it seems we want to find a vector
 
\begin_inset Formula $\vec{s}$
\end_inset

 such that 
\begin_inset Formula $MI\left(\vec{g},\vec{s}\right)$
\end_inset

 is maximized, and so that 
\begin_inset Formula $MI\left(\vec{g},\vec{t}\right)$
\end_inset

 is minimized, subject to the constraint 
\begin_inset Formula $\vec{w}=\vec{s}+\vec{t}$
\end_inset

.
 Lets work that out.
 By definition,
\begin_inset Formula 
\[
MI\left(\vec{g},\vec{s}\right)=\log_{2}\frac{i\left(g,s\right)i\left(*,*\right)}{i\left(g,*\right)i\left(*,s\right)}
\]

\end_inset

The extrema of this would be given by those values of 
\begin_inset Formula $\vec{s}$
\end_inset

 that are stationary under perturbations 
\begin_inset Formula $\vec{s}+\delta\vec{s}$
\end_inset

, where 
\begin_inset Formula 
\[
\delta\vec{s}=\sum_{d}\delta N\left(s,d\right)\widehat{e}_{d}
\]

\end_inset

so that we must solve 
\begin_inset Formula $\left|d\right|$
\end_inset

 different differential equations, one for each disjunct 
\begin_inset Formula $d$
\end_inset

:
\begin_inset Formula 
\[
0=\left.\frac{\partial MI\left(\vec{g},\vec{s}\right)}{\partial N\left(s,d\right)}\right|_{d\mbox{ held fixed}}
\]

\end_inset

Its a mess.
 Lets do it.
 We have
\begin_inset Formula 
\[
0=\frac{\partial}{\partial N\left(s,d\right)}\,\frac{\sum_{\beta}N\left(s,\beta\right)N\left(g,\beta\right)i\left(*,*\right)}{\sum_{\alpha}N\left(s,\alpha\right)N\left(*,\alpha\right)i\left(g,*\right)}
\]

\end_inset

The two 
\begin_inset Formula $i$
\end_inset

 terms are constants; drop them.
 What is left is
\begin_inset Formula 
\[
0=\frac{N\left(g,d\right)}{i\left(s,*\right)}-\frac{i\left(s,g\right)}{\left(i\left(s,*\right)\right)^{2}}N\left(*,d\right)
\]

\end_inset

Reorganizing:
\begin_inset Formula 
\[
0=N\left(g,d\right)i\left(s,*\right)-N\left(*,d\right)i\left(s,g\right)
\]

\end_inset

or
\begin_inset Formula 
\[
\frac{i\left(s,g\right)}{i\left(s,*\right)}=\frac{N\left(g,d\right)}{N\left(*,d\right)}
\]

\end_inset

This is mis-constrained, in several ways.
 First, the RHS depends on 
\begin_inset Formula $d$
\end_inset

 while the LHS does not.
 Not only is it impossible to satisfy, but even if it were, it tells us
 nothing about the components of 
\begin_inset Formula $\vec{s}$
\end_inset

, a formula for which we were hoping to find.
 
\end_layout

\begin_layout Standard
Compute again, to second order.
 Up to a constant,
\begin_inset Formula 
\[
MI\left(\vec{g},\vec{s}\right)=\log_{2}i\left(g,s\right)-\log_{2}i\left(s,*\right)+C
\]

\end_inset

so
\begin_inset Formula 
\[
\frac{\partial}{\partial N\left(s,d\right)}\,\log_{2}i\left(s,g\right)=\frac{N\left(g,d\right)}{i\left(s,g\right)\log2}
\]

\end_inset

so
\begin_inset Formula 
\[
\frac{\partial}{\partial N\left(s,d\right)}\frac{\partial}{\partial N\left(s,d^{\prime}\right)}\,\log_{2}i\left(s,g\right)=-\frac{N\left(g,d\right)N\left(g,d^{\prime}\right)}{\left(i\left(s,g\right)\right)^{2}\log2}
\]

\end_inset

so
\begin_inset Formula 
\[
0=\frac{\partial}{\partial N\left(s,d\right)}\frac{\partial}{\partial N\left(s,d^{\prime}\right)}\,MI\left(\vec{g},\vec{s}\right)=-\frac{N\left(g,d\right)N\left(g,d^{\prime}\right)}{\left(i\left(s,g\right)\right)^{2}}+\frac{N\left(*,d\right)N\left(*,d^{\prime}\right)}{\left(i\left(s,*\right)\right)^{2}}
\]

\end_inset

or
\begin_inset Formula 
\[
\frac{\left(i\left(s,g\right)\right)^{2}}{\left(i\left(s,*\right)\right)^{2}}=\frac{N\left(g,d\right)N\left(g,d^{\prime}\right)}{N\left(*,d\right)N\left(*,d^{\prime}\right)}
\]

\end_inset

which is equally absurd; its just the square of the earlier result.
 WTF.
 But of course, that's obvious, in a way.
 But what to do? Why is this not working?
\end_layout

\begin_layout Standard
Repeating the first-order variation calculation for 
\begin_inset Formula $\cos\theta=i\left(s,g\right)/\sqrt{i\left(s,s\right)i\left(g,g\right)}$
\end_inset

 promptly gives the standard orthogonalization: 
\emph on
viz
\emph default
 that 
\begin_inset Formula $N\left(s,d\right)\sim N\left(g,d\right)$
\end_inset

 is the condition for parallel vectors.
 This is driven not by the numerator 
\begin_inset Formula $i\left(s,g\right)$
\end_inset

 which is shared in common by both 
\begin_inset Formula $\cos\theta$
\end_inset

 and by 
\begin_inset Formula $MI\left(s,g\right)$
\end_inset

 but is rather due to the quadratic 
\begin_inset Formula $i\left(s,s\right)$
\end_inset

 in the denominator.
 In a sense, it is a 
\begin_inset Quotes eld
\end_inset

second order effect
\begin_inset Quotes erd
\end_inset

.
\begin_inset Foot
status open

\begin_layout Plain Layout
A quick sketch might help.
 The quadratic term contributes 
\begin_inset Formula 
\[
\frac{\partial}{\partial N\left(s,d\right)}i\left(s,s\right)=2N\left(s,d\right)
\]

\end_inset

 and so 
\begin_inset Formula 
\[
0=\frac{\partial}{\partial N\left(s,d\right)}\cos\theta=i\left(s,s\right)N\left(g,d\right)-i\left(s,g\right)N\left(s,d\right)
\]

\end_inset

after dropping a constant factor of 
\begin_inset Formula $\sqrt{i\left(g,g\right)}$
\end_inset

 and multiplying by 
\begin_inset Formula $i\left(s,s\right)$
\end_inset

.
 Then, up to scale factor, one gets 
\begin_inset Formula $N\left(s,d\right)\sim N\left(g,d\right)$
\end_inset

 which is the desired result: 
\begin_inset Formula $\vec{s}$
\end_inset

 is collinear with 
\begin_inset Formula $\vec{g}$
\end_inset

.
 To maintain the constraint that 
\begin_inset Formula $\vec{w}=\vec{s}+\vec{t}$
\end_inset

 one must set 
\begin_inset Formula $N\left(s,d\right)=0$
\end_inset

 whenever 
\begin_inset Formula $N\left(w,d\right)=0$
\end_inset

; thus, in the end, 
\begin_inset Formula $\vec{s}$
\end_inset

 might not be completely colinear with 
\begin_inset Formula $\vec{g}$
\end_inset

, but only close.
 The MI formulation lacks the quadratic 
\begin_inset Formula $i\left(s,s\right)$
\end_inset

 term and thus is unable to generate some analog to 
\begin_inset Formula $N\left(s,d\right)\sim N\left(g,d\right)$
\end_inset

.
 
\end_layout

\end_inset

 
\end_layout

\begin_layout Standard
OK, so what is the source of the flawed thinking above? What is the root
 cause of the failure of variational principles on MI? Geometrically, the
 answer is simple: it means that 
\begin_inset Formula $\cos\theta$
\end_inset

 is symptomatic of a sphere: smooth, differentiable, having will-defined
 derivatives.
 By contrast, surfaces of minimal (or maximal) MI are polytopes of some
 sort, consisting of smooth facets, together with non-differentiable edges
 where faces meet.
 Variational principles do not apply.
 We'll have to use some kind of linear programming techniques to find a
 solution.
\end_layout

\begin_layout Standard
So with that in mind, lets start again:
\begin_inset Formula 
\[
MI\left(\vec{g},\vec{s}\right)=\log_{2}\frac{\vec{s}\cdot\vec{g}}{\vec{s}\cdot\vec{*}}+C
\]

\end_inset

and so we're dealing with a ratio of two dot products.
 This is now easy to minimize and maximize, conceptually.
 In one case, any 
\begin_inset Formula $\vec{s}$
\end_inset

 that satisfies 
\begin_inset Formula $\vec{s}\cdot\vec{g}=0$
\end_inset

 will do the trick.
 More appropriately, this would be 
\begin_inset Formula $\vec{t}$
\end_inset

 the orthogonal complement, and the question becomes: can we find such a
 
\begin_inset Formula $\vec{t}$
\end_inset

 such that no coefficients are negative, and such that the coefficients
 of 
\begin_inset Formula $\vec{s}=\vec{w}-\vec{t}$
\end_inset

 are also non-negative? Given that we already know that 
\begin_inset Formula $MI\left(\vec{g},\vec{w}\right)$
\end_inset

 is large, it is clear that bot share a lot of disjuncts in common; the
 handful that are in 
\begin_inset Formula $\vec{w}$
\end_inset

 that are not also in 
\begin_inset Formula $\vec{g}$
\end_inset

 could serve as a basis for 
\begin_inset Formula $\vec{t}$
\end_inset

.
 That's a start, but not a terribly good one.
\end_layout

\begin_layout Standard
A different extremum would come from finding some 
\begin_inset Formula $\vec{s}$
\end_inset

 that satisfies 
\begin_inset Formula $\vec{s}\cdot\vec{*}=0$
\end_inset

.
 It is fairly straight-forward to see that such an 
\begin_inset Formula $\vec{s}$
\end_inset

, with all non-negative coefficients, does not exist: By definition, 
\begin_inset Formula $\vec{*}$
\end_inset

 has all-positive coefficients, 
\emph on
i.e.

\emph default
 
\begin_inset Formula $N\left(*,d\right)>0$
\end_inset

 for all 
\begin_inset Formula $d$
\end_inset

; none are zero.
 Any vector orthogonal to this must have at least one negative coefficient
 (excluding the trivial zero-vector).
 This also suggests what the almost-orthogonal vectors will be: the almost-ortho
gonal vectors will be those 
\begin_inset Formula $\vec{s}$
\end_inset

 for which most 
\begin_inset Formula $N\left(s,d\right)$
\end_inset

 are zero, except for a few 
\begin_inset Formula $d$
\end_inset

 where 
\begin_inset Formula $N\left(*,d\right)$
\end_inset

 is small.
 That is, sort the set of 
\begin_inset Formula $\left\{ d\right\} $
\end_inset

 according to smallest 
\begin_inset Formula $N\left(*,d\right)$
\end_inset

 and then make linear combinations of the the first few in that sorted set.
 More specifically, the only allowed combinations are those for which 
\begin_inset Formula $N\left(w,d\right)>0$
\end_inset

, and so then search for that 
\begin_inset Formula $d$
\end_inset

 in that set with the smallest 
\begin_inset Formula $N\left(*,d\right)$
\end_inset

.
 Of course, this only works if that same 
\begin_inset Formula $d$
\end_inset

 has appreciable overlap with 
\begin_inset Formula $\vec{g}$
\end_inset

 i.e.
 if 
\begin_inset Formula $N\left(g,d\right)$
\end_inset

 is large or at least non-zero.
\end_layout

\begin_layout Standard
This is now gelling towards an algorithm: Obtain the set 
\begin_inset Formula $\left\{ d\right\} =\left\{ d|N\left(w,d\right)N\left(g,d\right)>0\right\} $
\end_inset

 and rank that set, from highest to lowest, according to 
\begin_inset Formula $N\left(g,d\right)/N\left(*,d\right)$
\end_inset

.
 Pick that 
\begin_inset Formula $d$
\end_inset

 and set 
\begin_inset Formula $N\left(s,d\right)=1$
\end_inset

.
 Now pick the second disjunct in that list, call it 
\begin_inset Formula $d^{\prime}$
\end_inset

 and consider fractions of the form 
\begin_inset Formula 
\[
F=\frac{N\left(g,d\right)+\alpha N\left(g,d^{\prime}\right)}{N\left(*,d\right)+\alpha N\left(*,d^{\prime}\right)}
\]

\end_inset

and solve for the non-negative 
\begin_inset Formula $\alpha$
\end_inset

 that maximizes this fraction.
 It appears that the only possible solution to this is 
\begin_inset Formula $\alpha=0$
\end_inset

; this is forced, since by definition 
\begin_inset Formula $N\left(g,d\right)/N\left(*,d\right)>N\left(g,d^{\prime}\right)/N\left(*,d^{\prime}\right)$
\end_inset

.
 Thus, the way to maximize 
\begin_inset Formula $MI\left(\vec{g},\vec{s}\right)$
\end_inset

 is to set 
\begin_inset Formula $N\left(s,d\right)=N\left(w,d\right)$
\end_inset

 and set all other 
\begin_inset Formula $N\left(s,d^{\prime}\right)=0$
\end_inset

.
 This leaves behind 
\begin_inset Formula $N\left(t,d\right)=0$
\end_inset

 and 
\begin_inset Formula $N\left(t,d^{\prime}\right)=N\left(w,d^{\prime}\right)$
\end_inset

 for all other 
\begin_inset Formula $d^{\prime}$
\end_inset

.
 However, it is likely that the result 
\begin_inset Formula $t$
\end_inset

 still has a large 
\begin_inset Formula $MI\left(\vec{g},\vec{t}\right)$
\end_inset

 and so this has not been minimized.
 Recall, we want to maximize 
\begin_inset Formula $MI\left(\vec{g},\vec{s}\right)$
\end_inset

 and also simultaneously minimize 
\begin_inset Formula $MI\left(\vec{g},\vec{t}\right)$
\end_inset

.
 It seems that there is a trade-off between the two.
 
\end_layout

\begin_layout Standard
In fact, we want to maximize some linear combination of 
\begin_inset Formula $MI\left(\vec{g},\vec{s}\right)$
\end_inset

 and 
\begin_inset Formula $MI\left(\vec{g},\vec{t}\right)$
\end_inset

.
 Since the MI are being interpreted as entropies, it seems like straight
 addition is the thing to do, and so perhaps the thing to maximize is 
\begin_inset Formula $S=MI\left(\vec{g},\vec{s}\right)-MI\left(\vec{g},\vec{t}\right)$
\end_inset

.
 Another possibility might be 
\begin_inset Formula $H=p\left(\vec{s}\right)MI\left(\vec{g},\vec{s}\right)-p\left(\vec{t}\right)MI\left(\vec{g},\vec{t}\right)$
\end_inset

 where 
\begin_inset Formula $p\left(\vec{w}\right)$
\end_inset

 is some probability; possibly, for example, 
\begin_inset Formula $p\left(\vec{w}\right)=i\left(w,*\right)/i\left(*,*\right)$
\end_inset

 seems to be 
\begin_inset Quotes eld
\end_inset

natural
\begin_inset Quotes erd
\end_inset

.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Is there any chance that we got lucky, that by adding enough complexity
 to the problem, we arrived where, maybe this time, the variational arguments
 will work? Let's try them on for size, proceeding naively, as before.
 Just as before, but in altered notation, 
\begin_inset Formula 
\[
\frac{\partial}{\partial N\left(s,d\right)}\vec{s}\cdot\vec{w}=\frac{\partial}{\partial N\left(s,d\right)}\sum_{d^{\prime}}N\left(s,d^{\prime}\right)N\left(w,d^{\prime}\right)=N\left(w,d\right)
\]

\end_inset

and so
\begin_inset Formula 
\begin{align*}
0=\frac{\partial S}{\partial N\left(s,d\right)} & =\frac{\partial}{\partial N\left(s,d\right)}\left[\log_{2}\frac{\vec{s}\cdot\vec{g}}{\vec{s}\cdot\vec{*}}-\log_{2}\frac{\left(\vec{w}-\vec{s}\right)\cdot\vec{g}}{\left(\vec{w}-\vec{s}\right)\cdot\vec{*}}\right]\\
 & =\frac{\vec{s}\cdot\vec{*}}{\vec{s}\cdot\vec{g}}\left(\frac{N\left(g,d\right)}{\vec{s}\cdot\vec{*}}-\frac{\vec{s}\cdot\vec{g}\,N\left(*,d\right)}{\left(\vec{s}\cdot\vec{*}\right)^{2}}\right)-\frac{\left(\vec{w}-\vec{s}\right)\cdot\vec{*}}{\left(\vec{w}-\vec{s}\right)\cdot\vec{g}}\left(\frac{-N\left(g,d\right)}{\left(\vec{w}-\vec{s}\right)\cdot\vec{*}}+\frac{\left(\vec{w}-\vec{s}\right)\cdot\vec{g}\,N\left(*,d\right)}{\left(\left(\vec{w}-\vec{s}\right)\cdot\vec{*}\right)^{2}}\right)
\end{align*}

\end_inset

What a mess.
 Continuing as before, we run into the same problem as before: an over-constrain
ed/mis-constrained system of equations:
\begin_inset Formula 
\[
0=N\left(*,d\right)\vec{s}\cdot\vec{g}\vec{w}\cdot\vec{*}\left(\vec{s}-\vec{w}\right)\cdot\vec{g}-N\left(g,d\right)\vec{s}\cdot\vec{*}\vec{w}\cdot\vec{g}\left(\vec{s}-\vec{w}\right)\cdot\vec{*}
\]

\end_inset

 or
\begin_inset Formula 
\[
\frac{N\left(g,d\right)}{N\left(*,d\right)}=\frac{\vec{s}\cdot\vec{g}\,\vec{w}\cdot\vec{*}\,\left(\vec{s}-\vec{w}\right)\cdot\vec{g}}{\vec{s}\cdot\vec{*}\,\vec{w}\cdot\vec{g}\,\left(\vec{s}-\vec{w}\right)\cdot\vec{*}}
\]

\end_inset

so that, again, the LHS depends on 
\begin_inset Formula $d$
\end_inset

 and the RHS does not, making it impossible to satisfy in general.
 Lets try again with 
\begin_inset Formula $H$
\end_inset

, making use of 
\begin_inset Formula 
\[
\frac{\partial}{\partial N\left(s,d\right)}p\left(\vec{s}\right)=\frac{1}{i\left(*,*\right)}\,\frac{\partial}{\partial N\left(s,d\right)}i\left(s,*\right)=\frac{N\left(*,d\right)}{\vec{*}\cdot\vec{*}}=-\frac{\partial}{\partial N\left(s,d\right)}p\left(\vec{t}\right)
\]

\end_inset

It is convenient to write 
\begin_inset Formula $p\left(\vec{s}\right)i\left(*,*\right)=\vec{s}\cdot\vec{*}$
\end_inset

 and drop the 
\begin_inset Formula $i\left(*,*\right)$
\end_inset

 as a constant.
\begin_inset Formula 
\begin{align*}
0=\frac{\partial H}{\partial N\left(s,d\right)}= & \frac{\left(\vec{s}\cdot\vec{*}\right)^{2}}{\vec{s}\cdot\vec{g}}\left(\frac{N\left(g,d\right)}{\vec{s}\cdot\vec{*}}-\frac{\vec{s}\cdot\vec{g}\,N\left(*,d\right)}{\left(\vec{s}\cdot\vec{*}\right)^{2}}\right)+N\left(*,d\right)\log_{2}\frac{\vec{s}\cdot\vec{g}}{\vec{s}\cdot\vec{*}}\\
 & -\frac{\left(\left(\vec{w}-\vec{s}\right)\cdot\vec{*}\right)^{2}}{\left(\vec{w}-\vec{s}\right)\cdot\vec{g}}\left(\frac{-N\left(g,d\right)}{\left(\vec{w}-\vec{s}\right)\cdot\vec{*}}+\frac{\left(\vec{w}-\vec{s}\right)\cdot\vec{g}\,N\left(*,d\right)}{\left(\left(\vec{w}-\vec{s}\right)\cdot\vec{*}\right)^{2}}\right)+N\left(*,d\right)\log_{2}\frac{\left(\vec{w}-\vec{s}\right)\cdot\vec{g}}{\left(\vec{w}-\vec{s}\right)\cdot\vec{*}}
\end{align*}

\end_inset

Again, quite the mess.
 Grouping terms, we see that this too has the same failures as before: It
 lacks any terms that are linear in 
\begin_inset Formula $N\left(s,d\right)$
\end_inset

 which are what's needed to make the equations solvable.
 So we conclude, wtf, what were we thinking?
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The problem is of linear-programming-type, but not linear.
 Maximizing 
\begin_inset Formula $S$
\end_inset

 means maximizing 
\begin_inset Formula 
\[
\frac{\vec{s}\cdot\vec{g}}{\vec{s}\cdot\vec{*}}\frac{\left(\vec{w}-\vec{s}\right)\cdot\vec{*}}{\left(\vec{w}-\vec{s}\right)\cdot\vec{g}}=\frac{\vec{s}\cdot\vec{g}}{\vec{s}\cdot\vec{*}}\frac{\vec{t}\cdot\vec{*}}{\vec{t}\cdot\vec{g}}
\]

\end_inset

subject to 
\begin_inset Formula $0<\vec{s}\le\vec{w}$
\end_inset

.
 This is certainly not linear; so it's actually a 
\begin_inset Quotes eld
\end_inset

convex programming
\begin_inset Quotes erd
\end_inset

 problem, I guess.
 Its not that bad (I think) because there are only four dot products.
 It is slightly confusing because it is completely projective in all four
 vectors: the magnitudes all cancel.
\end_layout

\begin_layout Standard
Define sets that define the support of the vectors.
 Let 
\begin_inset Formula $G=\left\{ d|N\left(g,d\right)>0\right\} $
\end_inset

 and likewise for 
\begin_inset Formula $W$
\end_inset

 and 
\begin_inset Formula $S,T$
\end_inset

.
 By definition 
\begin_inset Formula $N\left(*,d\right)>0$
\end_inset

 for all 
\begin_inset Formula $d$
\end_inset

, so this is the universe 
\begin_inset Formula $U$
\end_inset

.
 The constraint is that 
\begin_inset Formula $W=S\cup T$
\end_inset

.
 By definition, 
\begin_inset Formula $G\cap W\ne\varnothing$
\end_inset

.
 By presumption, 
\begin_inset Formula $G\subset U$
\end_inset

 and 
\begin_inset Formula $W\subset U$
\end_inset

 are proper subsets.
\end_layout

\begin_layout Standard
To proceed, turn this into a decision problem: for and fixed disjunct 
\begin_inset Formula $d$
\end_inset

, select some 
\begin_inset Formula $\alpha$
\end_inset

 such that 
\begin_inset Formula $N\left(s,d\right)=\alpha N\left(w,d\right)$
\end_inset

 and 
\begin_inset Formula $N\left(t,d\right)=\left(1-\alpha\right)N\left(w,d\right)$
\end_inset

.
 To keep things easy for now, make this a binary decision problem: 
\begin_inset Formula $\alpha$
\end_inset

 is zero or one.
 Holding all other 
\begin_inset Formula $N$
\end_inset

 fixed, the optimization problem is to maximize
\begin_inset Formula 
\[
\frac{j_{d}\left(s,g\right)+\alpha N\left(w,d\right)N\left(g,d\right)}{j_{d}\left(s,*\right)+\alpha N\left(w,d\right)N\left(*,d\right)}\;\frac{j_{d}\left(t,*\right)+\left(1-\alpha\right)N\left(w,d\right)N\left(*,d\right)}{j_{d}\left(t,g\right)+\left(1-\alpha\right)N\left(w,d\right)N\left(g,d\right)}
\]

\end_inset

where 
\begin_inset Formula $j_{d}\left(u,w\right)=i\left(u,w\right)-N\left(u,d\right)N\left(w,d\right)$
\end_inset

, that is, the dot product, excluding the 
\begin_inset Formula $d$
\end_inset

 term.
 This really is a binary decision problem: one need only look at the two
 endpoints 
\begin_inset Formula $\alpha=0$
\end_inset

 and 
\begin_inset Formula $\alpha=1$
\end_inset

.
 It is either true that 
\begin_inset Formula 
\[
\frac{j_{d}\left(s,g\right)}{j_{d}\left(s,*\right)}\;\frac{j_{d}\left(t,*\right)+N\left(w,d\right)N\left(*,d\right)}{j_{d}\left(t,g\right)+N\left(w,d\right)N\left(g,d\right)}<\frac{j_{d}\left(s,g\right)+N\left(w,d\right)N\left(g,d\right)}{j_{d}\left(s,*\right)+N\left(w,d\right)N\left(*,d\right)}\;\frac{j_{d}\left(t,*\right)}{j_{d}\left(t,g\right)}
\]

\end_inset

or it is false.
 That is, if LHS < RHS holds true, then 
\begin_inset Formula $\alpha=1$
\end_inset

 maximizes, and the disjunct 
\begin_inset Formula $d$
\end_inset

 belongs to the set 
\begin_inset Formula $S$
\end_inset

; otherwise 
\begin_inset Formula $d$
\end_inset

 belongs in set 
\begin_inset Formula $T$
\end_inset

.
\end_layout

\begin_layout Standard
This is readily computed numerically.
\end_layout

\begin_layout Standard
So this is actually a binary decision problem, more narrowly, an binary-integer
 programming problem.
 It's supposed NP-hard, but I'm thinking that this is a special case that
 might not be.
 A plausible algo is to rank the disjuncts by size of 
\begin_inset Formula $N\left(*,d\right)$
\end_inset

 from greatest to least, and assign the disjuncts, one by one, to either
 of the two classes.
 One can do likewise for 
\begin_inset Formula $H$
\end_inset

 instead of 
\begin_inset Formula $S$
\end_inset

, just using a different, more complex decision.
 Both run in linear time according to size of 
\begin_inset Formula $W\cap G$
\end_inset

.
\end_layout

\begin_layout Standard
As before, the niput data, coming from disjuncts from WSD parses, is noisey,
 and so a binary assignment might be incorrect.
 It might be better to assign only a fraction of the passing disjuncts to
 
\begin_inset Formula $\vec{s}$
\end_inset

 and place the remainder in 
\begin_inset Formula $\vec{t}$
\end_inset

.
 This requires yet more experimentation.
\end_layout

\begin_layout Section*
Link Merge Redux
\end_layout

\begin_layout Standard
The word-disjunct dataset, containing counts 
\begin_inset Formula $N\left(w,d\right)$
\end_inset

, is far too large to be used as a practical dictionary for parsing.
 It also fails to provide sufficient coverage to parse most sentences.
 Thus, it needs to be simultaneously be compressed and also generalized.
 This is partly achieved by clustering words into word-classes.
 However, the current system leaves the connectors un-clustered.
 That is, each disjunct is an ordered sequence of (pseudo-)connectors: 
\begin_inset Formula 
\[
d=\left(c_{1},c_{2},\cdots,c_{k}\right)
\]

\end_inset

with each (pseudo-)connector being a word plus direction (polarity) indicator:
\begin_inset Formula 
\[
c_{j}=\left[w_{j},p_{j}\right]
\]

\end_inset

with 
\begin_inset Formula $p_{j}\in\left\{ +,-\right\} $
\end_inset

.
 We wish to cluster these, replacing the 
\begin_inset Formula $c_{j}$
\end_inset

 with a 
\begin_inset Formula $\gamma_{j}=\left[g_{j},p_{j}\right]$
\end_inset

 where each 
\begin_inset Formula $g_{j}$
\end_inset

 is a grammatical class, with 
\begin_inset Formula $w_{j}\in g_{j}$
\end_inset

.
 A naive (but wrong) conception of self-consistency it to have the 
\begin_inset Formula $g_{j}$
\end_inset

's appearing in connectors be the same grammatical classes as those used
 to categorize words.
 The notion of self-consistent connector classes is different (given below).
 At any rate, how does one cluster connectors? 
\end_layout

\begin_layout Standard
To proceed, a notation that breaks out the spider diagram is needed.
 Instead of writing 
\begin_inset Formula $N\left(w,d\right)$
\end_inset

, one actually has a collection of observation counts 
\begin_inset Formula 
\[
N\left(w,\left(\left[w_{1},p_{1}\right],\left[w_{2},p_{2}\right],\cdots,\left[w_{k},p_{k}\right]\right)\right)
\]

\end_inset

with 
\begin_inset Formula $k$
\end_inset

 being the arity of the disjunct.
 Square brackets are used for easy reading.
 Prior to clustering, one can convert pseudo-connectors to real connectors
 (and real links) by observing that word 
\begin_inset Formula $w_{a}$
\end_inset

 (on the left) can link to word 
\begin_inset Formula $w_{b}$
\end_inset

 (on the right) if an only if exists two word-disjunct pairs 
\begin_inset Formula $\left(w_{a},d_{a}\right)$
\end_inset

 and 
\begin_inset Formula $\left(w_{b},d_{b}\right)$
\end_inset

 such that 
\begin_inset Formula $d_{a}=\left(\left[w_{1},p_{1}\right],\cdots,\left[w_{j},p_{j}\right],\cdots,\left[w_{k},p_{k}\right]\right)$
\end_inset

 and 
\begin_inset Formula $d_{b}=\left(\left[w_{1},p_{1}\right],\cdots,\left[w_{m},p_{m}\right],\cdots,\left[w_{n},p_{n}\right]\right)$
\end_inset

 and 
\begin_inset Formula $w_{j}=w_{b}$
\end_inset

 and 
\begin_inset Formula $p_{j}=+$
\end_inset

 and 
\begin_inset Formula $w_{m}=w_{a}$
\end_inset

 and 
\begin_inset Formula $p_{m}=-$
\end_inset

.
 A link can then be defined as any function that provides a unique label
 
\begin_inset Formula $l_{ab}$
\end_inset

 to the word-pair 
\begin_inset Formula $(w_{a},w_{b})$
\end_inset

.
 Given a 
\begin_inset Quotes eld
\end_inset

true link
\begin_inset Quotes erd
\end_inset

 type 
\begin_inset Formula $l$
\end_inset

, the corresponding 
\begin_inset Quotes eld
\end_inset

true connectors
\begin_inset Quotes erd
\end_inset

 are then 
\begin_inset Formula $l+$
\end_inset

 and 
\begin_inset Formula $l-$
\end_inset

.
 Thus, for the above example, 
\begin_inset Formula $\left[w_{j},p_{j}\right]\to l_{ab}+$
\end_inset

 and 
\begin_inset Formula $\left[w_{m},p_{m}\right]\to l_{ab}-$
\end_inset

.
\end_layout

\begin_layout Standard
The clustering task is now to find all other links that can be argued to
 be 
\begin_inset Quotes eld
\end_inset

similar
\begin_inset Quotes erd
\end_inset

.
 One very simple ansatz is to create a link 
\begin_inset Formula $l_{gg^{\prime}}$
\end_inset

 that corresponds to the gram-class pair 
\begin_inset Formula $\left(g,g^{\prime}\right)$
\end_inset

 and is used to connect any word-pair 
\begin_inset Formula $(w_{a},w_{b})$
\end_inset

 whenever 
\begin_inset Formula $w_{a}\in g$
\end_inset

 and 
\begin_inset Formula $w_{b}\in g^{\prime}.$
\end_inset

 This ansataz has a strong 
\begin_inset Quotes eld
\end_inset

broadening
\begin_inset Quotes erd
\end_inset

 effect, as the generic link type allows words in 
\begin_inset Formula $g$
\end_inset

 and 
\begin_inset Formula $g^{\prime}$
\end_inset

 to connect, for which actual links were never observed at earlier stages.
 (This is the variant used in the 
\begin_inset Quotes eld
\end_inset

baseline
\begin_inset Quotes erd
\end_inset

 datasets measured in the grammar baseline report.) 
\end_layout

\begin_layout Standard
How might one determine if two links are similar? The same way that one
 determines if any other vectors are similar.
 In this case, one starts with the double-spider diagram:
\end_layout

\begin_layout Standard
\align center
\begin_inset Graphics
	filename images/seeds-two.eps
	width 60col%

\end_inset


\end_layout

\begin_layout Standard
The disjoint union of the open circles (the unconnected connectors) define
 a defacto basis element for the two linked words.
 Associated with this is a product of observation counts, which can effectively
 be used to define any distance metric previously considered.
 Notationally, write 
\begin_inset Formula 
\[
d_{a}=\left(c_{1}^{\left(a\right)},\cdots,c_{j}^{\left(a\right)},\cdots,c_{k}^{\left(a\right)}\right)
\]

\end_inset

just as before, but now with a superscript 
\begin_inset Formula $\left(a\right)$
\end_inset

 to tell the connectors apart from hos on 
\begin_inset Formula $d_{b}$
\end_inset

.
 The connector set on 
\begin_inset Formula $l_{ab}$
\end_inset

 is then 
\begin_inset Formula 
\[
d_{ab}=\left(c_{1}^{\left(a\right)},\cdots,\widehat{c_{j}^{\left(a\right)}},\cdots,c_{k}^{\left(a\right)},c_{1}^{\left(b\right)},\cdots,\widehat{c_{m}^{\left(b\right)}},\cdots,c_{n}^{\left(b\right)}\right)
\]

\end_inset

with the widehat notation 
\begin_inset Formula $\widehat{c_{i}}$
\end_inset

 indicating that connector 
\begin_inset Formula $c_{i}$
\end_inset

 is absent from the list.
 Both 
\begin_inset Formula $c_{j}^{\left(a\right)}$
\end_inset

 and 
\begin_inset Formula $c_{m}^{\left(b\right)}$
\end_inset

 need to be absent, as those are the two connectors forming the link.
 Thus, associated with every 
\begin_inset Formula $l_{ab}$
\end_inset

 is a (fairly large) set of 
\begin_inset Formula $d_{ab}$
\end_inset

.
 To turn this into a bona-fide vector, a count is needed.
 It seems reasonable to define 
\begin_inset Formula 
\[
N\left(l_{ab},d_{ab}\right)=N\left(w_{a},d_{a}\right)N\left(w_{b},d_{b}\right)
\]

\end_inset

One might consider adjusting the above by counts on connectors, or ..
 something...
 but this seems like un-needed complexity, at this point.
\end_layout

\begin_layout Standard
How practical is this? Lets look at the number of potential links
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Counted with the 
\family typewriter
count-quarter-links
\family default
, 
\family typewriter
count-half-links
\family default
 and 
\family typewriter
count-links
\family default
 functions in 
\family typewriter
count-links.scm.
\end_layout

\end_inset

 in various datasets:
\end_layout

\begin_layout Standard
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="6" columns="7">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Dimensions
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
qrt-links
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
half-links
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
pairs
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
full-links
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Secs
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Name
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1610 x 67K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
146K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
309K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
30.4K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
2.45M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
184K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_micro_marg
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7385 x 270K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
864K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.33M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
88.6K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7.8M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
608K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_mini_marg
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
17K x 947K
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
3.78M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
4.64M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.56M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_large_marg
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
55K x 6.03M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
21.1M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
23.9M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
7.91M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_huge_marg
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
438K x 23.3M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
76.9M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
90.1M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout

\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
31.7M
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
en_full_marg
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
The columns in the table are:
\end_layout

\begin_layout Itemize

\series bold
Dimensions
\series default
 – Number of words x Number of disjuncts
\end_layout

\begin_layout Itemize

\series bold
qrt-links
\series default
 – Total number of unique links that can be made between some word in the
 dataset, and some connector in some disjunct.
 This is a 
\begin_inset Quotes eld
\end_inset

quarter-link
\begin_inset Quotes erd
\end_inset

, as any given disjunct might be shared by multiple (word,disjunct) pairs.
 The counting is without multiplicity (
\emph on
i.e.

\emph default
 if the connector appears twice in the sequence, it is counted only once.)
\end_layout

\begin_layout Itemize

\series bold
half-links
\series default
 – Total number of unique links that can be made between some word in the
 dataset, and some connector in any (word, disjunct) pair.
 This is a 
\begin_inset Quotes eld
\end_inset

half-link
\begin_inset Quotes erd
\end_inset

, as the relation is not symmetric; to be symmetric, i.e.
 a full-link, there needs to be another matching half-link going in the
 opposite direction.
 The counting is without multiplicity (
\emph on
i.e.

\emph default
 if the connector appears twice in the sequence, it is counted only once.)
\end_layout

\begin_layout Itemize

\series bold
pairs
\series default
 – Total number of word-pairs having full-links between them.
\end_layout

\begin_layout Itemize

\series bold
full-links
\series default
 – Total number of unique full-links in the dataset, as defined above.
 
\end_layout

\begin_layout Itemize

\series bold
Secs
\series default
 – Number of sections in the dataset.
\end_layout

\begin_layout Itemize

\series bold
Name
\series default
 – Dataset name
\end_layout

\begin_layout Section*
Disjunct dataset redo
\end_layout

\begin_layout Standard
The code that converted MST oparses to disjuncts had a bug in it – sentences
 with multiple instances of the same word created disjuncts that were incorrect.
 Fixed in pull req opencog/atomspace#2252.
 However, need to redo the DB's.
\end_layout

\begin_layout Standard
While we are at it, here's a new processing pipeline:
\end_layout

\begin_layout Itemize
Filter the word-pair database
\emph on
 en_pairs_cfive
\emph default
 so that all word-pairs with MI<1.8 are removed.
 The value of 1.8 seems reasonable, as this is the mean MI of that database.
\end_layout

\begin_layout Itemize
Instead of creating MST parses, create MPG parses: 
\begin_inset Quotes eld
\end_inset

Maximum Planar Graph
\begin_inset Quotes erd
\end_inset

 parses.
 These start with the MST parse, and add additional high-MI edges, creating
 a planar graph with loops.
 The resulting disjuncts are larger.
\end_layout

\begin_layout Itemize
At a later time, e.g.
 during clustering, the disjuncts can be trimmed down (if desired) by looking
 up the MI between germ and connector, and discarding it if it falls below
 some threshold.
 Thus, MPG disjuncts are both 
\begin_inset Quotes eld
\end_inset

fuller
\begin_inset Quotes erd
\end_inset

 than MST disjuncts, but can also be trimmed in a fairly coherent fashion,
 offering yet another way to limit dataset size (in exchange for additional
 processing :-o).
\end_layout

\begin_layout Section*
Reading list
\end_layout

\begin_layout Standard
Papers that people want me to read and have an opinion on:
\end_layout

\begin_layout Itemize
https://arxiv.org/abs/1904.03746
\end_layout

\begin_layout Section*
TODO
\end_layout

\begin_layout Standard
Explain how mutual exclusion of concepts as performed by humans when learning
 new concepts, resembles optimal strategies for the channel coding theorem,
 by minimizing confusion between similar concepts.
 This is the 
\begin_inset Quotes eld
\end_inset

mutual exclusion
\begin_inset Quotes erd
\end_inset

 principle.
 Well, MI already provides a certain measure of exclusivity.
\end_layout

\begin_layout Section*
The End of Part One
\end_layout

\begin_layout Standard
This diary became too long to be manageable; it was closed out on 26 February
 2021.
 This diary resumes at learn-lang-diary-2.
\end_layout

\begin_layout Standard
\begin_inset CommandInset bibtex
LatexCommand bibtex
bibfiles "lang"
options "alpha"

\end_inset


\end_layout

\end_body
\end_document