Skip to content

Commit 4bb2572

Browse files
committed
Apologia for the system architecture.
1 parent 50cd952 commit 4bb2572

2 files changed

Lines changed: 60 additions & 0 deletions

File tree

README-Architecture.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
2+
A Unified Architectural Framework
3+
=================================
4+
5+
There are many -- many dozens -- of different NLP projects and
6+
frameworks -- written in Python, Java, Perl, and another half-dozen
7+
programming ecosystems. The code here uses none of these. Why?
8+
9+
The answer is illustrated by a a simple, practical problem encountered
10+
when one tries to perform scientific research withing these frameworks.
11+
It is illustrated below, an excerpt from an email.
12+
13+
A Simple Unsupervised Learning Example
14+
--------------------------------------
15+
One dream that I have is to be able to have feedback between processing
16+
stages. So, for example: tokenization delivers 3 or 4 different
17+
tokenization possibilities, each ranked with a score (ideally, log_2 of
18+
probability or similar) All of these possibilities are then run through
19+
the parser, which also assigns scores (again, log_2 p). The sum of
20+
these scores can then be used to select that most likely
21+
tokenization+parse. (A sum -- addition, because logarithms are
22+
additive). Thus, the parser helps eliminate bad tokenizations.
23+
24+
That would be the first step. The second step is more interesting:
25+
Given the above, one can then automatically search for errors in the
26+
tokenization. That is, to automatically discover rules that will correct
27+
errors. The final system is then "use the freedom model algorithm,
28+
unless rule A applies or unless rule B applies ..." where rule A, B, ...
29+
were obtained by statistical analysis of tokenizer plus parser.
30+
31+
This exposes the problem: to do the above, one needs to perform a
32+
statistical analysis of the tokenizer+parser, working together. What
33+
code performs this statistical analysis? Where is the statistical data
34+
kept? What converts this analysis into rules? What is the format of the
35+
rules? How are the rules applied? What code applies those rules?
36+
37+
If one is naive, then the answer to all of those questions is difficult,
38+
ugly, messy, confusing, and takes a lot of work. How can we make this
39+
less messy, less difficult? (Less stovepipe code?) My answer, as always,
40+
is to say that all work must be done in a single, common framework, that
41+
makes it easy to collect statistical information from a large variety of
42+
sources. A place where the statistical analysis is easy. A place where
43+
inference can be automated. A place where rules can be defineed,
44+
generically, and lanuched, generically. That single location is, for me,
45+
the AtomSpace ...
46+
47+
I'm saying this to explain why, if it wasn't clear before, why I do
48+
things the way I do them.

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -356,6 +356,18 @@ vectors", they are also components of a graph, components of a grammar.
356356

357357

358358
### The Small Idea
359+
360+
System Architecture
361+
===================
362+
There are many -- many dozens -- of different NLP projects and
363+
frameworks -- written in Python, Java, Perl, and another half-dozen
364+
programming ecosystems. The code here uses none of these. Why?
365+
366+
The answer is illustrated by a a simple, practical problem encountered
367+
when one tries to perform scientific research withing these frameworks.
368+
It is illustrated in the [README-Architecture](README-Architecture.md)
369+
file.
370+
359371
Unsupervised Image Learning and Recognition (Vision)
360372
====================================================
361373
A simple sketch of how the above can be applied to vision (2D images) is

0 commit comments

Comments
 (0)