|
| 1 | + |
| 2 | +A Unified Architectural Framework |
| 3 | +================================= |
| 4 | + |
| 5 | +There are many -- many dozens -- of different NLP projects and |
| 6 | +frameworks -- written in Python, Java, Perl, and another half-dozen |
| 7 | +programming ecosystems. The code here uses none of these. Why? |
| 8 | + |
| 9 | +The answer is illustrated by a a simple, practical problem encountered |
| 10 | +when one tries to perform scientific research withing these frameworks. |
| 11 | +It is illustrated below, an excerpt from an email. |
| 12 | + |
| 13 | +A Simple Unsupervised Learning Example |
| 14 | +-------------------------------------- |
| 15 | +One dream that I have is to be able to have feedback between processing |
| 16 | +stages. So, for example: tokenization delivers 3 or 4 different |
| 17 | +tokenization possibilities, each ranked with a score (ideally, log_2 of |
| 18 | +probability or similar) All of these possibilities are then run through |
| 19 | +the parser, which also assigns scores (again, log_2 p). The sum of |
| 20 | +these scores can then be used to select that most likely |
| 21 | +tokenization+parse. (A sum -- addition, because logarithms are |
| 22 | +additive). Thus, the parser helps eliminate bad tokenizations. |
| 23 | + |
| 24 | +That would be the first step. The second step is more interesting: |
| 25 | +Given the above, one can then automatically search for errors in the |
| 26 | +tokenization. That is, to automatically discover rules that will correct |
| 27 | +errors. The final system is then "use the freedom model algorithm, |
| 28 | +unless rule A applies or unless rule B applies ..." where rule A, B, ... |
| 29 | +were obtained by statistical analysis of tokenizer plus parser. |
| 30 | + |
| 31 | +This exposes the problem: to do the above, one needs to perform a |
| 32 | +statistical analysis of the tokenizer+parser, working together. What |
| 33 | +code performs this statistical analysis? Where is the statistical data |
| 34 | +kept? What converts this analysis into rules? What is the format of the |
| 35 | +rules? How are the rules applied? What code applies those rules? |
| 36 | + |
| 37 | +If one is naive, then the answer to all of those questions is difficult, |
| 38 | +ugly, messy, confusing, and takes a lot of work. How can we make this |
| 39 | +less messy, less difficult? (Less stovepipe code?) My answer, as always, |
| 40 | +is to say that all work must be done in a single, common framework, that |
| 41 | +makes it easy to collect statistical information from a large variety of |
| 42 | +sources. A place where the statistical analysis is easy. A place where |
| 43 | +inference can be automated. A place where rules can be defineed, |
| 44 | +generically, and lanuched, generically. That single location is, for me, |
| 45 | +the AtomSpace ... |
| 46 | + |
| 47 | +I'm saying this to explain why, if it wasn't clear before, why I do |
| 48 | +things the way I do them. |
0 commit comments