#Wikitongues Database The following concepts were developed during the hackathon hosted by the Recurse Center in New York, USA on April 9th, 2016. ##Basic terms
-
Phrase
Text or video unit
-
Phrase pair
Relationship between phrases implying translation between two languages
-
Dictionary
Collection of phrase pairs
-
Book
Container for dictionary with extra user metadata such as titles, descriptions, authorship etc...
##Index table The index table accounts for all languages in the world.
ID | Names | … |
---|---|---|
eng |
[English, Inglês, Anglais,…] | … |
cmn |
[Chinese (Mandarin), Beifang Fangyan, Guanhua,…] | … |
… | … | … |
For language names, an array of all of the strings used in naming a language |
##Method 1 Each language has a corpus table with all of the phrases in that language. Each user owns their authored books.
User Table
ID | User Name | Books (Has many) | … |
---|---|---|---|
# | Johnathan Swift | [Reference to book table] | … |
… | … | … | … |
Book Table
Each book would have it's own table defining which phrases it owns
Source records eng |
Target records spa |
---|---|
Record # | Record # |
… | … |
We know which language corpus to refer to by the table headers.
I wonder if the book reference on the user table couldnt be something along the lines of
{
"eng":"spa",
"dictionary": {
"record #":"record #",
"record #":"record #"
}
}
Eng
Corpus
ID | Value | Type | Meta | … |
---|---|---|---|---|
# | hello | text | … | |
# | link/to/resource | video | olá | … |
… | … | … | … | … |
Each record is a phrase. Phrases can exist in multiple dictionaries or phrase books.
For written languages, you can have both text entries and video entries.
For languages that are not written, video will be used. For video to be indexed by meaning, textual metadata needs to exist. The video meta information will include one of the following:
- if the phrase pair is between text and video, the text is enough to index the video
- if the phrase pair is between video and video, user input will be needed to index the video content.
Question: reference books on user table or author on books table?
##Method 2 A centralized books table references authors. Each language pair has its own unique corpus table. Dictionaries are defined in belongs-to relationships as phrase pair book IDs.
User Table
ID | User Name | OAuth | … |
---|---|---|---|
# | Daniel Udell | Token | … |
# | Frederico Andrade | Token | … |
… | … | … | … |
Books table
ID | Title | Source language ID | Target language ID | User | … |
---|---|---|---|---|---|
# | Russian for beginners | eng |
rus |
Author's # | … |
# | Aprenda Japonês | por |
jap |
Author's # | … |
… | … | … | … | … | … |
eng && rus
Corpus
ID | Language 1 ID (eng ) |
Language 2 ID (rus ) |
Book # | … |
---|---|---|---|---|
# | Hello | Привет! (Privyet!) | Book ID | … |
… | … | … | … | … |
The Language pair corpus or Translation corpus represents all of the phrases that exist between two languages. |
Downsides: 7000^7000 = 49 million language pairs, 49 million tables.
##Method 3
Single universal corpus with all phrase pairs. In this proposal, dictionaries are aggregates of corpus entries, specified by the interface #.
Full corpus
ID | Source Lang ID | Target Lang ID | Source Value | Target Value | Source Type | Target Type | Source Meta | Target Meta | Interface ID | … |
---|---|---|---|---|---|---|---|---|---|---|
# | eng |
rus |
Hello | Привет! (Privyet!) | Text | Text | … | … | Reference to Book Interface Table | … |
# | eng |
rus |
Hello | link/to/video | Text | Video | … | hello | Reference to Book Interface Table | … |
… | … | … | … | … | … | … | … | … | … | … |
Interface Table
ID | Interface # | Book # | … |
---|---|---|---|
# | Interface ID | Book ID | … |
… | … | … | … |
Duplicate problem
The following table illustrates a data duplicate problem.
ID | … | Source Value | Target Value | … |
---|---|---|---|---|
# | … | Hello | Привет! (Privyet!) | … |
# | … | Привет! (Privyet!) | Hello | … |
Both phrase entries are the same in practice. |
##Open questions:
- To have source and target language ids referenced in both book table and corpus table?
- How to define source language / target language positions in phrase display?
#Notes:
ISO 639-3
The International Standards Organization, along with the Summer Institute for Linguistics has devised an ISO code for languages that has seen six revisions so far. Wikitongues uses the third varient, ISO 639-3. Read more on Wikipedia