#Wikitongues Database The following concepts were developed during the hackathon hosted by the Recurse Center in New York, USA on April 9th, 2016. ##Basic terms

Phrase

Text or video unit
Phrase pair

Relationship between phrases implying translation between two languages
Dictionary

Collection of phrase pairs
Book

Container for dictionary with extra user metadata such as titles, descriptions, authorship etc...

##Index table The index table accounts for all languages in the world.

ID	Names	…
`eng`	[English, Inglês, Anglais,…]	…
`cmn`	[Chinese (Mandarin), Beifang Fangyan, Guanhua,…]	…
…	…	…
For language names, an array of all of the strings used in naming a language

##Method 1 Each language has a corpus table with all of the phrases in that language. Each user owns their authored books.

User Table

ID	User Name	Books (Has many)	…
#	Johnathan Swift	[Reference to book table]	…
…	…	…	…

Book Table

Each book would have it's own table defining which phrases it owns

Source records `eng`	Target records `spa`
Record #	Record #
…	…

We know which language corpus to refer to by the table headers.

I wonder if the book reference on the user table couldnt be something along the lines of

  {
    "eng":"spa",
    "dictionary": {
       "record #":"record #",
       "record #":"record #"
     }
  }

Eng Corpus

ID	Value	Type	Meta	…
#	hello	text		…
#	link/to/resource	video	olá	…
…	…	…	…	…

Each record is a phrase. Phrases can exist in multiple dictionaries or phrase books.

For written languages, you can have both text entries and video entries.

For languages that are not written, video will be used. For video to be indexed by meaning, textual metadata needs to exist. The video meta information will include one of the following:

if the phrase pair is between text and video, the text is enough to index the video
if the phrase pair is between video and video, user input will be needed to index the video content.

Question: reference books on user table or author on books table?

##Method 2 A centralized books table references authors. Each language pair has its own unique corpus table. Dictionaries are defined in belongs-to relationships as phrase pair book IDs.

User Table

ID	User Name	OAuth	…
#	Daniel Udell	Token	…
#	Frederico Andrade	Token	…
…	…	…	…

Books table

ID	Title	Source language ID	Target language ID	User	…
#	Russian for beginners	`eng`	`rus`	Author's #	…
#	Aprenda Japonês	`por`	`jap`	Author's #	…
…	…	…	…	…	…

eng && rus Corpus

ID	Language 1 ID (`eng`)	Language 2 ID (`rus`)	Book #	…
#	Hello	Привет! (Privyet!)	Book ID	…
…	…	…	…	…
The Language pair corpus or Translation corpus represents all of the phrases that exist between two languages.

Downsides: 7000^7000 = 49 million language pairs, 49 million tables.

##Method 3

Single universal corpus with all phrase pairs. In this proposal, dictionaries are aggregates of corpus entries, specified by the interface #.

Full corpus

ID	Source Lang ID	Target Lang ID	Source Value	Target Value	Source Type	Target Type	Source Meta	Target Meta	Interface ID	…
#	`eng`	`rus`	Hello	Привет! (Privyet!)	Text	Text	…	…	Reference to Book Interface Table	…
#	`eng`	`rus`	Hello	link/to/video	Text	Video	…	hello	Reference to Book Interface Table	…
…	…	…	…	…	…	…	…	…	…	…

Interface Table

ID	Interface #	Book #	…
#	Interface ID	Book ID	…
…	…	…	…

Duplicate problem

The following table illustrates a data duplicate problem.

ID	…	Source Value	Target Value	…
#	…	Hello	Привет! (Privyet!)	…
#	…	Привет! (Privyet!)	Hello	…
Both phrase entries are the same in practice.

##Open questions:

To have source and target language ids referenced in both book table and corpus table?
How to define source language / target language positions in phrase display?

#Notes:

ISO 639-3

The International Standards Organization, along with the Summer Institute for Linguistics has devised an ISO code for languages that has seen six revisions so far. Wikitongues uses the third varient, ISO 639-3. Read more on Wikipedia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema.md

schema.md

Files

schema.md

Latest commit

History

schema.md

File metadata and controls