Skip to content

datnt88/identity-language

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author: Nguyen Tien Dat
Email: [email protected]

This source implement two techniques: Trigram and Small Word
to solve a problem of language indentity.

The reference literature:
COMPARING TWO LANGUAGE IDENTIFICATION SCHEMES
by Gregory Grefenstette. Xerox Research Centre Europe

http://www.academia.edu/375397/Comparing_Two_Language_Identification_Schemes

========================================================
For implementation:

source:       	src/
project file: 	pom.xml
traning corpus	dataset/

To build the model for new lanuage:
./target/build-model.jar -c <corpus> -l <language>

note: there current are 6 language models:
English, Geman, French, Spanish, Czech and Vietnamese 

To indendity the language of a given document with
techique (ngram or smallword)
./target/indentity-language.jar -d <document> -t <techique>


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages