GitHub - geetua/language_identification

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
models		models
testdata		testdata
traindata		traindata
README		README
__init__.py		__init__.py
test.py		test.py
train.py		train.py
utils.py		utils.py
utils.pyc		utils.pyc

Repository files navigation

==================================
     Language Identification
==================================

I wrote a simple module to identify language of an input sentence

I grabbed some training data from http://cls.informatik.uni-leipzig.de/
I use 100k sentences for each of the 3 languages eng, fra, deu - data is in
traindata directory. I also took a non intersecting corpus of test sentences 
in each language and these can be found in directory labelled testdata. 

In order to train models, please run 'python train.py'
This script will run through the train files and create unigram, bigram
and trigram statistics for each language and write them to the directory 
labelled 'models'

In order to run over test corpus, run 'python test.py' after setting
sanity_check in the script to 'True' 
The script can also be run on command line with options 
'python test.py -n 2 -i 'est la sélection de joueurs de'
This will output a prediction language using the pretrained models 
from the models directory

-----------
Example: 
python test.py -n 3 -i 'i like ham' (with sanity_check=True)

Accuracy on lm using ngrams( 1 ) :  96.8333333333

Accuracy on lm using ngrams( 2 ) :  99.3333333333

Accuracy on lm using ngrams( 3 ) :  100.0

Predictions: [(1.8920271199218448, 'eng'), (2.2868514076023674, 'deu'), (2.3757705107667095, 'fra')]

Input sentence is identified as:  eng
-----------

Note: 
eng = english
fra = french
deu = german