Skip to content

geetua/language_identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

==================================
     Language Identification
==================================

I wrote a simple module to identify language of an input sentence

I grabbed some training data from http://cls.informatik.uni-leipzig.de/
I use 100k sentences for each of the 3 languages eng, fra, deu - data is in
traindata directory. I also took a non intersecting corpus of test sentences 
in each language and these can be found in directory labelled testdata. 

In order to train models, please run 'python train.py'
This script will run through the train files and create unigram, bigram
and trigram statistics for each language and write them to the directory 
labelled 'models'

In order to run over test corpus, run 'python test.py' after setting
sanity_check in the script to 'True' 
The script can also be run on command line with options 
'python test.py -n 2 -i 'est la sélection de joueurs de'
This will output a prediction language using the pretrained models 
from the models directory

-----------
Example: 
python test.py -n 3 -i 'i like ham' (with sanity_check=True)

Accuracy on lm using ngrams( 1 ) :  96.8333333333

Accuracy on lm using ngrams( 2 ) :  99.3333333333

Accuracy on lm using ngrams( 3 ) :  100.0

Predictions: [(1.8920271199218448, 'eng'), (2.2868514076023674, 'deu'), (2.3757705107667095, 'fra')]

Input sentence is identified as:  eng
-----------

Note: 
eng = english
fra = french
deu = german

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages