Skip to content
Panos Louridas edited this page Jul 27, 2018 · 9 revisions

spaCy speaks Greek

Welcome to the home repository of Greek language integration for spaCy.

This project is developed for Google Summer of Code 2018, under the auspices of GFOSS - Open Technologies Alliance.

Problem statement - project goals

We live in the era of data. Every minute, 3.8 billion internet users, produce content; more than 120 million emails , 500.000 Facebook comments, 3 million Google searches. If we want to process that amount of data efficiently, we need to process natural language. Open source projects such as spaCy, textblob, or NLTK contribute signifficantly to that direction and thus they need to be reinforced.

This project is about improving the quality of Natural Language Processing in the Greek language. The project goals can be categorized as following:

  1. Integration of Greek language support to the spaCy platform.

  2. Production of models for Part-Of-Speech (POS) tagging, Dependency Analysis (DEP) and Named Entities Recognition (NER), with and without vectors.

  3. A live website (demo) in which people can analyze their own Greek text. Ideally, the analyzer will be able to do successfully the following:

    1. Split text into tokens.
    2. Lemmatize each token.
    3. Find POS tag for each token.
    4. Find dependencies between tokens.
    5. Extract (if any), the following Named Entitiies: ORG, PRODUCT, GPE, LOC, EVENT, PERSON.
    6. Classify the text among the following categories: Sports, Science, World News, Greek News, Environment, Politics, Art, Health, Science.
    7. Summarize text.
    8. Extract text topics.

Results

We consider the project as successful because we have achieved the majority of the goals mentioned above with great success :)

Integration of Greek language to spaCy

When we started the project, spaCy didn't support Greek language. Just 2 days before our pull request, a Greek company opened a pull request for a first version of Greek language to spaCy. That was a surprise for us, but we quickly discovered that our efforts where complementary. They have mostly worked on tokenizer while we have focused our efforts on a rule-based lemmatization approach, norm-exceptions collection, reliable stop-word list construction, etc.

We opened a pull request, that was merged with really motivating comments and ...voila; first project goal successful! 🏆

You can see the pull request here.

Features and models

We have produced 2 models for the Greek language; el_core_web_sm, el_core_web_lg. A wiki page for models will be released soon. The features that models support are the following:

  1. Part of Speech tagging.

  2. Dependency Analysis.

  3. Named Entities Extraction.

Demo

Soon to be released :)

Links to guide you through the project

People

Behind every project, there are people struggling to make it work :)

The contributors in this project are the following:

Google Summer of Code Student: Ioannis Daras

Mentor: Markos Gogoulos

Mentor: Panos Louridas

Clone this wiki locally