A systematic framework that uses diachronic word embeddings to trace semantic shifts or variations in the context of words over time in the Greek language.
This repository accompanies the paper "Studying the Evolution of Greek Words via Word Embeddings" by V. Barzokas, E. Papagiannopoulou and G. Tsoumakas, published in the proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020) and contains the set of tools developed and data prepared for its needs. The paper is going to be available here https://doi.org/10.1145/3411408.3411425
If you use this code and/or data in your research please cite the following:
@inproceedings{10.1145/3411408.3411425,
author = {Barzokas, Vasileios and
Papagiannopoulou, Eirini and
Tsoumakas, Grigorios},
title = {Studying the Evolution of Greek Words via Word Embeddings},
booktitle = {11th Hellenic Conference on Artificial Intelligence},
pages = {118–124},
year = {2020},
location = {Athens, Greece},
series = {SETN 2020}
isbn = {9781450388788},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3411408.3411425},
doi = {10.1145/3411408.3411425},
}
- Highlighted is the most relevant word and the rest are represented in descending order.
- Python 3.6.9
- fastText - a library for efficient learning of word representations and sentence classification.
-
Clone this repository by running:
git clone [email protected]:intelligence-csd-auth-gr/greek-words-evolution.git
-
Clone the required
fastText
repository by running:git submodule init git submodule update
-
Install the
fastText
library for your system as described in its documentation that can be found here: https://github.com/facebookresearch/fastTextNote: Normally all that is required to do is:
cd fastText make pip install .
-
Install the required Python libraries by running:
pip install -r requirements.txt
-
If running for first time, create the text files per period by running:
python gws.py text --action exportByPeriod
-
Then create the models from those text files by running:
python gws.py model --action create
-
Later, after the models have been generated you can see the nearest neighbours of a word by running something similar to this example:
python gws.py model --action getNN --word ποντίκι --period 2010
output:
['ποντικι', 'φακα', 'πιασμενο', 'ταμπλετ', 'κατσαβιδι', 'μπιλη', 'γατα', 'ποντικοπαγιδα', 'αραχνη', 'βιντεοκασετα', 'κοριο', 'πληκτρολογιο', 'ποντικο', 'κλακ', 'κατεβασεις', 'μιξερ', 'ποντικακι', 'τσιπακι', 'μεγαλουτσικο', 'συνδεθω', 'μυγοσκοτωστρα']
-
Get the 10 words with the highest semantic change, based on their cosine distance:
python gws.py model --action getCD --fromYear 1980 --toYear 2020
-
Get the 10 words with the highest semantic change, based on their cosine similarity (opposite sorted list of cosine distance):
python gws.py model --action getCS --fromYear 1980 --toYear 2020
The script accepts either of the following positional arguments:
website
- allows actions on the websites, such as URL extraction, file downloading etc.metadata
- allows actions on the metadata, metadata display or export etc.text
- allows actions on the text, such as text extraction, metadata display or export etc.model
- allows actions on the trained models, such as the training, evaluation through nearest neighbours or shifts of word meanings through periods.
In order to see a full list of the available options and a short description of each one of them, type:
python gws.py --help
The snippets below display a brief description of each of the options that the positional arguments accept.
usage: gws.py website [-h] [--target {openbook}]
[--action {fetchLinks,fetchMetadata,fetchFiles}]
optional arguments:
-h, --help show this help message and exit
--target {openbook} Target website to scrap data from
--action {fetchLinks,fetchMetadata,fetchFiles}
The action to execute on the selected website
usage: gws.py metadata [-h] [--corpus {all,openbook,project_gutenberg}]
[--action {printStandard,printEnhanced,exportEnhanced}]
[--fromYear FROMYEAR] [--toYear TOYEAR]
[--splitYearsInterval SPLITYEARSINTERVAL]
optional arguments:
-h, --help show this help message and exit
--corpus {all,openbook,project_gutenberg}
The name of the target corpus to work with
--action {printStandard,printEnhanced,exportEnhanced}
Action to perform against the metadata of the selected
text corpus
--fromYear FROMYEAR The target starting year to extract data from
--toYear TOYEAR The target ending year to extract data from
--splitYearsInterval SPLITYEARSINTERVAL
The interval to split the years with and export the
extracted data
usage: gws.py text [-h] [--corpus {all,openbook,project_gutenberg}]
[--action {combineByPeriod,extractFromPDF}]
[--fromYear FROMYEAR] [--toYear TOYEAR]
[--splitYearsInterval SPLITYEARSINTERVAL]
optional arguments:
-h, --help show this help message and exit
--corpus {all,openbook,project_gutenberg}
The name of the target corpus to work with
--action {combineByPeriod,extractFromPDF}
Action to perform against the selected text corpus
--fromYear FROMYEAR The target starting year to extract data from
--toYear TOYEAR The target ending year to extract data from
--splitYearsInterval SPLITYEARSINTERVAL
The interval to split the years with and export the
extracted data
usage: gws.py model [-h] [--action {create,getNN,getCS,getCD}] [--word WORD]
[--period PERIOD] [--textsFolder TEXTSFOLDER]
[--fromYear FROMYEAR] [--toYear TOYEAR]
optional arguments:
-h, --help show this help message and exit
--action {create,getNN,getCS,getCD}
Action to perform against the selected model
--word WORD Target word to get nearest neighbours for
--period PERIOD The target period to load the model from
--textsFolder TEXTSFOLDER
The target folder that contains the texts files
--fromYear FROMYEAR the target starting year to create the model for
--toYear TOYEAR the target ending year to create the model for