They have uses for downstream neural networks, or a deep reading of a particular corpus. While their results should always be taken with a grain of salt, they can reveal noteworthy insights.
So, I decided to take my favorite collection of short stories and novels, the Sherlock Holmes collection, and generate word embeddings using the Google word2vec model with the help of gensim.
There are two pre-trained word embeddings files in output/.bin which have vectors trained on this corpus using the Continuous Bag of Words (cbow) method, and the Skip-gram (skip) method. My personal insights from the corpus are within the output/.txt files. The source text for these vectors is from the USA version.
As mentioned before, this generator uses gensim to create the word embeddings. It also uses nltk in order to parse the sentences into tokens.
pip install gensim
pip install nltk
If you want to make your own vectors, you will need to pass in 3 arguments. These are:
- A source file (i.e
curated.txt) - An output name for your vector file (
skip_vectors.bin). This needs to have a.binfile extension - An embedding type, either
cbowor Continuous Bag of Words orskipfor Skipgram.
Then, run the command such that:
python embeddings.py <source_file> <output_name.bin> <cbow or skip>
For example:
python embeddings.py "data/curated.txt" "skip_vectors.bin" "skip"
Word2Vec: https://deeplearning4j.org/word2vec.html
Skip gram methodology : http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
More on Gensim: https://radimrehurek.com/gensim/models/word2vec.html , https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
Cosine Similarity: https://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python
This project was made using Python 3.5.4