I think word embeddings are pretty cool

They have uses for downstream neural networks, or a deep reading of a particular corpus. While their results should always be taken with a grain of salt, they can reveal noteworthy insights.

So, I decided to take my favorite collection of short stories and novels, the Sherlock Holmes collection, and generate word embeddings using the Google word2vec model with the help of gensim.

There are two pre-trained word embeddings files in output/.bin which have vectors trained on this corpus using the Continuous Bag of Words (cbow) method, and the Skip-gram (skip) method. My personal insights from the corpus are within the output/.txt files. The source text for these vectors is from the USA version.

Installation

As mentioned before, this generator uses gensim to create the word embeddings. It also uses nltk in order to parse the sentences into tokens.

pip install gensim
pip install nltk

Usage

If you want to make your own vectors, you will need to pass in 3 arguments. These are:

A source file (i.e curated.txt)
An output name for your vector file (skip_vectors.bin). This needs to have a .bin file extension
An embedding type, either cbow or Continuous Bag of Words or skip for Skipgram.

Then, run the command such that: python embeddings.py <source_file> <output_name.bin> <cbow or skip>

For example: python embeddings.py "data/curated.txt" "skip_vectors.bin" "skip"

More food for thought

Word2Vec: https://deeplearning4j.org/word2vec.html

Skip gram methodology : http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

More on Gensim: https://radimrehurek.com/gensim/models/word2vec.html , https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Cosine Similarity: https://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python

This project was made using Python 3.5.4

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.gitignore		.gitignore
README.MD		README.MD
analysis.py		analysis.py
embeddings.py		embeddings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

I think word embeddings are pretty cool

Installation

Usage

More food for thought

About

Uh oh!

Releases

Packages

Languages

gschultz49/sherlock-word-embeddings

Folders and files

Latest commit

History

Repository files navigation

I think word embeddings are pretty cool

Installation

Usage

More food for thought

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages