Skip to content

gschultz49/sherlock-word-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

I think word embeddings are pretty cool

They have uses for downstream neural networks, or a deep reading of a particular corpus. While their results should always be taken with a grain of salt, they can reveal noteworthy insights.

So, I decided to take my favorite collection of short stories and novels, the Sherlock Holmes collection, and generate word embeddings using the Google word2vec model with the help of gensim.

There are two pre-trained word embeddings files in output/.bin which have vectors trained on this corpus using the Continuous Bag of Words (cbow) method, and the Skip-gram (skip) method. My personal insights from the corpus are within the output/.txt files. The source text for these vectors is from the USA version.

Installation

As mentioned before, this generator uses gensim to create the word embeddings. It also uses nltk in order to parse the sentences into tokens.

pip install gensim
pip install nltk

Usage

If you want to make your own vectors, you will need to pass in 3 arguments. These are:

  1. A source file (i.e curated.txt)
  2. An output name for your vector file (skip_vectors.bin). This needs to have a .bin file extension
  3. An embedding type, either cbow or Continuous Bag of Words or skip for Skipgram.

Then, run the command such that: python embeddings.py <source_file> <output_name.bin> <cbow or skip>

For example: python embeddings.py "data/curated.txt" "skip_vectors.bin" "skip"

More food for thought

Word2Vec: https://deeplearning4j.org/word2vec.html

Skip gram methodology : http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

More on Gensim: https://radimrehurek.com/gensim/models/word2vec.html , https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Cosine Similarity: https://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python

This project was made using Python 3.5.4

About

Word Embeddings generated from the Sherlock Holmes Collection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages