Sentiment analysis using NLP

About

Scans documents for frequently repeated words of interest
Command Line Tool
Outputs results to CSV & HTML
Finds Locations, Countries, Organizations, Events, Persons
Can additionally search nouns, verbs, and adjectives
Omits stop words with ability to add additional as parameters
Filters dates, percents, numbers, ordinal/cardinal types

Example

Installation

Uses:

pandas Documentation
spaCy Documentation

Starting the virtual environment

cd 
pip install pipenv
pipenv shell

Installing Dependancies

pipenv install -r requirements.txt

Installing the spaCy model

pipenv run python -m spacy download en_core_web_sm

OR

python -m spacy download en_core_web_sm

OR by adding it to requirements.txt

spacy>=2.2.0,<3.0.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm

A guide on working with with pipenv Further guidance on installing spaCy can he found here

Basic Usage

Each document to be included in the search shold have it's path given as a parameter when running the below script. In main.py, each document given as a parameter is instantiated as a Document object. The Document class uses multiple spaCy objects such as doc, span, matching patterns , and tokens to work with the data. A dictionary of these doc objects is passed to the Batch class which inherits from the Document class. Note that the Batch class is initialized with the sum of all text from the documents given as arguments as oppused to the file path in the Document class.

Both a .csv and nicely formatted .html file will be produced and out to the following directory: eigen_test/tests/test_output. (Currently, outputs will be overwritten if the script is run again) The file bold.py inserts tags to bold the keywords in the html file.

Run : python3 main.py [path/to/document1] [path/to/document1] ... python3 main.py ../tests/test_docs/doc1.txt ../tests/test_docs/doc2.txt ../tests/test_docs/doc3.txt ../tests/test_docs/doc4.txt ../tests/test_docs/doc5.txt ../tests/test_docs/doc6.txt

Adjusting parameters

A number of the methods in the Document class could act as useful functions in a jupyter notebook environment with spacy, but below are a few easily adjusted parameters that can be adjusted in the main.py file.

By Default, Entities (esentially pronouns), Nouns, Verbs,and Adjectives are all counted for frequency as they could have some relevance.

To omit one/some/all nouns, verbs, adjectives , adjust the doc_all.set_default_phrases("NVA") in main.py to "" for none , "N" for just Nouns

To exclude the named entities you can just comment out doc_all.interesting_entities() in main.py.

The most frequent entries are output in descending order with a defualt amount of keywords being 20. This can be adjusted here : doc_all.set_number_results(20) . Large value may reduce performance drastically ( more optimizaiton needed).

Adding your own stop words to the defualt list provided can be done by calling the method : doc_all.add_stop(['Let'])

The sentence surrounding the word was capped at 40 words. This logic is in the def sentence_cropped(self, word): method in the Document class, and places the keyword at index 30 of 40 to provide reasonable context within a sentence of excessive length.

Roadmap

Future features include :

Scraping recent SEC filings

Executing diff comparisons on the notes sections of filings

Topic modeling and Sentiment Analysis using Gensim library

Data Sanity Checking

Testing that results are within a reasonable tolerance of a simple grep search.

cd test_docs grep -roh people . | wc -w

Additionally, this one liner borrowed and modified from Doug Mcilroy quickly shows the top 200 words in terms of highest frequency (omitting the top 30 as primarily stop words) to get a sense of what a few of the more relevant words might be when troubleshooting.

cat *.txt >> combined.txt cat combined.txt | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 230q | tail -n +31

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cli		cli
tests		tests
.DS_Store		.DS_Store
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment analysis using NLP

About

Example

Installation

Basic Usage

Adjusting parameters

Roadmap

Data Sanity Checking

About

Uh oh!

Releases

Packages

Uh oh!

Languages

t-markey/sentiment-nlp

Folders and files

Latest commit

History

Repository files navigation

Sentiment analysis using NLP

About

Example

Installation

Basic Usage

Adjusting parameters

Roadmap

Data Sanity Checking

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages