Project_Maui

Using News Sentiment to Predict Stoci Price Movements

Program that allows a user to choose a company from the S&P 500 and run a logistic regression model to predict the price movement of this company's stock on the next trading day based on current sentiment (Vader) of Reuters news articles related to this company.

UX/UI Showcase

Choose a Company

Required Model Accuracy of 20%

Required Model Accuracy of 63%

Please wait till the end for modified recommendation. Thank you for your patience!

Original Goal

Should there be access to hourly data on stock prices and news sentiments

Predict intraday movement in stock price between the current point in time and the end of the trading day to determine if a trade will be profitable by end of day. Machine learning model to be implemented that took in average news sentiment between h0 and h-24 (h = hour) as features, and intraday change in price between h0 and 4pm as the target on a rolling single-day basis historically .

NOTE: You must have active keys from the following APIs to run this program:

File to run: project_code > master_function

Link to Project Proposal

Data

Cleaning and Curation

News Sentiments

Pulled 20 Articles per day from news API
- Could only pull for past 30 days
- Request limitations informed Dataframe Structure
Sentiments are placed into four categories:
- compound, positive, negative and neutral

Cleaned up articles using Lemmatization and stop word removal

Marginally affected polarity score
Applied Vader Sentiment Analyzer to return Polarity Score

Code (click me)

# function to tokenize text
def tokenizer(text):
    
    # cleaning text
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)
    words = word_tokenize(re_clean)
    lem = [lemmatizer.lemmatize(word) for word in words]
    tokens = [word.lower() for word in lem if word.lower() not in sw]
    
    # exporting tokenized words as output
    return tokens

  </details>

Ensure relevancy by including both the company and ticker

Code (click me)

# establishing keywords for news pull
keyword = f'{company} AND {ticker}'

Stock Prices

Breaks on weekends and holidays? Here we go!

Code

        # iterating through sentiment score / article DataFrame to...
        for index, row in dataframe.iterrows():

            # if daily return is null value for a given day - i.e. a non-trading day,
            if pd.isnull(row['return']):
                
                # then append polarity scores to their respective lists
                compound.append(row['compound'])
                positive.append(row['positive'])
                negative.append(row['negative'])
                neutral.append(row['neutral'])
                dataframe.drop(index=index, inplace=True)
            
            # if there was a return value - i.e. it was a trading day
            elif pd.notnull(row['return']):
                
                # The list of compound polarity scores will be empty if the stock was traded
                # on the previous day; therefore, move along.
                if len(compound) == 0:
                    pass

                # If the list is not empty, then at least one day prior was a non-trading 
                # day. Append the current day's scores to the list and calculate the mean 
                # for each score. Then replace the current day's polarity scores with the 
                # average scores of today and previous non-trading days.
                else:
                    compound.append(row['compound'])
                    compound_mean = np.mean(compound)
                    compound = []

                    positive.append(row['positive'])
                    positive_mean = np.mean(positive)
                    positive = []

                    negative.append(row['negative'])
                    negative_mean = np.mean(negative)
                    negative = []

                    neutral.append(row['neutral'])
                    neutral_mean = np.mean(neutral)
                    neutral = []

                    dataframe.at[index, 'compound'] = compound_mean
                    dataframe.at[index, 'positive'] = positive_mean
                    dataframe.at[index, 'negative'] = negative_mean
                    dataframe.at[index, 'neutral'] = neutral_mean

            else:
                pass

Sample of Pre-model Dataframe

Featuring Lags

Lagging days feature is incorporated into the get_model_data function:

def get_model_data(company, ticker, lag=0):

 # shifting the return column up to adjust for a lag in stock reaction to sentiments
    final_df = cleaned_df(combined_df)
    final_df['return'] = final_df['return'].shift(-lag)
    final_df.dropna(inplace=True)

Limitations

Lack of affordable availability of historical intraday stock price data
- Had to change scope of project from intraday predictions to day over day
News API only allows for 30 days of historical articles to be pulled in
- Limited training data likely affects the accuracy of our model

Sources

IEX Finance - historical stock price data
News API / Reuters - historical news articles

Models

We are prediction whether the closing price of a stock would rise (1) or fall (-1) compared to the closing price of the previous trading day. It is supervised machine learning as we have a target variable. A 30% training-and-testing split is applied to fit the models.

As of models, we used Logit regression and Balanced Random Forest Classifier to predict the probability of the binary outcome.

Other models used include LSTM Sequential and 3-Layer Neural Network.

Python Libraries:

Data for News Sentiments

NLTK

Logit, Balanced Random Forest and Miscellaneous Classifiers:

scikit-learn

For Neural Network Sequential and LSTM Models:

keras

tensorflow

Evaluation Results: Which Model Shall We Use?

Based on 31 days of data for Disney (DIS): 3/15/2020 to 4/15/2020

Note: Changes on test statistics based on live data may lead to different choice of models.

a. Logit Regression

The balanced accuracy score is 0.83.

Evaluation (click me)

Code (click me)

# ********* MODEL FITTING *************
   # --------- Loigt -----------
   # --------Start-------------
   
M = 'Logit'
from sklearn import linear_model 
lm = linear_model.LogisticRegression(C = 1e5)
lm.fit(X_train, y_train)
lm_pred = lm.predict(X_test)


  # --------- Logit ------------
   # ---------End -------------

b. Balanced Random Forest Classifier Ensemble Learning

The balanced accuracy score is 0.67.
Better choice over Decision Tree model as it prevents overfitting

Evaluation

Code

# ********* MODEL FITTING *************
   # -----Balanced Random Forest -------
   # --------Start-------------

# Resample the training data with the RandomOversampler
# fit Random Forest Classifier
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)
brf_pred = brf.predict(X_test)

   # --- Balanced Random Forest --------
   # --------End-------------

c. Decision Tree Resampling

The balanced accuracy score is 0.67.

Evaluation

Code

# ********* MODEL FITTING *************
   # ----- Decision Tree -------
   # --------Start-------------

from sklearn import tree
# Needed for decision tree visualization
import pydotplus
from IPython.display import Image

# Creating the decision tree classifier instance
model_tree = tree.DecisionTreeClassifier()
# Fitting the model
model_tree = model_tree.fit(X_train, y_train)
# Making predictions using the testing data
tree_pred = model_tree.predict(X_test)

  # --- Decision Tree --------
   # --------End-------------

Image

Data Preparation for Models

Code

# Creating training and testing data sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=False, random_state=42) 

# For neural network sequential, LSTM and ensemble learning
#Create the StandardScaler instance
scaler = StandardScaler()
# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

# Scale the training data - only scale X_train and X_test data 
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


# Creating validation data sets for deep learning on neural network model training
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.3, shuffle=False)

Model Evaluation

Confusion Matrix and Balanced Accuracy Scores for Logit, Supervised Resampling and Ensemble Learning
Model evaluation would differ by company and time window.
Interpretion on accuracy scores
- The Logit model is offering a balanced accuracy score of predictions for the test dataset, which is 30% of 20, or 6 predictions.
  - The reason we usually get 50% as the balanced accuracy score is because with so little data (and having been trained on so little data - 14 days), it's basically randomly guessing and getting it right 50% of the time.
  - Now, sometimes we're a value of .25 (25%) or .625 (62.5%), and you're probably thinking to yourself, 6 can't be divided by another whole number to return that fraction/percentage. True - but we're not using a straight accuracy score in our model, we're using a balanced accuracy score, which is different. Read here for explanation.
- If you go into the model and change balanced_accuracy_score to accuracy_score (which is a simple calculation of number of correct guesses divided by total guesses), and print the confusion matrix, you'll see that it's returning the correct fraction.

Code

# Score the accuracy
print("Training vs. Testing - Logit")
print(f"Training Data Score: {lm.score(X_train, y_train):,.04f}")
print(f"Testing Data Score: {lm.score(X_test, y_test):,.04f}")

# Evaluating the Logit model in a nicer format
# Calculating the confusion matrix
cm_lm = confusion_matrix(y_test, lm_pred)
cm_lm_df = pd.DataFrame(
    cm_lm, index=["Actual -1", "Actual 1"], columns=["Predicted -1", "Predicted 1"]
)
# Calculating the accuracy score
acc_lm_score = balanced_accuracy_score(y_test, lm_pred)

# Displaying results
print("Confusion Matrix - Logit")
display(cm_lm_df)
print(f"Balanced Accuracy Score : {acc_lm_score:,.04f}")
print("Classification Report - Logit")
print(classification_report(y_test, lm_pred))

Results: Real vs. Predicted

Logit

Logit predictions without rolling window

Balanced Random Forest

Graph

Three-layer Neural Network

Graph

For more details on models, please click on the link below:

Test Folder

Conclusions

Impact of Positive vs. Negative News Sentiments

Conclusion: Predictions are more consistent with actual directions of returns based on negative compared to positive sentiments, subject to overfitting.

Deep Learning LSTM Model

Positive News Sentiments

Negative News Sentiments

Note: Potentially overfitting. Needed more data.

The plots of parallel categories and OLS regressions below shows consistent conclusion that negative and compound sentiment scores hold higher predictive power on the direction of daily returns.

Compared to positive and neutral sentiments:

OLS Prediction on Price Move Directions due to News Sentiments in the Past 5 Days

Actual Returns (the blue line on top) in more in tune with the predictions according to negative sentiments (the purple line, third from the bottom) and compound sentiments (the brown line, second from the bottom) over the past five trading days.

OLS Predicted Returns on News Sentiments in the Past 5 Days

Note: Inverted due to complications when multiplying positive and negative signs.

Snoozed Sentiments? For How Long?

Impact on Lagged Response to News Sentiments

Conclusion: We found that news sentiments over the past five trading days has identical predictive capacity to one-day sentiments.

OLS with Rolling One-day Training Window

OLS Returns with Rolling Three-day Training Window

OLS with Rolling Five-day Training Window

Discussions

Gradient Boosing Ensemble and SVM Models also provides higher accuracy scores (50%) compared to other models.

Gradient Boosting Ensemble Learning on News Sentiments

Returns from predicted directions: does multiplication always work?

What happens when multiplying a prediction of price drop, i.e. -1, with a negative actual return? Please click below to see a solution.

Code

sigma = (dis['return']/100).std()
all_predic = all_pred
all_predic['OLS_predi'] = 0.7 * sigma * (-all_predic['OLS_pred'])

Interpretation on Test Statistics

Confusion Matrix

Predicted 0 (-1) Predicted 1

Actually 0 (-1) TN FP

Actually 1 FN TP

Accuracy = (TP+TN)/(TP+TN+FP+FN)
- It treats FP and FN equally and would be biased for imbalanced data:
  - More weights are put on true negatives (TN)s for COVID-19 tests
  - Tests need to focus on minimizing false negatives (FN)
- Therefore, other test statistics need to be considered

Graph Illustration

Other Model Evaluation Statistics (click me)

Precision = TP/(TP+FP)
- Out of all the predictions of "1" for daily price increase, how many are actually increased.
- It focuses on the data on price increase and uses figuress in the second column of the confusion matrix.
Recall = TP/(TP+FN)
- How many actual daily price increase moves are predicted correctly?
- It features the second row of the confusion matrix
- Recall is also the sensitivity of the testing model
Specificity = TN/(TN+FP)
- How many of the actuall downward price moves are predicted correctly?
- It spotlight the first row of our confusion matrix and examine only the downward price moves in our data.
F1 = 2 x (Precision x Recall)/(Precision + Recall)
- F1 score is the harmonic mean of precision and recall.
- As precission and recall usually go in opposite directions, f1 score is a good balance between the two.
- F1 leverages the second row and column for actual and predicted upward price moves.

Deployment

Jupyter Lab, Jupyter Notebook, Visual Studio Code

Built With

Future Steps

We are working on the following three features to upgrade our machine learning widget.

Our group has been working on several solutions to incorporate Buy, Sell and Hold feature into the master function widget.

The original code appears as follows:

def signal_column(df):
    df['test'] = None
    for index, row in df.iterrows():
        if pd.isnull(row['test']):
            if df.loc[index]['return'] >= 2:
                df.at[index, 'test'] = 1
            elif df.loc[index]['return'] <= -2:
                df.at[index, 'test'] = -1
            else:
                df.at[index, 'test'] = 0
    return df

Trying to fit it in, we tried three versions below

a. on get_model_data(company, ticker, lag=0) function

b. on on_button_clicked(b) function

c. on the model(df) function

# defining model to run Logit logistic regression model on the feature/target DataFrame
# and export predicted price movement and model accuracy
def model(df):
    # preparing the dataframe
    df['return_sign'] = None
    for index, row in df.iterrows():
        if pd.isnull(row['return_sign']):
            if df.loc[index]['return'] >= 2:
                df.at[index, 'return_sign'] = 1
            elif df.loc[index]['return'] <= -2:
                df.at[index, 'return_sign'] = -1
            else:
                df.at[index, 'return_sign'] = 0
    df = df.drop(columns=['text'])
    # creating the features (X) and target (y) sets
    X = df.iloc[:, 0:4]
    y = df["return_sign"]
    # creating training and testing data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=False, random_state=42)
    # fitting model
    M = 'Logit'
    lm = linear_model.LogisticRegression(solver = 'lbfgs')
    lm.fit(X_train, y_train)
    lm_pred = lm.predict(X_test)
    # calculating confusion matrix
    cm_lm = confusion_matrix(y_test, lm_pred)
    cm_lm_df = pd.DataFrame(
    cm_lm, index=["Actual -1", "Actual 1"], columns=["Predicted -1", "Predicted 1"]
    )
    # calculating the accuracy score
    acc_lm_score = balanced_accuracy_score(y_test, lm_pred)
    # exporting model accuracy and predicted price movement float variables as output
    return acc_lm_score, lm_pred[-1]

APIs ran out as of the date of this readme file. It remains unknown whether the buy-sell-hold feature works.

Click here for latest version Featuring Buy Sell and Hold

Furthermore, we discussed about options based on Black-Scholes Pricing and showcase put-call parity. It features an interactive input function. Outputs include prices on put and calls with greeks to measure price sensitivities.

Click here for latest version on options feature

Another topic that we spoke about was an Amazon Lex Bot.

Click here for latest version on Lambda function for Amazon Bot

Contributors

Richard Bertrand
Ava Lee
Devin Nigro
Brody Wacker

Files

Data

Dataframe

Stock Data

Models

Models of Good Fit

Scikit-learn Classifiers

Neural Net

LSTM

Future Steps

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.ipynb_checkpoints		.ipynb_checkpoints
project_code		project_code
test		test
.DS_Store		.DS_Store
README.md		README.md

tiricha91/Project_Maui

Folders and files

Latest commit

History

Repository files navigation

Project_Maui

Using News Sentiment to Predict Stoci Price Movements

UX/UI Showcase

Choose a Company

Required Model Accuracy of 20%

Required Model Accuracy of 63%

Original Goal

Data

Cleaning and Curation

Models

Python Libraries:

Evaluation Results: Which Model Shall We Use?

Data Preparation for Models

Model Evaluation

Results: Real vs. Predicted

Logit

Balanced Random Forest

Three-layer Neural Network

For more details on models, please click on the link below:

Conclusions

Impact of Positive vs. Negative News Sentiments

Snoozed Sentiments? For How Long?

OLS with Rolling One-day Training Window

OLS with Rolling Five-day Training Window

Gradient Boosting Ensemble Learning on News Sentiments

Returns from predicted directions: does multiplication always work?

Interpretation on Test Statistics

Deployment

Built With

Future Steps

Contributors

Files

Data

Models

Future Steps

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages