Can we find any sentiments on Wall Street ?
Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
There are two channels of data provided in this dataset:>News data: I crawled historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by reddit users' votes, and only the top 25 headlines are considered for a single date.(Range: 2008-06-08 to 2016-07-01) Stock data: Dow Jones Industrial Average (DJIA) is used to "prove the concept". (Range: 2008-08-08 to 2016-07-01) I provided three data files in .csv format: RedditNews.csv:two columnsThe first column is the "date", and second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date. DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info. CombinedNewsDJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".
df = pd.read_csv('data/Combined_news_DJIA.csv')
df[:2]
df.Label.value_counts().rename(index={1:'Next Day Stock Raise',0:'No Raise'}).plot.bar(rot=0,figsize=(5,3));
features = [f'Top{x+1}' for x in range(25)]
target = 'Label'
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the [MIT License] (we sincerely appreciate all attributions and readily accept most contributions, but please don't hold us liable).
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
%%time
copy_df = df.copy()
copy_df[:1]
Extract Sentiments from each news title
copy_df[features] = copy_df[features].applymap(lambda x: sid.polarity_scores(x) if type(x) == str else {})
copy_df[:1]
Some steps to format the data
copy_df = copy_df.set_index('Date')[features].unstack().reset_index()
copy_df[:3]
sentiments = pd.DataFrame.from_dict(copy_df[0].values.tolist())
sentiments[:3]
copy_df[sentiments.columns] = sentiments
copy_df = copy_df.drop([0],axis=1).rename(columns={'level_0':'news_label'})
copy_df[:3]
clean_df = copy_df.groupby('Date',as_index=False).agg({x:['mean','sum'] for x in sentiments.columns}).copy()
clean_df.columns = [ '_'.join(x) if x[1] != '' else x[0].lower() for x in clean_df.columns]
clean_df = clean_df.merge(df[['Date','Label']],left_on='date',right_on='Date',how='left').drop(['Date'],axis=1)
clean_df['date'] = pd.to_datetime(clean_df['date'] )
clean_df[:3]
Now let's do some data enrichment for those sentiments
import pandas_datareader.data as web
min_date,max_date = clean_df.date.min(),clean_df.date.max()
min_date,max_date
DJI stock prices from the same period of time
stock = web.DataReader('^DJI', 'stooq',start=min_date,end=max_date).reset_index()
stock.columns = stock.columns.str.lower()
stock[:3]
Merged the clean data with the stock prices data
enriched_df = clean_df.merge(stock,on='date',how='left').sort_values(by='date').copy()
enriched_df[:3]
Features Correlation
plt.figure(figsize=(16,7))
corr = enriched_df.corr(method='spearman')
sns.heatmap(corr[corr>0],annot=True,fmt='.2f',cmap='Blues');
Close Price and Compund Mean hilighted by given Label
sns.scatterplot(x=enriched_df['close'],y=enriched_df['compound_mean'],hue=enriched_df['Label'],alpha=.5);
Now let's use an XGBoost model to see what is driving prices changes
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
We well use the compound mean as feature to capture the news influence in our model
features = ['compound_mean','open','high','low']
target = 'close'
XGBoost Regressor
params = {'eval_metric': 'rmse', 'max_depth': 5, 'n_estimators': 100, 'objective': 'reg:gamma', 'use_label_encoder': False}
reg = xgb.XGBRegressor(**params)
reg.fit(enriched_df[features],enriched_df[target])
Let's look at one of our trees
- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
- negative sentiment: compound score <= -0.05
fig,ax = plt.subplots(1,1,figsize=(35,10))
xgb.plot_tree(reg,num_trees=50,ax=ax);
Feature Importance
xgb.plot_importance(reg);
Feature Impact
import shap
expl = shap.TreeExplainer(reg)
shap_values = expl.shap_values(enriched_df[features],enriched_df[target])
shap.summary_plot(shap_values,enriched_df[features])