Can we find any sentiments on Wall Street ?

Imports

Code

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn')

About Data

Dataset

There are two channels of data provided in this dataset:

News data: I crawled historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by reddit users’ votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01) Stock data: Dow Jones Industrial Average (DJIA) is used to “prove the concept”. (Range: 2008-08-08 to 2016-07-01) I provided three data files in .csv format:

RedditNews.csv: two columns The first column is the “date”, and second column is the “news headlines”. All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date. DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info. CombinedNewsDJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is “Date”, the second is “Label”, and the following ones are news headlines ranging from “Top1” to “Top25”.

Code

df = pd.read_csv('data/Combined_news_DJIA.csv')
df[:2]

	Date	Label	Top1	Top2	Top3	Top4	Top5	Top6	Top7	Top8	...	Top16	Top17	Top18	Top19	Top20	Top21	Top22	Top23	Top24	Top25
0	2008-08-08	0	b"Georgia 'downs two Russian warplanes' as cou...	b'BREAKING: Musharraf to be impeached.'	b'Russia Today: Columns of troops roll into So...	b'Russian tanks are moving towards the capital...	b"Afghan children raped with 'impunity,' U.N. ...	b'150 Russian tanks have entered South Ossetia...	b"Breaking: Georgia invades South Ossetia, Rus...	b"The 'enemy combatent' trials are nothing but...	...	b'Georgia Invades South Ossetia - if Russia ge...	b'Al-Qaeda Faces Islamist Backlash'	b'Condoleezza Rice: "The US would not act to p...	b'This is a busy day: The European Union has ...	b"Georgia will withdraw 1,000 soldiers from Ir...	b'Why the Pentagon Thinks Attacking Iran is a ...	b'Caucasus in crisis: Georgia invades South Os...	b'Indian shoe manufactory - And again in a se...	b'Visitors Suffering from Mental Illnesses Ban...	b"No Help for Mexico's Kidnapping Surge"
1	2008-08-11	1	b'Why wont America and Nato help us? If they w...	b'Bush puts foot down on Georgian conflict'	b"Jewish Georgian minister: Thanks to Israeli ...	b'Georgian army flees in disarray as Russians ...	b"Olympic opening ceremony fireworks 'faked'"	b'What were the Mossad with fraudulent New Zea...	b'Russia angered by Israeli military sale to G...	b'An American citizen living in S.Ossetia blam...	...	b'Israel and the US behind the Georgian aggres...	b'"Do not believe TV, neither Russian nor Geor...	b'Riots are still going on in Montreal (Canada...	b'China to overtake US as largest manufacturer'	b'War in South Ossetia [PICS]'	b'Israeli Physicians Group Condemns State Tort...	b' Russia has just beaten the United States ov...	b'Perhaps the question about the Georgia - R...	b'Russia is so much better at war'	b"So this is what it's come to: trading sex fo...

2 rows × 27 columns

Code

df.Label.value_counts().rename(index={1:'Next Day Stock Raise',0:'No Raise'}).plot.bar(rot=0,figsize=(5,3));

Code

features = [f'Top{x+1}' for x in range(25)]
target = 'Label'

Vader on Wall Street

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the [MIT License] (we sincerely appreciate all attributions and readily accept most contributions, but please don’t hold us liable).

Code

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/cristianexer/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

True

Code

sid = SentimentIntensityAnalyzer()

Code

%%time
copy_df = df.copy()
copy_df[:1]

CPU times: user 1.08 ms, sys: 425 µs, total: 1.51 ms
Wall time: 1.02 ms

	Date	Label	Top1	Top2	Top3	Top4	Top5	Top6	Top7	Top8	...	Top16	Top17	Top18	Top19	Top20	Top21	Top22	Top23	Top24	Top25
0	2008-08-08	0	b"Georgia 'downs two Russian warplanes' as cou...	b'BREAKING: Musharraf to be impeached.'	b'Russia Today: Columns of troops roll into So...	b'Russian tanks are moving towards the capital...	b"Afghan children raped with 'impunity,' U.N. ...	b'150 Russian tanks have entered South Ossetia...	b"Breaking: Georgia invades South Ossetia, Rus...	b"The 'enemy combatent' trials are nothing but...	...	b'Georgia Invades South Ossetia - if Russia ge...	b'Al-Qaeda Faces Islamist Backlash'	b'Condoleezza Rice: "The US would not act to p...	b'This is a busy day: The European Union has ...	b"Georgia will withdraw 1,000 soldiers from Ir...	b'Why the Pentagon Thinks Attacking Iran is a ...	b'Caucasus in crisis: Georgia invades South Os...	b'Indian shoe manufactory - And again in a se...	b'Visitors Suffering from Mental Illnesses Ban...	b"No Help for Mexico's Kidnapping Surge"

1 rows × 27 columns

Extract Sentiments from each news title

Code

copy_df[features] = copy_df[features].applymap(lambda x: sid.polarity_scores(x) if type(x) == str else {})
copy_df[:1]

	Date	Label	Top1	Top2	Top3	Top4	Top5	Top6	Top7	Top8	...	Top16	Top17	Top18	Top19	Top20	Top21	Top22	Top23	Top24	Top25
0	2008-08-08	0	{'neg': 0.262, 'neu': 0.738, 'pos': 0.0, 'comp...	{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...	{'neg': 0.172, 'neu': 0.828, 'pos': 0.0, 'comp...	{'neg': 0.247, 'neu': 0.753, 'pos': 0.0, 'comp...	{'neg': 0.424, 'neu': 0.576, 'pos': 0.0, 'comp...	{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...	{'neg': 0.149, 'neu': 0.851, 'pos': 0.0, 'comp...	{'neg': 0.107, 'neu': 0.79, 'pos': 0.103, 'com...	...	{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...	{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...	{'neg': 0.078, 'neu': 0.819, 'pos': 0.103, 'co...	{'neg': 0.092, 'neu': 0.78, 'pos': 0.128, 'com...	{'neg': 0.112, 'neu': 0.773, 'pos': 0.116, 'co...	{'neg': 0.351, 'neu': 0.649, 'pos': 0.0, 'comp...	{'neg': 0.406, 'neu': 0.594, 'pos': 0.0, 'comp...	{'neg': 0.14, 'neu': 0.86, 'pos': 0.0, 'compou...	{'neg': 0.65, 'neu': 0.35, 'pos': 0.0, 'compou...	{'neg': 0.0, 'neu': 0.649, 'pos': 0.351, 'comp...

1 rows × 27 columns

Some steps to format the data

Code

copy_df = copy_df.set_index('Date')[features].unstack().reset_index()
copy_df[:3]

	level_0	Date	0
0	Top1	2008-08-08	{'neg': 0.262, 'neu': 0.738, 'pos': 0.0, 'comp...
1	Top1	2008-08-11	{'neg': 0.0, 'neu': 0.668, 'pos': 0.332, 'comp...
2	Top1	2008-08-12	{'neg': 0.169, 'neu': 0.656, 'pos': 0.175, 'co...

Code

sentiments = pd.DataFrame.from_dict(copy_df[0].values.tolist())
sentiments[:3]

	neg	neu	pos	compound
0	0.262	0.738	0.000	-0.5994
1	0.000	0.668	0.332	0.8156
2	0.169	0.656	0.175	0.0258

Code

copy_df[sentiments.columns] = sentiments
copy_df = copy_df.drop([0],axis=1).rename(columns={'level_0':'news_label'})
copy_df[:3]

	news_label	Date	neg	neu	pos	compound
0	Top1	2008-08-08	0.262	0.738	0.000	-0.5994
1	Top1	2008-08-11	0.000	0.668	0.332	0.8156
2	Top1	2008-08-12	0.169	0.656	0.175	0.0258

Code

clean_df = copy_df.groupby('Date',as_index=False).agg({x:['mean','sum'] for x in sentiments.columns}).copy()
clean_df.columns = [ '_'.join(x) if x[1] != '' else x[0].lower() for x in clean_df.columns]
clean_df = clean_df.merge(df[['Date','Label']],left_on='date',right_on='Date',how='left').drop(['Date'],axis=1)
clean_df['date'] = pd.to_datetime(clean_df['date'] )
clean_df[:3]

	date	neg_mean	neg_sum	neu_mean	neu_sum	pos_mean	pos_sum	compound_mean	compound_sum	Label
0	2008-08-08	0.19284	4.821	0.76920	19.230	0.03800	0.950	-0.309440	-7.7360	0
1	2008-08-11	0.15028	3.757	0.78260	19.565	0.06708	1.677	-0.120740	-3.0185	1
2	2008-08-12	0.15712	3.928	0.78496	19.624	0.05788	1.447	-0.217556	-5.4389	0

Now let’s do some data enrichment for those sentiments

Code

import pandas_datareader.data as web

Code

min_date,max_date = clean_df.date.min(),clean_df.date.max()
min_date,max_date

(Timestamp('2008-08-08 00:00:00'), Timestamp('2016-07-01 00:00:00'))

DJI stock prices from the same period of time

Code

stock = web.DataReader('^DJI', 'stooq',start=min_date,end=max_date).reset_index()
stock.columns = stock.columns.str.lower()
stock[:3]

	date	open	high	low	close	volume
0	2016-07-01	17924.24	18002.38	17916.91	17949.37	82167191
1	2016-06-30	17712.76	17930.61	17711.80	17929.99	133078223
2	2016-06-29	17456.02	17704.51	17456.02	17694.68	106343184

Merged the clean data with the stock prices data

Code

enriched_df = clean_df.merge(stock,on='date',how='left').sort_values(by='date').copy()
enriched_df[:3]

	date	neg_mean	neg_sum	neu_mean	neu_sum	pos_mean	pos_sum	compound_mean	compound_sum	Label	open	high	low	close	volume
0	2008-08-08	0.19284	4.821	0.76920	19.230	0.03800	0.950	-0.309440	-7.7360	0	11432.1	11760.0	11388.0	11734.3	212842817
1	2008-08-11	0.15028	3.757	0.78260	19.565	0.06708	1.677	-0.120740	-3.0185	1	11729.7	11867.1	11675.5	11782.3	183186104
2	2008-08-12	0.15712	3.928	0.78496	19.624	0.05788	1.447	-0.217556	-5.4389	0	11781.7	11782.3	11601.5	11642.5	173686814

Features Correlation

Code

plt.figure(figsize=(16,7))
corr = enriched_df.corr(method='spearman')
sns.heatmap(corr[corr>0],annot=True,fmt='.2f',cmap='Blues');

Close Price and Compund Mean hilighted by given Label

Code

sns.scatterplot(x=enriched_df['close'],y=enriched_df['compound_mean'],hue=enriched_df['Label'],alpha=.5);

Now let’s use an XGBoost model to see what is driving prices changes

Code

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

We well use the compound mean as feature to capture the news influence in our model

Code

features = ['compound_mean','open','high','low']
target = 'close'

XGBoost Regressor

Code

params = {'eval_metric': 'rmse', 'max_depth': 5, 'n_estimators': 100, 'objective': 'reg:gamma', 'use_label_encoder': False}
reg = xgb.XGBRegressor(**params)
reg.fit(enriched_df[features],enriched_df[target])

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, eval_metric='rmse',
             gamma=0, gpu_id=-1, importance_type='gain',
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=5, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=8,
             num_parallel_tree=1, objective='reg:gamma', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1,
             tree_method='exact', use_label_encoder=False,
             validate_parameters=1, verbosity=None)

Let’s look at one of our trees

positive sentiment: compound score >= 0.05
neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
negative sentiment: compound score <= -0.05

Code

fig,ax = plt.subplots(1,1,figsize=(35,10))
xgb.plot_tree(reg,num_trees=50,ax=ax);

Feature Importance

Code

xgb.plot_importance(reg);

Feature Impact

Code

import shap

Code

expl = shap.TreeExplainer(reg)

Code

shap_values = expl.shap_values(enriched_df[features],enriched_df[target])

Code

shap.summary_plot(shap_values,enriched_df[features])