Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
'seaborn') plt.style.use(
Daniel Fat
February 14, 2021
Imports
There are two channels of data provided in this dataset:
News data: I crawled historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by reddit users’ votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01) Stock data: Dow Jones Industrial Average (DJIA) is used to “prove the concept”. (Range: 2008-08-08 to 2016-07-01) I provided three data files in .csv format:
RedditNews.csv: two columns The first column is the “date”, and second column is the “news headlines”. All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date. DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info. CombinedNewsDJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is “Date”, the second is “Label”, and the following ones are news headlines ranging from “Top1” to “Top25”.
Date | Label | Top1 | Top2 | Top3 | Top4 | Top5 | Top6 | Top7 | Top8 | ... | Top16 | Top17 | Top18 | Top19 | Top20 | Top21 | Top22 | Top23 | Top24 | Top25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-08-08 | 0 | b"Georgia 'downs two Russian warplanes' as cou... | b'BREAKING: Musharraf to be impeached.' | b'Russia Today: Columns of troops roll into So... | b'Russian tanks are moving towards the capital... | b"Afghan children raped with 'impunity,' U.N. ... | b'150 Russian tanks have entered South Ossetia... | b"Breaking: Georgia invades South Ossetia, Rus... | b"The 'enemy combatent' trials are nothing but... | ... | b'Georgia Invades South Ossetia - if Russia ge... | b'Al-Qaeda Faces Islamist Backlash' | b'Condoleezza Rice: "The US would not act to p... | b'This is a busy day: The European Union has ... | b"Georgia will withdraw 1,000 soldiers from Ir... | b'Why the Pentagon Thinks Attacking Iran is a ... | b'Caucasus in crisis: Georgia invades South Os... | b'Indian shoe manufactory - And again in a se... | b'Visitors Suffering from Mental Illnesses Ban... | b"No Help for Mexico's Kidnapping Surge" |
1 | 2008-08-11 | 1 | b'Why wont America and Nato help us? If they w... | b'Bush puts foot down on Georgian conflict' | b"Jewish Georgian minister: Thanks to Israeli ... | b'Georgian army flees in disarray as Russians ... | b"Olympic opening ceremony fireworks 'faked'" | b'What were the Mossad with fraudulent New Zea... | b'Russia angered by Israeli military sale to G... | b'An American citizen living in S.Ossetia blam... | ... | b'Israel and the US behind the Georgian aggres... | b'"Do not believe TV, neither Russian nor Geor... | b'Riots are still going on in Montreal (Canada... | b'China to overtake US as largest manufacturer' | b'War in South Ossetia [PICS]' | b'Israeli Physicians Group Condemns State Tort... | b' Russia has just beaten the United States ov... | b'Perhaps *the* question about the Georgia - R... | b'Russia is so much better at war' | b"So this is what it's come to: trading sex fo... |
2 rows × 27 columns
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the [MIT License] (we sincerely appreciate all attributions and readily accept most contributions, but please don’t hold us liable).
[nltk_data] Downloading package vader_lexicon to
[nltk_data] /Users/cristianexer/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
True
CPU times: user 1.08 ms, sys: 425 µs, total: 1.51 ms
Wall time: 1.02 ms
Date | Label | Top1 | Top2 | Top3 | Top4 | Top5 | Top6 | Top7 | Top8 | ... | Top16 | Top17 | Top18 | Top19 | Top20 | Top21 | Top22 | Top23 | Top24 | Top25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-08-08 | 0 | b"Georgia 'downs two Russian warplanes' as cou... | b'BREAKING: Musharraf to be impeached.' | b'Russia Today: Columns of troops roll into So... | b'Russian tanks are moving towards the capital... | b"Afghan children raped with 'impunity,' U.N. ... | b'150 Russian tanks have entered South Ossetia... | b"Breaking: Georgia invades South Ossetia, Rus... | b"The 'enemy combatent' trials are nothing but... | ... | b'Georgia Invades South Ossetia - if Russia ge... | b'Al-Qaeda Faces Islamist Backlash' | b'Condoleezza Rice: "The US would not act to p... | b'This is a busy day: The European Union has ... | b"Georgia will withdraw 1,000 soldiers from Ir... | b'Why the Pentagon Thinks Attacking Iran is a ... | b'Caucasus in crisis: Georgia invades South Os... | b'Indian shoe manufactory - And again in a se... | b'Visitors Suffering from Mental Illnesses Ban... | b"No Help for Mexico's Kidnapping Surge" |
1 rows × 27 columns
Extract Sentiments from each news title
Date | Label | Top1 | Top2 | Top3 | Top4 | Top5 | Top6 | Top7 | Top8 | ... | Top16 | Top17 | Top18 | Top19 | Top20 | Top21 | Top22 | Top23 | Top24 | Top25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-08-08 | 0 | {'neg': 0.262, 'neu': 0.738, 'pos': 0.0, 'comp... | {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... | {'neg': 0.172, 'neu': 0.828, 'pos': 0.0, 'comp... | {'neg': 0.247, 'neu': 0.753, 'pos': 0.0, 'comp... | {'neg': 0.424, 'neu': 0.576, 'pos': 0.0, 'comp... | {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... | {'neg': 0.149, 'neu': 0.851, 'pos': 0.0, 'comp... | {'neg': 0.107, 'neu': 0.79, 'pos': 0.103, 'com... | ... | {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... | {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... | {'neg': 0.078, 'neu': 0.819, 'pos': 0.103, 'co... | {'neg': 0.092, 'neu': 0.78, 'pos': 0.128, 'com... | {'neg': 0.112, 'neu': 0.773, 'pos': 0.116, 'co... | {'neg': 0.351, 'neu': 0.649, 'pos': 0.0, 'comp... | {'neg': 0.406, 'neu': 0.594, 'pos': 0.0, 'comp... | {'neg': 0.14, 'neu': 0.86, 'pos': 0.0, 'compou... | {'neg': 0.65, 'neu': 0.35, 'pos': 0.0, 'compou... | {'neg': 0.0, 'neu': 0.649, 'pos': 0.351, 'comp... |
1 rows × 27 columns
Some steps to format the data
level_0 | Date | 0 | |
---|---|---|---|
0 | Top1 | 2008-08-08 | {'neg': 0.262, 'neu': 0.738, 'pos': 0.0, 'comp... |
1 | Top1 | 2008-08-11 | {'neg': 0.0, 'neu': 0.668, 'pos': 0.332, 'comp... |
2 | Top1 | 2008-08-12 | {'neg': 0.169, 'neu': 0.656, 'pos': 0.175, 'co... |
neg | neu | pos | compound | |
---|---|---|---|---|
0 | 0.262 | 0.738 | 0.000 | -0.5994 |
1 | 0.000 | 0.668 | 0.332 | 0.8156 |
2 | 0.169 | 0.656 | 0.175 | 0.0258 |
news_label | Date | neg | neu | pos | compound | |
---|---|---|---|---|---|---|
0 | Top1 | 2008-08-08 | 0.262 | 0.738 | 0.000 | -0.5994 |
1 | Top1 | 2008-08-11 | 0.000 | 0.668 | 0.332 | 0.8156 |
2 | Top1 | 2008-08-12 | 0.169 | 0.656 | 0.175 | 0.0258 |
clean_df = copy_df.groupby('Date',as_index=False).agg({x:['mean','sum'] for x in sentiments.columns}).copy()
clean_df.columns = [ '_'.join(x) if x[1] != '' else x[0].lower() for x in clean_df.columns]
clean_df = clean_df.merge(df[['Date','Label']],left_on='date',right_on='Date',how='left').drop(['Date'],axis=1)
clean_df['date'] = pd.to_datetime(clean_df['date'] )
clean_df[:3]
date | neg_mean | neg_sum | neu_mean | neu_sum | pos_mean | pos_sum | compound_mean | compound_sum | Label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-08-08 | 0.19284 | 4.821 | 0.76920 | 19.230 | 0.03800 | 0.950 | -0.309440 | -7.7360 | 0 |
1 | 2008-08-11 | 0.15028 | 3.757 | 0.78260 | 19.565 | 0.06708 | 1.677 | -0.120740 | -3.0185 | 1 |
2 | 2008-08-12 | 0.15712 | 3.928 | 0.78496 | 19.624 | 0.05788 | 1.447 | -0.217556 | -5.4389 | 0 |
Now let’s do some data enrichment for those sentiments
(Timestamp('2008-08-08 00:00:00'), Timestamp('2016-07-01 00:00:00'))
DJI stock prices from the same period of time
date | open | high | low | close | volume | |
---|---|---|---|---|---|---|
0 | 2016-07-01 | 17924.24 | 18002.38 | 17916.91 | 17949.37 | 82167191 |
1 | 2016-06-30 | 17712.76 | 17930.61 | 17711.80 | 17929.99 | 133078223 |
2 | 2016-06-29 | 17456.02 | 17704.51 | 17456.02 | 17694.68 | 106343184 |
Merged the clean data with the stock prices data
date | neg_mean | neg_sum | neu_mean | neu_sum | pos_mean | pos_sum | compound_mean | compound_sum | Label | open | high | low | close | volume | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-08-08 | 0.19284 | 4.821 | 0.76920 | 19.230 | 0.03800 | 0.950 | -0.309440 | -7.7360 | 0 | 11432.1 | 11760.0 | 11388.0 | 11734.3 | 212842817 |
1 | 2008-08-11 | 0.15028 | 3.757 | 0.78260 | 19.565 | 0.06708 | 1.677 | -0.120740 | -3.0185 | 1 | 11729.7 | 11867.1 | 11675.5 | 11782.3 | 183186104 |
2 | 2008-08-12 | 0.15712 | 3.928 | 0.78496 | 19.624 | 0.05788 | 1.447 | -0.217556 | -5.4389 | 0 | 11781.7 | 11782.3 | 11601.5 | 11642.5 | 173686814 |
Features Correlation
Close Price and Compund Mean hilighted by given Label
Now let’s use an XGBoost model to see what is driving prices changes
We well use the compound mean as feature to capture the news influence in our model
XGBoost Regressor
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='rmse',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=5, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, objective='reg:gamma', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1,
tree_method='exact', use_label_encoder=False,
validate_parameters=1, verbosity=None)
Let’s look at one of our trees
Feature Importance
Feature Impact