Code
try:
import faster_than_requests as req
except:
!pip install faster-than-requests
import faster_than_requests as req
Daniel Fat
January 9, 2020
We use faster-than-requests
becaue is new, fast and fancy
we upgrade gensim at this point we we won’t have to restart the runtime after we get the data in
Here we import the packages we need, good to know we use fake_useragent
to change the User-Agent
in the header
of each request to make it look like IE7. Thus it will force Twitter to go into the old web version without the web component based front-end and easy to crawl
More utils which are going to be used later, please use TF <=2.0
from gensim.utils import simple_preprocess
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
from wordcloud import WordCloud
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Here is the fancy trick with the user agent
So here we have:
twitter_selectors
a dictionary with the css
selectors for each tweet
get_soup
is sending a get request to get the html content from the url right into the BeautifulSoup
html parser creating a object which I call soup
get_tweets
is using a soup
to extract the username
and content
of each tweet
get_tweets_df
finally this function is using all functions from above into a loop
to mine all the tweets and returning a pandas dataframe
twitter_selectors = {
'post': '.tweet',
# 'full_name': '.fullname',
'username': '.username',
'content': '.tweet-text'
}
def get_soup(url):
return BeautifulSoup(req.get2str(url),'html.parser')
def get_tweets(soup):
tweets = list()
for tweet in soup.select(twitter_selectors.get('post')):
tweets.append({
'username': tweet.select_one(twitter_selectors.get('username')).text,
'content': tweet.select_one(twitter_selectors.get('content')).text,
})
return tweets
def get_tweets_df(keyword,limit=None):
url = f'https://mobile.twitter.com/search?q={urllib.parse.quote_plus(keyword)}'
tweets = []
stop = False
while(stop != True):
try:
soup = get_soup(url=url)
tweets+=get_tweets(soup=soup)
next = soup.select_one('.w-button-more a')
if next:
url = 'https://mobile.twitter.com' + next['href']
else:
stop = True
except:
continue
if limit != None and limit <= tweets.__len__():
stop = True
print(f'{tweets.__len__()} tweets has been crawled')
return pd.DataFrame.from_dict(tweets)
Then we call it with a 1000
tweets limit to get all the tweets for the related topic, in our case coronavirus
. Since we are mining by tweets/page we get a few more tweets.
1015 tweets has been crawled
CPU times: user 2.45 s, sys: 43.4 ms, total: 2.5 s
Wall time: 27.5 s
Further on we use this lambda function to remove any breakline
@
and #
Now we use this Universal Sentence Encoder to get the sentence embeddings from the tweets content
We save it as a column
username | content | content_wordlist | content_sent_vects | |
---|---|---|---|---|
0 | @allkpop | Jaejoong facing possible punishment by KCDC ... | NaN | [-0.07023757696151733, 0.04490802809596062, -0... |
1 | @Corona_Bot__ | CONFIRMED: Barney tests positive for Coronav... | NaN | [0.05011240020394325, -0.031166449189186096, -... |
2 | @nytimes | The Korean star known as Jaejoong of the K-p... | NaN | [-0.01596658118069172, 0.04994076490402222, 0.... |
3 | @PIB_India | Approx 1800 people related to #TablighJamaat... | NaN | [-0.06460247933864594, 0.05959327518939972, -0... |
4 | @SkyNews | Coronavirus: Woman fined £650 for refusing t... | NaN | [-0.05083870142698288, 0.03122670203447342, 0.... |
... | ... | ... | ... | ... |
1010 | @ASlavitt | Andy Slavitt: slowing spread of coronavirus ... | NaN | [0.02724810130894184, -0.002946038730442524, 0... |
1011 | @fred_guttenberg | Many reasons that I wish @JoeBiden were Pres... | NaN | [0.04280007258057594, -0.05422208085656166, 0.... |
1012 | @FOXLA | Coronavirus droplets could travel 27 feet, wa... | NaN | [0.03585874289274216, 0.06049816682934761, 0.0... |
1013 | @joncoopertweets | Fox News purportedly bracing for “legal bloo... | NaN | [0.03799957036972046, 0.0289924293756485, -0.0... |
1014 | @NBCLatino | On #CesarChavezDay, Democratic presidential ... | NaN | [0.020493853837251663, 0.03346347063779831, 0.... |
1015 rows × 4 columns
More packages
from gensim.models import Phrases
from gensim.models.phrases import Phraser
import spacy
import gensim.corpora as corpora
from gensim.models.ldamodel import LdaModel
from pprint import pprint
from gensim.models import CoherenceModel
from gensim.models.wrappers import LdaMallet
from os.path import exists
import requests, zipfile, io
import os
I’ve put together a class following this tutorial to handler easier the LDA model, still having some issues with the mallet model inside of the colab, but I guess I will try to sort that out later
Nothing very fancy here, just building and storing everytning in a class, as further on we might need to reuse bigrams
or trigrams
. It is not the best approach as it is using loads of other packages inside so passing arugments into a constructor to initialise or change each module parameters might be pain so for the moment we will stick with this configuration and hopefully we can find a fast way to work with the mallet model.
class LDA:
def __init__(self,sentences,mallet=False):
self.sentences = sentences
self.bigram = Phrases(self.sentences, min_count=5, threshold=100) # higher threshold fewer phrases.
self.trigram = Phrases(self.bigram[self.sentences], threshold=100)
self.bigram_mod = Phraser(self.bigram)
self.trigram_mod = Phraser(self.trigram)
self.stop_words = stopwords.words('english')
self.nlp = spacy.load('en', disable=['parser', 'ner'])
self.download_mallet_path = 'http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip'
self.mallet_path = '/content/mallet-2.0.8/bin/mallet'
self.ldamallet = None
self.ldamallet = None
self.topics = None
self.java_installed = False
os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'
######################
self.make()
# if mallet:
# self.build_ldamallet()
# else:
# self.build_lda()
######################
def remove_stopwords(self):
self.sent_no_stops = [[word for word in simple_preprocess(str(doc)) if word not in self.stop_words] for doc in self.sentences]
def make_bigrams(self):
self.bigram_data = [self.bigram_mod[doc] for doc in self.sent_no_stops]
def make_trigrams(self):
self.trigram_data = [self.trigram_mod[self.bigram_mod[doc]] for doc in self.sent_no_stops]
def lemmatization(self, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
self.lemm_sentences = [ [token.lemma_ for token in self.nlp(" ".join(sent)) if token.pos_ in allowed_postags] for sent in self.bigram_data]
def dictionary(self):
self.id2word = corpora.Dictionary(self.lemm_sentences)
def corpus(self):
self.corpus = [self.id2word.doc2bow(text) for text in self.lemm_sentences]
def make(self):
self.remove_stopwords()
self.make_bigrams()
self.make_trigrams()
self.lemmatization()
self.dictionary()
self.corpus()
def build_lda(self,num_topics=20,random_state=100,update_every=1,chunksize=100,passes=10,alpha='auto'):
self.lda_model = LdaModel(
corpus=self.corpus,
id2word=self.id2word,
num_topics=num_topics,
random_state=random_state,
update_every=update_every,
chunksize=chunksize,
passes=passes,
alpha=alpha,
per_word_topics=True)
def download_mallet(self):
r = requests.get(self.download_mallet_path)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
if not self.java_installed:
self.install_java()
def install_java(self):
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null #install openjdk
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" #set environment variable
!java -version #check java version
def build_ldamallet(self,num_topics=20):
if not exists(self.mallet_path):
self.download_mallet()
self.ldamallet = LdaMallet(self.mallet_path, corpus=self.corpus, num_topics=num_topics, id2word=self.id2word)
def coherence_score(self,mallet=False):
if mallet:
pprint(self.ldamallet.show_topics(formatted=False))
md = self.ldamallet
else:
md = self.lda_model
pprint(self.lda_model.print_topics())
coherence_model = CoherenceModel(model=md, texts=self.lemm_sentences, dictionary=self.id2word, coherence='c_v')
coh = coherence_model.get_coherence()
print('\nCoherence Score: ', coh)
def compute_coherence_values(self, limit, start=2, step=3, mallet=False):
coherence_values = []
for num_topics in range(start, limit, step):
if mallet:
self.build_ldamallet(num_topics=num_topics)
md = self.ldamallet
else:
self.build_lda(num_topics=num_topics)
md = self.lda_model
coherencemodel = CoherenceModel(model=md, texts=self.lemm_sentences, dictionary=self.id2word, coherence='c_v')
coherence_values.append({
'num_topics': num_topics,
'coherence': coherencemodel.get_coherence()
})
return coherence_values
We create an object of our LDA
class and passing the raw sentences as a list. Then we compute a coherence search for the best number of topics related to our content. However since those tweets are mined, this process might have different results on each new data.
[{'coherence': 0.24454476285561758, 'num_topics': 2},
{'coherence': 0.46119096489624833, 'num_topics': 8},
{'coherence': 0.4516224532223637, 'num_topics': 14},
{'coherence': 0.42736241847827366, 'num_topics': 20},
{'coherence': 0.3930428008422175, 'num_topics': 26},
{'coherence': 0.4105924951482943, 'num_topics': 32},
{'coherence': 0.39407950264156383, 'num_topics': 38}]
CPU times: user 34.9 s, sys: 183 ms, total: 35.1 s
Wall time: 35.2 s
From the previous search we found the best number of topics, which now we are using to rebuild the model and show the topics we found.
[(0,
'0.148*"com" + 0.075*"twitter" + 0.061*"pic" + 0.046*"virus" + 0.032*"go" + 0.026*"people" + 0.022*"corona" + 0.018*"test" + 0.018*"make" + 0.018*"day"'),
(1,
'0.026*"would" + 0.025*"fight" + 0.021*"die" + 0.019*"watch" + 0.018*"turn" + 0.017*"hard" + 0.016*"leader" + 0.016*"thank" + 0.013*"presidential" + 0.013*"bar"'),
(2,
'0.036*"week" + 0.027*"tell" + 0.023*"see" + 0.019*"official" + 0.018*"tough" + 0.017*"break" + 0.015*"story" + 0.015*"worker" + 0.014*"end" + 0.014*"good"'),
(3,
'0.025*"shift" + 0.022*"use" + 0.018*"slow" + 0.017*"do" + 0.017*"state" + 0.016*"company" + 0.015*"truth" + 0.015*"crisis" + 0.015*"few" + 0.015*"time"'),
(4,
'0.037*"say" + 0.031*"today" + 0.029*"doctor" + 0.028*"many" + 0.028*"want" + 0.024*"nurse" + 0.021*"work" + 0.015*"can" + 0.013*"american" + 0.012*"job"'),
(5,
'0.069*"trump" + 0.027*"country" + 0.017*"tirelessly" + 0.016*"join" + 0.016*"term" + 0.014*"enemy" + 0.014*"view" + 0.013*"concerned" + 0.013*"republican" + 0.012*"nation"'),
(6,
'0.055*"case" + 0.053*"death" + 0.043*"new" + 0.021*"spread" + 0.021*"high" + 0.018*"back" + 0.017*"call" + 0.016*"important" + 0.014*"look" + 0.013*"country"'),
(7,
'0.031*"incredible" + 0.029*"say" + 0.025*"warn" + 0.024*"rather" + 0.023*"report" + 0.020*"could" + 0.020*"bravery" + 0.016*"force" + 0.016*"think" + 0.016*"care"')]
We use the cosine similarity
on the extracted vectors
to create a similarity matrix
CPU times: user 2.88 s, sys: 150 ms, total: 3.03 s
Wall time: 1.96 s
We put this together into a dataframe with the username
as columns and index thus it will look quite similar with a correlation dataframe or heatmap
However we want this unstacked to make make it easier to pass into our graph, also we’ve kept just the username as each one is unique
So here you can see the similarity between the content of each user
t1 | t2 | sim | |
---|---|---|---|
1 | @AdamSchefter | @NorbertElekes | 0.133930 |
2 | @AdamSchefter | @BreitbartNews | 0.295476 |
3 | @AdamSchefter | @DarylDust | 0.150086 |
Here we want to select the similarity treshold to filter the tweets and create our graph from the formatted dataframe
sim_weight = 0.95
gdf = long_form[long_form.sim > sim_weight]
plt.figure(figsize=(25,25))
pd_graph = nx.Graph()
pd_graph = nx.from_pandas_edgelist(gdf, 't1', 't2')
pos = nx.spring_layout(pd_graph)
nx.draw_networkx(pd_graph,pos,with_labels=True,font_size=10,font_color='#fff',edge_color='#f00',node_size = 30)
Now we get the connected components into a dataframe
Here is the number of groups
or clusters
we have extracted for the chosen similarity treshold
We add the content for each user into the grouped dataframe
users | groups | content | |
---|---|---|---|
0 | @AdamSchefter | 0 | Cardinals All-Pro linebacker Chandler Jones ... |
1 | @kfitz134 | 0 | Cardinals All-Pro linebacker Chandler Jones ... |
2 | @Aletteri66 | 1 | The Corona virus can’t kill me because I alr... |
3 | @BrandonM0721 | 1 | The Corona Virus can’t kill me, I already di... |
4 | @realDailyWire | 2 | Epidemiologist Behind Highly-Cited Coronavir... |
We use a nltk
tokenizer to extract just the words and remove the stop words
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Our big_groups
is a list of indexes
from our grouped dataset with the index of the top 12 groups
sorted descending
.
Then we iterate through these indexes
and create WordCloud of each group whihc is stored in the clouds
dictionary
%%time
clouds = dict()
big_groups = pd.DataFrame({
'counts':gcd.groups.value_counts()
}).sort_values(by='counts',ascending=False)[:12].index.values.tolist()
for group in big_groups:
text = gcd[gcd.groups == group].content.values
wordcloud = WordCloud(width=1000, height=1000).generate(str(text))
clouds[group] = wordcloud
CPU times: user 28.7 s, sys: 581 ms, total: 29.3 s
Wall time: 29.4 s
def plot_figures(figures, nrows = 1, ncols=1):
fig, axeslist = plt.subplots(ncols=ncols, nrows=nrows,figsize=(20,20))
for ind,title in zip(range(len(figures)), figures):
axeslist.ravel()[ind].imshow(figures[title], cmap=plt.jet())
axeslist.ravel()[ind].set_title(f'Most Freqent words for the group {title+1}')
axeslist.ravel()[ind].set_axis_off()
plt.tight_layout() # optional
Then we plot them with the help of our plot_figures
function