Kaggle credentials to download the dataset | Remove duplicates and missing values

%%time
import os
if not os.path.exists('CORD-19-research-challenge.zip'):
  os.environ['KAGGLE_USERNAME'] = "" # username from the json file
  os.environ['KAGGLE_KEY'] = "" # key from the json file
  !kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge
CPU times: user 125 µs, sys: 0 ns, total: 125 µs
Wall time: 220 µs

Find the dataset and unpack it

os.listdir('./')
['.config', 'CORD-19-research-challenge.zip', 'sample_data']
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import zipfile
with zipfile.ZipFile("CORD-19-research-challenge.zip") as z:
   with z.open("metadata.csv") as f:
      df = pd.read_csv(f)
df[:3]
sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal Microsoft Academic Paper ID WHO #Covidence has_full_text full_text_file
0 NaN Elsevier Intrauterine virus infections and congenital h... 10.1016/0002-8703(72)90077-4 NaN 4361535.0 els-covid Abstract The etiologic basis for the vast majo... 1972-12-31 Overall, James C. American Heart Journal NaN NaN False custom_license
1 NaN Elsevier Coronaviruses in Balkan nephritis 10.1016/0002-8703(80)90355-5 NaN 6243850.0 els-covid NaN 1980-03-31 Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;... American Heart Journal NaN NaN False custom_license
2 NaN Elsevier Cigarette smoking and coronary heart disease: ... 10.1016/0002-8703(80)90356-7 NaN 7355701.0 els-covid NaN 1980-03-31 Friedman, Gary D American Heart Journal NaN NaN False custom_license

So far we are interested just in title,abstract and maybe authors

  • First we get rid of rows with missing values
  • Then we get rid of the duplicated values
df = df[['title','abstract','authors']]
df.isnull().sum()
title        224
abstract    8414
authors     3146
dtype: int64
df.dropna(inplace=True)
df.isnull().sum()
title       0
abstract    0
authors     0
dtype: int64
df[df.duplicated()].__len__()
20
df.drop_duplicates(inplace=True)
df[df.duplicated()].__len__()
0
df[:3]
title abstract authors
0 Intrauterine virus infections and congenital h... Abstract The etiologic basis for the vast majo... Overall, James C.
3 Clinical and immunologic studies in identical ... Abstract Middle-aged female identical twins, o... Brunner, Carolyn M.; Horwitz, David A.; Shann,...
4 Epidemiology of community-acquired respiratory... Abstract Upper respiratory tract infections ar... Garibaldi, Richard A.

Tensorflow Upgrade

Here we prepare some simple clustering functions to be used later on

!pip install --upgrade tensorflow
Requirement already up-to-date: tensorflow in /usr/local/lib/python3.6/dist-packages (2.1.0)
Requirement already satisfied, skipping upgrade: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.9.0)
Requirement already satisfied, skipping upgrade: scipy==1.4.1; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.4.1)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.8.1)
Requirement already satisfied, skipping upgrade: numpy<2.0,>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.18.2)
Requirement already satisfied, skipping upgrade: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.34.2)
Requirement already satisfied, skipping upgrade: six>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.12.0)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: tensorflow-estimator<2.2.0,>=2.1.0rc0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (2.1.0)
Requirement already satisfied, skipping upgrade: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (3.2.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: keras-applications>=1.0.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.0.8)
Requirement already satisfied, skipping upgrade: gast==0.2.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.2.2)
Requirement already satisfied, skipping upgrade: google-pasta>=0.1.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.2.0)
Requirement already satisfied, skipping upgrade: tensorboard<2.2.0,>=2.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (2.1.1)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.24.3)
Requirement already satisfied, skipping upgrade: protobuf>=3.8.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (3.10.0)
Requirement already satisfied, skipping upgrade: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.12.1)
Requirement already satisfied, skipping upgrade: h5py in /usr/local/lib/python3.6/dist-packages (from keras-applications>=1.0.8->tensorflow) (2.8.0)
Requirement already satisfied, skipping upgrade: setuptools>=41.0.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (46.0.0)
Requirement already satisfied, skipping upgrade: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (0.4.1)
Requirement already satisfied, skipping upgrade: requests<3,>=2.21.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (2.21.0)
Requirement already satisfied, skipping upgrade: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (1.7.2)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (3.2.1)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (1.0.0)
Requirement already satisfied, skipping upgrade: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow) (1.3.0)
Requirement already satisfied, skipping upgrade: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow) (1.24.3)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow) (2019.11.28)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow) (2.8)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow) (0.2.8)
Requirement already satisfied, skipping upgrade: cachetools<3.2,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow) (3.1.1)
Requirement already satisfied, skipping upgrade: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow) (4.0)
Requirement already satisfied, skipping upgrade: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow) (3.1.0)
Requirement already satisfied, skipping upgrade: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow) (0.4.8)

Text features extraction

import tensorflow as tf
import tensorflow_hub as hub
class UniversalSenteneceEncoder:

    def __init__(self, encoder='universal-sentence-encoder', version='4'):
        self.version = version
        self.encoder = encoder
        self.embd = hub.load(f"https://tfhub.dev/google/{encoder}/{version}")

    def embed(self, sentences):
        return self.embd(sentences)

    def squized(self, sentences):
        return np.array(self.embd(tf.squeeze(tf.cast(sentences, tf.string))))
encoder = UniversalSenteneceEncoder(encoder='universal-sentence-encoder',version='4')

Since this process is quite consuming we will slice the dataset

Title Feature correlation

%%time
df['title_sent_vects'] = encoder.squized(df.title.values).tolist()
CPU times: user 8.13 s, sys: 3.23 s, total: 11.4 s
Wall time: 5.22 s
import plotly.graph_objects as go

sents = 50
labels = df[:sents].title.values
features = df[:sents].title_sent_vects.values.tolist()

fig = go.Figure(data=go.Heatmap(
                    z=np.inner(features, features),
                    x=labels,
                    y=labels,
                    colorscale='Viridis',
                    ))

fig.update_layout(
    margin=dict(l=40, r=40, t=40, b=40),
    height=1000,
    xaxis=dict(
        autorange=True,
        showgrid=False,
        ticks='',
        showticklabels=False
    ),
    yaxis=dict(
        autorange=True,
        showgrid=False,
        ticks='',
        showticklabels=False
    )
)

fig.show()

Abstract Features Correlation

# df['abstract_sent_vects'] = encoder.squized(df.abstract.values).tolist()
# sents = 30
# labels = df[:sents].abstract.values
# features = df[:sents].abstract_sent_vects.values.tolist()

# fig = go.Figure(data=go.Heatmap(
#                     z=np.inner(features, features),
#                     x=labels,
#                     y=labels,
#                     colorscale='Viridis',
#                     ))

# fig.update_layout(
#     margin=dict(l=140, r=140, t=140, b=140),
#     height=1000,
#     xaxis=dict(
#         autorange=True,
#         showgrid=False,
#         ticks='',
#         showticklabels=False
#     ),
#     yaxis=dict(
#         autorange=True,
#         showgrid=False,
#         ticks='',
#         showticklabels=False
#     ),
#     hoverlabel = dict(namelength = -1)
# )

# fig.show()

PCA & Clustering

from sklearn.cluster import KMeans
import matplotlib.cm as cm
from sklearn.decomposition import PCA
from datetime import datetime
import plotly.express as px

Title Level Clustering

%%time
n_clusters = 10
vectors = df.title_sent_vects.values.tolist()
kmeans = KMeans(n_clusters = n_clusters, init = 'k-means++', random_state = 0)
kmean_indices = kmeans.fit_predict(vectors)

pca = PCA(n_components=512)
scatter_plot_points = pca.fit_transform(vectors)

tmp = pd.DataFrame({
    'Feature space for the 1st feature': scatter_plot_points[:,0],
    'Feature space for the 2nd feature': scatter_plot_points[:,1],
    'labels': kmean_indices,
    'title': df.title.values.tolist()[:vectors.__len__()]
})

fig = px.scatter(tmp, x='Feature space for the 1st feature', y='Feature space for the 2nd feature', color='labels',
                 size='labels', hover_data=['title'])
fig.update_layout(
    margin=dict(l=20, r=20, t=20, b=20),
    height=1000
)

fig.show()

Abstract Level Clustering

# n_clusters = 10
# vectors = df.abstract_sent_vects.values.tolist()
# kmeans = KMeans(n_clusters = n_clusters, init = 'k-means++', random_state = 0)
# kmean_indices = kmeans.fit_predict(vectors)

# pca = PCA(n_components=512)
# scatter_plot_points = pca.fit_transform(vectors)

# tmp = pd.DataFrame({
#     'Feature space for the 1st feature': scatter_plot_points[:,0],
#     'Feature space for the 2nd feature': scatter_plot_points[:,1],
#     'labels': kmean_indices,
#     'title': df.abstract.values.tolist()[:vectors.__len__()]
# })

# fig = px.scatter(tmp, x='Feature space for the 1st feature', y='Feature space for the 2nd feature', color='labels',
#                  size='labels', hover_data=['title'])
# fig.update_layout(
#     margin=dict(l=20, r=20, t=20, b=20),
#     height=1000
# )

# fig.show()

Graphs

Preparing the data for graphs

import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity

Make a copy of our dataframe and create the similarity matrix for the extracted title vectors

%%time 
sdf = df.copy()
similarity_matrix = cosine_similarity(sdf.title_sent_vects.values.tolist())
CPU times: user 51.4 s, sys: 1min 51s, total: 2min 42s
Wall time: 13.9 s

Add them into a dataframe, similiar to what a heatmap looks likes

simdf = pd.DataFrame(
    similarity_matrix,
    columns = sdf.title.values.tolist(),
    index = sdf.title.values.tolist()
)

Let's unstack them to add them easier into our graph

long_form = simdf.unstack()
# rename columns and turn into a dataframe
long_form.index.rename(['t1', 't2'], inplace=True)
long_form = long_form.to_frame('sim').reset_index()
long_form = long_form[long_form.t1 != long_form.t2]
long_form[:3]

Plotly Graph

!pip install python-igraph
import igraph as ig
%%time
lng = long_form[long_form.sim > 0.75] 
tuples = [tuple(x) for x in lng.values]
Gm = ig.Graph.TupleList(tuples, edge_attrs = ['sim'])
layt=Gm.layout('kk', dim=3)
Xn=[layt[k][0] for k in range(layt.__len__())]# x-coordinates of nodes
Yn=[layt[k][1] for k in range(layt.__len__())]# y-coordinates
Zn=[layt[k][2] for k in range(layt.__len__())]# z-coordinates
Xe=[]
Ye=[]
Ze=[]
for e in Gm.get_edgelist():
    Xe+=[layt[e[0]][0],layt[e[1]][0], None]# x-coordinates of edge ends
    Ye+=[layt[e[0]][1],layt[e[1]][1], None]
    Ze+=[layt[e[0]][2],layt[e[1]][2], None]
import plotly.graph_objs as go
trace1 = go.Scatter3d(x = Xe,
  y = Ye,
  z = Ze,
  mode = 'lines',
  line = dict(color = 'rgb(0,0,0)', width = 1),
  hoverinfo = 'none'
)

trace2 = go.Scatter3d(x = Xn,
  y = Yn,
  z = Zn,
  mode = 'markers',
  name = 'actors',
  marker = dict(symbol = 'circle',
    size = 6, 
    # color = group,
    colorscale = 'Viridis',
    line = dict(color = 'rgb(50,50,50)', width = 0.5)
  ),
  text = lng.t1.values.tolist(),
  hoverinfo = 'text'
)

axis = dict(showbackground = False,
  showline = False,
  zeroline = False,
  showgrid = False,
  showticklabels = False,
  title = ''
)

layout = go.Layout(
  title = "Network of similarity between CORD-19 Articles(3D visualization)",
  width = 1500,
  height = 1500,
  showlegend = False,
  scene = dict(
    xaxis = dict(axis),
    yaxis = dict(axis),
    zaxis = dict(axis),
  ),
  margin = dict(
    t = 100,
    l = 20,
    r = 20
  ),

)
fig=go.Figure(data=[trace1,trace2], layout=layout)

fig.show()

Finding Communities

Next step, we filter them nodes with higher similarity

sim_weight = 0.75
gdf = long_form[long_form.sim > sim_weight]

We create our graph from our dataframe

plt.figure(figsize=(50,50))
pd_graph = nx.Graph()
pd_graph = nx.from_pandas_edgelist(gdf, 't1', 't2')
pos = nx.spring_layout(pd_graph)
nx.draw_networkx(pd_graph,pos,with_labels=True,font_size=10, node_size = 30)

Now let's try to find communities in our graph

betCent = nx.betweenness_centrality(pd_graph, normalized=True, endpoints=True)
node_color = [20000.0 * pd_graph.degree(v) for v in pd_graph]
node_size =  [v * 10000 for v in betCent.values()]
plt.figure(figsize=(35,35))
nx.draw_networkx(pd_graph, pos=pos, with_labels=True,
                 font_size=5,
                 node_color=node_color,
                 node_size=node_size )

Now let's get our groups

l=list(nx.connected_components(pd_graph))

L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

d=[{'articles':k , 'groups':v }for d in L for k, v in d.items()]

We've got our 'clustered' dataframe of articles, however since we filtered the data to take just the most similar articles, we've ended up havin left just 660 from the 5k data

Creating word clouds for grooups

gcd = pd.DataFrame.from_dict(d)
import nltk
nltk.download('stopwords')
nltk.download('punkt') 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.tokenize import RegexpTokenizer

tok = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english')) 
def clean(string):
  return " ".join([w for w in word_tokenize(" ".join(tok.tokenize(string))) if not w in stop_words])

gcd.articles = gcd.articles.apply(lambda x: clean(x))
gcd.__len__(),gcd.__len__() / df.__len__()
from wordcloud import WordCloud
%%time
clouds = dict()

big_groups = pd.DataFrame({
    'counts':gcd.groups.value_counts()
    }).sort_values(by='counts',ascending=False)[:9].index.values.tolist()

for group in big_groups:
  text = gcd[gcd.groups == group].articles.values
  wordcloud = WordCloud(width=1000, height=1000).generate(str(text))
  clouds[group] = wordcloud
def plot_figures(figures, nrows = 1, ncols=1):
    """Plot a dictionary of figures.

    Parameters
    ----------
    figures : <title, figure> dictionary
    ncols : number of columns of subplots wanted in the display
    nrows : number of rows of subplots wanted in the figure
    """

    fig, axeslist = plt.subplots(ncols=ncols, nrows=nrows,figsize=(20,20))
    for ind,title in zip(range(len(figures)), figures):
        axeslist.ravel()[ind].imshow(figures[title], cmap=plt.jet())
        axeslist.ravel()[ind].set_title(f'Most Freqent words for the group {title+1}')
        axeslist.ravel()[ind].set_axis_off()
    plt.tight_layout() # optional
plot_figures(clouds, 3, 3)
plt.show()