Kaggle credentials to download the dataset | Remove duplicates and missing values

%%time
import os
if not os.path.exists('CORD-19-research-challenge.zip'):
  os.environ['KAGGLE_USERNAME'] = "" # username from the json file
  os.environ['KAGGLE_KEY'] = "" # key from the json file
  !kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge

CPU times: user 125 µs, sys: 0 ns, total: 125 µs
Wall time: 220 µs

Find the dataset and unpack it

os.listdir('./')

['.config', 'CORD-19-research-challenge.zip', 'sample_data']

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import zipfile

with zipfile.ZipFile("CORD-19-research-challenge.zip") as z:
   with z.open("metadata.csv") as f:
      df = pd.read_csv(f)

df[:3]

So far we are interested just in title,abstract and maybe authors

First we get rid of rows with missing values
Then we get rid of the duplicated values

df = df[['title','abstract','authors']]

df.isnull().sum()

title        224
abstract    8414
authors     3146
dtype: int64

df.dropna(inplace=True)

df.isnull().sum()

title       0
abstract    0
authors     0
dtype: int64

df[df.duplicated()].__len__()

20

df.drop_duplicates(inplace=True)
df[df.duplicated()].__len__()

0

df[:3]

Tensorflow Upgrade

Here we prepare some simple clustering functions to be used later on

!pip install --upgrade tensorflow

Requirement already up-to-date: tensorflow in /usr/local/lib/python3.6/dist-packages (2.1.0)
Requirement already satisfied, skipping upgrade: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.9.0)
Requirement already satisfied, skipping upgrade: scipy==1.4.1; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.4.1)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.8.1)
Requirement already satisfied, skipping upgrade: numpy<2.0,>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.18.2)
Requirement already satisfied, skipping upgrade: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.34.2)
Requirement already satisfied, skipping upgrade: six>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.12.0)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: tensorflow-estimator<2.2.0,>=2.1.0rc0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (2.1.0)
Requirement already satisfied, skipping upgrade: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (3.2.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: keras-applications>=1.0.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.0.8)
Requirement already satisfied, skipping upgrade: gast==0.2.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.2.2)
Requirement already satisfied, skipping upgrade: google-pasta>=0.1.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.2.0)
Requirement already satisfied, skipping upgrade: tensorboard<2.2.0,>=2.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (2.1.1)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.24.3)
Requirement already satisfied, skipping upgrade: protobuf>=3.8.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (3.10.0)
Requirement already satisfied, skipping upgrade: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.12.1)
Requirement already satisfied, skipping upgrade: h5py in /usr/local/lib/python3.6/dist-packages (from keras-applications>=1.0.8->tensorflow) (2.8.0)
Requirement already satisfied, skipping upgrade: setuptools>=41.0.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (46.0.0)
Requirement already satisfied, skipping upgrade: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (0.4.1)
Requirement already satisfied, skipping upgrade: requests<3,>=2.21.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (2.21.0)
Requirement already satisfied, skipping upgrade: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (1.7.2)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (3.2.1)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow) (1.0.0)
Requirement already satisfied, skipping upgrade: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow) (1.3.0)
Requirement already satisfied, skipping upgrade: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow) (1.24.3)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow) (2019.11.28)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow) (2.8)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow) (0.2.8)
Requirement already satisfied, skipping upgrade: cachetools<3.2,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow) (3.1.1)
Requirement already satisfied, skipping upgrade: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow) (4.0)
Requirement already satisfied, skipping upgrade: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow) (3.1.0)
Requirement already satisfied, skipping upgrade: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow) (0.4.8)

Text features extraction

import tensorflow as tf
import tensorflow_hub as hub
class UniversalSenteneceEncoder:

    def __init__(self, encoder='universal-sentence-encoder', version='4'):
        self.version = version
        self.encoder = encoder
        self.embd = hub.load(f"https://tfhub.dev/google/{encoder}/{version}")

    def embed(self, sentences):
        return self.embd(sentences)

    def squized(self, sentences):
        return np.array(self.embd(tf.squeeze(tf.cast(sentences, tf.string))))

encoder = UniversalSenteneceEncoder(encoder='universal-sentence-encoder',version='4')

Since this process is quite consuming we will slice the dataset

Title Feature correlation

%%time
df['title_sent_vects'] = encoder.squized(df.title.values).tolist()

CPU times: user 8.13 s, sys: 3.23 s, total: 11.4 s
Wall time: 5.22 s

import plotly.graph_objects as go

sents = 50
labels = df[:sents].title.values
features = df[:sents].title_sent_vects.values.tolist()

fig = go.Figure(data=go.Heatmap(
                    z=np.inner(features, features),
                    x=labels,
                    y=labels,
                    colorscale='Viridis',
                    ))

fig.update_layout(
    margin=dict(l=40, r=40, t=40, b=40),
    height=1000,
    xaxis=dict(
        autorange=True,
        showgrid=False,
        ticks='',
        showticklabels=False
    ),
    yaxis=dict(
        autorange=True,
        showgrid=False,
        ticks='',
        showticklabels=False
    )
)

fig.show()

Abstract Features Correlation

# df['abstract_sent_vects'] = encoder.squized(df.abstract.values).tolist()

# sents = 30
# labels = df[:sents].abstract.values
# features = df[:sents].abstract_sent_vects.values.tolist()

# fig = go.Figure(data=go.Heatmap(
#                     z=np.inner(features, features),
#                     x=labels,
#                     y=labels,
#                     colorscale='Viridis',
#                     ))

# fig.update_layout(
#     margin=dict(l=140, r=140, t=140, b=140),
#     height=1000,
#     xaxis=dict(
#         autorange=True,
#         showgrid=False,
#         ticks='',
#         showticklabels=False
#     ),
#     yaxis=dict(
#         autorange=True,
#         showgrid=False,
#         ticks='',
#         showticklabels=False
#     ),
#     hoverlabel = dict(namelength = -1)
# )

# fig.show()

PCA & Clustering

from sklearn.cluster import KMeans
import matplotlib.cm as cm
from sklearn.decomposition import PCA
from datetime import datetime
import plotly.express as px

Title Level Clustering

%%time
n_clusters = 10
vectors = df.title_sent_vects.values.tolist()
kmeans = KMeans(n_clusters = n_clusters, init = 'k-means++', random_state = 0)
kmean_indices = kmeans.fit_predict(vectors)

pca = PCA(n_components=512)
scatter_plot_points = pca.fit_transform(vectors)

tmp = pd.DataFrame({
    'Feature space for the 1st feature': scatter_plot_points[:,0],
    'Feature space for the 2nd feature': scatter_plot_points[:,1],
    'labels': kmean_indices,
    'title': df.title.values.tolist()[:vectors.__len__()]
})

fig = px.scatter(tmp, x='Feature space for the 1st feature', y='Feature space for the 2nd feature', color='labels',
                 size='labels', hover_data=['title'])
fig.update_layout(
    margin=dict(l=20, r=20, t=20, b=20),
    height=1000
)

fig.show()

Abstract Level Clustering

# n_clusters = 10
# vectors = df.abstract_sent_vects.values.tolist()
# kmeans = KMeans(n_clusters = n_clusters, init = 'k-means++', random_state = 0)
# kmean_indices = kmeans.fit_predict(vectors)

# pca = PCA(n_components=512)
# scatter_plot_points = pca.fit_transform(vectors)

# tmp = pd.DataFrame({
#     'Feature space for the 1st feature': scatter_plot_points[:,0],
#     'Feature space for the 2nd feature': scatter_plot_points[:,1],
#     'labels': kmean_indices,
#     'title': df.abstract.values.tolist()[:vectors.__len__()]
# })

# fig = px.scatter(tmp, x='Feature space for the 1st feature', y='Feature space for the 2nd feature', color='labels',
#                  size='labels', hover_data=['title'])
# fig.update_layout(
#     margin=dict(l=20, r=20, t=20, b=20),
#     height=1000
# )

# fig.show()

Graphs

Preparing the data for graphs

import networkx as nx

from sklearn.metrics.pairwise import cosine_similarity

Make a copy of our dataframe and create the similarity matrix for the extracted title vectors

%%time 
sdf = df.copy()
similarity_matrix = cosine_similarity(sdf.title_sent_vects.values.tolist())

CPU times: user 51.4 s, sys: 1min 51s, total: 2min 42s
Wall time: 13.9 s

Add them into a dataframe, similiar to what a heatmap looks likes

simdf = pd.DataFrame(
    similarity_matrix,
    columns = sdf.title.values.tolist(),
    index = sdf.title.values.tolist()
)

Let's unstack them to add them easier into our graph

long_form = simdf.unstack()
# rename columns and turn into a dataframe
long_form.index.rename(['t1', 't2'], inplace=True)
long_form = long_form.to_frame('sim').reset_index()

long_form = long_form[long_form.t1 != long_form.t2]
long_form[:3]

Plotly Graph

!pip install python-igraph

import igraph as ig

%%time
lng = long_form[long_form.sim > 0.75] 
tuples = [tuple(x) for x in lng.values]
Gm = ig.Graph.TupleList(tuples, edge_attrs = ['sim'])
layt=Gm.layout('kk', dim=3)

Xn=[layt[k][0] for k in range(layt.__len__())]# x-coordinates of nodes
Yn=[layt[k][1] for k in range(layt.__len__())]# y-coordinates
Zn=[layt[k][2] for k in range(layt.__len__())]# z-coordinates
Xe=[]
Ye=[]
Ze=[]
for e in Gm.get_edgelist():
    Xe+=[layt[e[0]][0],layt[e[1]][0], None]# x-coordinates of edge ends
    Ye+=[layt[e[0]][1],layt[e[1]][1], None]
    Ze+=[layt[e[0]][2],layt[e[1]][2], None]

import plotly.graph_objs as go

trace1 = go.Scatter3d(x = Xe,
  y = Ye,
  z = Ze,
  mode = 'lines',
  line = dict(color = 'rgb(0,0,0)', width = 1),
  hoverinfo = 'none'
)

trace2 = go.Scatter3d(x = Xn,
  y = Yn,
  z = Zn,
  mode = 'markers',
  name = 'actors',
  marker = dict(symbol = 'circle',
    size = 6, 
    # color = group,
    colorscale = 'Viridis',
    line = dict(color = 'rgb(50,50,50)', width = 0.5)
  ),
  text = lng.t1.values.tolist(),
  hoverinfo = 'text'
)

axis = dict(showbackground = False,
  showline = False,
  zeroline = False,
  showgrid = False,
  showticklabels = False,
  title = ''
)

layout = go.Layout(
  title = "Network of similarity between CORD-19 Articles(3D visualization)",
  width = 1500,
  height = 1500,
  showlegend = False,
  scene = dict(
    xaxis = dict(axis),
    yaxis = dict(axis),
    zaxis = dict(axis),
  ),
  margin = dict(
    t = 100,
    l = 20,
    r = 20
  ),

)

fig=go.Figure(data=[trace1,trace2], layout=layout)

fig.show()

Finding Communities

Next step, we filter them nodes with higher similarity

sim_weight = 0.75
gdf = long_form[long_form.sim > sim_weight]

We create our graph from our dataframe

plt.figure(figsize=(50,50))
pd_graph = nx.Graph()
pd_graph = nx.from_pandas_edgelist(gdf, 't1', 't2')
pos = nx.spring_layout(pd_graph)
nx.draw_networkx(pd_graph,pos,with_labels=True,font_size=10, node_size = 30)

Now let's try to find communities in our graph

betCent = nx.betweenness_centrality(pd_graph, normalized=True, endpoints=True)
node_color = [20000.0 * pd_graph.degree(v) for v in pd_graph]
node_size =  [v * 10000 for v in betCent.values()]
plt.figure(figsize=(35,35))
nx.draw_networkx(pd_graph, pos=pos, with_labels=True,
                 font_size=5,
                 node_color=node_color,
                 node_size=node_size )

Now let's get our groups

l=list(nx.connected_components(pd_graph))

L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

d=[{'articles':k , 'groups':v }for d in L for k, v in d.items()]

We've got our 'clustered' dataframe of articles, however since we filtered the data to take just the most similar articles, we've ended up havin left just 660 from the 5k data

Creating word clouds for grooups

gcd = pd.DataFrame.from_dict(d)

import nltk
nltk.download('stopwords')
nltk.download('punkt') 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.tokenize import RegexpTokenizer

tok = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english')) 
def clean(string):
  return " ".join([w for w in word_tokenize(" ".join(tok.tokenize(string))) if not w in stop_words])

gcd.articles = gcd.articles.apply(lambda x: clean(x))

gcd.__len__(),gcd.__len__() / df.__len__()

from wordcloud import WordCloud

%%time
clouds = dict()

big_groups = pd.DataFrame({
    'counts':gcd.groups.value_counts()
    }).sort_values(by='counts',ascending=False)[:9].index.values.tolist()

for group in big_groups:
  text = gcd[gcd.groups == group].articles.values
  wordcloud = WordCloud(width=1000, height=1000).generate(str(text))
  clouds[group] = wordcloud

def plot_figures(figures, nrows = 1, ncols=1):
    """Plot a dictionary of figures.

    Parameters
    ----------
    figures : <title, figure> dictionary
    ncols : number of columns of subplots wanted in the display
    nrows : number of rows of subplots wanted in the figure
    """

    fig, axeslist = plt.subplots(ncols=ncols, nrows=nrows,figsize=(20,20))
    for ind,title in zip(range(len(figures)), figures):
        axeslist.ravel()[ind].imshow(figures[title], cmap=plt.jet())
        axeslist.ravel()[ind].set_title(f'Most Freqent words for the group {title+1}')
        axeslist.ravel()[ind].set_axis_off()
    plt.tight_layout() # optional

plot_figures(clouds, 3, 3)
plt.show()

	sha	source_x	title	doi	pmcid	pubmed_id	license	abstract	publish_time	authors	journal	Microsoft Academic Paper ID	WHO #Covidence	has_full_text	full_text_file
0	NaN	Elsevier	Intrauterine virus infections and congenital h...	10.1016/0002-8703(72)90077-4	NaN	4361535.0	els-covid	Abstract The etiologic basis for the vast majo...	1972-12-31	Overall, James C.	American Heart Journal	NaN	NaN	False	custom_license
1	NaN	Elsevier	Coronaviruses in Balkan nephritis	10.1016/0002-8703(80)90355-5	NaN	6243850.0	els-covid	NaN	1980-03-31	Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;...	American Heart Journal	NaN	NaN	False	custom_license
2	NaN	Elsevier	Cigarette smoking and coronary heart disease: ...	10.1016/0002-8703(80)90356-7	NaN	7355701.0	els-covid	NaN	1980-03-31	Friedman, Gary D	American Heart Journal	NaN	NaN	False	custom_license

	title	abstract	authors
0	Intrauterine virus infections and congenital h...	Abstract The etiologic basis for the vast majo...	Overall, James C.
3	Clinical and immunologic studies in identical ...	Abstract Middle-aged female identical twins, o...	Brunner, Carolyn M.; Horwitz, David A.; Shann,...
4	Epidemiology of community-acquired respiratory...	Abstract Upper respiratory tract infections ar...	Garibaldi, Richard A.