Tagging Wikipedia Articles with Multi-Label Classification

Tagging Wikipedia Articles with Multi-Label Classification, using Autoencoders and Deep Learning
Data Science
Machine Learning
Deep Learning
Autoencoders
Fraud Detection
Author

Daniel Fat

Published

February 5, 2021

Imports

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
sns.set_style('whitegrid')

About Data

Dataset

Context

The growing availability of information in the past decade has allowed internet users to find vast amounts of information online, but this has come with more and more deceptive articles designed to advertise or promote a product or ideology. In addition, hidden sponsored news articles have grown in prevalence in recent years as news organizations have shifted their business strategy to account for developments in technology and content consumption. It is for this reason that having a system in place to detect these deceptive practices is more important than ever.

Content

This dataset consists of articles that were tagged by users as having a “promotional tone” (promotional.csv) and of articles that were tagged as “good articles” (good.csv).

The each promotional article can have multiple labels (quotes from Wikipedia tags):

  • advert - “This article contains content that is written like an advertisement.”
  • coi - “A major contributor to this article appears to have a close connection with its subject.”
  • fanpov - “This article may be written from a fan’s point of view, rather than a neutral point of view.”
  • pr - “This article reads like a press release or a news article or is largely based on routine coverage or sensationalism.”
  • resume - “This biographical article is written like a résumé.”

The “good articles” are articles that were deemed “well written, contain factually accurate and verifiable information, are broad in coverage, neutral in point of view, stable, and illustrated.”

Code
df = pd.read_csv('data/wiki-articles-promo.csv').reset_index().rename(columns={'index':'id'})
df
id text advert coi fanpov pr resume url
0 0 1 Litre no Namida 1, lit. 1 Litre of Tears als... 0 0 1 0 0 https://en.wikipedia.org/wiki/1%20Litre%20no%2...
1 1 1DayLater was free, web based software that wa... 1 1 0 0 0 https://en.wikipedia.org/wiki/1DayLater
2 2 1E is a privately owned IT software and servic... 1 0 0 0 0 https://en.wikipedia.org/wiki/1E
3 3 1Malaysia pronounced One Malaysia in English a... 1 0 0 0 0 https://en.wikipedia.org/wiki/1Malaysia
4 4 The Jerusalem Biennale, as stated on the Bienn... 1 0 0 0 0 https://en.wikipedia.org/wiki/1st%20Jerusalem%...
... ... ... ... ... ... ... ... ...
23832 23832 ZURICH.MINDS is a non profit foundation set up... 1 0 0 0 0 https://en.wikipedia.org/wiki/Zurich.minds
23833 23833 zvelo, Inc. or simply zvelo is a privately hel... 1 0 0 0 0 https://en.wikipedia.org/wiki/Zvelo
23834 23834 Zygote Media Group is a 3D human anatomy conte... 1 1 0 0 0 https://en.wikipedia.org/wiki/Zygote%20Media%2...
23835 23835 Zylom is a distributor of casual games for PC ... 1 0 0 0 0 https://en.wikipedia.org/wiki/Zylom
23836 23836 Zynx Health Incorporated is an American corpor... 1 1 0 0 0 https://en.wikipedia.org/wiki/Zynx%20Health

23837 rows × 8 columns

Code
labels = ['advert','coi','fanpov','pr','resume']

Modeling

Pre-Processing

Code
from sklearn.model_selection import train_test_split
from sklearn import metrics
Code
train,test = train_test_split(df[['text'] + labels],test_size=0.33, random_state=42,stratify=df[labels[1:]])

For ML based models we can use USE to extract the sentence vectors

The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

To learn more about text embeddings, refer to the TensorFlow Embeddings documentation. Our encoder differs from word level embedding models in that we train on a number of natural language prediction tasks that require modeling the meaning of word sequences rather than just individual words. Details are available in the paper “Universal Sentence Encoder”

Paper

Pre-Trained Models

Code
# !pip install -U tensorflow tensorflow-hub 
Code
import tensorflow as tf
import tensorflow_hub as hub
class UniversalSentenceEncoder:

    def __init__(self, encoder='universal-sentence-encoder', version='4'):
        self.version = version
        self.encoder = encoder
        self.embd = hub.load(f"https://tfhub.dev/google/{encoder}/{version}",)

    def embed(self, sentences):
        return self.embd(sentences)

    def squized(self, sentences):
        return np.array(self.embd(tf.squeeze(tf.cast(sentences, tf.string))))
    
use = UniversalSentenceEncoder()

Use the USE to get the vectors from the text

Code
%%time
train_use = train.copy()
train_use['text_vect'] = use.squized(train_use['text'].tolist()).tolist()
CPU times: user 5min 38s, sys: 7min 41s, total: 13min 20s
Wall time: 16min 45s
Code
%%time
test_use = test.copy()
test_use['text_vect'] = use.squized(test_use['text'].tolist()).tolist()
CPU times: user 3min 5s, sys: 3min 4s, total: 6min 10s
Wall time: 6min 20s

Binary Relevance

Transforms a multi-label classification problem with L labels into L single-label separate binary classification problems using the same base classifier provided in the constructor. The prediction output is the union of all per label classifiers

Code
# !pip install scikit-multilearn
Code
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.ensemble import RandomForestClassifier

Random Forest

Code
%%time
classifier = BinaryRelevance(
    classifier = RandomForestClassifier()
)

# train
classifier.fit(pd.DataFrame(train_use['text_vect'].tolist()), train_use[labels])
CPU times: user 2min 43s, sys: 1.21 s, total: 2min 44s
Wall time: 2min 45s
BinaryRelevance(classifier=RandomForestClassifier(), require_dense=[True, True])
Code
# predict
predictions = classifier.predict(pd.DataFrame(test_use['text_vect'].tolist()))
Code
# predict
predictions_proba = classifier.predict_proba(pd.DataFrame(test_use['text_vect'].tolist()))
Code
print(metrics.classification_report(test_use[labels],predictions,zero_division=0))
              precision    recall  f1-score   support

           0       0.83      0.97      0.90      6239
           1       0.00      0.00      0.00       707
           2       0.92      0.14      0.24       493
           3       0.00      0.00      0.00       500
           4       0.74      0.16      0.26       726

   micro avg       0.83      0.72      0.77      8665
   macro avg       0.50      0.25      0.28      8665
weighted avg       0.72      0.72      0.68      8665
 samples avg       0.79      0.75      0.77      8665

XGBoost

Code
import xgboost as xgb
Code
%%time
classifier_xgb = BinaryRelevance(
    classifier = xgb.XGBClassifier(objective='binary:logistic',use_label_encoder=False)
)

# train
classifier_xgb.fit(pd.DataFrame(train_use['text_vect'].tolist()), train_use[labels])
[18:22:22] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[18:23:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[18:23:44] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[18:24:27] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[18:25:14] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
CPU times: user 26min 45s, sys: 5.84 s, total: 26min 51s
Wall time: 3min 40s
BinaryRelevance(classifier=XGBClassifier(base_score=None, booster=None,
                                         colsample_bylevel=None,
                                         colsample_bynode=None,
                                         colsample_bytree=None, gamma=None,
                                         gpu_id=None, importance_type='gain',
                                         interaction_constraints=None,
                                         learning_rate=None,
                                         max_delta_step=None, max_depth=None,
                                         min_child_weight=None, missing=nan,
                                         monotone_constraints=None,
                                         n_estimators=100, n_jobs=None,
                                         num_parallel_tree=None,
                                         random_state=None, reg_alpha=None,
                                         reg_lambda=None, scale_pos_weight=None,
                                         subsample=None, tree_method=None,
                                         use_label_encoder=False,
                                         validate_parameters=None,
                                         verbosity=None),
                require_dense=[True, True])
Code
# predict
xgb_pred = classifier_xgb.predict(pd.DataFrame(test_use['text_vect'].tolist(),columns=[f'f{x}' for x in range(512)]))
Code
# predict
xgb_pred_proba = classifier_xgb.predict_proba(pd.DataFrame(test_use['text_vect'].tolist(),columns=[f'f{x}' for x in range(512)]))
Code
print(metrics.classification_report(test_use[labels],xgb_pred,zero_division=0))
              precision    recall  f1-score   support

           0       0.86      0.94      0.90      6239
           1       0.45      0.01      0.01       707
           2       0.73      0.29      0.42       493
           3       0.00      0.00      0.00       500
           4       0.61      0.38      0.47       726

   micro avg       0.84      0.73      0.78      8665
   macro avg       0.53      0.32      0.36      8665
weighted avg       0.75      0.73      0.71      8665
 samples avg       0.79      0.76      0.77      8665

We can see using Binary Relevance we can easily use diffrent models for multi-label classification.

Now let’s try Transformers

Not those…

Original package

Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.

🤗 Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.

However, we will use the Simple Transformers package, which lets you quickly train and evaluate Transformer models. Keep the big guns for other projects and new PC…

Code
# !pip install simpletransformers
Code
from simpletransformers.classification import MultiLabelClassificationModel
import logging
Code
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
Code
train['labels'] = train[labels].values.tolist()
test['labels'] = test[labels].values.tolist()
Code
train_df = train[['text','labels']].copy()
eval_df = test[['text','labels']].copy()
Code
# Create a MultiLabelClassificationModel
model = MultiLabelClassificationModel(
    "roberta",
    "roberta-base",
    num_labels=len(labels),
    use_cuda=True,
    args={"reprocess_input_data": True, "overwrite_output_dir": True, "num_train_epochs": 5},
)
INFO:filelock:Lock 140598834372336 acquired on /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b.lock
INFO:filelock:Lock 140598834372336 released on /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b.lock
INFO:filelock:Lock 140597941182416 acquired on /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7.lock
INFO:filelock:Lock 140597941182416 released on /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7.lock
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:filelock:Lock 140597903954216 acquired on /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab.lock
INFO:filelock:Lock 140597903954216 released on /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab.lock
INFO:filelock:Lock 140597938558176 acquired on /root/.cache/huggingface/transformers/cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
INFO:filelock:Lock 140597938558176 released on /root/.cache/huggingface/transformers/cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
Code
# Train the model
model.train_model(train_df)
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.
INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_roberta_128_0_15970
/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
  warnings.warn(SAVE_STATE_WARNING, UserWarning)
INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.
(9985, 0.23426581945831798)
Code
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(eval_df)
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.
INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_roberta_128_0_7867
Code
result
{'LRAP': 0.8831122975015083, 'eval_loss': 0.28330832859690536}
Code
preds = pd.DataFrame(model_outputs,columns=labels)
preds
advert coi fanpov pr resume
0 0.963867 0.109924 0.001567 0.044006 0.001279
1 0.969727 0.089905 0.002100 0.036438 0.001156
2 0.969238 0.091858 0.002035 0.037048 0.001165
3 0.972168 0.069641 0.003004 0.031860 0.001057
4 0.973145 0.062805 0.003622 0.029816 0.001040
... ... ... ... ... ...
7862 0.938965 0.149658 0.001240 0.074341 0.001810
7863 0.087708 0.089600 0.006958 0.053314 0.870605
7864 0.970215 0.038910 0.007904 0.025513 0.001086
7865 0.969238 0.035828 0.009193 0.025421 0.001078
7866 0.971191 0.041138 0.007122 0.026810 0.001032

7867 rows × 5 columns

Code
print(metrics.classification_report(test[labels],preds.gt(.5).astype(int),zero_division=0))
              precision    recall  f1-score   support

           0       0.87      0.93      0.90      6239
           1       0.00      0.00      0.00       707
           2       0.60      0.43      0.50       493
           3       0.30      0.02      0.03       500
           4       0.55      0.53      0.54       726

   micro avg       0.83      0.74      0.78      8665
   macro avg       0.46      0.38      0.39      8665
weighted avg       0.72      0.74      0.72      8665
 samples avg       0.81      0.77      0.78      8665

Conclusions

We have played with those packages to solve the multi-label classificationt task - Binary Relevance - Universal Sentence Encoder - Random Forest - XGBoost - Transformers

We could see very intresting results, especially that each one of them works very diffrent from each other.

Personally I think using the XGBoost model with the USE worked the best in terms of accuracy-resources report, also it had a better precision for the label 1 where the other models could not score anything for that label.

All these models can be improved by tunning but this notebooks was mainly designed for learning purposes.