Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
'seaborn')
plt.style.use('whitegrid') sns.set_style(
Daniel Fat
February 5, 2021
Imports
Context
The growing availability of information in the past decade has allowed internet users to find vast amounts of information online, but this has come with more and more deceptive articles designed to advertise or promote a product or ideology. In addition, hidden sponsored news articles have grown in prevalence in recent years as news organizations have shifted their business strategy to account for developments in technology and content consumption. It is for this reason that having a system in place to detect these deceptive practices is more important than ever.
Content
This dataset consists of articles that were tagged by users as having a “promotional tone” (promotional.csv) and of articles that were tagged as “good articles” (good.csv).
The each promotional article can have multiple labels (quotes from Wikipedia tags):
The “good articles” are articles that were deemed “well written, contain factually accurate and verifiable information, are broad in coverage, neutral in point of view, stable, and illustrated.”
id | text | advert | coi | fanpov | pr | resume | url | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 1 Litre no Namida 1, lit. 1 Litre of Tears als... | 0 | 0 | 1 | 0 | 0 | https://en.wikipedia.org/wiki/1%20Litre%20no%2... |
1 | 1 | 1DayLater was free, web based software that wa... | 1 | 1 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/1DayLater |
2 | 2 | 1E is a privately owned IT software and servic... | 1 | 0 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/1E |
3 | 3 | 1Malaysia pronounced One Malaysia in English a... | 1 | 0 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/1Malaysia |
4 | 4 | The Jerusalem Biennale, as stated on the Bienn... | 1 | 0 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/1st%20Jerusalem%... |
... | ... | ... | ... | ... | ... | ... | ... | ... |
23832 | 23832 | ZURICH.MINDS is a non profit foundation set up... | 1 | 0 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/Zurich.minds |
23833 | 23833 | zvelo, Inc. or simply zvelo is a privately hel... | 1 | 0 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/Zvelo |
23834 | 23834 | Zygote Media Group is a 3D human anatomy conte... | 1 | 1 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/Zygote%20Media%2... |
23835 | 23835 | Zylom is a distributor of casual games for PC ... | 1 | 0 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/Zylom |
23836 | 23836 | Zynx Health Incorporated is an American corpor... | 1 | 1 | 0 | 0 | 0 | https://en.wikipedia.org/wiki/Zynx%20Health |
23837 rows × 8 columns
Pre-Processing
For ML based models we can use USE to extract the sentence vectors
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.
To learn more about text embeddings, refer to the TensorFlow Embeddings documentation. Our encoder differs from word level embedding models in that we train on a number of natural language prediction tasks that require modeling the meaning of word sequences rather than just individual words. Details are available in the paper “Universal Sentence Encoder”
import tensorflow as tf
import tensorflow_hub as hub
class UniversalSentenceEncoder:
def __init__(self, encoder='universal-sentence-encoder', version='4'):
self.version = version
self.encoder = encoder
self.embd = hub.load(f"https://tfhub.dev/google/{encoder}/{version}",)
def embed(self, sentences):
return self.embd(sentences)
def squized(self, sentences):
return np.array(self.embd(tf.squeeze(tf.cast(sentences, tf.string))))
use = UniversalSentenceEncoder()
Use the USE to get the vectors from the text
CPU times: user 5min 38s, sys: 7min 41s, total: 13min 20s
Wall time: 16min 45s
CPU times: user 3min 5s, sys: 3min 4s, total: 6min 10s
Wall time: 6min 20s
Transforms a multi-label classification problem with L labels into L single-label separate binary classification problems using the same base classifier provided in the constructor. The prediction output is the union of all per label classifiers
Random Forest
CPU times: user 2min 43s, sys: 1.21 s, total: 2min 44s
Wall time: 2min 45s
BinaryRelevance(classifier=RandomForestClassifier(), require_dense=[True, True])
precision recall f1-score support
0 0.83 0.97 0.90 6239
1 0.00 0.00 0.00 707
2 0.92 0.14 0.24 493
3 0.00 0.00 0.00 500
4 0.74 0.16 0.26 726
micro avg 0.83 0.72 0.77 8665
macro avg 0.50 0.25 0.28 8665
weighted avg 0.72 0.72 0.68 8665
samples avg 0.79 0.75 0.77 8665
XGBoost
[18:22:22] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[18:23:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[18:23:44] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[18:24:27] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[18:25:14] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
CPU times: user 26min 45s, sys: 5.84 s, total: 26min 51s
Wall time: 3min 40s
BinaryRelevance(classifier=XGBClassifier(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None,
max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
use_label_encoder=False,
validate_parameters=None,
verbosity=None),
require_dense=[True, True])
precision recall f1-score support
0 0.86 0.94 0.90 6239
1 0.45 0.01 0.01 707
2 0.73 0.29 0.42 493
3 0.00 0.00 0.00 500
4 0.61 0.38 0.47 726
micro avg 0.84 0.73 0.78 8665
macro avg 0.53 0.32 0.36 8665
weighted avg 0.75 0.73 0.71 8665
samples avg 0.79 0.76 0.77 8665
We can see using Binary Relevance
we can easily use diffrent models for multi-label classification.
Not those…
Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.
🤗 Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.
However, we will use the Simple Transformers package, which lets you quickly train and evaluate Transformer models. Keep the big guns for other projects and new PC…
INFO:filelock:Lock 140598834372336 acquired on /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b.lock
INFO:filelock:Lock 140598834372336 released on /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b.lock
INFO:filelock:Lock 140597941182416 acquired on /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7.lock
INFO:filelock:Lock 140597941182416 released on /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7.lock
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:filelock:Lock 140597903954216 acquired on /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab.lock
INFO:filelock:Lock 140597903954216 released on /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab.lock
INFO:filelock:Lock 140597938558176 acquired on /root/.cache/huggingface/transformers/cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
INFO:filelock:Lock 140597938558176 released on /root/.cache/huggingface/transformers/cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.
INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_roberta_128_0_15970
/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.
(9985, 0.23426581945831798)
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.
INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_roberta_128_0_7867
advert | coi | fanpov | pr | resume | |
---|---|---|---|---|---|
0 | 0.963867 | 0.109924 | 0.001567 | 0.044006 | 0.001279 |
1 | 0.969727 | 0.089905 | 0.002100 | 0.036438 | 0.001156 |
2 | 0.969238 | 0.091858 | 0.002035 | 0.037048 | 0.001165 |
3 | 0.972168 | 0.069641 | 0.003004 | 0.031860 | 0.001057 |
4 | 0.973145 | 0.062805 | 0.003622 | 0.029816 | 0.001040 |
... | ... | ... | ... | ... | ... |
7862 | 0.938965 | 0.149658 | 0.001240 | 0.074341 | 0.001810 |
7863 | 0.087708 | 0.089600 | 0.006958 | 0.053314 | 0.870605 |
7864 | 0.970215 | 0.038910 | 0.007904 | 0.025513 | 0.001086 |
7865 | 0.969238 | 0.035828 | 0.009193 | 0.025421 | 0.001078 |
7866 | 0.971191 | 0.041138 | 0.007122 | 0.026810 | 0.001032 |
7867 rows × 5 columns
precision recall f1-score support
0 0.87 0.93 0.90 6239
1 0.00 0.00 0.00 707
2 0.60 0.43 0.50 493
3 0.30 0.02 0.03 500
4 0.55 0.53 0.54 726
micro avg 0.83 0.74 0.78 8665
macro avg 0.46 0.38 0.39 8665
weighted avg 0.72 0.74 0.72 8665
samples avg 0.81 0.77 0.78 8665
We have played with those packages to solve the multi-label classificationt task - Binary Relevance - Universal Sentence Encoder - Random Forest - XGBoost - Transformers
We could see very intresting results, especially that each one of them works very diffrent from each other.
Personally I think using the XGBoost model with the USE worked the best in terms of accuracy-resources report, also it had a better precision for the label 1
where the other models could not score anything for that label.
All these models can be improved by tunning but this notebooks was mainly designed for learning purposes.