Tagging Wikipedia Articles
Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
sns.set_style('whitegrid')
Context
The growing availability of information in the past decade has allowed internet users to find vast amounts of information online, but this has come with more and more deceptive articles designed to advertise or promote a product or ideology. In addition, hidden sponsored news articles have grown in prevalence in recent years as news organizations have shifted their business strategy to account for developments in technology and content consumption. It is for this reason that having a system in place to detect these deceptive practices is more important than ever.
Content
This dataset consists of articles that were tagged by users as having a "promotional tone" (promotional.csv) and of articles that were tagged as "good articles" (good.csv).
The each promotional article can have multiple labels (quotes from Wikipedia tags):
- advert - "This article contains content that is written like an advertisement."
- coi - "A major contributor to this article appears to have a close connection with its subject."
- fanpov - "This article may be written from a fan's point of view, rather than a neutral point of view."
- pr - "This article reads like a press release or a news article or is largely based on routine coverage or sensationalism."
- resume - "This biographical article is written like a résumé."
The "good articles" are articles that were deemed "well written, contain factually accurate and verifiable information, are broad in coverage, neutral in point of view, stable, and illustrated."
df = pd.read_csv('data/wiki-articles-promo.csv').reset_index().rename(columns={'index':'id'})
df
labels = ['advert','coi','fanpov','pr','resume']
Pre-Processing
from sklearn.model_selection import train_test_split
from sklearn import metrics
train,test = train_test_split(df[['text'] + labels],test_size=0.33, random_state=42,stratify=df[labels[1:]])
For ML based models we can use USE to extract the sentence vectors
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.
To learn more about text embeddings, refer to the TensorFlow Embeddings documentation. Our encoder differs from word level embedding models in that we train on a number of natural language prediction tasks that require modeling the meaning of word sequences rather than just individual words. Details are available in the paper "Universal Sentence Encoder"
import tensorflow as tf
import tensorflow_hub as hub
class UniversalSentenceEncoder:
def __init__(self, encoder='universal-sentence-encoder', version='4'):
self.version = version
self.encoder = encoder
self.embd = hub.load(f"https://tfhub.dev/google/{encoder}/{version}",)
def embed(self, sentences):
return self.embd(sentences)
def squized(self, sentences):
return np.array(self.embd(tf.squeeze(tf.cast(sentences, tf.string))))
use = UniversalSentenceEncoder()
Use the USE to get the vectors from the text
%%time
train_use = train.copy()
train_use['text_vect'] = use.squized(train_use['text'].tolist()).tolist()
%%time
test_use = test.copy()
test_use['text_vect'] = use.squized(test_use['text'].tolist()).tolist()
Transforms a multi-label classification problem with L labels into L single-label separate binary classification problems using the same base classifier provided in the constructor. The prediction output is the union of all per label classifiers
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.ensemble import RandomForestClassifier
Random Forest
%%time
classifier = BinaryRelevance(
classifier = RandomForestClassifier()
)
# train
classifier.fit(pd.DataFrame(train_use['text_vect'].tolist()), train_use[labels])
predictions = classifier.predict(pd.DataFrame(test_use['text_vect'].tolist()))
predictions_proba = classifier.predict_proba(pd.DataFrame(test_use['text_vect'].tolist()))
print(metrics.classification_report(test_use[labels],predictions,zero_division=0))
XGBoost
import xgboost as xgb
%%time
classifier_xgb = BinaryRelevance(
classifier = xgb.XGBClassifier(objective='binary:logistic',use_label_encoder=False)
)
# train
classifier_xgb.fit(pd.DataFrame(train_use['text_vect'].tolist()), train_use[labels])
xgb_pred = classifier_xgb.predict(pd.DataFrame(test_use['text_vect'].tolist(),columns=[f'f{x}' for x in range(512)]))
xgb_pred_proba = classifier_xgb.predict_proba(pd.DataFrame(test_use['text_vect'].tolist(),columns=[f'f{x}' for x in range(512)]))
print(metrics.classification_report(test_use[labels],xgb_pred,zero_division=0))
We can see using Binary Relevance
we can easily use diffrent models for multi-label classification.
Not those...
Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.
🤗 Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.
However, we will use the Simple Transformers package, which lets you quickly train and evaluate Transformer models. Keep the big guns for other projects and new PC...
from simpletransformers.classification import MultiLabelClassificationModel
import logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
train['labels'] = train[labels].values.tolist()
test['labels'] = test[labels].values.tolist()
train_df = train[['text','labels']].copy()
eval_df = test[['text','labels']].copy()
model = MultiLabelClassificationModel(
"roberta",
"roberta-base",
num_labels=len(labels),
use_cuda=True,
args={"reprocess_input_data": True, "overwrite_output_dir": True, "num_train_epochs": 5},
)
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(eval_df)
result
preds = pd.DataFrame(model_outputs,columns=labels)
preds
print(metrics.classification_report(test[labels],preds.gt(.5).astype(int),zero_division=0))
We have played with those packages to solve the multi-label classificationt task
- Binary Relevance
- Universal Sentence Encoder
- Random Forest
- XGBoost
- Transformers
We could see very intresting results, especially that each one of them works very diffrent from each other.
Personally I think using the XGBoost model with the USE worked the best in terms of accuracy-resources report, also it had a better precision for the label 1
where the other models could not score anything for that label.
All these models can be improved by tunning but this notebooks was mainly designed for learning purposes.