Discuss@GL4L

Multi Label Text Classification with Scikit-Learn


#1

Multi-class classification means a classification task with more than two classes; each label are mutually exclusive. The classification makes the assumption that each sample is assigned to one and only one label.

On the other hand, Multi-class classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as Tim Horton are often categorized as both bakery and coffee shop. Multi-label text classification has many real world applications such as categorizing businesses on Yelp or classifying movies into one or more genre(s).

Problem Formulation

Anyone who has been the target of abuse or harassment online will know that it doesn’t go away when you log off or switch off your phone. Researchers at Google are working on tools to study toxic comments online. In this post, we will build a multi-label model that’s capable of detecting different types of toxicity like severe toxic, threats, obscenity, insults, and so on. We will be using supervised classifiers and text representations. A toxic comment might be about any of toxic, severe toxic, obscene, threat, insult or identity hate at the same time or none of the above. The data set can be found at Kaggle.

Exploring

%matplotlib inline
import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words(‘english’))
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import seaborn as sns

df = pd.read_csv(“train 2.csv”, encoding = “ISO-8859-1”)
df.head()

Number of comments in each category

df_toxic = df.drop([‘id’, ‘comment_text’], axis=1)
counts = []
categories = list(df_toxic.columns.values)
for i in categories:
counts.append((i, df_toxic[i].sum()))
df_stats = pd.DataFrame(counts, columns=[‘category’, ‘number_of_comments’])
df_stats

image

df_stats.plot(x=‘category’, y=‘number_of_comments’, kind=‘bar’, legend=False, grid=True, figsize=(8, 5))
plt.title(“Number of comments per category”)
plt.ylabel(’# of Occurrences’, fontsize=12)
plt.xlabel(‘category’, fontsize=12)

image

Multi-Label

How many comments have multi labels?

rowsums = df.iloc[:,2:].sum(axis=1)
x=rowsums.value_counts()

#plot
plt.figure(figsize=(8,5))
ax = sns.barplot(x.index, x.values)
plt.title(“Multiple categories per comment”)
plt.ylabel(’# of Occurrences’, fontsize=12)
plt.xlabel(’# of categories’, fontsize=12)

image

Vast majority of the comment text are not labeled.

print(‘Percentage of comments that are not labelled:’)
print(len(df[(df[‘toxic’]==0) & (df[‘severe_toxic’]==0) & (df[‘obscene’]==0) & (df[‘threat’]== 0) & (df[‘insult’]==0) & (df[‘identity_hate’]==0)]) / len(df))

Percentage of comments that are not labelled:
0.8983211235124177

The distribution of the number of words in comment texts .

lens = df.comment_text.str.len()
lens.hist(bins = np.arange(0,5000,50))

image

Most of the comment text length are within 500 characters, with some outliers up to 5,000 characters long.

There is no missing comment in comment text column.

print(‘Number of missing comments in comment text:’)
df[‘comment_text’].isnull().sum()

Number of missing comments in comment text:

0

Have a peek the first comment, the text needs to be cleaned.

df[‘comment_text’][0]