Deep Learning Approach for Extreme Multi-label Text Classification

Google+ Pinterest LinkedIn Tumblr

>> Wei-Cheng Chang is
a PhD student at CMU, working on and showing, talking about
the deep learning approach to XC text classification. >> Hello, everyone. This is Wei-Cheng Chang, currently a PhD student
from LTI CMU. First, I would like to thank the workshop organizer inviting my adviser Yiming Yang
to present this talk. However, due to some
personal travelling issues, she cannot make it. So today I'm going to talk about the deep learning approach to extreme Multi-Label

This is a joint work
by Jingzhou, Yuexin, Yiming, and me. So this talk is
divided in two volts. The first half, we focus on our SIGIR 2017 paper
in earlier this year, which is about using deep learning
architecture to extract richer feature representation for multi-label
text classification. And the second half, we introduce some related work in deep structured learning
paper in this ICML. They can also be used to apply on Multi-Label
Classification. So to start with, let me briefly introduce the definition of
multi-label text classification, XMTC in short. So it's about finding a document its most relevant
subset of labels from a very large space
of category. So this is different from the multi-class where each
document only have one label. So I believe everyone know this. So introduce about
the notation in our setting, the training data is
denoted in x y up here.

X is the data, y is the label. And our goal is finding
a mapping from x to y. So what's so special in XMTC? So in extreme multi-labels, typically, it refers
the label space is very large. But, actually, if you
look at the XMTC datasets, it's also that the number
of document n and number of feature D are
also in extreme scale. So this posts
two key challenges. The first one is scalability, both in terms of
training and testing. And another issue is
the data sparsity in terms of the many labels only appear in
very few documents. So in this talk, we try to provide a solution for these two challenge in XMTC. The first one is we will
discuss another way to extract feature
representation besides the traditional
back-and-forth method. And the second half, we'll discuss how to
exploit label dependencies. First, in the literature
of extreme classification, we can roughly divide
the approach into two category.

The first one is
target-embedding, and the second one is
tree-based ensemble. So I believe I don't need
to give more detail on these two part because this is the traditional state of the art in this extreme classification. What I would like to focus on is the deep learning
approach because this may be less familiar with the extreme multi-class
classification community. So in this talk, we will examine
the state-of-theart XMTC by conducting a
comparative evaluation of seven
state-of-the-art method, including the
target-embedding methods and tree-based method and
a deep learning methods. So to start with, let's look at the KimCNN, which is a very famous
convolution network on text classification. So the thing is, instead of using
just a bag-of-word feature, the roll text is
fed into the model, as you can see in
the figure here. The resolution may
not be that good, so here is a sequence of words, and each word is replaced
by the word embedding.

So you have a vector here. It's a word embedding
with the associated word. So the input will
form a n by k matrix, where n is number of
words in a document, and k is dimension of
the word embedding. So since the input is a m by k, you can think of
this like image, and we are doing convolutional
filters on this image. So what they do is they place one deconvolution filter sliding through the time dimension. So this is, implicitly, it's extracting
a generalized version of the n-gram features, compared to the traditional bag-of-word or bigram
or trigram features. So in this red box here, they're using
the filter size of two, but you can also use
three and four and so on. So that way, you are extracting more generalized bigram
or trigram features. So after that you have
a feature map, and what they do, they just do max
pooling to extract the most strongest signal in that feature map and
stack them all together.

And finally is
the fully connected layer. So this is a brief
introduction of the current strongest method in multiclass classification. So what we did here in the extreme multi label
classification is, there are several different
part, roughly three. Let me go through
them one by one. So the first part
is the convolution. As you can see, the red box
is the convolution filter. We slide through
the convolution filter in another dimension. So maybe we can go
back a little bit. So in the previous work, they slide through
the convolution filter in a temporal manner. Here, we do it opposite
direction that we swipe through the convolution filter through the word embedding dimension. That is, you can think of in an image is
a spatial dimension. So the motivation is like each filters is capturing Moore's
global information, given the whole sequence. So flying through
different dimension of the word embedding
is like capturing the most salient features of word among the entire document. So we also have, say, filter of size
248 et cetera that capturing different
spatial relations among the embedding matrix.

So after that, we also
have a feature map. What we do is not
the traditional max pooling. What we do is like adaptive
max pooling that extract, say, two to three most secure known features
in the feature map. So if we are doing that, the final feature
representation here will be even two to three times larger
than the previous method. So in the extreme
multi-label setting, we know that
the final output number of label here is very huge. So we need to make a low-rank
matrix factorization here. So there's a safe 512th
hidden representation in the middle that
first project this long feature map to a lower dimension and
then we do fully connect. And just as a case study of the memory
consumption, since, say, you may be interest in how large the memory consumption
of these neural nets. So, say, we are using
vocabulary size 30K, and the word embedding
dimension is 300, and we cut the document
length to, say, 500, and we analyze
the largest asset, the number of label is 670, which is the largest asset
on the website, and we set the
hidden dimension to 500 and the pulling
units to 128.

So first is the embedding layer, the number of
parameter is roughly V times D, roughly 9 millions. And the number of parameter
in the convolution layer, it's calculated in
this way because we have for the filter size of two, we have 32 of them. For the filter size
of four and eight, we also have 32 copy of them. So in total, it's roughly 200K. And in the hidden layer, I'm referring to
the parameter in this part. In the hidden layer, we fixed out
the max pooling output to be 128 and given
that we have 32, each have three
times the output 500, that's roughly 6.3 million. And the bottlenecks is actually on the
final output layer. That is the 500 times 670K, which is roughly 343 millions. So if you sum this all up, this roughly accounts to 360 million number
of parameters.

So if you're using
floating precision, that's roughly 1.3 gigabytes, doesn't sound that bad. If you have a 12 giga GPU memory
or you have more, this definitely can
fit into your machine. So the analysis here
is just the minimum of memory you use because
when you are doing backpropagation and
the mini batches, that also depends on
the mini batch size you use. So, finally, it's the Choice
of the Loss function. We know that if in
the Multi-class classification, if you use Softmax, the model will favor only one label and pushing
other label to zero. So in the XMTC setting, you mostly put
zero probability output on the other labels. But actually, in
this kind of the dataset, because each document is
only tag with the most relevant, say five or 10 labels, it doesn't mean that other label
is not relevant. So, using Softmax may
not be that good choice, so we consider the most naive Binary
Cross-entropy using the Sigmoid. So, okay, finally it's
the experiment setting, so we consider six
benchmark dataset. The first two is the smallest
dataset RCV1 and EUR-Lex, and the middle two is
the median-sized dataset, Amazon-12K and Wiki-30K, and the last two is the largest one, Amazon-670K and the Wiki-500K.

And here is the results in XMTC. We're usually most interested in the top portion of
the rank labels, so the results are
evaluated by Precision one, three, five and NDCG
one, three, five. Here we present the Precision
one, three, five. So, you can see each column
representing a method, the bold face indicate
the best results on that row, so each row is a dataset and the bold face
is the best one. So, we also okay, the star notation denotes not statistically significantly
different from the best model on that row, using the sample T-test. So we can see here,
there's no star because all the results are
statistical significant. Among 18 rows, so the XML-CNN
is the proposed method. It has the best score
in 10 rows, while the second method
SLEEC has four best rows, and then followed
by the FastXML. But to be honest, the training height
for the XML-CNN would be a lot slower than FastXML, the tree-based ensemble method. So, let's look at the NDCG. So NDCG also provide
similar results but you can see here the star notation
here, and here, that means that this is not statistically significant
compared to this one.

Finally, we study
the Ablation Test because we have
a close relationship with the previous work
with the Kim-CNN. So we conduct Ablation test
building upon this Kim theme, adding one component at a time. So, the lightest color, probably the white here in
the screen, is the Kim-CNN. And as the color grows darker, it's a different versions from the white to
the lighter one, we changed the Softmax
to Binary Cross-entropy, as you can see that
gave us some gain in this dataset and
this dataset but probably not this one,
in the precision one. And the Version two is, we add a middle
bottom hidden layer. So this also will give us some gain in the performance
and finally is, it changed the
convolutions direction from temporal to spatial
through swiping their actions. So that you can see
in this dataset, gave us the most improvement, and also this one, dataset. So, this is my first half of the talk, in the second half, we investigate
some recent works using deep structural prediction
framework to solve the Multi- label
classification problem. So, from now the work is published by
other people in ICML.

So we run the experiments
and compare them with the state of the art method in extreme Multi-label
Classification. So first, so far what we did
is we are given an input X, we path through some function, and we extract
the feature representation. In the final layer,
we simply use Binary Logistic Regression to predict the output
independently, that's so far what
I have described. So from next, we discuss a structure prediction thinking of doing Multi-Label
Classification. So, you can think of here the, when they, what they do is, they try in the
Structure Prediction, they will extract
joint feature representation from both the data
and the label. And what they do is finding
a prediction output Y, that has a lower energy. So this correspond into finding the largest conditional
likelihood of PY given X in the tri-ordinal
graphical model CRF like those kind of thing. So the motivation is
that they want to extract joined feature between
the data X and the output, and explore the label dependency via the design of
the energy network.

So the key is how you design
the energy network here. So, this is, let's look
at each component of the Energy Network that are
designed in David ICML paper. So first, it is a feature network that's extracting the feature
given a feature input X, it's just a two-layer MLP. Next is the Energy Network, is composed by two part, the local energy
and the global one. The local one is
like in the middle, you can see here in
the middle part of the figure. It's a sum of
the L linear model, where L is a number of labels. So this is very
similar to what we did in the previous work. It's just BI is a lower dimensional embedding
to learn for each label. And in the upper part, is the global energy. So what do I mean by global? It's like the score function
is independent of X, but you can see here
in this, in here, the formula here is
trying to extract the lower embedding of
the large output space Y, by that, we hope that we can learn some fine transformation of
the output that captured the salient
feature of labels using this to model
the label dependencies.

So, the final Energy Design is the local part
adding the global part. So how do we learn
the model parameter theta? So say we are using, simply using the
mean-square out loss, the loss function here. First, we need to find the most, find the prediction
yi hat by solving the inference problem and
of the energy network, given the fixed model
parameter theta. So, exact solving the inference
may not be feasible.

So, what they do, in practice, they just do
gradient descent with a fixed maximum iteration
number, say just five. So, they do this
five-time obtaining the prediction yi hat and fix that to calculate
the gradient based on theta. So, actually
calculating the gradient with respect to
model parameter theta could be somehow tricky. So, let's look at
the computational graph here. The input is x and you path this through
some feature network. Use cache the feature and
you will calculate the, you first make a forward path based on this to calculate
the prediction y. So, you have, you need to calculate the gradient
with respect to y here in the pink box
is a model part. But in the backward pass, you also want to
calculate the gradient with respect to
the model parameter theta. So, you need to
visit multiple times through the pink box, like for example,
the first paths it's like this.

And the second part
is like this, because the yi is
sort of like the rn thinking that the final
prediction of y is a dependency among
the previous stat and previous stat is
a input function of the model primacy that's. So, you need to make the
derivative multiple times path. But this is the detail and another work in this year ICML is called Input
Convex Neural Nets, which is very similar to
the previous structure Prediction Energy Network,
except one thing. The design architecture
of energy network. In this work, they design
the energy network to be convex with respect to the y. So, the benefit of that is when you do
the task time optimization, solving inference problems out, this will have
a global optimal solution.

Of course, to design
the network to be convex, you have some assumption or
can say constraint need to made on with parameter w. So, let me go to the
experiment parts in there. For simplicity, we
rerun the experiments in those ICML paper, where they consider only these three relatively small scale
multi-label datasets. The main reason why we want
to rerun their experiments if that actually in
their paper they did not compare with the most strongest state of the art multi-label text
classification baseline. So, we find interesting
to compare with this. The setting is like this. The input feature
representation you just 0,1 binary backup for a feature and they pre-train the feature network based
on the binary logistic loss.

And the evaluation metric
is precision at 1,3,5 which is commonly
used in literature. So, the results like here. So, before jumping
to the number, I would like to emphasize that the hybrid parameters
for each method, it's just follow
the default setting. So, I didn't heavily
tune and not so surprising if you see
the FastXML and a SLEEC, it's either FastXML or SLEEC that reach the best performance. However, if you look at
the ICNN, this column here, you can see in
a Delicious Dataset has the largest number of
Levo among these three. It actually is the second best outperformed the SLEEC
and the PDSparse here. So, the first column is just a multilayer perception and the second column
and third column is using energy networks. So, they outperformed
the MLP saying that exploring
the joined representation between data and
labeled does help. So, if you compare the third
column to the second column, the difference is the design
of the energy network. So, it seems like designing that the energy network to be
complex with respect to y, have some adventures
when you are doing the test
time optimization.

So, there you can
solve that through more high precision with
respect to the prediction y. So, that's have more benefit in a Delicious Dataset
and the Bookmark Datasets. I think that's all, conclude my presentation today. So, in short, the first part
is I introduce some alternative way to extract the feature representation via another way of
convolution you or not. And the second part is I
introduce some recent work in ICML that's used in
structure prediction, but we also can apply them in multi-label
text classification. And they may provide some alternative insights of exploring the label
dependencies. Thank you. >> Thank you so much. Are there any
questions? Hi, Manny. >> Hi. So, it's really nice to see these
learning approaches starting to do so well for
extreme classification. I have a couple of questions though the first one
is about scaling. >> Right. >> So, actually I just wanted to clarify that the largest
dataset available on the repository has actually
3 million labels, not 670K. But, of course, internally, as Edward also mentioning, we've gone beyond a hundred million labels,
well beyond that.

So, how do you see these kinds
of approaches scaling to those kinds of
label settings? So, that was the first question. And the second is as
regards the accuracy, I was wondering if you had
compared against this man called PPDSparse or
FasterXML as well? >> Okay, so maybe I can first try to answer
your first question. So, I think you are asking
whether this kind of model architecture can scale to further larger label space. So, in currently
we are only using dimension reduction through the hidden representation here. So, if the final
output space say the dimension of y go
through say 5 millions, then that may have some trouble, because in terms
of that complexity, it's linear with respect
to the output space. So, if we are trying to apply this kind of framework
on even larger dataset, we need to make
the hidden representation, hidden layer, even smaller.

So, currently I don't have any other more than solution building upon this dibnoring neuronites to
address that issue. And I think your
second question is, do we compare with PDSparse
and SLEEC or a FastXML? >> No, PPDSparse or FasterXML. >> Oh, oh you mean the PPD. No, we haven't compared with the PPDSparse
and the DISMAC. Right, but we. >> [inaudible]. >> Right, so because currently
I'm not working on this. So, this is an earlier
work in this year. So, currently I'm working
on other projects. So, the first author may have conduct
some experiments with on that part but he has
an update with me yet, right. >> Okay. There's
another question. Yep, you. Yup. >> I find the way you do
convolution really curious. So, I was wondering like
whether it's specific to the setting of multi-label
extreme classification or is this like general.

So, for example if you replace the way you
do convolution in the original KIM-CNN which
actually improved performance. >> So, we only investigate that ablation test under the dataset we use
in a multi-label. So, I can't really
say what that would be beyond the
multi-label datasets. So, if you can see here, the difference of v2 and v3 it's really the way the direction
you do convolution. So, actually in
the eurlex dataset and the amazon670K that does improve but not so much in the wiki30K. So, I think that also depends
on the datasets, right. >> Thanks..

As found on YouTube