>> Wei-Cheng Chang is

a PhD student at CMU, working on and showing, talking about

the deep learning approach to XC text classification. >> Hello, everyone. This is Wei-Cheng Chang, currently a PhD student

from LTI CMU. First, I would like to thank the workshop organizer inviting my adviser Yiming Yang

to present this talk. However, due to some

personal travelling issues, she cannot make it. So today I'm going to talk about the deep learning approach to extreme Multi-Label

Classification.

This is a joint work

by Jingzhou, Yuexin, Yiming, and me. So this talk is

divided in two volts. The first half, we focus on our SIGIR 2017 paper

in earlier this year, which is about using deep learning

architecture to extract richer feature representation for multi-label

text classification. And the second half, we introduce some related work in deep structured learning

paper in this ICML. They can also be used to apply on Multi-Label

Classification. So to start with, let me briefly introduce the definition of

multi-label text classification, XMTC in short. So it's about finding a document its most relevant

subset of labels from a very large space

of category. So this is different from the multi-class where each

document only have one label. So I believe everyone know this. So introduce about

the notation in our setting, the training data is

denoted in x y up here.

X is the data, y is the label. And our goal is finding

a mapping from x to y. So what's so special in XMTC? So in extreme multi-labels, typically, it refers

the label space is very large. But, actually, if you

look at the XMTC datasets, it's also that the number

of document n and number of feature D are

also in extreme scale. So this posts

two key challenges. The first one is scalability, both in terms of

training and testing. And another issue is

the data sparsity in terms of the many labels only appear in

very few documents. So in this talk, we try to provide a solution for these two challenge in XMTC. The first one is we will

discuss another way to extract feature

representation besides the traditional

back-and-forth method. And the second half, we'll discuss how to

exploit label dependencies. First, in the literature

of extreme classification, we can roughly divide

the approach into two category.

The first one is

target-embedding, and the second one is

tree-based ensemble. So I believe I don't need

to give more detail on these two part because this is the traditional state of the art in this extreme classification. What I would like to focus on is the deep learning

approach because this may be less familiar with the extreme multi-class

classification community. So in this talk, we will examine

the state-of-theart XMTC by conducting a

comparative evaluation of seven

state-of-the-art method, including the

target-embedding methods and tree-based method and

a deep learning methods. So to start with, let's look at the KimCNN, which is a very famous

convolution network on text classification. So the thing is, instead of using

just a bag-of-word feature, the roll text is

fed into the model, as you can see in

the figure here. The resolution may

not be that good, so here is a sequence of words, and each word is replaced

by the word embedding.

So you have a vector here. It's a word embedding

with the associated word. So the input will

form a n by k matrix, where n is number of

words in a document, and k is dimension of

the word embedding. So since the input is a m by k, you can think of

this like image, and we are doing convolutional

filters on this image. So what they do is they place one deconvolution filter sliding through the time dimension. So this is, implicitly, it's extracting

a generalized version of the n-gram features, compared to the traditional bag-of-word or bigram

or trigram features. So in this red box here, they're using

the filter size of two, but you can also use

three and four and so on. So that way, you are extracting more generalized bigram

or trigram features. So after that you have

a feature map, and what they do, they just do max

pooling to extract the most strongest signal in that feature map and

stack them all together.

And finally is

the fully connected layer. So this is a brief

introduction of the current strongest method in multiclass classification. So what we did here in the extreme multi label

classification is, there are several different

part, roughly three. Let me go through

them one by one. So the first part

is the convolution. As you can see, the red box

is the convolution filter. We slide through

the convolution filter in another dimension. So maybe we can go

back a little bit. So in the previous work, they slide through

the convolution filter in a temporal manner. Here, we do it opposite

direction that we swipe through the convolution filter through the word embedding dimension. That is, you can think of in an image is

a spatial dimension. So the motivation is like each filters is capturing Moore's

global information, given the whole sequence. So flying through

different dimension of the word embedding

is like capturing the most salient features of word among the entire document. So we also have, say, filter of size

248 et cetera that capturing different

spatial relations among the embedding matrix.

So after that, we also

have a feature map. What we do is not

the traditional max pooling. What we do is like adaptive

max pooling that extract, say, two to three most secure known features

in the feature map. So if we are doing that, the final feature

representation here will be even two to three times larger

than the previous method. So in the extreme

multi-label setting, we know that

the final output number of label here is very huge. So we need to make a low-rank

matrix factorization here. So there's a safe 512th

hidden representation in the middle that

first project this long feature map to a lower dimension and

then we do fully connect. And just as a case study of the memory

consumption, since, say, you may be interest in how large the memory consumption

of these neural nets. So, say, we are using

vocabulary size 30K, and the word embedding

dimension is 300, and we cut the document

length to, say, 500, and we analyze

the largest asset, the number of label is 670, which is the largest asset

on the website, and we set the

hidden dimension to 500 and the pulling

units to 128.

So first is the embedding layer, the number of

parameter is roughly V times D, roughly 9 millions. And the number of parameter

in the convolution layer, it's calculated in

this way because we have for the filter size of two, we have 32 of them. For the filter size

of four and eight, we also have 32 copy of them. So in total, it's roughly 200K. And in the hidden layer, I'm referring to

the parameter in this part. In the hidden layer, we fixed out

the max pooling output to be 128 and given

that we have 32, each have three

times the output 500, that's roughly 6.3 million. And the bottlenecks is actually on the

final output layer. That is the 500 times 670K, which is roughly 343 millions. So if you sum this all up, this roughly accounts to 360 million number

of parameters.

So if you're using

floating precision, that's roughly 1.3 gigabytes, doesn't sound that bad. If you have a 12 giga GPU memory

or you have more, this definitely can

fit into your machine. So the analysis here

is just the minimum of memory you use because

when you are doing backpropagation and

the mini batches, that also depends on

the mini batch size you use. So, finally, it's the Choice

of the Loss function. We know that if in

the Multi-class classification, if you use Softmax, the model will favor only one label and pushing

other label to zero. So in the XMTC setting, you mostly put

zero probability output on the other labels. But actually, in

this kind of the dataset, because each document is

only tag with the most relevant, say five or 10 labels, it doesn't mean that other label

is not relevant. So, using Softmax may

not be that good choice, so we consider the most naive Binary

Cross-entropy using the Sigmoid. So, okay, finally it's

the experiment setting, so we consider six

benchmark dataset. The first two is the smallest

dataset RCV1 and EUR-Lex, and the middle two is

the median-sized dataset, Amazon-12K and Wiki-30K, and the last two is the largest one, Amazon-670K and the Wiki-500K.

And here is the results in XMTC. We're usually most interested in the top portion of

the rank labels, so the results are

evaluated by Precision one, three, five and NDCG

one, three, five. Here we present the Precision

one, three, five. So, you can see each column

representing a method, the bold face indicate

the best results on that row, so each row is a dataset and the bold face

is the best one. So, we also okay, the star notation denotes not statistically significantly

different from the best model on that row, using the sample T-test. So we can see here,

there's no star because all the results are

statistical significant. Among 18 rows, so the XML-CNN

is the proposed method. It has the best score

in 10 rows, while the second method

SLEEC has four best rows, and then followed

by the FastXML. But to be honest, the training height

for the XML-CNN would be a lot slower than FastXML, the tree-based ensemble method. So, let's look at the NDCG. So NDCG also provide

similar results but you can see here the star notation

here, and here, that means that this is not statistically significant

compared to this one.

Finally, we study

the Ablation Test because we have

a close relationship with the previous work

with the Kim-CNN. So we conduct Ablation test

building upon this Kim theme, adding one component at a time. So, the lightest color, probably the white here in

the screen, is the Kim-CNN. And as the color grows darker, it's a different versions from the white to

the lighter one, we changed the Softmax

to Binary Cross-entropy, as you can see that

gave us some gain in this dataset and

this dataset but probably not this one,

in the precision one. And the Version two is, we add a middle

bottom hidden layer. So this also will give us some gain in the performance

and finally is, it changed the

convolutions direction from temporal to spatial

through swiping their actions. So that you can see

in this dataset, gave us the most improvement, and also this one, dataset. So, this is my first half of the talk, in the second half, we investigate

some recent works using deep structural prediction

framework to solve the Multi- label

classification problem. So, from now the work is published by

other people in ICML.

So we run the experiments

and compare them with the state of the art method in extreme Multi-label

Classification. So first, so far what we did

is we are given an input X, we path through some function, and we extract

the feature representation. In the final layer,

we simply use Binary Logistic Regression to predict the output

independently, that's so far what

I have described. So from next, we discuss a structure prediction thinking of doing Multi-Label

Classification. So, you can think of here the, when they, what they do is, they try in the

Structure Prediction, they will extract

joint feature representation from both the data

and the label. And what they do is finding

a prediction output Y, that has a lower energy. So this correspond into finding the largest conditional

likelihood of PY given X in the tri-ordinal

graphical model CRF like those kind of thing. So the motivation is

that they want to extract joined feature between

the data X and the output, and explore the label dependency via the design of

the energy network.

So the key is how you design

the energy network here. So, this is, let's look

at each component of the Energy Network that are

designed in David ICML paper. So first, it is a feature network that's extracting the feature

given a feature input X, it's just a two-layer MLP. Next is the Energy Network, is composed by two part, the local energy

and the global one. The local one is

like in the middle, you can see here in

the middle part of the figure. It's a sum of

the L linear model, where L is a number of labels. So this is very

similar to what we did in the previous work. It's just BI is a lower dimensional embedding

to learn for each label. And in the upper part, is the global energy. So what do I mean by global? It's like the score function

is independent of X, but you can see here

in this, in here, the formula here is

trying to extract the lower embedding of

the large output space Y, by that, we hope that we can learn some fine transformation of

the output that captured the salient

feature of labels using this to model

the label dependencies.

So, the final Energy Design is the local part

adding the global part. So how do we learn

the model parameter theta? So say we are using, simply using the

mean-square out loss, the loss function here. First, we need to find the most, find the prediction

yi hat by solving the inference problem and

of the energy network, given the fixed model

parameter theta. So, exact solving the inference

may not be feasible.

So, what they do, in practice, they just do

gradient descent with a fixed maximum iteration

number, say just five. So, they do this

five-time obtaining the prediction yi hat and fix that to calculate

the gradient based on theta. So, actually

calculating the gradient with respect to

model parameter theta could be somehow tricky. So, let's look at

the computational graph here. The input is x and you path this through

some feature network. Use cache the feature and

you will calculate the, you first make a forward path based on this to calculate

the prediction y. So, you have, you need to calculate the gradient

with respect to y here in the pink box

is a model part. But in the backward pass, you also want to

calculate the gradient with respect to

the model parameter theta. So, you need to

visit multiple times through the pink box, like for example,

the first paths it's like this.

And the second part

is like this, because the yi is

sort of like the rn thinking that the final

prediction of y is a dependency among

the previous stat and previous stat is

a input function of the model primacy that's. So, you need to make the

derivative multiple times path. But this is the detail and another work in this year ICML is called Input

Convex Neural Nets, which is very similar to

the previous structure Prediction Energy Network,

except one thing. The design architecture

of energy network. In this work, they design

the energy network to be convex with respect to the y. So, the benefit of that is when you do

the task time optimization, solving inference problems out, this will have

a global optimal solution.

Of course, to design

the network to be convex, you have some assumption or

can say constraint need to made on with parameter w. So, let me go to the

experiment parts in there. For simplicity, we

rerun the experiments in those ICML paper, where they consider only these three relatively small scale

multi-label datasets. The main reason why we want

to rerun their experiments if that actually in

their paper they did not compare with the most strongest state of the art multi-label text

classification baseline. So, we find interesting

to compare with this. The setting is like this. The input feature

representation you just 0,1 binary backup for a feature and they pre-train the feature network based

on the binary logistic loss.

And the evaluation metric

is precision at 1,3,5 which is commonly

used in literature. So, the results like here. So, before jumping

to the number, I would like to emphasize that the hybrid parameters

for each method, it's just follow

the default setting. So, I didn't heavily

tune and not so surprising if you see

the FastXML and a SLEEC, it's either FastXML or SLEEC that reach the best performance. However, if you look at

the ICNN, this column here, you can see in

a Delicious Dataset has the largest number of

Levo among these three. It actually is the second best outperformed the SLEEC

and the PDSparse here. So, the first column is just a multilayer perception and the second column

and third column is using energy networks. So, they outperformed

the MLP saying that exploring

the joined representation between data and

labeled does help. So, if you compare the third

column to the second column, the difference is the design

of the energy network. So, it seems like designing that the energy network to be

complex with respect to y, have some adventures

when you are doing the test

time optimization.

So, there you can

solve that through more high precision with

respect to the prediction y. So, that's have more benefit in a Delicious Dataset

and the Bookmark Datasets. I think that's all, conclude my presentation today. So, in short, the first part

is I introduce some alternative way to extract the feature representation via another way of

convolution you or not. And the second part is I

introduce some recent work in ICML that's used in

structure prediction, but we also can apply them in multi-label

text classification. And they may provide some alternative insights of exploring the label

dependencies. Thank you. >> Thank you so much. Are there any

questions? Hi, Manny. >> Hi. So, it's really nice to see these

learning approaches starting to do so well for

extreme classification. I have a couple of questions though the first one

is about scaling. >> Right. >> So, actually I just wanted to clarify that the largest

dataset available on the repository has actually

3 million labels, not 670K. But, of course, internally, as Edward also mentioning, we've gone beyond a hundred million labels,

well beyond that.

So, how do you see these kinds

of approaches scaling to those kinds of

label settings? So, that was the first question. And the second is as

regards the accuracy, I was wondering if you had

compared against this man called PPDSparse or

FasterXML as well? >> Okay, so maybe I can first try to answer

your first question. So, I think you are asking

whether this kind of model architecture can scale to further larger label space. So, in currently

we are only using dimension reduction through the hidden representation here. So, if the final

output space say the dimension of y go

through say 5 millions, then that may have some trouble, because in terms

of that complexity, it's linear with respect

to the output space. So, if we are trying to apply this kind of framework

on even larger dataset, we need to make

the hidden representation, hidden layer, even smaller.

So, currently I don't have any other more than solution building upon this dibnoring neuronites to

address that issue. And I think your

second question is, do we compare with PDSparse

and SLEEC or a FastXML? >> No, PPDSparse or FasterXML. >> Oh, oh you mean the PPD. No, we haven't compared with the PPDSparse

and the DISMAC. Right, but we. >> [inaudible]. >> Right, so because currently

I'm not working on this. So, this is an earlier

work in this year. So, currently I'm working

on other projects. So, the first author may have conduct

some experiments with on that part but he has

an update with me yet, right. >> Okay. There's

another question. Yep, you. Yup. >> I find the way you do

convolution really curious. So, I was wondering like

whether it's specific to the setting of multi-label

extreme classification or is this like general.

So, for example if you replace the way you

do convolution in the original KIM-CNN which

actually improved performance. >> So, we only investigate that ablation test under the dataset we use

in a multi-label. So, I can't really

say what that would be beyond the

multi-label datasets. So, if you can see here, the difference of v2 and v3 it's really the way the direction

you do convolution. So, actually in

the eurlex dataset and the amazon670K that does improve but not so much in the wiki30K. So, I think that also depends

on the datasets, right. >> Thanks..