classtorch.nn.BCELoss(weight=None, size_average=True)[source]
Creates a criterion that measures the Binary Cross Entropy between the target and the output:(i为每一个类别)
or in the case of the weight argument being specified:
This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets t[i] should be numbers between 0 and 1.
weight | weight (Tensor, optional) – a manual rescaling weight given to the loss of each batch element. If given, has to be aTensor of size “nbatch”. |
---|---|
size_average | size_average (bool, optional) – By default, the losses are averaged over observations for each minibatch. However, if the field size_average is set to False, the losses are instead summed for each minibatch. Default: True |
Shape:
Examples:
>>> m = nn.Sigmoid()
>>> loss = nn.BCELoss()
>>> input = autograd.Variable(torch.randn(3), requires_grad=True)
>>> target = autograd.Variable(torch.FloatTensor(3).random_(2))
>>> output = loss(m(input), target)
>>> output.backward()
来自 «https://pytorch.org/docs/0.3.0/nn.htmlhighlight=bceloss#torch.nn.BCELoss»
Learning framework
multi-label indicators:
label density
label diversity
normalized label diversity
要理解这个算法,就要先知道word2vector里面发两个模型:CBOW和skip-gram模型。要理解其中理论,此文讲解甚好:(1)Word2Vector中的数学理解、(2)Deep Learning 实战之 word2vec。
词向量就是将语言数学化的表示,以能够输入算法,进行学习预测等。而词向量用于`统计语言模型`中就是一个重要的基础;统计语言模型是所有nlp的基础,它被广泛应用于语音识别、机器翻译等各项任务;统计语言模型就是计算`一个句子的概率`的模型:$p(W)$ $$ p(W)= p(w_1^T)=p(w_1,...,w_T) $$ 贝叶斯公式在统计机器学习中有着重要作用,利用贝叶斯模型,可将统计语言模型改写为: $$ p(W)=p(w_1)p(w_2|w_1)p(w_3|w_12)...p(w_T|w_1{T-1}) $$ 给出条件概率了,模型处理起来的方法就多了,计算模型参数的模型常有n-gram、神经网络、决策树、最大熵、最大熵马尔科夫、条件随机场等。(1)中主要讲解了n-gram模型和神经网络模型。
对于n-gram模型,我的理解是:它认为一个词出现的概率只和它前面的n个词相关,即n-1阶的Markov假设,即: 根据组合词的出现词频统计可以计算得到$p(w_k|w_1^{k-1})$。例如n=2时: 参数与n的关系:
n | 参数数量 |
---|---|
2 | 4*10^10 |
1 | 2*10^5 |
3 (实际中最常用) |
8*10^15 |
4 | 16*10^20 |
对于神经网络模型,我的理解是首先建立一个目标函数$\prod_{w\in C}p(w|content(w))$而content(w)就是与w这个单词相关的上下文单词,在n-gram模型中就是$w_{i-n+1}^{i-1}$。利用最大对数似然就可以得到:
然后对似然函数进行最大化,这里都和n-gram模型十分类似,都通过建立的条件概率来进行计算语言模型,但不同的是,神经网络模型对$p(w|content(w))$的建模不同,不是通过词频统计来建模的,而是通过神经网络模型来建模的:$p(w|content(w))=F(w,context(w),\theta)$而神经网络模型中就用到了单词的词向量
(在n-gram模型中是不需要的,因为它只对单词做词频统计,然后保存这个概率即可,实际上来说,它也不存在真正意义上的参数,有点类似于knn,主要在于存储部分)。而说到词向量,就有两种方式:1)one-hot repersentation;2)distributed representation.区别可见参考文献(1)。
这里面v(w)
,W,U,p,q就是参数,注意v(w)(即单词w的词向量,而不是输入context的词向量哦!)也要经过训练才能得到。最后$y_w$经过一个softmax归一化之后,就可以得到最终概率:
这两个模型主要是用于生成distributed wordvector的。
文章主要是`language proir[12]`那篇文章上进行修改的,主要是增加了`spatial distribution`,它包含了position relation、size relation、shape relation以及distance relation,从而构成一个region model部分。然后把这个模块加入到C(*)目标函数,以及正则化项里面。其实并没有太大的贡献,实验结果也并不是很好。虽然文章中说到一些对数据的统计结果,但实际上,只是针对大部分的,我想对于特殊情况的关系,这先验统计知识是具有副作用的。
利用弱标签的web数据,建立language与image的Visual n-gram model.能够预测与图像内容相关的phrases,这篇文章的主要贡献在于损失函数,其来源为nlp里面的n-gram模型。
关于loss函数:given an image I, assign a likelihood p(w|I) to each possible phrase (n-gram) w。we develop a novel, differentiable `loss function` that optimizes trainable parameters for `frequent n-grams`, whereas for `infrequent n-grams`, the loss is dominated by the predicted likelihood of smaller `“sub-grams”`.
learning from weakly supervised web data
本文采用了和[28]一样的弱监督训练数据,但是不同于它的是不仅仅只考虑单个词,而是考虑了n-gram。数据来自图像分享网站:image-comment。
Relating image content and language
没有采用RNN,而是采用了bilinear model,它也能根据给定图像输出phrases的概率,并把相关的phrases组合成caption。而与其他类似文章所不同的是:本文能处理大量的visual concepts,而不仅仅限制于flickr类似的数据集上面的评论内容,更加能用于实际问题。此处与[40]最相关,但是使用了一个端到端的弱监督训练方法。
Language models
使用了n-gram的语言模型,并使用了Jelinek Mercer smoothing[26] 。
n-gram models count the frequency of n-grams in a text corpus to produce a distribution over phrases or sentences, our model measures phrase likelihoods by evaluating inner products between image features and learned parameter vectors.
train: YFCC100M dataset,其标注作为弱标签
comments:We applied a simple language detector to the dataset to select only images with English user comments, leaving a total of 30 million examples for training and testing. We preprocessed the text by removing punctuations, and we added [BEGIN] and [END] tokens at the beginning and end of each sentence.
images:rescaling them to 256×256 pixels (using bicubic interpolation), cropping the central 224× 224, subtracting the mean pixel value of each image, and dividing by the standard deviation of the pixel values
训练词典:英语短语:1-5 grams
the smoothed visual n-gram models
are trained and evaluated on all n-grams in the dataset, even if these n-grams are not in the dictionary. However, whereas the probability of indictionary n-grams
is primarily a function of parameters that are specifically tuned for those n-grams, the probability of out-of-dictionary
n-grams is composed from the probability of smaller in-dictionary n-grams (details below).
不在词典中的短语的概率是通过短语的子短语计算的,子短语肯定是会存在于词典中的。
D
and a comment containing K words
by $ w ∈ [1, C]^K $ , where C is the total number of words in the (English) language
. We denote the n-gram that ends at the i-th word of comment w by $w^i_{i-n+1}$ and the ith word in comment w by $w_i^i$.omit the sum over all image-comment pairs in the training / test data when writing loss functions(也就是说下面损失函数的写法就只包含了一张图像上的一个评论的损失)
. (也就是说训练数据的n-gram词典记为D,每个图像的评论长度固定为K,一条评论为这个语言的单词词典C中的单词组成)由于此模型不是一个条件概率,不能进行语言模型的建模,因此加入一个back-off model[6],(这里也就是对评论语句里的每个单词建立语言模型:n-gram模型,这里对损失函数的建立相当于提供了弱监督信息-由用户评论得到的语言模型):
此处避免了naive 函数的两个弊端。此处提出的Jelinek-Mercer smoothing方法对于E和$\theta$是可导的。因此loss就可以通过卷积网络回传。
only perform the forwardbackward pass for a random subset
(formed by all positive n-grams in the batch) of the columns of E.(原因是全部更新的话,输出量太大)phrase-level image tagging
就是给定图像,输出图像内容相关的phrases(table2);还对模型进行了复杂度的分析(table1)
这个地方其实也只做了multi-class classification,而并没有做multi-label classification;
对于figure1中展示的结果:We predict n-gram phrases for images by outputting the n-grams with the highest calibrated log-likelihood score for an image ,对词典中的每个gram计算了与该图像是否相关的分数,然后取了分数最高的4个phrases;计算分数的方法就是针对语言模型计算,即根据$p(w_i^i |
w_{i-n+1}^{i-1})$计算。 |
下表为在测试集上面的结果:We define recall@k
as the average percentage of ngrams appearing in the comment that are among the k front ranked n-grams when the n-grams are sorted according to their score under the mode。As a baseline ,we consider a linear multi-class classifier over n-grams (i.e., using naive n-gram loss)
phrased based image retrieval
实验表明,模型能很好的进行区分visual concepts:
the model has learned accurate visual representations for n-grams such as “Market Street” and “street market”, as well as for “city park” and “Park City”,our model is able to distinguish visual concepts related to Washington: namely, between the state, the city, the baseball team, and the hockey team
relating images and captions
即给定图像检索相关phrases,注意此文用是数据具有much larger vocabularies than the baseline models
以上两个实验任务自然是无法和专门做检索的工作效果好的。
zero-shot transfer
1)The performance of our models is particularly good on common classes such as those in the aYahoo dataset
for which many examples are available in the YFCC100M dataset.
2)The performance of our models is worse on datasets that involve fine-grained classification such as Imagenet
, for instance, because YFCC100M contains few examples of specific, uncommon dog breeds