Welcome to Narcissus's Blog!

这里是Narcissus的技术圈

论文阅读：VtransE

2018-09-05

Narcissus

图像理解

SceneGraphGeneration

Read All

pytorch中利用BCEloss进行多分类的原理

2018-08-18

机器学习

损失函数

pytorch中利用BCEloss进行多标签分类的原理

pytorch中利用BCEloss进行多标签分类的原理

classtorch.nn.BCELoss(weight=None, size_average=True)[source]

Creates a criterion that measures the Binary Cross Entropy between the target and the output:(i为每一个类别)

$loss(o,t)=\frac{-1}{n}\sum_{i}{(t[i]*log(o[i])+(1-t[i])*log(1-o[i]))}$ or in the case of the weight argument being specified:

$loss(o,t)=\frac{-1}{n}\sum_{i}{weight[i]*(t[i]*log(o[i])+(1-t[i])*log(1-o[i]))}$ This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets t[i] should be numbers between 0 and 1.

weight	weight (Tensor, optional) – a manual rescaling weight given to the loss of each batch element. If given, has to be aTensor of size “nbatch”.
size_average	size_average (bool, optional) – By default, the losses are averaged over observations for each minibatch. However, if the field size_average is set to False, the losses are instead summed for each minibatch. Default: True

Shape:

Input:$ (N,∗)(N,∗) $where * means, any number of additional dimensions
Target: $(N,∗)(N,∗), $same shape as the input

Examples:

>>> m = nn.Sigmoid()
>>> loss = nn.BCELoss()
>>> input = autograd.Variable(torch.randn(3), requires_grad=True)
>>> target = autograd.Variable(torch.FloatTensor(3).random_(2))
>>> output = loss(m(input), target)
>>> output.backward()

来自 «https://pytorch.org/docs/0.3.0/nn.htmlhighlight=bceloss#torch.nn.BCELoss»

Read All

Multi label learning methods

2018-08-09

Narcissus

机器学习

多标签学习
- Multi-label classification
  - 1.Formal definitions
Multi-label classification

1.Formal definitions
1. Learning framework
  
  multi-label indicators:
  - label cardinality
  - label density $LDen(D)=\frac{1}{|D|}LDcard(D)$
  - label diversity $LDiv(D)=|\{Y|\exists x:(x,Y)\in D\}|$
  - normalized label diversity $P LDiv(D) = \frac{1}{|D|} · LDiv(D)$
Read All

hierachical softmax

2018-08-07

Narcissus

机器学习

损失函数

我的一些理解
- 语言模型（及词向量）
- CBOW 模型和 skip gram模型

要理解这个算法，就要先知道word2vector里面发两个模型：CBOW和skip-gram模型。要理解其中理论，此文讲解甚好：（1）Word2Vector中的数学理解、（2）Deep Learning 实战之 word2vec。

我的一些理解

语言模型（及词向量）

词向量就是将语言数学化的表示，以能够输入算法，进行学习预测等。而词向量用于`统计语言模型`中就是一个重要的基础；统计语言模型是所有nlp的基础，它被广泛应用于语音识别、机器翻译等各项任务；统计语言模型就是计算`一个句子的概率`的模型：$p(W)$ $$ p(W)=	p(w_1^T)=p(w_1,...,w_T) $$ 贝叶斯公式在统计机器学习中有着重要作用，利用贝叶斯模型，可将统计语言模型改写为： $$ p(W)=p(w_1)p(w_2|w_1)p(w_3|w_12)...p(w_T|w_1{T-1}) $$ 给出条件概率了，模型处理起来的方法就多了，计算模型参数的模型常有n-gram、神经网络、决策树、最大熵、最大熵马尔科夫、条件随机场等。（1）中主要讲解了n-gram模型和神经网络模型。

对于n-gram模型，我的理解是：它认为一个词出现的概率只和它前面的n个词相关，即n-1阶的Markov假设，即： $p(w_k|w_1^{k-1})=p(w_k|w_{k-n+1}^{k-1})$ 根据组合词的出现词频统计可以计算得到$p(w_k|w_1^{k-1})$。例如n=2时： $p(w_k|w_1^{k-1})\approx \frac{count(w_{k-1},w_k)}{count(w_{k-1})}$ 参数与n的关系：

n	参数数量
2	4*10^10
1	2*10^5
`3`(实际中最常用)	8*10^15
4	16*10^20

对于神经网络模型，我的理解是首先建立一个目标函数$\prod_{w\in C}p(w|content(w))$而content(w)就是与w这个单词相关的上下文单词，在n-gram模型中就是$w_{i-n+1}^{i-1}$。利用最大对数似然就可以得到： $L=\sum_{w \in C}log(p(w|content(w)))$ 然后对似然函数进行最大化,这里都和n-gram模型十分类似，都通过建立的条件概率来进行计算语言模型，但不同的是，神经网络模型对$p(w|content(w))$的建模不同，不是通过词频统计来建模的，而是通过神经网络模型来建模的：$p(w|content(w))=F(w,context(w),\theta)$而神经网络模型中就用到了单词的词向量（在n-gram模型中是不需要的,因为它只对单词做词频统计，然后保存这个概率即可，实际上来说，它也不存在真正意义上的参数，有点类似于knn，主要在于存储部分）。而说到词向量，就有两种方式：1）one-hot repersentation;2)distributed representation.区别可见参考文献（1）。

这里面v(w),W,U,p,q就是参数,注意v(w)（即单词w的词向量，而不是输入context的词向量哦！）也要经过训练才能得到。最后$y_w$经过一个softmax归一化之后，就可以得到最终概率： $p(w|content(w))=\frac{e^{y_w,i_w}}{\sum_{i=1}^{N}e^{y_w,i}}$

CBOW 模型和 skip gram模型

这两个模型主要是用于生成distributed wordvector的。

Read All

visual relationship detection with object spatial distribution

2018-08-06

图像理解

RelationshipDetection

文章思想

文章思想

文章主要是`language proir[12]`那篇文章上进行修改的，主要是增加了`spatial distribution`,它包含了position relation、size relation、shape relation以及distance relation,从而构成一个region model部分。然后把这个模块加入到C(*)目标函数，以及正则化项里面。其实并没有太大的贡献，实验结果也并不是很好。虽然文章中说到一些对数据的统计结果，但实际上，只是针对大部分的，我想对于特殊情况的关系，这先验统计知识是具有副作用的。

Read All

Learning Visual N-Granms from web data

2018-08-06

图像理解 NLP

图像字幕损失函数

一、论文理解

文章思想

利用弱标签的web数据，建立language与image的Visual n-gram model.能够预测与图像内容相关的phrases，这篇文章的主要贡献在于损失函数，其来源为nlp里面的n-gram模型。

关于loss函数：given an image I, assign a likelihood p(w|I) to each possible phrase (n-gram) w。we develop a novel, differentiable `loss function` that optimizes trainable parameters for `frequent n-grams`, whereas for `infrequent n-grams`, the loss is dominated by the predicted likelihood of smaller `“sub-grams”`.

数据集及预处理

train: YFCC100M dataset，其标注作为弱标签

comments:We applied a simple language detector to the dataset to select only images with English user comments, leaving a total of 30 million examples for training and testing. We preprocessed the text by removing punctuations, and we added [BEGIN] and [END] tokens at the beginning and end of each sentence.

images:rescaling them to 256×256 pixels (using bicubic interpolation), cropping the central 224× 224, subtracting the mean pixel value of each image, and dividing by the standard deviation of the pixel values
训练词典：英语短语：1-5 grams

the smoothed visual n-gram models are trained and evaluated on all n-grams in the dataset, even if these n-grams are not in the dictionary. However, whereas the probability of indictionary n-grams is primarily a function of parameters that are specifically tuned for those n-grams, the probability of out-of-dictionary n-grams is composed from the probability of smaller in-dictionary n-grams (details below).

不在词典中的短语的概率是通过短语的子短语计算的，子短语肯定是会存在于词典中的。

损失函数函

$ \phi(I,\theta) $是CNN特征提取网络
I是图像
denote the n-gram dictionary that our model uses by D and a comment containing K words by $ w ∈ [1, C]^K $ , where C is the total number of words in the (English) language. We denote the n-gram that ends at the i-th word of comment w by $w^i_{i-n+1}$ and the ith word in comment w by $w_i^i$.omit the sum over all image-comment pairs in the training / test data when writing loss functions(也就是说下面损失函数的写法就只包含了一张图像上的一个评论的损失). （也就是说训练数据的n-gram词典记为D，每个图像的评论长度固定为K，一条评论为这个语言的单词词典C中的单词组成）
预测的分布是一个 n-gram embedding matrix $ E ∈ R^{D×|D|}$ （也就是英语短语词典中每个短语的词向量）
Naive n-gram loss

由于此模型不是一个条件概率，不能进行语言模型的建模，因此加入一个back-off model[6]，(这里也就是对评论语句里的每个单词建立语言模型：n-gram模型，这里对损失函数的建立相当于提供了弱监督信息-由用户评论得到的语言模型):

$p(w_i^i|w_{i-n+1}^i)\propto \begin{equation} \left\{ \begin{array}{**rcl**} p_{obs}(w_i^i|w_{i-n+1}^i),if w_{i-n+1}^i \in D & ,\\ \lambda p(w_i^i|w_{i-n+2}^i),otherwise \end{array} \right. \end{equation}$

Jelinek-Mercer loss

此处避免了naive 函数的两个弊端。此处提出的Jelinek-Mercer smoothing方法对于E和$\theta$是可导的。因此loss就可以通过卷积网络回传。

训练

CNN: residual network [23] with 34 layers
follow [28] and perform stochastic gradient descent over outputs [4]: we only perform the forwardbackward pass for a random subset (formed by all positive n-grams in the batch) of the columns of E.(原因是全部更新的话，输出量太大)

实验

phrase-level image tagging

就是给定图像，输出图像内容相关的phrases（table2）；还对模型进行了复杂度的分析（table1）

这个地方其实也只做了multi-class classification，而并没有做multi-label classification；

对于figure1中展示的结果：We predict n-gram phrases for images by outputting the n-grams with the highest calibrated log-likelihood score for an image，对词典中的每个gram计算了与该图像是否相关的分数，然后取了分数最高的4个phrases；计算分数的方法就是针对语言模型计算，即根据$p(w_i^i w_{i-n+1}^{i-1})$计算。

下表为在测试集上面的结果：We define recall@k as the average percentage of ngrams appearing in the comment that are among the k front ranked n-grams when the n-grams are sorted according to their score under the mode。As a baseline ，we consider a linear multi-class classifier over n-grams (i.e., using naive n-gram loss)

phrased based image retrieval

实验表明，模型能很好的进行区分visual concepts：

the model has learned accurate visual representations for n-grams such as “Market Street” and “street market”, as well as for “city park” and “Park City”,our model is able to distinguish visual concepts related to Washington: namely, between the state, the city, the baseball team, and the hockey team
relating images and captions

即给定图像检索相关phrases，注意此文用是数据具有much larger vocabularies than the baseline models

以上两个实验任务自然是无法和专门做检索的工作效果好的。
zero-shot transfer

1)The performance of our models is particularly good on common classes such as those in the aYahoo dataset for which many examples are available in the YFCC100M dataset.

2)The performance of our models is worse on datasets that involve fine-grained classification such as Imagenet, for instance, because YFCC100M contains few examples of specific, uncommon dog breeds

Read All

3/6

Welcome to Narcissus's Blog!

论文阅读：VtransE

pytorch中利用BCEloss进行多分类的原理

pytorch中利用BCEloss进行多标签分类的原理

Multi label learning methods

Multi-label classification

1.Formal definitions

hierachical softmax

我的一些理解

语言模型（及词向量）

CBOW 模型和 skip gram模型

visual relationship detection with object spatial distribution

文章思想

Learning Visual N-Granms from web data

《Learning Visual N-Granms from web data》

一、论文理解

文章思想

相关工作

数据集及预处理

损失函数函

Naive n-gram loss

Jelinek-Mercer loss

训练

实验