Narcissus's blog Stay Follish, Stay Hungry

论文阅读:VtransE

2018-09-05
Narcissus

这篇文章的缺点是只用了一对物体的信息。

Motivation: 这篇论文的启发是来源于知识图谱中,使用转移向量(translation vector)来表示实体之间的关系(见Trans系列的知识表示)。

来自 «https://www.jianshu.com/p/6d6132113fa5»

一、主要贡献

  • 1.纯视觉模型;检测+预测关系:端到端(其实也不是实际意义上的端到端,检测器是没有被更新的,训练的时候是利用faster-rcnn+resnet101检测好(包含框和类别分布),保存成.npz文件,然后训练关系网络的);本文虽然是端到端的,并不是只在faster-rcnn后面加了一个关系分类,还做了一些创新,比如说加入知识迁移,构建特征提取网络等,而且还很容易应用到其他物体检测的网络中。

  • 2.Translation embedding:model visual relations by mapping the features of objects and predicts in a low-dimensional space!

  • 3.关系中的知识迁移:物体检测与关系检测是相互作用的,本文在objects 与predicates之间建立知识迁移。设计了一种特别的特征提取网络,提取物体的三种特征:

    1. 1)类概率
    2. 2)bounding box,scale
    3. 3)ROI visual feature

在可微的坐标点,用双线性插值代替ROI pooling。confidence,location,scale就是objects与predicts之间的知识

  • 4.outperform several other strong baselines

这篇文章的主要特点其实就是三点:

  1. 特征的选取方法:分别为subject和object提取三个特征(class,box,bi-linear visual feature)
  2. 将subject和object的特征映射到关系空间。
  3. 损失函数的设计:其实损失函数就是为了减小两个特征(有关系的subject和object的特征)映射到关系空间的距离越小越好

二、网络设计

I Relation Prediction Module  I classeme location visual  person  elephant  ride elephant)  taller  person  o  z  z  Object Detection Module I  conv  feat  O box O  Bilinear  Interpolation  O box O  Feature  Extraction  Layer  nextto elephant:  person  with  pants  person  Figure 3. The VTransE network overview. An input image is first through the Object Detection Module, which is a convolutional local-  ization network that outputs a set of detected objects. Then, every pair of objects are fed into the Relation Prediction Module for feature  extraction and visual translation embedding. In particular, the visual feature of an object is smoothly extracted from the last convolutional  feature map using Bilinear Interpolation. o denotes vector concatenation and €9 denotes element-wise subtraction.

Idea:能否把数据标注在训练的时候补充全,强化训练,同时相互促进?可以查一查 有没有类似的解决标注不全的解决方案,或许可以用到这里来!因为在实际中,图像中的物体时非常多的,他们之间的关系只标注了部分,而不是穷尽标注的。

三、流程:Object detection->feature extraction->visual translation embedding->Softmax

Feature Extracion:

输入:来自faster-rcnn的检测结果:物体类别分数,物体位置,VGG16最后一个卷积层特征

中间输出:

  • 1)类别分数:(N+1)-d vector
  • 2)物体的位置(5-d):不仅仅是(x,y,w,h)的表示,有一个尺度不变的操作:

log—;, h log  (C:\Users\nian\OneDrive\NoteBook\assets\clip_image002.png)

为什么有这种操作,可以参考[12]。对介词关系和动词关系很有用。

  • 3)visual feature(D-d):对物体检测结果中的feature vecor进行双线行插值得到,与faster-rcnn中用到的ROI pooling 特征是一样大小的。这个地方用来双线性插值(具体参考论文),主要是替代VGG16最后的pooling层:

(C:\Users\nian\OneDrive\NoteBook\assets\clip_image003.png)

这里比较重要的其实也就是一个双线性插值得到visual feature的特征设计。

最终输出:

(M=N+D+1)img

主要是将物体检测的结果,进一步进行提取特征,再现有特征上进一步构造有用的特征。

visual translation embedding:

输入:$X_s,X_o\in R^M$

关系迁移向量:img

Wsxs _4_ tp  woxo.

这里的参数:W_o就是将objects映射到关系空间,通过关系迁移向量将两者联系到一起(也可以认为是选择合适的t_p 来将两者映射到一起)。img

boat  bike  person  object  horse  m tor elephant  ride  Feature Space  subject  Relation Space  Figure 2. An illustration of translation embedding for learning  predicate ride. Instead of modeling from a variety of ride  images, VTransE learns consistent translation vector in the rela-  tion space regardless of the diverse appearances of subjects (C:\Users\nian\OneDrive\NoteBook\assets\clip_image009.png) and objects (e.g., horse, bike, etc.) involved in the  predicate relation (e.g., ride).

Softmax:

关系分类部分的loss函数构建:

这个损失函数有个缺陷就是如果标注不全的话,容导致负样本采样错误,因此有下面的改进:

However, unlike the relations in a knowledge base that  are generally facts, e.g., Alan Tur ing-born In-London,  visual relations are volatile to specific visual examples,  e.g., the validity of car-taller-person depends on the  heights of the specific car and person in an image, result-  ing in problematic negative sampling if the relation annota-  tion is incomplete. Instead, we propose to use a simple yet  efficient softmax for prediction loss that only rewards the  deterministically accurate predicates3, but not the agnostic  object compositions of specific examples:  Crel = ¯log softmax (C:\Users\nian\OneDrive\NoteBook\assets\clip_image011.png)  (3)  where the softmax is computed over p. Although Eq. (3)  learns a rotational approximation for the translation model  in Eq. (I), we can retain the translational property by proper  regularizations such as weight decay [31]. In fact, if the an-  notation for training samples is complete, V TransE works  with softmax (Eq. (3)) and negative sampling metric learn-  ing (Eq. (2)) interchangeably.  The final score for relation detection is the sum of object  detection score and predicate prediction score in Eq. (3):  s

四、最终的端到端loss:

img

四、实现细节及实验结果

中间关系预测的模块,主要是以下代码实现的:

def build_rd_network(self):
sub_sp_info = self.sub_sp_info
ob_sp_info = self.ob_sp_info
sub_cls_prob = self.predictions['sub_cls_prob']#物体
ob_cls_prob = self.predictions['ob_cls_prob']
sub_fc = self.layers['sub_fc7']
ob_fc = self.layers['ob_fc7']
 
if self.index_sp:
sub_fc = tf.concat([sub_fc, sub_sp_info], axis = 1)
ob_fc = tf.concat([ob_fc, ob_sp_info], axis = 1)
if self.index_cls:
sub_fc = tf.concat([sub_fc, sub_cls_prob], axis = 1)
ob_fc = tf.concat([ob_fc, ob_cls_prob], axis = 1)
 
sub_fc1 = slim.fully_connected(sub_fc, cfg.VTR.VG_R, 
 activation_fn=tf.nn.relu, scope='RD_sub_fc1')
ob_fc1 = slim.fully_connected(ob_fc, cfg.VTR.VG_R, 
 activation_fn=tf.nn.relu, scope='RD_ob_fc1')
dif_fc1 = ob_fc1 - sub_fc1
rela_score = slim.fully_connected(dif_fc1, self.num_predicates, 
 activation_fn=None, scope='RD_fc2')
rela_prob = tf.nn.softmax(rela_score)
self.layers['rela_score'] = rela_score
self.layers['rela_prob'] = rela_prob
 
实验结果可参考论文,服务器上的实验代码也能跑

Comments

Content