一、主要贡献
二、网络设计
三、流程：Object detection->feature extraction->visual translation embedding->Softmax
四、最终的端到端loss：

这篇文章的缺点是只用了一对物体的信息。

Motivation: 这篇论文的启发是来源于知识图谱中，使用转移向量（translation vector）来表示实体之间的关系（见Trans系列的知识表示）。

来自 «https://www.jianshu.com/p/6d6132113fa5»

一、主要贡献

1.纯视觉模型；检测+预测关系：端到端(其实也不是实际意义上的端到端，检测器是没有被更新的，训练的时候是利用faster-rcnn+resnet101检测好（包含框和类别分布），保存成.npz文件，然后训练关系网络的)；本文虽然是端到端的，并不是只在faster-rcnn后面加了一个关系分类，还做了一些创新，比如说加入知识迁移，构建特征提取网络等，而且还很容易应用到其他物体检测的网络中。
2.Translation embedding：model visual relations by mapping the features of objects and predicts in a low-dimensional space!
3.关系中的知识迁移：物体检测与关系检测是相互作用的，本文在objects 与predicates之间建立知识迁移。设计了一种特别的特征提取网络，提取物体的三种特征：
1. 1）类概率
2. 2）bounding box，scale
3. 3）ROI visual feature

在可微的坐标点，用双线性插值代替ROI pooling。confidence，location，scale就是objects与predicts之间的知识

4.outperform several other strong baselines

这篇文章的主要特点其实就是三点：

特征的选取方法：分别为subject和object提取三个特征（class，box，bi-linear visual feature）
将subject和object的特征映射到关系空间。
损失函数的设计：其实损失函数就是为了减小两个特征（有关系的subject和object的特征）映射到关系空间的距离越小越好

二、网络设计

Idea：能否把数据标注在训练的时候补充全，强化训练，同时相互促进？可以查一查有没有类似的解决标注不全的解决方案，或许可以用到这里来！因为在实际中，图像中的物体时非常多的，他们之间的关系只标注了部分，而不是穷尽标注的。

三、流程：Object detection->feature extraction->visual translation embedding->Softmax

Feature Extracion:

输入：来自faster-rcnn的检测结果：物体类别分数，物体位置，VGG16最后一个卷积层特征

中间输出：

1)类别分数：（N+1）-d vector
2)物体的位置（5-d）：不仅仅是（x，y，w，h）的表示，有一个尺度不变的操作：

$log—;, h log (C:\Users\nian\OneDrive\NoteBook\assets\clip_image002.png)$

为什么有这种操作，可以参考[12]。对介词关系和动词关系很有用。

3)visual feature（D-d）：对物体检测结果中的feature vecor进行双线行插值得到，与faster-rcnn中用到的ROI pooling 特征是一样大小的。这个地方用来双线性插值（具体参考论文），主要是替代VGG16最后的pooling层：

$(C:\Users\nian\OneDrive\NoteBook\assets\clip_image003.png)$

这里比较重要的其实也就是一个双线性插值得到visual feature的特征设计。

最终输出：

（Ｍ=N+D+1）

主要是将物体检测的结果，进一步进行提取特征，再现有特征上进一步构造有用的特征。

visual translation embedding:

输入：$X_s,X_o\in R^M$

关系迁移向量：

Wsxs _4_ tp woxo.

这里的参数：W_o就是将objects映射到关系空间，通过关系迁移向量将两者联系到一起（也可以认为是选择合适的t_p 来将两者映射到一起）。

$boat bike person object horse m tor elephant ride Feature Space subject Relation Space Figure 2. An illustration of translation embedding for learning predicate ride. Instead of modeling from a variety of ride images, VTransE learns consistent translation vector in the rela- tion space regardless of the diverse appearances of subjects (C:\Users\nian\OneDrive\NoteBook\assets\clip_image009.png) and objects (e.g., horse, bike, etc.) involved in the predicate relation (e.g., ride).$

Softmax:

关系分类部分的loss函数构建：

这个损失函数有个缺陷就是如果标注不全的话，容导致负样本采样错误，因此有下面的改进：

$However, unlike the relations in a knowledge base that are generally facts, e.g., Alan Tur ing-born In-London, visual relations are volatile to specific visual examples, e.g., the validity of car-taller-person depends on the heights of the specific car and person in an image, result- ing in problematic negative sampling if the relation annota- tion is incomplete. Instead, we propose to use a simple yet efficient softmax for prediction loss that only rewards the deterministically accurate predicates3, but not the agnostic object compositions of specific examples: Crel = ¯log softmax (C:\Users\nian\OneDrive\NoteBook\assets\clip_image011.png) (3) where the softmax is computed over p. Although Eq. (3) learns a rotational approximation for the translation model in Eq. (I), we can retain the translational property by proper regularizations such as weight decay [31]. In fact, if the an- notation for training samples is complete, V TransE works with softmax (Eq. (3)) and negative sampling metric learn- ing (Eq. (2)) interchangeably. The final score for relation detection is the sum of object detection score and predicate prediction score in Eq. (3): s$

四、最终的端到端loss：

四、实现细节及实验结果

中间关系预测的模块，主要是以下代码实现的：

def build_rd_network(self):
sub_sp_info = self.sub_sp_info
ob_sp_info = self.ob_sp_info
sub_cls_prob = self.predictions['sub_cls_prob']#物体
ob_cls_prob = self.predictions['ob_cls_prob']
sub_fc = self.layers['sub_fc7']
ob_fc = self.layers['ob_fc7']
 
if self.index_sp:
sub_fc = tf.concat([sub_fc, sub_sp_info], axis = 1)
ob_fc = tf.concat([ob_fc, ob_sp_info], axis = 1)
if self.index_cls:
sub_fc = tf.concat([sub_fc, sub_cls_prob], axis = 1)
ob_fc = tf.concat([ob_fc, ob_cls_prob], axis = 1)
 
sub_fc1 = slim.fully_connected(sub_fc, cfg.VTR.VG_R, 
 activation_fn=tf.nn.relu, scope='RD_sub_fc1')
ob_fc1 = slim.fully_connected(ob_fc, cfg.VTR.VG_R, 
 activation_fn=tf.nn.relu, scope='RD_ob_fc1')
dif_fc1 = ob_fc1 - sub_fc1
rela_score = slim.fully_connected(dif_fc1, self.num_predicates, 
 activation_fn=None, scope='RD_fc2')
rela_prob = tf.nn.softmax(rela_score)
self.layers['rela_score'] = rela_score
self.layers['rela_prob'] = rela_prob
 
实验结果可参考论文，服务器上的实验代码也能跑

论文阅读：VtransE