1.主要思想

weakly supervised visual relation detection (WSVRD)这里无监督指的是两个方面，一个方面是弱监督的物体检测，在训练过程中，没有bounding box作为标签来监督训练，其实我认为这里的物体检测是无监督的，不太懂这里为啥说是弱监督的。（解释，通过阅读WSDDN这篇文章，我理解了其中的弱监督的观念：就是说虽然没有bounding box，但是通过预训练的cnn+region proposal mechanism+spp能得到一个大致的物体区域，利用image-level的标注信息进行训练，就相当于给了弱监督的信息）

At training time, we discarded the object bounding boxes to conform with the weakly supervised setting ，应该是WSOD和WSPP两个过程都没有用bounding box的标签。

那为什么又说是image-level的分类或者标注呢？因为标注只给了很少的信息，如果一个物体出现在一张图片中，那么图像的标注就是positive，否则为negtive。而分类是要在这种image-level的标注下找到一个物体对应的region proposal。
网络分成了两个部分，两个都是弱监督的方法，一个进行物体检测，另一个进行关系分类。

$WSOD Module Input Image s Subject FCN Subject FCN NMS Pos.-Role Sensitive Score Map Pos.-Role Sensitive Score Map s Cls. Object FCN Object FCN Rois Rolo Pairwise ROI Pooling Pair-wise Rol Pooling Output I Sperson Shold O 0.8 s umbrella Shat son = 0.6 WSPP Module Figure 2. The overview Of the proposed PPR-FCN architecture for WSVRD. It has two modules: WSOD for Object detection (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image001-1547018446020.png) and WSPP for predicate prediction (Section 3.2), each module is composed by a pair selection branch and a classification branch.$

2.WSVRD的三个难点

1）缺乏空间位置的限制，比如人戴帽子，帽子可能在别人头上；
2）相对于WSOD更容易陷入次优解[21]；
3)WSVRD如果采用全连接网络，会有很大的计算量。

3.WSOD模块采用的是WSDDN[3]

参考另一篇WSDDN的文章：

4.WSPP模块采用了两个子网络

一个用来选取关系对，另一个用来进行关系分类：

$(C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image002-1547018446020.png) '(0 "d)sßs • (0 'id)ßs = (0 'Id)US$

$cation) branch. ss' I is softmax normalized over all pos- Sible region pairs with respect to a predicate class, i.e., (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image003-1547018446020.png)) 4— Pj); while S,.cls is soft- normalized over possible predicate classes for a re- gion pair, i.e., P)) 4— softmaxrSrClS(Pi, Note that such normalizations assign different objectives to two branches and hence they are unlikely to learn redundant models [3]. Essentially, the normalized selection score can be considered as a soft-attention mechanism used in weakly supervised vision tasks [35, 5] to determine the likely Rols.$

在进行softmax之前，如何计算这两个分数：

Position-sequence-Sensitive Score Map

主要考虑到两方面的因素：position和pair对的顺序问题：

Pairwise roi pooling（为了实现以上思想，设计了以下pooling方案：）

$(O '(Cd + + = (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image006-1547018446020.png)Js$

1）subject、object pooling：主要判断两个物体哪个是subject，哪个是object，以subject的分数计算为例，文章说这种设计是position-sensitive的，我不是很懂：

$(C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image007-1547018446020.png) = vote pool (3) where k = 3, pool(•) is mean pooling, and vote(•) is aver- age voting (e.g., average pooling for the scores of the grids). ss ub (P) is position-sensitive because Eq. (3) aggregates re- sponses for a spatial grid of Rol subject to the correspond- ing one from the k2 maps (e.g., in Figure 3 left, the dark red value pooled from the top-left grid of the Rol) and then votes for all the spatial grids. Therefore, the training will shepherd the k2 R subject filters to capture subject position in an Image.$

2）joint pooling 为了捕捉subject和object的相对位置关系，与1）不同的是grids是建立在joint region的

$sjoint(C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image008-1547018446020.png) = vote pool xs (xo pool (4) where g(i' , j') n Pi denotes the intersected pixels between g(i', j') and P', in particular, if g(i', j' F) Pi = d), pool(•) is zero and the gradient is not back-propagated. We set k = 3, pool(.) to average pooling, and vote(.) to average voting. For example, for relation person-ride-bike, the pool- ing result of person Rol is usually zero at the lower grids of the joint ROIS while that of bike Rol is usually zero at the upper grids.$

5.损失函数

Follow[34]，Multi-task loss:

$reg mg tm g (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image009-1547018446020.png) where a is empirically set to 0.2. We train the PPR-FCN model by SGI) with momentum [19].$

1）WSOD loss：（这个loss是图像级的类别分数）

$= -E logsc + log(C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image010-1547018446020.png)) , (5)$

根据[10]又考虑了位置上的关系（上面这个loss不能保证空间位置上的平滑性）：

$detection [10], we regularize the smoothness as: 1) for each foreground class c G {1, ..., C}, the top high-scored regions (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image011-1547018446021.png) and their neighborhood with IOU 0.5 should both have high scores; we consider them as the pseudo pos- itive regions; and 2) the neighborhood of the pseudo posi- tive regions with 0.1 IOU 0.5 should be pseudo back- ground regions (c = O). In this way, our spatial smoothness regularization loss is: ceC ieCc iet3 where Cc is the set of pseudo positive regions for class c # O, and B is the set of pseudo background regions. We follow a similar sampling strategy as in [34]: 256 regions are sampled, where at least 75% are the pseudo background regions.$

2）WSPP loss:

$WSPP Loss. Suppose R is the set of the image-level relation groundtruth triplets, specifically, (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image012-1547018446021.png) e R, where s, o G {1, ...,C} are the labels of the subject and object. Suppose Cs and Co are the region sets of subject s and object o, respectively. Denote Sr = S (P' P') as the image-level predicate score, i€Cs r t•, J the image-level predicate prediction loss is defined as: cered = _E(I log Sr + (7)$

6.实验

1）物体检测结果，提升主要来自于全卷积网络，而且WSOD往往能发现discriminative parts of objects
2）谓词检测结果，结果说明考虑position和sequence的网络要更优，用全连接层的要优于pooling-based methods，本文用这类方法是为了减少计算量

Table 2. Predicate prediction Irrformances of various methods on VRD and VG. The llst two rows are fc-based methods. Pos*Pairw ise Pos•JointBox Pos Seq +Pairw ise+fc VTransE R@50 24.30 29.57 2, 74 47.43 44.76 R@IOO 24.30 29.57 47.43 44_76 R@SO 43_30 45_69 61.5 64.17 62.63 4361 45.78 64M, 6287

3）phrase和关系检测结果

Supervised-PPR-FCN优于VtransE,因为其pair-wise roi pooling，有用的spatial context 速度相对于VtransE更快，参数更少

7.实现

数据：

VG is annotated by crowd workers and thus the relations labeling are noisy, e.g., free-language and typos. Therefore, we used the20pruned version provided by Zhang et al. [48].

论文阅读：PPR-FCN