1.主要思想
- weakly supervised visual relation detection (WSVRD)这里无监督指的是两个方面,一个方面是弱监督的物体检测,在训练过程中,没有bounding box作为标签来监督训练,其实我认为这里的物体检测是无监督的,不太懂这里为啥说是弱监督的。(解释,通过阅读WSDDN这篇文章,我理解了其中的弱监督的观念:就是说虽然没有bounding box,但是通过预训练的cnn+region proposal mechanism+spp能得到一个大致的物体区域,利用image-level的标注信息进行训练,就相当于给了弱监督的信息)
At training time, we discarded the object bounding boxes to conform with the weakly supervised setting ,应该是WSOD和WSPP两个过程都没有用bounding box的标签。
- 
    那为什么又说是image-level的分类或者标注呢?因为标注只给了很少的信息,如果一个物体出现在一张图片中,那么图像的标注就是positive,否则为negtive。而分类是要在这种image-level的标注下找到一个物体对应的region proposal。 
- 
    网络分成了两个部分,两个都是弱监督的方法,一个进行物体检测,另一个进行关系分类。 

2.WSVRD的三个难点
- 1)缺乏空间位置的限制,比如人戴帽子,帽子可能在别人头上;
- 2)相对于WSOD更容易陷入次优解[21];
- 3)WSVRD如果采用全连接网络,会有很大的计算量。
3.WSOD模块采用的是WSDDN[3]
参考另一篇WSDDN的文章:
4.WSPP模块采用了两个子网络
一个用来选取关系对,另一个用来进行关系分类:

![cation) branch. ss' I is softmax normalized over all pos-  Sible region pairs with respect to a predicate class, i.e.,  (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image003-1547018446020.png)) 4— Pj); while S,.cls is soft-  normalized over possible predicate classes for a re-  gion pair, i.e., P)) 4— softmaxrSrClS(Pi, Note  that such normalizations assign different objectives to two  branches and hence they are unlikely to learn redundant  models [3]. Essentially, the normalized selection score can  be considered as a soft-attention mechanism used in weakly  supervised vision tasks [35, 5] to determine the likely Rols.](https://ws1.sinaimg.cn/large/007tAU6Agy1fz0d2ru63jj30hr08vabq.jpg)
在进行softmax之前,如何计算这两个分数:
 Position-sequence-Sensitive Score Map Position-sequence-Sensitive Score Map
主要考虑到两方面的因素:position和pair对的顺序问题:

- Pairwise roi pooling(为了实现以上思想,设计了以下pooling方案:)

- 1)subject、object pooling:主要判断两个物体哪个是subject,哪个是object,以subject的分数计算为例,文章说这种设计是position-sensitive的,我不是很懂:

- 2)joint pooling 为了捕捉subject和object的相对位置关系,与1)不同的是grids是建立在joint region的

5.损失函数
Follow[34],Multi-task loss:
![reg  mg  tm g  (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image009-1547018446020.png)  where a is empirically set to 0.2. We train the PPR-FCN  model by SGI) with momentum [19].](https://ws1.sinaimg.cn/large/007tAU6Agy1fz0d5iywjrj30i104cq3b.jpg)
- 1)WSOD loss:(这个loss是图像级的类别分数)

根据[10]又考虑了位置上的关系(上面这个loss不能保证空间位置上的平滑性):
![detection [10], we regularize the smoothness as: 1) for each  foreground class c G {1, ..., C}, the top high-scored regions  (C:\Users\nian\OneDrive\NoteBook\narcissuscyn.github.io_posts\assets\clip_image011-1547018446021.png) and their neighborhood with IOU 0.5 should  both have high scores; we consider them as the pseudo pos-  itive regions; and 2) the neighborhood of the pseudo posi-  tive regions with 0.1 IOU 0.5 should be pseudo back-  ground regions (c = O). In this way, our spatial smoothness  regularization loss is:  ceC ieCc  iet3  where Cc is the set of pseudo positive regions for class  c # O, and B is the set of pseudo background regions. We  follow a similar sampling strategy as in [34]: 256 regions  are sampled, where at least 75% are the pseudo background  regions.](https://ws1.sinaimg.cn/large/007tAU6Agy1fz0d6btp6gj30hz0er76t.jpg)
- 2)WSPP loss:

6.实验
- 
    1)物体检测结果,提升主要来自于全卷积网络,而且WSOD往往能发现discriminative parts of objects  
- 
    2)谓词检测结果,结果说明考虑position和sequence的网络要更优,用全连接层的要优于pooling-based methods,本文用这类方法是为了减少计算量 

- 3)phrase和关系检测结果
Supervised-PPR-FCN优于VtransE,因为其pair-wise roi pooling,有用的spatial context 速度相对于VtransE更快,参数更少
7.实现
数据:
VG is annotated by crowd workers and thus the relations labeling are noisy, e.g., free-language and typos. Therefore, we used the20pruned version provided by Zhang et al. [48].