Fan Yu, Xin Tan, Tongwei Ren, Gangshan Wu
: Nanjing University
We propose a novel human-centric relation segmentation method based on the fine-tuned Mask R-CNN model and VTransE model. We first fine-tune the Mask R-CNN model according to the object categories appearing in the training dataset, and segment both person and object instances. Because Mask R-CNN may omit some persons in instance segmentation results, we further detect the omitted faces and extend them to locate the corresponding persons. Finally, we fine-tuned the VTransE model according to the visual relations appearing in the training dataset, and detect the visual relations between each pair of person and person/object.
July 31, 2018, 4:54 p.m.
Cluster, Depth, and Greedy
Hsuan-Kung, Yang, Anjie Zheng, Kuan-Wei, Ho, Tsu-Jui, Fu, Chun-Yi, Lee
: National Tsing Hua University
We analyze the statistics of training data provided by Person In Context (PIC) dataset in advance, and two discoveries are found.
First, the class imbalance problem among relation label is severe. The frequency of position relations (geometric relations) is way higher than the frequency of action relations (non-geometric relations).
Second, the class imbalance problem among object labels is also critical. There is only a few human-object categories with numerous enough numbers for model’s learning, and the rest of other categories are often with scarce data, which results in their insignificance in model’s learning. To address the class imbalance problems above, we introduce additional information as features combined with instance and segmentation provided by Mask-RCNN into training; we also observe that some human-object categories share the similar pattern of relations’ frequency distribution. As the result, we decide to reduce the 85 human-object categories with clustering beforehand.
Use Clustering method to cluster 85 categories into 8 pseudo label groups
We hope our clustering method provides results acheiving three conditions (constraints)
Objects inside same group are similar enough with each other regarding frequency distribution
The total number of clusters is small enough to benefit model’s computation
The total number of data inside each clusters are expected to be the maximum choice
Use MaskRCNN as our instance segmentation architecture.
We extract object features from the classification branch and get instance masks from the mask branch. Thus, we need to accord the instance indices between two branches with each other.
Use single image to predict depth (monodepth, CVPR 2017).
Apply instance mask on predict depth, so as to get instance depth.
Use Tukey’s rule to remove the values which are significantly different from each other.
Get Q1 and Q3, and calculate mean in the interval (Q1, Q3) as the depth feature.
Use Gradient boosting as our classifier. We feed 10 values into it:
box difference (y1, x1, y2, x2) - 4 values
box overlap between subject and object
cluster type of object
depth meams and medians of subject and object - 4 values
For some relations which rarely appear, we decided to give up them. That is, we use greedy policy in some certain types based on the frequency.
And we focus on the geometric relations, we use the classifier which mentioned above.
July 31, 2018, 11:46 a.m.
Chen Gao, Yuliang Zou, Jia-Bin Huang
: Virginia Tech
We decompose the problem into two steps: 1) instance segmentation and 2) visual relation recognition:
1) For the segmentation part, we observed that the categories in this dataset contain both object and stuff. We use off-the-shelf semantic segmentation model (https://github.com/CSAILVision/semantic-segmentation-pytorch) for the stuff classes. For the object classes, we train a Mask R-CNN. Lastly, we combine the segmentation results to parse them into submission format.
2) For the relation recognition part, we use the model from "iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection", BMVC 2018 (https://gaochen315.github.io/iCAN/). The core idea of our relation recognition module is that the appearance of a person or an object instance contains informative cues on which relevant parts of an image to attend to for facilitating interaction prediction. To exploit this cue, we propose an instance-centric attention module that learns to dynamically highlight regions in an image conditioned on the appearance of each instance. Such an attention-based network allows us to selectively aggregate features relevant for recognizing relations.
July 31, 2018, 12:26 p.m.
A context-aware top-down model
Our method is two-stage, including a Mask RCNN based method to extract instance mask, then we use an LSTM contains a two-layer context-aware module to generate relations between masks