Tianfei Zhou , Wenguan Wang, Jianbing Shen , Zhijie Zhang   Inception Institute of Artificial Intelligence (IIAI)
We propose a Unified Relationship detection Network (URNet) in this contest. We jointly train entity detection/segmentation and relationship inference in an end-to-end way. Our URNet is built on Cascade (Mask)RCNN, which is further extended to add a new branch for relationship recognition for each human-object pair. The relationship recognition branch operates on rich multi-modal features for accurate inference.
Oct. 17, 2019, 9:08 p.m.
Zanhui Fan, Han Wang, Peng Chen,Yusen Qin,Shengjin Wang, Zichong Chen :Segway Robotics. Tsinghua University. Dalian Neusoft University of Information
The interactions in HOI-W challenge are highly correlated with object categories and location information is critical to determining whether an interaction between a person and an object exists. We propose a human-object-interaction detection scheme consisting of three parts: (1) Object detection; (2) Human/object location information based on human face detection, person keypoint detection; (3) Decision network to predict the relations between human subjects and objects. The object detection module is adapted from Mask R-CNN and provides bounding boxes of human and object targets for following procedures. Human face detection based on MTCNN and person keypoint detection based on OpenPose are introduced to calculate the absolute and relative location metrics of a human subject and an object. Finally, the visual appearance features from object detection module are combined with location metrics to predict relations between human subjects and objects. Given sufficient time this line of research will provide further details and improvements.
Oct. 17, 2019, 5:11 a.m.
Faster Interaction Net
Haoliang Tan , Yu Zhu , Guodong Guo 
 Institute of Deep Learning, Baidu Research
 Xi’an Jiaotong University, Xi’an, China
 National Engineering Laboratory for Deep Learning Technology and Application, Beijing China.
We decompose the problem into two steps: 1) Human and Object bounding-boxes detection and 2) human-object Interaction recognition.
1) For the object detection part, we observed that the object categories in the HOIW dataset contain the specific object category such as document and cigarette. These objects are usually of variable size or very small and difficult to detect. So we used the Faster R-CNN with ResNet-101 as the backbone network that has already fine-tune on the COCO dataset rather than use the ImageNet pre-trained weight. After fine-tuning the Faster R-CNN on the HOIW dataset, the object scores higher than 0.4 is preserved for next step relationship recognition.
2) For the interaction relation recognition part, our model just like a modified faster R-CNN. In this part, we use the ResNet-50 as the backbone network, after the image through the res3 block, we perform the ROIAlign operation on the feature map to generate instance feature for each detected objects. These features are then through the res4 block to perform the final action recognition.
Oct. 17, 2019, 5:01 a.m.
Shuchang Lyu, Lingyun Zeng
We propose a two-stage architecture named Faster Fusion Interaction Network (F2INet) for the human-object interaction (HOI) task. First, a pretrained Faster R-CNN on COCO dataset is finetuned on the HOI dataset, which can provide reliable object detection results. Second, we utilize the accurate detection results to generate proposals representing the subject and object. Then a relation-object feature interaction model with information transferring attention module is proposed to explore the relations between the subject and the object to finally decide the relation categories.
Our method can effectively explore the interaction feature through the attention module to make the relation features contain object and subject features’ information. The object and subject features also integrate the relation features, which can contain more additional information besides their global features.
We apply a hierarchical joint training scheme. To be specific, the framework is trained with hierarchical classification tasks. We first train the network to classify interactive and non-interactive pairs, then classify into different HOI categories.
Our model consists of four streams: human stream, object stream, spatial-pose stream, and binary-classification stream. Within human and object streams, a residual block with global average pooling and four 1024 sized FCs are used. Relatively, the spatial stream is composed of two convolutional layers with max-pooling, and two 1024 sized FCs. The outputs of three streams are concatenated and passed through two 1024 sized FCs to perform binary classification. For each stream, the input is human, object, spatial map and global feature extracted from the coco pre-trained Faster R-CNN as the backbone. The final predictions of HOI are obtained by late-fusion of results of four streams. We use a single model and no ensemble trick is adopted.
In the test phase, we generate the final score of HOI combining the detection score of human, object and network prediction. We also apply some prior knowledge on the results to make sure that all HOI detection results are reasonable. What’s more, we use a weight computed from the distribution of training data labels to change the output score distribution."