|Bo Chen, Manlin Chu, Yi Zhao, Yuan Gao, Xiao Liu TAL Education Group We use a two-stage method to solve human-object interation task . In the first stage, we employ a Cascade-RCNN to detect humans and objects. The proposals will be taken as the following stage inputs. We show that a powerful detector has considerable improvements for this task. In the second stage, we employ the spatially conditional graph network(SCG) to infer the relationship between the paired human and object proposals. We utilized the Swin-L as the backbone to extract the image feature and obtained the instance feature by an RoI align layer. A transformer encoder is applied to enhance each instance feature by considering the global context across instances. To further reduce false positives, we introduce human keypoint information to enhance human representation. Finally, the paired node encodings will be used to predict the relationship.
|Zhimin Li , Chi Xie  , Shuang Liang  , Hongyu Wang  , Cheng Zou  , Boxun Li  , Chi Zhang   MEGVII ,  Huazhong University of Science and Technology ,  Tongji University We propose an End-to-End Swin-B backbone HOI Transformer. The model design follows https://github.com/bbepoch/HoiTransformer, but we replace the res50 backbone with a more powerful Swin-B backbone.
|Xu Sun, Hui Jiang, Yaqun Fang, Yunqing He, Tongwei Ren, Gangshan Wu Nanjing University We detect objects and exploit multiple models to recognize relationships between human-object pairs. The relationship recognition branch operates on rich multi-modal features for accurate inference.
|“We propose Word as Query TransFormer (WQTF) which is built on the great work QPIC. Our model sets specific meaning for each HOI query using word embeddings, aiming to bring information for each query instead of random initialization. Furtherly, we use two transformer decoders to get more fine-gained information for object detection and relation detection respectively. “
|Desen Zhou*, Tao Hu*, Zhichao Liu*, Jian Wang (* indicates equal contribution) Baidu Inc. We propose a simple three-stage based method for the competition. Our method consists of an object detection stage, a relationship classification stage and a confusing relationships mining stage. The object detection stage is mainly based on CenterNet2 detector. We perform class-specific NMS and ensemble methods for different configurations of CenterNet2. The relationship classification stage takes the outputs of object detection and utilizes a simple graph netural network(GNN) to perform hoi classification for each human-object pair. Finally, we notice that many relationships are confusing due to the confusing of different object categories. Our confusing relationships mining stage takes the object proposals of a human-object pair and crop the object patches to perform re-labeling of object categories. The classification scores are utilized to re-rank the corresponding relationships by multiplication with the initial scores.
|Sai Wang; Zhenyu Xu DeepBlue Technology(Shanghai) Co.,Ltd We propose a Position Prior Detection Network (PPDN) for HOI detection. PPDN is composed of an object prior detector and AS-Net. It uses position priors to constrain the original detection results to solve the problems of missed detections and false positives in the detection results.
|HOI-Transformer with object features from Scene-Graph Detector
|Haoran Tang, Jiawei Fan Peking University We used a scene-graph detector trained on VG to predict the objects correlated with HOI pairs. We used word2vec and human knowledge to select the related objects and then concatenated them to our HOI-Transformer model. We found that there is a difference between the distributation of train and test set, in that case, we use softmax to fix our loss function. The final submission is the model trained 33 epochs and fine-tune one epoch with a higher weight-decay.
|Danyuan Liu We used CenterNet2 with human-object detector. And then predict HOI class based on SCG.
|“Neuro-Symbolic HOI; Refining AS-Net with Tiny Object Detection”
|“Trong-Tung Nguyen, Huu-Nghia H. Nguyen, Minh-Triet Tran” “University of Science, VNU-HCM, Ho Chi Minh city, Vietnam; John von Neumann, VNU-HCM, Ho Chi Minh city, Vietnam (this affiliation only applies to Minh-Triet Tran); Vietnam National University, Ho Chi Minh City, Vietnam;” We introduce a simple rule-based method for HOI tasks. The method can achieve the competitive result with a 0.6176 mAP score using our lightweight model architecture with fast inference time. We adopt the two-stage mechanism in traditional HOI methods with our retrained object detector – Faster RCNN – on 10 object categories of the challenge as the first component. After that, objects are categorized into multiple bins and paired with humans based on their locations and bounding box size. The pairs will then be predicted using specific rules based on the spatial information, category, bounding box of the objects and gaze tracking, pose estimation of humans. Besides, we observe that we can determine the discrimination between inter-class relationships such as play and call without object appearance features given prior knowledge about categories of the object. Therefore, we propose a simple abstraction masking mechanism in the original image to enforce the model focusing on the masking objects and avoiding wide variation of the object features. In addition, we also make a small modification on a current SOTA method for the HOI-A 2019 dataset, which is AS-Net. The original method reformulates the HOI problem as a set prediction problem, a trend set by Detection Transformer. An AS-Net model trained for 90 epochs on the HOI-A 2019 training set achieves an mAP score of 0.7224 on the HOI-A 2021 dataset. One distinctive drawback of both methods is that the object detection performance suffers for small objects, e.g., cigarettes. To focus the AS-Net training procedure more on the object detection performance on small objects, we also experiment with modifying the original L1 loss for box regression in AS-Net to the box regression loss used in YOLO. The L1 losses for the box centers are the same, but the L1 losses for the box dimensions (i.e., widths and heights) are only applied to the square root of the predicted dimensions and the square root of the ground truth dimensions. We use the pretrained AS-Net model, freeze all parameters except for the MLP object detection head, and train the model with the new loss for 95 epochs. This model achieves an mAP score of 0.7205. We introduce a post-processing procedure to refine the bounding box predictions for a cigarette. For each cigarette ground truth annotation, we crop out the part of the image centered at this annotation but has a much larger size (e.g., 400% of the annotation size). Then we use the newly cropped images and the corresponding annotations to train an object detection model (Faster RCNN) to detect the cigarette in the cropped image. During test time, given a cigarette prediction by AS-Net, we crop an extended part of the image around the prediction and use the object detection model to get new bounding box prediction results. The highest mAP score is achieved with this method at 0.7270. We integrate cooccurrence information between objects and relations to augment possibly missing relations in the prediction. We calculate the conditional probabilities of relations given the existence of an object type in HOI prediction. Then, for each HOI triplet, relations with a conditional probability of more than 90% given the object in that triplet are added to the final prediction as a new HOI triplet. This procedure is applied on top of the bounding box refinement method, and the combined solution achieves an mAP score of 0.7245.
|Tao Pan Guoming Li Bowen Zheng AI Lab of China Merchants Bank We propose an end-to-end Human-Object Interaction(HOI) detection method that directly detects a human and object in a pairwise manner. Our method is based on Deformable DETR which used in object detection problem. Specifically, we use the ResNet-50 as the backbone network and leverage a transformer to aggregate contextual features. We modify the deformable attention module in the transformer decoders to attends two small sets of sampling locations on human and object separately.
AI is evolving. Don't get left behind.
AI insights delivered straight to your inbox.