Bo Chen, Manlin Chu, Yi Zhao, Yuan Gao, Xiao Liu TAL Education Group
We use a two-stage method to solve human-object interation task . In the first stage, we employ a Cascade-RCNN to detect
humans and objects. The proposals will be taken as the following stage inputs. We show that a powerful detector has
considerable improvements for this task. In the second stage, we employ the spatially conditional graph network(SCG) to
infer the relationship between the paired human and object proposals. We utilized the Swin-L as the backbone to extract
the image feature and obtained the instance feature by an RoI align layer. A transformer encoder is applied to enhance
each instance feature by considering the global context across instances. To further reduce false positives, we
introduce human keypoint information to enhance human representation. Finally, the paired node encodings will be used to
predict the relationship.
Zhimin Li , Chi Xie  , Shuang Liang  , Hongyu Wang  , Cheng Zou  , Boxun Li  , Chi Zhang  
MEGVII ,  Huazhong University of Science and Technology ,  Tongji University
We propose an End-to-End Swin-B backbone HOI Transformer. The model design follows
https://github.com/bbepoch/HoiTransformer, but we replace the res50 backbone with a more powerful Swin-B backbone.
We detect objects and exploit multiple models to recognize relationships between human-object pairs. The relationship
recognition branch operates on rich multi-modal features for accurate inference.
Xian Qu, Zijian Li, Xubin Zhong,Changxing Ding ,
South China University of Technology
"We propose Word as Query TransFormer (WQTF) which is built on the great work QPIC.
Our model sets specific meaning for each HOI query using word embeddings, aiming to bring information for each query
instead of random initialization.
Furtherly, we use two transformer decoders to get more fine-gained information for object detection and relation
Desen Zhou*, Tao Hu*, Zhichao Liu*, Jian Wang (* indicates equal contribution) Baidu Inc.
We propose a simple three-stage based method for the competition. Our method consists of an object detection stage, a
relationship classification stage and a confusing relationships mining stage. The object detection stage is mainly based
on CenterNet2 detector. We perform class-specific NMS and ensemble methods for different configurations of CenterNet2.
The relationship classification stage takes the outputs of object detection and utilizes a simple graph netural
network(GNN) to perform hoi classification for each human-object pair. Finally, we notice that many relationships are
confusing due to the confusing of different object categories. Our confusing relationships mining stage takes the object
proposals of a human-object pair and crop the object patches to perform re-labeling of object categories. The
classification scores are utilized to re-rank the corresponding relationships by multiplication with the initial scores.
Sai Wang; Zhenyu Xu
DeepBlue Technology(Shanghai) Co.,Ltd
We propose a Position Prior Detection Network (PPDN) for HOI detection. PPDN is composed of an object prior detector and
AS-Net. It uses position priors to constrain the original detection results to solve the problems of missed detections
and false positives in the detection results.
HOI-Transformer with object features from Scene-Graph Detector
Haoran Tang, Jiawei Fan Peking University
We used a scene-graph detector trained on VG to predict the objects correlated with HOI pairs. We used word2vec and
human knowledge to select the related objects and then concatenated them to our HOI-Transformer model. We found that
there is a difference between the distributation of train and test set, in that case, we use softmax to fix our loss
function. The final submission is the model trained 33 epochs and fine-tune one epoch with a higher weight-decay.
We used CenterNet2 with human-object detector. And then predict HOI class based on SCG.
Refining AS-Net with Tiny Object Detection"
Huu-Nghia H. Nguyen,
Minh-Triet Tran" "University of Science, VNU-HCM, Ho Chi Minh city, Vietnam;
John von Neumann, VNU-HCM, Ho Chi Minh city, Vietnam (this affiliation only applies to Minh-Triet Tran);
Vietnam National University, Ho Chi Minh City, Vietnam;"
"We introduce a simple rule-based method for HOI tasks. The method can achieve the competitive result with a 0.6176 mAP
score using our lightweight model architecture with fast inference time. We adopt the two-stage mechanism in traditional
HOI methods with our retrained object detector - Faster RCNN - on 10 object categories of the challenge as the first
component. After that, objects are categorized into multiple bins and paired with humans based on their locations and
bounding box size. The pairs will then be predicted using specific rules based on the spatial information, category,
bounding box of the objects and gaze tracking, pose estimation of humans. Besides, we observe that we can determine the
discrimination between inter-class relationships such as play and call without object appearance features given prior
knowledge about categories of the object. Therefore, we propose a simple abstraction masking mechanism in the original
image to enforce the model focusing on the masking objects and avoiding wide variation of the object features.
In addition, we also make a small modification on a current SOTA method for the HOI-A 2019 dataset, which is AS-Net. The
original method reformulates the HOI problem as a set prediction problem, a trend set by Detection Transformer. An
AS-Net model trained for 90 epochs on the HOI-A 2019 training set achieves an mAP score of 0.7224 on the HOI-A 2021
dataset. One distinctive drawback of both methods is that the object detection performance suffers for small objects,
To focus the AS-Net training procedure more on the object detection performance on small objects, we also experiment
with modifying the original L1 loss for box regression in AS-Net to the box regression loss used in YOLO. The L1 losses
for the box centers are the same, but the L1 losses for the box dimensions (i.e., widths and heights) are only applied
to the square root of the predicted dimensions and the square root of the ground truth dimensions. We use the pretrained
AS-Net model, freeze all parameters except for the MLP object detection head, and train the model with the new loss for
95 epochs. This model achieves an mAP score of 0.7205.
We introduce a post-processing procedure to refine the bounding box predictions for a cigarette. For each cigarette
ground truth annotation, we crop out the part of the image centered at this annotation but has a much larger size (e.g.,
400% of the annotation size). Then we use the newly cropped images and the corresponding annotations to train an object
detection model (Faster RCNN) to detect the cigarette in the cropped image. During test time, given a cigarette
prediction by AS-Net, we crop an extended part of the image around the prediction and use the object detection model to
get new bounding box prediction results. The highest mAP score is achieved with this method at 0.7270.
We integrate cooccurrence information between objects and relations to augment possibly missing relations in the
prediction. We calculate the conditional probabilities of relations given the existence of an object type in HOI
prediction. Then, for each HOI triplet, relations with a conditional probability of more than 90% given the object in
that triplet are added to the final prediction as a new HOI triplet. This procedure is applied on top of the bounding
box refinement method, and the combined solution achieves an mAP score of 0.7245."
Tao Pan Guoming Li Bowen Zheng AI Lab of China Merchants Bank
We propose an end-to-end Human-Object Interaction(HOI) detection method that directly detects a human and object in a
pairwise manner. Our method is based on Deformable DETR which used in object detection problem. Specifically, we use the
ResNet-50 as the backbone network and leverage a transformer to aggregate contextual features. We modify the deformable
attention module in the transformer decoders to attends two small sets of sampling locations on human and object