An End-to-End Approach to Detecting Human-Object Interactions using Transformers

Photo of author
Written By Zach Johnson

AI and tech enthusiast with a background in machine learning.

A team of researchers from MEGVII Technology recently published a paper introducing a new deep learning method for detecting human-object interactions (HOI) in images. HOI detection is an important computer vision task with applications like automated image captioning, human behavior analysis, and assisting the visually impaired.

The key innovation in this paper is the use of a transformer architecture to directly predict HOI instances in an end-to-end manner. Previous methods either break down HOI detection into separate object detection and interaction classification stages, or introduce extra “interaction points” as a workaround. In contrast, the HOI Transformer can reason about relations between humans and objects from the full global context in an image, and directly output HOI detections in one pass.

At a high level, the HOI Transformer consists of a CNN backbone to extract image features, a transformer encoder-decoder that processes the features and makes predictions, and some final MLP layers to output the actual HOI detections. The encoder condenses the image into a “global memory”, while the decoder takes learned “HOI queries” and the global memory as input, and predicts matching embeddings for each ground truth HOI instance.

A key component is the custom quintuple matching loss function designed for this task. It allows HOI instances to be trained and predicted end-to-end by comparing predicted instance tuples (human, interaction, object, human box, object box) to ground truth tuples in a batched manner. This captures both classification and localization accuracy.

The authors evaluated their model on two HOI detection benchmark datasets, HICO-DET and V-COCO. Without any bells & whistles, the HOI Transformer achieves state-of-the-art results, outperforming more complex prior methods. The performance gains are especially notable on rare HOI categories with fewer training examples.

In summary, this work demonstrates how bringing transformers into HOI detection can enable a simpler, more effective end-to-end approach. The HOI Transformer sets a new state-of-the-art, while avoiding the need for hand-designed pipelines. This end-to-end reasoning capability could unlock better performance and generalization for human-centric vision tasks. The code for the HOI Transformer is available on GitHub.

AI is evolving. Don't get left behind.

AI insights delivered straight to your inbox.

Please enable JavaScript in your browser to complete this form.