Zhenzhi Wang, Yixuan Li, Tao Wu, Gangshan Wu, Limin
Wang,
Multimedia Computing Group (MCG), Nanjing University
Description of the method: A two-stage model is used
to tackle STVG problem: in the first stage, we detect human
bounding boxes in all frames and link them to generate candidate action tubes, then we
extract RoI features for each
bounding box to capture the motion cues; in the second stage, we use a Contrastive and
Compatible Matching Network to
refine the temporal regions of each candidate tube, where the compatible part is similar to
a 2D-TAN to predict the vIoU
of each proposal in the 2d proposal map (e.g. 16*16) and the contrastive part aims to learn
more discriminative features
by contrasting the positive tube-sentence pair and the negative tube-sentence pairs intra-
or inter-videos. Both parts
has no early-fusion components and only use cosine similarity of two modalities, which is
different from 2D-TAN.
Spatio-Temporal Video Grounding with Human Tracking and Cross-Modality Encoder Transformer
0.300
Contributors
Description
Yiyu, XinyingWang, WeiHu, XunLuo, ChengLi
MGTV
QuadrupleExtraction description
We use NLP technology, such as regular matching, keyword extraction, syntactic analysis, etc. to extract four-tuples of
text content. The final extracted quadruple is composed of characters, colors, clothing, and positions.These features
are used for two-stage target detection and played an important role.
Get tubes:
Firstly, we split each video to smaller video cuts by using scenedetect. Secondly, We apply human detection with YoloV5
and multiple object tracking with Deepsort and FastReID to get tubes for each video cut. After that, A gaussian filter
is applied to smooth tube tracks.
(only for test) For each video, we ignore the video cuts which has only one tube while the description contains more
than one character. Also, We delete the tube which does not match the description by using gender classification network
to .
Cross-modal representation
We use LXMERT framework to learn the vision-and-language connection and use Mask R-CNN to extract the visual features
from each bounding box. Finaly, we output global classification of each tube and frame classification and regression of
each frame in a tube.
To tackle the problem of class-imbalance problem, We use the focal loss as the global classification loss. the frame
classification loss and regression loss are cross-entropy loss and IoU loss respectively. During the experiment, we
found a serious over-fitting problem. We solve the problem by following two step. First, we fix the language branch
model weight in LXMERT model. Second, we stop the training early
DOTER
0.298
Contributors
Description
Cristian Rodriguez-Opazo (Australian Institute for Machine Learning), Yicong Hong (Australian National University),
Fatemeh Saleh (Samsung AI Center)
It is a two-stage method that made use of DORi https://arxiv.org/abs/2010.06260 as temporal moment localization and
UNITER for spatial grounding. We made use of the training split only to train both stages with 10 ensembles in temporal
and 3 in spatial. To get better human boxes, we used IterDet: Iterative Scheme for Object Detection in Crowded
Environments.
Augmented 2D-TAN
0.294
Contributors
Description
Chaolei Tan[1], Zihang Lin[1], Jian-Fang Hu[1], Wei-Shi Zheng[1].
[1] Sun Yat-sen University
We propose a two-stage approach to solve the human-centric spatio-temporal video grounding problem. In the first stage,
we design a temporal video grounding network named Augmented 2D-TAN, which is mainly built on top of 2D-TAN. We
primarily improve the original 2D-TAN from three aspects: first, a temporal context-aware representation aggregation
module is exploited to aggregate clip-level representations in substitution of the original max pooling or stacked
convolution. The proposed aggregation module is designed as a BiLSTM with weights shared across all moments (a sequence
of clips), which has proved to be effective of capturing moment-level discriminative information. Second, we randomly
concatenate target moment segments in pairs as a strong data augmentation, which can prevent early overfitting and
reduce the risk of learning a model that simply grounds the salient moments. Finally, model ensemble strategy is adopted
to further boost our temporal grounding performance. In the second stage of our method, we use pretrained MDETR model to
generate frame-level bounding boxes via language query, and design a set of hand-crafted rules to select the best
matching bounding box outputted by MDETR for each frame within our temporal grounding segment.