Human-centric Spatio-Temporal Video Grounding

Introduction

Given an untrimmed video and a description depicting the object, the spatio-temporal video grounding(STVG) task aims to localize the spatio-temporal tube of the target object related to the description, which is a crucial task entailing the visual-language cross-modal comprehension.
Inputs: A video of 20 s and a description coressponding to a human
Outputs: The start frame and the end frame number with the bounding boxes of the target person during the video clip.
We are preparing the DATA based on the DATA of the PIC 3.0 challenge, where we conduct a refinement and extension. Thus, the data format is the same as last year. Participants could tune the models according to Data format.
The updated annotations are in the anno_v2 directory. We split the train, val and test set.
Participants should register in the this Form before testing.

Prize

1st prize: ¥ 1,0000
2nd prize: ¥ 3,000
3rd prize: ¥ 2,000

Important Date

Time zone: Beijing, UTC+8

June 10th, 2022
June 25th, 2022
June 26th-30th , 2022
July 1th, 2022
July 6th, 2022
Testing set released and submission opened
Objective evaluation
Evaluation results announce

1. The results should be stored in results.json, with the following format:
{
'video_id': {
'st_frame': st,
'ed_frame': ed,
'bbox': {
'st': [x1, y1, x2, y2],
'st+1':[x1, y1, x2, y2],
...
'ed':[x1, y1, x2, y2],
}
},
}
2. You have 20 submission chances in total.
3. The evaluation process can takes times. And a failed submission will not cause the reduction of submission chances.

1. $$vIoU$$: $$vIoU = \frac{1}{\left | S_u \right |} \sum_{t \in S_i} IoU(Box^t, Box^{t'})$$, where $$S_i$$ is the set of frames in the intersection of selected and ground truth tube, $$S_u$$ is the set of frames in the union of selected and ground truth tube, $$Box^t$$ and $$Box^{t'}$$ are predicted bounding box and ground truth bounding box of frame $$t$$. vIoU can directly reflect the accuracy of the prediction results spatiotemporally.
2. $$vIoU@R$$ stands for the percentage of samples whose $$vIoU$$ is larger than $$R$$
3. $$mvIoU$$ stands for mean value of $$vIoU$$.
4. Dataset

Human-centric Spatio-Temporal Video Grounding(HC-STVG), which only focuses human in the videos. We provide 16k annotation-video paris with different movie scenes. Specifically, we annotated the description statement and all the trajectories of the corresponding person (a series of Bounding Boxes). It's worth noting that all of our clips will include multiple people to increase the challenge of video comprehension.

Data Statistics

Dataset Total Trainval Test Video_len
HCVG 16685 12000 4665 20s

Organizers

Zongheng Tang
Beihang University
Fengguang Peng
Beihang University
Si Liu
Beihang University
Luoqi Liu
Meitu Inc
Yunpeng Chen
Meitu Inc

Citation

@article{tang2021human,
title={Human-centric spatio-temporal video grounding with visual transformers},
author={Tang, Zongheng and Liao, Yue and Liu, Si and Li, Guanbin and Jin, Xiaojie and Jiang, Hongxu and Yu, Qian and Xu, Dong},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2021}
}publisher={IEEE}
}