Given an untrimmed video and a description depicting the object, the spatio-temporal video grounding(STVG) task aims to localize the spatio-temporal tube of the target object related to the description, which is a crucial task entailing the visual-language cross-modal comprehension.
Inputs: A video of 20 s and a description coressponding to a human
Outputs: The start frame and the end frame number with the bounding boxes of the target person during the video clip.
Downloads at DATA.
Participants should register in the this Form before testing.
1st prize: CN¥ 2,0000
2nd prize: CN¥ 5,000
3rd prize: CN¥ 2,000
Human-centric Spatio-Temporal Video Grounding(HC-STVG), which only focuses human in the videos. We provide 16k annotation-video paris with different movie scenes. Specifically, we annotated the description statement and all the trajectories of the corresponding person (a series of Bounding Boxes). It's worth noting that all of our clips will include multiple people to increase the challenge of video comprehension.
Dataset | Total | Trainval | Test | Video_len |
---|---|---|---|---|
HCVG | 16685 | 12000 | 4665 | 20s |
@article{tang2020human,
title={Human-centric Spatio-Temporal Video Grounding With Visual Transformers},
author={Tang, Zongheng and Liao, Yue and Liu, Si and Li, Guanbin and Jin, Xiaojie and Jiang, Hongxu and Yu, Qian and Xu, Dong},
journal={arXiv preprint arXiv:2011.05049},
year={2020}
}