Given an untrimmed video and a description depicting the object, the spatio-temporal video grounding(STVG) task aims to localize the spatio-temporal tube of the target object related to the description, which is a crucial task entailing the visual-language cross-modal comprehension.
Inputs: A video of 20 s and a description coressponding to a human
Outputs: The start frame and the end frame number with the bounding boxes of the target person during the video clip.
We are preparing the DATA based on the DATA of the PIC 3.0 challenge, where we conduct a refinement and extension. Thus, the data format is the same as last year. Participants could tune the models according to Data format.
The updated annotations are in the anno_v2 directory. We split the train, val and test set.
Participants should register in the this Form before testing.
1st prize: ¥ 1,0000
2nd prize: ¥ 3,000
3rd prize: ¥ 2,000
Human-centric Spatio-Temporal Video Grounding(HC-STVG), which only focuses human in the videos. We provide 16k annotation-video paris with different movie scenes. Specifically, we annotated the description statement and all the trajectories of the corresponding person (a series of Bounding Boxes). It's worth noting that all of our clips will include multiple people to increase the challenge of video comprehension.
Dataset | Total | Trainval | Test | Video_len |
---|---|---|---|---|
HCVG | 16685 | 12000 | 4665 | 20s |
@article{tang2021human,
title={Human-centric spatio-temporal video grounding with visual transformers},
author={Tang, Zongheng and Liao, Yue and Liu, Si and Li, Guanbin and Jin, Xiaojie and Jiang, Hongxu and Yu, Qian and Xu, Dong},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2021}
}publisher={IEEE}
}