Human-centric Spatio-Temporal Video Grounding


Given an untrimmed video and a description depicting the object, the spatio-temporal video grounding(STVG) task aims to localize the spatio-temporal tube of the target object related to the description, which is a crucial task entailing the visual-language cross-modal comprehension.
Inputs: A video of 20 s and a description coressponding to a human
Outputs: The start frame and the end frame number with the bounding boxes of the target person during the video clip.
We are preparing the DATA based on the DATA of the PIC 3.0 challenge, where we conduct a refinement and extension. Thus, the data format is the same as last year. Participants could tune the models according to Data format.
The updated annotations are in the anno_v2 directory. We split the train, val and test set.
Participants should register in the this Form before testing.


1st prize: ¥ 1,0000
2nd prize: ¥ 3,000
3rd prize: ¥ 2,000

Important Date

Time zone: Beijing, UTC+8

June 10th, 2022
June 25th, 2022
June 26th-30th , 2022
July 1th, 2022
July 6th, 2022
Testing set released and submission opened
Submission deadline
Objective evaluation
Evaluation results announce
Paper submission deadline

Task Rules

  1. The results should be stored in results.json, with the following format:
     'video_id': {
     'st_frame': st,
     'ed_frame': ed,
     'bbox': {
      'st': [x1, y1, x2, y2],
      'st+1':[x1, y1, x2, y2],
      'ed':[x1, y1, x2, y2],
  2. You have 20 submission chances in total.
  3. The evaluation process can takes times. And a failed submission will not cause the reduction of submission chances.

Task Metric

  1. \(vIoU\): \(vIoU = \frac{1}{\left | S_u \right |} \sum_{t \in S_i} IoU(Box^t, Box^{t'})\), where \(S_i\) is the set of frames in the intersection of selected and ground truth tube, \(S_u\) is the set of frames in the union of selected and ground truth tube, \(Box^t\) and \(Box^{t'}\) are predicted bounding box and ground truth bounding box of frame \(t\). vIoU can directly reflect the accuracy of the prediction results spatiotemporally.
  2. \(vIoU@R\) stands for the percentage of samples whose \(vIoU\) is larger than \(R\)
  3. \(mvIoU\) stands for mean value of \(vIoU\).
  4. Dataset

    Human-centric Spatio-Temporal Video Grounding(HC-STVG), which only focuses human in the videos. We provide 16k annotation-video paris with different movie scenes. Specifically, we annotated the description statement and all the trajectories of the corresponding person (a series of Bounding Boxes). It's worth noting that all of our clips will include multiple people to increase the challenge of video comprehension.

    Data Statistics

    Dataset Total Trainval Test Video_len
    HCVG 16685 12000 4665 20s

    Image Example



    title={Human-centric spatio-temporal video grounding with visual transformers},
    author={Tang, Zongheng and Liao, Yue and Liu, Si and Li, Guanbin and Jin, Xiaojie and Jiang, Hongxu and Yu, Qian and Xu, Dong},
    journal={IEEE Transactions on Circuits and Systems for Video Technology},