Photo of author
Written By Zach Johnson

AI and tech enthusiast with a background in machine learning.

Focusing on Humans in Spatio-Temporal Video Localization

Centered around the human figure, our task in Spatio-Temporal Video Localization (STVL) involves identifying and outlining the time-space path of a targeted object as per a provided description. This is an important endeavor in the context of visual-language cross-modal understanding.

Input parameters: A 20-second video and a human-centric description.
Output parameters: The start and end frame numbers along with the bounding boxes encapsulating the target person within the video sequence.
Please download necessary data from DATA.
Prospective participants must complete registration through this Form prior to testing.

1st place: CN¥ 20,000
2nd place: CN¥ 5,000
3rd place: CN¥ 2,000

Timings are as per Beijing time, UTC+8
Key Dates:

  • Training/Validation set released: April 1st, 10:00:00, 2021
  • Testing set released and submission opened: May 6th, 10:00:00, 2021
  • Submission deadline: June 4th, 10:00:00, 2021
  • Challenge winners notified: June 8th, 10:00:00, 2021
  • Winners present at CVPR 2021 Workshop: June 20th, 2021

Participants must save their results in a results.json file in the following format:
 ’video_id’: {
 ’st_frame’: st,
 ’ed_frame’: ed,
 ’bbox’: {
  ’st’: [x1, y1, x2, y2],
  ’st+1′:[x1, y1, x2, y2],
  ’ed’:[x1, y1, x2, y2],

A total of 10 submission attempts are allowed per participant.
Note: Evaluation process may take some time and a failed submission will not decrease your total submission attempts.

In terms of evaluation metrics, we use vIoU to measure the accuracy of spatio-temporal predictions. It calculates the intersection over union (IoU) of the predicted and ground truth bounding box for each frame, and then averages this over all frames.

The Spatio-Temporal Video Localization task specific to humans (HC-STVL) encompasses videos featuring humans only. We provide a total of 16,000 annotated video-description pairs from various movie scenes. These annotations include descriptions and trajectories for all persons in the frame. To heighten the challenge, all our clips will feature multiple individuals.

Dataset Specifications:

  • Dataset: HCVG
  • Total: 16,685
  • Training/Validation: 12,000
  • Test: 4,665
  • Video Length: 20s

Leave a Comment

AI is evolving. Don't get left behind.

AI insights delivered straight to your inbox.

Please enable JavaScript in your browser to complete this form.