Make-up Temporal Video Grounding (MTVG) Challenge

Given an untrimmed make-up video and a step query, the Make-up Temporal Video Grounding (MTVG) aims to localize the target make-up step in the video. This task requires models to align fine-grained video-text semantics and distinguish makeup steps with subtle differences.

Inputs: An untrimmed make-up video ranging from 15s to 1h and a make-up step description.
Outputs: The temporal boundary of the step query located in the video.

Downloads and Baselines at Here.

Testing on Codalab.

Participants should register in this Form before testing.

Prizes

1st prize: ¥ 10,000
2nd prize: ¥ 3,000
3rd prize: ¥ 2,000

Schedule (Beijing, UTC+8)

April 25th, 2022 – Training / validation set released
June 10th, 2022 – Testing set released and submission opened
June 25th, 2022 – Submission deadline
June 26th-30th, 2022 – Objective evaluation
July 1st, 2022 – Evaluation results announce
July 6th, 2022 – Paper submission deadline

Submission Details

Participants should submit the timestamp candidate for each (video, text query) input. The results should be stored in results.json, with the following format:

{
  'query_idx': [start_time, end_time],
}

Each team can submit the results.json once a day. The evaluation process can be time-consuming and a failed submission will not reduce the number of submission chances.

Evaluation Metrics

We adopt \( R@n, IoU=m \) with \( n \) in {1} and \( m \) in {0.3, 0.5, 0.7} as evaluation metrics. This means the percentage of at least one of the top-\( n \) results having Intersection over Union (IoU) with the ground truth larger than \( m \).

About the Dataset

The makeup instructional videos are inherently more fine-grained than open-domain videos. Different steps may have similar backgrounds but contain subtle but vital differences, such as specific actions, tools, and applied facial areas, which can lead to different facial effects.

We use the YouMakeup dataset, containing 2,800 makeup instructional videos from YouTube, totaling over 420 hours. Each video has annotations for a series of steps, including temporal boundaries, grounded facial areas, and natural language descriptions. There are 30,626 steps, averaging 10.9 steps per video. Videos range from 15s to 1h, averaging 9 minutes.

YouMakeup Dataset Details

Dataset	Total	Train	Val	Test	Video_len
YouMakeup	2,800	1,680	280	840	15s-1h