Make-up Temporal Video Grounding (MTVG) Challenge

Photo of author
Written By Zach Johnson

AI and tech enthusiast with a background in machine learning.

Given an untrimmed make-up video and a step query, the Make-up Temporal Video Grounding (MTVG) aims to localize the target make-up step in the video. This task requires models to align fine-grained video-text semantics and distinguish makeup steps with subtle differences.

Inputs: An untrimmed make-up video ranging from 15s to 1h and a make-up step description.
Outputs: The temporal boundary of the step query located in the video.

Downloads and Baselines at Here.

Testing on Codalab.

Participants should register in this Form before testing.


  • 1st prize: ¥ 10,000
  • 2nd prize: ¥ 3,000
  • 3rd prize: ¥ 2,000

Schedule (Beijing, UTC+8)

  • April 25th, 2022 – Training / validation set released
  • June 10th, 2022 – Testing set released and submission opened
  • June 25th, 2022 – Submission deadline
  • June 26th-30th, 2022 – Objective evaluation
  • July 1st, 2022 – Evaluation results announce
  • July 6th, 2022 – Paper submission deadline

Submission Details

Participants should submit the timestamp candidate for each (video, text query) input. The results should be stored in results.json, with the following format:

  'query_idx': [start_time, end_time],

Each team can submit the results.json once a day. The evaluation process can be time-consuming and a failed submission will not reduce the number of submission chances.

Evaluation Metrics

We adopt \( R@n, IoU=m \) with \( n \) in {1} and \( m \) in {0.3, 0.5, 0.7} as evaluation metrics. This means the percentage of at least one of the top-\( n \) results having Intersection over Union (IoU) with the ground truth larger than \( m \).

About the Dataset

The makeup instructional videos are inherently more fine-grained than open-domain videos. Different steps may have similar backgrounds but contain subtle but vital differences, such as specific actions, tools, and applied facial areas, which can lead to different facial effects.

We use the YouMakeup dataset, containing 2,800 makeup instructional videos from YouTube, totaling over 420 hours. Each video has annotations for a series of steps, including temporal boundaries, grounded facial areas, and natural language descriptions. There are 30,626 steps, averaging 10.9 steps per video. Videos range from 15s to 1h, averaging 9 minutes.

YouMakeup Dataset Details

Dataset Total Train Val Test Video_len
YouMakeup 2,800 1,680 280 840 15s-1h

AI is evolving. Don't get left behind.

AI insights delivered straight to your inbox.

Please enable JavaScript in your browser to complete this form.