Make-up Temporal Video Grounding


Given an untrimmed make-up video and a step query, the Make-up Temporal Video Grounding(MTVG) aims to localize the target make-up step in the video. This task requires models to align fine-grained video-text semantics and distinguish makeup steps with subtle difference.
Inputs: An untrimmed make-up video varies from 15s to 1h and a make-up step description.
Outputs: The temporal boundary of the step query located in the video.
Downloads and Baselines at Here.
Testing at Codalab.
Participants should register in this Form before testing.


1st prize: ¥ 1,0000
2nd prize: ¥ 3,000
3rd prize: ¥ 2,000

Important Date

Time zone: Beijing, UTC+8

April 25th, 2022
June 10th, 2022
June 25th, 2022
June 26th-30th , 2022
July 1th, 2022
July 6th, 2022
Training / validation set released
Testing set released and submission opened
Submission deadline
Objective evaluation
Evaluation results announce
Paper submission deadline

Task Rules

  1. Participants should submit the timestamp candidate for each (video, text query) input.
  2. The results should be stored in results.json, with the following format:
     'query_idx':   [start_time, end_time],
  3. Each team can submit the results.json once a day
  4. The evaluation process can take times and a failed submission will not cause the reduction of submission chances.

Task Metric

  1. We adopt \(R@n, IoU=m\) with n in {1} and m in {0.3, 0.5, 0.7} as evaluation metrics. It means that the percentage of at least one of the top-n results having Intersection over Union (IoU) with the ground truth larger than m.
  2. Dataset

    The makeup instructional videos are naturally more fine-grained than open-domain videos. Different steps share the similar backgrounds, but contain subtle but critical differences such as fine-grained actions, tools and applied facial areas, all of which can result in different effects to the face.
    We utilize YouMakeup dataset, which contains 2,800 make-up instructional videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of steps, including temporal boundaries, grounded facial areas and natural language descriptions of each step. There are totally 30,626 steps with 10.9 steps on average for each video. The length of videos varies from 15s to 1h with 9 min on average.

    Data Statistics

    Dataset Total Train Val Test Video_len
    YouMakeup 2800 1680 280 840 15s-1h

    Image Example


    Linli Yao
    Renmin University of China
    Ludan Ruan
    Renmin University of China
    Shuwei Liu
    Renmin University of China
    Shunyao Yu
    Renmin University of China
    Qin Jin
    Renmin University of China


    title={YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension},
    author={Wang, Weiying and Wang, Yongcheng and Chen, Shizhe and Jin, Qin},
    booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},