Given an untrimmed make-up video, the Make-up Dense Video Captioning (MDVC) task aims to localize and describe a sequence of makeup steps in the target video. This task requires models to both detect and describe fine-grained make-up events in a video.
Inputs: An untrimmed make-up video varies from 15s to 1h.
Outputs: The temporal boundary
Downloads and Baselines at Here.
Testing at Codalab.
Participants should register in this Form before testing.
1st prize: ¥ 1,0000
2nd prize: ¥ 3,000
3rd prize: ¥ 2,000
The makeup instructional videos are naturally more fine-grained than open-domain videos. Different steps share the similar backgrounds, but contain subtle but critical differences such as fine-grained actions, tools and applied facial areas, all of which can result in different effects to the face.
We utilize YouMakeup dataset, which contains 2,800 make-up instructional videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of steps, including temporal boundaries, grounded facial areas and natural language descriptions of each step. There are totally 30,626 steps with 10.9 steps on average for each video. The length of videos varies from 15s to 1h with 9 min on average.
Dataset | Total | Train | Val | Test | Video_len |
---|---|---|---|---|---|
YouMakeup | 2800 | 1680 | 280 | 840 | 15s-1h |
@inproceedings{wang2019youmakeup,
title={YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension},
author={Wang, Weiying and Wang, Yongcheng and Chen, Shizhe and Jin, Qin},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages={5136--5146},
year={2019}
}