|Li Yang, Peixuan Wu, Chunfeng Yuan, Bing Li, Weiming Hu (VSLab, NLPR, CASIA)
||Description of the method:
Our method is based on TubeDETR with some methodological improvements. Specifically, we develop two cascaded decoders,
where the first decoder is responsible for spatially grounding the target and its outputs are fed to the second decoder
for temporal grounding. During the inference process, we perform a two-step grounding strategy that first produces a
coarse estimate of the target and then samples the keyframes to refine the estimation results. Finally, we perform model
ensemble to obtain more accurate temporal and spatial grounding results.