Discovering Primary Objects in Videos by Saliency Fusion and Iterative Appearance Estimation
Jiong Yang, Gangqiang Zhao, Junsong Yuan, Xiaohui Shen, Zhe Lin, Brian Price, Jonathan Brandt
Abstract
In this paper, we propose a new method for detecting primary objects in unconstrained videos in a completely automatic setting. Here, we define the primary object in a video as the object that presents saliently in most of the frames. Unlike previous works only considering local saliency detection or common pattern discovery, the proposed method integrates the local visual/motion saliency extracted from each frame, global appearance consistency throughout the video and spatio-temporal smoothness constraint on object trajectories. We first identify a temporal coherent salient region throughout the whole video and then explicitly learn a discriminative model to represent the global appearance of the primary object against the background to distinguish the primary object from salient background. In order to obtain high quality saliency estimations from both appearance and motion cues, we propose a novel self-adaptive saliency map fusion method by learning the reliability of saliency maps from labelled data. As a whole, our method can robustly localize and track primary objects in diverse video content, and handle the challenges such as fast object and camera motion, large scale and appearance variation, background clutter and pose deformation. Moreover, compared to some existing approaches which assume the object is present in all the frames, our approach can naturally handle the case where the object is only present in part of the frames, e.g., the object enters the scene in the middle of the video or leaves the scene before the video ends. We also propose a new video dataset containing 51 videos for primary object detection with per-frame ground truth labeling. Quantitative experiments on several challenging video datasets demonstrate the superiority of our method compared to the recent state of the arts.
Work Flow
Results
Examples of the various types of saliency maps and the map fusion results by averaging and our proposed SVM-Fusion method without nonlinear weight adjustment and warping. The first five examples are from the NTU-Adobe dataset and the last example is from the Camo and Hollywood2 dataset.
Comparisons of the detection results with and without foreground modeling for some cases where the pure saliency-based detection fails. Each row refers to one video and the first column shows the overall detection accuracy in CDR. In the subsequent columns, the red and green box indicates the detection results before and after foreground modeling, respectively.
Video Demo
Dataset
The NTU-Adobe dataset and the ground truth annotations can be downloaded from here.
Note: the videos in this dataset are either downloaded from YouTube or obtained from other public datasets, and may subject to copyright. We do not own the copyright of the videos and only provide them for non-commercial research purposes.
Note: the videos in this dataset are either downloaded from YouTube or obtained from other public datasets, and may subject to copyright. We do not own the copyright of the videos and only provide them for non-commercial research purposes.
Publication
J. Yang, G. Zhao, X. Shen, Z. Lin, B. Price and J. Brandt, Discovering Primary Objects in Videos by Saliency Fusion and Iterative Appearance Estimation. Accepted by IEEE Trans. Circuit and System for Video Technology, 2015
Acknowledgement
This work is supported in part by Adobe gift grant and Singapore Ministry of Education Academic Research Fund (AcRF) Tier 1 grant M4011272.040. This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the National Research Foundation, Prime Minister’s Office, Singapore, under its IDM Futures Funding Initiative and administered by the Interactive and Digital Media Programme Office.