Our team at IBM Research recently developed a new approach for automated video scene detection. Videos are used today for everything from entertainment and marketing to knowledge-sharing, news, and social journaling. Automated scene detection can help consumers and enterprises utilize this video content in new ways.
Video scene detection is the task of temporally dividing a video into semantic scenes. A scene is defined as a series of consecutive shots which depicts some high-level concept or story (where each shot is a series of frames taken from the same camera at the same time). In Hollywood films, for example, different scenes may consist of a car chase scene or a relaxed picnic on a beach. In a talk show or news broadcast, a scene might be defined by a specific topic that was discussed.
Once a video has been split into scenes, each scene can be tagged and enriched with metadata—based on what is said, where it occurs, what activity is taking place, the setting, the background—in order to augment the video with consumable information. Dividing a video into scenes and using AI to label these segments enables the creation of an inverted index for video content that is searchable.
We formulated the scene detection task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. We described the technique in our paper, Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection (written by Daniel Rotman, Dror Porat, Gal Ashour and Udi Barzelay), at the 2018 ACM International Conference on Multimedia Retrieval. We also published our dataset.
A common approach for video scene detection is to equate the problem to a standard clustering task. But such an approach neglects an important element, which is the inherent temporal structure of the video. Clustering shots independently of their location and order in the video can lead to marking a scene as a non-contiguous segment, in contrast to the definition of a scene as a temporally continuous entity.
To overcome these issues, we formulate the problem of video scene detection as an optimization process of sequential grouping. We choose optimal locations for division based on the pairwise distances between shot features: after applying shot boundary detection, we create a feature vector for each shot using both visual and audio features. Using a distance metric, we measure the distance between each pair of shots, resulting in a distance matrix.
The intuition is that shots within the same scene will have a low distance value, and shots from different scenes will result in a large distance value. Choosing the features correctly will likely result in a diagonal-block-like structure where each block corresponds to a scene. Obviously, for real videos the distance matrix is not as ideal, but the block structure can still be observed.
Essentially the problem definition can be simplified to finding the optimal divisions along the diagonal. The approach can then be to define a cost function that can measure the quality of a given division and then optimize this cost.
When attempting to decide on the cost function, one intuitive way to measure the quality of a division is with an additive cost function (Hadd) that sums up all the values in the blocks on the diagonal.
In the example, we would sum the values in blocks 1-5.
Another option is using an average cost function (Havg) which sums up the average value of each block on the diagonal.
The above cost functions (Hadd, Havg) are biased due to inherent mathematical properties (see next paragraph). Therefore, in our paper we introduce a normalized cost function (Hnrm) which provides the desired results by measuring the average value of the entire blocks on the diagonal.
Let’s examine an experiment that highlights the aforementioned mathematical bias. Suppose we have a synthetic distance matrix with values that are uniformly distributed and are interested in a division into two. In such a matrix all points of division should be equally acceptable and so we would expect the cost function to have a constant response over the point of division. In practice we see in Figure 7 (which denotes the values of the cost function when dividing the matrix into two as a function of the point of division) that only Hnrm results in the expected behavior.
Keeping the cost function mathematically unbiased is especially important for situations where the block structure is less evident. Figure 8 shows a synthetic distance matrix that has a slight inclination for the division of two scenes (highlighted by the yellow rectangles) and only Hnrm allows the identification of the subtle inclination for the division.
Given a cost function, we use dynamic programming to find the division that globally minimizes the cost. Dynamic programming is a method to solve complex problems by breaking them down into subproblems and conserving their solutions. Using dynamic programming with our normalized cost function is especially difficult since the function examines the entire block diagonal even when inspecting the division of a sub-problem (because of the denominator). To overcome this, we developed a unique scheme for solving the optimization problem, which is described in the paper.
Although the task of video scene detection has been studied for several years, IBM is one of the first to bring a solution to market, as part of the IBM Watson Media Video Enrichment product. In Video Enrichment, a key component for adding insights and searchable metadata for videos is the video scene detection technology which divides the video into semantic sections. For these scenes, the visual and audio concepts can be aggregated to create an electronic table of contents for the video.
Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection
Daniel Rotman, Dror Porat, Gal Ashour and Udi Barzelay