Our team at IBM Research recently developed a new approach for automated video scene detection. Videos are used today for everything from entertainment and marketing to knowledge-sharing, news, and social journaling. Automated scene detection can help consumers and enterprises utilize this video content in new ways.

Video scene detection is the task of temporally dividing a video into semantic scenes. A scene is defined as a series of consecutive shots which depicts some high-level concept or story (where each shot is a series of frames taken from the same camera at the same time). In Hollywood films, for example, different scenes may consist of a car chase scene or a relaxed picnic on a beach. In a talk show or news broadcast, a scene might be defined by a specific topic that was discussed.

Videos have an inherent hierarchical structure, where the scenes are the semantic chapters which denote a high-level concept or story.

Figure 1. Videos have an inherent hierarchical structure, where the scenes are the semantic chapters which denote a high-level concept or story.

Once a video has been split into scenes, each scene can be tagged and enriched with metadata—based on what is said, where it occurs, what activity is taking place, the setting, the background—in order to augment the video with consumable information. Dividing a video into scenes and using AI to label these segments enables the creation of an inverted index for video content that is searchable.

Many content understanding tasks might be difficult or impossible when dealing with heterogenous video content. By applying video scene detection, the tasks become better defined.

Figure 2. Many content understanding tasks might be difficult or impossible when dealing with heterogenous video content. By applying video scene detection, the tasks become better defined.

We formulated the scene detection task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. We described the technique in our paper, Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection (written by Daniel Rotman, Dror Porat, Gal Ashour and Udi Barzelay), at the 2018 ACM International Conference on Multimedia Retrieval. We also published our dataset.

A common approach for video scene detection is to equate the problem to a standard clustering task. But such an approach neglects an important element, which is the inherent temporal structure of the video. Clustering shots independently of their location and order in the video can lead to marking a scene as a non-contiguous segment, in contrast to the definition of a scene as a temporally continuous entity.

Formulating video scene detection as a classic shot-clustering problem ignores the inherent temporal structure of the video.

Figure 3. Formulating video scene detection as a classic shot-clustering problem ignores the inherent temporal structure of the video.

To overcome these issues, we formulate the problem of video scene detection as an optimization process of sequential grouping. We choose optimal locations for division based on the pairwise distances between shot features: after applying shot boundary detection, we create a feature vector for each shot using both visual and audio features. Using a distance metric, we measure the distance between each pair of shots, resulting in a distance matrix.

The intuition is that shots within the same scene will have a low distance value, and shots from different scenes will result in a large distance value. Choosing the features correctly will likely result in a diagonal-block-like structure where each block corresponds to a scene. Obviously, for real videos the distance matrix is not as ideal, but the block structure can still be observed.

Figure 4. (Left) An illustration of an ideal distance matrix of a video with heterogenous content. Light pixels denote higher distance values, and the dark blocks on the diagonal represent sequences of shots which have low pairwise distances. The values of t1t7 represent the desired scene transition locations. (Right) A distance matrix derived from real video data, where the block diagonal structure is still identifiable.

Essentially the problem definition can be simplified to finding the optimal divisions along the diagonal. The approach can then be to define a cost function that can measure the quality of a given division and then optimize this cost.

video scene detection pipeline

Figure 5. The video scene detection pipeline. In the feature extraction stage, audio and visual features are extracted using deep neural networks and then a multi-modal fusion scheme is incorporated to leverage both the audio and visual properties of the video. (click to enlarge)

When attempting to decide on the cost function, one intuitive way to measure the quality of a division is with an additive cost function (Hadd) that sums up all the values in the blocks on the diagonal.

A distance matrix, and the blocks on the diagonal. The cost function should utilize these values to indicate the preference for the point of division.

Figure 6. A distance matrix, and the blocks on the diagonal. The cost function should utilize these values to indicate the preference for the point of division.

In the example, we would sum the values in blocks 1-5.

Another option is using an average cost function (Havg) which sums up the average value of each block on the diagonal.

The above cost functions (Hadd, Havg) are biased due to inherent mathematical properties (see next paragraph). Therefore, in our paper we introduce a normalized cost function (Hnrm) which provides the desired results by measuring the average value of the entire blocks on the diagonal.

Let’s examine an experiment that highlights the aforementioned mathematical bias. Suppose we have a synthetic distance matrix with values that are uniformly distributed and are interested in a division into two. In such a matrix all points of division should be equally acceptable and so we would expect the cost function to have a constant response over the point of division. In practice we see in Figure 7 (which denotes the values of the cost function when dividing the matrix into two as a function of the point of division) that only Hnrm results in the expected behavior.

Figure 7. When dividing a matrix with uniformly distributed values, the graphs show the three aforementioned costs as a function of the point of division into two.

Keeping the cost function mathematically unbiased is especially important for situations where the block structure is less evident. Figure 8 shows a synthetic distance matrix that has a slight inclination for the division of two scenes (highlighted by the yellow rectangles) and only Hnrm allows the identification of the subtle inclination for the division.

Figure 8. When the block diagonal structure is ambiguous, Hnrm allows the point of division to be easily detectable.

Given a cost function, we use dynamic programming to find the division that globally minimizes the cost. Dynamic programming is a method to solve complex problems by breaking them down into subproblems and conserving their solutions. Using dynamic programming with our normalized cost function is especially difficult since the function examines the entire block diagonal even when inspecting the division of a sub-problem (because of the denominator). To overcome this, we developed a unique scheme for solving the optimization problem, which is described in the paper.

Although the task of video scene detection has been studied for several years, IBM is one of the first to bring a solution to market, as part of the IBM Watson Media Video Enrichment product. In Video Enrichment, a key component for adding insights and searchable metadata for videos is the video scene detection technology which divides the video into semantic sections. For these scenes, the visual and audio concepts can be aggregated to create an electronic table of contents for the video.

Figure 9. The Video Enrichment user interface shows the breaking of the video into scenes (top). The ingested video is processed by our video scene detection technology. In this instance 45 scenes were detected in the video. The bottom part displays the top visual elements associated with scene number 31.


Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection
Daniel Rotman, Dror Porat, Gal Ashour and Udi Barzelay

Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.