Due to the Corona Virus Diseases (COVID-19) pandemic, education is completely dependent on digital platforms, so recent advances in technology have made a tremendous amount of video content available. Due to the huge amount of video content, content-based information retrieval has become more and more important. Video content retrieval, just like information retrieval, requires some pre-processing such as indexing, key frame selection, and, most importantly, accurate detection of video shots. This gives the way for video information to be stored in a manner that will allow easy access. Video processing plays a vital role in many large applications. The applications required to perform the various manipulations on video streams (as on frames or say shots). The high definition of video can take a lot of memory to store, so compression techniques are huge in demand. Also, object tracking or object identification is an area where much considerable research has taken place and it is in progress.
In the current scenario, with the rapid development of multimedia technology, the amount of video data available every day is enormous and is increasing at a high rate. Video is the most consumed data type on the Internet, such as on YouTube, Vimeo, Dailymotion, Yahoo Video, and social networking sites like Facebook, Twitter, Instagram, etc. Searching for videos can be text-based or content-based. The text-based approach is more timeconsuming and populates the database with lot of data. Therefore, an efficient retrieval process is a contentbased retrieval, which gives more appropriate results. The explosive growth in video content leads to the problem of content management. The high definition of video can take a lot of memory to store, so compression techniques are a huge requirement. Also, object tracking or object identification is an area where much considerable research has taken place and is in progress. The retrieval of a video frame from a large database is also possible in video processing. This can be the next step in Content- Based Image Retrieval (CBIR). The formation of video is made from frames, and by combining them, it becomes a shot. The identification of shot(s) is a research area, which found very useful for much of real-time applications. Shot Boundary Detection (SBD) is also necessary for the retrieval of the desired frame from the video database.
For automatic video indexing and browsing, Shot Boundary Detection (SBD) is required. It can be used for a variety of purposes, including video database indexing and video compression, among others. The basic building block of video is the frame. Figure 1 depicts a video's structure. The frame number is used to index the frame sequences. The frames that are left behind after cutting the video are all the same size. Usually, 25 to 30 frames are taken every second. A video shot is a collection of connected frames captured by a single camera at a time. Typically, pictures are stitched together to create a video. A scene in a video might be made up of one or more shots that tell a specific story. The recognition of the visual disparity brought on by transitions serves as the foundation for Shot Boundary Detection. The discrepancy between two frames is typically discovered during a shot change. This dissimilarity manifests in various ways that can be divided into two categories, such as abrupt (as a hard cut) and gradual (dissolve, fade in, fade out, wipe). A single frame shows a sudden change in shot.
Figure 1. Structure of the Video
The shot is a series of interrelated, consecutive frames taken continuously by a single camera and representing continuous action in time and space. The shots are considered primitives for higher-level content analysis, indexing, and classification.
Abdulhussain et al. (2018) develops an automatic detection of the shot boundaries in a stream of video is called "shot boundary detection." This is dealing with the detection of transitions between the shots in digital video for temporal segmentation, as it is required for purposes like video content analysis, video browsing, and video retrieval, which are content-based. Here, interrelated consecutive frames are taken continuously by a single camera, which gives continuous action in time and space. The shots are efficient for indexing and may be for higher-level content analysis as well as classification ( Hannane et al., 2016 ; Janwe & Bhoyar, 2013 ). Video shot boundary detection is the process of identifying the transition between two successive shots. The shot boundary is basically a connection point between one shot and another. Shot boundary detection deals with two types of shot transitions ( Bi et al., 2018 ).
A shot detection method based on objects was proposed by Heng & Ngan, 2001 . In this paper, a time stamp transfer technique from numerous frames was proposed as a means of discovering information. This method effectively handled gradual changes. This method outperformed more conventional algorithms in several ways.
Lu and Shi (2013 ) developed a Video Shot Boundary Detection approach based on segment selection and Singular Value Decomposition (SVD) with Pattern Matching. In this paper, adaptive thresholds were used to calculate the position of shot borders and the length of gradual transitions, and the majority of non-boundary frames were simultaneously eliminated.
Figure 2 shows an abrupt transition, it indicates a sudden transition, i.e., one frame belongs to the first shot and the next frame belongs to the second shot ( Zheng & Zhang, 2016 ). In this transition, shot changes are clearly visible. It is also known as "hard cuts" or "simple cuts.”
Figure 2. Abrupt Transition
Figure 3 shows a gradual transition, in this transition, the change in visual content takes place slowly and continues over a few frames ( Janwe & Bhoyar, 2013 ; Wu & Xu, 2014 ). Fade-in, fade-out, wipe, and dissolve are some of the types of gradual transitions shown in Figure 4.
Figure 3. Gradual Transitions
The video segmentation, setting the length of shots, and feature extractions are the terms that are needed to be implemented here for achieving the desired task ( Karpagavalli et al., 2020 ). It is classified into two criteria,
To achieve accurate video segmentation results, it must select appropriate thresholds. For different types of video sources, shot transition thresholds vary widely. For example, a cartoon's threshold is usually higher than a teleplay. On the other hand, dissolve transition detection is a challenging problem in the shot boundary detection area ( Yi et al., 2012 ).
2.2.1 Adaptive Threshold Technique
The threshold selection should be based on the frame difference. In practice, the extracted frame features can be histograms, edges, motions, etc. Except for shot transitions and camera motions, frame differences are usually caused by three reasons,
2.2.2 Detection of Dissolve Type Transition
Gradual transition detection especially dissolve transition detection, is a difficult problem in the video-shot boundary detection area. It proposes an integrated algorithm that is composed of two threshold techniques, transparency computation, and a canny operator.
The proposed system includes the implementation of basic steps such as the preprocessing of video frames in a stream of video. Then followed the application of precise matching techniques to make a comparison with several techniques, thus choosing the most efficient one to achieve the preliminary tasks. The selection of either thresholding or a classifier to find the matched frames, the analysis of frames from the video database using the most favorable tool(s), the detection of shot boundaries; and the judgment of the algorithm. Figure 5 shows the flow chart in general for the steps to be carried out to achieve the goal.
Figure 5. Flowchart of Proposed System
Shot Boundary Detection techniques take one or more features from a video frame or a subset of it, referred to as a "Region of Interest" (ROI). The detection of shot change from these features can then be done by an algorithm using various techniques. Almost all shot change detection techniques decrease the video domain's high dimensionality by extracting a small number of features from one or more locations of interest in each video frame. Among these features are,
A Region of Interest (ROI's) average grayscale luminance is the most basic property that may be used to describe it. Utilizing one or more statistics (for example, averages) of the values in an appropriate color space, such as HSV, is a more reliable option.
Grayscale or color histograms are more detailed features for ROIs than luminance or color histograms. The fact that it is very discriminating, simple to compute, and largely unaffected by translational, rotational, and zooming Camera motion is one of its advantages. It is commonly used for the aforementioned reasons.
The edge information of an image is a natural choice for describing an ROI. The benefit connects people who perceive a situation visually and are sufficiently invariant to changes in illumination and various types of motion. The key drawbacks are the computational expense, noise sensitivity, and large dimensionality without postprocessing.
Transform coefficients, such as the Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), and wavelet, are traditional techniques to characterize the picture data in an ROI. Another benefit of using the Discrete Cosine Transform (DCT) coefficients is that Moving Picture Experts Group-encoded (MPEG) video streams and files already contain it.
This is occasionally used as a characteristic for detecting shot transitions, but it is typically paired with others because, on its own, it can be highly discontinuous within a shot (when motion changes quickly), and it is obviously useless when there is no motion in the video.
The gathering and analysis of data, which is typically expressed in terms of measures and metrics, is necessary for the measurement of quality, whether it relates to products or processes. The main goal of measurements is to take control of a task so that it can be managed. Three measures are used to measure the quality of an Shot Boundary Detection (SBD) algorithm, as follows,
V=C/(C+M) (1)
where C = correctly measured cuts (correct hits), and M = no. of not-detected cuts (missed hits).P=C/(C+F) (2)
where F = falsely detected cuts (false hits).F1 = 2*P*V/(P+V) (3)
where, V=recall, P=Precision
All these measures are mathematical measures ranging between 0 and 1. If the basic rule is higher the value, the algorithm also gives better performance.
Figure 6 shows cut detection, where 1 denotes a hit that is detected as a hard cut, whereas 2 denotes a missed hit, i.e., a soft cut (dissolve), that was not detected, and 3 denotes a false hit, i.e., one single soft cut that is falsely interpreted as two different hard cuts.
Figure 6. Cut Detection
Table 1 shows the summary of the performance of videoshot boundary detection with different techniques ( Yong et al., 2002 ).
Despite of a large number of works in shot boundary detection, many issues remain unresolved and required additional research. For a large database, this is the most required method to extract the desired video content. Also, well-ordered and effective management of video documents depends on the availability of indexes. This Shot Boundary Detection (SBD) helps to make the indexing task possible. In addition, it also helps in video browsing.