IP video traffic on the network continues to grow. To account for a future traffic of a million minutes of videos per second, it is important to minimize the traffic each video generates. This is a performance comparison of the HEVC, AV1, and xvc (in fast mode) video CODECs with the focus on minimizing network traffic for common use cases, such as video conferences and social media video streaming, these being used, where less than optimal videos are sent. The purpose is to identify which CODEC gives the best quality at the lowest bit rate. This is done by running a test bench with multiple videos of different qualities, encoding and decoding each video with each CODEC. The study shows that xvc (fast mode) and AV1 (double-pass) has similar quality performance on the dataset and has a noticeable improvement compared to HEVC (double-pass) when it comes to less optimal video quality. This research work was conducted in the context of the course II2202 Research Methodology and Scientific Writing at KTH Royal Institute of Technology, Sweden.
Many different video CODECs exist to efficiently encode video to be sent from server to client. IP video traffic constituted 73 percent of consumer internet traffic in 2016 and is predicted to increase to 82 percent by 2021, according to Cisco (Cisco Systems Inc, 2017). The Cisco report also predicts that a million minutes of video content will cross the network every second by 2021. This encourages the development of more efficient video CODECs with the goal of sustainable network usage, as otherwise, operators have a difficult time supplying all the bandwidth that will be needed at the prices that customers are willing to pay.
There are existing studies involving the xvc CODEC that mainly focus on high-quality video resources; however, this quality of video might not be the case for video conferences, amateur home videos, social media, and other video resources among Cisco's predicted million video minutes per second. Thus, investigating the performance of the CODECs from other aspects is essential.
Video traffic will continue to increase and in order to accommodate this, a more reliable and sustainable network is needed. This can be done by increased compression while maintaining an acceptable Quality of Experience (QoE).
Reduction in bit rates for video-streaming is important in regard to total network usage. New open source video CODECs challenge the defacto CODEC HEVC and thus it is of interest to study possible savings due to the open source CODECs compared to HEVC.
The research work investigates the metrics of video quality mentioned below and compression for a set of low bit rates for the following video CODECs: HEVC, AV1, and xvc. This will give an indication of which is the more sustainable video CODEC. As per the creators of xvc's suggestion, part of the dataset will include low-quality recordings, by which we mean those that have been affected by shaking camera or motion-blur. Part of the solution is to compare the CODECs using their Bjøntegaard Delta (BD)-ratings for different bit rates and video qualities. The BD-rate is calculated using the following metrics: Peak Signal-to- Noise Ratio (PSNR), Structural Similarity Index Metric (SSIM), and Video Multimethod Assessment Fusion (VMAF).
This section will introduce the technical terms necessary to understand the purpose and methods used for this work, previous work which laid the foundation for this investigation, and give an introduction to xvc, the newest CODEC of the once used in this research work.
Quality of Experience (QoE), Mean Opinion Score (MOS), and Quality of Service (QoS) are all factors which describe quality (Pal & Vanijja, 2017). QoE is measured through subjective questions to an individual. MOS is a metric where the mean of the QoE scores is calculated, depending on the area, where different amounts of individuals are needed for the MOS to be valid. QoS describes the quality of service, in this case, the service of network delivery. Any negative factors, like high packet loss or similar, lower the QoS.
Bjøntegaard Delta (BD) (Bjøntegaard, 2001) is a method for comparing performance/bit-rate curves. It works by only comparing the interval of the curves which are valid for both curves. It also proposes using a specific method for interpolating the curves, using the values in log-form.
Peak Signal-to-Noise Ratio (PSNR) (National Instruments, 2013) is a metric, which expresses the ratio between the peak power value of a frequency, hopefully, the one desired for the image, and the peak power value of the next frequency, representing distortion. PSNR is most commonly presented in the decibel (dB) scale. PSNR is a quantitative method for comparing data compression, enhancement, and quality. The desire when comparing and using PSNR as a metric is to have as high of a PSNR value as possible since this means that the differences between the original picture and the reconstructed one are smaller.
Structural Similarity Index Metric (SSIM) (Wang, Bovik, Sheikh, & Simoncelli, 2004) is a different approach to quantifying the visible differences between a reference image and a compressed or enhanced one. It bases its quality assessment on the deterioration of the image's structural information. SSIM has proven to simulate subjective ratings well.
Video Multimethod Assessment Fusion (VMAF) is a system for assessing video quality developed by Netflix (Netflix Inc, 2016). By using well-defined quality metrics, such as Anti-noise SNR, Detail Loss Measure, Visual Information Fidelity, and Mean Co-Located Pixel Difference, combined with machine learning, VMAF yields a score that accurately correlates with human perception of quality (Rassool, 2017). This enables computer evaluation of video quality and thus one can conduct research on video CODECs more efficiently than by using MOS. A VMAF score of 93% or more is generally a clip without disturbing visual artifacts and can, therefore, be considered a good or acceptable QoE (Rassool, 2017).
VMAF has been compared to other methods by running these methods on a set of videos and comparing the scores to scores from human subjects (Rassool, 2017). The experiment showed that VMAF has a better correlation and the least mean square error. Even when the input video is affected by the previous compression, in the case of video conferences, VMAF evaluates the quality similar to human subjects.
In an online blog post on EuclidIQ (Wingard, 2014), the dichotomy between using subjective or objective methods to measure video quality is discussed. From a QoE point of view, the need for MOS and other subjective factors is very important. Since the testing to establish a MOS factor is cumbersome and lengthy, the need for objective performance arises to quickly evaluate QoE factors. Classic measurements, such as SNR, PSNR, and MSE give a good indication of how well the picture matches the original, but newer metrics such as SSIM and VMAF provide measurements closer to the QoE scores given by MOS testing (Wingard, 2014).
In (Pal & Vanijja, 2017), a mapping between Quality of Service (QoS) scores and Quality of Experience (QoE) scores are made, along with a relation to Mean Opinion Score (MOS). The conclusion was that high QoS is related to high QoE scores and MOS. If the traffic load from videos on the network is lowered, then the QoS will be higher and as such the QoE and MOS will be higher. This points to a desire for lower bit rates with still acceptable (VMAF score of 93 or up) quality scores being beneficial even for QoS and thus the QoE. The research used publicly available SVT High Definition Multi-Format Testset maintained by the Video Quality Experts Group (The Institute for Telecommunication Sciences (ITS), 2018) for the subjective tests since the MOS score is available for these tests without any extra effects affecting them. To give a good spread for the testset, the videos which were chosen, each had different levels of spatial (SI) and temporal information (TI). All four videos used were Full HD (1080p) resolution with a frame-rate of 30 fps. A total of 7280 subjective MOS scores were collected. For simulation of the network effects, the Microsoft network emulator was used to create the desired network effects on the clips. The collected MOS scores where then presented in relation to the different network effects and compared with the provided MOS scores to show how QoS correlates with QoE.
Research (Feldmann, 2018; Zabrovskiy, Feldmann, & Timmerer, 2018) based on an HTTP Adaptive Streaming (HAS) dataset was conducted to compare the performance of HEVC and other CODECs to AV1. This study showed improvements for AV1 compared to the others when comparing the Peak Signal-to-Noise Ratio (PSNR) and Bjøntegaard Delta PSNR (BD-PSNR) metrics (Ozer, 2018). The dataset used is optimized for HAS and adopted a bit rate ladder, ranging from 100 kbit/s bit rate at 256 x 144 pixels resolution up to 20 megabit/s bit rate with a resolution of 4k (3820 x 2160 pixels). To compare the encoding speed, the study encoded the AV1 bitstreams at the Institute of Information Technology at the Alpen- Adria Universitat Klagenfurt, and encoded the other three CODECs' bitstreams on Bitmovin Video Encoding cloud infrastructure utilizing ffmpeg (FFmpeg, 2018) for all three of the CODECs. The study used a weighted PSNR relative to the original source as the metric and used this to calculate BD-rate values.
The xvc CODEC (Samuelsson & Hermansson, 2018; Divideon, 2018) is a new video CODEC that has shown promising results in earlier studies (Divideon, 2018) competing against AV1 (Bitmovin Inc, 2018) and HEVC (Fraunhofer Heinrich Hertz Institute, 2018). The CODECs were compared using a set of metrics including PSNR, SSIM, and VMAF; the ones that will be used in this study. Some numbers are outdated as AV1 and xvc are still under development and numbers except the xvc results were taken from a database with results from tests with well-defined test sequences and conditions (Daede, Norkin, & Brailovskiy, 2015). The creators of the xvc CODEC have also expressed an interest in a comparative test using lower quality data (VHS-quality, shaking from using a hand camera, etc.).
Video traffic will continue to increase and in order to accommodate this a more reliable and sustainable network is needed. This can be done by increased compression while maintaining an acceptable Quality of Experience (QoE). Reduction in bit rates for videostreaming is important in regard to total network usage. New open source video CODECs challenge the defacto CODEC HEVC and thus it is of interest to study possible savings due to the open source CODECs compared to HEVC. The problem to be answered is which CODEC offers the highest quality at a low bit rate? The purpose is to give an indication of the sustainability of the video CODECs in hopes of reducing the amount of network traffic per video generated by IP video, allowing for more traffic on the same infrastructure. The goals of this research work are to, create a testbench for calculating the indicated metrics of the video CODECs from a video source, using the output from the testbench create a well-presented performance comparison between the different CODECs, using the BD-ratings resulting from the test run, and using the results from above together with the encoded size of each clip, and suggest which option has the most suitable trade-off between QoE and bit rate.
The structure of this study is based on the guidelines proposed in the IETF, see “for Video CODEC Testing and Quality Measurement” (Daede et al., 2015). The method is a quantitative study with the following structure.
The dataset will comprise of dark, well lit, and overexposed videos around 720p, provided by Xiph.org (2018). The videos will be encoded using the three different video CODECs with a set of lower bit rates for 720p streaming (∼100 to ∼3500 kbps, with evenly growing intervals): xvc, AV1, HEVS. See Appendix A for information on the videos in the dataset and for frame captures of the videos.
The VMAF tool used in this research compares the similarities of two video files with raw pixel format. The test bench is therefore designed to encode a raw video file into the format given by the video CODEC, and then decode that file into a video file with raw pixel format. The resulting video file has then been subjected to steps equivalent to the ones taken when compressing a video for efficient network utilization or storage of the video, and decompression for viewing the video on the target device. The video quality has been compressed into a smaller format and would, therefore, experience loss of data and thus loss of image quality. Therefore, the original raw video file and the resulting file is compared using the VMAF tool to get a score on how well the video CODEC preserves video quality.
Every step of the test bench is conducted on the same virtual machine on the same device, and the video CODECs are instructed to utilize multiple processor cores of the system and can run in parallel with the other video CODECs without affecting the results.
The test bench is a set of bash scripts running encoding, decoding, and evaluation programs. Following is a pseudocode for a more detailed description of what is being done.
1) For every image sequence or video file in the dataset:
frame rate and video resolution Frame rate of 25 frames per second
Video resolution same as source if the width is lower than 720 pixels, else it is scaled down to 720 pixels in width and a height to maintain the width-height ratio
HEVC implementation in FFMPEG (FFmpeg, 2018)
AV1 implementation in FFMPEG (FFmpeg, 2018)
Xvc implementation from open source github project (Divideon, 2017)
Store encoded file size in bytes
Calculate bit rate from video length and encoded file size
(fileSize[bytes])*8/(videoLength[ms]) =bitRate[kbit/s]
Decode encoded file using the video CODEC implementation
Store decoded file size in bytes
Calculate SSIM, PSNR, and VMAF score by comparing the original raw video file and the decoded file by using the VMAF program created by Netflix (Netflix Inc, 2016)
Compile the gathered data into one .xml document using a naming convention for automation of the statistical analysis
(video CODEC)_(dataset name)_(bit rate).xml
2) Wait until every video CODEC has finished the analysis, then begin on the next image sequence or video file in the dataset.
The settings of the video CODECs used in the test bench and information about the system the test bench ran on can be found in Appendix B.
The data will be processed and presented using the following methods:
3.3.1 Preprocessing and Loading
The constructed test bench takes the .xml output from the VMAF tool and adds the size of the compressed file. The results are stored in a .xml file following the naming scheme: CODEC testSet bitRate.xml. A Python script iterates through a folder and loads data from all .xml files in it. The script builds a data structure which we then can use for the analysis proposed in the Statistical tools section below.
3.3.1.1 Independent and dependent variables
Independent variables: CODEC, Video source, Bit rate Dependent variables: SSIM, PSNR, VMAF, Compression
3.3.2 Statistical Tools
The authors use Bjøntegaard Delta (BD) on the VMAF scores when comparing the CODECs results in regard to quality improvements and bit rate savings. The BD-quality score describes the difference in VMAF score with regards to the same bit rate between the tested CODECs, positive numbers indicate that the primary CODEC has higher values than the secondary CODEC. A difference of at least 6 in VMAF score indicates a noticeable difference in video quality (Rassool, 2017). The BD-rate score describes the difference in percent of the bit rate for the same quality between the tested CODECs, e.g. −30 indicates that the primary CODEC needs 30% lower bit rate for the same quality score compared to the secondary CODEC. To help calculate the BD-ratings, a Matlab script (Matyunin, 2013) is translated to run in Python, using numpy. The translated function is then tested against the Matlab function using testsets provided in the documentation for the Matlab function. The authors use an interpolation level value of 3 for the quality testing and a value of 1 for the rate testing, this is to most accurately be able to interpolate the metric relations.
3.3.3 Advantages and Disadvantages
Advantages: By using the same hardware and software for all tests, the results should be comparable. Using the VMAF tool from Netflix instead of creating our own, saves verification time and efforts since their software is trustworthy. The comparisons show the results for realworld non-perfect videos. Use of BD values is common when comparing CODECs.
Disadvantages: Using xvc in fast mode gives it a disadvantage in terms of quality. Using VMAF scores with BD needs to be accompanied by the graphic results so that correct interpretations can be made since the threshold of a VMAF score of 93 or higher equates to nonvisible differences.
The difference in BD-quality ratings and BD-rate are shown in Table 1. The “score/bit rate”- curves for each metric and video sequence are shown in Figures 1 to 8. Figures 1, 3, 4, 5, and 8 show that the VMAF score of 93% was reached at bit rates lower than 1000 kbit/s while the other video sequences required higher bit rates for some of the video CODECs.
Table 1. Difference of VMAF Quality Score (BD) and Bitrate for the same Quality Score (BD) between xvc (Fast Mode), AV1 (Double-Pass) and HEVC (Double-Pass) for different Scenarios, as well as Averages
Figure 1. SSIM, PSNR, and VMAF Analysis Result of Video Sequence “Good quality, well lit” after Encoding and Decoding the Video Sequence using AV1, HEVC, and xvc with a Discrete Set of Bit Rates
Figure 2. SSIM, PSNR, and VMAF Analysis Result of Video Sequence ”Good Quality, Dim Lighting” after Encoding and Decoding the Video Sequence using AV1, HEVC, and xvc with a Discrete Set of Bit Rates
Figure 3. SSIM, PSNR, and VMAF Analysis Result of Video Sequence “Good Quality, Dark Lighting” after Encoding and Decoding the Video Sequence using AV1, HEVC, and xvc with a Discrete Set of Bit Rates
Figure 4. SSIM, PSNR, and VMAF Analysis Result of Video Sequence “Underexposed, clean” after Encoding and Decoding the Video Sequence using AV1, HEVC, and xvc with a Discrete Set of Bit Rates
Figure 5. SSIM, PSNR, and VMAF Analysis Result of Video Sequence “Underexposed, Artifacts” after Encoding and Decoding the Video Sequence using AV1, HEVC, and xvc with a Discrete Set of Bit Rates
Figure 6. SSIM, PSNR, and VMAF Analysis Result of Video Sequence “Overexposed” after Encoding and Decoding the Video Sequence using AV1, HEVC, and xvc with a Discrete Set of Bit Rates
Figure 7. SSIM, PSNR, and VMAF Analysis Result of Video Sequence “Overexposed, Precompressed” after Encoding and Decoding the Video Sequence using AV1, HEVC, and xvc with a Discrete Set of Bit Rates
Figure 8. SSIM, PSNR, and VMAF Analysis Result of Video Sequence “Overexposed, Heavily Precompressed” after Encoding and Decoding the Video Sequence using AV1, HEVC, and xvc with a Discrete Set of Bit Rates
To show the relationship between the CODECs, two candidate/opponent tables have been generated. The BD-quality relationships are presented in Table 2, where a positive number indicates that the candidate has a higher average VMAF score, for the same bit rate, compared to the opponent. The bit rate relationships are presented in Table 3, where a negative number indicates a lower bit rate used by the candidate, for the same quality score, compared to the opponent.
Table 2. Difference in Average BD-quality Scores between Candidate and Opponent, Positive Numbers means Candidate has better VMAF Score for the same Bit Rate
Table 3. Savings in Average BD-rate between Candidate and Opponent, Negative Numbers means Candidate has given Bit Rate Savings in Percent for the same Quality
This section gives an answer to the research problem and then proposes future work.
Table 1 shows that both AV1 and xvc have better BDquality scores (based on the VMAF metric) than HEVC overall while having big savings with regards to the bit rate needed for the same quality. Regarding BD-quality, the difference between AV1 and xvc is small overall, with AV1 having slightly better scores in most cases. Although it is important to note that xvc runs in fast mode due to time constraints of this study and the results should not be considered as the official xvc performance scores, but instead an indication of its performance for applications requiring faster encoding times. One can expect better performance when running in performance mode (Divideon, 2018).
The outlier case (Figure 1) when comparing the bit rate savings between AV1 and xvc has huge bit rate savings for AV1 due to AV1 correlating to a higher number. In this case, both xvc and AV1 are well beyond the limit of a VMAF score of 93, with a difference of less than 6, and thus the difference in video quality between them is not noticeable.
Regarding videos where the original quality is lacking (”Overexposed” and ”Mobile, precompressed”), both xvc and AV1 provide visibly better quality (>6) than HEVC. Even a darker video would result in almost visible differences, as can be noted by the scores of “Good quality, dark lighting”.
In conclusion, the authors cannot give a clear winner between xvc and AV1, but between AV1 and xvc (fast mode). What we can see is that AV1 with double pass provides slightly better quality scores with a fairly strong 20% lower bit rate than xvc running in fast mode, which is not optimized for quality. Both open source alternatives are markedly better than HEVC.
In the current state, AV1 and xvc are missing hardware acceleration. Investigating an implementation where the GPU of a mobile device is utilized would provide further information about the suitability for the mobile use case.
The encoding and decoding times are some very important factors when focusing on live streaming of videos. In this study, they have noticed that AV1 was much slower than xvc (fast mode) and HEVC. This might be due to the implementation in FFMPEG is still a beta version and currently under development. Encoding with xvc also took longer than the length of the video and would therefore not be suitable for live streaming. This, however, would also depend on the use of a virtual machine. The investigation of encoding time is a very interesting topic that would further evaluate the suitability for live streaming videos, and if one could optimize the implementation to further reduce the time, that would help create a sustainable network usage in the future.
To give xvc a fair chance and if comparing it for situations when the encoding time is not of as much importance, for streaming services that only encode once such as Netflix, it would be of interest to do a test with xvc in performance mode with double-pass turned ON.
It can be seen in the line graphs that the VMAF score of a 93 threshold is reached with bit rates ranging from 100 kbit/s to 1500 kbit/s depending on the video content. What would be interesting is to research an adaptive solution for encoding the videos such that the minimum bit rates are reached while still maintaining a certain quality. Utilizing the frame-bursting of wireless data transmission enables dynamic adjustments of the bit rate so that for each part of the video, a quality score is kept at an average value of, for example, 95% VMAF, and would result in stable video quality in mobile devices. This could also further optimize the network usage for throughput and thus reduce delays and increase the quality of service.
Other future work would be to collaborate the validity of the results by using the same procedures, but performing an extensive MOS test, which was out of the scope of this research work.
The authors would like to thank Niklas Embretsen and Swetha Varadharajan for their valuable comments and suggestions for improving the quality of the paper. They would also like to thank Gerald Q Maguire Jr for his help with narrowing the scope of the research work and also for his valuable comments and suggestions. They are also grateful to Jonatan Samuelsson, CEO of Divideon, for his suggestions on where to find a suitable dataset and comparison methods.
This document requires readers to be familiar with terms and concepts regarding video compression. For clarity, some of these terms are summarized and a short description of them are given before presenting them in the next sections.
AV1 - AOMedia Video 1 (Bitmovin Inc, 2018)
BD - Bjøntegaard Delta (Bjøntegaard, 2001)
CODEC - A linguistic blend of the term coder-decoder (Cambridge University Press, 2018)
HEVC - High Efficiency Video Coding (Fraunhofer Heinrich Hertz Institute, 2018)
MOS - Mean Opinion Score
MSE - Mean Squared Error
PSNR - Peak Signal-to-Noise Ratio (National Instruments, 2013)
QoE - Quality of Experience
QoS - Quality of Service
SNR - Signal-to-Noise Ratio (Kieser, Reynisson, & Mulligan, 2005)
SSIM - Structural Similarity Index (Wang et al., 2004)
VMAF - Video Multimethod Assessment Fusion (Rassool, 2017)
xvc - Name of a video CODEC (Divideon, 2018)
Figure 1A. One Frame from the Video described as “Good Quality, Well lit”
Figure 1B. One Frame from the Video described as “Good Quality, Dim Lighting”
Figure 1C. One Frame from the Video described as “Good Quality, Dark Lighting”
Figure 1D. One Frame from the Video described as “Underexposed, Clean”
Figure 1E. One Frame from the Video described as “Underexposed, Artifacts”. Note: the Artifacts are Color Variations from Frame to Frame and thus not Visible in this still Frame
Figure 1F. One Frame from the Video described as “Overexposed”
Figure 1G. One Frame from the Video described as “Overexposed”
Figure 1H. One Frame from the Video described as “Overexposed, Heavily Precompressed”
VMware Workstation 14 Player Virtual Machine running on following system specifications:
VM system specifications:
The test bench is a set of GNU bash script running on version 4.4.19
The test bench runs: