MPEG: Motion Picture Experts Group - Video

Topics covered in this module:: Analogue Television and Video.; MPEG-1 Encoding.; I, P and B Frame Encoding.; I, P and B Sequences.

Introduction

The Motion Pictures Expert Group (MPEG) are another committee, very similar to the JPEG crowd. In fact MPEG is just a nickname, their official name is ISO/IEC JTC1 SC29 WG11. That really rolls off the tongue. I think we'll stick with MPEG.

If ever there was compression needed for digital information, it is needed for digital video. Standard New Zealand PAL video has a luminance resolution of 768x576 and runs at 25 frames a second. The two Chrominance components generally have a 384x288 resolution, giving a total raw video throughput of around 47 MegaBytes per second. Not the sort of thing you can put on a CD in any usefully way, unless you can manage 270:1 compression. Ah-ha! Enter MPEG compression.

This compression format specifies methods for compressing both audio and video information. This module deals with the video part of the format. A bit of TV history is probably called for before we delve into MPEG proper.

Analogue Television and Video

Television systems have been around since the 1930's and two different versions are in use. The USA, and Japan use NTSC (National Television Systems Committee) standard, while the UK, many European countries and New Zealand use the PAL (Phase Alternation Line) system.

The main difference between the two is that NTSC has a vertical resolution of 525 line and PAL has 625. NTSC also runs at a speed of 30 frames per seconds, whereas PAL is 25 fps. The horizontal resolutions are also different and generally measured by the bandwidth of the signal. For the purposes of pixel resolution, NTSC and PAL horizontal resolutions are generally accepted as being 640 and 768 respectively (when using square pixels). The full vertical resolution is never displayed on a normal television, however. About 45 of the lines have control information or are 'unused'. This results in a final 'useable' NTSC and PAL resolutions of 640x480 and 768x576 pixels respectively.

When these standards were first used, television technology was very primitive and was unable to display an image at full resolution quickly enough without flickering. To get around this problem, televisions had a reduced vertical resolution of just 288 lines (pretty crummy resolution if you ask me). Each video frame was then interlaced, and had two fields per frame. For PAL systems, the odd fields (288 lines) are displayed in the first 1/50th of a second and the even fields in the second 1/50th. This means that still frames on fast moving object can jump forwards and backwards on a TV, or seem to be 'combed' on a computer monitor.

Figure 1. Video Interlacing, showing 'combing' on fast moving objects.

When colour was added to the television broadcasts, they needed to keep backwards compatibility the existing black and white (actually grayscale) signals, so the YUV colour space was used and the chrominance components were just added to the existing luminance information (recall YUV and YCbCr details in the module on JPEG compression). Because there was less bandwidth available for the colour information, both NTSC and PAL broadcast quality signals have a 4:2:2 subsampling ratio. (ie. the U and V components have half the horizontal resolution of the Y component).

MPEG-1 Encoding

There are two versions of the MPEG standard in current use, MPEG-1 and (surprise, surprise) MPEG-2. The MPEG-1 standard is used to produce Video CDs (VCDs) and MPEG-2 is the format used for DVDs. Both versions are similar, with MPEG-2 being based on MPEG-1. However, it has improvements which allow larger image sizes, better handling of interlacing and format conversions. We'll cover MPEG-1 and then mentions the MPEG-2 additions later. The MPEG standard was design to specify the format of the data streams but not how the data was calculated, in a similar way that the JPEG standard doesn't actually specify what coefficient quantisation values must be used. This means that hardware and software companies are free to devise their own methods of calculating the data stream. This means they are more likely to support a standard if they can make some money out of it.

MPEG-1 was designed to run from a standard CD running at 1.5 Mb/s, including both video and audio streams. This format is quite flexible and it is possible to encode images up to 4095x4095 in resolution, with bit rates up to 100 Mbits/s but rarely used in practice. A number of sacrifices were made to video (in particular) to achieve 'almost' VHS quality.

The standard video image size used by MPEG-1 is based on the CCIR-601 digital video standard. This is an image 720x576 by 25 fps (with two fields making up one frame) for PAL, and 720x486 by 30 fps for NTSC. This format does not use square pixels, so the horizontal resolution is slightly worse for PAL and slightly better for NTSC signals. The chrominance components are reduce 2:1 horizontally for both formats. MPEG makes use of a reduced version of CCIR, called SIF (Source Input Format). The resolution of SIF images just 360x288 at 25 fps (no interlacing), with a chrominance resolution of 180x144 for PAL. The NTSC figures are 360x240 at 30 fps, with chrominance 180x120. These figures are convenient as they give an identical source data rate for both formats. ie. (360 x 288 + 180 x 144 x 2) x 25 = (360 x 240 + 180 x 120 x 2) x 30. Sneaky, huh?

MPEG video compression makes use of two easily observed facts of video streams:

In any particular video frame, high frequency details are less noticable and can tolerate heavy compression (like JPEG compression).
A block of pixels in one frame is likely to be very similar to the same block in a previous or following frame.

There are three kinds of encoding methods available for any video frame, one purely spatial, and two a mix of spatial and temporal (ie. they depend on previous or following frames). All methods work on blocks of pixels, called macroblocks, which are 16x16 in size. MPEG compression is, of course, lossy.

I frames (Intrapicture), not dependent on other frames. These are keyframes.
P frames (Predicted), dependent on previous frames (I or P).
B frames (Bidirectional predicted), dependent on previous and/or future frames.

Right! Let's get the low-down on these methods:

I frames

Intrapicture frames essentially encode the source image in the JPEG format (with some differences). 8x8 blocks of pixels (4 per macroblock) are run through a Discrete Cosine Transform and can be quantised on a per-macroblock basis, instead of the whole image as with JPEG. This macroblock grouped quantisation greatly increases the control over the final size of the compressed image, which is important to maintain fairly constant data rates. It also means that more details can be allocated in the areas of an image which require it.

Figure 2. Low quality MPEG sequence of I frames (2x2 macroblocks) and source images.

Intrapicture frames are not dependent on any other frames and are used as 'jump-in' points for random access. Note how none of the errors in the I frames in figure 2 are carried on to successive frames. Each frame has it's own unique errors. This translates to 'speckling' around high contrast areas (edges) when the video is played back and is similar to the speckle seen everywhere on analogue video. A valid MPEG video stream can be made up entirely of I frames, and early MPEG capture hardware and encoding software produced this sort of stream. They can't match the compression of a full MPEG implementation, or no-one would bother with P and B frames.

P frames

Predicted frames make use of the previous I or P frame to 'predict' the contents of the current frame and then compress the difference between the prediction and the actual frame contents. This all seems logical, as most video frames are based on something in the previous frame. The prediction is made by attempting to find an area close to the previous macroblock's position in the current frame, which contains similar pixels. A motion vector is calculated which moves the previous macroblock to the new predicted position (with half pixel accuracy). The motion vector may legitimately be a null vector if there is no motion, which of course encodes very efficiently.

Figure 3. Low quality MPEG IPPP sequence of frames (2x2 macroblocks) and source images.

Note how in figure 3 the DCT error in the first frame is carried over into the successive P frames. More information is needed to describe the changes in the high contrast areas than the flat coloured bits. The difference between the predicted pixels and their actual values are calculated, DCT-encoded and the coefficients quantised (more coarsely than I frame DCT coefficients). If a sufficiently similar group of pixels cannot be found in the previous frame, a P frame can simply spatially encode the macroblock as though it were an I frame.

Figure 4. Motion vectors and difference calculation. (That the motion vectors are actually the wrong way around in this diagram).

A valid MPEG video stream can be made up entirely of I and P frames. IP MPEG encoders only require a buffer for the previous frame.

B frames

Bidirectional frames can use a number of different methods of encoding. Each macroblock may be encoded as though it were an I or P frame. Furthermore, it can predict the contents of the current macroblock from the closest following I or P frame. Peering into the future, as it were. This works in the same way as a P frame, except that it attempts to find a motion vector that will move a macroblock from the following frame (rather than the previous one) to approximate the current frame. This enables good prediction for areas in the image which have just been uncovered and were not avialable in the past.

Finally, a B frame can predict the current frame based on both the previous and following frames. Gosh. Motion vector are calculated using macroblocks from the frames either side of the current frame. The resulting predicted blocks are then averaged and these pixels values are used as the final predictor for the current frame.
Figure 5. Low quality MPEG IBBPBB sequence of frames (2x2 macroblocks) and source images.

Note again, how the errors from the I frames are propagated into the P frames. The B frames neaten out the error considerably, due to the forward and backward prediction. Like P frames, the difference between the predicted pixels values and their actual values are calculated and DCT encoded and quantised. Using a B frame requires that the previous and following frames are buffered form processing.

As with the quantisation tables used in JPEG compression, the MPEG specification leaves it entirely up to the programmer of the encoder to decide how they will interleave the various frame types (I, P, B-forward and B-backwards) and how the motions vectors will be calculated.

I, P and B frame sequences

I, P and B frames may be mixed in virtually any sequence - with some semantic and practical restrictions. If the video stream is to have adequate random access, then I frames must be periodically inserted into the frame pattern. While P frames add persistent detail to previous I or P frames, the information described by a B frames does not persist beyond the lifespan of the B frame itself. In this sense, B frames are merely a cheap way of increasing the apparent frame rate of a video stream.

Figure 6. Frame display sequence and inter-frame dependencies.

One common frame combination for an MPEG stream is IBBPBBPBBPBB, which supplies a keyframe (I frame) approximately every half second for PAL video. This is the order in which the frames are displayed, not in which they occur in the data stream. Because B frames may be dependent on I or P frames occurring after the current video frame, these I and P frames must be decoded first.

Thus the previous frame pattern IBBP would occur as IPBB in the data stream. The I frame is decoded first and displayed. Next the P frame is decoded for reference, but not immediately shown. The B frames are decoded and displayed, possibly using information from the P frame, before the P frame itself is finally displayed.

How did you find this 'lecture'? Was it too hard? Too easy? Were there something in particular you would like graphically illustrated? If you have any sensible comments, email me and let me know.

References:

MPEG: A video compression standard for multimedia applications.
Communications of the ACM.
D. Le Gall.
1991 ACM.

Real-time Video Compression.
Raymond Westwater, Borko Furht.
1991 Kluwer Academic Publishers.

Video Demystified 2nd ed.
Keith Jack.
1996 HighText Interactive.

Techniques and Standards for Image, Video and Audio Coding
K.R. Rao, J.J. Hwang
1996 Prentice Hall.