If ever there was compression needed for digital information, it is needed for digital video. Standard New Zealand PAL video has a luminance resolution of 768x576 and runs at 25 frames a second. The two Chrominance components generally have a 384x288 resolution, giving a total raw video throughput of around 47 MegaBytes per second. Not the sort of thing you can put on a CD in any usefully way, unless you can manage 270:1 compression. Ah-ha! Enter MPEG compression.
This compression format specifies methods for compressing both audio and video information. This module deals with the video part of the format. A bit of TV history is probably called for before we delve into MPEG proper.
The main difference between the two is that NTSC has a vertical resolution of 525 line and PAL has 625. NTSC also runs at a speed of 30 frames per seconds, whereas PAL is 25 fps. The horizontal resolutions are also different and generally measured by the bandwidth of the signal. For the purposes of pixel resolution, NTSC and PAL horizontal resolutions are generally accepted as being 640 and 768 respectively (when using square pixels). The full vertical resolution is never displayed on a normal television, however. About 45 of the lines have control information or are 'unused'. This results in a final 'useable' NTSC and PAL resolutions of 640x480 and 768x576 pixels respectively.
When these standards were first used, television technology was very primitive and was unable to display an image at full resolution quickly enough without flickering. To get around this problem, televisions had a reduced vertical resolution of just 288 lines (pretty crummy resolution if you ask me). Each video frame was then interlaced, and had two fields per frame. For PAL systems, the odd fields (288 lines) are displayed in the first 1/50th of a second and the even fields in the second 1/50th. This means that still frames on fast moving object can jump forwards and backwards on a TV, or seem to be 'combed' on a computer monitor.
Figure 1. Video Interlacing, showing 'combing' on fast moving objects.
When colour was added to the television broadcasts, they needed to keep backwards compatibility the existing black and white (actually grayscale) signals, so the YUV colour space was used and the chrominance components were just added to the existing luminance information (recall YUV and YCbCr details in the module on JPEG compression). Because there was less bandwidth available for the colour information, both NTSC and PAL broadcast quality signals have a 4:2:2 subsampling ratio. (ie. the U and V components have half the horizontal resolution of the Y component).
MPEG-1 was designed to run from a standard CD running at 1.5 Mb/s, including both video and audio streams. This format is quite flexible and it is possible to encode images up to 4095x4095 in resolution, with bit rates up to 100 Mbits/s but rarely used in practice. A number of sacrifices were made to video (in particular) to achieve 'almost' VHS quality.
The standard video image size used by MPEG-1 is based on the CCIR-601 digital video standard. This is an image 720x576 by 25 fps (with two fields making up one frame) for PAL, and 720x486 by 30 fps for NTSC. This format does not use square pixels, so the horizontal resolution is slightly worse for PAL and slightly better for NTSC signals. The chrominance components are reduce 2:1 horizontally for both formats. MPEG makes use of a reduced version of CCIR, called SIF (Source Input Format). The resolution of SIF images just 360x288 at 25 fps (no interlacing), with a chrominance resolution of 180x144 for PAL. The NTSC figures are 360x240 at 30 fps, with chrominance 180x120. These figures are convenient as they give an identical source data rate for both formats. ie. (360 x 288 + 180 x 144 x 2) x 25 = (360 x 240 + 180 x 120 x 2) x 30. Sneaky, huh?
MPEG video compression makes use of two easily observed facts of video streams:
Figure 2. Low quality MPEG sequence of I frames (2x2 macroblocks) and source images.
A valid MPEG video stream can be made up entirely of I frames, and early MPEG capture hardware and encoding software produced this sort of stream. They can't match the compression of a full MPEG implementation, or no-one would bother with P and B frames.
Figure 3. Low quality MPEG IPPP sequence of frames (2x2 macroblocks) and source images.
The difference between the predicted pixels and their actual values are calculated, DCT-encoded and the coefficients quantised (more coarsely than I frame DCT coefficients). If a sufficiently similar group of pixels cannot be found in the previous frame, a P frame can simply spatially encode the macroblock as though it were an I frame.
Figure 4. Motion vectors and difference calculation.
(That the motion vectors are actually the wrong way around in this diagram).
A valid MPEG video stream can be made up entirely of I and P frames. IP MPEG encoders only require a buffer for the previous frame.
Finally, a B frame can predict the current frame based on both the
previous and following frames. Gosh. Motion vector are calculated using
macroblocks from the frames either side of the current frame. The resulting
predicted blocks are then averaged and these pixels values are used as the final
predictor for the current frame.
Figure 5. Low quality MPEG IBBPBB sequence of frames (2x2 macroblocks) and source images.
Like P frames, the difference between the predicted pixels values and their actual values are calculated and DCT encoded and quantised. Using a B frame requires that the previous and following frames are buffered form processing.
As with the quantisation tables used in JPEG compression, the MPEG specification leaves it entirely up to the programmer of the encoder to decide how they will interleave the various frame types (I, P, B-forward and B-backwards) and how the motions vectors will be calculated.
Figure 6. Frame display sequence and inter-frame dependencies.
One common frame combination for an MPEG stream is IBBPBBPBBPBB, which supplies a keyframe (I frame) approximately every half second for PAL video. This is the order in which the frames are displayed, not in which they occur in the data stream. Because B frames may be dependent on I or P frames occurring after the current video frame, these I and P frames must be decoded first.
Thus the previous frame pattern IBBP would occur as IPBB in the data stream. The I frame is decoded first and displayed. Next the P frame is decoded for reference, but not immediately shown. The B frames are decoded and displayed, possibly using information from the P frame, before the P frame itself is finally displayed.
How did you find this 'lecture'? Was it too hard? Too easy? Were there something in
particular you would like graphically illustrated? If you have any sensible comments,
email me and let me know.
References: