14 May 2008

Blog moved to multimedia.cx

This blog has been moved to the multimedia.cx server and can now be found here.

10 May 2008

Introducing Photon

A few days ago I was somewhat bored; no particular good ideas for quick x264 improvements, no good games to play, and most of my finals out of the way. So I decided to make my own codec, called Photon.

The basic goal is to make a very fast MPEG-2-like format with better compression and speed than MPEG-2, but without the complexities of things like interlacing and B-frames.
My eventual plans for Photon are a bit more fancy than what I have so far, of course; currently the encoder and decoder are intra-only, for example. And honestly, I'll be shocked if I manage to reach my goals; my main purpose in this is to learn how to write a codec and bitstream format from the ground up and experiment with all sorts of features in the process.

The main philosophy behind Photon is to keep everything as simple as possible; the fewer special cases needed, the better. As such, every macroblock is divided up into 8x8 blocks; 4 for luma and 2 for chroma (the codec uses YV12 colorspace). Every single 8x8 block, luma or chroma, is treated exactly the same with the exact same set of code; the bitstream does not even distinguish them. This makes it possible to have a single loop for decoding all of these blocks. For the transform, the H.264 transform/quant/zigzag process is used.

Preceding the blocks is a 6-bit element, with one bit for each block.
For each block, its associated bit in the Coded Block Pattern (CBP) is set to 0 if there's nothing in the block and set to 1 if there is.

For each block with a
CBP of 1:

1. A delta-quantizer element. Yes, that's right: each 8x8 block has its own quantizer! This increases the effectiveness of adaptive quantization. The delta-quantizer is done with respect to the same numbered block in the previous macroblock and is coded in unary with one bit for the sign (total bit cost: 1 bit to not code a delta quant, N+1 bits to code a delta quant, where N is the delta).
2. A transform element. Set to 0 if the block uses 4x4dct, set to 1 if the block uses 8x8dct.
3. A 4-bit
CBP for the 4 4x4dct blocks, set in the same manner as the macroblock CBP. Omitted in the case of 8x8dct.
4. Residual for the 8x8 block, or for each 4x4 block with a CBP (in raster scan order) as follows:
a. The first coefficient in the block, coded as a signed exponential golomb code.
b. Is this the last coefficient? If so, a 1 is coded and the residual coding ends. Otherwise, a 0 is coded.
c. The run length of 0s until the next coefficient, coded as an unsigned exponential golomb code.
This loop repeats to code the entire block.

The macroblock itself has a 2-to-4-bit header specifying the luma and chroma prediction modes to use, and the frame has a header specifying the frametype and frame quantizer.

... and that's it. That's the whole thing, so far. Yes, the entropy coding is absurdly suboptimal and would benefit dramatically from custom variable-length codes instead of universal codes. Yes, I'm not using an ounce of assembly whatsoever and the encoding and decoding are far slower than they should be (decoding is still realtime for most sane resolutions though). But it works.

For those who want to see my hackneyed, half-copy-pasted-from-x264's-common-library code, the diff so far is here.

09 May 2008

Variable length coding and you

Let's say you want to write a number to a bitstream. The decoder is expecting a number, so it knows exactly what to look for. But how do you format this number?

You could write 16 bits no matter what, putting a max of 65535 on this number. This would also waste tons of bits though; what if the number you're writing here with your encoder is almost always 0, 1, or 2? Writing 16 bits would seem ridiculous, so it only makes sense to use a variable-length code (VLC). But how would the decoder know how *long* the code is? Here's a simple example:

5 -> 101
2 -> 10

Both will look the same to the decoder, since it doesn't know whether the code ends after the first bit, the second, the third, etc. But there's a solution to this problem; a common one lies in the form of exponential Golomb codes. Here's an example:

0 -> 1
2 -> 011
5 -> 00110
8 -> 0001001


The number of zeroes is the number of bits in the code, and a 1 is used to terminate the list of zeroes. This may seem inefficient, but in a sense one has no other choice but to have some sort of coding scheme like this. Of course, exponential golomb codes are only optimal in the case of a specific probability distribution; an optimal variable-length coding system has code lengths chosen based on the probability of each value. This is the primary reason why MPEG-4 Part 2 (Xvid/DivX) is superior to MPEG-2; the VLC tables were much better chosen.

Of course, the ideal system would be to allow custom VLC tables to be put in the header of the video file; its quite trivial to design a twopass encoder to create optimal VLC tables for a given video, or even a given frame. H.264 partially does away with this problem with CABAC; while there would be benefit to including a custom initial CABAC state, the cost of various CABAC modes adapts quickly to the video. Of course, this is not always the case; CABAC still relies on variable-length codes, because it converts numbers to series of binary digits, which are then written into the context-based arithmetic coder (the "binary" part of CABAC). One result of this is that even if one has a B-frame made up entirely of intra blocks, the arithmetic coder won't be able to completely adapt, and each block will still cost more bits; the variable-length coding for the macroblock type is not chosen in such a way that would allow such "complete" adaptation.

06 May 2008

Test clips

Many users have wondered what kind of test clips are used to test x264 during development. In this post, I will attempt to enlighten readers on my personal suite of test clips and why I chose each one.

My first test clip is a roughly 3600-frame-long segment from Pirates of the Carribean: The Curse of the Black Pearl. I often just use the first 500 or so frames of this. The primary reason this test clip was chosen was to serve as an "ordinary" standard-definition video with a reasonable amount of film grain and the same sort of imperfections that most DVD sources have. The clip can be downloaded here.

My second test clip is a section from Elephant's Dream, the free full-CGI short made in Blender and released by the Orange Open Movie Project studio. This sample is useful for a number of reasons: first, it is available in true lossless 1080p, so I can use a completely flawless source as input to the encoder. Second, of course, it is completely free. Finally, it serves as a good example of a CGI source for which to test x264 on. The full lossless original can be found in PNG sequence format here. If I need an extra HD CGI short, I usually use For the Birds or Lifted.

Next, I usually have an anime/cartoon source of some sort. I sometimes use the Flash version of the Azumanga Daioh OP, but this clip is often less useful than one would think; because its Flash, it is completely lossless. Most real anime/cartoon sources are either DVD or TV broadcast and are therefore far less clean than the above clip. Since x264 has to be able to deal with the flaws in its input, I usually use a fansub copy of the Haruhi ED sequence.

My penultimate test clip is from the VC-1 Blu-ray version of "300." This is a serious torture test for any video encoder; the artificial grain in this source is unbelievable and requires a ridiculously high bitrate to maintain at any resolution. This is also one of the main test clips I used in developing my film grain optimization. While I obviously can't upload the whole video nor even a reasonably sized segment of the source, a relatively high quality sample rip can be found here.

My final test clip is the ultimate torture test: a set of 640x480 videos that need over 4 megabits to avoid noticeable visual artifacts. These are lossless FRAPS captures of Touhou Project games, a series of "danmaku" vertical-scrolling shmups with a difficulty ranging from high to completely ridiculous. With this difficulty comes incredibly complex bullet patterns that make can make many video encoders completely choke; it is nearly impossible to effectively compress the sharp-edged bullets enough to reach the target bitrate without completely decimating the backgrounds. I have used these clips for a comparison between x264 and Elecard and a sample video for H.264-in-browser using Flash. I also have a short lossless clip available for those interested.

x264 development: a six month retrospective

These past 6 months have consisted mostly of bugfixes, vast speed improvements, and the beginning of what will hopefully be a series of psychovisual optimizations.

How can I best describe the speed boost? Numbers would do the best job, I think. All values are my internal development build compared to the current version from 6 months ago. Adaptive quantization is disabled to make the results comparable. CRF is used for all encodes.

Max speed settings (no B-frames, subme 1, analyse none, me dia): 29.5% speed boost
Near-max speed settings (
3 B-frames, subme 1, analyse none, me dia): 24.5% speed boost
Medium speed settings: (3 B-frames, subme 5): 18.5% speed boost
Slow speed settings (3 b-frames, subme 6, b-rdo,
me umh, ref 4): 35% speed boost
Very slow speed settings (16 b-frames, subme 7, b-rdo, me esa, ref 16, trellis 2, no fast-pskip, partitions all, mixed-refs): 52% speed boost
Lossless: 15% speed boost

Notable new features:

1. Psy-based adaptive quantization, for improving quality in flat areas of the frame by taking bits from more complex areas of the frame.
2. --me tesa, transformed exhaustive search. Converted from a ridiculously slow initial algorithm by me to a highly optimized thresholded solution by Loren Merritt, resulting in an even slower alternative to --me esa.
3. A massive preprocessor-based abstraction layer for assembly, allowing complete abstraction between 32-bit and 64-bit assembly and even automatic handling of everything from stack offsets to macros that permute their arguments and SSE/MMX abstraction. Written from scratch by Loren Merritt and drastically simplifies all assembly development.

Notable speed increases:

1. Altivec implementations of various functions; much faster PowerPC encoding.
2. Cacheline optimization for SAD-based motion search. Also for luma MC.
3. Much faster exhaustive motion search.
4. Lots more SSE2 assembly. And SSSE3 too. And even more SSE2. Oh wait, more...
5. Skipping stuff.
6. Much much faster CABAC encoding.
7. Tons of small optimizations all over x264. Yes, there's lots more of these. And more of these. And even more... wait, there's more here...

Chroma encoding optimizations

This post revolves around this diff.

The chroma encoding process is an often-neglected one; we focus all too much on the complexities of luma encoding while forgetting the chroma encoding process, which is exactly the same regardless of block type. The 8x8 residual blocks for the two chroma planes (from motion compensation or intra prediction) are DCT'd, quantized, and encoded. However, there are some aspects to chroma encoding that make it subtly different:

1. Like i16x16 blocks, chroma encoding is separated into the DC coefficients of each 4x4 DCT block and the AC coefficients. DC coefficients are "flat" frequencies, that is, a number representing a uniform change over the whole block. AC coefficients are non-flat, and represent a spatial frequency. In most block types, DC coefficients are not treated differently from AC; in chroma and i16x16, they are. In chroma in particular, the 4 DC coefficients (for each of the 4 DCT blocks in a color plane) form a 2x2 array, to which a Hadamard transform is applied, and then the result is quantized. This allows the overall "flat" change in color to be coded separately from any more complex change in color.

2. Chroma blocks are generally very simple and have nearly no residual data. This is part of the reason that 1) is how the H.264 standard implements chroma coding; nearly all the data is probably going to be very simple if it is there at all.

Now, this changes things slightly. First of all, it changes how DCT decimation works; normally, decimate works by picking a DCT block and setting its contents to zero if its contents are sufficiently simple that it would save a lot of bits just to not code it at all (and have a relatively small quality loss relative to the bit savings). For chroma in particular, decimate happens a *lot*, because usually there are nearly no AC coefficients.

But what if, like we often do, we decimate the AC coefficients... but there's DC coefficients left over? We don't decimate these; the visual effect could potentially be large to do so, and the bit savings is small. But we still have to decode the DC-only chroma residual data, that is, perform an inverse DCT, and store the resulting data to represent the decoded frame.

But if we know that there's only a DC coefficient and no AC coefficients, why do an entire iDCT? That would be a waste of time, since having only a DC coefficient literally just means we're adding the same value to every pixel in the block. So I ported the DC-only chroma iDCT from ffh264 and rewrote it to do all 4 4x4 blocks at once, so it can handle an entire 8x8 chroma block in one call. Now, whenever we have a DC-only chroma block, we can skip most of the iDCT process, saving a bit of time. I also cleaned up a lot of the decimate code in the function, speeding things up further. Thanks a lot to Loren for help with the assembly on this one.

03 May 2008

Film grain optimization

Optimizing an encoder for film grain is a tough problem. For one, film grain is, by definition, basically uncorrelated between frames; that is, film grain from a previous frame is totally useless in encoding the current frame's film grain (at least it would seem!). This would suggest that intra blocks are necessary for encoding film grain, which is what generally results. Yet this encounters another problem: film grain is made up of a whole slew of spacial frequencies, many of which cannot be represented at the quantizers often used in P/B-frames! This makes it extremely difficult to efficiently represent the film grain at reasonable bitrates.

But we can cheat.

Previous frames have a lot of the necessary spatial frequencies--why not steal them? Sure, an inter block won't be as efficient as an intra block, but it might work better. Indeed, the initial idea I got from glancing at the results of the film grain optimization in Elecard's encoder (Mainconcept core). Their film grain optimization almost completely disabled I-blocks in P-frames, suggesting that this was indeed the avenue to go down. Of course, their film grain optimization really wasn't that good--so who knows?

To begin with, I tried the obvious; completely disable intra blocks in P-frames for the hell of it. Surprisingly, this actually worked; in many cases it improved grain retention! But if I was to make this practical, I'd have to find a real way of implementing a metric to decide what block type to use, rather than just brute-force disabling an entire category of blocks.

I eventually came back upon an idea I considered a while back--what about NSSD? NSSD, also known as "noise-retaining sum of squared differences," is a block comparison metric that is supposed to promote retaining grain/noise. How exactly does it do this? NSSD is equal to the sum of the ordinary SSD and the absolute value of the difference in "noise" values for the two blocks to be compared. "Noise" is abs( x(i,j) - x(i+1,j) - x(i,j+1) + x(i+1,j+1) ) summed up over all pixels x(i,j) in the source block (ignoring pixels that would result in this formula going over the edge of the block). In other words, it doesn't compare the pixels of the two blocks; it simply measures the "noisiness" of each block, and makes sure that they have a "similar" amount of noise. Keeping the two blocks visually similar is taken care of by the SSD portion of the score.

Amazingly, this worked; replacing the RD metric (SSD) with NSSD, combined with tweaking of the RD thresholds to ensure that modes that tended to retain noise were always analyzed, drastically improved grain retention, and made inter blocks drastically more common in grainy footage. The patch can be found here, complete with mildly optimized MMX assembly for the "noise" operation, ported from ffmpeg (where NSSD is available as a -cmp/-subcmp/-rdcmp option).