<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-2580926098227998794</id><updated>2011-09-04T04:33:52.986-07:00</updated><category term='skip'/><category term='speed'/><category term='memory management'/><category term='loops'/><category term='bugs'/><category term='variable-length coding'/><category term='development'/><category term='test sequences'/><category term='videos'/><category term='cacheline'/><category term='ffmpeg'/><category term='blog'/><category term='assembly'/><category term='exponential golomb codes'/><category term='wordpress'/><category term='stupidity'/><category term='ugly code'/><category term='GSOC'/><category term='finite state machine'/><category term='psychovisual optimizations'/><category term='photon'/><category term='H.264'/><category term='codec'/><category term='summary'/><category term='chroma'/><category term='CABAC'/><category term='film grain'/><category term='Intel'/><category term='noise'/><category term='x264'/><category term='bitstream'/><category term='google'/><category term='rate-distortion optimization'/><title type='text'>Diary of an x264 Developer</title><subtitle type='html'>... and other topics ...</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>15</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-8592764940080818213</id><published>2008-05-14T18:13:00.000-07:00</published><updated>2008-05-14T18:17:46.678-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='wordpress'/><category scheme='http://www.blogger.com/atom/ns#' term='ffmpeg'/><category scheme='http://www.blogger.com/atom/ns#' term='blog'/><title type='text'>Blog moved to multimedia.cx</title><content type='html'>&lt;span style="font-family: arial;"&gt;This blog has been moved to the multimedia.cx server and can now be found &lt;/span&gt;&lt;a style="font-family: arial;" href="http://x264dev.multimedia.cx/"&gt;here&lt;/a&gt;&lt;span style="font-family: arial;"&gt;.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-8592764940080818213?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/8592764940080818213/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=8592764940080818213' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/8592764940080818213'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/8592764940080818213'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/blog-moved-to-multimediacx.html' title='Blog moved to multimedia.cx'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-2318737404274674225</id><published>2008-05-10T19:58:00.001-07:00</published><updated>2008-05-10T20:25:06.963-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ffmpeg'/><category scheme='http://www.blogger.com/atom/ns#' term='codec'/><category scheme='http://www.blogger.com/atom/ns#' term='photon'/><title type='text'>Introducing Photon</title><content type='html'>&lt;span style="font-family:arial;"&gt;A few days ago I was somewhat bored; no particular good ideas for quick x264 improvements, no good games to play, and most of my finals out of the way.  So I decided to make my own codec, called Photon.&lt;br /&gt;&lt;br /&gt;The basic goal is to make a very fast MPEG-2-like format with better compression and speed than MPEG-2, but without the complexities of things like interlacing and B-frames.  &lt;/span&gt;&lt;span style="font-family:arial;"&gt;My eventual plans for Photon are a bit more fancy than what I have so far, of course; currently the encoder and decoder are intra-only, for example.  And honestly, I'll be shocked if I manage to reach my goals; my main purpose in this is to learn how to write a codec and bitstream format from the ground up and experiment with all sorts of features in the process.&lt;br /&gt;&lt;br /&gt;The main philosophy behind Photon is to keep everything as simple as possible; the fewer special cases needed, the better.  As such, every macroblock is divided up into 8x8 blocks; 4 for luma and 2 for chroma (the codec uses YV12 colorspace).  Every single 8x8 block, luma or chroma, is treated exactly the same with the exact same set of code; the bitstream does not even distinguish them.  This makes it possible to have a single loop for decoding all of these blocks.  For the transform, the H.264 transform/quant/zigzag process is used.&lt;br /&gt;&lt;br /&gt;Preceding the blocks is a 6-bit element, with one bit for each block.&lt;/span&gt;&lt;span style="font-family:arial;"&gt;  For each block, its associated bit in the Coded Block &lt;/span&gt;&lt;span style="font-family:arial;"&gt;Pattern (CBP) &lt;/span&gt;&lt;span style="font-family:arial;"&gt;is set to 0 if there's nothing in the block and set to 1 if there is.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;For each block with a &lt;/span&gt;&lt;span style="font-family:arial;"&gt;CBP&lt;/span&gt;&lt;span style="font-family:arial;"&gt; &lt;/span&gt;&lt;span style="font-family:arial;"&gt;of 1:&lt;br /&gt;&lt;br /&gt;1.  A delta-quantizer element.  Yes, that's right: each 8x8 block has its own quantizer!  This increases the effectiveness of adaptive quantization.  The delta-quantizer is done with respect to the same numbered block in the previous macroblock and is coded in unary with one bit for the sign (total bit cost: 1 bit to not code a delta quant, N+1 bits to code a delta quant, where N is the delta).&lt;br /&gt;2.  A transform element.  Set to 0 if the block uses 4x4dct, set to 1 if the block uses 8x8dct.&lt;br /&gt;3.  A 4-bit &lt;/span&gt;&lt;span style="font-family:arial;"&gt;CBP&lt;/span&gt;&lt;span style="font-family:arial;"&gt; &lt;/span&gt;&lt;span style="font-family:arial;"&gt;for the 4 4x4dct blocks, set in the same manner as the macroblock CBP.  Omitted in the case of 8x8dct.&lt;br /&gt;4.  Residual for the 8x8 block, or for each 4x4 block with a CBP (in raster scan order) as follows:&lt;br /&gt;a.  The first coefficient in the block, coded as a signed exponential golomb code.&lt;br /&gt;b.  Is this the last coefficient?  If so, a 1 is coded and the residual coding ends.  Otherwise, a 0 is coded.&lt;br /&gt;c.  The run length of 0s until the next coefficient, coded as an unsigned exponential golomb code.&lt;br /&gt;This loop repeats to code the entire block.&lt;br /&gt;&lt;br /&gt;The macroblock itself has a 2-to-4-bit header specifying the luma and chroma prediction modes to use, and the frame has a header specifying the frametype and frame quantizer.&lt;br /&gt;&lt;br /&gt;... and that's it.  That's the whole thing, so far.  Yes, the entropy coding is absurdly suboptimal and would benefit dramatically from custom variable-length codes instead of universal codes.  Yes, I'm not using an ounce of assembly whatsoever and the encoding and decoding are far slower than they should be (decoding is still realtime for most sane resolutions though).  But it works.&lt;br /&gt;&lt;br /&gt;For those who want to see my hackneyed, half-copy-pasted-from-x264's-common-library code, the diff so far is &lt;a href="http://pastebin.com/f45c9ef15"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:arial;"&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-2318737404274674225?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/2318737404274674225/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=2318737404274674225' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/2318737404274674225'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/2318737404274674225'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/introducing-photon.html' title='Introducing Photon'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-9008297699690682194</id><published>2008-05-09T11:51:00.000-07:00</published><updated>2008-05-09T17:10:52.112-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='exponential golomb codes'/><category scheme='http://www.blogger.com/atom/ns#' term='CABAC'/><category scheme='http://www.blogger.com/atom/ns#' term='bitstream'/><category scheme='http://www.blogger.com/atom/ns#' term='variable-length coding'/><category scheme='http://www.blogger.com/atom/ns#' term='H.264'/><title type='text'>Variable length coding and you</title><content type='html'>&lt;span style="font-family:arial;"&gt;Let's say you want to write a number to a bitstream.  The decoder is expecting a number, so it knows exactly what to look for.  But how do you format this number?&lt;br /&gt;&lt;br /&gt;You could write 16 bits no matter what, putting a max of 65535 on this number.  This would also waste tons of bits though; what if the number you're writing here with your encoder is almost always 0, 1, or 2?  Writing 16 bits would seem ridiculous, so it only makes sense to use a &lt;a href="http://en.wikipedia.org/wiki/Variable_length_code"&gt;variable-length code&lt;/a&gt; (VLC).  But how would the decoder know how *long* the code is?  Here's a simple example:&lt;br /&gt;&lt;br /&gt;5 -&gt; 101&lt;br /&gt;2 -&gt; 10&lt;br /&gt;&lt;br /&gt;Both will look the same to the decoder, since it doesn't know whether the code ends after the first bit, the second, the third, etc.  But there's a solution to this problem; a common one lies in the form of &lt;a href="http://en.wikipedia.org/wiki/Exponential-Golomb_coding"&gt;exponential Golomb codes&lt;/a&gt;.  Here's an example:&lt;br /&gt;&lt;br /&gt;0 -&gt; 1&lt;br /&gt;2 -&gt; 011&lt;br /&gt;5 -&gt; 00110&lt;br /&gt;8 -&gt; 0001001&lt;/span&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;&lt;br /&gt;The number of zeroes is the number of bits in the code, and a 1 is used to terminate the list of zeroes.  This may seem inefficient, but in a sense one has no other choice but to have some sort of coding scheme like this.  Of course, exponential golomb codes are only optimal in the case of a specific probability distribution; an optimal variable-length coding system has code lengths chosen based on the probability of each value.  This is the primary reason why MPEG-4 Part 2 (Xvid/DivX) is superior to MPEG-2; the VLC tables were much better chosen.&lt;br /&gt;&lt;br /&gt;Of course, the ideal system would be to allow custom VLC tables to be put in the header of the video file; its quite trivial to design a twopass encoder to create optimal VLC tables for a given video, or even a given frame.  H.264 partially does away with this problem with &lt;a href="http://en.wikipedia.org/wiki/CABAC"&gt;CABAC&lt;/a&gt;; while there would be benefit to including a custom initial CABAC state, the cost of various CABAC modes adapts quickly to the video.  Of course, this is not always the case; CABAC still relies on variable-length codes, because it converts numbers to series of binary digits, which are then written into the context-based arithmetic coder (the "binary" part of CABAC).  One result of this is that even if one has a B-frame made up entirely of intra blocks, the arithmetic coder won't be able to completely adapt, and each block will still cost more bits; the variable-length coding for the macroblock type is not chosen in such a way that would allow such "complete" adaptation.&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-9008297699690682194?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/9008297699690682194/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=9008297699690682194' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/9008297699690682194'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/9008297699690682194'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/variable-length-coding-and-you.html' title='Variable length coding and you'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-4446880252585556358</id><published>2008-05-06T17:20:00.000-07:00</published><updated>2008-05-06T17:20:46.397-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='videos'/><category scheme='http://www.blogger.com/atom/ns#' term='test sequences'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><title type='text'>Test clips</title><content type='html'>&lt;span style="font-family:arial;"&gt;Many users have wondered what kind of test clips are used to test x264 during development.  In this post, I will attempt to enlighten readers on my personal suite of test clips and why I chose each one.&lt;br /&gt;&lt;br /&gt;My first test clip is a roughly 3600-frame-long segment from &lt;a href="http://en.wikipedia.org/wiki/Pirates_of_the_Caribbean:_The_Curse_of_the_Black_Pearl"&gt;Pirates of the Carribean: The Curse of the Black Pearl&lt;/a&gt;.  I often just use the first 500 or so frames of this.  The primary reason this test clip was chosen was to serve as an "ordinary" standard-definition video with a reasonable amount of film grain and the same sort of imperfections that most DVD sources have.  The clip can be downloaded &lt;a href="http://www.mediafire.com/?mymhmje0iki"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;My second test clip is a section from &lt;a href="http://orange.blender.org/"&gt;Elephant's Dream&lt;/a&gt;, the free full-CGI short made in Blender and released by the Orange Open Movie Project studio.  This sample is useful for a number of reasons: first, it is available in true lossless 1080p, so I can use a completely flawless source as input to the encoder.  Second, of course, it is completely free.  Finally, it serves as a good example of a CGI source for which to test x264 on.  The full lossless original can be found in PNG sequence format &lt;a href="http://media.xiph.org/ED/"&gt;here&lt;/a&gt;.  If I need an extra HD CGI short, I usually use &lt;a href="http://www.imdb.com/title/tt0248808/"&gt;For the Birds&lt;/a&gt; or &lt;a href="http://www.imdb.com/title/tt0945571/"&gt;Lifted&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Next, I usually have an anime/cartoon source of some sort.  I sometimes use the &lt;a href="http://dagobah.biz/flash/Azumanga_OP.swf"&gt;Flash version of the Azumanga Daioh OP&lt;/a&gt;, but this clip is often less useful than one would think; because its Flash, it is completely lossless.  Most real anime/cartoon sources are either DVD or TV broadcast and are therefore far less clean than the above clip.  Since x264 has to be able to deal with the flaws in its input, I usually use a &lt;a href="http://www.mediafire.com/?jzyecytmdyx"&gt;fansub copy of the Haruhi ED sequence&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;My penultimate test clip is from the VC-1 Blu-ray version of "&lt;a href="http://www.imdb.com/title/tt0416449/"&gt;300&lt;/a&gt;."  This is a serious torture test for any video encoder; the artificial grain in this source is unbelievable and requires a ridiculously high bitrate to maintain at any resolution.  This is also one of the main test clips I used in developing my &lt;a href="http://x264dev.blogspot.com/2008/05/film-grain-optimization.html"&gt;film grain optimization&lt;/a&gt;.  While I obviously can't upload the whole video nor even a reasonably sized segment of the source, a relatively high quality sample rip can be found &lt;a href="http://www.mediafire.com/?1d1i1m39waz"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;My final test clip is the ultimate torture test: a set of 640x480 videos that need over 4 megabits to avoid noticeable visual artifacts.  These are lossless FRAPS captures of &lt;a href="http://en.wikipedia.org/wiki/Touhou_Project"&gt;Touhou Project&lt;/a&gt; games, a series of "&lt;a href="http://en.wikipedia.org/wiki/Danmaku#Manic_vs._methodical"&gt;danmaku&lt;/a&gt;" vertical-scrolling &lt;a href="http://en.wikipedia.org/wiki/Shoot_%27em_up"&gt;shmups&lt;/a&gt; with a difficulty ranging from high to completely ridiculous.  With this difficulty comes incredibly complex bullet patterns that make can make many video encoders &lt;a href="http://mirror05.x264.nl/Dark/x264vsElecard/xvid.png"&gt;completely choke&lt;/a&gt;; it is nearly impossible to effectively compress the sharp-edged bullets enough to reach the target bitrate without completely decimating the backgrounds.  I have used these clips for a &lt;a href="http://mirror05.x264.nl/Dark/x264vsElecard/"&gt;comparison between x264 and Elecard&lt;/a&gt; and a &lt;a href="http://mirror05.x264.nl/Dark/Flash/index_lowbitrate.html"&gt;sample video for H.264-in-browser using Flash&lt;/a&gt;.  I also have a &lt;a href="http://mirror05.x264.nl/Dark/force.php?file=./LosslessTouhou.mkv"&gt;short lossless clip&lt;/a&gt; available for those interested.&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-4446880252585556358?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/4446880252585556358/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=4446880252585556358' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/4446880252585556358'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/4446880252585556358'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/test-clips.html' title='Test clips'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-6709152145629455168</id><published>2008-05-06T03:21:00.001-07:00</published><updated>2008-05-06T15:08:42.876-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='summary'/><category scheme='http://www.blogger.com/atom/ns#' term='development'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><title type='text'>x264 development: a six month retrospective</title><content type='html'>&lt;span style="font-family:arial;"&gt;These past 6 months have consisted mostly of bugfixes, vast speed improvements, and the beginning of what will hopefully be a series of psychovisual optimizations.&lt;br /&gt;&lt;br /&gt;How can I best describe the speed boost?  Numbers would do the best job, I think.  All values are my internal development build compared to the current version from 6 months ago.  Adaptive quantization is disabled to make the results comparable.  CRF is used for all encodes.&lt;br /&gt;&lt;br /&gt;Max speed settings (no B-frames, subme 1, analyse none, me dia): 29.5% speed boost&lt;br /&gt;Near-max speed settings (&lt;/span&gt;&lt;span style="font-family:arial;"&gt;3 B-frames, subme 1, analyse none, me dia): 24.5% speed boost&lt;br /&gt;Medium speed settings: (3 B-frames, subme 5): 18.5% speed boost&lt;br /&gt;Slow speed settings (3 b-frames, subme 6, b-rdo, &lt;/span&gt;&lt;span style="font-family:arial;"&gt;me umh, ref 4): 35% speed boost&lt;br /&gt;Very slow speed settings (16 b-frames, subme 7, b-rdo, me esa, ref 16, trellis 2, no fast-pskip, partitions all, mixed-refs): 52% speed boost&lt;br /&gt;Lossless: 15% speed boost&lt;br /&gt;&lt;br /&gt;Notable new features:&lt;br /&gt;&lt;br /&gt;1.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=dc4f40ce74c996a1e15021b82ab71ebcb8652d0b"&gt;Psy-based adaptive quantization&lt;/a&gt;, for improving quality in flat areas of the frame by taking bits from more complex areas of the frame.&lt;br /&gt;2.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=fa58b842b47a5e7fd3ac91d8141e800ecdbab0c7"&gt;--me tesa, transformed exhaustive search&lt;/a&gt;.  Converted from a ridiculously slow initial algorithm by me to a highly optimized thresholded solution by Loren Merritt, resulting in an even slower alternative to --me esa.&lt;br /&gt;3.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=c4680aceec03d3063cbefa5db7ab4404f32578d8"&gt;A massive preprocessor-based abstraction layer for assembly&lt;/a&gt;, allowing complete abstraction between 32-bit and 64-bit assembly and even automatic handling of everything from stack offsets to &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=a7eec58b62c43f3417d7523a5e337f2f68602ec9"&gt;macros that permute their arguments&lt;/a&gt; and &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=5cc926a4dad3b73da4458a54b194c926f91cacd3"&gt;SSE/MMX abstraction&lt;/a&gt;.  Written from scratch by Loren Merritt and drastically simplifies all assembly development.&lt;br /&gt;&lt;br /&gt;Notable speed increases:&lt;br /&gt;&lt;br /&gt;1.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=a22fe12c4358ee3bbe205cf6dfded782eedd3886"&gt;Altivec&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commit;h=b9eb9117a8f028ec5b727587c823a1c2ae83509e"&gt;implementations&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commit;h=edf284268a60110e0ce0474f0b4e3fb772c6935f"&gt;of&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commit;h=d4493c32b2d6f1eee5b7df0026a31562665bff07"&gt;various&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=795c3c1c2a4f5d4d2ef9025583a21ab400439fa6"&gt;functions&lt;/a&gt;; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=9fb4e53f4cbf65d5386da7f3321ddcaf8f5250d3"&gt;much&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=e1d815e15cc62b52ed67b4fd1538aaa238c70e97"&gt;faster&lt;/a&gt;&lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=e1d815e15cc62b52ed67b4fd1538aaa238c70e97"&gt; PowerPC encoding&lt;/a&gt;.&lt;br /&gt;2.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commit;h=6d6092197676cf4949bff2a1e28a79aa1bbab1ea"&gt;Cacheline optimization for SAD-based motion search&lt;/a&gt;.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=44361e9514344c4e4685bca45e8c55a28c3cd395"&gt;Also for luma MC&lt;/a&gt;.&lt;br /&gt;3.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=fd9ab6cee508192fe53ffaa52d3d7586004808fa"&gt;Much faster&lt;/a&gt;&lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=fd9ab6cee508192fe53ffaa52d3d7586004808fa"&gt; exhaustive motion search.&lt;/a&gt;&lt;br /&gt;4.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=c32cac21f5f634398ca7135a6f8e304c9ea528b4"&gt;Lots&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=e4059f859e072ef5ac61c101897bcf67f76e27f1"&gt;more&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=91991ba67aa9a7256b4bdf8d1d9be183ec2daa2b"&gt;SSE2&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=dcf2604a00d57c05a2def4551b75b8a08af024e3"&gt;assembly&lt;/a&gt;.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=1df5f84baf226141548948d94c84a1f3b1792c0b"&gt;And&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=2487cdf92f3050830fe2b7ffc4e04cf292823e9e"&gt;SSSE3 too&lt;/a&gt;.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=adfab36d395dff335c5a34d050c84ac8e7e1b470"&gt;And&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=02e610262bac2645742cfaa40d018fd43f26e859"&gt;even&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=afba69a247ee3ff4ae9781cb63093529175ec135"&gt;more&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=2487cdf92f3050830fe2b7ffc4e04cf292823e9e"&gt;SSE2&lt;/a&gt;.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=c9a928bf0b3acf13287147137ea3ceee3c6c81b2"&gt;Oh&lt;/a&gt;&lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=c9a928bf0b3acf13287147137ea3ceee3c6c81b2"&gt; wait, more&lt;/a&gt;...&lt;br /&gt;5.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=f46c3bc88be70eb4ab798cba5e0da75b13038ffc"&gt;Skipping&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=2b3d65e7c2389fc4201dafb69422550e15ef0d85"&gt;stuff.&lt;/a&gt;&lt;br /&gt;6.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=d4f9f60faf14bd1415f16dd419d1889324db3e1e"&gt;Much&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=bf9bf7acf61da13d9cc45c35291f61e614d7414c"&gt;much&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=1035c63eb9ef354670c3c16f23fd3ac2bb88ec19"&gt;faster&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=6ae335530efb189b00fd6f3b1b7da5eefd856473"&gt;CABAC&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=e403fe9364ad1ea1cd8c3d5055759a538e97bb8b"&gt;encoding&lt;/a&gt;.&lt;br /&gt;&lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=1df5f84baf226141548948d94c84a1f3b1792c0b"&gt;&lt;/a&gt;7. &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=c0b1b1af1cc30fe9ae7ad46e5c1ccd7640ceb889"&gt;Tons&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=f72092d59eeb212bb6db98796c797b53d1f3d966"&gt;of&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=4b199a95c2df54707186d3fab1b8c9511cfe0458"&gt;small&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=cda2dead4d316f2b1015a9cbb37765ad7eb57a13"&gt;optimizations&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=96a59591e1190a3b4dbcbae79f9b150f09977427"&gt;all&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=6cbea64acf3e6641600e7ac04789989ae61ceaf3"&gt;over&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=47ea39dc0e5df989776f7d5a78cb9033fbd72947"&gt;x264&lt;/a&gt;.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=ea9eaa7086e4262405b3aff047da621b6edc8814"&gt;Yes&lt;/a&gt;, &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=9da2facd2a45c20ff4a225fbbbcba0f3ad644457"&gt;there's&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=c34e1098fec1b950f8265680daa6d3f98d361074"&gt;lots&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=84452e50633eb7d98a2e5f55ff4c799b2bf30f32"&gt;more&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=3f48ce7e838aab6d0701150f93166132cb06c6a0"&gt;of&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=67a130f53f26ac5ecf29ee4bd0ab442e0d87fc70"&gt;these&lt;/a&gt;.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=36fe32ae368797be584657eed37350faa0e93e78"&gt;And&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=4b18012311fe5e823495169ab9a911e1cb9b3907"&gt;more&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=5e0e058c72e6dcf0f432157b48c0b07566535fe6"&gt;of&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=73cfc9e5b8947e4d279a63da8f18253a4c066c89"&gt;these&lt;/a&gt;.  &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=b32dacbaf2294607c31ee3658c85f44b8f40722a"&gt;And&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=9b52ed57a8897f50d193b4ddbefe9fd19d234098"&gt;even&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=f892ee11a99c0b85e577e33bc1e695373b58584a"&gt;more&lt;/a&gt;... &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=2f75bcc1b74ece41ae39134719bcdefc2a6a0265"&gt;wait&lt;/a&gt;, &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=5a562eb9e12ef535be3d8c956b477982707083dd"&gt;there's&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=5dae513218070a3aafc7c56b097bc7ff7ae58526"&gt;more&lt;/a&gt; &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=45cc42cc3f5155a9bcbadaaa88c828359884c85b"&gt;here&lt;/a&gt;...&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:arial;"&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-6709152145629455168?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/6709152145629455168/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=6709152145629455168' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/6709152145629455168'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/6709152145629455168'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/x264-development-six-month.html' title='x264 development: a six month retrospective'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-5718843159305985503</id><published>2008-05-06T00:53:00.000-07:00</published><updated>2008-05-06T02:35:52.688-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speed'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><category scheme='http://www.blogger.com/atom/ns#' term='chroma'/><title type='text'>Chroma encoding optimizations</title><content type='html'>&lt;span style="font-family:arial;"&gt;This post revolves around &lt;/span&gt;&lt;a style="font-family: arial;" href="http://pastebin.com/f7538d8b7"&gt;this diff&lt;/a&gt;&lt;span style="font-family:arial;"&gt;.&lt;br /&gt;&lt;br /&gt;The chroma encoding process is an often-neglected one; we focus all too much on the complexities of luma encoding while forgetting the chroma encoding process, which is exactly the same regardless of block type.  The 8x8 residual blocks for the two chroma planes (from motion compensation or intra prediction) are DCT'd, quantized, and encoded.  However, there are some aspects to chroma encoding that make it subtly different:&lt;br /&gt;&lt;br /&gt;1.  Like i16x16 blocks, chroma encoding is separated into the DC coefficients of each 4x4 DCT block and the AC coefficients.  DC coefficients are "flat" frequencies, that is, a number representing a uniform change over the whole block.  AC coefficients are non-flat, and represent a spatial frequency.  In most block types, DC coefficients are not treated differently from AC; in chroma and i16x16, they are.  In chroma in particular, the 4 DC coefficients (for each of the 4 DCT blocks in a color plane) form a 2x2 array, to which a &lt;a href="http://en.wikipedia.org/wiki/Hadamard_transform"&gt;Hadamard transform&lt;/a&gt; is applied, and then the result is quantized.  This allows the overall "flat" change in color to be coded separately from any more complex change in color.&lt;br /&gt;&lt;br /&gt;2.  Chroma blocks are generally very simple and have nearly no residual data.  This is part of the reason that 1) is how the H.264 standard implements chroma coding; nearly all the data is probably going to be very simple if it is there at all.&lt;br /&gt;&lt;br /&gt;Now, this changes things slightly.  First of all, it changes how DCT decimation works; normally, decimate works by picking a DCT block and setting its contents to zero if its contents are sufficiently simple that it would save a lot of bits just to not code it at all (and have a relatively small quality loss relative to the bit savings).  For chroma in particular, decimate happens a *lot*, because usually there are nearly no AC coefficients.&lt;br /&gt;&lt;br /&gt;But what if, like we often do, we decimate the AC coefficients... but there's DC coefficients left over?  We don't decimate these; the visual effect could potentially be large to do so, and the bit savings is small.  But we still have to decode the DC-only chroma residual data, that is, perform an inverse DCT, and store the resulting data to represent the decoded frame.&lt;br /&gt;&lt;br /&gt;But if we know that there's only a DC coefficient and no AC coefficients, why do an entire iDCT?  That would be a waste of time, since having only a DC coefficient literally just means we're adding the same value to every pixel in the block.  So I ported the DC-only chroma iDCT from ffh264 and rewrote it to do all 4 4x4 blocks at once, so it can handle an entire 8x8 chroma block in one call.  Now, whenever we have a DC-only chroma block, we can skip most of the iDCT process, saving a bit of time.  I also cleaned up a lot of the decimate code in the function, speeding things up further.  Thanks a lot to Loren for help with the assembly on this one.&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-5718843159305985503?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/5718843159305985503/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=5718843159305985503' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/5718843159305985503'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/5718843159305985503'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/chroma-encoding-optimizations.html' title='Chroma encoding optimizations'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-3122882286446421370</id><published>2008-05-03T02:30:00.000-07:00</published><updated>2008-05-07T00:38:56.617-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='psychovisual optimizations'/><category scheme='http://www.blogger.com/atom/ns#' term='noise'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><category scheme='http://www.blogger.com/atom/ns#' term='film grain'/><title type='text'>Film grain optimization</title><content type='html'>&lt;span style="font-family:arial;"&gt;Optimizing an encoder for film grain is a tough problem.  For one, film grain is, by definition, basically uncorrelated between frames; that is, film grain from a previous frame is totally useless in encoding the current frame's film grain (at least it would seem!).  This would suggest that intra blocks are necessary for encoding film grain, which is what generally results.  Yet this encounters another problem: film grain is made up of a whole slew of spacial frequencies, many of which &lt;span style="font-weight: bold;"&gt;cannot&lt;/span&gt; be represented at the quantizers often used in P/B-frames!  This makes it extremely difficult to efficiently represent the film grain at reasonable bitrates.&lt;br /&gt;&lt;br /&gt;But we can cheat.&lt;br /&gt;&lt;br /&gt;Previous frames have a lot of the necessary spatial frequencies--why not steal them?  Sure, an inter block won't be as efficient as an intra block, but it might work better.  Indeed, the initial idea I got from glancing at the results of the film grain optimization in Elecard's encoder (Mainconcept core).  Their film grain optimization almost completely disabled I-blocks in P-frames, suggesting that this was indeed the avenue to go down.  Of course, their film grain optimization really wasn't that good--so who knows?&lt;br /&gt;&lt;br /&gt;To begin with, I tried the obvious; completely disable intra blocks in P-frames for the hell of it.  Surprisingly, this actually worked; in many cases it improved grain retention!  But if I was to make this practical, I'd have to find a real way of implementing a metric to decide what block type to use, rather than just brute-force disabling an entire category of blocks.&lt;br /&gt;&lt;br /&gt;I eventually came back upon an idea I considered a while back--what about NSSD?  NSSD, also known as "noise-retaining sum of squared differences," is a block comparison metric that is supposed to promote retaining grain/noise.  How exactly does it do this?  NSSD is equal to the sum of the ordinary SSD and the absolute value of the difference in "noise" values for the two blocks to be compared.  "Noise" is abs( x(i,j) - x(i+1,j) - x(i,j+1) + x(i+1,j+1) ) summed up over all pixels x(i,j) in the source block (ignoring pixels that would result in this formula going over the edge of the block).  In other words, it doesn't compare the pixels of the two blocks; it simply measures the "noisiness" of each block, and makes sure that they have a "similar" amount of noise.  Keeping the two blocks visually similar is taken care of by the SSD portion of the score.&lt;br /&gt;&lt;br /&gt;Amazingly, this worked; replacing the RD metric (SSD) with NSSD, combined with tweaking of the RD thresholds to ensure that modes that tended to retain noise were always analyzed, drastically improved grain retention, and made inter blocks drastically more common in grainy footage. The patch can be found &lt;a href="http://pastebin.com/m2e601b27"&gt;here&lt;/a&gt;, complete with mildly optimized MMX assembly for the "noise" operation, ported from ffmpeg (where NSSD is available as a -cmp/-subcmp/-rdcmp option).&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-3122882286446421370?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/3122882286446421370/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=3122882286446421370' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/3122882286446421370'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/3122882286446421370'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/film-grain-optimization.html' title='Film grain optimization'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-3353126353950493681</id><published>2008-05-02T10:04:00.000-07:00</published><updated>2008-05-07T00:39:04.748-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GSOC'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><category scheme='http://www.blogger.com/atom/ns#' term='google'/><title type='text'>Google summer of code</title><content type='html'>&lt;span style="font-family:arial;"&gt;This year, x264 is participating in &lt;a href="http://code.google.com/soc/2008/"&gt;Google Summer of Code&lt;/a&gt;.  We have accepted four applications out of a total of roughly 15 applications (out of Videolan's roughly 80 applications and 14 slots).  The projects are as follows:&lt;br /&gt;&lt;br /&gt;Robert Deaton (masquerade): Improve B-frame decision, both in terms of simple numbers of B-frames, direct modes, frame ordering, reference frame using, etc.&lt;br /&gt;&lt;br /&gt;Joey Degges (keram): Improve inter mode search and decision and experiment with more psychovisual optimizations&lt;br /&gt;&lt;br /&gt;Aki Jäntti (Kuukunen): Use a "macroblock tree" structure to measure the temporal importance of various parts of the image.&lt;br /&gt;&lt;br /&gt;Holger Lubitz (holger): General assembly optimizations and speed improvements.&lt;br /&gt;&lt;br /&gt;Good luck to all participants!&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-3353126353950493681?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/3353126353950493681/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=3353126353950493681' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/3353126353950493681'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/3353126353950493681'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/google-summer-of-code.html' title='Google summer of code'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-7268309400765593941</id><published>2008-05-02T00:59:00.000-07:00</published><updated>2008-05-04T12:16:46.392-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speed'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><category scheme='http://www.blogger.com/atom/ns#' term='skip'/><title type='text'>Skipping stuff</title><content type='html'>&lt;span style="font-family:arial;"&gt;Its nice to be able to take shortcuts in the encoding process--its even better when we can take a shortcut that doesn't even change the output.&lt;br /&gt;&lt;br /&gt;A recent example was my &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=f46c3bc88be70eb4ab798cba5e0da75b13038ffc"&gt;intra skip&lt;/a&gt; patch.  The basic idea behind intra skip lies in the method by which intra analysis works.  For each block of an i4x4 or i8x8 block, all 9 prediction modes are analyzed and the best chosen.  The prediction modes use the top, left, and/or top-right pixels to predict the block--so the blocks to the top, left, and top-right of the current block have to be encoded in order to know which pixels we will be predicting with:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dGhJ_P39Aco/SB4LUttTueI/AAAAAAAAAAo/N8QyaMHjapA/s1600-h/i4x42.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_dGhJ_P39Aco/SB4LUttTueI/AAAAAAAAAAo/N8QyaMHjapA/s400/i4x42.png" alt="" id="BLOGGER_PHOTO_ID_5196603470527052258" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The numbers show the order in which the blocks are encoded, in order to make sure all the necessary edge pixels are available during analysis.  The gray blocks are from neighboring macroblocks which have already been encoded.  Note that the "16" block never gets encoded as part of intra analysis, since it isn't needed for the edges of anything else.&lt;br /&gt;&lt;br /&gt;As a result, we have to encode each block as we go along in the analysis process.  Once we're done, we can save the results of the encoding so that later, if that intra mode is chosen, we can just restore the backup, saving over 2500 clocks of predict/DCT/quant/zigzag/dequant/iDCT.&lt;br /&gt;&lt;br /&gt;Next, there's my &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=2b3d65e7c2389fc4201dafb69422550e15ef0d85"&gt;p/bskip patch&lt;/a&gt;.  When pskip/bskip analysis is done, motion compensation on the block is done (with respect to the "skip" motion vector) and then analysis follows.  If the skip is chosen, the analysis terminates and the block is encoded... where the motion compensation gets done again.  But we can't just turn off all skip-based motion compensation, because there are some cases where you can have a skip in which the motion compensation wasn't the last thing that was done.  So I made a variable that told the encoder whether it could skip it or not.&lt;br /&gt;&lt;br /&gt;Finally, there's my upcoming &lt;a href="http://pastebin.com/mb8a1f0d"&gt;fast probe_skip patch&lt;/a&gt;, which takes a number of shortcuts in the skip check process:&lt;br /&gt;&lt;br /&gt;1.  The current skip_check does a full 16x16 DCT (16 4x4 DCTs), and then checks each 4x4 block individually.  Quite often, it'll be able to tell after the first block or two that the block won't be a skip; in this case, doing the full 16x16 DCT is a good waste of a few hundred clock cycles.  So I made it to an 8x8 DCT (4 4x4 DCTs) for each 8x8 block.  Why didn't I do each 4x4 DCT as I went along?  The 8x4 DCT (2 4x4 DCTs) is much faster than the 4x4 because its SSE2-accelerated, so I just went ahead and did the 8x8, which is coded as two 8x4s.&lt;br /&gt;&lt;br /&gt;2.  When checking for a skip, its highly likely that if one got past the first DCT block or two without terminating, most of the rest of the blocks will be zero.  As a result, its efficient to check if they're all zero before zigzag/decimate_score; this only takes a few clocks and quite often allows us to skip the aforementioned somewhat time-consuming steps.&lt;br /&gt;&lt;br /&gt;As you can see, its quite good when we can skip something entirely; its even better than speeding up a function; its calling a function less often.&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-7268309400765593941?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/7268309400765593941/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=7268309400765593941' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/7268309400765593941'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/7268309400765593941'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/skipping-stuff.html' title='Skipping stuff'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_dGhJ_P39Aco/SB4LUttTueI/AAAAAAAAAAo/N8QyaMHjapA/s72-c/i4x42.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-7286331237754874232</id><published>2008-05-01T22:39:00.001-07:00</published><updated>2008-05-01T23:15:52.212-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><category scheme='http://www.blogger.com/atom/ns#' term='Intel'/><category scheme='http://www.blogger.com/atom/ns#' term='assembly'/><category scheme='http://www.blogger.com/atom/ns#' term='cacheline'/><title type='text'>Cacheline splits, aka Intel hell</title><content type='html'>&lt;span style="font-family:arial;"&gt;About 6 months ago Loren Merritt and I, after relatively exhaustive testing on over a dozen types of processors, concluded that Intel chips in general have a problem in the case of a "cacheline split."&lt;br /&gt;&lt;br /&gt;To begin with, modern processors &lt;a href="http://en.wikipedia.org/wiki/Cache_line"&gt;divide their cache into "lines"&lt;/a&gt;; for example, a Pentium 3 has a 32-byte cacheline size, while a Core 2 has a 64-byte cacheline.  The way these work is that whenever the processor is told to fetch data from memory, it will fetch a full cacheline at a time.  This also affects how the data is organized in the cache.&lt;br /&gt;&lt;br /&gt;Now, when you're loading aligned data, that is, data aligned to a 16-byte line, you're guaranteed to never have that data cross a cache line, since the data is loaded in 8 or 16 byte chunks.  But if you're loading unaligned data, which is completely unavoidable in the case of the motion search, you have to do unaligned loads.  In the case of a 16x16 motion search and a 64-byte cacheline, this means that 1 in every 4 motion vectors checked will result in the load spanning a cacheline--that is, some of the data comes from one cacheline, and some from another.&lt;br /&gt;&lt;br /&gt;Why is this bad?  On AMD chips, it isn't; the penalty is no different from any other unaligned load.  But on Intel chips the penalty is enormous; in fact, the cost of a cacheline-split load is roughly equivalent to the &lt;a href="http://en.wikipedia.org/wiki/L2_cache#Multi-level_caches"&gt;L2 cache&lt;/a&gt; latency, suggesting that a cacheline split actually results in a load from L2 cache.  This may not seem too bad, but this adds roughly 12 clocks per load on a Core 2... boosting the cost of a single 16x16 SAD from 48 clocks to over 230 clocks.  Splits on &lt;a href="http://en.wikipedia.org/wiki/Page_%28computing%29"&gt;page lines&lt;/a&gt; are even worse; a single load can cost over 220 clock cycles, roughly equivalent to the main memory latency!  A 16x16 SAD has now gone from 48 clocks to over 3600 clocks.  The penalty on AMD chips for a pageline split exists, but its extremely small.&lt;br /&gt;&lt;br /&gt;Since cacheline splits happen only some of the time (and 1 in 64 cachelines are pagelines) it isn't as bad as it sounds, but the end result is that SADs take about an average of 100 clocks instead of 48.  Fortunately, Loren Merritt's &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=6d6092197676cf4949bff2a1e28a79aa1bbab1ea"&gt;massive cacheline split patch&lt;/a&gt; fixed this for SADs, resulting in a considerably faster motion search.  Investigating his patch, you can see four solutions that he implemented for this problem:&lt;br /&gt;&lt;br /&gt;1.  The easiest, SSE3-based.  The "lddqu" operation tells the processor to, instead of loading the unaligned data, to load two aligned pieces of data and shift them accordingly to give the same result.  As a result, it has none of the cost of a cacheline split... but unfortunately it only works on Pentium 4Ds and Core 1s.  On Core 2s, lddqu does the same thing as movdqu.&lt;br /&gt;&lt;br /&gt;2.  The second easiest, MMX-based.  Shift amounts are calculated based on the position of the data to be loaded, the data is loaded in two chunks, and they are shifted and OR'd together to get the result.  This is best expressed in an image:&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dGhJ_P39Aco/SBqvSNtTucI/AAAAAAAAAAU/b7PRvO_nxsM/s1600-h/cacheline.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_dGhJ_P39Aco/SBqvSNtTucI/AAAAAAAAAAU/b7PRvO_nxsM/s400/cacheline.png" alt="" id="BLOGGER_PHOTO_ID_5195657847577491906" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;As can be seen, this is quite a bit more complex than an ordinary load; it involves precalculation of two numbers (the shift values) and two shifts plus one OR for every single load!  But its still far faster than the 12-14 clock latency for cacheline splits.  But wait, you thought this workaround was messy?  Just wait until you see the rest...&lt;br /&gt;&lt;br /&gt;3.  The second hardest, using SSSE3's palignr.  palignr does what the two shifts and the OR did above, all in a single operation!  This would be easier than the MMX version, but there's one complication; the MMX shift described above takes a *variable* value, so we can calculate on runtime the shift amount to use, and then do it.  But palignr takes a constant which must be written into the opcode on compile-time!  So for this, Loren Merritt made a set of 15 loops, one for each possible alignment, and had the program jump to the correct loop based on the current alignment.  Since each loop had a constant size, he could literally calculate the position of the loop in the code and jump to its position numerically instead of storing an array of pointers to the various loops.  What a mess!  But it works, and faster than the MMX version, since palignr is so much simpler.&lt;br /&gt;&lt;br /&gt;4.  The worst is SSE; SSE's version of MMX's full-shift has the same problem as palignr; it only takes a constant shift amount.  So we have to do the same as palignr's 15-loops, except with all the same hackiness as in the MMX version.&lt;br /&gt;&lt;br /&gt;So there we have it; the 4 workarounds, all of which give a similar performance boost on Intel processors.  Now you understand why coding assembly for Intel chips can be hell; one has to compensate for such ridiculous performance penalties like these.  I &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=44361e9514344c4e4685bca45e8c55a28c3cd395"&gt;extended the cacheline patch&lt;/a&gt; to work for the qpel interpolation also, which is used in all of the subpel motion search.  This gave an overall speed boost of 1-2% or so (25-40% for qpel interpolation only).   My patch only uses the MMX and SSE3 workarounds, since the others were unnecessarily complex for this smaller case.  Loren Merritt's &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=2487cdf92f3050830fe2b7ffc4e04cf292823e9e"&gt;sse2/ssse3 hpel_filter patch&lt;/a&gt; also used this concept, though in this case the misalignments were precisely known, so using palignr was drastically easier.  My &lt;a href="http://akuvian.org/src/x264/lowres.7.diff"&gt;frame_init_lowres&lt;/a&gt; asm patch also provides an SSE2 and SSSE3-based solution for cacheline splits of known alignment.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-7286331237754874232?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/7286331237754874232/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=7286331237754874232' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/7286331237754874232'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/7286331237754874232'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/cacheline-splits-aka-intel-hell.html' title='Cacheline splits, aka Intel hell'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_dGhJ_P39Aco/SBqvSNtTucI/AAAAAAAAAAU/b7PRvO_nxsM/s72-c/cacheline.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-5232840398292347516</id><published>2008-05-01T22:30:00.001-07:00</published><updated>2008-05-09T12:12:45.200-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CABAC'/><category scheme='http://www.blogger.com/atom/ns#' term='ffmpeg'/><category scheme='http://www.blogger.com/atom/ns#' term='finite state machine'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><category scheme='http://www.blogger.com/atom/ns#' term='H.264'/><title type='text'>Finite state machines and CABAC</title><content type='html'>&lt;span style="font-family:arial;"&gt;During &lt;a href="http://en.wikipedia.org/wiki/CABAC"&gt;CABAC&lt;/a&gt; binarization, the coefficients in each DCT block are converted to binary symbols which are &lt;a href="http://en.wikipedia.org/wiki/Arithmetic_coding"&gt;arithmetically coded&lt;/a&gt; into the bitstream.  Each symbol has a specific context, the choice of which is calculated through a simple algorithm.  This context-choice, of course, has to be the same on encoding and decoding.  One can calculate this context for each coefficient--or one can make a &lt;a href="http://en.wikipedia.org/wiki/Finite_state_machine"&gt;finite state machine&lt;/a&gt; using a set of states and a transition tables, so that given the current context, the next context can be calculated using merely an array lookup.&lt;br /&gt;&lt;br /&gt;This improves performance both in &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=e403fe9364ad1ea1cd8c3d5055759a538e97bb8b"&gt;CABAC encoding&lt;/a&gt; and &lt;a href="http://svn.mplayerhq.hu/ffmpeg/trunk/libavcodec/h264.c?r1=13017&amp;amp;r2=13060"&gt;CABAC decoding&lt;/a&gt;.  Note the similarities between the two diffs--despite the fact that one is in x264 and designed for encoding, and the other is in ffmpeg and designed for decoding, they are practically the same code.  This is true of most of the CABAC code; in fact, most of x264's CABAC binarization code is simply the code from ffmpeg, except reversed.  Original credit for the finite state machine code goes to Loren Merritt, who wrote it to speed up the calculation of CABAC states in x264's trellis quantization algorithm.&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-5232840398292347516?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/5232840398292347516/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=5232840398292347516' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/5232840398292347516'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/5232840398292347516'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/finite-state-machines-and-cabac.html' title='Finite state machines and CABAC'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-3387783363516164855</id><published>2008-05-01T18:08:00.001-07:00</published><updated>2008-05-01T18:12:25.250-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><category scheme='http://www.blogger.com/atom/ns#' term='stupidity'/><category scheme='http://www.blogger.com/atom/ns#' term='Intel'/><category scheme='http://www.blogger.com/atom/ns#' term='assembly'/><title type='text'>Why, Intel, why?</title><content type='html'>&lt;span style="font-family: arial;"&gt;&lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=f892ee11a99c0b85e577e33bc1e695373b58584a"&gt;This diff&lt;/a&gt; is highly related to this post.&lt;br /&gt;&lt;br /&gt;If one looks in Intel's documentation of their assembly, one notices a few things.  In particular, there are a whole bunch of operations which do exactly the same thing but have different opcodes.  Intel introduces "movaps" and "movups" for aligned and unaligned moves in SSE1, and then "movdqa" and "movdqu" in SSE2... to do exactly the same thing.  The same situation occurs with pand and andps... etc.  The end result is a number of things:&lt;br /&gt;1.  Wasted opcode space on opcodes that do exactly the same thing.&lt;br /&gt;2.  Wasted executable size, since movdqa is larger than movaps (3 byte vs 2 byte opcode) despite doing exactly the same thing.&lt;br /&gt;3.  Loss of our sanity.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-3387783363516164855?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/3387783363516164855/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=3387783363516164855' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/3387783363516164855'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/3387783363516164855'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/why-intel-why.html' title='Why, Intel, why?'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-8560484501202870230</id><published>2008-05-01T18:04:00.001-07:00</published><updated>2008-05-06T23:32:21.451-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='loops'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><title type='text'>Unrolling a loop makes it smaller</title><content type='html'>&lt;span style="font-family:arial;"&gt;The topic of this post is &lt;a href="http://git.videolan.org/?p=x264.git;a=commitdiff;h=45cc42cc3f5155a9bcbadaaa88c828359884c85b"&gt;this recent diff&lt;/a&gt;.  It doesn't at all change the results of the function; I simply unrolled the loop to take advantage of the fact that most of the loop's variables were dependent on the loop constant itself.  The result was only 7 lines (!) down from well over 30.  Thanks to an earlier ffmpeg patch for the original idea and some of the code.&lt;br /&gt;&lt;br /&gt;Apparently GCC doesn't actually measure the benefit gained from constant propagation in unrolling loops, so one is forced to manually unroll in this sort of case.&lt;br /&gt;&lt;br /&gt;Oh, and I got roughly a 50% speedup in this function by doing this (not counting the encode_decision calls).&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-8560484501202870230?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/8560484501202870230/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=8560484501202870230' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/8560484501202870230'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/8560484501202870230'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/unrolling-loop-makes-it-smaller.html' title='Unrolling a loop makes it smaller'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-4533928885469717780</id><published>2008-05-01T14:16:00.000-07:00</published><updated>2008-05-01T14:39:28.366-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bugs'/><category scheme='http://www.blogger.com/atom/ns#' term='rate-distortion optimization'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><title type='text'>Inter RD refine bugs</title><content type='html'>&lt;span style="font-family:arial;"&gt;I think I've found two of these in one week now.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;For those who don't know, on inter blocks in P-frames, subme 7 works by a process called "qpel RD"; that is, it does an ordinary subpixel refinement of the motion vectors except instead of the usual fast metric (&lt;a href="http://en.wikipedia.org/wiki/Sum%20of%20Absolute%20Differences"&gt;SAD&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/SATD"&gt;SATD&lt;/a&gt;), it does a full rate-distortion comparison on each qpel position using a hexagonal search.  Of course, to increase speed, it uses a SATD threshold above which it won't bother with the whole RD process.  The reason for this is obvious when you see the numbers; a full-macroblock SATD takes about 450 clocks, while a full-macroblock RD takes about 10,000 clocks.  So even if SATD allowed us to avoid doing an RD 1/20th of the time, it would be worth it; the real numbers are much higher than that, of course.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;Now, onto the bugs.  The first one was in CAVLC, where in an 8x8dct inter block the numbers for the numbers of non-zero coefficients in each DCT block were not being calculated, resulting in incorrect calculations of bit cost.   This wasn't a problem for CABAC, because CABAC only needs to know whether a block is all-zero or not all-zero; while CAVLC needs to know exactly how many non-zero coefficients there are.  This was resolved as part of my &lt;/span&gt;&lt;a style="font-family: arial;" href="http://git.videolan.org/?p=x264.git;a=commit;h=a9057b503939d763a9f17111c41672bfca8beb7e"&gt;overhaul of&lt;/a&gt;&lt;span style="font-family:arial;"&gt; &lt;/span&gt;&lt;a style="font-family: arial;" href="http://git.videolan.org/?p=x264.git;a=commit;h=5dae513218070a3aafc7c56b097bc7ff7ae58526"&gt;the nnz code&lt;/a&gt;&lt;span style="font-family:arial;"&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;The second bug I just ran into today, where I found that when RD was done on 8x16 and 16x8 blocks, for simplicity's sake, the RD function encoded two 8x8 blocks separately and then summed the resulting scores.  This isn't a problem per se, but the 8x8 RD function went on to check what type of 8x8 block it was: this could be an 8x8, 2 8x4s,  2 4x8s, or 4 4x4 blocks.  An 8x16 or 16x8 block can't have such subblocks, while an 8x8 block can.  This again isn't a problem... but when a 16x8 or 8x16 block type is chosen, it doesn't reset the subpartition types used for the 8x8 search... so if the 8x8 search chose a sub-8x8 block type, the 8x8 block encode now thinks that we have a sub-8x8 block type even if we cannot possibly have one.  The solution was &lt;/span&gt;&lt;a style="font-family: arial;" href="http://pastebin.com/f7ff01732"&gt;pretty simple&lt;/a&gt;&lt;span style="font-family:arial;"&gt;.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-4533928885469717780?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/4533928885469717780/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=4533928885469717780' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/4533928885469717780'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/4533928885469717780'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/inter-rd-refine-bugs.html' title='Inter RD refine bugs'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2580926098227998794.post-3451621554922219569</id><published>2008-05-01T14:03:00.000-07:00</published><updated>2008-05-01T14:25:16.754-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ugly code'/><category scheme='http://www.blogger.com/atom/ns#' term='x264'/><category scheme='http://www.blogger.com/atom/ns#' term='memory management'/><title type='text'>Array overflows</title><content type='html'>&lt;span style="font-family:arial;"&gt;x264 keeps a set of pointers to various arrays used for custom quantization matrices (CQMs), deadzone biases, etc. These are stored in the primary x264_t struct as follows:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier;"&gt;&lt;br /&gt;uint16_t        (*quant4_bias[4])[16];   /* [4][52][16] */&lt;br /&gt;uint16_t        (*quant8_bias[2])[64];   /* [2][52][64] */&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;They are malloced at the start of the program and deleted when x264 finishes. Unfortunately the delete code looks something like this:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier;"&gt;&lt;br /&gt;for( i = 0; i &lt; 6; i++ )&lt;br /&gt;{&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;x264_free( h-&gt;quant4_bias[i] );&lt;br /&gt;}&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;A small part of me died when I read this code.  Yes, that's right, its overflowing the array of pointers intentionally because they're arranged sequentially in the struct. Apparently, according to Loren Merritt, this simplifies the deletion code.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2580926098227998794-3451621554922219569?l=x264dev.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://x264dev.blogspot.com/feeds/3451621554922219569/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2580926098227998794&amp;postID=3451621554922219569' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/3451621554922219569'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2580926098227998794/posts/default/3451621554922219569'/><link rel='alternate' type='text/html' href='http://x264dev.blogspot.com/2008/05/array-overflows.html' title='Array overflows'/><author><name>Dark Shikari</name><uri>http://www.blogger.com/profile/03573422480643306284</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
