From: Carl Worth Date: Wed, 23 May 2007 16:39:35 +0000 (-0700) Subject: Add understanding_rectangles blog entry X-Git-Url: https://git.cworth.org/git?a=commitdiff_plain;h=c71d0b98f5c3de49e6f522f9640e9b659e946662;hp=55e82999d067daa627e489b63a1ce087980880e1;p=cworth.org Add understanding_rectangles blog entry --- diff --git a/src/exa/rectangles-512.png b/src/exa/rectangles-512.png new file mode 100644 index 0000000..a0729fe Binary files /dev/null and b/src/exa/rectangles-512.png differ diff --git a/src/exa/rectangles-64.png b/src/exa/rectangles-64.png new file mode 100644 index 0000000..6f1cd5a Binary files /dev/null and b/src/exa/rectangles-64.png differ diff --git a/src/exa/understanding_rectangles.mdwn b/src/exa/understanding_rectangles.mdwn new file mode 100644 index 0000000..5973620 --- /dev/null +++ b/src/exa/understanding_rectangles.mdwn @@ -0,0 +1,125 @@ +[[tag cairo exa xorg]] + +# Understanding the cairo rectangles performance test case + +About a month ago (can it have been that long already?) I started an +effort to try to [baseline EXA performance on an r100 +chip](http://article.gmane.org/gmane.comp.freedesktop.xorg/17466). A +particularly alarming result from that initial test was that cairo's +rectangles case was running 14 times slower with EXA than with no X +server acceleration at all. + +Afterwards, Eric and Dave [set me +straight](http://article.gmane.org/gmane.comp.freedesktop.xorg/17502) +and I got DRI working with EXA. This definitely made it faster in +general, but the rectangles test was still 8x slower than NoAccel. A +deeper look was necessary. + +Eric had various theories about how cairo's measurement strategy could +be confounding the results. What cairo's performance suite does is to +perform an XGetImage of a single pixel as a synchronization barrier, +(to allow the suite to wait until the X server provides the result as +a guarantee that all pending rendering has occurred). One theory is +that EXA could be doing something extremely inefficient here, (such as +fetching the entire image instead of just a single pixel). + +To alleviate this possible problem, I cranked the number of rectangles +being rendered between timings from 1000 to 10000. This actually did +help to some extent. After this change EXA is only 2-3x slower than +NoAccel instead of 8x slower. + +Also, we noticed that this slowdown only occurs when drawing to an +ARGB Pixmap as opposed to drawing to an RGB Window, (when drawing to a +window EXA is about 4x faster than NoAccel, whether drawing 1000 or +10000 rectangles). + +So the test with 1000 rectangles was definitely measuring something +undesired, since a 10x increase in the the number of rectangles +resulted in something close to a 2x increase in rendering time. (For +EXA to a Pixmap at least---EXA to a Window, and NoAccel to a Window or +Pixmap all increased by about 10x). I'm still not sure exactly what +the problem was in the case with 1000 rectangles, but the 1x1 +XGetImage is still a possibility. Eric has suggested adding a new +wait-for-rendering-to-complete request to the XFixes extension to +eliminate the need for the 1x1 XGetImage and any problems it might be +causing. + +After seeing the results change so dramatically with the number of +iterations, I began to wonder about batching effects. The original +cairo-based rectangles test case looked about like this: + + for (i = 0; i < NUM_RECTANGLES; i++) { + cairo_rectangle (cr, rect[i].{x,y,width,height}); + cairo_fill (cr); + } + +That is, each rectangle was being filled individually. I experimented +with changing this so that many calls were made to cairo_rectangle for +each call to cairo_fill. The mysterious EXA slowdown I had been +chasing went away, but only because everything became a lot slower. It +turns out there's a bad performance bug in cairo when it converts from +a list of rectangular trapezoids to a pixman_region. Cairo's pixman +doesn't expose a function for "create region from list of rectangles" +so cairo is calling a pixman_region_union function once for every +rectangle. This looks like an unnecessary O(n**2) algorithm to +me. Fortunately that should be a simple thing to fix. + +So next I rewrote the test case by eliminating cairo calls and calling +directly into either XRenderFillRectangles or XFillRectangle. I was +shocked to find that the Render function was much slower than the +non-Render function, (with no change in the X server). A little +protocol tracing[*] revealed that XFillRectangle is batching requests +while XRenderFillRectangles is not. That's a rather nasty trap for an +unwary Xlib coder like myself. I added batching in chunks of 256 +rectangles around XRenderFillRectangles and it started behaving +precisely like XRenderFillRectangles. + +Finally, I eliminated some non-determinism from the rectangles test +case. Originally, it was written to choose randomly-sized rectangles +by independently selecting a width and height from 1 to 50. Instead I +ran separate tests at power-of-2 sizes from 1 to 512. The results of +doing this were quite revealing and are best seen graphically: + +[[rectangles-512.png]] + +And a closer look at the small rectangles: + +[[rectangles-64.png]] + +As can be seen, there's a break-even point at a rectangle size just +below 60x60. Above that, EXA performance scales extremely well, with +the time becoming flat based on the number of rectangles, and +independent of their size. While NoAccel performance scales quite +poorly (and as expected). + +Meanwhile, for the small rectangles, (which my original test case just +happened to be testing exclusively), EXA is 3 to 4 times slower than +NoAccel. Perhaps it would make sense for the X server to take an +alternate approach for these small rectangles? The NoAccel results +show that the X server does have faster code already. Or perhaps EXA +itself could be made faster by having some hardware state caching to +reduce overhead from one rectangle to the next. + +But there are some obvious questions here: What sizes actually matter? +What would a rectangle-size histogram look like for typical desktop +loads? There's definitely room to do some measurement work here so +that we can come up with meaningful benchmarks, (rather than the +fairly arbitrary things I started with), and focus on optimizing the +things that really matter. + +A similar issue holds for the batching issue. I only saw good +performance when I batched many rectangles into each call to +XRenderFillRectangles. But is that even a reasonable thing to expect +applications to be able to do? Do applications actually sequentially +render dozens of rectangles all of the same color? I'm imagining GTK+ +widget themes with bevelled edges where it's actually much more likely +that the behavior would be close to toggling back and forth between +two colors every one or two rectangles. And that kind of behavior will +exhibit wildly different results than what's being benchmarked here. + +Anyway, there's plenty of interesting work to still be done here. + +[*] I used wireshark and manually decoded all Render requests. I'm +looking forward to good protocol tracing tools that decode all +extensions. And yes, I'm aware of current XCB efforts to provide +this---should be very nice!