git.cworth.org Git - cworth.org/blob - src/exa/understanding_rectangles.mdwn

   1 [[!meta title="Understanding the cairo rectangles performance test case"]]
   2
   3 [[!tag cairo exa xorg]]
   4
   5 About a month ago (can it have been that long already?) I started an
   6 effort to try to [baseline EXA performance on an r100
   7 chip](http://article.gmane.org/gmane.comp.freedesktop.xorg/17466). A
   8 particularly alarming result from that initial test was that cairo's
   9 rectangles case was running 14 times slower with EXA than with no X
  10 server acceleration at all.
  11
  12 Afterwards, Eric and Dave [set me
  13 straight](http://article.gmane.org/gmane.comp.freedesktop.xorg/17502)
  14 and I got DRI working with EXA. This definitely made it faster in
  15 general, but the rectangles test was still 8x slower than NoAccel. A
  16 deeper look was necessary.
  17
  18 Eric had various theories about how cairo's measurement strategy could
  19 be confounding the results. What cairo's performance suite does is to
  20 perform an XGetImage of a single pixel as a synchronization barrier,
  21 (to allow the suite to wait until the X server provides the result as
  22 a guarantee that all pending rendering has occurred). One theory is
  23 that EXA could be doing something extremely inefficient here, (such as
  24 fetching the entire image instead of just a single pixel).
  25
  26 To alleviate this possible problem, I cranked the number of rectangles
  27 being rendered between timings from 1000 to 10000. This actually did
  28 help to some extent. After this change EXA is only 2-3x slower than
  29 NoAccel instead of 8x slower.
  30
  31 Also, we noticed that this slowdown only occurs when drawing to an
  32 ARGB Pixmap as opposed to drawing to an RGB Window, (when drawing to a
  33 window EXA is about 4x faster than NoAccel, whether drawing 1000 or
  34 10000 rectangles).
  35
  36 So the test with 1000 rectangles was definitely measuring something
  37 undesired, since a 10x increase in the the number of rectangles
  38 resulted in something close to a 2x increase in rendering time. (For
  39 EXA to a Pixmap at least---EXA to a Window, and NoAccel to a Window or
  40 Pixmap all increased by about 10x). I'm still not sure exactly what
  41 the problem was in the case with 1000 rectangles, but the 1x1
  42 XGetImage is still a possibility. Eric has suggested adding a new
  43 wait-for-rendering-to-complete request to the XFixes extension to
  44 eliminate the need for the 1x1 XGetImage and any problems it might be
  45 causing.
  46
  47 After seeing the results change so dramatically with the number of
  48 iterations, I began to wonder about batching effects. The original
  49 cairo-based rectangles test case looked about like this:
  50
  51         for (i = 0; i < NUM_RECTANGLES; i++) {
  52             cairo_rectangle (cr, rect[i].{x,y,width,height});
  53             cairo_fill (cr);
  54         }
  55
  56 That is, each rectangle was being filled individually. I experimented
  57 with changing this so that many calls were made to `cairo_rectangle` for
  58 each call to `cairo_fill`. The mysterious EXA slowdown I had been
  59 chasing went away, but only because everything became a lot slower. It
  60 turns out there's a bad performance bug in cairo when it converts from
  61 a list of rectangular trapezoids to a `pixman_region`. Cairo's pixman
  62 doesn't expose a function for "create region from list of rectangles"
  63 so cairo is calling a `pixman_region_union` function once for every
  64 rectangle. This looks like an unnecessary O(n**2) algorithm to
  65 me. Fortunately that should be a simple thing to fix.
  66
  67 So next I rewrote the test case by eliminating cairo calls and calling
  68 directly into either XRenderFillRectangles or XFillRectangle. I was
  69 shocked to find that the Render function was much slower than the
  70 non-Render function, (with no change in the X server). A little
  71 protocol tracing[*] revealed that XFillRectangle is batching requests
  72 while XRenderFillRectangles is not. That's a rather nasty trap for an
  73 unwary Xlib coder like myself. I added batching in chunks of 256
  74 rectangles around XRenderFillRectangles and it started behaving
  75 precisely like XFillRectangles.
  76
  77 Finally, I eliminated some non-determinism from the rectangles test
  78 case. Originally, it was written to choose randomly-sized rectangles
  79 by independently selecting a width and height from 1 to 50. Instead I
  80 ran separate tests at power-of-2 sizes from 1 to 512. The results of
  81 doing this were quite revealing and are best seen graphically:
  82
  83 [[rectangles-512.png]]
  84
  85 And a closer look at the small rectangles:
  86
  87 [[rectangles-64.png]]
  88
  89 As can be seen, there's a break-even point at a rectangle size just
  90 below 60x60. Above that, EXA performance scales extremely well, with
  91 the time becoming flat based on the number of rectangles, and
  92 independent of their size. While NoAccel performance scales quite
  93 poorly (and as expected).
  94
  95 Meanwhile, for the small rectangles, (which my original test case just
  96 happened to be testing exclusively), EXA is 3 to 4 times slower than
  97 NoAccel. Perhaps it would make sense for the X server to take an
  98 alternate approach for these small rectangles? The NoAccel results
  99 show that the X server does have faster code already. Or perhaps EXA
 100 itself could be made faster by having some hardware state caching to
 101 reduce overhead from one rectangle to the next.
 102
 103 But there are some obvious questions here: What sizes actually matter?
 104 What would a rectangle-size histogram look like for typical desktop
 105 loads? There's definitely room to do some measurement work here so
 106 that we can come up with meaningful benchmarks, (rather than the
 107 fairly arbitrary things I started with), and focus on optimizing the
 108 things that really matter.
 109
 110 A similar issue holds for the batching issue. I only saw good
 111 performance when I batched many rectangles into each call to
 112 XRenderFillRectangles. But is that even a reasonable thing to expect
 113 applications to be able to do? Do applications actually sequentially
 114 render dozens of rectangles all of the same color? I'm imagining GTK+
 115 widget themes with bevelled edges where it's actually much more likely
 116 that the behavior would be close to toggling back and forth between
 117 two colors every one or two rectangles. And that kind of behavior will
 118 exhibit wildly different results than what's being benchmarked here.
 119
 120 Anyway, there's plenty of interesting work to still be done here.
 121
 122 [*] I used wireshark and manually decoded all Render requests. I'm
 123 looking forward to good protocol tracing tools that decode all
 124 extensions. And yes, I'm aware of current XCB efforts to provide
 125 this---should be very nice!