git.cworth.org Git - cworth.org/blob - src/exa/i965/avoiding_rmw.mdwn

   1 [[!meta title="Avoiding read-modify-write to speedup i965_prepare_composite"]]
   2
   3 [[!tag exa performance xorg]]
   4
   5 I asked for help explaining the
   6 [[slow_assignments|opannotate_i965_prepare_composite]] that opannotate
   7 saw in `i965_prepare_composite` and I got just the help I hoped
   8 for. Peter Lund, Phillip Ezolt, Michel Dänzer, Wang Zhenyu, and Keith
   9 Packard each provided some helpful suggestions. Thanks to each of you!
  10
  11 The big slowdown was due to the various bitfield assignments resulting
  12 in successive read-modify-write cycles to the uncached memory, (often
  13 to the same location!). So every read would block and force a flush of
  14 the previous write. So we were seeing abysmal performance as the CPU
  15 performed lock-step reads and writes over the AGP bus.
  16
  17 The simplest fix was to just setup the desired state in local buffers
  18 on the stack and to memcpy them when finished, resulting in a nice
  19 stream of AGP writes that can benefit from write-combining.
  20
  21 Subsequent improvements could involve not writing out data that is
  22 identical from one call of the function to the next. And it might help
  23 to rewrite the driver code to make it more clear when it is performing
  24 IO reads, (since unintentional reads can cause such performance
  25 problems by forcing a flush of pending writes).
  26
  27 I've published a series of "use local structure" patches in a [git
  28 branch](http://cgit.freedesktop.org/~cworth/xf86-video-intel/) and
  29 sent that off to the xorg mailing list for review. (Update: These
  30 improvements have now been pushed out into the upstream repository for
  31 xf86-video-intel.) Here's a chart showing the improvement:
  32
  33 [[!img i965.png]]
  34
  35 <table border="1">
  36   <tr> <th> Test <th> Tbox <th> TboxGFX <th> English <th> Foreign <th> SVG <th> ALL
  37   <tr> <th> NoAccel <td> 21.859 <td> 44.698 <td> 12.110 <td> 41.205 <td> 474.750 <td> 24.176
  38   <tr> <th> EXA <td> 100.777 <td> 133.532 <td> 83.543 <td> 101.258 <td> 473.111 <td> 87.740
  39   <tr> <th> EXA-patch <td> 69.147 <td> 58.795 <td> 51.450 <td> 79.048 <td> 511.694 <td> 60.086
  40 </table>
  41
  42 So that helped a fair amount for the text-heavy tests, (although the
  43 SVG case slowed down a bit for some reason), but the overall
  44 performance is still over 4x slower than NoAccel in some cases. The
  45 [[profile]] isn't showing any single huge bottleneck anymore, but
  46 several things in the 5-8% range. Fixing things now might require a
  47 series of individual fixes that each chop another 5% problem off.
  48
  49 Michel also made the good suggestion that I separately profile the
  50 worst-behaving test case, so I'll pursue that next.