[[meta title="Avoiding read-modify-write to speedup i965_prepare_composite"]]

[[tag exa performance xorg]]

I asked for help explaining the
[[slow_assignments|opannotate_i965_prepare_composite]] that opannotate
saw in `i965_prepare_composite` and I got just the help I hoped
for. Peter Lund, Phillip Ezolt, Michel Dänzer, Wang Zhenyu, and Keith
Packard each provided some helpful suggestions. Thanks to each of you!

The big slowdown was due to the various bitfield assignments resulting
in successive read-modify-write cycles to the uncached memory, (often
to the same location!). So every read would block and force a flush of
the previous write. So we were seeing abysmal performance as the CPU
performed lock-step reads and writes over the AGP bus.

The simplest fix was to just setup the desired state in local buffers
on the stack and to memcpy them when finished, resulting in a nice
stream of AGP writes that can benefit from write-combining.

Subsequent improvements could involve not writing out data that is
identical from one call of the function to the next. And it might help
to rewrite the driver code to make it more clear when it is performing
IO reads, (since unintentional reads can cause such performance
problems by forcing a flush of pending writes).

I've published a series of "use local structure" patches in a [git
branch](http://cgit.freedesktop.org/~cworth/xf86-video-intel/) and
sent that off to the xorg mailing list for review. (Update: These
improvements have now been pushed out into the upstream repository for
xf86-video-intel.) Here's a chart showing the improvement:

[[img i965.png]]

<table border="1">
  <tr> <th> Test <th> Tbox <th> TboxGFX <th> English <th> Foreign <th> SVG <th> ALL
  <tr> <th> NoAccel <td> 21.859 <td> 44.698 <td> 12.110 <td> 41.205 <td> 474.750 <td> 24.176
  <tr> <th> EXA <td> 100.777 <td> 133.532 <td> 83.543 <td> 101.258 <td> 473.111 <td> 87.740
  <tr> <th> EXA-patch <td> 69.147 <td> 58.795 <td> 51.450 <td> 79.048 <td> 511.694 <td> 60.086
</table>

So that helped a fair amount for the text-heavy tests, (although the
SVG case slowed down a bit for some reason), but the overall
performance is still over 4x slower than NoAccel in some cases. The
[[profile]] isn't showing any single huge bottleneck anymore, but
several things in the 5-8% range. Fixing things now might require a
series of individual fixes that each chop another 5% problem off.

Michel also made the good suggestion that I separately profile the
worst-behaving test case, so I'll pursue that next.
Test	Tbox	TboxGFX	English	Foreign	SVG	ALL
NoAccel	21.859	44.698	12.110	41.205	474.750	24.176
EXA	100.777	133.532	83.543	101.258	473.111	87.740
EXA-patch	69.147	58.795	51.450	79.048	511.694	60.086