Add avoiding_rmw blog entry

[cworth.org] / src / exa / i965 / avoiding_rmw.mdwn
diff --git a/src/exa/i965/avoiding_rmw.mdwn b/src/exa/i965/avoiding_rmw.mdwn

new file mode 100644 (file)

index 0000000..942a09f
--- /dev/null
+++ b/src/exa/i965/avoiding_rmw.mdwn
@@ -0,0 +1,49 @@
+[[meta title="Avoiding read-modify-write to speedup i965_prepare_composite"]]
+
+[[tag exa performance xorg]]
+
+I asked for help explaining the
+[[slow_assignments|opannotate_i965_prepare_composite]] that opannotate
+saw in `i965_prepare_composite` and I got just the help I hoped
+for. Peter Lund, Phillip Ezolt, Michel Dänzer, Wang Zhenyu, and Keith
+Packard each provided some helpful suggestions. Thanks to each of you!
+
+The big slowdown was due to the various bitfield assignments resulting
+in successive read-modify-write cycles to the uncached memory, (often
+to the same location!). So every read would block and force a flush of
+the previous write. So we were seeing abysmal performance as the CPU
+performed lock-step reads and writes over the AGP bus.
+
+The simplest fix was to just setup the desired state in local buffers
+on the stack and to memcpy them when finished, resulting in a nice
+stream of AGP writes that can benefit from write-combining.
+
+Subsequent improvements could involve not writing out data that is
+identical from one call of the function to the next. And it might help
+to rewrite the driver code to make it more clear when it is performing
+IO reads, (since unintentional reads can cause such performance
+problems by forcing a flush of pending writes).
+
+I've published a series of "use local structure" patches in a
+[[git_branch|http://cgit.freedesktop.org/~cworth/xf86-video-intel/]]
+and sent that off to the xorg mailing list for review. Here's a chart
+showing the improvement:
+
+[[img i965.png]]
+
+<table border="1">
+  <tr> <th> Test <th> Tbox <th> TboxGFX <th> English <th> Foreign <th> SVG <th> ALL
+  <tr> <th> NoAccel <td> 21.859 <td> 44.698 <td> 12.110 <td> 41.205 <td> 474.750 <td> 24.176
+  <tr> <th> EXA <td> 100.777 <td> 133.532 <td> 83.543 <td> 101.258 <td> 473.111 <td> 87.740
+  <tr> <th> EXA-patch <td> 69.147 <td> 58.795 <td> 51.450 <td> 79.048 <td> 511.694 <td> 60.086
+</table>
+
+So that helped a fair amount for the text-heavy tests, (although the
+SVG case slowed down a bit for some reason), but the overall
+performance is still over 4x slower than NoAccel in some cases. The
+[[profile]] isn't showing any single huge bottleneck anymore, but
+several things in the 5-8% range. Fixing things now might require a
+series of individual fixes that each chop another 5% problem off.
+
+Michel also made the good suggestion that I separately profile the
+worst-behaving test case, so I'll pursue that next.