Add opanotate_i965_prepare_composite blog entry

[cworth.org] / src / exa / opannotate_i965_prepare_composite.mdwn
diff --git a/src/exa/opannotate_i965_prepare_composite.mdwn b/src/exa/opannotate_i965_prepare_composite.mdwn

new file mode 100644 (file)

index 0000000..7ff167e
--- /dev/null
+++ b/src/exa/opannotate_i965_prepare_composite.mdwn
@@ -0,0 +1,80 @@
+[[meta title="Using opannotate to make sense of profiles"]]
+
+[[tag exa performance xorg]]
+
+After I recently posted some surprising
+[[profiles|mozilla_i965_profiles]], I received useful feedback from
+Michel Dänzer, Adam Jackson, and Eric Anholt. There was general
+agreement that the i965_prepare_composite function is generally stupid
+about acting synchronously in order to reuse a single state buffer,
+and that this shouldn't be too hard to optimize. Options include using
+a ring of buffers and synchronizing only when wrapping around and also
+optimizing to not send redundant data.
+
+I had mentioned earlier that I had tried eliminating the `i830WaitSync`
+calls, but hadn't noticed any performance change. Well, one problem
+was that I had edited the files on the wrong machine, (I'm still not a
+true X hacker since I'm not totally in the groove of the two-machine
+debugging yet). It certainly did make a difference when I removed
+these calls from the code actually executing, (all the text appears in
+arbitrary colors, giving me more psychedelia than I actually need on
+my desktop). But the performance really didn't improve at all.
+
+Then I received a very helpful email from Roland Dreier, (thanks
+Roland!), cluing me in to opannotate. The results I had posted before
+were from opeport which gives profiling reports with function-level
+granularity. The opannotate utility gives a similar report, but at the
+granularity of either lines of
+[[source_code|i965_prepare_composite.source_annotate]] or
+[[assembly_instructions|i965_prepare_composite.assembly_annotate]].
+
+So these reports make it clear that sometimes there is more going on
+than meets the eye by simple examination of the source code. For
+example, much of the i965_prepare_composite function looks like simple
+assignments such as these:
+
+       memset (cc_viewport, 0, sizeof (*cc_viewport));
+       cc_viewport->min_depth = -1.e35;
+       cc_viewport->max_depth = 1.e35;
+
+       /* Color calculator state */
+       memset(cc_state, 0, sizeof(*cc_state));
+       cc_state->cc0.stencil_enable = 0;   /* disable stencil */
+       cc_state->cc2.depth_test = 0;       /* disable depth test */
+
+But now take a look at the same assignments annotated by
+opannotate. The first two columns are sample counts and percentage of
+total time attributed to each line of code, (recall that we're trying
+to determine why `i965_prepare_composite` is using more than 25% of
+the total time in the test):
+
+         274  0.0098 :    memset (cc_viewport, 0, sizeof (*cc_viewport));
+         124  0.0044 :    cc_viewport->min_depth = -1.e35;
+         122  0.0044 :    cc_viewport->max_depth = 1.e35;
+                     :
+                     :    /* Color calculator state */
+         861  0.0307 :    memset(cc_state, 0, sizeof(*cc_state));
+       18559  0.6623 :    cc_state->cc0.stencil_enable = 0;   /* disable stencil */
+       17836  0.6365 :    cc_state->cc2.depth_test = 0;       /* disable depth test */
+
+Clearly, not all assignments are created equal as the final two
+assignments are a couple of orders of magnitude slower than the first
+two. For a closer look, here's a chunk of the annotated assembly code
+showing some very expensive operations:
+
+                     :    cc_state->cc2.depth_test = 0;       /* disable depth test */
+                     :    cc_state->cc2.logicop_enable = 0;   /* disable logic op */
+                     :    cc_state->cc3.ia_blend_enable = 1;  /* blend alpha just like colors */
+                     :    cc_state->cc3.blend_enable = 1;     /* enable color blend */
+           1 3.6e-05 :   33277:        movzbl 0xd(%ecx),%eax
+       18168  0.6484 :   3327b:        andb   $0x7f,0x3(%ecx)
+       17836  0.6365 :   3327f:        andb   $0x7f,0x9(%ecx)
+       12306  0.4392 :   33283:        andb   $0xfe,0x8(%ecx)
+        7307  0.2608 :   33287:        or     $0x30,%eax
+
+So, we've got some bitfields being used here. Is this uncached memory
+that's causing it to be so expensive?
+
+If I'm as fortunate as I was with my last post, hopefully someone will
+drop a handy note into my inbox telling me how to make this function
+go blisteringly fast. I'm really looking forward to that.