--- /dev/null
+[[meta title="Using opannotate to make sense of profiles"]]
+
+[[tag exa performance xorg]]
+
+After I recently posted some surprising
+[[profiles|mozilla_i965_profiles]], I received useful feedback from
+Michel Dänzer, Adam Jackson, and Eric Anholt. There was general
+agreement that the i965_prepare_composite function is generally stupid
+about acting synchronously in order to reuse a single state buffer,
+and that this shouldn't be too hard to optimize. Options include using
+a ring of buffers and synchronizing only when wrapping around and also
+optimizing to not send redundant data.
+
+I had mentioned earlier that I had tried eliminating the `i830WaitSync`
+calls, but hadn't noticed any performance change. Well, one problem
+was that I had edited the files on the wrong machine, (I'm still not a
+true X hacker since I'm not totally in the groove of the two-machine
+debugging yet). It certainly did make a difference when I removed
+these calls from the code actually executing, (all the text appears in
+arbitrary colors, giving me more psychedelia than I actually need on
+my desktop). But the performance really didn't improve at all.
+
+Then I received a very helpful email from Roland Dreier, (thanks
+Roland!), cluing me in to opannotate. The results I had posted before
+were from opeport which gives profiling reports with function-level
+granularity. The opannotate utility gives a similar report, but at the
+granularity of either lines of
+[[source_code|i965_prepare_composite.source_annotate]] or
+[[assembly_instructions|i965_prepare_composite.assembly_annotate]].
+
+So these reports make it clear that sometimes there is more going on
+than meets the eye by simple examination of the source code. For
+example, much of the i965_prepare_composite function looks like simple
+assignments such as these:
+
+ memset (cc_viewport, 0, sizeof (*cc_viewport));
+ cc_viewport->min_depth = -1.e35;
+ cc_viewport->max_depth = 1.e35;
+
+ /* Color calculator state */
+ memset(cc_state, 0, sizeof(*cc_state));
+ cc_state->cc0.stencil_enable = 0; /* disable stencil */
+ cc_state->cc2.depth_test = 0; /* disable depth test */
+
+But now take a look at the same assignments annotated by
+opannotate. The first two columns are sample counts and percentage of
+total time attributed to each line of code, (recall that we're trying
+to determine why `i965_prepare_composite` is using more than 25% of
+the total time in the test):
+
+ 274 0.0098 : memset (cc_viewport, 0, sizeof (*cc_viewport));
+ 124 0.0044 : cc_viewport->min_depth = -1.e35;
+ 122 0.0044 : cc_viewport->max_depth = 1.e35;
+ :
+ : /* Color calculator state */
+ 861 0.0307 : memset(cc_state, 0, sizeof(*cc_state));
+ 18559 0.6623 : cc_state->cc0.stencil_enable = 0; /* disable stencil */
+ 17836 0.6365 : cc_state->cc2.depth_test = 0; /* disable depth test */
+
+Clearly, not all assignments are created equal as the final two
+assignments are a couple of orders of magnitude slower than the first
+two. For a closer look, here's a chunk of the annotated assembly code
+showing some very expensive operations:
+
+ : cc_state->cc2.depth_test = 0; /* disable depth test */
+ : cc_state->cc2.logicop_enable = 0; /* disable logic op */
+ : cc_state->cc3.ia_blend_enable = 1; /* blend alpha just like colors */
+ : cc_state->cc3.blend_enable = 1; /* enable color blend */
+ 1 3.6e-05 : 33277: movzbl 0xd(%ecx),%eax
+ 18168 0.6484 : 3327b: andb $0x7f,0x3(%ecx)
+ 17836 0.6365 : 3327f: andb $0x7f,0x9(%ecx)
+ 12306 0.4392 : 33283: andb $0xfe,0x8(%ecx)
+ 7307 0.2608 : 33287: or $0x30,%eax
+
+So, we've got some bitfields being used here. Is this uncached memory
+that's causing it to be so expensive?
+
+If I'm as fortunate as I was with my last post, hopefully someone will
+drop a handy note into my inbox telling me how to make this function
+go blisteringly fast. I'm really looking forward to that.