git.cworth.org Git - cworth.org/blob - src/exa/opannotate_i965_prepare_composite.mdwn

   1 [[!meta title="Using opannotate to make sense of profiles"]]
   2
   3 [[!tag exa performance xorg]]
   4
   5 After I recently posted some surprising
   6 [[profiles|mozilla_i965_profiles]], I received useful feedback from
   7 Michel Dänzer, Adam Jackson, and Eric Anholt. There was general
   8 agreement that the `i965_prepare_composite` function is generally stupid
   9 about acting synchronously in order to reuse a single state buffer,
  10 and that this shouldn't be too hard to optimize. Options include using
  11 a ring of buffers and synchronizing only when wrapping around and also
  12 optimizing to not send redundant data.
  13
  14 I had mentioned earlier that I had tried eliminating the `i830WaitSync`
  15 calls, but hadn't noticed any performance change. Well, one problem
  16 was that I had edited the files on the wrong machine, (I'm still not a
  17 true X hacker since I'm not totally in the groove of the two-machine
  18 debugging yet). It certainly did make a difference when I removed
  19 these calls from the code actually executing, (all the text appears in
  20 arbitrary colors, giving me more psychedelia than I actually need on
  21 my desktop). But the performance really didn't improve at all.
  22
  23 Then I received a very helpful email from Roland Dreier, (thanks
  24 Roland!), cluing me in to opannotate. The results I had posted before
  25 were from opeport which gives profiling reports with function-level
  26 granularity. The opannotate utility gives a similar report, but at the
  27 granularity of either lines of
  28 [[source_code|i965_prepare_composite.source_annotate]] or
  29 [[assembly_instructions|i965_prepare_composite.assembly_annotate]].
  30
  31 So these reports make it clear that sometimes there is more going on
  32 than meets the eye by simple examination of the source code. For
  33 example, much of the i965_prepare_composite function looks like simple
  34 assignments such as these:
  35
  36         memset (cc_viewport, 0, sizeof (*cc_viewport));
  37         cc_viewport->min_depth = -1.e35;
  38         cc_viewport->max_depth = 1.e35;
  39
  40         /* Color calculator state */
  41         memset(cc_state, 0, sizeof(*cc_state));
  42         cc_state->cc0.stencil_enable = 0;   /* disable stencil */
  43         cc_state->cc2.depth_test = 0;       /* disable depth test */
  44
  45 But now take a look at the same assignments annotated by
  46 opannotate. The first two columns are sample counts and percentage of
  47 total time attributed to each line of code, (recall that we're trying
  48 to determine why `i965_prepare_composite` is using more than 25% of
  49 the total time in the test):
  50
  51           274  0.0098 :    memset (cc_viewport, 0, sizeof (*cc_viewport));
  52           124  0.0044 :    cc_viewport->min_depth = -1.e35;
  53           122  0.0044 :    cc_viewport->max_depth = 1.e35;
  54                       :
  55                       :    /* Color calculator state */
  56           861  0.0307 :    memset(cc_state, 0, sizeof(*cc_state));
  57         18559  0.6623 :    cc_state->cc0.stencil_enable = 0;   /* disable stencil */
  58         17836  0.6365 :    cc_state->cc2.depth_test = 0;       /* disable depth test */
  59
  60 Clearly, not all assignments are created equal as the final two
  61 assignments are a couple of orders of magnitude slower than the first
  62 two. For a closer look, here's a chunk of the annotated assembly code
  63 showing some very expensive operations:
  64
  65                       :    cc_state->cc2.depth_test = 0;       /* disable depth test */
  66                       :    cc_state->cc2.logicop_enable = 0;   /* disable logic op */
  67                       :    cc_state->cc3.ia_blend_enable = 1;  /* blend alpha just like colors */
  68                       :    cc_state->cc3.blend_enable = 1;     /* enable color blend */
  69             1 3.6e-05 :   33277:        movzbl 0xd(%ecx),%eax
  70         18168  0.6484 :   3327b:        andb   $0x7f,0x3(%ecx)
  71         17836  0.6365 :   3327f:        andb   $0x7f,0x9(%ecx)
  72         12306  0.4392 :   33283:        andb   $0xfe,0x8(%ecx)
  73          7307  0.2608 :   33287:        or     $0x30,%eax
  74
  75 So, we've got some bitfields being used here. Is this uncached memory
  76 that's causing it to be so expensive?
  77
  78 If I'm as fortunate as I was with my last post, hopefully someone will
  79 drop a handy note into my inbox telling me how to make this function
  80 go blisteringly fast. I'm really looking forward to that.