git.cworth.org Git - cworth.org/blob - src/exa/i965/new_job_old_tricks.mdwn

   1 [[!meta title="A new job, but old performance fixes"]]
   2
   3 [[!tag exa performance i965]]
   4
   5 Many readers have heard already, but it will be news to some that I
   6 recently changed jobs. After just short of 4 years with Red Hat, I've
   7 now taken a job working for Intel, (in its Open-source Technology
   8 Center). It was hard to leave Red Hat---I have only fond memories of
   9 working there, and I will always be grateful to Red Hat for first
  10 helping me launch a career out of working on Free Software.
  11
  12 Fortunately, as far as my free-software work is concerned, much of it
  13 will be unaffected by the job change. In fact, since I've been looking
  14 at X/2D/Intel driver graphics performance for the last year already,
  15 this job change should only help me do *much* *more* of that. And as
  16 far as [cairo](http://cairographics.org) goes, I'll continue to
  17 maintain it, but I haven't been doing much feature development there
  18 lately anyway. Instead, the most important thing I feel I could do for
  19 cairo now is to continue to improve X 2D performance. And that's an
  20 explicit job requirement in my new position. So I think the job change
  21 will be neutral to positive for anyone interested in my free-software
  22 efforts.
  23
  24 As my first task at Intel, I took the nice HP 2510p laptop I was given
  25 on the first day, (which has i965 graphics of course), installed Linux
  26 on it, then compiled everything I needed for doing X development. I
  27 would have saved myself some pain if I had used these [build
  28 instructions](http://wiki.x.org/wiki/Development/git). I've since
  29 repeated that exercise with the instructions, and they work quite
  30 well, (though one can save some work by using distribution-provided
  31 development packages for many of the dependencies).
  32
  33 Also, since I want to do development with
  34 [GEM](http://lwn.net/Articles/283793/), I built the drm-gem branches
  35 of the mesa, drm, and xf86-video-intel modules. That's as simple as
  36 doing "git checkout -b drm-gem origin/drm-gem" after the "git clone"
  37 of those three modules, (building the master branch of the xserver
  38 module is just fine). That seemed to build and run, so I quickly
  39 installed it as the X server I'm running regularly. I figured this
  40 would be great motivation for myself to fix any bugs I
  41 encountered---since they'd impact everything I tried to do.
  42
  43 Well, it didn't take long to find some performance bugs. Just
  44 switching workspaces was a rather slow experience---I could literally
  45 watch xchat repaint its window with a slow swipe. (Oddly enough,
  46 gnome-terminal and iceweasel could redraw similarly-sized windows much
  47 more quickly.) And it didn't take much investigation to find the
  48 problem since it was something I had [found
  49 before](http://cworth.org/exa/i965/synchronous_composite/), a big,
  50 blocking call to i830WaitSync in every composite operation. My old
  51 favorite, "x11perf -aa10text" was showing only 13,000 glyphs per
  52 second.
  53
  54 I had done some work to alleviate that before, and Dave Airlie had
  55 continued that until the call was entirely eliminated at one
  56 point. That happened on the old "intel-batchbuffer" branch of the
  57 driver. Recall that in January Eric and I had been disappointed to
  58 [report](http://cworth.org/talks/lca_2008/) that even after a recent
  59 2x improvement, the intel-batchbuffer branch was only at 109,000
  60 glyphs per second compared to 186,000 for XAA.
  61
  62 Well, that branch had about a dozen, large, unrelated changes in it,
  63 and poor Eric Anholt had been stuck with the job of cleaning them up
  64 and landing them independently to the master branch, (while also
  65 writing a new memory manager and porting the driver to it).
  66
  67 So here was one piece that just hadn't been finished yet. The driver
  68 was still just using a single vertex buffer that it allocates
  69 upfront---and a tiny buffer---just big enough for a single rectangle
  70 for a single composite operation. And so the driver was waiting for
  71 each composite operation to finish before reusing the buffer. And the
  72 change to GEM had made this problem even more noticeable. And Eric
  73 even had a partially-working patch to fix this---simply allocating a
  74 much larger vertex buffer and only doing the sync when wrapping around
  75 after filling it up. He had just been too busy with other things to
  76 get back to this patch. So this was one of those times when it's great
  77 to have a fresh new co-worker appear in the next cubicle asking how he
  78 could help. I took tested Eric's patch, broke it up into tiny pieces
  79 to test them independently, and Eric quickly found what was needed to
  80 fix it, (an explicit flush to avoid the hardware caching vertex-buffer
  81 entries that would be filled in on future composite calls).
  82
  83 So, with that in place the only thing left to decide was how large of
  84 a vertex buffer to allocate upfront. And that gives me an excuse to
  85 put in a performance plot:
  86
  87 [[!img vertex_buffers.png]]
  88
  89 So the more the better, (obviously), until we get to 256 composite
  90 operations fitting into a single buffer. Then we start losing
  91 performance. So on the drm-gem branch, this takes performance from
  92 13,000 glyphs/second to 100,000 glyphs/second for a 7.7x
  93 speedup. That's a nice improvement for a simple patch, even if the
  94 overall performance isn't astounding yet. It is at least fast enough
  95 that I can now switch workspaces without getting bored.
  96
  97 So I went ahead and applied these patches to the master branch as
  98 well. Interestingly, without any of the drm-gem branches, and even
  99 with the i830WaitSync call on every composite operation, things were
 100 already much better than in the GEM world. I measured 142,000
 101 glyphs/second before my patch, and 208,000 glyphs/second after the
 102 patch. So only a 1.5x speedup there, but for the first time ever I'm
 103 actually measuring EXA text rendering that's faster than XAA text
 104 rendering. Hurrah!
 105
 106 And really, this is still just getting started. The patch I've
 107 described here is still just a bandaid. The real fix is to eliminate
 108 the upfront allocation and reuse of buffers. Instead, now that we have
 109 a real memory manager, (that's the whole point of GEM), we can
 110 allocated buffer objects as needed for vertex buffer, (and for surface
 111 state objects, etc.). That's the work I'll do next and it should let
 112 us finally see some of the benefits of GEM. Or if not, it will point
 113 out some of the remaining issues in GEM and we'll fix those right
 114 up. Either way, performance should just keep getting better and
 115 better.
 116
 117 Stay tuned for more from me, and look forward to faster performance
 118 from every Intel graphics driver release.