From: Carl Worth Date: Tue, 15 Jul 2008 22:21:28 +0000 (-0700) Subject: Add New Job Old Tricks post X-Git-Url: https://git.cworth.org/git?p=cworth.org;a=commitdiff_plain;h=f8304a59f3b8d4f2620b626af03cdb23af88fded Add New Job Old Tricks post --- diff --git a/src/exa/i965/new_job_old_tricks.mdwn b/src/exa/i965/new_job_old_tricks.mdwn new file mode 100644 index 0000000..5b3c99b --- /dev/null +++ b/src/exa/i965/new_job_old_tricks.mdwn @@ -0,0 +1,118 @@ +[[meta title="A new job, but old performance fixes"]] + +[[tag exa performance i965]] + +Many readers have heard already, but it will be news to some that I +recently changed jobs. After just short of 4 years with Red Hat, I've +now taken a job working for Intel, (in its Open-source Technology +Center). It was hard to leave Red Hat---I have only fond memories of +working there, and I will always be grateful to Red Hat for first +helping me launch a career out of working on Free Software. + +Fortunately, as far as my free-software work is concerned, much of it +will be unaffected by the job change. In fact, since I've been looking +at X/2D/Intel driver graphics performance for the last year already, +this job change should only help me do *much* *more* of that. And as +far as [cairo](http://cairographics.org) goes, I'll continue to +maintain it, but I haven't been doing much feature development there +lately anyway. Instead, the most important thing I feel I could do for +cairo now is to continue to improve X 2D performance. And that's an +explicit job requirement in my new position. So I think the job change +will be neutral to positive for anyone interested in my free-software +efforts. + +As my first task at Intel, I took the nice HP 2510p laptop I was given +on the first day, (which has i965 graphics of course), installed Linux +on it, then compiled everything I needed for doing X development. I +would have saved myself some pain if I had used these [build +instructions](http://wiki.x.org/wiki/Development/git). I've since +repeated that exercise with the instructions, and they work quite +well, (though one can save some work by using distribution-provided +development packages for many of the dependencies). + +Also, since I want to do development with +[GEM](http://lwn.net/Articles/283793/), I built the drm-gem branches +of the mesa, drm, and xf86-video-intel modules. That's as simple as +doing "git checkout -b drm-gem origin/drm-gem" after the "git clone" +of those three modules, (building the master branch of the xserver +module is just fine). That seemed to build and run, so I quickly +installed it as the X server I'm running regularly. I figured this +would be great motivation for myself to fix any bugs I +encountered---since they'd impact everything I tried to do. + +Well, it didn't take long to find some performance bugs. Just +switching workspaces was a rather slow experience---I could literally +watch xchat repaint its window with a slow swipe. (Oddly enough, +gnome-terminal and iceweasel could redraw similarly-sized windows much +more quickly.) And it didn't take much investigation to find the +problem since it was something I had [found +before](http://cworth.org/exa/i965/synchronous_composite/), a big, +blocking call to i830WaitSync in every composite operation. My old +favorite, "x11perf -aa10text" was showing only 13,000 glyphs per +second. + +I had done some work to alleviate that before, and Dave Airlie had +continued that until the call was entirely eliminated at one +point. That happened on the old "intel-batchbuffer" branch of the +driver. Recall that in January Eric and I had been disappointed to +[report](http://cworth.org/talks/lca_2008/) that even after a recent +2x improvement, the intel-batchbuffer branch was only at 109,000 +glyphs per second compared to 186,000 for XAA. + +Well, that branch had about a dozen, large, unrelated changes in it, +and poor Eric Anholt had been stuck with the job of cleaning them up +and landing them independently to the master branch, (while also +writing a new memory manager and porting the driver to it). + +So here was one piece that just hadn't been finished yet. The driver +was still just using a single vertex buffer that it allocates +upfront---and a tiny buffer---just big enough for a single rectangle +for a single composite operation. And so the driver was waiting for +each composite operation to finish before reusing the bugger. And the +change to GEM had made this problem even more noticeable. And Eric +even had a partially-working patch to fix this---simply allocating a +much larger vertex buffer and only doing the sync when wrapping around +after filling it up. He had just been too busy with other things to +get back to this patch. So this was one of those times when it's great +to have a fresh new co-worker appear in the next cubicle asking how he +could help. I took tested Eric's patch, broke it up into tiny pieces +to test them independently, and Eric quickly found what was needed to +fix it, (an explicit flush to avoid the hardware caching vertex-buffer +entries that would be filled in on future composite calls). + +So, with that in place the only thing left to decide was how large of +a vertex buffer to allocate upfront. And that gives me an excuse to +put in a performance plot: + +[[img vertex_buffers.png]] + +So the more the better, (obviously), until we get to 256 composite +operations fitting into a single buffer. Then we start losing +performance. So on the drm-gem branch, this takes performance from +13,000 glyphs/second to 100,000 glyphs/second for a 7.7x +speedup. That's a nice improvement for a simple patch, even if the +overall performance isn't astounding yet. It is at least fast enough +that I can now switch workspaces without getting bored. + +So I went ahead and applied these patches to the master branch as +well. Interestingly, without any of the drm-gem branches, and even +with the i830WaitSync call on every composite operation, things were +already much better than in the GEM world. I measured 142,000 +glyphs/second before my patch, and 208,000 glyphs/second after the +patch. So only a 1.5x speedup there, but for the first time ever I'm +actually measuring EXA text rendering that's faster than XAA text +rendering. Hurrah! + +And really, this is still just getting started. The patch I've +described here is still just a bandaid. The real fix is to eliminate +the upfront allocation and reuse of buffers. Instead, now that we have +a real memory manager, (that's the whole point of GEM), we can +allocated buffer objects as needed for vertex buffer, (and for surface +state objects, etc.). That's the work I'll do next and it should let +us finally see some of the benefits of GEM. Or if not, it will point +out some of the remaining issues in GEM and we'll fix those right +up. Either way, performance should just keep getting better and +better. + +Stay tuned for more from me, and look forward to faster performance +from every Intel graphics driver release. diff --git a/src/exa/i965/new_job_old_tricks/vertex_buffers.png b/src/exa/i965/new_job_old_tricks/vertex_buffers.png new file mode 100644 index 0000000..3f44cdb Binary files /dev/null and b/src/exa/i965/new_job_old_tricks/vertex_buffers.png differ