git.cworth.org Git - cworth.org/blob - src/talks/lca_2008.mdwn

   1 [[meta title="X Acceleration that Finally Works"]]
   2
   3 ## Abstract
   4
   5 Meaningful hardware acceleration within the X Window System is
   6 becoming a reality. We present the recent state of the art of 2D
   7 accelerated rendering with the Intel 965 graphics device showing
   8 performance gains up to 900 times faster than software rendering.
   9
  10 ## Presentation
  11
  12   * [PDF slides](lca-2008.pdf)
  13   * [HTML slides](html) (OK, so really just a bunch of images with links between them)
  14   * [SVG slides](svg)
  15   * [Video (ogg)](http://mirror.linux.org.au/pub/linux.conf.au/2008/Wed/mel8-167.ogg)
  16   * [Audio (spx)](http://mirror.linux.org.au/pub/linux.conf.au/2008/Wed/mel8-167.spx)
  17
  18 ## Summary
  19
  20 ## Background
  21
  22 In the beginning, X provided support for graphics that by today's
  23 standards are extremely unattractive. But the graphics capabilities
  24 were perhaps not ill-suited for applications and graphics hardware at
  25 the time. For example, X provided bitwise raster operations like XOR
  26 and applications used XOR rubber-banding for window-frame resizing or
  27 selection rectangles. X provided non-antialiased line rendering, and
  28 hardware might have provided a Bresenham line implementation. X also
  29 provided an acceleration architecture (XAA) that exposes these
  30 original "core" rendering primitives to the drivers.
  31
  32 Then time passed. Some things changed, (what applications wanted to
  33 draw), while some things didn't (what X and XAA provided). So for a
  34 time, any "modern" application resorted to manually constructing the
  35 graphical results it wanted, (in software), and simply sending the
  36 final result to the X server. As far as graphics, X protocol was
  37 reduced to simply image transport and the hardware remained almost
  38 unused by applications, (except for a couple notable operations such
  39 as solid fills and "blitting" or copying pixels for things like
  40 scrolling).
  41
  42 X graphics support was revitalized with the X Render Extension. This
  43 extension provides a small number of new primitives that are
  44 well-suited for the needs of modern applications. These primitives
  45 include image compositing (blending), support for client-side fonts,
  46 trapezoid rasterization and gradients. The Render extension also
  47 shipped with a remarkably slow software implementation in the X
  48 server.
  49
  50 Today, through cairo and similar systems, the standard toolkits (GTK+
  51 and QT) provide applications with easy-to-use drawing APIs that
  52 provide sophisticated effects and pipe everything into the X server
  53 through the Render extension. So, unlike the days of X being used just
  54 to transport the final image, we now have the opportunity to get the
  55 video hardware involved in rendering the things the application
  56 actually wants to draw.
  57
  58 The EXA acceleration architecture, (an internal implementation detail
  59 of the X server), allows X server drivers to implement hardware
  60 support for each primitive provided by the Render extension. So
  61 architecturally, everything is in place with EXA. Applications'
  62 rendering operations are making it all the way to the hardware driver
  63 where it should be able to make everything go fast.
  64
  65 ## Problems
  66
  67 So if that's all in place, why isn't EXA blisteringly fast yet? Or why
  68 does adding the 'AccelMethod "EXA"' option to xorg.conf actually slow
  69 things down (and sometimes dramatically)? There are a variety of
  70 possible causes, but I'll discuss two here:
  71
  72   * Migration overhead
  73
  74   * Incomplete drivers
  75
  76 Migration refers to the problem that if a surface needs to be modified
  77 sometimes by a hardware operation and sometimes by software, then the
  78 X server needs to migrate the surface contents back and forth from
  79 video to system memory. Due to architectural issues in commodity
  80 hardware, reading back from video memory is painfully slow---often
  81 orders of magnitude slower than writing to video memory. Various
  82 attempts at doing clever migration strategies within EXA have been
  83 attempted, but it's clear that no amount of cleverness is going to
  84 prevent a significant amount of the overhead, (it turns out to be near
  85 impossible to predict in advance how a surface will be used next). A
  86 punch line here is that often a little-bit of hardware acceleration is
  87 much worse than none at all, (as evidence, see many references to
  88 people dramatically improving system performance by disabling hardware
  89 acceleration in XAA with the XAANoOffscreenPixmaps option). The real
  90 answer for the migration problem is to make sure that drivers support
  91 everything needed and basically never fallback to software rendering.
  92
  93 So this brings us to the second issue, which is that we don't yet have
  94 drivers that do everything we want yet. Fortunately, we now have
  95 video-hardware manufacturers that are cooperating by providing
  96 complete documentation for several devices, (and more and more devices
  97 all the time it appears). On the Intel side, documentation for the
  98 latest device, (the i965 or "gen4"), was
  99 [released](http://intellinuxgraphics.org/) under a
 100 CC-Attribution-NoDerivs license during LCA. This device is interesting
 101 because more than any previous Intel device it should be quite capable
 102 of supporting anything that Render and EXA can throw at it. Also, the
 103 unified memory architecture it uses, (system memory is reused for video
 104 memory), should help with migration issues.
 105
 106 However, the currently-available upstream driver for the i965 is
 107 extremely uninteresting performance-wise. It's one of the drivers that
 108 will give a tremendous slowdown if EXA is enabled. The fundamental
 109 problem that the driver has is that it's using a single chunk of
 110 memory to setup all the state for each compositing operation. Then
 111 while the hardware pipeline is started up on that operation, the
 112 driver receives the next operation. But instead of stuffing this into
 113 the pipe and keeping the hardware busy, the driver currently spins in
 114 the CPU until the hardware is completely finished with the previous
 115 request. It does this because it can't modify that shared state object
 116 in memory while the previous operation is still using it. So the CPU
 117 stays extremely busy while doing nothing but waiting, and the GPU
 118 stays extremely idle, doing short bursts of work that occupy only a
 119 tiny fraction of the compute resources on the chip. Not a good state
 120 of affairs at all.
 121
 122 ## Recent Work
 123
 124 Eric Anholt, Dave Airlie and I have been working to fix up the i965
 125 driver to actually be sane. This work depends first on TTM, a new
 126 kernel-supported graphics memory manager implemented as part of DRM,
 127 (see Dave Airlie's talk at LCA 2008 for more on TTM). The fundamental
 128 primitives that TTM provides are buffer objects, (kernel allocated
 129 chunks of video memory for the driver to use), and fences which allow
 130 the driver to setup operations and receive interrupts when operations
 131 are complete rather than busy-waiting.
 132
 133 As it turns out, antialiased text rendering is one of the most
 134 difficult things to accelerate in hardware. Currently, the driver will
 135 see an independent composite operation for every glyph and the glyph
 136 image might be as small as a 10x10 surface or smaller. With surfaces
 137 that small, any per-composite-operation overhead in the driver becomes
 138 quite significant. And, of course, text is one of the most fundamental
 139 operations in 2D interfaces, so it's important to not render it
 140 extremely slowly. Because of this, over the past several months we've
 141 been focusing on characterizing, profiling, and optimizing the i965
 142 driver with text as the primary benchmark. Other operations, (like
 143 image scaling and blending), will all benefit even more from the work
 144 we've been doing specifically for text.
 145
 146 When profiling text rendering, we immediately noticed that all
 147 operations were falling back to software rendering. This was because
 148 the X server was using special system-memory storage for all cached
 149 glyph images. We changed the X server to use ordinary pixmaps instead,
 150 which allows the glyph images to live in video memory instead,
 151 allowing for hardware compositing. Next we discovered that the i965
 152 driver was claiming it didn't support the necessary
 153 render-to-8-bit-alpha-mask operation that text rendering needed, so
 154 that was also forcing fallbacks. Fortunately, no real code was needed
 155 to fix that---we just changed the driver to properly report its
 156 capabilities.
 157
 158 Those were all good improvements, but with those alone, the
 159 performance of text only got worse with the i965 device, (now orders
 160 of magnitude slower than software.) This is because now every single,
 161 tiny glyph rendering was subject to the bug described earlier where
 162 the driver would spin the CPU while waiting for each separate operate
 163 to complete in the GPU. So now the need for a proper implementation of
 164 compositing in the driver was much more important.
 165
 166 Dave Airlie made the initial change of the i965 driver to use batch
 167 buffers rather than using a single, shared chunk of memory for the
 168 graphics state for each operation. He also changed it so that the
 169 driver uses TTM to allocate these batch buffers. With this in place,
 170 we did several optimizations so that the driver didn't needlessly
 171 re-initialize any more state objects than strictly necessary. All of
 172 our code is currently available in the intel-batchbuffer branch of the
 173 xf86-video-intel driver.
 174
 175 ## Results
 176
 177 The current results can be summarized as follows. Here we are showing
 178 the performance difference of the upstream "master" branch of the
 179 driver compared to "intel-batchbuffer" branch. In both cases we
 180 are using EXA, but reporting numbers as the speedup compared to XAA,
 181 (so higher numbers are better and numbers less than 1 are performance
 182 regressions).
 183
 184                         Speedup compared to XAA
 185
 186         Operation       EXA (master) EXA (intel-batchbuffer)
 187         ---------       ------------ -----------------------
 188         aa10text          .3            0.6
 189         Blend            9.4          101.6
 190         .5 scale         7.4           34.3
 191         2x scale        23.5          200.3
 192         General scale   20.2          946.1
 193
 194         Measurements made with "x11perf -aa10text" and with renderbench.
 195
 196 So, with our work, i965 EXA antialiased text for small glyphs is now
 197 2x faster than it was before, but still only 60% the speed of XAA
 198 text. That's 109,000 glyphs/second for EXA/intel-batchbuffer compared
 199 to 186,000 for XAA. So it's still a respectable speed for text, but it
 200 could be improved. Eric believes that the remaining problems
 201 preventing text from being faster are cache flushing issues within TTM
 202 itself.
 203
 204 Meanwhile, image blending with various scale factors is greatly
 205 improved. The intel-batchbuffer branch makes EXA on the i965 perform
 206 from 5 to 50 times faster than upstream EXA and up to more than 900
 207 times faster than XAA. Obviously, this is a very good result, and it
 208 shows the incidental improvements we achieved while looking closely
 209 only at text performance.
 210
 211 ## Future Work
 212
 213 We are currently in the process of merging this work into the master
 214 branch of the upstream Intel driver and plan for it to be part of the
 215 upcoming Intel driver release scheduled for June 2008.
 216
 217 Beyond that, we plan to add support for hardware-accelerated
 218 gradients, trapezoid rasterization, and perhaps polygon
 219 rasterization. Obviously there's similar EXA acceleration work needed
 220 for other drivers as well. Please come join us in the fun!
 221