Add talk from LCA 2008

[cworth.org] / src / talks / lca_2008.mdwn
diff --git a/src/talks/lca_2008.mdwn b/src/talks/lca_2008.mdwn

new file mode 100644 (file)

index 0000000..1b56a94
--- /dev/null
+++ b/src/talks/lca_2008.mdwn
@@ -0,0 +1,213 @@
+[[meta title="X Acceleration that Finally Works"]]
+
+## Abstract
+
+Meaningful hardware acceleration within the X Window System is
+becoming a reality. We present the recent state of the art of 2D
+accelerated rendering with the Intel 965 graphics device showing
+performance gains up to 900 times faster than software rendering.
+
+## Presentation
+
+  * [PDF slides](lca-2008.pdf)
+  * [HTML slides](html)
+  * [SVG slides](svg)
+  * [Video (ogg)](http://mirror.linux.org.au/pub/linux.conf.au/2008/Wed/mel8-167.ogg)
+  * [Audio (spx)](http://mirror.linux.org.au/pub/linux.conf.au/2008/Wed/mel8-167.spx)
+
+## Summary
+
+## Background
+
+In the beginning, X provided support for graphics that by today's
+standards are extremely unattractive. But the graphics capabilities
+were perhaps not ill-suited for applications and graphics hardware at
+the time. For example, X provided bitwise raster operations like XOR
+and applications used XOR rubber-banding for window-frame resizing or
+selection rectangles. X provided non-antialiased line rendering, and
+hardware might have provided a Bresenham line implementation. X also
+provided an acceleration architecture (XAA) that exposes these
+original "core" rendering primitives to the drivers.
+
+Then time passed. Some things changed, (what applications wanted to
+draw), while some things didn't (what X and XAA provided). So for a
+time, any "modern" application resorted to manually constructing the
+graphical results it wanted, (in software), and simply sending the
+final result to the X server. As far as graphics, X protocol was
+reduced to simply image transport and the hardware remained almost
+unused by applications, (except for a couple notable operations such
+as solid fills and "blitting" or copying pixels for things like
+scrolling).
+
+X graphics support was revitalized with the X Render Extension. This
+extension provides a small number of new primitives that are
+well-suited for the needs of modern applications. These primitives
+include image compositing (blending), support for client-side fonts,
+trapezoid rasterization and gradients. The Render extension also
+shipped with a remarkably slow software implementation in the X
+server.
+
+Today, through cairo and similar systems, the standard toolkits (GTK+
+and QT) provide applications with easy-to-use drawing APIs that
+provide sophisticated effects and pipe everything into the X server
+through the Render extension. So, unlike the days of X being used just
+to transport the final image, we now have the opportunity to get the
+video hardware involved in rendering the things the application
+actually wants to draw.
+
+The EXA acceleration architecture, (an internal implementation detail
+of the X server), allows X server drivers to implement hardware
+support for each primitive provided by the Render extension. So
+architecturally, everything is in place with EXA. Applications'
+rendering operations are making it all the way to the hardware driver
+where it should be able to make everything go fast.
+
+## Problems
+
+So if that's all in place, why isn't EXA blisteringly fast yet? Or why
+does adding the 'AccelMethod "EXA"' option to xorg.conf actually slow
+things down (and sometimes dramatically)? There are a variety of
+possible causes, but I'll discuss two here:
+
+  * Migration overhead
+
+  * Incomplete drivers
+
+Migration refers to the problem that if a surface needs to be modified
+sometimes by a hardware operation and sometimes by software, then the
+X server needs to migrate the surface contents back and forth from
+video to system memory. Due to architectural issues in commodity
+hardware, reading back from video memory is painfully slow---often
+orders of magnitude slower than writing to video memory. Various
+attempts at doing clever migration strategies within EXA have been
+attempted, but it's clear that no amount of cleverness is going to
+prevent a significant amount of the overhead, (it turns out to be near
+impossible to predict in advance how a surface will be used next). A
+punch line here is that often a little-bit of hardware acceleration is
+much worse than none at all, (as evidence, see many references to
+people dramatically improving system performance by disabling hardware
+acceleration in XAA with the XAANoOffscreenPixmaps option). The real
+answer for the migration problem is to make sure that drivers support
+everything needed and basically never fallback to software rendering.
+
+So this brings us to the second issue, which is that we don't yet have
+drivers that do everything we want yet. Fortunately, we now have
+video-hardware manufacturers that are cooperating by providing
+complete documentation for several devices, (and more and more devices
+all the time it appears). On the Intel side, documentation for the
+latest device, (the i965 or "gen4"), was
+[released](http://intellinuxgraphics.org/) under a
+CC-Attribution-NoDerivs license during LCA. This device is interesting
+because more than any previous Intel device it should be quite capable
+of supporting anything that Render and EXA can throw at it. Also, the
+unified memory architecture it uses, (system memory is reused for video
+memory), should help with migration issues.
+
+However, the currently-available upstream driver for the i965 is
+extremely uninteresting performance-wise. It's one of the drivers that
+will give a tremendous slowdown if EXA is enabled. The fundamental
+problem that the driver has is that it's using a single chunk of
+memory to setup all the state for each compositing operation. Then
+while the hardware pipeline is started up on that operation, the
+driver receives the next operation. But instead of stuffing this into
+the pipe and keeping the hardware busy, the driver currently spins in
+the CPU until the hardware is completely finished with the previous
+request. It does this because it can't modify that shared state object
+in memory while the previous operation is still using it. So the CPU
+stays extremely busy while doing nothing but waiting, and the GPU
+stays extremely idle, doing short bursts of work that occupy only a
+tiny fraction of the compute resources on the chip. Not a good state
+of affairs at all.
+
+## Recent Work
+
+Eric Anholt, Dave Airlie and I have been working to fix up the i965
+driver to actually be sane. This work depends first on TTM, a new
+kernel-supported graphics memory manager implemented as part of DRM,
+(see Dave Airlie's talk at LCA 2008 for more on TTM). The fundamental
+primitives that TTM provides are buffer objects, (kernel allocated
+chunks of video memory for the driver to use), and fences which allow
+the driver to setup operations and receive interrupts when operations
+are complete rather than busy-waiting.
+
+As it turns out, antialiased text rendering is one of the most
+difficult things to accelerate in hardware. Currently, the driver will
+see an independent composite operation for every glyph and the glyph
+image might be as small as a 10x10 surface or smaller. With surfaces
+that small, any per-composite-operation overhead in the driver becomes
+quite significant. And, of course, text is one of the most fundamental
+operations in 2D interfaces, so it's important to not render it
+extremely slowly. Because of this, over the past several months we've
+been focusing on characterizing, profiling, and optimizing the i965
+driver with text as the primary benchmark. Other operations, (like
+image scaling and blending), will all benefit even more from the work
+we've been doing specifically for text.
+
+When profiling text rendering, we immediately noticed that all
+operations were falling back to software rendering. This was because
+the X server was using special system-memory storage for all cached
+glyph images. We changed the X server to use ordinary pixmaps instead,
+which allows the glyph images to live in video memory instead,
+allowing for hardware compositing. Next we discovered that the i965
+driver was claiming it didn't support the necessary
+render-to-8-bit-alpha-mask operation that text rendering needed, so
+that was also forcing fallbacks. Fortunately, no real code was needed
+to fix that---we just changed the driver to properly report its
+capabilities.
+
+Those were all good improvements, but with those alone, the
+performance of text only got worse with the i965 device, (now orders
+of magnitude slower than software.) This is because now every single,
+tiny glyph rendering was subject to the bug described earlier where
+the driver would spin the CPU while waiting for each separate operate
+to complete in the GPU. So now the need for a proper implementation of
+compositing in the driver was much more important.
+
+Dave Airlie made the initial change of the i965 driver to use batch
+buffers rather than using a single, shared chunk of memory for the
+graphics state for each operation. He also changed it so that the
+driver uses TTM to allocate these batch buffers. With this in place,
+we did several optimizations so that the driver didn't needlessly
+re-initialize any more state objects than strictly necessary. All of
+our code is currently available in the intel-batchbuffer branch of the
+xf86-video-intel driver.
+
+Results
+
+The current results can be summarized as follows. Here we are showing
+the performance difference of the upstream "master" branch of the
+driver compared to "intel-batchbuffer" branch. In both cases we
+are using EXA, but reporting numbers as the speedup compared to XAA,
+(so higher numbers are better and numbers less than 1 are performance
+regressions).
+
+                       Speedup compared to XAA
+
+       Operation       EXA (master)    EXA (intel-batchbuffer)
+       --------        ------------    -----------------------
+       aa10text          .3              0.6
+       Blend            9.4            101.6
+       .5 scale         7.4             34.3
+       2x scale        23.5            200.3
+       General scale   20.2            946.1
+
+       Measurements made with "x11perf -aa10text" and with renderbench.
+
+So, with our work, i965 EXA antialiased text for small glyphs is now
+2x faster than it was before, but still only 60% the speed of XAA
+text. That's 109,000 glyphs/second for EXA/intel-batchbuffer compared
+to 186,000 for XAA. So it's still a respectable speed for text, but it
+could be improved. Eric believes that the remaining problems
+preventing text from being faster are cache flushing issues within TTM
+itself.
+
+Meanwhile, image blending with various scale factors is greatly
+improved. The intel-batchbuffer branch makes EXA on the i965 perform
+from 5 to 50 times faster than upstream EXA and up to more than 900
+times faster than XAA. Obviously, this is a very good result, and it
+shows the incidental improvements we achieved while looking closely
+only at text performance.
+
+We are currently in the process of merging this work into the master
+branch of the upstream Intel driver and plan for it to be part of the
+upcoming Intel driver release scheduled for June 2008.