From 04a50398d2be37dcd9b5f8ed42b72638b3ad861b Mon Sep 17 00:00:00 2001 From: Carl Worth Date: Fri, 12 Jun 2009 17:35:59 -0700 Subject: [PATCH] Add blog entry extolling cairo-perf-trace. Hopefully this is the beginning of the end of bad 2D benchmarks. --- src/intel/performance_measurement.mdwn | 196 +++++++++++++++++++++++++ 1 file changed, 196 insertions(+) create mode 100644 src/intel/performance_measurement.mdwn diff --git a/src/intel/performance_measurement.mdwn b/src/intel/performance_measurement.mdwn new file mode 100644 index 0000000..ce5ffd1 --- /dev/null +++ b/src/intel/performance_measurement.mdwn @@ -0,0 +1,196 @@ +[[!meta title="On 2D Performance Measurement"]] + +[[!tag intel cairo performance]] + +Trying to get a handle on 2D graphics rendering performance can be a +difficult task. Obviously, people care about the performance of their +2D applications. Nobody wants to wait for a web browser to scroll past +tacky banner ads or for an email client to render a pageful of +spam. And it's easy for users to notice "my programs aren't rendering +as fast with the latest drivers". But what developers need is a way to +quantify exactly what that means, in order to track improvements and +avoid regressions. And that measurement is the hard part. Or at least +it always has been hard, until Chris Wilson's recent cairo-perf-trace. + +# Previous attempts at 2D benchmarking + +Various attempts at 2D-rendering benchmark suites have appeared and +even become popular. Notable examples are x11perf and gtkperf. My +claim is that these tools range from useless to actively harmful when +the task is understanding performance of real applications. + +These traditional benchmarks suites are collections of synthetic +micro-benchmarks. Within a given benchmark, some tiny operation, (such +as "render a line of text" or "draw a radio box"), is performed +hundreds of times in a tight loop and the total time is measured. The +hope is that these operations will simulate the workload of actual +applications. + +Unfortunately, the workload of things like x11perf and gtkperf rarely +come close to simulating practical workloads. In the worst case, the +operation being tested might never be used at all in modern +applications, (notice that x11perf tests things like stippled fills +and wide ellipses which are obsolete graphics operations). Similarly, +even if the operation is used, (such as a GTK+ radio button), it might +not represent a significant fraction of time spent rendering by the +application, (which might spend most of its time drawing its primary +display area rather than any stock widget). + +So that's just the well-known idea to not focus on the performance of +things other than the primary bottlenecks. But even when we have +identified a bottleneck in an application, x11perf can still be the +wrong answer for measurement. For example, "text rendering" is a +common bottleneck for 2D applications. However, a test like "x11perf +aa10text" which seems like a tempting way to measure text performance +is far from ideal. This benchmark draws a small number of glyphs from +a single font at a single size over and over. Meanwhile, a real +application will use many glyphs from many fonts at many sizes. With +layers and layers of caches throughout the graphics stack, it's really +not possible to accurately simulate what "text rendering" means for a +real application without actually just running the actual application. + +And yes, I myself have used and perhaps indirectly advocated for using +things like x11perf in the past. I won't recommend it again in the +future. See below for what I suggest instead. + +# What do the 3D folks do? + +For 3D performance, everybody knows this lesson already. Nobody +measures the performance of "draw the same triangles over and +over". And when a program that does that (like glxgears) everybody +laughs if someone tries to take its frames-per-second report +seriously. In fact, the phrase "glxgears is not a benchmark" is a +catchphrase among 3D developers. Instead, 3D measurement is made with +"benchmark modes" in the 3D applications that people actually care +about, (which as far as I can tell is just games for some reason). In +the benchmark mode, a sample session of recorded input is replayed as +quickly as possible and a performance measurement is reported. + +As a rule, our 2D applications don't have similar benchmark +modes. (There are some exceptions such as the trender utility for +mozilla and special command-line options for the swfdec player.) And +coding up application-specific benchmarking code for every interesting +application isn't something that anyone is signing up to do right now. + +# Introducing cairo-perf-trace + +Over the past year or so, Chris "ickle" Wilson has been putting a lot +of work into a debugging utility known as cairo-trace, (inspired by +work on an earlier tool known as libcairowrap by Benjamin Otte and +Jeff Muizelaar). The cairo-trace utility produces a trace of all +cairo-based rendering operations made by an application. The trace is +complete and accurate enough to allow all operations to be replayed +with a separate tool. + +The cairo-trace utility has long proven invaluable as a way to capture +otherwise hard-to-reproduce test cases. People with complex +applications that exhibit cairo bugs can generate a cairo-trace and +often easily trim it down to a minimal test case. Then after +submitting this trace, a developer can replicate this bug without +needing to have a copy of the complex application nor its state. + +More recently, Chris wrote a new "cairo-trace --profile" mode and a +tool named [cairo-perf-trace](http://cairographics.org/FAQ/#profiling) +for replaying traces for benchmarking purposes. These tools are +currently available by obtaining the [cairo source +code](http://cairographics.org/download/), (either from git or in the +1.9.2 development snapshot or eventually the 1.10 release or +later). Hopefully we'll see them get packaged up so they're easier to +use soon. + +With cairo-perf-trace, it's a simple matter to get rendering +performance measurements of real applications without having to do any +modification of the application itself. And you can collect a trace +based on exactly the workload you want, (as long as the application +you are interested in performs its rendering with cairo). Simply run: + + cairo-trace --profile some-application + +Which will generate a compressed file named something like +some-application.$pid.lzma. To later benchmark this trace, first +uncompress it: + + lzma -cd some-application.$pid.lzma > some-application.trace + +And then run cairo-perf-trace on the trace file: + + cairo-perf-trace some-application.trace + +The cairo-perf-trace utility will replay several iterations of the +trace, (waiting for the standard deviation among reported times to +drop below a threshold), and will report timing results for both the +"image" backend (cairo's software backend) and whatever native backend +is compiled into cairo, (xlib, quartz, win32, etc.). So one +immediately useful result is its obvious to see if the native backend +is slower than the all-software backend. Then, after making changes to +the graphics stack, subsequent runs can be compared to ensure +regressions are avoided and performance improvements actually help. + +Finally, Chris has also established a [cairo-traces git +repository](http://cgit.freedesktop.org/cairo-traces/) which collects +useful traces that can be shared and compared. It already contains +several different browsing sessions with firefox, swfdec traces (one +with youtube), and traces of poppler, gnome-terminal, and +evolution. Obviously, anyone should feel free to generate and propose +new traces to contribute. + +# Putting cairo-perf-trace to use + +In the few days that cairo-perf-traces has existed, we're already +seeing great results from it. When Kristian Høgsberg recently proposed +a [memory-saving +patch](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002763.html) +for the Intel driver, Chris Wilson followed up with a +[cairo-perf-trace +report](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002770.html) +showing that the memory-saving had no negative impact on a traced +firefox session, which [addressed the +concern](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002771.html) +that Eric had about the patch. + +As another example, we've known that there's been a performance +regression in UXA (compared to EXA) for trapezoid rendering. The +problem was that UXA was allocating a pixmap only to then use +software-based rasterization to that pixmap (resulting in slow +read-modify-write cycles). The obvious fix I implemented is to simply +malloc a buffer, do the rasterization, and only then copy the result +to a pixmap. + +After I wrote the patch, it was very satisfying to be able to validate +its real-world impact with a swfdec-based trace. This trace is based +on using swfdec to view the [Giant +Steps](http://michalevy.com/wp-content/uploads/Giant%20Steps%202007.swf) +movie. When running this trace, sysprof makes it obvious that +trapezoid rendering is the primary bottleneck. Here is the output of +cairo-perf-trace on a GM965 machine before my patch: + + [ # ] backend test min(s) median(s) stddev. count + [ 0] image swfdec-giant-steps 45.766 45.858 0.11% 6 + [ 0] xlib swfdec-giant-steps 194.422 194.602 0.05% 6 + +The performance problem is quite plain here. Replaying the swfdec +trace to the X server takes 194 seconds compared to only 45 seconds to +replay it only to cairo's all-software image backend. Note that 194 +seconds is longer than the full video clip, meaning that my system +isn't going to be able to keep up without skipping here. That's +obviously not what we want. + +Then, after my simple just-use-malloc patch I get: + + [ # ] backend test min(s) median(s) stddev. count + [ 0] image swfdec-giant-steps 45.792 46.014 0.37% 6 + [ 0] xlib swfdec-giant-steps 81.505 81.567 0.03% 6 + +Here the xlib result has improved from 194 seconds to 81 +seconds. That's a 2.4x improvement, and fast enough to now play the +movie without skipping. It's very satisfying to validate performance +patches with real-world application code like this. (Of course, +there's still a 1.8x slowdown of the xlib backend compared to the +image backend, so there's still more to be fixed here.) + +The punchline is that we now have an easy way to benchmark 2D +rendering in actual, real-world applications. If you see someone +benchmarking with only toys like x11perf or gtkperf, go ahead and +point them to this post, or the the [cairo-perf-trace +entry](http://cairographics.org/FAQ/#profiling) in the cairo FAQ, and +insist on benchmarks from real applications. -- 2.43.0