git.cworth.org Git - cworth.org/blob - src/intel/performance_measurement.mdwn

   1 [[!meta title="On 2D Performance Measurement"]]
   2
   3 [[!tag intel cairo performance]]
   4
   5 Trying to get a handle on 2D graphics rendering performance can be a
   6 difficult task. Obviously, people care about the performance of their
   7 2D applications. Nobody wants to wait for a web browser to scroll past
   8 tacky banner ads or for an email client to render a screen full of
   9 spam. And it's easy for users to notice "my programs aren't rendering
  10 as fast with the latest drivers". But what developers need is a way to
  11 quantify exactly what that means, in order to track improvements and
  12 avoid regressions. And that measurement is the hard part. Or at least
  13 it always has been hard, until Chris Wilson's recent cairo-perf-trace.
  14
  15 ## Previous attempts at 2D benchmarking
  16
  17 Various attempts at 2D-rendering benchmark suites have appeared and
  18 even become popular. Notable examples are x11perf and gtkperf.  My
  19 claim is that these tools range from useless to actively harmful when
  20 the task is understanding performance of real applications.
  21
  22 These traditional benchmarks suites are collections of synthetic
  23 micro-benchmarks. Within a given benchmark, some tiny operation, (such
  24 as "render a line of text" or "draw a radio box"), is performed
  25 hundreds of times in a tight loop and the total time is measured. The
  26 hope is that these operations will simulate the workload of actual
  27 applications.
  28
  29 Unfortunately, the workload of things like x11perf and gtkperf rarely
  30 come close to simulating practical workloads. In the worst case, the
  31 operation being tested might never be used at all in modern
  32 applications, (notice that x11perf tests things like stippled fills
  33 and wide ellipses which are obsolete graphics operations). Similarly,
  34 even if the operation is used, (such as a GTK+ radio button), it might
  35 not represent a significant fraction of time spent rendering by the
  36 application, (which might spend most of its time drawing its primary
  37 display area rather than any stock widget).
  38
  39 So that's just the well-known idea to not focus on the performance of
  40 things other than the primary bottlenecks. But even when we have
  41 identified a bottleneck in an application, x11perf can still be the
  42 wrong answer for measurement. For example, "text rendering" is a
  43 common bottleneck for 2D applications. However, a test like "x11perf
  44 aa10text" which seems like a tempting way to measure text performance
  45 is far from ideal. This benchmark draws a small number of glyphs from
  46 a single font at a single size over and over. Meanwhile, a real
  47 application will use many glyphs from many fonts at many sizes. With
  48 layers and layers of caches throughout the graphics stack, it's really
  49 not possible to accurately simulate what "text rendering" means for a
  50 real application without actually just running the actual application.
  51
  52 And yes, I myself have used and perhaps indirectly advocated for using
  53 things like x11perf in the past. I won't recommend it again in the
  54 future. See below for what I suggest instead.
  55
  56 ## What do the 3D folks do?
  57
  58 For 3D performance, everybody knows this lesson already. Nobody
  59 measures the performance of "draw the same triangles over and
  60 over". And if someone does, (by seriously quoting glxgear fps numbers,
  61 for example), then everybody gets a good laugh. In fact, the phrase
  62 "glxgears is not a benchmark" is a catchphrase among 3D
  63 developers. Instead, 3D measurement is made with "benchmark modes" in
  64 the 3D applications that people actually care about, (which as far as
  65 I can tell is just games for some reason). In the benchmark mode, a
  66 sample session of recorded input is replayed as quickly as possible
  67 and a performance measurement is reported.
  68
  69 As a rule, our 2D applications don't have similar benchmark
  70 modes. (There are some exceptions such as the trender utility for
  71 mozilla and special command-line options for the swfdec player.)  And
  72 coding up application-specific benchmarking code for every interesting
  73 application isn't something that anyone is signing up to do right now.
  74
  75 ## Introducing cairo-perf-trace
  76
  77 Over the past year or so, Chris "ickle" Wilson has been putting a lot
  78 of work into a debugging utility known as cairo-trace, (inspired by
  79 work on an earlier tool known as libcairowrap by Benjamin Otte and
  80 Jeff Muizelaar). The cairo-trace utility produces a trace of all
  81 cairo-based rendering operations made by an application. The trace is
  82 complete and accurate enough to allow all operations to be replayed
  83 with a separate tool.
  84
  85 The cairo-trace utility has long proven invaluable as a way to capture
  86 otherwise hard-to-reproduce test cases. People with complex
  87 applications that exhibit cairo bugs can generate a cairo-trace and
  88 often easily trim it down to a minimal test case. Then after
  89 submitting this trace, a developer can replicate this bug without
  90 needing to have a copy of the complex application nor its state.
  91
  92 More recently, Chris wrote a new "cairo-trace --profile" mode and a
  93 tool named [cairo-perf-trace](http://cairographics.org/FAQ/#profiling)
  94 for replaying traces for benchmarking purposes. These tools are
  95 currently available by obtaining the [cairo source
  96 code](http://cairographics.org/download/), (either from git or in the
  97 1.9.2 development snapshot or eventually the 1.10 release or
  98 later). Hopefully we'll see them get packaged up so they're easier to
  99 use soon.
 100
 101 With cairo-perf-trace, it's a simple matter to get rendering
 102 performance measurements of real applications without having to do any
 103 modification of the application itself. And you can collect a trace
 104 based on exactly the workload you want, (as long as the application
 105 you are interested in performs its rendering with cairo). Simply run:
 106
 107         cairo-trace --profile some-application
 108
 109 Which will generate a compressed file named something like
 110 some-application.$pid.lzma. To later benchmark this trace, first
 111 uncompress it:
 112
 113         lzma -cd some-application.$pid.lzma > some-application.trace
 114
 115 And then run cairo-perf-trace on the trace file:
 116
 117         cairo-perf-trace some-application.trace
 118
 119 The cairo-perf-trace utility will replay several iterations of the
 120 trace, (waiting for the standard deviation among reported times to
 121 drop below a threshold), and will report timing results for both the
 122 "image" backend (cairo's software backend) and whatever native backend
 123 is compiled into cairo, (xlib, quartz, win32, etc.). So one
 124 immediately useful result is its obvious to see if the native backend
 125 is slower than the all-software backend. Then, after making changes to
 126 the graphics stack, subsequent runs can be compared to ensure
 127 regressions are avoided and performance improvements actually help.
 128
 129 Finally, Chris has also established a [cairo-traces git
 130 repository](http://cgit.freedesktop.org/cairo-traces/) which collects
 131 useful traces that can be shared and compared. It already contains
 132 several different browsing sessions with firefox, swfdec traces (one
 133 with youtube), and traces of poppler, gnome-terminal, and
 134 evolution. Obviously, anyone should feel free to generate and propose
 135 new traces to contribute.
 136
 137 ## Putting cairo-perf-trace to use
 138
 139 In the few days that cairo-perf-traces has existed, we're already
 140 seeing great results from it. When Kristian Høgsberg recently proposed
 141 a [memory-saving
 142 patch](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002763.html)
 143 for the Intel driver, Chris Wilson followed up with a
 144 [cairo-perf-trace
 145 report](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002770.html)
 146 showing that the memory-saving had no negative impact on a traced
 147 firefox session, which [addressed the
 148 concern](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002771.html)
 149 that Eric had about the patch.
 150
 151 As another example, we've known that there's been a performance
 152 regression in UXA (compared to EXA) for trapezoid rendering. The
 153 problem was that UXA was allocating a pixmap only to then use
 154 software-based rasterization to that pixmap (resulting in slow
 155 read-modify-write cycles). The obvious fix I implemented is to simply
 156 malloc a buffer, do the rasterization, and only then copy the result
 157 to a pixmap.
 158
 159 After I wrote the patch, it was very satisfying to be able to validate
 160 its real-world impact with a swfdec-based trace. This trace is based
 161 on using swfdec to view the [Giant
 162 Steps](http://michalevy.com/wp-content/uploads/Giant%20Steps%202007.swf)
 163 movie. When running this trace, sysprof makes it obvious that
 164 trapezoid rendering is the primary bottleneck. Here is the output of
 165 cairo-perf-trace on a GM965 machine before my patch:
 166
 167         [ # ]  backend                         test   min(s) median(s) stddev. count
 168         [  0]    image           swfdec-giant-steps   45.766   45.858  0.11%   6
 169         [  0]     xlib           swfdec-giant-steps  194.422  194.602  0.05%   6
 170
 171 The performance problem is quite plain here. Replaying the swfdec
 172 trace to the X server takes 194 seconds compared to only 45 seconds to
 173 replay it only to cairo's all-software image backend. Note that 194
 174 seconds is longer than the full video clip, meaning that my system
 175 isn't going to be able to keep up without skipping here. That's
 176 obviously not what we want.
 177
 178 Then, after my simple just-use-malloc patch I get:
 179
 180         [ # ]  backend                         test   min(s) median(s) stddev. count
 181         [  0]    image           swfdec-giant-steps   45.792   46.014  0.37%   6
 182         [  0]     xlib           swfdec-giant-steps   81.505   81.567  0.03%   6
 183
 184 Here the xlib result has improved from 194 seconds to 81
 185 seconds. That's a 2.4x improvement, and fast enough to now play the
 186 movie without skipping. It's very satisfying to validate performance
 187 patches with real-world application code like this. This commit is in
 188 the recent 2.7.99.901 or the Intel driver, by the way. (Of course,
 189 there's still a 1.8x slowdown of the xlib backend compared to the
 190 image backend, so there's still more to be fixed here.)
 191
 192 The punchline is that we now have an easy way to benchmark 2D
 193 rendering in actual, real-world applications. If you see someone
 194 benchmarking with only toys like x11perf or gtkperf, go ahead and
 195 point them to this post, or the the [cairo-perf-trace
 196 entry](http://cairographics.org/FAQ/#profiling) in the cairo FAQ, and
 197 insist on benchmarks from real applications.