From 04a50398d2be37dcd9b5f8ed42b72638b3ad861b Mon Sep 17 00:00:00 2001
From: Carl Worth <cworth@cworth.org>
Date: Fri, 12 Jun 2009 17:35:59 -0700
Subject: [PATCH] Add blog entry extolling cairo-perf-trace.

Hopefully this is the beginning of the end of bad 2D benchmarks.
---
 src/intel/performance_measurement.mdwn | 196 +++++++++++++++++++++++++
 1 file changed, 196 insertions(+)
 create mode 100644 src/intel/performance_measurement.mdwn

diff --git a/src/intel/performance_measurement.mdwn b/src/intel/performance_measurement.mdwn
new file mode 100644
index 0000000..ce5ffd1
--- /dev/null
+++ b/src/intel/performance_measurement.mdwn
@@ -0,0 +1,196 @@
+[[!meta title="On 2D Performance Measurement"]]
+
+[[!tag intel cairo performance]]
+
+Trying to get a handle on 2D graphics rendering performance can be a
+difficult task. Obviously, people care about the performance of their
+2D applications. Nobody wants to wait for a web browser to scroll past
+tacky banner ads or for an email client to render a pageful of
+spam. And it's easy for users to notice "my programs aren't rendering
+as fast with the latest drivers". But what developers need is a way to
+quantify exactly what that means, in order to track improvements and
+avoid regressions. And that measurement is the hard part. Or at least
+it always has been hard, until Chris Wilson's recent cairo-perf-trace.
+
+# Previous attempts at 2D benchmarking
+
+Various attempts at 2D-rendering benchmark suites have appeared and
+even become popular. Notable examples are x11perf and gtkperf.  My
+claim is that these tools range from useless to actively harmful when
+the task is understanding performance of real applications.
+
+These traditional benchmarks suites are collections of synthetic
+micro-benchmarks. Within a given benchmark, some tiny operation, (such
+as "render a line of text" or "draw a radio box"), is performed
+hundreds of times in a tight loop and the total time is measured. The
+hope is that these operations will simulate the workload of actual
+applications.
+
+Unfortunately, the workload of things like x11perf and gtkperf rarely
+come close to simulating practical workloads. In the worst case, the
+operation being tested might never be used at all in modern
+applications, (notice that x11perf tests things like stippled fills
+and wide ellipses which are obsolete graphics operations). Similarly,
+even if the operation is used, (such as a GTK+ radio button), it might
+not represent a significant fraction of time spent rendering by the
+application, (which might spend most of its time drawing its primary
+display area rather than any stock widget).
+
+So that's just the well-known idea to not focus on the performance of
+things other than the primary bottlenecks. But even when we have
+identified a bottleneck in an application, x11perf can still be the
+wrong answer for measurement. For example, "text rendering" is a
+common bottleneck for 2D applications. However, a test like "x11perf
+aa10text" which seems like a tempting way to measure text performance
+is far from ideal. This benchmark draws a small number of glyphs from
+a single font at a single size over and over. Meanwhile, a real
+application will use many glyphs from many fonts at many sizes. With
+layers and layers of caches throughout the graphics stack, it's really
+not possible to accurately simulate what "text rendering" means for a
+real application without actually just running the actual application.
+
+And yes, I myself have used and perhaps indirectly advocated for using
+things like x11perf in the past. I won't recommend it again in the
+future. See below for what I suggest instead.
+
+# What do the 3D folks do?
+
+For 3D performance, everybody knows this lesson already. Nobody
+measures the performance of "draw the same triangles over and
+over". And when a program that does that (like glxgears) everybody
+laughs if someone tries to take its frames-per-second report
+seriously. In fact, the phrase "glxgears is not a benchmark" is a
+catchphrase among 3D developers. Instead, 3D measurement is made with
+"benchmark modes" in the 3D applications that people actually care
+about, (which as far as I can tell is just games for some reason). In
+the benchmark mode, a sample session of recorded input is replayed as
+quickly as possible and a performance measurement is reported.
+
+As a rule, our 2D applications don't have similar benchmark
+modes. (There are some exceptions such as the trender utility for
+mozilla and special command-line options for the swfdec player.)  And
+coding up application-specific benchmarking code for every interesting
+application isn't something that anyone is signing up to do right now.
+
+# Introducing cairo-perf-trace
+
+Over the past year or so, Chris "ickle" Wilson has been putting a lot
+of work into a debugging utility known as cairo-trace, (inspired by
+work on an earlier tool known as libcairowrap by Benjamin Otte and
+Jeff Muizelaar). The cairo-trace utility produces a trace of all
+cairo-based rendering operations made by an application. The trace is
+complete and accurate enough to allow all operations to be replayed
+with a separate tool.
+
+The cairo-trace utility has long proven invaluable as a way to capture
+otherwise hard-to-reproduce test cases. People with complex
+applications that exhibit cairo bugs can generate a cairo-trace and
+often easily trim it down to a minimal test case. Then after
+submitting this trace, a developer can replicate this bug without
+needing to have a copy of the complex application nor its state.
+
+More recently, Chris wrote a new "cairo-trace --profile" mode and a
+tool named [cairo-perf-trace](http://cairographics.org/FAQ/#profiling)
+for replaying traces for benchmarking purposes. These tools are
+currently available by obtaining the [cairo source
+code](http://cairographics.org/download/), (either from git or in the
+1.9.2 development snapshot or eventually the 1.10 release or
+later). Hopefully we'll see them get packaged up so they're easier to
+use soon.
+
+With cairo-perf-trace, it's a simple matter to get rendering
+performance measurements of real applications without having to do any
+modification of the application itself. And you can collect a trace
+based on exactly the workload you want, (as long as the application
+you are interested in performs its rendering with cairo). Simply run:
+
+	cairo-trace --profile some-application
+
+Which will generate a compressed file named something like
+some-application.$pid.lzma. To later benchmark this trace, first
+uncompress it:
+
+	lzma -cd some-application.$pid.lzma > some-application.trace
+
+And then run cairo-perf-trace on the trace file:
+
+	cairo-perf-trace some-application.trace
+
+The cairo-perf-trace utility will replay several iterations of the
+trace, (waiting for the standard deviation among reported times to
+drop below a threshold), and will report timing results for both the
+"image" backend (cairo's software backend) and whatever native backend
+is compiled into cairo, (xlib, quartz, win32, etc.). So one
+immediately useful result is its obvious to see if the native backend
+is slower than the all-software backend. Then, after making changes to
+the graphics stack, subsequent runs can be compared to ensure
+regressions are avoided and performance improvements actually help.
+
+Finally, Chris has also established a [cairo-traces git
+repository](http://cgit.freedesktop.org/cairo-traces/) which collects
+useful traces that can be shared and compared. It already contains
+several different browsing sessions with firefox, swfdec traces (one
+with youtube), and traces of poppler, gnome-terminal, and
+evolution. Obviously, anyone should feel free to generate and propose
+new traces to contribute.
+
+# Putting cairo-perf-trace to use
+
+In the few days that cairo-perf-traces has existed, we're already
+seeing great results from it. When Kristian HÃ¸gsberg recently proposed
+a [memory-saving
+patch](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002763.html)
+for the Intel driver, Chris Wilson followed up with a
+[cairo-perf-trace
+report](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002770.html)
+showing that the memory-saving had no negative impact on a traced
+firefox session, which [addressed the
+concern](http://lists.freedesktop.org/archives/intel-gfx/2009-June/002771.html)
+that Eric had about the patch.
+
+As another example, we've known that there's been a performance
+regression in UXA (compared to EXA) for trapezoid rendering. The
+problem was that UXA was allocating a pixmap only to then use
+software-based rasterization to that pixmap (resulting in slow
+read-modify-write cycles). The obvious fix I implemented is to simply
+malloc a buffer, do the rasterization, and only then copy the result
+to a pixmap.
+
+After I wrote the patch, it was very satisfying to be able to validate
+its real-world impact with a swfdec-based trace. This trace is based
+on using swfdec to view the [Giant
+Steps](http://michalevy.com/wp-content/uploads/Giant%20Steps%202007.swf)
+movie. When running this trace, sysprof makes it obvious that
+trapezoid rendering is the primary bottleneck. Here is the output of
+cairo-perf-trace on a GM965 machine before my patch:
+
+	[ # ]  backend                         test   min(s) median(s) stddev. count
+	[  0]    image           swfdec-giant-steps   45.766   45.858  0.11%   6
+	[  0]     xlib           swfdec-giant-steps  194.422  194.602  0.05%   6
+
+The performance problem is quite plain here. Replaying the swfdec
+trace to the X server takes 194 seconds compared to only 45 seconds to
+replay it only to cairo's all-software image backend. Note that 194
+seconds is longer than the full video clip, meaning that my system
+isn't going to be able to keep up without skipping here. That's
+obviously not what we want.
+
+Then, after my simple just-use-malloc patch I get:
+
+	[ # ]  backend                         test   min(s) median(s) stddev. count
+	[  0]    image           swfdec-giant-steps   45.792   46.014  0.37%   6
+	[  0]     xlib           swfdec-giant-steps   81.505   81.567  0.03%   6
+
+Here the xlib result has improved from 194 seconds to 81
+seconds. That's a 2.4x improvement, and fast enough to now play the
+movie without skipping. It's very satisfying to validate performance
+patches with real-world application code like this. (Of course,
+there's still a 1.8x slowdown of the xlib backend compared to the
+image backend, so there's still more to be fixed here.)
+
+The punchline is that we now have an easy way to benchmark 2D
+rendering in actual, real-world applications. If you see someone
+benchmarking with only toys like x11perf or gtkperf, go ahead and
+point them to this post, or the the [cairo-perf-trace
+entry](http://cairographics.org/FAQ/#profiling) in the cairo FAQ, and
+insist on benchmarks from real applications.
-- 
2.43.0