]> git.cworth.org Git - cworth.org/blobdiff - src/exa/i965/emulating_speedups.mdwn
Add emulating_speedups post
[cworth.org] / src / exa / i965 / emulating_speedups.mdwn
diff --git a/src/exa/i965/emulating_speedups.mdwn b/src/exa/i965/emulating_speedups.mdwn
new file mode 100644 (file)
index 0000000..7370a55
--- /dev/null
@@ -0,0 +1,234 @@
+[[meta title="Emulating the future of the i965 driver"]]
+
+[[tag exa performance xorg]]
+
+Earlier this week I [[isolated_some_bugs|synchronous_composite]] that
+are currently causing a 4x slowdown with EXA and the i965 driver
+compared to using the NoAccel option of the X server.
+
+Some people have wondered if the discouraging results I have found so
+far suggest that we should give up on hardware acceleration or that
+EXA as an acceleration architecture is doomed. I think the answer is
+no on both points. I think we're just seeing typical behavior of new
+code that needs some optimization.
+
+# EXA without acceleration
+
+The first experiment is a very simple one to ensure that the 4x
+slowdown isn't an unavoidable aspect of having EXA enabled. In this
+experiment I first
+[[disabled_the_accelerated-compositing_functions|Disable-acceleration-from-i965-EXA-hooks.patch]]
+in the i965 driver, then I
+[[disabled_EXA_migration|Disable-all-EXA-migration.patch]]. The net
+result of this experiment is that the X server will still go through
+the EXA paths, but will basically use all the same software-fallbacks
+for compositing that are used in the case of NoAccel. The performance
+with this patch can be compared to the NoAccel case here.
+
+<dl class="chart barchart">
+    <dt><a href="/exa/i965/emulating_speedups/NoAccel/system.oprofile">NoAccel</a> (<a href="/exa/i965/emulating_speedups/NoAccel/timing">14.4 ms.</a>) <a href="/exa/i965/emulating_speedups/NoAccel/system.symbols">symbols profile</a></dt>
+    <dd style="width:83.7209%;">
+        <ul>
+            <li class="libpixman" style="width:45.4062%;"><a href="/exa/i965/emulating_speedups/NoAccel/libpixman.oprofile">libpixman</a><span>45%</span></li>
+            <li class="libxul" style="width:18.1849%;">libxul<span>18%</span></li>
+            <li class="vmlinux" style="width:13.3098%;"><a href="/exa/i965/emulating_speedups/NoAccel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
+            <li class="Xorg" style="width:7.7082%;"><a href="/exa/i965/emulating_speedups/NoAccel/Xorg.oprofile">Xorg</a><span>8%</span></li>
+            <li class="libc-2_5" style="width:5.1990%;"><a href="/exa/i965/emulating_speedups/NoAccel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
+            <li class="oprofiled" style="width:2.8155%;">oprofiled<span>3%</span></li>
+            <li class="libfb" style="width:2.0193%;"><a href="/exa/i965/emulating_speedups/NoAccel/libfb.oprofile">libfb</a><span>2%</span></li>
+            <li class="other" style="width:5.3571%;">other<span>5%</span></li>
+        </ul>
+    </dd>
+    <dt><a href="/exa/i965/emulating_speedups/EXA-without-accel/system.oprofile">EXA-without-accel</a> (<a href="/exa/i965/emulating_speedups/EXA-without-accel/timing">15.7 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-without-accel/system.symbols">symbols profile</a></dt>
+    <dd style="width:91.2791%;">
+        <ul>
+            <li class="libpixman" style="width:42.0541%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libpixman.oprofile">libpixman</a><span>42%</span></li>
+            <li class="libxul" style="width:15.7266%;">libxul<span>16%</span></li>
+            <li class="vmlinux" style="width:12.6525%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
+            <li class="Xorg" style="width:9.0172%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/Xorg.oprofile">Xorg</a><span>9%</span></li>
+            <li class="oprofiled" style="width:5.0459%;">oprofiled<span>5%</span></li>
+            <li class="libc-2_5" style="width:4.9173%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
+            <li class="libexa" style="width:3.1229%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libexa.oprofile">libexa</a><span>3%</span></li>
+            <li class="libfb" style="width:2.1381%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libfb.oprofile">libfb</a><span>2%</span></li>
+            <li class="other" style="width:5.3254%;">other<span>5%</span></li>
+        </ul>
+    </dd>
+</dl>
+
+It's worth pointing out that with this change, everything still
+renders correctly. Basically, we're using the same rendering code as
+in the NoAccel case, but we're using EXA to get there. And we can see
+that there is some overhead to EXA seen here, (a 10% slowdown), but
+nothing like the 400% slowdown seen before. There's certainly no
+indication here that EXA is doomed to be horribly slow, for
+example.
+
+Now, this experiment does miss overhead in EXA having to do with
+managing video memory, (since I disabled all migration so everything
+lives in system memory). We'll be able to see this additional overhead
+below.
+
+# Emulating future i965 speedups
+
+The above experiment is still pretty boring---it's still just
+measuring software-fallback performance. A much more interesting
+experiment allows us to start exploring where will be able to get with
+some un-broken hardware acceleration.
+
+My [[previous_post|synchronous_composite]] highlighted two significant
+problems preventing the current code from having good performance:
+
+ * Time lost migrating pixmaps with memcpy
+
+ * Time wasted while the driver busy-waited between operations
+
+Here's a run-down of what I could find about progress on solving these
+two problems:
+
+## Excessive migration
+
+When I looked closer at what was causing the pixmap migration, I found
+that much of it was due to glyph images being pinned to system memory,
+(recall that the benchmark I'm using is Mozilla Firefox on a page
+consisting of mostly text). I asked Keith Packard about why these
+glyph images are being pinned to system memory, and he explained that
+what was preventing the glyphs from migrating is that the X server has
+not been using straight Pixmaps for glyphs, but something slightly
+different.
+
+Keith is already mostly finished with a change to make the server use
+Pixmaps for glyphs. Apparently there is one slight snag in that
+Pixmaps are a per-screen resource while glyphs are not. For now, that
+could be worked around by using one Pixmap per screen, (until Pixmaps
+and other resources can be made global within the server).
+
+So, hopefully that glyph pinning problem will be fixed. Meanwhile,
+it's fairly silly that there's a bunch of memcpy operations to migrate
+things from "system" to "video" memory on the i965 anyway. This card
+doesn't have dedicated video memory, but just uses system memory
+anyway, (all that's needed is for some entries to be set in the GART
+table, and for some alignment constraints to be satisfied). So it
+should be possible to eliminate all of this memcpy time anyway.
+
+I'm told that the long-awaited memory management work, (TTM), is what
+will solve this. I don't know what the status of that work is, but
+hopefully it will be ready soon. does anyone have some pointers for
+more information on TTM status?
+
+## Synchronous compositing
+
+I characterized this problem fairly well in my previous post. Eric
+Anholt suggested a first quick step toward improving the situation
+would be to use an array of state buffers. With N buffers we could
+make the waiting happen only 1/N as frequently as it's currently
+happening. So that's something that even someone like me without any
+detailed documentation on the i965 could do.
+
+And with a little more smarts, (from someone with more information),
+we could presumably reclaim buffers that the hardware was done with
+without having to do any waiting at all.
+
+So it shouldn't be too long before the waiting can be eliminated or
+reduced to an arbitrarily small amount of time.
+
+## Results
+
+Given these identified solutions for the current known problems, (and
+much of the work in progress already), the next question I want to ask
+is what will things look like when these are solved?
+
+I implemented quick patches to both
+[[EXA|Emulate-infinitely-fast-migration-disable-memcpy.patch]] and the
+[[i965_driver|Emulate-infinitely-fast-i965-compositing-make-check.patch]]
+to emulate the time being spent on migration and compositing going to
+zero. That's not totally realistic, but is at least a best-case look
+at where we'll be with these problems fixed. And here's what it looks
+like (with the previous results repeated for comparison):
+
+<dl class="chart barchart">
+    <dt><a href="/exa/i965/emulating_speedups/NoAccel/system.oprofile">NoAccel</a> (<a href="/exa/i965/emulating_speedups/NoAccel/timing">14.4 ms.</a>) <a href="/exa/i965/emulating_speedups/NoAccel/system.symbols">symbols profile</a></dt>
+    <dd style="width:83.7209%;">
+        <ul>
+            <li class="libpixman" style="width:45.4062%;"><a href="/exa/i965/emulating_speedups/NoAccel/libpixman.oprofile">libpixman</a><span>45%</span></li>
+            <li class="libxul" style="width:18.1849%;">libxul<span>18%</span></li>
+            <li class="vmlinux" style="width:13.3098%;"><a href="/exa/i965/emulating_speedups/NoAccel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
+            <li class="Xorg" style="width:7.7082%;"><a href="/exa/i965/emulating_speedups/NoAccel/Xorg.oprofile">Xorg</a><span>8%</span></li>
+            <li class="libc-2_5" style="width:5.1990%;"><a href="/exa/i965/emulating_speedups/NoAccel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
+            <li class="oprofiled" style="width:2.8155%;">oprofiled<span>3%</span></li>
+            <li class="libfb" style="width:2.0193%;"><a href="/exa/i965/emulating_speedups/NoAccel/libfb.oprofile">libfb</a><span>2%</span></li>
+            <li class="other" style="width:5.3571%;">other<span>5%</span></li>
+        </ul>
+    </dd>
+    <dt><a href="/exa/i965/emulating_speedups/EXA-without-accel/system.oprofile">EXA-without-accel</a> (<a href="/exa/i965/emulating_speedups/EXA-without-accel/timing">15.7 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-without-accel/system.symbols">symbols profile</a></dt>
+    <dd style="width:91.2791%;">
+        <ul>
+            <li class="libpixman" style="width:42.0541%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libpixman.oprofile">libpixman</a><span>42%</span></li>
+            <li class="libxul" style="width:15.7266%;">libxul<span>16%</span></li>
+            <li class="vmlinux" style="width:12.6525%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
+            <li class="Xorg" style="width:9.0172%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/Xorg.oprofile">Xorg</a><span>9%</span></li>
+            <li class="oprofiled" style="width:5.0459%;">oprofiled<span>5%</span></li>
+            <li class="libc-2_5" style="width:4.9173%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
+            <li class="libexa" style="width:3.1229%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libexa.oprofile">libexa</a><span>3%</span></li>
+            <li class="libfb" style="width:2.1381%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libfb.oprofile">libfb</a><span>2%</span></li>
+            <li class="other" style="width:5.3254%;">other<span>5%</span></li>
+        </ul>
+    </dd>
+    <dt><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/system.oprofile">EXA-emulate-speedups</a> (<a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/timing">17.2 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/system.symbols">symbols profile</a></dt>
+    <dd style="width:100%;">
+        <ul>
+            <li class="libxul" style="width:19.6803%;">libxul<span>20%</span></li>
+            <li class="libpixman" style="width:16.8573%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libpixman.oprofile">libpixman</a><span>17%</span></li>
+            <li class="vmlinux" style="width:13.1026%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
+            <li class="libexa" style="width:11.9970%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libexa.oprofile">libexa</a><span>12%</span></li>
+            <li class="libc-2_5" style="width:11.4997%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libc-2.5.oprofile">libc-2.5</a><span>11%</span></li>
+            <li class="intel_drv" style="width:9.4984%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/intel_drv.oprofile">intel_drv</a><span>9%</span></li>
+            <li class="Xorg" style="width:7.3488%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/Xorg.oprofile">Xorg</a><span>7%</span></li>
+            <li class="oprofiled" style="width:3.3068%;">oprofiled<span>3%</span></li>
+            <li class="other" style="width:6.7091%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/other.oprofile">other</a><span>7%</span></li>
+        </ul>
+    </dd>
+</dl>
+
+Note that in this experiment, rendered results are not at all correct,
+(basically, no text appears, for example).
+
+And, still, things aren't faster than NoAccel, but there's definitely
+still lots of room for improvement. For example, the pixman profile
+shows compositing, (fbCombineInU and fbFetch_a1) that should be moved
+to the hardware, (particularly when the hardware is infinitely fast
+like it is in my emulation here!).
+
+After that, pixman's rasterization would be at the top of the pixman
+profile. I've been wanting rasterization to show up at the top of a
+profile for a long time so I could have an excuse to implement some
+ideas I have for much faster software rasterization, (and to explore
+using the hardware for rasterization as well). And, for some
+applications doing much more than just rendering text, rasterization
+might already be a lot closer to the top.
+
+So that shows what software operations aren't hooked up to be
+accelerated yet. What else is here? As I pointed out before, (and is
+much easier to see in this chart than the one from earlier this week),
+libxul is mysteriously getting slower once the i965 gets involved, but
+libxul really shouldn't care. So that will be something to investigate
+by actually building mozilla with debug symbols.
+
+Also, there's also significantly more overhead in libexa in this chart
+compared to those above. So there's some room for improvement there,
+(ExaOffscreenMarkUsed is at the top of the profile, and as I've
+mentioned before it looks ripe for improvement).
+
+Finally, the i965 driver is still burning a lot of time in its wait
+function here. I'm not sure what the cause of that is this time since
+I've eliminated all calls to the wait function from
+`i965_prepare_composite` and `i965_composite` in this experiment.
+
+Oh, and the big libc time in this chart is from gettimeofday, (which I
+showed how to eliminate earlier). That patch hasn't been accepted
+upstream yet, and it wasn't included in this run.
+
+As always, I've tried to make as much data available as possible, (you
+can even change the .oprofile extensions on the links to .callgraph
+for more data---but I often can't make sense of oprofile callgraph
+reports myself). So I'd be glad for anybody to dig in deeper and
+provide any useful feedback.