]> git.cworth.org Git - cworth.org/blob - src/exa/i965/emulating_speedups.mdwn
b9b36d198e602cdb5a281d4a9849c4542aa0ada3
[cworth.org] / src / exa / i965 / emulating_speedups.mdwn
1 [[meta title="Emulating the future of the i965 driver"]]
2
3 [[tag exa performance xorg]]
4
5 Earlier this week I [[isolated_some_bugs|synchronous_composite]] that
6 are currently causing a 4x slowdown with EXA and the i965 driver
7 compared to using the NoAccel option of the X server.
8
9 Some people have wondered if the discouraging results I have found so
10 far suggest that we should give up on hardware acceleration or that
11 EXA as an acceleration architecture is doomed. I think the answer is
12 no on both points. I think we're just seeing typical behavior of new
13 code that needs some optimization.
14
15 # EXA without acceleration
16
17 The first experiment is a very simple one to ensure that the 4x
18 slowdown isn't an unavoidable aspect of having EXA enabled. In this
19 experiment I first
20 [[disabled_the_accelerated-compositing_functions|Disable-acceleration-from-i965-EXA-hooks.patch]]
21 in the i965 driver, then I
22 [[disabled_EXA_migration|Disable-all-EXA-migration.patch]]. The net
23 result of this experiment is that the X server will still go through
24 the EXA paths, but will basically use all the same software-fallbacks
25 for compositing that are used in the case of NoAccel. The performance
26 with this patch can be compared to the NoAccel case here. (Again,
27 click through to my blog if you're just getting a list of numbers, not
28 a colorful bar chart.)
29
30 <dl class="chart barchart">
31     <dt><a href="/exa/i965/emulating_speedups/NoAccel/system.oprofile">NoAccel</a> (<a href="/exa/i965/emulating_speedups/NoAccel/timing">14.4 ms.</a>) <a href="/exa/i965/emulating_speedups/NoAccel/system.symbols">symbols profile</a></dt>
32     <dd style="width:83.7209%;">
33         <ul>
34             <li class="libpixman" style="width:45.4062%;"><a href="/exa/i965/emulating_speedups/NoAccel/libpixman.oprofile">libpixman</a><span>45%</span></li>
35             <li class="libxul" style="width:18.1849%;">libxul<span>18%</span></li>
36             <li class="vmlinux" style="width:13.3098%;"><a href="/exa/i965/emulating_speedups/NoAccel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
37             <li class="Xorg" style="width:7.7082%;"><a href="/exa/i965/emulating_speedups/NoAccel/Xorg.oprofile">Xorg</a><span>8%</span></li>
38             <li class="libc-2_5" style="width:5.1990%;"><a href="/exa/i965/emulating_speedups/NoAccel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
39             <li class="oprofiled" style="width:2.8155%;">oprofiled<span>3%</span></li>
40             <li class="libfb" style="width:2.0193%;"><a href="/exa/i965/emulating_speedups/NoAccel/libfb.oprofile">libfb</a><span>2%</span></li>
41             <li class="other" style="width:5.3571%;">other<span>5%</span></li>
42         </ul>
43     </dd>
44     <dt><a href="/exa/i965/emulating_speedups/EXA-without-accel/system.oprofile">EXA-without-accel</a> (<a href="/exa/i965/emulating_speedups/EXA-without-accel/timing">15.7 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-without-accel/system.symbols">symbols profile</a></dt>
45     <dd style="width:91.2791%;">
46         <ul>
47             <li class="libpixman" style="width:42.0541%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libpixman.oprofile">libpixman</a><span>42%</span></li>
48             <li class="libxul" style="width:15.7266%;">libxul<span>16%</span></li>
49             <li class="vmlinux" style="width:12.6525%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
50             <li class="Xorg" style="width:9.0172%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/Xorg.oprofile">Xorg</a><span>9%</span></li>
51             <li class="oprofiled" style="width:5.0459%;">oprofiled<span>5%</span></li>
52             <li class="libc-2_5" style="width:4.9173%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
53             <li class="libexa" style="width:3.1229%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libexa.oprofile">libexa</a><span>3%</span></li>
54             <li class="libfb" style="width:2.1381%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libfb.oprofile">libfb</a><span>2%</span></li>
55             <li class="other" style="width:5.3254%;">other<span>5%</span></li>
56         </ul>
57     </dd>
58 </dl>
59
60 It's worth pointing out that with this change, everything still
61 renders correctly. Basically, we're using the same rendering code as
62 in the NoAccel case, but we're using EXA to get there. And we can see
63 that there is some overhead to EXA seen here, (a 10% slowdown), but
64 nothing like the 400% slowdown seen before. There's certainly no
65 indication here that EXA is doomed to be horribly slow, for
66 example.
67
68 Now, this experiment does miss overhead in EXA having to do with
69 managing video memory, (since I disabled all migration so everything
70 lives in system memory). We'll be able to see this additional overhead
71 below.
72
73 # Emulating future i965 speedups
74
75 The above experiment is still pretty boring---it's still just
76 measuring software-fallback performance. A much more interesting
77 experiment allows us to start exploring where will be able to get with
78 some un-broken hardware acceleration.
79
80 My [[previous_post|synchronous_composite]] highlighted two significant
81 problems preventing the current code from having good performance:
82
83  * Time lost migrating pixmaps with memcpy
84
85  * Time wasted while the driver busy-waited between operations
86
87 Here's a run-down of what I could find about progress on solving these
88 two problems:
89
90 ## Excessive migration
91
92 When I looked closer at what was causing the pixmap migration, I found
93 that much of it was due to glyph images being pinned to system memory,
94 (recall that the benchmark I'm using is Mozilla Firefox on a page
95 consisting of mostly text). I asked Keith Packard about why these
96 glyph images are being pinned to system memory, and he explained that
97 what was preventing the glyphs from migrating is that the X server has
98 not been using straight Pixmaps for glyphs, but something slightly
99 different.
100
101 Keith is already mostly finished with a change to make the server use
102 Pixmaps for glyphs. Apparently there is one slight snag in that
103 Pixmaps are a per-screen resource while glyphs are not. For now, that
104 could be worked around by using one Pixmap per screen, (until Pixmaps
105 and other resources can be made global within the server).
106
107 So, hopefully that glyph pinning problem will be fixed. Meanwhile,
108 it's fairly silly that there's a bunch of memcpy operations to migrate
109 things from "system" to "video" memory on the i965 anyway. This card
110 doesn't have dedicated video memory, but just uses system memory
111 anyway, (all that's needed is for some entries to be set in the GART
112 table, and for some alignment constraints to be satisfied). So it
113 should be possible to eliminate all of this memcpy time anyway.
114
115 I'm told that the long-awaited memory management work, (TTM), is what
116 will solve this. I don't know what the status of that work is, but
117 hopefully it will be ready soon. does anyone have some pointers for
118 more information on TTM status?
119
120 ## Synchronous compositing
121
122 I characterized this problem fairly well in my previous post. Eric
123 Anholt suggested a first quick step toward improving the situation
124 would be to use an array of state buffers. With N buffers we could
125 make the waiting happen only 1/N as frequently as it's currently
126 happening. So that's something that even someone like me without any
127 detailed documentation on the i965 could do.
128
129 And with a little more smarts, (from someone with more information),
130 we could presumably reclaim buffers that the hardware was done with
131 without having to do any waiting at all.
132
133 So it shouldn't be too long before the waiting can be eliminated or
134 reduced to an arbitrarily small amount of time.
135
136 ## Results
137
138 Given these identified solutions for the current known problems, (and
139 much of the work in progress already), the next question I want to ask
140 is what will things look like when these are solved?
141
142 I implemented quick patches to both
143 [[EXA|Emulate-infinitely-fast-migration-disable-memcpy.patch]] and the
144 [[i965_driver|Emulate-infinitely-fast-i965-compositing-make-check.patch]]
145 to emulate the time being spent on migration and compositing going to
146 zero. That's not totally realistic, but is at least a best-case look
147 at where we'll be with these problems fixed. And here's what it looks
148 like (with the previous results repeated for comparison):
149
150 <dl class="chart barchart">
151     <dt><a href="/exa/i965/emulating_speedups/NoAccel/system.oprofile">NoAccel</a> (<a href="/exa/i965/emulating_speedups/NoAccel/timing">14.4 ms.</a>) <a href="/exa/i965/emulating_speedups/NoAccel/system.symbols">symbols profile</a></dt>
152     <dd style="width:83.7209%;">
153         <ul>
154             <li class="libpixman" style="width:45.4062%;"><a href="/exa/i965/emulating_speedups/NoAccel/libpixman.oprofile">libpixman</a><span>45%</span></li>
155             <li class="libxul" style="width:18.1849%;">libxul<span>18%</span></li>
156             <li class="vmlinux" style="width:13.3098%;"><a href="/exa/i965/emulating_speedups/NoAccel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
157             <li class="Xorg" style="width:7.7082%;"><a href="/exa/i965/emulating_speedups/NoAccel/Xorg.oprofile">Xorg</a><span>8%</span></li>
158             <li class="libc-2_5" style="width:5.1990%;"><a href="/exa/i965/emulating_speedups/NoAccel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
159             <li class="oprofiled" style="width:2.8155%;">oprofiled<span>3%</span></li>
160             <li class="libfb" style="width:2.0193%;"><a href="/exa/i965/emulating_speedups/NoAccel/libfb.oprofile">libfb</a><span>2%</span></li>
161             <li class="other" style="width:5.3571%;">other<span>5%</span></li>
162         </ul>
163     </dd>
164     <dt><a href="/exa/i965/emulating_speedups/EXA-without-accel/system.oprofile">EXA-without-accel</a> (<a href="/exa/i965/emulating_speedups/EXA-without-accel/timing">15.7 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-without-accel/system.symbols">symbols profile</a></dt>
165     <dd style="width:91.2791%;">
166         <ul>
167             <li class="libpixman" style="width:42.0541%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libpixman.oprofile">libpixman</a><span>42%</span></li>
168             <li class="libxul" style="width:15.7266%;">libxul<span>16%</span></li>
169             <li class="vmlinux" style="width:12.6525%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
170             <li class="Xorg" style="width:9.0172%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/Xorg.oprofile">Xorg</a><span>9%</span></li>
171             <li class="oprofiled" style="width:5.0459%;">oprofiled<span>5%</span></li>
172             <li class="libc-2_5" style="width:4.9173%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
173             <li class="libexa" style="width:3.1229%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libexa.oprofile">libexa</a><span>3%</span></li>
174             <li class="libfb" style="width:2.1381%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libfb.oprofile">libfb</a><span>2%</span></li>
175             <li class="other" style="width:5.3254%;">other<span>5%</span></li>
176         </ul>
177     </dd>
178     <dt><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/system.oprofile">EXA-emulate-speedups</a> (<a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/timing">17.2 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/system.symbols">symbols profile</a></dt>
179     <dd style="width:100%;">
180         <ul>
181             <li class="libxul" style="width:19.6803%;">libxul<span>20%</span></li>
182             <li class="libpixman" style="width:16.8573%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libpixman.oprofile">libpixman</a><span>17%</span></li>
183             <li class="vmlinux" style="width:13.1026%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
184             <li class="libexa" style="width:11.9970%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libexa.oprofile">libexa</a><span>12%</span></li>
185             <li class="libc-2_5" style="width:11.4997%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libc-2.5.oprofile">libc-2.5</a><span>11%</span></li>
186             <li class="intel_drv" style="width:9.4984%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/intel_drv.oprofile">intel_drv</a><span>9%</span></li>
187             <li class="Xorg" style="width:7.3488%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/Xorg.oprofile">Xorg</a><span>7%</span></li>
188             <li class="oprofiled" style="width:3.3068%;">oprofiled<span>3%</span></li>
189             <li class="other" style="width:6.7091%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/other.oprofile">other</a><span>7%</span></li>
190         </ul>
191     </dd>
192 </dl>
193
194 Note that in this experiment, rendered results are not at all correct,
195 (basically, no text appears, for example).
196
197 And, still, things aren't faster than NoAccel, but there's definitely
198 still lots of room for improvement. For example, the pixman profile
199 shows compositing, (fbCombineInU and fbFetch_a1) that should be moved
200 to the hardware, (particularly when the hardware is infinitely fast
201 like it is in my emulation here!).
202
203 After that, pixman's rasterization would be at the top of the pixman
204 profile. I've been wanting rasterization to show up at the top of a
205 profile for a long time so I could have an excuse to implement some
206 ideas I have for much faster software rasterization, (and to explore
207 using the hardware for rasterization as well). And, for some
208 applications doing much more than just rendering text, rasterization
209 might already be a lot closer to the top.
210
211 So that shows what software operations aren't hooked up to be
212 accelerated yet. What else is here? As I pointed out before, (and is
213 much easier to see in this chart than the one from earlier this week),
214 libxul is mysteriously getting slower once the i965 gets involved, but
215 libxul really shouldn't care. So that will be something to investigate
216 by actually building mozilla with debug symbols.
217
218 Also, there's also significantly more overhead in libexa in this chart
219 compared to those above. So there's some room for improvement there,
220 (ExaOffscreenMarkUsed is at the top of the profile, and as I've
221 mentioned before it looks ripe for improvement).
222
223 Finally, the i965 driver is still burning a lot of time in its wait
224 function here. I'm not sure what the cause of that is this time since
225 I've eliminated all calls to the wait function from
226 `i965_prepare_composite` and `i965_composite` in this experiment.
227
228 Oh, and the big libc time in this chart is from gettimeofday, (which I
229 showed how to eliminate earlier). That patch hasn't been accepted
230 upstream yet, and it wasn't included in this run.
231
232 As always, I've tried to make as much data available as possible, (you
233 can even change the .oprofile extensions on the links to .callgraph
234 for more data---but I often can't make sense of oprofile callgraph
235 reports myself). So I'd be glad for anybody to dig in deeper and
236 provide any useful feedback.