]> git.cworth.org Git - cworth.org/blob - src/exa/i965/emulating_speedups.mdwn
7370a55ad37355ca7a11bdd98e3da35433f16e88
[cworth.org] / src / exa / i965 / emulating_speedups.mdwn
1 [[meta title="Emulating the future of the i965 driver"]]
2
3 [[tag exa performance xorg]]
4
5 Earlier this week I [[isolated_some_bugs|synchronous_composite]] that
6 are currently causing a 4x slowdown with EXA and the i965 driver
7 compared to using the NoAccel option of the X server.
8
9 Some people have wondered if the discouraging results I have found so
10 far suggest that we should give up on hardware acceleration or that
11 EXA as an acceleration architecture is doomed. I think the answer is
12 no on both points. I think we're just seeing typical behavior of new
13 code that needs some optimization.
14
15 # EXA without acceleration
16
17 The first experiment is a very simple one to ensure that the 4x
18 slowdown isn't an unavoidable aspect of having EXA enabled. In this
19 experiment I first
20 [[disabled_the_accelerated-compositing_functions|Disable-acceleration-from-i965-EXA-hooks.patch]]
21 in the i965 driver, then I
22 [[disabled_EXA_migration|Disable-all-EXA-migration.patch]]. The net
23 result of this experiment is that the X server will still go through
24 the EXA paths, but will basically use all the same software-fallbacks
25 for compositing that are used in the case of NoAccel. The performance
26 with this patch can be compared to the NoAccel case here.
27
28 <dl class="chart barchart">
29     <dt><a href="/exa/i965/emulating_speedups/NoAccel/system.oprofile">NoAccel</a> (<a href="/exa/i965/emulating_speedups/NoAccel/timing">14.4 ms.</a>) <a href="/exa/i965/emulating_speedups/NoAccel/system.symbols">symbols profile</a></dt>
30     <dd style="width:83.7209%;">
31         <ul>
32             <li class="libpixman" style="width:45.4062%;"><a href="/exa/i965/emulating_speedups/NoAccel/libpixman.oprofile">libpixman</a><span>45%</span></li>
33             <li class="libxul" style="width:18.1849%;">libxul<span>18%</span></li>
34             <li class="vmlinux" style="width:13.3098%;"><a href="/exa/i965/emulating_speedups/NoAccel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
35             <li class="Xorg" style="width:7.7082%;"><a href="/exa/i965/emulating_speedups/NoAccel/Xorg.oprofile">Xorg</a><span>8%</span></li>
36             <li class="libc-2_5" style="width:5.1990%;"><a href="/exa/i965/emulating_speedups/NoAccel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
37             <li class="oprofiled" style="width:2.8155%;">oprofiled<span>3%</span></li>
38             <li class="libfb" style="width:2.0193%;"><a href="/exa/i965/emulating_speedups/NoAccel/libfb.oprofile">libfb</a><span>2%</span></li>
39             <li class="other" style="width:5.3571%;">other<span>5%</span></li>
40         </ul>
41     </dd>
42     <dt><a href="/exa/i965/emulating_speedups/EXA-without-accel/system.oprofile">EXA-without-accel</a> (<a href="/exa/i965/emulating_speedups/EXA-without-accel/timing">15.7 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-without-accel/system.symbols">symbols profile</a></dt>
43     <dd style="width:91.2791%;">
44         <ul>
45             <li class="libpixman" style="width:42.0541%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libpixman.oprofile">libpixman</a><span>42%</span></li>
46             <li class="libxul" style="width:15.7266%;">libxul<span>16%</span></li>
47             <li class="vmlinux" style="width:12.6525%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
48             <li class="Xorg" style="width:9.0172%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/Xorg.oprofile">Xorg</a><span>9%</span></li>
49             <li class="oprofiled" style="width:5.0459%;">oprofiled<span>5%</span></li>
50             <li class="libc-2_5" style="width:4.9173%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
51             <li class="libexa" style="width:3.1229%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libexa.oprofile">libexa</a><span>3%</span></li>
52             <li class="libfb" style="width:2.1381%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libfb.oprofile">libfb</a><span>2%</span></li>
53             <li class="other" style="width:5.3254%;">other<span>5%</span></li>
54         </ul>
55     </dd>
56 </dl>
57
58 It's worth pointing out that with this change, everything still
59 renders correctly. Basically, we're using the same rendering code as
60 in the NoAccel case, but we're using EXA to get there. And we can see
61 that there is some overhead to EXA seen here, (a 10% slowdown), but
62 nothing like the 400% slowdown seen before. There's certainly no
63 indication here that EXA is doomed to be horribly slow, for
64 example.
65
66 Now, this experiment does miss overhead in EXA having to do with
67 managing video memory, (since I disabled all migration so everything
68 lives in system memory). We'll be able to see this additional overhead
69 below.
70
71 # Emulating future i965 speedups
72
73 The above experiment is still pretty boring---it's still just
74 measuring software-fallback performance. A much more interesting
75 experiment allows us to start exploring where will be able to get with
76 some un-broken hardware acceleration.
77
78 My [[previous_post|synchronous_composite]] highlighted two significant
79 problems preventing the current code from having good performance:
80
81  * Time lost migrating pixmaps with memcpy
82
83  * Time wasted while the driver busy-waited between operations
84
85 Here's a run-down of what I could find about progress on solving these
86 two problems:
87
88 ## Excessive migration
89
90 When I looked closer at what was causing the pixmap migration, I found
91 that much of it was due to glyph images being pinned to system memory,
92 (recall that the benchmark I'm using is Mozilla Firefox on a page
93 consisting of mostly text). I asked Keith Packard about why these
94 glyph images are being pinned to system memory, and he explained that
95 what was preventing the glyphs from migrating is that the X server has
96 not been using straight Pixmaps for glyphs, but something slightly
97 different.
98
99 Keith is already mostly finished with a change to make the server use
100 Pixmaps for glyphs. Apparently there is one slight snag in that
101 Pixmaps are a per-screen resource while glyphs are not. For now, that
102 could be worked around by using one Pixmap per screen, (until Pixmaps
103 and other resources can be made global within the server).
104
105 So, hopefully that glyph pinning problem will be fixed. Meanwhile,
106 it's fairly silly that there's a bunch of memcpy operations to migrate
107 things from "system" to "video" memory on the i965 anyway. This card
108 doesn't have dedicated video memory, but just uses system memory
109 anyway, (all that's needed is for some entries to be set in the GART
110 table, and for some alignment constraints to be satisfied). So it
111 should be possible to eliminate all of this memcpy time anyway.
112
113 I'm told that the long-awaited memory management work, (TTM), is what
114 will solve this. I don't know what the status of that work is, but
115 hopefully it will be ready soon. does anyone have some pointers for
116 more information on TTM status?
117
118 ## Synchronous compositing
119
120 I characterized this problem fairly well in my previous post. Eric
121 Anholt suggested a first quick step toward improving the situation
122 would be to use an array of state buffers. With N buffers we could
123 make the waiting happen only 1/N as frequently as it's currently
124 happening. So that's something that even someone like me without any
125 detailed documentation on the i965 could do.
126
127 And with a little more smarts, (from someone with more information),
128 we could presumably reclaim buffers that the hardware was done with
129 without having to do any waiting at all.
130
131 So it shouldn't be too long before the waiting can be eliminated or
132 reduced to an arbitrarily small amount of time.
133
134 ## Results
135
136 Given these identified solutions for the current known problems, (and
137 much of the work in progress already), the next question I want to ask
138 is what will things look like when these are solved?
139
140 I implemented quick patches to both
141 [[EXA|Emulate-infinitely-fast-migration-disable-memcpy.patch]] and the
142 [[i965_driver|Emulate-infinitely-fast-i965-compositing-make-check.patch]]
143 to emulate the time being spent on migration and compositing going to
144 zero. That's not totally realistic, but is at least a best-case look
145 at where we'll be with these problems fixed. And here's what it looks
146 like (with the previous results repeated for comparison):
147
148 <dl class="chart barchart">
149     <dt><a href="/exa/i965/emulating_speedups/NoAccel/system.oprofile">NoAccel</a> (<a href="/exa/i965/emulating_speedups/NoAccel/timing">14.4 ms.</a>) <a href="/exa/i965/emulating_speedups/NoAccel/system.symbols">symbols profile</a></dt>
150     <dd style="width:83.7209%;">
151         <ul>
152             <li class="libpixman" style="width:45.4062%;"><a href="/exa/i965/emulating_speedups/NoAccel/libpixman.oprofile">libpixman</a><span>45%</span></li>
153             <li class="libxul" style="width:18.1849%;">libxul<span>18%</span></li>
154             <li class="vmlinux" style="width:13.3098%;"><a href="/exa/i965/emulating_speedups/NoAccel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
155             <li class="Xorg" style="width:7.7082%;"><a href="/exa/i965/emulating_speedups/NoAccel/Xorg.oprofile">Xorg</a><span>8%</span></li>
156             <li class="libc-2_5" style="width:5.1990%;"><a href="/exa/i965/emulating_speedups/NoAccel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
157             <li class="oprofiled" style="width:2.8155%;">oprofiled<span>3%</span></li>
158             <li class="libfb" style="width:2.0193%;"><a href="/exa/i965/emulating_speedups/NoAccel/libfb.oprofile">libfb</a><span>2%</span></li>
159             <li class="other" style="width:5.3571%;">other<span>5%</span></li>
160         </ul>
161     </dd>
162     <dt><a href="/exa/i965/emulating_speedups/EXA-without-accel/system.oprofile">EXA-without-accel</a> (<a href="/exa/i965/emulating_speedups/EXA-without-accel/timing">15.7 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-without-accel/system.symbols">symbols profile</a></dt>
163     <dd style="width:91.2791%;">
164         <ul>
165             <li class="libpixman" style="width:42.0541%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libpixman.oprofile">libpixman</a><span>42%</span></li>
166             <li class="libxul" style="width:15.7266%;">libxul<span>16%</span></li>
167             <li class="vmlinux" style="width:12.6525%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
168             <li class="Xorg" style="width:9.0172%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/Xorg.oprofile">Xorg</a><span>9%</span></li>
169             <li class="oprofiled" style="width:5.0459%;">oprofiled<span>5%</span></li>
170             <li class="libc-2_5" style="width:4.9173%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libc-2.5.oprofile">libc-2.5</a><span>5%</span></li>
171             <li class="libexa" style="width:3.1229%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libexa.oprofile">libexa</a><span>3%</span></li>
172             <li class="libfb" style="width:2.1381%;"><a href="/exa/i965/emulating_speedups/EXA-without-accel/libfb.oprofile">libfb</a><span>2%</span></li>
173             <li class="other" style="width:5.3254%;">other<span>5%</span></li>
174         </ul>
175     </dd>
176     <dt><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/system.oprofile">EXA-emulate-speedups</a> (<a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/timing">17.2 ms.</a>) <a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/system.symbols">symbols profile</a></dt>
177     <dd style="width:100%;">
178         <ul>
179             <li class="libxul" style="width:19.6803%;">libxul<span>20%</span></li>
180             <li class="libpixman" style="width:16.8573%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libpixman.oprofile">libpixman</a><span>17%</span></li>
181             <li class="vmlinux" style="width:13.1026%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/vmlinux.oprofile">vmlinux</a><span>13%</span></li>
182             <li class="libexa" style="width:11.9970%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libexa.oprofile">libexa</a><span>12%</span></li>
183             <li class="libc-2_5" style="width:11.4997%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/libc-2.5.oprofile">libc-2.5</a><span>11%</span></li>
184             <li class="intel_drv" style="width:9.4984%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/intel_drv.oprofile">intel_drv</a><span>9%</span></li>
185             <li class="Xorg" style="width:7.3488%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/Xorg.oprofile">Xorg</a><span>7%</span></li>
186             <li class="oprofiled" style="width:3.3068%;">oprofiled<span>3%</span></li>
187             <li class="other" style="width:6.7091%;"><a href="/exa/i965/emulating_speedups/EXA-emulate-speedups/other.oprofile">other</a><span>7%</span></li>
188         </ul>
189     </dd>
190 </dl>
191
192 Note that in this experiment, rendered results are not at all correct,
193 (basically, no text appears, for example).
194
195 And, still, things aren't faster than NoAccel, but there's definitely
196 still lots of room for improvement. For example, the pixman profile
197 shows compositing, (fbCombineInU and fbFetch_a1) that should be moved
198 to the hardware, (particularly when the hardware is infinitely fast
199 like it is in my emulation here!).
200
201 After that, pixman's rasterization would be at the top of the pixman
202 profile. I've been wanting rasterization to show up at the top of a
203 profile for a long time so I could have an excuse to implement some
204 ideas I have for much faster software rasterization, (and to explore
205 using the hardware for rasterization as well). And, for some
206 applications doing much more than just rendering text, rasterization
207 might already be a lot closer to the top.
208
209 So that shows what software operations aren't hooked up to be
210 accelerated yet. What else is here? As I pointed out before, (and is
211 much easier to see in this chart than the one from earlier this week),
212 libxul is mysteriously getting slower once the i965 gets involved, but
213 libxul really shouldn't care. So that will be something to investigate
214 by actually building mozilla with debug symbols.
215
216 Also, there's also significantly more overhead in libexa in this chart
217 compared to those above. So there's some room for improvement there,
218 (ExaOffscreenMarkUsed is at the top of the profile, and as I've
219 mentioned before it looks ripe for improvement).
220
221 Finally, the i965 driver is still burning a lot of time in its wait
222 function here. I'm not sure what the cause of that is this time since
223 I've eliminated all calls to the wait function from
224 `i965_prepare_composite` and `i965_composite` in this experiment.
225
226 Oh, and the big libc time in this chart is from gettimeofday, (which I
227 showed how to eliminate earlier). That patch hasn't been accepted
228 upstream yet, and it wasn't included in this run.
229
230 As always, I've tried to make as much data available as possible, (you
231 can even change the .oprofile extensions on the links to .callgraph
232 for more data---but I often can't make sense of oprofile callgraph
233 reports myself). So I'd be glad for anybody to dig in deeper and
234 provide any useful feedback.