dispatch.html revision b8e80941
1848b8605Smrg<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
2848b8605Smrg<html lang="en">
3848b8605Smrg<head>
4848b8605Smrg  <meta http-equiv="content-type" content="text/html; charset=utf-8">
5848b8605Smrg  <title>GL Dispatch in Mesa</title>
6848b8605Smrg  <link rel="stylesheet" type="text/css" href="mesa.css">
7848b8605Smrg</head>
8848b8605Smrg<body>
9848b8605Smrg
10848b8605Smrg<div class="header">
11848b8605Smrg  <h1>The Mesa 3D Graphics Library</h1>
12848b8605Smrg</div>
13848b8605Smrg
14848b8605Smrg<iframe src="contents.html"></iframe>
15848b8605Smrg<div class="content">
16848b8605Smrg
17848b8605Smrg<h1>GL Dispatch in Mesa</h1>
18848b8605Smrg
19848b8605Smrg<p>Several factors combine to make efficient dispatch of OpenGL functions
20848b8605Smrgfairly complicated.  This document attempts to explain some of the issues
21848b8605Smrgand introduce the reader to Mesa's implementation.  Readers already familiar
22848b8605Smrgwith the issues around GL dispatch can safely skip ahead to the <a
23848b8605Smrghref="#overview">overview of Mesa's implementation</a>.</p>
24848b8605Smrg
25848b8605Smrg<h2>1. Complexity of GL Dispatch</h2>
26848b8605Smrg
27848b8605Smrg<p>Every GL application has at least one object called a GL <em>context</em>.
28848b8605SmrgThis object, which is an implicit parameter to every GL function, stores all
29848b8605Smrgof the GL related state for the application.  Every texture, every buffer
30848b8605Smrgobject, every enable, and much, much more is stored in the context.  Since
31848b8605Smrgan application can have more than one context, the context to be used is
32848b8605Smrgselected by a window-system dependent function such as
33848b8605Smrg<tt>glXMakeContextCurrent</tt>.</p>
34848b8605Smrg
35848b8605Smrg<p>In environments that implement OpenGL with X-Windows using GLX, every GL
36848b8605Smrgfunction, including the pointers returned by <tt>glXGetProcAddress</tt>, are
37848b8605Smrg<em>context independent</em>.  This means that no matter what context is
38848b8605Smrgcurrently active, the same <tt>glVertex3fv</tt> function is used.</p>
39848b8605Smrg
40848b8605Smrg<p>This creates the first bit of dispatch complexity.  An application can
41848b8605Smrghave two GL contexts.  One context is a direct rendering context where
42848b8605Smrgfunction calls are routed directly to a driver loaded within the
43848b8605Smrgapplication's address space.  The other context is an indirect rendering
44848b8605Smrgcontext where function calls are converted to GLX protocol and sent to a
45848b8605Smrgserver.  The same <tt>glVertex3fv</tt> has to do the right thing depending
46848b8605Smrgon which context is current.</p>
47848b8605Smrg
48848b8605Smrg<p>Highly optimized drivers or GLX protocol implementations may want to
49848b8605Smrgchange the behavior of GL functions depending on current state.  For
50848b8605Smrgexample, <tt>glFogCoordf</tt> may operate differently depending on whether
51848b8605Smrgor not fog is enabled.</p>
52848b8605Smrg
53848b8605Smrg<p>In multi-threaded environments, it is possible for each thread to have a
54848b8605Smrgdifferent GL context current.  This means that poor old <tt>glVertex3fv</tt>
55848b8605Smrghas to know which GL context is current in the thread where it is being
56848b8605Smrgcalled.</p>
57848b8605Smrg
58848b8605Smrg<h2 id="overview">2. Overview of Mesa's Implementation</h2>
59848b8605Smrg
60848b8605Smrg<p>Mesa uses two per-thread pointers.  The first pointer stores the address
61848b8605Smrgof the context current in the thread, and the second pointer stores the
62848b8605Smrgaddress of the <em>dispatch table</em> associated with that context.  The
63848b8605Smrgdispatch table stores pointers to functions that actually implement
64848b8605Smrgspecific GL functions.  Each time a new context is made current in a thread,
65848b8605Smrgthese pointers a updated.</p>
66848b8605Smrg
67848b8605Smrg<p>The implementation of functions such as <tt>glVertex3fv</tt> becomes
68848b8605Smrgconceptually simple:</p>
69848b8605Smrg
70848b8605Smrg<ul>
71848b8605Smrg<li>Fetch the current dispatch table pointer.</li>
72848b8605Smrg<li>Fetch the pointer to the real <tt>glVertex3fv</tt> function from the
73848b8605Smrgtable.</li>
74848b8605Smrg<li>Call the real function.</li>
75848b8605Smrg</ul>
76848b8605Smrg
77848b8605Smrg<p>This can be implemented in just a few lines of C code.  The file
78848b8605Smrg<tt>src/mesa/glapi/glapitemp.h</tt> contains code very similar to this.</p>
79848b8605Smrg
80848b8605Smrg<blockquote>
81848b8605Smrg<table border="1">
82848b8605Smrg<tr><td><pre>
83848b8605Smrgvoid glVertex3f(GLfloat x, GLfloat y, GLfloat z)
84848b8605Smrg{
85848b8605Smrg    const struct _glapi_table * const dispatch = GET_DISPATCH();
86848b8605Smrg
87848b8605Smrg    (*dispatch-&gt;Vertex3f)(x, y, z);
88848b8605Smrg}</pre></td></tr>
89848b8605Smrg<tr><td>Sample dispatch function</td></tr></table>
90848b8605Smrg</blockquote>
91848b8605Smrg
92848b8605Smrg<p>The problem with this simple implementation is the large amount of
93848b8605Smrgoverhead that it adds to every GL function call.</p>
94848b8605Smrg
95848b8605Smrg<p>In a multithreaded environment, a naive implementation of
96848b8605Smrg<tt>GET_DISPATCH</tt> involves a call to <tt>pthread_getspecific</tt> or a
97848b8605Smrgsimilar function.  Mesa provides a wrapper function called
98848b8605Smrg<tt>_glapi_get_dispatch</tt> that is used by default.</p>
99848b8605Smrg
100848b8605Smrg<h2>3. Optimizations</h2>
101848b8605Smrg
102848b8605Smrg<p>A number of optimizations have been made over the years to diminish the
103848b8605Smrgperformance hit imposed by GL dispatch.  This section describes these
104848b8605Smrgoptimizations.  The benefits of each optimization and the situations where
105848b8605Smrgeach can or cannot be used are listed.</p>
106848b8605Smrg
107848b8605Smrg<h3>3.1. Dual dispatch table pointers</h3>
108848b8605Smrg
109848b8605Smrg<p>The vast majority of OpenGL applications use the API in a single threaded
110848b8605Smrgmanner.  That is, the application has only one thread that makes calls into
111848b8605Smrgthe GL.  In these cases, not only do the calls to
112848b8605Smrg<tt>pthread_getspecific</tt> hurt performance, but they are completely
113848b8605Smrgunnecessary!  It is possible to detect this common case and avoid these
114848b8605Smrgcalls.</p>
115848b8605Smrg
116848b8605Smrg<p>Each time a new dispatch table is set, Mesa examines and records the ID
117848b8605Smrgof the executing thread.  If the same thread ID is always seen, Mesa knows
118848b8605Smrgthat the application is, from OpenGL's point of view, single threaded.</p>
119848b8605Smrg
120848b8605Smrg<p>As long as an application is single threaded, Mesa stores a pointer to
121848b8605Smrgthe dispatch table in a global variable called <tt>_glapi_Dispatch</tt>.
122848b8605SmrgThe pointer is also stored in a per-thread location via
123848b8605Smrg<tt>pthread_setspecific</tt>.  When Mesa detects that an application has
124848b8605Smrgbecome multithreaded, <tt>NULL</tt> is stored in <tt>_glapi_Dispatch</tt>.</p>
125848b8605Smrg
126848b8605Smrg<p>Using this simple mechanism the dispatch functions can detect the
127848b8605Smrgmultithreaded case by comparing <tt>_glapi_Dispatch</tt> to <tt>NULL</tt>.
128848b8605SmrgThe resulting implementation of <tt>GET_DISPATCH</tt> is slightly more
129848b8605Smrgcomplex, but it avoids the expensive <tt>pthread_getspecific</tt> call in
130848b8605Smrgthe common case.</p>
131848b8605Smrg
132848b8605Smrg<blockquote>
133848b8605Smrg<table border="1">
134848b8605Smrg<tr><td><pre>
135848b8605Smrg#define GET_DISPATCH() \
136848b8605Smrg    (_glapi_Dispatch != NULL) \
137b8e80941Smrg        ? _glapi_Dispatch : pthread_getspecific(&amp;_glapi_Dispatch_key)
138848b8605Smrg</pre></td></tr>
139848b8605Smrg<tr><td>Improved <tt>GET_DISPATCH</tt> Implementation</td></tr></table>
140848b8605Smrg</blockquote>
141848b8605Smrg
142848b8605Smrg<h3>3.2. ELF TLS</h3>
143848b8605Smrg
144848b8605Smrg<p>Starting with the 2.4.20 Linux kernel, each thread is allocated an area
145848b8605Smrgof per-thread, global storage.  Variables can be put in this area using some
146848b8605Smrgextensions to GCC.  By storing the dispatch table pointer in this area, the
147848b8605Smrgexpensive call to <tt>pthread_getspecific</tt> and the test of
148848b8605Smrg<tt>_glapi_Dispatch</tt> can be avoided.</p>
149848b8605Smrg
150848b8605Smrg<p>The dispatch table pointer is stored in a new variable called
151848b8605Smrg<tt>_glapi_tls_Dispatch</tt>.  A new variable name is used so that a single
152848b8605SmrglibGL can implement both interfaces.  This allows the libGL to operate with
153848b8605Smrgdirect rendering drivers that use either interface.  Once the pointer is
154848b8605Smrgproperly declared, <tt>GET_DISPACH</tt> becomes a simple variable
155848b8605Smrgreference.</p>
156848b8605Smrg
157848b8605Smrg<blockquote>
158848b8605Smrg<table border="1">
159848b8605Smrg<tr><td><pre>
160848b8605Smrgextern __thread struct _glapi_table *_glapi_tls_Dispatch
161848b8605Smrg    __attribute__((tls_model("initial-exec")));
162848b8605Smrg
163848b8605Smrg#define GET_DISPATCH() _glapi_tls_Dispatch
164848b8605Smrg</pre></td></tr>
165848b8605Smrg<tr><td>TLS <tt>GET_DISPATCH</tt> Implementation</td></tr></table>
166848b8605Smrg</blockquote>
167848b8605Smrg
168848b8605Smrg<p>Use of this path is controlled by the preprocessor define
169848b8605Smrg<tt>GLX_USE_TLS</tt>.  Any platform capable of using TLS should use this as
170848b8605Smrgthe default dispatch method.</p>
171848b8605Smrg
172848b8605Smrg<h3>3.3. Assembly Language Dispatch Stubs</h3>
173848b8605Smrg
174848b8605Smrg<p>Many platforms has difficulty properly optimizing the tail-call in the
175848b8605Smrgdispatch stubs.  Platforms like x86 that pass parameters on the stack seem
176848b8605Smrgto have even more difficulty optimizing these routines.  All of the dispatch
177848b8605Smrgroutines are very short, and it is trivial to create optimal assembly
178848b8605Smrglanguage versions.  The amount of optimization provided by using assembly
179848b8605Smrgstubs varies from platform to platform and application to application.
180848b8605SmrgHowever, by using the assembly stubs, many platforms can use an additional
181848b8605Smrgspace optimization (see <a href="#fixedsize">below</a>).</p>
182848b8605Smrg
183848b8605Smrg<p>The biggest hurdle to creating assembly stubs is handling the various
184848b8605Smrgways that the dispatch table pointer can be accessed.  There are four
185848b8605Smrgdifferent methods that can be used:</p>
186848b8605Smrg
187848b8605Smrg<ol>
188848b8605Smrg<li>Using <tt>_glapi_Dispatch</tt> directly in builds for non-multithreaded
189848b8605Smrgenvironments.</li>
190848b8605Smrg<li>Using <tt>_glapi_Dispatch</tt> and <tt>_glapi_get_dispatch</tt> in
191848b8605Smrgmultithreaded environments.</li>
192848b8605Smrg<li>Using <tt>_glapi_Dispatch</tt> and <tt>pthread_getspecific</tt> in
193848b8605Smrgmultithreaded environments.</li>
194848b8605Smrg<li>Using <tt>_glapi_tls_Dispatch</tt> directly in TLS enabled
195848b8605Smrgmultithreaded environments.</li>
196848b8605Smrg</ol>
197848b8605Smrg
198848b8605Smrg<p>People wishing to implement assembly stubs for new platforms should focus
199848b8605Smrgon #4 if the new platform supports TLS.  Otherwise, implement #2 followed by
200848b8605Smrg#3.  Environments that do not support multithreading are uncommon and not
201848b8605Smrgterribly relevant.</p>
202848b8605Smrg
203848b8605Smrg<p>Selection of the dispatch table pointer access method is controlled by a
204848b8605Smrgfew preprocessor defines.</p>
205848b8605Smrg
206848b8605Smrg<ul>
207b8e80941Smrg<li>If <tt>GLX_USE_TLS</tt> is defined, method #3 is used.</li>
208b8e80941Smrg<li>If <tt>HAVE_PTHREAD</tt> is defined, method #2 is used.</li>
209848b8605Smrg<li>If none of the preceding are defined, method #1 is used.</li>
210848b8605Smrg</ul>
211848b8605Smrg
212848b8605Smrg<p>Two different techniques are used to handle the various different cases.
213848b8605SmrgOn x86 and SPARC, a macro called <tt>GL_STUB</tt> is used.  In the preamble
214848b8605Smrgof the assembly source file different implementations of the macro are
215848b8605Smrgselected based on the defined preprocessor variables.  The assembly code
216848b8605Smrgthen consists of a series of invocations of the macros such as:
217848b8605Smrg
218848b8605Smrg<blockquote>
219848b8605Smrg<table border="1">
220848b8605Smrg<tr><td><pre>
221848b8605SmrgGL_STUB(Color3fv, _gloffset_Color3fv)
222848b8605Smrg</pre></td></tr>
223848b8605Smrg<tr><td>SPARC Assembly Implementation of <tt>glColor3fv</tt></td></tr></table>
224848b8605Smrg</blockquote>
225848b8605Smrg
226848b8605Smrg<p>The benefit of this technique is that changes to the calling pattern
227848b8605Smrg(i.e., addition of a new dispatch table pointer access method) require fewer
228848b8605Smrgchanged lines in the assembly code.</p>
229848b8605Smrg
230848b8605Smrg<p>However, this technique can only be used on platforms where the function
231848b8605Smrgimplementation does not change based on the parameters passed to the
232848b8605Smrgfunction.  For example, since x86 passes all parameters on the stack, no
233848b8605Smrgadditional code is needed to save and restore function parameters around a
234848b8605Smrgcall to <tt>pthread_getspecific</tt>.  Since x86-64 passes parameters in
235848b8605Smrgregisters, varying amounts of code needs to be inserted around the call to
236848b8605Smrg<tt>pthread_getspecific</tt> to save and restore the GL function's
237848b8605Smrgparameters.</p>
238848b8605Smrg
239848b8605Smrg<p>The other technique, used by platforms like x86-64 that cannot use the
240848b8605Smrgfirst technique, is to insert <tt>#ifdef</tt> within the assembly
241848b8605Smrgimplementation of each function.  This makes the assembly file considerably
242848b8605Smrglarger (e.g., 29,332 lines for <tt>glapi_x86-64.S</tt> versus 1,155 lines for
243848b8605Smrg<tt>glapi_x86.S</tt>) and causes simple changes to the function
244848b8605Smrgimplementation to generate many lines of diffs.  Since the assembly files
245848b8605Smrgare typically generated by scripts (see <a href="#autogen">below</a>), this
246848b8605Smrgisn't a significant problem.</p>
247848b8605Smrg
248848b8605Smrg<p>Once a new assembly file is created, it must be inserted in the build
249848b8605Smrgsystem.  There are two steps to this.  The file must first be added to
250848b8605Smrg<tt>src/mesa/sources</tt>.  That gets the file built and linked.  The second
251848b8605Smrgstep is to add the correct <tt>#ifdef</tt> magic to
252848b8605Smrg<tt>src/mesa/glapi/glapi_dispatch.c</tt> to prevent the C version of the
253848b8605Smrgdispatch functions from being built.</p>
254848b8605Smrg
255848b8605Smrg<h3 id="fixedsize">3.4. Fixed-Length Dispatch Stubs</h3>
256848b8605Smrg
257848b8605Smrg<p>To implement <tt>glXGetProcAddress</tt>, Mesa stores a table that
258848b8605Smrgassociates function names with pointers to those functions.  This table is
259848b8605Smrgstored in <tt>src/mesa/glapi/glprocs.h</tt>.  For different reasons on
260848b8605Smrgdifferent platforms, storing all of those pointers is inefficient.  On most
261848b8605Smrgplatforms, including all known platforms that support TLS, we can avoid this
262848b8605Smrgadded overhead.</p>
263848b8605Smrg
264848b8605Smrg<p>If the assembly stubs are all the same size, the pointer need not be
265848b8605Smrgstored for every function.  The location of the function can instead be
266848b8605Smrgcalculated by multiplying the size of the dispatch stub by the offset of the
267848b8605Smrgfunction in the table.  This value is then added to the address of the first
268848b8605Smrgdispatch stub.</p>
269848b8605Smrg
270848b8605Smrg<p>This path is activated by adding the correct <tt>#ifdef</tt> magic to
271848b8605Smrg<tt>src/mesa/glapi/glapi.c</tt> just before <tt>glprocs.h</tt> is
272848b8605Smrgincluded.</p>
273848b8605Smrg
274848b8605Smrg<h2 id="autogen">4. Automatic Generation of Dispatch Stubs</h2>
275848b8605Smrg
276848b8605Smrg</div>
277848b8605Smrg</body>
278848b8605Smrg</html>
279