1848b8605Smrg<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 2848b8605Smrg<html lang="en"> 3848b8605Smrg<head> 4848b8605Smrg <meta http-equiv="content-type" content="text/html; charset=utf-8"> 5848b8605Smrg <title>GL Dispatch in Mesa</title> 6848b8605Smrg <link rel="stylesheet" type="text/css" href="mesa.css"> 7848b8605Smrg</head> 8848b8605Smrg<body> 9848b8605Smrg 10848b8605Smrg<div class="header"> 11848b8605Smrg <h1>The Mesa 3D Graphics Library</h1> 12848b8605Smrg</div> 13848b8605Smrg 14848b8605Smrg<iframe src="contents.html"></iframe> 15848b8605Smrg<div class="content"> 16848b8605Smrg 17848b8605Smrg<h1>GL Dispatch in Mesa</h1> 18848b8605Smrg 19848b8605Smrg<p>Several factors combine to make efficient dispatch of OpenGL functions 20848b8605Smrgfairly complicated. This document attempts to explain some of the issues 21848b8605Smrgand introduce the reader to Mesa's implementation. Readers already familiar 22848b8605Smrgwith the issues around GL dispatch can safely skip ahead to the <a 23848b8605Smrghref="#overview">overview of Mesa's implementation</a>.</p> 24848b8605Smrg 25848b8605Smrg<h2>1. Complexity of GL Dispatch</h2> 26848b8605Smrg 27848b8605Smrg<p>Every GL application has at least one object called a GL <em>context</em>. 28848b8605SmrgThis object, which is an implicit parameter to every GL function, stores all 29848b8605Smrgof the GL related state for the application. Every texture, every buffer 30848b8605Smrgobject, every enable, and much, much more is stored in the context. Since 31848b8605Smrgan application can have more than one context, the context to be used is 32848b8605Smrgselected by a window-system dependent function such as 33848b8605Smrg<tt>glXMakeContextCurrent</tt>.</p> 34848b8605Smrg 35848b8605Smrg<p>In environments that implement OpenGL with X-Windows using GLX, every GL 36848b8605Smrgfunction, including the pointers returned by <tt>glXGetProcAddress</tt>, are 37848b8605Smrg<em>context independent</em>. This means that no matter what context is 38848b8605Smrgcurrently active, the same <tt>glVertex3fv</tt> function is used.</p> 39848b8605Smrg 40848b8605Smrg<p>This creates the first bit of dispatch complexity. An application can 41848b8605Smrghave two GL contexts. One context is a direct rendering context where 42848b8605Smrgfunction calls are routed directly to a driver loaded within the 43848b8605Smrgapplication's address space. The other context is an indirect rendering 44848b8605Smrgcontext where function calls are converted to GLX protocol and sent to a 45848b8605Smrgserver. The same <tt>glVertex3fv</tt> has to do the right thing depending 46848b8605Smrgon which context is current.</p> 47848b8605Smrg 48848b8605Smrg<p>Highly optimized drivers or GLX protocol implementations may want to 49848b8605Smrgchange the behavior of GL functions depending on current state. For 50848b8605Smrgexample, <tt>glFogCoordf</tt> may operate differently depending on whether 51848b8605Smrgor not fog is enabled.</p> 52848b8605Smrg 53848b8605Smrg<p>In multi-threaded environments, it is possible for each thread to have a 54848b8605Smrgdifferent GL context current. This means that poor old <tt>glVertex3fv</tt> 55848b8605Smrghas to know which GL context is current in the thread where it is being 56848b8605Smrgcalled.</p> 57848b8605Smrg 58848b8605Smrg<h2 id="overview">2. Overview of Mesa's Implementation</h2> 59848b8605Smrg 60848b8605Smrg<p>Mesa uses two per-thread pointers. The first pointer stores the address 61848b8605Smrgof the context current in the thread, and the second pointer stores the 62848b8605Smrgaddress of the <em>dispatch table</em> associated with that context. The 63848b8605Smrgdispatch table stores pointers to functions that actually implement 64848b8605Smrgspecific GL functions. Each time a new context is made current in a thread, 65848b8605Smrgthese pointers a updated.</p> 66848b8605Smrg 67848b8605Smrg<p>The implementation of functions such as <tt>glVertex3fv</tt> becomes 68848b8605Smrgconceptually simple:</p> 69848b8605Smrg 70848b8605Smrg<ul> 71848b8605Smrg<li>Fetch the current dispatch table pointer.</li> 72848b8605Smrg<li>Fetch the pointer to the real <tt>glVertex3fv</tt> function from the 73848b8605Smrgtable.</li> 74848b8605Smrg<li>Call the real function.</li> 75848b8605Smrg</ul> 76848b8605Smrg 77848b8605Smrg<p>This can be implemented in just a few lines of C code. The file 78848b8605Smrg<tt>src/mesa/glapi/glapitemp.h</tt> contains code very similar to this.</p> 79848b8605Smrg 80848b8605Smrg<blockquote> 81848b8605Smrg<table border="1"> 82848b8605Smrg<tr><td><pre> 83848b8605Smrgvoid glVertex3f(GLfloat x, GLfloat y, GLfloat z) 84848b8605Smrg{ 85848b8605Smrg const struct _glapi_table * const dispatch = GET_DISPATCH(); 86848b8605Smrg 87848b8605Smrg (*dispatch->Vertex3f)(x, y, z); 88848b8605Smrg}</pre></td></tr> 89848b8605Smrg<tr><td>Sample dispatch function</td></tr></table> 90848b8605Smrg</blockquote> 91848b8605Smrg 92848b8605Smrg<p>The problem with this simple implementation is the large amount of 93848b8605Smrgoverhead that it adds to every GL function call.</p> 94848b8605Smrg 95848b8605Smrg<p>In a multithreaded environment, a naive implementation of 96848b8605Smrg<tt>GET_DISPATCH</tt> involves a call to <tt>pthread_getspecific</tt> or a 97848b8605Smrgsimilar function. Mesa provides a wrapper function called 98848b8605Smrg<tt>_glapi_get_dispatch</tt> that is used by default.</p> 99848b8605Smrg 100848b8605Smrg<h2>3. Optimizations</h2> 101848b8605Smrg 102848b8605Smrg<p>A number of optimizations have been made over the years to diminish the 103848b8605Smrgperformance hit imposed by GL dispatch. This section describes these 104848b8605Smrgoptimizations. The benefits of each optimization and the situations where 105848b8605Smrgeach can or cannot be used are listed.</p> 106848b8605Smrg 107848b8605Smrg<h3>3.1. Dual dispatch table pointers</h3> 108848b8605Smrg 109848b8605Smrg<p>The vast majority of OpenGL applications use the API in a single threaded 110848b8605Smrgmanner. That is, the application has only one thread that makes calls into 111848b8605Smrgthe GL. In these cases, not only do the calls to 112848b8605Smrg<tt>pthread_getspecific</tt> hurt performance, but they are completely 113848b8605Smrgunnecessary! It is possible to detect this common case and avoid these 114848b8605Smrgcalls.</p> 115848b8605Smrg 116848b8605Smrg<p>Each time a new dispatch table is set, Mesa examines and records the ID 117848b8605Smrgof the executing thread. If the same thread ID is always seen, Mesa knows 118848b8605Smrgthat the application is, from OpenGL's point of view, single threaded.</p> 119848b8605Smrg 120848b8605Smrg<p>As long as an application is single threaded, Mesa stores a pointer to 121848b8605Smrgthe dispatch table in a global variable called <tt>_glapi_Dispatch</tt>. 122848b8605SmrgThe pointer is also stored in a per-thread location via 123848b8605Smrg<tt>pthread_setspecific</tt>. When Mesa detects that an application has 124848b8605Smrgbecome multithreaded, <tt>NULL</tt> is stored in <tt>_glapi_Dispatch</tt>.</p> 125848b8605Smrg 126848b8605Smrg<p>Using this simple mechanism the dispatch functions can detect the 127848b8605Smrgmultithreaded case by comparing <tt>_glapi_Dispatch</tt> to <tt>NULL</tt>. 128848b8605SmrgThe resulting implementation of <tt>GET_DISPATCH</tt> is slightly more 129848b8605Smrgcomplex, but it avoids the expensive <tt>pthread_getspecific</tt> call in 130848b8605Smrgthe common case.</p> 131848b8605Smrg 132848b8605Smrg<blockquote> 133848b8605Smrg<table border="1"> 134848b8605Smrg<tr><td><pre> 135848b8605Smrg#define GET_DISPATCH() \ 136848b8605Smrg (_glapi_Dispatch != NULL) \ 137b8e80941Smrg ? _glapi_Dispatch : pthread_getspecific(&_glapi_Dispatch_key) 138848b8605Smrg</pre></td></tr> 139848b8605Smrg<tr><td>Improved <tt>GET_DISPATCH</tt> Implementation</td></tr></table> 140848b8605Smrg</blockquote> 141848b8605Smrg 142848b8605Smrg<h3>3.2. ELF TLS</h3> 143848b8605Smrg 144848b8605Smrg<p>Starting with the 2.4.20 Linux kernel, each thread is allocated an area 145848b8605Smrgof per-thread, global storage. Variables can be put in this area using some 146848b8605Smrgextensions to GCC. By storing the dispatch table pointer in this area, the 147848b8605Smrgexpensive call to <tt>pthread_getspecific</tt> and the test of 148848b8605Smrg<tt>_glapi_Dispatch</tt> can be avoided.</p> 149848b8605Smrg 150848b8605Smrg<p>The dispatch table pointer is stored in a new variable called 151848b8605Smrg<tt>_glapi_tls_Dispatch</tt>. A new variable name is used so that a single 152848b8605SmrglibGL can implement both interfaces. This allows the libGL to operate with 153848b8605Smrgdirect rendering drivers that use either interface. Once the pointer is 154848b8605Smrgproperly declared, <tt>GET_DISPACH</tt> becomes a simple variable 155848b8605Smrgreference.</p> 156848b8605Smrg 157848b8605Smrg<blockquote> 158848b8605Smrg<table border="1"> 159848b8605Smrg<tr><td><pre> 160848b8605Smrgextern __thread struct _glapi_table *_glapi_tls_Dispatch 161848b8605Smrg __attribute__((tls_model("initial-exec"))); 162848b8605Smrg 163848b8605Smrg#define GET_DISPATCH() _glapi_tls_Dispatch 164848b8605Smrg</pre></td></tr> 165848b8605Smrg<tr><td>TLS <tt>GET_DISPATCH</tt> Implementation</td></tr></table> 166848b8605Smrg</blockquote> 167848b8605Smrg 168848b8605Smrg<p>Use of this path is controlled by the preprocessor define 169848b8605Smrg<tt>GLX_USE_TLS</tt>. Any platform capable of using TLS should use this as 170848b8605Smrgthe default dispatch method.</p> 171848b8605Smrg 172848b8605Smrg<h3>3.3. Assembly Language Dispatch Stubs</h3> 173848b8605Smrg 174848b8605Smrg<p>Many platforms has difficulty properly optimizing the tail-call in the 175848b8605Smrgdispatch stubs. Platforms like x86 that pass parameters on the stack seem 176848b8605Smrgto have even more difficulty optimizing these routines. All of the dispatch 177848b8605Smrgroutines are very short, and it is trivial to create optimal assembly 178848b8605Smrglanguage versions. The amount of optimization provided by using assembly 179848b8605Smrgstubs varies from platform to platform and application to application. 180848b8605SmrgHowever, by using the assembly stubs, many platforms can use an additional 181848b8605Smrgspace optimization (see <a href="#fixedsize">below</a>).</p> 182848b8605Smrg 183848b8605Smrg<p>The biggest hurdle to creating assembly stubs is handling the various 184848b8605Smrgways that the dispatch table pointer can be accessed. There are four 185848b8605Smrgdifferent methods that can be used:</p> 186848b8605Smrg 187848b8605Smrg<ol> 188848b8605Smrg<li>Using <tt>_glapi_Dispatch</tt> directly in builds for non-multithreaded 189848b8605Smrgenvironments.</li> 190848b8605Smrg<li>Using <tt>_glapi_Dispatch</tt> and <tt>_glapi_get_dispatch</tt> in 191848b8605Smrgmultithreaded environments.</li> 192848b8605Smrg<li>Using <tt>_glapi_Dispatch</tt> and <tt>pthread_getspecific</tt> in 193848b8605Smrgmultithreaded environments.</li> 194848b8605Smrg<li>Using <tt>_glapi_tls_Dispatch</tt> directly in TLS enabled 195848b8605Smrgmultithreaded environments.</li> 196848b8605Smrg</ol> 197848b8605Smrg 198848b8605Smrg<p>People wishing to implement assembly stubs for new platforms should focus 199848b8605Smrgon #4 if the new platform supports TLS. Otherwise, implement #2 followed by 200848b8605Smrg#3. Environments that do not support multithreading are uncommon and not 201848b8605Smrgterribly relevant.</p> 202848b8605Smrg 203848b8605Smrg<p>Selection of the dispatch table pointer access method is controlled by a 204848b8605Smrgfew preprocessor defines.</p> 205848b8605Smrg 206848b8605Smrg<ul> 207b8e80941Smrg<li>If <tt>GLX_USE_TLS</tt> is defined, method #3 is used.</li> 208b8e80941Smrg<li>If <tt>HAVE_PTHREAD</tt> is defined, method #2 is used.</li> 209848b8605Smrg<li>If none of the preceding are defined, method #1 is used.</li> 210848b8605Smrg</ul> 211848b8605Smrg 212848b8605Smrg<p>Two different techniques are used to handle the various different cases. 213848b8605SmrgOn x86 and SPARC, a macro called <tt>GL_STUB</tt> is used. In the preamble 214848b8605Smrgof the assembly source file different implementations of the macro are 215848b8605Smrgselected based on the defined preprocessor variables. The assembly code 216848b8605Smrgthen consists of a series of invocations of the macros such as: 217848b8605Smrg 218848b8605Smrg<blockquote> 219848b8605Smrg<table border="1"> 220848b8605Smrg<tr><td><pre> 221848b8605SmrgGL_STUB(Color3fv, _gloffset_Color3fv) 222848b8605Smrg</pre></td></tr> 223848b8605Smrg<tr><td>SPARC Assembly Implementation of <tt>glColor3fv</tt></td></tr></table> 224848b8605Smrg</blockquote> 225848b8605Smrg 226848b8605Smrg<p>The benefit of this technique is that changes to the calling pattern 227848b8605Smrg(i.e., addition of a new dispatch table pointer access method) require fewer 228848b8605Smrgchanged lines in the assembly code.</p> 229848b8605Smrg 230848b8605Smrg<p>However, this technique can only be used on platforms where the function 231848b8605Smrgimplementation does not change based on the parameters passed to the 232848b8605Smrgfunction. For example, since x86 passes all parameters on the stack, no 233848b8605Smrgadditional code is needed to save and restore function parameters around a 234848b8605Smrgcall to <tt>pthread_getspecific</tt>. Since x86-64 passes parameters in 235848b8605Smrgregisters, varying amounts of code needs to be inserted around the call to 236848b8605Smrg<tt>pthread_getspecific</tt> to save and restore the GL function's 237848b8605Smrgparameters.</p> 238848b8605Smrg 239848b8605Smrg<p>The other technique, used by platforms like x86-64 that cannot use the 240848b8605Smrgfirst technique, is to insert <tt>#ifdef</tt> within the assembly 241848b8605Smrgimplementation of each function. This makes the assembly file considerably 242848b8605Smrglarger (e.g., 29,332 lines for <tt>glapi_x86-64.S</tt> versus 1,155 lines for 243848b8605Smrg<tt>glapi_x86.S</tt>) and causes simple changes to the function 244848b8605Smrgimplementation to generate many lines of diffs. Since the assembly files 245848b8605Smrgare typically generated by scripts (see <a href="#autogen">below</a>), this 246848b8605Smrgisn't a significant problem.</p> 247848b8605Smrg 248848b8605Smrg<p>Once a new assembly file is created, it must be inserted in the build 249848b8605Smrgsystem. There are two steps to this. The file must first be added to 250848b8605Smrg<tt>src/mesa/sources</tt>. That gets the file built and linked. The second 251848b8605Smrgstep is to add the correct <tt>#ifdef</tt> magic to 252848b8605Smrg<tt>src/mesa/glapi/glapi_dispatch.c</tt> to prevent the C version of the 253848b8605Smrgdispatch functions from being built.</p> 254848b8605Smrg 255848b8605Smrg<h3 id="fixedsize">3.4. Fixed-Length Dispatch Stubs</h3> 256848b8605Smrg 257848b8605Smrg<p>To implement <tt>glXGetProcAddress</tt>, Mesa stores a table that 258848b8605Smrgassociates function names with pointers to those functions. This table is 259848b8605Smrgstored in <tt>src/mesa/glapi/glprocs.h</tt>. For different reasons on 260848b8605Smrgdifferent platforms, storing all of those pointers is inefficient. On most 261848b8605Smrgplatforms, including all known platforms that support TLS, we can avoid this 262848b8605Smrgadded overhead.</p> 263848b8605Smrg 264848b8605Smrg<p>If the assembly stubs are all the same size, the pointer need not be 265848b8605Smrgstored for every function. The location of the function can instead be 266848b8605Smrgcalculated by multiplying the size of the dispatch stub by the offset of the 267848b8605Smrgfunction in the table. This value is then added to the address of the first 268848b8605Smrgdispatch stub.</p> 269848b8605Smrg 270848b8605Smrg<p>This path is activated by adding the correct <tt>#ifdef</tt> magic to 271848b8605Smrg<tt>src/mesa/glapi/glapi.c</tt> just before <tt>glprocs.h</tt> is 272848b8605Smrgincluded.</p> 273848b8605Smrg 274848b8605Smrg<h2 id="autogen">4. Automatic Generation of Dispatch Stubs</h2> 275848b8605Smrg 276848b8605Smrg</div> 277848b8605Smrg</body> 278848b8605Smrg</html> 279