17ec681f3SmrgGL Dispatch
27ec681f3Smrg===========
37ec681f3Smrg
47ec681f3SmrgSeveral factors combine to make efficient dispatch of OpenGL functions
57ec681f3Smrgfairly complicated. This document attempts to explain some of the issues
67ec681f3Smrgand introduce the reader to Mesa's implementation. Readers already
77ec681f3Smrgfamiliar with the issues around GL dispatch can safely skip ahead to the
87ec681f3Smrg:ref:`overview of Mesa's implementation <overview>`.
97ec681f3Smrg
107ec681f3Smrg1. Complexity of GL Dispatch
117ec681f3Smrg----------------------------
127ec681f3Smrg
137ec681f3SmrgEvery GL application has at least one object called a GL *context*. This
147ec681f3Smrgobject, which is an implicit parameter to every GL function, stores all
157ec681f3Smrgof the GL related state for the application. Every texture, every buffer
167ec681f3Smrgobject, every enable, and much, much more is stored in the context.
177ec681f3SmrgSince an application can have more than one context, the context to be
187ec681f3Smrgused is selected by a window-system dependent function such as
197ec681f3Smrg``glXMakeContextCurrent``.
207ec681f3Smrg
217ec681f3SmrgIn environments that implement OpenGL with X-Windows using GLX, every GL
227ec681f3Smrgfunction, including the pointers returned by ``glXGetProcAddress``, are
237ec681f3Smrg*context independent*. This means that no matter what context is
247ec681f3Smrgcurrently active, the same ``glVertex3fv`` function is used.
257ec681f3Smrg
267ec681f3SmrgThis creates the first bit of dispatch complexity. An application can
277ec681f3Smrghave two GL contexts. One context is a direct rendering context where
287ec681f3Smrgfunction calls are routed directly to a driver loaded within the
297ec681f3Smrgapplication's address space. The other context is an indirect rendering
307ec681f3Smrgcontext where function calls are converted to GLX protocol and sent to a
317ec681f3Smrgserver. The same ``glVertex3fv`` has to do the right thing depending on
327ec681f3Smrgwhich context is current.
337ec681f3Smrg
347ec681f3SmrgHighly optimized drivers or GLX protocol implementations may want to
357ec681f3Smrgchange the behavior of GL functions depending on current state. For
367ec681f3Smrgexample, ``glFogCoordf`` may operate differently depending on whether or
377ec681f3Smrgnot fog is enabled.
387ec681f3Smrg
397ec681f3SmrgIn multi-threaded environments, it is possible for each thread to have a
407ec681f3Smrgdifferent GL context current. This means that poor old ``glVertex3fv``
417ec681f3Smrghas to know which GL context is current in the thread where it is being
427ec681f3Smrgcalled.
437ec681f3Smrg
447ec681f3Smrg.. _overview:
457ec681f3Smrg
467ec681f3Smrg2. Overview of Mesa's Implementation
477ec681f3Smrg------------------------------------
487ec681f3Smrg
497ec681f3SmrgMesa uses two per-thread pointers. The first pointer stores the address
507ec681f3Smrgof the context current in the thread, and the second pointer stores the
517ec681f3Smrgaddress of the *dispatch table* associated with that context. The
527ec681f3Smrgdispatch table stores pointers to functions that actually implement
537ec681f3Smrgspecific GL functions. Each time a new context is made current in a
547ec681f3Smrgthread, these pointers are updated.
557ec681f3Smrg
567ec681f3SmrgThe implementation of functions such as ``glVertex3fv`` becomes
577ec681f3Smrgconceptually simple:
587ec681f3Smrg
597ec681f3Smrg-  Fetch the current dispatch table pointer.
607ec681f3Smrg-  Fetch the pointer to the real ``glVertex3fv`` function from the
617ec681f3Smrg   table.
627ec681f3Smrg-  Call the real function.
637ec681f3Smrg
647ec681f3SmrgThis can be implemented in just a few lines of C code. The file
657ec681f3Smrg``src/mesa/glapi/glapitemp.h`` contains code very similar to this.
667ec681f3Smrg
677ec681f3Smrg.. code-block:: c
687ec681f3Smrg   :caption: Sample dispatch function
697ec681f3Smrg
707ec681f3Smrg   void glVertex3f(GLfloat x, GLfloat y, GLfloat z)
717ec681f3Smrg   {
727ec681f3Smrg       const struct _glapi_table * const dispatch = GET_DISPATCH();
737ec681f3Smrg
747ec681f3Smrg       (*dispatch->Vertex3f)(x, y, z);
757ec681f3Smrg   }
767ec681f3Smrg
777ec681f3SmrgThe problem with this simple implementation is the large amount of
787ec681f3Smrgoverhead that it adds to every GL function call.
797ec681f3Smrg
807ec681f3SmrgIn a multithreaded environment, a naive implementation of
817ec681f3Smrg``GET_DISPATCH`` involves a call to ``pthread_getspecific`` or a similar
827ec681f3Smrgfunction. Mesa provides a wrapper function called
837ec681f3Smrg``_glapi_get_dispatch`` that is used by default.
847ec681f3Smrg
857ec681f3Smrg3. Optimizations
867ec681f3Smrg----------------
877ec681f3Smrg
887ec681f3SmrgA number of optimizations have been made over the years to diminish the
897ec681f3Smrgperformance hit imposed by GL dispatch. This section describes these
907ec681f3Smrgoptimizations. The benefits of each optimization and the situations
917ec681f3Smrgwhere each can or cannot be used are listed.
927ec681f3Smrg
937ec681f3Smrg3.1. Dual dispatch table pointers
947ec681f3Smrg~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
957ec681f3Smrg
967ec681f3SmrgThe vast majority of OpenGL applications use the API in a single
977ec681f3Smrgthreaded manner. That is, the application has only one thread that makes
987ec681f3Smrgcalls into the GL. In these cases, not only do the calls to
997ec681f3Smrg``pthread_getspecific`` hurt performance, but they are completely
1007ec681f3Smrgunnecessary! It is possible to detect this common case and avoid these
1017ec681f3Smrgcalls.
1027ec681f3Smrg
1037ec681f3SmrgEach time a new dispatch table is set, Mesa examines and records the ID
1047ec681f3Smrgof the executing thread. If the same thread ID is always seen, Mesa
1057ec681f3Smrgknows that the application is, from OpenGL's point of view, single
1067ec681f3Smrgthreaded.
1077ec681f3Smrg
1087ec681f3SmrgAs long as an application is single threaded, Mesa stores a pointer to
1097ec681f3Smrgthe dispatch table in a global variable called ``_glapi_Dispatch``. The
1107ec681f3Smrgpointer is also stored in a per-thread location via
1117ec681f3Smrg``pthread_setspecific``. When Mesa detects that an application has
1127ec681f3Smrgbecome multithreaded, ``NULL`` is stored in ``_glapi_Dispatch``.
1137ec681f3Smrg
1147ec681f3SmrgUsing this simple mechanism the dispatch functions can detect the
1157ec681f3Smrgmultithreaded case by comparing ``_glapi_Dispatch`` to ``NULL``. The
1167ec681f3Smrgresulting implementation of ``GET_DISPATCH`` is slightly more complex,
1177ec681f3Smrgbut it avoids the expensive ``pthread_getspecific`` call in the common
1187ec681f3Smrgcase.
1197ec681f3Smrg
1207ec681f3Smrg.. code-block:: c
1217ec681f3Smrg   :caption: Improved ``GET_DISPATCH`` Implementation
1227ec681f3Smrg
1237ec681f3Smrg   #define GET_DISPATCH() \
1247ec681f3Smrg       (_glapi_Dispatch != NULL) \
1257ec681f3Smrg           ? _glapi_Dispatch : pthread_getspecific(&_glapi_Dispatch_key)
1267ec681f3Smrg
1277ec681f3Smrg3.2. ELF TLS
1287ec681f3Smrg~~~~~~~~~~~~
1297ec681f3Smrg
1307ec681f3SmrgStarting with the 2.4.20 Linux kernel, each thread is allocated an area
1317ec681f3Smrgof per-thread, global storage. Variables can be put in this area using
1327ec681f3Smrgsome extensions to GCC. By storing the dispatch table pointer in this
1337ec681f3Smrgarea, the expensive call to ``pthread_getspecific`` and the test of
1347ec681f3Smrg``_glapi_Dispatch`` can be avoided.
1357ec681f3Smrg
1367ec681f3SmrgThe dispatch table pointer is stored in a new variable called
1377ec681f3Smrg``_glapi_tls_Dispatch``. A new variable name is used so that a single
1387ec681f3SmrglibGL can implement both interfaces. This allows the libGL to operate
1397ec681f3Smrgwith direct rendering drivers that use either interface. Once the
1407ec681f3Smrgpointer is properly declared, ``GET_DISPACH`` becomes a simple variable
1417ec681f3Smrgreference.
1427ec681f3Smrg
1437ec681f3Smrg.. code-block:: c
1447ec681f3Smrg   :caption: TLS ``GET_DISPATCH`` Implementation
1457ec681f3Smrg
1467ec681f3Smrg   extern __thread struct _glapi_table *_glapi_tls_Dispatch
1477ec681f3Smrg       __attribute__((tls_model("initial-exec")));
1487ec681f3Smrg
1497ec681f3Smrg   #define GET_DISPATCH() _glapi_tls_Dispatch
1507ec681f3Smrg
1517ec681f3SmrgUse of this path is controlled by the preprocessor define
1527ec681f3Smrg``USE_ELF_TLS``. Any platform capable of using ELF TLS should use this
1537ec681f3Smrgas the default dispatch method.
1547ec681f3Smrg
1557ec681f3SmrgWindows has a similar concept, and beginning with Windows Vista, shared
1567ec681f3Smrglibraries can take advantage of compiler-assisted TLS. This TLS data
1577ec681f3Smrghas no fixed size and does not compete with API-based TLS (``TlsAlloc``)
1587ec681f3Smrgfor the limited number of slots available there, and so ``USE_ELF_TLS`` can
1597ec681f3Smrgbe used on Windows too, even though it's not truly ELF.
1607ec681f3Smrg
1617ec681f3Smrg3.3. Assembly Language Dispatch Stubs
1627ec681f3Smrg~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1637ec681f3Smrg
1647ec681f3SmrgMany platforms have difficulty properly optimizing the tail-call in the
1657ec681f3Smrgdispatch stubs. Platforms like x86 that pass parameters on the stack
1667ec681f3Smrgseem to have even more difficulty optimizing these routines. All of the
1677ec681f3Smrgdispatch routines are very short, and it is trivial to create optimal
1687ec681f3Smrgassembly language versions. The amount of optimization provided by using
1697ec681f3Smrgassembly stubs varies from platform to platform and application to
1707ec681f3Smrgapplication. However, by using the assembly stubs, many platforms can
1717ec681f3Smrguse an additional space optimization (see :ref:`below <fixedsize>`).
1727ec681f3Smrg
1737ec681f3SmrgThe biggest hurdle to creating assembly stubs is handling the various
1747ec681f3Smrgways that the dispatch table pointer can be accessed. There are four
1757ec681f3Smrgdifferent methods that can be used:
1767ec681f3Smrg
1777ec681f3Smrg#. Using ``_glapi_Dispatch`` directly in builds for non-multithreaded
1787ec681f3Smrg   environments.
1797ec681f3Smrg#. Using ``_glapi_Dispatch`` and ``_glapi_get_dispatch`` in
1807ec681f3Smrg   multithreaded environments.
1817ec681f3Smrg#. Using ``_glapi_Dispatch`` and ``pthread_getspecific`` in
1827ec681f3Smrg   multithreaded environments.
1837ec681f3Smrg#. Using ``_glapi_tls_Dispatch`` directly in TLS enabled multithreaded
1847ec681f3Smrg   environments.
1857ec681f3Smrg
1867ec681f3SmrgPeople wishing to implement assembly stubs for new platforms should
1877ec681f3Smrgfocus on #4 if the new platform supports TLS. Otherwise, implement #2
1887ec681f3Smrgfollowed by #3. Environments that do not support multithreading are
1897ec681f3Smrguncommon and not terribly relevant.
1907ec681f3Smrg
1917ec681f3SmrgSelection of the dispatch table pointer access method is controlled by a
1927ec681f3Smrgfew preprocessor defines.
1937ec681f3Smrg
1947ec681f3Smrg-  If ``USE_ELF_TLS`` is defined, method #3 is used.
1957ec681f3Smrg-  If ``HAVE_PTHREAD`` is defined, method #2 is used.
1967ec681f3Smrg-  If none of the preceding are defined, method #1 is used.
1977ec681f3Smrg
1987ec681f3SmrgTwo different techniques are used to handle the various different cases.
1997ec681f3SmrgOn x86 and SPARC, a macro called ``GL_STUB`` is used. In the preamble of
2007ec681f3Smrgthe assembly source file different implementations of the macro are
2017ec681f3Smrgselected based on the defined preprocessor variables. The assembly code
2027ec681f3Smrgthen consists of a series of invocations of the macros such as:
2037ec681f3Smrg
2047ec681f3Smrg.. code-block:: c
2057ec681f3Smrg   :caption: SPARC Assembly Implementation of ``glColor3fv``
2067ec681f3Smrg
2077ec681f3Smrg   GL_STUB(Color3fv, _gloffset_Color3fv)
2087ec681f3Smrg
2097ec681f3SmrgThe benefit of this technique is that changes to the calling pattern
2107ec681f3Smrg(i.e., addition of a new dispatch table pointer access method) require
2117ec681f3Smrgfewer changed lines in the assembly code.
2127ec681f3Smrg
2137ec681f3SmrgHowever, this technique can only be used on platforms where the function
2147ec681f3Smrgimplementation does not change based on the parameters passed to the
2157ec681f3Smrgfunction. For example, since x86 passes all parameters on the stack, no
2167ec681f3Smrgadditional code is needed to save and restore function parameters around
2177ec681f3Smrga call to ``pthread_getspecific``. Since x86-64 passes parameters in
2187ec681f3Smrgregisters, varying amounts of code needs to be inserted around the call
2197ec681f3Smrgto ``pthread_getspecific`` to save and restore the GL function's
2207ec681f3Smrgparameters.
2217ec681f3Smrg
2227ec681f3SmrgThe other technique, used by platforms like x86-64 that cannot use the
2237ec681f3Smrgfirst technique, is to insert ``#ifdef`` within the assembly
2247ec681f3Smrgimplementation of each function. This makes the assembly file
2257ec681f3Smrgconsiderably larger (e.g., 29,332 lines for ``glapi_x86-64.S`` versus
2267ec681f3Smrg1,155 lines for ``glapi_x86.S``) and causes simple changes to the
2277ec681f3Smrgfunction implementation to generate many lines of diffs. Since the
2287ec681f3Smrgassembly files are typically generated by scripts, this isn't a
2297ec681f3Smrgsignificant problem.
2307ec681f3Smrg
2317ec681f3SmrgOnce a new assembly file is created, it must be inserted in the build
2327ec681f3Smrgsystem. There are two steps to this. The file must first be added to
2337ec681f3Smrg``src/mesa/sources``. That gets the file built and linked. The second
2347ec681f3Smrgstep is to add the correct ``#ifdef`` magic to
2357ec681f3Smrg``src/mesa/glapi/glapi_dispatch.c`` to prevent the C version of the
2367ec681f3Smrgdispatch functions from being built.
2377ec681f3Smrg
2387ec681f3Smrg.. _fixedsize:
2397ec681f3Smrg
2407ec681f3Smrg3.4. Fixed-Length Dispatch Stubs
2417ec681f3Smrg~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2427ec681f3Smrg
2437ec681f3SmrgTo implement ``glXGetProcAddress``, Mesa stores a table that associates
2447ec681f3Smrgfunction names with pointers to those functions. This table is stored in
2457ec681f3Smrg``src/mesa/glapi/glprocs.h``. For different reasons on different
2467ec681f3Smrgplatforms, storing all of those pointers is inefficient. On most
2477ec681f3Smrgplatforms, including all known platforms that support TLS, we can avoid
2487ec681f3Smrgthis added overhead.
2497ec681f3Smrg
2507ec681f3SmrgIf the assembly stubs are all the same size, the pointer need not be
2517ec681f3Smrgstored for every function. The location of the function can instead be
2527ec681f3Smrgcalculated by multiplying the size of the dispatch stub by the offset of
2537ec681f3Smrgthe function in the table. This value is then added to the address of
2547ec681f3Smrgthe first dispatch stub.
2557ec681f3Smrg
2567ec681f3SmrgThis path is activated by adding the correct ``#ifdef`` magic to
2577ec681f3Smrg``src/mesa/glapi/glapi.c`` just before ``glprocs.h`` is included.
258