17ec681f3SmrgVC4
27ec681f3Smrg===
37ec681f3Smrg
47ec681f3SmrgMesa's ``vc4`` graphics driver supports multiple implementations of
57ec681f3SmrgBroadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0
67ec681f3Smrgthrough Raspberry Pi 3 hardware, and the driver is included as an
77ec681f3Smrgoption as of the 2016-02-09 Rasbpian release using ``raspi-config``.
87ec681f3SmrgOn most other distributions such as Debian or Fedora, you need no
97ec681f3Smrgconfiguration to enable the driver.
107ec681f3Smrg
117ec681f3SmrgThis Mesa driver talks directly to the `vc4
127ec681f3Smrg<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM
137ec681f3Smrgdriver for scheduling graphics commands, and that module also provides
147ec681f3SmrgKMS display support.  The driver makes no use of the closed source VPU
157ec681f3Smrgfirmware on the VideoCore IV block, instead talking directly to the
167ec681f3SmrgGPU block from Linux.
177ec681f3Smrg
187ec681f3SmrgGLES2 support
197ec681f3Smrg-------------
207ec681f3Smrg
217ec681f3SmrgThe vc4 driver is a nearly conformant GLES2 driver, and the hardware
227ec681f3Smrghas achieved GLES2 conformance with other driver stacks.
237ec681f3Smrg
247ec681f3SmrgOpenGL support
257ec681f3Smrg--------------
267ec681f3Smrg
277ec681f3SmrgAlong with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is
287ec681f3Smrgmostly correct but with a few caveats.
297ec681f3Smrg
307ec681f3Smrg* 4-byte index buffers.
317ec681f3Smrg
327ec681f3SmrgGLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support
337ec681f3Smrgthem in vc4, we create a shadow copy of your index buffer with the
347ec681f3Smrgindices truncated to 2 bytes. This is incorrect (and will assertion
357ec681f3Smrgfail in debug builds of Mesa) if any of the indices were >65535. To
367ec681f3Smrgfix that, we would need to detect this case and rewrite the index
377ec681f3Smrgbuffer and vertex buffers to do a series of draws each with small
387ec681f3Smrgindices and new vertex attrib bindings.
397ec681f3Smrg
407ec681f3SmrgTo avoid this problem, ensure that all index buffers are written using
417ec681f3Smrg``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls
427ec681f3Smrgwith updated vertex attrib bindings.
437ec681f3Smrg
447ec681f3Smrg* Occlusion queries
457ec681f3Smrg
467ec681f3SmrgThe VC4 hardware has no support for occlusion queries.  GL 2.0
477ec681f3Smrgrequires that you support the occlusion queries extension, but you can
487ec681f3Smrgreport 0 from ``glGetQueryiv(GL_SAMPLES_PASSED,
497ec681f3SmrgGL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles
507ec681f3Smrg"we want the functions to be present everywhere, but we want it to be
517ec681f3Smrgoptional for hardware to support it. Sadly, gallium doesn't yet allow
527ec681f3Smrgthe driver to report 0 query bits.
537ec681f3Smrg
547ec681f3Smrg* Primitive mode
557ec681f3Smrg
567ec681f3SmrgVC4 doesn't support reducing triangles/quads/polygons to lines and
577ec681f3Smrgpoints like desktop GL. If front/back mode matched, we could rewrite
587ec681f3Smrgthe index buffer to the new primitive type, but we don't. If
597ec681f3Smrgfront/back mode don't match, we would need to run the vertex shader in
607ec681f3Smrgsoftware, classify the prims, write new index buffers, and emit
617ec681f3Smrg(possibly many) new draw calls to rasterize the new prims in the same
627ec681f3Smrgorder.
637ec681f3Smrg
647ec681f3SmrgBug Reporting
657ec681f3Smrg-------------
667ec681f3Smrg
677ec681f3SmrgVC4 rendering bugs should go to Mesa's gitlab `issues
687ec681f3Smrg<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page.
697ec681f3Smrg
707ec681f3SmrgBy far the easiest way to communicate bug reports for rendering
717ec681f3Smrgproblems is to take an apitrace. This passes exactly the drawing you
727ec681f3Smrgsaw to the developer, without the developer needing to download and
737ec681f3Smrgbuild the application and replicate whatever steps you took to produce
747ec681f3Smrgthe problem.  Traces attached to bug reports should ideally be small.
757ec681f3Smrg
767ec681f3SmrgFor GPU hangs, if you can get a short apitrace that produces the
777ec681f3Smrgproblem, that's still the best.  If the problem takes a long time to
787ec681f3Smrgreproduce or you can't capture it in a trace, describing how to
797ec681f3Smrgreproduce and including a gpu hang dump would be the most
807ec681f3Smrguseful. Install `vc4-gpu-tools
817ec681f3Smrg<https://github.com/anholt/vc4-gpu-tools/>` and use
827ec681f3Smrg``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will
837ec681f3Smrgprovide useful information.
847ec681f3Smrg
857ec681f3SmrgTiled Rendering
867ec681f3Smrg---------------
877ec681f3Smrg
887ec681f3SmrgVC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or
897ec681f3Smrg32x32 (MSAA) tiles and rendering the scene per tile. Rasterization
907ec681f3Smrglooks like::
917ec681f3Smrg
927ec681f3Smrg    (CPU) Allocate space to store a list of draw commands per tile
937ec681f3Smrg    (CPU) Set up a command list per tile that does:
947ec681f3Smrg        Either load the current tile's color buffer from memory, or clear it.
957ec681f3Smrg        Either load the current tile's depth buffer from memory, or clear it.
967ec681f3Smrg        Branch into the draw list for the tile
977ec681f3Smrg        Store the depth buffer if anybody might read it.
987ec681f3Smrg        Store the color buffer if anybody might read it.
997ec681f3Smrg    (GPU) Initialize the per-tile draw call lists to empty.
1007ec681f3Smrg    (GPU) Run all draw calls collecting vertex data
1017ec681f3Smrg    (GPU) For each tile covered by a draw call's primitive.
1027ec681f3Smrg        Emit state packets to the list to update it to the current draw call's state.
1037ec681f3Smrg        Emit a primitive description into the tile's draw call list.
1047ec681f3Smrg
1057ec681f3SmrgTiled rendering avoids the need for large render target caches, at the
1067ec681f3Smrgexpense of increasing the cost of vertex processing. Unlike some tiled
1077ec681f3Smrgrenderers, VC4 has no non-tiled rendering mode.
1087ec681f3Smrg
1097ec681f3SmrgPerformance Tricks
1107ec681f3Smrg------------------
1117ec681f3Smrg
1127ec681f3Smrg* Reducing memory bandwidth by clearing.
1137ec681f3Smrg
1147ec681f3SmrgEven if your drawing is going to cover the entire render target, it's
1157ec681f3Smrgmore efficient for VC4 if you emit a ``glClear()`` of the color and
1167ec681f3Smrgdepth buffers. This means we can skip the load of the previous state
1177ec681f3Smrgfrom memory, in favor of a cheap GPU-side ``memset()`` of the tile
1187ec681f3Smrgbuffer before we start running the draw calls.
1197ec681f3Smrg
1207ec681f3Smrg* Reducing memory bandwidth with scissoring.
1217ec681f3Smrg
1227ec681f3SmrgIf all draw calls for the frame are with a ``glScissor()`` to only
1237ec681f3Smrgpart of the screen, then we can skip setting up the tiles for that
1247ec681f3Smrgarea, which means a little less memory used setting up the empty bins,
1257ec681f3Smrgand a lot less memory used loading/storing the unchanged tiles.
1267ec681f3Smrg
1277ec681f3Smrg* Reducing memory bandwidth with ``glInvalidateFramebuffer()``.
1287ec681f3Smrg
1297ec681f3SmrgIf we don't know who might use the contents of the framebuffer's depth
1307ec681f3Smrgor color in the future, then we have to store it for later. If you use
1317ec681f3SmrgglInvalidateFramebuffer() before accessing the results of your
1327ec681f3Smrgrendering, then we can skip the store of the depth or color
1337ec681f3Smrgbuffer. Note that this is unimplemented.
1347ec681f3Smrg
1357ec681f3Smrg* Avoid non-constant GLSL array indexing
1367ec681f3Smrg
1377ec681f3SmrgIn VC4 the only non-constant-index array access supported in hardware
1387ec681f3Smrgis uniforms. For everything else (inputs, outputs, temporaries), we
1397ec681f3Smrghave to lower them to an IF ladder like::
1407ec681f3Smrg
1417ec681f3Smrg  if (index == 0)
1427ec681f3Smrg     return array[0]
1437ec681f3Smrg  else if (index == 1)
1447ec681f3Smrg    return array[1]
1457ec681f3Smrg  ...
1467ec681f3Smrg
1477ec681f3SmrgThis is very expensive as we probably have to execute every branch of
1487ec681f3Smrgevery IF statement due to it being a SIMD machine. So, it is
1497ec681f3Smrgrecommended (if you can) to avoid non-uniform non-constant array
1507ec681f3Smrgindexing.
1517ec681f3Smrg
1527ec681f3SmrgNote that if you do variable indexing within a bounded loop that Mesa
1537ec681f3Smrgcan unroll, that can actually count as constant indexing.
1547ec681f3Smrg
1557ec681f3Smrg* Increasing GPU memory Increase CMA pool size
1567ec681f3Smrg
1577ec681f3SmrgThe memory for the VC4 driver is allocated from the standard Linux cma
1587ec681f3Smrgpool. The size of this pool defaults to 64 MB.  To increase this, pass
1597ec681f3Smrgan additional parameter on the kernel command line.  Edit the boot
1607ec681f3Smrgpartition's ``cmdline.txt`` to add::
1617ec681f3Smrg
1627ec681f3Smrg  cma=256M@256M
1637ec681f3Smrg
1647ec681f3Smrg``cmdline.txt`` is a single line with whitespace separated parameters.
1657ec681f3Smrg
1667ec681f3SmrgThe first value is the size of the pool and the second parameter is
1677ec681f3Smrgthe start address of the pool. The pool size can be increased further,
1687ec681f3Smrgbut it must fit into the memory, so size + start address must be below
1697ec681f3Smrg1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this
1707ec681f3Smrgreduces the memory available to Linux.
1717ec681f3Smrg
1727ec681f3Smrg* Decrease firmware memory
1737ec681f3Smrg
1747ec681f3SmrgThe firmware allocates a fixed chunk of memory before booting
1757ec681f3SmrgLinux. If firmware functions are not required, this amount can be
1767ec681f3Smrgreduced.
1777ec681f3Smrg
1787ec681f3SmrgIn ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding,
1797ec681f3Smrgedit gpu_mem to 64 if you need video decoding.
1807ec681f3Smrg
1817ec681f3SmrgPerformance debugging
1827ec681f3Smrg---------------------
1837ec681f3Smrg
1847ec681f3Smrg* Step 1: Known issues
1857ec681f3Smrg
1867ec681f3SmrgThe first tool to look at is running your application with the
1877ec681f3Smrgenvironment variable ``VC4_DEBUG=perf`` set. This will report debug
1887ec681f3Smrginformation for many known causes of performance problems on the
1897ec681f3Smrgconsole. Not all of them will cause visible performance improvements
1907ec681f3Smrgwhen fixed, but it's a good first step to see what might going wrong.
1917ec681f3Smrg
1927ec681f3Smrg* Step 2: CPU vs GPU
1937ec681f3Smrg
1947ec681f3SmrgThe primary question is figuring out whether the CPU is busy in your
1957ec681f3Smrgapplication, the CPU is busy in the GL driver, the GPU is waiting for
1967ec681f3Smrgthe CPU, or the CPU is waiting for the GPU. Ideally, you get to the
1977ec681f3Smrgpoint where the CPU is waiting for the GPU infrequently but for a
1987ec681f3Smrgsignificant amount of time (however long it takes the GPU to draw a
1997ec681f3Smrgframe).
2007ec681f3Smrg
2017ec681f3SmrgStart with top while your application is running. Is the CPU usage
2027ec681f3Smrgaround 90%+? If so, then our performance analysis will be with
2037ec681f3Smrgsysprof. If it's not very high, is the GPU staying busy? We don't have
2047ec681f3Smrga clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be
2057ec681f3Smrguseful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that
2067ec681f3Smrgmeans that the GPU is currently busy processing some rendering job.
2077ec681f3Smrg
2087ec681f3Smrg* sysprof for CPU usage
2097ec681f3Smrg
2107ec681f3SmrgIf the CPU is totally busy and the GPU isn't terribly busy, there is
2117ec681f3Smrgan excellent tool for debugging: sysprof. Install, run as root (so you
2127ec681f3Smrgcan get system-wide profiling), hit play and later stop. The top-left
2137ec681f3Smrgarea shows the flat profile sorted by total time of that symbol plus
2147ec681f3Smrgits descendants. The top few are generally uninteresting (main() and
2157ec681f3Smrgits descendants consuming a lot), but eventually you can get down to
2167ec681f3Smrgsomething interesting. Click it, and to the right you get the
2177ec681f3Smrgcallchains to descendants -- where all that time actually went. On the
2187ec681f3Smrgother hand, the lower left shows callers -- double-clicking those
2197ec681f3Smrgselects that as the symbol to view, instead.
2207ec681f3Smrg
2217ec681f3SmrgNote that you need debug symbols for the callgraphs in sysprof to
2227ec681f3Smrgwork, which is where most of its value is. Most distributions offer
2237ec681f3Smrgdebug symbol packages from their builds which can be installed
2247ec681f3Smrgseparately, and sysprof will find them. I've found that on arm, the
2257ec681f3Smrgdebug packages are not enough, and if someone could determine what is
2267ec681f3Smrgnecessary for callgraphs in debugging, that would be really helpful.
2277ec681f3Smrg
2287ec681f3Smrg* perf for CPU waits on GPU
2297ec681f3Smrg
2307ec681f3SmrgIf the CPU is not very busy and the GPU is not very busy, then we're
2317ec681f3Smrgprobably ping-ponging between the two. Most cases of this would be
2327ec681f3Smrgnoticed by ``VC4_DEBUG=perf``, but not all. To see all cases where
2337ec681f3Smrgthis happens, use the perf tool from the Linux kernel (note: unrelated
2347ec681f3Smrgto ``VC4_DEBUG=perf``)::
2357ec681f3Smrg
2367ec681f3Smrg    sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena
2377ec681f3Smrg
2387ec681f3SmrgIf you want to see the whole system's stalls for a period of time
2397ec681f3Smrg(very useful!), use the -a flag instead of a particular command
2407ec681f3Smrgname. Just ``^C`` when you're done capturing data.
2417ec681f3Smrg
2427ec681f3SmrgAt exit, you'll have ``perf.data`` in the current directory. You can print
2437ec681f3Smrgout the results with::
2447ec681f3Smrg
2457ec681f3Smrg    perf report | less
2467ec681f3Smrg
2477ec681f3Smrg* Debugging for GPU fully busy
2487ec681f3Smrg
2497ec681f3SmrgAs of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's
2507ec681f3Smrgperformance counters in OpenGL. Install apitrace, and trace your
2517ec681f3Smrgapplication with::
2527ec681f3Smrg
2537ec681f3Smrg    apitrace trace <application>          # for GLX applications
2547ec681f3Smrg    apitrace trace -a egl <application>   # for EGL applications
2557ec681f3Smrg
2567ec681f3SmrgOnce you've captured a trace, you can see what counters are available
2577ec681f3Smrgand replay it while looking while looking at some of those counters::
2587ec681f3Smrg
2597ec681f3Smrg    apitrace replay <application>.trace --list-metrics
2607ec681f3Smrg
2617ec681f3Smrg    apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading
2627ec681f3Smrg
2637ec681f3SmrgMultiple counters can be captured at once with commas separating them.
2647ec681f3Smrg
2657ec681f3SmrgOnce you've found what draw calls are surprisingly expensive in one of
2667ec681f3Smrgthe counters, you can work out which ones they were at the GL level by
2677ec681f3Smrgopening the trace up in qapitrace and using ``^-G`` to jump to that call
2687ec681f3Smrgnumber and ``^-L`` to look up the GL state at that call.
2697ec681f3Smrg
2707ec681f3Smrgshader-db
2717ec681f3Smrg---------
2727ec681f3Smrg
2737ec681f3Smrgshader-db is often used as a proxy for real-world app performance when
2747ec681f3Smrgworking on the compiler in Mesa.  On vc4, there is a lot of
2757ec681f3Smrgstate-dependent code in the shaders (like blending or vertex attribute
2767ec681f3Smrgformat handling), so the typical `shader-db
2777ec681f3Smrg<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important
2787ec681f3Smrgareas for optimization.  Instead, anholt wrote a `new one
2797ec681f3Smrg<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on
2807ec681f3Smrgapitraces.  Once you have a collection of traces, starting from
2817ec681f3Smrg`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__,
2827ec681f3Smrgyou can test a compiler change in this shader-db with::
2837ec681f3Smrg
2847ec681f3Smrg  ./run.py > before
2857ec681f3Smrg  (cd ../mesa && make install)
2867ec681f3Smrg  ./run.py > after
2877ec681f3Smrg  ./report.py before after
2887ec681f3Smrg
2897ec681f3SmrgHardware Documentation
2907ec681f3Smrg----------------------
2917ec681f3Smrg
2927ec681f3SmrgFor driver developers, Broadcom publicly released a `specification
2937ec681f3Smrg<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which
2947ec681f3Smrgis closely related to the vc4 GPU present in the Raspberry Pi.  They
2957ec681f3Smrgalso released a `snapshot <https://docs.broadcom.com/docs/12358546>`__
2967ec681f3Smrgof a corresponding Android graphics driver.  That graphics driver was
2977ec681f3Smrgported to Raspbian for a demo, but was not expected to have ongoing
2987ec681f3Smrgdevelopment.
2997ec681f3Smrg
3007ec681f3SmrgDevelopers with NDA access with Broadcom or Raspberry Pi can
3017ec681f3Smrgpotentially get access to "simpenrose", the C software simulator of
3027ec681f3Smrgthe GPU.  The Mesa driver includes a backend (`vc4_simulator.c`) to
3037ec681f3Smrguse simpenrose from an x86 system with the i915 graphics driver with
3047ec681f3Smrgall of the vc4 rendering commands emulated on simpenrose and memcpyed
3057ec681f3Smrgto the real GPU.
306