17ec681f3SmrgVC4 27ec681f3Smrg=== 37ec681f3Smrg 47ec681f3SmrgMesa's ``vc4`` graphics driver supports multiple implementations of 57ec681f3SmrgBroadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0 67ec681f3Smrgthrough Raspberry Pi 3 hardware, and the driver is included as an 77ec681f3Smrgoption as of the 2016-02-09 Rasbpian release using ``raspi-config``. 87ec681f3SmrgOn most other distributions such as Debian or Fedora, you need no 97ec681f3Smrgconfiguration to enable the driver. 107ec681f3Smrg 117ec681f3SmrgThis Mesa driver talks directly to the `vc4 127ec681f3Smrg<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM 137ec681f3Smrgdriver for scheduling graphics commands, and that module also provides 147ec681f3SmrgKMS display support. The driver makes no use of the closed source VPU 157ec681f3Smrgfirmware on the VideoCore IV block, instead talking directly to the 167ec681f3SmrgGPU block from Linux. 177ec681f3Smrg 187ec681f3SmrgGLES2 support 197ec681f3Smrg------------- 207ec681f3Smrg 217ec681f3SmrgThe vc4 driver is a nearly conformant GLES2 driver, and the hardware 227ec681f3Smrghas achieved GLES2 conformance with other driver stacks. 237ec681f3Smrg 247ec681f3SmrgOpenGL support 257ec681f3Smrg-------------- 267ec681f3Smrg 277ec681f3SmrgAlong with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is 287ec681f3Smrgmostly correct but with a few caveats. 297ec681f3Smrg 307ec681f3Smrg* 4-byte index buffers. 317ec681f3Smrg 327ec681f3SmrgGLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support 337ec681f3Smrgthem in vc4, we create a shadow copy of your index buffer with the 347ec681f3Smrgindices truncated to 2 bytes. This is incorrect (and will assertion 357ec681f3Smrgfail in debug builds of Mesa) if any of the indices were >65535. To 367ec681f3Smrgfix that, we would need to detect this case and rewrite the index 377ec681f3Smrgbuffer and vertex buffers to do a series of draws each with small 387ec681f3Smrgindices and new vertex attrib bindings. 397ec681f3Smrg 407ec681f3SmrgTo avoid this problem, ensure that all index buffers are written using 417ec681f3Smrg``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls 427ec681f3Smrgwith updated vertex attrib bindings. 437ec681f3Smrg 447ec681f3Smrg* Occlusion queries 457ec681f3Smrg 467ec681f3SmrgThe VC4 hardware has no support for occlusion queries. GL 2.0 477ec681f3Smrgrequires that you support the occlusion queries extension, but you can 487ec681f3Smrgreport 0 from ``glGetQueryiv(GL_SAMPLES_PASSED, 497ec681f3SmrgGL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles 507ec681f3Smrg"we want the functions to be present everywhere, but we want it to be 517ec681f3Smrgoptional for hardware to support it. Sadly, gallium doesn't yet allow 527ec681f3Smrgthe driver to report 0 query bits. 537ec681f3Smrg 547ec681f3Smrg* Primitive mode 557ec681f3Smrg 567ec681f3SmrgVC4 doesn't support reducing triangles/quads/polygons to lines and 577ec681f3Smrgpoints like desktop GL. If front/back mode matched, we could rewrite 587ec681f3Smrgthe index buffer to the new primitive type, but we don't. If 597ec681f3Smrgfront/back mode don't match, we would need to run the vertex shader in 607ec681f3Smrgsoftware, classify the prims, write new index buffers, and emit 617ec681f3Smrg(possibly many) new draw calls to rasterize the new prims in the same 627ec681f3Smrgorder. 637ec681f3Smrg 647ec681f3SmrgBug Reporting 657ec681f3Smrg------------- 667ec681f3Smrg 677ec681f3SmrgVC4 rendering bugs should go to Mesa's gitlab `issues 687ec681f3Smrg<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page. 697ec681f3Smrg 707ec681f3SmrgBy far the easiest way to communicate bug reports for rendering 717ec681f3Smrgproblems is to take an apitrace. This passes exactly the drawing you 727ec681f3Smrgsaw to the developer, without the developer needing to download and 737ec681f3Smrgbuild the application and replicate whatever steps you took to produce 747ec681f3Smrgthe problem. Traces attached to bug reports should ideally be small. 757ec681f3Smrg 767ec681f3SmrgFor GPU hangs, if you can get a short apitrace that produces the 777ec681f3Smrgproblem, that's still the best. If the problem takes a long time to 787ec681f3Smrgreproduce or you can't capture it in a trace, describing how to 797ec681f3Smrgreproduce and including a gpu hang dump would be the most 807ec681f3Smrguseful. Install `vc4-gpu-tools 817ec681f3Smrg<https://github.com/anholt/vc4-gpu-tools/>` and use 827ec681f3Smrg``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will 837ec681f3Smrgprovide useful information. 847ec681f3Smrg 857ec681f3SmrgTiled Rendering 867ec681f3Smrg--------------- 877ec681f3Smrg 887ec681f3SmrgVC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or 897ec681f3Smrg32x32 (MSAA) tiles and rendering the scene per tile. Rasterization 907ec681f3Smrglooks like:: 917ec681f3Smrg 927ec681f3Smrg (CPU) Allocate space to store a list of draw commands per tile 937ec681f3Smrg (CPU) Set up a command list per tile that does: 947ec681f3Smrg Either load the current tile's color buffer from memory, or clear it. 957ec681f3Smrg Either load the current tile's depth buffer from memory, or clear it. 967ec681f3Smrg Branch into the draw list for the tile 977ec681f3Smrg Store the depth buffer if anybody might read it. 987ec681f3Smrg Store the color buffer if anybody might read it. 997ec681f3Smrg (GPU) Initialize the per-tile draw call lists to empty. 1007ec681f3Smrg (GPU) Run all draw calls collecting vertex data 1017ec681f3Smrg (GPU) For each tile covered by a draw call's primitive. 1027ec681f3Smrg Emit state packets to the list to update it to the current draw call's state. 1037ec681f3Smrg Emit a primitive description into the tile's draw call list. 1047ec681f3Smrg 1057ec681f3SmrgTiled rendering avoids the need for large render target caches, at the 1067ec681f3Smrgexpense of increasing the cost of vertex processing. Unlike some tiled 1077ec681f3Smrgrenderers, VC4 has no non-tiled rendering mode. 1087ec681f3Smrg 1097ec681f3SmrgPerformance Tricks 1107ec681f3Smrg------------------ 1117ec681f3Smrg 1127ec681f3Smrg* Reducing memory bandwidth by clearing. 1137ec681f3Smrg 1147ec681f3SmrgEven if your drawing is going to cover the entire render target, it's 1157ec681f3Smrgmore efficient for VC4 if you emit a ``glClear()`` of the color and 1167ec681f3Smrgdepth buffers. This means we can skip the load of the previous state 1177ec681f3Smrgfrom memory, in favor of a cheap GPU-side ``memset()`` of the tile 1187ec681f3Smrgbuffer before we start running the draw calls. 1197ec681f3Smrg 1207ec681f3Smrg* Reducing memory bandwidth with scissoring. 1217ec681f3Smrg 1227ec681f3SmrgIf all draw calls for the frame are with a ``glScissor()`` to only 1237ec681f3Smrgpart of the screen, then we can skip setting up the tiles for that 1247ec681f3Smrgarea, which means a little less memory used setting up the empty bins, 1257ec681f3Smrgand a lot less memory used loading/storing the unchanged tiles. 1267ec681f3Smrg 1277ec681f3Smrg* Reducing memory bandwidth with ``glInvalidateFramebuffer()``. 1287ec681f3Smrg 1297ec681f3SmrgIf we don't know who might use the contents of the framebuffer's depth 1307ec681f3Smrgor color in the future, then we have to store it for later. If you use 1317ec681f3SmrgglInvalidateFramebuffer() before accessing the results of your 1327ec681f3Smrgrendering, then we can skip the store of the depth or color 1337ec681f3Smrgbuffer. Note that this is unimplemented. 1347ec681f3Smrg 1357ec681f3Smrg* Avoid non-constant GLSL array indexing 1367ec681f3Smrg 1377ec681f3SmrgIn VC4 the only non-constant-index array access supported in hardware 1387ec681f3Smrgis uniforms. For everything else (inputs, outputs, temporaries), we 1397ec681f3Smrghave to lower them to an IF ladder like:: 1407ec681f3Smrg 1417ec681f3Smrg if (index == 0) 1427ec681f3Smrg return array[0] 1437ec681f3Smrg else if (index == 1) 1447ec681f3Smrg return array[1] 1457ec681f3Smrg ... 1467ec681f3Smrg 1477ec681f3SmrgThis is very expensive as we probably have to execute every branch of 1487ec681f3Smrgevery IF statement due to it being a SIMD machine. So, it is 1497ec681f3Smrgrecommended (if you can) to avoid non-uniform non-constant array 1507ec681f3Smrgindexing. 1517ec681f3Smrg 1527ec681f3SmrgNote that if you do variable indexing within a bounded loop that Mesa 1537ec681f3Smrgcan unroll, that can actually count as constant indexing. 1547ec681f3Smrg 1557ec681f3Smrg* Increasing GPU memory Increase CMA pool size 1567ec681f3Smrg 1577ec681f3SmrgThe memory for the VC4 driver is allocated from the standard Linux cma 1587ec681f3Smrgpool. The size of this pool defaults to 64 MB. To increase this, pass 1597ec681f3Smrgan additional parameter on the kernel command line. Edit the boot 1607ec681f3Smrgpartition's ``cmdline.txt`` to add:: 1617ec681f3Smrg 1627ec681f3Smrg cma=256M@256M 1637ec681f3Smrg 1647ec681f3Smrg``cmdline.txt`` is a single line with whitespace separated parameters. 1657ec681f3Smrg 1667ec681f3SmrgThe first value is the size of the pool and the second parameter is 1677ec681f3Smrgthe start address of the pool. The pool size can be increased further, 1687ec681f3Smrgbut it must fit into the memory, so size + start address must be below 1697ec681f3Smrg1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this 1707ec681f3Smrgreduces the memory available to Linux. 1717ec681f3Smrg 1727ec681f3Smrg* Decrease firmware memory 1737ec681f3Smrg 1747ec681f3SmrgThe firmware allocates a fixed chunk of memory before booting 1757ec681f3SmrgLinux. If firmware functions are not required, this amount can be 1767ec681f3Smrgreduced. 1777ec681f3Smrg 1787ec681f3SmrgIn ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding, 1797ec681f3Smrgedit gpu_mem to 64 if you need video decoding. 1807ec681f3Smrg 1817ec681f3SmrgPerformance debugging 1827ec681f3Smrg--------------------- 1837ec681f3Smrg 1847ec681f3Smrg* Step 1: Known issues 1857ec681f3Smrg 1867ec681f3SmrgThe first tool to look at is running your application with the 1877ec681f3Smrgenvironment variable ``VC4_DEBUG=perf`` set. This will report debug 1887ec681f3Smrginformation for many known causes of performance problems on the 1897ec681f3Smrgconsole. Not all of them will cause visible performance improvements 1907ec681f3Smrgwhen fixed, but it's a good first step to see what might going wrong. 1917ec681f3Smrg 1927ec681f3Smrg* Step 2: CPU vs GPU 1937ec681f3Smrg 1947ec681f3SmrgThe primary question is figuring out whether the CPU is busy in your 1957ec681f3Smrgapplication, the CPU is busy in the GL driver, the GPU is waiting for 1967ec681f3Smrgthe CPU, or the CPU is waiting for the GPU. Ideally, you get to the 1977ec681f3Smrgpoint where the CPU is waiting for the GPU infrequently but for a 1987ec681f3Smrgsignificant amount of time (however long it takes the GPU to draw a 1997ec681f3Smrgframe). 2007ec681f3Smrg 2017ec681f3SmrgStart with top while your application is running. Is the CPU usage 2027ec681f3Smrgaround 90%+? If so, then our performance analysis will be with 2037ec681f3Smrgsysprof. If it's not very high, is the GPU staying busy? We don't have 2047ec681f3Smrga clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be 2057ec681f3Smrguseful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that 2067ec681f3Smrgmeans that the GPU is currently busy processing some rendering job. 2077ec681f3Smrg 2087ec681f3Smrg* sysprof for CPU usage 2097ec681f3Smrg 2107ec681f3SmrgIf the CPU is totally busy and the GPU isn't terribly busy, there is 2117ec681f3Smrgan excellent tool for debugging: sysprof. Install, run as root (so you 2127ec681f3Smrgcan get system-wide profiling), hit play and later stop. The top-left 2137ec681f3Smrgarea shows the flat profile sorted by total time of that symbol plus 2147ec681f3Smrgits descendants. The top few are generally uninteresting (main() and 2157ec681f3Smrgits descendants consuming a lot), but eventually you can get down to 2167ec681f3Smrgsomething interesting. Click it, and to the right you get the 2177ec681f3Smrgcallchains to descendants -- where all that time actually went. On the 2187ec681f3Smrgother hand, the lower left shows callers -- double-clicking those 2197ec681f3Smrgselects that as the symbol to view, instead. 2207ec681f3Smrg 2217ec681f3SmrgNote that you need debug symbols for the callgraphs in sysprof to 2227ec681f3Smrgwork, which is where most of its value is. Most distributions offer 2237ec681f3Smrgdebug symbol packages from their builds which can be installed 2247ec681f3Smrgseparately, and sysprof will find them. I've found that on arm, the 2257ec681f3Smrgdebug packages are not enough, and if someone could determine what is 2267ec681f3Smrgnecessary for callgraphs in debugging, that would be really helpful. 2277ec681f3Smrg 2287ec681f3Smrg* perf for CPU waits on GPU 2297ec681f3Smrg 2307ec681f3SmrgIf the CPU is not very busy and the GPU is not very busy, then we're 2317ec681f3Smrgprobably ping-ponging between the two. Most cases of this would be 2327ec681f3Smrgnoticed by ``VC4_DEBUG=perf``, but not all. To see all cases where 2337ec681f3Smrgthis happens, use the perf tool from the Linux kernel (note: unrelated 2347ec681f3Smrgto ``VC4_DEBUG=perf``):: 2357ec681f3Smrg 2367ec681f3Smrg sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena 2377ec681f3Smrg 2387ec681f3SmrgIf you want to see the whole system's stalls for a period of time 2397ec681f3Smrg(very useful!), use the -a flag instead of a particular command 2407ec681f3Smrgname. Just ``^C`` when you're done capturing data. 2417ec681f3Smrg 2427ec681f3SmrgAt exit, you'll have ``perf.data`` in the current directory. You can print 2437ec681f3Smrgout the results with:: 2447ec681f3Smrg 2457ec681f3Smrg perf report | less 2467ec681f3Smrg 2477ec681f3Smrg* Debugging for GPU fully busy 2487ec681f3Smrg 2497ec681f3SmrgAs of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's 2507ec681f3Smrgperformance counters in OpenGL. Install apitrace, and trace your 2517ec681f3Smrgapplication with:: 2527ec681f3Smrg 2537ec681f3Smrg apitrace trace <application> # for GLX applications 2547ec681f3Smrg apitrace trace -a egl <application> # for EGL applications 2557ec681f3Smrg 2567ec681f3SmrgOnce you've captured a trace, you can see what counters are available 2577ec681f3Smrgand replay it while looking while looking at some of those counters:: 2587ec681f3Smrg 2597ec681f3Smrg apitrace replay <application>.trace --list-metrics 2607ec681f3Smrg 2617ec681f3Smrg apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading 2627ec681f3Smrg 2637ec681f3SmrgMultiple counters can be captured at once with commas separating them. 2647ec681f3Smrg 2657ec681f3SmrgOnce you've found what draw calls are surprisingly expensive in one of 2667ec681f3Smrgthe counters, you can work out which ones they were at the GL level by 2677ec681f3Smrgopening the trace up in qapitrace and using ``^-G`` to jump to that call 2687ec681f3Smrgnumber and ``^-L`` to look up the GL state at that call. 2697ec681f3Smrg 2707ec681f3Smrgshader-db 2717ec681f3Smrg--------- 2727ec681f3Smrg 2737ec681f3Smrgshader-db is often used as a proxy for real-world app performance when 2747ec681f3Smrgworking on the compiler in Mesa. On vc4, there is a lot of 2757ec681f3Smrgstate-dependent code in the shaders (like blending or vertex attribute 2767ec681f3Smrgformat handling), so the typical `shader-db 2777ec681f3Smrg<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important 2787ec681f3Smrgareas for optimization. Instead, anholt wrote a `new one 2797ec681f3Smrg<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on 2807ec681f3Smrgapitraces. Once you have a collection of traces, starting from 2817ec681f3Smrg`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__, 2827ec681f3Smrgyou can test a compiler change in this shader-db with:: 2837ec681f3Smrg 2847ec681f3Smrg ./run.py > before 2857ec681f3Smrg (cd ../mesa && make install) 2867ec681f3Smrg ./run.py > after 2877ec681f3Smrg ./report.py before after 2887ec681f3Smrg 2897ec681f3SmrgHardware Documentation 2907ec681f3Smrg---------------------- 2917ec681f3Smrg 2927ec681f3SmrgFor driver developers, Broadcom publicly released a `specification 2937ec681f3Smrg<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which 2947ec681f3Smrgis closely related to the vc4 GPU present in the Raspberry Pi. They 2957ec681f3Smrgalso released a `snapshot <https://docs.broadcom.com/docs/12358546>`__ 2967ec681f3Smrgof a corresponding Android graphics driver. That graphics driver was 2977ec681f3Smrgported to Raspbian for a demo, but was not expected to have ongoing 2987ec681f3Smrgdevelopment. 2997ec681f3Smrg 3007ec681f3SmrgDevelopers with NDA access with Broadcom or Raspberry Pi can 3017ec681f3Smrgpotentially get access to "simpenrose", the C software simulator of 3027ec681f3Smrgthe GPU. The Mesa driver includes a backend (`vc4_simulator.c`) to 3037ec681f3Smrguse simpenrose from an x86 system with the i915 graphics driver with 3047ec681f3Smrgall of the vc4 rendering commands emulated on simpenrose and memcpyed 3057ec681f3Smrgto the real GPU. 306