1/* 2 * Copyright (C) 2019 Collabora, Ltd. 3 * 4 * Permission is hereby granted, free of charge, to any person obtaining a 5 * copy of this software and associated documentation files (the "Software"), 6 * to deal in the Software without restriction, including without limitation 7 * the rights to use, copy, modify, merge, publish, distribute, sublicense, 8 * and/or sell copies of the Software, and to permit persons to whom the 9 * Software is furnished to do so, subject to the following conditions: 10 * 11 * The above copyright notice and this permission notice (including the next 12 * paragraph) shall be included in all copies or substantial portions of the 13 * Software. 14 * 15 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 18 * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 * SOFTWARE. 22 * 23 * Authors: 24 * Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> 25 */ 26 27#include "util/u_math.h" 28#include "util/macros.h" 29#include "pan_device.h" 30#include "pan_encoder.h" 31#include "panfrost-quirks.h" 32 33/* Mali GPUs are tiled-mode renderers, rather than immediate-mode. 34 * Conceptually, the screen is divided into 16x16 tiles. Vertex shaders run. 35 * Then, a fixed-function hardware block (the tiler) consumes the gl_Position 36 * results. For each triangle specified, it marks each containing tile as 37 * containing that triangle. This set of "triangles per tile" form the "polygon 38 * list". Finally, the rasterization unit consumes the polygon list to invoke 39 * the fragment shader. 40 * 41 * In practice, it's a bit more complicated than this. On Midgard chips with an 42 * "advanced tiling unit" (all except T720/T820/T830), 16x16 is the logical 43 * tile size, but Midgard features "hierarchical tiling", where power-of-two 44 * multiples of the base tile size can be used: hierarchy level 0 (16x16), 45 * level 1 (32x32), level 2 (64x64), per public information about Midgard's 46 * tiling. In fact, tiling goes up to 4096x4096 (!), although in practice 47 * 128x128 is the largest usually used (though higher modes are enabled). The 48 * idea behind hierarchical tiling is to use low tiling levels for small 49 * triangles and high levels for large triangles, to minimize memory bandwidth 50 * and repeated fragment shader invocations (the former issue inherent to 51 * immediate-mode rendering and the latter common in traditional tilers). 52 * 53 * The tiler itself works by reading varyings in and writing a polygon list 54 * out. Unfortunately (for us), both of these buffers are managed in main 55 * memory; although they ideally will be cached, it is the drivers' 56 * responsibility to allocate these buffers. Varying buffer allocation is 57 * handled elsewhere, as it is not tiler specific; the real issue is allocating 58 * the polygon list. 59 * 60 * This is hard, because from the driver's perspective, we have no information 61 * about what geometry will actually look like on screen; that information is 62 * only gained from running the vertex shader. (Theoretically, we could run the 63 * vertex shaders in software as a prepass, or in hardware with transform 64 * feedback as a prepass, but either idea is ludicrous on so many levels). 65 * 66 * Instead, Mali uses a bit of a hybrid approach, splitting the polygon list 67 * into three distinct pieces. First, the driver statically determines which 68 * tile hierarchy levels to use (more on that later). At this point, we know the 69 * framebuffer dimensions and all the possible tilings of the framebuffer, so 70 * we know exactly how many tiles exist across all hierarchy levels. The first 71 * piece of the polygon list is the header, which is exactly 8 bytes per tile, 72 * plus padding and a small 64-byte prologue. (If that doesn't remind you of 73 * AFBC, it should. See pan_afbc.c for some fun parallels). The next part is 74 * the polygon list body, which seems to contain 512 bytes per tile, again 75 * across every level of the hierarchy. These two parts form the polygon list 76 * buffer. This buffer has a statically determinable size, approximately equal 77 * to the # of tiles across all hierarchy levels * (8 bytes + 512 bytes), plus 78 * alignment / minimum restrictions / etc. 79 * 80 * The third piece is the easy one (for us): the tiler heap. In essence, the 81 * tiler heap is a gigantic slab that's as big as could possibly be necessary 82 * in the worst case imaginable. Just... a gigantic allocation that we give a 83 * start and end pointer to. What's the catch? The tiler heap is lazily 84 * allocated; that is, a huge amount of memory is _reserved_, but only a tiny 85 * bit is actually allocated upfront. The GPU just keeps using the 86 * unallocated-but-reserved portions as it goes along, generating page faults 87 * if it goes beyond the allocation, and then the kernel is instructed to 88 * expand the allocation on page fault (known in the vendor kernel as growable 89 * memory). This is quite a bit of bookkeeping of its own, but that task is 90 * pushed to kernel space and we can mostly ignore it here, just remembering to 91 * set the GROWABLE flag so the kernel actually uses this path rather than 92 * allocating a gigantic amount up front and burning a hole in RAM. 93 * 94 * As far as determining which hierarchy levels to use, the simple answer is 95 * that right now, we don't. In the tiler configuration fields (consistent from 96 * the earliest Midgard's SFBD through the latest Bifrost traces we have), 97 * there is a hierarchy_mask field, controlling which levels (tile sizes) are 98 * enabled. Ideally, the hierarchical tiling dream -- mapping big polygons to 99 * big tiles and small polygons to small tiles -- would be realized here as 100 * well. As long as there are polygons at all needing tiling, we always have to 101 * have big tiles available, in case there are big polygons. But we don't 102 * necessarily need small tiles available. Ideally, when there are small 103 * polygons, small tiles are enabled (to avoid waste from putting small 104 * triangles in the big tiles); when there are not, small tiles are disabled to 105 * avoid enabling more levels than necessary, which potentially costs in memory 106 * bandwidth / power / tiler performance. 107 * 108 * Of course, the driver has to figure this out statically. When tile 109 * hiearchies are actually established, this occurs by the tiler in 110 * fixed-function hardware, after the vertex shaders have run and there is 111 * sufficient information to figure out the size of triangles. The driver has 112 * no such luxury, again barring insane hacks like additionally running the 113 * vertex shaders in software or in hardware via transform feedback. Thus, for 114 * the driver, we need a heuristic approach. 115 * 116 * There are lots of heuristics to guess triangle size statically you could 117 * imagine, but one approach shines as particularly simple-stupid: assume all 118 * on-screen triangles are equal size and spread equidistantly throughout the 119 * screen. Let's be clear, this is NOT A VALID ASSUMPTION. But if we roll with 120 * it, then we see: 121 * 122 * Triangle Area = (Screen Area / # of triangles) 123 * = (Width * Height) / (# of triangles) 124 * 125 * Or if you prefer, we can also make a third CRAZY assumption that we only draw 126 * right triangles with edges parallel/perpendicular to the sides of the screen 127 * with no overdraw, forming a triangle grid across the screen: 128 * 129 * |--w--| 130 * _____ | 131 * | /| /| | 132 * |/_|/_| h 133 * | /| /| | 134 * |/_|/_| | 135 * 136 * Then you can use some middle school geometry and algebra to work out the 137 * triangle dimensions. I started working on this, but realised I didn't need 138 * to to make my point, but couldn't bare to erase that ASCII art. Anyway. 139 * 140 * POINT IS, by considering the ratio of screen area and triangle count, we can 141 * estimate the triangle size. For a small size, use small bins; for a large 142 * size, use large bins. Intuitively, this metric makes sense: when there are 143 * few triangles on a large screen, you're probably compositing a UI and 144 * therefore the triangles are large; when there are a lot of triangles on a 145 * small screen, you're probably rendering a 3D mesh and therefore the 146 * triangles are tiny. (Or better said -- there will be tiny triangles, even if 147 * there are also large triangles. There have to be unless you expect crazy 148 * overdraw. Generally, it's better to allow more small bin sizes than 149 * necessary than not allow enough.) 150 * 151 * From this heuristic (or whatever), we determine the minimum allowable tile 152 * size, and we use that to decide the hierarchy masking, selecting from the 153 * minimum "ideal" tile size to the maximum tile size (2048x2048 in practice). 154 * 155 * Once we have that mask and the framebuffer dimensions, we can compute the 156 * size of the statically-sized polygon list structures, allocate them, and go! 157 * 158 * ----- 159 * 160 * On T720, T820, and T830, there is no support for hierarchical tiling. 161 * Instead, the hardware allows the driver to select the tile size dynamically 162 * on a per-framebuffer basis, including allowing rectangular/non-square tiles. 163 * Rules for tile size selection are as follows: 164 * 165 * - Dimensions must be powers-of-two. 166 * - The smallest tile is 16x16. 167 * - The tile width/height is at most the framebuffer w/h (clamp up to 16 pix) 168 * - There must be no more than 64 tiles in either dimension. 169 * 170 * Within these constraints, the driver is free to pick a tile size according 171 * to some heuristic, similar to units with an advanced tiling unit. 172 * 173 * To pick a size without any heuristics, we may satisfy the constraints by 174 * defaulting to 16x16 (a power-of-two). This fits the minimum. For the size 175 * constraint, consider: 176 * 177 * # of tiles < 64 178 * ceil (fb / tile) < 64 179 * (fb / tile) <= (64 - 1) 180 * tile <= fb / (64 - 1) <= next_power_of_two(fb / (64 - 1)) 181 * 182 * Hence we clamp up to align_pot(fb / (64 - 1)). 183 184 * Extending to use a selection heuristic left for future work. 185 * 186 * Once the tile size (w, h) is chosen, we compute the hierarchy "mask": 187 * 188 * hierarchy_mask = (log2(h / 16) << 6) | log2(w / 16) 189 * 190 * Of course with no hierarchical tiling, this is not a mask; it's just a field 191 * specifying the tile size. But I digress. 192 * 193 * We also compute the polgon list sizes (with framebuffer size W, H) as: 194 * 195 * full_size = 0x200 + 0x200 * ceil(W / w) * ceil(H / h) 196 * offset = 8 * ceil(W / w) * ceil(H / h) 197 * 198 * It further appears necessary to round down offset to the nearest 0x200. 199 * Possibly we would also round down full_size to the nearest 0x200 but 200 * full_size/0x200 = (1 + ceil(W / w) * ceil(H / h)) is an integer so there's 201 * nothing to do. 202 */ 203 204/* Hierarchical tiling spans from 16x16 to 4096x4096 tiles */ 205 206#define MIN_TILE_SIZE 16 207#define MAX_TILE_SIZE 4096 208 209/* Constants as shifts for easier power-of-two iteration */ 210 211#define MIN_TILE_SHIFT util_logbase2(MIN_TILE_SIZE) 212#define MAX_TILE_SHIFT util_logbase2(MAX_TILE_SIZE) 213 214/* The hierarchy has a 64-byte prologue */ 215#define PROLOGUE_SIZE 0x40 216 217/* For each tile (across all hierarchy levels), there is 8 bytes of header */ 218#define HEADER_BYTES_PER_TILE 0x8 219 220/* Likewise, each tile per level has 512 bytes of body */ 221#define FULL_BYTES_PER_TILE 0x200 222 223/* If the width-x-height framebuffer is divided into tile_size-x-tile_size 224 * tiles, how many tiles are there? Rounding up in each direction. For the 225 * special case of tile_size=16, this aligns with the usual Midgard count. 226 * tile_size must be a power-of-two. Not really repeat code from AFBC/checksum, 227 * because those care about the stride (not just the overall count) and only at 228 * a a fixed-tile size (not any of a number of power-of-twos) */ 229 230static unsigned 231pan_tile_count(unsigned width, unsigned height, unsigned tile_width, unsigned tile_height) 232{ 233 unsigned aligned_width = ALIGN_POT(width, tile_width); 234 unsigned aligned_height = ALIGN_POT(height, tile_height); 235 236 unsigned tile_count_x = aligned_width / tile_width; 237 unsigned tile_count_y = aligned_height / tile_height; 238 239 return tile_count_x * tile_count_y; 240} 241 242/* For `masked_count` of the smallest tile sizes masked out, computes how the 243 * size of the polygon list header. We iterate the tile sizes (16x16 through 244 * 2048x2048). For each tile size, we figure out how many tiles there are at 245 * this hierarchy level and therefore many bytes this level is, leaving us with 246 * a byte count for each level. We then just sum up the byte counts across the 247 * levels to find a byte count for all levels. */ 248 249static unsigned 250panfrost_hierarchy_size( 251 unsigned width, 252 unsigned height, 253 unsigned mask, 254 unsigned bytes_per_tile) 255{ 256 unsigned size = PROLOGUE_SIZE; 257 258 /* Iterate hierarchy levels */ 259 260 for (unsigned b = 0; b < (MAX_TILE_SHIFT - MIN_TILE_SHIFT); ++b) { 261 /* Check if this level is enabled */ 262 if (!(mask & (1 << b))) 263 continue; 264 265 /* Shift from a level to a tile size */ 266 unsigned tile_size = (1 << b) * MIN_TILE_SIZE; 267 268 unsigned tile_count = pan_tile_count(width, height, tile_size, tile_size); 269 unsigned level_count = bytes_per_tile * tile_count; 270 271 size += level_count; 272 } 273 274 /* This size will be used as an offset, so ensure it's aligned */ 275 return ALIGN_POT(size, 0x200); 276} 277 278/* Implement the formula: 279 * 280 * 0x200 + bytes_per_tile * ceil(W / w) * ceil(H / h) 281 * 282 * rounding down the answer to the nearest 0x200. This is used to compute both 283 * header and body sizes for GPUs without hierarchical tiling. Essentially, 284 * computing a single hierarchy level, since there isn't any hierarchy! 285 */ 286 287static unsigned 288panfrost_flat_size(unsigned width, unsigned height, unsigned dim, unsigned bytes_per_tile) 289{ 290 /* First, extract the tile dimensions */ 291 292 unsigned tw = (1 << (dim & 0b111)) * 8; 293 unsigned th = (1 << ((dim & (0b111 << 6)) >> 6)) * 8; 294 295 /* tile_count is ceil(W/w) * ceil(H/h) */ 296 unsigned raw = pan_tile_count(width, height, tw, th) * bytes_per_tile; 297 298 /* Round down and add offset */ 299 return 0x200 + ((raw / 0x200) * 0x200); 300} 301 302/* Given a hierarchy mask and a framebuffer size, compute the header size */ 303 304unsigned 305panfrost_tiler_header_size(unsigned width, unsigned height, unsigned mask, bool hierarchy) 306{ 307 if (hierarchy) 308 return panfrost_hierarchy_size(width, height, mask, HEADER_BYTES_PER_TILE); 309 else 310 return panfrost_flat_size(width, height, mask, HEADER_BYTES_PER_TILE); 311} 312 313/* The combined header/body is sized similarly (but it is significantly 314 * larger), except that it can be empty when the tiler disabled, rather than 315 * getting clamped to a minimum size. 316 */ 317 318unsigned 319panfrost_tiler_full_size(unsigned width, unsigned height, unsigned mask, bool hierarchy) 320{ 321 if (hierarchy) 322 return panfrost_hierarchy_size(width, height, mask, FULL_BYTES_PER_TILE); 323 else 324 return panfrost_flat_size(width, height, mask, FULL_BYTES_PER_TILE); 325} 326 327/* On GPUs without hierarchical tiling, we choose a tile size directly and 328 * stuff it into the field otherwise known as hierarchy mask (not a mask). */ 329 330static unsigned 331panfrost_choose_tile_size( 332 unsigned width, unsigned height, unsigned vertex_count) 333{ 334 /* Figure out the ideal tile size. Eventually a heuristic should be 335 * used for this */ 336 337 unsigned best_w = 16; 338 unsigned best_h = 16; 339 340 /* Clamp so there are less than 64 tiles in each direction */ 341 342 best_w = MAX2(best_w, util_next_power_of_two(width / 63)); 343 best_h = MAX2(best_h, util_next_power_of_two(height / 63)); 344 345 /* We have our ideal tile size, so encode */ 346 347 unsigned exp_w = util_logbase2(best_w / 16); 348 unsigned exp_h = util_logbase2(best_h / 16); 349 350 return exp_w | (exp_h << 6); 351} 352 353/* In the future, a heuristic to choose a tiler hierarchy mask would go here. 354 * At the moment, we just default to 0xFF, which enables all possible hierarchy 355 * levels. Overall this yields good performance but presumably incurs a cost in 356 * memory bandwidth / power consumption / etc, at least on smaller scenes that 357 * don't really need all the smaller levels enabled */ 358 359unsigned 360panfrost_choose_hierarchy_mask( 361 unsigned width, unsigned height, 362 unsigned vertex_count, bool hierarchy) 363{ 364 /* If there is no geometry, we don't bother enabling anything */ 365 366 if (!vertex_count) 367 return 0x00; 368 369 if (!hierarchy) 370 return panfrost_choose_tile_size(width, height, vertex_count); 371 372 /* Otherwise, default everything on. TODO: Proper tests */ 373 374 return 0xFF; 375} 376