Logic-Enhanced Memory for 3D Graphics Tile-Based Rasterizers

D.Crisu, S.D. Cotofana and S. Vassiliadis
Computer Engineering Laboratory, EEMCS
Delft University of Technology, Delft, The Netherlands
Mekelweg 4, 2600 GA Delft, The Netherlands
E-mail:[dan, sorin, stavantis}@ce.et.tudelft.nl

P. Liuha
Nokia Research Center
Visiokatu-1, SF-33720
Tampere, Finland
E-mail: petri.liuha@nokia.com

Abstract—An efficient logic-enhanced memory architecture to accelerate primitive traversal in 3D graphics tile-based rasterizers is presented. The memory contains the same number of bits as the number of pixels in the tile, and during rasterization time it is filled up in several clock cycles by a systolic primitive scan-conversion subsystem with the stencil of the primitive: ones are written for memory locations that represent tile pixels covered by primitive, otherwise zeros are stored. Once the shape of the primitive has been coded inside the memory, the memory internal logic is capable of delivering, on request, up to four hit positions (positions inside the primitive) per clock cycle to the pixel processing pipelines, signaling when all the hit positions were consumed.

The logic-enhanced memory architecture presents the following benefits: it handles "ghost" primitives efficiently, hit positions are communicated in a spatial pattern that increases the hit ratio of texture caches in pull texture architectures, and hit positions can always be mapped to different memory banks in the Z-buffer or color-buffer breaking the "read-modify-write" dependency associated with depth test and color blending, thus allowing efficient pipelining. Hardware implementation in a typical 0.18µm process technology for a QVGA 3D graphics hardware accelerator with a tile size of 32 × 16 pixels has indicated that the memory can be clocked at 200MHz and consumes an area of 120000µm².

I. INTRODUCTION

The challenge posed by the formidable cost constraints on products for the mobile consumer market requires a new breed of graphics rendering hardware with very low power consumption and implementation costs which precludes the utilization of the advanced features and the high throughput achieved in high-end systems. To fulfill these design constraints, tiling or chunking architectures [1] were proposed as a way to save memory bandwidth on framebuffer accesses, since an external memory access typically is one of the most energy-consuming operations, and to counteract the huge increase in storage required by full-scene antialiasing.

In a tiling architecture, the screen is divided in a number of non-overlapping regions, or tiles, which are processed serially. Every frame, primitive geometry is sorted first by screen location and dumped into one or more bins, one bin per tile. Geometry that overlaps a tile boundary is referenced in each tile it is visible in. When all the primitive geometry has been specified, it is rendered from bin N to the tile N before moving to the tile N + 1. The advantage of the tile-based architectures is that all the data (colors, depth) can be maintained in on-chip tile-sized buffers and accesses to external memories are required only to dump the tile color buffer content to the global off-chip frame buffer when all the primitive geometry for the current processed tile at the current frame was rasterized.

In traditional full-screen architectures efficient rasterization algorithms [2][3] are based on edge functions [4] and rely on the following paradigm: while not all the positions inside the primitive are exhausted do 1) save the rasterization context, 2) move to a new rasterization position, 3) test the edge functions value for that position to see if the position is a hit, 4) if it is inside communicate this hit position to the pixel processing pipelines and update the rasterization context else restore the rasterization context, 5) based on the edge functions computed earlier try to predict a new hit position.

The main difficulty in tile-based rasterization with this algorithm is to find the first hit position in the to be rasterized primitive, from our experiments the overhead can be 50%–300% (including testing if any of the primitive vertices or the primitive center of gravity are in the current rasterized tile to be considered the starting rasterization position or the hit point). In addition there is always overhead associated with "ghost" primitives (depicted in Figure 1), primitives that are assigned to the current tile when they have nothing in common with it (this is due to the simple software driver algorithm that assigns primitives to tiles based on a primitive bounding box test; other more complex tests in the software driver were envisaged eliminating the "ghost" primitive problem completely but moving the costs to software). In full-screen rasterization this overhead is evident due to the fact that a starting point inside the primitive can always be found, e.g., the center of gravity.

In addition, several studies [5]–[7] have revealed that the primitive pixel rasterization order is crucial for low-cost tile-based architectures that don’t have dedicated texture memories (pull texture architectures) and are relying on a robust texture cache hit ratio to reduce the latency and energy consumption of texel fetches from the external system memory.

To overcome the previously mentioned problems we propose an efficient logic-enhanced memory architecture to accelerate primitive traversal in 3D graphics tile-based rasterizers. The memory contains the same number of bits as the number of pixels in the tile, and during rasterization time it is filled up in several clock cycles by a systolic primitive scan-conversion subsystem with the stencil of the primitive: ones are written for memory locations that represent tile pixels covered by primitive and zeros for the rest. Once the shape of the primitive has been coded inside the memory, the memory internal logic is capable to deliver, on request, to the pixel processing pipelines at least one and up to four hit positions per clock cycle while signaling...
when all the hit positions were consumed or if none existed. The logic-enhanced memory architecture presents the following benefits: it handles “ghost” primitives efficiently, hit positions are communicated in a spatial pattern that increases the hit ratio of texture caches in pull texture architectures, and hit positions can always be mapped to different memory banks in the Z-buffer or color-buffer breaking the “read-modify-write” dependency associated with depth test and color blending thus allowing efficient pipelining. Hardware implementation in a typical 0.18µm process technology for a QVGA 3D graphics hardware accelerator with a tile size of 32 × 16 pixels has indicated that the memory can be clocked at 200MHz and consumes an area of 120000µm².

The rest of the paper is organized as follows. The logic-enhanced memory architecture is described in Section II. In Section III, hardware implementation results are presented. Finally, in Section IV, the conclusions are drawn.

II. LOGIC-ENHANCED MEMORY ARCHITECTURE

The quest to an efficient hardware algorithm for rasterization has to start from finding a suitable pixel rasterization order. In Figure 2 the pixel grid of the tile around the origin of the tile coordinate system is depicted and a space-filling path indicated with arrows starting from the origin is presented. Space-filling paths are known to improve the texel coherency generating high hit-ratio in texture caches [1]. In addition, if 2 × 2 regions of fragments can be generated during rasterization they can be mapped on different memory banks A, B, C, D. Supposing that the shape or stencil of a triangle has been communicated, the bit for that location has to be reset in or-der or color-buffer breaking the “read-modify-write” dependency associated with depth test and color blending thus allowing efficient pipelining. Hardware implementation in a typical 0.18µm process technology for a QVGA 3D graphics hardware accelerator with a tile size of 32 × 16 pixels has indicated that the memory can be clocked at 200MHz and consumes an area of 120000µm².

The logic-enhanced memory architecture is described in Section II. In Section III, hardware implementation results are presented. Finally, in Section IV, the conclusions are drawn.

II. LOGIC-ENHANCED MEMORY ARCHITECTURE

The quest to an efficient hardware algorithm for rasterization has to start from finding a suitable pixel rasterization order. In Figure 2 the pixel grid of the tile around the origin of the tile coordinate system is depicted and a space-filling path indicated with arrows starting from the origin is presented. Space-filling paths are known to improve the texel coherency generating high hit-ratio in texture caches [1]. In addition, if 2 × 2 regions of fragments can be generated during rasterization they can be mapped on different memory banks A, B, C, D. Supposing that the shape or stencil of a triangle has been communicated, the bit for that location has to be reset in or-der or color-buffer breaking the “read-modify-write” dependency associated with depth test and color blending thus allowing efficient pipelining. Hardware implementation in a typical 0.18µm process technology for a QVGA 3D graphics hardware accelerator with a tile size of 32 × 16 pixels has indicated that the memory can be clocked at 200MHz and consumes an area of 120000µm².

The logic-enhanced memory architecture is described in Section II. In Section III, hardware implementation results are presented. Finally, in Section IV, the conclusions are drawn.

II. LOGIC-ENHANCED MEMORY ARCHITECTURE

The quest to an efficient hardware algorithm for rasterization has to start from finding a suitable pixel rasterization order. In Figure 2 the pixel grid of the tile around the origin of the tile coordinate system is depicted and a space-filling path indicated with arrows starting from the origin is presented. Space-filling paths are known to improve the texel coherency generating high hit-ratio in texture caches [1]. In addition, if 2 × 2 regions of fragments can be generated during rasterization they can be mapped on different memory banks A, B, C, D. Supposing that the shape or stencil of a triangle has been communicated, the bit for that location has to be reset in or-der or color-buffer breaking the “read-modify-write” dependency associated with depth test and color blending thus allowing efficient pipelining. Hardware implementation in a typical 0.18µm process technology for a QVGA 3D graphics hardware accelerator with a tile size of 32 × 16 pixels has indicated that the memory can be clocked at 200MHz and consumes an area of 120000µm².

The logic-enhanced memory architecture is described in Section II. In Section III, hardware implementation results are presented. Finally, in Section IV, the conclusions are drawn.

II. LOGIC-ENHANCED MEMORY ARCHITECTURE

The quest to an efficient hardware algorithm for rasterization has to start from finding a suitable pixel rasterization order. In Figure 2 the pixel grid of the tile around the origin of the tile coordinate system is depicted and a space-filling path indicated with arrows starting from the origin is presented. Space-filling paths are known to improve the texel coherency generating high hit-ratio in texture caches [1]. In addition, if 2 × 2 regions of fragments can be generated during rasterization they can be mapped on different memory banks A, B, C, D. Supposing that the shape or stencil of a triangle has been communicated, the bit for that location has to be reset in or-der or color-buffer breaking the “read-modify-write” dependency associated with depth test and color blending thus allowing efficient pipelining. Hardware implementation in a typical 0.18µm process technology for a QVGA 3D graphics hardware accelerator with a tile size of 32 × 16 pixels has indicated that the memory can be clocked at 200MHz and consumes an area of 120000µm².

The logic-enhanced memory architecture is described in Section II. In Section III, hardware implementation results are presented. Finally, in Section IV, the conclusions are drawn.
domino logic design described in [9]: the 4-bit group priority encoder has one-level lookahead and the 32-bit global priority encoder is constructed from 8-bit priority encoders connected through the third-level lookahead signals.

In Figure 4 the Quad Cell circuit diagram is presented where four locations bit cells are depicted. Each bit cell consists of a storage cell (transistors M1, M2, M3, M4 and the two cross-coupled inverters), one of the four parallel transistors (M8) of a distributed domino four-input OR gate (that includes additionally transistors M9 and M10), the conditional read circuitry (transistors M5, M6) and the reset transistor M7. For write operation, when the signal GrpWrEn is asserted, one of the two storage nodes is pulled down, and the other is pulled up. This requires the pullup in both inverters to be weaker than the series pulldown transistors. The storage cell is write-only because the conditional read signal is formed internally based on the content of the storage cell and the signal QuadRdClr formed outside the Quad Cell (see Figure 5). The role of the OR gate is to detect if hit locations are stored in the quad cell, then the signal QuadRdClr will participate in priority encoding in the Group Cell. If based on the priority encoding scheme the quad contains the hit locations with the highest priority in the memory, the signal QuadRdClr will be asserted for the read and clear operation on the quad bits. The static delay buffer DLY_BUFF insures that is enough separation in time between reading the quad bits (the precharged read bit line InTriR can be discharged enough to be detected as logic 0 by the charge-redistribution amplifiers) and clearing the quad bits. The size required for DLY_BUFF is small because the memory has only 32 word lines.

The role of the Group Cell additional logic circuitry presented in Figure 5 is to pass forward the QuadNZ signals from the four Quad Cells to the group priority encoder. GrpNZ is connected to the lookahead output port (LA) of the group priority encoder and signals that at least one of the Quad cells contains hit locations and this is input in the global priority encoder. If the global priority encoder decides that the Group Cell has the highest priority among the other Group Cells the signal GrpPri is asserted. GrpPri together with the priority encoded lines EP from the group priority encoder is anded using four two-input domino AND gates forming QuadRdClr signals for the Quad Cells. In addition two domino OAI gates are used to form the quad code — if a quad having the highest priority exists in the Group Cell then two precharged bit lines are discharged broadcasting the quad code to the memory output.

Finally, in Figure 6 the block diagram of the logic-enhanced memory is presented. The signals from the Group Cells are input in the global priority encoder that decides which one of the Group Cells has the highest priority. MoreNZQuadsLeft is connected to the lookahead output port (LA) of the global priority encoder and signals that at least one of the Quad cells contains hit locations, therefore indicating outside the memory if there are any hit locations left. The quad code returned is used for multiplexing the highest priority quad bits from the highest priority Group Cell, and logic similar to that presented in Figure 5 is used to generate the block code and group code outputs. The memory input interface contains the GetNZQuad.
signal that has to be asserted in order for a quad with hit locations to be read out. The rest of the circuitry is identical to any CMOS SRAM read/write memory and thus it is not described in further details.

III. HARDWARE IMPLEMENTATION RESULTS

The logic-enhanced memory was designed at the physical level in a commercial 0.18µm IC manufacturing technology. After the parasitics were extracted from layout the annotated circuits composing the critical path (starting in Quad Cell 11 of the Group Cell 31, going through the global priority encoder than back in the originating cell, than to the Quad_11 Bit Amplifiers, and finally reaching the Quad output port) were simulated using the HSPICE circuit simulator. The results are reported in Table I. The critical path latency translates in a maximum clock frequency of 200 MHz assuming that the precharge and the evaluation phase take half the clock cycle.

IV. CONCLUSIONS

An efficient logic-enhanced memory architecture to accelerate primitive traversal in 3D graphics tile-based rasterizers was presented. The memory contains the same number of bits as the number of pixels in the tile, and during rasterization time it is filled up in several clock cycles by a systolic primitive scan-conversion subsystem with the stencil of the primitive: ones are written for memory locations that represent tile pixels covered by primitive, and zeros for the rest of locations. Once the shape of the primitive has been coded inside the memory, the memory internal logic is capable of delivering, on request, at least one and up to four hit positions (positions inside the primitive) per clock cycle to the pixel processing pipelines, signaling when all the hit positions were consumed or if none initially existed.

The logic-enhanced memory architecture presents the following benefits: it handles “ghost” primitives efficiently, hit positions are communicated in a spatial pattern that increases the hit ratio of texture caches in pull texture architectures, and hit positions can always be mapped to different memory banks in the Z-buffer or color-buffer breaking the “read-modify-write” dependency associated with depth test and color blending thus allowing efficient pipelining. Hardware implementation in a typical 0.18µm process technology for a QVGA 3D graphics hardware accelerator with a tile size of 32 × 16 pixels has indicated that the memory can be clocked at 200MHz and consumes an area of 12000µm².

REFERENCES