Conference paper

IBM Telum II: Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator

Abstract

The IBM Z microprocessor, Telum II, has been redesigned for the zNext system to improve performance and system capacity over the previous z16 system [1]. The system topology consists of four Dual-Chip Modules (DCM), each composed of two central-processor (CP) chips per drawer. The system can be configured with up to four drawers and a total of 32 CP chips in a fully coherent shared memory system. The CP (Fig. 2.2.7) is a 600mm2 die containing 43B transistors and is designed in Samsung 5nm bulk technology [2]. It contains over 24 miles of wire and 165B vias spread across 18 layers of metal: 8 narrow-width layers for local interconnect, 8 medium width-high performance layers, and two ultra-thick layers for off-chip signal routing and power/clock distribution. The combination of die size increase and library improvements allowed us to keep 8 cores per CP, while adding a new Data Processing Unit (DPU) onto the chip. Each CP operates at 5.5GHz. Each core has a 128KB L1 instruction cache, and a 128KB data cache. Each CP also has 2 PCIe Gen5 ×16 interfaces, an M-BUS interface to the other CP on the DCM, and an X-BUS interface to every CP in the other 3 DCMs per drawer. There is 1 A-BUS interface to connect 6 out of the 8 CPs in each drawer to the other drawers in the system. The clock network is designed with a 1 resonant mesh covering most of the chip and three small asynchronous non-resonant meshes for the memory and PCIe interfaces. In addition, Telum II adds an on-chip voltage control loop for improved dynamic voltage management, which maintains performance without requiring higher voltages across all workloads [3]. Key system capacity and performance improvements came from enhancements to the core and the increased cache size. Telum II increases the number of L2 cache instances from 8 to 10 and uses Samsung's high-density SRAM cell to grow the L2 cache by 40% from 32MB on Telum to 36MB per L2 cache instance. Each processor core and the DPU have a private 36MB L2 cache along with an extra floating L2, which are fully connected by a 352GB/s ring. The on-chip shared virtual L3 cache increases from 256MB to 360MB. A fully populated drawer now contains 2.88GB of virtual L4 cache, up from 2GB on z16. In addition to the high-density SRAM cell, the cache growth was enabled by a 20% core area shrink from Telum. The core shrink was achieved through microarchitecture enhancements and technology. In addition to shrinking, the core added to the overall system performance through enhancements to branch prediction, I-cache prefetching, additional rename registers, and TLB optimization. The core physical design is constructed of 7 large fully abutted floorplanned blocks where the logic boundaries are removed, and the design is restructured. This methodology removes 2 levels of physical hierarchy leading to efficient area and metal usage in the core [4,5].