Peek contained in the package deal of AMD’s or Nvidia’s most advanced AI products and also you’ll discover a acquainted association: The GPU is flanked on two sides by high-bandwidth memory (HBM), probably the most superior reminiscence chips out there. These reminiscence chips are positioned as shut as attainable to the computing chips they serve to be able to reduce down on the largest bottleneck in AI computing—the energy and delay in getting billions of bits per second from reminiscence into logic. However what should you may carry computing and reminiscence even nearer collectively by stacking the HBM on high of the GPU?
Imec not too long ago explored this situation utilizing superior thermal simulations, and the reply—delivered in December on the 2025 IEEE International Electron Device Meeting (IEDM)—was a bit grim. 3D stacking doubles the working temperature contained in the GPU, rendering it inoperable. However the crew, led by Imec’s James Myers, didn’t simply hand over. They recognized a number of engineering optimizations that finally may whittle down the temperature distinction to just about zero.
Imec began with a thermal simulation of a GPU and 4 HBM dies as you’d discover them as we speak, inside what’s referred to as a 2.5D package deal. That’s, each the GPU and the HBM sit on substrate referred to as an interposer, with minimal distance between them. The 2 forms of chips are linked by hundreds of micrometer-scale copper interconnects constructed into the interposer’s floor. On this configuration, the mannequin GPU consumes 414 watts and reaches a peak temperature of slightly below 70 °C—typical for a processor. The reminiscence chips eat an extra 40 W or so and get considerably much less scorching. The warmth is faraway from the highest of the package deal by the sort of liquid cooling that’s turn into widespread in new AI data centers.
“Whereas this strategy is at the moment used, it doesn’t scale properly for the longer term—particularly because it blocks two sides of the GPU, limiting future GPU-to-GPU connections contained in the package deal,” Yukai Chen, a senior researcher at Imec instructed engineers at IEDM. In distinction, “the 3D strategy results in increased bandwidth, decrease latency… a very powerful enchancment is the package deal footprint.”
Sadly, as Chen and his colleagues discovered, probably the most simple model of stacking, merely placing the HBM chips on high of the GPU and including a block of clean silicon to fill in a niche on the middle, shot temperatures within the GPU as much as a scorching 140 °C—properly previous a typical GPU’s 80 °C restrict.
System Know-how Co-optimization
The Imec crew set about making an attempt a variety of expertise and system optimizations geared toward decreasing the temperature. The very first thing they tried was to throw out a layer of silicon that was now redundant. To know why, you must first get a grip on what HBM actually is.
This type of reminiscence is a stack of as many as 12 high-density DRAM dies. Every has been thinned right down to tens of micrometers and is shot by means of with vertical connections. These thinned dies are stacked one atop one other and linked by tiny balls of solder, and this stack of reminiscence is vertically linked to a different piece of silicon, referred to as the bottom die. The bottom die is a logic chip designed to multiplex the information—pack it into the restricted variety of wires that may match throughout the millimeter-scale hole to the GPU.
However with the HBM now on high of the GPU, there’s no want for such a knowledge pump. Bits can circulation instantly into the processor with out regard for what number of wires occur to suit alongside the aspect of the chip. After all, this alteration means transferring the reminiscence management circuits from the bottom die into the GPU and subsequently altering the processor’s floorplan, says Myers. However there needs to be ample room, he suggests, as a result of the GPU will now not want the circuits used to demultiplex incoming reminiscence knowledge.
Chopping out this middle-man of reminiscence cooled issues down by solely rather less than 4 °C. However, importantly, it ought to massively increase the bandwidth between the reminiscence and the processor, which is essential for an additional optimization the crew tried—slowing down the GPU.
Which may appear opposite to the entire function of higher AI computing, however on this case it’s a bonus. Large language models are what are referred to as “reminiscence sure” issues. That’s, reminiscence bandwidth is the principle limiting issue. However Myers’ crew estimated 3D stacking HBM on the GPU would increase bandwidth fourfold. With that added headroom, even slowing the GPU’s clock by 50 p.c nonetheless results in a efficiency win, whereas cooling every thing down by greater than 20 °C. In observe, the processor won’t have to be slowed down fairly that a lot. Rising the clock frequency to 70 p.c led to a GPU that was just one.7 °C hotter, Myers says.
Optimized HBM
One other huge drop in temperature got here from making the HBM stack and the world round it extra conductive. That included merging the 4 stacks into two wider stacks, thereby eliminating a heat-trapping area; scaling down the highest—normally thicker—die of the stack; and filling in additional of the house across the HBM with clean items of silicon to conduct extra warmth.
With all of that, the stack now operated at about 88 °C. One ultimate optimization introduced issues again to close 70 °C. Usually, some 95 p.c of a chip’s warmth is faraway from the highest of the package deal, the place on this case water carries the warmth away. However including related cooling to the underside as properly drove the stacked chips down a ultimate 17 °C.
Though the analysis introduced at IEDM reveals it could be attainable, HBM-on-GPU isn’t essentially your best option, Myers says. “We’re simulating different system configurations to assist construct confidence that that is or isn’t your best option,” he says. “GPU-on-HBM is of curiosity to some in trade,” as a result of it places the GPU nearer to the cooling. However it could doubtless be a extra advanced design, as a result of the GPU’s energy and knowledge must circulation vertically by means of the HBM to achieve it.
From Your Website Articles
Associated Articles Across the Net

