The page you're viewing is for Korean (Korea) region.

Vertiv 영업담당자에게 문의하시면 고객의 고유한 요구에 맞게 복잡한 설계를 구성할 수 있습니다. Vertiv는 대규모 프로젝트에 대한 기술 지침이 필요한 조직에 필요한 지원을 제공할 수 있습니다.

자세히 보기

많은 고객이 Vertiv 리셀러 파트너와 협력하여 IT 애플리케이션을 위한 Vertiv 제품을 구매합니다. 파트너는 다양한 교육을 받고 전문 경험을 보유하고 있으며 Vertiv 제품을 통해 전체 IT 및 인프라 솔루션을 지정, 판매, 지원할 수 있는 독보적인 위치에 있습니다.

리셀러 찾기

필요한 것이 무엇인지 이미 알고 계십니까? 온라인 구매 및 배송의 편리함을 원하십니까? 특정 범주의 Vertiv 제품은 온라인 리셀러를 통해 구매할 수 있습니다.


온라인 리셀러 찾기

제품 선택에 도움이 필요하십니까? 여러분에게 적합한 솔루션을 안내할 수 있는 우수한 Vertiv 전문가와 상담하십시오.



Vertiv 전문가에게 문의하기

The page you're viewing is for Korean (Korea) region.

A roadmap for the future chip coolant temperature

6 min. Read

The adoption of liquid cooling has dramatically increased in recent years due to the rapid increase of graphics processing unit/application-specific integrated circuit (GPU/ASIC) power consumption for AI/ML workloads. Cloud service providers (CSPs) and OEM adoption also rapidly went up as CPU energy consumption steadily increased to an estimated 100W in a decade, to over 400W in the last five years. GPU/ASIC energy use has dramatically increased compared to CPUs, with power rising by hundreds of watts annually. Silicon vendors like NVIDIA, AMD, and Intel have published GPU powers greater than 1kW. In addition, CSPs such as Microsoft, Google, Amazon, and Meta are designing their silicon for AI/ML workloads, with their future silicon requiring liquid cooling.

In collaboration with some members of the Open Compute Project (OCP), such as AMD, Intel, Meta, NVIDIA, Samsung, and Vertiv™, we published a white paper titled “30°C Coolant—A Durable Roadmap for the Future.” The research provides business planning and efficiency insights on maintaining optimal temperatures for equipment longevity, supply chain stability, and environmental responsibility. 

GPU architecture: The thermal stack in the modern silicon

Modern GPUs have had a dramatic shift in their silicon package construction, utilizing 2.5-dimensional (2.5D) multi-chiplet stacking (Chip-on-Wafer-on-Substrate, or CoWoS) to enhance computing performance. This involves combining system-on-chip (SOC) with high bandwidth memory (HBM), allowing for a different type and number of chiplets to be combined for more performant GPU packages. The improvements in process technology, package assembly techniques, processing performance requirements, and co-location of memory have largely driven the large power increases seen in GPUs.

By comparison, the complexity of GPU construction is much higher than CPUs and equally creates unique thermal challenges in the package, namely:

  • The different chiplets or components require different maximum junction temperatures.
  • The different chiplets have different stack heights.

For example, the SOC chiplet is typically operating with a maximum junction temperature of 105℃, whereas, the HBM will require a lower junction temperature of 85℃ for a single-refresh, and 95℃ for a double-refresh operation in the early generations. Later generations of HBM have increased that to 95℃ and 105℃, respectively, to better match with SOC junction temperature requirements. Each HBM manufacturer is trying to address this trend based on their expertise.

Durability from a manufacturer’s and operator’s perspective

It is in the best interest of silicon manufacturers to have a durable data center to establish that temperature requirements for silicon will not change with each generation. Next-generation AI products have thermal designs that require significantly cooler fluid temperatures than what some data center designers are currently planning for. By agreeing on a standard coolant, silicon manufacturers can design their products with the assurance that they can be sufficiently cooled. Additionally, a set temperature provides a good design direction for silicon manufacturers.

Meanwhile, data center operators benefit from aligning on a durable coolant temperature for several reasons. While it takes many years to plan, design, build, and commission a data center, AI silicon is rapidly evolving. The industry benefits from quick iterations of silicon. But if the lower limit of the coolant needed to deploy AI silicon were to change rapidly, then new data centers could be obsolete before they are even built. It is also relatively easy and efficient to operate a technology cooling system (TCS) loop at the base of its hottest temperature range. However, it is very expensive and time-consuming to modify a TCS loop to operate below its existing design range once built. Therefore, defining the lower temperature limit of a data center’s TCS loop temperature is very important for long-term viability and investment in the data center infrastructure.

Determining the right temperature for the modern data center

Data centers must be designed for a specific coolant operating temperature and flow rate. If the temperature requirement for a particular generation of IT hardware is higher, the data center operational set points can be adjusted to raise the coolant temperature. This offers the opportunity for improved efficiency. If the temperature requirement is lower than the data center design temperature, expensive and time-consuming modifications to the physical design will be required. For example, lower temperatures may require additional chiller capacity or a new type of chiller. Given the consequences of such a change, it is important to set the coolant temperature requirement to not change through multiple generations of IT hardware.

This is especially challenging for AI hardware, given the rapid changes in silicon power and thermal requirements. Figure 2 illustrates the GPU and CPU power trends and associated technical fluid temperature requirements (Figure 6 in the white paper). AI hardware is driving the need for liquid cooling. GPU power is increasing much faster than CPU power. The chart shows the associated fluid temperature requirements for GPUs over time: the asymptotic temperature requirement is 30℃.

Figure 2. GPU and CPU power trends and associated technical fluid temperature requirements at 1.5 lpm per 1KW. Source: Open Compute Project®, “Coolant Temperatures for Next Generation IT and Durable Data Center Designs” presentation, OCP Regional Summit, April 2023 

We agreed on 30℃ as the minimum fluid temperature for hardware and data center design. The line thickness represents prediction uncertainty. Investments in advanced silicon packaging and liquid cooling performance are needed to maintain 30℃ as a long-term interface specification. The choice of 30℃ matches a common minimum air temperature specification for large-scale data center design. Choosing the same value for air- and liquid-cooled IT hardware allows the industry to maintain the significant power usage effectiveness (PUE) improvements achieved for the past 15 years. The consequences of lowering the fluid temperature from 30℃ to 20℃, and the reduction of the free-cooling hours in a year in zones classified with having hotter climates, can be read in the white paper.

While liquid cooling within computing systems has been around for many decades, the demand for cooling higher-density workloads is driving liquid cooling from boutique to hyperscale. Because of the development and evolution gap between AI solutions and data center build timelines, data center operators and silicon providers need to agree on a coolant temperature that enables the silicon to be durable.

A minimum coolant temperature of 30℃ for the TCS loop offers a setpoint that provides both parties an operating point. Considering a simplified view of the cooling architecture in a data center, there are at least two cooling loops in a liquid-cooling system from the TCS to the facility water system (FWS) as shown in Figure 3 (Figure 8 in the white paper).

Figure 3. Cooling system architecture and components

Cooling AI with confidence

The TCS 30℃ coolant temperature limit is not intended to discourage the development of a wide range of future solutions. While there is a significant value in aligning on a coolant temperature of 30℃ or greater, there is no desire to limit the types of technology used to deliver the cooling capability to the silicon. Freedom of innovation will be needed across many different cooling technologies to provide ideal solutions across the industry, including immersion, cold plate, thermal interface materials, and many others. There will also be demand for silicon solutions across a range of coolant temperatures, including some opportunities below the 30℃ target. However, the goal of this initiative is to target silicon with the best price-per-performance ratio above the 30℃ coolant limit.

As an industry, researchers continue to find solutions and enhancements for the critical digital infrastructure, including technologies that can further business and responsible business efforts into the future. For instance, OCP (Open Compute Project) has been instrumental and will continue promoting cooling solutions that enhance and enable data center efficiency and reliability. To find the technical insights and business details for keeping the optimal temperatures in the modern data center, and identify paths for new technological investments towards increased performance in the future, download the OCP’s latest white paper “30°C Coolant—A Durable Roadmap for the Future.” To know the deployments and implementations to facilitate this, visit Vertiv.com.


AI Data center innovation Thermal chain evolution

VertivTM AI Hub

Infrastructure designed to stay multiple compute generations ahead, starting now.

Learn more
PORTALS
개요
파트너 로그인

언어 & 지역