The server market is a diverse, yet standardized market. The ICs and components designed and manufactured in final assemblies must meet form factor requirements for rack mount and blades. The form factor enclosures and the component placement dictate the thermal-mechanical properties and hence the thermal cooling limits which are driven by the energy and power consumption of the system.
The workloads for server applications also vary significantly but all share elements of reliability, security and manageability. Many server operating systems combine virtualization technologies, and the applications can be multi-threaded and amenable to heterogeneous or symmetrical multi-processing.Related: Power Modeling and Simulation of System Memory Subsystem
Simulation of the dynamic behavior of complex multi-core CPU designs, with high reliability storage and memory and high throughput IO is important for meeting thermal and power targets. You can over design and lose on costs, or target narrow use profiles and not meet performance or QoS requirements. Many performance and functional simulators do not address IO throughput and do not provide suitable trace information for accurate system-level power estimation. Traditional benchmark software workload analysis often does not account for error conditions, fault handling, user or network-defined conditions affecting packet throughput, processing as well as security and virtualization features.
After determining the primary network operating system you will run on the server, estimated number of concurrent users and any storage requirements, the next critical decision to make is selecting the appropriate server form factor.
Servers come in three general form factors: tower, rack and blade.
So how could you go about designing IC’s for server requirements with such diversity of applications?
Build a thermal-mechanical model
Start with the targeted form factors in which your IC design will be used. In some IU rack enclosures there is no local fan cooling. Many rack enclosures have multiple variable speed fans and elaborate cooling mechanisms. The enclosure and component placement may dictate the TDP (thermodynamic power) limit of key components, notably the CPU’s, chipset and memory modules.
- Consider the enclosure, PCB/Assembly and packaged device volumes in x, y and z coordinates. Then the material stack up, material properties and HTC’s.
- Assign the power sources to the associated volumes consuming power and sensors or probes to monitor the temperature.
- For an IC that could be used in several different enclosures, you could have models of each chassis or enclosure, and then a model of the PCB/assembly for that enclosure and of course, the package model of your IC’s.
- When you model the HTC you can place sensors or probes at strategic locations such as the inlets, fan exhaust, CPU die/package, memory modules or memory devices.
Related: ESL Tool Update from #51DAC
Power models of the ICs
Pre-existing IP used can be re-used where the dynamic, leakage, and state dependent power equations are applicable. The power model architect can map the server IC power states and system states to the IP block. The power architect must also account for server specific functions such as redundancy, ECC, failover and recovery mechanisms are quantified in terms of logic area. The corresponding active, idle power and standby or leakage power states and percentage of residency in each state. New IP power models are created using the power model parameterization as above recanted as: Logic/transistor count or area, power states, power equations per state, percentage of occupancy or residence in each state as a function of workload and operating condition.
Power stimuli
The stimulus can be performance or functional simulators from server database and web server applications in the form of CSV or VCD exported traces. Portions of the trace can also be used to inject error conditions, retries or other activity based on characterized data or statistical data. In this way the power architect can get activity factor and dynamic power of the processing, memory, and storage subsystem that are unique to server workloads. The IO and connectivity power can also be modeled using bandwidth and traffic generators with security and reliability features enabled, and disabled. Throughput can be adjusted based on error conditions, retries and packet payload delivery. The user can create complex power traces by adding steps or tasks and concatenating the power stimulus to drive the power model with concurrent and pipelined tasks.
For multiple use case and multiple form factors and layouts consider using an ESL power-thermal profiling tool flow:
Docea Powerprovides an ESL power and thermal solution using using the Ace Thermal Modeler which can generate compact thermal model that can be used to run coupled power-thermal simulations as well as a Thermal Profiling tool which can be used with power traces from characterized workloads.
Summary
Power and thermal modeling for server IC’s and SoCs is much like the approach used for SoCs in smartphone, tablet and mobile applications. However the key items are:
- The IP blocks often have server specific hardware features such as ECC, security and packet processing acceleration. The corresponding power models need to comprehend server specific features and account for the power in server specific power states.
- Server power states are based on high availability and QoS so throughput is key. Processor and memory active and idle states are highly optimized and many components can be in low latency yet standby power levels. New IP has been developed for latency tolerant IO and new bus technologies.
- Server specific features like security and error handling need to be included in the power model. The security, reliability and manageability functions may add power, but in some instances the power penalty is highly dependent on system software and operating conditions.
- The power stimulus should provide configurable conditions for error injection, congestion and encryption/decryption in the event traces to activate server specific features.
- The thermal model needs to account for various chassis, PCB orientation, airflow and ambient environmental conditions.
The Intel Common Platform Foundry Alliance