Instance

Array
(
    [title] => Recent Forum Threads
    [title_url] => 
    [ignore_sticky] => 0
    [exclude_current] => 0
    [limit] => 10
    [sluglist] => ["jobs-dashboard"]
    [rw_opt] => Array
        (
            [widget_select] => 1
            [pageid_281769] => 1
            [pageid_281772] => 1
        )

    [display_widget_mobile] => 
    [rw_opt_exclude] => Array
        (
            [pageid_274493] => 1
            [cpt_podcast] => 1
            [cpta_podcast] => 1
            [category_16613] => 1
            [category_16631] => 1
            [taxonomy_series] => 1
            [pageid_354254] => 1
        )

    [node_id] => Array
        (
            [0] => 2
        )

)

Threads

Recent Article Comments

Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Appreciate your take, Rahul. You’re absolutely right that market scale drives architectural investment—scalar dominated when desktop and enterprise ruled, and…

— Jonah McLeod on June 29, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Well.. I found this to be a funny article. Flynn's critique is fine and good...but not really the driving factor…

— Rahul Razdan on June 29, 2025
Reachability in Analog and AMS. Innovation in Verification
Apologies for that slip-up on our part. Failing memories!

— Bernard Murphy on June 27, 2025
Reachability in Analog and AMS. Innovation in Verification
swka: This is true, I worked with MunEDA up until the Cadence acquisition. Before that I worked with Solido up…

— Daniel Nenni on June 26, 2025
Reachability in Analog and AMS. Innovation in Verification
One quick correction. WiCkeD was MunEDA tool, which was acquired by Cadence. So it is never part of Synopsys. Synopsy…

— swka on June 26, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
At Simplex Micro, the name says it all. Founder Dr. Thang Tran chose it to reflect his belief that in…

— Jonah McLeod on June 25, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Thanks for the thoughtful read—and you're right, we’re in a fascinating inflection point. On your first point: Lunar Lake doesn’t…

— Jonah McLeod on June 24, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
An interesting article for sure, as we are in a sea of change. I have perhaps two nitpicks; - Lunar…

— Xebec on June 24, 2025
Intel’s IDM 2.0
I am not writing anything these days due to being over booked in my day job, sorry.

— Scotten Jones on June 19, 2025
Stitched Multi-Patterning for Minimum Pitch Metal in DRAM Periphery
Indeed, it looks like CXMT is going this way: https://ieeexplore.ieee.org/document/9620445

— Fred Chen on June 19, 2025

Ceva webinar AI Arch SEMI 800X100 250625

WP_Term Object
(
    [term_id] => 151
    [name] => General
    [slug] => general
    [term_group] => 0
    [term_taxonomy_id] => 151
    [taxonomy] => category
    [description] => 
    [parent] => 0
    [count] => 444
    [filter] => raw
    [cat_ID] => 151
    [category_count] => 444
    [category_description] => 
    [cat_name] => General
    [category_nicename] => general
    [category_parent] => 0
)

February 13, 2019 by Bernard Murphy

Data Center Optimization Through Game Theory

Data Center Optimization Through Game Theory
by Bernard Murphy on 02-13-2019 at 7:00 am
Categories: General

I always enjoy surprising synergies so I was immediately attracted to a Research Highlight in the Communications of the ACM this month, on a game-theoretic method to balance discretionary speed-ups (known as computational sprints) in data centers. If you don’t have an ACM membership and want to dig deeper, I include an open link at the end of this blog.

This research starts with the widely used trick to reduce job run-time on a processor by speeding up the clock. The downside is that running with a faster clock dissipates more heat so you can’t do it for too long otherwise you’ll fry your processor. Heat dissipates relatively slowly, so there’s a recovery time for cooling during which clock speed has to return to nominal or perhaps even slower. Like sprinting in a long-distance race; you can sprint periodically, but you can’t sprint through the whole race.

Also you’re not running this race alone. In a data center, and particularly in cloud environments, you’re in a load mix with many other jobs which also want to optimize their performance. Your job may swap onto a processor on which another job was just sprinting, or vice-versa; either way the second job loses a sprinting opportunity, at least until recovery. Or maybe multiple jobs in a rack want to sprint at the same time but that risks tripping the rack power supply, perhaps switching over to UPS with its own recovery cycles. Who loses out in this race for extra power?

If you know anything about game theory, this should sound very familiar. You have multiple players with no intrinsic reason to cooperate, all grabbing for more than their fair share of finite resources. The default (greedy) approach is simply to let it work itself out. Jobs grab whatever they can within the bounds of hardware limits. If a chip is going to overheat, hardware enforced slow-down kicks in and can’t be overridden. Similarly, if multiple chips in a rack want to sprint, early requestors make it, up to the power limit of the rack and later requestors are out of luck. Or perhaps they can exceed the power limit, the power supply trips and switches to UPS, as mentioned above, with its own discharge and recovery considerations.

Equally if you know anything about game theory, you’ll know that the greedy method is less efficient overall than methods which take a more comprehensive approach to optimization. For those who don’t know much about game theory, this isn’t about altruism. You might believe that by being better at being greedy you can win out (the Wolf of Wall Street approach where you don’t care what happens to the others) but the probabilities are against you when others are can behave similarly. You are likely to do worse following that strategy than if you more effectively cooperate; check out the prisoner’s dilemma.

A team at Duke have researched several strategies to finding a more effective equilibrium between jobs. Each of these is based on an architecture in which a user job is supported by a selfish (to that job) mechanism to optimize job performance by judicious sprinting in job phases where it will have most advantage. Sprinting profiles are shared in advance with a central coordinator which is responsible for returning recommended profiles that each job should use to ensure an equilibrium strategy across the system. Per job, the mechanism then uses its assigned profile in driving sprinting.

One quick note first. This approach avoids centralized command and control for each sprinting decision, which could be very burdensome in overall performance. Instead jobs and the coordinator need communicate only infrequently while exchanged profiles remain representative of actual activity.

More important, consider how the game plays out. If each job follows its assigned profile and the coordinator calculated an effective equilibrium strategy, jobs should remain in that equilibrium, achieving optimal throughput. What if a job cheats and offers a non-representative profile to the coordinator or chooses not to follow the equilibrium profile the coordinator returns? In either case, the cheating job is likely to suffer because, by not following the computed equilibrium strategy it is most likely to fall into less optimal performance. The only way it could (possibly) avoid this and cheat its way to better performance would require knowing what profiles other jobs have, and it doesn’t have access to that information.

In their experiments, the Duke researchers found the gaming approach provided a 4-6x total task throughput improvement over greedy approaches for data analytics workloads and close to optimal throughput based on a globally optimized policy. Not bad, and an impressive use of a math/economics theory you might never have considered relevant to this domain. You can access an open copy of the Duke paper here.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.

Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Appreciate your take, Rahul. You’re absolutely right that market scale drives architectural investment—scalar dominated when desktop and enterprise ruled, and…

— Jonah McLeod on June 29, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Well.. I found this to be a funny article. Flynn's critique is fine and good...but not really the driving factor…

— Rahul Razdan on June 29, 2025
Reachability in Analog and AMS. Innovation in Verification
Apologies for that slip-up on our part. Failing memories!

— Bernard Murphy on June 27, 2025
Reachability in Analog and AMS. Innovation in Verification
swka: This is true, I worked with MunEDA up until the Cadence acquisition. Before that I worked with Solido up…

— Daniel Nenni on June 26, 2025
Reachability in Analog and AMS. Innovation in Verification
One quick correction. WiCkeD was MunEDA tool, which was acquired by Cadence. So it is never part of Synopsys. Synopsy…

— swka on June 26, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
At Simplex Micro, the name says it all. Founder Dr. Thang Tran chose it to reflect his belief that in…

— Jonah McLeod on June 25, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Thanks for the thoughtful read—and you're right, we’re in a fascinating inflection point. On your first point: Lunar Lake doesn’t…

— Jonah McLeod on June 24, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
An interesting article for sure, as we are in a sea of change. I have perhaps two nitpicks; - Lunar…

— Xebec on June 24, 2025

Search Semiwiki

Recent Forum Threads

Recent Article Comments

Recent Podcast Episodes

Comments

Recent Forum Threads

Recent Article Comments