Data Center Optimization Through Game Theory

Data Center Optimization Through Game Theory
by Bernard Murphy on 02-13-2019 at 7:00 am

I always enjoy surprising synergies so I was immediately attracted to a Research Highlight in the Communications of the ACM this month, on a game-theoretic method to balance discretionary speed-ups (known as computational sprints) in data centers. If you don’t have an ACM membership and want to dig deeper, I include an open link at the end of this blog.

This research starts with the widely used trick to reduce job run-time on a processor by speeding up the clock. The downside is that running with a faster clock dissipates more heat so you can’t do it for too long otherwise you’ll fry your processor. Heat dissipates relatively slowly, so there’s a recovery time for cooling during which clock speed has to return to nominal or perhaps even slower. Like sprinting in a long-distance race; you can sprint periodically, but you can’t sprint through the whole race.

22975-marathon-min.jpeg

Also you’re not running this race alone. In a data center, and particularly in cloud environments, you’re in a load mix with many other jobs which also want to optimize their performance. Your job may swap onto a processor on which another job was just sprinting, or vice-versa; either way the second job loses a sprinting opportunity, at least until recovery. Or maybe multiple jobs in a rack want to sprint at the same time but that risks tripping the rack power supply, perhaps switching over to UPS with its own recovery cycles. Who loses out in this race for extra power?

If you know anything about game theory, this should sound very familiar. You have multiple players with no intrinsic reason to cooperate, all grabbing for more than their fair share of finite resources. The default (greedy) approach is simply to let it work itself out. Jobs grab whatever they can within the bounds of hardware limits. If a chip is going to overheat, hardware enforced slow-down kicks in and can’t be overridden. Similarly, if multiple chips in a rack want to sprint, early requestors make it, up to the power limit of the rack and later requestors are out of luck. Or perhaps they can exceed the power limit, the power supply trips and switches to UPS, as mentioned above, with its own discharge and recovery considerations.

Equally if you know anything about game theory, you’ll know that the greedy method is less efficient overall than methods which take a more comprehensive approach to optimization. For those who don’t know much about game theory, this isn’t about altruism. You might believe that by being better at being greedy you can win out (the Wolf of Wall Street approach where you don’t care what happens to the others) but the probabilities are against you when others are can behave similarly. You are likely to do worse following that strategy than if you more effectively cooperate; check out the prisoner’s dilemma.

A team at Duke have researched several strategies to finding a more effective equilibrium between jobs. Each of these is based on an architecture in which a user job is supported by a selfish (to that job) mechanism to optimize job performance by judicious sprinting in job phases where it will have most advantage. Sprinting profiles are shared in advance with a central coordinator which is responsible for returning recommended profiles that each job should use to ensure an equilibrium strategy across the system. Per job, the mechanism then uses its assigned profile in driving sprinting.

One quick note first. This approach avoids centralized command and control for each sprinting decision, which could be very burdensome in overall performance. Instead jobs and the coordinator need communicate only infrequently while exchanged profiles remain representative of actual activity.

More important, consider how the game plays out. If each job follows its assigned profile and the coordinator calculated an effective equilibrium strategy, jobs should remain in that equilibrium, achieving optimal throughput. What if a job cheats and offers a non-representative profile to the coordinator or chooses not to follow the equilibrium profile the coordinator returns? In either case, the cheating job is likely to suffer because, by not following the computed equilibrium strategy it is most likely to fall into less optimal performance. The only way it could (possibly) avoid this and cheat its way to better performance would require knowing what profiles other jobs have, and it doesn’t have access to that information.

In their experiments, the Duke researchers found the gaming approach provided a 4-6x total task throughput improvement over greedy approaches for data analytics workloads and close to optimal throughput based on a globally optimized policy. Not bad, and an impressive use of a math/economics theory you might never have considered relevant to this domain. You can access an open copy of the Duke paper here.