You’ve heard about Meltdown and Spectre and you know they’re really bad security bugs (in different ways). If you’ve dug deeper, you know that these problems are related to the speculative execution common in modern processors, and if you dug deeper still you may have learned that underlying both problems are exploits called timing side-channel attacks which depend on differences in timing between different operations, for example in retrieving data from a cache on a hit or miss. From there on for many of us, certainly for me, the details get a lot harder to follow.
But you probably thought (as I did) “this is a problem for the CPU guys, not my concern”. Bad news – you need to worry about this too. Timing channel exploits are not just for CPUs and caches. Timing exploits are possible through NoCs, in accelerators, between accelerators and their caches, pretty much anywhere an attacker might probe for hints to privileged information. We can’t just punt this to someone else; we need a deeper level of understanding.
So I talked to Jason Oberg, CEO of Tortuga, who has a PhD in timing channels; he managed to drag me through the basics. I say “drag” with feeling because just understanding the basics was hard; trying to reason about whether you have timing channel problems in a large design may earn you an extended stay in a sanitarium. They’re hard because this class of problem inherently spans multiple instructions and requires you to look at both software and hardware together. To see why, Jason shared a couple of the “easier” examples. These assume a victim process and an attacker process can run on the same system (under a VM OS for example).
First, think about a public-key cryptography system – could be in hardware or software. This depends on calculating a number to a power (the key) then taking the modulus of the result to some base. The most efficient standard way to do this uses a square-and-multiply algorithm, progressing over bits in the key. Square and multiply operations take different times and use of these operations differs for 0 and 1 bits in the key, therefore total time taken for the operation (if you have access to a sufficiently accurate clock) reveals the number of ‘1’ bits in the key. So start a timer, run the encryption, stop the timer and read the result. Since the key won’t change, repeat with multiple carefully-selected plain-text inputs, analyze the timing variations for these inputs and you can reconstruct the key one bit at a time.
The crypto-experts figured this out and one came up with a better algorithm called Montgomery’s ladder, which is immune to this kind of attack because it balances operation times for 0 and 1 bits. But then the experts found another way to force timing variations, through data retrieval times from the cache. Hang on to your hats – this is going to get complicated. One approach starts with something called a Prime and Probe attack. Before running the encryption test, the attacker primes the cache by filling with its own cache lines. Then the attacker times an encryption test, as before. Subsequently the attacker swaps back in and checks if any of the cache lines it preloaded have been evicted by the encryption. If they have, each such operation would have taken longer to execute in the encryption.
Now back to Montgomery’s ladder. This also steps through the key bitwise and performs different operations in each case depending on whether the bit is 0 or 1. But because of the Prime and Probe setup, now timing is sensitive to memory indexing from the ladder operation in the victim process and that indexing is still based on progressive bits in the key. From there you just continue to run analyses over multiple plain-text samples, analyzing the timing variations for these operations, from which you can ultimately extract the key.
Reminder – these are just examples; nothing about them is particularly restricted to CPUs, caches or encryption. Timing channel vulnerabilities can happen all over the place, as I mentioned earlier. And you can’t figure out where you might have such a problem without looking at hardware and software together. Formal and other standard security tools really can’t help. Even thinking about where you might have such problems can be difficult. You probably should talk to Tortuga who have a strong background in this domain and have built tools particularly around finding timing channel problems.