AI, Machine Learning, Deep Learning and neural networks are all hot industry topics in 2019, but you probably want to know if these concepts are changing how we actually design or verify an SoC. To answer that question what better place to get an answer than from a panel of industry experts who recently gathered at DVcon with moderator Jean-Marie Brunet from Mentor, a Siemens business:
- Raymond Nijssen, Achronix Semiconductor Corp.
- Saad Godil, NVIDIA
- Alex Starr, AMD
- Rob Aitken, Arm, Ltd.
- Ty Garibay, Mythic
AI hardware and software startups are commanding a leading portion of venture capital investment in 2018 and continuing into 2019, and in China alone there are some 600 AI startups. There’s even a book from author Dr. Kai Fu Lee on the topic, AI Superpowers: China, Silicon Valley, and the New World Order. Moderator Brunet started it out by asking the panelists to consider two challenges:
Brunet: How is the AI idea reshaping the semiconductor industry and specifically chip design verification?
Nijssen: These chips tend to be very homogenous with a lot of emphasis on data movement and on elementary arithmetic. In that sense, verifying those chips from a traditional design verification point of view is not very different from what you know and do already. What I will say is that the performance validation of the system is much more complicated, because there’s so much software involved in the systems.
Aitken: I would actually argue that it is very much like verifying a processor, because the machine learning unit itself is doing processing. It’s not a GPU, it’s not a CPU, it’s doing lots of multiplications. I think the interesting verification problem is exactly as someone described it, it’s how these things interact with each other, and what the system is actually doing.
From the machine learning verification perspective, I think it’s an interesting subset of the EDA problem in general. If you look at EDA, for the last 30 years, people have built in huge numbers of heuristics, and heuristics are actually really good.
As an example, some work done at ARM took a look at selecting which test factors were more or less likely to actually add value in a verification suite. And that problem turned out to be relatively easy to formulate as a machine learning problem and an image recognition problem, so that a standard set of neural networks can figure out that this looks like a promising set of vectors, this looks like a less promising set of vectors, and we were able to get about 2X improvement in overall verification throughput that way.
Starr: We have multi-die designs that are commonplace, certainly for AMD, multi-sockets, so we are already facing the scale challenge, which you certainly see on the AI side of things. All of these designs are going to be measured on how fast they can execute these machine learning algorithms. That’s the hardware and software/firmware problem. I think the industry is going to get their head around that game of how do we optimize for performance, not just the design itself but the whole ecosystem.
From the ‘how do I use AI in verification processes’, I think that is a big area of expansion. Today, we can run the large designs on hybrid systems, and process tons of data. Getting high visibility into that is key, and AI can play a role specifically on how to analyze the data we are getting out of the systems and be more targeted in our approaches.
Garibay: The design team and the verification team working on our chips essentially implemented a generative adversarial network with the people. You have the design, you have efficient people trying to attack it, and the design evolves over time as the designers fix it, make it better, and actually create a new chip. The challenge is that this chip is unique and there’s no baseline data or there’s a limited visibility of baseline data from chip to chip, unless you’re doing derivatives or next generation x86 or something like that.
Godil: You ought to construct verification environments to allow you to react in a very short amount of time, and that’s going to give you a competitive advantage in delivering these AI chips. On the topic of building AI chips, in general, the industry has been designing chips that have dealt with scaling already.
The other question is, how will AI impact verification. On that I’m actually very bullish. As we mentioned in the introduction, a lot of people have said that AI is the new electricity, and there is going to be a lot of industries that will be impacted by it.
Brunet: Do you guys think EDA tool vendors are ready with delivering what you need to verify your chips in a particular domain?
Nijssen: In my view, we are only just getting started with AI. This is going to be a long road, and it’s going to be very exciting time. In the case of verification, you could wonder, what would that training set look like. Is it a big dataset with errors in it that may affect a system, and force the question, is this good or is this bad, because you’ve trained with it.
Garibay: The chips are just a bunch of multipliers with some amount of programming on it. So the verification task of these chips must include the firmware and the software layers, even more so than just in a normal processor or SOC.
Aitken: I think there’s another important point regarding curve fitting. It’s really useful to think about it in the verification space, especially, because that curve fitting as a process works really well if your training set bounce your curve. Interpolation is good, extrapolation is notoriously bad in curves.
Starr: The hardware/software ecosystem, that is what we’ve got to verify these days, it’s not just the design. We pretty much had terrible tools industry wide to address that problem. We’ve got tools that can look into the design and see what the firmware was doing, what the software stack was doing at the same time. And that’s great. But we need a more abstract level of debugging and understanding all of that with a global view.
Godil: One observation that I will make is, I see a lot more custom chips, a lot more domain specific chips not just for AI, and usually those are paired with their own domain specific languages. I do think that for this verification community we’re going to have fewer opportunities to rely on some common standards and common programs.
Brunet: What’s different in the AI space is that you have now new frameworks. We are talking about frameworks that are different if you go to mobile, framework benchmarks for reference, etc. We need help from you guys to tell us what you actually need to extract from the design. Not all emulator are the same, in our case we can extract everything, but what I want to avoid is to dump a massive amount of data that is not going to help you, because it’s too much data.
Garibay: So we have some common layer that is the neural network input but after that it is all unique for each vendor. It’s incumbent on the tool vendors to be able to rapidly adapt to each unique implementation, supplying the ability to generate scoreboards very quickly to track different states within the chip in a very visible way and the ability to understand new data types and new types of operations.
Godil: We can look at a picture and figure out what the different objects are without knowing what those objects mean or what they are, it’s really good at extracting meaning from sentences, even though it doesn’t understand. It can’t reason very well but it’s really got a perception. Perhaps the right answer is, you have a spec that you can build but what if you could discover the properties of a proprietary design and figure out what would be important and what would not be important.
Aitken: When you’re looking at the actual implementation of how that processing works, if you just have visibility to some random set of edges, and some activation function, and you say, this data appeared here and it fired, you don’t really know what that means, or why, and that’s the benefit with AI. That’s why these things work, and why we don’t have to program every aspect of them, but it’s also why they’re really hard to debug.
Nijssen: I think we’re touching on something important, the distinction between supervised learning and unsupervised learning. In the same sense when machine learning started two decades ago, people had domain specific knowledge to say if you want to recognize a cat, you have to do edge detection, then you have to have some overlap and two triangles on top of it. That’s a rule, and you can get pretty far with it. But they hit the glass ceiling, until they came up with unsupervised learning where basically there were no rules to drive the system, the system had to work by providing a lot of these streaming factors, the tool derived coefficients that made it so that the cat’s tail was recognized without anyone having to specify what the rules were.
For design verification, the risk is that somebody may be saying, the self-learning system maybe going to miss something, maybe the training data was just not good enough. And then that would probably not happen with a rule driven system where the rules provide much higher likelihood that the system will catch the deviations from the rules on an enormous amount of data. These distinctions are very important to be made.
Audience: Comment on probabilistic computing, analog computing, quantum computing, statistical models, algorithmic benchmarks.
Garibay: We are implementing convolutional math in analog and it does create its own range of verification issues. I think it’s a bit humorous that we’re spending millions of dollars and millions of man hours to target 100% verification of our machines that are intended to be 96.8% accurate, and then treating the final bit of it as if it was meaningful. Oh, that’s wrong because it’s off by this significant bit. No, it’s not wrong, it’s just different from your software model. And that’s really all we have right now as verification golden models.
Aitken: You talked about 96.8%, it’s tempting, especially as a digital designer, to say, oh, well, I fudged the circuit a little bit, and now, instead of 96.8, it’s only 96.7. But, hey, it was only 96.8 before, so really 96.7 is probably fine. But the software people who will come up with the networks in the first place will kill for that extra decimal points. Hardware people can’t just give it away. But it does lead to some interesting thoughts.
Nijssen: In terms of the previous question that you asked, we’re going to have to figure out a way to make reliable systems out of unreliable components. From a design verification point of view, nothing could be more daunting than that.
Starr: If you look at what we’re trying to verify when we make these big designs today, it’s got the hardware and the software in it and it is not deterministic when we run it. In fact, we go out of our way to design non-deterministic systems in the spirit of performance, so that we can get as much throughput as possible to verify that everything works.
Godil: I want to comment on this whole idea of neural nets, artificial intelligence, and the probability aspect to it, and what if we get something wrong every now and then. Like Rob was saying, how do you know whether this activation was supposed to fire or not. Someone who stares at loss function all day to see why the model is not trending. I can guarantee you, you do not want your verification problem to be figured out by this neural network design. You’re not going to make it. I agree with Rob, I think eventually you will be hit with some limits, maybe that’ll be the eventuality.
Audience: My question is how can we improve system validation through hardware/software validation? Is there are some kind of AI and Machine Learning approach that can help us?
Godil: I spent a lot of time talking to different groups in NVIDIA, sort of brainstorming on what maybe good applications, and one of the first things that we talked about is, don’t tell me about your data, tell me about your problem. What is it that you’re trying to solve, and there are certain things that we’re looking for that make something a good AI problem.
Aitken: Look for problems and have rules. If there are rules that enable people to solve the problem, that helps a lot. AI systems can play Go, because Go has a defined set of rules. AI systems cannot play a typical game that your four-year-old kids would invent because the rules changed as much.
Nijssen: I’ve been thinking about how would you come up with the training set, how do you come up with enough data to train your AI system? Maybe one way to do that is to take your system and, on purpose, introduce a bug to see what the AI system can generate. This is how Google trained Google Go.
Audience: Why isn’t there any sort of capability within the tools right now to integrate any of this deep learning techniques?
Brunet: As I said before, it’s not a problem of how to do it for us, the problem is what to extract. We can very deterministically extract pretty much anything that is in an emulator. The challenge we’re seeing with those frameworks is, we can spend time extracting a lot of stuff that is completely useless.
Audience: The simulator vendors don’t act upon it, there’s no feedback. That’s what I’m saying is missing.
Godil: I think people have looked into that. I think there was a tool called Echo from VCS that addressed that. This is not a new idea in terms of feeding data back.
Audience: How many companies making chips containing special purpose neural network hardware do you think will exist in five years?
Garibay: The AI chip market is a pretty standard market split between China and not China. You’ll have a number one, a number two, and maybe three others that are trying to run a business for a long period of time.
Aitken: I think you will have more than that. If you look at your typical CPU, whether it’s from ARM or whoever, there’s a lot of work that’s going into just conventional CPUs on how to optimize the hardware to do better on metrics, to multiply and similar other neural network calculations. At some minimalistic level, anything with the CPU in five years is going to essentially have custom purpose AI hardware in it.
Nijssen: A few years ago, I was attending a panel where somebody from the audience asked the same question or the same question with a different order of magnitude. Somebody from the panel answered over 100, and it was a lot of laughter in a room because everybody felt it was preposterous. That’s what we are right now.
Brunet: What I believe will happen is, in five years the distribution among the top ten will change, probably not startups, hopefully one of them is here today. But I think we’ll see companies similar to Google, Amazon, Facebook of today that will emerge within the top 10 semiconductor makers because of that the distribution shift.
Audience: Deep learning is divided into two problem scopes, the training and the inference. Do you think that we would need different versions of EDA tools to solve these two different problems?
Nijssen: If you are looking for one example that shows we are in the infancy of this whole thing, it is the dichotomy between training and inference. Right now we’re at the vacuum tube level of AI. We have a cat and if I show my cat a new treat, once, just once, the cat will immediately learn about that treat, and it does that with maybe two watts in its brain. I’m sure nothing got uploaded to a data center, ran overnight, and some coefficients downloaded into my cat’s brain. It did this instantaneously with a couple of watts.
My answer to your question is that the separation will continue for now, but eventually it is going to disappear.
Garibay: I’d say that it’s no difference than what happens with mobile chips or edge chips of any kind or chips in other markets at the edge versus data center servers. We all use the same EDA tools.
The AI hardware market is expanding with new products announced every month, around the globe, so automating the design and verification of these chips makes sound business sense for both EDA and IP vendors. The two segments for these AI chips are in the data center where training typically takes place, and then at the edge where inferencing happens, but over time this distinction is becoming more blurred.
i was kind of surprised that none of the panelists talked about how a few EDA tools are using machine learning in the areas of cell characterization and smart Monte-Carlo simulation for SPICE circuit simulators. We are likely to see more AI inside of EDA tools for problems that require long runtimes, like: Logic Synthesis, Place and Route, DFM. The other challenge with EDA tools is having deterministic results, otherwise how will a vendor be able to reproduce a bug in order to fix it.