Software abstraction is a huge benefit of a network-on-chip (NoC), but with flexibility comes the potential for runtime errors. Improper addresses and illegal commands can generate unexpected behavior. Timeouts can occur on congested paths. Security violations can arise from oblivious or malicious access attempts.
Runtime errors also tend to be things not happening in isolation, especially if the first error in a sequence goes unmitigated. If there are natural causes such as congestion, further errors are likely to pile up as operation continues. For unnatural causes such as a malicious app, small errors can be a precursor to larger exploits. A chain of runtime errors can eventually render part or all of a SoC unable to function.
Not all errors are created equal. Many errors simply happen silently, producing an incorrect response but otherwise undetected. Others are seen but unacted upon. Depending on the source and severity of the error condition, recovery might be possible, or it might be prohibitively expensive in terms of extra gates and layers of software. The last resort is the dreaded hardware reset, an increasingly archaic response that irritates users to no end.
Without the right NoC infrastructure, even the first few phases of error management are difficult, making simple errors hard to handle. In architectures such as automotive and the IoT, where real-time and safety-critical operation becomes more important, error management is taking on more importance in SoC design. With the right NoC architecture, built-in features make robust error management easier.
There are five phases in error management: detection, aggregation, logging, reporting, and recovery. In the SonicsGN architecture, detection starts with configurable initiator agents and target agents. A transaction begins at an initiator, flows through routers, is received at a target, and is acknowledged with a response that flows back to the initiator. Each agent has what amounts to a watchdog timer, looking at four situations: burst failure, target flow control, return ack fail, and initiator flow control.
Other types of in-band errors can occur. Each initiator agent has a map of the targets it is permitted to reach; an access attempt can fall into an address “hole” in the map, or might be trying to access a powered-down domain. An initiator agent might see an unsupported command, a target agent might see an access violation, or both might report some type of safety error (as in a firewall, or what Sonics terms a protection mechanism). Another common error is the out-of-band variety, such a violation of the AXI non-modifiable burst. When possible, errors are handled at the initiator agent to minimize network traffic.
The SonicsGN agents detect, aggregate, and log errors – but what happens then? Reporting is configurable, with responses ranging from simple in-band messages to sideband techniques up to processor interrupt. One interesting scenario is an attack on a sensitive IP block. It may be futile to report those errors back to the initiator, who would be generating the attack entirely on purpose. Recovering errors is also up to the customer. Software can go into the agents and sweep the error logs, looking at different classes of severity and frequency, then decide what to do.
The point is customers can use the SonicsGN capability to engineer as little or as much error management into their product as needed. Much of the original work on NoC error management was done in conjunction with TI on various OMAP family members, and Sonics has a detailed error management microarchitecture (under NDA).
There are always tradeoffs. For a fully certifiable, safety-critical system, the investment in both hardware and software for a SoC with robust error reporting and recovery in some scenarios may be well worth it. Even for less hardened systems where recovery might be expensive in silicon, the ability to recognize and report suspicious activity could be instrumental in IoT and other applications. Imagine an IoT edge device that could tell the provisioning system it is being hacked and going offline – while the attack is in progress, rather than after the fact when bad data has propagated all over network.
To me, this seems like the early days of the Internet, when IT types were looking through logs of traffic from routers, firewalls, packet shapers, load balancers, and other appliances looking for who was trying to do what to whom. The difference is now it is all happening within a single chip running a NoC. Without the type of visibility SonicsGN provides, errors could easily run out of control all over a chip – and users would never know until it was too late. With the error management capability in SonicsGN, SoC designers have a lot more control.