Jay Dawani is the co-founder & CEO at Lemurian Labs, a startup developing a novel processor to enable autonomous robots to fully leverage the capabilities of modern day AI within their current energy, space, and latency constraints.
Prior to founding Lemurian, Jay had founded two other companies in the AI space. He is also the author of the top rated “mathematics for deep learning” book.
Jay has also served as the CTO of BlocPlay, a public company building a blockchain-based gaming platform, and served as Director of AI at GEC, where he led the development of several client projects covering areas from retail, algorithmic trading, protein folding, robots for space exploration, recommendation systems, and more. In his spare time, he has also been an advisor at NASA FDL.
Can you give us the backstory on Lemurian?
We started Lemurian because of the observation that the robotics field is moving towards the adoption of software-defined robots over the large stationary fixed function robots which have been the norm in the last few decades. The main advantage here is the ability to give robots new capabilities over time through training with more simulated data and over the air updates. Three of the biggest drivers for this shift are deep learning and reinforcement learning; more powerful compute; and synthetic data. Most robotics companies are unable to fully leverage the advancements in deep learning and reinforcement learning because of a lack of sufficient compute performance within the power consumption and latency they require. Our roadmap is aligned to these customer needs, and we are focused on building the processor that would address these concerns. In some ways, we are building the processor we would need if we were to launch a robotics company.
There have been over 100 companies created in the last 10 or so years that are focusing on AI hardware, what makes Lemurian different?
We are developing a processor that enables AI in robots with far less power and lower latency by leveraging custom arithmetic to do matrix multiplication differently so that it is reliable, efficient, and deterministic. Our approach is well suited to address the needs of the growing autonomous robotics industry which can include anything from a home vacuum cleaner to a materials handling robot in a warehouse or a vehicle outdoors performing last mile delivery. What many of these applications have in common is the need to respond rapidly to changes in its local environment using very low power, and cannot wait for a signal from a data center in the cloud. These applications need to be programmed for their particular context, with high precision and deterministic actions. Determinism in our case means generating the same answer every time given the same inputs, which is essential for safety. General purpose AI processors, as others are building them, do not address these essential requirements.
Are you saying that the robotics industry needs a dedicated processor that is different from what most AI hardware companies are building?
Absolutely! Most companies focusing on edge AI inference are over-optimized for computer vision, but the challenge with robotics is that it is more than computer vision where conventionally the objective is for example to detect whether something is present in an image or to classify it. A robot on the other hand is something that interacts with the real world. It has to perceive, decide, plan, and act based on often incomplete and often uncertain information.
For example, a bin picking and sorting robot needs to be able to perceive the difference between objects, and interact with them appropriately with high speed and accuracy. With the availability of a domain-specific compute platform, robots will be able to process more data from sensors in less time which will allow many mobile robots to complete longer missions or tasks, and react to changes in the environment more quickly too. In some applications, it is hard to collect enough good data to train a robot so companies are using behavior cloning which is where a robot learns by observing demonstrations from a human in a supervised setting.
These autonomous robotic applications require an entirely new approach such as the one we are taking with our processor, which has been designed from first principles. Our solution is software-defined, high precision, deterministic, and energy efficient. That is why we are generating so much interest in this market segment from some of the leading companies. Fundamentally, we are doing for deep reinforcement learning inference at the edge what NVidia did for deep learning training in the data center.
Very cool. So what is unique about the technology that you are building?
Fundamentally, we are building a software managed, distributed dataflow machine that leverages custom arithmetic, which overall reduces power consumption and increases silicon efficiency. The demands of AI are so severe now it is breaking the old way of doing things, and that is creating a renaissance in computer architectures reviving ideas like dataflow and non-Von Neumann. A lot of these ideas are commonplace in digital signal processing and high performance data acquisition because these systems are constrained by silicon area or power.
For our target workloads, we were able to develop an arithmetic that is several orders of magnitude more efficient for matrix multiplications. It is ideally suited to modern day AI which depends heavily on linear algebra algorithms, and allows us to make better use of the transistors available. Other linear algebra-dominated application verticals, such as computer-aided engineering or computer graphics require floating-point. But floating-point arithmetic as we know is notoriously energy inefficient and expensive.
What is the benefit of this approach over those being taken by other companies?
The arithmetic we designed has roughly the same precision as a 16-bit float but consumes a fraction of the area. In a nutshell we’re able to get the efficiency of analog while retaining all the nice properties of digital. And once you change the arithmetic as we have, you can back off the memory wall and increase your performance and efficiency levels quite significantly.
Single precision floats have been very effective for training deep neural networks as we have seen, but for inference most AI hardware companies are building chips for networks that have been quantized to 8-bit integer weights and activations. Unfortunately, many neural network architectures are not quantizable to anything below 16-bit floats. So if we are to squeeze out more performance from the same amount of silicon as everyone else, we need new arithmetic.
Taking some of the newer neural network topologies as an example, the weights and activations in different layers have different levels of sensitivity to quantization. As a result most chips are forced to accommodate multi-precision quantization and have multiple arithmetic types in their hardware which in turn reduces overall silicon efficiency. We took this into account when designing our custom arithmetic. It has high precision, is adaptive and addresses the needs of deep learning to enable training, inference, and continual learning at the edge.
Why do you think other companies haven’t innovated in arithmetic?
High-performance systems always specialize their arithmetic and computational pipeline organization. However, general-purpose processors need to pick a common type and stick with it, and ever since IEEE standardized floating-point arithmetic to improve application interoperability among processor vendors in 1985, these common types have been floating-point and integer arithmetic. They work for the general case, but these types are suboptimal for deep learning.
Over the decades companies developing GPUs have had many different types and arithmetic optimizations in the lighting equations, geometry stages, and rasterization stages, all optimizing for area because of the need to multiply these units millions of times. The nature of the number system is the true innovation. The awareness that a particular computation has a particular opportunity to sample more efficiently is a nontrivial exercise. But when the vertex and pixel shaders made the GPU more general purpose, it progressed to the same common arithmetic as the CPUs.
So there has been innovation in arithmetic, but we haven’t made the progress in it that we should have. And now we are in an era where we need to innovate not just on microarchitecture and compilers, but arithmetic as well to continue to extract and deliver more performance and efficiency.
You just closed your seed round. What can we expect to see from Lemurian in the next 12-18 months?
We did indeed close an oversubscribed seed round. This was a pleasant surprise given the market situation this spring, but we are starting to hear more use cases and more enthusiasm for our solution from our target customers. And investors are increasingly open to novel approaches which may not have gotten attention years ago before the difficulties of the current approaches were commonly known.
We have built out our core engineering team and are forging ahead to tape out our test chip at the end of the year which will demonstrate our hypothesis that our hardware, software and arithmetic built for robotics can deliver superior processing, at lower energy usage and in a smaller form factor than competitors. We will be taping out our prototype chip at the end of 2023, which we will get into our early customers hands for sampling.