LLMs have amazing capabilities but inference run times grow rapidly with the size of the input (prompt) sequence, a significant weakness for some applications in engineering. State space models (SSMs) aim to correct this weakness. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
The Innovation
This month’s pick is Mamba: Linear-Time Sequence Modeling with Selective State Spaces. This was published in arXiv in 2023. The authors are from CMU and Princeton.
Judging by recent publications there is growing interest in a next step beyond transformer architectures, using an architecture building on state space models (SSMs). State space modeling is not a new idea; studies date back to the 1960s (Kalman filters) and are applied to time series analysis in many disciplines. In essence the method builds a model of internal state for a system based on equation-based constraints or statistical observations.
Research in SSMs for LLMs is quite recent, based on the idea that it should be possible to generate statistical models to mirror a more compact representation. Using such a model, inference can predict next items in a sequence faster than using brute-force attention recalculation on each next step. Research is already looking at applications in speech generation, DNA sequence analysis, computer vision, and of course LLM methods. While I haven’t yet found research on verification applications, it seems reasonable to assume that if ‘traditional’ LLMs can play a role in a verification problem, then SSM-based LLMs can play a more efficient role.
Paul’s view
The potential for LLMs to dramatically improve verification productivity is clear. How long it will take to and what kinds of tools will achieve it is actively debated. All EDA vendors including Cadence have significant investments in LLM-based tools. This month we’re blogging about Mamba, a state space model (SSM) rather than an LLM. SSM research has been active for many years, but Mamba puts it in the spotlight as a serious contender to replace LLMs. While ours is not a blog for AI experts, if SSMs are to replace LLMs it would be a big deal for all of us, so we figured we should respect the moment and blog on Mamba!
As a simple teaser here, I like to compare LLMs and SSMs to control and datapath in chip design. Think of an LLM as a massive multi-billion node datapath. The inputs are every word in the prompt concatenated with every word that has been output so far. The output is the next word inferred. The width of the datapath explodes internally as very complex math is used to map numbers denoting each input word into scores for every possible output word, literally the entire English dictionary.
Alongside a datapath is control logic that gates and guides the datapath. In our world, control logic is highly sequential – state machines and control registers. Control logic up-levels datapath from a calculator into a reasoning thing that can take actions and make decisions.
In LLMs the control logic is not sequential. It’s a combinational “attention” weighting function that weights input words with other input words. In SSMs the control logic is a generic programmable (through training) state machine. Sure, it can do attention, but it can do many other things as well.
One key benefit of SSMs is that they don’t have limits on the size of input prompt. LLMs have an n-squared size/runtime problem since the attention function must compare every input word with every other input word. Inference blows up if the context window is too big. SSMs have no hardwired requirement to compare every input word to every other input word. Conceptually they just remember something about words input so far and use this memory to project weightings on the current input word.
The math and innovations behind SSMs go deep. If you are want to zoom in, this blog is a great place to start. Either way, let’s all stay tuned – dramatic improvements in verification productivity may well come through SSMs rather than LLMs. Imagine what we could do if the RTL and testbench for a full chip SOC and a full wavedump from its simulation could be passed as input to an SSM?
Raúl’s view
Inference in transformers has quadratic complexity arising from the self-attention mechanism: each token in the input sequence must compute its relevance (attention score) to every other token. This means that for an input sequence of length n the attention mechanism requires O(n2) computations. This makes inference expensive, and in practice a state-of-the-art LLM like OpenAI’s GPT-4 reportedly manages sequences of up to 32,000 tokens, while Google’s Gemini can handle up to 8,192 tokens. State Space Models (SSMs) have been developed to address transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important domains such as language.
The paper we review this month introduces Mamba, an architecture which incorporates a structured SSM to perform context-dependent reasoning while scaling linearly in sequence length, matching or outperforming transformers in many cases. Here is how it works.
A Structured SSM maps an input sequence xt to an output yt through a state ht as follows (discretized): ht = Aht-1 + Bxt, yt = Cht, where A, B, and C are matrices (to electrical engineers this is reminiscent of a Moore finite state machine). Such recurrent models are efficient because they have a finite state, implying constant-time inference and linear-time training. However, their effectiveness is limited by how well the state has compressed the context. This shortcoming is addressed by selection, which means making B and C also functions of the input and thus time varying. (*)
Mamba is an architecture that integrates a selective SSM with a Multi-Layer Perceptron (MLP) block. It achieves state-of-the-art results, often matching or surpassing Transformer models, in some cases using 3-4x fewer parameters (which is nice but not game changing). Additionally, it can handle longer context up to sequences of one million in length (this may allow to process very long strings, useful in EDA where design data is large). It certainly makes the point that Transformers are not the end of the road.
The paper, cited over 1000 times, spans 36 pages with 116 references and requires AI expertise to read. It covers various aspects of SSMs like architectures, dimensions, use of complex vs. real numbers, discretization, RNN gating mechanisms, and selection effects. Mamba is evaluated on synthetic tasks such as Selective Copying (filter out irrelevant tokens) and Induction Heads (retrieving an answer based on context, e.g., predict Potter after Harry), and on Language, DNA, and Audio modeling. Mamba is compared to other SSM architectures such as Hyena, SaShiMi, H3 and Transformer models such as Transformer++. The number of parameters is in the range of hundreds of thousands to one billion. The authors finish by suggesting that “Mamba is a strong candidate to be a general sequence model backbone”.
(*) The paper uses an overbar to indicate discretized A and B matrices, which I could not translate successfully from my Mac to the SemiWiki site. I used an underbar instead.
Share this post via:
Comments
There are no comments yet.
You must register or log in to view/post comments.