Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/can-a-2-stage-deep-seek-nvidia-process-yield-better-results.21975/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Can A 2 stage deep seek/Nvidia process yield better results?

Arthur Hanson

Well-known member
Could a combination of the Nvidia/Deep Seek in a two-stage process yield better results lowering hardware, time and power costs to the benefit of all parties? Could this be an entire new ecosystem or frontier for further exploration?
 
Don't bet on it.

An AMD/DeepSeek process may yield better results, lowering hardware costs. It seems that DeepSeek may utilize the AMD chip at half the cost of Nvidia's, and if this is the case, Nvidia's stock could drop significantly. This is why DeepSeek is not good news for Nvidia.
 
Don't bet on it.

An AMD/DeepSeek process may yield better results, lowering hardware costs. It seems that DeepSeek may utilize the AMD chip at half the cost of Nvidia's, and if this is the case, Nvidia's stock could drop significantly. This is why DeepSeek is not good news for Nvidia.
They don't need to use AMD's chips; they can use Huawei's chips. In addition, there are numerous AI inference chip startups.

Furthermore, with DeepSeek R1 distilled models, you can perform edge inference at very low costs. For my setup, it is around $1,000. I also conducted some market research and found that you can build an Arrow Lake-S machine for approximately $1,000.

 
Yes - many of the techniques used by DeepSeek for efficiency can be applied to any model with any hardware. I found a good post on reddit explaining the list of "innovations". One of them is definitely first time with DeepSeek, the others were known, and a few are opportunistic for costs in China.


The first few architectural points compound together for huge savings:

[Items in brackets are my comments]
 
Yes - many of the techniques used by DeepSeek for efficiency can be applied to any model with any hardware. I found a good post on reddit explaining the list of "innovations". One of them is definitely first time with DeepSeek, the others were known, and a few are opportunistic for costs in China.


The first few architectural points compound together for huge savings:

[Items in brackets are my comments]
You're missing what is probably responsible for a large part of the improvement, which is reprogramming 20% or so of the GPU engines to build a super-fast on-chip comms/sync network and reduce the effect of restricted inter-chip bandwidth. AFAIK this has not been done before, and it's not possible using CUDA.
 
You're missing what is probably responsible for a large part of the improvement, which is reprogramming 20% or so of the GPU engines to build a super-fast on-chip comms/sync network and reduce the effect of restricted inter-chip bandwidth. AFAIK this has not been done before, and it's not possible using CUDA.
I think this these are more detailed descriptions of the concepts you're referring to:


Despite that skepticism, if you comb through the 53 page paper, there are all kinds of clever optimizations and approaches that DeepSeek has taken to make the V3 model, and these, we do believe that they do cut down on inefficiencies and boost the training and inference performance on the iron DeepSeek has to play with.

The key innovation in the approach taken to train the V3 foundation model, we think, is the use of 20 of the 132 streaming multiprocessors (SMs) on the Hopper GPU to work, for lack of better words, as a communication accelerator and scheduler for data as it passes around a cluster as the training run chews through the tokens and generates the weights for the model from the parameter depths set. As far as we can surmise, this “the overlap between computation and communication to hide the communication latency during computation,” as the V3 paper puts it, uses SMs to create what is in effect an L3 cache controller and a data aggregator between the GPUs not in the same nodes.

As the paper puts it, this communication accelerator, which is called DualPipe, has the following tasks:

  • Forwarding data between the InfiniBand and NVLink domain while aggregating InfiniBand traffic destined for multiple GPUs within the same node from a single GPU.
  • Transporting data between RDMA buffers (registered GPU memory regions) and input/output buffers.
  • Executing reduce operations for all-to-all combine.
  • Managing fine-grained memory layout during chunked data transferring to multiple experts across the InfiniBand and NVLink domain.
In another sense, then, DeepSeek has created its own on-GPU virtual DPU for doing all kinds of SHARP-like processing associated with all-to-all communication in the GPU cluster.

Here is an important paragraph about DualPipe:

“As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead. In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency.”

The paper does not say how much of a boost this DualPipe feature offers, but if a GPU is waiting for data 75 percent of the time because of the inefficiency of communication, reducing that compute delay by hiding latency and scheduling tricks like L3 caches do for CPU and GPU cores, then maybe, if DeepSeek can push that computational efficiency to near 100 percent on those 2,048 GPUs, the cluster would start acting it had 8,192 GPUs (with some of the SMs missing, of course) that were not running as efficiently because they did not have DualPipe. OpenAI’s GPT-4 foundation model was trained on 8,000 of Nvidia’s “Ampere” A100 GPUs, which is like 4,000 H100s (sort of). We are not saying this is the ratio DeepSeek attained, we are just saying this is how you might think about it.

And this:

 
I think this these are more detailed descriptions of the concepts you're referring to:




And this:

Yes, that's what I was referring to. Another article (can't remember where from) said that this had to be done by resorting to the equivalent of coding in assembler, it isn't possible in CUDA which is a high-level language.

Though presumably there would be nothing stopping Nvidia adding such HLL comms programming into CUDA now it's been shown to give such a big advantage... ;-)
 
Yes, that's what I was referring to. Another article (can't remember where from) said that this had to be done by resorting to the equivalent of coding in assembler, it isn't possible in CUDA which is a high-level language.
Not being a CUDA implementation expert, I can't say one way or the other, but I suspect you're correct. I wonder, however, how much of this communications pipelining and parallelism must be integrated into the application, or is application-specific.
Though presumably there would be nothing stopping Nvidia adding such HLL comms programming into CUDA now it's been shown to give such a big advantage... ;-)
Knowing the Nvidia networking team a bit, I suspect they're already at work. :)
 
Back
Top