Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/could-parallel-processes-or-higher-speed-replace-shrink.17679/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Could Parallel Processes or Higher Speed replace Shrink?

Arthur Hanson

Well-known member
With shrink coming to an end, could parallel process or higher speed replace shrink? Would integrating memory, maybe by sections, into the chip speed up the process? Also, more specialized designs to accomplish tasks at increased speeds? What companies may be working in this area? Cost efficiency?
 
In many applications power efficiency becomes the key, not just density or speed, because the power per operation is shrinking more slowly than area with each process node so power density is rising.

This means the best approach is to go parallel not faster, which is exactly what's been happening with CPU cores for many years.

The networking chips I work on have massively parallel internal processing with data buses thousands of bits wide, it's the only way to go when you're processing terabits per second...
 
Ian, as you are aware, running lots of embedded CPUs at the same time on the same die will cause heat. Do you use parallel chiplets? Do you use high threshold devices and perhaps lower voltages to reduce short circuit currents (flow of current through the complementary fets)? Do you reduce the resolution on your math functions?
 
Ian, as you are aware, running lots of embedded CPUs at the same time on the same die will cause heat. Do you use parallel chiplets? Do you use high threshold devices and perhaps lower voltages to reduce short circuit currents (flow of current through the complementary fets)? Do you reduce the resolution on your math functions?
We use everything possible to reduce power -- varying Vth, low supply voltage which is adjusted to track gate delay, just-enough-maths resolution -- the datapaths are not programmable CPUs they're all custom logic. Chiplets don't really work for internal signal processing when we have data rates of multiple terabits per second, the power overhead to get this much data even from one chiplet to another (and the number of wires needed to do it) is just too high...
 
Thanks. This makes sense to me. Do you speculate that the big digital companies (AMD, Intel) do this, or are they all about brute force? I suspect the later.
 
With shrink coming to an end, could parallel process or higher speed replace shrink? Would integrating memory, maybe by sections, into the chip speed up the process? Also, more specialized designs to accomplish tasks at increased speeds? What companies may be working in this area? Cost efficiency?
Arthur, I'm not sure exactly what you're asking, but I think you're asking if parallel processing computing architectures can replace shrink. Are you? If so, the answer is that pretty much every chip designed today already has parallel processing designed into it, some more than others. Nvidia GPUs, for example, have a very high level of parallel processing (thousands of CUDA cores). So do various AI/ML chips or cores, like Apple's, and Intel's Gaudi processor from Habana being another example.

General purpose CPUs have parallel processing at multiple levels, such as many cores, hardware threads (which are seen as additional cores by an OS, but are really just sharing over-provisioned functional units in a given CPU core), superscalar instruction processing, out of order execution, special purpose CPU instructions for specific workloads (Intel is [in]famous for this, with AVX512, AMX, CRC32c instructions) multiple memory channels, multiple I/O channels, and direct memory access engines, to name the functionality off the top of my head. There are also numerous implementation techniques used to increase execution efficiency, that mostly boil down to replacing software executed by general purpose CPUs with specific hardware solutions like state machine logic (Nvidia network adapters, Intel Sapphire Rapids PCIe connected accelerators) and FPGA logic. To use an analogy, these design innovations for parallelism or efficiency are like boats, fab process is like the ocean. A better process (shrink) raises all boats.
 
Moving large amounts of data between parallel compute is going to be a key area of focus.

A novel approach to routing data is going to be critical. ;-)
 
Moving large amounts of data between parallel compute is going to be a key area of focus.

A novel approach to routing data is going to be critical. ;-)
Ah, yes. Putting the computation near the storage. Or in the storage. Or in the memory. Or in the network. Multiple decades-old ideas, because any systems architect knows that moving data to compute and back is just a waste of energy and latency. End users get no benefit. The problem is making the point at which computation takes place transparent to the applications, because otherwise you need new applications, which will only happen once there is pervasive industry-standard hardware and software. And that isn't happening for the most part, though the big cloud vendors do this themselves because their systems are so big they are markets unto themselves. For software vendor applications you need compilers which make the placement of the computing transparent. It could be done with middleware, like this proposal for edge computing I was once asked to comment on:


But it still requires application changes if the original version doesn't use a distributed computing model. And of course every different middleware product will have different interfaces and behaviors, so the apps will be locked into the middleware vendor.

I hope you have a different solution in mind.
 
Working on it. We can use low level system interactions to quickly exchange data (and expect to transparently share system resources) Basic implementations are working well.
That works for remote data access, but it won't solve the distributed computation problem. You have to intercept the instruction stream, either with function call insertions in applications (meaning code changes) or creating a proprietary compiler which creates the distributed computing code.
 
Last edited:
Actually we are working the problem opposite to your understanding it seems. Or opposite to my explanation. Consider the Non-Transparent Bridge for PCIe switching. We have implemented a transparent bridgeswitch. Our implementation of such a bridgeswitch permits system-level transparent exchanges of data and system resources. We don't support remote access. Actually, our implementation of data exchange resets data/resource security to a default secure position (without additional overheads).
 
Actually we are working the problem opposite to your understanding it seems. Or opposite to my explanation. Consider the Non-Transparent Bridge for PCIe switching. We have implemented a transparent bridgeswitch. Our implementation of such a bridgeswitch permits system-level transparent exchanges of data and system resources. We don't support remote access. Actually, our implementation of data exchange resets data/resource security to a default secure position (without additional overheads).
Your previous posts gave me the impression you were working on the distributed computation problem. Like these guys:


What you're describing sounds like GigaIO:


or one of the several companies working on CXL 3.0 distributed memory fabrics, or even Enfabrica:

 
Actually, our solution ends up solving both challenges.

As to GigaIO and Enfabrica, they seem to continue down the road of doing more to achieve less power and less overhead - its almost contradictory. Its actually quite remarkable that doing more can result in less power and less latency. Enfabrica implements RDMA and CXL. One of the strengths of CXL is in managing cache... we are looking to remove caching altogether. Normally that can cause issues around timing and misconnects, but we have a solution for that. EnchargeAi has some super sophisticated integrated chip designs. Dont look for complexity from me, I'm not that well trained or experienced to go chasing complex designs.

I have been able to reduce latency and power by removing nearly everything that gets in the way of efficient computing. So yes - we are looking at addressing the underlying issue of reducing power, latency and computing bottlenecks, however our target is 90-95% savings, not 50%. And we expect to achieve it by doing much much less.

Right now we are focused on building a consumer data platform because it is too long a sales cycle to convince an engineer to unthink 70 some years of proven designs. Presently, we can switch 5 Gbps across 4 ports consuming 0.02mW/h at 40ns. (FGPA is limited to 100MHz and increasing to 1 GHz drops switch latency to 4ns) Adding more ports does not increase latency but will add some additional power drain. Still, nothing is as efficient or fast. A side benefit of the platform is that we can arrange for resource transfers between the connected systems transparently so as to address the distributed compute challenge. Achieving these objectives required us to take a very different approach to data exchange - one that is super simple and completely unexpected.

Am working on publishing some early work in the coming months. We just have a lot on the go.
 
Last edited:
Actually, our solution ends up solving both challenges.

As to GigaIO and Enfabrica, they seem to continue down the road of doing more to achieve less power and less overhead - its almost contradictory. Its actually quite remarkable that doing more can result in less power and less latency. Enfabrica implements RDMA and CXL. One of the strengths of CXL is in managing cache... we are looking to remove caching altogether. Normally that can cause issues around timing and misconnects, but we have a solution for that. EnchargeAi has some super sophisticated integrated chip designs. Dont look for complexity from me, I'm not that well trained or experienced to go chasing complex designs.

I have been able to reduce latency and power by removing nearly everything that gets in the way of efficient computing. So yes - we are looking at addressing the underlying issue of reducing power, latency and computing bottlenecks, however our target is 90-95% savings, not 50%. And we expect to achieve it by doing much much less.

Right now we are focused on building a consumer data platform because it is too long a sales cycle to convince an engineer to unthink 70 some years of proven designs. Presently, we can switch 5 Gbps across 4 ports consuming 3mW/h at 40ns. (FGPA is limited to 100MHz and increasing to 1 GHz drops switch latency to 4ns) Adding more ports does not increase latency but will add some additional power drain. Still, nothing is as efficient or fast. A side benefit of the platform is that we can arrange for resource transfers between the connected systems transparently so as to address the distributed compute challenge. Achieving these objectives required us to take a very different approach to data exchange - one that is super simple and completely unexpected.

Am working on publishing some early work in the coming months. We just have a lot on the go.
Looking forward to seeing what you guys come up with.
 
Clarification to above - as we break through the analysis of the switching prototype - switching power consumed for a 10Gbps switch ( 2 parallel switches supporting 5Gbps) between any/many of 3 ports is 0.02 mW..... 2.8mW is consumed by a clock that is not really necessary. Just powering the board is 27mW but we are not using 99% of it so will be happy to move to custom silicon. Extrapolate to 100Gbps we're looking at 0.40mW ... kinda lower than the 1W per 100Gbps industry target which only works over 1Tbps.
 
Back
Top