Could Parallel Processes or Higher Speed replace Shrink?

Arthur Hanson · Mar 28, 2023

With shrink coming to an end, could parallel process or higher speed replace shrink? Would integrating memory, maybe by sections, into the chip speed up the process? Also, more specialized designs to accomplish tasks at increased speeds? What companies may be working in this area? Cost efficiency?

IanD · Mar 30, 2023

In many applications power efficiency becomes the key, not just density or speed, because the power per operation is shrinking more slowly than area with each process node so power density is rising.

This means the best approach is to go parallel not faster, which is exactly what's been happening with CPU cores for many years.

The networking chips I work on have massively parallel internal processing with data buses thousands of bits wide, it's the only way to go when you're processing terabits per second...

cliff · Mar 30, 2023

Ian, as you are aware, running lots of embedded CPUs at the same time on the same die will cause heat. Do you use parallel chiplets? Do you use high threshold devices and perhaps lower voltages to reduce short circuit currents (flow of current through the complementary fets)? Do you reduce the resolution on your math functions?

IanD · Mar 30, 2023

cliff said:
Ian, as you are aware, running lots of embedded CPUs at the same time on the same die will cause heat. Do you use parallel chiplets? Do you use high threshold devices and perhaps lower voltages to reduce short circuit currents (flow of current through the complementary fets)? Do you reduce the resolution on your math functions?

We use everything possible to reduce power -- varying Vth, low supply voltage which is adjusted to track gate delay, just-enough-maths resolution -- the datapaths are not programmable CPUs they're all custom logic. Chiplets don't really work for internal signal processing when we have data rates of multiple terabits per second, the power overhead to get this much data even from one chiplet to another (and the number of wires needed to do it) is just too high...

cliff · Mar 30, 2023

Thanks. This makes sense to me. Do you speculate that the big digital companies (AMD, Intel) do this, or are they all about brute force? I suspect the later.

blueone · Mar 30, 2023

Arthur Hanson said:
With shrink coming to an end, could parallel process or higher speed replace shrink? Would integrating memory, maybe by sections, into the chip speed up the process? Also, more specialized designs to accomplish tasks at increased speeds? What companies may be working in this area? Cost efficiency?

Arthur, I'm not sure exactly what you're asking, but I think you're asking if parallel processing computing architectures can replace shrink. Are you? If so, the answer is that pretty much every chip designed today already has parallel processing designed into it, some more than others. Nvidia GPUs, for example, have a very high level of parallel processing (thousands of CUDA cores). So do various AI/ML chips or cores, like Apple's, and Intel's Gaudi processor from Habana being another example.

General purpose CPUs have parallel processing at multiple levels, such as many cores, hardware threads (which are seen as additional cores by an OS, but are really just sharing over-provisioned functional units in a given CPU core), superscalar instruction processing, out of order execution, special purpose CPU instructions for specific workloads (Intel is [in]famous for this, with AVX512, AMX, CRC32c instructions) multiple memory channels, multiple I/O channels, and direct memory access engines, to name the functionality off the top of my head. There are also numerous implementation techniques used to increase execution efficiency, that mostly boil down to replacing software executed by general purpose CPUs with specific hardware solutions like state machine logic (Nvidia network adapters, Intel Sapphire Rapids PCIe connected accelerators) and FPGA logic. To use an analogy, these design innovations for parallelism or efficiency are like boats, fab process is like the ocean. A better process (shrink) raises all boats.

cliff · Mar 30, 2023

What is the launching fee?

Chris9594 · Mar 30, 2023

Moving large amounts of data between parallel compute is going to be a key area of focus.

A novel approach to routing data is going to be critical. ;-)

blueone · Mar 30, 2023

cliff said:
What is the launching fee?

You tell me. What is the launch fee for getting on N3?

blueone · Mar 30, 2023

Chris9594 said:
Moving large amounts of data between parallel compute is going to be a key area of focus.

A novel approach to routing data is going to be critical. ;-)

Ah, yes. Putting the computation near the storage. Or in the storage. Or in the memory. Or in the network. Multiple decades-old ideas, because any systems architect knows that moving data to compute and back is just a waste of energy and latency. End users get no benefit. The problem is making the point at which computation takes place transparent to the applications, because otherwise you need new applications, which will only happen once there is pervasive industry-standard hardware and software. And that isn't happening for the most part, though the big cloud vendors do this themselves because their systems are so big they are markets unto themselves. For software vendor applications you need compilers which make the placement of the computing transparent. It could be done with middleware, like this proposal for edge computing I was once asked to comment on:

Middleware for Edge Devices in Mobile Edge Computing

In mobile edge computing, edge devices collect data, and an edge server performs computational or data processing tasks that need real-time processing. Depending upon the requested task's complexity, an edge server executes it locally or remotely in the cloud. When an edge server needs to...

ieeexplore.ieee.org

But it still requires application changes if the original version doesn't use a distributed computing model. And of course every different middleware product will have different interfaces and behaviors, so the apps will be locked into the middleware vendor.

I hope you have a different solution in mind.

Chris9594 · Mar 30, 2023

blueone said:
I hope you have a different solution in mind.

Working on it. We can use low level system interactions to quickly exchange data (and expect to transparently share system resources) Basic implementations are working well.

blueone · Mar 30, 2023

Chris9594 said:
Working on it. We can use low level system interactions to quickly exchange data (and expect to transparently share system resources) Basic implementations are working well.

That works for remote data access, but it won't solve the distributed computation problem. You have to intercept the instruction stream, either with function call insertions in applications (meaning code changes) or creating a proprietary compiler which creates the distributed computing code.

Chris9594 · Mar 30, 2023

Actually we are working the problem opposite to your understanding it seems. Or opposite to my explanation. Consider the Non-Transparent Bridge for PCIe switching. We have implemented a transparent bridgeswitch. Our implementation of such a bridgeswitch permits system-level transparent exchanges of data and system resources. We don't support remote access. Actually, our implementation of data exchange resets data/resource security to a default secure position (without additional overheads).

blueone · Mar 30, 2023

Chris9594 said:
Actually we are working the problem opposite to your understanding it seems. Or opposite to my explanation. Consider the Non-Transparent Bridge for PCIe switching. We have implemented a transparent bridgeswitch. Our implementation of such a bridgeswitch permits system-level transparent exchanges of data and system resources. We don't support remote access. Actually, our implementation of data exchange resets data/resource security to a default secure position (without additional overheads).

Your previous posts gave me the impression you were working on the distributed computation problem. Like these guys:

EnCharge AI reimagines computing to meet needs of cutting-edge AI

A startup based on Princeton research is rethinking the computer chip with a design that increases performance, efficiency and capability to match the computational needs of technologies that use artificial intelligence (AI). Using a technique called in-memory computing, the new design can store...

research.princeton.edu

What you're describing sounds like GigaIO:

GigaIO | Rack-Scale Composable Infrastructure

Eliminate I/O network bottlenecks for demanding data center workloads, including AI, machine learning and deep learning. GigaIO FabreX breaks the constraints of old architectures, for new configuration possibilities in advanced scale computing and maximum utilization of all data center resources.

gigaio.com

or one of the several companies working on CXL 3.0 distributed memory fabrics, or even Enfabrica:

Enfabrica

Enfabrica - unleash the revolution in next-gen computing

enfabrica.net

Chris9594 · Mar 30, 2023

Actually, our solution ends up solving both challenges.

As to GigaIO and Enfabrica, they seem to continue down the road of doing more to achieve less power and less overhead - its almost contradictory. Its actually quite remarkable that doing more can result in less power and less latency. Enfabrica implements RDMA and CXL. One of the strengths of CXL is in managing cache... we are looking to remove caching altogether. Normally that can cause issues around timing and misconnects, but we have a solution for that. EnchargeAi has some super sophisticated integrated chip designs. Dont look for complexity from me, I'm not that well trained or experienced to go chasing complex designs.

I have been able to reduce latency and power by removing nearly everything that gets in the way of efficient computing. So yes - we are looking at addressing the underlying issue of reducing power, latency and computing bottlenecks, however our target is 90-95% savings, not 50%. And we expect to achieve it by doing much much less.

Right now we are focused on building a consumer data platform because it is too long a sales cycle to convince an engineer to unthink 70 some years of proven designs. Presently, we can switch 5 Gbps across 4 ports consuming 0.02mW/h at 40ns. (FGPA is limited to 100MHz and increasing to 1 GHz drops switch latency to 4ns) Adding more ports does not increase latency but will add some additional power drain. Still, nothing is as efficient or fast. A side benefit of the platform is that we can arrange for resource transfers between the connected systems transparently so as to address the distributed compute challenge. Achieving these objectives required us to take a very different approach to data exchange - one that is super simple and completely unexpected.

Am working on publishing some early work in the coming months. We just have a lot on the go.

blueone · Mar 30, 2023

Chris9594 said:
Actually, our solution ends up solving both challenges.

As to GigaIO and Enfabrica, they seem to continue down the road of doing more to achieve less power and less overhead - its almost contradictory. Its actually quite remarkable that doing more can result in less power and less latency. Enfabrica implements RDMA and CXL. One of the strengths of CXL is in managing cache... we are looking to remove caching altogether. Normally that can cause issues around timing and misconnects, but we have a solution for that. EnchargeAi has some super sophisticated integrated chip designs. Dont look for complexity from me, I'm not that well trained or experienced to go chasing complex designs.

I have been able to reduce latency and power by removing nearly everything that gets in the way of efficient computing. So yes - we are looking at addressing the underlying issue of reducing power, latency and computing bottlenecks, however our target is 90-95% savings, not 50%. And we expect to achieve it by doing much much less.

Right now we are focused on building a consumer data platform because it is too long a sales cycle to convince an engineer to unthink 70 some years of proven designs. Presently, we can switch 5 Gbps across 4 ports consuming 3mW/h at 40ns. (FGPA is limited to 100MHz and increasing to 1 GHz drops switch latency to 4ns) Adding more ports does not increase latency but will add some additional power drain. Still, nothing is as efficient or fast. A side benefit of the platform is that we can arrange for resource transfers between the connected systems transparently so as to address the distributed compute challenge. Achieving these objectives required us to take a very different approach to data exchange - one that is super simple and completely unexpected.

Am working on publishing some early work in the coming months. We just have a lot on the go.

Looking forward to seeing what you guys come up with.

cliff · Mar 30, 2023

blueone said:
You tell me. What is the launch fee for getting on N3?

Only yachts allowed on your ocean. I watch from afar and will stay in the inlet. The tide still applies, I guess.

Chris9594 · Apr 7, 2023

Clarification to above - as we break through the analysis of the switching prototype - switching power consumed for a 10Gbps switch ( 2 parallel switches supporting 5Gbps) between any/many of 3 ports is 0.02 mW..... 2.8mW is consumed by a clock that is not really necessary. Just powering the board is 27mW but we are not using 99% of it so will be happy to move to custom silicon. Extrapolate to 100Gbps we're looking at 0.40mW ... kinda lower than the 1W per 100Gbps industry target which only works over 1Tbps.

Could Parallel Processes or Higher Speed replace Shrink?

Well-known member

Well-known member

Active member

Well-known member

Active member

Well-known member

Active member

Member

Well-known member

Well-known member

Member

Well-known member

Member

Well-known member

Member

Well-known member

Active member

Member