Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/a-revolutionary-massively-parallel-processing-architecture.10586/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

A Revolutionary Massively Parallel Processing Architecture

P

paul pawlenko

Guest
I would like to start a technical discourse about a a massively parallel architecture called Practical Plentiful Parallel Processing or 4P. These processors are designed by effectively inserting tiny processors between memory cells thereby creating a very large parallel hardware canvas. The main idea is to keep processing localized unless and until the data needs to be streamed off chip for IO such as graphics, network or disc. All data is referenced by a data stream consisting of a single address and element count. This drastically simplifies the hardware by eliminating the need for addressing and 3 cache levels. The hardware simplicity allows an order of magnitude (minimum) of performance increase per unit area of silicon.

This hardware canvas operates very similar to a physical IC layout except that it is entirely programmable. Modules are similar to procedures from conventional languages except that they are instantiated by placing them onto a physical area of the hardware canvas where they operate on data as it streams through them similar to commands in the UNIX shell. Modules can have multiple inputs & outputs and can operate asynchronously by sleeping until a true value on a specified input line signals them to wake. Simple, serial modules occupy only a small amount of canvas while parallel SIMD modules, such as graphics, can have identical modules in rows and columns running in parallel over large portions of the canvas.

The streaming IO makes communicating between physical chips logically identical to communicating on chip excepting the additional latency incurred when starting a new stream. With some exceptions, physical pins are functionally programmable and can be allocated as a resource by the operating system as can area on the canvas. One exception is the main control pin on each physical chip that maintains the highest level of authority to control the operation of any area on the physical chip at all times. The security areas of the operating system are typically loaded via secure physical connection or secure network connection via this pin.

Contrast this with conventional processor architectures where address pins are mandated by the hardware specification. By allowing general use of hardware resources such as pins and processing canvas space, the architecture give tremendous power to the program developer that is simply not possible through conventional, instruction/operand based designs. These designs are also greatly limited by the serial instruction streams they process. The 4P hardware canvas runs every module in parallel, with every memory cell being updated every clock tick.

I could keep writing for hours on why this architecture is so greatly superior to conventional architectures but I am just trying to get the conversation started. Please feel free to post any questions or comments. I look forward to reading them.
 
Last edited by a moderator:
Seems like you have just described Automata computing which is still a work in progress or managed memory like Micron and Intel are working on. I also recently have written a forum on moving memory and processing closer that is currently being done. Many companies are working in this direction currently and have been for a few years.
 
What language is used to program them? I would be interested in seeing it.

Automata computing is a very general form of computation. 4P has also been similarly compared to an FPGA. Both technologies can be used in many ways so details really matter here.

Do any of the architectures you describe use address busses? Caches? How about instructions and operands? If they do then they are not really comparable to 4P as those are some of the major sources of architectural inefficiencies that are eliminated by 4P hardware.

If not, then I wonder why they have not been developed and marketed as I recall learning about automata computing in grad school over 20 years ago. The 4P hardware layout is simple enough for a small, dedicated team to construct in relatively short order provided that the fundamental concepts are clearly understood. Any company truly understanding the power of 4P technology and allocating appropriate resources would find that the effort to implementation is applied engineering for a well defined set tasks. No fundamental breakthroughs are required here just a lot of hard work.

This leads me to my next point. Ideas are great but working results are better. Do any of these technologies you mention have a working hardware simulator showing a running program? If so, I would like to see it an compare it to mine shown here:
https://sourcecodecreations.com/4p-in-action-1

That way we can get a really good look and do an apples to apples comparison since, as I stated, details matter.

Finally, the hardware IC layout is a key piece but only a piece. The hardware provides the massive performance increase but compilers, operating systems, device drivers and applications all must be designed and built in the new language. Are those already designed using automata? If so, I would very much like to see the designs.
 
Someone brought up a good point.
Let it be clear that the 4P canvas does not execute instructions with operands.
First, the entire program is loaded onto the canvas.
Then, after the program is locked into place on the canvas, the data is streamed through and operated upon as shown in the video.
 
The video link: https://sourcecodecreations.com/4p-in-action-1

You hit the nail on the head. Indeed the structure of the processors between memory cells is the IP that I am selling so it will remain secret until someone buys it and decides to make the design public. I do, however, offer many details on the website regarding how the hardware is programmed, used and its many advantages in addition to performance.

As to what the individual processors do, I can say they perform simple operations between local groups of bits. Since the processing is localized the hardware remains very simple with small gate delays and local performance on par with a dedicated ASIC. Combining many of such small processors is used to create a program as shown in the video.


Transputers and XMOS use standard opcode/operand von Neumann implementation creating a serial instruction stream so, no, they do not exploit hardware parallelism in the same way as 4P.

Not being familiar with PureData except through cursory examination, it appears to be a dataflow GUI that creates a workflow that subsequently compiles to a set of instructions that execute on a standard PC CPU.

The GUI front end of such dataflow software, including PureData (if I understand it correctly) could provide the basis for a drag and drop interface for 4P modules as they are placed onto the canvas. For the simulator in the video, I wrote my interface in HLSL since that is what I am comfortable with. But using a more feature rich dataflow API could provide a much cleaner GUI front end with significantly less programming effort.

Paul
 
From website: "We expect 99% of all text based software development (C, C++, C#, Java, Assembly, etc.) to become obsolete and be replaced by 4P style coding"

This is for me the reason why 4P never will succeed. A new architecture that can't be programmed with normal programming languages is I think never going to take off. Up to now all architectures that wanted the programming language to be adapted to their new way of doing things have failed. FPGA is now only getting real traction in data centers for acceleration because of OpenCL and similar coding libs and not needing to use RTL for their programming.

Also can you elaborate how (nested) if-then-else is going to work in your architecture; maybe with an extension of procedure call inside the different branches of the if-then-else ?
To me, for your architecture, all problems seems to need be converted to map-reduce type algorithms which I thinks limits applicability.
 
x86 VMs will undoubtedly be built to allow business as usual. Orders of magnitude performance increases has a way of swaying skeptics. OpenCL/CUDA was the market response to developers resorting to treacherous programming techniques to use graphics shaders to do general purpose programming just to leverage performance. 4P has the advantage over GPUs of trivial scalability with additional chips supplementing existing hardware.

Also note that programming will be much simpler, more intuitive, easier to isolate errors and (finally) enable logical security instead of security by obscurity.


Nested, if-then-else will operate like a train track switch with the binary data stream being the train cars.
The equivalent to a C procedure is a 4P module with the argument list being replaced by input/output streams that operate on structured data.
C structures are flattened to their binary stream equivalent then passed into a module.
A module is instantiated by 1) programming it onto the canvas then 2) locking it into place then 3) running data through it as shown in the video

If an application module can "call" another module in one of two ways.
If the called module is small enough, it can simply be included in the area of the application module similar to an "inline" statement in C++.
If the called module is large, then the data will be streamed as output from the application to input in the module being called.
Large API's, such as the graphics API will have control modules directing traffic from the applications as directed by the OS using dedicated control lines.

So....for the deeply nested if-then-else, the conditional will be some evaluation of input data that signals the train track switch what way to direct the flow of data for each branch. When the inner most branches are reached, the data is tagged with a header (see below) then streamed as output from the module to its destination to be further processed.

Busses will stream large amounts of data around the canvas in a very general way using protocols similar to network protocols for congestion control. Each data stream will have headers perpended that will route it then instruct the destination API how to process the data. One paradigm I anticipate is a size followed by a series of single bit branches to direct data flow inward to the appropriate processing module with the size telling the branch switches how long to remain in their current state before being reset for the next stream.

The resulting program development environment is really very similar to hardware development, except that it is programmable.
This will give hardware developers something to do when all the monsterously complex CPUs are replaced by simple 4P canvas ;)
 
You are going to have to show more variety of algorithm than TSP. Even with TSP, 8 nodes is tiny. Your local truck delivery routinely uses routes generated with a hundred stops or more. And yes, they do that with tuned apps that have short cuts and assumptions - can you do that? But you might be better off if you can show how you do a neural net, a tensor multiply, a layout generator, etc. Can you bootstrap your own layout? These are hard algorithms, but that is kind of the point. The conventional chips are deeply embedded in the real world and have programs with millions of lines of code in them. You need to show how you get to at least thousands of lines of code equivalent to become more than a research project.

Your main competition is not ordinary CPUs, it is FPGAs and ASICs. These will absorb the algorithms which currently flow through CPUs but where reorganizing the data flow and distributing specialized computation can reduce power and latency an order of magnitude or more. This is where More-than-Moore is going - sideways, using the sea of gates smarter instead of trying to clock faster or stretch the speculative pipelines beyond useful limits.
 
You are going to have to show more variety of algorithm than TSP. Even with TSP, 8 nodes is tiny. Your local truck delivery routinely uses routes generated with a hundred stops or more. And yes, they do that with tuned apps that have short cuts and assumptions - can you do that? But you might be better off if you can show how you do a neural net, a tensor multiply, a layout generator, etc. Can you bootstrap your own layout? These are hard algorithms, but that is kind of the point. The conventional chips are deeply embedded in the real world and have programs with millions of lines of code in them. You need to show how you get to at least thousands of lines of code equivalent to become more than a research project.

Your main competition is not ordinary CPUs, it is FPGAs and ASICs. These will absorb the algorithms which currently flow through CPUs but where reorganizing the data flow and distributing specialized computation can reduce power and latency an order of magnitude or more. This is where More-than-Moore is going - sideways, using the sea of gates smarter instead of trying to clock faster or stretch the speculative pipelines beyond useful limits.

What you say is true.
The point of the simulation video is not to solve the TSP but rather to show how a 4P program looks when it is running. The next generation hardware simulator will have offline capability to allow much larger problems to be shown....waiting on funding.

Actually the CPUs and GPUs are the prime targets for 4P to replace. Yes, ASICs and FPGAs do have remarkable speedup, but they are typically highly specialized limiting their use to specific, highly parallel tasks. 4P has the additional advantage of handling serial tasks without underutilizing hardware resources. Serial tasks take a relatively small amount of 4P canvas allowing the remainder of the canvas to be used for other tasks. 4P canvas allows the OS, device drivers and certain APIs to run in their own hardware space concurrently with any applications. There is no "task switching". Every program residing on the canvas is updated every clock tick allowing near hardware throughput while being completely programable.
 
Your main competition is not ordinary CPUs, it is FPGAs and ASICs. These will absorb the algorithms which currently flow through CPUs but where reorganizing the data flow and distributing specialized computation can reduce power and latency an order of magnitude or more. This is where More-than-Moore is going - sideways, using the sea of gates smarter instead of trying to clock faster or stretch the speculative pipelines beyond useful limits.

Indeed, pipelines can only go so deep with the von Neumann model.

More efficient & better utilized hardware is part of the problem. Trying to implement parallelism "automatically" without involving the programmer is fraught with issues in any general context. That is why 4P architecture is centered around giving power to the programmer, since they are the ones who really understand what parts can exploit parallelism and what parts are inherently serial.
 
Back
Top