Two weeks ago I attended a seminar at the Daresbury laboratory near Warrington. Unfortunately I only had the day off so couldn't attend the workshops on offer and buttercup, my trusty landy, had to drive there and back on the same day. Sorry to everyone on the m4, m5 and m6 that day :p
Jack Dongarra presented a rather nice summary of supercomputing / parallel computing or as he called it "Throughput Computing". I had never heard this term before and now really like it. It implies that you should make the best of available resources in order to maximize throughput. It's very easy to get into a mindset where everything has to be in parallel but it's possible by doing so you neglect to notice that some parts work very well in serial.
Of course serial parts can be a bottleneck with lots of fork and join bits which reduces your throughput, a better way of keeping your compute nodes busy while dealing with the inevitable serial sections is to use directed acyclic graphs.
He also mentioned that effectively programming these large and complex machines is becomming more and more complex and therefore costly and there is currently no language that can adequately describe the systems needed.
In the current and next generation of supercomputers there will be less memory per core than there currently is. Memory is a big consumer of power and in the drive for power efficiency this will have to be reduced possibly with small chunks of memory layered onto the processing elements in a 3d fashion. This will of course mean a rethink of existing algorithms as compute elements may not necessarily share a common memory area.
Another very interesting topic he mentioned, which I had not considered before, is the area of fault tolerance. We are used to expecting hard drive / storage to fail and so have various systems to negate the impact a failure has (raid / mirrors etc) but if a compute node fails in a large cluster in the middle of a matrix operation for example how do we firstly detect there has been a problem and secondly how to recover from it without having to restart the entire run.
Lastly he mentioned:
5 Important Features to Consider:
1) Many Core and hybrid machines will require block data layout and dynamic data driven execution.
2) Mixed precision - you don't always need maximum precision
3) Self adapting / auto tuning Software
4) Fault Tolerant Algorithms
5) Communication avoiding Algorithms
All in all a very informative talk and well worth the 10 hours of driving time :)
"with small chunks of memory layered onto the processing elements in a 3d fashion", I am highly looking forward to this assuming the interconnect network between the processing elements (PEs) is fast and wide enough. Specifically I'd like to see a board of many chips each of which had many PE/MEM cores.
ReplyDeleteProgramming one of these systems would be a joy IMO. Where as with current GPUs we end up being very gather centric when it comes to global communication (and end up doing a huge amount of ALU work to either search for data or get data in the order required for processing), with the above type of system it would likely best to be scatter centric. PE nodes would likely often follow this pattern,
1. Process data at node.
2. Compute destination coordinate(s) for data in some computational domain.
3. Use a fixed mapping from computational domain to core index(es) based on number of PEs in the machine. This is effectively a coarse binning operation (each destination core handles a subset of nodes in the DAG).
4. Send data packet to core(s), allowing the high speed PE interconnect to route data in the background of computation.
This push method is one way and doesn't require any round trip communication. Interconnect network handles pushing data to the next PE node(s) for computation (push to the next node(s) in the DAG). Further computation at the PEs could easily bin queued data (in linear time) into fine granularity bins (to gather out of order data packets going to one set node in the DAG). Then do computation of all data at that DAG join node. Then the process continues...
Each network push re-converges data and computational locality, and it would be up to the programmer to provide a optimal domain mappings so data routing latency and cost is minimized (given the highly non-uniform network cost based on distance of destination PE element).
I agree with you, currently to get decent throughput on the GPU's you have to do a large amount of algorithmic redesign and even then a lot of processing cycles are taken up with trying to get the data into a format that is friendly to the global memory system.
ReplyDeleteHaving memory dedicated to the processor - and in the case mentioned this means directly above it on the package would mean high PE<->mem communication. Dr Dongarra mentioned that this might possibly be photonic in the future. The mem would probably be shared between the cores on the physical socket effectively giving a high speed PE<->PE communication link between small groups (those on the same socket).
Unfortunately communicating between nodes is relatively slow hence the need for "communication avoiding algorithms". I dont think that even optical interconnects will solve this problem. But I tend to think of these clusters in terms of graphical processing where there is a need for a high amount of bandwidth. A lot of scientific systems don't need quite so much for the systems they are interested in.
In the architecture you mention on your blog with all the interconnects the DAG could actually be represented by the nodes themselves which each firing a certain state/function once its neighbours meet the prerequisites and feed the data onwards. This is actually do-able on the GPU too but the memory access patterns are far from ideal.