ISO/IEC JTC1/SC22/WG5-N1754 Implementation Issues for Coarrays ---------------------------------- Nick Maclaren, 30th October 2008 Introduction ------------ I am including this paper because it was circulated to the BSI Fortran panel and some other people, and includes evidence for various statements in other papers. It is very much a background paper for coarray specialists only. I apologise for its length, but it is unavoidable if I am to address all of the points raised after I circulated the first draft to the UK panel. It does NOT represent the BSI Fortran panel's position. The fundamental issues are described in the first two sections. The remaining sections analyse the issues in more detail, and provide evidence for the statements in the first two. Few people will want to read them, or at least will want to look at only the examples. In almost all cases, where I make remarks about subtle technical aspects, I can provide further details on request, but including them here would make this paper incomprehensible as well as several times the length. A. Implementation problems with coarrays on distributed memory systems ---------------------------------------------------------------------- Coarrays will be extremely difficult to implement on distributed memory systems, and may be impractical for many developers on the almost ubiquitous 'commodity clusters'[*]. Throughout this document, the references to problems are in the context of coarrays being implemented on that sort of distributed memory system. Possible solutions to this are described in section B. There are at least the following major problems: A.1) There is a known, genuinely portable, technique for implementing them (see section E.1 below), but it can be extremely inefficient, needs significant changes to all existing compilers, and does not work reliably together with a companion processor. A.2) They cannot be implemented reliably and efficiently using any widespread formal or informal standard, including POSIX (IEEE Std 1003.1, 2004) and MPI-2 (Message Passing Interface), without relying on explicitly undefined behaviour. A.3) Even ignoring standards conformance and portability, it is doubtful that they can be implemented reliably and efficiently on many systems without operating system extensions. This would cause serious problems to third-party vendors. A.4) There seems to be little implementation experience of coarrays, except by one complete system vendor[+], and it is unclear whether even that vendor has implemented all of the problematic aspects. A.5) There appear to be no other widespread parallel languages and interfaces that both introduce all of the implementation problems that coarrays do and have any implementations that resolve them. Those statements may seem controversial, but this paper provides evidence for them, which accounts for its length. While coarrays are syntactically simple, their semantics (and hence implementation) are not; in fact, I know of no other parallel interface or language that creates as many difficulties for the implementor. The problems are essentially soluble if coarrays were intended only for cache-coherent shared memory systems, complete system vendors and specialist interconnects, of course, but I believe that is not the objective. I believe that that coarrays are not suitable for inclusion in the Fortran standard until the above issues can be shown to be soluble by the existence of at least one real implementation by a third-party vendor on a commodity cluster, and some evidence that this will extend to other such systems, or the problems raised here are resolved by changes to the specification. Terminology: [*] I use the term 'commodity cluster' to refer to a collection of off-the-shelf workstations or small servers, connected by TCP/IP and Ethernet (or possibly InfiniBand), and running some widely-available operating system (such as a Linux or Unix variant or Microsoft system). On such systems, almost all compilers, language run-time systems and applications libraries are written by separate organisations, and run without 'system privileges' and without operating system extensions. [+] The term 'complete system vendor' is not a happy one, but I cannot think of a better. It refers to the vendors who maintain and market at least a modified version of a standard operating system, including specialist networking support, and the Fortran and MPI that run under them. Such vendors can ensure that the operating system and network interfaces provide the facilities needed for implementing parallel languages, reliably and efficiently. It will be clear that, to a large extent, there are two disjoint types of system, and it is the former I am concerned about. [ Aside: specialist interconnects are an intermediate form, but the only one I have found that seems to be adequate for implementing Fortran coarrays, reliably and efficiently, is Quadrics - and that is fairly rare. It is mentioned whereever it is relevant in this document. ] B. Summary of the problems -------------------------- The fundamental problem is that the specification permits two images to communicate by defining and referencing coarrays owned by a third, without any explicit action by the owning process. That is very difficult to implement when the underlying communication mechanism is based on message passing. That is compounded by existing Fortran 2003 features, including companion processors and even I/O, which prevent many of the known solutions to this problem being used for coarrays. The effect will be either that coarrays are not implemented for most distributed memory systems, or that the implementations have such serious (and probably undocumented) restrictions that many reasonable coarray programs will not work effectively on them. Users will then justifiably claim that Fortran is an unsuitable language for parallel programming. In theory, most distributed memory operating systems could be enhanced to enable the efficient implementation of coarrays, but I doubt that Fortran is still important enough to force such a drastic change. All of these problems can be solved with reasonable efficiency on most current systems (though not all) if VOLATILE coarrays are dropped from the standard. See sections E.6 and E.7 for details. An alternative approach is to change the coarray specification to make them implementable, but the necessary changes are drastic (see section E.5 below). If that approach were to be taken, it should be the subject of another paper; this one is complicated enough as it is. EVIDENCE FOR THE ABOVE STATEMENTS --------------------------------- Please note that the examples are written to show the issues, and are not necessarily the best way to perform their task. Writing clear, realistic examples would take more time than I have. C. The basic requirement ------------------------ Consider a program like the following: PROGRAM Synchronisation INTEGER :: value[*] = 0, i LOGICAL, VOLATILE :: first[*] = .False., second[*] = .False. SELECT CASE(THIS_IMAGE()) CASE(1) DO i = 1,CO_UBOUND(value) value[i] = 123*i END DO SYNC MEMORY ! One first[9] = .True. CASE(2) DO WHILE (.NOT. first[9]) CONTINUE END DO SYNC MEMORY ! Two DO i = 1,CO_UBOUND(value) PRINT *, value[i] END DO second[8] = .True. CASE(3) ! Assume some intensive serial computation here DO WHILE (.NOT. second[8]) CONTINUE END DO END SELECT END PROGRAM Synchronisation The processor has two options to ensure consistency, with variations: C.1) The straightforward implementation involves transferring all of the updates of coarrays from image 1 to all other images 'during' the SYNC MEMORY statements marked One, and transferring them from all other images to image 2 'during' the SYNC MEMORY statements marked Two. C.2) An alternative approach involves transferring the identities of all updated coarrays from image 1 to all other images 'during' the SYNC MEMORY statements marked One, and must inspect them in image 2 'during' the SYNC MEMORY statements marked Two. The reason that they must be transferred to all other images is because they could be executing code similar to that of image 2. Both of those mechanism are expensive, and the second is not scalable, but that is not the primary point of this paper. It is important to note that the use of VOLATILE coarrays in the above is not fundamental, and any functionally equivalent synchronisation mechanism would expose the same issues; examples of this are described in section D below. D. The potential for deadlock ----------------------------- This example relies on the existence of "user-defined ordering" as permitted in 8.5.1 paragraph 5. Consider a program like the following: PROGRAM Deadlock INTERFACE SUBROUTINE Mutex_lock (which) BIND(C) USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which END SUBROUTINE Mutex_lock SUBROUTINE Mutex_unlock (which) BIND(C) USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which END SUBROUTINE Mutex_unlock END INTERFACE INTEGER :: value[*] = 0, i IF (THIS_IMAGE() == 1) THEN CALL Mutex_lock(9) ELSE IF (THIS_IMAGE() == 3) THEN CALL Mutex_lock(8) END IF SYNC ALL SELECT CASE(THIS_IMAGE()) CASE(1) DO i = 1,CO_UBOUND(value) value[i] = 123*i END DO SYNC MEMORY ! One CALL Mutex_unlock(9) CASE(2) CALL Mutex_lock(9) SYNC MEMORY ! Two DO i = 1,CO_UBOUND(value) PRINT *, value[i] END DO CALL Mutex_unlock(8) CASE(3) CALL Mutex_lock(8) END SELECT END PROGRAM Deadlock If the call to Mutex_lock in image 3 blocks, and coindexed objects owned by it cannot be accessed by another image while it is in that state, the above program will deadlock. The Fortran processor has obviously no control over the code of Mutex_lock and Mutex_unlock, and so cannot prevent them from blocking. It is important to note that this problem is not caused solely by companion processors, but by any external interface, including Fortran I/O. My original example was based on Fortran I/O, but its unavoidable complexity caused many of the people who saw it to misunderstand its point. I have therefore excluded it from this paper, but here is a summary of its design: D.1) Mutex_unlock() is replaced by writing a record to an I/O unit 10+, followed by a FLUSH statement on that unit, and Mutex_lock() by reading one from an I/O unit 20+. D.2) Those I/O units are connected to FIFOs (pipes, sockets etc.), where the other ends of units 10+ and 20+ are an input FIFO and an output one to the same program, which simply reads a record from unit 10+ and writes it to 20+. That is obviously a reasonable configuration and exposes exactly the same problem as the above example does, though the external details use features not in the Fortran standard. Communication using files and a shared file system like NFS is common in parallel applications on distributed memory systems, though it is usually done in other ways. E. Implementation possibilities ------------------------------- Consider a conventional distributed memory system, made up of separate systems (i.e. CPUs and memory), connected by I/O over a network, with a conventional semi-secure operating system (such as specified by POSIX), and where compilers, language run-time systems and applications libraries are unprivileged applications. Fortran does not require each image to store its own copy of a coarray locally, but it 'almost does', and coarrays are unlikely to be implemented in any other way on such systems. In the above examples, when image 1 reaches the SYNC MEMORY marked One, it must transfer the coarray updates to each of the systems that hosts the other images. And similarly, it must transfer the data from them to image 2 when it reaches the SYNC MEMORY marked Two. I shall call image control statements, the use of VOLATILE and I/O statements, 'natural break points', because they all need some special code to ensure cross-image consistency, according to the current draft. They could obviously be extended to handle cross-image transfers as well. The examples do not show how this problem can arise while an image is executing code that contains no natural break points, but that is obviously possible, and I do that in my conversion to UPC (see sections F.2 and G below). The examples show that the processor cannot wait until both images involved in a transfer reach a natural break point, without losing parallelism, or causing livelock and even deadlock. I know of only six realistic options, and will describe them and their problems separately. E.1. Polling in compiled code ----------------------------- The Fortran processor can obviously check for pending transfers (i.e. poll for them) at every natural break point, and insert artificial break points into all long-running loops that do not contain natural break points. This technique was used on some of the rudimentary operating systems of the 1960s and 1970s to detect looping programs, and experience showed that the 'companion processor' problem caused the technique to be seriously unreliable. It was dropped as soon as almost all operating systems supported preemptive interrupts. For reasonable efficiency in high-communication programs, a break point needs to be reached every few microseconds - experience with MPI is that the difference between 1 and 5 microseconds latency makes a vast difference to many applications. Note that a typical scheduling interval for a modern system is 10 milliseconds, which is far too long, and so implementing the polling efficiently needs great skill and is not portable. However, most MPI implementations achieve it. That interferes with many other critical optimisations, in particular, the extreme pipelining and pseudo-vectorisation needed for good performance on Intel IA64, Hitachi SR8000 and similar future systems. It would handicap optimisation even on SPARC and POWER, and possibly even x86. More seriously, it does not help with programs that use companion processors, in this context including programs that call optimised versions of the BLAS, LAPACK etc., because the Fortran processor obviously cannot insert polling into external code. This problem is shown by example Deadlock in section D, but would remain even if the words "or user-defined ordering" were removed. Obviously, an image executing a companion processor procedure cannot service any memory requests from other images until it returns. See also section E.5 below. E.2. Purely one-sided transfers ('RDMA') ---------------------------------------- There are some purely one-sided transfer mechanisms (i.e. ones that need no action whatsoever on the image that owns the data). In this context, the criterion is that no action is needed by the local application when a remote system reads and updates its data, and not what happens in the hardware, the network interface device driver or the operating system kernel. Similarly, the relevant interfaces are ones that can be used by unprivileged applications, such as Fortran programs. Some more details of RDMA are given below. Most cache-coherent shared memory (SMP) architectures provide suitable mechanisms, but are irrelevant in this context; note that, despite common belief, they do not always do so entirely transparently. It appears that Cray distributed memory systems provide them, and that the Quadrics interconnect does, too. TCP/IP and Ethernet do not provide any such mechanism. There have been several research projects to extend TCP/IP to do this but, as far as I know, they have all fizzled out. The RDMA Consortium is tackling an unrelated problem. There is no such mechanism defined by any relevant standard, formal or informal, for reasons implied by sections E.4 and E.6, and I can find no evidence that one is currently provided by any distributed memory operating system or interconnect vendor other than Cray and Quadrics. It is possible that such mechanisms exist in the Openfabrics InfiniBand stack, but I baulked at downloading and searching 58 MB of software to find out. In any case, I can find no evidence of such a mechanism being provided to or used by applications. Such mechanisms can be provided for systems that do not have them only by non-trivial operating system kernel enhancements. Aside: some information on RDMA. The term 'Remote Direct Memory Access' (also called RMA) covers a wide variety of mechanisms, but it always refers to some mechanism for avoiding the use of at least one internal I/O buffer for large transfers. Despite common belief, the data are not always copied directly from the network to the CPU's memory, but incoming packets may be buffered in the network card, decoded there, and the contents then written to the CPU's memory. That means that a message can reach its destination some time before it becomes visible to an application running on that system. Unfortunately, the requirement for coarrays is not that the data are written directly to memory, but that the transmission is not acknowledged as complete until the data is actually visible to the application. The availability of the feature therefore depends more on the interface software (e.g. TCP/IP or Openfabrics) than on the network hardware (e.g. Ethernet or InfiniBand). E.3. MPI-2 one-sided communication ---------------------------------- MPI is a widely available de facto standard, and is the dominant interface used on distributed memory systems. The two-sided and collective mechanisms defined by MPI-1 are available on effectively all parallel systems, but are obviously inadequate for implementing coarrays. MPI-2 added one-sided communication which, on the face of it, would appear suitable for implementing coarrays. Very few applications currently use MPI-2 one-sided communication, and it is unclear how reliable, complete and efficient its implementations are. There are at least the following issues. Programs in PGAS (Parallel Global Array Storage) languages, like Fortran with coarrays, tend to use a much lower granularity than programs that use message-passing libraries. It is common for the former to fail when built on the latter, and it was one of the reasons for the failure of HPF. Intel have a version of OpenMP that supports distributed memory, but it is intended only for low-communication applications like Web servers, and not the high-communication ones common in HPC, according to a personal communication from a developer. The only appropriate one-sided synchronisation mechanism in MPI-2 is MPI_Win_lock (MPI-2 6.4.3), and MPI allows that to be restricted to memory allocated by MPI_Alloc_mem. That may not be feasible for all Fortran compilers on all systems. MPI-2 11.7.2 states that progress with a transfer is not required until the target process next reaches an MPI call, and therefore may take an unbounded amount of time. As shown by my conversion to UPC (see sections F.2 and G below), such an implementation would be extremely inefficient on example Synchronisation in section C, and would deadlock on example Deadlock in section D. MPI's guarantees of no deadlock obviously apply only when MPI procedures are called, and not when external code is. So any image that is waiting in a companion processor procedure or even Fortran I/O obviously cannot service any memory requests from other images until it returns. E.4. Interrupting the image to complete the transfer --------------------------------------------------- In theory, the processor could use some form of unprivileged interrupt mechanism to trap signals to the executing image (i.e. the one that owns the data), handle the transfers and then continue processing. The only currently relevant mechanism is signals, and doing I/O in them is undefined behaviour in C99 (7.14.1.1 The signal function, paragraph 5) and POSIX (sigaction: APPLICATION USAGE, paragraph 3). Many systems provide extensions to POSIX in this area, but few are much of an improvement, and most do not provide enough supported functionality to implement message passing in an interrupt handler. From a practical viewpoint, this technique is seriously unreliable under all of the dozen or so current operating systems which I have looked at; it was fiendishly difficult to use and usually unreliable even under the old mainframe systems, where it was fully supported. I haven't investigated Intel Cluster OpenMP in detail, but its specification states clearly that it uses the page faulting mechanism ('virtual memory') to do this. That unavoidably needs kernel privilege, though it may use documented system calls. Equally seriously, it almost certainly will involve the kernel scheduler, and so the latency is likely to be of the order of 10 mS - which is hopeless for VOLATILE coarrays and not good for segments. See section 11 in: http://cache-www.intel.com/cd/00/00/29/78/297875_297875.pdf E.5. Using the MPI and GASNet progress model -------------------------------------------- MPI and GASNet (see section F.2 below) have the concept of a progress engine, where natural break points are required to check for and service all pending actions, but no action need be taken between them. In the Fortran context, a reference or definition from one image to a coindexed object on another might not occur until the latter reached a natural break point. Examples Synchronisation and Deadlock in sections C and D show that would not be adequate for implementing coarrays as currently specified. In order to take this approach, MPI imposes significant constraints on the program to ensure that progress is always possible, and using it would require Fortran to do the same. Modifying the coarray specification to allow this would certainly be possible, and the simplest approach would be the following: E.5.1) To eliminate SYNC MEMORY and VOLATILE coarrays in their entirety. Locking primitives do not cause the same problems, and are more comparable to SYNC IMAGES in this context. E.5.2) To state that this progress model is used, and programmers must obey its constraints, in a similar way to MPI. I say MPI rather than GASNet, because the former standard is more comparable to Fortran in precision and terminology. E.5.3) To specify that performing unbounded actions using any external interface (including I/O and companion processors), together with using coarrays, is undefined. Without this, Fortran coarrays will almost certainly develop a reputation for being 'broken'. These changes would mean that example Synchronisation in section C would usually be extremely inefficient, and example Deadlock in section D would be undefined. E.6. Using a separate thread to handle the transfers ---------------------------------------------------- A processor could require that there is at least one, permanently running, thread dedicated to message passing per system, and that the operating system provides coherent shared memory between that thread and the image execution threads (or an equivalent mechanism). This requires cache-coherent threading support and, for practicability, at least one core more than the number of images (per SMP system). That is not always available on specialist parallel systems, but the constraint is a relatively minor problem nowadays. However, it requires the hardware and operating system to support coherent shared memory in such a way that an update from one thread is guaranteed to become visible to another thread, expeditiously, with no action by the second thread. That is undefined behaviour in POSIX (4.10 Memory Synchronization). It is also unreliable in practice, usually because of thread scheduling, dispatchability and memory consistency problems. It should be noted that thread scheduling control is optional in POSIX, and its semantics are largely implementation specific (POSIX 4.13 Scheduling Policy). Problems are rare, but incredibly confusing to the programmer (and even the most experienced system expert) when they occur. They are often put down to 'cosmic rays', but most of those claims are statistically implausible; I have tracked down a few such failures and can explain how the problems I describe here can occur. Again, such mechanisms could be made reliable for many or most systems only by non-trivial operating system kernel enhancements. The alternative approach is to drop VOLATILE coarrays from the standard. An indication of how that could work is given in section E.7 below. E.7. A separate thread and no VOLATILE -------------------------------------- Fortran's ordering rules (8.5.1 paragraph 6) are close enough to POSIX's that enforcing POSIX rules (POSIX 4.10 Memory Synchronization) will provide Fortran's semantics for non-VOLATILE variables. Let us consider a single communication thread and one or more image threads on a single, cache-coherent SMP system. The communication thread transfers local data to and from other SMP systems by message passing (e.g. MPI or TCP/IP), and the fact that there is only one communication thread means that all images external to its SMP system see a consistent view of the local data, whether or not any local image thread accesses it. Restricting it to a single thread is a scalability issue, of course, but not a catastrophic one. The problem is how it maintains consistency with the local image threads. All of Fortran's image control statements need to use one of POSIX's synchronization functions to handshake with the communication thread, and the latter needs to do the same. Unfortunately, it cannot simply call one of the blocking functions (e.g. sem_wait), as it needs to keep checking for messages from other SMP systems. There are several ways to resolve this problem, but all need some frequent action by the communication thread, far too frequent to allow the thread to go into a scheduler wait (see section E.1 above). So using some sort of spin loop and dedicating a core to the communication thread will almost always be essential. I believe that is enough to implement coarrays without VOLATILE with tolerable efficiency, on multi-core, cache-coherent SMP systems. However, even this relies on the system being configured for gang scheduling (not usually the default) and for running the number of images that the programmer has requested. From experience with OpenMP and MPI, it will not work well on multi-user SMP systems. The above is not enough to implement VOLATILE coarrays because one of the communication or image threads might read a variable while it was half-updated by another. So each access to a VOLATILE variable would need to be locked, with the communication thread involved in every lock. That will not deliver acceptable efficiency on many systems. Naturally, there are optimisations, but few of them will help with more than a subset of coarray programs, because they rely on certain (quite reasonable) patterns of use being rare. F. Other interfaces ------------------- I have been told that my concerns are unjustified, because there are existing, widely used single-sided message passing interfaces and PGAS languages. I have done some fairly extensive investigation, and I can find no evidence for those claims - indeed, I have found fairly strong evidence of the converse. All of relevant interfaces that I found while searching for evidence were built on top of coherent shared memory, MPI, TCP/IP, SHMEM or GASNet, or some combination. The first three have already been discussed, so it leaves just the last two. Some interfaces also support Quadrics and Myrinet, as options. Quadrics has already been mentioned, but Myrinet currently supports only two-sided communications (MX Specification 1.2, II Concepts). F.1. Cray SHMEM --------------- This is the message passing interface that originated on Cray, and was copied on many other parallel systems; it is not the 'System V' shared memory segment interface also called shmem. The only call that helps with my examples is SHMEM_Quiet (SHMEM_Fence and SHMEM_Wait affect actions on the local node only, and SHMEM_Barrier is a collective). It appears that SHMEM_Quiet was introduced in the T3E. There appears to be no adequate implementation of SHMEM for distributed memory systems, except for Cray systems and the specialist interconnect Quadrics. I have asked several contacts in the parallel programming area, and none of them know of a version of SHMEM for most distributed memory systems. A Web search (using Google) on SHMEM_Quiet and Linux had only 28 hits, and none seemed to indicate the existence of any other implementation; indeed, one said that SHMEM_Quiet had not yet been implemented by Scali in 2001. I changed the "Linux" to "BSD", "Microsoft", "Windows", "InfiniBand", "Voltaire", "Mellanox", "CISCO", "Scali", "Myrinet", "GPSHMEM", "OpenIB" and "Openfabrics"; none got more than 12 hits, and none indicated a relevant implementation. I tried a few other searches not using SHMEM_Quiet, too. My attempts to find a current Web page for GPSHMEM failed. F.2. Unified Parallel C (UPC) ----------------------------- This is by far the most widespread PGAS (Parallel Global Array Storage) language, but I have found little evidence that it is used for actual applications, as distinct from computer scientists investigating technologies. I am informed that it is used in the USA DoD and related organisations, but I have little contact with their scientific researchers. I have also been informed that it is used elsewhere, but none of my sources could provide any evidence or any reference that I could investigate. Its syntax is far more complicated and hard to use than coarrays, but this paper is about semantics. UPC's semantics are much simpler and more restrictive than for coarrays, in many subtle and poorly documented ways, but I shall omit all aspects but one: that of 'progress'. It is based on a communication mechanism (GASnet), and the issues raised in this paper relate mostly to that. My investigations indicate that GASnet lacks the power to implement coarrays efficiently on distributed memory systems. The current GASNet specification includes the following: int gasnet_AMPoll () An explicit call to service the network, process pending messages and run handlers as appropriate. Most of the message-sending primitives in GASNet poll the network implicitly. Purely polling-based implementations of GASNet may require occasional calls to this function to ensure progress of remote nodes during compute-only loops. Any client code which spin-waits for the arrival of a message should call this function within the spin loop to optimize response time. This call may be a no-op on some implementations (e.g. purely interrupt-based implementations). Returns `GASNET_OK' unless an error condition was detected. This makes it clear that GASNet permits (and even assumes) a progress engine like that of MPI-1 (see E.5), and I explain in that section why that is not adequate for coarrays, as currently specified. In section G, I show that this problem is not purely theoretical. I have enquired further about GASNet, but have not had a response. G. Conversion to UPC -------------------- I translated my example into UPC, using a Mandelbrot loop for the computation-intensive section, and ran it with the Berkeley UPC system under SUSE 10.2 on fairly modern Intel/Dell workstations. It ran efficiently on a single cache-coherent shared-memory system, but failed on a cluster of 5 of the same systems. If the loop was fairly short, it did not transfer the data until it reached the terminating barrier; if the loop was longer, it detected the lack of progress and aborted. The results are appended. The threaded version hung, sometimes unkillably, during process termination, but that is almost certainly an unrelated bug in the Linux pthreads implementation, possibly triggered by one in UPC. Such problems are not rare with pthreads. G.1. The UPC code ----------------- After running this, I noticed that I forgot to remove the references to 'second'. They may be ignored, as they do nothing relevant. #include "upc_relaxed.h" #include "upc_collective.h" #include #include #include shared [*] int value[THREADS]; shared [*] volatile int first[THREADS], second[THREADS]; static time_t start; static double delay (void) { return difftime(time(NULL),start); } typedef struct {double re, im;} complex; int almond (complex position) { complex value_1, value_2; int i; value_1.re = value_1.im = 0.0; for (i = 0; i < 1000*1000*1000; ++i) { value_2.re = value_1.re*value_1.re-value_1.im*value_1.im+position.re; value_2.im = 2.0*value_1.re*value_1.im+position.im; value_1.re = value_2.re*value_2.re-value_2.im*value_2.im+position.re; value_1.im = 2.0*value_2.re*value_2.im+position.im; if (value_1.re < -2.0 || value_1.re > 2.0 || value_1.im < -2.0 || value_1.im > 2.0) break; } return (value_1.re*value_1.re+value_1.im*value_1.im <= 4.0); } int main (void) { complex position; int i; if (THREADS != 5) { fprintf(stderr,"This program expects %d threads\n",THREADS); exit(EXIT_FAILURE); } value[MYTHREAD] = i+1; first[MYTHREAD] = 0; second[MYTHREAD] = 0; upc_barrier; start = time(NULL); if (MYTHREAD == 0) { for (i = 0; i < 10; ++i) value[i] = 10*(i+1); upc_fence; first[4] = 1; upc_fence; printf("Leaving thread 0 after %.1f seconds\n",delay()); } else if (MYTHREAD == 1) { while (1) { upc_fence; if (first[4]) break; } printf("Entering thread 1 after %.1f seconds\n",delay()); printf("%d %d %d %d %d\n", value[0],value[1],value[2],value[3],value[4]); upc_fence; second[3] = 1; upc_fence; printf("Leaving thread 1 after %.1f seconds\n",delay()); } else if (MYTHREAD == 2) { position.re = 0.0; position.im = 1.0; /* These are enabled for the 'big' versions. almond(position); almond(position); almond(position); almond(position); almond(position); almond(position); almond(position); almond(position); almond(position); */ printf("Result = %d\n",almond(position)); system("ps -fLmu nmm1"); } printf("Thread %d finished after %.1f seconds\n",MYTHREAD,delay()); upc_barrier; printf("Thread %d leaving after %.1f seconds\n",MYTHREAD,delay()); return EXIT_SUCCESS; } G.2. Results on a shared memory system -------------------------------------- Note that thread 1 does not wait until thread 2 has finished the computation loop before reading the data that thread 0 has stored in the memory of thread 4. pcphxtr13$upcc -pthreads -o SC SC.upc pcphxtr13$upcrun -n 5 SC WARNING: Node 0 running more threads (5) than there are physical CPU's (2) enabling "polite", low-performance synchronization algorithms UPCR: UPC threads 0..4 of 5 on pcphxtr13 (process 0 of 1, pid=6486) WARNING: Conflicting environment values for GASNET_COLL_GATHER_ALL_DISSEM_LIMIT (1024) and GASNET_COLL_GATHER_ALL_DISSEM_LIMIT_PER_THREAD (204) WARNING: Using: 204 WARNING: Conflicting environment values for GASNET_COLL_EXCHANGE_DISSEM_LIMIT (1024) and GASNET_COLL_EXCHANGE_DISSEM_LIMIT_PER_THREAD (40) WARNING: Using: 40 Leaving thread 0 after 0.0 seconds Thread 0 finished after 0.0 seconds Thread 4 finished after 0.0 seconds Entering thread 1 after 0.0 seconds 10 20 30 40 50 Leaving thread 1 after 0.0 seconds Thread 1 finished after 0.0 seconds Thread 3 finished after 0.0 seconds Result = 1 UID PID PPID LWP C NLWP STIME TTY TIME CMD < ps -fLm output omitted for brevity > Thread 2 finished after 186.0 seconds Thread 0 leaving after 186.0 seconds Thread 4 leaving after 186.0 seconds Thread 1 leaving after 186.0 seconds Thread 3 leaving after 186.0 seconds Thread 2 leaving after 186.0 seconds < It then hung, but killably > G.3. Results on a distributed memory system (1) ----------------------------------------------- Note that thread 1 now waits until thread 2 has finished the computation loop before reading the data that thread 0 has stored in the memory of thread 4. This is the 'small' version - i.e. with one call of function 'almond'. pcphxtr13$upcc -o SC SC.upc pcphxtr13$cat ruin #!/bin/sh set -eu export PATH=$PATH:/home/nmm1/UPC/bin export UPC_NODES='pcphxtr08 pcphxtr09 pcphxtr10 pcphxtr11 pcphxtr12' set -x time upcrun -n 5 -c 1 SC ppcphxtr13$ruin + upcrun -n 5 -c 1 SC UPCR: UPC thread 1 of 5 on pcphxtr11 (process 1 of 5, pid=5949) UPCR: UPC thread 0 of 5 on pcphxtr08 (process 0 of 5, pid=5933) UPCR: UPC thread 2 of 5 on pcphxtr12 (process 2 of 5, pid=5940) UPCR: UPC thread 3 of 5 on pcphxtr09 (process 3 of 5, pid=5980) UPCR: UPC thread 4 of 5 on pcphxtr10 (process 4 of 5, pid=5943) UID PID PPID LWP C NLWP STIME TTY TIME CMD < ps -fLm output omitted for brevity > Leaving thread 0 after 18.0 seconds Thread 0 finished after 18.0 seconds Thread 0 leaving after 18.0 seconds Result = 1 Thread 2 finished after 18.0 seconds Thread 2 leaving after 18.0 seconds Thread 4 finished after 0.0 seconds Thread 4 leaving after 18.0 seconds Entering thread 1 after 18.0 seconds 10 20 30 40 50 Leaving thread 1 after 18.0 seconds Thread 1 finished after 18.0 seconds Thread 1 leaving after 18.0 seconds Thread 3 finished after 0.0 seconds Thread 3 leaving after 18.0 seconds real 0m23.458s user 0m0.092s sys 0m0.124s G.4. Results on a distributed memory system (2) ----------------------------------------------- This is the 'large' version - i.e. with ten calls of function 'almond'. pcphxtr13$upcc -o SC SC.upc pcphxtr13$cat ruin #!/bin/sh set -eu export PATH=$PATH:/home/nmm1/UPC/bin export UPC_NODES='pcphxtr08 pcphxtr09 pcphxtr10 pcphxtr11 pcphxtr12' set -x time upcrun -n 5 -c 1 SC pcphxtr13$ruin + upcrun -n 5 -c 1 SC UPCR: UPC thread 4 of 5 on pcphxtr12 (process 4 of 5, pid=5814) UPCR: UPC thread 2 of 5 on pcphxtr09 (process 2 of 5, pid=5863) UPCR: UPC thread 3 of 5 on pcphxtr08 (process 3 of 5, pid=5814) UPCR: UPC thread 0 of 5 on pcphxtr10 (process 0 of 5, pid=5823) UPCR: UPC thread 1 of 5 on pcphxtr11 (process 1 of 5, pid=5829) *** FATAL ERROR: An active message was returned to sender, and trapped by the default returned message handler (handler 0): Error Code: ECONGESTION: Congestion at destination endpoint Message type: AM_REQUEST_IM Destination: (128.232.253.139:32777) (2) Handler: 71 Tag: 0x80e8fd8f00021711 Arguments(2): 0xb7c57000 0x080eee70 Aborting... *** Caught a fatal signal: SIGABRT(6) on node 0/5 NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. bash: line 1: 5823 Aborted './SC' '__AMUDP_SLAVE_PROCESS__' 'pcphxtr13:46030' pcphxtr13$bash: line 1: 5757 Aborted './SC' '__AMUDP_SLAVE_PROCESS__' 'pcphxtr13:34854' bash: line 1: 5788 Aborted './SC' '__AMUDP_SLAVE_PROCESS__' 'pcphxtr13:56485' bash: line 1: 5863 Aborted './SC' '__AMUDP_SLAVE_PROCESS__' 'pcphxtr13:46030' H. References ------------- POSIX (IEEE Std 1003.1, 2004): http://www.opengroup.org/bookstore/catalog/c046.htm http://www.opengroup.org/bookstore/catalog/c047.htm MPI-2 (Message Passing Interface): http://www.mpi-forum.org/docs/docs.html InfiniBand: http://www.infinibandta.org/specs/ http://www.openfabrics.org/ Cray SHMEM: http://docs.cray.com/books/S-2383-23/S-2383-23-manual.pdf UPC and GASNet: http://upc.lbl.gov/ http://gasnet.cs.berkeley.edu/ RDMA Consortium: http://www.rdmaconsortium.org/home Myrinet: http://www.myrinet.com/scs/documentation.html Quadrics: http://www.quadrics.com/quadrics/QuadricsHome.nsf/DisplayPages/Homepage