ISO/IEC JTC1/SC22/WG5-N1754

                   Implementation Issues for Coarrays
                   ----------------------------------

                    Nick Maclaren, 30th October 2008


Introduction
------------

I am including this paper because it was circulated to the BSI Fortran
panel and some other people, and includes evidence for various
statements in other papers.  It is very much a background paper for
coarray specialists only.  I apologise for its length, but it is
unavoidable if I am to address all of the points raised after I
circulated the first draft to the UK panel.

It does NOT represent the BSI Fortran panel's position.

The fundamental issues are described in the first two sections.  The
remaining sections analyse the issues in more detail, and provide
evidence for the statements in the first two.  Few people will want to
read them, or at least will want to look at only the examples.

In almost all cases, where I make remarks about subtle technical
aspects, I can provide further details on request, but including them
here would make this paper incomprehensible as well as several times
the length.


A. Implementation problems with coarrays on distributed memory systems
----------------------------------------------------------------------

Coarrays will be extremely difficult to implement on distributed memory
systems, and may be impractical for many developers on the almost
ubiquitous 'commodity clusters'[*].  Throughout this document, the
references to problems are in the context of coarrays being implemented
on that sort of distributed memory system.

Possible solutions to this are described in section B.


There are at least the following major problems:

    A.1) There is a known, genuinely portable, technique for
implementing them (see section E.1 below), but it can be extremely
inefficient, needs significant changes to all existing compilers,
and does not work reliably together with a companion processor.

    A.2) They cannot be implemented reliably and efficiently using any
widespread formal or informal standard, including POSIX (IEEE Std
1003.1, 2004) and MPI-2 (Message Passing Interface), without relying on
explicitly undefined behaviour.

    A.3) Even ignoring standards conformance and portability, it is
doubtful that they can be implemented reliably and efficiently on many
systems without operating system extensions.  This would cause serious
problems to third-party vendors.

    A.4) There seems to be little implementation experience of coarrays,
except by one complete system vendor[+], and it is unclear whether even
that vendor has implemented all of the problematic aspects.

    A.5) There appear to be no other widespread parallel languages and
interfaces that both introduce all of the implementation problems that
coarrays do and have any implementations that resolve them.


Those statements may seem controversial, but this paper provides
evidence for them, which accounts for its length.  While coarrays are
syntactically simple, their semantics (and hence implementation) are
not; in fact, I know of no other parallel interface or language that
creates as many difficulties for the implementor.

The problems are essentially soluble if coarrays were intended only for
cache-coherent shared memory systems, complete system vendors and
specialist interconnects, of course, but I believe that is not the
objective.


I believe that that coarrays are not suitable for inclusion in the
Fortran standard until the above issues can be shown to be soluble by
the existence of at least one real implementation by a third-party
vendor on a commodity cluster, and some evidence that this will extend
to other such systems, or the problems raised here are resolved by
changes to the specification.


Terminology:

[*] I use the term 'commodity cluster' to refer to a collection of
off-the-shelf workstations or small servers, connected by TCP/IP and
Ethernet (or possibly InfiniBand), and running some widely-available
operating system (such as a Linux or Unix variant or Microsoft system).
On such systems, almost all compilers, language run-time systems and
applications libraries are written by separate organisations, and run
without 'system privileges' and without operating system extensions.

[+] The term 'complete system vendor' is not a happy one, but I
cannot think of a better.  It refers to the vendors who maintain and
market at least a modified version of a standard operating system,
including specialist networking support, and the Fortran and MPI
that run under them.  Such vendors can ensure that the operating
system and network interfaces provide the facilities needed for
implementing parallel languages, reliably and efficiently.

It will be clear that, to a large extent, there are two disjoint
types of system, and it is the former I am concerned about.

[ Aside: specialist interconnects are an intermediate form, but the only
one I have found that seems to be adequate for implementing Fortran
coarrays, reliably and efficiently, is Quadrics - and that is fairly
rare.  It is mentioned whereever it is relevant in this document. ]


B. Summary of the problems
--------------------------

The fundamental problem is that the specification permits two images to
communicate by defining and referencing coarrays owned by a third,
without any explicit action by the owning process.  That is very
difficult to implement when the underlying communication mechanism is
based on message passing.

That is compounded by existing Fortran 2003 features, including
companion processors and even I/O, which prevent many of the known
solutions to this problem being used for coarrays.

The effect will be either that coarrays are not implemented for most
distributed memory systems, or that the implementations have such
serious (and probably undocumented) restrictions that many reasonable
coarray programs will not work effectively on them.  Users will then
justifiably claim that Fortran is an unsuitable language for parallel
programming.

In theory, most distributed memory operating systems could be enhanced
to enable the efficient implementation of coarrays, but I doubt that
Fortran is still important enough to force such a drastic change.

All of these problems can be solved with reasonable efficiency on most
current systems (though not all) if VOLATILE coarrays are dropped from
the standard.  See sections E.6 and E.7 for details.

An alternative approach is to change the coarray specification to make
them implementable, but the necessary changes are drastic (see section
E.5 below).  If that approach were to be taken, it should be the subject
of another paper; this one is complicated enough as it is.


EVIDENCE FOR THE ABOVE STATEMENTS
---------------------------------

Please note that the examples are written to show the issues, and are
not necessarily the best way to perform their task.  Writing clear,
realistic examples would take more time than I have.


C. The basic requirement
------------------------

Consider a program like the following:

    PROGRAM Synchronisation
        INTEGER :: value[*] = 0, i
        LOGICAL, VOLATILE :: first[*] = .False., second[*] = .False.
        SELECT CASE(THIS_IMAGE())
    CASE(1)
            DO i = 1,CO_UBOUND(value)
                value[i] = 123*i
            END DO
            SYNC MEMORY    ! One
            first[9] = .True.
    CASE(2)
            DO WHILE (.NOT. first[9])
                CONTINUE
            END DO
            SYNC MEMORY    ! Two
            DO i = 1,CO_UBOUND(value)
                PRINT *, value[i]
            END DO
            second[8] = .True.
    CASE(3)
            ! Assume some intensive serial computation here
            DO WHILE (.NOT. second[8])
                CONTINUE
            END DO
        END SELECT
    END PROGRAM Synchronisation

The processor has two options to ensure consistency, with variations:

    C.1) The straightforward implementation involves transferring all of
the updates of coarrays from image 1 to all other images 'during' the
SYNC MEMORY statements marked One, and transferring them from all other
images to image 2 'during' the SYNC MEMORY statements marked Two.

    C.2) An alternative approach involves transferring the identities of
all updated coarrays from image 1 to all other images 'during' the SYNC
MEMORY statements marked One, and must inspect them in image 2 'during'
the SYNC MEMORY statements marked Two.  The reason that they must be
transferred to all other images is because they could be executing code
similar to that of image 2.

Both of those mechanism are expensive, and the second is not scalable,
but that is not the primary point of this paper.

It is important to note that the use of VOLATILE coarrays in the above
is not fundamental, and any functionally equivalent synchronisation
mechanism would expose the same issues; examples of this are described
in section D below.


D. The potential for deadlock
-----------------------------

This example relies on the existence of "user-defined ordering" as
permitted in 8.5.1 paragraph 5.

Consider a program like the following:

    PROGRAM Deadlock
        INTERFACE
            SUBROUTINE Mutex_lock (which) BIND(C)
                USE, INTRINSIC :: ISO_C_BINDING
                INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which
            END SUBROUTINE Mutex_lock
            SUBROUTINE Mutex_unlock (which) BIND(C)
                USE, INTRINSIC :: ISO_C_BINDING
                INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which
            END SUBROUTINE Mutex_unlock
        END INTERFACE
        INTEGER :: value[*] = 0, i

        IF (THIS_IMAGE() == 1) THEN
            CALL Mutex_lock(9)
        ELSE IF (THIS_IMAGE() == 3) THEN
            CALL Mutex_lock(8)
        END IF
        SYNC ALL
        SELECT CASE(THIS_IMAGE())
    CASE(1)
            DO i = 1,CO_UBOUND(value)
                value[i] = 123*i
            END DO
            SYNC MEMORY    ! One
            CALL Mutex_unlock(9)
    CASE(2)
            CALL Mutex_lock(9)
            SYNC MEMORY    ! Two
            DO i = 1,CO_UBOUND(value)
                PRINT *, value[i]
            END DO
            CALL Mutex_unlock(8)
    CASE(3)
            CALL Mutex_lock(8)
        END SELECT
    END PROGRAM Deadlock

If the call to Mutex_lock in image 3 blocks, and coindexed objects owned
by it cannot be accessed by another image while it is in that state, the
above program will deadlock.  The Fortran processor has obviously no
control over the code of Mutex_lock and Mutex_unlock, and so cannot
prevent them from blocking.


It is important to note that this problem is not caused solely by
companion processors, but by any external interface, including Fortran
I/O.  My original example was based on Fortran I/O, but its unavoidable
complexity caused many of the people who saw it to misunderstand its
point.  I have therefore excluded it from this paper, but here is a
summary of its design:

    D.1) Mutex_unlock(<n>) is replaced by writing a record to an I/O
unit 10+<n>, followed by a FLUSH statement on that unit, and
Mutex_lock(<n>) by reading one from an I/O unit 20+<n>.

    D.2) Those I/O units are connected to FIFOs (pipes, sockets etc.),
where the other ends of units 10+<n> and 20+<n> are an input FIFO and
an output one to the same program, which simply reads a record from
unit 10+<n> and writes it to 20+<n>.

That is obviously a reasonable configuration and exposes exactly the
same problem as the above example does, though the external details use
features not in the Fortran standard.  Communication using files and a
shared file system like NFS is common in parallel applications on
distributed memory systems, though it is usually done in other ways.


E. Implementation possibilities
-------------------------------

Consider a conventional distributed memory system, made up of separate
systems (i.e. CPUs and memory), connected by I/O over a network, with a
conventional semi-secure operating system (such as specified by POSIX),
and where compilers, language run-time systems and applications
libraries are unprivileged applications.

Fortran does not require each image to store its own copy of a coarray
locally, but it 'almost does', and coarrays are unlikely to be
implemented in any other way on such systems.

In the above examples, when image 1 reaches the SYNC MEMORY marked One,
it must transfer the coarray updates to each of the systems that hosts
the other images.  And similarly, it must transfer the data from them to
image 2 when it reaches the SYNC MEMORY marked Two.

I shall call image control statements, the use of VOLATILE and I/O
statements, 'natural break points', because they all need some special
code to ensure cross-image consistency, according to the current draft.
They could obviously be extended to handle cross-image transfers as
well.

The examples do not show how this problem can arise while an image is
executing code that contains no natural break points, but that is
obviously possible, and I do that in my conversion to UPC (see sections
F.2 and G below).

The examples show that the processor cannot wait until both images
involved in a transfer reach a natural break point, without losing
parallelism, or causing livelock and even deadlock.

I know of only six realistic options, and will describe them and their
problems separately.


E.1. Polling in compiled code
-----------------------------

The Fortran processor can obviously check for pending transfers
(i.e. poll for them) at every natural break point, and insert artificial
break points into all long-running loops that do not contain natural
break points.

This technique was used on some of the rudimentary operating systems of
the 1960s and 1970s to detect looping programs, and experience showed
that the 'companion processor' problem caused the technique to be
seriously unreliable.  It was dropped as soon as almost all operating
systems supported preemptive interrupts.

For reasonable efficiency in high-communication programs, a break point
needs to be reached every few microseconds - experience with MPI is that
the difference between 1 and 5 microseconds latency makes a vast
difference to many applications.  Note that a typical scheduling
interval for a modern system is 10 milliseconds, which is far too long,
and so implementing the polling efficiently needs great skill and is
not portable.  However, most MPI implementations achieve it.

That interferes with many other critical optimisations, in particular,
the extreme pipelining and pseudo-vectorisation needed for good
performance on Intel IA64, Hitachi SR8000 and similar future systems.
It would handicap optimisation even on SPARC and POWER, and possibly
even x86.

More seriously, it does not help with programs that use companion
processors, in this context including programs that call optimised
versions of the BLAS, LAPACK etc., because the Fortran processor
obviously cannot insert polling into external code.  This problem is
shown by example Deadlock in section D, but would remain even if the
words "or user-defined ordering" were removed.  Obviously, an image
executing a companion processor procedure cannot service any memory
requests from other images until it returns.

See also section E.5 below.


E.2. Purely one-sided transfers ('RDMA')
----------------------------------------

There are some purely one-sided transfer mechanisms (i.e. ones that need
no action whatsoever on the image that owns the data).

In this context, the criterion is that no action is needed by the local
application when a remote system reads and updates its data, and not
what happens in the hardware, the network interface device driver or the
operating system kernel.  Similarly, the relevant interfaces are ones
that can be used by unprivileged applications, such as Fortran programs.
Some more details of RDMA are given below.

Most cache-coherent shared memory (SMP) architectures provide suitable
mechanisms, but are irrelevant in this context; note that, despite
common belief, they do not always do so entirely transparently.  It
appears that Cray distributed memory systems provide them, and that the
Quadrics interconnect does, too.

TCP/IP and Ethernet do not provide any such mechanism.  There have been
several research projects to extend TCP/IP to do this but, as far as I
know, they have all fizzled out.  The RDMA Consortium is tackling an
unrelated problem.

There is no such mechanism defined by any relevant standard, formal or
informal, for reasons implied by sections E.4 and E.6, and I can find no
evidence that one is currently provided by any distributed memory
operating system or interconnect vendor other than Cray and Quadrics.

It is possible that such mechanisms exist in the Openfabrics InfiniBand
stack, but I baulked at downloading and searching 58 MB of software to
find out.  In any case, I can find no evidence of such a mechanism being
provided to or used by applications.

Such mechanisms can be provided for systems that do not have them only
by non-trivial operating system kernel enhancements.


Aside: some information on RDMA.

The term 'Remote Direct Memory Access' (also called RMA) covers a wide
variety of mechanisms, but it always refers to some mechanism for
avoiding the use of at least one internal I/O buffer for large
transfers.  Despite common belief, the data are not always copied
directly from the network to the CPU's memory, but incoming packets may
be buffered in the network card, decoded there, and the contents then
written to the CPU's memory.  That means that a message can reach its
destination some time before it becomes visible to an application
running on that system.

Unfortunately, the requirement for coarrays is not that the data are
written directly to memory, but that the transmission is not
acknowledged as complete until the data is actually visible to the
application.  The availability of the feature therefore depends more on
the interface software (e.g. TCP/IP or Openfabrics) than on the network
hardware (e.g. Ethernet or InfiniBand).


E.3. MPI-2 one-sided communication
----------------------------------

MPI is a widely available de facto standard, and is the dominant
interface used on distributed memory systems.  The two-sided and
collective mechanisms defined by MPI-1 are available on effectively
all parallel systems, but are obviously inadequate for implementing
coarrays.

MPI-2 added one-sided communication which, on the face of it, would
appear suitable for implementing coarrays.  Very few applications
currently use MPI-2 one-sided communication, and it is unclear how
reliable, complete and efficient its implementations are.  There are at
least the following issues.

Programs in PGAS (Parallel Global Array Storage) languages, like Fortran
with coarrays, tend to use a much lower granularity than programs that
use message-passing libraries.  It is common for the former to fail when
built on the latter, and it was one of the reasons for the failure of
HPF. Intel have a version of OpenMP that supports distributed memory,
but it is intended only for low-communication applications like Web
servers, and not the high-communication ones common in HPC, according to
a personal communication from a developer.

The only appropriate one-sided synchronisation mechanism in MPI-2 is
MPI_Win_lock (MPI-2 6.4.3), and MPI allows that to be restricted to
memory allocated by MPI_Alloc_mem.  That may not be feasible for all
Fortran compilers on all systems.

MPI-2 11.7.2 states that progress with a transfer is not required until
the target process next reaches an MPI call, and therefore may take an
unbounded amount of time.  As shown by my conversion to UPC (see
sections F.2 and G below), such an implementation would be extremely
inefficient on example Synchronisation in section C, and would deadlock
on example Deadlock in section D.

MPI's guarantees of no deadlock obviously apply only when MPI procedures
are called, and not when external code is.  So any image that is waiting
in a companion processor procedure or even Fortran I/O obviously cannot
service any memory requests from other images until it returns.


E.4. Interrupting the image to complete the transfer
--------------------------------------------------- 

In theory, the processor could use some form of unprivileged interrupt
mechanism to trap signals to the executing image (i.e. the one that owns
the data), handle the transfers and then continue processing.

The only currently relevant mechanism is signals, and doing I/O in them
is undefined behaviour in C99 (7.14.1.1 The signal function, paragraph
5) and POSIX (sigaction: APPLICATION USAGE, paragraph 3).  Many systems
provide extensions to POSIX in this area, but few are much of an
improvement, and most do not provide enough supported functionality to
implement message passing in an interrupt handler.

From a practical viewpoint, this technique is seriously unreliable under
all of the dozen or so current operating systems which I have looked at;
it was fiendishly difficult to use and usually unreliable even under the
old mainframe systems, where it was fully supported.

I haven't investigated Intel Cluster OpenMP in detail, but its
specification states clearly that it uses the page faulting mechanism
('virtual memory') to do this.  That unavoidably needs kernel privilege,
though it may use documented system calls.  Equally seriously, it almost
certainly will involve the kernel scheduler, and so the latency is
likely to be of the order of 10 mS - which is hopeless for VOLATILE
coarrays and not good for segments.  See section 11 in:

    http://cache-www.intel.com/cd/00/00/29/78/297875_297875.pdf


E.5. Using the MPI and GASNet progress model
--------------------------------------------

MPI and GASNet (see section F.2 below) have the concept of a progress
engine, where natural break points are required to check for and service
all pending actions, but no action need be taken between them.

In the Fortran context, a reference or definition from one image to a
coindexed object on another might not occur until the latter reached a
natural break point.  Examples Synchronisation and Deadlock in sections
C and D show that would not be adequate for implementing coarrays as
currently specified.

In order to take this approach, MPI imposes significant constraints on
the program to ensure that progress is always possible, and using it
would require Fortran to do the same.  Modifying the coarray
specification to allow this would certainly be possible, and the
simplest approach would be the following:

    E.5.1) To eliminate SYNC MEMORY and VOLATILE coarrays in their
entirety.  Locking primitives do not cause the same problems, and are
more comparable to SYNC IMAGES in this context.

    E.5.2) To state that this progress model is used, and programmers
must obey its constraints, in a similar way to MPI.  I say MPI rather
than GASNet, because the former standard is more comparable to Fortran
in precision and terminology.

    E.5.3) To specify that performing unbounded actions using any
external interface (including I/O and companion processors), together
with using coarrays, is undefined.  Without this, Fortran coarrays will
almost certainly develop a reputation for being 'broken'.

These changes would mean that example Synchronisation in section C would
usually be extremely inefficient, and example Deadlock in section D
would be undefined.


E.6. Using a separate thread to handle the transfers
----------------------------------------------------

A processor could require that there is at least one, permanently
running, thread dedicated to message passing per system, and that the
operating system provides coherent shared memory between that thread and
the image execution threads (or an equivalent mechanism).

This requires cache-coherent threading support and, for practicability,
at least one core more than the number of images (per SMP system).  That
is not always available on specialist parallel systems, but the
constraint is a relatively minor problem nowadays.

However, it requires the hardware and operating system to support
coherent shared memory in such a way that an update from one thread is
guaranteed to become visible to another thread, expeditiously, with no
action by the second thread.  That is undefined behaviour in POSIX (4.10
Memory Synchronization).

It is also unreliable in practice, usually because of thread scheduling,
dispatchability and memory consistency problems.  It should be noted that
thread scheduling control is optional in POSIX, and its semantics are
largely implementation specific (POSIX 4.13 Scheduling Policy).

Problems are rare, but incredibly confusing to the programmer (and even
the most experienced system expert) when they occur.  They are often put
down to 'cosmic rays', but most of those claims are statistically
implausible; I have tracked down a few such failures and can explain how
the problems I describe here can occur.

Again, such mechanisms could be made reliable for many or most systems
only by non-trivial operating system kernel enhancements.

The alternative approach is to drop VOLATILE coarrays from the standard.
An indication of how that could work is given in section E.7 below.


E.7. A separate thread and no VOLATILE
--------------------------------------

Fortran's ordering rules (8.5.1 paragraph 6) are close enough to POSIX's
that enforcing POSIX rules (POSIX 4.10 Memory Synchronization) will
provide Fortran's semantics for non-VOLATILE variables.  Let us consider
a single communication thread and one or more image threads on a single,
cache-coherent SMP system.

The communication thread transfers local data to and from other SMP
systems by message passing (e.g. MPI or TCP/IP), and the fact that there
is only one communication thread means that all images external to its
SMP system see a consistent view of the local data, whether or not any
local image thread accesses it.  Restricting it to a single thread
is a scalability issue, of course, but not a catastrophic one.

The problem is how it maintains consistency with the local image
threads.  All of Fortran's image control statements need to use one of
POSIX's synchronization functions to handshake with the communication
thread, and the latter needs to do the same.  Unfortunately, it cannot
simply call one of the blocking functions (e.g. sem_wait), as it needs
to keep checking for messages from other SMP systems.

There are several ways to resolve this problem, but all need some
frequent action by the communication thread, far too frequent to allow
the thread to go into a scheduler wait (see section E.1 above).  So
using some sort of spin loop and dedicating a core to the communication
thread will almost always be essential.

I believe that is enough to implement coarrays without VOLATILE with
tolerable efficiency, on multi-core, cache-coherent SMP systems.
However, even this relies on the system being configured for gang
scheduling (not usually the default) and for running the number of
images that the programmer has requested.  From experience with OpenMP
and MPI, it will not work well on multi-user SMP systems.

The above is not enough to implement VOLATILE coarrays because one of
the communication or image threads might read a variable while it was
half-updated by another.  So each access to a VOLATILE variable would
need to be locked, with the communication thread involved in every
lock.  That will not deliver acceptable efficiency on many systems.
Naturally, there are optimisations, but few of them will help with
more than a subset of coarray programs, because they rely on certain
(quite reasonable) patterns of use being rare.


F. Other interfaces
-------------------

I have been told that my concerns are unjustified, because there are
existing, widely used single-sided message passing interfaces and PGAS
languages.  I have done some fairly extensive investigation, and I can
find no evidence for those claims - indeed, I have found fairly strong
evidence of the converse.

All of relevant interfaces that I found while searching for evidence
were built on top of coherent shared memory, MPI, TCP/IP, SHMEM or
GASNet, or some combination.  The first three have already been
discussed, so it leaves just the last two.

Some interfaces also support Quadrics and Myrinet, as options.  Quadrics
has already been mentioned, but Myrinet currently supports only
two-sided communications (MX Specification 1.2, II Concepts).


F.1. Cray SHMEM
---------------

This is the message passing interface that originated on Cray, and was
copied on many other parallel systems; it is not the 'System V' shared
memory segment interface also called shmem.

The only call that helps with my examples is SHMEM_Quiet (SHMEM_Fence
and SHMEM_Wait affect actions on the local node only, and SHMEM_Barrier
is a collective).  It appears that SHMEM_Quiet was introduced in the
T3E.

There appears to be no adequate implementation of SHMEM for distributed
memory systems, except for Cray systems and the specialist interconnect
Quadrics.  I have asked several contacts in the parallel programming
area, and none of them know of a version of SHMEM for most distributed
memory systems.

A Web search (using Google) on SHMEM_Quiet and Linux had only 28 hits,
and none seemed to indicate the existence of any other implementation;
indeed, one said that SHMEM_Quiet had not yet been implemented by Scali
in 2001.  I changed the "Linux" to "BSD", "Microsoft", "Windows",
"InfiniBand", "Voltaire", "Mellanox", "CISCO", "Scali", "Myrinet",
"GPSHMEM", "OpenIB" and "Openfabrics"; none got more than 12 hits, and
none indicated a relevant implementation.  I tried a few other searches
not using SHMEM_Quiet, too.  My attempts to find a current Web page for
GPSHMEM failed.


F.2. Unified Parallel C (UPC)
-----------------------------

This is by far the most widespread PGAS (Parallel Global Array Storage)
language, but I have found little evidence that it is used for actual
applications, as distinct from computer scientists investigating
technologies.  I am informed that it is used in the USA DoD and related
organisations, but I have little contact with their scientific
researchers.  I have also been informed that it is used elsewhere, but
none of my sources could provide any evidence or any reference that I
could investigate.

Its syntax is far more complicated and hard to use than coarrays, but
this paper is about semantics.  UPC's semantics are much simpler and
more restrictive than for coarrays, in many subtle and poorly documented
ways, but I shall omit all aspects but one: that of 'progress'.

It is based on a communication mechanism (GASnet), and the issues raised
in this paper relate mostly to that.  My investigations indicate that
GASnet lacks the power to implement coarrays efficiently on distributed
memory systems.  The current GASNet specification includes the
following:

int gasnet_AMPoll ()
     An explicit call to service the network, process pending messages
     and run handlers as appropriate.  Most of the message-sending
     primitives in GASNet poll the network implicitly.  Purely
     polling-based implementations of GASNet may require occasional
     calls to this function to ensure progress of remote nodes during
     compute-only loops. Any client code which spin-waits for the
     arrival of a message should call this function within the spin
     loop to optimize response time.  This call may be a no-op on some
     implementations (e.g. purely interrupt-based implementations).
     Returns `GASNET_OK' unless an error condition was detected.

This makes it clear that GASNet permits (and even assumes) a progress
engine like that of MPI-1 (see E.5), and I explain in that section why
that is not adequate for coarrays, as currently specified.  In section
G, I show that this problem is not purely theoretical.

I have enquired further about GASNet, but have not had a response.


G. Conversion to UPC
--------------------

I translated my example into UPC, using a Mandelbrot loop for the
computation-intensive section, and ran it with the Berkeley UPC system
under SUSE 10.2 on fairly modern Intel/Dell workstations.  It ran
efficiently on a single cache-coherent shared-memory system, but failed
on a cluster of 5 of the same systems.  If the loop was fairly short, it
did not transfer the data until it reached the terminating barrier; if
the loop was longer, it detected the lack of progress and aborted.  The
results are appended.

The threaded version hung, sometimes unkillably, during process
termination, but that is almost certainly an unrelated bug in the Linux
pthreads implementation, possibly triggered by one in UPC.  Such
problems are not rare with pthreads.


G.1. The UPC code
-----------------

After running this, I noticed that I forgot to remove the references
to 'second'.  They may be ignored, as they do nothing relevant.


#include "upc_relaxed.h"
#include "upc_collective.h"
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

shared [*] int value[THREADS];
shared [*] volatile int first[THREADS], second[THREADS];

static time_t start;

static double delay (void) {
    return difftime(time(NULL),start);
}

typedef struct {double re, im;} complex;

int almond (complex position) {
    complex value_1, value_2;
    int i;

    value_1.re = value_1.im = 0.0;
    for (i = 0; i < 1000*1000*1000; ++i) {
        value_2.re = value_1.re*value_1.re-value_1.im*value_1.im+position.re;
        value_2.im = 2.0*value_1.re*value_1.im+position.im;
        value_1.re = value_2.re*value_2.re-value_2.im*value_2.im+position.re;
        value_1.im = 2.0*value_2.re*value_2.im+position.im;
        if (value_1.re < -2.0 || value_1.re > 2.0 ||
                value_1.im < -2.0 || value_1.im > 2.0)
            break;
    }
    return (value_1.re*value_1.re+value_1.im*value_1.im <= 4.0);
}

int main (void) {
    complex position;
    int i;

    if (THREADS != 5) {
        fprintf(stderr,"This program expects %d threads\n",THREADS);
        exit(EXIT_FAILURE);
    }

    value[MYTHREAD] = i+1;
    first[MYTHREAD] = 0;
    second[MYTHREAD] = 0;
    upc_barrier;
    start = time(NULL);

    if (MYTHREAD == 0) {
        for (i = 0; i < 10; ++i) value[i] = 10*(i+1);
        upc_fence;
        first[4] = 1;
        upc_fence;
        printf("Leaving thread 0 after %.1f seconds\n",delay());
    } else if (MYTHREAD == 1) {
        while (1) {
            upc_fence;
            if (first[4]) break;
        }
        printf("Entering thread 1 after %.1f seconds\n",delay());
        printf("%d %d %d %d %d\n",
            value[0],value[1],value[2],value[3],value[4]);
        upc_fence;
        second[3] = 1;
        upc_fence;
        printf("Leaving thread 1 after %.1f seconds\n",delay());
    } else if (MYTHREAD == 2) {
        position.re = 0.0;
        position.im = 1.0;
/* These are enabled for the 'big' versions.
        almond(position); almond(position); almond(position);
        almond(position); almond(position); almond(position);
        almond(position); almond(position); almond(position);
*/
        printf("Result = %d\n",almond(position));
        system("ps -fLmu nmm1");
     }
    printf("Thread %d finished after %.1f seconds\n",MYTHREAD,delay());
    upc_barrier;
    printf("Thread %d leaving after %.1f seconds\n",MYTHREAD,delay());
    return EXIT_SUCCESS;
}


G.2. Results on a shared memory system
--------------------------------------

Note that thread 1 does not wait until thread 2 has finished the
computation loop before reading the data that thread 0 has stored in the
memory of thread 4.


pcphxtr13$upcc -pthreads -o SC SC.upc
pcphxtr13$upcrun -n 5 SC
WARNING: Node 0 running more threads (5) than there are physical CPU's (2)
         enabling "polite", low-performance synchronization algorithms
UPCR: UPC threads 0..4 of 5 on pcphxtr13 (process 0 of 1, pid=6486)
WARNING: Conflicting environment values for GASNET_COLL_GATHER_ALL_DISSEM_LIMIT (1024) and GASNET_COLL_GATHER_ALL_DISSEM_LIMIT_PER_THREAD (204)
WARNING: Using: 204
WARNING: Conflicting environment values for GASNET_COLL_EXCHANGE_DISSEM_LIMIT (1024) and GASNET_COLL_EXCHANGE_DISSEM_LIMIT_PER_THREAD (40)
WARNING: Using: 40
Leaving thread 0 after 0.0 seconds
Thread 0 finished after 0.0 seconds
Thread 4 finished after 0.0 seconds
Entering thread 1 after 0.0 seconds
10 20 30 40 50
Leaving thread 1 after 0.0 seconds
Thread 1 finished after 0.0 seconds
Thread 3 finished after 0.0 seconds
Result = 1
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
< ps -fLm output omitted  for brevity >
Thread 2 finished after 186.0 seconds
Thread 0 leaving after 186.0 seconds
Thread 4 leaving after 186.0 seconds
Thread 1 leaving after 186.0 seconds
Thread 3 leaving after 186.0 seconds
Thread 2 leaving after 186.0 seconds
< It then hung, but killably >


G.3. Results on a distributed memory system (1)
-----------------------------------------------

Note that thread 1 now waits until thread 2 has finished the computation
loop before reading the data that thread 0 has stored in the memory of
thread 4.  This is the 'small' version - i.e. with one call of function
'almond'.


pcphxtr13$upcc -o SC SC.upc
pcphxtr13$cat ruin
#!/bin/sh
set -eu
export PATH=$PATH:/home/nmm1/UPC/bin
export UPC_NODES='pcphxtr08 pcphxtr09 pcphxtr10 pcphxtr11 pcphxtr12'
set -x
time upcrun -n 5 -c 1 SC
ppcphxtr13$ruin
+ upcrun -n 5 -c 1 SC
UPCR: UPC thread 1 of 5 on pcphxtr11 (process 1 of 5, pid=5949)
UPCR: UPC thread 0 of 5 on pcphxtr08 (process 0 of 5, pid=5933)
UPCR: UPC thread 2 of 5 on pcphxtr12 (process 2 of 5, pid=5940)
UPCR: UPC thread 3 of 5 on pcphxtr09 (process 3 of 5, pid=5980)
UPCR: UPC thread 4 of 5 on pcphxtr10 (process 4 of 5, pid=5943)
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
< ps -fLm output omitted  for brevity >
Leaving thread 0 after 18.0 seconds
Thread 0 finished after 18.0 seconds
Thread 0 leaving after 18.0 seconds
Result = 1
Thread 2 finished after 18.0 seconds
Thread 2 leaving after 18.0 seconds
Thread 4 finished after 0.0 seconds
Thread 4 leaving after 18.0 seconds
Entering thread 1 after 18.0 seconds
10 20 30 40 50
Leaving thread 1 after 18.0 seconds
Thread 1 finished after 18.0 seconds
Thread 1 leaving after 18.0 seconds
Thread 3 finished after 0.0 seconds
Thread 3 leaving after 18.0 seconds

real    0m23.458s
user    0m0.092s
sys     0m0.124s


G.4. Results on a distributed memory system (2)
-----------------------------------------------

This is the 'large' version - i.e. with ten calls of function 'almond'.


pcphxtr13$upcc -o SC SC.upc
pcphxtr13$cat ruin
#!/bin/sh
set -eu
export PATH=$PATH:/home/nmm1/UPC/bin
export UPC_NODES='pcphxtr08 pcphxtr09 pcphxtr10 pcphxtr11 pcphxtr12'
set -x
time upcrun -n 5 -c 1 SC
pcphxtr13$ruin
+ upcrun -n 5 -c 1 SC
UPCR: UPC thread 4 of 5 on pcphxtr12 (process 4 of 5, pid=5814)
UPCR: UPC thread 2 of 5 on pcphxtr09 (process 2 of 5, pid=5863)
UPCR: UPC thread 3 of 5 on pcphxtr08 (process 3 of 5, pid=5814)
UPCR: UPC thread 0 of 5 on pcphxtr10 (process 0 of 5, pid=5823)
UPCR: UPC thread 1 of 5 on pcphxtr11 (process 1 of 5, pid=5829)
*** FATAL ERROR: An active message was returned to sender,
    and trapped by the default returned message handler (handler 0):
Error Code: ECONGESTION: Congestion at destination endpoint                
Message type: AM_REQUEST_IM
Destination: (128.232.253.139:32777) (2)
Handler: 71
Tag: 0x80e8fd8f00021711
Arguments(2): 0xb7c57000  0x080eee70  
Aborting...
*** Caught a fatal signal: SIGABRT(6) on node 0/5
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
bash: line 1:  5823 Aborted                 './SC' '__AMUDP_SLAVE_PROCESS__' 'pcphxtr13:46030'
pcphxtr13$bash: line 1:  5757 Aborted                 './SC' '__AMUDP_SLAVE_PROCESS__' 'pcphxtr13:34854'
bash: line 1:  5788 Aborted                 './SC' '__AMUDP_SLAVE_PROCESS__' 'pcphxtr13:56485'
bash: line 1:  5863 Aborted                 './SC' '__AMUDP_SLAVE_PROCESS__' 'pcphxtr13:46030'


H. References
-------------

POSIX (IEEE Std 1003.1, 2004):
    http://www.opengroup.org/bookstore/catalog/c046.htm
    http://www.opengroup.org/bookstore/catalog/c047.htm

MPI-2 (Message Passing Interface):
    http://www.mpi-forum.org/docs/docs.html

InfiniBand:
    http://www.infinibandta.org/specs/
    http://www.openfabrics.org/

Cray SHMEM:
    http://docs.cray.com/books/S-2383-23/S-2383-23-manual.pdf

UPC and GASNet:
    http://upc.lbl.gov/
    http://gasnet.cs.berkeley.edu/

RDMA Consortium:
    http://www.rdmaconsortium.org/home

Myrinet:
    http://www.myrinet.com/scs/documentation.html

Quadrics:
    http://www.quadrics.com/quadrics/QuadricsHome.nsf/DisplayPages/Homepage