ISO/IEC JTC1/SC22/WG5-N1745

VOLATILE Coarrays
-----------------

Nick Maclaren on behalf of the UK panel, 14th October 2008.


0. Introduction
---------------

The concept of 'volatile' objects, as used in C, Fortran etc., has
always been a problem for language semantics, because it introduces the
concept of objects changing value without explicit action by the
program.  Fortran VOLATILE coarrays overload it with another concept,
that of parallel atomicity - i.e. that a value can be changed from one
value to another by one image, and either the new or the old value is
seen by all other images (without an intervening period of being
undefined).  These are several major problems with them.

There are major specification problems with VOLATILE coarrays, which are
discussed below; some are sufficiently serious that resolving them would
mean pervasive changes, which would probably be regarded as too
restrictive to be acceptable.  Even ignoring those, the specification is
too imprecise to know what effects Fortran requires a processor to
deliver to the programmer, and it is therefore almost impossible for a
programmer to write code that is reliably portable.

Experience with similar parallel interfaces shows that few (if any)
ordinary programmers can use volatile objects correctly, and even
experts have difficulty, especially when the specifications are
imprecise.

The introduction of VOLATILE coarrays also means that many existing,
widespread, important serial optimisations cannot be performed without
changing the results, even on code that makes no use of either coarrays
or VOLATILE.  Downgrading the optimisation for serial code from that
which is possible in Fortran 2003 will be unacceptable to many people.

Lastly, it is unclear whether VOLATILE coarrays can be implemented with
an acceptable degree of efficiency on the now ubiquitous 'commodity
clusters'[*].

For these reasons, we feel that VOLATILE coarrays should be removed from
the Fortran standard, and more appropriate (higher-level) mechanisms
included (possibly after more implementation experience).


[*] The term 'commodity cluster' refers to a collection of off-the-shelf
workstations or small servers, connected by TCP/IP and Ethernet (or
possibly InfiniBand), and running some widely-available operating system
(such as a Linux or Unix variant or Microsoft system).  On such systems,
the compilers, language run-time systems and applications libraries are
often written by separate organisations, and always run without 'system
privileges' or operating system extensions.


1. A Better Approach
--------------------

The major use of volatile data objects in parallelism, in the languages
that have them, is by experts for writing signal handling and
synchronisation primitives.  A second use is for essentially trivial
tasks, such as setting and testing a single global flag variable or
writing a simple parallel reduction.

Fortran is a high-level language, and the cleanest solution would be to
remove VOLATILE coarrays, thus eliminating all the problems they cause,
and to specify the high-level parallelism primitives directly.  These
need not be standardised immediately, which would give time to design
them properly, and to obtain experience with implementation and use.

This paper does not make any proposal for such primitives, but the
following is a description of the sort of ones that are envisaged:

    1) Locks, mutexes, semaphores etc.  Exactly which of these should
be specified is a matter of taste, but most experience is that simple
uses can be implemented with any of them.  Paper J3/08-256 makes a
proposal for locks.

    2) Explicitly atomic datatypes and operations, including global flag
setting, compare-and-swap etc.  Separating these from 'normal' Fortran
datatypes and operations means that the semantic problems described
below can be bypassed, and makes their implementation a lot easier.

    3) Global reductions (e.g. summation over images).  These have the
property that the final value does not become visible until some
appropriate synchronisation is performed, and have similar semantic and
implementation advantages to explicitly atomic actions.

These would provide the facilities that real users need, at a level that
they might manage to use correctly.


2. Specification Issues
-----------------------

In parallel languages that have similar volatile object semantics, even
experts have great difficulty using volatile objects to implement
synchronisation primitives unless they keep their code very simple.
Experience is that it is too hard for most ordinary programmers, and
they usually make serious mistakes by assuming more synchronisation than
is actually specified.

A great many of these problems are caused by imprecise specifications;
these lead to each vendor providing subtly different semantics for
volatile data objects, which causes even well-tested programs written by
experienced users to fail unpredictably, especially when ported to new
systems or when there is a new version of the compiler.

There are several major specification problems with VOLATILE coarrays,
which fall into two classes:

    1) Exactly what is allowed.  Some of the examples given here are
simple oversights and could be resolved by wording alone, but others are
not so easy.

Fortran, like most other languages, specifies the language largely by
imposing constraints on what a conforming program may do.  This issue is
less about what may be done, than exactly what effects conforming
actions have; that is often not specified.  In some cases of VOLATILE
coarrays, the effects are almost unspecifiable.

Note that parallel memory models are much more complicated than serial
ones, because parallelism exposes issues that are hidden in serial
languages (except in asynchronous signal handling, which Fortran does
not have).

    2) The exact effect of actions on VOLATILE coarrays as seen by other
images, and the interactions of VOLATILE coarray accesses with segments.
This is essentially unspecified, and there are some serious ambiguities.

Allowing VOLATILE coarrays requires at least specifications of the
granularity of accesses and what the parallel memory model is for them
(if not sequential consistency), and some examples of the issues are
given here.

The problem about providing examples is that simple ones are always
unrealistic, and every simple problem can be resolved by an extra
constraint.  Actual experience of shared memory programs is that the
problems arise in code that looks simple but is very hard to analyse.
As Lamport observed, there is no way to solve the problem properly
except by defining a proper memory model.


2.1 Lack of Safety
------------------

We revisit the example in N1744, "Coarrays and Memory Models", to
illustrate the unpredictable behaviour that is possible with VOLATILE
coarrays.

        PROGRAM Memory_Model_1
            INTEGER, VOLATILE :: one[*] = 0, two[*] = 0
            INTEGER :: p, q
            SELECT CASE(THIS_IMAGE())
        CASE(1)
                one[8] = 123
        CASE(2)
                two[9] = 456
        CASE(3)
                p = one[8]
                q = two[9]
                WRITE (3,*) p, q
        CASE(4)
                q = two[9]
                p = one[8]
                WRITE (4,*) p, q
            END SELECT
        END PROGRAM Memory_Model_1

There is no requirement for images 1 and 2 to check that the new values
have reached images 8 and 9 until after executing SYNC ALL.  Hence the
value of one[8] accessed by images 3 and 4 may be either 0 or 123.
Similarly, the value of two[9] may be either 0 or 456.  Furthermore, the
combination '123 0' on unit 3 and '0 456' on unit 4 can occur if image 3
has better communication with image 9 than 8 but image 4 has better
communication with image 8 than 9.

In fact, all combinations of '0 0', '123 0', '0 456' and '123 456' are
possible and the result can vary from run to run. Some combinations may
occur quite rarely, making unexpected results occur in code that was
thought to be tested.

Note that this example is the simplest that shows the issue; more
complex, but still realistic, examples are available from the author.


2.2 Varying the VOLATILE Attribute of a Coarray Between Scopes
--------------------------------------------------------------

A very simple example of this is:

        PROGRAM Memory_Model_3
            INTEGER :: one[*] = 0, two[*] = 0
            INTEGER :: p, q
            SELECT CASE(THIS_IMAGE())
        CASE(1)
                one[8] = 123
        CASE(2)
                two[9] = 456
        CASE(3)
                p = Get(one,8)
                q = Get(two,9)
                WRITE (3,*) p, q
        CASE(4)
                q = Get(two,9)
                p = Get(one,8)
                WRITE (4,*) p, q
            END SELECT
        CONTAINS
            FUNCTION Get (z, n)
                INTEGER, VOLATILE :: z[*]
                INTEGER :: Get, n
                Get = z[n]
            END FUNCTION Get
        END PROGRAM Memory_Model_3

The question here is whether this changes anything from the previous
example.  The above code seems to meet the liberty allowed in 8.5.1
Image control statements, paragraph 6:

    A coarray that is default integer, default logical or default real,
    and which has the VOLATILE attribute may be referenced during the
    execution of a segment that is unordered relative to one in which
    the coarray is defined.  Otherwise:  ...

This sort of problem could be resolved only by requiring a coarray to
have the VOLATILE attribute in all scoping units if it has it in any
of them.

An even nastier example is the following, and it is so nasty that most
compilers reject it as invalid (though it seems to be valid Fortran
2003):

        MODULE Global
            INTEGER, SAVE :: Matthew[*] = 1
        END MODULE Global
        
        PROGRAM Boggle
            USE Global
            SELECT CASE (THIS_IMAGE())
        CASE(1)
                CALL John()
        CASE(2)
                CALL James()
            END SELECT
            PRINT *, Matthew[9]
        END PROGRAM Boggle
        
        SUBROUTINE John
            USE Global
            VOLATILE :: Matthew
            PRINT *, Matthew[9]
        END SUBROUTINE John
        
        SUBROUTINE James
            USE Global
            Matthew[9] = 2
        END SUBROUTINE James

There are some systems where that is effectively implementable only by
providing VOLATILE coarray semantics for all coarrays, with the
consequent loss of efficiency.


2.3 Composite Objects
---------------------

Consider the following program:

        PROGRAM Composite_1
            INTEGER, VOLATILE :: value(100)[*] = 0
            SELECT CASE(THIS_IMAGE())
        CASE(1)
                value(:)[9] = 123
        CASE(2)
                value(:)[9] = 456
           END SELECT
           SYNC ALL
           IF (THIS_IMAGE() == 9) PRINT *, value
        END PROGRAM Composite_1

An array is an object, and 'value' has type INTEGER, so many users will
assume that the elements of 'value' is either all 123 or all 456, but
many implementations will deliver a mixture.  There is nothing in the
current wording that states or even implies which.

Another example is:

        PROGRAM Composite_2
            INTEGER, VOLATILE :: value(100)[*] = 123
            SELECT CASE(THIS_IMAGE())
        CASE(1)
                value[9] = SUM(value[9])
        CASE(2)
                PRINT *, value[9]
           END SELECT
        END PROGRAM Composite_2

Is this required to print all values the same, or may some values be
123 and others 12300?  And, in either case, where is it specified?


2.4 Use in Protected Contexts
-----------------------------

There are several contexts where a variable may not be defined or
undefined except in specific ways, but in not all of those is it clear
whether that covers the VOLATILE coarray case when the dubious action is
performed by another image.

For example, 8.1.7.6.2 paragraph 2 says "..., the DO variable may not be
redefined nor become undefined while the DO construct is active".  But
consider the following program:

         PROGRAM Do_what
            INTEGER, VOLATILE :: a[*] = 1
            INTEGER :: b(100) = 0
            SELECT CASE (THIS_IMAGE())
        CASE(1)
                READ (*,*)
                a[2] = 2
        CASE(2)
                DO a = 1,100
                    b(a) = a
                END DO
                WRITE (*,*) "Kilroy was here"
            END SELECT
        END PROGRAM Do_what

What does "while the DO construct is active" mean as applied to a
separate image?  Let us assume that the user did not type a newline
until he saw the message "Kilroy was here" appear, which would mean
that the DO construct had finished.  Would that make the above program
correct?

A variant question relates to the same program, but with the READ and
WRITE removed.  Would the correctness of the program depend on whether
the processor happened to execute image 2 before executing image 1?
And, if not, why would the answer differ from the previous one?

Note that this sort of issue causes a lot of trouble to users who test
their parallel code on workstations, and then run it on massively
parallel computers.  The former often run threads sequentially.

Resolving these issues would be time-consuming, as each case would
need finding and careful consideration.


2.5 Using WHERE and Masks
-------------------------

This may be an oversight, but does not seem to be forbidden.  However,
it is a very good example of how the lack of a precisely defined memory
model causes problems with the interaction of VOLATILE coarrays and
existing Fortran facilities.

Similar issues arise with vector subscripts, FORALL and in several other
constructions; even quite reasonable programmers might well write
elemental functions that expose this sort of issue.  Hence simply
forbidding such uses isn't a simple task, and would need changes to many
parts of the standard.

        PROGRAM Where_is_it_at
            INTEGER, VOLATILE :: a(1000)[*], b(1000)[*]
            INTEGER :: i
            IF (THIS_IMAGE() == 9) THEN
                DO I = 1,1000
                    a(i) = MOD(17*17*17*i,1024)
                    b(i) = MOD(19*19*19*i,1024)
                END DO
            END IF
            SYNC ALL
            SELECT CASE (THIS_IMAGE())
        CASE(1)
                WHERE (MOD(a(:)[9],13) == 5) b(:)[9] = a(:)[9]
        CASE(2)
                WHERE (MOD(b(:)[9],13) == 5) a(:)[9] = b(:)[9]
            END SELECT
            SYNC ALL
            IF (THIS_IMAGE() == 9) PRINT *, a, b
        END PROGRAM Where_is_it_at

Fortran could either forbid this, or specify what it means, but the
current situation is that it is permitted without having even a
guessable meaning.


3. Reduced serial optimisation
------------------------------

In strict Fortran 2003, adding the VOLATILE attribute adds some
constraints, but no new semantics; extra semantics can be added only by
processor extensions (in this context, including relevant companion
processor support).  In particular, in a sequence of statements, no
object can change value unless the program defines or undefines some
identifier associated with it (possibly in a subprocedure).

However, VOLATILE coarrays can be changed at any time by other images,
without needing any processor extensions, and doing so is defined
behaviour.  A processor therefore needs to allow for this, and not use
any optimisations that would give incorrect results if it happens.
As referencing VOLATILE coarrays is allowed even in PURE functions,
this has a major impact.

The example shown here is one where an array is initialised 'the wrong
way round'; several compilers currently optimise such things by
reversing the order of the loops.  It is also a case where common
expression elimination can save a lot of time; most compilers will
do that, even at low levels of optimisation.

It shows that the introduction of VOLATILE coarrays means that neither
optimisation may be performed without changing the results, even though
there is no use of either coarrays or VOLATILE in the procedure being
compiled.

This problem could be resolved only by making any reference to VOLATILE
coarrays in functions or PURE subroutines undefined behaviour.  This
would probably be regarded as unacceptable.


3.1 An Example
--------------

Consider the case of a Fortran processor that does not define any
semantics for VOLATILE beyond those required by Fortran 2003; that is
the usual case, and is likely to continue to be.  Now consider the
following external subroutine:

        SUBROUTINE Fred (arg, m, n)
            INTEGER, INTENT(IN) :: m, n
            INTEGER :: arg(m,n)
            INTERFACE
                PURE FUNCTION Joe (x)
                    INTEGER :: Joe
                    INTEGER, INTENT(IN) :: x
                END FUNCTION Joe
            END INTERFACE
            INTEGER :: i, j
            DO i = 1,m
                DO j = 1,n
                    arg(i,j) = Joe(arg(i,j))+Joe(0)
                END DO
            END DO
        END SUBROUTINE Fred

Because Joe is marked PURE (and, strictly, even if it had not been), no
objects other than the array arg can become defined in that loop, in
Fortran 2003.  Hence the processor need evaluate Joe(0) only once, and
can arbitrarily reorder the loops for increased memory performance; many
existing compilers do either or both of those.

However, with VOLATILE coarrays, that is no longer possible.  Consider
the following module, program and function Joe.

        MODULE Global
            INTEGER, VOLATILE, SAVE :: Pete[*] = 1
        END MODULE Global
        
        PROGRAM Main
            USE Global
            INTERFACE
                SUBROUTINE Fred (arg, m, n)
                    INTEGER, INTENT(IN) :: m, n
                    INTEGER :: arg(m,n)
                END SUBROUTINE Fred
            END INTERFACE
            INTEGER :: array(1000,2000), i, j
            DO I = 1,1000
                DO J = 1,2000
                    array(i,j) = 2000*i+j
                END DO
            END DO
            SELECT CASE (THIS_IMAGE())
        CASE(1)
                CALL Fred(array,1000,2000)
        CASE(2)
            DO I = 1,1000000000
                Pete[9] = I
            END DO
            END SELECT
            PRINT *, array
        END PROGRAM Main
        
        PURE FUNCTION Joe (x)
            USE Global
            INTEGER :: Joe
            INTEGER, INTENT(IN) :: x
            Joe = Pete[9]+x/3
        END FUNCTION Joe

Here, Joe references (not defines) Pete[9], image 1 calls Fred and hence
Joe, and image 2 defines Pete[9] in open code.  I can find nothing in
the standard that even discourages this. The order in which Joe is
called is now visible to the program, which contradicts NOTE 8.30 and
NOTE 12.51 and prevents some forms of the above optimisations.

Note that there is no use of either coarrays or VOLATILE in subroutine
Fred; the introduction of VOLATILE coarrays has therefore reduced the
possibilities for optimisation even in code that does not use either.


4. Behaviour on Commodity Clusters
----------------------------------

Consider the subroutine Refinement in a program fragment like the
following:

        MODULE Data
            INTEGER, VOLATILE :: table(1000)[*]
            INTERFACE
                ELEMENTAL LOGICAL FUNCTION Valid (value)
                    INTEGER, INTENT(IN), VALUE :: value
                END FUNCTION Valid
            END INTERFACE
        END MODULE Data

        SUBROUTINE Refinement (index)
            USE Data
            INTEGER :: index(:,:), n
            DO n = 1,UBOUND(index,2)
                IF (index(1,n) > 0 .AND. &
                        .NOT. Valid(table(index(1,n))[index(2,n)])) &
                    index(1,n) = -1
            END DO
        END SUBROUTINE Refinement

In general, a compiler cannot be sure that the VOLATILE coarray table
will not be updated by another image, and therefore will need to fetch
each value of 'table' sequentially.  There are better ways to write
this, but all simple versions have similar problems in the case where
'table' is too large to store on a single image.  Doubtless there are
better examples, too.

The issue here is how this sort of code can be implemented, and the
consequences of possible implementation approaches.  Nobody will expect
it to be as efficient as using local data, but the question is whether
it can be implemented reasonably portably and reasonably efficiently on
commodity clusters.

The requirement is for image A to access data on image B while the
latter is occupied doing something else.  All of the possible
implementation approaches known to the authors will be discussed
separately.

Note that certain implementation strategies can lead to deadlock, even
in programs that contain no deadlock in their logic; consider a program
like the following:

        PROGRAM Deadlock
            INTERFACE
        ! The initialisation of mutexes to an unlocked state is omitted
        ! for clarity; the companion processor is assumed to create and
        ! initialise at least mutexes indexed by arguments 8 and 9.
        ! Otherwise, these interfaces are modelled on the POSIX calls
        ! pthread_mutex_lock and  pthread_mutex_unlock.
                SUBROUTINE Mutex_lock (which) BIND(C)
                    USE, INTRINSIC :: ISO_C_BINDING
                    INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which
                END SUBROUTINE Mutex_lock
                SUBROUTINE Mutex_unlock (which) BIND(C)
                    USE, INTRINSIC :: ISO_C_BINDING
                    INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which
                END SUBROUTINE Mutex_unlock
            END INTERFACE
            INTEGER :: value[*] = 0, i
        
            IF (THIS_IMAGE() == 1) THEN
                CALL Mutex_lock(9)
            ELSE IF (THIS_IMAGE() == 3) THEN
                CALL Mutex_lock(8)
            END IF
            SYNC ALL
            SELECT CASE(THIS_IMAGE())
        CASE(1)
                DO i = 1,CO_UBOUND(value)
                    value[i] = 123*i
                END DO
                SYNC MEMORY    ! One
                CALL Mutex_unlock(9)
        CASE(2)
                CALL Mutex_lock(9)
                SYNC MEMORY    ! Two
                DO i = 1,CO_UBOUND(value)
                    PRINT *, value[i]
                END DO
                CALL Mutex_unlock(8)
        CASE(3)
                CALL Mutex_lock(8)
            END SELECT
        END PROGRAM Deadlock
       
If the call to Mutex_lock in image 3 blocks, and coindexed objects owned
by it cannot be accessed by another image while it is in that state, the
above program will deadlock.  The Fortran processor has obviously no
control over the code of Mutex_lock and Mutex_unlock, and so cannot
prevent them from blocking.

In the following, the classification of implementation strategy refers
to their viability for use on the ubiquitous commodity clusters.

        
4.1 Cache-coherent Shared Memory
--------------------------------
    
Currently, there are commodity systems that provide this for up to about
16 cores (i.e. images), and a few specialist companies provide it up to
about 1,000.  There are few problems with implementing VOLATILE coarrays
on such systems.  However, note that the specification problems remain,
as each architecture defines a slightly different set of guarantees;
see, for example:

    http://www.intel.com/products/processor/manuals/318147.pdf
    http://download.boulder.ibm.com/ibmdl/pub/software/dw/library/
        es-archpub2.zip
    http://www.sparc.org/standards/SPARCV9.pdf

There have been many attempts, over many decades, to provide
cache-coherent virtual shared memory over a cluster of separate systems
(i.e. with distributed memory at the hardware level).  None have
succeeded, and any claims that it will be delivered "real soon now"
are implausible.
    
This is not a viable implementation strategy.


4.2 Special Hardware and Operating System Support
-------------------------------------------------

Many specialist vendors (e.g. Cray) provide hardware or operating system
extensions that have the effect of letting an application on one system
access the memory of another, transparently.  That is, so that none of
the applications on the latter need include any logic to enable such
access.  This is often called RDMA (Remote Direct Memory Access).
Again, experience is that this works.

However, neither commodity hardware nor commodity software support any
such mechanism, and so it cannot be used on commodity clusters.

The nearest to a commodity interconnect that has any such support is
InfiniBand, where it is generally believed that the protocol enables
such access.  Unfortunately, the specification is 2,000 pages long, and
the general belief may not actually be correct.  More importantly, its
prevalent software implementation for commodity clusters, Openfabrics,
has no such support.

The only interconnects that deliver the appropriate functionality are
vendors' own ones (e.g. Cray's) and Quadrics.  The latter can be
attached to some commodity clusters, but is expensive, specialist and
rare.

This is not a viable implementation strategy, until and unless
Openfabrics delivers such support.


4.3 MPI-2 One-Sided Communication
---------------------------------

On the face of it, this would appear to be a widely available
implementation of RDMA, but investigation shows that to be false.

Very few (if any) applications currently use MPI-2 one-sided
communication, and it is unclear how reliable, complete and efficient
its implementations are.  Even more seriously, the only 'true' one-sided
mechanism in MPI-2 is MPI_Win_lock (MPI-2 6.4.3), and MPI allows that to
be restricted to memory allocated by MPI_Alloc_mem, which may not be
feasible for all Fortran compilers on all systems.

Furthermore, MPI-2 11.7.2 states that progress with a transfer is not
required until the target process (i.e. the image that owns the data)
next reaches an MPI call, and therefore may take an unbounded amount of
time.  This would lead to deadlock in some Fortran VOLATILE coarray
programs, such as the program Deadlock above.

Also, Fortran VOLATILE coarray accesses use a much lower granularity
than most MPI transfers, and it is unclear whether a viable MPI-2
implementation would be efficient enough for VOLATILE coarrays.

This is almost certainly not a viable implementation strategy.


4.4 'Cray SHMEM'
----------------

This is the message passing interface that originated on Cray, and
copied to many other parallel systems; it is not the 'System V' shared
memory segment interface also called shmem.  The only call that would
helps is SHMEM_Quiet (SHMEM_Fence and SHMEM_Wait affect actions on the
local node only, and SHMEM_Barrier is a collective).  It appears that
SHMEM_Quiet was introduced for the T3E.

There appears to be no implementation of SHMEM for distributed memory
systems that includes SHMEM_Quiet, except for Cray systems and the
specialist interconnect Quadrics.

This is not a viable implementation strategy.


4.5 Interrupting the Image to Complete the Transfer
--------------------------------------------------- 

In theory, the processor could use some form of interrupt mechanism to
trap transfers to the executing image (i.e. the one that owns the data),
handle the transfers and then continue processing.

The only currently relevant mechanism is signals, and doing I/O in them
is undefined behaviour in C99 (7.14.1.1 The signal function, paragraph
5) and POSIX (sigaction: APPLICATION USAGE, paragraph 3).  Many systems
provide extensions to POSIX in this area, but few are much of an
improvement, and most do not provide enough supported functionality to
implement message passing in an interrupt handler.

Experience with trying to use this mechanism on modern systems is that
it is, at best, hopelessly unreliable.  It could be made reliable only
by major changes to the kernel design (i.e. adopting the old mainframe
designs); that is implausible.

This is not a viable implementation strategy.


4.6 Using a Separate Thread to Handle the Transfers
---------------------------------------------------

A processor could require that there is at least one, permanently
running, thread dedicated to message passing per system, and that the
operating system provides coherent shared memory between that thread and
the image execution threads.

For performance, this also requires one core dedicated to message
passing, because otherwise the latency of VOLATILE coarray accesses will
be bound to the scheduler interval (typically 10 milliseconds on modern
systems, compared with about 1 microsecond for Ethernet or InfiniBand
messages).  A factor of 10,000 on the latency is a very serious
performance hit.

It also requires the hardware and operating system to support a memory
update in one thread becoming visible to another thread, expeditiously,
with no action by the second thread.  That is undefined behaviour in
POSIX (4.10 Memory Synchronization), and is somewhat unreliable in
practice, because of scheduling, memory consistency and other problems.
Thread scheduling control is optional in POSIX, and its semantics are
largely implementation specific (POSIX 4.13 Scheduling Policy); it also
does not address the memory consistency issue.

Such mechanisms could be made reliable for many or most systems only by
non-trivial kernel enhancements.

This is perhaps be the best implementation strategy for commodity
clusters, but its unreliability, system dependence and potentially poor
performance are serious problems.


4.7 Polling in compiled code
-----------------------------

The Fortran processor can obviously insert checks for pending transfers
(i.e. poll for them) into the executed code, but needs to put them
inside all long-running loops and all potentially blocking primitives
(e.g. I/O statements).  That could reduce application performance by a
large factor, because it interferes with the pipelining that is so
important on almost all modern processors.

It also does not address the problem of blocking in a companion processor.

This is a viable implementation strategy only if performance and the use
of companion processors are not of major consequence.


4.8. Using the MPI and GASNet Progress Model
--------------------------------------------

MPI and GASNet (see section F.2 below) have the concept of a progress
engine, where MPI or GASNet calls check for and service all pending
actions, but no progress need occur between them.  Experience shows that
this is not what most shared memory programmers expect; it is very hard
to explain to them that what appears to be transparent shared memory
must actually be programmed like two-sided message passing.

It is also arguably a major change of specification for Fortran
coarrays; whether or not it is, Fortran would need to define when
processors are required to progress, so that programmers can write
correct, reasonably portable programs.

On this topic, it should be noted that the implementation of GASNet is
such that UPC programs that do not make expeditious progress actually
fail, and do not merely run inefficiently.

Specifying that explicit action is needed on the image that owns
coindexed objects to ensure progress (even if that image does not access
the relevant coarray) is a viable strategy, but needs significant
changes to the standard; also many people will regard it as unacceptable
for VOLATILE coarrays.

This problem is much less serious for segments (i.e. non-VOLATILE
coarrays), because the higher granularity and need for explicit image
control statements is likely to lead to a more collective programming
style.  The problem does not arise at all for SYNC ALL and CRITICAL, of
course, which are already collective.

This is probably not a viable implementation strategy for VOLATILE
coarrays, but probably is for segments.