ISO/IEC JTC1/SC22/WG5-N1744

Coarrays and Memory Models
--------------------------

Nick Maclaren on behalf of the BSI Fortran panel, 7th October 2008.
This paper reflects the views of the majority of the panel.


0. Introduction
---------------

This paper addresses some points where we feel that the standard is
ambiguous or confusing about the precise semantics of coarrays, mostly
as far as non-VOLATILE objects are concerned; it does not propose any
changes to what seem to be the agreed semantics.

Most of this paper is written assuming that VOLATILE coarrays have been
excluded.  If they were to be included, there are some very tricky
questions about how they should interact with segments, and the SYNC
MEMORY statement in particular.  Issues with VOLATILE are discussed
separately in the paper N1745 "VOLATILE Coarrays", but their basic
memory model aspects are described at the end of this one.

This paper takes the viewpoint of a programmer or implementor reading
the Fortran standard, who has not been involved in any of the coarray
standardisation discussions.  The issues are not about what the standard
intends, but about whether its words (and especially the normative text)
explicitly and correctly state its intent.

Unfortunately, there is no consensus on precisely what memory models are
appropriate for shared memory parallelism, either theoretically or in
practice, and experience with supporting scientific programmers is that
many of their 'hard' difficulties derive from them making reasonable
assumptions that do not happen to be true.

None of the issues raised are purely theoretical, as the author has seen
all of them cause major confusion in reasonable code using similar
interfaces, often written by experienced users.

Section 2.3 of paper J3/08-126, by Mellor-Cummey, Adhianto and Scherer
also says that a precise memory model is a key feature, and Fortran
currently lacks it.

There are 4 proposals in this paper, 2 for normative text and 2 for new
NOTEs, in sections 1.1, 2 and 3.3.


1. Sequential Consistency
-------------------------

Section 8.5.1 specifies how segments are ordered, but does not say
anything about ordered segments where the precise order is not specified
by the standard, implicitly or explicitly.  This issue has been
discussed extensively by theoreticians and practitioners in parallel
programming, but with little consensus.

Consider a program like the following:

        PROGRAM Sequential_Consistency

            INTERFACE
        ! The initialisation of mutexes to an unlocked state is omitted
        ! for clarity; the companion processor is assumed to create and
        ! initialise at least mutexes indexed by arguments 8 and 9.
        ! Otherwise, these interfaces are modelled on the POSIX calls
        ! pthread_mutex_lock and  pthread_mutex_unlock.
                SUBROUTINE Mutex_lock (which) BIND(C)
                    USE, INTRINSIC :: ISO_C_BINDING
                    INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which
                END SUBROUTINE Mutex_lock
                SUBROUTINE Mutex_unlock (which) BIND(C)
                    USE, INTRINSIC :: ISO_C_BINDING
                    INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which
                END SUBROUTINE Mutex_unlock
            END INTERFACE

            INTEGER :: one[*] = 0, two[*] = 0, z(8:9) = 0, i

            SELECT CASE(THIS_IMAGE())
        CASE(1)
                CALL Mutex_lock(8)
                SYNC MEMORY
                one[8] = 123
                SYNC MEMORY
                CALL Mutex_unlock(8)
        CASE(2)
                CALL Mutex_lock(9)
                SYNC MEMORY
                two[9] = 456
                SYNC MEMORY
                CALL Mutex_unlock(9)
        CASE(3)
                CALL Mutex_lock(8)
                SYNC MEMORY
                z(8) = one[8]
                SYNC MEMORY
                CALL Mutex_unlock(8)
                CALL Mutex_lock(9)
                SYNC MEMORY
                z(9) = two[9]
                SYNC MEMORY
                CALL Mutex_unlock(9)
                WRITE (3,*) z(8), z(9)
        CASE(4)
                CALL Mutex_lock(9)
                SYNC MEMORY
                z(9) = two[9]
                SYNC MEMORY
                CALL Mutex_unlock(9)
                CALL Mutex_lock(8)
                SYNC MEMORY
                z(8) = one[8]
                SYNC MEMORY
                CALL Mutex_unlock(8)
                WRITE (4,*) z(8), z(9)
            END SELECT

        END PROGRAM Sequential_Consistency

This can write any of the following pairs of values to units 3 and 4:
'0 0', '123 0', '0 456' and '123 456'.  Most combinations of these pairs
are possible, even with sequential consistency [Lamport] (see below).

A)  The principal question is whether unit 3 can contain '123 0' and
    unit 4 can contain '0 456'; if this is so, then the execution is not
    sequentially consistent.

B)  If it is possible, then the subsidiary question is exactly what
    Fortran requires in terms of consistency between segments.

C)  If it is not possible, then the subsidiary question is what normative
    text states that it is not possible.

There are many ways of phrasing this question, but one of the simpler is
that it asks whether image execution control is required to apply to
memory banks (i.e. access to storage across all images) as well as image
execution.  That is Lamport's second restriction (see below), and does
not appear to be stated in Fortran, even indirectly.


1.1 Proposal
------------

There are three possible options: one is to specify sequential
consistency in the currently unspecified case, another is to specify
exactly what Fortran requires from the processor and guarantees to the
programmer, and the third is to exclude the problematic case.

Option 1 (sequential consistency):
----------------------------------

Because sequential consistency is the best understood parallel memory
model, the one that is generally most favoured for shared memory
designs, and one that is already required by the specification of SYNC
ALL and SYNC IMAGES, Fortran could require it for all segments.

If this is adopted, normative text along the following lines should be
added to section 8.5.1 paragraph 5:

    Where multiple segments are ordered, but they can be executed in
    more than one complete order, the processor shall execute them in
    a sequentially consistent fashion.

The standard could also provide a reference to Lamport's letter (see
below), if the term 'sequential consistency' were felt to be unclear on
its own.

Option 2 (not sequential consistency):
--------------------------------------

There is no consensus among parallelism experts about what alternative
shared memory models are reasonable, and no standard terminology to
define them.  Fortran would therefore have to specify exactly what its
specified behaviour is and when programmers can rely on it.

Regrettably, none of the BSI panel have been able to make a proposal
that they are sure is self-consistent enough for standardisation, but
the matter is being considered.  The minimum requirement is that there
is an identifiable subset of uses that has specified behaviour, and that
the standard clearly specifies the subset and its behaviour.

Option 3 (removal of the difficult case):
-----------------------------------------

SYNC MEMORY could be removed, leaving only SYNC ALL and SYNC IMAGES.
The execution of a CRITICAL construct on different images would be
explicitly stated to be completely ordered.


NOTE 8.31 can be read to imply that this issue also arises with the use
of CRITICAL and SYNC IMAGES.  At best, that note is unhelpful, and may
increase confusion; unless there is clear normative text specifying the
model (or equivalent), it needs deletion or rewriting.


1.2 Reference
-------------

Lamport gave a definition of sequential consistency in a letter in IEEE
Transactions on Computers 28 pp 690-691, 1979, where he defined it as:

    The results of any execution is the same as if the operations of
    all the processors were executed in some sequential order, and the
    operations of each individual processor appear in this sequence in
    the order specified by its program.

It is important to note that that definition, and most other work on
shared-memory parallelism, assume an essentially flat memory model, but
coarrays use a segmented one.  Lamport considered the matter of 'memory
modules' in his letter, and suggested the following restrictions in
order to ensure sequential consistency:

    Each processor issues memory requests in the order specified by its
    program.

    Memory requests from all processors issued to an individual memory
    module are serviced from a single FIFO queue.  Issuing a memory
    request consists of entering the request on this queue.

The coarrays of an image can be considered as a memory module in this
context, and it is fairly clear that the first restriction is imposed,
but there seems to be nothing in the current draft that imposes the
second restriction.  Without that restriction, the coarray support is
not sequentially consistent.

An alternative way of viewing this issue is that sequential consistency
requires that both that the execution on a single image is sequential,
and that access to the data owned by a single image is.  The current
coarray specification requires the former, but does not seem to require
the latter.


2. Data Storage
---------------

Section 2.5.7 describes coarrays, and paragraph 4 allows an image to
access a coarray owned by itself without using cosubscripts.

The problem here is that Fortran does not explicitly state that the
square bracket notation is merely syntactic sugar, with no special
semantics.  While cautious readers of the standard will not assume
anything that is not explicitly promised, experience with similar
parallel language extensions shows that at least some readers will
misunderstand.

This is made more confusing by using the term "access" in 2.5.7
paragraph 4; that is not a defined term, and is not widely used
elsewhere in the standard in normative text.  Because it is not used
entirely consistently, the solution of defining it would be more trouble
than it is worth.

This is a relatively minor point if volatile coarrays are excluded, but
a more serious one if they were to be included.  This issue should be
clarified in a note, such as:

    NOTE 2.18+

    Accessing a coarray on its own image by using a set of cosubscripts
    that map to the image has exactly the same effect as accessing it
    without cosubscripts.  In particular, the segment ordering rules
    (8.5.1) apply whether or not it uses cosubscripts to access the
    coarray.

The authors of paper J3/08-126 also did not understand the intent, and
asked this question in Section 2.3.


3. User-Defined Ordering
------------------------

Section 8.5.1 paragraph 5 permits 'user-defined ordering' to control
whether two segments are ordered, but provides no further specification
of what that means.  Experience with supporting the users of parallel
systems shows that misunderstandings of what ordering means are very
common, and both of the issues described here arise in practice.

Fortunately, the problems are very easy to resolve in Fortran, merely
by stating that the permissible forms of user-defined ordering are
processor dependent.  That will not eliminate the problems, but will
make the situation clear to all Fortran experts, whether or not they
are parallel programming experts.


3.1 User-Defined Ordering and Separation in Time
------------------------------------------------

Many users will expect that temporal ordering (i.e. segment A precedes
segment B in time) will imply consequential ordering (i.e. that the
effects of A will be visible to B).  However, that is a very onerous
constraint to place on implementors, and few currently honour it for any
parallel language.

The chances of confusion are enhanced in Fortran, because section 2.4.1
paragraph 2 says 'Image execution is a sequence, in time, of actions',
and the word 'time' can reasonably be read to mean physical time.
Consider a large, distributed memory cluster with a consistent global
clock.  There are known, practical algorithms for providing such
consistency with a proven accuracy of better than 4.T.log_2(N), where N
is the number of images and T is the one-way latency for small messages.

A user might reasonably assume that, if each segment is separated in
time by more than twice the accuracy of the clock, then purely temporal
ordering is adequate as a form of 'user-defined ordering'; there is
nothing in the standard that even implies that it is not.  While
parallelism experts will know that is not intended, experience shows
that many readers of the standard will not draw that conclusion, and
there is nothing in the words that states they are wrong to do so.

An example of that sort of coding is shown in the following program:

        PROGRAM User_defined_ordering

            INTERFACE
       ! This assumes that the subroutine TIMESTAMP returns the value
       ! of the global clock in seconds from a suitable starting point,
       ! and the global clock is consistent to better than 0.5 seconds.
                SUBROUTINE Timestamp (value) BIND(C)
                    INTEGER, INTENT(OUT) :: value
                END SUBROUTINE Timestamp
            END INTERFACE

            INTEGER :: var[*] = 0, z

            DO
                CALL Timestamp(z)
                IF (z > 3*THIS_IMAGE()) EXIT
            END DO
            SYNC MEMORY
            ! The system and number of images are assumed to be such
            ! that this segment takes considerably less than 1 second
            ! to complete.
            DO n = 1,NUM_IMAGES()
                var[n] = THIS_IMAGE()
            END DO
            SYNC MEMORY
            CALL Timestamp(z)
            WRITE (*,*) THIS_IMAGE(), z

        END PROGRAM User_defined_ordering

[ The function Timestamp could also be implemented as one that combines
the values returned from the intrinsic DATE_AND_TIME into a single real
or integer numeric number of seconds.  It is described here in terms of
a C function only because there are POSIX functions that provide exactly
the semantics it assumes. ]

Unfortunately, supporting this program requires the processor to perform
a pairwise synchronisation around each of the SYNC MEMORY statements
(within the resolution of the clock) with all of the other images, and
not to proceed until the other images have acknowledged the
synchronisation.  That requires N^2 handshakes for the execution of that
program, where N is the number of images.

No parallelism expert would ever write such code, but the above
conditions can reasonably be regarded as amounting to 'user-defined
ordering'; the current wording of the standard implies that they do.
Even quite experienced users will say that the output (i.e. the
timestamps following the completion of each segment) demonstrates that
the segments are ordered.

Assuming that this is not intended, the fact that they do not should be
clarified, assuming that is what is agreed.  We make a suggestion in
section 3.3.


3.2 User-defined Ordering and Progress
--------------------------------------

There is another ambiguity in the current wording - that of 'progress'.
The basic requirement is to ensure that a correct program will
eventually terminate, and that a processor will not introduce artificial
deadlock or indefinite livelock.  The converse of that is to impose
enough constraints on correct programs so that it is always possible.

[ "Artificial deadlock" is used to refer to the case where there is no
deadlock in the logic of the program, but the processor's implementation
is such that the program deadlocks.  "Indefinite livelock" is where a
program gets into a situation where the processor's implementation
causes it to get stuck in an indefinite loop with images waiting for
each other. ]

This is notoriously difficult to specify, and the best description seems
to be in the MPI Report, especially MPI-1 4.12 and MPI-2 11.7.2.  MPI
imposes enough restrictions on programs that an implementation can
always make progress, and explicitly requires implementations to make
progress for all valid programs.  Fortran coarrays are comparable to
MPI-1 in the absence of user-defined ordering, but that introduces
problems of the sort that are described in MPI-2 11.7.2, where MPI does
not require progress.

There is nothing in the wording of coarrays that makes it clear that a
processor must make progress for all conforming programs.  It is likely
that some implementors will take the approach that they need do so only
for all 'reasonable' programs, and each implementor will use a different
interpretation of the word 'reasonable'.

There is more on this issue in the paper N1745 "VOLATILE Coarrays", but
this paper is intended to be comprehensible on its own.

The following is an example of how the problem can arise (the mutex
interface is identical to that used in the previous example):

        PROGRAM Progress

            INTERFACE
        ! The initialisation of mutexes to an unlocked state is omitted
        ! for clarity; the companion processor is assumed to create and
        ! initialise at least mutexes indexed by arguments 8 and 9.
        ! Otherwise, these interfaces are modelled on the POSIX calls
        ! pthread_mutex_lock and  pthread_mutex_unlock.
                SUBROUTINE Mutex_lock (which) BIND(C)
                    USE, INTRINSIC :: ISO_C_BINDING
                    INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which
                END SUBROUTINE Mutex_lock
                SUBROUTINE Mutex_unlock (which) BIND(C)
                    USE, INTRINSIC :: ISO_C_BINDING
                    INTEGER(KIND=C_INT), INTENT(IN), VALUE :: which
                END SUBROUTINE Mutex_unlock
            END INTERFACE

            INTEGER :: value[*] = 0
            SELECT CASE(THIS_IMAGE())
        CASE(1)
               CALL Mutex_lock(8)
        CASE(2)
               CALL Mutex_lock(9)
            END SELECT
            SYNC ALL

            SELECT CASE(THIS_IMAGE())
        CASE(1)
                value[3] = 123
                SYNC MEMORY
                CALL Mutex_unlock(8)
        CASE(2)
                CALL Mutex_lock(8)
                SYNC MEMORY
                PRINT *, value[3]
                CALL Mutex_unlock(9)
        CASE(3)
                CALL Mutex_lock(9)
            END SELECT
        END PROGRAM Progress

Images 1 and 2 need to transfer a non_VOLATILE value that is owned by
image 3, but image 3 is in a companion processor call that will not
return until after images 1 and 2 have transferred that data!  The
question is whether the program is correct and, if not, where in the
Fortran standard that is stated.

Note that this is different from the existing, serial case where a
program goes into an infinite loop, because there is no deadlock in the
logic of the above program, only in the processor's implementation.


3.3 Proposal
------------

The mechanisms that can be used to provide user-defined ordering should
be stated to be processor dependent, which would make it clear that
users cannot assume a particular behaviour without checking the
appropriate documentation.

Normative text along the following lines should be added to section
8.5.1, somewhere around paragraph 5:

    The mechanisms that may be used to provide user-defined ordering are
    processor dependent.

It would also be desirable to encourage implementors to provide the
mechanisms that are used by current parallel programs for ordering
(primarily I/O to external files and using synchronisation objects, like
semaphores).  A note like the following would help, though it is clearly
not critical.

    NOTE 8.29+

    A processor should include at least the following as potential
    user-defined ordering mechanisms:

        Closing a unit connected to an external file, followed by
    opening a unit connected to an external file.

        Writing to a unit connected to an external file, followed by
    executing the FLUSH statement for it, followed by reading from a
    unit connected to an external file.

        Two calls to impure subroutines that are provided by a companion
    processor.

Note that the fact that images have separate units is irrelevant, as
they could be connected to one another through a series of FIFOs.  NOTE
8.39 already gives an example for the third mechanism.


4. VOLATILE Issues
------------------

Specifically VOLATILE issues are described in the paper N1745 "VOLATILE
Coarrays", but some general memory model issues are described here,
including raising the question of interactions between segment and
VOLATILE coarray consistency.  There is a little overlap between the
papers, to avoid the need for cross-referencing.

It is extremely unclear whether VOLATILE coarrays should use the same
memory model as segments.  That could have a major performance impact on
large systems.  However, if they do not use compatible, well-defined
memory models, many people will get confused.  Section 2.3 of paper
J3/08-126 refers to this point.

Realistic programs will mix the use of VOLATILE coarrays and
partially-ordered segments in quite complicated ways, which confuses the
issue further.
primitives.


4.1 Basic Memory Model
----------------------

This is discussed in more detail in the paper N1745 "VOLATILE Coarrays",
but a simplified example is given here:

        PROGRAM Memory_Model_1
            INTEGER, VOLATILE :: one[*] = 0, two[*] = 0
            INTEGER :: p, q
            SELECT CASE(THIS_IMAGE())
        CASE(1)
                one[8] = 123
        CASE(2)
                two[9] = 456
        CASE(3)
                p = one[8]
                q = two[9]
                WRITE (3,*) p, q
        CASE(4)
                q = two[9]
                p = one[8]
                WRITE (4,*) p, q
            END SELECT
        END PROGRAM Memory_Model_1

The question is whether unit 3 can contain '123 0' and unit 4 can
contain '0 456'.  There seems to be no relevant normative text making
this clear.

If we refer back to program Sequential_Consistency, we are asking the
same question for segments.  The question is whether VOLATILE coarray
accesses have the same consistency requirements as ordered segments with
an unspecified ordering.  If they do not, many users will get very
confused.  But, if they do, it means that each access to a VOLATILE
coarray has to be implemented very much as if it were a complete
segment.

Note that this example is the simplest that shows the issue; more
complex, but still realistic, examples are available from the author.


4.2 Summary
-----------

SC22WG21 (C++) has spent a considerable amount of time debating the
consequences of these issues, and defining a precise memory model.
Fortran would need to do the same if VOLATILE coarrays were to be
included.  See, for example, SC22WG21 paper N2556 and the numerous
papers it references:

    http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2556.html

This aspect should be considered when discussing the paper N1745
"VOLATILE Coarrays".