ISO/IEC JTC1/SC22/WG5-N1749


VOLATILE Coarrays Break Existing Code
-------------------------------------

Nick Maclaren, 31st October 2008


In my view, one of the most serious faults of VOLATILE coarrays is that
they will cause existing compiled code to break - i.e. code that uses
neither VOLATILE nor coarrays.  There are two reasons for this, which
are explained in N1745, but this paper provides more detail on one of
the points (and a lot less context).


    1) It is permitted to define a coarray in one image without it
having the VOLATILE attribute, and to reference it in another with the
VOLATILE attribute, without the segments being ordered.  This does not
work, because many compilers on many architectures access non-VOLATILE
objects in ways that are not safe together with cross-processor access.
An example is given below.

The only solution is to require coarrays to have the VOLATILE attribute
in all scopes or none.


    2) Note 12.51 states that constraints C1274 to 1285 are designed to
guarantee that a PURE procedure is free from side effects, and therefore
may be called safely where there is no explicit order of evaluation, and
need not be called if their value is not needed.  The introduction of
VOLATILE coarrays makes those constraints inadequate.  No example is
given here, because one is given in section 3 of N1745.

The only solution is to forbid any reference to a VOLATILE coarray
inside PURE procedures.

N1745 makes similar remarks about functions, but that has been said by
one person to be incorrect.  The semantics of impure function calls has
always been a source of heated and unproductive debate, and the point is
not critical anyway, so I propose to concede it.


Note that this paper is NOT arguing for the preservation of VOLATILE
coarrays, as resolving the specification problems is a much harder task,
but that is not covered here.


Example of First Issue
----------------------

In the following, I shall use the Intel 64 Architecture as an example,
but similar remarks apply to several other architectures.

Consider the following program:

    PROGRAM Main
        INTEGER :: x(100) = 123456789, y(100)[*] = 0
        IF (THIS_IMAGE() == 1) THEN
            CALL Fred(x,y)
        ELSE IF (THIS_IMAGE() == 2) THEN
            CALL Joe(y)
        END IF

    CONTAINS

        SUBROUTINE Fred (in, out)
            INTEGER, INTENT(IN) :: in(100)
            INTEGER, INTENT(OUT), TARGET :: out(100)
            out = in+in
        END SUBROUTINE Fred

        SUBROUTINE Joe (data)
            INTEGER, VOLATILE :: data(100)[*]
            PRINT *, data[1]
        END SUBROUTINE Joe

    END PROGRAM Main

Note that SUBROUTINE Fred contains no use of either VOLATILE or
coarrays.  Compiling it on its own (or in the whole program with
cosubscripts removed) using the Intel 10.1 Fortran compiler with the
-fast option generates the following instructions:

         movdqa    (%rdx,%rdi), %xmm0
         paddd     %xmm0, %xmm0
         movdqu    %xmm0, (%rdx,%rsi)
 
The following statements are taken from the Intel 64 Architecture Memory
Ordering White Paper:

    http://www.intel.com/products/processor/manuals/318147.pdf

Section 1 states that aligned loads and stores of 1, 2, 4 and 8 bytes
are implemented atomically, and then includes the following paragraph:

    Other instructions may be implemented with multiple memory
    accesses. From a memory-ordering point of view, there are no
    guarantees regarding the relative order in which the constituent
    memory accesses are made. There is also no guarantee that the
    constituent operations of a store are executed in the same order as
    the constituent operations of a load.

The above program is therefore undefined behaviour, because the store is
of 16 bytes, and therefore may store the bytes in any order.


Aside: Using a VOLATILE coarray with two different base types
(especially INTEGER and REAL) would cause similar problems for a few
architectures, because some architectures require special action to
ensure that the integer and floating-point memory pipelines are
synchronised across processors.  However, that seems to be already
excluded in the current draft.