Some comments on the recent WG5 papers on coarrays
===================================================

R. Bader, LRZ               November 6, 2008
                            ISO/IEC JTC1/SC22/WG5-N1756


A number of WG5 papers by Nick Maclaren (N1744,N1745,N1748,N1749,N1751)
contain example programs and a critique of the coarray concept as
defined within the present Fortran 2008 draft, in particular concerning
VOLATILE coarrays and communication with a passive image.

This paper is an attempt to understand some of the identified issues in the
context of the draft standard. 


1. Prerequisites and Assumptions:
=================================

Assessment of the above papers will be based upon the interpretation of 
the relevant parts of the draft Fortran 2008 standard performed in this
section. 

1.1 Definition of VOLATILE (5.3.19):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The VOLATILE attribute only refers to the possibility of external updates to 
an object given it, not the manner in which this update is performed. In
particular, no atomicity of external memory updates on the level of the object
is guaranteed. This interpretation appears to be shared by MR&C, Fortran
95/2003 Explained:

"Even if only one process is writing to the variable and the Fortran program
is reading from it, ... it is possible to read a partially updated ... value."

In the draft standard, Note 5.24 appears to be a bit misleading; a more
helpful formulation might be

"The Fortran processor should use the most recently available state of a
volatile object when a reference is performed by the processor. Likewise, 
it should make the most recent Fortran state available when a reference
is performed by the external mechanism. It is the programmer’s responsibility
to manage any interaction with non-Fortran processes, including the integrity
of the referenced object." 

For a VOLATILE object, the processor is expected to
* not register optimize the object
* not move assignments from/to the object around during its optimization
  attempts 
Any code segments involving VOLATILE objects can hence be expected to suffer
(considerable) performance degradation.

1.2 VOLATILE coarrays (8.5.1, para6):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As an exception to the rules on definedness of coarray references, which 
normally require an explicit image synchronization between accesses from
different images, it is possible to omit synchronization for VOLATILE
coarrays provided the referenced object is of type default real, default
integer or default logical. 

This restriction, according to N1747, was imposed, 

" ... because memory updates/references to such a variable need to be
atomic: referencing the value on one image concurrently with an update on
another will either get the previous value or the new value. Such atomic
memory operations cannot be guaranteed in general."

As a consequence, the object must additionally be a scalar, since the last
cited sentence would also apply to arrays of the above types. It may be
necessary to change the wording of 8.5.1, para6 to 

"A scalar coarray that is a default integer, default logical, or default
real, and which has the VOLATILE attribute may be referenced during the
execution of a segment that is unordered relative to the execution of a
segment in which the coarray is defined."

to make this more clear. Furthermore (based on e-Mail discussion with John
Reid) the following additional Note is suggested for 8.5.1:

-------------------------------------------------------------------------------
NOTE 8.29a
A scalar coarray that is volatile and of type default integer, default real,
or default logical is 'atomic' in the sense that read accesses from any image
other than those altering its value will obtain either the value previous to 
an alteration, or the value after an alteration. 
It remains the programmer's responsibility to prevent race conditions for
such volatile coarray references by suitable formulation of the algorithm
(ref. to example in Note 8.38). 
------------------------------------------------------------------------------- 

In any case, we have here an extension of the semantics of VOLATILE for the
indicated coarray objects, as compared to the original definition in 1.1. This
additional property is needed for the code in Note 8.38 to work. Note that
formally the VOLATILE attribute is only exploited on image Q in that example.
Due to the additional semantics, a reliable implementation for commodity 
clusters will probably incur an even larger overhead than normal VOLATILE 
objects.


2. Comments on paper N1745:
===========================

In section 1, the author advocates three possibilities for a better approach. 

The first of these, locks, will probably be included in the final standard
anyway. However, while it is possible to implement the spin loop from Note 
8.38 in terms of locks, all attempts I've seen incur some additional overhead, 
either due to pre-synchronization, or to potential lock contention. The
solution based on VOLATILE coarrays may hence still be the most efficient one
available. Locks will show their strengths in other situations.

The second one, atomic datatypes and operations, is in effect what is already
there (see 1.2 above). One could of course consider introducing an additional
separate attribute, say ATOMIC, for such (and only such) an object, and add
special (generic) intrinsics for R, W, TAS, CAS etc.   
The VOLATILE attribute would then still be advantageous since it provides the
effect of an object-specific SYNC MEMORY, saving on overhead if many other
memory operations are outstanding (an additional SYNC MEMORY would otherwise
be required within the spin loop of Note 8.38!).
Furthermore, it appears that anything beyond simple atomic read and write
needs special hardware support, so may be difficult for distributed memory
systems anyway. (Note added in writing: As of Nov 6, there exists
a suggestion by Aleks Donev (N1753) which provides a facility to completely 
decouple atomic reads/writes from VOLATILE. This appears to completely solve 
the problem from the standardization point of view, although efficiency on DMS
may still be under debate.)

The third one will be addressed in the coarray TR. I agree this is especially
important in the light of being able to map to reduction hardware available in
newer interconnects. 


In section 2, some examples are provided to illustrate inconsistencies in the
standard.

Example 2.1 ("Lack of safety") correctly describes what will happen. It is,
after all, a program with a race condition, in other words, an ill-defined
parallel algorithm. Also note that that if VOLATILE were removed as proposed,
the program would be non-conforming. There are then various ways to make it
conforming again, depending on what result one wishes to achieve.

Example 2.2 ("Varying the scope") appears to be addressed by J3/08-290. The
argument is reasonable, especially in the light of the additional atomic
semantics of VOLATILE coarrays. (I'm not sure wether the formulation in
J3/08-290 is sufficient to ensure that a VOLATILE coarray dummy argument is 
rejected by the processor). 

Example 2.3 ("Composite object") is non-conforming since the VOLATILE 
coarray is not a scalar and hence the restrictions from 8.5.1 apply.

Example 2.4 ("Protected Context") is non-conforming since the cited
restriction is violated by the object in question being VOLATILE within
the scope of the DO loop. This would apply even if the CASE(1) statements
were not present.

Example 2.5 ("Where and Masks") is non-conforming since the VOLATILE 
coarray is not a scalar and hence the restrictions from 8.5.1 apply.

In section 3, the following points are made:
* Using VOLATILE reduces serial optimization. This is true, and implies 
  that users must be properly educated (like in the use and misuse of 
  other language features).
* The example used to illustrate this uses a PURE function, which of course
  should be optimisable. This issue, an oversight, is addressed in J3/08-284.

In section 4 ("Behaviour on Commodity Clusters"), the example with the
INTEGER, VOLATILE :: table(1000)[*] 
has a high likelihood of being non-conforming since again the object is 
not scalar, and there may be updates coming in from other images in an
unordered segment. This one would probably be a nice candidate for using
locks, by the way (after removing the VOLATILE attribute). 
The example program "Deadlock" will be discussed in the comments on 
N1744 below.

I am not going through the list of implementation choices since large-scale
efficiency of VOLATILE coarrays is not the main point of having them.

Finally, the reference to the C++ standardization efforts with respect
to the memory model (actually at the end of N1744) appears to be relevant
for shared memory processing, but not necessarily for segmented memory
processing.


Final remarks on N1745
~~~~~~~~~~~~~~~~~~~~~~

* An argument can be made to disallow the VOLATILE attribute for
  non-scalar coarray objects as well as coarray objects not of type default
  logical, default integer or default real. This would indeed prevent 
  users from doing stupid things, and it would also obviate the need to
  disambiguate between the two kinds of VOLATILE semantics (atomic
  vs. non-atomic). If the atomic calls suggested by Donev are put into 
  the standard, VOLATILE coarrays could be disallowed completely.

 
3. Concerning Paper N1744:
==========================

Section 1 (Sequential Consistency):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The author seems to believe that in absence of the user defining a
suitable synchronization sequence the processor should impose one. 
This appears contrary to the spirit of coarray programming which intends to
minimize the constraints as far as possible to achieve improved performance. 

Suppose we have a coarray program with 4 images and 2 segments per 
image. The first segment may be a bit load imbalanced (think irregular
lattices) and the second one is a CRITICAL block collecting things.

Under the author's suggestions 1 or 3 we might well get this (unit 
timesteps downward):

 Image    1    2    3    4
-----------------------------
 Segment  1    1    1    1
          1    1    1
          1    1    
          1 
          2
               2
                    2 
                         2
          
while the imbalance would be very nicely hidden 

 Image    1    2    3    4
-----------------------------
 Segment  1    1    1    1
          1    1    1    2
          1    1    2
          1    2
          2

if we simply don't care. OK, I've given the worst case, but on the 
other hands there were only very few images ...

So the answer with respect to sequential consistency is: 
* The user must fix the segment ordering if his algorithm requires it,
  and should not do so if it doesn't. Tools for identifying race
  conditions are welcome.
* Sequential consistency may be important with respect to the algorithm 
  i.e., running the algorithm with one image only should, if supported, 
  yield consistent results with a many-image run (typically within some
  specified precision, due e.g. to reordering changes in reductions).
 

Section 2 ("Data storage"):
~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

This is addressed in J3/08-290


Section 3.1 ("User-defined ordering ...")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

According to 8.5.1, ordering (be it via image synchronization statements
or user-defined constructs using SYNC MEMORY) does imply consequential 
ordering. Otherwise the statements on definability of coarrays in para 6
of that section would not make any sense. For the programmer, this does
mean that e.g., SYNC ALL should be very carefully used since this may 
transfer many not-yet-needed outstanding buffers. 

In particular, the example program is non-conforming. The SYNC MEMORY
always only refers to the local image (hence there is no N**2 effect), 
and segment 2 on image 1 is in fact unordered with respect to segment 2
on all other images. The fact that in many cases "correct" results will
be printed out does not disprove this. Hence, this code *cannot* serve
as an example for user-defined ordering.


Section 3.2 ("... Progress"):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I consider the example a bit of a red herring since deadlock situations
with companion processors or I/O may also occur if only two images are
involved. The author has however, in my opinion, triggered a bug in 
the specification that must be fixed in 8.5.1 para6. In the first 
bullet of para6, the following situations are covered:


----------------------------------------------------------------
Legend:
P, Q, ... are image numbers
a is a coarray
S(XY) is a pairwise sync which induces segment ordering for the 
two involved images.
Time goes downward ... whatever that means.
----------------------------------------------------------------


              P             Q
              |             |
              |             | a = ...
        S(PQ) ~~~~~~~~~~~~~~~                RAW
              |             | 
 ... = a[Q]   |<------------|


and further diagrams covering WAW (push instead of pull by P), WAR.
What is however not covered correctly are (at least) the cases

              P             Q            R
              |             |            |
              |  ... = a[R] |<-----------|
        S(PQ) ~~~~~~~~~~~~~~~            |   WAR (R "passive")
              |             |            |
 a[R] = ...   |------------------------->|
              |             |            |
        S(PR) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~  
              |       S(QR) ~~~~~~~~~~~~~~
              |             |            |
              |  ... = a[R] |<-----------|   RAW (R "passive")
              |             |            |   (or WAW, R "passive", not shown)

Indeed it appears the present formulation of the draft standard requires S(PQ)
twice in the above diagram, making it look like a one-sided MPI call with 
a passive partner (not implemented e.g., in MPICH2 or Intel MPI for good
reason).
In my opinion, R should only be passive with respect to references to a, 
but not with respect to requiring a

sync images (/P,Q,R/) 

(as drawn in the diagram) for the RAW case, as images P and Q do. 
However, for WAR indeed only S(PQ) is needed. 

So I think it is appropriate to introduce the concept of an image being
"owner" of a coarray and changing the first bullet to cover all conceivable
transactions with the owner (I may still have overlooked something). 


Section 3.3 ("... Proposal"):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I'd suggest replacing

  "The mechanisms that may be used to provide user-defined ordering are
   processor dependent."

by

  "Additional, processor dependent mechanisms may be used to provide
   user-defined ordering"

since VOLATILE coarrays or atomic intrinsics are already available for 
that purpose.